BIG TIME OVERFETCH ! ! ! Perceus.

Gavin

Joined: 21 Sep 10

Posts: 191

Credit: 40644234415

RAC: 127443

RE: I was worried it was

3 Jun 2013 18:53:31 UTC

Message 116520 in response to message 116518

(moderation:

)

Quote:

I was worried it was just my hosts, but it's not - for example this host is doing the same according to its log. Same BOINC version, different OS (windows vs. linux).

Also I have found that one of my single-GPU hosts isn't doing it - the only difference I'm aware of is that it's on 7.0.44 (all the others are 7.0.64 or .65).

Neil,

This got my attention as the host you linked to is one of mine and looking at my other machines its not the only one displaying this behaviour.

Seven of my hosts are running Boinc 7.0.64 and including the one you linked to I actually have four of the seven contacting the server every minute, the others are: 3641777, 6550900 and 5647876 all these machines are single gpu with the same preference settings, work caches and venue, they are on the same network and have the same ISP.

As for the remaining three this host and this are again single gpu machines with the same preferences etc. but have smaller work caches and a different venue and are also on a different network and ISP. No problem with this pair.

The final machine my laptop does not have a gpu and is used on both networks with no problem.

I think it would be safe to assume that the problem is not ISP or network related :-) and is more likely due to a quirk in the cache size settings within Boinc, the affected machines have caches of either 1 or 1.25 days (can't remember, will have to check). whereas the machines that behave are set at 0.75 days (all with a max. additional buffer of 0.10 days). That said, a quick look back at tbret's original post, he has a cache of only 0.50 days...but then his machine was actually requesting and getting additional work whereas mine are just saying "hello" regularly and repeatedly!

I will try Beyond's suggestion tomorrow and either upgrade/downgrade Boinc on one of my affected hosts to see what happens and will lower the cache size on another then report back.

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

RE: I will try Beyond's

3 Jun 2013 19:15:04 UTC

Message 116521 in response to message 116520

(moderation:

)

Quote:

I will try Beyond's suggestion tomorrow and either upgrade/downgrade Boinc on one of my affected hosts to see what happens and will lower the cache size on another then report back.

If you haven't tried it yet just restart Boinc, I experienced the same thing about a week ago or so when it repeatedly contacted both Einstein and Albert@home, tried a few things that did not work and then just exited Boinc completely and restarted again, haven't seen any problems since with version 7.0.64.

Beyond

Joined: 28 Feb 05

Posts: 120

Credit: 1687406835

RAC: 4527441

RE: RE: I will try

3 Jun 2013 21:34:03 UTC

Message 116522 in response to message 116521

(moderation:

)

Quote:

Quote:
I will try Beyond's suggestion tomorrow and either upgrade/downgrade Boinc on one of my affected hosts to see what happens and will lower the cache size on another then report back.

If you haven't tried it yet just restart Boinc, I experienced the same thing about a week ago or so when it repeatedly contacted both Einstein and Albert@home, tried a few things that did not work and then just exited Boinc completely and restarted again, haven't seen any problems since with version 7.0.64.

I tried restarting BOINC on the 2 boxes that had the problem, they started downloading more work before too long (WAY too much work). So far no problem since I've switched to 7.1.3 (knock on wood). BTW, my queue size on both machines was 0.4 days with no additional queue set.

Neil Newell

Joined: 20 Nov 12

Posts: 176

Credit: 169699457

RAC: 0

RE: .... Seven of my hosts

3 Jun 2013 22:05:52 UTC

Message 116523 in response to message 116520

(moderation:

)

Quote:

....
Seven of my hosts are running Boinc 7.0.64 and including the one you linked to I actually have four of the seven contacting the server every minute
...
As for the remaining three this host and this are again single gpu machines with the same preferences etc. but have smaller work caches and a different venue and are also on a different network and ISP. No problem with this pair.
...
I think it would be safe to assume that the problem is not ISP or network related :-) and is more likely due to a quirk in the cache size settings within Boinc, the affected machines have caches of either 1 or 1.25 days (can't remember, will have to check). whereas the machines that behave are set at 0.75 days (all with a max. additional buffer of 0.10 days). That said, a quick look back at tbret's original post, he has a cache of only 0.50 days...but then his machine was actually requesting and getting additional work whereas mine are just saying "hello" regularly and repeatedly!

I will try Beyond's suggestion tomorrow and either upgrade/downgrade Boinc on one of my affected hosts to see what happens and will lower the cache size on another then report back.

Gavin: Thanks for the feedback, also Beyond for the info! Given I twiddled my cache from 0.25 to 1.00 days it does seem very possible it's related to some mix of scheduler changes in BOINC, cache sizing and (possibly) the long run time of BRP5 tasks. Still, assuming it's a bug, like most of them I'm sure it'll be 'obvious' once the cause is found :).

As a control I'll leave my hosts alone for now (although it's pretty clear with 1300+ BRP5's to process, I'll have to do 'something' in the next week or so!).

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 690724056

RAC: 269553

Hi! So we have two issues

4 Jun 2013 8:28:15 UTC

Message 116524

(moderation:

)

Hi!

So we have two issues here, let me address them one by one:

1) fetching more work than seems reasonable: we are looking into it, we will probably lower the quota as a workaround

2) clients contacting the server more often than is reasonable with requests to fetch NO work at all. There have been reports at Seti@Home about this, and some rather strange workaround, quoting Eric Korpela on the boinc-dev list:

Quote:

[...] people have reported that it goes away when they select
"read config file", even if they don't have a config file.

As strange as it seems, you may want to try this.

Cheers
HB

Gavin

Joined: 21 Sep 10

Posts: 191

Credit: 40644234415

RAC: 127443

The problem I had with the

4 Jun 2013 9:51:06 UTC

Message 116525 in response to message 116524

(moderation:

)

The problem I had with the four hosts constantly contacting the server appears to be resolved. Following on from the suggestion by Holmis I exited Boinc and restarted each of those hosts in turn, 2 hours have now past and these machines have returned to normal operation.

Thank you Holmis.

Now I just need to work out why this host keeps randomly trashing small batches of GW tasks...

Neil Newell

Joined: 20 Nov 12

Posts: 176

Credit: 169699457

RAC: 0

RE: 2) clients contacting

4 Jun 2013 9:57:07 UTC

Message 116526 in response to message 116524

(moderation:

)

Quote:

2) clients contacting the server more often than is reasonable with requests to fetch NO work at all. There have been reports at Seti@Home about this, and some rather strange workaround, quoting Eric Korpela on the boinc-dev list:
Quote:

[...] people have reported that it goes away when they select
"read config file", even if they don't have a config file.

Thanks for the update, and confirmed this works! I did this on one of my affected hosts and it stopped the requests. I also tried stopping and restarting BOINC on another host, and (as Holmis suggested) that worked too.

I also subsequently performed a manual 'Update' on both hosts to see if it would re-trigger the polling issue, but it didn't.

On the other issue, any interim suggestions for those of us with too much work to process? Strategy here is probably to process as much as possible, then abort stuff that won't make the deadline (although this means slow turn-arounds for my wingmen).

Neil Newell

Joined: 20 Nov 12

Posts: 176

Credit: 169699457

RAC: 0

RE: Now I just need to

4 Jun 2013 10:02:47 UTC

Message 116527 in response to message 116525

(moderation:

)

Quote:

Now I just need to work out why this host keeps randomly trashing small batches of GW tasks...

Perhaps best move to another thread (as I've already sidetracked this one over the polling issue) but this looks suspicious:-

2013-06-04 09:07:10.7220 (5000) [CRITICAL]: Required frequency-bins [864284, 864299] not covered by SFT-interval [895056, 895544]
		[Parameters: alpha:0, Dphi_alpha:8.642919e+005, Tsft:1.800000e+003, *Tdot_al:9.999594e-001]

Beyond

Joined: 28 Feb 05

Posts: 120

Credit: 1687406835

RAC: 4527441

RE: Thanks for the update,

4 Jun 2013 13:29:23 UTC

Message 116528 in response to message 116526

(moderation:

)

Quote:

Thanks for the update, and confirmed this works! I did this on one of my affected hosts and it stopped the requests. I also tried stopping and restarting BOINC on another host, and (as Holmis suggested) that worked too.

Mine worked for a while too and then started up again. After switching versions it hasn't reoccurred.

Quote:

On the other issue, any interim suggestions for those of us with too much work to process? Strategy here is probably to process as much as possible, then abort stuff that won't make the deadline (although this means slow turn-arounds for my wingmen).

I would think it's far better to about them now. There's no problem with aborting tasks, they go right back into the queue. If you hang on to them it'll be 2 weeks before you abort and then they'll still go back into the queue. Better to do it now IMO.

Horacio

Joined: 3 Oct 11

Posts: 205

Credit: 80557243

RAC: 0

RE: RE: On the other

4 Jun 2013 14:10:48 UTC

Message 116529 in response to message 116528

(moderation:

)

Quote:

Quote:
On the other issue, any interim suggestions for those of us with too much work to process? Strategy here is probably to process as much as possible, then abort stuff that won't make the deadline (although this means slow turn-arounds for my wingmen).

I would think it's far better to about them now. There's no problem with aborting tasks, they go right back into the queue. If you hang on to them it'll be 2 weeks before you abort and then they'll still go back into the queue. Better to do it now IMO.

I agree, aborted tasks are instantly put in the resend list and will be quickly sent to other hosts that will be able to process and return them faster than your host that has to do hundreds of them fist... Not only this is good for the wingmen it will also reduce the number of WUs that will be kept in the DB servers.
It would be better also to abort the WUs most recently received until leaving a reazonable amount of the older ones in your cache. Just sort the tasks by deadline and abort first the ones with longer deadlines until you have a doable amount of tasks. (The idea of keeping the ones with shorter deadlines is that the wingmen of those tasks were waiting for them much more time than the the wigmen of the others and if you abort them now they will go to the botton of the list of another host who is going to take even more time).

BIG TIME OVERFETCH ! ! ! Perceus.

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner