BIG TIME OVERFETCH ! ! ! Perceus.

Gavin
Gavin
Joined: 21 Sep 10
Posts: 191
Credit: 40644337738
RAC: 1

RE: I was worried it was

Quote:

I was worried it was just my hosts, but it's not - for example this host is doing the same according to its log. Same BOINC version, different OS (windows vs. linux).

Also I have found that one of my single-GPU hosts isn't doing it - the only difference I'm aware of is that it's on 7.0.44 (all the others are 7.0.64 or .65).

Neil,

This got my attention as the host you linked to is one of mine and looking at my other machines its not the only one displaying this behaviour.

Seven of my hosts are running Boinc 7.0.64 and including the one you linked to I actually have four of the seven contacting the server every minute, the others are: 3641777, 6550900 and 5647876 all these machines are single gpu with the same preference settings, work caches and venue, they are on the same network and have the same ISP.

As for the remaining three this host and this are again single gpu machines with the same preferences etc. but have smaller work caches and a different venue and are also on a different network and ISP. No problem with this pair.

The final machine my laptop does not have a gpu and is used on both networks with no problem.

I think it would be safe to assume that the problem is not ISP or network related :-) and is more likely due to a quirk in the cache size settings within Boinc, the affected machines have caches of either 1 or 1.25 days (can't remember, will have to check). whereas the machines that behave are set at 0.75 days (all with a max. additional buffer of 0.10 days). That said, a quick look back at tbret's original post, he has a cache of only 0.50 days...but then his machine was actually requesting and getting additional work whereas mine are just saying "hello" regularly and repeatedly!

I will try Beyond's suggestion tomorrow and either upgrade/downgrade Boinc on one of my affected hosts to see what happens and will lower the cache size on another then report back.

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

RE: I will try Beyond's

Quote:
I will try Beyond's suggestion tomorrow and either upgrade/downgrade Boinc on one of my affected hosts to see what happens and will lower the cache size on another then report back.

If you haven't tried it yet just restart Boinc, I experienced the same thing about a week ago or so when it repeatedly contacted both Einstein and Albert@home, tried a few things that did not work and then just exited Boinc completely and restarted again, haven't seen any problems since with version 7.0.64.

Beyond
Beyond
Joined: 28 Feb 05
Posts: 121
Credit: 2360096212
RAC: 5628461

RE: RE: I will try

Quote:
Quote:
I will try Beyond's suggestion tomorrow and either upgrade/downgrade Boinc on one of my affected hosts to see what happens and will lower the cache size on another then report back.

If you haven't tried it yet just restart Boinc, I experienced the same thing about a week ago or so when it repeatedly contacted both Einstein and Albert@home, tried a few things that did not work and then just exited Boinc completely and restarted again, haven't seen any problems since with version 7.0.64.


I tried restarting BOINC on the 2 boxes that had the problem, they started downloading more work before too long (WAY too much work). So far no problem since I've switched to 7.1.3 (knock on wood). BTW, my queue size on both machines was 0.4 days with no additional queue set.

Neil Newell
Neil Newell
Joined: 20 Nov 12
Posts: 176
Credit: 169699457
RAC: 0

RE: .... Seven of my hosts

Quote:


....
Seven of my hosts are running Boinc 7.0.64 and including the one you linked to I actually have four of the seven contacting the server every minute
...
As for the remaining three this host and this are again single gpu machines with the same preferences etc. but have smaller work caches and a different venue and are also on a different network and ISP. No problem with this pair.
...
I think it would be safe to assume that the problem is not ISP or network related :-) and is more likely due to a quirk in the cache size settings within Boinc, the affected machines have caches of either 1 or 1.25 days (can't remember, will have to check). whereas the machines that behave are set at 0.75 days (all with a max. additional buffer of 0.10 days). That said, a quick look back at tbret's original post, he has a cache of only 0.50 days...but then his machine was actually requesting and getting additional work whereas mine are just saying "hello" regularly and repeatedly!

I will try Beyond's suggestion tomorrow and either upgrade/downgrade Boinc on one of my affected hosts to see what happens and will lower the cache size on another then report back.

Gavin: Thanks for the feedback, also Beyond for the info! Given I twiddled my cache from 0.25 to 1.00 days it does seem very possible it's related to some mix of scheduler changes in BOINC, cache sizing and (possibly) the long run time of BRP5 tasks. Still, assuming it's a bug, like most of them I'm sure it'll be 'obvious' once the cause is found :).

As a control I'll leave my hosts alone for now (although it's pretty clear with 1300+ BRP5's to process, I'll have to do 'something' in the next week or so!).

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 728115804
RAC: 1197322

Hi! So we have two issues

Hi!

So we have two issues here, let me address them one by one:

1) fetching more work than seems reasonable: we are looking into it, we will probably lower the quota as a workaround

2) clients contacting the server more often than is reasonable with requests to fetch NO work at all. There have been reports at Seti@Home about this, and some rather strange workaround, quoting Eric Korpela on the boinc-dev list:

Quote:

[...] people have reported that it goes away when they select
"read config file", even if they don't have a config file.

As strange as it seems, you may want to try this.

Cheers
HB

Gavin
Gavin
Joined: 21 Sep 10
Posts: 191
Credit: 40644337738
RAC: 1

The problem I had with the

The problem I had with the four hosts constantly contacting the server appears to be resolved. Following on from the suggestion by Holmis I exited Boinc and restarted each of those hosts in turn, 2 hours have now past and these machines have returned to normal operation.

Thank you Holmis.

Now I just need to work out why this host keeps randomly trashing small batches of GW tasks...

Neil Newell
Neil Newell
Joined: 20 Nov 12
Posts: 176
Credit: 169699457
RAC: 0

RE: 2) clients contacting

Quote:

2) clients contacting the server more often than is reasonable with requests to fetch NO work at all. There have been reports at Seti@Home about this, and some rather strange workaround, quoting Eric Korpela on the boinc-dev list:
Quote:

[...] people have reported that it goes away when they select
"read config file", even if they don't have a config file.


Thanks for the update, and confirmed this works! I did this on one of my affected hosts and it stopped the requests. I also tried stopping and restarting BOINC on another host, and (as Holmis suggested) that worked too.

I also subsequently performed a manual 'Update' on both hosts to see if it would re-trigger the polling issue, but it didn't.

On the other issue, any interim suggestions for those of us with too much work to process? Strategy here is probably to process as much as possible, then abort stuff that won't make the deadline (although this means slow turn-arounds for my wingmen).

Neil Newell
Neil Newell
Joined: 20 Nov 12
Posts: 176
Credit: 169699457
RAC: 0

RE: Now I just need to

Quote:

Now I just need to work out why this host keeps randomly trashing small batches of GW tasks...

Perhaps best move to another thread (as I've already sidetracked this one over the polling issue) but this looks suspicious:-

2013-06-04 09:07:10.7220 (5000) [CRITICAL]: Required frequency-bins [864284, 864299] not covered by SFT-interval [895056, 895544]
		[Parameters: alpha:0, Dphi_alpha:8.642919e+005, Tsft:1.800000e+003, *Tdot_al:9.999594e-001]


Beyond
Beyond
Joined: 28 Feb 05
Posts: 121
Credit: 2360096212
RAC: 5628461

RE: Thanks for the update,

Quote:
Thanks for the update, and confirmed this works! I did this on one of my affected hosts and it stopped the requests. I also tried stopping and restarting BOINC on another host, and (as Holmis suggested) that worked too.


Mine worked for a while too and then started up again. After switching versions it hasn't reoccurred.

Quote:
On the other issue, any interim suggestions for those of us with too much work to process? Strategy here is probably to process as much as possible, then abort stuff that won't make the deadline (although this means slow turn-arounds for my wingmen).


I would think it's far better to about them now. There's no problem with aborting tasks, they go right back into the queue. If you hang on to them it'll be 2 weeks before you abort and then they'll still go back into the queue. Better to do it now IMO.

Horacio
Horacio
Joined: 3 Oct 11
Posts: 205
Credit: 80557243
RAC: 0

RE: RE: On the other

Quote:
Quote:
On the other issue, any interim suggestions for those of us with too much work to process? Strategy here is probably to process as much as possible, then abort stuff that won't make the deadline (although this means slow turn-arounds for my wingmen).

I would think it's far better to about them now. There's no problem with aborting tasks, they go right back into the queue. If you hang on to them it'll be 2 weeks before you abort and then they'll still go back into the queue. Better to do it now IMO.

I agree, aborted tasks are instantly put in the resend list and will be quickly sent to other hosts that will be able to process and return them faster than your host that has to do hundreds of them fist... Not only this is good for the wingmen it will also reduce the number of WUs that will be kept in the DB servers.
It would be better also to abort the WUs most recently received until leaving a reazonable amount of the older ones in your cache. Just sort the tasks by deadline and abort first the ones with longer deadlines until you have a doable amount of tasks. (The idea of keeping the ones with shorter deadlines is that the wingmen of those tasks were waiting for them much more time than the the wigmen of the others and if you abort them now they will go to the botton of the list of another host who is going to take even more time).

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.