FGRPB1G work shortage

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110040243966
RAC: 22410153

Ian&Steve C. wrote:... been

Ian&Steve C. wrote:
... been happening well before the issues with work availability.

Whilst that's certainly true, it's happening much more regularly of late.

I've been monitoring last RPC time for all my hosts for many years.  At most, there might have been one or two instances of a useless 24hr back-off per year, over the entire fleet.  It was quite a rare thing for me to see.  After the work availability issue arose, I've seen at least six in that relatively short time.  Of course, this doesn't prove anything but the observation of increasing frequency might point to something related.  In any case, it needed to be brought to Bernd's attention.

For the benefit of others reading and since the FGRPB1G shortage seems likely to be around for a while, I'll share some thoughts about what has worked for me in trying to maintain a steady work cache size.  Currently, all my hosts doing FGRPB1G seem to be able to do this.

My work fetch scheme had always centered around a 0.05 day default cache size with zero extra days.  A script runs 6 times a day (every 4 hrs) to do a number of things:-

  1. Make sure each host is running normally.  Flag it for attention if not.
  2. Store/retrieve the 'current' large data files (not template files) from a local file share using rsync.  The idea is to download just once and deploy to all, to limit unnecessary data downloads and to reinstate deleted copies when needed to cover future resends for at least a couple of months.
  3. Increase the 0.05 day cache to a new value (eg 2 to 3 days) to keep enough work to last through all but the most severe outages.
  4. After the cache has been topped up, return to 0.05 days to prevent further work fetch activity until the next run, 4 hrs later.

Of course, a new host joining the fleet is manually loaded with tasks over a period before joining the above automated scheme.  The idea of the scheme is just to top-up less frequently, whilst avoiding scheduler contacts and unnecessary downloads as much as possible.

I've had to change the above to cope with the fact that many work requests are met with 'got 0 new tasks'.  Work requests would cease for too long if the cache was reduced to 0.05 days between runs.  It has to be always at the desired level to maximise the chance of getting enough work.

The default cache size is now 3 days which is incremented to 3.1 days during runs every 3 hours.  This means that missing large data files get checked and replenished a little more regularly than previously and that the hosts will be able to keep requesting work between runs when the work on hand drops below 3 days.  Unfortunately, this necessarily results in a lot more scheduler contacts being attempted.  It seems to be working well since every time I look at an individual host, it seems to have approximately the full 3 days of work.  They need to do it themselves since I don't use any form of extra intervention to force additional work requests.

I've sometimes noticed people using a 'split' cache, eg. say x days + y extra days.  In the current environment, that's likely to be a bad idea because even if the cache got topped up to x+y days, it stops BOINC from requesting more work again until the amount on hand drops below x days.  It might be hard to get back up to the full x+y days after that.

EDIT:  It took me a while to compose the above and I hadn't seen the response from Bernd before I posted.  Thanks to him for dealing with the problem so quickly.

 

Cheers,
Gary.

Link
Link
Joined: 15 Mar 20
Posts: 97
Credit: 605605
RAC: 291

Gary Roberts wrote:I've

Gary Roberts wrote:
I've sometimes noticed people using a 'split' cache, eg. say x days + y extra days.  In the current environment, that's likely to be a bad idea because even if the cache got topped up to x+y days, it stops BOINC from requesting more work again until the amount on hand drops below x days.  It might be hard to get back up to the full x+y days after that.

If there's enough work, that would make your 3+4 unnecessary. I've used that for my HD3850 crunching Moo. Since BOINC reports completed tasks after max. 1 hour, I've set the additional days to 0.15 and that limited scheduler requests to those, that BOINC would do anyway. Old BOINC clients reported latest after 24 hours, newer versions you can't limit to every 4 hours with low cache setting unless you also disallow network connections.

.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3716
Credit: 34693533075
RAC: 26736263

The best way to get work will

The best way to get work will be to maintain slow hosts (~4M ppd, ~1150 tasks/day max on Gamma Ray, but that might have gone up slightly, need to re-evaluate). Which is what Gary does and the ultimate reason he’s been largely unaffected by this issue since he has 100+ hosts with slow AMD Polaris and older GPUs. Hosts that slow will have no problem since they complete work much slower than the server is able to refill the work. Especially if it’s setup to ask for work very often or on every RPC. the closer you get to the limit, the more often you need to ask for work since the server is only sending a handful of work on each request, and sometimes none. Once you exceed the threshold for host production, your cache starts to drain.

Think of it like a sink filling with water. Your host is the sink drain (your host crunching work) and the project is the faucet (sending you work to crunch), and the level of water in the sink is you work cache, and the interval in which the faucet is turned on is how often you request work (project set minimum interval is 60 seconds). If the rate of flow through the drain is higher than the flow from the faucet, the water level will drop and the sink empties, and that’s exactly the situation that’s happening.
 

Folks with more slow hosts are able to effectively ask for work more often (per user, on average). Since the project limits each host to once every 60 seconds. A 3-host user can at most ask for 3 times a minute, and a 100-host user can ask for work 100 times a minute. 
 

of my 3 active hosts, only the 2x RTX 3060 is able to maintain constant work with FGRPB1G. And only because it’s brute forcing RPCs every 65 seconds. The other faster hosts (normally ~23M and ~17M ppd respectively) cannot maintain any cache on FGRPB1G alone and require supplementing with BRP7 to stay busy.


 

 

_________________________________________________________________________

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110040243966
RAC: 22410153

Bernd Machenschalk wrote:...

Bernd Machenschalk wrote:
... Fixed, should not occur again.

Unfortunately, there was another one last night.

07-Nov-2022 17:59:59 [Einstein@Home] Sending scheduler request: To fetch work.<br />
07-Nov-2022 17:59:59 [Einstein@Home] Reporting 1 completed tasks<br />
07-Nov-2022 17:59:59 [Einstein@Home] Requesting new tasks for AMD/ATI GPU<br />
07-Nov-2022 18:00:02 [Einstein@Home] Scheduler request completed: got 0 new tasks<br />
07-Nov-2022 18:00:02 [Einstein@Home] platform 'x86_64-pc-linux-gnu' not found<br />
07-Nov-2022 18:00:02 [Einstein@Home] Project requested delay of 86400 seconds

As you can see, that's 6.00PM local - 8.00AM UTC - which is around an hour after your post.

Hopefully, it might just mean that your fix took some time to be effective.  If any more occur I'll continue to report them.

 

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110040243966
RAC: 22410153

Link wrote: Gary Roberts

Link wrote:

Gary Roberts wrote:
I've sometimes noticed people using a 'split' cache, eg. say x days + y extra days.  In the current environment, that's likely to be a bad idea because even if the cache got topped up to x+y days, it stops BOINC from requesting more work again until the amount on hand drops below x days.  It might be hard to get back up to the full x+y days after that.
If there's enough work, that would make your 3+4 unnecessary

I don't understand the point you were trying to make.

You were responding to my comment that using an x+y days cache setting was probably a bad idea.  At first I thought you were suggesting I was using 3+4 days.  I ruled that out because it was clear that I was recommending zero for extra days.

On re-reading my whole post, it occurred to me that what you might be saying was that it was unnecessary to run scripts at either 3 hour or 4 hour intervals.  Since BOINC is supposed to be capable of 'set and forget' operation, then yes, running scripts at regular intervals would be 'unnecessary' for people with 'normal' numbers of hosts.  Since I run an insane number, I feel that I should try to stop them 'banging on the door' as much as possible.  In other words, I try to minimise my impact.

If neither of those two things address your point, could you please explain what you meant?

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110040243966
RAC: 22410153

Ian&Steve C. wrote:Folks with

Ian&Steve C. wrote:
Folks with more slow hosts are able to effectively ask for work more often (per user, on average). Since the project limits each host to once every 60 seconds. A 3-host user can at most ask for 3 times a minute, and a 100-host user can ask for work 100 times a minute.

Thanks for taking time to add your comments after Link's query.  For the benefit of those with lesser experience, I thought it might be useful to point out that there are other limiting factors apart from the standard project enforced 60 sec delay.

The boinc client itself imposes an additional backoff when a work fetch event fails to get any work.  So, each consecutive fetch failure causes the client to stop requesting for an extended period which can be many minutes or longer.  The period keeps increasing with each failure and is eventually reset back to the default 60 secs when an 'in-progress' task completes.  So, under normal circumstances, a 100-host user is very unlikely to be making anything like 100 requests per minute.

Out of interest, I chose one of my faster hosts and counted up the number of work requests for a multi-day period.  I just used the Linux utility 'grep' to search stdoutdae.txt for the string "Sending scheduler request: To fetch work" and came up with 2772 requests for a period of 11.5 days - ie. a request every 6 mins on average.  I guess it would be a longer time for slower hosts, of which I do have quite a few.

In my original post, I made a point of saying, "They need to do it themselves since I don't use any form of extra intervention to force additional work requests.".  It's difficult for everybody to get what they need so I choose to let BOINC do its thing at its own pace and take my chances.

 

Cheers,
Gary.

mikey
mikey
Joined: 22 Jan 05
Posts: 11971
Credit: 1834045326
RAC: 225382

Ian&Steve C. wrote: of my 3

Ian&Steve C. wrote:

of my 3 active hosts, only the 2x RTX 3060 is able to maintain constant work with FGRPB1G. And only because it’s brute forcing RPCs every 65 seconds. The other faster hosts (normally ~23M and ~17M ppd respectively) cannot maintain any cache on FGRPB1G alone and require supplementing with BRP7 to stay busy. 

Are you still running multiple tasks at one time on the gpu's that can do so? If so that's another reason your 'sink' is draining faster than those with less capable gpu's.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3716
Credit: 34693533075
RAC: 26736263

I’m ignoring the

I’m ignoring the client’s backoff behavior since it’s possible to override that and force schedule connections on time every time with a script using boinccmd. This is what I do to make sure it requests work every chance it gets. I’m talking about maximum requests possible, allowable. Notice I said “can” ask for work, not “do” ask for work. I would request for work more often to help get more work, if I was allowed to. 
 

but your hosts are slow enough to not need a request every minute anyway. 

_________________________________________________________________________

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3716
Credit: 34693533075
RAC: 26736263

mikey wrote:Ian&Steve C.

mikey wrote:

Ian&Steve C. wrote:

of my 3 active hosts, only the 2x RTX 3060 is able to maintain constant work with FGRPB1G. And only because it’s brute forcing RPCs every 65 seconds. The other faster hosts (normally ~23M and ~17M ppd respectively) cannot maintain any cache on FGRPB1G alone and require supplementing with BRP7 to stay busy. 

Are you still running multiple tasks at one time on the gpu's that can do so? If so that's another reason your 'sink' is draining faster than those with less capable gpu's.

intentionally slowing production for the sake of maintaining cache is counterproductive. 
 

but even at 1x those hosts would still be much too fast. Moving to 1x would probably only drop overall production by about 10-20%. 
 

I just setup two venues for crunching profiles. One for BRP7 and one for Gamma Ray. BRP7 is slower, so I’ll fill the cache on the BRP7 profile, then switch to the gamma ray profile so it’s asking for gamma rays and backfilling with those while slowly crunching BRP7. Then when BRP7 runs out it will crunch through whatever cache of GR that has amassed. And when it gets close to running out I switch back to the BRP7 profile. Rinse. Repeat. 
 

what I really need is a scheme of “worker bee” hosts to get work en masse, and shuffle them along to my big hosts for crunching and reporting. I “could” split my multi-GPU hosts to single GPU logical hosts with separate clients, but I really don’t like split stats. 

_________________________________________________________________________

Link
Link
Joined: 15 Mar 20
Posts: 97
Credit: 605605
RAC: 291

Gary Roberts wrote:Link

Gary Roberts wrote:

Link wrote:

Gary Roberts wrote:
I've sometimes noticed people using a 'split' cache, eg. say x days + y extra days.  In the current environment, that's likely to be a bad idea because even if the cache got topped up to x+y days, it stops BOINC from requesting more work again until the amount on hand drops below x days.  It might be hard to get back up to the full x+y days after that.
If there's enough work, that would make your 3+4 unnecessary

I don't understand the point you were trying to make.

You were responding to my comment that using an x+y days cache setting was probably a bad idea.  At first I thought you were suggesting I was using 3+4 days.  I ruled that out because it was clear that I was recommending zero for extra days.

On re-reading my whole post, it occurred to me that what you might be saying was that it was unnecessary to run scripts at either 3 hour or 4 hour intervals.  Since BOINC is supposed to be capable of 'set and forget' operation, then yes, running scripts at regular intervals would be 'unnecessary' for people with 'normal' numbers of hosts.  Since I run an insane number, I feel that I should try to stop them 'banging on the door' as much as possible.  In other words, I try to minimise my impact.

If neither of those two things address your point, could you please explain what you meant?

I meant setting additional 0.15 (or more) days limits BOINC scheduler requests to 1 hour after finishing a WU (unless you also disable network access), you can't stop BOINC from reporting tasks latest 1 hour after the result file was uploaded (older clients did it latest 24 hours after result upload). The additional days setting is to limit scheduler requests, but with newer BOINC versions, it doesn't change that much.

So your points 3 & 4 in your script (changing cache settings) are actually unnecessary (unless you forgot to mention, that you disable net access until next run of the script or run an old version of BOINC), because BOINC will contact the scheduler anyway about once an hour and not once in 4 hours, so it might also request tasks at the same time. Your script might even create one or more additional scheduler requests if the last reporting was shortly before the script runs.

.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.