Of quotas, deferrals, and midnights

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7331401687
RAC: 2309281
Topic 223058

A couple of weeks ago I asserted

archae86 wrote:
There is a long-standing flaw in this picture, in that different components of the overall system have a different definition of when the "new day" boundary happens.  In particular, the bit that decides how long to put off the next request sent by your host does not agree on that boundary with the bit that decides whether to honor a request that is made.  This conflict has some secondary consequences that are both confusing and displeasing.

Gary Roberts brought up another matter

Gary Roberts wrote:
...an event that trashes the complete list of work locally for a particular search.
<snip>
and the client's response (not the server) has been to immediately issue a full 24-hour backoff, irrespective of the proximity to midnight UTC.

While Gary's description does provide a mechanism for differing end-of-deferral times, it matches none of the circumstances of my personal observations which prompted my guess of two different midnights.

Purely by happenstance, the recent jostling about of Gravity Wave GPU work availability here, coupled with my actions, generated an example.

When GW GPU work ceased to be issued a bit before noon UTC on July 7, my sole GW host had about 1.5 days of GW GPU tasks.  As I was due to run out of GW work in the middle of the night, I enabled GRP work with a very small queue (initially 0.1 + 0.05) on July 8.  When GW work dispatch resumed about 17:00 UTC on July 9 my host transitioned back to GW fetching, and resumed actual task work about 20:45 UTC.

In several stages I bumped up my queue request and built inventory of GW tasks aiming to get pretty soon up to the 2 to 3 day range I personally prefer.

At 02:30 UTC on July 10 I noticed that new work reporting was not getting any new tasks, and check the most recent request log.  That advised that while 18,000+ seconds of GPU work was requested, none was granted, as the 416 daily maximum had been reached.  Reviewing my task download times suggested this was not the first denial as the last task download was timestamped 01:36 UTC and download requests were frequent.

At the time I noticed all this, my projects tab reported a deferral in effect of 22:27:07, or until about 02:00 UTC the next day.

I went to bed thinking that were I around to do a project update, my host would be granted new work sometime in the wee hours.  But I next visited the machine at 06:05 local 12:05 UTC in the morning.  At that time the remaining deferral was 12:52:37, with 51 completed tasks ready to report.  I clicked project update and promptly received dozens of tasks.

There were no recent errors involved in this incident (I had aborted dozens of GW tasks in a personal deadline confusion error a few weeks earlier).

1. While it is a minor point, I don't see what method of counting found my host to have reached 416 tasks in "a day".  By normal human methods of counting, I think it was more like 300 or so, so long as we are talking about a "day" that lasts 24 hours, any 24 hours.  I may be wrong on this.

2. By the time I got deferred, a new UTC day had begun within which only about 50 tasks had been downloaded.  Clearly I did not get my "daily quota" counter reset at midnight UTC.

3. From previous experience, though not checked this time, I believe that had I done an update later in the evening, I'd still have gotten the daily quota exceeded denial.

4. I suspect I could have received work several hours sooner than the time I actually finally clicked.

Multiple mysteries here, so far as I am concerned.

1. Just how is the daily quota "day" implemented?
2. Why does the deferral displayed on the projects tab (and honored by BOINC on my host) mismatch the server-imposed download denial massively (12-hour mismatch observed and more suspected in this case)?
3. When really does the "penalty box" refusal to provide additional work on request terminate?

This specific case does not illustrate at least one vexing situation I've seen before, but I'll leave that out before this gets longer.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5879
Credit: 118897942360
RAC: 23453296

archae86 wrote:Purely by

archae86 wrote:
Purely by happenstance, the recent jostling about of Gravity Wave GPU work availability here, coupled with my actions, generated an example.

I have seen a situation that might be the same thing.  I've noticed it a few times over the last few months.  I haven't been sufficiently inconvenienced to be motivated to do a proper investigation so please regard the following 'explanation' as speculative at best :-).

Last November (late southern spring) we had a foretaste of a long hot summer.  I decided to protect the fleet by shutting down over 50 of the oldest and most vulnerable hosts.  In the new year, I started refurbishing those ready for restarting when the cooler weather arrived.  Over the last 3 months, they have been progressively put back to work.

Quite often, I fire up such a machine late in the day.   I fill the cache progressively with a number of smaller work requests.  This can sometimes use up the full daily allowance.  The next morning at 10:00am local time it's midnight UTC.  If I try to get more tasks after that time, the server wont comply.  It will comply later in the day, seemingly after the machine has been running for a full 24 hours.  This only seems to happen for that first 24 hour period after a restart.

I've never been sufficiently inconvenienced to bother testing this properly.  I've just assumed that a 'new day' starts at midnight UTC, *provided* the search for which tasks are being requested has been running for at least 24 hours on that particular machine.  I can imagine that someone might have implemented a protection to prevent a newly joined host from "bunkering" for a challenge by deliberately getting 2 days worth of work from clever placement of work requests either side of midnight UTC.

None of my machines have a sufficiently productive GPU for this to ever be a problem for me so I haven't delved into it any further.

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7331401687
RAC: 2309281

So at the moment, we have at

So at the moment, we have at least two live options for the "real" willingness to resume downloads after a daily quota pause.

1. 24 hours after "something"

2. Midnight "somewhere".

In my recent example, there seems nothing very plausible as a starting point for a 24-hour interval.  If it was midnight somewhere, somewhere was roughly between the US east coast and Hawaii.

I harbor a wicked suspicion that midnight "somewhere" just might be California.

If you ever run GRP jobs on a Radeon VII, you'll find the daily quota of much more interest.  Want one?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5879
Credit: 118897942360
RAC: 23453296

archae86 wrote:So at the

archae86 wrote:

So at the moment, we have at least two live options for the "real" willingness to resume downloads after a daily quota pause.

1. 24 hours after "something"

2. Midnight "somewhere".

If the 'problem' is truly, and only, that a daily quota has been reached, then I believe the scheduler's reaction is always to instruct the client to impose a variable time for a back-off which is not 24 hours but a calculated time to the next midnight UTC.  I have tested this out fairly fully and the precise messages in the event log support this.  The scheduler reactions to the various tests are listed in the following series of excerpts :-

13-Jul-2020 16:44:25 [Einstein@Home] Sending scheduler request: To fetch work.<br />
13-Jul-2020 16:44:25 [Einstein@Home] Requesting new tasks for AMD/ATI GPU<br />
13-Jul-2020 16:44:28 [Einstein@Home] Scheduler request completed: got 0 new tasks<br />
13-Jul-2020 16:44:28 [Einstein@Home] No work sent<br />
13-Jul-2020 16:44:28 [Einstein@Home] No work is available for Gamma-ray pulsar search #5<br />
13-Jul-2020 16:44:28 [Einstein@Home] No work is available for Gamma-ray pulsar binary search #1 on GPUs<br />
13-Jul-2020 16:44:28 [Einstein@Home] (reached daily quota of 288 tasks)<br />
13-Jul-2020 16:44:28 [Einstein@Home] Project has no jobs available<br />
13-Jul-2020 16:44:28 [Einstein@Home] Project requested delay of 64045 seconds

That request is timed at 16:44:25 in a UTC+10 timezone, ie. 06:44:25 UTC.  The back-off was for 17:47:25 - well short of 24 hrs.  The host was being instructed not to make further requests until 00:31:50 UTC on the next day.

For the benefit of anyone reading this who is not familiar with daily limits and back-offs, here is a detailed explanation.  Please skip if you already fully understand this.

If your cache setting is such that lots of work is being requested, you may exceed a 'daily quota'.  The exact amount depends on numbers of both CPU and GPU 'cores' that BOINC has access to.  In the above example, there was 1 CPU core and one GPU.  The limit of 288 mentioned suggests that the 'device' limits are 32 for a CPU core and 256 for a GPU.  I had deliberately set BOINC's CPU cores limit to just 25% - 1 core on a quad core machine - to minimise the daily limit and to verify the exact numbers.  I was pretty sure the CPU limit was 32.

When making a work request where the scheduler notices that the daily limit would be exceeded (if granted in full), you will get a partial (or by chance, a full) supply that will take you right to the limit.  There is no warning that you have now reached the limit and no back-off at that precise point.  If there is a further work request before the next midnight UTC, it will be denied (as shown above) and you will get a back-off that takes you to some safety margin beyond the next midnight UTC.  I'm guessing this random extra bit (I've seen up to ~45 mins or so) is to prevent every host on the planet that might have been backed-off that particular day from all banging on the server at precisely the same moment.  I don't know this for sure but it seems likely.

There is an interesting consequence to this type of back-off.  It actually prevents the client from reporting completed work.  In the example above, I allowed the host to run for several hours and once it became very clear that the client wasn't going to report any tasks, I forced an update to clear the backlog.  The work cache was still set at the high value that had triggered the initial back-off as I wanted a further work request to happen as well as the reporting of completed work.  I had wondered if a second request whilst at the daily limit might create a full 24hr back-off.  At the time of the 'update', there was still over 14 hours showing as the remaining back-off.  Here is the relevant section of the event log:-

13-Jul-2020 20:04:17 [Einstein@Home] update requested by user<br />
13-Jul-2020 20:04:19 [Einstein@Home] Sending scheduler request: Requested by user.<br />
13-Jul-2020 20:04:19 [Einstein@Home] Reporting 11 completed tasks<br />
13-Jul-2020 20:04:19 [Einstein@Home] Requesting new tasks for AMD/ATI GPU<br />
13-Jul-2020 20:04:24 [Einstein@Home] Scheduler request completed: got 32 new tasks<br />
13-Jul-2020 20:04:24 [Einstein@Home] Project requested delay of 60 seconds

You can imagine my surprise at being granted more work rather than a further back-off.  So there's a definite bug.  As it was getting late (after 8pm) I decided to reduce the cache size and call it a day.  The next morning, I decided to continue to see how far the scheduler would 'cooperate' :-).  It took multiple further requests and a cache size of over 7.2 days, but here is what finally happened.  As you can see, the local time is still well before the original midnight UTC (10:00am local):-

14-Jul-2020 06:33:11 [Einstein@Home] Sending scheduler request: To fetch work.<br />
14-Jul-2020 06:33:11 [Einstein@Home] Requesting new tasks for AMD/ATI GPU<br />
14-Jul-2020 06:33:15 [Einstein@Home] Scheduler request completed: got 0 new tasks<br />
14-Jul-2020 06:33:15 [Einstein@Home] No work sent<br />
14-Jul-2020 06:33:15 [Einstein@Home] No work is available for Gamma-ray pulsar search #5<br />
14-Jul-2020 06:33:15 [Einstein@Home] No work is available for Gamma-ray pulsar binary search #1 on GPUs<br />
14-Jul-2020 06:33:15 [Einstein@Home] (reached daily quota of 288 tasks)<br />
14-Jul-2020 06:33:15 [Einstein@Home] Project has no jobs available<br />
14-Jul-2020 06:33:15 [Einstein@Home] Project requested delay of 12510 seconds

So, a second quota of 288 tasks and a new back-off of 3:28:30 which takes me to 10:01:45 local - less than 2 mins after the original midnight.  Still nothing like a 24hr back-off for exceeding 2 daily quotas within the one 'day' :-).

Despite the huge amount of work, I decided to attempt to continue.  The previous experiment had made a work request in conjunction with a reporting of 11 completed tasks.  I allowed a further number of completed tasks to accumulate.  This time, the two operations were done separately,  The reporting was done first with a low cache setting.  This cleared the remaining portion of the new back-off.  As expected for a 'no work' request, the result was just the normal 60 sec delay.  It's still ~45 mins (09:15am local) before the original midnight UTC:-

14-Jul-2020 09:14:52 [Einstein@Home] update requested by user<br />
14-Jul-2020 09:14:54 [Einstein@Home] Sending scheduler request: Requested by user.<br />
14-Jul-2020 09:14:54 [Einstein@Home] Reporting 8 completed tasks<br />
14-Jul-2020 09:14:54 [Einstein@Home] Not requesting tasks: don't need (CPU: ; AMD/ATI GPU: job cache full)<br />
14-Jul-2020 09:14:58 [Einstein@Home] Scheduler request completed<br />
14-Jul-2020 09:14:58 [Einstein@Home] Project requested delay of 60 seconds

Then I upped the cache setting to trigger a further work request.  Here's what happened:-

14-Jul-2020 09:15:58 [Einstein@Home] Sending scheduler request: To fetch work.<br />
14-Jul-2020 09:15:58 [Einstein@Home] Requesting new tasks for AMD/ATI GPU<br />
14-Jul-2020 09:16:01 [Einstein@Home] Scheduler request completed: got 0 new tasks<br />
14-Jul-2020 09:16:01 [Einstein@Home] No work sent<br />
14-Jul-2020 09:16:01 [Einstein@Home] No work is available for Gamma-ray pulsar search #5<br />
14-Jul-2020 09:16:01 [Einstein@Home] No work is available for Gamma-ray pulsar binary search #1 on GPUs<br />
14-Jul-2020 09:16:01 [Einstein@Home] (reached daily quota of 288 tasks)<br />
14-Jul-2020 09:16:01 [Einstein@Home] Project has no jobs available<br />
14-Jul-2020 09:16:01 [Einstein@Home] Project requested delay of 5726 seconds

This time the scheduler hasn't 'forgotten' about the daily limit and has issued a further back-off of 1.6hrs, ie. to around 50 mins after the original midnight UTC.  By this time I was getting sick of this 'game' so just waited until just after the original midnight UTC before trying again.  In retrospect, it would have been smarter to let the back-off count down to zero before trying again.  However, I wasn't smart enough and forced the next work request at around 6 mins after the original midnight.  Here's the result:-

14-Jul-2020 10:06:06 [Einstein@Home] Sending scheduler request: To fetch work.<br />
14-Jul-2020 10:06:06 [Einstein@Home] Requesting new tasks for AMD/ATI GPU<br />
14-Jul-2020 10:06:10 [Einstein@Home] Scheduler request completed: got 0 new tasks<br />
14-Jul-2020 10:06:10 [Einstein@Home] No work sent<br />
14-Jul-2020 10:06:10 [Einstein@Home] No work is available for Gamma-ray pulsar search #5<br />
14-Jul-2020 10:06:10 [Einstein@Home] No work is available for Gamma-ray pulsar binary search #1 on GPUs<br />
14-Jul-2020 10:06:10 [Einstein@Home] (reached daily quota of 288 tasks)<br />
14-Jul-2020 10:06:10 [Einstein@Home] Project has no jobs available<br />
14-Jul-2020 10:06:10 [Einstein@Home] Project requested delay of 88176 seconds

At last, the scheduler has caught up with me :-).  I have two full daily limits worth of tasks and the scheduler has now set a back-off of almost 24.5 hours which will take me to around 30 mins after the 'correct' midnight UTC, if allowed to run its course.  Things might have been a little different if I'd waited longer for the previous 1.6hr back-off to completely expire.  However, to my mind, exceeding the daily limit always attempts to create a requested delay that is associated with midnight UTC (plus margin), even if there are bugs in the process.  It's never a full 24hrs unless that's the appropriate delay to get to the desired point beyond midnight UTC.

If anyone sees a full 24hr back-off where the interval to the next midnight UTC is way less than 24hrs, there must be something else besides the daily quota causing it.  Trawling through the event log (stdoutdae.txt) - which should probably still be available in the archae86 case - might reveal exactly what that something else might be.

archae86 wrote:
If you ever run GRP jobs on a Radeon VII, you'll find the daily quota of much more interest.

Not really :-).  I'd just set the host to have some unimaginably large number of processors and forget about it :-).

archae86 wrote:
Want one?

If I did, it would be on the condition that the seller had stipulated a proper price that was in agreement with the proper value (to both parties) of the product.

I tend to avoid 'top end' devices.  When they were first available in Aus, I saw prices way above what I thought was reasonable - something like $AU1,100 - 1,200, if I remember correctly and I promptly lost interest.  At the moment, I don't really have much idea of 2nd hand values.  My next purchase might be an 8GB RX 570 that is around $AU230 - something like $US160 on current exchange rates.  I'd really like to use something like that on the current VelaJr tasks to see what difference extra VRAM makes.  None of my current 570s have more than 4GB.

Cheers,
Gary.

Tom M
Tom M
Joined: 2 Feb 06
Posts: 6728
Credit: 9712676888
RAC: 2504146

Gary Roberts wrote:archae86

Gary Roberts wrote:
archae86 wrote:
If you ever run GRP jobs on a Radeon VII, you'll find the daily quota of much more interest.
Not really :-).  I'd just set the host to have some unimaginably large number of processors and forget about it :-).

How/Where do you set the host to "some unimaginably large number of processors"?

Thank you.

Tom M

 

A Proud member of the O.F.A.  (Old Farts Association).

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7331401687
RAC: 2309281

Tom M wrote:How/Where do you

Tom M wrote:
How/Where do you set the host to "some unimaginably large number of processors"?

There may be more than one way, but the one I know of involves the <options> sections of a cc_config.xml file in the BOINC directory.

Here is the an example of a line that specifies you have 6 cpus, which will take the place of whatever number would automatically have been obtained.

        <ncpus>6</ncpus>

As the machine you are pointing to already reports 32 processors, you'd need to set a higher number than 32 to get any positive effects.

I believe the current quota calculation is 32* ncpus + 256 * GPUs, with no adjustments for performance.

Catches:

1. There is a weasel word, something like "available" applied to these counts.  So if on the one hand you tell it you have 49 CPUs, but on the other hand tell it to use no more than 87% of CPUS, you won't get the full quota increase.  As you report a quota limit of 1408, that suggests that currently only 28 of your actual 32 CPUs are recognized as available for the purpose of the quota computation.  Perhaps you have set Preferences|Computing|use at most nn % of processors to something near 88?

2. There is a maximum number of cpus recognized for this quota computation.  I think it is currently at least 50, and possibly as high as 100, but it is definitely not "unimaginably large".  I hope it just clamps to the maximum if you exceed it, but I've not seen direct evidence.

3. If you are actually running both GPU and CPU applications, and falsify the cpu number higher than what is real, you'll find BOINC will start extra tasks to use the extra capability.  That is likely to be inefficient.

3a. for extra credit you could perhaps lie with an app_config.xml entry that claims your applications need more than one CPU per task to manage things so you both get a higher quota and also only start the number of CPU and GPU tasks you actually wish.  I'm not sure this works, as I've not experienced nor seen what the behavior is when you specify cpu_usage as a number greater than 1.  If you do go this route, be aware that backing out of it is a bit more complicated than just removing the file.

 

 

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5879
Credit: 118897942360
RAC: 23453296

archae86 wrote:...  As you

archae86 wrote:
...  As you report a quota limit of 1408, ...

I've just responded to the post where he quoted that number.  Since he has thousands of aborted tasks, that number is no doubt transient and moving up or down as he either completes or aborts tasks.  I imagine the full quota is 32x32+256x2=1536.

Gavin has a host with an i5-3570k CPU that is listed in the top hosts list and shows 128 processors.  With the best CPU in my fleet having 6C/12T, 128 seems unimaginably large to me :-).

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3163
Credit: 7331401687
RAC: 2309281

Gary Roberts wrote: archae86

Gary Roberts wrote:

archae86 wrote:
...  As you report a quota limit of 1408, ...

I've just responded to the post where he quoted that number.  Since he has thousands of aborted tasks, that number is no doubt transient and moving up or down as he either completes or aborts tasks.  I imagine the full quota is 32x32+256x2=1536.

Gavin has a host with an i5-3570k CPU that is listed in the top hosts list and shows 128 processors.  With the best CPU in my fleet having 6C/12T, 128 seems unimaginably large to me :-).

The weasel word I mentioned that Bikeman used when he posted a policy change in 2012 was "usable".

I actually see this "usable" constraint in action on one of my own systems, which has 6 real CPUs and 1 real GPU, yet hits and reports a daily quota limit of 416 rather than the 448 it would have by the formula without regard to "usable".   That machine has the "use at most nn% of processors" set to 85%.

Yes, I do understand that one transiently loses daily quota in response to errors.  But the doubling of the reduced amount for each successfully returned task, re-establishes the full limit very, very quickly on high productivity machines. 2**11 is already 2048. 

In that same comment, Bikeman reported that as of that update in 2012 the maximum numbers for quota calculation purposes were 8 GPUs and 64 CPUs.  Those numbers may well have changed since. (the quota per GPU increased from 160 at that time to 256 now, while the quota per CPU has stayed at 32).  I suspect the limits for quota purposes are not necessarily the same as the limits for some other purposes.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.