System suddenly not reserving CPU for GPU tasks

Glenn Hawley, RASC Calgary
Glenn Hawley, R...
Joined: 6 Mar 05
Posts: 48
Credit: 893379976
RAC: 347679
Topic 224321

Starting December 23 my average credit rate started to drop off markedly, from about 550,000 per day to less than 500,000.

For months I've been running 2 GPU workunits and 14 CPU workunits on a 16 core CPU. WU Allocation has been automatic and smooth.

Now, I see that only one GPU workunit is running, but all 16 cores are also processing CPU workunits... so there's wasteful task swapping going on and my GPU is seriously underutilized.

I've tried hard-rebooting my machine, checked the GPU operating settings using Radeon's tool. I've tried suspending or aborting workunits to see if the correct replacement ones are picked up. I've reverified my Processing Preferences settings, showing utilization factor of 0.5 for the GPU for the "school" setting for this machine.

All to no avail, so I've reduced to "Store at least 0.1 days, of work" and for the time being set "no new tasks" to reduce the amount I have to mess with.

I am mystified. 

mikey
mikey
Joined: 22 Jan 05
Posts: 12693
Credit: 1839100036
RAC: 3700

Glenn Hawley, RASC Calgary

Glenn Hawley, RASC Calgary wrote:

Starting December 23 my average credit rate started to drop off markedly, from about 550,000 per day to less than 500,000.

For months I've been running 2 GPU workunits and 14 CPU workunits on a 16 core CPU. WU Allocation has been automatic and smooth.

Now, I see that only one GPU workunit is running, but all 16 cores are also processing CPU workunits... so there's wasteful task swapping going on and my GPU is seriously underutilized.

I've tried hard-rebooting my machine, checked the GPU operating settings using Radeon's tool. I've tried suspending or aborting workunits to see if the correct replacement ones are picked up. I've reverified my Processing Preferences settings, showing utilization factor of 0.5 for the GPU for the "school" setting for this machine.

All to no avail, so I've reduced to "Store at least 0.1 days, of work" and for the time being set "no new tasks" to reduce the amount I have to mess with.

I am mystified.  

Since you only have 2 pc's go into the Boinc Manager and under Options, computing preferences and the computing tab set the top box to 99 and the pc will automatically reserve a cpu core for the gpu to use or for anything else the pc wants to do that requires a cpu core to do..ie a/v scans, backups etc. Be sure to click Save at the bottom of the page and you should see a cpu task become 'waiting to run' meaning a cpu core is available for the gpu to use.

Glenn Hawley, RASC Calgary
Glenn Hawley, R...
Joined: 6 Mar 05
Posts: 48
Credit: 893379976
RAC: 347679

Now I'm REALLY mystified, for

Now I'm REALLY mystified, for the thing is now working properly after I cut back the number of days worth stored, and set "no new tasks". The latter should not have come into effect anyway, since I previously had "Store at least 1.0 days work" selected, and it would take time to catch up.

While I'm very happy to see it seems to be running again, I find it aesthetically unappealing to not know why it got fixed, nor why it was buggered in the first place.

 

Glenn Hawley, RASC Calgary
Glenn Hawley, R...
Joined: 6 Mar 05
Posts: 48
Credit: 893379976
RAC: 347679

Now it's buggered again!!!

Now it's buggered again!!! Running 17 WUs, 16 CPU WUs and 1 GPU.

And I didn't do anything explicitly that would explain it.

Glenn Hawley, RASC Calgary
Glenn Hawley, R...
Joined: 6 Mar 05
Posts: 48
Credit: 893379976
RAC: 347679

So, I once more set "No new

So, I once more set "No new tasks", and as soon as it completed the one GPU unit, it started two of them and kicked two of the CPU units off (waiting to run).

But it seems to me that "No new tasks" shopuld not have such an effect, so I'll wait until these two WUs have finished, and allow new tasks again and see if that buggers up the scheduling again.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117704025758
RAC: 35063705

Glenn Hawley, RASC Calgary

Glenn Hawley, RASC Calgary wrote:
But it seems to me that "No new tasks" shopuld not have such an effect, so I'll wait until these two WUs have finished, and allow new tasks again and see if that buggers up the scheduling again.

It's really nothing to do with that setting (NNT).  This problem is just a continuation of your previous one.  BOINC is now in 'Panic Mode' because it thinks tasks are going to miss deadlines.

This was created initially when you were trying to encourage new tasks after you had hit the 5GB disk space limit.  You had been increasing the work cache size to no effect.  When you eventually increased the allowed disk space from 5GB to 50GB, you forgot to reduce your excessive work cache first.  You got a very large slab of work, as I mentioned in the previous thread.  At that time you had the full 14 day deadline so no panic mode initially.

Now, several days later, BOINC is starting to think that tasks are going to miss deadlines.  The main thing that has gone wrong is that using all 16 threads on an 8 true core host and having the GPU requirements to contend with as well, has meant that odd CPU tasks have taken way longer than usual.  Take a look at the completed times for GW CPU tasks.  Values range from around 90Ksecs to a high of ~140Ksecs.  You only need one task doing that high value and BOINC will revise the estimate for every task up to that value.  That's probably why BOINC is in panic mode all of a sudden.

The fix doesn't involve NNT.  You've done the first bit by setting the work cache to 0.1 days.  Make sure the second (extra days) setting is 0.0 days.  You currently have (when I checked) 107 GW CPU tasks in progress.  You also have another 47 Gamma-ray CPU tasks.  You just need to suspend enough tasks to allow BOINC to drop out of panic mode.  Pick tasks with the longest deadlines, say 10 GRP tasks and 20 GW tasks and suspend those 30 by highlighting them in BOINC Manager (tasks tab) and clicking the suspend button.  BOINC should immediately drop out of panic mode and everything should settle down.

The next step is to allow BOINC to start reducing the high task estimates that have created the panic.  You will best achieve that by making sure that there are NEVER more than 14 max concurrent CPU tasks.  You should keep running 2 GPU tasks on the RX 580.  One thread to support each one should be quite enough.  If you need to stop more than 14 CPU tasks from running, just start reducing the CPU usage limits from 100% to whatever value (perhaps in the 90-98% region) until there are no more than 14 CPU tasks running.  If you are using website prefs, each time you make a setting change, click update in BOINC Manager for the client to be advised of the change immediately.  If you are using local prefs (in the manager), the change would be immediate.  Just use whatever you normally use.

Each time a task finishes, BOINC should be able to start reducing all the remaining estimates.  In a day or two, if the estimates are lower, you may very well be able to release all the suspended tasks without causing BOINC to panic.

You only have around 180 GPU tasks in progress.  You are going to run out of those long before CPU tasks.  You wont be able to get new GPU tasks whilst any tasks are suspended.  The obvious time to try releasing all remaining CPU tasks from suspension will be when you are down to the last few GPU tasks.  With any luck at all you probably will be able to do that without causing a panic :-).

Ask if anything is not clear.

Cheers,
Gary.

Glenn Hawley, RASC Calgary
Glenn Hawley, R...
Joined: 6 Mar 05
Posts: 48
Credit: 893379976
RAC: 347679

Gary: It was messed up again

Gary: It was messed up again this morning, but I tried suspending the older CPU WUs like you suggested, and it immediately went to running 14 CPU and 2 GPU tasks, with 2 CPU tasks downgraded to "waiting to run"

I'll stick with 0.1 days of work and 0.0 additional days worth.

You have offered a coherent explanation as to why it's doing what it's doing. It was driving me nuts trying various experiments to figure it out... and failing to find anything consistently repeatable.


If I reduce the percent of CPUs available, though, it will shut down either CPU or GPU tasks and just leave 2 CPUs completely idle. With suspending those tasks, however, (as you suggested above), it's now running 14 CPU and 2 GPU tasks as it should. 

So, I have learned some interesting things about BOINC that I did not know before.

Thank you so much

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117704025758
RAC: 35063705

Glenn Hawley, RASC Calgary

Glenn Hawley, RASC Calgary wrote:
Gary: It was messed up again this morning, but I tried suspending the older CPU WUs like you suggested, and it immediately went to running 14 CPU and 2 GPU tasks, with 2 CPU tasks downgraded to "waiting to run"

I didn't say to suspend the "older CPU WUs".  I said specifically, "Pick tasks with the longest deadlines".  They would be the very newest tasks - the ones with the longest remaining time before they expire.  The reason for that is you really want the "older CPU WUs" (which have the shortest deadlines) to be done done first because they are the ones that will be creating BOINCs panic mode.  You need them finished and returned as soon as possible.

Check carefully!!  If you have any short deadline (ie 'older') tasks suspended, then suspend the same number of longest deadline tasks so that you can then safely release the short deadline ones to be crunched next.

Glenn Hawley, RASC Calgary wrote:
If I reduce the percent of CPUs available, though, it will shut down either CPU or GPU tasks and just leave 2 CPUs completely idle. With suspending those tasks, however, (as you suggested above), it's now running 14 CPU and 2 GPU tasks as it should.

Please understand what I wrote.  I said, "If you need to stop more than 14 CPU tasks from running ...".  All I was doing was giving you a quick way to get the number of CPU tasks to NOT exceed 14 - if you found more than 14 still running for some unexplained reason.  Of course, you're not expected to make that particular change if it's not needed.

Please realise that this isn't over yet.  It all depends on BOINC not going back into panic mode.  Only you can see the critical information - the current estimated crunch time (as seen in BOINC Manager) for CPU tasks (both GW and GRP).  If you report those values now, and again at the very end of your day, it will allow me to tell you if there is likely to be a future issue.  There are other things that can be done if necessary.

Cheers,
Gary.

Glenn Hawley, RASC Calgary
Glenn Hawley, R...
Joined: 6 Mar 05
Posts: 48
Credit: 893379976
RAC: 347679

Yes... I misstated "older"

Yes... I misstated "older" while thinking of "longer deadline".

 

The reported estimated time for tasks that have not yet started ranges from 3.5-4.5 hours for the CPU tasks, and 0.5 hours for the GPU tasks.

Once started, estimated time settles about 30 hours for the CPU tasks and 0.5 hours for the GPU tasks, even though the latter in practice take 25 minutes each. This is as of 16:30 MST (UT-7)

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117704025758
RAC: 35063705

Glenn Hawley, RASC Calgary

Glenn Hawley, RASC Calgary wrote:
Yes... I misstated "older" while thinking of "longer deadline".

I wondered if that might be the case but didn't want to make assumptions.  I can't see what you see so you need to be careful with descriptions.

I notice that since you posted the message I'm responding to, you have aborted a bunch of GRP tasks.  Yesterday, you had 47 of them in progress and today you have just 1.  There are 46 that have been aborted quite recently - a short time ago as I write this.  I guess you may have seen BOINC return to panic mode and decided to abort.  Aborting was always a 'down the track' option if needed.  For the moment until we can work out the number to abort, the other option would have been to suspend a further 10-15 GW tasks to stay out of panic mode.

I asked you to give the estimated values that show for both GRP and GW CPU tasks - ie. two separate values.  You said all were 3.5-4.5 hours.  Am I supposed to assume that 3hrs:30mins is the GRP value and that 4hrs:30mins is the GW value?  Why couldn't you give two precise numbers, eg. GRP=xx:yy:zz and GW=aa:bb:cc for the two different values in hrs:mins:secs as listed on the tasks tab?  Each task type would have had a unique single value.

Suspending tasks just buys some time in order to accurately work out if aborting will be necessary and if so, how many to abort.  If you give ambiguous answers it just wastes time in trying to confirm.

If you just want the problem to go away without bothering to save as many as possible, Just keep 42 GW CPU tasks on top of the 14 that are crunching right now (ie. 56 in total) and abort the remainder.  Worst case scenario will be that you have 5 days (based on your 30hr/task figure) to finish what remains compared to close to 8 days to the deadline.  With a bit of patience you could probably save quite a few more but by staying nearly 3 days under the deadline limit, there should be no reason for BOINC to panic.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.