Why does my client think it needs more CPU tasks when it's got nearly 80 days worth.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3571203690
RAC: 622338
Topic 224548

Since upgrading to a faster GPU (1080 to 3070) and switching from 3 to 1 GPU tasks at a time, my client seems to think it needs an unlimited number of CPU tasks; and is sucking them down to the point that my GPU ends up on backup projects because the server thinks I've got too many tasks.

 

A few minutes ago after aborting several hundred tasks I've no hope to complete it requested both more CPU and GPU tasks.

At that point I had 760 unstarted CPU tasks.  I'd been out of GPU work, so my DCF had adjusted to the point the CPU tasks were showing reasonably correct estimates of 15 hours each.  My compute preferences are such that I run 6 CPU and 1 GPU tasks at a time (leaving the 8th virtual core free for other use).  That meant the CPU tasks I had would take about 80 days to complete.

My computing preferences for this PC are currently 1.2 + 0.25 days of work.  Serverside I'm taking only GW CPU tasks and Fermi GPU tasks.

So there's absolutely no reason I should need more CPU tasks.  But my first update requested 20 GPU tasks and 7 CPU tasks.  Because my DCF was 100% to CPU rate the 20 GPU tasks estimated at 90m each (actually 8.5m) were enough to make it think I had enough GPU work; but it kept asking for more CPU tasks and got 49 of them in 7 batches over the next 9 minutes.  Followed by 7 CPU and 1 GPU after I finished my first GPU task.  During this time my runtime estimates for tasks didn't change, so it asked for and got 5 more days of CPU work it won't be able to complete.

 

The net result of this over the last few days has been that I've been up against the 1000 total task and 480 task/day limits bloated down with insane numbers of CPU tasks and my GPU running one or more backup projects because my client's insanity has meant all I've got from E@H are CPU tasks.

 

EDIT: I've set network activity to off, to short term stop my client from pulling more CPU tasks down; and hoping that by completing the GPU ones I've got I can swing my DCF to the GPU value and hopefully pull down a big batch of them all at once when I turn it back on.  Even if it works, this level of hand holding isn't viable long term.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 5044
Credit: 19049066319
RAC: 6522647

You are never going to get

You are never going to get your host to work correctly with attempting to run both GW and GR tasks on the same host at the same time.

The project uses old server software that employs the DCF function.  You can only have ONE DCF value at any time.

But the DCF generated for GW work is 10X larger than the DCF generated for GR work.

So the scheduler is working with an incorrect DCF value when it requests work for the different sub-project than what you were running previously and requests way too much work.

The only solution is to use a very small cache value of 0.1 and 0.1 additional days to limit the amount of work seconds you request at every connection.

Or only run one sub-project at a time on the host. Or run one project on one host and the other project on another host.

 

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 4081
Credit: 48692422902
RAC: 34754642

Well. Two of my hosts work ok

Well. Two of my hosts work ok running both GR and GW (the 2-GPU hosts) But you know my situation isn’t normal and I have a custom BOINC client to severely restrict how many tasks I have running at one time and I run a zero resource share. And don’t run CPU work for Einstein. 
 

I think you can manage it with a small enough cache tho, even with the  stock client. But you might have to make it really small. Maybe even like .01 days or whatever is the smallest value you can put in. 

_________________________________________________________________________

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3571203690
RAC: 622338

Keith Myers wrote: You are

Keith Myers wrote:

You are never going to get your host to work correctly with attempting to run both GW and GR tasks on the same host at the same time.

The project uses old server software that employs the DCF function.  You can only have ONE DCF value at any time.

But the DCF generated for GW work is 10X larger than the DCF generated for GR work.

So the scheduler is working with an incorrect DCF value when it requests work for the different sub-project than what you were running previously and requests way too much work.

The only solution is to use a very small cache value of 0.1 and 0.1 additional days to limit the amount of work seconds you request at every connection.

Or only run one sub-project at a time on the host. Or run one project on one host and the other project on another host.

 

I'm aware that E@H uses an old version of the platform that only has a single DCF and that I need to set my quota, and have been using a client quota much smaller than the week and a half of work I want for years.  And I did adjust it down to compensate for the faster GPU execution times.

 

This is happening at the client though.  According to the DCF at request time, my client had 30h of GPU tasks (incorrect) and 80 days of CPU tasks.  But my client was asking the server for more CPU work.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 5044
Credit: 19049066319
RAC: 6522647

The other factor coming into

The other factor coming into play is the rsc_fpops the scientists assign to each individual task.  That tells the client and scheduler the expected amount of GFLOPS necessary to crunch the task.  That affects the projected runtimes in the Manager. If the task value is under or over estimated, that also can confuse the scheduler also.

And so it doesn't think you have that much work to crunch in your cache and asks for more than your cache setting for days of work.

 

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2992956119
RAC: 708944

DanNeely wrote: This is

DanNeely wrote:

This is happening at the client though.  According to the DCF at request time, my client had 30h of GPU tasks (incorrect) and 80 days of CPU tasks.  But my client was asking the server for more CPU work.

Do you happen to use the <max_concurrent> feature in an app_config.xml file? I think there may be a bug in the client which disregards tasks blocked by <max_concurrent> when calculating how much work is currently cached. I got caught a couple of weeks ago when a machine asked for more CPU work - every minute, for about three hours before I caught it. Seems only to happen for CPU tasks.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3571203690
RAC: 622338

Richard Haselgrove

Richard Haselgrove wrote:

DanNeely wrote:

This is happening at the client though.  According to the DCF at request time, my client had 30h of GPU tasks (incorrect) and 80 days of CPU tasks.  But my client was asking the server for more CPU work.

Do you happen to use the <max_concurrent> feature in an app_config.xml file? I think there may be a bug in the client which disregards tasks blocked by <max_concurrent> when calculating how much work is currently cached. I got caught a couple of weeks ago when a machine asked for more CPU work - every minute, for about three hours before I caught it. Seems only to happen for CPU tasks.

Actually yes I do.

 

I'm not entirely sure why I had it set - I think it might have had something to do with running multiple E@H GPU tasks concurrently, and keeping CPU cores reserved for them; while not leaving cores idle if I was running a backup GPU project - I'll have to try turning it off later today.

 

What you're describing does seem to fit what I'm seeing after turning on some debug messages to the event log.  My client was requesting ~300k-350k CPU work each time; which would roughly align with what I'd need if I had no CPU tasks outside those currently in progress.

 

If so, there seems to be more too it though.  I'd set max concurrent a long time ago, upgraded to 7.16.11 months ago, but didn't notice anything wrong until very recently; and this problem leading to my being maxed out on CPU tasks and unable to get more GPU work should've been ongoing if something else wasn't stopping it.

Highlander
Highlander
Joined: 1 Jul 05
Posts: 24
Credit: 141827363
RAC: 14882

Richard Haselgrove

Richard Haselgrove wrote:

Do you happen to use the <max_concurrent> feature in an app_config.xml file? I think there may be a bug in the client which disregards tasks blocked by <max_concurrent> when calculating how much work is currently cached. I got caught a couple of weeks ago when a machine asked for more CPU work - every minute, for about three hours before I caught it. Seems only to happen for CPU tasks.

 

Thanks for the idea, can confirm it with boinc 7.16.11 windows x64 version. It happens to me on LHC@home with around 4 weeks worth of CMS VMs... (with normal 2 days cache)

Harri Liljeroos
Harri Liljeroos
Joined: 10 Dec 05
Posts: 4519
Credit: 3300700180
RAC: 1970548

For LHC there is something

For LHC there is something else in the play as well. The other subprojects than CMS work differently there. They all download a different amount (CPU time) work to your cache. There might be additional limits set on server side for Theory, Atlas and sixtrack tasks. 

What I understand from Richard's message is that Boinc's max allowed CPU count would be used instead of <max_concurrent> setting. Anyway CMS clearly sends too much work (not following the cache settings). 

But enough of LHC, this is Einstein forum.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3571203690
RAC: 622338

Highlander wrote: Richard

Highlander wrote:

Richard Haselgrove wrote:

Do you happen to use the <max_concurrent> feature in an app_config.xml file? I think there may be a bug in the client which disregards tasks blocked by <max_concurrent> when calculating how much work is currently cached. I got caught a couple of weeks ago when a machine asked for more CPU work - every minute, for about three hours before I caught it. Seems only to happen for CPU tasks.

 

Thanks for the idea, can confirm it with boinc 7.16.11 windows x64 version. It happens to me on LHC@home with around 4 weeks worth of CMS VMs... (with normal 2 days cache)

 

Confirmed on my end as well, pulled out all the max_concurrent entries in the file and it's back to working the way it used to.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.