Why does my client think it needs more CPU tasks when it's got nearly 80 days worth.

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

19 Jan 2021 23:27:41 UTC

Topic 224548

(moderation:

)

Since upgrading to a faster GPU (1080 to 3070) and switching from 3 to 1 GPU tasks at a time, my client seems to think it needs an unlimited number of CPU tasks; and is sucking them down to the point that my GPU ends up on backup projects because the server thinks I've got too many tasks.

A few minutes ago after aborting several hundred tasks I've no hope to complete it requested both more CPU and GPU tasks.

At that point I had 760 unstarted CPU tasks. I'd been out of GPU work, so my DCF had adjusted to the point the CPU tasks were showing reasonably correct estimates of 15 hours each. My compute preferences are such that I run 6 CPU and 1 GPU tasks at a time (leaving the 8th virtual core free for other use). That meant the CPU tasks I had would take about 80 days to complete.

My computing preferences for this PC are currently 1.2 + 0.25 days of work. Serverside I'm taking only GW CPU tasks and Fermi GPU tasks.

So there's absolutely no reason I should need more CPU tasks. But my first update requested 20 GPU tasks and 7 CPU tasks. Because my DCF was 100% to CPU rate the 20 GPU tasks estimated at 90m each (actually 8.5m) were enough to make it think I had enough GPU work; but it kept asking for more CPU tasks and got 49 of them in 7 batches over the next 9 minutes. Followed by 7 CPU and 1 GPU after I finished my first GPU task. During this time my runtime estimates for tasks didn't change, so it asked for and got 5 more days of CPU work it won't be able to complete.

The net result of this over the last few days has been that I've been up against the 1000 total task and 480 task/day limits bloated down with insane numbers of CPU tasks and my GPU running one or more backup projects because my client's insanity has meant all I've got from E@H are CPU tasks.

EDIT: I've set network activity to off, to short term stop my client from pulling more CPU tasks down; and hoping that by completing the GPU ones I've got I can swing my DCF to the GPU value and hopefully pull down a big batch of them all at once when I turn it back on. Even if it works, this level of hand holding isn't viable long term.

Keith Myers

Joined: 11 Feb 11

Posts: 5023

Credit: 18929024517

RAC: 6493273

You are never going to get

20 Jan 2021 2:10:11 UTC

Message 182621

(moderation:

)

You are never going to get your host to work correctly with attempting to run both GW and GR tasks on the same host at the same time.

The project uses old server software that employs the DCF function. You can only have ONE DCF value at any time.

But the DCF generated for GW work is 10X larger than the DCF generated for GR work.

So the scheduler is working with an incorrect DCF value when it requests work for the different sub-project than what you were running previously and requests way too much work.

The only solution is to use a very small cache value of 0.1 and 0.1 additional days to limit the amount of work seconds you request at every connection.

Or only run one sub-project at a time on the host. Or run one project on one host and the other project on another host.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 4045

Credit: 48069108236

RAC: 34277225

Well. Two of my hosts work ok

20 Jan 2021 2:29:02 UTC

Message 182622

(moderation:

)

Well. Two of my hosts work ok running both GR and GW (the 2-GPU hosts) But you know my situation isn’t normal and I have a custom BOINC client to severely restrict how many tasks I have running at one time and I run a zero resource share. And don’t run CPU work for Einstein.

I think you can manage it with a small enough cache tho, even with the stock client. But you might have to make it really small. Maybe even like .01 days or whatever is the smallest value you can put in.

_________________________________________________________________________

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

Keith Myers wrote: You are

20 Jan 2021 4:27:20 UTC

Message 182626 in response to message 182621

(moderation:

)

Keith Myers wrote:

You are never going to get your host to work correctly with attempting to run both GW and GR tasks on the same host at the same time.

The project uses old server software that employs the DCF function. You can only have ONE DCF value at any time.

But the DCF generated for GW work is 10X larger than the DCF generated for GR work.

So the scheduler is working with an incorrect DCF value when it requests work for the different sub-project than what you were running previously and requests way too much work.

The only solution is to use a very small cache value of 0.1 and 0.1 additional days to limit the amount of work seconds you request at every connection.

Or only run one sub-project at a time on the host. Or run one project on one host and the other project on another host.

I'm aware that E@H uses an old version of the platform that only has a single DCF and that I need to set my quota, and have been using a client quota much smaller than the week and a half of work I want for years. And I did adjust it down to compensate for the faster GPU execution times.

This is happening at the client though. According to the DCF at request time, my client had 30h of GPU tasks (incorrect) and 80 days of CPU tasks. But my client was asking the server for more CPU work.

Keith Myers

Joined: 11 Feb 11

Posts: 5023

Credit: 18929024517

RAC: 6493273

The other factor coming into

20 Jan 2021 5:21:22 UTC

Message 182627

(moderation:

)

The other factor coming into play is the rsc_fpops the scientists assign to each individual task. That tells the client and scheduler the expected amount of GFLOPS necessary to crunch the task. That affects the projected runtimes in the Manager. If the task value is under or over estimated, that also can confuse the scheduler also.

And so it doesn't think you have that much work to crunch in your cache and asks for more than your cache setting for days of work.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2980380710

RAC: 765626

DanNeely wrote: This is

21 Jan 2021 8:50:12 UTC

Message 182651 in response to message 182626

(moderation:

)

DanNeely wrote:

This is happening at the client though. According to the DCF at request time, my client had 30h of GPU tasks (incorrect) and 80 days of CPU tasks. But my client was asking the server for more CPU work.

Do you happen to use the <max_concurrent> feature in an app_config.xml file? I think there may be a bug in the client which disregards tasks blocked by <max_concurrent> when calculating how much work is currently cached. I got caught a couple of weeks ago when a machine asked for more CPU work - every minute, for about three hours before I caught it. Seems only to happen for CPU tasks.

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

Richard Haselgrove

21 Jan 2021 12:44:16 UTC

Message 182657 in response to message 182651

(moderation:

)

Richard Haselgrove wrote:

DanNeely wrote:

This is happening at the client though. According to the DCF at request time, my client had 30h of GPU tasks (incorrect) and 80 days of CPU tasks. But my client was asking the server for more CPU work.

Do you happen to use the <max_concurrent> feature in an app_config.xml file? I think there may be a bug in the client which disregards tasks blocked by <max_concurrent> when calculating how much work is currently cached. I got caught a couple of weeks ago when a machine asked for more CPU work - every minute, for about three hours before I caught it. Seems only to happen for CPU tasks.

Actually yes I do.

I'm not entirely sure why I had it set - I think it might have had something to do with running multiple E@H GPU tasks concurrently, and keeping CPU cores reserved for them; while not leaving cores idle if I was running a backup GPU project - I'll have to try turning it off later today.

What you're describing does seem to fit what I'm seeing after turning on some debug messages to the event log. My client was requesting ~300k-350k CPU work each time; which would roughly align with what I'd need if I had no CPU tasks outside those currently in progress.

If so, there seems to be more too it though. I'd set max concurrent a long time ago, upgraded to 7.16.11 months ago, but didn't notice anything wrong until very recently; and this problem leading to my being maxed out on CPU tasks and unable to get more GPU work should've been ongoing if something else wasn't stopping it.

Highlander

Joined: 1 Jul 05

Posts: 24

Credit: 141664032

RAC: 967

Richard Haselgrove

21 Jan 2021 12:48:35 UTC

Message 182658 in response to message 182651

(moderation:

)

Richard Haselgrove wrote:

Do you happen to use the <max_concurrent> feature in an app_config.xml file? I think there may be a bug in the client which disregards tasks blocked by <max_concurrent> when calculating how much work is currently cached. I got caught a couple of weeks ago when a machine asked for more CPU work - every minute, for about three hours before I caught it. Seems only to happen for CPU tasks.

Thanks for the idea, can confirm it with boinc 7.16.11 windows x64 version. It happens to me on LHC@home with around 4 weeks worth of CMS VMs... (with normal 2 days cache)

Harri Liljeroos

Joined: 10 Dec 05

Posts: 4462

Credit: 3264061117

RAC: 1909838

For LHC there is something

21 Jan 2021 13:25:19 UTC

Message 182660

(moderation:

)

For LHC there is something else in the play as well. The other subprojects than CMS work differently there. They all download a different amount (CPU time) work to your cache. There might be additional limits set on server side for Theory, Atlas and sixtrack tasks.

What I understand from Richard's message is that Boinc's max allowed CPU count would be used instead of <max_concurrent> setting. Anyway CMS clearly sends too much work (not following the cache settings).

But enough of LHC, this is Einstein forum.

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

Highlander wrote: Richard

21 Jan 2021 17:38:12 UTC

Message 182662 in response to message 182658

(moderation:

)

Highlander wrote:

Richard Haselgrove wrote:

Do you happen to use the <max_concurrent> feature in an app_config.xml file? I think there may be a bug in the client which disregards tasks blocked by <max_concurrent> when calculating how much work is currently cached. I got caught a couple of weeks ago when a machine asked for more CPU work - every minute, for about three hours before I caught it. Seems only to happen for CPU tasks.

Thanks for the idea, can confirm it with boinc 7.16.11 windows x64 version. It happens to me on LHC@home with around 4 weeks worth of CMS VMs... (with normal 2 days cache)

Confirmed on my end as well, pulled out all the max_concurrent entries in the file and it's back to working the way it used to.

Why does my client think it needs more CPU tasks when it's got nearly 80 days worth.

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner