Gravity Wave search on GPUs: do we have a problem?

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3952

Credit: 46782872642

RAC: 64161285

Richard Haselgrove

26 Apr 2021 16:39:06 UTC

Message 185239 in response to message 185237

(moderation:

)

Richard Haselgrove wrote:

Ian&Steve C. wrote:
you're still getting only 1000 cred when it validates.
If.

None of my O3 tasks has been validated yet.

When.

Credit reward is based on this value.

But they don't appear to have a validator setup/running for these tasks yet. so they probably wont even try to validate until that's implemented. Proof: https://einsteinathome.org/workunit/544521572

_________________________________________________________________________

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3952

Credit: 46782872642

RAC: 64161285

looks like they put up a

27 Apr 2021 12:41:25 UTC

Message 185270 in response to message 185239

(moderation:

)

looks like they put up a Validator for these tasks, ran it for a bit, then shut it off for more troubleshooting. over a thousand tasks (including the one linked above) were marked inconclusive and resent to new hosts. only 2 validated according to the SSP.

_________________________________________________________________________

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2956606421

RAC: 714990

Ian&Steve C. wrote: looks

27 Apr 2021 13:22:11 UTC

Message 185273 in response to message 185270

(moderation:

)

Ian&Steve C. wrote:

looks like they put up a Validator for these tasks, ran it for a bit, then shut it off for more troubleshooting. over a thousand tasks (including the one linked above) were marked inconclusive and resent to new hosts. only 2 validated according to the SSP.

I got a lot of those resends overnight, and I'm working through them. My current score is:

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250459996

RAC: 35134

Thanks for the reports, and

27 Apr 2021 15:08:38 UTC

Message 185276

(moderation:

)

Thanks for the reports, and sorry for the late reply.

Some remarks that may clear up things a bit, at least in hindsight:
- In contrast to previous transitions "within" the O2MDF application we now needed to let the old workunit generator run dry completely before we could set a new one. It was rather unfortunate that the O2MDF workunt generator ran out of work at the beginning of a weekend. This way we had no GPU GW work for a few days.

- Unfortunately the status shown on the server status page doesn't reflect the actual status in all cases - a daemon can be still running while it is "disabled". We disabled the O2MDF workunit generator to prevent it from being uselessly restated every 5 mins, just to terminates itself because there is no more work left to generate.

- Regarding the "cleanup at 99%": This is actually more than just cleanup. During the main computation the application cycles through millions of "templates" and matches it to the data. The result is a list of ~10000 "candidates" that match the data best. We got this algorithm running pretty efficient on the GPU. However in making this computation more efficient, a bit of statistical detail in the result is lost that is important later on to judge the quality of a candidate. Calculating this detail is clumsy; the memory access pattern is so random that running it on the GPU isn't any faster than on the CPU, and counting in the required memory transfers it would take even longer there. However, we don't need this this information for each of the millions of templates (most of which are thrown away), we only need it for the candidates that make it into the top n. So at the end we do a little more computation only on the candidates that come out of the GPU calculation, and do this on the CPU. The result list of O3ASE is a bit longer (30000 candidates) than that of O2MD (3x7500), so computation on the CPU will also take longer.

- Regarding CPU usage: Some years ago the behavior of the BOINC client regarding the CPU utilization ("estimate") for a GPU app was that when it's more than 0.9, it would lower the priority of the process to "nice" instead of "normal", which slows down the communication between GPU and CPU (especially on NVidia with "busy waiting"/"polling" on the CPU side) and also the CPU computation at the end. Has this changed since back then?

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250459996

RAC: 35134

Indeed I'm still working on

27 Apr 2021 15:10:20 UTC

Message 185277

(moderation:

)

Indeed I'm still working on the validator, there might be some changes of the vaildation status back and forth even of already reported (and validated) results.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2956606421

RAC: 714990

Bernd Machenschalk wrote:-

27 Apr 2021 16:09:42 UTC

Message 185278 in response to message 185276

(moderation:

)

Bernd Machenschalk wrote:

- Regarding CPU usage: Some years ago the behavior of the BOINC client regarding the CPU utilization ("estimate") for a GPU app was that when it's more than 0.9, it would lower the priority of the process to "nice" instead of "normal", which slows down the communication between GPU and CPU (especially on NVidia with "busy waiting"/"polling" on the CPU side) and also the CPU computation at the end. Has this changed since back then

Ah. I didn't know that. Yes, it seems it is still true:

https://github.com/BOINC/boinc/blob/master/client/app_start.cpp#L560

// run it at above idle priority if it
// - uses coprocs
// - uses less than one CPU
// - is a wrapper
//
bool high_priority = false;
if (app_version->rsc_type()) high_priority = true;
if (app_version->avg_ncpus < 1) high_priority = true;
if (app_version->is_wrapper) high_priority = true;

But I don't know how case (1) and case (2) interact. I'll try to tease that out.

Edit - checked on a Windows machine, set up the same way with 1.0 CPU set via app_config.xml (I'm more familiar with the Windows tools).

Einstein GPU app - still O2MDF in this case - is shown as 'base priority: below normal', whereas pure CPU apps are 'base priority: low'. So the logic seems to be "if any one (or more than one) of these cases apply, increase the priority".

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250459996

RAC: 35134

Richard Haselgrove wrote: As

27 Apr 2021 16:43:08 UTC

Message 185279 in response to message 185223

(moderation:

)

Richard Haselgrove wrote:

As usual with OpenCL on NVidia, it actually clocks up 100% CPU while running, but is sent out with an estimate of 0.9 - which allows BOINC to start another CPU task.

Will BOINC actually do this, i.e. start another CPU task for this 0.1 "free" CPU? Or will it do this only if you are running 10 GPU tasks in parallel, summing up to a free CPU core?

Would it help to set the CPU utilization (estimate) to 0.99 instead of 0.9?

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2956606421

RAC: 714990

Bernd Machenschalk

27 Apr 2021 17:44:52 UTC

Message 185281 in response to message 185279

(moderation:

)

Bernd Machenschalk wrote:

Richard Haselgrove wrote:

As usual with OpenCL on NVidia, it actually clocks up 100% CPU while running, but is sent out with an estimate of 0.9 - which allows BOINC to start another CPU task.

Will BOINC actually do this, i.e. start another CPU task for this 0.1 "free" CPU? Or will it do this only if you are running 10 GPU tasks in parallel, summing up to a free CPU core?

Would it help to set the CPU utilization (estimate) to 0.99 instead of 0.9?

Yes and no.

Only the integer part is considered. 0.99 will allow a CPU task to be started. 2 x 0.99 (1.98) will allow one task to be started, but not two.

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3952

Credit: 46782872642

RAC: 64161285

some of this terminology is

27 Apr 2021 21:55:54 UTC

Message 185291

(moderation:

)

BOINC logic is that less than >1 really means 0 for accounting.

0.9 = 0 CPUs reserved

0.9+0.9 = 1 CPU reserved

0.9+0.9+0.9 = 2 CPUs reserved

some of this priority terminology is counterintuitive though. are you saying that the tasks will run slower if the CPU use estimate is greater than 1? or less than 1?

i force all my tasks to 1.0 CPU via an app_config as that is best representative of actual CPU use and BOINC accounting for available resources. is there a reason the project doesnt want to do this for nvidia tasks?

_________________________________________________________________________

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3952

Credit: 46782872642

RAC: 64161285

Richard Haselgrove

27 Apr 2021 21:59:47 UTC

Message 185292 in response to message 185278

(moderation:

)

Richard Haselgrove wrote:

Einstein GPU app - still O2MDF in this case - is shown as 'base priority: below normal', whereas pure CPU apps are 'base priority: low'. So the logic seems to be "if any one (or more than one) of these cases apply, increase the priority".

being that these are GPU tasks, it seems that case 1 (uses coprocs) will always be satisfied regardless of the others, no?

_________________________________________________________________________

Gravity Wave search on GPUs: do we have a problem?

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports