Gravitational Wave All-sky search on LIGO O1 Open Data

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5883

Credit: 119033211559

RAC: 24712091

Bernd Machenschalk wrote:FWIW

12 Apr 2019 0:38:17 UTC

Message 170660 in response to message 170648

(moderation:

)

Bernd Machenschalk wrote:

FWIW ...

Strangely enough, it's actually worth quite a lot! :-). It's always helpful to have a brief idea of why things change!

Congratulations and sincere thanks for a very welcome speed increase. I think your 20% estimate is a little conservative since some people in this thread now report close to a doubling in speed or halving in crunch time. Perhaps it may vary with different operating systems or processor architectures. I'm seeing the same and already a 0.06 result of mine has validated against a 0.03.

I'm using two machines with well over 200 completed tasks between them (mainly 0.03) and no invalids so far that I've noticed. The current 0.06 tasks are project estimated at over 20 hours but are only taking 4.5 hours. The 0.03s were taking 8.5 hours. If this improvement is to remain, is it possible to refine the estimate, please?

Cheers,
Gary.

Eskomorko

Joined: 15 Jan 09

Posts: 39

Credit: 870934733

RAC: 0

I don't know where to put

13 Apr 2019 13:00:06 UTC

Message 170682

(moderation:

)

I don't know where to put this but my RAC has been plummeting down lately and completed tasks waiting for validation are piling up in big numbers. I have now at least 12 pages of completed tasks waiting now and the oldest tasks are almost 1 month old? Is this normal or is there some ongoing problem with validation?

Tasks are mostly:

-Gamma-ray pulsar binary search #1 on GPUs v1.20 -Gravitational Wave Engineering run on LIGO O1 Open Data v0.04

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5883

Credit: 119033211559

RAC: 24712091

Eskomorko wrote:I don't know

13 Apr 2019 21:56:00 UTC

Message 170686 in response to message 170682

(moderation:

)

Eskomorko wrote:

I don't know where to put this ....

A good way to work that out is to think about the purpose of the various forums and how that relates to exactly what is troubling you. In your case you probably had three choices:-

Technical News - A place where the staff start threads to make announcements and give ongoing information about things of a technical nature. Volunteers make comments directly related to the announcement.
Cruncher's Corner - A good place to discuss all sorts of performance observations and issues. Unless your comment is directly related to an ongoing discussion, it's best to start a new thread rather than take an existing discussion off topic in a new direction.
Problems & Bug Reports - A place for getting help with problems you are having or bugs you think may exist in the way your host is interacting with the project servers.

Your concern is about the number of tasks you have that are 'pending validation'. Having tasks in the pending category is a normal everyday fact of life. It's not news, it's not a problem, it's just the way things have always been. Pendings increase in two particular cases. Firstly, if you have a fast GPU, you can churn out lots of results before your partners can catch up. Secondly, for new searches where locality scheduling is being used to control the distribution of tasks. You can end up not having quorum partners in a timely manner. Locality scheduling is necessary to minimise the bandwidth needed to efficiently deploy the large numbers of large data files to volunteer computers.

You actually have (at the time I looked) 179 pendings - 9 pages - so if you had 12 pages earlier, things are definitely on the improve. There were 99 pendings for FGRPB1G and 34 for FGRP5. There were only 46 for the new O1OD1E engineering run. None of these seem to be particularly excessive. You have a fast, modern GPU so pendings for the FGRPB1G search are to be expected.

Just realise that the number of pendings is entirely due to factors beyond the control of the project. If you are unlucky enough to be partnered with lots of other hosts that don't return valid work promptly (but you do) you will end up with lots of pendings. Your only recourse is to write a stern letter to all your quorum partners, telling them to hurry up and get their fingers out :-).

Cheers,
Gary.

Eskomorko

Joined: 15 Jan 09

Posts: 39

Credit: 870934733

RAC: 0

Gary Roberts wrote:Eskomorko

14 Apr 2019 1:17:00 UTC

Message 170688 in response to message 170686

(moderation:

)

Gary Roberts wrote:

Eskomorko wrote:
I don't know where to put this ....

A good way to work that out is to think about the purpose of the various forums and how that relates to exactly what is troubling you. In your case you probably had three choices:-

Technical News - A place where the staff start threads to make announcements and give ongoing information about things of a technical nature. Volunteers make comments directly related to the announcement.

Cruncher's Corner - A good place to discuss all sorts of performance observations and issues. Unless your comment is directly related to an ongoing discussion, it's best to start a new thread rather than take an existing discussion off topic in a new direction.

Problems & Bug Reports - A place for getting help with problems you are having or bugs you think may exist in the way your host is interacting with the project servers.

Your concern is about the number of tasks you have that are 'pending validation'. Having tasks in the pending category is a normal everyday fact of life. It's not news, it's not a problem, it's just the way things have always been. Pendings increase in two particular cases. Firstly, if you have a fast GPU, you can churn out lots of results before your partners can catch up. Secondly, for new searches where locality scheduling is being used to control the distribution of tasks. You can end up not having quorum partners in a timely manner. Locality scheduling is necessary to minimise the bandwidth needed to efficiently deploy the large numbers of large data files to volunteer computers.

You actually have (at the time I looked) 179 pendings - 9 pages - so if you had 12 pages earlier, things are definitely on the improve. There were 99 pendings for FGRPB1G and 34 for FGRP5. There were only 46 for the new O1OD1E engineering run. None of these seem to be particularly excessive. You have a fast, modern GPU so pendings for the FGRPB1G search are to be expected.

Just realise that the number of pendings is entirely due to factors beyond the control of the project. If you are unlucky enough to be partnered with lots of other hosts that don't return valid work promptly (but you do) you will end up with lots of pendings. Your only recourse is to write a stern letter to all your quorum partners, telling them to hurry up and get their fingers out :-).

Thank you, for your answer.

I'm not here to blame anyone, just wondered how the things have been lately. I switched from 1060GTX to 2070RTX roughly 2 months ago so that might explain something.

Sometimes i just don't get how RAC goes down that fast as my computer keeps crunching numbers all the time.

mmonnin

Joined: 29 May 16

Posts: 292

Credit: 3444726540

RAC: 27555

Mad_Max wrote:Yeah, i know

15 Apr 2019 13:18:24 UTC

Message 170720 in response to message 170602

(moderation:

)

Mad_Max wrote:

Yeah, i know about such utilities. But i don't use them as there is a built in "native" BOINC option to "tune" this:

Via option section of cc_config.xml
<process_priority>N</process_priority>, <process_priority_special>N</process_priority_special>
    The OS process priority at which tasks are run. Values are 0 (lowest priority, the default), 1 (below normal), 2 (normal), 3 (above normal), 4 (high) and 5 (real-time - not recommended). 'special' process priority is used for coprocessor (GPU) applications, wrapper applications, and non-compute-intensive applications, 'process priority' for all others. The two options can be used independently.
But you don't get a point: there are many possibilities to control process priority from the user side if a particular user pay attention to it and know how to tune it. I.e. for some geeks only.

We spoke about default behavior for ALL users which can be set from the project side.

Windows will still move around the load and the GPU exe will still end up waiting. Even with AMD cards that use low CPU util for much of the task run time, GPU util will drop unless the GPU exe is set to its own free CPU thread.

Mad_Max

Joined: 2 Jan 10

Posts: 165

Credit: 2250039517

RAC: 618703

mmonnin, yes, you are right -

15 Apr 2019 23:20:20 UTC

Message 170730

(moderation:

)

mmonnin, yes, you are right - this is a very old known problem of almost all E@H GPU apps. While usually not a problem for other BOINC GPU projects.

Some of volunteers even pointed to root source of this problem a long ago. But for unknown reasons this issue is still here popping up again and again.

Quick remainder that cause this need of reserving of full CPU core and/or running multiple GPU tasks to avoid significant loss of performance of GPU computations.

It is not a low priority of E@H app (process), but low priority of main thread inside app process.
For example main thread of current app for Gamma-ray pulsar binary search #1 on GPUs v1.18 (FGRPopencl1K-ati) app is set to 1 (one) - very lowest possible value. Independent of the app process priority - even if i assign high priority (=13) to FGRP app - main thread will still remains at lowest priority (=1).

For other GPU app i saw it work quite different: threads inherit priority from process priority by default. Eg normal priority (=8) set to process - thread also get normal (8) priority. Process get high priority (=13) and it's thread gets high priority too.
There for example screenshots of current GPUs app from E@H and MW@H running on same Windows machine. Both apps set to run at normal priority, but that is happening with threads priority inside them:

FGRPB - https://yadi.sk/i/_DwVUckqo0iUhA

MW - https://yadi.sk/i/tcQ5n3RuMU_WXA

This is a reason why FGRPB run at max speed only if there is no ANY competition for CPU resources. Even from other BOINC CPU apps running at lowest priority. E.g. then app has own whole CPU core to use.

This is actual for current GW GPU test app too: its main thread priority still is lowest possible (1) always regardless of the priority of the process. So ANY other thread/process can take resources from it and thus slow down GPU computations.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4343

Credit: 252682520

RAC: 35637

Sorry, this complaint didn't

16 Apr 2019 11:57:00 UTC

Message 170741

(moderation:

)

Sorry, this complaint didn't get through to me yet, or I have been too busy with other things to listen carefully enough. App Version 0.12 is in the pipeline, which should have this fixed. If so, I hope to get another FGRPB1G App version out this week.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4343

Credit: 252682520

RAC: 35637

Hm, 012. didn't work. 0.13 is

17 Apr 2019 11:30:31 UTC

Message 170768

(moderation:

)

Hm, 012. didn't work. 0.13 is out.

Mad_Max

Joined: 2 Jan 10

Posts: 165

Credit: 2250039517

RAC: 618703

Thanks BERND. I have got few

17 Apr 2019 23:46:39 UTC

Message 170783

(moderation:

)

Thanks BERND. I have got few GW WU of new 0.13 ver. Now it inherit thread priority from process priority as expected: https://yadi.sk/i/1jdbcRjgqebyCg
This should greatly increase performance for users who do not pay attention to things like reserving CPU core for GPU apps.

Meanwhile start testing 4 GW WU on one GPU in parallel on one of computer. Looks good so far:
- GPU load tripled atleast (from ~20% to 60-65%)
- GPU RAM consumption quadrupled as expected but not a problem - its only ~500 MB
- average runtimes and validation results are pending...

Also i have noticed what main load in the CPU part of computations for current GW app created by OpenCL library (amdocl64.dll in my case). Especially at start of each computation cycle (as captured on screenshot above) when OCL dll consume whole CPU core while GPU load ~0%. But other time amdocl64.dll still creates about 60-70% of total CPU load of app.
Is it expected behavior? And part of computation which does not ported to GPU code yet done via some calls of OCL dll? Or something went wrong and some functions which should run on GPU actually run on CPU in emulation mode?
I saw such errors few times on other projects - wrong call of OpenCL can lead to emulation instead of actual GPU computation.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4343

Credit: 252682520

RAC: 35637

We essentially stopped the

18 Apr 2019 10:13:01 UTC

Message 170785

(moderation:

)

We essentially stopped the "O1 Engineering run", i.e aren't generating new workunits anymore.

Instead we will continue the previously suspended "O2AS" run as a GW run. The current GPU App will not give much benefit in that setup, so "O2AS" will (for now) continue to be CPU-only.

If all goes as planned we will start with the "O1OD1 injection run" on GPUs.

Gravitational Wave All-sky search on LIGO O1 Open Data

Forums › Technical News

Comment viewing options

Forums › Technical News