When will the 90% problem be fixed?

Luca

Joined: 20 Jan 23

Posts: 7

Credit: 21280304

RAC: 3819

29 Oct 2023 14:29:10 UTC

Topic 230278

(moderation:

)

When the WU reaches the 90% it stops. I know why but when will this be fixed? I feel like i'm wasting around the 33% of my, let's call it, GPU time. My GPU crunches for 15-20 minutes and then remains stuck for almost 7-8 minutes while using the CPU. Why can't i start a new GPU WU while the last WU is finishing with the CPU? Or when this will be fixed?

mikey

Joined: 22 Jan 05

Posts: 12678

Credit: 1839078536

RAC: 4018

Luca wrote: When the WU

29 Oct 2023 15:00:20 UTC

Message 218663

(moderation:

)

Luca wrote:

When the WU reaches the 90% it stops. I know why but when will this be fixed? I feel like i'm wasting around the 33% of my, let's call it, GPU time. My GPU crunches for 15-20 minutes and then remains stuck for almost 7-8 minutes while using the CPU. Why can't i start a new GPU WU while the last WU is finishing with the CPU? Or when this will be fixed?

The easy answer is to start a 2nd task when you get to the 90% spot where it switches to cpu crunching to finish up the task. You can set it on the website by going below where you pick which kind of tasks you want to run and setting it to 50%, then once the 2nd task starts suspend it until you get to where the first task is not doing any work on the gpu and unsuspend the 2nd task, In practice you may have to suspend every Einstein gpu task except the 1st task and then resume them once the 1st task gets far enough along for you. Just be sure your gpu has at least 8gb of ram on it so you don't get out of memory errors for the tasks.

Luca

Joined: 20 Jan 23

Posts: 7

Credit: 21280304

RAC: 3819

Should i do it manually? It

29 Oct 2023 15:08:41 UTC

Message 218664

(moderation:

)

Should i do it manually? It looks quite time consuming and impossible for most of the tasks.

GWGeorge007

Joined: 8 Jan 18

Posts: 3060

Credit: 4962344353

RAC: 1404302

Luca wrote:Should i do it

29 Oct 2023 15:30:17 UTC

Message 218666 in response to message 218664

(moderation:

)

Luca wrote:

Should i do it manually? It looks quite time consuming and impossible for most of the tasks.

Unless your tasks all have the same exact time of running, you will either need to occasionally get manually involved, or just let them ride it out. If you are so intent on using 100% of the GPUs 100% of the time, then yes, you will be required to monitor it constantly and intervene when necessary.

The reason your tasks are using both the GPU and CPU is that to finish the task, the CPU does a higher percentage of double precision than a GPU. It will not be "fixed" by the task's developers. Just monitor your tasks and start a second task when the first one completes in using the GPU.

George

Proud member of the Old Farts Association

Tom M

Joined: 2 Feb 06

Posts: 6432

Credit: 9562034566

RAC: 9851043

Luca wrote: Should i do it

29 Oct 2023 15:24:45 UTC

Message 218668 in response to message 218664

(moderation:

)

Luca wrote:

Should i do it manually? It looks quite time consuming and impossible for most of the tasks.

Exactly. It would be very time consuming on your part.

There is no easy fix. And even when I tried running 2x and suspending the tasks till they were offset significantly, they still ended up in the 90% together most of the time.

It appears I get my best production at 1x.

Tom M

A Proud member of the O.F.A. (Old Farts Association). Be well, do good work, and keep in touch.® (Garrison Keillor) I want some more patience. RIGHT NOW!

B.I.G

Joined: 26 Oct 07

Posts: 117

Credit: 1170469039

RAC: 965479

Tom M wrote:And even when I

30 Oct 2023 6:02:15 UTC

Message 218691 in response to message 218668

(moderation:

)

Tom M wrote:

And even when I tried running 2x and suspending the tasks till they were offset significantly, they still ended up in the 90% together most of the time.

Interesting, with GW task I offset them once and didn't have to intervene. However, since I run 2 GW tasks but only 1 MeerKAT task at at time - if the application is switching tasks and goes back to GW of course 2 tasks start at the same time again. So by my experience the solution is to either run all tasks at 2x, just run one type, or manually offset them again.

With my AMD W7600 I get the highest RAC with the BRP7 (MeerKAT) tasks and one task at a time, but the scheduler prefers to give that machine GW tasks so I go with it and GW tasks profit a lot from being offset as they require more CPU crunching. If your goal is the highest possible RAC you might want to try which tasks run best on your GPU and then stick to them.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 315225814

RAC: 312311

In one sense it will never be

31 Oct 2023 4:30:00 UTC

Message 218719

(moderation:

)

In one sense it will never be fixed in the way that you mean. It another fashion it already has been!

By that I mean : the re-examination of the GPU data by the CPU is inevitable ( various reasons including double precision ) given the relatively poor implementation of IEEE standards for floating point on the commonest GPUs that E@H contributors have ( on 'consumer' or 'gaming' cards ). That lack of standards compliance is just not going to yield sensible science if not accounted for in the search strategy ie. the validity of the entire search is at risk otherwise.

{ Aside : we don't want to get a reputation for misleading work! }

However the search is still faster overall ( in general ) than doing the initial search ( fast Fourier transforms ) via CPU followed by a toplist candidate filtering scheme, again on CPU. An FFT of the size typical for E@H ( ~2²² points ) is simply at awesome speed when done on the parallel architecture that GPUs offer. In this sense we have already converged on the best solution - or close to it - for the commonest host hardware combinations that E@H encounters.

So that's the balance that has been struck between speed on the one hand versus reliable answers on the other. But please do try the other suggestions made here, they may help.

{ Now in a perfect world we could all afford DGX-A100 systems that carry eight Nvidia A100 Tesla cards @ $200K USD ..... drool :-) }

{ The currently unobtainable 'unicorn solution' for this is a coherent search over a year long data set. There is not enough computing power on the planet for that! }

Cheers, Mike.

( edit ) The sensitivity of the numerical analysis depends upon the methodology of the search, as does the computational cost. With regard to searching for continuous GWs, the raison d'etre of E@H, see the full gore of that, say, here and here. To date we have not conclusively discovered a continuous GW, but have set bounds on the parameters of any that might exist, see here for example. Note that one is accustomed to thinking of noise as a fraction of the signal strength, but for continuous GW detection the reverse is true. The expected signal is a mild drift back & forth upon a noisy ( non-target ) background and this is responsible for much subtlety in the signal processing. If only there could be a system-on-chip solution for this quandary!

( edit ) It occurs to me that it would be useful to know to what degree, if any, does the candidate data have to remain on the GPU while the CPU is doing it's follow on thing. Put another way : what's the exact detail of the handover of the tasks from GPU to CPU ? Does anyone know this ?

( edit ) Silly me. Take, say, an 'All-Sky Gravitational Wave search on O3 v1.06 (GW-opencl-nvidia)' work unit stderr output. It looks to me like the results from the GPU stage are written to a temporary file ( if windows, found in the <*PUT_YOUR_BOINC_DISK_HERE*>:\ProgramData\BOINC\projects\einstein.phys.uwm.edu directory ) which is then taken up by the CPU for candidate filtering. So that would imply that once the initial candidate list is formed by the GPU it is indeed free for other things.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

mikey

Joined: 22 Jan 05

Posts: 12678

Credit: 1839078536

RAC: 4018

Mike Hewson wrote: { Now

31 Oct 2023 10:33:23 UTC

Message 218726 in response to message 218719

(moderation:

)

Mike Hewson wrote:

{ Now in a perfect world we could all afford DGX-A100 systems that carry eight Nvidia A100 Tesla cards @ $200K USD ..... drool :-) }

So would the Tesla gpu's work here at Einstein?

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6588

Credit: 315225814

RAC: 312311

mikey wrote: Mike Hewson

31 Oct 2023 11:09:52 UTC

Message 218727 in response to message 218726

(moderation:

)

mikey wrote:

Mike Hewson wrote:

{ Now in a perfect world we could all afford DGX-A100 systems that carry eight Nvidia A100 Tesla cards @ $200K USD ..... drool :-) }

So would the Tesla gpu's work here at Einstein?

If I had that system I wouldn't care, it's 7nm technology ! ;-)

Seriously : if OpenCL compliant drivers emerge then they might, and I can't find any reference to that on NVidia documents. At least it is IEEE compliant for FP64. Anyway if we all had one then I'm sure that E@H devs would oblige with 8 x 7936 = 63,488 CUDA cores per system to play with.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Ian&Steve C.

Joined: 19 Jan 20

Posts: 3945

Credit: 46628862642

RAC: 64175710

mikey wrote: Mike Hewson

31 Oct 2023 11:57:22 UTC

Message 218729 in response to message 218726

(moderation:

)

mikey wrote:

Mike Hewson wrote:

{ Now in a perfect world we could all afford DGX-A100 systems that carry eight Nvidia A100 Tesla cards @ $200K USD ..... drool :-) }

So would the Tesla gpu's work here at Einstein?

why wouldnt they? they've shown up here before. I've temporarily rented some hosts like this before. they work fine as long as you have drivers installed.

_________________________________________________________________________

Boca Raton Comm...

Joined: 4 Nov 15

Posts: 238

Credit: 10519195586

RAC: 27120022

mikey wrote: Mike Hewson

31 Oct 2023 14:12:01 UTC

Message 218732 in response to message 218726

(moderation:

)

mikey wrote:

Mike Hewson wrote:

{ Now in a perfect world we could all afford DGX-A100 systems that carry eight Nvidia A100 Tesla cards @ $200K USD ..... drool :-) }

So would the Tesla gpu's work here at Einstein?

I feel like those show up every once in a while.

Also, I will call your DGX-A100 system and raise you with the DGX-H100 system. What's a couple more hundred thousand dollars? Almost 3x the FP32 computational power of the A100. Absolutely insane. Also, you will need a small power plant to use one of these.

When will the 90% problem be fixed?

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner