Multi-Directional Gravitational Wave Search on O3 data (O3MD1/F)

[AF>EDLS]zOU

Joined: 5 May 15

Posts: 65

Credit: 384235373

RAC: 0

Ian&Steve C. wrote:how many

12 Jan 2023 18:33:21 UTC

Message 206432 in response to message 206421

(moderation:

)

Ian&Steve C. wrote:

how many tasks are you trying to run at once?

you’re getting :
CL_MEM_OBJECT_ALLOCATION_FAILURE
which means you’re running out of GPU memory.

Just one ;)

Unrelated to the topic then since it's not an app error, I thought I saw the same error code.

Einstein@Home   1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)   h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_2754_1   00:02:23 (00:00:59)   1/10/2023 8:04:01 PM   1/10/2023 8:14:40 PM   0,9C + 1NV   41,26   Reported: Computation error (1,)   Hades   53.65 MB   43.17 MB
Einstein@Home   1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)   h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_2751_0   00:02:14 (00:01:00)   1/10/2023 8:05:01 PM   1/10/2023 8:14:40 PM   0,9C + 1NV   44,78   Reported: Computation error (1,)   Hades   53.66 MB   43.27 MB
Einstein@Home   1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)   h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_2749_0   00:02:08 (00:00:56)   1/10/2023 8:09:06 PM   1/10/2023 8:14:40 PM   0,9C + 1NV   43,75   Reported: Computation error (114,)   Hades   53.66 MB   43.22 MB
Einstein@Home   1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)   h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_2748_0   00:02:07 (00:00:58)   1/10/2023 8:09:06 PM   1/10/2023 8:14:40 PM   0,9C + 1NV   45.67   Reported: Computation error (1152,)   Hades
Einstein@Home   1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)   h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_3051_1   00:01:57 (00:01:01)   1/9/2023 8:22:53 PM   1/9/2023 8:57:57 PM   0,9C + 1NV   52,14   Reported: Computation error (1152,)   Hades   1260.33 MB   459.21 MB
Einstein@Home   1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)   h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_3049_1   00:02:04 (00:01:02)   1/9/2023 8:22:53 PM   1/9/2023 8:57:57 PM   0,9C + 1NV   50,00   Reported: Computation error (1152,)   Hades
Einstein@Home   1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)   h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_3052_1   00:02:14 (00:01:03)   1/9/2023 8:26:53 PM   1/9/2023 8:57:57 PM   0,9C + 1NV   47,01   Reported: Computation error (1152,)   Hades   502.23 MB   126.55 MB
Einstein@Home   1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)   h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_3048_1   00:02:07 (00:00:59)   1/9/2023 8:26:53 PM   1/9/2023 8:57:57 PM   0,9C + 1NV   46,46   Reported: Computation error (1152,)   Hades
Einstein@Home   1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)   h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_3050_1   00:02:06 (00:01:04)   1/9/2023 8:30:53 PM   1/9/2023 8:57:57 PM   0,9C + 1NV   50,79   Reported: Computation error (1152,)   Hades   53.66 MB   43.15 MB
Einstein@Home   1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)   h1_0466.40_O3aC01Cl1In0__O3MDFG1_G34731_466.50Hz_7_1   00:05:41 (00:04:04)   1/9/2023 8:04:51 PM   1/9/2023 8:15:52 PM   0,9C + 1NV   71,55   Reported: Computation error (1152,)   Hades   3859.27 MB   686.07 MB
Einstein@Home   1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)   h1_0466.40_O3aC01Cl1In0__O3MDFG1_G34731_466.50Hz_8_1   00:01:49 (00:00:51)   1/9/2023 8:04:51 PM   1/9/2023 8:15:52 PM   0,9C + 1NV   46,79   Reported: Computation error (1152,)   Hades
Einstein@Home   1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)   h1_0466.40_O3aC01Cl1In0__O3MDFG1_G34731_466.50Hz_6_1   00:01:45 (00:00:50)   1/9/2023 8:05:51 PM   1/9/2023 8:15:52 PM   0,9C + 1NV   47,62   Reported: Computation error (1152,)   Hades   50.05 MB   39.20 MB

I didn't have any WU for this app in a few days, but I'll keep an eye on it

Conan

Joined: 19 Jun 05

Posts: 172

Credit: 8323426

RAC: 9004

Conan wrote:With O3MDF now

13 Jan 2023 3:03:55 UTC

Message 206509 in response to message 206396

(moderation:

)

Conan wrote:

With O3MDF now fixed (hopefully), is it possible to get O3MD1 CPU work units going again, there is none in the queue and hasn't been for awhile.

I would like to run some more of them please.

Conan

Thanks for getting more work out for O3MD1 and the CPUs.

I have some questions though concerning these new work units.

The previous ones that I ran took somewhere from 33,000 to 43,000 seconds (I think) and gave 1,000 credits. So a bit longer than my RYZEN 9 5900X takes to run a Binary type WU (around twice as long as a Gamma WU with Linux).

These new ones have so far ran 18 Hours and are just 33% done with 1 Day 12 Hours still to go.

They were sent out saying they would take 4 hours 54 minutes so I ended up with 55 of them.

My questions are

"How long do these things run?"

Are they still only getting 1,000 points?

Thanks

Conan

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18747093082

RAC: 7057049

Einstein does not allot

13 Jan 2023 3:56:05 UTC

Message 206512 in response to message 206509

(moderation:

)

Einstein does not allot credit based on runtimes. Credit is alloted only in fixed amounts for each species of task.

So the 1000 credits will not change for OM3MD1/F.

Conan

Joined: 19 Jun 05

Posts: 172

Credit: 8323426

RAC: 9004

Thanks Keith for the

13 Jan 2023 5:28:04 UTC

Message 206515

(moderation:

)

Thanks Keith for the response.

I was sure the amounts were not going to change, but I could hope.

The work units have now almost 21 hours done and 39.909% complete, still 1 Day 7 Hours to go.

With 55 downloaded, I will contemplate whether I will complete all of them all, we will see how it goes.

I suppose I was the one asking for them so I will have to try and finish them.

Thanks

Conan

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117675505966

RAC: 35143769

Conan wrote:The work units

13 Jan 2023 8:09:32 UTC

Message 206518 in response to message 206515

(moderation:

)

Conan wrote:

The work units have now almost 21 hours done and 39.909% complete, still 1 Day 7 Hours to go.

How many concurrent CPU tasks are you running. Your host is listed as 12 cores / 24 threads. I hope you're not trying to run tasks on all 24 threads.

If you are, you may get better performance by experimenting with a smaller number. You could try halving the number of threads in use and checking the new rate of progress. You may find it more than doubles. There will probably be some 'sweet spot' for the optimum number to use.

Cheers,
Gary.

Keith Myers

Joined: 11 Feb 11

Posts: 4964

Credit: 18747093082

RAC: 7057049

You may finish them in time.

13 Jan 2023 8:17:57 UTC

Message 206519 in response to message 206515

(moderation:

)

You may finish them in time. Since they are brand new tasks that your host has never seen before, BOINC doesn't have a good estimate of how long they take to crunch.

With normal BOINC server software, the APR for a species of task takes 10 validated results before the estimate is correct.

But with Einstein using its custom server software that mechanism is not used and they use a variation of the older DCF method of runtime estimation. I don't understand fully how it works other than knowing that is it continually updated and often gets confused when the task species changes from what it was running before and produces wildly inaccurate estimates of remaining runtimes. Eventually after enough work is returned of the same type, the estimates are then accurate.

You might reduce your cache levels until your DCF stabilizes and you know how fast work is completed and you can increase your cache levels again.

Conan

Joined: 19 Jun 05

Posts: 172

Credit: 8323426

RAC: 9004

Thanks Gary and

13 Jan 2023 10:36:40 UTC

Message 206520

(moderation:

)

Thanks Gary and Keith,

I run a lot of projects, so when that computer finished up with World Community Grid, PRIVATE GFN SERVER, RamanujanMachine and ClimatePrediction it was left with the 55 tasks downloaded by BOINC.

So yes it started running 24 tasks at once to get through the tasks.

As I said the BOINC estimated run time of 4 hours 54 minutes was a mile off, even after running some of them before Christmas, BOINC did not learn of the run times.

After now passing 1 Day run times and not yet to 47%, I have suspended all bar 12 work units.

However this just also paused the 12 left running as BOINC downloaded LHC and LODA work units and started work on them instead, probably due to the hours spent on Einstein and now doing some catch up.

Will let it run and see how it goes, the work units don't run out till the 26th so I should be good, enabling more Einstein work units as the others complete.

Conan

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250562784

RAC: 34400

Our CPU tasks are designed to

13 Jan 2023 10:37:00 UTC

Message 206521

(moderation:

)

Our CPU tasks are designed to run of order 10-12h (varying a lot with CPU type etc.). At lower frequency range (mostly at the beginning of a GW search), there are often tasks for which we don't have enough data for this runtime, so these task may run significantly shorter.

That's, however, our design. In reality, once the search is launched, things may turn out to deviate quite a bit from that. Every search is a little bit different not only in data, but also in parameters. For the GPU search we first found that the memory requirement was much higher than what we estimated. Then there was this bug that occurred only with a combination of certain parameters (frequency and coherence time) that we possibly never had before (and it occurred in a part of the code that wasn't used before).

The CPU (part of the) search seems to suffer from even more technical issues this time. We stopped it before the holiday season as our (attention) capacity for the project was very limited during that time, and the problem on the GPU side was already becoming large enough.

We started the CPU part again a few days ago, and are still seeing problems occurring (you may notice on the Server Status page that the number of tasks "failed" are about 50% of the "valid" tasks, which is alarming).

Please bare with us as we fix these issues one by one. Credit, I have to admit, is of lower priority to us right now. We'll fix that, too, but in due time.

Conan

Joined: 19 Jun 05

Posts: 172

Credit: 8323426

RAC: 9004

Bernd Machenschalk

13 Jan 2023 12:32:30 UTC

Message 206524 in response to message 206521

(moderation:

)

Bernd Machenschalk wrote:

Our CPU tasks are designed to run of order 10-12h (varying a lot with CPU type etc.). At lower frequency range (mostly at the beginning of a GW search), there are often tasks for which we don't have enough data for this runtime, so these task may run significantly shorter.

That's, however, our design. In reality, once the search is launched, things may turn out to deviate quite a bit from that. Every search is a little bit different not only in data, but also in parameters. For the GPU search we first found that the memory requirement was much higher than what we estimated. Then there was this bug that occurred only with a combination of certain parameters (frequency and coherence time) that we possibly never had before (and it occurred in a part of the code that wasn't used before).

The CPU (part of the) search seems to suffer from even more technical issues this time. We stopped it before the holiday season as our (attention) capacity for the project was very limited during that time, and the problem on the GPU side was already becoming large enough.

We started the CPU part again a few days ago, and are still seeing problems occurring (you may notice on the Server Status page that the number of tasks "failed" are about 50% of the "valid" tasks, which is alarming).

Please bare with us as we fix these issues one by one. Credit, I have to admit, is of lower priority to us right now. We'll fix that, too, but in due time.

Thanks Bernd, I am continuing to process, they haven't failed yet just running a long time. With less running at once they should run shorter times from now on.

Conan

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117675505966

RAC: 35143769

Conan wrote:... I have

13 Jan 2023 22:33:22 UTC

Message 206549 in response to message 206520

(moderation:

)

Conan wrote:

... I have suspended all bar 12 work units.

That's not what I suggested you might try.

I suggested that - "You could try halving the number of threads in use" - which means run 12 CPU threads, not 24. You do that by going into the manager and changing the setting for the number of 'cores' the client is allowed to use from 100% to 50%. You are trying to find the 'sweet spot' - the number of active threads which gives you the best 'rate of progress'. If 50% of the cores gives more than double the rate of progress you had previously, then you are using your resources more efficiently and getting more 'output' as a result. You then try other settings (say 75%, etc.) until you find what works best (or run out of patience while doing so) :-).

By suspending "all but 12" you created a shortfall in the work on hand. The client can't download replacements from Einstein (tasks are suspended) so of course it will fill the shortage from whatever other projects are available. Then the client can run 24 threads once again and resource share considerations will determine the project mix for the 24 running tasks. Obviously the client decided that other projects were 'owed' more share. If you want to test just a single project, temporarily suspend the others while testing that project.

You are no closer to working out if there is an efficiency benefit from running less than the full 24 threads. Please be aware that different projects may be more or less 'compute intensive', using CPU resources differently and consequently giving different 'sweet spots'.

It's quite likely there will be a gain in output by using somewhat less than 24 threads. It's worthwhile doing the experiments to find out.

Cheers,
Gary.

Multi-Directional Gravitational Wave Search on O3 data (O3MD1/F)

Forums › Technical News

Comment viewing options

Forums › Technical News