Multi-Directional Gravitational Wave Search on O3 data (O3MD1/F)

[AF>EDLS]zOU
[AF>EDLS]zOU
Joined: 5 May 15
Posts: 65
Credit: 384235373
RAC: 0

Ian&Steve C. wrote:how many

Ian&Steve C. wrote:

how many tasks are you trying to run at once? 
 

you’re getting : 

CL_MEM_OBJECT_ALLOCATION_FAILURE

which means you’re running out of GPU memory. 

 

Just one ;)

Unrelated to the topic then since it's not an app error, I thought I saw the same error code.

 

Einstein@Home    1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)    h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_2754_1    00:02:23 (00:00:59)    1/10/2023 8:04:01 PM    1/10/2023 8:14:40 PM    0,9C + 1NV    41,26    Reported: Computation error (1,)    Hades    53.65 MB    43.17 MB    
Einstein@Home    1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)    h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_2751_0    00:02:14 (00:01:00)    1/10/2023 8:05:01 PM    1/10/2023 8:14:40 PM    0,9C + 1NV    44,78    Reported: Computation error (1,)    Hades    53.66 MB    43.27 MB    
Einstein@Home    1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)    h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_2749_0    00:02:08 (00:00:56)    1/10/2023 8:09:06 PM    1/10/2023 8:14:40 PM    0,9C + 1NV    43,75    Reported: Computation error (114,)    Hades    53.66 MB    43.22 MB    
Einstein@Home    1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)    h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_2748_0    00:02:07 (00:00:58)    1/10/2023 8:09:06 PM    1/10/2023 8:14:40 PM    0,9C + 1NV    45.67    Reported: Computation error (1152,)    Hades            
Einstein@Home    1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)    h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_3051_1    00:01:57 (00:01:01)    1/9/2023 8:22:53 PM    1/9/2023 8:57:57 PM    0,9C + 1NV    52,14    Reported: Computation error (1152,)    Hades    1260.33 MB    459.21 MB    
Einstein@Home    1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)    h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_3049_1    00:02:04 (00:01:02)    1/9/2023 8:22:53 PM    1/9/2023 8:57:57 PM    0,9C + 1NV    50,00    Reported: Computation error (1152,)    Hades            
Einstein@Home    1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)    h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_3052_1    00:02:14 (00:01:03)    1/9/2023 8:26:53 PM    1/9/2023 8:57:57 PM    0,9C + 1NV    47,01    Reported: Computation error (1152,)    Hades    502.23 MB    126.55 MB    
Einstein@Home    1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)    h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_3048_1    00:02:07 (00:00:59)    1/9/2023 8:26:53 PM    1/9/2023 8:57:57 PM    0,9C + 1NV    46,46    Reported: Computation error (1152,)    Hades            
Einstein@Home    1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)    h1_0466.80_O3aC01Cl1In0__O3MDFG1_G34731_467.00Hz_3050_1    00:02:06 (00:01:04)    1/9/2023 8:30:53 PM    1/9/2023 8:57:57 PM    0,9C + 1NV    50,79    Reported: Computation error (1152,)    Hades    53.66 MB    43.15 MB    
Einstein@Home    1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)    h1_0466.40_O3aC01Cl1In0__O3MDFG1_G34731_466.50Hz_7_1    00:05:41 (00:04:04)    1/9/2023 8:04:51 PM    1/9/2023 8:15:52 PM    0,9C + 1NV    71,55    Reported: Computation error (1152,)    Hades    3859.27 MB    686.07 MB    
Einstein@Home    1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)    h1_0466.40_O3aC01Cl1In0__O3MDFG1_G34731_466.50Hz_8_1    00:01:49 (00:00:51)    1/9/2023 8:04:51 PM    1/9/2023 8:15:52 PM    0,9C + 1NV    46,79    Reported: Computation error (1152,)    Hades            
Einstein@Home    1.03 Multi-Directional Gravitational Wave search on O3 (GPU) (GW-opencl-nvidia)    h1_0466.40_O3aC01Cl1In0__O3MDFG1_G34731_466.50Hz_6_1    00:01:45 (00:00:50)    1/9/2023 8:05:51 PM    1/9/2023 8:15:52 PM    0,9C + 1NV    47,62    Reported: Computation error (1152,)    Hades    50.05 MB    39.20 MB    
 

I didn't have any WU for this app in a few days, but I'll keep an eye on it

Conan
Conan
Joined: 19 Jun 05
Posts: 172
Credit: 8086420
RAC: 4597

Conan wrote:With O3MDF now

Conan wrote:

With O3MDF now fixed (hopefully), is it possible to get O3MD1 CPU work units going again, there is none in the queue and hasn't been for awhile.

I would like to run some more of them please.

 

Conan

Thanks for getting more work out for O3MD1 and the CPUs. 

I have some questions though concerning these new work units.

The previous ones that I ran took somewhere from 33,000 to 43,000 seconds (I think) and gave 1,000 credits. So a bit longer than my RYZEN 9 5900X takes to run a Binary type WU (around twice as long as a Gamma WU with Linux).

These new ones have so far ran 18 Hours and are just 33% done with 1 Day 12 Hours still to go.

They were sent out saying they would take 4 hours 54 minutes so I ended up with 55 of them.

 

My questions are

"How long do these things run?"

Are they still only getting 1,000 points?

 

Thanks

Conan 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4944
Credit: 18575255362
RAC: 5662694

Einstein does not allot

Einstein does not allot credit based on runtimes.  Credit is alloted only in fixed amounts for each species of task.

So the 1000 credits will not change for OM3MD1/F.

 

Conan
Conan
Joined: 19 Jun 05
Posts: 172
Credit: 8086420
RAC: 4597

Thanks Keith for the

Thanks Keith for the response.

 

I was sure the amounts were not going to change, but I could hope.

 

The work units have now almost 21 hours done and 39.909% complete, still 1 Day 7 Hours to go.

With 55 downloaded, I will contemplate whether I will complete all of them all, we will see how it goes.

I suppose I was the one asking for them so I will have to try and finish them.

 

Thanks

Conan

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5870
Credit: 116720584549
RAC: 36531836

Conan wrote:The work units

Conan wrote:
The work units have now almost 21 hours done and 39.909% complete, still 1 Day 7 Hours to go.

How many concurrent CPU tasks are you running.  Your host is listed as 12 cores / 24 threads.  I hope you're not trying to run tasks on all 24 threads.

If you are, you may get better performance by experimenting with a smaller number.  You could try halving the number of threads in use and checking the new rate of progress.  You may find it more than doubles.  There will probably be some 'sweet spot' for the optimum number to use.

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4944
Credit: 18575255362
RAC: 5662694

You may finish them in time. 

You may finish them in time.  Since they are brand new tasks that your host has never seen before, BOINC doesn't have a good estimate of how long they take to crunch.

With normal BOINC server software, the APR for a species of task takes 10 validated results before the estimate is correct.

But with Einstein using its custom server software that mechanism is not used and they use a variation of the older DCF method of runtime estimation.  I don't understand fully how it works other than knowing that is it continually updated and often gets confused when the task species changes from what it was running before and produces wildly inaccurate estimates of remaining runtimes.  Eventually after enough work is returned of the same type, the estimates are then accurate.

You might reduce your cache levels until your DCF stabilizes and you know how fast work is completed and you can increase your cache levels again.

 

Conan
Conan
Joined: 19 Jun 05
Posts: 172
Credit: 8086420
RAC: 4597

Thanks Gary and

Thanks Gary and Keith,

 

I run a lot of projects, so when that computer finished up with World Community Grid, PRIVATE GFN SERVER, RamanujanMachine and ClimatePrediction it was left with the 55 tasks downloaded by BOINC.

So yes it started running 24 tasks at once to get through the tasks.

As I said the BOINC estimated run time of 4 hours 54 minutes was a mile off, even after running some of them before Christmas, BOINC did not learn of the run times.

 

After now passing 1 Day run times and not yet to 47%, I have suspended all bar 12 work units.

 

However this just also paused the 12 left running as BOINC downloaded LHC and LODA work units and started work on them instead, probably due to the hours spent on Einstein and now doing some catch up.

 

Will let it run and see how it goes, the work units don't run out till the 26th so I should be good, enabling more Einstein work units as the others complete.

 

Conan

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4307
Credit: 249637960
RAC: 34372

Our CPU tasks are designed to

Our CPU tasks are designed to run of order 10-12h (varying a lot with CPU type etc.). At lower frequency range (mostly at the beginning of a GW search), there are often tasks for which we don't have enough data for this runtime, so these task may run significantly shorter.

That's, however, our design. In reality, once the search is launched, things may turn out to deviate quite a bit from that. Every search is a little bit different not only in data, but also in parameters. For the GPU search we first found that the memory requirement was much higher than what we estimated. Then there was this bug that occurred only with a combination of certain parameters (frequency and coherence time) that we possibly never had before (and it occurred in a part of the code that wasn't used before).

The CPU (part of the) search seems to suffer from even more technical issues this time. We stopped it before the holiday season as our (attention) capacity for the project was very limited during that time, and the problem on the GPU side was already becoming large enough.

We started the CPU part again a few days ago, and are still seeing problems occurring (you may notice on the Server Status page that the number of tasks "failed" are about 50% of the "valid" tasks, which is alarming).

Please bare with us as we fix these issues one by one. Credit, I have to admit, is of lower priority to us right now. We'll fix that, too, but in due time.

BM

Conan
Conan
Joined: 19 Jun 05
Posts: 172
Credit: 8086420
RAC: 4597

Bernd Machenschalk

Bernd Machenschalk wrote:

Our CPU tasks are designed to run of order 10-12h (varying a lot with CPU type etc.). At lower frequency range (mostly at the beginning of a GW search), there are often tasks for which we don't have enough data for this runtime, so these task may run significantly shorter.

That's, however, our design. In reality, once the search is launched, things may turn out to deviate quite a bit from that. Every search is a little bit different not only in data, but also in parameters. For the GPU search we first found that the memory requirement was much higher than what we estimated. Then there was this bug that occurred only with a combination of certain parameters (frequency and coherence time) that we possibly never had before (and it occurred in a part of the code that wasn't used before).

The CPU (part of the) search seems to suffer from even more technical issues this time. We stopped it before the holiday season as our (attention) capacity for the project was very limited during that time, and the problem on the GPU side was already becoming large enough.

We started the CPU part again a few days ago, and are still seeing problems occurring (you may notice on the Server Status page that the number of tasks "failed" are about 50% of the "valid" tasks, which is alarming).

Please bare with us as we fix these issues one by one. Credit, I have to admit, is of lower priority to us right now. We'll fix that, too, but in due time.

Thanks Bernd, I am continuing to process, they haven't failed yet just running a long time. With less running at once they should run shorter times from now on.

 

Conan

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5870
Credit: 116720584549
RAC: 36531836

Conan wrote:... I have

Conan wrote:
... I have suspended all bar 12 work units.

That's not what I suggested you might try.

I suggested that - "You could try halving the number of threads in use" - which means run 12 CPU threads, not 24.  You do that by going into the manager and changing the setting for the number of 'cores' the client is allowed to use from 100% to 50%.  You are trying to find the 'sweet spot' - the number of active threads which gives you the best 'rate of progress'.  If 50% of the cores gives more than double the rate of progress you had previously, then you are using your resources more efficiently and getting more 'output' as a result.  You then try other settings (say 75%, etc.) until you find what works best (or run out of patience while doing so) :-).

By suspending "all but 12" you created a shortfall in the work on hand.  The client can't download replacements from Einstein (tasks are suspended) so of course it will fill the shortage from whatever other projects are available.  Then the client can run 24 threads once again and resource share considerations will determine the project mix for the 24 running tasks.  Obviously the client decided that other projects were 'owed' more share.  If you want to test just a single project, temporarily suspend the others while testing that project.

You are no closer to working out if there is an efficiency benefit from running less than the full 24 threads.  Please be aware that different projects may be more or less 'compute intensive', using CPU resources differently and consequently giving different 'sweet spots'.

It's quite likely there will be a gain in output by using somewhat less than 24 threads.  It's worthwhile doing the experiments to find out.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.