All-Sky Gravitational Wave Search on O3 data (O3ASHF1)

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,240
Credit: 244,349,460
RAC: 23,737

There's still work being done

Originally, you all were working on the recalculation step (from CPU to GPU) but said it didn't seem to speed up the work. Is anything in the works related to this?

There's still work being done on the "recalc" step. The problem with that is that this step requires really data-dependent random memory access, which is pretty bad for the GPU memory. There are some tricks you can play to help with that with CUDA, and we plan to bring out a CUDA version of the app for NVidia GPUs. But this is still work in progress. And it will speed up the whole runtime only by 10-20% max, dependig on the card.

The main problem for us is that by losing the computing power from <=4GB GPUs the search is progressing half as fast as we expected and designed it for. Getting more GPUs to help with that is therefore our higher priority.

BM

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 214
Credit: 8,358,552,691
RAC: 4,731,379

Bernd Machenschalk

Bernd Machenschalk wrote:

The main problem for us is that by losing the computing power from <=4GB GPUs the search is progressing half as fast as we expected and designed it for. Getting more GPUs to help with that is therefore our higher priority.

Completely understand! We have seen the impact of the random memory access on our systems. The Threadripper Pros with 8 memory channels have been FAR superior to systems with the same/similar CPU and memory speeds but fewer channels. Our older systems that still have a relatively fast CPUs but slower memory and only 2 channels really struggle with the recalc step. It has been a fun problems for us to optimize on our end (or, attempt to optimize!). 

DF1DX
DF1DX
Joined: 14 Aug 10
Posts: 95
Credit: 2,873,932,897
RAC: 1,606,982

Thank you for the

Thank you for the information.


What is the difference between

GW-opencl-ati / GW-opencl-ati-2 and

GW-opencl-nvidia / GW-opencl-nvidia-2 for version 1.06?

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,240
Credit: 244,349,460
RAC: 23,737

The "-2" plan classes don't

The "-2" plan classes don't really exist yet. Ultimately these will be used to specify a lower VRAM requirement for the new workunits.

BM

Boca Raton Community HS
Boca Raton Comm...
Joined: 4 Nov 15
Posts: 214
Credit: 8,358,552,691
RAC: 4,731,379

Could someone help me out

Could someone help me out with this error? I had this happen for a group of work units. I restarted the system and have not seen it again, but would like to have more insight into what it means. Thanks!

 

[14:51:54][4797][ERROR] Error synchronising after CUDA device->host HS data transfer (dirty phase 2) (error: 700)
[14:51:54][4797][ERROR] Error during CUDA host->device HS thresholds data transfer (error: 700)
[14:51:54][4797][ERROR] Demodulation failed (error: 1007)!
14:51:54 (4797): called boinc_finish(1007)
Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,631
Credit: 32,939,318,056
RAC: 6,784,200

my guess is some kind of

my guess is some kind of problem with the driver.

_________________________________________________________________________

wujj123456
wujj123456
Joined: 16 Sep 08
Posts: 18
Credit: 1,426,416,678
RAC: 2,014,841

Is it possible to free all

Is it possible to free all the GPU memory the moment O3AS is done using GPU? From reading posts here and monitoring with nvtop, I believe that once the GPU calculation phase is done, the GPU is never used afterwards. However, I see all the memory is still kept there. It would be nice if it's possible to free them sooner. The benefits are tens of watts of savings if the GPU is not used by anything else and I have this one crappy laptop that allocates power to CPU based on how much GPU is pulling...

I'm curious if the latest source code is available. I couldn't find anything related to EAH following the instructions in the source code page. I don't see lalapps/src/pulsar/EinsteinAtHome/eah_build2.sh or anything related to EAH in the git repository. I doubt this is useful for most people, so probably not worth core developer's time. I was hoping to check out if I could get lucky in case it's a simple change. :-D

In addition, could the "recalc" phase benefit from multiple threads? This could be helpful for systems with weaker CPU, or not enough VRAM to stagger two tasks. Otherwise, it basically becomes mostly a CPU app and throwing more cores might be useful.

Thanks.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,240
Credit: 244,349,460
RAC: 23,737

1. Indeed freeing the GPU

1. Indeed freeing the GPU memory is theoretically possible, although it's not easy technically, i.e. within the current function call structure. Are you sure that just keeping something in memory draws noticeable power for the GPU even if its processing units are not used?

2. We thought about that ourselves. However the change to the workunits that we are about to deploy may scre up all benefit of it. The current workunits analyze a 2Hz frequency band, which is something like a sweet spot between efficiency and memory requirement. We plan to make this a workunit with two passes, each analyzing a 1 Hz band. This will add additional time e.g. for overhead because of the two calls, but should roughly cut the required memory in half (+ overhead). Freeing that in the first pass just to be allocated again just afterwards won't help you much I'm afraid.

BM

mikey
mikey
Joined: 22 Jan 05
Posts: 11,757
Credit: 1,821,569,375
RAC: 495,898

Bernd Machenschalk wrote: 2.

Bernd Machenschalk wrote:

2. We thought about that ourselves. However the change to the workunits that we are about to deploy may scre up all benefit of it. The current workunits analyze a 2Hz frequency band, which is something like a sweet spot between efficiency and memory requirement. We plan to make this a workunit with two passes, each analyzing a 1 Hz band. This will add additional time e.g. for overhead because of the two calls, but should roughly cut the required memory in half (+ overhead). 

Is there any reason to keep going up the Hz band, ie 3Hz, 4Hz, 5Hz etc? Or is that beyond the point of whatever you are looking for in this dataset?

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,240
Credit: 244,349,460
RAC: 23,737

I'm not sure if I understand

I'm not sure if I understand the question. In O3ASHF1 we are analyzing O3 data in a "high" (for GW) frequency range (800-1500Hz), in 2Hz per workunit. These 2Hz of a workunit will be split in halves and done in two 1Hz passes. Does that help?

BM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.