Originally, you all were working on the recalculation step (from CPU to GPU) but said it didn't seem to speed up the work. Is anything in the works related to this?
There's still work being done on the "recalc" step. The problem with that is that this step requires really data-dependent random memory access, which is pretty bad for the GPU memory. There are some tricks you can play to help with that with CUDA, and we plan to bring out a CUDA version of the app for NVidia GPUs. But this is still work in progress. And it will speed up the whole runtime only by 10-20% max, dependig on the card.
The main problem for us is that by losing the computing power from <=4GB GPUs the search is progressing half as fast as we expected and designed it for. Getting more GPUs to help with that is therefore our higher priority.
The main problem for us is that by losing the computing power from <=4GB GPUs the search is progressing half as fast as we expected and designed it for. Getting more GPUs to help with that is therefore our higher priority.
Completely understand! We have seen the impact of the random memory access on our systems. The Threadripper Pros with 8 memory channels have been FAR superior to systems with the same/similar CPU and memory speeds but fewer channels. Our older systems that still have a relatively fast CPUs but slower memory and only 2 channels really struggle with the recalc step. It has been a fun problems for us to optimize on our end (or, attempt to optimize!).
Could someone help me out with this error? I had this happen for a group of work units. I restarted the system and have not seen it again, but would like to have more insight into what it means. Thanks!
[14:51:54][4797][ERROR] Error synchronising after CUDA device->host HS data transfer (dirty phase 2) (error: 700)
[14:51:54][4797][ERROR] Error during CUDA host->device HS thresholds data transfer (error: 700)
[14:51:54][4797][ERROR] Demodulation failed (error: 1007)!
14:51:54 (4797): called boinc_finish(1007)
Is it possible to free all the GPU memory the moment O3AS is done using GPU? From reading posts here and monitoring with nvtop, I believe that once the GPU calculation phase is done, the GPU is never used afterwards. However, I see all the memory is still kept there. It would be nice if it's possible to free them sooner. The benefits are tens of watts of savings if the GPU is not used by anything else and I have this one crappy laptop that allocates power to CPU based on how much GPU is pulling...
I'm curious if the latest source code is available. I couldn't find anything related to EAH following the instructions in the source code page. I don't see lalapps/src/pulsar/EinsteinAtHome/eah_build2.sh or anything related to EAH in the git repository. I doubt this is useful for most people, so probably not worth core developer's time. I was hoping to check out if I could get lucky in case it's a simple change. :-D
In addition, could the "recalc" phase benefit from multiple threads? This could be helpful for systems with weaker CPU, or not enough VRAM to stagger two tasks. Otherwise, it basically becomes mostly a CPU app and throwing more cores might be useful.
1. Indeed freeing the GPU memory is theoretically possible, although it's not easy technically, i.e. within the current function call structure. Are you sure that just keeping something in memory draws noticeable power for the GPU even if its processing units are not used?
2. We thought about that ourselves. However the change to the workunits that we are about to deploy may scre up all benefit of it. The current workunits analyze a 2Hz frequency band, which is something like a sweet spot between efficiency and memory requirement. We plan to make this a workunit with two passes, each analyzing a 1 Hz band. This will add additional time e.g. for overhead because of the two calls, but should roughly cut the required memory in half (+ overhead). Freeing that in the first pass just to be allocated again just afterwards won't help you much I'm afraid.
2. We thought about that ourselves. However the change to the workunits that we are about to deploy may scre up all benefit of it. The current workunits analyze a 2Hz frequency band, which is something like a sweet spot between efficiency and memory requirement. We plan to make this a workunit with two passes, each analyzing a 1 Hz band. This will add additional time e.g. for overhead because of the two calls, but should roughly cut the required memory in half (+ overhead).
Is there any reason to keep going up the Hz band, ie 3Hz, 4Hz, 5Hz etc? Or is that beyond the point of whatever you are looking for in this dataset?
I'm not sure if I understand the question. In O3ASHF1 we are analyzing O3 data in a "high" (for GW) frequency range (800-1500Hz), in 2Hz per workunit. These 2Hz of a workunit will be split in halves and done in two 1Hz passes. Does that help?
There's still work being done
)
There's still work being done on the "recalc" step. The problem with that is that this step requires really data-dependent random memory access, which is pretty bad for the GPU memory. There are some tricks you can play to help with that with CUDA, and we plan to bring out a CUDA version of the app for NVidia GPUs. But this is still work in progress. And it will speed up the whole runtime only by 10-20% max, dependig on the card.
The main problem for us is that by losing the computing power from <=4GB GPUs the search is progressing half as fast as we expected and designed it for. Getting more GPUs to help with that is therefore our higher priority.
BM
Bernd Machenschalk
)
Completely understand! We have seen the impact of the random memory access on our systems. The Threadripper Pros with 8 memory channels have been FAR superior to systems with the same/similar CPU and memory speeds but fewer channels. Our older systems that still have a relatively fast CPUs but slower memory and only 2 channels really struggle with the recalc step. It has been a fun problems for us to optimize on our end (or, attempt to optimize!).
Thank you for the
)
Thank you for the information.
What is the difference between
GW-opencl-ati / GW-opencl-ati-2 and
GW-opencl-nvidia / GW-opencl-nvidia-2 for version 1.06?
The "-2" plan classes don't
)
The "-2" plan classes don't really exist yet. Ultimately these will be used to specify a lower VRAM requirement for the new workunits.
BM
Could someone help me out
)
Could someone help me out with this error? I had this happen for a group of work units. I restarted the system and have not seen it again, but would like to have more insight into what it means. Thanks!
my guess is some kind of
)
my guess is some kind of problem with the driver.
_________________________________________________________________________
Is it possible to free all
)
Is it possible to free all the GPU memory the moment O3AS is done using GPU? From reading posts here and monitoring with nvtop, I believe that once the GPU calculation phase is done, the GPU is never used afterwards. However, I see all the memory is still kept there. It would be nice if it's possible to free them sooner. The benefits are tens of watts of savings if the GPU is not used by anything else and I have this one crappy laptop that allocates power to CPU based on how much GPU is pulling...
I'm curious if the latest source code is available. I couldn't find anything related to EAH following the instructions in the source code page. I don't see lalapps/src/pulsar/EinsteinAtHome/eah_build2.sh or anything related to EAH in the git repository. I doubt this is useful for most people, so probably not worth core developer's time. I was hoping to check out if I could get lucky in case it's a simple change. :-D
In addition, could the "recalc" phase benefit from multiple threads? This could be helpful for systems with weaker CPU, or not enough VRAM to stagger two tasks. Otherwise, it basically becomes mostly a CPU app and throwing more cores might be useful.
Thanks.
1. Indeed freeing the GPU
)
1. Indeed freeing the GPU memory is theoretically possible, although it's not easy technically, i.e. within the current function call structure. Are you sure that just keeping something in memory draws noticeable power for the GPU even if its processing units are not used?
2. We thought about that ourselves. However the change to the workunits that we are about to deploy may scre up all benefit of it. The current workunits analyze a 2Hz frequency band, which is something like a sweet spot between efficiency and memory requirement. We plan to make this a workunit with two passes, each analyzing a 1 Hz band. This will add additional time e.g. for overhead because of the two calls, but should roughly cut the required memory in half (+ overhead). Freeing that in the first pass just to be allocated again just afterwards won't help you much I'm afraid.
BM
Bernd Machenschalk wrote: 2.
)
Is there any reason to keep going up the Hz band, ie 3Hz, 4Hz, 5Hz etc? Or is that beyond the point of whatever you are looking for in this dataset?
I'm not sure if I understand
)
I'm not sure if I understand the question. In O3ASHF1 we are analyzing O3 data in a "high" (for GW) frequency range (800-1500Hz), in 2Hz per workunit. These 2Hz of a workunit will be split in halves and done in two 1Hz passes. Does that help?
BM