It's like this: The search code can be thought of as a loop over "templates", where the loop has different stages.
After several years of incremental optimization, almost the complete search code within this main loop runs on the GPU. The only exception now is the management of the list of candidates to send back to the server, the "toplist". This is still done on the CPU, e.g. to periodically write the list of candidates found so far to the disk as "checkpoints", something that code on the GPU cannot do.
Originally, near the end of each loop iteration, we copied the entire result from the GPU processing step back to main RAM, where the candidate-selection code would go sequentially thru those results and put them into the toplist of candidates to keep if they make it to this toplist (candidates that are "better" than the last entry in the toplist).
This is somewhat wasteful. In the new version we look at the toplist *before* starting the GPU part of the iteration to give us a threshold of the minimum "strength" of a candidate for it to make it to the toplist. During the GPU processing, we take note when this threshold is crossed. If we find that the threshold was never crossed during the GPU processing, we can completely skip writing the results back to the main memory in that iteration because there can't be anything in it that will make it to the toplist. This saves PCIe bandwidth (for dedicated GPU cards) and CPU processing time because we don't need to inspect those results for candidates either.
This also explains why some workunits can be "lucky": if many strong signal candidates are found early in the search, this sets higher thresholds for all the rest of the templates and cuts down on the number of transfers needed. If a work unit has no clear outliers at all however, the toplist will build up with candidates more evenly during the runtime and the saving effect is much less.
This is a bit simplified and doesn't explain all the details but the gist of it should describe this effect quite well. A further optimization I'll do now is to allow for partial transfers of results from GPU memory to host memory instead of the yes/no decision implemented now.
Is the processing methodology described above in the opencl-ati beta app or is it something that can/will be added in the future?
An update of the cuda version (toolkit) from the old 3.2 to a more recent 5.5 or even 6.5 was discussed some time ago. It should be quite easy to do and could yield a few extra % in processing speed. Is this still on the road map or was it dropped?
An update of the cuda version (toolkit) from the old 3.2 to a more recent 5.5 or even 6.5 was discussed some time ago. It should be quite easy to do and could yield a few extra % in processing speed. Is this still on the road map or was it dropped?
In a nutshell, once we have this app version stable we are planning to offer both CUDA 3.2 and 5.5 app versions for a transition period, and then we will see a) what we gain by including CUDA 5.5 support but also b) how many hosts we would lose by dropping CUDA 3.2 support and requiring CUDA 5.5+ in the future. We hope to be able to drop CUDA 3.2 support and switch to 5.5. We'll see.
I promoted a full set of 1.52, so have run a total of eleven, on five different GPUs residing on three hosts. Uneventful during run time, so far as I could tell, with execution times and CPU times never far above the base population for 1.47/1.50. Perhaps this means 1.52 implements the tail-curtailing scheme Bikeman has been forshadowing, and it works nicely, or perhaps it means this first batch I got just happened to be in the base population anyway, and the real change is something else.
Sadly, of the eleven one raised a Validate error (58:00111010). This was one the GPU which had already generated more than one on 1.50, so may have nothing specific to do with the 1.52 changes.
Hi HBE, RE: It's
)
Hi HBE,
Is the processing methodology described above in the opencl-ati beta app or is it something that can/will be added in the future?
Gord
As far as I know the part
)
As far as I know the part after "In the new version..." is in the current beta, whereas "A further optimization..." is still in development.
MrS
Scanning for our furry friends since Jan 2002
RE: As far as I know the
)
Exactly !
HB
An update of the cuda version
)
An update of the cuda version (toolkit) from the old 3.2 to a more recent 5.5 or even 6.5 was discussed some time ago. It should be quite easy to do and could yield a few extra % in processing speed. Is this still on the road map or was it dropped?
RE: An update of the cuda
)
Planning is still like described here: http://einsteinathome.org/node/197990&nowrap=true#138717
In a nutshell, once we have this app version stable we are planning to offer both CUDA 3.2 and 5.5 app versions for a transition period, and then we will see a) what we gain by including CUDA 5.5 support but also b) how many hosts we would lose by dropping CUDA 3.2 support and requiring CUDA 5.5+ in the future. We hope to be able to drop CUDA 3.2 support and switch to 5.5. We'll see.
HB
I see I'm getting an updated
)
I see I'm getting an updated v1.52 for cuda32 and - new this time - intel-gpu. Anything in particular you'd like us to watch out for?
Getting them for AMD also...
)
Getting them for AMD also... promoted a few to run now.
RE: I'm getting an updated
)
My first quick feedback: link
MrS
Scanning for our furry friends since Jan 2002
I promoted a full set of
)
I promoted a full set of 1.52, so have run a total of eleven, on five different GPUs residing on three hosts. Uneventful during run time, so far as I could tell, with execution times and CPU times never far above the base population for 1.47/1.50. Perhaps this means 1.52 implements the tail-curtailing scheme Bikeman has been forshadowing, and it works nicely, or perhaps it means this first batch I got just happened to be in the base population anyway, and the real change is something else.
Sadly, of the eleven one raised a Validate error (58:00111010). This was one the GPU which had already generated more than one on 1.50, so may have nothing specific to do with the 1.52 changes.
RE: Perhaps this means
)
Yes, the version 1.52 beta apps hopefully have a more uniform run time, and not far from the mean runtime of the previus beta app.
Cheers
HB