You are speaking of "mismatched pairs", "fortunate" and "unfortunate" beta (work)units, about "good or bad fotune", of "fine-grained gradation". You suppose a "unbalanced switching process" between CPU and GPU ...
Maybe I better understand what happens if I could get more information about the mentioned data-dependency.
It's like this: The search code can be thought of as a loop over "templates", where the loop has different stages.
After several years of incremental optimization, almost the complete search code within this main loop runs on the GPU. The only exception now is the management of the list of candidates to send back to the server, the "toplist". This is still done on the CPU, e.g. to periodically write the list of candidates found so far to the disk as "checkpoints", something that code on the GPU cannot do.
Originally, near the end of each loop iteration, we copied the entire result from the GPU processing step back to main RAM, where the candidate-selection code would go sequentially thru those results and put them into the toplist of candidates to keep if they make it to this toplist (candidates that are "better" than the last entry in the toplist).
This is somewhat wasteful. In the new version we look at the toplist *before* starting the GPU part of the iteration to give us a threshold of the minimum "strength" of a candidate for it to make it to the toplist. During the GPU processing, we take note when this threshold is crossed. If we find that the threshold was never crossed during the GPU processing, we can completely skip writing the results back to the main memory in that iteration because there can't be anything in it that will make it to the toplist. This saves PCIe bandwidth (for dedicated GPU cards) and CPU processing time because we don't need to inspect those results for candidates either.
This also explains why some workunits can be "lucky": if many strong signal candidates are found early in the search, this sets higher thresholds for all the rest of the templates and cuts down on the number of transfers needed. If a work unit has no clear outliers at all however, the toplist will build up with candidates more evenly during the runtime and the saving effect is much less.
This is a bit simplified and doesn't explain all the details but the gist of it should describe this effect quite well. A further optimization I'll do now is to allow for partial transfers of results from GPU memory to host memory instead of the yes/no decision implemented now.
Is there a similar improvement in performance expected for BRP4G too, since it uses the same application?
That is a very good question. It's using the same application, but different search parameters, and to make things more complicated, the BRP4G tasks go out to a very special breed of GPUs (Intel GPUs integrated in the CPU, not dedicated GPUs ). Too many variables for me to make a good guess, we will try this later.
HBE
Wait - what about BRP4G-cuda32-nv301? These go to DGPUs.
Wait - what about BRP4G-cuda32-nv301? These go to DGPUs.
We do not expect to have work for this application most of the time. Our main supply of GPU workunits for the near future will come from BRP6, with only a few WUs from BRP4 set aside for Android, ARM Linux and Intel HD GPUs.
Is there a similar improvement in performance expected for BRP4G too, since it uses the same application?
That is a very good question. It's using the same application, but different search parameters, and to make things more complicated, the BRP4G tasks go out to a very special breed of GPUs (Intel GPUs integrated in the CPU, not dedicated GPUs ). Too many variables for me to make a good guess, we will try this later.
HBE
Wait - what about BRP4G-cuda32-nv301? These go to DGPUs.
Not any more - they've run out of data, and what little there is left is being preferentially held back for intel_gpu, which can't run the bigger tasks. There are announcements about that somewhere round here.
Here is a first summary of running 2 tasks parallel on a 750Ti:
Using v1.39 the average of 40 workunits was:
20366s runtime and 2103s CPU time.
The long v1.50 task (PM0007_01161_126_1) was:
22643s runtime and 5254s CPU time. (!)
The other six v1.49/v1.50 tasks I've done so far have taken pretty much the same time each and in average:
15942s runtime and 550s CPU time.
Single Work unit v1.50 per GPU
PM 0007_016D1_316_0 14360 runtime 10533 CPU time
PM 0007_016D1_326_0 14363 runtime 10345 CPU time
PM 0007_016D1_362_0 15212 runtime 10400 CPU time
PM 0007_016D1_206_1 16121 runtime 7034 CPU time
Edit... Work units uses between 68-72% of 1 core each.
While a WU usually took about 2 hours with version 1.39 cuda, now with the new 1.50 beta cuda32 it takes 3 hours 45 minutes.
Just an observation, no complaint.
There is quite a lot of variability in the new app version depending on the tasks you receive. Looking back at your completed v1.50 beta units some have finished in ~90 minutes! You're just experiencing at bit of rough with the smooth so stick with it :-) The beta app will pay dividends in the long run ;-)
While a WU usually took about 2 hours with version 1.39 cuda, now with the new 1.50 beta cuda32 it takes 3 hours 45 minutes.
Just an observation, no complaint.
There is quite a lot of variability in the new app version depending on the tasks you receive. Looking back at your completed v1.50 beta units some have finished in ~90 minutes! You're just experiencing at bit of rough with the smooth so stick with it :-) The beta app will pay dividends in the long run ;-)
RE: You are speaking of
)
It's like this: The search code can be thought of as a loop over "templates", where the loop has different stages.
After several years of incremental optimization, almost the complete search code within this main loop runs on the GPU. The only exception now is the management of the list of candidates to send back to the server, the "toplist". This is still done on the CPU, e.g. to periodically write the list of candidates found so far to the disk as "checkpoints", something that code on the GPU cannot do.
Originally, near the end of each loop iteration, we copied the entire result from the GPU processing step back to main RAM, where the candidate-selection code would go sequentially thru those results and put them into the toplist of candidates to keep if they make it to this toplist (candidates that are "better" than the last entry in the toplist).
This is somewhat wasteful. In the new version we look at the toplist *before* starting the GPU part of the iteration to give us a threshold of the minimum "strength" of a candidate for it to make it to the toplist. During the GPU processing, we take note when this threshold is crossed. If we find that the threshold was never crossed during the GPU processing, we can completely skip writing the results back to the main memory in that iteration because there can't be anything in it that will make it to the toplist. This saves PCIe bandwidth (for dedicated GPU cards) and CPU processing time because we don't need to inspect those results for candidates either.
This also explains why some workunits can be "lucky": if many strong signal candidates are found early in the search, this sets higher thresholds for all the rest of the templates and cuts down on the number of transfers needed. If a work unit has no clear outliers at all however, the toplist will build up with candidates more evenly during the runtime and the saving effect is much less.
This is a bit simplified and doesn't explain all the details but the gist of it should describe this effect quite well. A further optimization I'll do now is to allow for partial transfers of results from GPU memory to host memory instead of the yes/no decision implemented now.
HBE
RE: RE: Is there a
)
Wait - what about BRP4G-cuda32-nv301? These go to DGPUs.
-----
RE: Wait - what about
)
We do not expect to have work for this application most of the time. Our main supply of GPU workunits for the near future will come from BRP6, with only a few WUs from BRP4 set aside for Android, ARM Linux and Intel HD GPUs.
Cheers
HB
RE: RE: RE: Is there a
)
Not any more - they've run out of data, and what little there is left is being preferentially held back for intel_gpu, which can't run the bigger tasks. There are announcements about that somewhere round here.
Ah, I thought that the BRP4G
)
Ah, I thought that the BRP4G shortage is just temporary. OK, let's get back to topic..
-----
RE: So the long v1.50
)
Single Work unit v1.50 per GPU
PM 0007_016D1_316_0 14360 runtime 10533 CPU time
PM 0007_016D1_326_0 14363 runtime 10345 CPU time
PM 0007_016D1_362_0 15212 runtime 10400 CPU time
PM 0007_016D1_206_1 16121 runtime 7034 CPU time
Edit... Work units uses between 68-72% of 1 core each.
RE: It's like this:
)
Thank you very much for this summary, Bikeman :-)
I know I am a part of a story that starts long before I can remember and continues long beyond when anyone will remember me [Danny Hillis, Long Now]
Noticed changes with the
)
Noticed changes with the calculations duration:
While a WU usually took about 2 hours with version 1.39 cuda, now with the new 1.50 beta cuda32 it takes 3 hours 45 minutes.
Just an observation, no complaint.
Om mani padme hum.
RE: Noticed changes with
)
There is quite a lot of variability in the new app version depending on the tasks you receive. Looking back at your completed v1.50 beta units some have finished in ~90 minutes! You're just experiencing at bit of rough with the smooth so stick with it :-) The beta app will pay dividends in the long run ;-)
RE: RE: Noticed changes
)
Ah, good to know. Thanks for the info :)
Om mani padme hum.