Binary Radio Pulsar Search (Parkes PMPS XT) "BRP6"

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 729963622
RAC: 1191206

RE: You are speaking of

Quote:

You are speaking of "mismatched pairs", "fortunate" and "unfortunate" beta (work)units, about "good or bad fotune", of "fine-grained gradation". You suppose a "unbalanced switching process" between CPU and GPU ...

Maybe I better understand what happens if I could get more information about the mentioned data-dependency.

It's like this: The search code can be thought of as a loop over "templates", where the loop has different stages.

After several years of incremental optimization, almost the complete search code within this main loop runs on the GPU. The only exception now is the management of the list of candidates to send back to the server, the "toplist". This is still done on the CPU, e.g. to periodically write the list of candidates found so far to the disk as "checkpoints", something that code on the GPU cannot do.

Originally, near the end of each loop iteration, we copied the entire result from the GPU processing step back to main RAM, where the candidate-selection code would go sequentially thru those results and put them into the toplist of candidates to keep if they make it to this toplist (candidates that are "better" than the last entry in the toplist).

This is somewhat wasteful. In the new version we look at the toplist *before* starting the GPU part of the iteration to give us a threshold of the minimum "strength" of a candidate for it to make it to the toplist. During the GPU processing, we take note when this threshold is crossed. If we find that the threshold was never crossed during the GPU processing, we can completely skip writing the results back to the main memory in that iteration because there can't be anything in it that will make it to the toplist. This saves PCIe bandwidth (for dedicated GPU cards) and CPU processing time because we don't need to inspect those results for candidates either.

This also explains why some workunits can be "lucky": if many strong signal candidates are found early in the search, this sets higher thresholds for all the rest of the templates and cuts down on the number of transfers needed. If a work unit has no clear outliers at all however, the toplist will build up with candidates more evenly during the runtime and the saving effect is much less.

This is a bit simplified and doesn't explain all the details but the gist of it should describe this effect quite well. A further optimization I'll do now is to allow for partial transfers of results from GPU memory to host memory instead of the yes/no decision implemented now.

HBE

Mumak
Joined: 26 Feb 13
Posts: 325
Credit: 3525227848
RAC: 1504904

RE: RE: Is there a

Quote:
Quote:
Is there a similar improvement in performance expected for BRP4G too, since it uses the same application?

That is a very good question. It's using the same application, but different search parameters, and to make things more complicated, the BRP4G tasks go out to a very special breed of GPUs (Intel GPUs integrated in the CPU, not dedicated GPUs ). Too many variables for me to make a good guess, we will try this later.

HBE

Wait - what about BRP4G-cuda32-nv301? These go to DGPUs.

-----

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 729963622
RAC: 1191206

RE: Wait - what about

Quote:

Wait - what about BRP4G-cuda32-nv301? These go to DGPUs.

We do not expect to have work for this application most of the time. Our main supply of GPU workunits for the near future will come from BRP6, with only a few WUs from BRP4 set aside for Android, ARM Linux and Intel HD GPUs.

Cheers
HB

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2959282820
RAC: 707336

RE: RE: RE: Is there a

Quote:
Quote:
Quote:
Is there a similar improvement in performance expected for BRP4G too, since it uses the same application?

That is a very good question. It's using the same application, but different search parameters, and to make things more complicated, the BRP4G tasks go out to a very special breed of GPUs (Intel GPUs integrated in the CPU, not dedicated GPUs ). Too many variables for me to make a good guess, we will try this later.

HBE


Wait - what about BRP4G-cuda32-nv301? These go to DGPUs.


Not any more - they've run out of data, and what little there is left is being preferentially held back for intel_gpu, which can't run the bigger tasks. There are announcements about that somewhere round here.

Mumak
Joined: 26 Feb 13
Posts: 325
Credit: 3525227848
RAC: 1504904

Ah, I thought that the BRP4G

Ah, I thought that the BRP4G shortage is just temporary. OK, let's get back to topic..

-----

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

RE: So the long v1.50

Quote:

So the long v1.50 finished over night.

Here is a first summary of running 2 tasks parallel on a 750Ti:

Using v1.39 the average of 40 workunits was:
20366s runtime and 2103s CPU time.

The long v1.50 task (PM0007_01161_126_1) was:
22643s runtime and 5254s CPU time. (!)

The other six v1.49/v1.50 tasks I've done so far have taken pretty much the same time each and in average:
15942s runtime and 550s CPU time.

Single Work unit v1.50 per GPU

PM 0007_016D1_316_0 14360 runtime 10533 CPU time
PM 0007_016D1_326_0 14363 runtime 10345 CPU time
PM 0007_016D1_362_0 15212 runtime 10400 CPU time
PM 0007_016D1_206_1 16121 runtime 7034 CPU time

Edit... Work units uses between 68-72% of 1 core each.

Daniels_Parents
Daniels_Parents
Joined: 9 Feb 05
Posts: 101
Credit: 1877689213
RAC: 0

RE: It's like this:

Quote:
It's like this: ...

Thank you very much for this summary, Bikeman :-)

I know I am a part of a story that starts long before I can remember and continues long beyond when anyone will remember me [Danny Hillis, Long Now]

Michael Hoffmann
Michael Hoffmann
Joined: 31 Oct 10
Posts: 32
Credit: 31031260
RAC: 0

Noticed changes with the

Noticed changes with the calculations duration:

While a WU usually took about 2 hours with version 1.39 cuda, now with the new 1.50 beta cuda32 it takes 3 hours 45 minutes.

Just an observation, no complaint.

Om mani padme hum.

Gavin
Gavin
Joined: 21 Sep 10
Posts: 191
Credit: 40644337738
RAC: 1

RE: Noticed changes with

Quote:

Noticed changes with the calculations duration:

While a WU usually took about 2 hours with version 1.39 cuda, now with the new 1.50 beta cuda32 it takes 3 hours 45 minutes.

Just an observation, no complaint.

There is quite a lot of variability in the new app version depending on the tasks you receive. Looking back at your completed v1.50 beta units some have finished in ~90 minutes! You're just experiencing at bit of rough with the smooth so stick with it :-) The beta app will pay dividends in the long run ;-)

Michael Hoffmann
Michael Hoffmann
Joined: 31 Oct 10
Posts: 32
Credit: 31031260
RAC: 0

RE: RE: Noticed changes

Quote:
Quote:

Noticed changes with the calculations duration:

While a WU usually took about 2 hours with version 1.39 cuda, now with the new 1.50 beta cuda32 it takes 3 hours 45 minutes.

Just an observation, no complaint.

There is quite a lot of variability in the new app version depending on the tasks you receive. Looking back at your completed v1.50 beta units some have finished in ~90 minutes! You're just experiencing at bit of rough with the smooth so stick with it :-) The beta app will pay dividends in the long run ;-)

Ah, good to know. Thanks for the info :)

Om mani padme hum.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.