Binary Radio Pulsar Search (Parkes PMPS XT) "BRP6"

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 725724143

RAC: 1220479

RE: You are speaking of

5 Mar 2015 9:45:21 UTC

Message 129901 in response to message 129887

(moderation:

)

Quote:

You are speaking of "mismatched pairs", "fortunate" and "unfortunate" beta (work)units, about "good or bad fotune", of "fine-grained gradation". You suppose a "unbalanced switching process" between CPU and GPU ...

Maybe I better understand what happens if I could get more information about the mentioned data-dependency.

It's like this: The search code can be thought of as a loop over "templates", where the loop has different stages.

After several years of incremental optimization, almost the complete search code within this main loop runs on the GPU. The only exception now is the management of the list of candidates to send back to the server, the "toplist". This is still done on the CPU, e.g. to periodically write the list of candidates found so far to the disk as "checkpoints", something that code on the GPU cannot do.

Originally, near the end of each loop iteration, we copied the entire result from the GPU processing step back to main RAM, where the candidate-selection code would go sequentially thru those results and put them into the toplist of candidates to keep if they make it to this toplist (candidates that are "better" than the last entry in the toplist).

This is somewhat wasteful. In the new version we look at the toplist *before* starting the GPU part of the iteration to give us a threshold of the minimum "strength" of a candidate for it to make it to the toplist. During the GPU processing, we take note when this threshold is crossed. If we find that the threshold was never crossed during the GPU processing, we can completely skip writing the results back to the main memory in that iteration because there can't be anything in it that will make it to the toplist. This saves PCIe bandwidth (for dedicated GPU cards) and CPU processing time because we don't need to inspect those results for candidates either.

This also explains why some workunits can be "lucky": if many strong signal candidates are found early in the search, this sets higher thresholds for all the rest of the templates and cuts down on the number of transfers needed. If a work unit has no clear outliers at all however, the toplist will build up with candidates more evenly during the runtime and the saving effect is much less.

This is a bit simplified and doesn't explain all the details but the gist of it should describe this effect quite well. A further optimization I'll do now is to allow for partial transfers of results from GPU memory to host memory instead of the yes/no decision implemented now.

HBE

Mumak

Joined: 26 Feb 13

Posts: 325

Credit: 3520338337

RAC: 1608194

RE: RE: Is there a

5 Mar 2015 11:48:30 UTC

Message 129902 in response to message 129899

(moderation:

)

Quote:

Quote:
Is there a similar improvement in performance expected for BRP4G too, since it uses the same application?

That is a very good question. It's using the same application, but different search parameters, and to make things more complicated, the BRP4G tasks go out to a very special breed of GPUs (Intel GPUs integrated in the CPU, not dedicated GPUs ). Too many variables for me to make a good guess, we will try this later.

HBE

Wait - what about BRP4G-cuda32-nv301? These go to DGPUs.

-----

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 725724143

RAC: 1220479

RE: Wait - what about

5 Mar 2015 11:53:27 UTC

Message 129903 in response to message 129902

(moderation:

)

Quote:

Wait - what about BRP4G-cuda32-nv301? These go to DGPUs.

We do not expect to have work for this application most of the time. Our main supply of GPU workunits for the near future will come from BRP6, with only a few WUs from BRP4 set aside for Android, ARM Linux and Intel HD GPUs.

Cheers
HB

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2956669748

RAC: 715106

RE: RE: RE: Is there a

5 Mar 2015 11:54:02 UTC

Message 129904 in response to message 129902

(moderation:

)

Quote:

Quote:
Quote:
Is there a similar improvement in performance expected for BRP4G too, since it uses the same application?

That is a very good question. It's using the same application, but different search parameters, and to make things more complicated, the BRP4G tasks go out to a very special breed of GPUs (Intel GPUs integrated in the CPU, not dedicated GPUs ). Too many variables for me to make a good guess, we will try this later.

HBE

Wait - what about BRP4G-cuda32-nv301? These go to DGPUs.

Not any more - they've run out of data, and what little there is left is being preferentially held back for intel_gpu, which can't run the bigger tasks. There are announcements about that somewhere round here.

Mumak

Joined: 26 Feb 13

Posts: 325

Credit: 3520338337

RAC: 1608194

Ah, I thought that the BRP4G

5 Mar 2015 12:50:42 UTC

Message 129905 in response to message 129904

(moderation:

)

Ah, I thought that the BRP4G shortage is just temporary. OK, let's get back to topic..

-----

Zalster

Joined: 26 Nov 13

Posts: 3117

Credit: 4050672230

RAC: 0

RE: So the long v1.50

5 Mar 2015 15:52:47 UTC

Message 129906 in response to message 129898

(moderation:

)

Quote:

So the long v1.50 finished over night.

Here is a first summary of running 2 tasks parallel on a 750Ti:

Using v1.39 the average of 40 workunits was:
20366s runtime and 2103s CPU time.

The long v1.50 task (PM0007_01161_126_1) was:
22643s runtime and 5254s CPU time. (!)

The other six v1.49/v1.50 tasks I've done so far have taken pretty much the same time each and in average:
15942s runtime and 550s CPU time.

Single Work unit v1.50 per GPU

PM 0007_016D1_316_0 14360 runtime 10533 CPU time
PM 0007_016D1_326_0 14363 runtime 10345 CPU time
PM 0007_016D1_362_0 15212 runtime 10400 CPU time
PM 0007_016D1_206_1 16121 runtime 7034 CPU time

Edit... Work units uses between 68-72% of 1 core each.

Daniels_Parents

Joined: 9 Feb 05

Posts: 101

Credit: 1877689213

RAC: 0

RE: It's like this:

5 Mar 2015 23:05:03 UTC

Message 129907 in response to message 129901

(moderation:

)

Quote:

It's like this: ...

Thank you very much for this summary, Bikeman :-)

I know I am a part of a story that starts long before I can remember and continues long beyond when anyone will remember me [Danny Hillis, Long Now]

Michael Hoffmann

Joined: 31 Oct 10

Posts: 32

Credit: 31031260

RAC: 0

Noticed changes with the

7 Mar 2015 16:20:14 UTC

Message 129908

(moderation:

)

Noticed changes with the calculations duration:

While a WU usually took about 2 hours with version 1.39 cuda, now with the new 1.50 beta cuda32 it takes 3 hours 45 minutes.

Just an observation, no complaint.

Om mani padme hum.

Gavin

Joined: 21 Sep 10

Posts: 191

Credit: 40644337738

RAC: 1

RE: Noticed changes with

7 Mar 2015 16:42:15 UTC

Message 129909 in response to message 129908

(moderation:

)

Quote:

Noticed changes with the calculations duration:

While a WU usually took about 2 hours with version 1.39 cuda, now with the new 1.50 beta cuda32 it takes 3 hours 45 minutes.

Just an observation, no complaint.

There is quite a lot of variability in the new app version depending on the tasks you receive. Looking back at your completed v1.50 beta units some have finished in ~90 minutes! You're just experiencing at bit of rough with the smooth so stick with it :-) The beta app will pay dividends in the long run ;-)

Michael Hoffmann

Joined: 31 Oct 10

Posts: 32

Credit: 31031260

RAC: 0

RE: RE: Noticed changes

7 Mar 2015 17:11:52 UTC

Message 129910 in response to message 129909

(moderation:

)

Quote:

Quote:
Noticed changes with the calculations duration:

While a WU usually took about 2 hours with version 1.39 cuda, now with the new 1.50 beta cuda32 it takes 3 hours 45 minutes.

Just an observation, no complaint.

There is quite a lot of variability in the new app version depending on the tasks you receive. Looking back at your completed v1.50 beta units some have finished in ~90 minutes! You're just experiencing at bit of rough with the smooth so stick with it :-) The beta app will pay dividends in the long run ;-)

Ah, good to know. Thanks for the info :)

Om mani padme hum.

Binary Radio Pulsar Search (Parkes PMPS XT) "BRP6"

Forums › Technical News

Comment viewing options

Forums › Technical News