Fun with run_harmonic_summing() on CUDA

Janus
Janus
Joined: 10 Nov 04
Posts: 27
Credit: 23862534
RAC: 5
Topic 196666

Heya

I've been spending a bit of spare time learning CUDA and decided to try looking at one of the BRP4 CUDA functions to see if I could make any sense of it - and maybe even improve it a bit. The Einstein code is all quite well documented, so it is a nice place to pick up new tricks for learners like me.

The choice of function ended up being run_harmonic_summing() because it is a fairly simple function. As is visible from the image below, the function uploads some data to the GFX card device (HtoD), runs two little programs on the data (the blue-ish tinted blocks) and then downloads some of the results back to the host CPU memory (DtoH):

Both of the little programs (also known as kernels in CUDA-lingo) are available in the BRP4 source tree in the file "harmonic_summing_kernel.cuh" in src/cuda/app and contain the actual science code. What exactly this code does is a bit of a mystery to me still - it searches for harmonics in the power spectrum; which probably has something to do with detecting signals which are periodic but where you only have a certain chance of hearing each "blip". Ah well, someone else can probably chime in with some enlightening details here =)

One striking thing about the performance printout for the function call (above) is that it seems to spend relatively little time doing actual computation and a lot of time moving around data. This seems to mainly be because of the PCI-E bus latency involved when sending a command from the CPU to the GFX card.
A PCI-E bus is a bit like a network: in your software you make a "packet", the packet goes to a driver, the driver sends it to a controller in the CPU or on the motherboard, the controller puts it on the bus and the GFX card's controller picks it up from the bus and then informs the graphics processor chip about the new data. Then you wait for all of that to happen in reverse to get an answer to whether your command went fine.
Obviously, if the bus is fast then it will be able to make this round-trip much more quickly than a slow bus - and in turn it will be able to send more data in the same amount of time. The size of the packets also has something to say. These two things in unison is why a PCI-E 2.0 x16 lane is important for the Einstein CUDA app: it makes frequent round-trips and sometimes sends decently sized packages across the bus.

Taking a closer look at what happens:

Upload loop:
Data is uploaded (6 round-trips)

Computation
This is done as soon as the last piece of data is available, so it actually doesn't seem to require a round-trip

sumspec[0] download and border calculation
Power spectrum data is downloaded and used to pick interesting areas (1 round-trip)

Sync
The CPU asks the GPU to be done with everything it is doing and waits for it to finish (1 round-trip)

Download loop
Data is downloaded (5 round-trips)

So all in all it seems like it is doing 10-15 round-trips in order to get the results.
Luckily it is possible to:
1) Do more than one thing in parallel
2) Request that the GFX card accepts or delivers more than one piece of data in one round-trip, or performs a sequence of commands and only reports back once the entire sequence is done.
The GPU makers call this asynchronous transfers or asynchronous computation. They enable parallelism by using multiple "streams" and each stream accepts sequences of commands - which they will try to carry out simultaneously with the sequences running in the other streams. This is somewhat like multiprocessing on the CPU where multiple programs can run their code sequences at the same time.

So just for the fun of it here's another view of what the function could potentially look like if it used two streams performing asynchronous sequences of commands:

The key differences are:
a) All the input data is uploaded asynchronously together with the calls to start the kernels in one roundtrip - and all in one stream to make sure that they are run in sequence on the GFX card.
b) Assuming that the spectrum data isn't changed by the harmonic summing, another stream can be started to download the power spectrum while the first stream does its thing
c) Once all that is done all the relevant data can be downloaded asynchronously in one roundtrip
d) Remaining cleanup can be done after the data has been successfully downloaded.

Even with multiple asynchronous streams it seems like the function would still spend some time doing nothing but waiting for round-trips to complete.
This is why most people with modern GFX cards will find that it improves their total computing performance if they run 2 or more CUDA WUs on their card at the same time - WU2 would be able to take advantage of the "downtime" from WU1 by doing computations when WU1 is transfering data or transfering data when WU1 is computing and so on. As is clear from the plots it won't always be a 100% increase - it all depends on timing.

So, there you have it - a bit of fun with run_harmonic_summing().

__
ps. I'm clueless and I probably have half of this wrong and am missing the other half, so please correct flaws in the post or add your view to it. That is the best way to learn.
pps. Everything is somewhat photoshopped - and the curvy braces are all 10% additionally curvy; you get that for free ;)

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 324196738
RAC: 165235

Fun with run_harmonic_summing() on CUDA

Thanks for describing this for us! :-)

You've identified the basic character of GPU work : blindingly fast massive multi-threading but the lag is in setup and readout, hence a need for the best CPU to GPU pathway specs you can muster.

Cheers, Mike.

( edit ) And in this context one can think of a GPU as an outstanding parallel co-processor that just so happens to also run a visual device .... :-)

( edit ) This general behaviour is akin to many parallel devices - the Adapteva Parallela ( recently launched on KickStarter ) comes to mind. I've seen analysis of matrix multiplication procedures that spend about 5% of the overall execution time on the processor array actually calculating the result matrix entries, with the remainder being largely input/output to the memory shared with the host ( transferring both data and code in this case ). As that's quite a ratio ( 19:1 ) then any decent optimisation from that state is going to be focussed on the transfers.

[ I remember an ( apocryphal? ) story about maintenance programmers who wound up optimising the busy/wait loop of an operating system because they hadn't profiled correctly in advance ... :-) ]

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Dad
Dad
Joined: 18 Nov 12
Posts: 1
Credit: 456538
RAC: 0

SO the i7 with an X-58 board

SO the i7 with an X-58 board and nvgtx560 would be very fast. Maybe I'll update to two of the 600 series nvidia cards next year.

Sorry,I can run 3 x16 slots. Hope Santa sends me some cash.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 760301958
RAC: 1123662

Hi Janus, Excellent

Hi Janus,

Excellent analysis.

As a piece of trivia, the first version of this code was actually contributed 2 years ago by me in my former life (while I was a volunteer developer , before joining the AEI and doing this stuff for a living). So you can see that E@H is very open to feedback and improvements of the code by developers outside the project staff. That's the beauty of Open Source.

So let me comment on your experiment in some detail:

Quote:

s is visible from the image below, the function uploads some data to the GFX card device (HtoD), runs two little programs on the data (the blue-ish tinted blocks) and then downloads some of the results back to the host CPU memory (DtoH)

Right. The first part, uploading stuff to the gfx card, is actually just zeroing out arrays on the card, at least in the current version. If you downloaded the source code package some months ago, you might still have an older version that had an additional memory transfer .

Quote:
what exactly this code does is a bit of a mystery to me still - it searches for harmonics in the power spectrum; which probably has something to do with detecting signals which are periodic but where you only have a certain chance of hearing each "blip". Ah well, someone else can probably chime in with some enlightening details here =)

To find a periodic signal in radio data, BRP4 computes the Discrete Fourier Transform and then the power spectrum, which will give you the signal power at small discrete frequency bins in our search range of frequencies. If the signal was a sinusoid, we would be done at this point. Just search thru the spectrum to find the "loudest" signal. However, the signals we are looking for (even after removing the effects of binary orbits etc) are anything but sinusoids, they are more like sharp, periodic spikes. That's why you will see the signal also in integer multiples of the pulse frequency in the spectrum. See also here, last paragraph: http://einstein.phys.uwm.edu/radiopulsar/html/topic3.php

Quote:
These two things in unison is why a PCI-E 2.0 x16 lane is important for the Einstein CUDA app: it makes frequent round-trips and sometimes sends decently sized packages across the bus.

Exactly. Over the past 2 years we improved the code mostly by reducing the amount of memory that has to be transferred between the gfx card and the host.

Quote:
So just for the fun of it here's another view of what the function could potentially look like if it used two streams performing asynchronous sequences of commands:

Actually the situation is even brighter, because the two computation blocks (cyan and blue in the diagram) are independent, so they can execute in parallel.

The first profiling diagram is a bit misleading, the two kernels are actually started asynchronously in separate streams:

cuLaunchGridAsync(kernelHarmonicSumming, dg1.x, dg1.y, stream[0]);


and

cuResult = cuLaunchGridAsync(kernelHarmonicSummingGaps, dg2.x, dg2.y, stream[1])

Profiling under CUDA can actually cause the streams to be serialized (greatly reducing the usefulness for optimizing stuff, e.g. see here : https://devtalk.nvidia.com/default/topic/418148/cuda-programming-and-performance/stream-serialization-with-cuda-visual-profiler-v2-3-11/

The memory transfer is a different thing, it is in fact synchronous. In CUDA3.2, in order to perform asynchronous memory transfer, you have to use a special kind of host memory, so called page-locked memory. Not all cards did support this, so one would actually have to implement some logic to do it asynchronously for most cards and synchronously for legacy cards. There's a reminder in source code to include this, actually:

/* FIXME:  Improvement: if powerspectrum is in pinned memory, this could be done async. */


;-)

I nevertheless once tried async transfer back then in 2010 on some cards that did support it, and found no noticable performance improvement. So I lost interest in this detail. Maybe it's time to repeat the experiment with the latest Kepler generation, the outcome might be different than back then. I also see some further optimization opportunities there, e.g. with the latest changes it might be possible to transfer only part of the powerspectrum[0] array back to the host (but this would completely break the powerspectrum display in the screensaver...hmmm...does anyone care?).

Another caveat is that the two harmonic summing kernels have quite chaotic memory access patterns, that's why we are using texture memory (as this offered some caching even on legacy cards) to speed it up. Further saturating the on board memory subsystem by making an async transfer to host memory in parallel might or might not make things faster overall.

Thanks for sharing your experiments results.

Cheers
HB

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

I'd also like to thank Janus

I'd also like to thank Janus and Heinz for the interesting discussion.

I'm just getting started in CUDA as something to wile away my less than copious free time.

What software did you use to profile the GPU?

Also would you please describe the procedure to run a single instance of BRP4 without installing BOINC?

I have access to a few K10's I'd like to benchmark but they are cluster computers and I can't install BOINC (without risking the ire of the PTB).

Actually now that I think about it, Bikeman probably has access to those same machines.

Best,
Joe

Janus
Janus
Joined: 10 Nov 04
Posts: 27
Credit: 23862534
RAC: 5

Nice answers! RE: To

Nice answers!

Quote:
To find a periodic signal in radio data, BRP4 computes the Discrete Fourier Transform and then the power spectrum, which will give you the signal power at small discrete frequency bins in our search range of frequencies. If the signal was a sinusoid, we would be done at this point. Just search thru the spectrum to find the "loudest" signal. However, the signals we are looking for (even after removing the effects of binary orbits etc) are anything but sinusoids, they are more like sharp, periodic spikes. That's why you will see the signal also in integer multiples of the pulse frequency in the spectrum. See also here, last paragraph: http://einstein.phys.uwm.edu/radiopulsar/html/topic3.php


Ah! That makes great sense indeed - quite a smart way to get extra sensitivity out of the search.

Quote:
Actually the situation is even brighter, because the two computation blocks (cyan and blue in the diagram) are independent, so they can execute in parallel. [snip] Profiling under CUDA can actually cause the streams to be serialized (greatly reducing the usefulness for optimizing stuff, e.g. see here : https://devtalk.nvidia.com/default/topic/418148/cuda-programming-and-performance/stream-serialization-with-cuda-visual-profiler-v2-3-11/


Very interesting little gotcha! I was wondering why the kernels were executing in-order on two different streams under profiling when there was seemingly no connection between them except they both use the same powerspectrum input and the output; something that shouldn't really block anything. This explains it.

Quote:
Another caveat is that the two harmonic summing kernels have quite chaotic memory access patterns, that's why we are using texture memory (as this offered some caching even on legacy cards) to speed it up. Further saturating the on board memory subsystem by making an async transfer to host memory in parallel might or might not make things faster overall.


True that - but since there's a sync right after it things will grind to a halt anyways; so giving it the opportunity probably shouldn't hurt

Quote:
I nevertheless once tried async transfer back then in 2010 on some cards that did support it, and found no noticable performance improvement. So I lost interest in this detail. Maybe it's time to repeat the experiment with the latest Kepler generation, the outcome might be different than back then.


That is a good point actually - it would be logical to think that as the speed of the processors and memory on the cards increase the computation part of the graph will shorten but the transfer parts will remain mostly the same - so as the cards get faster and faster the transfer time will start to dominate more and more.
Even with PCIe 3.0 that is capable of doing faster transfers the latency is still comparatively high.

Quote:
I also see some further optimization opportunities there, e.g. with the latest changes it might be possible to transfer only part of the powerspectrum[0] array back to the host (but this would completely break the powerspectrum display in the screensaver...hmmm...does anyone care?).


Don't tell anyone ;)
(or even better, check if the screensaver is actually running if that is the only thing using the full spectrum - mostly people will have it turned off)

Thanks a lot for the enlightening comments on the experiment - things are always much clearer when you know the why and the how.

Janus
Janus
Joined: 10 Nov 04
Posts: 27
Credit: 23862534
RAC: 5

RE: What software did you

Quote:
What software did you use to profile the GPU?


There are several useful tools to do this. Some of them use counters and debug registers directly in hardware while others wrap the CUDA dlls to catch calls before they hit the driver. The plots are from the "Nvidia Visual Profiler" which does a little bit of both I recon.

Quote:
Also would you please describe the procedure to run a single instance of BRP4 without installing BOINC?


If BOINC is not around then the Einstein app switches to non-shared-memory-mode and simply becomes a commandline program like any other program. The easiest way to get a working standalone app (if you don't like compiling it yourself) is by taking the slots/X directory from a running instance of the app and then:
1) Copy it somewhere else
2) Replace relevant file links with their actual files from the projects/einstein directory - this most importantly includes the binary executable and any other files that the app cannot look up
3) Set up a similar directory structure as the one BOINC had - so ../../projects/einsteintralala/* needs to exist
4) Delete any "boinc_lockfile" in the destination dir

Now you are pretty much good to go - you only need the commandline that BOINC was going to run the WU with. You can get that from client_state.xml. Obviously it, and the relevant input files, changes for every new WU you want to run.

The output from the app, just like anything else here at Einstein, seems pretty well-described and is very helpful when there is an error of some kind.

Funny fact:
Once the WU has been completed you can move the output files back to the original machine. Once it re-launches the WU it will detect that all the work is already done and it will return it as if it had computed it itself. I once used a setup like that to do a few Einstein WUs on an android phone - for the fun of it.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6591
Credit: 324196738
RAC: 165235

RE: RE: I also see some

Quote:
Quote:
I also see some further optimization opportunities there, e.g. with the latest changes it might be possible to transfer only part of the powerspectrum[0] array back to the host (but this would completely break the powerspectrum display in the screensaver...hmmm...does anyone care?).

Don't tell anyone ;)
(or even better, check if the screensaver is actually running if that is the only thing using the full spectrum - mostly people will have it turned off)


I seriously doubt that anyone wanting to maximise their GPU crunching will have screensaver(s) enabled. :-)

[ Especially so if texture memory is being preferred. ]

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

Thank you. I'm sure I'll

Thank you.

I'm sure I'll have many hours of fun getting it to work.

I'll report K10 numbers as soon as I have them.

Joe

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.