Discussion Thread for the Continuous GW Search known as O2MD1 (now O2MDF - GPUs only)

Rolf
Rolf
Joined: 7 Aug 17
Posts: 27
Credit: 135377187
RAC: 0

About the PCI bandwidth. My

About the PCI bandwidth. My own conclusion, based only on very superficial observations of how the tasks run and how much CPU usage they report, is that the bottleneck is not the bandwwidth as such, but that the CPU seems to transfer data to GPU in very small batches. I also got the impression that, that while the CPU is preparing the next batch, the GPU is idling. And possibly vice versa. (Which in a way would suggest that PCI bandwidth is a factor, but I still have a hard time believing that it's the bottleneck. It can't be that huge amounts of data. Rather then the chattiness of this, that it is a very high amount of very small transmissions.)

Why it's done that way I don't know. Possibly because the problem requires this, that the next batch can not be calculated by the CPU until it has the results from the previous batch from the GPU.

 

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 0

Rolf wrote:About the PCI

Rolf wrote:

About the PCI bandwidth. My own conclusion, based only on very superficial observations of how the tasks run and how much CPU usage they report, is that the bottleneck is not the bandwwidth as such, but that the CPU seems to transfer data to GPU in very small batches. I also got the impression that, that while the CPU is preparing the next batch, the GPU is idling. And possibly vice versa. (Which in a way would suggest that PCI bandwidth is a factor, but I still have a hard time believing that it's the bottleneck. It can't be that huge amounts of data. Rather then the chattiness of this, that it is a very high amount of very small transmissions.)

Why it's done that way I don't know. Possibly because the problem requires this, that the next batch can not be calculated by the CPU until it has the results from the previous batch from the GPU.

 

 

Probably because the way the E@H devs do GPU apps is different than most people  The conventional approach is to move all of the calculation code onto the GPU in a single pass, only using the CPU for startup/shutdown/checkpointing.  E@H is using the GPU as an accelerator, only porting over 1 tiny part at a time and leaving everything else on the CPU. 

The notional advantages of doing it incrementally are that the GPU is harder to debug problems on so adding only a tiny bit of GPU code at a time it's easier to debug, and that because the innermost loops are where the vast majority of the computing is done a smaller amount of porting can get most of the advantage from going to the GPU.  

The flip side is that GPUs are so much faster than CPUs that moving 95% fo the work to the GPU still leaves the CPU as a bottleneck, and as you've observed cause the GPU to spend a lot of time waiting for the CPU to do a tiny bit of work and send it to the GPU so it can continue.

As an aside, there's nothing intrinsically wrong about porting incrementally to near/total completion vs all at once in a single pass - other than than writing lots of data passing code that gets replaced when you move another loop to the GPU - the issue is that E@H's devs stop early in the process and leave the CPU still a major bottleneck on any but the slowest cards.  The steady relative increase of GPU power vs single threaded CPU power only makes it worse.  Nor would E@H throwing minimally ported apps out to the public to test against a much wider set of hardware than they have in the lab fundamentally wrong; if they continued to regularly iterate performance by increasing the GPU share instead of declaring victory when the app stops returning invalid data and is 2-3x faster than the CPU app while using maybe 10 or 20% of the GPU.

 

cecht
cecht
Joined: 7 Mar 18
Posts: 1535
Credit: 2909688736
RAC: 2108629

Here are updated data from my

Here are updated data from my initial post on possible limiting factors for optimal GW GPU task performance.

Data from my system in the table below support Rolf's conclusion that the amount of crosstalk between the GPU and CPU is a main determinant for slowing things down. I think that data show that increasing CPU capacity is a good initial step to increase GW GPU task productivity. Others have stated this elsewhere, so here are some numbers...

From the table, comparing one vs two GPU running a single task, power increases and the load average doubles, but task times are comparable; there no throttling of the analysis.
    Similarly, comparing one GPU running 1 vs 2 tasks, CPU load doubles to only 50%, and the GPU is used more efficiently resulting in lower task time.
    And again, when 3 tasks are run concurrently, CPU doubles to nearly 100%, but now, being near system capacity, task time slightly increases; that is, no gains in GPU crunch efficiency.
    When run at 4x, especially when two GPUs are running 2 tasks each, the load average exceeds 100%; that is, processes are waiting to run, and task times double or quadruple (as power usage decreases), so there now is throttling of the GPU as it waits for the CPU to do its thing.
    Under none of these conditions did CPU or GPU usage reach 100% (though CPU usage did so intermittently at 1GPU@4x).
    Does anybody have any ideas on what the much larger %load increase for 2GPU@2x vs. 1GPU@4x might mean? Is it a PCIe lane thing?
    By comparison, this system running 2GPU@3x with the FGRBP gamma ray GPU app has a %CPU of ~25%, %GPU of 100%, and a %load of ~%35 (data not shown). Indeed, for GR tasks, GPU utilization 2x-4x peg individual GPU usage at 98%-100%, so again, % usage is not a good indicator of crunching capacity.

No doubt a faster GPU can speed things along, but my first step will be to upgrade my CPU from 2c/4t to 8 cores. Rolf has also pointed out that system memory, specifically the number of channels, can be a limiting factor with higher task multiplicities. The CPU I'm considering can only support 2 memory channels, so I won't see any advantages there even with an upgrade to faster DDR4 to support a beefier CPU.

Gravitational Wave GPU app v2.08 performance across different run conditions

configtWc%CPU%GPU%load
1_GPU@1x36 722443 25
1_GPU@2x21 814742 50
1_GPU@3x25 786641 98
1_GPU@4x42 646460118
2_GPU@1x381144524 56
2_GPU@2x85 968220175

Table footnotes
config, boinc-client run configuration for one or two active GPUs, where 1x is a GPU utilization factor (GUF) of 1, 2x is 0.5 GUF, etc.
t, Realized single task time in minutes = BOINC run completion time in minutes X GUF.
Wc, Crunch Watts = Wattage as measured with a power meter at the wall, minus the host's resting state of 56W.
%CPU, average CPU usage over 10 min, reading every second, from the 'top' command.
%GPU, average GPU usage over 10 min, reading every two seconds, from the 'amdgpu-monitor --log' command.
%load, normalized 15 minute CPU load average(LA) from the 'top' command, where %load = LA/4, for using 4 processors (threads). The load average value is average number of kernel processes using or waiting for the CPU, over a given time interval.
Most tasks were run 17-18 April from the O2MDFG3a_G34731_18xx.xx Hz data series.
BBcode table formatting was generated at https://theenemy.dk/table/

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3958
Credit: 47043602642
RAC: 65084180

Honestly I think you’re

Honestly I think you’re shooting yourself in the foot trying to run a CPU intensive task (GW) with a very weak CPU. That Pentium is only a dual core with hyperthreading enabled. I don’t think you’re going to get meaningful increases in work processing until you upgrade the CPU to something more powerful. The amount of CPU cache available also probably plays some role here, that chip only has 4MB L3 cache, which is tiny.

I observed very significant improvements in runtime and GPU utilization when I changed from a E5-2630Lv2 (6c12t/2.6GHz/15MB L3) to an E5-2667v2 (8c16/3.6Ghz/25MB L3). I was driving 7 nvidia GPUs, hence the need for more cores.  

_________________________________________________________________________

cecht
cecht
Joined: 7 Mar 18
Posts: 1535
Credit: 2909688736
RAC: 2108629

Ian&Steve C. wrote:Honestly I

Ian&Steve C. wrote:

Honestly I think you’re shooting yourself in the foot trying to run a CPU intensive task (GW) with a very weak CPU. That Pentium is only a dual core with hyperthreading enabled. I don’t think you’re going to get meaningful increases in work processing until you upgrade the CPU to something more powerful. The amount of CPU cache available also probably plays some role here, that chip only has 4MB L3 cache, which is tiny.

I observed very significant improvements in runtime and GPU utilization when I changed from a E5-2630Lv2 (6c12t/2.6GHz/15MB L3) to an E5-2667v2 (8c16/3.6Ghz/25MB L3). I was driving 7 nvidia GPUs, hence the need for more cores.  

Agreed. I'm looking at an i7-9700K. Just waiting for that COVID economic stimulus check to arrive. :)

Ideas are not fixed, nor should they be; we live in model-dependent reality.

cecht
cecht
Joined: 7 Mar 18
Posts: 1535
Credit: 2909688736
RAC: 2108629

Ian&Steve C. wrote:..I

Ian&Steve C. wrote:
..I observed very significant improvements in runtime and GPU utilization when I changed from a E5-2630Lv2 (6c12t/2.6GHz/15MB L3) to an E5-2667v2 (8c16/3.6Ghz/25MB L3). I was driving 7 nvidia GPUs, hence the need for more cores.

Your hosts currently list only gamma-ray pulsar GPU tasks as completed. How do those E5 CPUs run with the current batch of GW GPU tasks? What sort of GW completion times have you seen?  It's the GW GPU app that really puts the pressure on CPU performance.  My Pentium does a fine job running 6 concurrent pulsar GPU tasks on a couple of RX 570s, but chokes with more than two GW GPU tasks. If I can remove that CPU bottleneck with an affordable pre-owned E5 chip, then I'll go that route.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 838
Credit: 519315189
RAC: 13902

cecht wrote:Ian&Steve C.

cecht wrote:
Ian&Steve C. wrote:
..I observed very significant improvements in runtime and GPU utilization when I changed from a E5-2630Lv2 (6c12t/2.6GHz/15MB L3) to an E5-2667v2 (8c16/3.6Ghz/25MB L3). I was driving 7 nvidia GPUs, hence the need for more cores.
Your hosts currently list only gamma-ray pulsar GPU tasks as completed. How do those E5 CPUs run with the current batch of GW GPU tasks? What sort of GW completion times have you seen?  It's the GW GPU app that really puts the pressure on CPU performance.  My Pentium does a fine job running 6 concurrent pulsar GPU tasks on a couple of RX 570s, but chokes with more than two GW GPU tasks. If I can remove that CPU bottleneck with an affordable pre-owned E5 chip, then I'll go that route.

I'd like to see gravity tasks that can use multiple CPU cores.  My machines could all manage gravity if this was done.  But one core is never enough to keep up with a decent GPU, and since they're now getting bigger, I can no longer put multiple WUs on one GPU as they don't fit in GPU RAM.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3958
Credit: 47043602642
RAC: 65084180

cecht wrote:Ian&Steve C.

cecht wrote:
Ian&Steve C. wrote:
..I observed very significant improvements in runtime and GPU utilization when I changed from a E5-2630Lv2 (6c12t/2.6GHz/15MB L3) to an E5-2667v2 (8c16/3.6Ghz/25MB L3). I was driving 7 nvidia GPUs, hence the need for more cores.
Your hosts currently list only gamma-ray pulsar GPU tasks as completed. How do those E5 CPUs run with the current batch of GW GPU tasks? What sort of GW completion times have you seen?  It's the GW GPU app that really puts the pressure on CPU performance.  My Pentium does a fine job running 6 concurrent pulsar GPU tasks on a couple of RX 570s, but chokes with more than two GW GPU tasks. If I can remove that CPU bottleneck with an affordable pre-owned E5 chip, then I'll go that route.

it was several months ago when I first signed up for Einstein and I was doing some testing On different hardware and configs. I noticed that I had good performance (~80% GPU utilization) with the GW GPU app on the system running 10x 2070s with the E5-2680v2 CPUs, but pretty poor performance (40-50% GPU utilization) on the system with 7x 2080s and just the E5-2630Lv2. swapping to the E5-2667v2 pretty much solved that and brought the GPU utilization back up. But I’ve just stuck to running the Gamma Ray tasks since then since they run better. 

now that I’ve got all my systems configured for GPUGRID, I’ll only really be crunching Einstein as a backup project when GPUGRID is out of work. I can load up a couple GW tasks to test, but completion times might not be really relevant since I run nvidia cards, RTX 2070s (w/ E5-2680v2) and RTX 2080s (w/ E5-2667v2) 

_________________________________________________________________________

Mr P Hucker
Mr P Hucker
Joined: 12 Aug 06
Posts: 838
Credit: 519315189
RAC: 13902

Ian&Steve C. wrote:now that

Ian&Steve C. wrote:
now that I’ve got all my systems configured for GPUGRID, I’ll only really be crunching Einstein as a backup project when GPUGRID is out of work. I can load up a couple GW tasks to test, but completion times might not be really relevant since I run nvidia cards, RTX 2070s (w/ E5-2680v2) and RTX 2080s (w/ E5-2667v2) 

As you're on GPU Grid a lot, do you happen to know if there's any chance of them making an AMD GPU version?  I don't have any Nvidias, but I'd like to contribute to their project.

If this page takes an hour to load, reduce posts per page to 20 in your settings, then the tinpot 486 Einstein uses can handle it.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3958
Credit: 47043602642
RAC: 65084180

just to update, I switched

just to update, I switched this system to run GW tasks. I watched a couple of them (not VelaJr) and running just 1 task per GPU: (RTX 2070 fed by a E5-2680v2 @3.0GHz)

~11:30 run time

~75-78% GPU utilization

~1800MB VRAM use

~110-120% (spikes to 140) CPU thread utilization (meaning dipping into a second thread)

~2% PCIe utilization of a 3.0 x1 link

 

_________________________________________________________________________

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.