CPU speed impact to GPU processing

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,874
Credit: 41,930,932,644
RAC: 59,541,583
Topic 220467

Hey guys, new user here, though I know few of you from the SETI forums.:)

Since SETI is having a little break at the moment, I figured I would sign up here and check it out. I swung 2 of my biggest systems over to E@h for use as a backup project. I attached E@h to 2 of my systems for testing and have some questions about the nature of the jobs and how the CPU impacts the GPU processing.

Disclaimer, I will only be inquiring about GPU WUs, I do not run CPU WUs, the CPUs are only being used as support for GPU work.

For SETI, CPU speed mostly does not impact the GPU work on the CUDA apps there, and since I built the systems for SETI, I use relatively weak CPUs in some systems to keep costs down. the two systems I have attached here are as follows:

 

  • Beast
  • CPU: 2x Xeon E5-2680v2 10c/20t (20c/40t total), runs at 3.1GHz "all core"
  • MB: Supermicro X9DRX+-F
  • RAM: 32GB DDR3 1600MHz ECC RDIMM
  • GPU: 10 (yes, ten) RTX 2070
  • PCIe bandwidth: [x8/x8/x8/x8/x8/x8/x8/x8/x1/x1]
  • https://imgur.com/inoa538

 

  • Water
  • CPU: 1x Xeon E5-2630Lv2 6c/12t, runs at 2.6GHz "all core"
  • MB: ASUS P9X79-E WS
  • RAM: 32GB DDR3 1600MHz ECC UDIMM
  • GPU: 7x RTX 2080 (watercooled, single slot)
  • PCIe bandwidth: [x16/x8/x8/x16/x8/x8/x8]
  • https://imgur.com/izVgWFh

 

So one thing I noticed is that there is a big difference in GPU utilization between the two systems. When running 1 WU per GPU, the watercooled 2080 system GPUs are running about ~35-50% GPU utilization. but the Beast 2070 system runs about 70-80% GPU utilization.

I switched the Beast system to run 2 WU per GPU, and it brought GPU utilization up to about 90-99%, but it doesnt seem I am able to run that on the watercooled 2080 due to lack of CPU resources (i tried and GPU utilization plumeted due to a pegged/overworked CPU).

 

So my main question is, does this experience make sense to you guys? do the GPU WUs require this much CPU resources, and does it greatly benefit faster CPU clock speeds when feeding work to the GPUs? I'm trying to figure out what might cause the low utilization on the 2080 system, even when not starving the CPU (1 WU at a time runs about 70% CPU usage).

 

Thanks :)

Ian

_________________________________________________________________________

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,874
Credit: 41,930,932,644
RAC: 59,541,583

secondary question.   is

secondary question.

 

is there a specific reason that no one seems to be running the GW O2MD jobs on GPU? looking though other systems to compare run times, it looks like everyone only runs the Gamma Ray Pulsar Binary Search WUs. why?

_________________________________________________________________________

archae86
archae86
Joined: 6 Dec 05
Posts: 3,153
Credit: 7,157,634,931
RAC: 587,505

Ian&Steve C. wrote:do the GPU

Ian&Steve C. wrote:
do the GPU WUs require this much CPU resources

I think in recent months we have had two different major applications here at Einstein using GPU calculation.  The Gamma-Ray Pulsar search is mature, has changed very little in a long time, and makes rather little use of host CPU and other assets.  So people run it quite successfully with weak CPUs, older generation PCI-e bus, narrow PCI-e bus, and so on and still can get to pretty high GPU utilization running just two tasks at a time.

The Gravitational Wave GPU search is much newer, not as mature, and has changed several times in the last few months.  It also seems more strongly data dependent.  Compared to the GRP search, it is much more dependent on host resources.  It will use a lot more of your CPU on a given system.  You will find it much harder to get high GPU utilization on it.  The science goals are quite different, and many of our participants strongly favor it over the GRP work from the point of view of their science support agendas.

Also, in scheduler terms, the two flavors do not mix well on a given system.  I suggest you try one at a time and choose the one you think better suited to your system and your personal priorities.

In discussing them, it can be helpful to single them out, rather them lumping them together as "Einstein".

Welcome to a change of pace from SETI.

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3,117
Credit: 4,050,672,230
RAC: 0

Hello Ian, the Gamma Rays are

Hello Ian, the Gamma Rays are PCIe dependent, Ram of the GPU speeds and the Speed of the Ram.

People tend to run the Gamma Rays more as there is a high credit reward compared to the Gravity Wave GPU work. The latter tends to run longer and reward is much less.  So you can see why the switch to Gamma Rays.

16x PCIe will crunch faster than a 8x, 1080Ti will crunch faster than a 1080 due to the difference in ram on the card. The faster you can get the RAM on the MoBo will also boost your result. 

Does running more than 1 at a time help? Yes, I think 3 was the sweet spot but since I only run this as a back up  (once in a while) I leave it at 1 at a time so when Seti give work, the machine can switch back.

As you noticed, you need to have enough CPU threads for a 1 per work unit ratio.

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,874
Credit: 41,930,932,644
RAC: 59,541,583

Thanks guys for the

Thanks guys for the responses. I was initially running the Gravity Wave WUs, which did not seem to like my watercooled system. now that I have gotten some Gamma-Ray WUs, they seem to run a lot better.

 

Zalster wrote:

Hello Ian, the Gamma Rays are PCIe dependent, Ram of the GPU speeds and the Speed of the Ram.

People tend to run the Gamma Rays more as there is a high credit reward compared to the Gravity Wave GPU work. The latter tends to run longer and reward is much less.  So you can see why the switch to Gamma Rays.

16x PCIe will crunch faster than a 8x, 1080Ti will crunch faster than a 1080 due to the difference in ram on the card. The faster you can get the RAM on the MoBo will also boost your result. 

Does running more than 1 at a time help? Yes, I think 3 was the sweet spot but since I only run this as a back up  (once in a while) I leave it at 1 at a time so when Seti give work, the machine can switch back.

As you noticed, you need to have enough CPU threads for a 1 per work unit ratio.

I know I've heard you say this before, but it's not matching up with what I'm seeing. The Gamma-ray WUs don't appear to use nearly any PCIe bandwidth. I see very similar run times between the cards running on a x8 link as on a x1 link.and when watching PCIe bus usage throughout a GR WU run, PCIe bus utilization stays at 0%. So where did this idea that PCIe bandwidth is so important come from? Or do you mean that PCIe bandwidth is more important for the Gravity Wave WUs? the SETI CUDA special app appears to rely on PCIe bandwidth more than this.

This is what it looks like running Gamma Ray WUs on the watercooled system with RTX 2080's this is a card running at PCIe 3.0 x8

full GPU utilization, about 50% GPU memory bus utilization, 0% PCIe bus utilization. I don't see any difference between the cards running at x16 and the cards running x8 here. they all take about 8 min on the 2080's running 1 WU at a time.

 

this system is running all RTX 2070s on x1 USB risers: https://einsteinathome.org/host/12803503

this system is running most RTX 2070s on x8 risers: https://einsteinathome.org/host/12803486

only about a 10-20 second difference over a 10-minute run. since PCIe utilization is so low, I think the CPU resource difference in these systems is more likely the limiting factor.

 

_________________________________________________________________________

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,874
Credit: 41,930,932,644
RAC: 59,541,583

I have a suspicion that these

I have a suspicion that these WUs (particularly the Gravity Wave) depend more on CPU clock speed than anything else. I need to do some more testing on my 2070 setup thats on USB risers. I have Gravity Wave disabled for now. I also havent had a chance to test PCIe utilization on a GW WU. I just thought to check after I switched to GR.

I've ordered an E5-2667v2 (8c/16t 3.6GHz all-core) to replace the E5-2630Lv2 in the watercooled build to see if the GPU utilization of GW WUs increases substantially from the 30-50% they were running before.

even with similar levels of reported GPU utilization, SETI CUDA special definitely stresses the cards more. higher power used, more resources used, more heat generated.

_________________________________________________________________________

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,874
Credit: 41,930,932,644
RAC: 59,541,583

messed around a little more

messed around a little more with the 3rd system. with Gravity Wave WUs

  • CPU: i7-7700k 4c/8t, underclocked to 3.0GHz "all core"
  • MB: ASUS Prime Z270-P
  • RAM: 16GB DDR4 2400MHz UDIMM
  • GPU: 7x RTX 2070 (most on USB x1 risers)
  • PCIe bandwidth: [x16/x1/x1/x1/x1/x1/x1]
  • https://imgur.com/EZ99VKo

same story here as I saw on the water cooled system. low GPU utilization, about 20-40%, with these tasks, almost non-existent PCIe bus use on the x16 card or the x1 cards. (x16 shows 0%, the x1 cards show 1-2%). uses about a full CPU core for each GPU WU.

then I bumped the CPU back up to it's stock 4.2GHz to see what if any change, and the GPU utilization maybe bumped slightly, to 30-50% (varies a lot over the run), no change to PCIe resource use. looks like the CPU just doesnt have enough free cycles to fully feed the GPU GW WUs on this system. that's the best guess right now. to get good GPU utilization you need a good amount of free CPU threads, and good clock speed. I should be able to confirm if I see a noticeable change on the watercooled 2080 system when the new CPU arrives. as Arch mentioned, perhaps the GW application just needs more time/work to mature. So far the best results running that app came from the Beast system, which was consistently pushing the GPU utilization to 70-80%, even on the cards on a x1 link, and the only advantage I can see between that one and the watercooled system is spare/unused CPU resources, and slightly faster core clock (+500MHz).

so far no evidence that PCIe bandwidth makes the slightest bit of difference with the tasks available right now. maybe that was the case in the past, but doesn't look to be now.

_________________________________________________________________________

rromanchuk
rromanchuk
Joined: 4 May 18
Posts: 7
Credit: 9,902,647
RAC: 0

I have a Vega 56 8G, and I

I have a Vega 56 8G, and I can't get GW GPU tasks above 20% utilization. I've tried .50, .33, and GPU remains bored and cold. This smells it's CPU bound. 

 

rromanchuk
rromanchuk
Joined: 4 May 18
Posts: 7
Credit: 9,902,647
RAC: 0

I just looked at my GW GPU

I just looked at my GW GPU tasks, lol. Such bizarre run time vs cpu time

https://einsteinathome.org/host/12645379/tasks/0/54

Ian&Steve C.
Ian&Steve C.
Joined: 19 Jan 20
Posts: 3,874
Credit: 41,930,932,644
RAC: 59,541,583

rromanchuk wrote:I just

rromanchuk wrote:

I just looked at my GW GPU tasks, lol. Such bizarre run time vs cpu time

https://einsteinathome.org/host/12645379/tasks/0/54

at least comparing the GW and GR tasks. your CPU use % isnt too far off. it's using about 20% of a thread on your GR tasks, and 25% of a thread on GW tasks.but I think i've been reading that AMD cards have less CPU use than the Nvidia cards do. my tasks use about 100% of a CPU thread for each GPU WU on both GW and GR, but at least the GRs use nearly 100% GPU utilization.

_________________________________________________________________________

archae86
archae86
Joined: 6 Dec 05
Posts: 3,153
Credit: 7,157,634,931
RAC: 587,505

Ian&Steve C. wrote: AMD cards

Ian&Steve C. wrote:
AMD cards have less CPU use than the Nvidia cards do. my tasks use about 100% of a CPU thread for each GPU WU on both GW and GR, but at least the GRs use nearly 100% GPU utilization.

This is an application difference.  The current openCL code base for Nvidia applications here employs a polling loop to check whether the work being processed on the GPU needs CPU services.  I believe it runs 100% of the time except time it is knocked away by the OS awarding the slot to another process. 

So the apparent CPU usage by these tasks will go way down if you arrange that the host has lots of competing high-priority CPU work, but that will not mean things are more efficient.  And a faster CPU will spin the wheels faster but still clock near 100% utilization.  That does not mean a faster CPU does zero good, as eventually there is some service to be performed, but as mentioned before, rather slow CPUs can support quite fast GPUs pretty well for the current Einstein GRP application.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.