Does speed step affect GPU tasks adversely?

solling2
solling2
Joined: 20 Nov 14
Posts: 219
Credit: 1578367945
RAC: 21169
Topic 217167

Recently I had two test scenarios with tasks on AMD GPU (W10, newest driver).

Test 1: Two tasks per GPU, 100% cores allowed, no CPU tasks, so all cores free.
Test 2: Two tasks per GPU, 100% cores allowed, a task running on CPU, at least two cores free.

My expectation was to see identical run times for GPU tasks in both tests. However, test 1 tasks took about 15% more time than test 2.

How can that be interpreted? Was there a bottleneck in test 1 or a speed up in test 2? A look at the account page tells me that the extra time in test 1 comes from additional CPU time, which almost doubled. We know that the CPU core is in use all the time for Nvidia GPU, not so for AMD GPU. In test 2 the CPU ran at max core frequency all the time, whereas in test 1 core frequency was hovering somewhere above minimum. So I can imagine that the AMD GPU doesn't put full load on the core which therefore runs throttled, as opposed to test 2. While Intel Speed Step is welcome to reduce power draw, it adversely affects performance. So for such a system either disable speed step or run an extra task on CPU seem to be options, other suggestions welcome as always. :-)

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118718826841
RAC: 20394154

solling2 wrote:How can that

solling2 wrote:
How can that be interpreted?

I run Linux and not Windows.  About a month or more ago, on all my machines with GPUs that crunch, I disabled CPU crunching completely.  I did this when there was an early bout of summer heat in order to try to keep the machine room at a less than extreme temperature.  Apart from expected changes due to the nature of the two different classes of FGRPB1G GPU tasks, I haven't noticed any significant change in RAC.  This is both for machines on an individual basis and for the RAC of the fleet as a whole.

A lot of my hosts started out years ago as CPU only crunchers and have been upgraded at a later date with a modern GPU.  I was in the habit in those earlier times of turning off speedstep in the BIOS since I just wanted the machines to run at the proper speed at all times anyway.  That option is still probably off as I didn't enable it when upgrading with a GPU (CPU tasks still running) or, more recently, when I turned off CPU tasks.

I did intend eventually to do so as allowing non-used cores to throttle down should be a way of saving some further power and heat.  It's a rather big job to disturb a big working fleet so I haven't got around to doing it yet :-).  Also, there will be some hosts for sure that have speedstep not disabled.  I'll have a closer look for hosts that currently show lower than average performance and see if they are of that type.  If I can find any evidence, I'll post again later on.

I might start the process of deliberately enabling speedstep now on a couple of hosts and closely monitor the crunch times to see if I can duplicate what you are reporting.  I can't think of any other more logical explanation than the speedstep setting for accounting for the slowdown.

I routinely monitor the number of CPU 'clock ticks' consumed by GPU tasks as a way of detecting a GPU crash.  This has turned out to be a very reliable technique.  The Linux kernel maintains a virtual filesystem in RAM (under /proc) where lots of statistics about running tasks are kept and updated in real time.  It's extremely easy to interrogate these stats (in scripts designed to do so) and calculate the 'ticks' consumed for a set period.  It will be very interesting to see if the number of ticks increases substantially as a result of enabling speedstep.

I routinely monitor for 2 secs and for AMD Polaris GPUs, I see around 5-10 clock ticks accumulate over that interval.  The 'clock' runs at 100Hz so those ticks represent around 2.5 - 5% CPU urilization on average.  There are huge variations right at task start and end so I detect and avoid measurement at those times.  If the ticks go to zero outside those times, it's a perfect indicator that the GPU has crashed.  I have yet to see a report of zero ticks that hasn't been a GPU crash.  GPU drivers have improved so GPU crashes are a lot less frequent than they were 12 months ago.

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118718826841
RAC: 20394154

After writing the above, I

After writing the above, I picked an older box that now hosted an RX 460 GPU.  I rebooted it and enabled speedstep in the BIOS.  It has now completed 6 additional tasks since the change.  I ignored the two in progress at the time of the change so that just leaves 4 to average on.  That is not sufficient (of course) but the times are very tightly clustered both before and after so I think they tell the story fairly well.

The host has a Pentium dual core CPU (e6300 @ 2.8 GHz) dating from around 2009/2010.  Prior to the change it was completing tasks 2x where the elapsed/CPU times were very tightly clustered around 2460/68 seconds - ie averaging around 2.75% CPU utilization.  After the change, the four new tasks have averaged 2468/87.  The range for the elapsed time was 2466 - 2471.  The range for the CPU component was 86 - 88 which represents a CPU utilization of around 3.5%

So it looks like the CPU component has increased 19 seconds and the elapsed time has increased less than 10 seconds.  My assessment is that under Linux, perhaps around 10-15% more CPU cycles are needed for GPU support (presumably since the CPU frequency is being reduced) whilst the elapsed time is less affected.  Of course, the sample size is woefully inadequate and so the statements should be regarded in that light.  I certainly don't see your 15% increase in run time so I don't know what could be causing it.

I'm not surprised at the above result, as I think more about it.  I do have the same model of GPU supported by a range of different CPU generations and speeds.  The elapsed time for GPU tasks on the same model GPU over this range of hosts, has never varied all that much.  This is one of the main reasons why I decided to upgrade the old hosts.  They could produce pretty much the same GPU task output as a much more modern CPU and PCIe version could.

 

Cheers,
Gary.

solling2
solling2
Joined: 20 Nov 14
Posts: 219
Credit: 1578367945
RAC: 21169

Thanks for checking that.

Thanks for checking that. Very plausible. I'll report if I come across any other reason for the somewhat surprising behavior of my system. My overall goal of course was to maximize output and minimize power draw. Maybe getting a wattmeter and see what works best practically is the next option. :-)

solling2
solling2
Joined: 20 Nov 14
Posts: 219
Credit: 1578367945
RAC: 21169

Meanwhile I tested the energy

Meanwhile I tested the energy settings of the operational system (W10: system / energy /additional). The recommended and apparently default setting is 'balanced'. With that setting and no CPU task (Test scenario 1) a GPU task takes 15% more time than with 'max performance' setting. The latter setting pushes max core frequency and processor cache frequency to always maximum. Thus it appears that a 'balanced' setting kind of overrides the CPU support needs of the GPU task in the sense that the feeding core won't run at full speed. So it may well be useful to pay attention to that setting (or check Bios), at least it draws a little bit less power that running a separate CPU task all the time.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

My 'feeling' (again) has been

My 'feeling' (again) has been that a system with AMD GPU and Windows has completed GPU tasks slower if Speed Step was enabled. And I 'believe' the same has gone for Nvidia + Windows too. I'll try to bring fresh observations here soon.

 

Here's a host with W10 and R9 270X: host/12329446

CPU is Xeon X5670 6c/12t which in this case is overclocked from stock 2.93 GHz to 4.13 GHz.

Computer is running only GPU tasks (2x), nothing else. GPU is running with stock speeds.

This computer has also C-State options enabled in BIOS and power mode in Windows is "balanced".

Computer has been running a long time with Speed Step enabled, so all the tasks currently visible in the history have had Speed Step "ON" and the other settings like mentioned above. I changed Speed Step to OFF in BIOS at 04 Jan 2019 02:30 UTC (time zone here is UTC +2 ). That was the only change I made (well... Windows did update itself to next Insider Preview version at this point... but I don't think that would have a noticeable effect on the speeds).

 

Quick CPU clock observations:

Speed Step still ON :

- Windows Task Manager showed "Base clock" 2.93 GHz. Looks like it doesn't recognize CPU has overclocking if the Speed Step is on. Minimum cpu frequencies hoovered around 1.6 GHz while being idle.

- CPU-Z showed minimum cpu frequencies to be 2.25 GHz while idle. Crunching 2x GPU tasks still kept the freq most of the time at 2.25 GHz, spiking now and then to 3.4 - 3.6 GHz.

Speed Step OFF :

- Windows Task Manager now shows "Base clock" is 4.13 GHz

- CPU-Z shows minimum cpu frequencies to be 2.25 GHz while idle. Crunching 2x GPU tasks still keeps freq most of the time at 2.25 GHz but it's spiking now to 4.13 GHz.

 edit:

- even the last calculation phase (after 89.997%) isn't enough to keep cpu clock high, if the other task is going somewhere in the pre phase ... clock is still jumping and average seems only somewhat higher than in pre phase

Changing only the Speed Step from ON to OFF doesn't necessarily change much the average cpu clock. Depends also on the other settings. I'm going to let this host crunch tasks for a day or so to get some completion times with these settings. Then I'll disable C-States or change power mode in windows to "max performance".

** I 'believe' it might depend on the "size" of the cpu how quickly the clocks jump. For example 1 core cpu vs. plenty of cores / threads and cache... there might be difference when the system will consider if it should start whipping the cpu. If that is true, then of course that could have an effect on a host that is running only GPU tasks. A smaller processor could experience just enough load for the higher clocks to jump in more often, even if power saving options were enabled.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Richie wrote:a host with W10

Richie wrote:
a host with W10 and R9 270X: host/12329446

Okay, the work batch seems to have changed from 2006L to 2007L during the last hours. With AMD cards the tasks from 2007L low range seem to be somewhat faster than last tasks from 2006L high range:

270X ... about 3 minutes difference in completion times and same goes for R9 390 (for this card the proportional difference is already 20% or so). With Nvidia GTX 960 in the other hand the difference looks to be 1-2 minutes, which is very little proportion on the run time.

edit: First tasks from this batch must have been much faster again, but looks like those larger differences are vanishing and this is going to be generally the same as 2006L.

But I'll let the computer crunch some more 2007L's  to get something for a comparison, until I change the CPU to max speed.

Observations after changing only the Speed Step from On to OFF :

Yes, it had an effect on the completion times. I see there was a trend. With 2006L's the CPU times went down about 30 seconds (270 to 240) and runtimes went down also... about the same amount (1760 to 1730). That's not much. I'm sure the "max performance" setting will give yet more reduction in run times.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118718826841
RAC: 20394154

Richie wrote:Okay, the work

Richie wrote:
Okay, the work batch seems to have changed from 2006L to 2007L during the last hours. With AMD cards the tasks from 2007L low range seem to be somewhat faster than last tasks from 2006L high range

It's not the change from 2006L to 2007L that is causing what you are seeing.  It's the pulsar spin frequency being analysed for, that is causing the initially faster and then gradually reverting to the the more usual and somewhat slower crunch times.  When you look at a task name - eg LATeah2006L_1132.0_.... - that 2nd parameter between the underscores is related to spin frequency - I think it is some multiple, maybe x4, of the spin frequency in Hz.

When tasks for each data file are distributed, the low frequencies are sent out first and usually don't last for long.  This is followed by a longish period of steadily increasing frequencies and slowly increasing crunch times.

If you look at the very first graph in this thread you can see a very distinct relationship between crunch time and spin frequency.  For frequencies less than ~200Hz, the times are significantly faster.  Above ~200Hz, the times steadily increase until a plateau is reached in the 800-1000Hz range.  There is then a further small step to a new plateau above 1000Hz.

I wouldn't be at all surprised if the current 2006L and 2007L tasks behave in much the same way.  If you want accurate comparisons, you'll need to choose the frequency component fairly carefully :-).  Thank you for deciding to look at this.  I'm sure the information will be of interest to quite a few of us who try to tweak performance.

 

Cheers,
Gary.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Gary Roberts wrote:If you

Gary Roberts wrote:
If you look at the very first graph in this thread you can see a very distinct relationship between crunch time and spin frequency.  For frequencies less than ~200Hz, the times are significantly faster.  Above ~200Hz, the times steadily increase until a plateau is reached in the 800-1000Hz range.  There is then a further small step to a new plateau above 1000Hz.

Those graphs by archae86 are excellent. I've took a look at them a few times and tried to keep in mind how these work batches typically evolve. In an optimal situation that could help me to avoid of getting too excited when I thought I had found something out (and then I'd more likely double check and avoid writing nonsense here Embarassed ).

After I woke up today and saw batch had changed my thought was "those 2007L tasks must be already from the "plateau" phase". But as that computer had very small work cache it actually had been sent some of the first ones... and had crunched them. I've missed the first ones many times (NNT has been in effect) so they still feel quite rare to me. I didn't look at the task names carefully enough.

Gary Roberts wrote:
If you want accurate comparisons, you'll need to choose the frequency component fairly carefully :-)

I'm probably too lazy to bring here anything precise Tongue Out But there's always great value when somebody else does that kind of work and I respect them.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5877
Credit: 118718826841
RAC: 20394154

Richie wrote:I'm probably too

Richie wrote:
I'm probably too lazy to bring here anything precise ....

I'm not so sure about that :-).  If you stick around for long enough, the need for precision will probably grow on you :-).

 

Cheers,
Gary.

Richie
Richie
Joined: 7 Mar 14
Posts: 656
Credit: 1702989778
RAC: 0

Gary Roberts wrote:I'm not so

Gary Roberts wrote:
I'm not so sure about that :-).  If you stick around for long enough, the need for precision will probably grow on you :-).

I started my test from the beginning to provide a bit more precision Cool

Just to remind, this host has an Intel cpu and almost 10 years old motherboard. This has BIOS and there is a cpu configuration setting called "C-State", which can also have an effect on how the clock speeds will behave. Modern UEFI boards might differ with options and behavior. Boards with a AMD cpu surely will.

 

Test computer: host/12329446

- AMD GPU driver is stock version that came with this Windows version (Adrenalin 18.40.21.06 DCH , Beta)

- GPU tasks run 2x and enough "out-of-sync" for the last computation phase (89-100%) to never occur simultaneously for both tasks

I downloaded 120+ tasks that all were LATeah2007L_916.0...

Four different configurations were tested and each version of this system crunched about 30 tasks.

I think CPU clock speeds are not relevant after all, so I will leave those observations out. I'll just say that changing only the SpeedStep from ON to OFF in BIOS might not be enough to make a system from similar age to utilize highest available CPU clock speeds. Even without using any precision monitoring tools it was easy to see that average or max clock speeds were not identical.

Completion times are show below (run time / cpu time) :

 

Test 1 --- SpeedStep ON, C-State ON, Windows 'Power plan' Balanced

Average time (34 tasks): 1689 / 218

 

Test 2 --- SpeedStep OFF, C-State ON, Windows 'Power plan' Balanced

Average time (32 tasks): 1679 / 188

 

Test 3 --- SpeedStep OFF, C-State ON, Windows 'Power plan' High performance

Average time (32 tasks): 1683 / 186

 

Test 4 --- SpeedStep OFF, C-State OFF, Windows 'Power plan' High performance

* among the finished tasks there was one 'validated' task that was somewhat anomaly (run time 1356 / 175)

Average time (27 tasks. including anomaly): 1659 / 175

Average time (26 tasks. without anomaly): 1671 / 175


I had expected greater differences in run times. I'll continue to run this host with SpeedStep ON. It will run cooler and that's what I prefer most of the time.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.