Comprehensive GPU performance list?

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109407361243
RAC: 35328410

cecht wrote:.... I'm just

cecht wrote:
.... I'm just wondering what a sustainable long-term temperature range is for those GPUs.

My machines run with forced ventilation but no aircon so the ambient is always high.  It's never less than about 28C and for most of the year it ranges between 32 - 37C.  I suspend all crunching if the room temp goes above 36C.  That doesn't happen very often since the forced cooling air flow rate is reasonably high and seems to do a remarkably good job, even in the peak of summer with a lot of heat to dissipate :-).

The hosts are open and in open racks but the machine density is such that the ambient in the vicinity of each box is over 40C  a lot of the time.  It has to be more, right at the motherboard level.  I've been running GPUs since about 2010 - initially a small number of HD4850s on Milkyway.  It was around 2012 when I first started running Einstein GPU tasks.

I used to closely monitor both CPU and GPU temperatures.  I don't stress about it so much these days.  The HD4850s ran hot - around 90C.  They would crash if they got to about 100C.  I had some fan failures (fairly easy to work around) but never saw a card die.  I stopped using them when Milkyway stopped providing an app that was OpenCL 1.0 capable.  They would have run for around 5-6 years at least.

I've grown quite used to relatively high temperatures - 80 to 90C - and the RX 570s are usually above the mid 70s and into the low 80s.  I've had remarkably few issues over the years, despite the adverse ambient conditions.  I guess I'll come a cropper if the more recent purchases don't have the same longevity as my earlier stuff.  I've had over 30 HD7850 (Southern Islands - Pitcairn series) cards running for over 5 years now and I'm scratching my head to think of a single failure in any of them.  They run at least in the high 70s all the time.

I wouldn't worry too much at the temperature values you're seeing - in the sixties is positively freezing :-) ;-).

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17548813915
RAC: 6438829

Gary Roberts wrote:Thanks for

Gary Roberts wrote:

Thanks for the extra information about coolbits - should be helpful to the user.

Your last paragraph is a bit of an eye-opener.  I didn't realise that nvidia did sneaky things like that!  I guess they are trying to ensure fewer warranty claims from people running 24/7 heavy compute operations - perhaps of the mining type :-).

Hi Gary, actually Nvidia has been doing this long before the mining craze.  All the way back to Maxwell generation.

They did this to differentiate their product stack to ensure that anyone doing compute, scientific or encoding operations needed to buy from their Tesla or Quadro product lines instead of purchasing from their consumer product lines.  They state the reason that they enforce the clock and power penalty is to "ensure calculation integrity".

I though this was well known across the Nvidia user community.

 

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17548813915
RAC: 6438829

HWpecker wrote:Thank

HWpecker wrote:

Thank you,

 

I'll read up a bit before I might burn the card.

As long as you don't go crazy, you won't burn anything up.  The gpus just like the cpus protect themselves if you go too far.  Nvidia GPU Boost 4.0 automatically increases the core clock speeds whenever the driver can stay within the thermal and power limits.  That is why with a compute load, you can only gain maybe 30-40Mhz offset on the core clock with a compute load.  The card will automatically crank bank the core clock if the temps get too high to stay within the thermal limits.  The memory core clock OTOH gets penalized up to 1600Mhz in frequency depending on the card generation.  There is a large penalty on Pascal cards, not so much now with Turing cards with their different memory type.

For example I use a 2000Mhz offset on my Pascal 1080 cards to get back to the normal video load memory memory clock of 11Ghz, but then push them farther to 12Ghz.  As long as you keep the cards cool and don't produce task errors you are golden.

For my new Turing cards, I only have an offset of 400Mhz to get them back to 14.2Ghz which is only 200Mhz past their stock video load memory clocks.  I'm fairly certain they will go much further with their GDDR6 memory, but I am still feeling them out since the technology is newer and I don't have as much experience with Turing yet compared to Pascal.

You also can control the power limits of Nvidia cards with nvidia-smi.  I crank the power down from 250W to 215W for my 1080 Ti's for example and crank the power down from 225W to 200W for my 2080's.  Doesn't impact the crunching times all that much and the cards run cooler.

 

cecht
cecht
Joined: 7 Mar 18
Posts: 1421
Credit: 2445806267
RAC: 1498442

Gary Roberts wrote:I wouldn't

Gary Roberts wrote:
I wouldn't worry too much at the temperature values you're seeing - in the sixties is positively freezing :-) ;-).

That's good to know.  With that in mind, I might lower the cards' fan speeds to prolong fan life. From casual observations, I've had a vague notion that higher GPU temperatures slightly increase compute times, so slowing the fans (raising GPU temps) will provide data to test that.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

abcde12345
abcde12345
Joined: 14 Apr 14
Posts: 10
Credit: 10676522
RAC: 0

The coolbits worked on the

The coolbits worked on the gtx780ti. After setting the coolbits and rebooting, the gui-desktop couldn't start the boinc manager, simple fix was to start boinc with the cmd: boincmanager and close the boinc manager, from there the gui picked up and started the boinc manager from the menu again.

With the same OC clockspeeds win10 wins against linux(gui-desktop).with about 1850s/WU on average vs 1870s/WU. Windows also wins here by not slowing down other windows(like a browser) while crunching WUs as the linux desktop did. Maybe I can win some seconds by dropping the gui-desktop and go full commandline, not really my thing. yayks xD

Also, I couldn't really crank linux further then win10 to the point of erring out or getting coilwhines. Underclocking is no problem on both installs.

Thanks for helping getting it work.

 

Anyone runs boinc on a minimal install cmd-line linux with the single purpose of boinc, is that faster?

 

edit: gtx780ti in OC-mode on win10 is 45-47 WU/day, linux (Leap 15.1 gui-desktop) 44-46 WU/day

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4704
Credit: 17548813915
RAC: 6438829

Quote:Anyone runs boinc on a

Quote:
Anyone runs boinc on a minimal install cmd-line linux with the single purpose of boinc, is that faster?

That would be what is called running headless.  I run some hosts that way. Just the command terminal session running, no Manager.  Remote into the host for control with boinccmd.  Don't think there is any benefit in crunching times though.  Still dependent on the hardware.

 

 

cecht
cecht
Joined: 7 Mar 18
Posts: 1421
Credit: 2445806267
RAC: 1498442

Just to follow up on my

Just to follow up on my previous posting: no surprise that raising GPU temps (by lowering fan speeds) on my RX 460s and RX 570s had no effect on crunch times.  The 24 hr average temperature change for all cards was from ~67 C to ~80 C. Both hosts are quieter with the slower fans, so I'm going to let the GPUs run on the toasty side. Thank you for enlightening me about sustainable GPU temps Gary!

BTW, those RX 570s, running @ 3X tasks, average 593 sec /task, which gives a calculated yield of ~145 tasks/day.

Ideas are not fixed, nor should they be; we live in model-dependent reality.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109407361243
RAC: 35328410

cecht wrote:... I'm going to

cecht wrote:
... I'm going to let the GPUs run on the toasty side.


You don't want/need to be any more "toasty", I would suggest :-).  80C should be fine but that's a fairly decent step-up from the mid sixties :-).  The silicon can certainly take that level.

Running fans more slowly will give you less noise but noise never killed any hardware :-).  A slower fan speed will most likely help fan life by reducing wear but higher temperatures probably contribute to evaporation/degradation of lubricant so what you gain with one you may lose with the other.

Higher temps have a damaging effect on electrolytic capacitors.  I know - old technology - but it certainly affects me :-).  Fortunately new kit seems to be pretty much all polymer caps these days.  That fixes the electrolyte degradation problem but I don't know how sensitive to heat the polymer variety might still be.  I wouldn't just assume they are immune to it.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.