My first significant hardware problem with a GPU at Einstein.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117661826046
RAC: 35180293
Topic 198365

I have around 50 GPUs crunching at Einstein. The oldest is a GTX-550Ti that has been running for more than 4 years. When I decided around 3 years ago to use Kepler series GTX-650s, I also bought a single AMD HD7770 which cost virtually the same but might have had better performance. It didn't. I found out later this was because of sub-standard drivers. Since the GTX-650 initially performed better, I ended up with around 20 GTX-650s. A little while later, a new driver version ended up resolving the problem and the HD7770 was able to outperform the GTX-650s. More recently, the cuda55 app has narrowed the gap and the HD7770 has been only marginally better than a GTX-650.

A bit more than a month ago, GPU tasks on the 7770 machine started failing and, on checking, the GPU fan was hardly rotating at all. The GPU was alarmingly hot, just by feel alone. No big deal - I'll just re-oil the bearings. I'm quite used to removing the rubber bung and filling the well with machine oil on all sorts of fans. However this one was totally sealed - no bung. I've encountered this before and have a pencil shaped wire drill with a series of fine bits. I chose a size just larger than a diabetes syringe and drilled through the plastic to where the bearing would be. I was able to inject enough oil to get the fan freed up but was unhappy with the rough running it was still displaying.

Undeterred, I went through my various boxes of ex-server fans. I love these fans - they are so well built to the point of running forever. I found one that I could screw and zip-tie onto the fins of the GPU heat sink. It actually had a 2 wire connector, the same as the original but not the same pitch. I bent the card pins slightly and got the plug to fit the posts. It all worked fine and gave a strong air flow. I reassembled the machine and was very happy to see the thing fire up and to note the cool heat sink.

The machine seemed to be running OK so I restarted crunching with the same settings as previously. It's a phenomII x4 and runs GPU tasks 3x alongside 2 CPU tasks. I watched it for a couple of days and all seemed to be good. Tasks were validating and crunch times were as expected. The heat sink continued to feel cool. I then promptly forgot about it - job done. This would have been in early December.

Because of the failed tasks and the ensuing downtime, the RAC was low and it does take quite a while to recover. In my control script logs, complaints about low RAC continued to be issued but because I'm such a knowall, I continue to ignore them because, "It's OK, I know the reason", and I don't bother to look in more detail. I did look early on and the RAC was slowly recovering so I wasn't at all concerned.

About a week ago, I thought I'd better look into why there was still a low RAC warning and was shocked to see how much further the RAC had declined. A trip to the tasks list on the website showed no compute errors but a high proportion of inconclusive or invalid GPU results - perhaps 70% ultimately being declared invalid. There were no CPU task problems. The GPU heat sink was still cool to the touch and the air flow was strong.

I decided to try to isolate the real issue by making single changes to various things and watching what happened as a result. Nothing was overclocked and the fan was keeping the GPU heat sink quite cool. I was fully aware that a cool heat sink doesn't necessarily mean a cool chip and I was too stupid to setup to measure the core temperature. The machines don't have peripherals and I don't know of a GUI app for displaying temperatures under Linux. There is a CLI utility (aticonfig) with zillions of options for looking at all sorts of parameters but I was too lazy to hookup monitor keyboard and mouse and consult the manpage for the options I needed. After all, the fan was the problem and the heat sink is now cool so I just assumed that the temps would be OK and that it must be something else :-).

Rebooting (a few times actually) made no ultimate difference and crunch times were all OK. So I started to adjust the number of concurrent GPU tasks. I changed from 3x to 2x whilst still leaving only 2 cores running CPU tasks. After a couple of days, the new crunch times were as expected but there appeared to be no real change in the rate of invalid tasks. I can't be sure about that because it wasn't left long enough to really know.

Then I put the GPU back to single tasks and allowed 3 CPU tasks to run. That was yesterday. Today I'm seeing a string of consecutive GPU tasks that are all valid with not a single invalid. There are currently six such tasks all completed and validated. This is certainly not enough to be sure but at least it's a good indication that perhaps it's not one of my theories - progressive degradation of the GPU as a result of the unknown amount of time it spent vastly overheated as the fan was progressively failing. I had started to consider that I might be retiring this original AMD GPU cruncher and the fact it was kicking out good task again was a pleasant relief. So, thus encouraged, I decided to do what I should have done a week ago when I first saw all the invalid tasks. I hooked up the monitor and ran aticonfig with the correct options for a printout of things like frequencies, load and the all important core temperature.

YIKES!!!

It's returning all valid tasks but the temperature is 106C!!! I wonder what it was when it was running 3x and only about 30% were actually validating. Memo to self: Don't be so stupid and lazy and next time this happens (as it's bound too), actually measure the temperature in preference to just feeling the heat sink.

I've now pulled the heat sink and examined what's left of the TIM (thermal interface material). It had set like concrete and no amount of gentle rubbing with solvent would shift it. With a suitably shaped plastic scraper, I managed to dislodge all the concrete, both on the chip and the heat sink. Now alcohol on a tissue cleans the chip to a mirror-like finish. The heat sink is rough as guts (machining grooves) and I'd be happy to lap it (I've done that before) but there are 4 screw receptacles sitting up proud of the surface so lapping isn't really possible. So I cleaned as best I could, applied some new thermal grease to the shiny chip surface and reassembled the heat sink to the card.

After assembly, I fired up the machine (but not BOINC) and let it idle for a while. The temperature was a nice steady 40c (ambient about 33C today). I fired up BOINC and kept rerunning aticonfig so as to get a feel for how fast the temperature was rising. The first measurement was 50C followed by 54C and then 57C. I was happy that it wasn't going through the roof. A few minutes later it was 61C. Now, a couple of hours later, it's 63C and seems stable.

It's at the last-used settings of 1x on the GPU and 3 CPU tasks. Tomorrow, all being well, I anticipate being able to return it to the long term settings. while I've been documenting all this, it's suddenly occurred to me that I should write a new function for my control script. It would be quite straight forward that every time the script 'visits' a host with a GPU, it should run aticonfig (or the nvidia equivalent, whatever it is), parse the output to extract the key numbers and actually log these with the BOINC related info on task numbers, file downloads, RAC values, etc. I will then have an ongoing record to look at without having to do anything special - just browse the logs as per normal. It should be easy enough to maintain an 'expected value' like I do with RAC, and flag it if the temperature gets above a certain upper limit. That should make it a lot easier to spot failing fans (or failing TIM) long before the temperature gets to anything like 106C!!

It's pretty impressive what these chips will take without being destroyed.

Cheers,
Gary.

Logforme
Logforme
Joined: 13 Aug 10
Posts: 332
Credit: 1714373961
RAC: 0

My first significant hardware problem with a GPU at Einstein.

Interesting and well written. You should write detective stories :)
Only flaw I spotted in the plot was the need to hook up monitor and keyboard to run CLI, surely you can SSH into your Linux boxes?

archae86
archae86
Joined: 6 Dec 05
Posts: 3157
Credit: 7224904931
RAC: 1042310

Gary, interesting

Gary, interesting account.

I'm curious as to what use you make of monitoring invalid result production. While trying to take advantage of inconclusives is labor-intensive and disturbingly dependent on luck of the quorum partner, so not, I think, helpful for routine monitoring, it appears to me that actual invalid results are currently quite rare on healthy machines in Parkes PMPS work, so that even a very small number clustered on a machine are likely to signify a problem.

My guess as to the typical behavior of healthy machines is not very scientific, but during my recent overclocking exercises on 5 750 cards I was trying to take (labor-intensive) advantage of inconclusive results and in that pursuit clicked on the Parkes task lists for dozens of machines. A great many of them had under 1% as many invalid results as valid results, with the most common number being zero, even on fast machines with many hundreds of tasks reported.

Contrary-wise, I came across a considerable number of unhealthy machines, generating 10% or more, sometimes much more, as many logged invalid results as valid. I'll hazard a guess the two predominant reasons are partially failed cooling (dust buildup, stopped fan, aged thermal paste...) and overenthusiastic overclocking, but the common element is that I suspect most of the owners are quite unaware.

Jeroen
Jeroen
Joined: 25 Nov 05
Posts: 379
Credit: 740030628
RAC: 0

The TIM application from the

The TIM application from the factory is generally not all that great and does tend to wear down given a long enough run time. Often there is far too much TIM applied from the factory.

I recently had a 7970 which would downclock due to overheating at times. I have seen the temps spike above 90C during these cases. I took it apart and the TIM was in bad shape and the memory and VRM pads were done for. There was also a thick hard layer of old TIM on the heatsink surface that took a while to scrub off.

I was able to find some replacements for the pads of the right thickness. The pads I got are of thickness 0.5mm for the memory chips and 1mm for the VRMs. Contact looks good nearest I can tell. I applied Tuniq TX4 TIM on the GPU which has worked well with other devices. I still need to test out the card but hopefully all will be well.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: a GUI app for

Quote:
a GUI app for displaying temperatures under Linux.

I use AMDOverdriveCtl allows for overclocking and setting fan profiles. Recommended.


Quote:

There is a CLI utility (aticonfig) with zillions of options for looking at all sorts of parameters but I was too lazy to hookup monitor keyboard and mouse and consult the manpage for the options I needed. After all, the fan was the problem and the heat sink is now cool so I just assumed that the temps would be OK and that it must be something else :-).


ah yes, i think (-: is the hindsight smiley... there is a thread going on about invalid results between windows and linux and mac.

I know with a degree of certainty if i over-clock the gpu, i will get 1-2% errors (and run 1-2% faster) but i can downvolt without much problem, i hate invalid tasks... Recently i noticed a few invalids and yes the arch-villians the windows world were picking on me - always a quorum of windows. How dare they! i started compiling the evidence and could pinpoint exactly when the trouble started, but there was no good pattern. XP64, win7 and Win10 all different.

I also run aticonfig in a cron script every 10 minutes and log a number of system temps and fan speeds (and shutdown if too hot) and looked to see if anything happened around that date. Sure enough i'd started overclocking the GPU following a restart of AMDOverdriveCtrl, oops an hour wasted should have looked there first to find the arch-villian!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117661826046
RAC: 35180293

RE: Interesting and well

Quote:
Interesting and well written. You should write detective stories :)


Thanks! :-).

Quote:
... surely you can SSH into your Linux boxes?


Sure. I could easily ssh from the server machine to the problem one that's maybe a few metres away. However, through force of habit, when dealing with a troublesome machine, I tend to walk those few metres and plug in the three cables just in case I want to do other things as well :-). Old habits die hard :-).

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117661826046
RAC: 35180293

RE: I'm curious as to what

Quote:
I'm curious as to what use you make of monitoring invalid result production.


I don't really monitor invalids unless I'm prompted to look for errors or invalids by some other parameter being 'out of spec' so to speak. I get warnings for low RAC, low GPU/CPU task fetch rate, low GPU/CPU task return rate, all things that can be calculated locally using boinccmd or by parsing the content of stdoutdae.txt. I don't try to 'screen scrape' web pages. Problem hosts causing errors or invalids are not frequent so I have no problem just looking through a task list if something is amiss.

I have found that stable crunching conditions leads to fairly predictable values for each of the above mentioned things. If I get warnings about any of these for no apparent reason, I will check the control script logs and then look on the website. This is the main reason why I would deliberately look at the actual numbers of errors and invalids.

Quote:
... it appears to me that actual invalid results are currently quite rare on healthy machines in Parkes PMPS work, so that even a very small number clustered on a machine are likely to signify a problem.


I agree completely. I have at times gone to my account page and clicked on the 'view tasks' link, which is, for me, a list of thousands of actual tasks but hopefully very few errors or invalids. If I select just the BRP6 invalids, I get a (hopefully) quite short list of tasks which includes the hostID of each host involved. What I don't want to see is a single hostID with multiple entries. If I do, like the example in this thread, I know I have to find the cause.

Quote:
... clicked on the Parkes task lists for dozens of machines. A great many of them had under 1% as many invalid results as valid results, with the most common number being zero, even on fast machines with many hundreds of tasks reported.


Yes, when I look at my 90 machines as a group, most are zero with a few odd hosts perhaps having one. If there is a host in the 90 with a problem, it really shows up.

Quote:
Contrary-wise, I came across a considerable number of unhealthy machines, generating 10% or more, sometimes much more, as many logged invalid results as valid. I'll hazard a guess the two predominant reasons are partially failed cooling (dust buildup, stopped fan, aged thermal paste...) and overenthusiastic overclocking, but the common element is that I suspect most of the owners are quite unaware.


I've also seen examples of such machines and I agree completely with your analysis. Sometimes when people come to the problems board and want to know why their machine is producing so many invalids, they tend to ask if there is a 'problem with the WUs' rather than thinking about what could be wrong at their end - which is a far more likely source of the solution.

Cheers,
Gary.

MAGIC Quantum Mechanic
MAGIC Quantum M...
Joined: 18 Jan 05
Posts: 1887
Credit: 1410997852
RAC: 1191498

Yeah I imagine it wouldn't be

Yeah I imagine it wouldn't be easy to check the video card (and CPU) temps all the time on 90 hosts.

I only have 7 running GPU's 24/7 here and try to check the temps when I go to check if they need to be reloaded with more tasks.

Mine are all OC'd or SC'd and the main one I have to watch is a 560Ti since the factory fans quit (even the plastic frame fell apart) and since I have to keep them running all the time I just rigged up a couple AC fans I had in my stash to keep it at 83C or in the 70's if it is cold enough outside.

The 650Ti's so far have always run cooler that the others and I still have a 550Ti running and the 660Ti SC can run up into the upper 60's if it gets warm in the room but I use that EVGA PrecisionX to check all the stats and crank up the fans if they need it (and check the OC'ing and voltage levels)

http://eu.evga.com/precision/

You always wonder how long they will run since they usually have a 3 year warranty and the way we run them 24/7 is not *typical*

I never saved the date that I even started any of mine so I usually just check my account from the place I bought the cards to see the dates.

And I just use that Windows *Task Manager* to take a quick look at the CPU and Memory performance.

I am surprised that the one laptop I am on right now has been running non-stop GPU's for almost 4 years here and it is also an 8-core so I have it also running vLHC X2 and Atlas X1 and can even use all the other cores to run those Pogs since they aren't cpu or ram hogs......and at first the GeForce 610M it has would be hard to keep running at low 80'sC until I switched to a SSD and they run cooler which made the 610M also run cooler so like right now it is at 72C

I have all of mine running with Windows 10 now after all those updates except my original XP Pro on a 3-core X86 that I am going to run as long as possible before I update that one and the best thing about doing that is it will then be X64 and actually *see* the total Ram it has instead of only the 4GB that X86 *thinks* it has.

The wife complains enough about my 7 hosts so since it is all upstairs where she never goes I just say I am ordering new parts and not another pc deal for me to plug another GeForce card into

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117661826046
RAC: 35180293

RE: I use AMDOverdriveCtl

Quote:
I use AMDOverdriveCtl allows for overclocking and setting fan profiles. Recommended.


Thanks for that -- looks like a nice utility.

Quote:
I know with a degree of certainty if i over-clock the gpu, i will get 1-2% errors (and run 1-2% faster) ...


When I first got the 7770 running at it's full potential with a decent driver, I did start playing around with a bit of overclocking through the aticonfig utility but quickly came to the conclusion that the gains were too modest to justify the various risks - from potential invalid results to hotter running leading to premature hardware failures. I figured the safest (and easiest) course of action was to leave well enough alone and stay on stock settings. So when I got all the 7850s they just followed the same pattern - all stock settings.

As I continued adding to my control script, it was all about monitoring BOINC related stuff. I wasn't smart enough to also see the potential for predictive monitoring of hardware. With me, something needs to happen to make what should be blindingly obvious actually become blindingly obvious :-).

I find I can solve problems a whole lot faster by grabbing some innocent bystander and saying to them, "Have you got a minute? I just want to explain this to you." Of course, their eyes will glaze over, and you can see them desperately wanting to escape, but it's amazing how often the process of being forced to organise your thoughts at a sufficiently basic level to explain the problem triggers the light-bulb moment where you find (with hindsight) the simple and obvious solution.

This whole thread is a case in point. I had spent a week (on and off - mostly off) fiddling in the wrong direction. I started writing the opening post with no solution but with the intention of asking if people had any experience with possible hardware degradation due to overly high temps for an overlong period. I was organising my thoughts by explaining all the background and had got to the point of noting that the final change to 1x on the GPU was allowing results to validate when I suddenly realised that this didn't fit 'degraded hardware' but it fitted perfectly with temperature having been reduced - but how could that be, because the heat sink was always so coo.... Of course, the TIM!!!!

So I worked out the solution to the problem even before I'd captured my "innocent bystanders" for the "eye glazing" experience and could have cancelled all the background stuff already written and nobody would have known how stupid I had been. I decided to complete the story and fess up to my stupidity because lots of people will probably have this sort of failure with their GPUs over time and I hope the story might be of use to someone else.

As I was documenting the 106C temperature, I suddenly realised that since I 'ssh' into each machine to extract all the BOINC stuff and I already know what GPU (if any) a particular host has, it would be trivial to run an appropriate utility and parse the output at the same time. The ongoing log would be useful to spot temperature changes over time and so allow preemptive maintenance even before limits were reached and a warning triggered. I should have been doing this much earlier so this failure has been a good wake-up call. Sooner or later other GPU fans are going to start failing.

Cheers,
Gary.

AgentB
AgentB
Joined: 17 Mar 12
Posts: 915
Credit: 513211304
RAC: 0

RE: RE: I use

Quote:
Quote:
I use AMDOverdriveCtl allows for overclocking and setting fan profiles. Recommended.

Thanks for that -- looks like a nice utility.


I have tinkered around with quite a few settings and the main value is to to down-volt from 1.2 to about 1.05V leaving the rest stock. I think that saves a few W without losing much. I can't remember who gave me the hint on the down-volt but it was here.

Quote:
Sooner or later other GPU fans are going to start failing.

..and generate a few invalids as they start to struggle. I mentioned a few months ago the lack of feedback about invalids on the boinc forum and suggested a simple API - which the boinc client/manager could get feedback about it's "health" - i'm sure if the system tray turned red for each invalid, or made a passing wind noise - folks might oil a squeaky wheel (edit: or fan). In the meantime i log the invalids daily and if found, send out Watson looking for clues.

Nice to know you can push them into the 100C+ before bad things happen. You should give AMD/Nvidia a call and offer to crash test their GPUs...

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117661826046
RAC: 35180293

I've been testing out

I've been testing out commands suitable for use over ssh to measure the temperature of GPUs while crunching. My aim is to add suitable commands to my control script so that these measurements will be made regularly and recorded to an ongoing log. For the moment, I've been running commands manually over ssh from a terminal session on the server. I've just run through all current hosts with GPUs, 52 in total. Because of the power of the Linux shell and utilities, this was actually quite quick and easy to do.

Because I often use ssh to run odd commands on specific hosts, I organised things many years ago to make this easy. Each host has a meaningful name and a static IP address. There is an easy to use alias for the static address. The alias is just the last octet of the IP sddress with a 'H' in front of it. So $H10 translates to the host whose IP is 192.168.0.10, etc.

Here is a small excerpt of the temperatures I've just measured manually. I've selected some AMD and NVIDIA results so you can see how it went. I only need to type the full command once as I can use the shell's history mechanism to change just the bit that needs to be changed in order to repeat the command on a different host. The command is structured to return up to a 3 digit integer with 'C' for the scale. To repeat the command on a different host is trivial - ^xx^yy - which is interpreted by the shell as, "For the previous command, just change the first occurrence of xx into yy and run the whole command again. The shell reprints the new command before running it so you can see exactly what was changed.

AMD GPUs start here
[gary@server ~]$ ssh $H10 "export DISPLAY:0 ; xhost + > /dev/null ; aticonfig --odgt | tail -1 | cut -c42- | sed s/\.00\ //"
 70C
[gary@server ~]$ ^10^11
ssh $H11 "export DISPLAY:0 ; xhost + > /dev/null ; aticonfig --odgt | tail -1 | cut -c42- | sed s/\.00\ //"
 70C
[gary@server ~]$ ^11^17
ssh $H17 "export DISPLAY:0 ; xhost + > /dev/null ; aticonfig --odgt | tail -1 | cut -c42- | sed s/\.00\ //"
 69C
[gary@server ~]$ ^17^18
ssh $H18 "export DISPLAY:0 ; xhost + > /dev/null ; aticonfig --odgt | tail -1 | cut -c42- | sed s/\.00\ //"
 65C
[gary@server ~]$ ^18^19
ssh $H19 "export DISPLAY:0 ; xhost + > /dev/null ; aticonfig --odgt | tail -1 | cut -c42- | sed s/\.00\ //"
 71C
[gary@server ~]$ ^19^21
ssh $H21 "export DISPLAY:0 ; xhost + > /dev/null ; aticonfig --odgt | tail -1 | cut -c42- | sed s/\.00\ //"
 68C
....
....
NVIDIA GPUs start here
[gary@server ~]$ ssh $H27 "nvidia-smi | head -9 | tail -1 | cut -c8-11"
 63C
[gary@server ~]$ ^27^29
ssh $H29 "nvidia-smi | head -9 | tail -1 | cut -c8-11"
 61C
[gary@server ~]$ ^29^32
ssh $H32 "nvidia-smi | head -9 | tail -1 | cut -c8-11"
 64C
[gary@server ~]$ ^32^34
ssh $H34 "nvidia-smi | head -9 | tail -1 | cut -c8-11"
 80C
[gary@server ~]$ ^34^43
ssh $H43 "nvidia-smi | head -9 | tail -1 | cut -c8-11"
 66C
[gary@server ~]$ ^43^63
ssh $H63 "nvidia-smi | head -9 | tail -1 | cut -c8-11"
 67C
....

After doing the entire 52 GPUs, the highest temperature is actually in the above list for H34 -> 80C. I'm not surprised as this is a GTX-550Ti which is known to be a hot running GPU. The hottest AMD GPU was 76C. Considering the ambient in the room is currently 34C, I'm very happy there are none above 80C. The average temperature for all AMD GPUs was about 68-70C. The average for all NVIDIA GPUs was probably a tad lower but was skewed by a couple close to 80 that I'll need to check for fan speed and/or bad TIM.

So now I just need to add the functionality to the control script and start producing regular logs so I can see if any trends to hotter running start to develop. I'll code in a set value for a limit for each host (like I do for RAC) so I'll get a specific warning if the limit is exceeded.

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.