Parkes PMPS XT GPU Errors

ritterm
ritterm
Joined: 18 Jun 08
Posts: 23
Credit: 46657826
RAC: 0
Topic 198172

My Q6600 GTX 570 host has been throwing off errors in the past few days (sorry if those links are public). Not all of these tasks have returned errors.

It's running two GPU tasks at a time along with 4 CPU tasks (CMS-dev, Cosmology, Einstein). Not dedicating a CPU (or CPUs) to Einstein has not been a problem before, that I know of. Although this host has 4GB RAM, I haven't seen it use more than ~70% recently. I'm seeing no errors on other projects.

The following messages are typical of what's included in the stderr output. There may be other messages that are significant, but I'm not sure.

[12:35:51][3140][ERROR] Error during CUDA device->host HS data transfer (error: 999)
[12:35:51][3140][ERROR] Demodulation failed (error: 1008)!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109390106771
RAC: 35896364

Parkes PMPS XT GPU Errors

My first guess is that you might have a heat issue. Your errors are all in BRP6. You have none with BRP4G and you have no invalids. BRP6 was made more efficient by HB not that long ago so perhaps the higher throughput is pushing your card 'over the edge' occasionally. There's two things I would do to check this. Run only BRP4G and measure the temperature and then run only BRP6 and see if the temperature is noticeably higher. If it is, try a 'poor man's cooling upgrade' - ie, take the side panel off and point a desk fan at the GPU and see if the errors go away. If that works, try renewing the thermal grease under the GPU heat sink and/or checking the condition of the GPU fan.

If it's not heat, I would check out the board, PSU and possibly RAM. 4GB is OK for RAM for that host - I have a couple of Q6600s using that amount with no issues. It doesn't sound like a RAM problem because you don't have issues with CPU tasks, so I would check other things first unless you have a spare set you could try out immediately.

The age of Q6600s is such that you could be having problems with electrolytic capacitors, either on the mainboard or in the PSU, particularly if both of these are 'original'. If I find a system of mine starting to do strange things, I check for signs of bulging caps. These are commonly the culprit and once replaced with good quality low ESR (equivalent series resistance) ones, the problems disappear. I've been doing quite a few repairs of late due to deterioration of filter caps in the secondary side of the PSU or in the VRM circuit of a board. There's lots of good info for dealing with bad caps in these forums and the board dealing with PSU issues is quite active.

It's very easy to spot a lot of problem caps - the swelling of the flat top is a dead giveaway. Some caps fail before they bulge and an ESR meter can detect these in circuit. Even if the host still appears OK, it's only a matter of time before you'll start to see increasing problems. The PSU will probably be the first place for issues to appear and the extra ripple that's allowed through to the board will hasten the demise there as well. So, to prevent problems later on, it's a good idea to fix or replace PSUs with bulging and/or high ESR caps as soon as you see them.

Cheers,
Gary.

Logforme
Logforme
Joined: 13 Aug 10
Posts: 332
Credit: 1714373961
RAC: 0

Google found this old

Google found this old thread.
Seemed to be either a driver issue or GPU clock problem.

ritterm
ritterm
Joined: 18 Jun 08
Posts: 23
Credit: 46657826
RAC: 0

Logforme wrote:Google found

Logforme wrote:
Google found this old thread...


Ack! I must remember to use Google. A forum search using the same words in that thread's title came up empty... :-/

I've updated the driver and will try other GPU tasks when I get them (the BRP4G queue has no tasks to send as of now). I'll also inspect components and dust out the innards -- this one is due for some housekeeping.

Thanks for the feedback, you guys! :-)

ritterm
ritterm
Joined: 18 Jun 08
Posts: 23
Credit: 46657826
RAC: 0

I finally got enough time to

I finally got enough time to open up my problem host for a thorough cleaning and inspection. I didn't see any obviously bad components on the mobo or GPU board -- no bulging caps, no telltale flash residue, both of which I've seen before. I doubt the PSU is an issue as it's a fairly recent high quality upgrade. I did remove the GPU heatsink and gave it a fresh layer of thermal paste (the heatsink literally fell off with zero effort on my part -- the paste had completely dried out). Unfortunately, none of that seemed to do much to reduce the GPU temperature, or temperatures in general, and it threw off a another computation error not too long after it started back up.

So, I'm not sure what I'm going to do in this case. The only problem this host seems to have developed is running Einstein GPU tasks, either one or two at a time. It doesn't have trouble running GPUGrid or Collatz alongside the same CPU tasks. One thing I haven't tried is running only Einstein GPU tasks to see if that changes anything; however, that's not something I'd want to do long term. I'd rather run Einstein, but since it runs GPUGrid okay, I can't justify buying a new GPU.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.