many invalid gpu WUs

Tom Philippart
Tom Philippart
Joined: 1 Oct 06
Posts: 11
Credit: 11670137
RAC: 0
Topic 196568

Could anyone please help me identify the problem of this host: http://einsteinathome.org/host/5573148/tasks

The WUs run their correct time and they all finish without error, however are marked as invalid by the server.

What could be wrong? This host was very reliable in the past...

Betreger
Betreger
Joined: 25 Feb 05
Posts: 992
Credit: 1594732322
RAC: 775029

many invalid gpu WUs

Use GPU-Z to check your temps. Could be dust build up on heat sinks, also try reseating the card.

Horacio
Horacio
Joined: 3 Oct 11
Posts: 205
Credit: 80557243
RAC: 0

Beside temps... sometimes a

Beside temps... sometimes a minor glitch can leave "garbage" in the GPU memory that will be interfering with the apps, this kind of things can be fixed just doing a system reboot (turning it off and on, just restarting it may not be enough).

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117811321817
RAC: 34697212

RE: Could anyone please

Quote:
Could anyone please help me identify the problem of this host: http://einsteinathome.org/host/5573148/tasks


All your tasks are failing with a validate error. There is a sticky thread about validate errors which you might like to peruse for some background information.

In short, when two results are in for a particular work unit, the validator gets called to do the validation. The first thing it does is to perform some basic sanity checks on each result to see if it's worthwhile to proceed to the full comparison. When this is being done on your results, the validator is very unhappy with what it finds. Here is the output for just one of your failed results.

Outcome 	Validate error [6] (00111010)
- result file has entries that aren't numbers
- a number is out of valid range for this result
- result file has (lines with) wrong number of columns
- result file has too few or too many rows

As you can see, the validator found lots of things to get upset about. The most likely cause of this is a hardware problem with your GPU possibly caused by overheating and/or overclocking. You need to check the fan and the heatsink fins for any blockage. I've had quite a bit of success not only with servicing the fan but also with removing the heatsink and reapplying the thermal paste. It often dries out after prolonged use and can easily be the cause of overheating.

Of course, there are many other possible causes of hardware instability but overheating and overclocking are two of the most likely.

Cheers,
Gary.

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

Gary, I am just a noob

Gary,

I am just a noob trying to understand what going on and get better at debugging my own problems, so please excuse my ignorance.

I don't see how those errors lead to the overheating conclusion.

My (amateur) read of them suggest file formatting problems (not a number, wrong number of columns, wrong number of rows).

I would expect a GPU failure to produce erroneous results, but unless results like nan and inf are being ignored, I would expect to see results if the program completed without errors being reported and the output being produced.

I agree with you it's most probably a hardware error of some sort ust given the the fact that it's isolated to one system.

I'm not arguing with your conclusion, just trying to follow the logic.

Joe

Horacio
Horacio
Joined: 3 Oct 11
Posts: 205
Credit: 80557243
RAC: 0

RE: Gary, I am just a noob

Quote:

Gary,

I am just a noob trying to understand what going on and get better at debugging my own problems, so please excuse my ignorance.

I don't see how those errors lead to the overheating conclusion.

My (amateur) read of them suggest file formatting problems (not a number, wrong number of columns, wrong number of rows).

I would expect a GPU failure to produce erroneous results, but unless results like nan and inf are being ignored, I would expect to see results if the program completed without errors being reported and the output being produced.

I agree with you it's most probably a hardware error of some sort ust given the the fact that it's isolated to one system.

I'm not arguing with your conclusion, just trying to follow the logic.

Joe


Amateur thoughts: The result files are plain text files, so if the GPU memory or bus fails when the app reads a certain byte to write it to the file, that data coverted to text could lead to any of those errors... (a NAN is the most obvious but if the string contains a "new line", a "" or any other "reserved" character it could ruin the XML convention)
I have a 560Ti that was giving a lot of invalids (both kinds, the ones that will not fit the expected result and others that were not validated against wingmen). I fixed it downclocking a bit the memory speed...

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

RE: Amateur thoughts: The

Quote:
Amateur thoughts: The result files are plain text files, so if the GPU memory or bus fails when the app reads a certain byte to write it to the file, that data coverted to text could lead to any of those errors... (a NAN is the most obvious but if the string contains a "new line", a "" or any other "reserved" character it could ruin the XML convention)
I have a 560Ti that was giving a lot of invalids (both kinds, the ones that will not fit the expected result and others that were not validated against wingmen). I fixed it downclocking a bit the memory speed...


Horacio,
Again all the disclaimers, I'm putting this up to find out what's missing from my understanding, certainly not to say I know what's going on and you or Gary doesn't.

Data sent to and received from the GPU is binary single precision floating point AFAIK.

If the result printing routines get 4 bytes of garbage they still look like a floating point number with a few special bit patterns that represent NAN (result of a divide by zero or log of a neg number) or INF for an exponent under/overflow. I don't see how that screws up the text file output.

I do understand the not validating by wingman but I don't understand the formatting problems due to calculation errors by the GPU.

Joe

Horacio
Horacio
Joined: 3 Oct 11
Posts: 205
Credit: 80557243
RAC: 0

RE: If the result printing

Quote:
If the result printing routines get 4 bytes of garbage they still look like a floating point number with a few special bit patterns that represent NAN (result of a divide by zero or log of a neg number) or INF for an exponent under/overflow. I don't see how that screws up the text file output.


If its garbage, what leads you to think that they will look as a FP? or that they will have the right marker to specify a NAN?
And also, if the error is in the variable that specifies the number of bytes that app should read/transfer, then the result can have more or less data than expected (be it rows, columns or both)...

Im not saying that I know for sure what happens, I just know what can be happening...

Of course, no matter what, its always possible to add more validation checks on the app, but I (wildly) guess its not worth because anyway the servers will need to do the same validation again just in case there was some issue during the data transfer over internet...

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117811321817
RAC: 34697212

RE: I don't see how those

Quote:

I don't see how those errors lead to the overheating conclusion.

My (amateur) read of them suggest file formatting problems (not a number, wrong number of columns, wrong number of rows).


The data that the validator program is trying to decipher is obviously scrambled in some way and obviously I have no way of really being 100% sure of exactly what did the scrambling. All I was trying to do was suggest the most likely cause of the scrambling. I'm doing this not from absolute knowledge but simply from personal, direct observation of similar situations with my own hardware. I did say "the most likely cause ... " and "possibly caused by ... " because how can anyone really be 100% sure of these things.

From personal experience when machines overheat, sometimes the machine will lockup or crash and sometimes the machine will keep running and individual tasks will lockup or crash or produce results that don't validate. When this happens, if there are no obvious environmental causes (failed aircon, failed case fans, exceptionally hot day, etc) I pull the machine out of the rack and inspect the CPU fan and heatsink. Sometimes you can see that that's the problem right there. If the fan and heatsink look clean and the fan is fully free to rotate, I remove the heatsink and replace the thermal paste. I've done this many times over the years and in the vast majority of cases, the problem immediately goes away. The next thing I usually try is to back off the overclock a bit. If problems persist, I then start changing hardware and checking thoroughly for swollen capacitors.

At one point several years ago, I was running over 200 machines. Most were moderately overclocked so there was not a big margin before excess heat would cause problems. I got a lot of experience dealing with heat because I live in a sub-tropical climate and most of my machines are not in air conditioned offices.

Quote:
I would expect a GPU failure to produce erroneous results, but unless results like nan and inf are being ignored, I would expect to see results if the program completed without errors being reported and the output being produced.


Erroneous results are being produced and nothing is being ignored. The results are being uploaded and reported and it's only when the validator is examining the contents that they are being rejected. We can go to the website and click on a taskID link and read the stderr.txt output to see that the crunching terminated normally. Unless we take measures to trap the data before it gets uploaded and wiped, we can't readily examine the actual contents of the result data once we find out later that the validator doesn't like it. In the case of the OP, since all results are failing, it would be worth his while to temporarily disable uploads and make the effort to save copies of the 8 result files for a particular task that has just completed. He could browse all 8 copies and see if anything stands out as being the likely cause of the validator's unhappiness. If he did that and then restored network activity so that the files could upload, we could compare what the validator reports with the actual saved file contents.

At the end of the day, only the owner can (by replacing hardware/firmware/drivers bit by bit) really determine the cause of the problem. It's really worthwhile to eliminate things like heat/overclocking first though.

Cheers,
Gary.

joe areeda
joe areeda
Joined: 13 Dec 10
Posts: 285
Credit: 320378898
RAC: 0

Thanks Gary. If the OP

Thanks Gary.

If the OP get's one of those corrupted files, I would be interested in seeing what it looks like.

Not that I expect to be able to make sense out of it.

Joe

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.