Could anyone please help me identify the problem of this host: http://einsteinathome.org/host/5573148/tasks
The WUs run their correct time and they all finish without error, however are marked as invalid by the server.
What could be wrong? This host was very reliable in the past...
Copyright © 2024 Einstein@Home. All rights reserved.
many invalid gpu WUs
)
Use GPU-Z to check your temps. Could be dust build up on heat sinks, also try reseating the card.
Beside temps... sometimes a
)
Beside temps... sometimes a minor glitch can leave "garbage" in the GPU memory that will be interfering with the apps, this kind of things can be fixed just doing a system reboot (turning it off and on, just restarting it may not be enough).
RE: Could anyone please
)
All your tasks are failing with a validate error. There is a sticky thread about validate errors which you might like to peruse for some background information.
In short, when two results are in for a particular work unit, the validator gets called to do the validation. The first thing it does is to perform some basic sanity checks on each result to see if it's worthwhile to proceed to the full comparison. When this is being done on your results, the validator is very unhappy with what it finds. Here is the output for just one of your failed results.
As you can see, the validator found lots of things to get upset about. The most likely cause of this is a hardware problem with your GPU possibly caused by overheating and/or overclocking. You need to check the fan and the heatsink fins for any blockage. I've had quite a bit of success not only with servicing the fan but also with removing the heatsink and reapplying the thermal paste. It often dries out after prolonged use and can easily be the cause of overheating.
Of course, there are many other possible causes of hardware instability but overheating and overclocking are two of the most likely.
Cheers,
Gary.
Gary, I am just a noob
)
Gary,
I am just a noob trying to understand what going on and get better at debugging my own problems, so please excuse my ignorance.
I don't see how those errors lead to the overheating conclusion.
My (amateur) read of them suggest file formatting problems (not a number, wrong number of columns, wrong number of rows).
I would expect a GPU failure to produce erroneous results, but unless results like nan and inf are being ignored, I would expect to see results if the program completed without errors being reported and the output being produced.
I agree with you it's most probably a hardware error of some sort ust given the the fact that it's isolated to one system.
I'm not arguing with your conclusion, just trying to follow the logic.
Joe
RE: Gary, I am just a noob
)
Amateur thoughts: The result files are plain text files, so if the GPU memory or bus fails when the app reads a certain byte to write it to the file, that data coverted to text could lead to any of those errors... (a NAN is the most obvious but if the string contains a "new line", a "" or any other "reserved" character it could ruin the XML convention)
I have a 560Ti that was giving a lot of invalids (both kinds, the ones that will not fit the expected result and others that were not validated against wingmen). I fixed it downclocking a bit the memory speed...
RE: Amateur thoughts: The
)
Horacio,
Again all the disclaimers, I'm putting this up to find out what's missing from my understanding, certainly not to say I know what's going on and you or Gary doesn't.
Data sent to and received from the GPU is binary single precision floating point AFAIK.
If the result printing routines get 4 bytes of garbage they still look like a floating point number with a few special bit patterns that represent NAN (result of a divide by zero or log of a neg number) or INF for an exponent under/overflow. I don't see how that screws up the text file output.
I do understand the not validating by wingman but I don't understand the formatting problems due to calculation errors by the GPU.
Joe
RE: If the result printing
)
If its garbage, what leads you to think that they will look as a FP? or that they will have the right marker to specify a NAN?
And also, if the error is in the variable that specifies the number of bytes that app should read/transfer, then the result can have more or less data than expected (be it rows, columns or both)...
Im not saying that I know for sure what happens, I just know what can be happening...
Of course, no matter what, its always possible to add more validation checks on the app, but I (wildly) guess its not worth because anyway the servers will need to do the same validation again just in case there was some issue during the data transfer over internet...
RE: I don't see how those
)
The data that the validator program is trying to decipher is obviously scrambled in some way and obviously I have no way of really being 100% sure of exactly what did the scrambling. All I was trying to do was suggest the most likely cause of the scrambling. I'm doing this not from absolute knowledge but simply from personal, direct observation of similar situations with my own hardware. I did say "the most likely cause ... " and "possibly caused by ... " because how can anyone really be 100% sure of these things.
From personal experience when machines overheat, sometimes the machine will lockup or crash and sometimes the machine will keep running and individual tasks will lockup or crash or produce results that don't validate. When this happens, if there are no obvious environmental causes (failed aircon, failed case fans, exceptionally hot day, etc) I pull the machine out of the rack and inspect the CPU fan and heatsink. Sometimes you can see that that's the problem right there. If the fan and heatsink look clean and the fan is fully free to rotate, I remove the heatsink and replace the thermal paste. I've done this many times over the years and in the vast majority of cases, the problem immediately goes away. The next thing I usually try is to back off the overclock a bit. If problems persist, I then start changing hardware and checking thoroughly for swollen capacitors.
At one point several years ago, I was running over 200 machines. Most were moderately overclocked so there was not a big margin before excess heat would cause problems. I got a lot of experience dealing with heat because I live in a sub-tropical climate and most of my machines are not in air conditioned offices.
Erroneous results are being produced and nothing is being ignored. The results are being uploaded and reported and it's only when the validator is examining the contents that they are being rejected. We can go to the website and click on a taskID link and read the stderr.txt output to see that the crunching terminated normally. Unless we take measures to trap the data before it gets uploaded and wiped, we can't readily examine the actual contents of the result data once we find out later that the validator doesn't like it. In the case of the OP, since all results are failing, it would be worth his while to temporarily disable uploads and make the effort to save copies of the 8 result files for a particular task that has just completed. He could browse all 8 copies and see if anything stands out as being the likely cause of the validator's unhappiness. If he did that and then restored network activity so that the files could upload, we could compare what the validator reports with the actual saved file contents.
At the end of the day, only the owner can (by replacing hardware/firmware/drivers bit by bit) really determine the cause of the problem. It's really worthwhile to eliminate things like heat/overclocking first though.
Cheers,
Gary.
Thanks Gary. If the OP
)
Thanks Gary.
If the OP get's one of those corrupted files, I would be interested in seeing what it looks like.
Not that I expect to be able to make sense out of it.
Joe