Hi folks,
it's me again, this time with sth a bit more serious. One of my crunching boxes, namely a P3 Coppermine (host ID 846283), keeps getting computation errors. This happens with QMC WUs as well as Einstein ones, so it can't have anything to do with the project or the WUs. The WUs get crunched to the end (afaik), but they never validate. This has been going on for more than a week now.
I've tried updating the BOINC client but it didn't help. The box is not overclocked and I think I can rule out a heat problem.
The computer is running Windows XP and was crunching away just fine for months before things suddenly turned evil...
Advice is very much appreciated since I really don't know what might be wrong (except maybe a hardware problem, but Windows seems to be running alright, no crashes...)
Thanks in advance
Annika
Copyright © 2024 Einstein@Home. All rights reserved.
Need help- keep getting computation errors
)
Could be that this is a memory, CPU or cache problem. I recommend you to install a hardware diagnostic tool and test your system. For memory tests memtest86+ is a very good tool.
Yes, I know that one, I've
)
Yes, I know that one, I've even got it here. I simply thought that if the hardware was failing, the problem would have resulted in stuff like freezing, reboots, bluescreens and so on, and that doesn't seem to be the case. Still, running memtest can't hurt, so I'll do it as soon as my dad stops using the box ;-)
The first two results (still)
)
The first two results (still) showing on that box show aborted by user. The most recent one shows Unrecoverable error exit code -1073741819 (0xc0000005).
You could check here what else to check on that error. As for the other two showing, did you (or your dad) ever abort those results by hand?
Yes, I did, because my dad
)
Yes, I did, because my dad had paused BOINC and forgotten to switch it back on, so they were well beyond the deadline by the time I noticed. But as I said the box also produced computation errors in QMC, as you see here: http://qah.uni-muenster.de/results.php?hostid=40747
5 aborted by user (and not
)
5 aborted by user (and not after deadline) and the other two have exit code 1 (0x1)
That exit code is the 'normal' exit code done by the science application, when it hits an error that isn't defined by the programmers. So you should report that one on the QMC forums.
Well, thanks for the advice.
)
Well, thanks for the advice. Strange, no idea what's going on there. I didn't abort any QMC WUs.
Running memcheck86+ showed some errors, do you think that's part of the problem? I have no idea what would be normal for a 7-year-old box, maybe you always get some errors there? I only ever used the tool on my own computer (which is less than a year old) and never got an error there. On the P3 it was about 600 errors after the first pass.
RE: Well, thanks for the
)
well now, that seems to me theres a memory error. weather its with the cpu cache or the ram it self i have no idea.. if theres more than one simm, or dimm, remove one and run it again. still get errors, remove the other one, and put the one you took out back in, and try again, if you still get errors it may well be cache. if you no longer get errors then the stick that was taken out is bad..
note: im am not an expert and im sure others can give better advice. tho what i just gave is standard fare for memtest. and no, memory errors are not normal, that means what ever data written to that section of ram will be corupt
seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift.
Well, seems like you were
)
Well, seems like you were right: It really was a memory error, though none of the RAM bars seems to be faulty. After taking each of them out in turn and getting no errors with either I did another run with both and it worked fine. Apparently one of the RAM bars was not properly installed, or maybe the problem was dust or sth. All WUs have validated okay since then. Thanks a lot for you help!
RE: Well, seems like you
)
welcome. sounds like perhaps a dimm workd lose some how. or perhaps some dust got betwen a few contacts and the ram. glad it was a simple repair :)
seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift.