Need help- keep getting computation errors

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0
Topic 192602

Hi folks,
it's me again, this time with sth a bit more serious. One of my crunching boxes, namely a P3 Coppermine (host ID 846283), keeps getting computation errors. This happens with QMC WUs as well as Einstein ones, so it can't have anything to do with the project or the WUs. The WUs get crunched to the end (afaik), but they never validate. This has been going on for more than a week now.
I've tried updating the BOINC client but it didn't help. The box is not overclocked and I think I can rule out a heat problem.
The computer is running Windows XP and was crunching away just fine for months before things suddenly turned evil...
Advice is very much appreciated since I really don't know what might be wrong (except maybe a hardware problem, but Windows seems to be running alright, no crashes...)
Thanks in advance
Annika

Dotsch
Dotsch
Joined: 1 May 05
Posts: 50
Credit: 643422
RAC: 1716

Need help- keep getting computation errors

Could be that this is a memory, CPU or cache problem. I recommend you to install a hardware diagnostic tool and test your system. For memory tests memtest86+ is a very good tool.

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Yes, I know that one, I've

Yes, I know that one, I've even got it here. I simply thought that if the hardware was failing, the problem would have resulted in stuff like freezing, reboots, bluescreens and so on, and that doesn't seem to be the case. Still, running memtest can't hurt, so I'll do it as soon as my dad stops using the box ;-)

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 42

The first two results (still)

The first two results (still) showing on that box show aborted by user. The most recent one shows Unrecoverable error exit code -1073741819 (0xc0000005).

You could check here what else to check on that error. As for the other two showing, did you (or your dad) ever abort those results by hand?

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Yes, I did, because my dad

Yes, I did, because my dad had paused BOINC and forgotten to switch it back on, so they were well beyond the deadline by the time I noticed. But as I said the box also produced computation errors in QMC, as you see here: http://qah.uni-muenster.de/results.php?hostid=40747

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 42

5 aborted by user (and not

Message 62646 in response to message 62645

5 aborted by user (and not after deadline) and the other two have exit code 1 (0x1)

That exit code is the 'normal' exit code done by the science application, when it hits an error that isn't defined by the programmers. So you should report that one on the QMC forums.

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Well, thanks for the advice.

Well, thanks for the advice. Strange, no idea what's going on there. I didn't abort any QMC WUs.
Running memcheck86+ showed some errors, do you think that's part of the problem? I have no idea what would be normal for a 7-year-old box, maybe you always get some errors there? I only ever used the tool on my own computer (which is less than a year old) and never got an error there. On the P3 it was about 600 errors after the first pass.

paul milton
paul milton
Joined: 16 Sep 05
Posts: 329
Credit: 35825044
RAC: 0

RE: Well, thanks for the

Message 62648 in response to message 62647

Quote:
Well, thanks for the advice. Strange, no idea what's going on there. I didn't abort any QMC WUs.
Running memcheck86+ showed some errors, do you think that's part of the problem? I have no idea what would be normal for a 7-year-old box, maybe you always get some errors there? I only ever used the tool on my own computer (which is less than a year old) and never got an error there. On the P3 it was about 600 errors after the first pass.

well now, that seems to me theres a memory error. weather its with the cpu cache or the ram it self i have no idea.. if theres more than one simm, or dimm, remove one and run it again. still get errors, remove the other one, and put the one you took out back in, and try again, if you still get errors it may well be cache. if you no longer get errors then the stick that was taken out is bad..

note: im am not an expert and im sure others can give better advice. tho what i just gave is standard fare for memtest. and no, memory errors are not normal, that means what ever data written to that section of ram will be corupt

seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift.

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

Well, seems like you were

Well, seems like you were right: It really was a memory error, though none of the RAM bars seems to be faulty. After taking each of them out in turn and getting no errors with either I did another run with both and it worked fine. Apparently one of the RAM bars was not properly installed, or maybe the problem was dust or sth. All WUs have validated okay since then. Thanks a lot for you help!

paul milton
paul milton
Joined: 16 Sep 05
Posts: 329
Credit: 35825044
RAC: 0

RE: Well, seems like you

Message 62650 in response to message 62649

Quote:
Well, seems like you were right: It really was a memory error, though none of the RAM bars seems to be faulty. After taking each of them out in turn and getting no errors with either I did another run with both and it worked fine. Apparently one of the RAM bars was not properly installed, or maybe the problem was dust or sth. All WUs have validated okay since then. Thanks a lot for you help!

welcome. sounds like perhaps a dimm workd lose some how. or perhaps some dust got betwen a few contacts and the ram. glad it was a simple repair :)

seeing without seeing is something the blind learn to do, and seeing beyond vision can be a gift.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.