Computation Error

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2143
Credit: 2984563625
RAC: 737685

RE: ABP1 Workunit 57514358

Message 93497 in response to message 93496

Quote:

ABP1 Workunit 57514358 ran to 100% (23893.08 sec) and failed to validate with error code 2, on a dual core 64bit WinVista machine with plenty (4GB) of memory. No client or host system crashes noted; system not rebooted during 57514358's run. BOINC client is 6.6.28. I've run many S5 and a few ABPS tasks with no problem.

There's 6.6 cpu-hours down the toilet. Oh well.


There's a significant error message in there that somebody might like to look at:

Quote:
[05:02:31][7196][INFO ] Checkpoint committed!
[05:03:23][7196][ERROR] Couldn't rename temporary checkpoint file (status.cpt.tmp) to final checkpoint file: status.cpt (Result too large).
[05:03:23][7196][ERROR] Demodulation failed (error: 2)!
called boinc_finish
Xin Jing
Xin Jing
Joined: 29 Aug 09
Posts: 2
Credit: 0
RAC: 0

Hey guys, I've had nearly

Hey guys,

I've had nearly all of my tasks register an outcome of "client error", which is defined as a user-side error. The numbers of credits that I would have been awarded would have been small but it's concerning nevertheless. I'm not sure what exactly you need to see to help me, but here is some information on a few:

138154537
137864090
137846426

Like I said, the credits are fairly small so it's not that issue that bothers me but the repeated ambiguous fault known as "client error".

Then there's the next one, which is a bit more alarming because of the type of error and the amount of cpu time involved to complete the task. The outcome was "validate error" which is defined as a server-side error:

137821169

This all seems a bit strange, since I'm concurrently running Seti, Cosmology and Milkyway. With the exception of Cosmology, all are generating credits. Cosmology's tasks and error descriptions are a bit more vague and labeled with a simple "error".

I'm running BOINC Manager version 6.6.36 for Windows.

Here is my PC , as reported by CPU-Z version 1.52.2.

Any help would be appreciated.

-Xin

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9352143
RAC: 0

Hmmmm... You have a recent

Hmmmm...

You have a recent CBNC over on SAH right now with this host as well.

The 'hard' errors all look to be memory access faults (aka General Protection Errors).

Generally speaking, this usually indicates either transient overheating in the processors FPU and/or flakey memory.

I see it just reported a task over on MW and it validated OK.

So the good news is that most likely whatever is wrong is fixable. :-)

The bad news is it looks like it's intermittent, which increases the aggravation and difficulty factor in finding and fixing it at least an order of magnitude. :-(

I'd suggest starting with the usual steps. Cleaning, and then diagnostics with Memtest86, Prime95, etc. and see what comes out in the wash.

HTH,

Alinator

Xin Jing
Xin Jing
Joined: 29 Aug 09
Posts: 2
Credit: 0
RAC: 0

RE: The bad news is it

Message 93500 in response to message 93499

Quote:


The bad news is it looks like it's intermittent, which increases the aggravation and difficulty factor in finding and fixing it at least an order of magnitude. :-(

I'd suggest starting with the usual steps. Cleaning, and then diagnostics with Memtest86, Prime95, etc. and see what comes out in the wash.

I'm guessing your response was for my post and if so, thank you.

I have had a few BSD's recently and I'm not sure exactly what's triggering them. I installed a Cool Master cpu cooler which brought my cpu temp down from a 100% cpu usage temp of 53c to a much more manageable 33c. When I went to Asus and used their PSU calulator to see how much wattage I need to run my current configuration, I'm coming up short. I'll upgrade the PSU and continue to monitor the GPFs.

However, this doesn't directly address the times when my PC operated without fail for an entire task cycle, or the fact that I'm still generating a good amount of credits at Milkyway@Home. I could print out a certificate now that shows real accomplishments from contributions I've made so far. I can't figure out why Seti and Milkyway are working while Einstein and Cosmology are zombie flatliners.

I appreciate your observations and insight, I'll try the diagnostic apps you mentioned. I guess I've come full circle back to your comment that it's intermittent.

-Xin

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9352143
RAC: 0

RE: RE: The bad news is

Message 93501 in response to message 93500

Quote:
Quote:


The bad news is it looks like it's intermittent, which increases the aggravation and difficulty factor in finding and fixing it at least an order of magnitude. :-(

I'd suggest starting with the usual steps. Cleaning, and then diagnostics with Memtest86, Prime95, etc. and see what comes out in the wash.

I'm guessing your response was for my post and if so, thank you.

I have had a few BSD's recently and I'm not sure exactly what's triggering them. I installed a Cool Master cpu cooler which brought my cpu temp down from a 100% cpu usage temp of 53c to a much more manageable 33c. When I went to Asus and used their PSU calulator to see how much wattage I need to run my current configuration, I'm coming up short. I'll upgrade the PSU and continue to monitor the GPFs.

However, this doesn't directly address the times when my PC operated without fail for an entire task cycle, or the fact that I'm still generating a good amount of credits at Milkyway@Home. I could print out a certificate now that shows real accomplishments from contributions I've made so far. I can't figure out why Seti and Milkyway are working while Einstein and Cosmology are zombie flatliners.

I appreciate your observations and insight, I'll try the diagnostic apps you mentioned. I guess I've come full circle back to your comment that it's intermittent.

-Xin

No Problemo.

Anyway, I'm assuming the PSU has been in the Host for a while. If the size calculator is telling you you're near the borderline this makes perfect sense.

Even if everything looks and tests fine during diagnostics, the point others were trying to make is the dynamic load on the host when crunching in a real world setting can bring on a fault you wouldn't otherwise see. IOWs, when you have multiple projects running on multi-core processors, which ones of the projects and the number of tasks from each running can making a big difference to the localized thermal loading on the processor die itself. If you utilize the machine to do work for you as well, you have to factor that into the equation as well.

If you look back through the forum to when the Q6600 was new, you'll find there where more than a few reports of them being sensitive to localized dynamic thermal loading in their FPUs, especially with the early steppings and when running what is now the stock EAH app (it was a Beta/Power app back then).

If your PSU is borderline for your host now, and is showing excessive sag, or worse breaking regulation (unlikely), this won't help matters.

It might even be that normal semiconductor aging is making the processor (memory) itself more sensitive to this. In that case, since you have a beefy Cooler Master on it now, you might try giving the Processor voltages an up tweak (if you can) and see if that helps.

So the bottomline is that this is probably just the result of a number of little things that by themselves don't cause trouble, but can come together unexpectedly from time to time and show up as intermittent compute and/or validate errors.

Alinator

koniiiik
koniiiik
Joined: 8 Feb 09
Posts: 5
Credit: 288224
RAC: 0

Hmm, I attached a new machine

Hmm, I attached a new machine a few days ago and it didn't yet finish a single WU successfully, all crashed with Compute errors, e. g. tasks 139775953, 139775951, 139775945, …
All of them crashed on signal 8, which means floating point exception. The computer has a C2D CPU, probably some CPU feature was detected incorrectly.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 767396393
RAC: 1082219

RE: Hmm, I attached a new

Message 93503 in response to message 93502

Quote:
Hmm, I attached a new machine a few days ago and it didn't yet finish a single WU successfully, all crashed with Compute errors, e. g. tasks 139775953, 139775951, 139775945, …
All of them crashed on signal 8, which means floating point exception. The computer has a C2D CPU, probably some CPU feature was detected incorrectly.

Signal 8 on Linux can be tracked down to one of two causes most of the time:

1) a specific bug in the Linux Kernel. However, that bug was fixed some kerbnel versions ago already and your Linux Kernel seems to be a very recent one. Still you might try to tentatively install BOINC and einstein@home on a LINUX Live system booted from a CD/DVD and see if this problem might depend on your distro and kernel patches (you seem to use TuxOnice, right)?

or

2) excessive overclocking and/or overheating.

Good luck
Bikeman

koniiiik
koniiiik
Joined: 8 Feb 09
Posts: 5
Credit: 288224
RAC: 0

RE: Signal 8 on Linux can

Message 93504 in response to message 93503

Quote:

Signal 8 on Linux can be tracked down to one of two causes most of the time:

1) a specific bug in the Linux Kernel. However, that bug was fixed some kerbnel versions ago already and your Linux Kernel seems to be a very recent one. Still you might try to tentatively install BOINC and einstein@home on a LINUX Live system booted from a CD/DVD and see if this problem might depend on your distro and kernel patches (you seem to use TuxOnice, right)?


Hmm, I can try that sometime. Yes, the kernel is a Gentoo tuxonice one.

Quote:
2) excessive overclocking and/or overheating.


No overclocking here, as well as no overheating.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5876
Credit: 118524997872
RAC: 26253878

RE: No overclocking here,

Message 93505 in response to message 93504

Quote:
No overclocking here, as well as no overheating.


Are you really sure about that?? I presume you are using the stock Intel HSF? I've built quite a number of new dual and quad cores recently and I'd like to think I've really worked out how to properly lock down evenly, the stock Intel HSF. Early on (even though I was ultra careful) I found it quite easy to end up with the HSF not quite squarely clamped to the CPU heat spreader (one corner not fully locked despite seeming to be). The characteristics were that the machine would boot and run fine on normal loads but would not be able to complete full load testing. The problem was identified when I pulled vigorously on the HSF and one corner rather easily popped out of lock. The problem ended when I reapplied the HSF with all 4 posts properly locked down.

Cheers,
Gary.

koniiiik
koniiiik
Joined: 8 Feb 09
Posts: 5
Credit: 288224
RAC: 0

Heh, it's a laptop. (-;

Message 93506 in response to message 93505

Heh, it's a laptop. (-;

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.