Computation Error

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2984563625

RAC: 737685

RE: ABP1 Workunit 57514358

29 Aug 2009 14:38:59 UTC

Message 93497 in response to message 93496

(moderation:

)

Quote:

ABP1 Workunit 57514358 ran to 100% (23893.08 sec) and failed to validate with error code 2, on a dual core 64bit WinVista machine with plenty (4GB) of memory. No client or host system crashes noted; system not rebooted during 57514358's run. BOINC client is 6.6.28. I've run many S5 and a few ABPS tasks with no problem.

There's 6.6 cpu-hours down the toilet. Oh well.

There's a significant error message in there that somebody might like to look at:

Quote:

[05:02:31][7196][INFO ] Checkpoint committed!
[05:03:23][7196][ERROR] Couldn't rename temporary checkpoint file (status.cpt.tmp) to final checkpoint file: status.cpt (Result too large).
[05:03:23][7196][ERROR] Demodulation failed (error: 2)!
called boinc_finish

Xin Jing

Joined: 29 Aug 09

Posts: 2

Credit: 0

RAC: 0

Hey guys, I've had nearly

2 Sep 2009 3:58:30 UTC

Message 93498

(moderation:

)

Hey guys,

I've had nearly all of my tasks register an outcome of "client error", which is defined as a user-side error. The numbers of credits that I would have been awarded would have been small but it's concerning nevertheless. I'm not sure what exactly you need to see to help me, but here is some information on a few:

138154537
137864090
137846426

Like I said, the credits are fairly small so it's not that issue that bothers me but the repeated ambiguous fault known as "client error".

Then there's the next one, which is a bit more alarming because of the type of error and the amount of cpu time involved to complete the task. The outcome was "validate error" which is defined as a server-side error:

137821169

This all seems a bit strange, since I'm concurrently running Seti, Cosmology and Milkyway. With the exception of Cosmology, all are generating credits. Cosmology's tasks and error descriptions are a bit more vague and labeled with a simple "error".

I'm running BOINC Manager version 6.6.36 for Windows.

Here is my PC , as reported by CPU-Z version 1.52.2.

Any help would be appreciated.

-Xin

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

Hmmmm... You have a recent

2 Sep 2009 18:14:26 UTC

Message 93499

(moderation:

)

Hmmmm...

You have a recent CBNC over on SAH right now with this host as well.

The 'hard' errors all look to be memory access faults (aka General Protection Errors).

Generally speaking, this usually indicates either transient overheating in the processors FPU and/or flakey memory.

I see it just reported a task over on MW and it validated OK.

So the good news is that most likely whatever is wrong is fixable. :-)

The bad news is it looks like it's intermittent, which increases the aggravation and difficulty factor in finding and fixing it at least an order of magnitude. :-(

I'd suggest starting with the usual steps. Cleaning, and then diagnostics with Memtest86, Prime95, etc. and see what comes out in the wash.

HTH,

Alinator

Xin Jing

Joined: 29 Aug 09

Posts: 2

Credit: 0

RAC: 0

RE: The bad news is it

2 Sep 2009 23:20:28 UTC

Message 93500 in response to message 93499

(moderation:

)

Quote:

The bad news is it looks like it's intermittent, which increases the aggravation and difficulty factor in finding and fixing it at least an order of magnitude. :-(

I'd suggest starting with the usual steps. Cleaning, and then diagnostics with Memtest86, Prime95, etc. and see what comes out in the wash.

I'm guessing your response was for my post and if so, thank you.

I have had a few BSD's recently and I'm not sure exactly what's triggering them. I installed a Cool Master cpu cooler which brought my cpu temp down from a 100% cpu usage temp of 53c to a much more manageable 33c. When I went to Asus and used their PSU calulator to see how much wattage I need to run my current configuration, I'm coming up short. I'll upgrade the PSU and continue to monitor the GPFs.

However, this doesn't directly address the times when my PC operated without fail for an entire task cycle, or the fact that I'm still generating a good amount of credits at Milkyway@Home. I could print out a certificate now that shows real accomplishments from contributions I've made so far. I can't figure out why Seti and Milkyway are working while Einstein and Cosmology are zombie flatliners.

I appreciate your observations and insight, I'll try the diagnostic apps you mentioned. I guess I've come full circle back to your comment that it's intermittent.

-Xin

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

RE: RE: The bad news is

3 Sep 2009 2:03:51 UTC

Message 93501 in response to message 93500

(moderation:

)

Quote:

Quote:

The bad news is it looks like it's intermittent, which increases the aggravation and difficulty factor in finding and fixing it at least an order of magnitude. :-(

I'd suggest starting with the usual steps. Cleaning, and then diagnostics with Memtest86, Prime95, etc. and see what comes out in the wash.

I'm guessing your response was for my post and if so, thank you.

I have had a few BSD's recently and I'm not sure exactly what's triggering them. I installed a Cool Master cpu cooler which brought my cpu temp down from a 100% cpu usage temp of 53c to a much more manageable 33c. When I went to Asus and used their PSU calulator to see how much wattage I need to run my current configuration, I'm coming up short. I'll upgrade the PSU and continue to monitor the GPFs.

However, this doesn't directly address the times when my PC operated without fail for an entire task cycle, or the fact that I'm still generating a good amount of credits at Milkyway@Home. I could print out a certificate now that shows real accomplishments from contributions I've made so far. I can't figure out why Seti and Milkyway are working while Einstein and Cosmology are zombie flatliners.

I appreciate your observations and insight, I'll try the diagnostic apps you mentioned. I guess I've come full circle back to your comment that it's intermittent.

-Xin

No Problemo.

Anyway, I'm assuming the PSU has been in the Host for a while. If the size calculator is telling you you're near the borderline this makes perfect sense.

Even if everything looks and tests fine during diagnostics, the point others were trying to make is the dynamic load on the host when crunching in a real world setting can bring on a fault you wouldn't otherwise see. IOWs, when you have multiple projects running on multi-core processors, which ones of the projects and the number of tasks from each running can making a big difference to the localized thermal loading on the processor die itself. If you utilize the machine to do work for you as well, you have to factor that into the equation as well.

If you look back through the forum to when the Q6600 was new, you'll find there where more than a few reports of them being sensitive to localized dynamic thermal loading in their FPUs, especially with the early steppings and when running what is now the stock EAH app (it was a Beta/Power app back then).

If your PSU is borderline for your host now, and is showing excessive sag, or worse breaking regulation (unlikely), this won't help matters.

It might even be that normal semiconductor aging is making the processor (memory) itself more sensitive to this. In that case, since you have a beefy Cooler Master on it now, you might try giving the Processor voltages an up tweak (if you can) and see if that helps.

So the bottomline is that this is probably just the result of a number of little things that by themselves don't cause trouble, but can come together unexpectedly from time to time and show up as intermittent compute and/or validate errors.

Alinator

koniiiik

Joined: 8 Feb 09

Posts: 5

Credit: 288224

RAC: 0

Hmm, I attached a new machine

16 Sep 2009 9:59:06 UTC

Message 93502

(moderation:

)

Hmm, I attached a new machine a few days ago and it didn't yet finish a single WU successfully, all crashed with Compute errors, e. g. tasks 139775953, 139775951, 139775945, â€¦
All of them crashed on signal 8, which means floating point exception. The computer has a C2D CPU, probably some CPU feature was detected incorrectly.

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 767396393

RAC: 1082219

RE: Hmm, I attached a new

16 Sep 2009 10:56:40 UTC

Message 93503 in response to message 93502

(moderation:

)

Quote:

Hmm, I attached a new machine a few days ago and it didn't yet finish a single WU successfully, all crashed with Compute errors, e. g. tasks 139775953, 139775951, 139775945, â€¦
All of them crashed on signal 8, which means floating point exception. The computer has a C2D CPU, probably some CPU feature was detected incorrectly.

Signal 8 on Linux can be tracked down to one of two causes most of the time:

1) a specific bug in the Linux Kernel. However, that bug was fixed some kerbnel versions ago already and your Linux Kernel seems to be a very recent one. Still you might try to tentatively install BOINC and einstein@home on a LINUX Live system booted from a CD/DVD and see if this problem might depend on your distro and kernel patches (you seem to use TuxOnice, right)?

2) excessive overclocking and/or overheating.

Good luck
Bikeman

koniiiik

Joined: 8 Feb 09

Posts: 5

Credit: 288224

RAC: 0

RE: Signal 8 on Linux can

16 Sep 2009 11:07:08 UTC

Message 93504 in response to message 93503

(moderation:

)

Quote:

Signal 8 on Linux can be tracked down to one of two causes most of the time:

1) a specific bug in the Linux Kernel. However, that bug was fixed some kerbnel versions ago already and your Linux Kernel seems to be a very recent one. Still you might try to tentatively install BOINC and einstein@home on a LINUX Live system booted from a CD/DVD and see if this problem might depend on your distro and kernel patches (you seem to use TuxOnice, right)?

Hmm, I can try that sometime. Yes, the kernel is a Gentoo tuxonice one.

Quote:

2) excessive overclocking and/or overheating.

No overclocking here, as well as no overheating.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5876

Credit: 118524997872

RAC: 26253878

RE: No overclocking here,

17 Sep 2009 3:22:20 UTC

Message 93505 in response to message 93504

(moderation:

)

Quote:

No overclocking here, as well as no overheating.

Are you really sure about that?? I presume you are using the stock Intel HSF? I've built quite a number of new dual and quad cores recently and I'd like to think I've really worked out how to properly lock down evenly, the stock Intel HSF. Early on (even though I was ultra careful) I found it quite easy to end up with the HSF not quite squarely clamped to the CPU heat spreader (one corner not fully locked despite seeming to be). The characteristics were that the machine would boot and run fine on normal loads but would not be able to complete full load testing. The problem was identified when I pulled vigorously on the HSF and one corner rather easily popped out of lock. The problem ended when I reapplied the HSF with all 4 posts properly locked down.

Cheers,
Gary.

koniiiik

Joined: 8 Feb 09

Posts: 5

Credit: 288224

RAC: 0

Heh, it's a laptop. (-;

17 Sep 2009 20:52:26 UTC

Message 93506 in response to message 93505

(moderation:

)

Heh, it's a laptop. (-;

Computation Error

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports