ABP1 Workunit 57514358 ran to 100% (23893.08 sec) and failed to validate with error code 2, on a dual core 64bit WinVista machine with plenty (4GB) of memory. No client or host system crashes noted; system not rebooted during 57514358's run. BOINC client is 6.6.28. I've run many S5 and a few ABPS tasks with no problem.
There's 6.6 cpu-hours down the toilet. Oh well.
There's a significant error message in there that somebody might like to look at:
Quote:
[05:02:31][7196][INFO ] Checkpoint committed!
[05:03:23][7196][ERROR] Couldn't rename temporary checkpoint file (status.cpt.tmp) to final checkpoint file: status.cpt (Result too large).
[05:03:23][7196][ERROR] Demodulation failed (error: 2)!
called boinc_finish
I've had nearly all of my tasks register an outcome of "client error", which is defined as a user-side error. The numbers of credits that I would have been awarded would have been small but it's concerning nevertheless. I'm not sure what exactly you need to see to help me, but here is some information on a few:
Like I said, the credits are fairly small so it's not that issue that bothers me but the repeated ambiguous fault known as "client error".
Then there's the next one, which is a bit more alarming because of the type of error and the amount of cpu time involved to complete the task. The outcome was "validate error" which is defined as a server-side error:
This all seems a bit strange, since I'm concurrently running Seti, Cosmology and Milkyway. With the exception of Cosmology, all are generating credits. Cosmology's tasks and error descriptions are a bit more vague and labeled with a simple "error".
I'm running BOINC Manager version 6.6.36 for Windows.
Here is my PC , as reported by CPU-Z version 1.52.2.
You have a recent CBNC over on SAH right now with this host as well.
The 'hard' errors all look to be memory access faults (aka General Protection Errors).
Generally speaking, this usually indicates either transient overheating in the processors FPU and/or flakey memory.
I see it just reported a task over on MW and it validated OK.
So the good news is that most likely whatever is wrong is fixable. :-)
The bad news is it looks like it's intermittent, which increases the aggravation and difficulty factor in finding and fixing it at least an order of magnitude. :-(
I'd suggest starting with the usual steps. Cleaning, and then diagnostics with Memtest86, Prime95, etc. and see what comes out in the wash.
The bad news is it looks like it's intermittent, which increases the aggravation and difficulty factor in finding and fixing it at least an order of magnitude. :-(
I'd suggest starting with the usual steps. Cleaning, and then diagnostics with Memtest86, Prime95, etc. and see what comes out in the wash.
I'm guessing your response was for my post and if so, thank you.
I have had a few BSD's recently and I'm not sure exactly what's triggering them. I installed a Cool Master cpu cooler which brought my cpu temp down from a 100% cpu usage temp of 53c to a much more manageable 33c. When I went to Asus and used their PSU calulator to see how much wattage I need to run my current configuration, I'm coming up short. I'll upgrade the PSU and continue to monitor the GPFs.
However, this doesn't directly address the times when my PC operated without fail for an entire task cycle, or the fact that I'm still generating a good amount of credits at Milkyway@Home. I could print out a certificate now that shows real accomplishments from contributions I've made so far. I can't figure out why Seti and Milkyway are working while Einstein and Cosmology are zombie flatliners.
I appreciate your observations and insight, I'll try the diagnostic apps you mentioned. I guess I've come full circle back to your comment that it's intermittent.
The bad news is it looks like it's intermittent, which increases the aggravation and difficulty factor in finding and fixing it at least an order of magnitude. :-(
I'd suggest starting with the usual steps. Cleaning, and then diagnostics with Memtest86, Prime95, etc. and see what comes out in the wash.
I'm guessing your response was for my post and if so, thank you.
I have had a few BSD's recently and I'm not sure exactly what's triggering them. I installed a Cool Master cpu cooler which brought my cpu temp down from a 100% cpu usage temp of 53c to a much more manageable 33c. When I went to Asus and used their PSU calulator to see how much wattage I need to run my current configuration, I'm coming up short. I'll upgrade the PSU and continue to monitor the GPFs.
However, this doesn't directly address the times when my PC operated without fail for an entire task cycle, or the fact that I'm still generating a good amount of credits at Milkyway@Home. I could print out a certificate now that shows real accomplishments from contributions I've made so far. I can't figure out why Seti and Milkyway are working while Einstein and Cosmology are zombie flatliners.
I appreciate your observations and insight, I'll try the diagnostic apps you mentioned. I guess I've come full circle back to your comment that it's intermittent.
-Xin
No Problemo.
Anyway, I'm assuming the PSU has been in the Host for a while. If the size calculator is telling you you're near the borderline this makes perfect sense.
Even if everything looks and tests fine during diagnostics, the point others were trying to make is the dynamic load on the host when crunching in a real world setting can bring on a fault you wouldn't otherwise see. IOWs, when you have multiple projects running on multi-core processors, which ones of the projects and the number of tasks from each running can making a big difference to the localized thermal loading on the processor die itself. If you utilize the machine to do work for you as well, you have to factor that into the equation as well.
If you look back through the forum to when the Q6600 was new, you'll find there where more than a few reports of them being sensitive to localized dynamic thermal loading in their FPUs, especially with the early steppings and when running what is now the stock EAH app (it was a Beta/Power app back then).
If your PSU is borderline for your host now, and is showing excessive sag, or worse breaking regulation (unlikely), this won't help matters.
It might even be that normal semiconductor aging is making the processor (memory) itself more sensitive to this. In that case, since you have a beefy Cooler Master on it now, you might try giving the Processor voltages an up tweak (if you can) and see if that helps.
So the bottomline is that this is probably just the result of a number of little things that by themselves don't cause trouble, but can come together unexpectedly from time to time and show up as intermittent compute and/or validate errors.
Hmm, I attached a new machine a few days ago and it didn't yet finish a single WU successfully, all crashed with Compute errors, e. g. tasks 139775953, 139775951, 139775945, …
All of them crashed on signal 8, which means floating point exception. The computer has a C2D CPU, probably some CPU feature was detected incorrectly.
Hmm, I attached a new machine a few days ago and it didn't yet finish a single WU successfully, all crashed with Compute errors, e. g. tasks 139775953, 139775951, 139775945, …
All of them crashed on signal 8, which means floating point exception. The computer has a C2D CPU, probably some CPU feature was detected incorrectly.
Signal 8 on Linux can be tracked down to one of two causes most of the time:
1) a specific bug in the Linux Kernel. However, that bug was fixed some kerbnel versions ago already and your Linux Kernel seems to be a very recent one. Still you might try to tentatively install BOINC and einstein@home on a LINUX Live system booted from a CD/DVD and see if this problem might depend on your distro and kernel patches (you seem to use TuxOnice, right)?
Signal 8 on Linux can be tracked down to one of two causes most of the time:
1) a specific bug in the Linux Kernel. However, that bug was fixed some kerbnel versions ago already and your Linux Kernel seems to be a very recent one. Still you might try to tentatively install BOINC and einstein@home on a LINUX Live system booted from a CD/DVD and see if this problem might depend on your distro and kernel patches (you seem to use TuxOnice, right)?
Hmm, I can try that sometime. Yes, the kernel is a Gentoo tuxonice one.
Are you really sure about that?? I presume you are using the stock Intel HSF? I've built quite a number of new dual and quad cores recently and I'd like to think I've really worked out how to properly lock down evenly, the stock Intel HSF. Early on (even though I was ultra careful) I found it quite easy to end up with the HSF not quite squarely clamped to the CPU heat spreader (one corner not fully locked despite seeming to be). The characteristics were that the machine would boot and run fine on normal loads but would not be able to complete full load testing. The problem was identified when I pulled vigorously on the HSF and one corner rather easily popped out of lock. The problem ended when I reapplied the HSF with all 4 posts properly locked down.
RE: ABP1 Workunit 57514358
)
There's a significant error message in there that somebody might like to look at:
Hey guys, I've had nearly
)
Hey guys,
I've had nearly all of my tasks register an outcome of "client error", which is defined as a user-side error. The numbers of credits that I would have been awarded would have been small but it's concerning nevertheless. I'm not sure what exactly you need to see to help me, but here is some information on a few:
138154537
137864090
137846426
Like I said, the credits are fairly small so it's not that issue that bothers me but the repeated ambiguous fault known as "client error".
Then there's the next one, which is a bit more alarming because of the type of error and the amount of cpu time involved to complete the task. The outcome was "validate error" which is defined as a server-side error:
137821169
This all seems a bit strange, since I'm concurrently running Seti, Cosmology and Milkyway. With the exception of Cosmology, all are generating credits. Cosmology's tasks and error descriptions are a bit more vague and labeled with a simple "error".
I'm running BOINC Manager version 6.6.36 for Windows.
Here is my PC , as reported by CPU-Z version 1.52.2.
Any help would be appreciated.
-Xin
Hmmmm... You have a recent
)
Hmmmm...
You have a recent CBNC over on SAH right now with this host as well.
The 'hard' errors all look to be memory access faults (aka General Protection Errors).
Generally speaking, this usually indicates either transient overheating in the processors FPU and/or flakey memory.
I see it just reported a task over on MW and it validated OK.
So the good news is that most likely whatever is wrong is fixable. :-)
The bad news is it looks like it's intermittent, which increases the aggravation and difficulty factor in finding and fixing it at least an order of magnitude. :-(
I'd suggest starting with the usual steps. Cleaning, and then diagnostics with Memtest86, Prime95, etc. and see what comes out in the wash.
HTH,
Alinator
RE: The bad news is it
)
I'm guessing your response was for my post and if so, thank you.
I have had a few BSD's recently and I'm not sure exactly what's triggering them. I installed a Cool Master cpu cooler which brought my cpu temp down from a 100% cpu usage temp of 53c to a much more manageable 33c. When I went to Asus and used their PSU calulator to see how much wattage I need to run my current configuration, I'm coming up short. I'll upgrade the PSU and continue to monitor the GPFs.
However, this doesn't directly address the times when my PC operated without fail for an entire task cycle, or the fact that I'm still generating a good amount of credits at Milkyway@Home. I could print out a certificate now that shows real accomplishments from contributions I've made so far. I can't figure out why Seti and Milkyway are working while Einstein and Cosmology are zombie flatliners.
I appreciate your observations and insight, I'll try the diagnostic apps you mentioned. I guess I've come full circle back to your comment that it's intermittent.
-Xin
RE: RE: The bad news is
)
No Problemo.
Anyway, I'm assuming the PSU has been in the Host for a while. If the size calculator is telling you you're near the borderline this makes perfect sense.
Even if everything looks and tests fine during diagnostics, the point others were trying to make is the dynamic load on the host when crunching in a real world setting can bring on a fault you wouldn't otherwise see. IOWs, when you have multiple projects running on multi-core processors, which ones of the projects and the number of tasks from each running can making a big difference to the localized thermal loading on the processor die itself. If you utilize the machine to do work for you as well, you have to factor that into the equation as well.
If you look back through the forum to when the Q6600 was new, you'll find there where more than a few reports of them being sensitive to localized dynamic thermal loading in their FPUs, especially with the early steppings and when running what is now the stock EAH app (it was a Beta/Power app back then).
If your PSU is borderline for your host now, and is showing excessive sag, or worse breaking regulation (unlikely), this won't help matters.
It might even be that normal semiconductor aging is making the processor (memory) itself more sensitive to this. In that case, since you have a beefy Cooler Master on it now, you might try giving the Processor voltages an up tweak (if you can) and see if that helps.
So the bottomline is that this is probably just the result of a number of little things that by themselves don't cause trouble, but can come together unexpectedly from time to time and show up as intermittent compute and/or validate errors.
Alinator
Hmm, I attached a new machine
)
Hmm, I attached a new machine a few days ago and it didn't yet finish a single WU successfully, all crashed with Compute errors, e. g. tasks 139775953, 139775951, 139775945, …
All of them crashed on signal 8, which means floating point exception. The computer has a C2D CPU, probably some CPU feature was detected incorrectly.
RE: Hmm, I attached a new
)
Signal 8 on Linux can be tracked down to one of two causes most of the time:
1) a specific bug in the Linux Kernel. However, that bug was fixed some kerbnel versions ago already and your Linux Kernel seems to be a very recent one. Still you might try to tentatively install BOINC and einstein@home on a LINUX Live system booted from a CD/DVD and see if this problem might depend on your distro and kernel patches (you seem to use TuxOnice, right)?
or
2) excessive overclocking and/or overheating.
Good luck
Bikeman
RE: Signal 8 on Linux can
)
Hmm, I can try that sometime. Yes, the kernel is a Gentoo tuxonice one.
No overclocking here, as well as no overheating.
RE: No overclocking here,
)
Are you really sure about that?? I presume you are using the stock Intel HSF? I've built quite a number of new dual and quad cores recently and I'd like to think I've really worked out how to properly lock down evenly, the stock Intel HSF. Early on (even though I was ultra careful) I found it quite easy to end up with the HSF not quite squarely clamped to the CPU heat spreader (one corner not fully locked despite seeming to be). The characteristics were that the machine would boot and run fine on normal loads but would not be able to complete full load testing. The problem was identified when I pulled vigorously on the HSF and one corner rather easily popped out of lock. The problem ended when I reapplied the HSF with all 4 posts properly locked down.
Cheers,
Gary.
Heh, it's a laptop. (-;
)
Heh, it's a laptop. (-;