One should also consider, that in previous versions certain (if not all?) FPEs appear to be ignored -- thus we might see design/programming flaws today (unless those traps appear only on faulty hardware and never on faulty software design (which is okay, and human, and must happen)).
From the release info:
throws floating-point exception on NaNs and FPU stack errors
Of course I could verify that by running an older version without that new traps, but this might mean incorrect/drifting data.
I'm not sure since when the FPEs are trapped in E@H code, but it's been quite a while. In any case, an untrapped Floating point exception would almost certainly lead to a secondary error like a segmentation fault, or at least a validation error after the result is submitted. So the trapping should not increase the overall error count, it's just that it's easier to differentiate between potential software problems and likely hardware problems.
I'm not sure since when the FPEs are trapped in E@H code, but it's been quite a while. In any case, an untrapped Floating point exception would almost certainly lead to a secondary error like a segmentation fault, or at least a validation error after the result is submitted. So the trapping should not increase the overall error count, it's just that it's easier to differentiate between potential software problems and likely hardware problems.
One should also consider, that in previous versions certain (if not all?) FPEs appear to be ignored -- thus we might see design/programming flaws today (unless those traps appear only on faulty hardware and never on faulty software design (which is okay, and human, and must happen)).
From the release info:
throws floating-point exception on NaNs and FPU stack errors
Of course I could verify that by running an older version without that new traps, but this might mean incorrect/drifting data.
I'm not sure since when the FPEs are trapped in E@H code, but it's been quite a while. In any case, an untrapped Floating point exception would almost certainly lead to a secondary error like a segmentation fault, or at least a validation error after the result is submitted. So the trapping should not increase the overall error count, it's just that it's easier to differentiate between potential software problems and likely hardware problems.
CU
Bikeman
Shouldn't it be possible to do some server-side statistics about plattform/cpu, einstein-client-version and error-type/error-count in order to detect probably environmentally broken code and not rely on user reports?
Shouldn't it be possible to do some server-side statistics about plattform/cpu, einstein-client-version and error-type/error-count in order to detect probably environmentally broken code and not rely on user reports?
This is indeed already done. But the individual user error-reports are still very useful since you can actually ask questions to get information that is otherwise not directly available from the logs (e.g. are the hosts overclocked? what other projects are running and did they show problems as well...).
The "signal 11" problem on Linux that we discussed here recently is a good example. It turned out to be a bug in the BOINC library, but to analyse the problem it was crucial to learn from users that the problem had some connection with broken network connections. Only then was it possible to pinpoint the root cause.
Shouldn't it be possible to do some server-side statistics about plattform/cpu, einstein-client-version and error-type/error-count in order to detect probably environmentally broken code and not rely on user reports?
This is indeed already done. But the individual user error-reports are still very useful since you can actually ask questions to get information that is otherwise not directly available from the logs (e.g. are the hosts overclocked? what other projects are running and did they show problems as well...).
Thanks for your answers. :-)
I see you have also got a P3 Coppermine without troubles - I suppose you run >= 4.31 there.
(Is there some non-speed but stability/fault-tolerance optimized version which I could try?)
I see you have also got a P3 Coppermine without troubles - I suppose you run >= 4.31 there.
Yes, I've got a Dual PIII-866 Coppermine. It's a HP Kayak XM-600 Series 2 that I bought used for 25 Euros :-), it's probably more than 7 years old but (knocking on wood) still runs rock solid. I had to replace the power supply recently, tho, because the fan was failing. I would not be surprised to see it die any day now, tho.
The P-III already supports SSE, so I'm using the 4.35 "Power Users" App on this box, which is considerably faster than the 4.31 App.
Quote:
(Is there some non-speed but stability/fault-tolerance optimized version which I could try?)
Stability and validity of results are the top concerns for E@H, and the FPE trapping feature is in fact part of the error detection code. Once the app notices there might be something wrong, it terminates the computation, which is safer than trying to be "fault-tolerant" and carry on regardless. There is no special "stability" edition of the app.
This particular fault (FPE) is really hard to produce with software bugs (other than compiler bugs), the most likely explanation is failing hardware. After all, a PIII Coppermine must be how old by now? 6 year? 7 years?
Most of the time it's not the CPU itself but things like failing fans, swollen capacitors on the motherboard, glitches in the power supply... Gary will be able to expand on this better than me. The E@H app has now reached a significant level of optimization and squeezes quite a bit of performance out of the FPU, so it's not surprising taht E@H is the first app to show symptoms of hardware failure.
CU
Bikeman
One should also consider, that in previous versions certain (if not all?) FPEs appear to be ignored -- thus we might see design/programming flaws today (unless those traps appear only on faulty hardware and never on faulty software design (which is okay, and human, and must happen)). But then - if it would be flawed design, then it shouldn't only appear here.
From the release info:
throws floating-point exception on NaNs and FPU stack errors
Of course I could verify that by running an older version without that new traps, but this might mean incorrect/drifting data - so my decision would then rather be "this old, p3 driven host cannot participate in einstein". (would be okay. It would be interesting whether other users with the same CPU get the same FPEs, but then I think those old p3 coppermine users haven't detected the latest version yet, and run old versions)
Update: the 4.35 SSE version seems also to produce errors:
2008-02-23 10:15:34 [Einstein@Home] Resuming task h1_0851.85_S5R3__372_S5R3b_1 using einstein_S5R3 version 435
2008-02-23 10:15:46 [Einstein@Home] Deferring communication for 1 min 0 sec
2008-02-23 10:15:46 [Einstein@Home] Reason: Unrecoverable error for result h1_0851.85_S5R3__372_S5R3b_1 (process exited with code 99 (0x63))
The FPE (at least the ones I've seen here) would almost always lead to a NaN in a certain variable and this to an error with exit status 99 a few instructions later (when there is a sanity check for array bounds). So these errors are taken from the "99" bunch of computing errors in order to get closer to the point where the error actually occurs, that's all.
There is at least one other reason for FPEs: a flaw in the operating system (or even in the compiler it was built with). Actually it should protect one process context against whatever is happening in other contexts, but apparently this doesn't alway work correctly. At least on Windows I read reports where a hardware driver (usually printer) could mess up the FPU stack and flags so badly that they weren't properly restored when switching back to a user process, generating an FPE there. I'm not sure that all possible Linux kernels (including self-built) have sufficient protection against bad drivers and other stuff running in kernel mode. I'm not even sure they all properly save all registers - I've seen the CPU type / register detection of the Linux fail to detect the right CPU (and thus available set of registers) at least in two cases.
You seem to have quite a number of machine running, can you point me to the machine or even better the result where the error happened with the 4.35?
There is at least one other reason for FPEs: a flaw in the operating system (or even in the compiler it was built with). Actually it should protect one process context against whatever is happening in other contexts, but apparently this doesn't alway work correctly. At least on Windows I read reports where a hardware driver (usually printer) could mess up the FPU stack and flags so badly that they weren't properly restored when switching back to a user process, generating an FPE there. I'm not sure that all possible Linux kernels (including self-built) have sufficient protection against bad drivers and other stuff running in kernel mode. I'm not even sure they all properly save all registers - I've seen the CPU type / register detection of the Linux fail to detect the right CPU (and thus available set of registers) at least in two cases.
The system uses (well, used) gentoo libc (currently 2.6.1) and gcc (just switched to 4.1.2) and vanilla kernel (2.6.22.1, will be updated/rebuilt).
Quote:
You seem to have quite a number of machine running, can you point me to the machine or even better the result where the error happened with the 4.35?
Host 1117633
The last two (92666821 and 92655563) are done with 4.35.
Probably worthless info: this host performs well on LHC, Rosetta and obviously SETI (for SETI no results, and no bad news yet).
I've looked into some results with this error, and without doing a real statistical analysis, it "feels" as if Gentoo Linux runs a higher risk to encounter this problem, here are just a few hosts:
RE: One should also
)
I'm not sure since when the FPEs are trapped in E@H code, but it's been quite a while. In any case, an untrapped Floating point exception would almost certainly lead to a secondary error like a segmentation fault, or at least a validation error after the result is submitted. So the trapping should not increase the overall error count, it's just that it's easier to differentiate between potential software problems and likely hardware problems.
CU
Bikeman
RE: I'm not sure since
)
I found that. Maybe it helps.
RE: RE: One should also
)
Shouldn't it be possible to do some server-side statistics about plattform/cpu, einstein-client-version and error-type/error-count in order to detect probably environmentally broken code and not rely on user reports?
RE: Shouldn't it be
)
This is indeed already done. But the individual user error-reports are still very useful since you can actually ask questions to get information that is otherwise not directly available from the logs (e.g. are the hosts overclocked? what other projects are running and did they show problems as well...).
The "signal 11" problem on Linux that we discussed here recently is a good example. It turned out to be a bug in the BOINC library, but to analyse the problem it was crucial to learn from users that the problem had some connection with broken network connections. Only then was it possible to pinpoint the root cause.
CU
Bikeman
RE: RE: Shouldn't it be
)
Thanks for your answers. :-)
I see you have also got a P3 Coppermine without troubles - I suppose you run >= 4.31 there.
(Is there some non-speed but stability/fault-tolerance optimized version which I could try?)
RE: Thanks for your
)
Yes, I've got a Dual PIII-866 Coppermine. It's a HP Kayak XM-600 Series 2 that I bought used for 25 Euros :-), it's probably more than 7 years old but (knocking on wood) still runs rock solid. I had to replace the power supply recently, tho, because the fan was failing. I would not be surprised to see it die any day now, tho.
The P-III already supports SSE, so I'm using the 4.35 "Power Users" App on this box, which is considerably faster than the 4.31 App.
Stability and validity of results are the top concerns for E@H, and the FPE trapping feature is in fact part of the error detection code. Once the app notices there might be something wrong, it terminates the computation, which is safer than trying to be "fault-tolerant" and carry on regardless. There is no special "stability" edition of the app.
CU
Bikeman
RE: RE: Hi! This
)
The FPE (at least the ones I've seen here) would almost always lead to a NaN in a certain variable and this to an error with exit status 99 a few instructions later (when there is a sanity check for array bounds). So these errors are taken from the "99" bunch of computing errors in order to get closer to the point where the error actually occurs, that's all.
There is at least one other reason for FPEs: a flaw in the operating system (or even in the compiler it was built with). Actually it should protect one process context against whatever is happening in other contexts, but apparently this doesn't alway work correctly. At least on Windows I read reports where a hardware driver (usually printer) could mess up the FPU stack and flags so badly that they weren't properly restored when switching back to a user process, generating an FPE there. I'm not sure that all possible Linux kernels (including self-built) have sufficient protection against bad drivers and other stuff running in kernel mode. I'm not even sure they all properly save all registers - I've seen the CPU type / register detection of the Linux fail to detect the right CPU (and thus available set of registers) at least in two cases.
You seem to have quite a number of machine running, can you point me to the machine or even better the result where the error happened with the 4.35?
BM
BM
RE: There is at least one
)
The system uses (well, used) gentoo libc (currently 2.6.1) and gcc (just switched to 4.1.2) and vanilla kernel (2.6.22.1, will be updated/rebuilt).
Host 1117633
The last two (92666821 and 92655563) are done with 4.35.
Probably worthless info: this host performs well on LHC, Rosetta and obviously SETI (for SETI no results, and no bad news yet).
I've looked into some results
)
I've looked into some results with this error, and without doing a real statistical analysis, it "feels" as if Gentoo Linux runs a higher risk to encounter this problem, here are just a few hosts:
http://einsteinathome.org/host/806536
http://einsteinathome.org/host/645060
http://einsteinathome.org/host/728561
http://einsteinathome.org/host/982121
http://einsteinathome.org/host/382784
http://einsteinathome.org/host/692446
http://einsteinathome.org/host/1078848
http://einsteinathome.org/host/1116760
http://einsteinathome.org/host/904093
http://einsteinathome.org/host/536424
Some Coppermines, but also some AMD CPUs among those hosts.
Gentoo isn't that widespread anymore, is it? Ubuntu, Redhat and Suse must be far more popular.
CU
Bikeman
I upgraded to "power users"
)
I upgraded to "power users" app 4.35 without problems.
Really very fast!