@Mike: The E@H app might be a bit special in that it actually catches certain FPU problems which are by default not handled. Before this change was made, earlier E@H app versions just continues their computations, but with some false values. Unless those values are part of a computation that will lead to a "candidate" that is sent back to the server, no validation error would appear. In this sense, the E@H app is especially sensitive to certain FPU problems.
Currently I can't see how a programming bug or compiler bug on the app side could cause this behavior (depending on this special kernel configuration parameter), as this paramter should really have no effect at all on any ordinary (non-kernel) processes.
This is due to FPU exceptions being caught by the current E@H S5R3 app. The exception itself is in turn, I suppose, caused by the "lazy" FPU state restoration being "out of sync" when "CONFIG_PREEMPT" is used.
So far I've tested it on two of my machines (2.6.20.7 and 2.6.24.2) and I can confirm that changing "CONFIG_PREEMPT" to the less aggressive "CONFIG_PREEMPT_VOLUNTARY" seemingly solves the issue.
I tried to find a patch in the kernel git repos that might have solved the issue using dis-/enable_preempt() in the critical FPU code sections but without success so far - hence it might still be an open issue. I'm still waiting for a reply of one kernel hacker who seems to have implemented a fix for the same issue on the sh architecture (thanks to Bikeman for the hint). In the meantime you might want to have a look at this to get an idea what this is all about.
Has anyone tested more recent kernels like the 2.6.25 series or the 2.6.26-RCs?
I'm not sure everyone could follow the discussion of this issue so far, so I'll try to wrap it up in a few sentences:
Mostly thanks to input from juergen.mell and Oliver's experiments, the current theory is that there is a rather serious bug in some versions of the kernel of the Linux operating system itself that can cause rather random errors in floating point calculations (not just in the E@H app!).
Unlike off-the-shelf operating systems like Windows or Mac OS X that are distributed in binary packages (that is, already compiled), Linux is open source and allows customization to your special operating needs, so you can built many different variants of LINUX by configuring many different switches and options before compiling your very own Linux kernel.
One of these options (CONFIG_PREEMPT) seems to be involved with this special suspected bug, that is, only kernel versions that are compiled with special values for this option are affected.
If your LINUX app is running fine so far ==> no need to worry, no need to change anything. All will be fine :-).
*Only* if you are frequently getting "signal 8" errors AND your kernel uses the 'Preemptible Kernel (Low-Latency Desktop)' kernel option setting ==> consider switching to another kernel without this option. Suspending the E@H project until a kernel fix is available would be wise if you can't switch the kernel. I cannot recommend switching to another BOINC project because this suspected bug would likely affect other projects' apps as well (detected or undetected).
If you are a Linux kernel geek, you are most welcome to experiment with kernel versions 2.6.25 series or the 2.6.26-RCs and post your results here, as suggested by Oliver (see previous message).
Has anyone tested more recent kernels like the 2.6.25 series or the 2.6.26-RCs?
I am running a vanilla kernel 2.6.26-rc1 with CONFIG_PREEMPT set now since 12 hours and up to now there was no crash of Einstein (normally the first crash occurred after 15 to 30 Minutes). I will continue testing, but this looks good.
I am running a vanilla kernel 2.6.26-rc1 with CONFIG_PREEMPT set now since 12 hours and up to now there was no crash of Einstein (normally the first crash occurred after 15 to 30 Minutes). I will continue testing, but this looks good.
Unfortunately it does not look good anymore. I just had my first crash with the 2.6.26-rc1. The only change is that it takes much longer until Einstein crashes (in this case more than 10 CPU hours). So the bug is still present in the most recent kernel.
I am running a vanilla kernel 2.6.26-rc1 with CONFIG_PREEMPT set now since 12 hours and up to now there was no crash of Einstein (normally the first crash occurred after 15 to 30 Minutes). I will continue testing, but this looks good.
Unfortunately it does not look good anymore. I just had my first crash with the 2.6.26-rc1. The only change is that it takes much longer until Einstein crashes (in this case more than 10 CPU hours). So the bug is still present in the most recent kernel.
Bye,
Jürgen
Hi Jürgen.
The 2.6.26 Release Candidate 2 now contains another bugfix which looks like it might be related to this issue :
Working fine on Fedora 8 as well, Core 2 Quad Q6600
32 or 64 bits?
Alex.
"I am tired of all this sort of thing called science here... We have spent
millions in that sort of thing for the last few years, and it is time it
should be stopped."
-- Simon Cameron, U.S. Senator, on the Smithsonian Institute, 1901.
RE: @Mike: The E@H app
)
This is due to FPU exceptions being caught by the current E@H S5R3 app. The exception itself is in turn, I suppose, caused by the "lazy" FPU state restoration being "out of sync" when "CONFIG_PREEMPT" is used.
So far I've tested it on two of my machines (2.6.20.7 and 2.6.24.2) and I can confirm that changing "CONFIG_PREEMPT" to the less aggressive "CONFIG_PREEMPT_VOLUNTARY" seemingly solves the issue.
I tried to find a patch in the kernel git repos that might have solved the issue using dis-/enable_preempt() in the critical FPU code sections but without success so far - hence it might still be an open issue. I'm still waiting for a reply of one kernel hacker who seems to have implemented a fix for the same issue on the sh architecture (thanks to Bikeman for the hint). In the meantime you might want to have a look at this to get an idea what this is all about.
Has anyone tested more recent kernels like the 2.6.25 series or the 2.6.26-RCs?
Cheers,
Oliver
Einstein@Home Project
I'm not sure everyone could
)
I'm not sure everyone could follow the discussion of this issue so far, so I'll try to wrap it up in a few sentences:
Mostly thanks to input from juergen.mell and Oliver's experiments, the current theory is that there is a rather serious bug in some versions of the kernel of the Linux operating system itself that can cause rather random errors in floating point calculations (not just in the E@H app!).
Unlike off-the-shelf operating systems like Windows or Mac OS X that are distributed in binary packages (that is, already compiled), Linux is open source and allows customization to your special operating needs, so you can built many different variants of LINUX by configuring many different switches and options before compiling your very own Linux kernel.
One of these options (CONFIG_PREEMPT) seems to be involved with this special suspected bug, that is, only kernel versions that are compiled with special values for this option are affected.
If your LINUX app is running fine so far ==> no need to worry, no need to change anything. All will be fine :-).
*Only* if you are frequently getting "signal 8" errors AND your kernel uses the 'Preemptible Kernel (Low-Latency Desktop)' kernel option setting ==> consider switching to another kernel without this option. Suspending the E@H project until a kernel fix is available would be wise if you can't switch the kernel. I cannot recommend switching to another BOINC project because this suspected bug would likely affect other projects' apps as well (detected or undetected).
If you are a Linux kernel geek, you are most welcome to experiment with kernel versions 2.6.25 series or the 2.6.26-RCs and post your results here, as suggested by Oliver (see previous message).
CU
Bikeman
RE: Has anyone tested more
)
I am running a vanilla kernel 2.6.26-rc1 with CONFIG_PREEMPT set now since 12 hours and up to now there was no crash of Einstein (normally the first crash occurred after 15 to 30 Minutes). I will continue testing, but this looks good.
Bye,
Jürgen
RE: I am running a vanilla
)
Unfortunately it does not look good anymore. I just had my first crash with the 2.6.26-rc1. The only change is that it takes much longer until Einstein crashes (in this case more than 10 CPU hours). So the bug is still present in the most recent kernel.
Bye,
Jürgen
You may want to try the new
)
You may want to try the new 4.49 Beta App.
BM
BM
RE: RE: I am running a
)
Hi Jürgen.
The 2.6.26 Release Candidate 2 now contains another bugfix which looks like it might be related to this issue :
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fd3c3ed5d1e3ceb37635cbe6d220ab94aae0781d
Might be worth a try.
CU
Bikeman
RE: Working fine on Fedora
)
32 or 64 bits?
Alex.
"I am tired of all this sort of thing called science here... We have spent
millions in that sort of thing for the last few years, and it is time it
should be stopped."
-- Simon Cameron, U.S. Senator, on the Smithsonian Institute, 1901.