GNU/Linux S5R3 App 4.38 available for Beta test

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 978
Credit: 25,170,813
RAC: 23

RE: @Mike: The E@H app

Message 79910 in response to message 79909

Quote:

@Mike: The E@H app might be a bit special in that it actually catches certain FPU problems which are by default not handled. Before this change was made, earlier E@H app versions just continues their computations, but with some false values. Unless those values are part of a computation that will lead to a "candidate" that is sent back to the server, no validation error would appear. In this sense, the E@H app is especially sensitive to certain FPU problems.

Currently I can't see how a programming bug or compiler bug on the app side could cause this behavior (depending on this special kernel configuration parameter), as this paramter should really have no effect at all on any ordinary (non-kernel) processes.

This is due to FPU exceptions being caught by the current E@H S5R3 app. The exception itself is in turn, I suppose, caused by the "lazy" FPU state restoration being "out of sync" when "CONFIG_PREEMPT" is used.

So far I've tested it on two of my machines (2.6.20.7 and 2.6.24.2) and I can confirm that changing "CONFIG_PREEMPT" to the less aggressive "CONFIG_PREEMPT_VOLUNTARY" seemingly solves the issue.

I tried to find a patch in the kernel git repos that might have solved the issue using dis-/enable_preempt() in the critical FPU code sections but without success so far - hence it might still be an open issue. I'm still waiting for a reply of one kernel hacker who seems to have implemented a fix for the same issue on the sh architecture (thanks to Bikeman for the hint). In the meantime you might want to have a look at this to get an idea what this is all about.

Has anyone tested more recent kernels like the 2.6.25 series or the 2.6.26-RCs?

Cheers,
Oliver

 

Einstein@Home Project

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,522
Credit: 692,293,122
RAC: 2,489

I'm not sure everyone could

I'm not sure everyone could follow the discussion of this issue so far, so I'll try to wrap it up in a few sentences:

Mostly thanks to input from juergen.mell and Oliver's experiments, the current theory is that there is a rather serious bug in some versions of the kernel of the Linux operating system itself that can cause rather random errors in floating point calculations (not just in the E@H app!).
Unlike off-the-shelf operating systems like Windows or Mac OS X that are distributed in binary packages (that is, already compiled), Linux is open source and allows customization to your special operating needs, so you can built many different variants of LINUX by configuring many different switches and options before compiling your very own Linux kernel.

One of these options (CONFIG_PREEMPT) seems to be involved with this special suspected bug, that is, only kernel versions that are compiled with special values for this option are affected.

If your LINUX app is running fine so far ==> no need to worry, no need to change anything. All will be fine :-).

*Only* if you are frequently getting "signal 8" errors AND your kernel uses the 'Preemptible Kernel (Low-Latency Desktop)' kernel option setting ==> consider switching to another kernel without this option. Suspending the E@H project until a kernel fix is available would be wise if you can't switch the kernel. I cannot recommend switching to another BOINC project because this suspected bug would likely affect other projects' apps as well (detected or undetected).

If you are a Linux kernel geek, you are most welcome to experiment with kernel versions 2.6.25 series or the 2.6.26-RCs and post your results here, as suggested by Oliver (see previous message).

CU

Bikeman

juergen.mell
juergen.mell
Joined: 9 Feb 05
Posts: 9
Credit: 11,685,774
RAC: 11,411

RE: Has anyone tested more

Message 79912 in response to message 79910

Quote:

Has anyone tested more recent kernels like the 2.6.25 series or the 2.6.26-RCs?

I am running a vanilla kernel 2.6.26-rc1 with CONFIG_PREEMPT set now since 12 hours and up to now there was no crash of Einstein (normally the first crash occurred after 15 to 30 Minutes). I will continue testing, but this looks good.

Bye,
Jürgen

juergen.mell
juergen.mell
Joined: 9 Feb 05
Posts: 9
Credit: 11,685,774
RAC: 11,411

RE: I am running a vanilla

Message 79913 in response to message 79912

Quote:

I am running a vanilla kernel 2.6.26-rc1 with CONFIG_PREEMPT set now since 12 hours and up to now there was no crash of Einstein (normally the first crash occurred after 15 to 30 Minutes). I will continue testing, but this looks good.


Unfortunately it does not look good anymore. I just had my first crash with the 2.6.26-rc1. The only change is that it takes much longer until Einstein crashes (in this case more than 10 CPU hours). So the bug is still present in the most recent kernel.

Bye,
Jürgen

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,301
Credit: 248,186,413
RAC: 32,492

You may want to try the new

You may want to try the new 4.49 Beta App.

BM

BM

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,522
Credit: 692,293,122
RAC: 2,489

RE: RE: I am running a

Message 79915 in response to message 79913

Quote:
Quote:

I am running a vanilla kernel 2.6.26-rc1 with CONFIG_PREEMPT set now since 12 hours and up to now there was no crash of Einstein (normally the first crash occurred after 15 to 30 Minutes). I will continue testing, but this looks good.

Unfortunately it does not look good anymore. I just had my first crash with the 2.6.26-rc1. The only change is that it takes much longer until Einstein crashes (in this case more than 10 CPU hours). So the bug is still present in the most recent kernel.

Bye,
Jürgen

Hi Jürgen.

The 2.6.26 Release Candidate 2 now contains another bugfix which looks like it might be related to this issue :

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fd3c3ed5d1e3ceb37635cbe6d220ab94aae0781d

Might be worth a try.

CU
Bikeman

Alexander W. Janssen
Alexander W. Janssen
Joined: 20 Feb 05
Posts: 56
Credit: 4,543,686
RAC: 0

RE: Working fine on Fedora

Message 79916 in response to message 79884

Quote:
Working fine on Fedora 8 as well, Core 2 Quad Q6600


32 or 64 bits?

Alex.

"I am tired of all this sort of thing called science here... We have spent
millions in that sort of thing for the last few years, and it is time it
should be stopped."
-- Simon Cameron, U.S. Senator, on the Smithsonian Institute, 1901.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.