GNU/Linux S5R3 App 4.38 available for Beta test

Ed1934158

Joined: 10 Nov 04

Posts: 62

Credit: 14481483

RAC: 0

RE: You will lie awake at

25 Apr 2008 22:34:20 UTC

Message 79900 in response to message 79898

(moderation:

)

Quote:

You will lie awake at night thinking about how much more you could do with three ...

Happens to me all the time ...

Heck I just got a dual 3.2 GHz Mac Pro and I am already thinking about how to get more done with upgrading other computers ... Haven't even paid the bill for the Pro yet (due the 8th, check in the mail)...

And no, you don't want to know how much I spent ...

That is why I've convinced my girlfriend that she needs Q6600. :)
I'm sorry for oftopic.

juergen.mell

Joined: 9 Feb 05

Posts: 9

Credit: 12897831

RAC: 10457

I am having big problems with

26 Apr 2008 15:39:42 UTC

Message 79901

(moderation:

)

I am having big problems with the application recently. At the beginning everything was running very well, but this week I started experimenting with real-time optimized kernels (I am running openSUSE 10.3 and right now I am using a kernel 2.6.23.17-ccj64-rt). Now I am getting lots of computation errors, eg. work units 95960123 or 95960103. All work units which are aborted have either the floating point error or the XLAL error. Since I switched kernels, only a single work unit was computed without failure. Otherwise the machine seems to run without any problems. Major changes in the kernel configuration against the standard kernels is an increased HZ-value (1000 instead of 250) and several changes which make pre-empting possible at nearly every point in the kernel. Any ideas what is wrong here?

tullio

Joined: 22 Jan 05

Posts: 2118

Credit: 61407735

RAC: 0

RE: I am having big

27 Apr 2008 7:00:14 UTC

Message 79902 in response to message 79901

(moderation:

)

Quote:

I am having big problems with the application recently. At the beginning everything was running very well, but this week I started experimenting with real-time optimized kernels (I am running openSUSE 10.3 and right now I am using a kernel 2.6.23.17-ccj64-rt). Now I am getting lots of computation errors, eg. work units 95960123 or 95960103. All work units which are aborted have either the floating point error or the XLAL error. Since I switched kernels, only a single work unit was computed without failure. Otherwise the machine seems to run without any problems. Major changes in the kernel configuration against the standard kernels is an increased HZ-value (1000 instead of 250) and several changes which make pre-empting possible at nearly every point in the kernel. Any ideas what is wrong here?

I am using SuSE 10.3 with this kernel:Linux 2.6.22.5-31-default i686
and everyting works in Einstein, SETI. QMC/ORCA, climateprediction.net, CPDN.beta and LHC on AMD Opteron 1210. I was always afraid of updating kernels,one never knows what you get. Cheers.
Tullio

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 752980860

RAC: 1193643

RE: I am having big

27 Apr 2008 7:55:43 UTC

Message 79903 in response to message 79901

(moderation:

)

Quote:

I am having big problems with the application recently. At the beginning everything was running very well, but this week I started experimenting with real-time optimized kernels (I am running openSUSE 10.3 and right now I am using a kernel 2.6.23.17-ccj64-rt). Now I am getting lots of computation errors, eg. work units 95960123 or 95960103. All work units which are aborted have either the floating point error or the XLAL error. Since I switched kernels, only a single work unit was computed without failure. Otherwise the machine seems to run without any problems. Major changes in the kernel configuration against the standard kernels is an increased HZ-value (1000 instead of 250) and several changes which make pre-empting possible at nearly every point in the kernel. Any ideas what is wrong here?

Yes, this kind of error seems to happen frequently with non-standard kernels (self compiled Gentoo kernels for example). What exactly causes this has remained a mystery. Maybe it's not so much the kernel itself, but some compilation option of some shared libraries. Something seems to corrupt the FPU stack of the E@H app.

CU
Bikeman

Jürgen Mell

Joined: 2 Feb 08

Posts: 1

Credit: 7606979

RAC: 0

RE: Yes, this kind of

27 Apr 2008 9:47:43 UTC

Message 79904 in response to message 79903

(moderation:

)

Quote:

Yes, this kind of error seems to happen frequently with non-standard kernels (self compiled Gentoo kernels for example). What exactly causes this has remained a mystery. Maybe it's not so much the kernel itself, but some compilation option of some shared libraries. Something seems to corrupt the FPU stack of the E@H app.

No, in this case it is definitely only the kernel, as I did not change anything else. What makes me worry is that if the kernel causes corruption of Einstein's FPU stack this might also happen to my own application and that is something I really do not like :-(. Do you have any hint (kernel configuration options which might influence the FPU stack) where I might start searching?
I am back to a 2.6.22.17-0.2-default kernel now and will try to tweak some options there to get the behavior I need for may application (simply a stable timer tick with 1 ms resolution).

juergen.mell

Joined: 9 Feb 05

Posts: 9

Credit: 12897831

RAC: 10457

RE: Yes, this kind of

3 May 2008 8:54:55 UTC

Message 79905 in response to message 79903

(moderation:

)

Quote:

Yes, this kind of error seems to happen frequently with non-standard kernels (self compiled Gentoo kernels for example). What exactly causes this has remained a mystery. Maybe it's not so much the kernel itself, but some compilation option of some shared libraries. Something seems to corrupt the FPU stack of the E@H app.

So, 5 kernels later I have a little better information. It seems that activating the kernel option 'Preemtible Kernel (Low-Latency Desktop)' (CONFIG_PREEMPT) causes this problem. I still have to verify this, as I have also modified some other options and the failure might be caused by a combination of changed options, but before I set this option everything was well. It would be most helpful, if some other people could also verify this. For me, kernels 2.6.22 and 2.6.23 are affected.

Bye,
JÃ¼rgen

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 752980860

RAC: 1193643

Thanks very much for your

3 May 2008 11:24:11 UTC

Message 79906

(moderation:

)

Thanks very much for your report!

Hmmm....very interesting indeed. The Linux kernel 2.6.20 had some significant changes wrt. policy on restoring the FPU state when switching between tasks.

As I understand it, the 2.6 kernel intentionally used NOT to restore the FPU state when switching back to a userland task immediately (so the task continued with a "wrong" FPU context) , but only did restore the context when the first FPU instruction was encountered (lazy FPU state restoration). This is not so cool for FPU intensive processes as the mechanism to trap the first FPU instruction after a context switch is rather expensive when it happens, it does pay off for tasks that don't use the FPU so much. So the new policy performs some heuristics to decide whether to restore the FPU state immediately or in the previous, lazy fashion (on demand).

http://kernelnewbies.org/Linux_2_6_20#head-b73db42e6026d0a7d99ddf0b81e905d74fa8ecbf

Still, it would be a rather catastrophic bug if this had issues with the optionally finer-grained preemption of the kernel.

But it is rather suspicious that many hosts that have this problem seem to have kernels optimized for RealTime processing (e.g. 2.6.x-bla-rt), which would indeed include a low latency preemption option. And most of these kernels (all I've seen lately) are >= 2.6.20 .... hmmm.....

CU
Bikeman

juergen.mell

Joined: 9 Feb 05

Posts: 9

Credit: 12897831

RAC: 10457

RE: Still, it would be a

4 May 2008 8:27:22 UTC

Message 79907 in response to message 79906

(moderation:

)

Quote:

Still, it would be a rather catastrophic bug if this had issues with the optionally finer-grained preemption of the kernel.

But it is rather suspicious that many hosts that have this problem seem to have kernels optimized for RealTime processing (e.g. 2.6.x-bla-rt), which would indeed include a low latency preemption option. And most of these kernels (all I've seen lately) are >= 2.6.20 .... hmmm.....

I have finally verified the problem with 'Preemtible Kernel (Low-Latency Desktop)' (CONFIG_PREEMPT) option. I used a default kernel source from openSUSE and changed only this option. After that, Einstein started crashing :-(
I have filed a bug report to the openSUSE people. I hope they can help as a really would not like to have to dig into this myself.

Bye,
JÃ¼rgen

MikeB

Joined: 22 Jan 05

Posts: 55

Credit: 4795711

RAC: 0

After reading the thread I

4 May 2008 11:47:10 UTC

Message 79908

(moderation:

)

After reading the thread I recompiled my kernel and moved over to "No Forced Preemption" from "Preemptible Kernel" and now my computer is no longer trashing wu's. From searching, albeit shallow, I gather that most interrupts or syscalls to the kernel use fixed point integer math when needed since full save/restore of the FPU wasn't implemented. I have had this setting before in 2.6.2y-rxx and it didn't cause a problem (afaik) then.

Funny thing is I have only seen FPU errors with this project. My machine was doing cosmology@home and QMC@home (not at the same time as E@H) and those tasks validated. I wonder how dissimilar the build environment at those projects are with respect to this one. If I may ask, what version of glibc and linux-headers (or their equivalent packages) is the linux binary compiled against?

I am currently using gentoo-sources 2.6.25-r1.

"But it's turtles all the way down!"

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 752980860

RAC: 1193643

@Juergen: Thanks for nailing

4 May 2008 21:15:06 UTC

Message 79909

(moderation:

)

@Juergen: Thanks for nailing this down to a single kernel parameter. It will be intersting to see what will happen with the bug report

@Mike: The E@H app might be a bit special in that it actually catches certain FPU problems which are by default not handled. Before this change was made, earlier E@H app versions just continues their computations, but with some false values. Unless those values are part of a computation that will lead to a "candidate" that is sent back to the server, no validation error would appear. In this sense, the E@H app is especially sensitive to certain FPU problems.

Currently I can't see how a programming bug or compiler bug on the app side could cause this behavior (depending on this special kernel configuration parameter), as this paramter should really have no effect at all on any ordinary (non-kernel) processes.

Bikeman

GNU/Linux S5R3 App 4.38 available for Beta test

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner