GNU/Linux S5R3 App 4.38 available for Beta test

Ed1934158
Ed1934158
Joined: 10 Nov 04
Posts: 62
Credit: 14481483
RAC: 0

RE: You will lie awake at

Message 79900 in response to message 79898

Quote:

You will lie awake at night thinking about how much more you could do with three ...

Happens to me all the time ...

Heck I just got a dual 3.2 GHz Mac Pro and I am already thinking about how to get more done with upgrading other computers ... Haven't even paid the bill for the Pro yet (due the 8th, check in the mail)...

And no, you don't want to know how much I spent ...


That is why I've convinced my girlfriend that she needs Q6600. :)
I'm sorry for oftopic.

juergen.mell
juergen.mell
Joined: 9 Feb 05
Posts: 9
Credit: 12897831
RAC: 10457

I am having big problems with

I am having big problems with the application recently. At the beginning everything was running very well, but this week I started experimenting with real-time optimized kernels (I am running openSUSE 10.3 and right now I am using a kernel 2.6.23.17-ccj64-rt). Now I am getting lots of computation errors, eg. work units 95960123 or 95960103. All work units which are aborted have either the floating point error or the XLAL error. Since I switched kernels, only a single work unit was computed without failure. Otherwise the machine seems to run without any problems. Major changes in the kernel configuration against the standard kernels is an increased HZ-value (1000 instead of 250) and several changes which make pre-empting possible at nearly every point in the kernel. Any ideas what is wrong here?

tullio
tullio
Joined: 22 Jan 05
Posts: 2118
Credit: 61407735
RAC: 0

RE: I am having big

Message 79902 in response to message 79901

Quote:
I am having big problems with the application recently. At the beginning everything was running very well, but this week I started experimenting with real-time optimized kernels (I am running openSUSE 10.3 and right now I am using a kernel 2.6.23.17-ccj64-rt). Now I am getting lots of computation errors, eg. work units 95960123 or 95960103. All work units which are aborted have either the floating point error or the XLAL error. Since I switched kernels, only a single work unit was computed without failure. Otherwise the machine seems to run without any problems. Major changes in the kernel configuration against the standard kernels is an increased HZ-value (1000 instead of 250) and several changes which make pre-empting possible at nearly every point in the kernel. Any ideas what is wrong here?


I am using SuSE 10.3 with this kernel:Linux 2.6.22.5-31-default i686
and everyting works in Einstein, SETI. QMC/ORCA, climateprediction.net, CPDN.beta and LHC on AMD Opteron 1210. I was always afraid of updating kernels,one never knows what you get. Cheers.
Tullio

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 753089730
RAC: 1193078

RE: I am having big

Message 79903 in response to message 79901

Quote:
I am having big problems with the application recently. At the beginning everything was running very well, but this week I started experimenting with real-time optimized kernels (I am running openSUSE 10.3 and right now I am using a kernel 2.6.23.17-ccj64-rt). Now I am getting lots of computation errors, eg. work units 95960123 or 95960103. All work units which are aborted have either the floating point error or the XLAL error. Since I switched kernels, only a single work unit was computed without failure. Otherwise the machine seems to run without any problems. Major changes in the kernel configuration against the standard kernels is an increased HZ-value (1000 instead of 250) and several changes which make pre-empting possible at nearly every point in the kernel. Any ideas what is wrong here?

Yes, this kind of error seems to happen frequently with non-standard kernels (self compiled Gentoo kernels for example). What exactly causes this has remained a mystery. Maybe it's not so much the kernel itself, but some compilation option of some shared libraries. Something seems to corrupt the FPU stack of the E@H app.

CU
Bikeman

Jürgen Mell
Jürgen Mell
Joined: 2 Feb 08
Posts: 1
Credit: 7606979
RAC: 0

RE: Yes, this kind of

Message 79904 in response to message 79903

Quote:

Yes, this kind of error seems to happen frequently with non-standard kernels (self compiled Gentoo kernels for example). What exactly causes this has remained a mystery. Maybe it's not so much the kernel itself, but some compilation option of some shared libraries. Something seems to corrupt the FPU stack of the E@H app.


No, in this case it is definitely only the kernel, as I did not change anything else. What makes me worry is that if the kernel causes corruption of Einstein's FPU stack this might also happen to my own application and that is something I really do not like :-(. Do you have any hint (kernel configuration options which might influence the FPU stack) where I might start searching?
I am back to a 2.6.22.17-0.2-default kernel now and will try to tweak some options there to get the behavior I need for may application (simply a stable timer tick with 1 ms resolution).

juergen.mell
juergen.mell
Joined: 9 Feb 05
Posts: 9
Credit: 12897831
RAC: 10457

RE: Yes, this kind of

Message 79905 in response to message 79903

Quote:

Yes, this kind of error seems to happen frequently with non-standard kernels (self compiled Gentoo kernels for example). What exactly causes this has remained a mystery. Maybe it's not so much the kernel itself, but some compilation option of some shared libraries. Something seems to corrupt the FPU stack of the E@H app.


So, 5 kernels later I have a little better information. It seems that activating the kernel option 'Preemtible Kernel (Low-Latency Desktop)' (CONFIG_PREEMPT) causes this problem. I still have to verify this, as I have also modified some other options and the failure might be caused by a combination of changed options, but before I set this option everything was well. It would be most helpful, if some other people could also verify this. For me, kernels 2.6.22 and 2.6.23 are affected.

Bye,
Jürgen

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 753089730
RAC: 1193078

Thanks very much for your

Thanks very much for your report!

Hmmm....very interesting indeed. The Linux kernel 2.6.20 had some significant changes wrt. policy on restoring the FPU state when switching between tasks.

As I understand it, the 2.6 kernel intentionally used NOT to restore the FPU state when switching back to a userland task immediately (so the task continued with a "wrong" FPU context) , but only did restore the context when the first FPU instruction was encountered (lazy FPU state restoration). This is not so cool for FPU intensive processes as the mechanism to trap the first FPU instruction after a context switch is rather expensive when it happens, it does pay off for tasks that don't use the FPU so much. So the new policy performs some heuristics to decide whether to restore the FPU state immediately or in the previous, lazy fashion (on demand).

http://kernelnewbies.org/Linux_2_6_20#head-b73db42e6026d0a7d99ddf0b81e905d74fa8ecbf

Still, it would be a rather catastrophic bug if this had issues with the optionally finer-grained preemption of the kernel.

But it is rather suspicious that many hosts that have this problem seem to have kernels optimized for RealTime processing (e.g. 2.6.x-bla-rt), which would indeed include a low latency preemption option. And most of these kernels (all I've seen lately) are >= 2.6.20 .... hmmm.....

CU
Bikeman

juergen.mell
juergen.mell
Joined: 9 Feb 05
Posts: 9
Credit: 12897831
RAC: 10457

RE: Still, it would be a

Message 79907 in response to message 79906

Quote:


Still, it would be a rather catastrophic bug if this had issues with the optionally finer-grained preemption of the kernel.

But it is rather suspicious that many hosts that have this problem seem to have kernels optimized for RealTime processing (e.g. 2.6.x-bla-rt), which would indeed include a low latency preemption option. And most of these kernels (all I've seen lately) are >= 2.6.20 .... hmmm.....

I have finally verified the problem with 'Preemtible Kernel (Low-Latency Desktop)' (CONFIG_PREEMPT) option. I used a default kernel source from openSUSE and changed only this option. After that, Einstein started crashing :-(
I have filed a bug report to the openSUSE people. I hope they can help as a really would not like to have to dig into this myself.

Bye,
Jürgen

MikeB
MikeB
Joined: 22 Jan 05
Posts: 55
Credit: 4795711
RAC: 0

After reading the thread I

After reading the thread I recompiled my kernel and moved over to "No Forced Preemption" from "Preemptible Kernel" and now my computer is no longer trashing wu's. From searching, albeit shallow, I gather that most interrupts or syscalls to the kernel use fixed point integer math when needed since full save/restore of the FPU wasn't implemented. I have had this setting before in 2.6.2y-rxx and it didn't cause a problem (afaik) then.

Funny thing is I have only seen FPU errors with this project. My machine was doing cosmology@home and QMC@home (not at the same time as E@H) and those tasks validated. I wonder how dissimilar the build environment at those projects are with respect to this one. If I may ask, what version of glibc and linux-headers (or their equivalent packages) is the linux binary compiled against?

I am currently using gentoo-sources 2.6.25-r1.

"But it's turtles all the way down!"

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 753089730
RAC: 1193078

@Juergen: Thanks for nailing

@Juergen: Thanks for nailing this down to a single kernel parameter. It will be intersting to see what will happen with the bug report

@Mike: The E@H app might be a bit special in that it actually catches certain FPU problems which are by default not handled. Before this change was made, earlier E@H app versions just continues their computations, but with some false values. Unless those values are part of a computation that will lead to a "candidate" that is sent back to the server, no validation error would appear. In this sense, the E@H app is especially sensitive to certain FPU problems.

Currently I can't see how a programming bug or compiler bug on the app side could cause this behavior (depending on this special kernel configuration parameter), as this paramter should really have no effect at all on any ordinary (non-kernel) processes.

Bikeman

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.