GNU/Linux S5R3 App 4.31 available for Beta test

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 753644781
RAC: 1179530

RE: One should also

Message 78502 in response to message 78501

Quote:

One should also consider, that in previous versions certain (if not all?) FPEs appear to be ignored -- thus we might see design/programming flaws today (unless those traps appear only on faulty hardware and never on faulty software design (which is okay, and human, and must happen)).

From the release info:

throws floating-point exception on NaNs and FPU stack errors

Of course I could verify that by running an older version without that new traps, but this might mean incorrect/drifting data.

I'm not sure since when the FPEs are trapped in E@H code, but it's been quite a while. In any case, an untrapped Floating point exception would almost certainly lead to a secondary error like a segmentation fault, or at least a validation error after the result is submitted. So the trapping should not increase the overall error count, it's just that it's easier to differentiate between potential software problems and likely hardware problems.

CU
Bikeman

rroonnaalldd
rroonnaalldd
Joined: 12 Dec 05
Posts: 116
Credit: 537221
RAC: 0

RE: I'm not sure since

Message 78503 in response to message 78502

Quote:


I'm not sure since when the FPEs are trapped in E@H code, but it's been quite a while. In any case, an untrapped Floating point exception would almost certainly lead to a secondary error like a segmentation fault, or at least a validation error after the result is submitted. So the trapping should not increase the overall error count, it's just that it's easier to differentiate between potential software problems and likely hardware problems.

CU
Bikeman

I found that. Maybe it helps.

Robert Felber
Robert Felber
Joined: 18 Feb 08
Posts: 8
Credit: 2275453
RAC: 0

RE: RE: One should also

Message 78504 in response to message 78502

Quote:
Quote:

One should also consider, that in previous versions certain (if not all?) FPEs appear to be ignored -- thus we might see design/programming flaws today (unless those traps appear only on faulty hardware and never on faulty software design (which is okay, and human, and must happen)).

From the release info:

throws floating-point exception on NaNs and FPU stack errors

Of course I could verify that by running an older version without that new traps, but this might mean incorrect/drifting data.

I'm not sure since when the FPEs are trapped in E@H code, but it's been quite a while. In any case, an untrapped Floating point exception would almost certainly lead to a secondary error like a segmentation fault, or at least a validation error after the result is submitted. So the trapping should not increase the overall error count, it's just that it's easier to differentiate between potential software problems and likely hardware problems.

CU
Bikeman

Shouldn't it be possible to do some server-side statistics about plattform/cpu, einstein-client-version and error-type/error-count in order to detect probably environmentally broken code and not rely on user reports?

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 753644781
RAC: 1179530

RE: Shouldn't it be

Message 78505 in response to message 78504

Quote:

Shouldn't it be possible to do some server-side statistics about plattform/cpu, einstein-client-version and error-type/error-count in order to detect probably environmentally broken code and not rely on user reports?

This is indeed already done. But the individual user error-reports are still very useful since you can actually ask questions to get information that is otherwise not directly available from the logs (e.g. are the hosts overclocked? what other projects are running and did they show problems as well...).

The "signal 11" problem on Linux that we discussed here recently is a good example. It turned out to be a bug in the BOINC library, but to analyse the problem it was crucial to learn from users that the problem had some connection with broken network connections. Only then was it possible to pinpoint the root cause.

CU

Bikeman

Robert Felber
Robert Felber
Joined: 18 Feb 08
Posts: 8
Credit: 2275453
RAC: 0

RE: RE: Shouldn't it be

Message 78506 in response to message 78505

Quote:
Quote:

Shouldn't it be possible to do some server-side statistics about plattform/cpu, einstein-client-version and error-type/error-count in order to detect probably environmentally broken code and not rely on user reports?

This is indeed already done. But the individual user error-reports are still very useful since you can actually ask questions to get information that is otherwise not directly available from the logs (e.g. are the hosts overclocked? what other projects are running and did they show problems as well...).

Thanks for your answers. :-)

I see you have also got a P3 Coppermine without troubles - I suppose you run >= 4.31 there.

(Is there some non-speed but stability/fault-tolerance optimized version which I could try?)

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 753644781
RAC: 1179530

RE: Thanks for your

Message 78507 in response to message 78506

Quote:

Thanks for your answers. :-)

I see you have also got a P3 Coppermine without troubles - I suppose you run >= 4.31 there.

Yes, I've got a Dual PIII-866 Coppermine. It's a HP Kayak XM-600 Series 2 that I bought used for 25 Euros :-), it's probably more than 7 years old but (knocking on wood) still runs rock solid. I had to replace the power supply recently, tho, because the fan was failing. I would not be surprised to see it die any day now, tho.

The P-III already supports SSE, so I'm using the 4.35 "Power Users" App on this box, which is considerably faster than the 4.31 App.

Quote:

(Is there some non-speed but stability/fault-tolerance optimized version which I could try?)


Stability and validity of results are the top concerns for E@H, and the FPE trapping feature is in fact part of the error detection code. Once the app notices there might be something wrong, it terminates the computation, which is safer than trying to be "fault-tolerant" and carry on regardless. There is no special "stability" edition of the app.


CU
Bikeman

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4330
Credit: 251389201
RAC: 36912

RE: RE: Hi! This

Message 78508 in response to message 78501

Quote:
Quote:

Hi!

This particular fault (FPE) is really hard to produce with software bugs (other than compiler bugs), the most likely explanation is failing hardware. After all, a PIII Coppermine must be how old by now? 6 year? 7 years?

Most of the time it's not the CPU itself but things like failing fans, swollen capacitors on the motherboard, glitches in the power supply... Gary will be able to expand on this better than me. The E@H app has now reached a significant level of optimization and squeezes quite a bit of performance out of the FPU, so it's not surprising taht E@H is the first app to show symptoms of hardware failure.

CU
Bikeman

One should also consider, that in previous versions certain (if not all?) FPEs appear to be ignored -- thus we might see design/programming flaws today (unless those traps appear only on faulty hardware and never on faulty software design (which is okay, and human, and must happen)). But then - if it would be flawed design, then it shouldn't only appear here.

From the release info:

throws floating-point exception on NaNs and FPU stack errors

Of course I could verify that by running an older version without that new traps, but this might mean incorrect/drifting data - so my decision would then rather be "this old, p3 driven host cannot participate in einstein". (would be okay. It would be interesting whether other users with the same CPU get the same FPEs, but then I think those old p3 coppermine users haven't detected the latest version yet, and run old versions)

Update: the 4.35 SSE version seems also to produce errors:

2008-02-23 10:15:34 [Einstein@Home] Resuming task h1_0851.85_S5R3__372_S5R3b_1 using einstein_S5R3 version 435
2008-02-23 10:15:46 [Einstein@Home] Deferring communication for 1 min 0 sec
2008-02-23 10:15:46 [Einstein@Home] Reason: Unrecoverable error for result h1_0851.85_S5R3__372_S5R3b_1 (process exited with code 99 (0x63))


The FPE (at least the ones I've seen here) would almost always lead to a NaN in a certain variable and this to an error with exit status 99 a few instructions later (when there is a sanity check for array bounds). So these errors are taken from the "99" bunch of computing errors in order to get closer to the point where the error actually occurs, that's all.

There is at least one other reason for FPEs: a flaw in the operating system (or even in the compiler it was built with). Actually it should protect one process context against whatever is happening in other contexts, but apparently this doesn't alway work correctly. At least on Windows I read reports where a hardware driver (usually printer) could mess up the FPU stack and flags so badly that they weren't properly restored when switching back to a user process, generating an FPE there. I'm not sure that all possible Linux kernels (including self-built) have sufficient protection against bad drivers and other stuff running in kernel mode. I'm not even sure they all properly save all registers - I've seen the CPU type / register detection of the Linux fail to detect the right CPU (and thus available set of registers) at least in two cases.

You seem to have quite a number of machine running, can you point me to the machine or even better the result where the error happened with the 4.35?

BM

BM

Robert Felber
Robert Felber
Joined: 18 Feb 08
Posts: 8
Credit: 2275453
RAC: 0

RE: There is at least one

Message 78509 in response to message 78508

Quote:

There is at least one other reason for FPEs: a flaw in the operating system (or even in the compiler it was built with). Actually it should protect one process context against whatever is happening in other contexts, but apparently this doesn't alway work correctly. At least on Windows I read reports where a hardware driver (usually printer) could mess up the FPU stack and flags so badly that they weren't properly restored when switching back to a user process, generating an FPE there. I'm not sure that all possible Linux kernels (including self-built) have sufficient protection against bad drivers and other stuff running in kernel mode. I'm not even sure they all properly save all registers - I've seen the CPU type / register detection of the Linux fail to detect the right CPU (and thus available set of registers) at least in two cases.

The system uses (well, used) gentoo libc (currently 2.6.1) and gcc (just switched to 4.1.2) and vanilla kernel (2.6.22.1, will be updated/rebuilt).

Quote:

You seem to have quite a number of machine running, can you point me to the machine or even better the result where the error happened with the 4.35?

Host 1117633
The last two (92666821 and 92655563) are done with 4.35.

Probably worthless info: this host performs well on LHC, Rosetta and obviously SETI (for SETI no results, and no bad news yet).

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 753644781
RAC: 1179530

I've looked into some results

I've looked into some results with this error, and without doing a real statistical analysis, it "feels" as if Gentoo Linux runs a higher risk to encounter this problem, here are just a few hosts:

http://einsteinathome.org/host/806536

http://einsteinathome.org/host/645060

http://einsteinathome.org/host/728561

http://einsteinathome.org/host/982121

http://einsteinathome.org/host/382784

http://einsteinathome.org/host/692446

http://einsteinathome.org/host/1078848

http://einsteinathome.org/host/1116760

http://einsteinathome.org/host/904093

http://einsteinathome.org/host/536424

Some Coppermines, but also some AMD CPUs among those hosts.

Gentoo isn't that widespread anymore, is it? Ubuntu, Redhat and Suse must be far more popular.

CU
Bikeman

Jos van Wolput
Jos van Wolput
Joined: 11 Feb 05
Posts: 47
Credit: 800840
RAC: 0

I upgraded to "power users"

I upgraded to "power users" app 4.35 without problems.
Really very fast!

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.