GNU/Linux S5R3 App 4.31 available for Beta test

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 753688932

RAC: 1179793

RE: One should also

23 Feb 2008 9:18:10 UTC

Message 78502 in response to message 78501

(moderation:

)

Quote:

One should also consider, that in previous versions certain (if not all?) FPEs appear to be ignored -- thus we might see design/programming flaws today (unless those traps appear only on faulty hardware and never on faulty software design (which is okay, and human, and must happen)).

From the release info:

throws floating-point exception on NaNs and FPU stack errors

Of course I could verify that by running an older version without that new traps, but this might mean incorrect/drifting data.

I'm not sure since when the FPEs are trapped in E@H code, but it's been quite a while. In any case, an untrapped Floating point exception would almost certainly lead to a secondary error like a segmentation fault, or at least a validation error after the result is submitted. So the trapping should not increase the overall error count, it's just that it's easier to differentiate between potential software problems and likely hardware problems.

CU
Bikeman

rroonnaalldd

Joined: 12 Dec 05

Posts: 116

Credit: 537221

RAC: 0

RE: I'm not sure since

23 Feb 2008 9:53:10 UTC

Message 78503 in response to message 78502

(moderation:

)

Quote:

I'm not sure since when the FPEs are trapped in E@H code, but it's been quite a while. In any case, an untrapped Floating point exception would almost certainly lead to a secondary error like a segmentation fault, or at least a validation error after the result is submitted. So the trapping should not increase the overall error count, it's just that it's easier to differentiate between potential software problems and likely hardware problems.

CU
Bikeman

I found that. Maybe it helps.

Robert Felber

Joined: 18 Feb 08

Posts: 8

Credit: 2275453

RAC: 0

RE: RE: One should also

23 Feb 2008 11:48:30 UTC

Message 78504 in response to message 78502

(moderation:

)

Quote:

Quote:

One should also consider, that in previous versions certain (if not all?) FPEs appear to be ignored -- thus we might see design/programming flaws today (unless those traps appear only on faulty hardware and never on faulty software design (which is okay, and human, and must happen)).

From the release info:

throws floating-point exception on NaNs and FPU stack errors

Of course I could verify that by running an older version without that new traps, but this might mean incorrect/drifting data.

I'm not sure since when the FPEs are trapped in E@H code, but it's been quite a while. In any case, an untrapped Floating point exception would almost certainly lead to a secondary error like a segmentation fault, or at least a validation error after the result is submitted. So the trapping should not increase the overall error count, it's just that it's easier to differentiate between potential software problems and likely hardware problems.

CU
Bikeman

Shouldn't it be possible to do some server-side statistics about plattform/cpu, einstein-client-version and error-type/error-count in order to detect probably environmentally broken code and not rely on user reports?

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 753688932

RAC: 1179793

RE: Shouldn't it be

23 Feb 2008 12:24:58 UTC

Message 78505 in response to message 78504

(moderation:

)

Quote:

Shouldn't it be possible to do some server-side statistics about plattform/cpu, einstein-client-version and error-type/error-count in order to detect probably environmentally broken code and not rely on user reports?

This is indeed already done. But the individual user error-reports are still very useful since you can actually ask questions to get information that is otherwise not directly available from the logs (e.g. are the hosts overclocked? what other projects are running and did they show problems as well...).

The "signal 11" problem on Linux that we discussed here recently is a good example. It turned out to be a bug in the BOINC library, but to analyse the problem it was crucial to learn from users that the problem had some connection with broken network connections. Only then was it possible to pinpoint the root cause.

Bikeman

Robert Felber

Joined: 18 Feb 08

Posts: 8

Credit: 2275453

RAC: 0

RE: RE: Shouldn't it be

23 Feb 2008 18:52:21 UTC

Message 78506 in response to message 78505

(moderation:

)

Quote:

Quote:

Shouldn't it be possible to do some server-side statistics about plattform/cpu, einstein-client-version and error-type/error-count in order to detect probably environmentally broken code and not rely on user reports?

This is indeed already done. But the individual user error-reports are still very useful since you can actually ask questions to get information that is otherwise not directly available from the logs (e.g. are the hosts overclocked? what other projects are running and did they show problems as well...).

Thanks for your answers. :-)

I see you have also got a P3 Coppermine without troubles - I suppose you run >= 4.31 there.

(Is there some non-speed but stability/fault-tolerance optimized version which I could try?)

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 753688932

RAC: 1179793

RE: Thanks for your

23 Feb 2008 19:21:09 UTC

Message 78507 in response to message 78506

(moderation:

)

Quote:

Thanks for your answers. :-)

I see you have also got a P3 Coppermine without troubles - I suppose you run >= 4.31 there.

Yes, I've got a Dual PIII-866 Coppermine. It's a HP Kayak XM-600 Series 2 that I bought used for 25 Euros :-), it's probably more than 7 years old but (knocking on wood) still runs rock solid. I had to replace the power supply recently, tho, because the fan was failing. I would not be surprised to see it die any day now, tho.

The P-III already supports SSE, so I'm using the 4.35 "Power Users" App on this box, which is considerably faster than the 4.31 App.

Quote:

(Is there some non-speed but stability/fault-tolerance optimized version which I could try?)

Stability and validity of results are the top concerns for E@H, and the FPE trapping feature is in fact part of the error detection code. Once the app notices there might be something wrong, it terminates the computation, which is safer than trying to be "fault-tolerant" and carry on regardless. There is no special "stability" edition of the app.

CU
Bikeman

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4330

Credit: 251389326

RAC: 36745

RE: RE: Hi! This

23 Feb 2008 20:15:09 UTC

Message 78508 in response to message 78501

(moderation:

)

Quote:

Quote:
Hi!

This particular fault (FPE) is really hard to produce with software bugs (other than compiler bugs), the most likely explanation is failing hardware. After all, a PIII Coppermine must be how old by now? 6 year? 7 years?

Most of the time it's not the CPU itself but things like failing fans, swollen capacitors on the motherboard, glitches in the power supply... Gary will be able to expand on this better than me. The E@H app has now reached a significant level of optimization and squeezes quite a bit of performance out of the FPU, so it's not surprising taht E@H is the first app to show symptoms of hardware failure.

CU
Bikeman

One should also consider, that in previous versions certain (if not all?) FPEs appear to be ignored -- thus we might see design/programming flaws today (unless those traps appear only on faulty hardware and never on faulty software design (which is okay, and human, and must happen)). But then - if it would be flawed design, then it shouldn't only appear here.

From the release info:

throws floating-point exception on NaNs and FPU stack errors

Of course I could verify that by running an older version without that new traps, but this might mean incorrect/drifting data - so my decision would then rather be "this old, p3 driven host cannot participate in einstein". (would be okay. It would be interesting whether other users with the same CPU get the same FPEs, but then I think those old p3 coppermine users haven't detected the latest version yet, and run old versions)

Update: the 4.35 SSE version seems also to produce errors:

2008-02-23 10:15:34 [Einstein@Home] Resuming task h1_0851.85_S5R3__372_S5R3b_1 using einstein_S5R3 version 435
2008-02-23 10:15:46 [Einstein@Home] Deferring communication for 1 min 0 sec
2008-02-23 10:15:46 [Einstein@Home] Reason: Unrecoverable error for result h1_0851.85_S5R3__372_S5R3b_1 (process exited with code 99 (0x63))

The FPE (at least the ones I've seen here) would almost always lead to a NaN in a certain variable and this to an error with exit status 99 a few instructions later (when there is a sanity check for array bounds). So these errors are taken from the "99" bunch of computing errors in order to get closer to the point where the error actually occurs, that's all.

There is at least one other reason for FPEs: a flaw in the operating system (or even in the compiler it was built with). Actually it should protect one process context against whatever is happening in other contexts, but apparently this doesn't alway work correctly. At least on Windows I read reports where a hardware driver (usually printer) could mess up the FPU stack and flags so badly that they weren't properly restored when switching back to a user process, generating an FPE there. I'm not sure that all possible Linux kernels (including self-built) have sufficient protection against bad drivers and other stuff running in kernel mode. I'm not even sure they all properly save all registers - I've seen the CPU type / register detection of the Linux fail to detect the right CPU (and thus available set of registers) at least in two cases.

You seem to have quite a number of machine running, can you point me to the machine or even better the result where the error happened with the 4.35?

Robert Felber

Joined: 18 Feb 08

Posts: 8

Credit: 2275453

RAC: 0

RE: There is at least one

23 Feb 2008 22:51:10 UTC

Message 78509 in response to message 78508

(moderation:

)

Quote:

There is at least one other reason for FPEs: a flaw in the operating system (or even in the compiler it was built with). Actually it should protect one process context against whatever is happening in other contexts, but apparently this doesn't alway work correctly. At least on Windows I read reports where a hardware driver (usually printer) could mess up the FPU stack and flags so badly that they weren't properly restored when switching back to a user process, generating an FPE there. I'm not sure that all possible Linux kernels (including self-built) have sufficient protection against bad drivers and other stuff running in kernel mode. I'm not even sure they all properly save all registers - I've seen the CPU type / register detection of the Linux fail to detect the right CPU (and thus available set of registers) at least in two cases.

The system uses (well, used) gentoo libc (currently 2.6.1) and gcc (just switched to 4.1.2) and vanilla kernel (2.6.22.1, will be updated/rebuilt).

Quote:

You seem to have quite a number of machine running, can you point me to the machine or even better the result where the error happened with the 4.35?

Host 1117633
The last two (92666821 and 92655563) are done with 4.35.

Probably worthless info: this host performs well on LHC, Rosetta and obviously SETI (for SETI no results, and no bad news yet).

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 753688932

RAC: 1179793

I've looked into some results

24 Feb 2008 0:10:34 UTC

Message 78510

(moderation:

)

I've looked into some results with this error, and without doing a real statistical analysis, it "feels" as if Gentoo Linux runs a higher risk to encounter this problem, here are just a few hosts:

http://einsteinathome.org/host/806536

http://einsteinathome.org/host/645060

http://einsteinathome.org/host/728561

http://einsteinathome.org/host/982121

http://einsteinathome.org/host/382784

http://einsteinathome.org/host/692446

http://einsteinathome.org/host/1078848

http://einsteinathome.org/host/1116760

http://einsteinathome.org/host/904093

http://einsteinathome.org/host/536424

Some Coppermines, but also some AMD CPUs among those hosts.

Gentoo isn't that widespread anymore, is it? Ubuntu, Redhat and Suse must be far more popular.

CU
Bikeman

Jos van Wolput

Joined: 11 Feb 05

Posts: 47

Credit: 800840

RAC: 0

I upgraded to "power users"

24 Feb 2008 7:42:33 UTC

Message 78511

(moderation:

)

I upgraded to "power users" app 4.35 without problems.
Really very fast!

GNU/Linux S5R3 App 4.31 available for Beta test

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner