Client Errors of S5R2/S5R3 Apps

KWSN Sir Clark

Joined: 26 Jun 05

Posts: 42

Credit: 1200171

RAC: 0

No problems reported so far

12 Dec 2007 1:53:32 UTC

Message 71176

(moderation:

)

No problems reported so far with MemTest....CPU is running at about 50 or lower under load so no overheating probs/

I managed to do a couple of units in November....

I'll try another and see if I get the same.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5872

Credit: 117503333674

RAC: 35418820

RE: I'll try another and

12 Dec 2007 4:20:09 UTC

Message 71177 in response to message 71176

(moderation:

)

Quote:

I'll try another and see if I get the same.

Sounds good. If Memtest runs OK for a couple of hours, if CPU temp is <50C at full load and if you're not aggressively overclocking then perhaps it is one of the code 99 errors that Bernd will be interested in.

Cheers,
Gary.

KWSN Sir Clark

Joined: 26 Jun 05

Posts: 42

Credit: 1200171

RAC: 0

It must be one for Bernd to

13 Dec 2007 14:42:35 UTC

Message 71178

(moderation:

)

It must be one for Bernd to look at

I've just successfully completed another WU

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 984

Credit: 25171376

RAC: 43

Hi guys, For the last

16 Dec 2007 11:20:48 UTC

Message 71179

(moderation:

)

Hi guys,

For the last couple of WUs I've repeatedly got the same compute error on one of my machines (running BOINC 5.10.27):

-------snip--------------------------------------------

APP DEBUG: Application caught signal 8.

FPU status word ffff80c1, flags: ERR_SUMM STACK_FAULT INVALID
Obtained 7 stack frames for this thread.
Use gdb command: 'info line *0xADDRESS' to print corresponding line numbers.
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a4b9e]
einstein_S5R3_4.20_i686-pc-linux-gnu(LocalComputeFStatFreqBand+0x1849)[0x80ace69]
einstein_S5R3_4.20_i686-pc-linux-gnu(MAIN+0x352d)[0x80a495d]
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a5b34]
../../projects/einstein.phys.uwm.edu/einstein_S5R3_4.20_i686-pc-linux-gnu.so(_Z6foobarPv+0x14)[0xb7cd9e24]
/lib/libpthread.so.0[0xb7ed4383]
/lib/libc.so.6(clone+0x5e)[0xb7e5863e]
Stack trace of LAL functions in worker thread:
LocalComputeFStatFreqBand at line 201 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LocalComputeFStat at line 289 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
(null) at line 0 of file (null)
At lowest level status code = 0, description: NO LAL ERROR REGISTERED

-------snip--------------------------------------------

There seems to be some floating-point exception. Any idea?

Cheers, Oliver

Einstein@Home Project

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 722917149

RAC: 1155043

Hi! The last tie I saw

16 Dec 2007 18:05:45 UTC

Message 71180 in response to message 71179

(moderation:

)

Hi!

The last tie I saw something similar was here.

The PC affected by this turned out to produce errors in another BOINC project (QMC) as well, so the most likely cause for this was a hardware failure.

This could well be the case for your PC as well. Is it overclocked, or aging?

Bikeman

Quote:

Hi guys,

For the last couple of WUs I've repeatedly got the same compute error on one of my machines (running BOINC 5.10.27):

-------snip--------------------------------------------

APP DEBUG: Application caught signal 8.

FPU status word ffff80c1, flags: ERR_SUMM STACK_FAULT INVALID
Obtained 7 stack frames for this thread.
Use gdb command: 'info line *0xADDRESS' to print corresponding line numbers.
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a4b9e]
einstein_S5R3_4.20_i686-pc-linux-gnu(LocalComputeFStatFreqBand+0x1849)[0x80ace69]
einstein_S5R3_4.20_i686-pc-linux-gnu(MAIN+0x352d)[0x80a495d]
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a5b34]
../../projects/einstein.phys.uwm.edu/einstein_S5R3_4.20_i686-pc-linux-gnu.so(_Z6foobarPv+0x14)[0xb7cd9e24]
/lib/libpthread.so.0[0xb7ed4383]
/lib/libc.so.6(clone+0x5e)[0xb7e5863e]
Stack trace of LAL functions in worker thread:
LocalComputeFStatFreqBand at line 201 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LocalComputeFStat at line 289 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
(null) at line 0 of file (null)
At lowest level status code = 0, description: NO LAL ERROR REGISTERED

-------snip--------------------------------------------

There seems to be some floating-point exception. Any idea?

Cheers, Oliver

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 984

Credit: 25171376

RAC: 43

Well, the affected machine is

18 Dec 2007 0:09:44 UTC

Message 71181 in response to message 71180

(moderation:

)

Well, the affected machine is neither overclocked nor aging. Obviously the latter depends on how you define aging - the mobo is just a few months old and the CPU (as well as the RAM) is 2-3 years old. Anyway, I ran cpuburn (http://pages.sbcglobal.net/redelm, K7 and MMX for one hour each) and memtest86+ (for 7.5 hours) without any noticeable crash or even a single error... Although this doesn't really prove anything beyond doubt, it might show, however, that this issue's probably not hardware-related. By the way, during the tests mentioned above the CPU never got over 66Â°C whereas the CPU runs at about 40Â°C when idle (incl. BOINC!). Maybe I should mention here that I use powernowd which throttles (underclocks/undervoltages) the CPU dynamically.

Please note: einstein successfully crunched quite a few WUs on this machine over three months and I think this issue occurred just a few weeks ago (beginning of december I'd say). Since then not a single WU was computed successfully...

Oliver

Quote:

Hi!

The last tie I saw something similar was here.

The PC affected by this turned out to produce errors in another BOINC project (QMC) as well, so the most likely cause for this was a hardware failure.

This could well be the case for your PC as well. Is it overclocked, or aging?

CU

Bikeman

Einstein@Home Project

BackGroundMAN

Joined: 25 Feb 05

Posts: 58

Credit: 246736656

RAC: 0

Hi to all, I am running

23 Dec 2007 12:42:29 UTC

Message 71182

(moderation:

)

Hi to all,
I am running Boinc and E@H for over a year at a Dual AthlonXP machine with Gentoo Linux. The last weeks from 17 wu's, only 3 completes with no errors. The most of the wu's exists with error 99 and some with error 38. I have tried several BOINC clients (5.10.21, 5.10.28, 5.8.16, 5.4.11) with the same problems. The E@H client is S5R3_4.20_i686. There is no hardware failure and the machine is stable and running for over 90 days. I was stress the machine with several kernel compiles (more than 20) and I was run cpuburn for over 6 hour in each cpu. As you already know Gentoo Linux is a source based distribution and every 3 or 4 days I update the system with several program compiles with no problems.

I think that the problem is on the E@H client and I want your help to solve it

Thank you and sorry for my english...

Keith Jillings

Joined: 3 Sep 05

Posts: 20

Credit: 4668603

RAC: 0

I'm having the same problem

23 Dec 2007 18:03:26 UTC

Message 71183

(moderation:

)

I'm having the same problem as several have mentioned above - my machine spends hours crunching, and then something somewhere decides there's an error and the time is wasted. The most recent is 89819460 which is yet again "Client error / Compute error / 86,407.09 secs / 191.23 claimed score." That's 24 hours of my electricity bill wasted. That's just the last one. On checking, I find that's happened with six out of the last seven Einstein tasks processed. They run quite happily using "einstein_s5r3 version 415" - and then get thrown out.

Last time this happened, it was "my fault" (I even got flamed) for not knowing that I should have downloaded a new version of the Einstein software. Not that anyone bothered to e-mail me to tell me, of course. Not that there was any message anywhere to alert me. It came to light when I checked the graph of results and saw a long horizontal line. That time, it was something like 20 results in a row that had been rejected - but still the server sent new projects to the computer.

I process Einstein units voluntarily, out of goodwill, and at my own cost. It seems the goodwill is only one way. This "client" is now fed up with being blamed for "errors" that aren't notified unless I go looking for them. The server has my e-mail address - a little software work would allow it to send an automated message saying "your work units are failing - you need to do X or Y".

I don't have time to browse the web to try to find what "I'm" doing wrong now, so I'm taking the easy way out and stopping my machine wasting time and electricity on Einstein. I'm sure the other projects I process (none of which exhibit this problem) will make better use of my computer resources.

Keef, Essex or Norfolk, England

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 984

Credit: 25171376

RAC: 43

RE: I process Einstein

23 Dec 2007 21:02:19 UTC

Message 71184 in response to message 71183

(moderation:

)

Quote:

I process Einstein units voluntarily, out of goodwill, and at my own cost. It seems the goodwill is only one way. This "client" is now fed up with being blamed for "errors" that aren't notified unless I go looking for them. The server has my e-mail address - a little software work would allow it to send an automated message saying "your work units are failing - you need to do X or Y".

Hi Keith,

You blames you for these errors? As you said it's all voluntarily - if you don't want to support a project then just don't do it ;-) On the other hand there's no software without bugs and there's always room for improvement! I agree with you that it'd be nice to have some sort of notification that something went wrong (maybe only after crossing a user-defined threshold?), but have you filed a feature request with the BOINC project for that (as it's not an E@H specific feature)? It's an open source project so it's again up to you - even if you don't implement a feature you could just let others know of your idea.

Just my two cents...

Einstein@Home Project

Oliver Behnke

Moderator

Administrator

Joined: 4 Sep 07

Posts: 984

Credit: 25171376

RAC: 43

OK, back to topic: I ran

23 Dec 2007 21:19:41 UTC

Message 71185 in response to message 71181

(moderation:

)

OK, back to topic:

I ran E@H on the affected machine (another one is still crunching fine with BOINC 5.4.11) for a sustained period in order to exclude frequent shutdowns/reboots from the list of potential root causes. No luck, WUs kept failing...

However, LHC and Rosetta are working fine without any glitches. Again this doesn't mean anything but considering the fact that all errors (all SIGFPE) occur at the same two code positions (see below), it makes me wonder if it's not just some mean bug.

GetSemiCohToplist at line 3173 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/HierarchicalSearch.c

LocalComputeFStatFreqBand at line 201 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LocalComputeFStat at line 289 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c

Hope this helps,

Oliver

Quote:

Well, the affected machine is neither overclocked nor aging. Obviously the latter depends on how you define aging - the mobo is just a few months old and the CPU (as well as the RAM) is 2-3 years old. Anyway, I ran cpuburn (http://pages.sbcglobal.net/redelm, K7 and MMX for one hour each) and memtest86+ (for 7.5 hours) without any noticeable crash or even a single error... Although this doesn't really prove anything beyond doubt, it might show, however, that this issue's probably not hardware-related. By the way, during the tests mentioned above the CPU never got over 66Â°C whereas the CPU runs at about 40Â°C when idle (incl. BOINC!). Maybe I should mention here that I use powernowd which throttles (underclocks/undervoltages) the CPU dynamically.

Please note: einstein successfully crunched quite a few WUs on this machine over three months and I think this issue occurred just a few weeks ago (beginning of december I'd say). Since then not a single WU was computed successfully...

Einstein@Home Project

Client Errors of S5R2/S5R3 Apps

Forums › Problems and Bug Reports

Comment viewing options

Forums › Problems and Bug Reports