Client Errors of S5R2/S5R3 Apps

KWSN Sir Clark
KWSN Sir Clark
Joined: 26 Jun 05
Posts: 42
Credit: 1200171
RAC: 0

No problems reported so far

No problems reported so far with MemTest....CPU is running at about 50 or lower under load so no overheating probs/

I managed to do a couple of units in November....

I'll try another and see if I get the same.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117627659573
RAC: 35211081

RE: I'll try another and

Message 71177 in response to message 71176

Quote:
I'll try another and see if I get the same.

Sounds good. If Memtest runs OK for a couple of hours, if CPU temp is <50C at full load and if you're not aggressively overclocking then perhaps it is one of the code 99 errors that Bernd will be interested in.

Cheers,
Gary.

KWSN Sir Clark
KWSN Sir Clark
Joined: 26 Jun 05
Posts: 42
Credit: 1200171
RAC: 0

It must be one for Bernd to

It must be one for Bernd to look at

I've just successfully completed another WU

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171438
RAC: 34

Hi guys, For the last

Hi guys,

For the last couple of WUs I've repeatedly got the same compute error on one of my machines (running BOINC 5.10.27):

-------snip--------------------------------------------

APP DEBUG: Application caught signal 8.

FPU status word ffff80c1, flags: ERR_SUMM STACK_FAULT INVALID
Obtained 7 stack frames for this thread.
Use gdb command: 'info line *0xADDRESS' to print corresponding line numbers.
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a4b9e]
einstein_S5R3_4.20_i686-pc-linux-gnu(LocalComputeFStatFreqBand+0x1849)[0x80ace69]
einstein_S5R3_4.20_i686-pc-linux-gnu(MAIN+0x352d)[0x80a495d]
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a5b34]
../../projects/einstein.phys.uwm.edu/einstein_S5R3_4.20_i686-pc-linux-gnu.so(_Z6foobarPv+0x14)[0xb7cd9e24]
/lib/libpthread.so.0[0xb7ed4383]
/lib/libc.so.6(clone+0x5e)[0xb7e5863e]
Stack trace of LAL functions in worker thread:
LocalComputeFStatFreqBand at line 201 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LocalComputeFStat at line 289 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
(null) at line 0 of file (null)
At lowest level status code = 0, description: NO LAL ERROR REGISTERED

-------snip--------------------------------------------

There seems to be some floating-point exception. Any idea?

Cheers, Oliver

Einstein@Home Project

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 727950019
RAC: 1225710

Hi! The last tie I saw

Message 71180 in response to message 71179

Hi!

The last tie I saw something similar was here.

The PC affected by this turned out to produce errors in another BOINC project (QMC) as well, so the most likely cause for this was a hardware failure.

This could well be the case for your PC as well. Is it overclocked, or aging?

CU

Bikeman

Quote:

Hi guys,

For the last couple of WUs I've repeatedly got the same compute error on one of my machines (running BOINC 5.10.27):

-------snip--------------------------------------------

APP DEBUG: Application caught signal 8.

FPU status word ffff80c1, flags: ERR_SUMM STACK_FAULT INVALID
Obtained 7 stack frames for this thread.
Use gdb command: 'info line *0xADDRESS' to print corresponding line numbers.
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a4b9e]
einstein_S5R3_4.20_i686-pc-linux-gnu(LocalComputeFStatFreqBand+0x1849)[0x80ace69]
einstein_S5R3_4.20_i686-pc-linux-gnu(MAIN+0x352d)[0x80a495d]
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a5b34]
../../projects/einstein.phys.uwm.edu/einstein_S5R3_4.20_i686-pc-linux-gnu.so(_Z6foobarPv+0x14)[0xb7cd9e24]
/lib/libpthread.so.0[0xb7ed4383]
/lib/libc.so.6(clone+0x5e)[0xb7e5863e]
Stack trace of LAL functions in worker thread:
LocalComputeFStatFreqBand at line 201 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LocalComputeFStat at line 289 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
(null) at line 0 of file (null)
At lowest level status code = 0, description: NO LAL ERROR REGISTERED

-------snip--------------------------------------------

There seems to be some floating-point exception. Any idea?

Cheers, Oliver


Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171438
RAC: 34

Well, the affected machine is

Message 71181 in response to message 71180

Well, the affected machine is neither overclocked nor aging. Obviously the latter depends on how you define aging - the mobo is just a few months old and the CPU (as well as the RAM) is 2-3 years old. Anyway, I ran cpuburn (http://pages.sbcglobal.net/redelm, K7 and MMX for one hour each) and memtest86+ (for 7.5 hours) without any noticeable crash or even a single error... Although this doesn't really prove anything beyond doubt, it might show, however, that this issue's probably not hardware-related. By the way, during the tests mentioned above the CPU never got over 66°C whereas the CPU runs at about 40°C when idle (incl. BOINC!). Maybe I should mention here that I use powernowd which throttles (underclocks/undervoltages) the CPU dynamically.

Please note: einstein successfully crunched quite a few WUs on this machine over three months and I think this issue occurred just a few weeks ago (beginning of december I'd say). Since then not a single WU was computed successfully...

Oliver

Quote:

Hi!

The last tie I saw something similar was here.

The PC affected by this turned out to produce errors in another BOINC project (QMC) as well, so the most likely cause for this was a hardware failure.

This could well be the case for your PC as well. Is it overclocked, or aging?

CU

Bikeman


Einstein@Home Project

BackGroundMAN
BackGroundMAN
Joined: 25 Feb 05
Posts: 58
Credit: 246736656
RAC: 0

Hi to all, I am running

Hi to all,
I am running Boinc and E@H for over a year at a Dual AthlonXP machine with Gentoo Linux. The last weeks from 17 wu's, only 3 completes with no errors. The most of the wu's exists with error 99 and some with error 38. I have tried several BOINC clients (5.10.21, 5.10.28, 5.8.16, 5.4.11) with the same problems. The E@H client is S5R3_4.20_i686. There is no hardware failure and the machine is stable and running for over 90 days. I was stress the machine with several kernel compiles (more than 20) and I was run cpuburn for over 6 hour in each cpu. As you already know Gentoo Linux is a source based distribution and every 3 or 4 days I update the system with several program compiles with no problems.

I think that the problem is on the E@H client and I want your help to solve it

Thank you and sorry for my english...

Keith Jillings
Keith Jillings
Joined: 3 Sep 05
Posts: 20
Credit: 4668603
RAC: 0

I'm having the same problem

I'm having the same problem as several have mentioned above - my machine spends hours crunching, and then something somewhere decides there's an error and the time is wasted. The most recent is 89819460 which is yet again "Client error / Compute error / 86,407.09 secs / 191.23 claimed score." That's 24 hours of my electricity bill wasted. That's just the last one. On checking, I find that's happened with six out of the last seven Einstein tasks processed. They run quite happily using "einstein_s5r3 version 415" - and then get thrown out.

Last time this happened, it was "my fault" (I even got flamed) for not knowing that I should have downloaded a new version of the Einstein software. Not that anyone bothered to e-mail me to tell me, of course. Not that there was any message anywhere to alert me. It came to light when I checked the graph of results and saw a long horizontal line. That time, it was something like 20 results in a row that had been rejected - but still the server sent new projects to the computer.

I process Einstein units voluntarily, out of goodwill, and at my own cost. It seems the goodwill is only one way. This "client" is now fed up with being blamed for "errors" that aren't notified unless I go looking for them. The server has my e-mail address - a little software work would allow it to send an automated message saying "your work units are failing - you need to do X or Y".

I don't have time to browse the web to try to find what "I'm" doing wrong now, so I'm taking the easy way out and stopping my machine wasting time and electricity on Einstein. I'm sure the other projects I process (none of which exhibit this problem) will make better use of my computer resources.

Keef, Essex or Norfolk, England

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171438
RAC: 34

RE: I process Einstein

Message 71184 in response to message 71183

Quote:
I process Einstein units voluntarily, out of goodwill, and at my own cost. It seems the goodwill is only one way. This "client" is now fed up with being blamed for "errors" that aren't notified unless I go looking for them. The server has my e-mail address - a little software work would allow it to send an automated message saying "your work units are failing - you need to do X or Y".

Hi Keith,

You blames you for these errors? As you said it's all voluntarily - if you don't want to support a project then just don't do it ;-) On the other hand there's no software without bugs and there's always room for improvement! I agree with you that it'd be nice to have some sort of notification that something went wrong (maybe only after crossing a user-defined threshold?), but have you filed a feature request with the BOINC project for that (as it's not an E@H specific feature)? It's an open source project so it's again up to you - even if you don't implement a feature you could just let others know of your idea.

Just my two cents...

Einstein@Home Project

Oliver Behnke
Oliver Behnke
Moderator
Administrator
Joined: 4 Sep 07
Posts: 984
Credit: 25171438
RAC: 34

OK, back to topic: I ran

Message 71185 in response to message 71181

OK, back to topic:

I ran E@H on the affected machine (another one is still crunching fine with BOINC 5.4.11) for a sustained period in order to exclude frequent shutdowns/reboots from the list of potential root causes. No luck, WUs kept failing...

However, LHC and Rosetta are working fine without any glitches. Again this doesn't mean anything but considering the fact that all errors (all SIGFPE) occur at the same two code positions (see below), it makes me wonder if it's not just some mean bug.

GetSemiCohToplist at line 3173 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/HierarchicalSearch.c

LocalComputeFStatFreqBand at line 201 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LocalComputeFStat at line 289 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c

Hope this helps,

Oliver

Quote:

Well, the affected machine is neither overclocked nor aging. Obviously the latter depends on how you define aging - the mobo is just a few months old and the CPU (as well as the RAM) is 2-3 years old. Anyway, I ran cpuburn (http://pages.sbcglobal.net/redelm, K7 and MMX for one hour each) and memtest86+ (for 7.5 hours) without any noticeable crash or even a single error... Although this doesn't really prove anything beyond doubt, it might show, however, that this issue's probably not hardware-related. By the way, during the tests mentioned above the CPU never got over 66°C whereas the CPU runs at about 40°C when idle (incl. BOINC!). Maybe I should mention here that I use powernowd which throttles (underclocks/undervoltages) the CPU dynamically.

Please note: einstein successfully crunched quite a few WUs on this machine over three months and I think this issue occurred just a few weeks ago (beginning of december I'd say). Since then not a single WU was computed successfully...


Einstein@Home Project

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.