Sounds good. If Memtest runs OK for a couple of hours, if CPU temp is <50C at full load and if you're not aggressively overclocking then perhaps it is one of the code 99 errors that Bernd will be interested in.
FPU status word ffff80c1, flags: ERR_SUMM STACK_FAULT INVALID
Obtained 7 stack frames for this thread.
Use gdb command: 'info line *0xADDRESS' to print corresponding line numbers.
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a4b9e]
einstein_S5R3_4.20_i686-pc-linux-gnu(LocalComputeFStatFreqBand+0x1849)[0x80ace69]
einstein_S5R3_4.20_i686-pc-linux-gnu(MAIN+0x352d)[0x80a495d]
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a5b34]
../../projects/einstein.phys.uwm.edu/einstein_S5R3_4.20_i686-pc-linux-gnu.so(_Z6foobarPv+0x14)[0xb7cd9e24]
/lib/libpthread.so.0[0xb7ed4383]
/lib/libc.so.6(clone+0x5e)[0xb7e5863e]
Stack trace of LAL functions in worker thread:
LocalComputeFStatFreqBand at line 201 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LocalComputeFStat at line 289 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
(null) at line 0 of file (null)
At lowest level status code = 0, description: NO LAL ERROR REGISTERED
FPU status word ffff80c1, flags: ERR_SUMM STACK_FAULT INVALID
Obtained 7 stack frames for this thread.
Use gdb command: 'info line *0xADDRESS' to print corresponding line numbers.
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a4b9e]
einstein_S5R3_4.20_i686-pc-linux-gnu(LocalComputeFStatFreqBand+0x1849)[0x80ace69]
einstein_S5R3_4.20_i686-pc-linux-gnu(MAIN+0x352d)[0x80a495d]
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a5b34]
../../projects/einstein.phys.uwm.edu/einstein_S5R3_4.20_i686-pc-linux-gnu.so(_Z6foobarPv+0x14)[0xb7cd9e24]
/lib/libpthread.so.0[0xb7ed4383]
/lib/libc.so.6(clone+0x5e)[0xb7e5863e]
Stack trace of LAL functions in worker thread:
LocalComputeFStatFreqBand at line 201 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LocalComputeFStat at line 289 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
(null) at line 0 of file (null)
At lowest level status code = 0, description: NO LAL ERROR REGISTERED
Well, the affected machine is neither overclocked nor aging. Obviously the latter depends on how you define aging - the mobo is just a few months old and the CPU (as well as the RAM) is 2-3 years old. Anyway, I ran cpuburn (http://pages.sbcglobal.net/redelm, K7 and MMX for one hour each) and memtest86+ (for 7.5 hours) without any noticeable crash or even a single error... Although this doesn't really prove anything beyond doubt, it might show, however, that this issue's probably not hardware-related. By the way, during the tests mentioned above the CPU never got over 66°C whereas the CPU runs at about 40°C when idle (incl. BOINC!). Maybe I should mention here that I use powernowd which throttles (underclocks/undervoltages) the CPU dynamically.
Please note: einstein successfully crunched quite a few WUs on this machine over three months and I think this issue occurred just a few weeks ago (beginning of december I'd say). Since then not a single WU was computed successfully...
Hi to all,
I am running Boinc and E@H for over a year at a Dual AthlonXP machine with Gentoo Linux. The last weeks from 17 wu's, only 3 completes with no errors. The most of the wu's exists with error 99 and some with error 38. I have tried several BOINC clients (5.10.21, 5.10.28, 5.8.16, 5.4.11) with the same problems. The E@H client is S5R3_4.20_i686. There is no hardware failure and the machine is stable and running for over 90 days. I was stress the machine with several kernel compiles (more than 20) and I was run cpuburn for over 6 hour in each cpu. As you already know Gentoo Linux is a source based distribution and every 3 or 4 days I update the system with several program compiles with no problems.
I think that the problem is on the E@H client and I want your help to solve it
I'm having the same problem as several have mentioned above - my machine spends hours crunching, and then something somewhere decides there's an error and the time is wasted. The most recent is 89819460 which is yet again "Client error / Compute error / 86,407.09 secs / 191.23 claimed score." That's 24 hours of my electricity bill wasted. That's just the last one. On checking, I find that's happened with six out of the last seven Einstein tasks processed. They run quite happily using "einstein_s5r3 version 415" - and then get thrown out.
Last time this happened, it was "my fault" (I even got flamed) for not knowing that I should have downloaded a new version of the Einstein software. Not that anyone bothered to e-mail me to tell me, of course. Not that there was any message anywhere to alert me. It came to light when I checked the graph of results and saw a long horizontal line. That time, it was something like 20 results in a row that had been rejected - but still the server sent new projects to the computer.
I process Einstein units voluntarily, out of goodwill, and at my own cost. It seems the goodwill is only one way. This "client" is now fed up with being blamed for "errors" that aren't notified unless I go looking for them. The server has my e-mail address - a little software work would allow it to send an automated message saying "your work units are failing - you need to do X or Y".
I don't have time to browse the web to try to find what "I'm" doing wrong now, so I'm taking the easy way out and stopping my machine wasting time and electricity on Einstein. I'm sure the other projects I process (none of which exhibit this problem) will make better use of my computer resources.
I process Einstein units voluntarily, out of goodwill, and at my own cost. It seems the goodwill is only one way. This "client" is now fed up with being blamed for "errors" that aren't notified unless I go looking for them. The server has my e-mail address - a little software work would allow it to send an automated message saying "your work units are failing - you need to do X or Y".
Hi Keith,
You blames you for these errors? As you said it's all voluntarily - if you don't want to support a project then just don't do it ;-) On the other hand there's no software without bugs and there's always room for improvement! I agree with you that it'd be nice to have some sort of notification that something went wrong (maybe only after crossing a user-defined threshold?), but have you filed a feature request with the BOINC project for that (as it's not an E@H specific feature)? It's an open source project so it's again up to you - even if you don't implement a feature you could just let others know of your idea.
I ran E@H on the affected machine (another one is still crunching fine with BOINC 5.4.11) for a sustained period in order to exclude frequent shutdowns/reboots from the list of potential root causes. No luck, WUs kept failing...
However, LHC and Rosetta are working fine without any glitches. Again this doesn't mean anything but considering the fact that all errors (all SIGFPE) occur at the same two code positions (see below), it makes me wonder if it's not just some mean bug.
GetSemiCohToplist at line 3173 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/HierarchicalSearch.c
LocalComputeFStatFreqBand at line 201 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LocalComputeFStat at line 289 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
Hope this helps,
Oliver
Quote:
Well, the affected machine is neither overclocked nor aging. Obviously the latter depends on how you define aging - the mobo is just a few months old and the CPU (as well as the RAM) is 2-3 years old. Anyway, I ran cpuburn (http://pages.sbcglobal.net/redelm, K7 and MMX for one hour each) and memtest86+ (for 7.5 hours) without any noticeable crash or even a single error... Although this doesn't really prove anything beyond doubt, it might show, however, that this issue's probably not hardware-related. By the way, during the tests mentioned above the CPU never got over 66°C whereas the CPU runs at about 40°C when idle (incl. BOINC!). Maybe I should mention here that I use powernowd which throttles (underclocks/undervoltages) the CPU dynamically.
Please note: einstein successfully crunched quite a few WUs on this machine over three months and I think this issue occurred just a few weeks ago (beginning of december I'd say). Since then not a single WU was computed successfully...
No problems reported so far
)
No problems reported so far with MemTest....CPU is running at about 50 or lower under load so no overheating probs/
I managed to do a couple of units in November....
I'll try another and see if I get the same.
RE: I'll try another and
)
Sounds good. If Memtest runs OK for a couple of hours, if CPU temp is <50C at full load and if you're not aggressively overclocking then perhaps it is one of the code 99 errors that Bernd will be interested in.
Cheers,
Gary.
It must be one for Bernd to
)
It must be one for Bernd to look at
I've just successfully completed another WU
Hi guys, For the last
)
Hi guys,
For the last couple of WUs I've repeatedly got the same compute error on one of my machines (running BOINC 5.10.27):
-------snip--------------------------------------------
APP DEBUG: Application caught signal 8.
FPU status word ffff80c1, flags: ERR_SUMM STACK_FAULT INVALID
Obtained 7 stack frames for this thread.
Use gdb command: 'info line *0xADDRESS' to print corresponding line numbers.
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a4b9e]
einstein_S5R3_4.20_i686-pc-linux-gnu(LocalComputeFStatFreqBand+0x1849)[0x80ace69]
einstein_S5R3_4.20_i686-pc-linux-gnu(MAIN+0x352d)[0x80a495d]
einstein_S5R3_4.20_i686-pc-linux-gnu[0x80a5b34]
../../projects/einstein.phys.uwm.edu/einstein_S5R3_4.20_i686-pc-linux-gnu.so(_Z6foobarPv+0x14)[0xb7cd9e24]
/lib/libpthread.so.0[0xb7ed4383]
/lib/libc.so.6(clone+0x5e)[0xb7e5863e]
Stack trace of LAL functions in worker thread:
LocalComputeFStatFreqBand at line 201 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LocalComputeFStat at line 289 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
(null) at line 0 of file (null)
At lowest level status code = 0, description: NO LAL ERROR REGISTERED
-------snip--------------------------------------------
There seems to be some floating-point exception. Any idea?
Cheers, Oliver
Einstein@Home Project
Hi! The last tie I saw
)
Hi!
The last tie I saw something similar was here.
The PC affected by this turned out to produce errors in another BOINC project (QMC) as well, so the most likely cause for this was a hardware failure.
This could well be the case for your PC as well. Is it overclocked, or aging?
CU
Bikeman
Well, the affected machine is
)
Well, the affected machine is neither overclocked nor aging. Obviously the latter depends on how you define aging - the mobo is just a few months old and the CPU (as well as the RAM) is 2-3 years old. Anyway, I ran cpuburn (http://pages.sbcglobal.net/redelm, K7 and MMX for one hour each) and memtest86+ (for 7.5 hours) without any noticeable crash or even a single error... Although this doesn't really prove anything beyond doubt, it might show, however, that this issue's probably not hardware-related. By the way, during the tests mentioned above the CPU never got over 66°C whereas the CPU runs at about 40°C when idle (incl. BOINC!). Maybe I should mention here that I use powernowd which throttles (underclocks/undervoltages) the CPU dynamically.
Please note: einstein successfully crunched quite a few WUs on this machine over three months and I think this issue occurred just a few weeks ago (beginning of december I'd say). Since then not a single WU was computed successfully...
Oliver
Einstein@Home Project
Hi to all, I am running
)
Hi to all,
I am running Boinc and E@H for over a year at a Dual AthlonXP machine with Gentoo Linux. The last weeks from 17 wu's, only 3 completes with no errors. The most of the wu's exists with error 99 and some with error 38. I have tried several BOINC clients (5.10.21, 5.10.28, 5.8.16, 5.4.11) with the same problems. The E@H client is S5R3_4.20_i686. There is no hardware failure and the machine is stable and running for over 90 days. I was stress the machine with several kernel compiles (more than 20) and I was run cpuburn for over 6 hour in each cpu. As you already know Gentoo Linux is a source based distribution and every 3 or 4 days I update the system with several program compiles with no problems.
I think that the problem is on the E@H client and I want your help to solve it
Thank you and sorry for my english...
I'm having the same problem
)
I'm having the same problem as several have mentioned above - my machine spends hours crunching, and then something somewhere decides there's an error and the time is wasted. The most recent is 89819460 which is yet again "Client error / Compute error / 86,407.09 secs / 191.23 claimed score." That's 24 hours of my electricity bill wasted. That's just the last one. On checking, I find that's happened with six out of the last seven Einstein tasks processed. They run quite happily using "einstein_s5r3 version 415" - and then get thrown out.
Last time this happened, it was "my fault" (I even got flamed) for not knowing that I should have downloaded a new version of the Einstein software. Not that anyone bothered to e-mail me to tell me, of course. Not that there was any message anywhere to alert me. It came to light when I checked the graph of results and saw a long horizontal line. That time, it was something like 20 results in a row that had been rejected - but still the server sent new projects to the computer.
I process Einstein units voluntarily, out of goodwill, and at my own cost. It seems the goodwill is only one way. This "client" is now fed up with being blamed for "errors" that aren't notified unless I go looking for them. The server has my e-mail address - a little software work would allow it to send an automated message saying "your work units are failing - you need to do X or Y".
I don't have time to browse the web to try to find what "I'm" doing wrong now, so I'm taking the easy way out and stopping my machine wasting time and electricity on Einstein. I'm sure the other projects I process (none of which exhibit this problem) will make better use of my computer resources.
Keef, Essex or Norfolk, England
RE: I process Einstein
)
Hi Keith,
You blames you for these errors? As you said it's all voluntarily - if you don't want to support a project then just don't do it ;-) On the other hand there's no software without bugs and there's always room for improvement! I agree with you that it'd be nice to have some sort of notification that something went wrong (maybe only after crossing a user-defined threshold?), but have you filed a feature request with the BOINC project for that (as it's not an E@H specific feature)? It's an open source project so it's again up to you - even if you don't implement a feature you could just let others know of your idea.
Just my two cents...
Einstein@Home Project
OK, back to topic: I ran
)
OK, back to topic:
I ran E@H on the affected machine (another one is still crunching fine with BOINC 5.4.11) for a sustained period in order to exclude frequent shutdowns/reboots from the list of potential root causes. No luck, WUs kept failing...
However, LHC and Rosetta are working fine without any glitches. Again this doesn't mean anything but considering the fact that all errors (all SIGFPE) occur at the same two code positions (see below), it makes me wonder if it's not just some mean bug.
GetSemiCohToplist at line 3173 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/HierarchicalSearch.c
LocalComputeFStatFreqBand at line 201 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
LocalComputeFStat at line 289 of file /home/bema/einsteinathome/HierarchicalSearch/EaH_build_release_einstein_S5R3_4.20/extra_sources/lalapps-CVS/src/pulsar/hough/src2/LocalComputeFstat.c
Hope this helps,
Oliver
Einstein@Home Project