5th Computing error for S5R2 on one host

Dave Burbank
Dave Burbank
Joined: 30 Jan 06
Posts: 275
Credit: 1548376
RAC: 0

No Luck :-( Just had the

No Luck :-(

Just had the 6th on error out. They never fail after a few minutes (or hours), this one died after 11 hours of crunching.

I have never before considered moving to a different project, but if this keeps up, I will be left with no choice. It's one thing asking me to donate computer cycles for a good cause (which I gladly do), it's another thing asking me to use my computer as space heater for my room.

Quote:
I'm going to remove the execute permission for this library and see what breaks/works :)

Please keep us posted on your results, hope it works!

There are 10^11 stars in the galaxy. That used to be a huge number. But it's only a hundred billion. It's less than the national deficit! We used to call them astronomical numbers. Now we should call them economical numbers. - Richard Feynman

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 686138814
RAC: 555564

RE: G'day Bikeman After

Message 63005 in response to message 63003

Quote:

G'day Bikeman

After conducting a quick survey of the distro's on my boxes - one thing stood out - the problem box, an AthlonXP 2400 running FC6 (32bit) has the executable bit set on libGL.so.1

./libGL.so.1 throws up a Segmentation Fault message

In my other distro's across various boxes libGL.so.1 is either non-existent or not executable.

I'm going to remove the execute permission for this library and see what breaks/works :)

I don't think it's the permission bit. Anyway, this is a shared library (like a Windows DLL) and can't be executed on it's own.

But the problem might be a different version / vendor for the libGL.so. Also make sure you follow symbolic links, e.g. that's what ls -l /usr/lib/libGL*.so.* looks like on one of my systems:

lrwxrwxrwx 1 root root 10 2006-05-04 18:46 /usr/lib/libGL.so -> libGL.so.1
lrwxrwxrwx 1 root root 12 2006-05-04 18:46 /usr/lib/libGL.so.1 -> libGL.so.1.2
-rw-r--r-- 1 root root 447560 2006-03-20 09:16 /usr/lib/libGL.so.1.2

so in the end libGL.so.1.2 is used, it doesn't have the x bit set but einstein has probs nevertheless (until I prevented libGL beeing loaded at all, see my post above).

CU

BRM

rudylis
rudylis
Joined: 18 Jan 05
Posts: 5
Credit: 73932522
RAC: 8905

RE: IIRC I've seen win

Message 63006 in response to message 62987

Quote:
IIRC I've seen win machines erroring out too.

Hi
I have 2 machines making errors with WU.
One is Vista Premium , another Linux Slackware ( kernel compiled) , but both are based on similar hardware architecture.
So I think the clue is in Nvidia chipset rather , especially southbridge 420 & 430
I've observed several errors before on these machines , but recently with S5R2 almost each WU returns an error.

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

RE: RE: That couldn't

Message 63007 in response to message 62993

Quote:
Quote:
That couldn't have been the reason for my WU to crash. I'm completely sure I didn't pause/resume that. Maybe this can trigger SIGABRT errors, but it can't be the only thing that causes them...
...

Hi!
SIGABRT seems to happen sometimes on *any* interruption/suspension of the science client:

- manual update
- manual suspend
- automatic interruption to perform benchmark
- automatic switch to other project if E@H isn't your only one
- power-down
- disconnecting power supply and running on batteries (if configured this way)on notebooks
- ....

CU

BRM

You were very right, the moment my WU crashed, the BOINC client was contacting the scheduler trying to get new work. Interesting.

@rudylis: I don't really think so. I also got one of those errors (I count myself lucky it was only one) and my PC has a VIA chipset.

Hartmut Geissbauer
Hartmut Geissbauer
Joined: 5 Jan 06
Posts: 31
Credit: 152941307
RAC: 0

I discovered the same reason

I discovered the same reason for an error.

2007-05-03 16:27:03 [Einstein@Home] Sending scheduler request: To fetch work
2007-05-03 16:27:03 [Einstein@Home] Requesting 33 seconds of new work, and reporting 1 completed tasks
2007-05-03 16:27:08 [Einstein@Home] Scheduler RPC succeeded [server version 509]
2007-05-03 16:27:08 [Einstein@Home] Deferring communication for 1 min 0 sec
2007-05-03 16:27:08 [Einstein@Home] Reason: requested by project
2007-05-03 16:27:10 [Einstein@Home] Deferring communication for 1 min 0 sec
2007-05-03 16:27:10 [Einstein@Home] Reason: Unrecoverable error for result h1_0294.95_S5R2__150_S5R2c_2 (process got signal 11)

M. Schmitt
M. Schmitt
Joined: 27 Jun 05
Posts: 478
Credit: 15872262
RAC: 0

Same here, but the second

Same here, but the second request for work didn't cause an error:

03/05/2007 18:33|Einstein@Home|Sending scheduler request: To fetch work
03/05/2007 18:33|Einstein@Home|Requesting 52 seconds of new work, and reporting 1 completed tasks
03/05/2007 18:33|Einstein@Home|Scheduler RPC succeeded [server version 509]
03/05/2007 18:33|Einstein@Home|Deferring communication for 1 min 0 sec
03/05/2007 18:33|Einstein@Home|Reason: requested by project
03/05/2007 18:33|Einstein@Home|Deferring communication for 1 min 0 sec
03/05/2007 18:33|Einstein@Home|Reason: Unrecoverable error for result h1_0243.60_S5R2__40_S5R2c_0 (process got signal 11)
03/05/2007 18:33|Einstein@Home|Computation for task h1_0243.60_S5R2__40_S5R2c_0 finished
03/05/2007 18:33|Einstein@Home|Output file h1_0243.60_S5R2__40_S5R2c_0_0 for task h1_0243.60_S5R2__40_S5R2c_0 absent
03/05/2007 18:33|Einstein@Home|Starting h1_0243.60_S5R2__28_S5R2c_0
03/05/2007 18:33|Einstein@Home|Starting task h1_0243.60_S5R2__28_S5R2c_0 using einstein_S5R2 version 418
03/05/2007 18:35|Einstein@Home|Sending scheduler request: To fetch work
03/05/2007 18:35|Einstein@Home|Requesting 79 seconds of new work, and reporting 1 completed tasks
03/05/2007 18:35|Einstein@Home|Scheduler RPC succeeded [server version 509]
03/05/2007 18:35|Einstein@Home|Deferring communication for 1 min 0 sec
03/05/2007 18:35|Einstein@Home|Reason: requested by project

Anyway, the next host, that is losing more than 2h of work through one of these ugly SIGABRT errors, will leave this project.
This app is not even beta, it's just alpha and therefore should only be testet by a small group of volunteers.

cu
Michael

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

No, it doesn't happen every

No, it doesn't happen every time with every WU. My desktop is a single core with a 1-day-cache, so with the largest WUs taking about 28 hours it has to contact the servers about daily, and still all WUs except this one made it in one piece. Still, IF something happens, it's a very likely situation.

Lisandro Firman
Lisandro Firman
Joined: 17 May 06
Posts: 22
Credit: 49004
RAC: 0

I have lost 85.000 seconds of

I have lost 85.000 seconds of work.... a whole day because of E@H crashes...

Dave Burbank
Dave Burbank
Joined: 30 Jan 06
Posts: 275
Credit: 1548376
RAC: 0

A 6th WU has filed on this

A 6th WU has filed on this host just after contacting the server. I'm going to suspend network activity on this host and see if that does anything. If another WU fails I'm going to have to find a secondary project until all of this is sorted out [keeping his fingers crossed].

EDIT : Just as I suspended network activity the host contacted the server, and the WU kept on crunching... beats me.

There are 10^11 stars in the galaxy. That used to be a huge number. But it's only a hundred billion. It's less than the national deficit! We used to call them astronomical numbers. Now we should call them economical numbers. - Richard Feynman

Voyager
Voyager
Joined: 8 May 05
Posts: 6
Credit: 155181553
RAC: 196271

Just had my third wu fail on

Just had my third wu fail on a win/intel machine.
They all end with the same text :

===== WARNING: XLALComputeFaFb() should not be used with upsampled-SFTs!
XLAL Error - XLALComputeFaFb (ComputeFstat.c:390): Invalid argument
Level 0: $Id: HierarchicalSearch.c,v 1.164 2007/04/25 14:31:47 bema Exp $
Function call `COMPUTEFSTATFREQBAND ( &status, fstatVector.data + k, &thisPoint, stackMultiSFT.data[k], stackMultiNoiseWeights.data[k], stackMultiDetStates.data[k], &CFparams)' failed.
file HierarchicalSearch.c, line 1010
2007-05-06 09:24:59.2656 [normal]:
Level 1: $Id: ComputeFstat.c,v 1.67 2007/02/26 14:15:43 reinhard Exp $
2007-05-06 09:24:59.2656 [normal]: Status code -1: Recursive error
2007-05-06 09:24:59.2656 [normal]: function ComputeFStatFreqBand, file ComputeFstat.c, line 151
2007-05-06 09:24:59.2656 [normal]:
Level 2: $Id: ComputeFstat.c,v 1.67 2007/02/26 14:15:43 reinhard Exp $
2007-05-06 09:24:59.2656 [normal]: Status code 5: XLAL function call failed
2007-05-06 09:24:59.2656 [normal]: function ComputeFStat, file ComputeFstat.c, line 286
2007-05-06 09:24:59.2656 [CRITICAL]: BOINC_LAL_ErrHand(): now calling boinc_finish()

]]>

Does anybody now what causes this failure.
My win/amd machine has not had any errors, so a win/intel problem maybe.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.