Client Errors of S5R2/S5R3 Apps

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 727422027
RAC: 1224352

RE: My first S5R3 result

Message 71136 in response to message 71135

Quote:

My first S5R3 result failed with signal 11

http://einsteinathome.org/task/87088144

Thank you very much indeed for your error report. This confirms the suspicion that there's something wrong with the screensaver code.

Yeah, I know, the Linux app doesn't have a screensaver. Don't tell me, tell the app :-) :

Quote:

GLUT: Fatal Error in screensaver: could not open display: :1.0

Bikeman

Mike Francis
Mike Francis
Joined: 18 Mar 06
Posts: 4
Credit: 6564723
RAC: 0

I recently was sent 6 units

I recently was sent 6 units to crunch.
2 were successful 1 was compute error when sent in; Don't know what the problem was.
3 were in the process of running; at various stages.
When BOINC switched over to the 3 units; at about the same time; they all went bad.
I am running 5.10.20; No problems with the other projects I run.

This is what I got as messeges.

9/24/2007 6:09:54 PM|Einstein@Home|Reason: Unrecoverable error for result h1_0535.50_S5R2__135_S5R2c_1 ( - exit code 99 (0x63))
9/24/2007 6:09:54 PM|Einstein@Home|Computation for task h1_0535.50_S5R2__135_S5R2c_1 finished
9/24/2007 6:09:54 PM|Einstein@Home|Output file h1_0535.50_S5R2__135_S5R2c_1_0 for task h1_0535.50_S5R2__135_S5R2c_1 absent
9/24/2007 6:09:54 PM|Einstein@Home|Restarting task h1_0535.50_S5R2__123_S5R2c_1 using einstein_S5R2 version 438
9/24/2007 6:09:57 PM|Einstein@Home|Deferring communication for 1 min 0 sec
9/24/2007 6:09:57 PM|Einstein@Home|Reason: Unrecoverable error for result h1_0535.50_S5R2__123_S5R2c_1 ( - exit code 99 (0x63))
9/24/2007 6:09:57 PM|Einstein@Home|Computation for task h1_0535.50_S5R2__123_S5R2c_1 finished
9/24/2007 6:09:57 PM|Einstein@Home|Output file h1_0535.50_S5R2__123_S5R2c_1_0 for task h1_0535.50_S5R2__123_S5R2c_1 absent
9/24/2007 6:09:57 PM|Einstein@Home|Restarting task h1_0535.50_S5R2__106_S5R2c_1 using einstein_S5R2 version 438
9/24/2007 6:10:03 PM|Einstein@Home|Deferring communication for 1 min 0 sec
9/24/2007 6:10:03 PM|Einstein@Home|Reason: Unrecoverable error for result h1_0535.50_S5R2__106_S5R2c_1 ( - exit code 99 (0x63))
9/24/2007 6:10:03 PM|Einstein@Home|Computation for task h1_0535.50_S5R2__106_S5R2c_1 finished

Hope this helps

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250505763
RAC: 34713

RE: Bernd wrote that 'exit

Message 71138 in response to message 71134

Quote:
Bernd wrote that 'exit code 10' is mostly related to disk failures. But his result file has a line which looks very strange...


This definitely is a disk corruption, even of the file the stderr output is kept in.

BM

BM

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250505763
RAC: 34713

RE: I recently was sent 6

Message 71139 in response to message 71137

Quote:

I recently was sent 6 units to crunch.
2 were successful 1 was compute error when sent in; Don't know what the problem was.
3 were in the process of running; at various stages.
When BOINC switched over to the 3 units; at about the same time; they all went bad.
I am running 5.10.20; No problems with the other projects I run.

This is what I got as messeges.

9/24/2007 6:09:54 PM|Einstein@Home|Reason: Unrecoverable error for result h1_0535.50_S5R2__135_S5R2c_1 ( - exit code 99 (0x63))
9/24/2007 6:09:54 PM|Einstein@Home|Computation for task h1_0535.50_S5R2__135_S5R2c_1 finished
9/24/2007 6:09:54 PM|Einstein@Home|Output file h1_0535.50_S5R2__135_S5R2c_1_0 for task h1_0535.50_S5R2__135_S5R2c_1 absent
9/24/2007 6:09:54 PM|Einstein@Home|Restarting task h1_0535.50_S5R2__123_S5R2c_1 using einstein_S5R2 version 438
9/24/2007 6:09:57 PM|Einstein@Home|Deferring communication for 1 min 0 sec
9/24/2007 6:09:57 PM|Einstein@Home|Reason: Unrecoverable error for result h1_0535.50_S5R2__123_S5R2c_1 ( - exit code 99 (0x63))
9/24/2007 6:09:57 PM|Einstein@Home|Computation for task h1_0535.50_S5R2__123_S5R2c_1 finished
9/24/2007 6:09:57 PM|Einstein@Home|Output file h1_0535.50_S5R2__123_S5R2c_1_0 for task h1_0535.50_S5R2__123_S5R2c_1 absent
9/24/2007 6:09:57 PM|Einstein@Home|Restarting task h1_0535.50_S5R2__106_S5R2c_1 using einstein_S5R2 version 438
9/24/2007 6:10:03 PM|Einstein@Home|Deferring communication for 1 min 0 sec
9/24/2007 6:10:03 PM|Einstein@Home|Reason: Unrecoverable error for result h1_0535.50_S5R2__106_S5R2c_1 ( - exit code 99 (0x63))
9/24/2007 6:10:03 PM|Einstein@Home|Computation for task h1_0535.50_S5R2__106_S5R2c_1 finished

Hope this helps


Though they all failed with exit status 99, at least two of the four tasks failed with completely different symptoms: one with an error in reading the data files (though your client should check their integrity (md5sum) before starting the App), and the other in what looks like a programming error (NULL pointer), but apparently nobody else has stumbled over yet. I couldn't get anything useful of the stderr output of the other two, as the actual message has been truncated.

To me I'd guess your memory has gone faulty right at the moment were the first crash happened. I'd suggest to run a memory checker.

BM

BM

Mike Francis
Mike Francis
Joined: 18 Mar 06
Posts: 4
Credit: 6564723
RAC: 0

Hi Bernd, Thanks for the

Hi Bernd,

Thanks for the reply, I will do a memory check ASAP.
Like I said, I've had no problem with my other projects and I have run two S3 units with no problem. I'm thinking that it might have that all three S2 units all started at the same time and with the large size, there was a FUBAH.

Mike F,

josep
josep
Joined: 9 Mar 05
Posts: 63
Credit: 1156542
RAC: 0

My two firsts S5R3 WU's have

My two firsts S5R3 WU's have been completed successfuly on my Duron 1600 running OpenSUSE Linux, with granted credit...

...but the third one has finished with a compute error, exit status 11 (0xb). I had no client errors at all with S5R2

Here are my results:

http://einsteinathome.org/account/tasks

Any idea what does it mean?

Before this wrong WU was reported, my DSL internet connection failed, so I needed to manually reset it. In Boincwiew's "Messages" screen I have lots of messages like this:

> Requesting 8640 seconds of new work, and reporting 1 completed tasks
> Sending scheduler request: To report completed tasks
> Reason: scheduler request failed
> Deferring communication for 1 min 0 sec
> Scheduler request failed: couldn't resolve host name

After resettting the DSL line, BOINC has reported the task and downloaded new work and this work is now running

I suppose that the temporary lack of Internet connection should not cause a compute error...

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 76

A "process got signal 11"

Message 71142 in response to message 71141

A "process got signal 11" error is a segmentation error. Which can be anything, from a bit wrong in memory to a problem with the CPU. Yet seeing that you return otherwise flawless results, just consider this a bug in the result.

josep
josep
Joined: 9 Mar 05
Posts: 63
Credit: 1156542
RAC: 0

Thank you for your answer. A


Thank you for your answer. A fourth result has now been completed and succesfully validated, so everything seems ok.

ccpilla
ccpilla
Joined: 4 Jun 05
Posts: 6
Credit: 133109
RAC: 0

RE: RE: Bernd wrote that

Message 71144 in response to message 71138

Quote:
Quote:
Bernd wrote that 'exit code 10' is mostly related to disk failures. But his result file has a line which looks very strange...

This definitely is a disk corruption, even of the file the stderr output is kept in.

BM


I've got some errors "exit code 10", See in http://einsteinathome.org/task/87136558
Should this be considered disk errors, too?
Thank you for your help.
CCP

rhb
rhb
Joined: 15 Aug 06
Posts: 6
Credit: 1287768
RAC: 0

I also had one result with a

I also had one result with a signal 11 error, running Ubuntu Linux 6.06. It failed during a time when I was having internet issues. First, the xtremlab site was down, so I turned communications on and off a few times, then I lost my internet connection completely for a while.

I was wondering if it would be feasible to respond to some errors of this type by restarting at the last checkpoint. If that were done, there would need to be some way to insure that it didn't restart repeatedly. Perhaps the task could be suspended, with an option for the user to restart or abort it. I don't think it would be a good idea to require input to make the decision though. Perhaps it would be aborted automatically if it happened more than once, or more than once without significant progress from the last checkpoint.

If anyone has more to add about possible causes of the error, I would be interested in hearing them also. For now, I'm just assuming it was just a fluke.

http://einsteinathome.org/task/87532936

5.4.9

process got signal 11

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.