exited with zero status!!!

KnB-Construction

Joined: 3 Mar 05

Posts: 8

Credit: 954576

RAC: 0

nice idea but the way i solve

17 Jun 2007 21:11:02 UTC

Message 68243

(moderation:

)

nice idea but the way i solve the problem was a little bit quicker i think. ;-)

i delete the complete boinc folder after finishing my units, install everything new, attach to my projects and finally merge the host in every projekt. after this the "exited with zero status..." message didn't occurre in the last 5h.

thanks a lot for your help and happy crunching!!!

greetings KnB-Construction

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 761715067

RAC: 1112000

RE: RE: RE: Yes, when

17 Jun 2007 21:55:38 UTC

Message 68244 in response to message 68241

(moderation:

)

Quote:

Quote:
Quote:
Yes, when this happens the work from the last checkpoint is lost.

Are you sure the workunit resumes from the last checkpoint? As I understand it the workunit gets terminated when this happens, because to BOINc it looks as if it had crashed. Otherwise, why would you see this kind of error message at the end of a half-crunched result (and not just in the middle of a finished one).

E.g. this one : http://einsteinathome.org/task/84929543

As to versions of BOINC: I think this is fixed only in the 5.9. beta versions, which is kind of unfortunate given the severity of the bug.

CU

BRM

Generally speaking, even this beta app (v0.44??) should be able to handle the CC getting blocked and exit gracefully. Here's a log snippet from a test where I deliberately killed the CC for v4.17.

From stderr:

.
.
.
17305, 17306, 17307, c
17308, 17309, 17310, c
17311, 17312, 17313, c
17314, No heartbeat from core client for 31 sec - exiting

2007-06-17 09:44:24.1599 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R2_4.17_windows_intelx86.exe'.
2007-06-17 09:44:28.2199 [debug]: Reading SFTs and setting up stacks ... done
2007-06-17 09:46:36.3599 [debug]: Found checkpoint - reading...
2007-06-17 09:46:36.3599 [debug]: Read checkpoint - reading previous output...
2007-06-17 09:46:38.7799 [debug]: Read exactly 1008759 == maxbytes from Fstat-file, that's enough.
2007-06-17 09:46:38.8299 [debug]: DEBUG: read_fstat_toplist_from_fp() returned 1008759
2007-06-17 09:46:38.8299 [debug]: Total skypoints = 35581. Progress: 17314, c
17315, 17316, 17317, c
17318, 17319, 17320, c
17321, 17322, 17323,

Keep in mind the lost heartbeat message in the stderr file is an app generated and written message, not a CC one. The issue with some of the earlier CC version is slow DNS response will block any other IO from the CC to the science app, and thus leads to a lost heartbeat which causes the app to exit which it's supposed to do. The problem there is the CC keeps coming back to retry the lookup so frequently it results in bringing progress on the result to a screeching halt due to all the lost heartbeats.

The reason you get the message blurb from the CC on the restart is because the CC didn't initiate the exit, like it would normally when you shut BOINC down for example. It sees the app has reported it exited successfully (Status zero), which usually means it's finished the computation, but there is no finished output file therefore something 'bad' must have happened it didn't know about and then tries restarting the result.

So in this particular case I don't think the lost heartbeat caused the subsequent abort per se, I think it's more likely the result aborted when the CC tried to restart it. From a quick search for the error code, the best I could find was this is a Windows disk/file system error. So I guess it's possible that the app didn't clean up the file system properly when exiting, and thus one or more of the output and/or state files was fatally flawed and lead to the abort.

Alinator

That's interesting, all the "herartbeat" stuff I've noticed so far was at the very end of the stderr for a failed result. Strange. The error code seems to indicate a problem when initializing a DLL, which doesn't help much either. Could this be a failure to load the runtime debugger in response to the error condition or is it related to the DNS problem???

CU BRM

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

OK, I looked a little deeper,

18 Jun 2007 15:47:10 UTC

Message 68245

(moderation:

)

OK, I looked a little deeper, and your right. The error code for the abort seems to correspond to an 'app failed to initialize correctly' problem. Was there any other info in the Win Event log as to which app/dll wasn't initializing?

The part I find interesting in your snippet is there is no restart message in stderr, like you see when I restart BOINC in my test case. This seems to indicate the app never really restarted after the lost heartbeat.

BTW, I should have explicitly said that killing the CC is not quite like blocking the IO (like what happens in the slow DNS scenario).

Perhaps as you suggest, it's a combination of a number of factors which leads to the abort, so that would make the lost heartbeat a symptom rather than a root cause.

Alinator

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 761715067

RAC: 1112000

RE: OK, I looked a little

18 Jun 2007 15:56:22 UTC

Message 68246 in response to message 68245

(moderation:

)

Quote:

OK, I looked a little deeper, and your right. The error code for the abort seems to correspond to an 'app failed to initialize correctly' problem. Was there any other info in the Win Event log as to which app/dll wasn't initializing?

The part I find interesting in your snippet is there is no restart message in stderr, like you see when I restart BOINC in my test case. This seems to indicate the app never really restarted after the lost heartbeat.

BTW, I should have explicitly said that killing the CC is not quite like blocking the IO (like what happens in the slow DNS scenario).

Perhaps as you suggest, it's a combination of a number of factors which leads to the abort, so that would make the lost heartbeat a symptom rather than a root cause.

Alinator

Will check in a few hours whether there's still some info to be found about the event. If not, is there a way to reproduce this DNS lookup blocking? Just pulling the network cable won't do, I guess. Would be interesting to provoke this with the new beta that does better error reporting.

BRM

exited with zero status!!!

Forums › Cruncher's Corner

nice idea but the way i solve

RE: RE: RE: Yes, when

OK, I looked a little deeper,

RE: OK, I looked a little

Comment viewing options

Forums › Cruncher's Corner