exited with zero status!!!

KnB-Construction
KnB-Construction
Joined: 3 Mar 05
Posts: 8
Credit: 954576
RAC: 0

nice idea but the way i solve

nice idea but the way i solve the problem was a little bit quicker i think. ;-)

i delete the complete boinc folder after finishing my units, install everything new, attach to my projects and finally merge the host in every projekt. after this the "exited with zero status..." message didn't occurre in the last 5h.

thanks a lot for your help and happy crunching!!!

greetings KnB-Construction

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 762253131
RAC: 1092534

RE: RE: RE: Yes, when

Message 68244 in response to message 68241

Quote:
Quote:
Quote:

Yes, when this happens the work from the last checkpoint is lost.

Are you sure the workunit resumes from the last checkpoint? As I understand it the workunit gets terminated when this happens, because to BOINc it looks as if it had crashed. Otherwise, why would you see this kind of error message at the end of a half-crunched result (and not just in the middle of a finished one).

E.g. this one : http://einsteinathome.org/task/84929543

As to versions of BOINC: I think this is fixed only in the 5.9. beta versions, which is kind of unfortunate given the severity of the bug.

CU

BRM

Generally speaking, even this beta app (v0.44??) should be able to handle the CC getting blocked and exit gracefully. Here's a log snippet from a test where I deliberately killed the CC for v4.17.

From stderr:

.
.
.
17305, 17306, 17307, c
17308, 17309, 17310, c
17311, 17312, 17313, c
17314, No heartbeat from core client for 31 sec - exiting

2007-06-17 09:44:24.1599 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R2_4.17_windows_intelx86.exe'.
2007-06-17 09:44:28.2199 [debug]: Reading SFTs and setting up stacks ... done
2007-06-17 09:46:36.3599 [debug]: Found checkpoint - reading...
2007-06-17 09:46:36.3599 [debug]: Read checkpoint - reading previous output...
2007-06-17 09:46:38.7799 [debug]: Read exactly 1008759 == maxbytes from Fstat-file, that's enough.
2007-06-17 09:46:38.8299 [debug]: DEBUG: read_fstat_toplist_from_fp() returned 1008759
2007-06-17 09:46:38.8299 [debug]: Total skypoints = 35581. Progress: 17314, c
17315, 17316, 17317, c
17318, 17319, 17320, c
17321, 17322, 17323,

Keep in mind the lost heartbeat message in the stderr file is an app generated and written message, not a CC one. The issue with some of the earlier CC version is slow DNS response will block any other IO from the CC to the science app, and thus leads to a lost heartbeat which causes the app to exit which it's supposed to do. The problem there is the CC keeps coming back to retry the lookup so frequently it results in bringing progress on the result to a screeching halt due to all the lost heartbeats.

The reason you get the message blurb from the CC on the restart is because the CC didn't initiate the exit, like it would normally when you shut BOINC down for example. It sees the app has reported it exited successfully (Status zero), which usually means it's finished the computation, but there is no finished output file therefore something 'bad' must have happened it didn't know about and then tries restarting the result.

So in this particular case I don't think the lost heartbeat caused the subsequent abort per se, I think it's more likely the result aborted when the CC tried to restart it. From a quick search for the error code, the best I could find was this is a Windows disk/file system error. So I guess it's possible that the app didn't clean up the file system properly when exiting, and thus one or more of the output and/or state files was fatally flawed and lead to the abort.

Alinator

That's interesting, all the "herartbeat" stuff I've noticed so far was at the very end of the stderr for a failed result. Strange. The error code seems to indicate a problem when initializing a DLL, which doesn't help much either. Could this be a failure to load the runtime debugger in response to the error condition or is it related to the DNS problem???

CU BRM

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9352143
RAC: 0

OK, I looked a little deeper,

OK, I looked a little deeper, and your right. The error code for the abort seems to correspond to an 'app failed to initialize correctly' problem. Was there any other info in the Win Event log as to which app/dll wasn't initializing?

The part I find interesting in your snippet is there is no restart message in stderr, like you see when I restart BOINC in my test case. This seems to indicate the app never really restarted after the lost heartbeat.

BTW, I should have explicitly said that killing the CC is not quite like blocking the IO (like what happens in the slow DNS scenario).

Perhaps as you suggest, it's a combination of a number of factors which leads to the abort, so that would make the lost heartbeat a symptom rather than a root cause.

Alinator

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 762253131
RAC: 1092534

RE: OK, I looked a little

Message 68246 in response to message 68245

Quote:

OK, I looked a little deeper, and your right. The error code for the abort seems to correspond to an 'app failed to initialize correctly' problem. Was there any other info in the Win Event log as to which app/dll wasn't initializing?

The part I find interesting in your snippet is there is no restart message in stderr, like you see when I restart BOINC in my test case. This seems to indicate the app never really restarted after the lost heartbeat.

BTW, I should have explicitly said that killing the CC is not quite like blocking the IO (like what happens in the slow DNS scenario).

Perhaps as you suggest, it's a combination of a number of factors which leads to the abort, so that would make the lost heartbeat a symptom rather than a root cause.

Alinator

Will check in a few hours whether there's still some info to be found about the event. If not, is there a way to reproduce this DNS lookup blocking? Just pulling the network cable won't do, I guess. Would be interesting to provoke this with the new beta that does better error reporting.

CU

BRM

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.