nice idea but the way i solve the problem was a little bit quicker i think. ;-)
i delete the complete boinc folder after finishing my units, install everything new, attach to my projects and finally merge the host in every projekt. after this the "exited with zero status..." message didn't occurre in the last 5h.
Yes, when this happens the work from the last checkpoint is lost.
Are you sure the workunit resumes from the last checkpoint? As I understand it the workunit gets terminated when this happens, because to BOINc it looks as if it had crashed. Otherwise, why would you see this kind of error message at the end of a half-crunched result (and not just in the middle of a finished one).
As to versions of BOINC: I think this is fixed only in the 5.9. beta versions, which is kind of unfortunate given the severity of the bug.
CU
BRM
Generally speaking, even this beta app (v0.44??) should be able to handle the CC getting blocked and exit gracefully. Here's a log snippet from a test where I deliberately killed the CC for v4.17.
From stderr:
.
.
.
17305, 17306, 17307, c
17308, 17309, 17310, c
17311, 17312, 17313, c
17314, No heartbeat from core client for 31 sec - exiting
2007-06-17 09:44:24.1599 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R2_4.17_windows_intelx86.exe'.
2007-06-17 09:44:28.2199 [debug]: Reading SFTs and setting up stacks ... done
2007-06-17 09:46:36.3599 [debug]: Found checkpoint - reading...
2007-06-17 09:46:36.3599 [debug]: Read checkpoint - reading previous output...
2007-06-17 09:46:38.7799 [debug]: Read exactly 1008759 == maxbytes from Fstat-file, that's enough.
2007-06-17 09:46:38.8299 [debug]: DEBUG: read_fstat_toplist_from_fp() returned 1008759
2007-06-17 09:46:38.8299 [debug]: Total skypoints = 35581. Progress: 17314, c
17315, 17316, 17317, c
17318, 17319, 17320, c
17321, 17322, 17323,
Keep in mind the lost heartbeat message in the stderr file is an app generated and written message, not a CC one. The issue with some of the earlier CC version is slow DNS response will block any other IO from the CC to the science app, and thus leads to a lost heartbeat which causes the app to exit which it's supposed to do. The problem there is the CC keeps coming back to retry the lookup so frequently it results in bringing progress on the result to a screeching halt due to all the lost heartbeats.
The reason you get the message blurb from the CC on the restart is because the CC didn't initiate the exit, like it would normally when you shut BOINC down for example. It sees the app has reported it exited successfully (Status zero), which usually means it's finished the computation, but there is no finished output file therefore something 'bad' must have happened it didn't know about and then tries restarting the result.
So in this particular case I don't think the lost heartbeat caused the subsequent abort per se, I think it's more likely the result aborted when the CC tried to restart it. From a quick search for the error code, the best I could find was this is a Windows disk/file system error. So I guess it's possible that the app didn't clean up the file system properly when exiting, and thus one or more of the output and/or state files was fatally flawed and lead to the abort.
Alinator
That's interesting, all the "herartbeat" stuff I've noticed so far was at the very end of the stderr for a failed result. Strange. The error code seems to indicate a problem when initializing a DLL, which doesn't help much either. Could this be a failure to load the runtime debugger in response to the error condition or is it related to the DNS problem???
OK, I looked a little deeper, and your right. The error code for the abort seems to correspond to an 'app failed to initialize correctly' problem. Was there any other info in the Win Event log as to which app/dll wasn't initializing?
The part I find interesting in your snippet is there is no restart message in stderr, like you see when I restart BOINC in my test case. This seems to indicate the app never really restarted after the lost heartbeat.
BTW, I should have explicitly said that killing the CC is not quite like blocking the IO (like what happens in the slow DNS scenario).
Perhaps as you suggest, it's a combination of a number of factors which leads to the abort, so that would make the lost heartbeat a symptom rather than a root cause.
OK, I looked a little deeper, and your right. The error code for the abort seems to correspond to an 'app failed to initialize correctly' problem. Was there any other info in the Win Event log as to which app/dll wasn't initializing?
The part I find interesting in your snippet is there is no restart message in stderr, like you see when I restart BOINC in my test case. This seems to indicate the app never really restarted after the lost heartbeat.
BTW, I should have explicitly said that killing the CC is not quite like blocking the IO (like what happens in the slow DNS scenario).
Perhaps as you suggest, it's a combination of a number of factors which leads to the abort, so that would make the lost heartbeat a symptom rather than a root cause.
Alinator
Will check in a few hours whether there's still some info to be found about the event. If not, is there a way to reproduce this DNS lookup blocking? Just pulling the network cable won't do, I guess. Would be interesting to provoke this with the new beta that does better error reporting.
nice idea but the way i solve
)
nice idea but the way i solve the problem was a little bit quicker i think. ;-)
i delete the complete boinc folder after finishing my units, install everything new, attach to my projects and finally merge the host in every projekt. after this the "exited with zero status..." message didn't occurre in the last 5h.
thanks a lot for your help and happy crunching!!!
greetings KnB-Construction
RE: RE: RE: Yes, when
)
That's interesting, all the "herartbeat" stuff I've noticed so far was at the very end of the stderr for a failed result. Strange. The error code seems to indicate a problem when initializing a DLL, which doesn't help much either. Could this be a failure to load the runtime debugger in response to the error condition or is it related to the DNS problem???
CU BRM
OK, I looked a little deeper,
)
OK, I looked a little deeper, and your right. The error code for the abort seems to correspond to an 'app failed to initialize correctly' problem. Was there any other info in the Win Event log as to which app/dll wasn't initializing?
The part I find interesting in your snippet is there is no restart message in stderr, like you see when I restart BOINC in my test case. This seems to indicate the app never really restarted after the lost heartbeat.
BTW, I should have explicitly said that killing the CC is not quite like blocking the IO (like what happens in the slow DNS scenario).
Perhaps as you suggest, it's a combination of a number of factors which leads to the abort, so that would make the lost heartbeat a symptom rather than a root cause.
Alinator
RE: OK, I looked a little
)
Will check in a few hours whether there's still some info to be found about the event. If not, is there a way to reproduce this DNS lookup blocking? Just pulling the network cable won't do, I guess. Would be interesting to provoke this with the new beta that does better error reporting.
CU
BRM