exited with zero status!!!

KnB-Construction
KnB-Construction
Joined: 3 Mar 05
Posts: 8
Credit: 954576
RAC: 0
Topic 192855

hi,

i'm crunching for seti and einstein and sometimes for rosetta with an athlon XP 3000+ and windows xp pro SP2. until three days ago i don't have any problems with seti and einstein. after installing the newest boinc version 5.8.16 (i'm using the 5.7.5 from crunch3r so far) the problems begins. after more than 33h of crunching the first complete einstein-unit was invalid. after this i uninstalled the complete new boinc software and switch to the old 5.7.5 from crunch3r again. after this i get a lot of "exited with zero status but no 'finished' file" messages as you can see in following text.

has anyone of you the same problems after switching the boinc versions or does anyone knowes what to do when this happens. reseting the projekt does not solve the problem.

thanks for your help in advance

6/16/2007 9:30:44 AM||Starting BOINC client version 5.7.5 for windows_intelx86
6/16/2007 9:30:44 AM||log flags: task, file_xfer, sched_ops
6/16/2007 9:30:44 AM||Libraries: libcurl/7.15.5 OpenSSL/0.9.8a zlib/1.2.3
6/16/2007 9:30:44 AM||Data directory: C:\\Programme\\BOINC
6/16/2007 9:30:44 AM|SETI@home|Found app_info.xml; using anonymous platform
6/16/2007 9:30:44 AM||Processor: 1 AuthenticAMD AMD Athlon(tm) XP 3000+
6/16/2007 9:30:44 AM||Memory: 2.00 GB physical, 3.81 GB virtual
6/16/2007 9:30:44 AM||Disk: 29.29 GB total, 18.18 GB free
6/16/2007 9:30:44 AM|rosetta@home|URL: http://boinc.bakerlab.org/rosetta/; Computer ID: 387212; location: (none); project prefs: default
6/16/2007 9:30:44 AM|Einstein@Home|URL: http://einstein.phys.uwm.edu/; Computer ID: 747357; location: home; project prefs: home
6/16/2007 9:30:44 AM|lhcathome|URL: http://lhcathome.cern.ch/lhcathome/; Computer ID: 8956785; location: home; project prefs: default
6/16/2007 9:30:44 AM|SETI@home|URL: http://setiathome.berkeley.edu/; Computer ID: 2694622; location: home; project prefs: home
6/16/2007 9:30:44 AM||General prefs: from Einstein@Home (last modified 2007-06-06 23:39:56)
6/16/2007 9:30:44 AM||Host location: home
6/16/2007 9:30:44 AM||General prefs: no separate prefs for home; using your defaults
6/16/2007 9:30:44 AM||
6/16/2007 9:30:44 AM||
6/16/2007 9:30:44 AM||BOINC 5.7.5.32 - 32 bit Edition by Crunch3r
6/16/2007 9:30:44 AM||enabled features:
6/16/2007 9:30:44 AM||-cpu_affinity
6/16/2007 9:30:44 AM||-return_results_immediately
6/16/2007 9:30:44 AM||
6/16/2007 9:30:44 AM||
6/16/2007 9:30:44 AM|Einstein@Home|Restarting task h1_0457.20_S5R2__226_S5R2c_0 using einstein_S5R2 version 417
6/16/2007 10:26:21 AM|Einstein@Home|Task h1_0457.20_S5R2__226_S5R2c_0 exited with zero status but no 'finished' file
6/16/2007 10:26:21 AM|Einstein@Home|If this happens repeatedly you may need to reset the project.
6/16/2007 10:26:21 AM|Einstein@Home|Restarting task h1_0457.20_S5R2__226_S5R2c_0 using einstein_S5R2 version 417
6/16/2007 11:19:28 AM|Einstein@Home|Task h1_0457.20_S5R2__226_S5R2c_0 exited with zero status but no 'finished' file
6/16/2007 11:19:28 AM|Einstein@Home|If this happens repeatedly you may need to reset the project.
6/16/2007 11:19:28 AM|Einstein@Home|Restarting task h1_0457.20_S5R2__226_S5R2c_0 using einstein_S5R2 version 417
6/16/2007 12:12:59 PM|Einstein@Home|Task h1_0457.20_S5R2__226_S5R2c_0 exited with zero status but no 'finished' file
6/16/2007 12:12:59 PM|Einstein@Home|If this happens repeatedly you may need to reset the project.
6/16/2007 12:12:59 PM|SETI@home|Restarting task 03ap99ab.13555.31634.767320.3.229_0 using setiathome_enhanced version 517
6/16/2007 1:05:50 PM|SETI@home|Task 03ap99ab.13555.31634.767320.3.229_0 exited with zero status but no 'finished' file
6/16/2007 1:05:50 PM|SETI@home|If this happens repeatedly you may need to reset the project.
6/16/2007 1:05:50 PM|Einstein@Home|Restarting task h1_0457.20_S5R2__226_S5R2c_0 using einstein_S5R2 version 417

greetings KnB-Construction

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 761671041
RAC: 1113002

exited with zero status!!!

I'm not sure downgrading BOINC was a good idea. The validation problem seems to be an instance of the "cross platform validation" issue: your result was initially compared to one from a Mac, ad the two results didn't agree sufficiently, so a third host was asked to crunch the unit. This happened to be another Mac, and finally the validator found the two mac results in agreement so they got the credit.

I'd try to go back to the more recent BOINC version. If that doesn't help, try to reset the project.

CU

BRM

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9352143
RAC: 0

Also, the 'Exited with zero

Also, the 'Exited with zero status..." message is not a problem per se. It just means that the science app didn't see a heartbeat signal from the CC for too long and exited the way it was designed to. This doesn't mean there is anything wrong or cause validation problems.

From looking at the log you posted it seems there is something going on about every hour on your host which is preempting the CC and keeping it from sending the heartbeat to the science app temporarily.

HTH,

Alinator

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 761671041
RAC: 1113002

RE: Also, the 'Exited with

Message 68235 in response to message 68234

Quote:

Also, the 'Exited with zero status..." message is not a problem per se. It just means that the science app didn't see a heartbeat signal from the CC for too long and exited the way it was designed to. This doesn't mean there is anything wrong or cause validation problems.

From looking at the log you posted it seems there is something going on about every hour on your host which is preempting the CC and keeping it from sending the heartbeat to the science app temporarily.

HTH,

Alinator

I understand that this "heartbeat" prob can be caused by an issue in boinc 5.8.x ?

http://boinc.berkeley.edu/trac/ticket/113

What is unfortinate about this is that after the science app exits, the core client will look for a result file, find none, and will report the result as having failed: The computation work is lost. Right?

Wouldn't it be better if after a missed heartbeat, BOINC would allow the science app to "fall asleep"/suspend so that it can be restarted after heartbeats resume?

CU

H-B

Erik
Erik
Joined: 14 Feb 06
Posts: 2815
Credit: 2645600
RAC: 0

RE: What is unfortinate

Message 68236 in response to message 68235

Quote:
What is unfortinate about this is that after the science app exits, the core client will look for a result file, find none, and will report the result as having failed: The computation work is lost. Right?

I used to have this problem frequently and it still happens once in a great while. I assumed that the work was lost and the wu resumed from the last checkpoint.

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9352143
RAC: 0

Yes, when this happens the

Yes, when this happens the work from the last checkpoint is lost. That's what makes it really annoying when you have the slow DNS issue block the CC in the current release versions. With the 1 minute delay intervals in some circumstances this essentially brings progress on the result to a screeching halt.

The part about the OP's particular case which made me dismiss the DNS part of the equation was there didn't seem to be any BOINC external comm traffic at the time and also the interval was only approximately hourly. Typically when the occurance is internal to BOINC you can almost set your watch by when the stall and app exit occurs.

Alinator

Pooh Bear 27
Pooh Bear 27
Joined: 20 Mar 05
Posts: 1376
Credit: 20312671
RAC: 0

The version that you are

The version that you are running (5.7.5) had the bug in it that reported this error a lot. It's a version issue.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 761671041
RAC: 1113002

RE: Yes, when this happens

Message 68239 in response to message 68237

Quote:

Yes, when this happens the work from the last checkpoint is lost.

Are you sure the workunit resumes from the last checkpoint? As I understand it the workunit gets terminated when this happens, because to BOINc it looks as if it had crashed. Otherwise, why would you see this kind of error message at the end of a half-crunched result (and not just in the middle of a finished one).

E.g. this one : http://einsteinathome.org/task/84929543

As to versions of BOINC: I think this is fixed only in the 5.9. beta versions, which is kind of unfortunate given the severity of the bug.

CU

BRM

KnB-Construction
KnB-Construction
Joined: 3 Mar 05
Posts: 8
Credit: 954576
RAC: 0

@anilator i've controlled the

@anilator
i've controlled the log after 6h of crunching einstein without changing to an other project. the 'Exited with zero status..." message came approximately every 50 minutes but i don't know what happens at this moment.

i don't think this is a problem with the version 5.7.5. i use this version since crunch3r released it and i don't have such problems ever before. i seems as if the problems come from the 5.8.16 i've used for a short time.

it looks like the bast way is to wait until all packages are done and delete the complete boinc folder and reinstall everything. hopefully the problems are gone.

greetings KnB-Construction

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9352143
RAC: 0

RE: RE: Yes, when this

Message 68241 in response to message 68239

Quote:
Quote:

Yes, when this happens the work from the last checkpoint is lost.

Are you sure the workunit resumes from the last checkpoint? As I understand it the workunit gets terminated when this happens, because to BOINc it looks as if it had crashed. Otherwise, why would you see this kind of error message at the end of a half-crunched result (and not just in the middle of a finished one).

E.g. this one : http://einsteinathome.org/task/84929543

As to versions of BOINC: I think this is fixed only in the 5.9. beta versions, which is kind of unfortunate given the severity of the bug.

CU

BRM

Generally speaking, even this beta app (v0.44??) should be able to handle the CC getting blocked and exit gracefully. Here's a log snippet from a test where I deliberately killed the CC for v4.17.

From stderr:

.
.
.
17305, 17306, 17307, c
17308, 17309, 17310, c
17311, 17312, 17313, c
17314, No heartbeat from core client for 31 sec - exiting

2007-06-17 09:44:24.1599 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R2_4.17_windows_intelx86.exe'.
2007-06-17 09:44:28.2199 [debug]: Reading SFTs and setting up stacks ... done
2007-06-17 09:46:36.3599 [debug]: Found checkpoint - reading...
2007-06-17 09:46:36.3599 [debug]: Read checkpoint - reading previous output...
2007-06-17 09:46:38.7799 [debug]: Read exactly 1008759 == maxbytes from Fstat-file, that's enough.
2007-06-17 09:46:38.8299 [debug]: DEBUG: read_fstat_toplist_from_fp() returned 1008759
2007-06-17 09:46:38.8299 [debug]: Total skypoints = 35581. Progress: 17314, c
17315, 17316, 17317, c
17318, 17319, 17320, c
17321, 17322, 17323,

Keep in mind the lost heartbeat message in the stderr file is an app generated and written message, not a CC one. The issue with some of the earlier CC version is slow DNS response will block any other IO from the CC to the science app, and thus leads to a lost heartbeat which causes the app to exit which it's supposed to do. The problem there is the CC keeps coming back to retry the lookup so frequently it results in bringing progress on the result to a screeching halt due to all the lost heartbeats.

The reason you get the message blurb from the CC on the restart is because the CC didn't initiate the exit, like it would normally when you shut BOINC down for example. It sees the app has reported it exited successfully (Status zero), which usually means it's finished the computation, but there is no finished output file therefore something 'bad' must have happened it didn't know about and then tries restarting the result.

So in this particular case I don't think the lost heartbeat caused the subsequent abort per se, I think it's more likely the result aborted when the CC tried to restart it. From a quick search for the error code, the best I could find was this is a Windows disk/file system error. So I guess it's possible that the app didn't clean up the file system properly when exiting, and thus one or more of the output and/or state files was fatally flawed and lead to the abort.

Alinator

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9352143
RAC: 0

RE: @anilator i've

Message 68242 in response to message 68240

Quote:

@anilator
i've controlled the log after 6h of crunching einstein without changing to an other project. the 'Exited with zero status..." message came approximately every 50 minutes but i don't know what happens at this moment.

i don't think this is a problem with the version 5.7.5. i use this version since crunch3r released it and i don't have such problems ever before. i seems as if the problems come from the 5.8.16 i've used for a short time.

it looks like the bast way is to wait until all packages are done and delete the complete boinc folder and reinstall everything. hopefully the problems are gone.

Another thing you could try in the mean time would be to selectively eliminate any other background tasks one by one and see if you can find the one which causes the CC to get blocked periodically. Although with a time period of ~1 hour that could take awhile. ;-)

Alinator

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.