hi,
i'm crunching for seti and einstein and sometimes for rosetta with an athlon XP 3000+ and windows xp pro SP2. until three days ago i don't have any problems with seti and einstein. after installing the newest boinc version 5.8.16 (i'm using the 5.7.5 from crunch3r so far) the problems begins. after more than 33h of crunching the first complete einstein-unit was invalid. after this i uninstalled the complete new boinc software and switch to the old 5.7.5 from crunch3r again. after this i get a lot of "exited with zero status but no 'finished' file" messages as you can see in following text.
has anyone of you the same problems after switching the boinc versions or does anyone knowes what to do when this happens. reseting the projekt does not solve the problem.
thanks for your help in advance
6/16/2007 9:30:44 AM||Starting BOINC client version 5.7.5 for windows_intelx86 6/16/2007 9:30:44 AM||log flags: task, file_xfer, sched_ops 6/16/2007 9:30:44 AM||Libraries: libcurl/7.15.5 OpenSSL/0.9.8a zlib/1.2.3 6/16/2007 9:30:44 AM||Data directory: C:\\Programme\\BOINC 6/16/2007 9:30:44 AM|SETI@home|Found app_info.xml; using anonymous platform 6/16/2007 9:30:44 AM||Processor: 1 AuthenticAMD AMD Athlon(tm) XP 3000+ 6/16/2007 9:30:44 AM||Memory: 2.00 GB physical, 3.81 GB virtual 6/16/2007 9:30:44 AM||Disk: 29.29 GB total, 18.18 GB free 6/16/2007 9:30:44 AM|rosetta@home|URL: http://boinc.bakerlab.org/rosetta/; Computer ID: 387212; location: (none); project prefs: default 6/16/2007 9:30:44 AM|Einstein@Home|URL: http://einstein.phys.uwm.edu/; Computer ID: 747357; location: home; project prefs: home 6/16/2007 9:30:44 AM|lhcathome|URL: http://lhcathome.cern.ch/lhcathome/; Computer ID: 8956785; location: home; project prefs: default 6/16/2007 9:30:44 AM|SETI@home|URL: http://setiathome.berkeley.edu/; Computer ID: 2694622; location: home; project prefs: home 6/16/2007 9:30:44 AM||General prefs: from Einstein@Home (last modified 2007-06-06 23:39:56) 6/16/2007 9:30:44 AM||Host location: home 6/16/2007 9:30:44 AM||General prefs: no separate prefs for home; using your defaults 6/16/2007 9:30:44 AM|| 6/16/2007 9:30:44 AM|| 6/16/2007 9:30:44 AM||BOINC 5.7.5.32 - 32 bit Edition by Crunch3r 6/16/2007 9:30:44 AM||enabled features: 6/16/2007 9:30:44 AM||-cpu_affinity 6/16/2007 9:30:44 AM||-return_results_immediately 6/16/2007 9:30:44 AM|| 6/16/2007 9:30:44 AM|| 6/16/2007 9:30:44 AM|Einstein@Home|Restarting task h1_0457.20_S5R2__226_S5R2c_0 using einstein_S5R2 version 417 6/16/2007 10:26:21 AM|Einstein@Home|Task h1_0457.20_S5R2__226_S5R2c_0 exited with zero status but no 'finished' file 6/16/2007 10:26:21 AM|Einstein@Home|If this happens repeatedly you may need to reset the project. 6/16/2007 10:26:21 AM|Einstein@Home|Restarting task h1_0457.20_S5R2__226_S5R2c_0 using einstein_S5R2 version 417 6/16/2007 11:19:28 AM|Einstein@Home|Task h1_0457.20_S5R2__226_S5R2c_0 exited with zero status but no 'finished' file 6/16/2007 11:19:28 AM|Einstein@Home|If this happens repeatedly you may need to reset the project. 6/16/2007 11:19:28 AM|Einstein@Home|Restarting task h1_0457.20_S5R2__226_S5R2c_0 using einstein_S5R2 version 417 6/16/2007 12:12:59 PM|Einstein@Home|Task h1_0457.20_S5R2__226_S5R2c_0 exited with zero status but no 'finished' file 6/16/2007 12:12:59 PM|Einstein@Home|If this happens repeatedly you may need to reset the project. 6/16/2007 12:12:59 PM|SETI@home|Restarting task 03ap99ab.13555.31634.767320.3.229_0 using setiathome_enhanced version 517 6/16/2007 1:05:50 PM|SETI@home|Task 03ap99ab.13555.31634.767320.3.229_0 exited with zero status but no 'finished' file 6/16/2007 1:05:50 PM|SETI@home|If this happens repeatedly you may need to reset the project. 6/16/2007 1:05:50 PM|Einstein@Home|Restarting task h1_0457.20_S5R2__226_S5R2c_0 using einstein_S5R2 version 417
greetings KnB-Construction
Copyright © 2024 Einstein@Home. All rights reserved.
exited with zero status!!!
)
I'm not sure downgrading BOINC was a good idea. The validation problem seems to be an instance of the "cross platform validation" issue: your result was initially compared to one from a Mac, ad the two results didn't agree sufficiently, so a third host was asked to crunch the unit. This happened to be another Mac, and finally the validator found the two mac results in agreement so they got the credit.
I'd try to go back to the more recent BOINC version. If that doesn't help, try to reset the project.
CU
BRM
Also, the 'Exited with zero
)
Also, the 'Exited with zero status..." message is not a problem per se. It just means that the science app didn't see a heartbeat signal from the CC for too long and exited the way it was designed to. This doesn't mean there is anything wrong or cause validation problems.
From looking at the log you posted it seems there is something going on about every hour on your host which is preempting the CC and keeping it from sending the heartbeat to the science app temporarily.
HTH,
Alinator
RE: Also, the 'Exited with
)
I understand that this "heartbeat" prob can be caused by an issue in boinc 5.8.x ?
http://boinc.berkeley.edu/trac/ticket/113
What is unfortinate about this is that after the science app exits, the core client will look for a result file, find none, and will report the result as having failed: The computation work is lost. Right?
Wouldn't it be better if after a missed heartbeat, BOINC would allow the science app to "fall asleep"/suspend so that it can be restarted after heartbeats resume?
CU
H-B
RE: What is unfortinate
)
I used to have this problem frequently and it still happens once in a great while. I assumed that the work was lost and the wu resumed from the last checkpoint.
Yes, when this happens the
)
Yes, when this happens the work from the last checkpoint is lost. That's what makes it really annoying when you have the slow DNS issue block the CC in the current release versions. With the 1 minute delay intervals in some circumstances this essentially brings progress on the result to a screeching halt.
The part about the OP's particular case which made me dismiss the DNS part of the equation was there didn't seem to be any BOINC external comm traffic at the time and also the interval was only approximately hourly. Typically when the occurance is internal to BOINC you can almost set your watch by when the stall and app exit occurs.
Alinator
The version that you are
)
The version that you are running (5.7.5) had the bug in it that reported this error a lot. It's a version issue.
RE: Yes, when this happens
)
Are you sure the workunit resumes from the last checkpoint? As I understand it the workunit gets terminated when this happens, because to BOINc it looks as if it had crashed. Otherwise, why would you see this kind of error message at the end of a half-crunched result (and not just in the middle of a finished one).
E.g. this one : http://einsteinathome.org/task/84929543
As to versions of BOINC: I think this is fixed only in the 5.9. beta versions, which is kind of unfortunate given the severity of the bug.
CU
BRM
@anilator i've controlled the
)
@anilator
i've controlled the log after 6h of crunching einstein without changing to an other project. the 'Exited with zero status..." message came approximately every 50 minutes but i don't know what happens at this moment.
i don't think this is a problem with the version 5.7.5. i use this version since crunch3r released it and i don't have such problems ever before. i seems as if the problems come from the 5.8.16 i've used for a short time.
it looks like the bast way is to wait until all packages are done and delete the complete boinc folder and reinstall everything. hopefully the problems are gone.
greetings KnB-Construction
RE: RE: Yes, when this
)
Generally speaking, even this beta app (v0.44??) should be able to handle the CC getting blocked and exit gracefully. Here's a log snippet from a test where I deliberately killed the CC for v4.17.
From stderr:
.
.
.
17305, 17306, 17307, c
17308, 17309, 17310, c
17311, 17312, 17313, c
17314, No heartbeat from core client for 31 sec - exiting
2007-06-17 09:44:24.1599 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R2_4.17_windows_intelx86.exe'.
2007-06-17 09:44:28.2199 [debug]: Reading SFTs and setting up stacks ... done
2007-06-17 09:46:36.3599 [debug]: Found checkpoint - reading...
2007-06-17 09:46:36.3599 [debug]: Read checkpoint - reading previous output...
2007-06-17 09:46:38.7799 [debug]: Read exactly 1008759 == maxbytes from Fstat-file, that's enough.
2007-06-17 09:46:38.8299 [debug]: DEBUG: read_fstat_toplist_from_fp() returned 1008759
2007-06-17 09:46:38.8299 [debug]: Total skypoints = 35581. Progress: 17314, c
17315, 17316, 17317, c
17318, 17319, 17320, c
17321, 17322, 17323,
Keep in mind the lost heartbeat message in the stderr file is an app generated and written message, not a CC one. The issue with some of the earlier CC version is slow DNS response will block any other IO from the CC to the science app, and thus leads to a lost heartbeat which causes the app to exit which it's supposed to do. The problem there is the CC keeps coming back to retry the lookup so frequently it results in bringing progress on the result to a screeching halt due to all the lost heartbeats.
The reason you get the message blurb from the CC on the restart is because the CC didn't initiate the exit, like it would normally when you shut BOINC down for example. It sees the app has reported it exited successfully (Status zero), which usually means it's finished the computation, but there is no finished output file therefore something 'bad' must have happened it didn't know about and then tries restarting the result.
So in this particular case I don't think the lost heartbeat caused the subsequent abort per se, I think it's more likely the result aborted when the CC tried to restart it. From a quick search for the error code, the best I could find was this is a Windows disk/file system error. So I guess it's possible that the app didn't clean up the file system properly when exiting, and thus one or more of the output and/or state files was fatally flawed and lead to the abort.
Alinator
RE: @anilator i've
)
Another thing you could try in the mean time would be to selectively eliminate any other background tasks one by one and see if you can find the one which causes the CC to get blocked periodically. Although with a time period of ~1 hour that could take awhile. ;-)
Alinator