The explanation here is simple. When the machine goes into hibernation, the EAH app is not being given enough time to complete its exit tasks and "go to sleep" gracefully, hence when the app comes back after hibernation it sees that exited, thinks it completed the result, but can't find the "finished" file and says, "WHAT??!! This ain't right, I guess I'll have to go back to the last checkpoint and try again."
FWIW, this will happen with SAH (I have seen it happen there as well), and applications not behaving well in the hibernation process is not all that uncommon. Also, as has been mentioned this is not grounds for a project reset, the fact it restarts and continues testifies to that, and yes you could avoid the error messages by manually quitting BOINC, but I never bother and haven't lost a result due to it for this reason per se.
HTH,
Alinator
I wasn't so much concerned about the recovery process (which works fine a lot of the time), but the repeated (though intermittent) failures to be able to find a usable checkpoint to resume from, logically enough followed by a complete restart of the WU from the very beginning.
I wasn't so much concerned about the recovery process (which works fine a lot of the time), but the repeated (though intermittent) failures to be able to find a usable checkpoint to resume from, logically enough followed by a complete restart of the WU from the very beginning.
OK, I get your drift now.
Two things come to mind here:
1.) Make sure it has gone back to ground zero. On mine, when this happens the majority of times it will start off looking like it's back to the start, but a few minutes later jumps back to where it was before the problem more or less.
2.) When it does start from scratch on this kind of event (which does happen so you're not imagining it BTW ;-)), it has been after a BSOD or some other type of lockup which required a hard reset. I discovered that Scandisk either deletes or just truncates the EAH file which was left open at the crash, although I don't recall the filename offhand since it hasn't happened in a while. Either way the app didn't like it, and would restart from the beginning. I currently work around this problem by using a better disk check utility by default, which lets me fix the file and now I get back to the last checkpoint recoveries when system crashes occur.
From the way your describing the worst case in your return from hibernation scenario, I would say the best thing to do is to exit BOINC before entering it. The problem is you don't get the chance to interceed before the app restart to fix the damaged file on a wakeup like you would on a cold or warm boot.
Just took a look at some of your results and notice for the back to ground zero ones the item about the checkpoint file exceeding its maximum size. This is a definite clue that the file is not getting closed before the machine takes a "nap" for whatever reason. :-)
1.) Make sure it has gone back to ground zero. On mine, when this happens the majority of times it will start off looking like it's back to the start, but a few minutes later jumps back to where it was before the problem more or less.
It truly went back to zero. It's still plowing through the same work unit today and hasn't caught up to where the checkpoint failed yet.
Quote:
2.) When it does start from scratch on this kind of event (which does happen so you're not imagining it BTW ;-)), it has been after a BSOD or some other type of lockup which required a hard reset. I discovered that Scandisk either deletes or just truncates the EAH file which was left open at the crash, although I don't recall the filename offhand since it hasn't happened in a while. Either way the app didn't like it, and would restart from the beginning. I currently work around this problem by using a better disk check utility by default, which lets me fix the file and now I get back to the last checkpoint recoveries when system crashes occur.
no lockups/hard resets - XP SP2 has been amazingly stable compared to all my previous windoze experience - don't think I have had more than one or two in almost two years of running two XP machines (fingers crossed).
Quote:
From the way your describing the worst case in your return from hibernation scenario, I would say the best thing to do is to exit BOINC before entering it. The problem is you don't get the chance to interceed before the app restart to fix the damaged file on a wakeup like you would on a cold or warm boot.
Just took a look at some of your results and notice for the back to ground zero ones the item about the checkpoint file exceeding its maximum size. This is a definite clue that the file is not getting closed before the machine takes a "nap" for whatever reason. :-)
I noticed the compactifying message too - I'll be very interested if the exact same sequence of messages comes up on the next "restart".
No Problemo, and it sounds like you have a handle on the problem.
Just to clarify on point 2, I didn't mean to imply you would have to have a BSOD or lockup for it to happen.
I gave it some more thought and I'm thnking this is more of a BOINC problem than an EAH one.
The reason is when you tell the laptop to hibernate, Windows tells everything that's running to tie up any loose ends it needs to and get ready to go to sleep. What should happen is BOINC gets the command from Windows, relays it to Albert, wait to make sure it exited cleanly, and then tell Windows it's good to go. Apparently this isn't how it always happens.
I have observed when running in command line mode if you give the CC the crtl-break command to exit it quits immediately, but sometimes the science app continues to run as an "orphaned" process (this is on 9x). The science app will eventually exit on their own in 30 seconds or so, but you get a "No Heartbeat" message in the stderr file as the reason for the exit. IOW, Mama disappeared, so I must die!
I'm speculating the hibernate issue is a variation on this theme, and is something I've seen with other software going into and coming out of hibernation mode even on 2K and XP.
Well, I'm back again. I can see this workunit is toast on my portable - I'm back to 10% done with three failed checkpoint recoveries since my last post.
Seti has no problems reported in stderr and does not appear to have any problem with the frequent hibernations and subsequent recovery. It appears to happen only on Einstein.
Here is stderr. The "compactifying" message occurs some of the time in conjunction with the complete restart of computations from the beginning, but not always. The "no heartbeat" message also only occurs some of the time, but I didn't see any obvious correlation between that message and subsequent "resume" failures.
Quote:
2006-03-02 20:28:25.6555 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-02 20:28:25.6655 [normal]: Started search at lalDebugLevel = 0
2006-03-02 20:28:28.7099 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-03-02 20:28:28.7099 [normal]: No usable checkpoint found, starting from beginning.
No heartbeat from core client for 31 sec - exiting
2006-03-02 22:46:36.9092 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-02 22:46:36.9092 [normal]: Started search at lalDebugLevel = 0
2006-03-02 22:46:41.1853 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-03-02 22:46:41.1853 [normal]: No usable checkpoint found, starting from beginning.
2006-03-02 22:57:55.9155 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done. No heartbeat from core client for 31 sec - exiting
2006-03-03 09:32:33.9106 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 09:32:33.9206 [normal]: Started search at lalDebugLevel = 0
2006-03-03 09:32:40.7104 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 09:32:40.8105 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 09:32:41.0909 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 09:32:41.0909 [normal]: Resuming computation at (3174/34021664/683574).
No heartbeat from core client for 31 sec - exiting
2006-03-03 12:59:33.5687 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 12:59:33.5687 [normal]: Started search at lalDebugLevel = 0
2006-03-03 12:59:36.9435 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 12:59:37.0537 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 12:59:39.4171 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 12:59:39.4171 [normal]: Resuming computation at (11221/72685318/1461171).
No heartbeat from core client for 31 sec - exiting
2006-03-03 15:17:04.3513 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 15:17:04.3713 [normal]: Started search at lalDebugLevel = 0
2006-03-03 15:17:10.5502 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 15:17:10.6504 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 15:17:13.8650 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 15:17:13.8650 [normal]: Resuming computation at (17945/86461248/1738444).
No heartbeat from core client for 31 sec - exiting
2006-03-03 16:54:42.2082 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 16:54:42.2783 [normal]: Started search at lalDebugLevel = 0
2006-03-03 16:54:46.1940 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 16:54:46.2941 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 16:54:49.6990 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 16:54:49.6990 [normal]: Resuming computation at (20798/91164118/1833095).
No heartbeat from core client for 31 sec - exiting
2006-03-03 17:58:52.6205 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 17:58:52.7908 [normal]: Started search at lalDebugLevel = 0
2006-03-03 17:59:04.8381 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 17:59:04.8381 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 17:59:08.8739 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 17:59:08.9740 [normal]: Resuming computation at (23454/95235763/1914616).
2006-03-03 18:22:05.1386 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 18:22:05.4190 [normal]: Started search at lalDebugLevel = 0
2006-03-03 18:22:21.9428 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 18:22:22.0830 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 18:22:27.2805 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 18:22:27.2805 [normal]: Resuming computation at (25041/97084878/1951683).
2006-03-03 18:22:48.9717 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 18:22:48.9717 [normal]: Started search at lalDebugLevel = 0
2006-03-03 18:22:50.9545 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 18:22:50.9545 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 18:22:54.1291 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 18:22:54.1291 [normal]: Resuming computation at (25041/97084878/1951683).
No heartbeat from core client for 31 sec - exiting
2006-03-03 20:26:59.8963 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 20:26:59.8963 [normal]: Started search at lalDebugLevel = 0
2006-03-03 20:27:10.4115 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 20:27:10.6818 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 20:27:15.9895 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 20:27:15.9895 [normal]: Resuming computation at (28293/101014762/2030441).
No heartbeat from core client for 31 sec - exiting
2006-03-04 11:51:37.8113 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-04 11:51:37.8113 [normal]: Started search at lalDebugLevel = 0
2006-03-04 11:51:44.0102 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-04 11:51:44.2005 [normal]: Trying to read Fstat-file into toplist ...
2006-03-04 11:51:50.0389 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-04 11:51:50.0389 [normal]: Resuming computation at (32039/104506072/2100447).
No heartbeat from core client for 31 sec - exiting
2006-03-04 17:15:18.9807 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-04 17:15:19.0007 [normal]: Started search at lalDebugLevel = 0
2006-03-04 17:15:22.7361 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-04 17:15:22.8262 [normal]: Trying to read Fstat-file into toplist ...
2006-03-04 17:15:26.4114 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-04 17:15:26.4114 [normal]: Resuming computation at (35392/108555321/2181559).
No heartbeat from core client for 31 sec - exiting
2006-03-04 17:59:57.5086 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-04 17:59:57.5086 [normal]: Started search at lalDebugLevel = 0
2006-03-04 18:00:00.8334 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-04 18:00:00.8534 [normal]: Trying to read Fstat-file into toplist ...
2006-03-04 18:00:08.3642 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-04 18:00:08.3642 [normal]: Resuming computation at (47268/119044525/2391828).
No heartbeat from core client for 31 sec - exiting
2006-03-05 07:38:13.8878 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-05 07:38:13.8878 [normal]: Started search at lalDebugLevel = 0
2006-03-05 07:38:16.8120 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-05 07:38:16.8120 [normal]: Trying to read Fstat-file into toplist ...
2006-03-05 07:38:24.3628 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-05 07:38:24.3829 [normal]: Resuming computation at (48515/119748081/2405917).
No heartbeat from core client for 31 sec - exiting
2006-03-05 15:28:09.9693 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-05 15:28:09.9693 [normal]: Started search at lalDebugLevel = 0
2006-03-05 15:28:14.4457 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-05 15:28:14.5458 [normal]: Trying to read Fstat-file into toplist ...
2006-03-05 15:28:20.0638 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-05 15:28:20.0638 [normal]: Resuming computation at (63214/128065678/2572804).
No heartbeat from core client for 31 sec - exiting
2006-03-06 08:10:24.1482 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-06 08:10:24.2283 [normal]: Started search at lalDebugLevel = 0
2006-03-06 08:10:29.3757 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-06 08:10:29.3757 [normal]: Trying to read Fstat-file into toplist ...
2006-03-06 08:10:39.4202 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-06 08:10:39.4202 [normal]: Resuming computation at (72129/132473208/2661225).
No heartbeat from core client for 31 sec - exiting
2006-03-06 08:59:31.7375 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-06 08:59:31.7775 [normal]: Started search at lalDebugLevel = 0
2006-03-06 08:59:35.0723 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-06 08:59:35.1824 [normal]: Trying to read Fstat-file into toplist ...
2006-03-06 08:59:40.1696 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-06 08:59:40.1696 [normal]: Resuming computation at (82232/135915533/2730222).
2006-03-06 17:24:33.4859 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-06 17:24:34.0367 [normal]: Started search at lalDebugLevel = 0
2006-03-06 17:24:37.1712 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-06 17:24:37.3915 [normal]: Trying to read Fstat-file into toplist ...
2006-03-06 17:24:46.3544 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-06 17:24:46.3844 [normal]: Resuming computation at (112925/144766999/2907377).
No heartbeat from core client for 31 sec - exiting
2006-03-06 19:44:39.2983 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-06 19:44:42.0522 [normal]: Started search at lalDebugLevel = 0
2006-03-06 19:44:46.9092 [normal]: Found checkpoint-file 'Fstat.out.ckp'
Failed to read checkpoint-counters from 'Fstat.out.ckp'!
2006-03-06 19:44:47.9708 [normal]: No usable checkpoint found, starting from beginning.
2006-03-06 20:03:13.8610 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.
2006-03-06 22:09:00.0439 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-06 22:09:00.0540 [normal]: Started search at lalDebugLevel = 0
2006-03-06 22:09:02.7378 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-06 22:09:02.7879 [normal]: Trying to read Fstat-file into toplist ...
2006-03-06 22:09:06.1627 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-06 22:09:06.1627 [normal]: Resuming computation at (29550/102404351/2058304).
No heartbeat from core client for 31 sec - exiting
2006-03-07 18:36:58.6439 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-07 18:36:58.6439 [normal]: Started search at lalDebugLevel = 0
2006-03-07 18:37:08.0975 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-07 18:37:08.1175 [normal]: Trying to read Fstat-file into toplist ...
2006-03-07 18:37:12.8743 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-07 18:37:12.8743 [normal]: Resuming computation at (34226/107105211/2152512).
No heartbeat from core client for 31 sec - exiting
2006-03-07 20:15:38.6407 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-07 20:15:38.6407 [normal]: Started search at lalDebugLevel = 0
2006-03-07 20:15:41.5850 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-07 20:15:41.6150 [normal]: Trying to read Fstat-file into toplist ...
2006-03-07 20:15:45.4105 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-07 20:15:45.4105 [normal]: Resuming computation at (43321/116443685/2339635).
No heartbeat from core client for 31 sec - exiting
2006-03-08 11:08:57.5604 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-08 11:08:57.5604 [normal]: Started search at lalDebugLevel = 0
2006-03-08 11:09:07.3444 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-08 11:09:07.7450 [normal]: Trying to read Fstat-file into toplist ...
2006-03-08 11:09:15.5863 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-08 11:09:15.5863 [normal]: Resuming computation at (74498/133436502/2680562).
No heartbeat from core client for 31 sec - exiting
2006-03-08 11:23:05.7804 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-08 11:23:05.7904 [normal]: Started search at lalDebugLevel = 0
2006-03-08 11:23:09.4357 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-08 11:23:09.5058 [normal]: Trying to read Fstat-file into toplist ...
2006-03-08 11:23:14.4729 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-08 11:23:14.4729 [normal]: Resuming computation at (75437/133751102/2686860).
No heartbeat from core client for 31 sec - exiting
2006-03-08 13:10:32.7834 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-08 13:10:32.7834 [normal]: Started search at lalDebugLevel = 0
2006-03-08 13:10:41.6261 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-08 13:10:41.7262 [normal]: Trying to read Fstat-file into toplist ...
2006-03-08 13:10:47.5045 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-08 13:10:47.5045 [normal]: Resuming computation at (77878/134559603/2703071).
No heartbeat from core client for 31 sec - exiting
2006-03-09 06:58:30.6560 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-09 06:58:30.6560 [normal]: Started search at lalDebugLevel = 0
2006-03-09 06:58:40.7605 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-09 06:58:40.9208 [normal]: Trying to read Fstat-file into toplist ...
2006-03-09 06:58:51.3057 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-09 06:58:51.3057 [normal]: Resuming computation at (80155/135264907/2717188).
No heartbeat from core client for 31 sec - exiting
2006-03-09 13:18:26.1195 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-09 13:18:26.1195 [normal]: Started search at lalDebugLevel = 0
2006-03-09 13:18:36.1839 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-09 13:18:36.4043 [normal]: Trying to read Fstat-file into toplist ...
2006-03-09 13:18:42.7133 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-09 13:18:42.7233 [normal]: Resuming computation at (80289/135290995/2717712).
No heartbeat from core client for 31 sec - exiting
2006-03-09 18:13:28.8593 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-09 18:13:28.8593 [normal]: Started search at lalDebugLevel = 0
2006-03-09 18:13:33.0754 [normal]: Found checkpoint-file 'Fstat.out.ckp'
Failed to read checkpoint-counters from 'Fstat.out.ckp'!
2006-03-09 18:13:33.0754 [normal]: No usable checkpoint found, starting from beginning.
2006-03-09 18:25:01.4352 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done. No heartbeat from core client for 31 sec - exiting
2006-03-09 20:19:25.7060 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-09 20:19:26.0565 [normal]: Started search at lalDebugLevel = 0
2006-03-09 20:19:32.3255 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-09 20:19:32.4156 [normal]: Trying to read Fstat-file into toplist ...
2006-03-09 20:19:32.8863 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-09 20:19:32.8863 [normal]: Resuming computation at (3469/37219157/747697).
No heartbeat from core client for 31 sec - exiting
2006-03-09 21:06:14.5372 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-09 21:06:14.5572 [normal]: Started search at lalDebugLevel = 0
2006-03-09 21:06:19.5945 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-09 21:06:19.7247 [normal]: Trying to read Fstat-file into toplist ...
2006-03-09 21:06:21.4571 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-09 21:06:21.4571 [normal]: Resuming computation at (6457/56669894/1138774).
No heartbeat from core client for 31 sec - exiting
2006-03-10 07:16:21.0208 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-10 07:16:21.0208 [normal]: Started search at lalDebugLevel = 0
2006-03-10 07:16:31.2555 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-10 07:16:31.3456 [normal]: Trying to read Fstat-file into toplist ...
2006-03-10 07:16:33.6990 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-10 07:16:33.6990 [normal]: Resuming computation at (7627/61162741/1229280).
No heartbeat from core client for 31 sec - exiting
2006-03-10 17:11:06.0193 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-10 17:11:06.0293 [normal]: Started search at lalDebugLevel = 0
2006-03-10 17:11:13.3999 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-10 17:11:13.4801 [normal]: Trying to read Fstat-file into toplist ...
2006-03-10 17:11:16.2641 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-10 17:11:16.2641 [normal]: Resuming computation at (16001/82865606/1666019).
No heartbeat from core client for 31 sec - exiting
2006-03-10 21:26:23.8836 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-10 21:26:23.8936 [normal]: Started search at lalDebugLevel = 0
2006-03-10 21:26:28.4401 [normal]: Found checkpoint-file 'Fstat.out.ckp'
Failed to read checkpoint-counters from 'Fstat.out.ckp'!
2006-03-10 21:26:28.4401 [normal]: No usable checkpoint found, starting from beginning. No heartbeat from core client for 31 sec - exiting
2006-03-11 10:04:15.0007 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-11 10:04:15.0107 [normal]: Started search at lalDebugLevel = 0
2006-03-11 10:04:24.9250 [normal]: Found checkpoint-file 'Fstat.out.ckp'
Failed to read checkpoint-counters from 'Fstat.out.ckp'!
2006-03-11 10:04:24.9250 [normal]: No usable checkpoint found, starting from beginning. No heartbeat from core client for 31 sec - exiting
2006-03-11 15:39:46.3198 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-11 15:39:46.3198 [normal]: Started search at lalDebugLevel = 0
2006-03-11 15:39:55.0324 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-11 15:39:55.0824 [normal]: Trying to read Fstat-file into toplist ...
2006-03-11 15:40:05.6877 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-11 15:40:05.6877 [normal]: Resuming computation at (2571/253182612/5088438).
2006-03-11 15:42:14.0022 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.
I finally pushed the wu to completion by simply leaving the portable on all day, long enough to finish the work unit uninterrupted. So the work unit is completed and uploaded if you want to check it out:
Result ID 20056495
Name r1_1498.5__2759_S4R2a_2
Workunit 5309203
Since I don't expect to be able to complete an Einstein WU under normal circumstances anymore, I'd like to keep the portable working on WU's it can complete (such as Seti). For now, I've set the portable to not accept new work from Einstein and left the other machines in my group working on Einstein and Seti. If I detach or leave the portable set to accept no new work, will that affect just the portable computer or will the others in my group also follow along? (I checked the wiki and wasn't able to find the answer)
If I detach or leave the portable set to accept no new work, will that affect just the portable computer or will the others in my group also follow along?
Only the one host will be affected by the settings in its BOINC manager, unlike the case of changes to the preferences in your BOINC or project-specific accounts.
Thanks for the information. I'll let the portable chew on the more "checkpointable" jobs then and leave the workhorse computers that are up all the time working on Einstein.
One thing you may want to try before giving up on the laptop for EAH is to experiment with the switch task interval and/or disk write interval (assuming you haven't already). Try setting the task interval to something like 3 or 4 hours, instead of the default 60 mins. This may give it a chance to generate more usable checkpoints to work with.
Although if the problem is the EAH app is just not getting enough time to do what it needs to before hibernating, this won't make any difference.
RE: The explanation here is
)
I wasn't so much concerned about the recovery process (which works fine a lot of the time), but the repeated (though intermittent) failures to be able to find a usable checkpoint to resume from, logically enough followed by a complete restart of the WU from the very beginning.
RE: I wasn't so much
)
OK, I get your drift now.
Two things come to mind here:
1.) Make sure it has gone back to ground zero. On mine, when this happens the majority of times it will start off looking like it's back to the start, but a few minutes later jumps back to where it was before the problem more or less.
2.) When it does start from scratch on this kind of event (which does happen so you're not imagining it BTW ;-)), it has been after a BSOD or some other type of lockup which required a hard reset. I discovered that Scandisk either deletes or just truncates the EAH file which was left open at the crash, although I don't recall the filename offhand since it hasn't happened in a while. Either way the app didn't like it, and would restart from the beginning. I currently work around this problem by using a better disk check utility by default, which lets me fix the file and now I get back to the last checkpoint recoveries when system crashes occur.
From the way your describing the worst case in your return from hibernation scenario, I would say the best thing to do is to exit BOINC before entering it. The problem is you don't get the chance to interceed before the app restart to fix the damaged file on a wakeup like you would on a cold or warm boot.
Just took a look at some of your results and notice for the back to ground zero ones the item about the checkpoint file exceeding its maximum size. This is a definite clue that the file is not getting closed before the machine takes a "nap" for whatever reason. :-)
HTH,
Alinator
RE: 1.) Make sure it has
)
It truly went back to zero. It's still plowing through the same work unit today and hasn't caught up to where the checkpoint failed yet.
no lockups/hard resets - XP SP2 has been amazingly stable compared to all my previous windoze experience - don't think I have had more than one or two in almost two years of running two XP machines (fingers crossed).
I noticed the compactifying message too - I'll be very interested if the exact same sequence of messages comes up on the next "restart".
Thanks for the suggestions.
Jon
No Problemo, and it sounds
)
No Problemo, and it sounds like you have a handle on the problem.
Just to clarify on point 2, I didn't mean to imply you would have to have a BSOD or lockup for it to happen.
I gave it some more thought and I'm thnking this is more of a BOINC problem than an EAH one.
The reason is when you tell the laptop to hibernate, Windows tells everything that's running to tie up any loose ends it needs to and get ready to go to sleep. What should happen is BOINC gets the command from Windows, relays it to Albert, wait to make sure it exited cleanly, and then tell Windows it's good to go. Apparently this isn't how it always happens.
I have observed when running in command line mode if you give the CC the crtl-break command to exit it quits immediately, but sometimes the science app continues to run as an "orphaned" process (this is on 9x). The science app will eventually exit on their own in 30 seconds or so, but you get a "No Heartbeat" message in the stderr file as the reason for the exit. IOW, Mama disappeared, so I must die!
I'm speculating the hibernate issue is a variation on this theme, and is something I've seen with other software going into and coming out of hibernation mode even on 2K and XP.
HTH,
Alinator
Well, I'm back again. I can
)
Well, I'm back again. I can see this workunit is toast on my portable - I'm back to 10% done with three failed checkpoint recoveries since my last post.
Seti has no problems reported in stderr and does not appear to have any problem with the frequent hibernations and subsequent recovery. It appears to happen only on Einstein.
Here is stderr. The "compactifying" message occurs some of the time in conjunction with the complete restart of computations from the beginning, but not always. The "no heartbeat" message also only occurs some of the time, but I didn't see any obvious correlation between that message and subsequent "resume" failures.
I finally pushed the wu to
)
I finally pushed the wu to completion by simply leaving the portable on all day, long enough to finish the work unit uninterrupted. So the work unit is completed and uploaded if you want to check it out:
Result ID 20056495
Name r1_1498.5__2759_S4R2a_2
Workunit 5309203
Since I don't expect to be able to complete an Einstein WU under normal circumstances anymore, I'd like to keep the portable working on WU's it can complete (such as Seti). For now, I've set the portable to not accept new work from Einstein and left the other machines in my group working on Einstein and Seti. If I detach or leave the portable set to accept no new work, will that affect just the portable computer or will the others in my group also follow along? (I checked the wiki and wasn't able to find the answer)
Thanks,
j
RE: If I detach or leave
)
Only the one host will be affected by the settings in its BOINC manager, unlike the case of changes to the preferences in your BOINC or project-specific accounts.
Thanks for the information.
)
Thanks for the information. I'll let the portable chew on the more "checkpointable" jobs then and leave the workhorse computers that are up all the time working on Einstein.
One thing you may want to try
)
One thing you may want to try before giving up on the laptop for EAH is to experiment with the switch task interval and/or disk write interval (assuming you haven't already). Try setting the task interval to something like 3 or 4 hours, instead of the default 60 mins. This may give it a chance to generate more usable checkpoints to work with.
Although if the problem is the EAH app is just not getting enough time to do what it needs to before hibernating, this won't make any difference.
Just a thought, no guarantees. ;-)
Alinator