Einstein Restarts?

Archie & Mehitabel

Joined: 28 Nov 05

Posts: 11

Credit: 24412

RAC: 0

RE: The explanation here is

7 Mar 2006 18:24:09 UTC

Message 25653 in response to message 25648

(moderation:

)

Quote:

The explanation here is simple. When the machine goes into hibernation, the EAH app is not being given enough time to complete its exit tasks and "go to sleep" gracefully, hence when the app comes back after hibernation it sees that exited, thinks it completed the result, but can't find the "finished" file and says, "WHAT??!! This ain't right, I guess I'll have to go back to the last checkpoint and try again."

FWIW, this will happen with SAH (I have seen it happen there as well), and applications not behaving well in the hibernation process is not all that uncommon. Also, as has been mentioned this is not grounds for a project reset, the fact it restarts and continues testifies to that, and yes you could avoid the error messages by manually quitting BOINC, but I never bother and haven't lost a result due to it for this reason per se.

HTH,

Alinator

I wasn't so much concerned about the recovery process (which works fine a lot of the time), but the repeated (though intermittent) failures to be able to find a usable checkpoint to resume from, logically enough followed by a complete restart of the WU from the very beginning.

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

RE: I wasn't so much

7 Mar 2006 19:10:21 UTC

Message 25654 in response to message 25653

(moderation:

)

Quote:

I wasn't so much concerned about the recovery process (which works fine a lot of the time), but the repeated (though intermittent) failures to be able to find a usable checkpoint to resume from, logically enough followed by a complete restart of the WU from the very beginning.

OK, I get your drift now.

Two things come to mind here:

1.) Make sure it has gone back to ground zero. On mine, when this happens the majority of times it will start off looking like it's back to the start, but a few minutes later jumps back to where it was before the problem more or less.

2.) When it does start from scratch on this kind of event (which does happen so you're not imagining it BTW ;-)), it has been after a BSOD or some other type of lockup which required a hard reset. I discovered that Scandisk either deletes or just truncates the EAH file which was left open at the crash, although I don't recall the filename offhand since it hasn't happened in a while. Either way the app didn't like it, and would restart from the beginning. I currently work around this problem by using a better disk check utility by default, which lets me fix the file and now I get back to the last checkpoint recoveries when system crashes occur.

From the way your describing the worst case in your return from hibernation scenario, I would say the best thing to do is to exit BOINC before entering it. The problem is you don't get the chance to interceed before the app restart to fix the damaged file on a wakeup like you would on a cold or warm boot.

Just took a look at some of your results and notice for the back to ground zero ones the item about the checkpoint file exceeding its maximum size. This is a definite clue that the file is not getting closed before the machine takes a "nap" for whatever reason. :-)

HTH,

Alinator

Archie & Mehitabel

Joined: 28 Nov 05

Posts: 11

Credit: 24412

RAC: 0

RE: 1.) Make sure it has

7 Mar 2006 20:05:03 UTC

Message 25655 in response to message 25654

(moderation:

)

Quote:

1.) Make sure it has gone back to ground zero. On mine, when this happens the majority of times it will start off looking like it's back to the start, but a few minutes later jumps back to where it was before the problem more or less.

It truly went back to zero. It's still plowing through the same work unit today and hasn't caught up to where the checkpoint failed yet.

Quote:

2.) When it does start from scratch on this kind of event (which does happen so you're not imagining it BTW ;-)), it has been after a BSOD or some other type of lockup which required a hard reset. I discovered that Scandisk either deletes or just truncates the EAH file which was left open at the crash, although I don't recall the filename offhand since it hasn't happened in a while. Either way the app didn't like it, and would restart from the beginning. I currently work around this problem by using a better disk check utility by default, which lets me fix the file and now I get back to the last checkpoint recoveries when system crashes occur.

no lockups/hard resets - XP SP2 has been amazingly stable compared to all my previous windoze experience - don't think I have had more than one or two in almost two years of running two XP machines (fingers crossed).

Quote:

From the way your describing the worst case in your return from hibernation scenario, I would say the best thing to do is to exit BOINC before entering it. The problem is you don't get the chance to interceed before the app restart to fix the damaged file on a wakeup like you would on a cold or warm boot.

Just took a look at some of your results and notice for the back to ground zero ones the item about the checkpoint file exceeding its maximum size. This is a definite clue that the file is not getting closed before the machine takes a "nap" for whatever reason. :-)

I noticed the compactifying message too - I'll be very interested if the exact same sequence of messages comes up on the next "restart".

Thanks for the suggestions.
Jon

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

No Problemo, and it sounds

7 Mar 2006 22:44:49 UTC

Message 25656

(moderation:

)

No Problemo, and it sounds like you have a handle on the problem.

Just to clarify on point 2, I didn't mean to imply you would have to have a BSOD or lockup for it to happen.

I gave it some more thought and I'm thnking this is more of a BOINC problem than an EAH one.

The reason is when you tell the laptop to hibernate, Windows tells everything that's running to tie up any loose ends it needs to and get ready to go to sleep. What should happen is BOINC gets the command from Windows, relays it to Albert, wait to make sure it exited cleanly, and then tell Windows it's good to go. Apparently this isn't how it always happens.

I have observed when running in command line mode if you give the CC the crtl-break command to exit it quits immediately, but sometimes the science app continues to run as an "orphaned" process (this is on 9x). The science app will eventually exit on their own in 30 seconds or so, but you get a "No Heartbeat" message in the stderr file as the reason for the exit. IOW, Mama disappeared, so I must die!

I'm speculating the hibernate issue is a variation on this theme, and is something I've seen with other software going into and coming out of hibernation mode even on 2K and XP.

HTH,

Alinator

Archie & Mehitabel

Joined: 28 Nov 05

Posts: 11

Credit: 24412

RAC: 0

Well, I'm back again. I can

11 Mar 2006 22:06:34 UTC

Message 25657

(moderation:

)

Well, I'm back again. I can see this workunit is toast on my portable - I'm back to 10% done with three failed checkpoint recoveries since my last post.

Seti has no problems reported in stderr and does not appear to have any problem with the frequent hibernations and subsequent recovery. It appears to happen only on Einstein.

Here is stderr. The "compactifying" message occurs some of the time in conjunction with the complete restart of computations from the beginning, but not always. The "no heartbeat" message also only occurs some of the time, but I didn't see any obvious correlation between that message and subsequent "resume" failures.

Quote:

2006-03-02 20:28:25.6555 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-02 20:28:25.6655 [normal]: Started search at lalDebugLevel = 0
2006-03-02 20:28:28.7099 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-03-02 20:28:28.7099 [normal]: No usable checkpoint found, starting from beginning.
No heartbeat from core client for 31 sec - exiting

2006-03-02 22:46:36.9092 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-02 22:46:36.9092 [normal]: Started search at lalDebugLevel = 0
2006-03-02 22:46:41.1853 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-03-02 22:46:41.1853 [normal]: No usable checkpoint found, starting from beginning.
2006-03-02 22:57:55.9155 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.
No heartbeat from core client for 31 sec - exiting

2006-03-03 09:32:33.9106 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 09:32:33.9206 [normal]: Started search at lalDebugLevel = 0
2006-03-03 09:32:40.7104 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 09:32:40.8105 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 09:32:41.0909 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 09:32:41.0909 [normal]: Resuming computation at (3174/34021664/683574).
No heartbeat from core client for 31 sec - exiting

2006-03-03 12:59:33.5687 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 12:59:33.5687 [normal]: Started search at lalDebugLevel = 0
2006-03-03 12:59:36.9435 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 12:59:37.0537 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 12:59:39.4171 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 12:59:39.4171 [normal]: Resuming computation at (11221/72685318/1461171).
No heartbeat from core client for 31 sec - exiting

2006-03-03 15:17:04.3513 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 15:17:04.3713 [normal]: Started search at lalDebugLevel = 0
2006-03-03 15:17:10.5502 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 15:17:10.6504 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 15:17:13.8650 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 15:17:13.8650 [normal]: Resuming computation at (17945/86461248/1738444).
No heartbeat from core client for 31 sec - exiting

2006-03-03 16:54:42.2082 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 16:54:42.2783 [normal]: Started search at lalDebugLevel = 0
2006-03-03 16:54:46.1940 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 16:54:46.2941 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 16:54:49.6990 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 16:54:49.6990 [normal]: Resuming computation at (20798/91164118/1833095).
No heartbeat from core client for 31 sec - exiting

2006-03-03 17:58:52.6205 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 17:58:52.7908 [normal]: Started search at lalDebugLevel = 0
2006-03-03 17:59:04.8381 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 17:59:04.8381 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 17:59:08.8739 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 17:59:08.9740 [normal]: Resuming computation at (23454/95235763/1914616).

2006-03-03 18:22:05.1386 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 18:22:05.4190 [normal]: Started search at lalDebugLevel = 0
2006-03-03 18:22:21.9428 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 18:22:22.0830 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 18:22:27.2805 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 18:22:27.2805 [normal]: Resuming computation at (25041/97084878/1951683).

2006-03-03 18:22:48.9717 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 18:22:48.9717 [normal]: Started search at lalDebugLevel = 0
2006-03-03 18:22:50.9545 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 18:22:50.9545 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 18:22:54.1291 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 18:22:54.1291 [normal]: Resuming computation at (25041/97084878/1951683).
No heartbeat from core client for 31 sec - exiting

2006-03-03 20:26:59.8963 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-03 20:26:59.8963 [normal]: Started search at lalDebugLevel = 0
2006-03-03 20:27:10.4115 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-03 20:27:10.6818 [normal]: Trying to read Fstat-file into toplist ...
2006-03-03 20:27:15.9895 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-03 20:27:15.9895 [normal]: Resuming computation at (28293/101014762/2030441).
No heartbeat from core client for 31 sec - exiting

2006-03-04 11:51:37.8113 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-04 11:51:37.8113 [normal]: Started search at lalDebugLevel = 0
2006-03-04 11:51:44.0102 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-04 11:51:44.2005 [normal]: Trying to read Fstat-file into toplist ...
2006-03-04 11:51:50.0389 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-04 11:51:50.0389 [normal]: Resuming computation at (32039/104506072/2100447).
No heartbeat from core client for 31 sec - exiting

2006-03-04 17:15:18.9807 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-04 17:15:19.0007 [normal]: Started search at lalDebugLevel = 0
2006-03-04 17:15:22.7361 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-04 17:15:22.8262 [normal]: Trying to read Fstat-file into toplist ...
2006-03-04 17:15:26.4114 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-04 17:15:26.4114 [normal]: Resuming computation at (35392/108555321/2181559).
No heartbeat from core client for 31 sec - exiting

2006-03-04 17:59:57.5086 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-04 17:59:57.5086 [normal]: Started search at lalDebugLevel = 0
2006-03-04 18:00:00.8334 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-04 18:00:00.8534 [normal]: Trying to read Fstat-file into toplist ...
2006-03-04 18:00:08.3642 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-04 18:00:08.3642 [normal]: Resuming computation at (47268/119044525/2391828).
No heartbeat from core client for 31 sec - exiting

2006-03-05 07:38:13.8878 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-05 07:38:13.8878 [normal]: Started search at lalDebugLevel = 0
2006-03-05 07:38:16.8120 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-05 07:38:16.8120 [normal]: Trying to read Fstat-file into toplist ...
2006-03-05 07:38:24.3628 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-05 07:38:24.3829 [normal]: Resuming computation at (48515/119748081/2405917).
No heartbeat from core client for 31 sec - exiting

2006-03-05 15:28:09.9693 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-05 15:28:09.9693 [normal]: Started search at lalDebugLevel = 0
2006-03-05 15:28:14.4457 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-05 15:28:14.5458 [normal]: Trying to read Fstat-file into toplist ...
2006-03-05 15:28:20.0638 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-05 15:28:20.0638 [normal]: Resuming computation at (63214/128065678/2572804).
No heartbeat from core client for 31 sec - exiting

2006-03-06 08:10:24.1482 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-06 08:10:24.2283 [normal]: Started search at lalDebugLevel = 0
2006-03-06 08:10:29.3757 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-06 08:10:29.3757 [normal]: Trying to read Fstat-file into toplist ...
2006-03-06 08:10:39.4202 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-06 08:10:39.4202 [normal]: Resuming computation at (72129/132473208/2661225).
No heartbeat from core client for 31 sec - exiting

2006-03-06 08:59:31.7375 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-06 08:59:31.7775 [normal]: Started search at lalDebugLevel = 0
2006-03-06 08:59:35.0723 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-06 08:59:35.1824 [normal]: Trying to read Fstat-file into toplist ...
2006-03-06 08:59:40.1696 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-06 08:59:40.1696 [normal]: Resuming computation at (82232/135915533/2730222).

2006-03-06 17:24:33.4859 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-06 17:24:34.0367 [normal]: Started search at lalDebugLevel = 0
2006-03-06 17:24:37.1712 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-06 17:24:37.3915 [normal]: Trying to read Fstat-file into toplist ...
2006-03-06 17:24:46.3544 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-06 17:24:46.3844 [normal]: Resuming computation at (112925/144766999/2907377).
No heartbeat from core client for 31 sec - exiting

2006-03-06 19:44:39.2983 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-06 19:44:42.0522 [normal]: Started search at lalDebugLevel = 0
2006-03-06 19:44:46.9092 [normal]: Found checkpoint-file 'Fstat.out.ckp'
Failed to read checkpoint-counters from 'Fstat.out.ckp'!
2006-03-06 19:44:47.9708 [normal]: No usable checkpoint found, starting from beginning.
2006-03-06 20:03:13.8610 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.

2006-03-06 22:09:00.0439 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-06 22:09:00.0540 [normal]: Started search at lalDebugLevel = 0
2006-03-06 22:09:02.7378 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-06 22:09:02.7879 [normal]: Trying to read Fstat-file into toplist ...
2006-03-06 22:09:06.1627 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-06 22:09:06.1627 [normal]: Resuming computation at (29550/102404351/2058304).
No heartbeat from core client for 31 sec - exiting

2006-03-07 18:36:58.6439 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-07 18:36:58.6439 [normal]: Started search at lalDebugLevel = 0
2006-03-07 18:37:08.0975 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-07 18:37:08.1175 [normal]: Trying to read Fstat-file into toplist ...
2006-03-07 18:37:12.8743 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-07 18:37:12.8743 [normal]: Resuming computation at (34226/107105211/2152512).
No heartbeat from core client for 31 sec - exiting

2006-03-07 20:15:38.6407 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-07 20:15:38.6407 [normal]: Started search at lalDebugLevel = 0
2006-03-07 20:15:41.5850 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-07 20:15:41.6150 [normal]: Trying to read Fstat-file into toplist ...
2006-03-07 20:15:45.4105 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-07 20:15:45.4105 [normal]: Resuming computation at (43321/116443685/2339635).
No heartbeat from core client for 31 sec - exiting

2006-03-08 11:08:57.5604 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-08 11:08:57.5604 [normal]: Started search at lalDebugLevel = 0
2006-03-08 11:09:07.3444 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-08 11:09:07.7450 [normal]: Trying to read Fstat-file into toplist ...
2006-03-08 11:09:15.5863 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-08 11:09:15.5863 [normal]: Resuming computation at (74498/133436502/2680562).
No heartbeat from core client for 31 sec - exiting

2006-03-08 11:23:05.7804 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-08 11:23:05.7904 [normal]: Started search at lalDebugLevel = 0
2006-03-08 11:23:09.4357 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-08 11:23:09.5058 [normal]: Trying to read Fstat-file into toplist ...
2006-03-08 11:23:14.4729 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-08 11:23:14.4729 [normal]: Resuming computation at (75437/133751102/2686860).
No heartbeat from core client for 31 sec - exiting

2006-03-08 13:10:32.7834 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-08 13:10:32.7834 [normal]: Started search at lalDebugLevel = 0
2006-03-08 13:10:41.6261 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-08 13:10:41.7262 [normal]: Trying to read Fstat-file into toplist ...
2006-03-08 13:10:47.5045 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-08 13:10:47.5045 [normal]: Resuming computation at (77878/134559603/2703071).
No heartbeat from core client for 31 sec - exiting

2006-03-09 06:58:30.6560 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-09 06:58:30.6560 [normal]: Started search at lalDebugLevel = 0
2006-03-09 06:58:40.7605 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-09 06:58:40.9208 [normal]: Trying to read Fstat-file into toplist ...
2006-03-09 06:58:51.3057 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-09 06:58:51.3057 [normal]: Resuming computation at (80155/135264907/2717188).
No heartbeat from core client for 31 sec - exiting

2006-03-09 13:18:26.1195 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-09 13:18:26.1195 [normal]: Started search at lalDebugLevel = 0
2006-03-09 13:18:36.1839 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-09 13:18:36.4043 [normal]: Trying to read Fstat-file into toplist ...
2006-03-09 13:18:42.7133 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-09 13:18:42.7233 [normal]: Resuming computation at (80289/135290995/2717712).
No heartbeat from core client for 31 sec - exiting

2006-03-09 18:13:28.8593 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-09 18:13:28.8593 [normal]: Started search at lalDebugLevel = 0
2006-03-09 18:13:33.0754 [normal]: Found checkpoint-file 'Fstat.out.ckp'
Failed to read checkpoint-counters from 'Fstat.out.ckp'!
2006-03-09 18:13:33.0754 [normal]: No usable checkpoint found, starting from beginning.
2006-03-09 18:25:01.4352 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.
No heartbeat from core client for 31 sec - exiting

2006-03-09 20:19:25.7060 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-09 20:19:26.0565 [normal]: Started search at lalDebugLevel = 0
2006-03-09 20:19:32.3255 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-09 20:19:32.4156 [normal]: Trying to read Fstat-file into toplist ...
2006-03-09 20:19:32.8863 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-09 20:19:32.8863 [normal]: Resuming computation at (3469/37219157/747697).
No heartbeat from core client for 31 sec - exiting

2006-03-09 21:06:14.5372 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-09 21:06:14.5572 [normal]: Started search at lalDebugLevel = 0
2006-03-09 21:06:19.5945 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-09 21:06:19.7247 [normal]: Trying to read Fstat-file into toplist ...
2006-03-09 21:06:21.4571 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-09 21:06:21.4571 [normal]: Resuming computation at (6457/56669894/1138774).
No heartbeat from core client for 31 sec - exiting

2006-03-10 07:16:21.0208 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-10 07:16:21.0208 [normal]: Started search at lalDebugLevel = 0
2006-03-10 07:16:31.2555 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-10 07:16:31.3456 [normal]: Trying to read Fstat-file into toplist ...
2006-03-10 07:16:33.6990 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-10 07:16:33.6990 [normal]: Resuming computation at (7627/61162741/1229280).
No heartbeat from core client for 31 sec - exiting

2006-03-10 17:11:06.0193 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-10 17:11:06.0293 [normal]: Started search at lalDebugLevel = 0
2006-03-10 17:11:13.3999 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-10 17:11:13.4801 [normal]: Trying to read Fstat-file into toplist ...
2006-03-10 17:11:16.2641 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-10 17:11:16.2641 [normal]: Resuming computation at (16001/82865606/1666019).
No heartbeat from core client for 31 sec - exiting

2006-03-10 21:26:23.8836 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-10 21:26:23.8936 [normal]: Started search at lalDebugLevel = 0
2006-03-10 21:26:28.4401 [normal]: Found checkpoint-file 'Fstat.out.ckp'
Failed to read checkpoint-counters from 'Fstat.out.ckp'!
2006-03-10 21:26:28.4401 [normal]: No usable checkpoint found, starting from beginning.
No heartbeat from core client for 31 sec - exiting

2006-03-11 10:04:15.0007 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-11 10:04:15.0107 [normal]: Started search at lalDebugLevel = 0
2006-03-11 10:04:24.9250 [normal]: Found checkpoint-file 'Fstat.out.ckp'
Failed to read checkpoint-counters from 'Fstat.out.ckp'!
2006-03-11 10:04:24.9250 [normal]: No usable checkpoint found, starting from beginning.
No heartbeat from core client for 31 sec - exiting

2006-03-11 15:39:46.3198 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-11 15:39:46.3198 [normal]: Started search at lalDebugLevel = 0
2006-03-11 15:39:55.0324 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-11 15:39:55.0824 [normal]: Trying to read Fstat-file into toplist ...
2006-03-11 15:40:05.6877 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-11 15:40:05.6877 [normal]: Resuming computation at (2571/253182612/5088438).
2006-03-11 15:42:14.0022 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.

Archie & Mehitabel

Joined: 28 Nov 05

Posts: 11

Credit: 24412

RAC: 0

I finally pushed the wu to

14 Mar 2006 21:01:11 UTC

Message 25658

(moderation:

)

I finally pushed the wu to completion by simply leaving the portable on all day, long enough to finish the work unit uninterrupted. So the work unit is completed and uploaded if you want to check it out:
Result ID 20056495
Name r1_1498.5__2759_S4R2a_2
Workunit 5309203

Since I don't expect to be able to complete an Einstein WU under normal circumstances anymore, I'd like to keep the portable working on WU's it can complete (such as Seti). For now, I've set the portable to not accept new work from Einstein and left the other machines in my group working on Einstein and Seti. If I detach or leave the portable set to accept no new work, will that affect just the portable computer or will the others in my group also follow along? (I checked the wiki and wasn't able to find the answer)

Thanks,
j

Odysseus

Joined: 17 Dec 05

Posts: 372

Credit: 20699288

RAC: 8950

RE: If I detach or leave

15 Mar 2006 8:41:37 UTC

Message 25659 in response to message 25658

(moderation:

)

Quote:

If I detach or leave the portable set to accept no new work, will that affect just the portable computer or will the others in my group also follow along?

Only the one host will be affected by the settings in its BOINC manager, unlike the case of changes to the preferences in your BOINC or project-specific accounts.

Archie & Mehitabel

Joined: 28 Nov 05

Posts: 11

Credit: 24412

RAC: 0

Thanks for the information.

15 Mar 2006 12:41:34 UTC

Message 25660

(moderation:

)

Thanks for the information. I'll let the portable chew on the more "checkpointable" jobs then and leave the workhorse computers that are up all the time working on Einstein.

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

One thing you may want to try

15 Mar 2006 16:32:30 UTC

Message 25661

(moderation:

)

One thing you may want to try before giving up on the laptop for EAH is to experiment with the switch task interval and/or disk write interval (assuming you haven't already). Try setting the task interval to something like 3 or 4 hours, instead of the default 60 mins. This may give it a chance to generate more usable checkpoints to work with.

Although if the problem is the EAH app is just not getting enough time to do what it needs to before hibernating, this won't make any difference.

Just a thought, no guarantees. ;-)

Alinator

Einstein Restarts?

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner