Einstein Restarts?

Archie & Mehitabel
Archie & Mehitabel
Joined: 28 Nov 05
Posts: 11
Credit: 24412
RAC: 0
Topic 190874

I am running Boinc on a laptop running Windows XP home (SP2) with projects seti@home and einstein@home. When I hibernate the laptop (which happens several times a day), einstein often complains of a problem and appears to restart the work unit. The messages I get are:

Quote:
3/6/2006 5:24:17 PM|Einstein@Home|Restarting result r1_1498.5__2759_S4R2a_2 using albert version 437
3/6/2006 5:24:17 PM|SETI@home|Pausing result 29my01ab.21787.21457.242328.1.157_2 (removed from memory)
3/6/2006 5:24:20 PM||request_reschedule_cpus: process exited
3/6/2006 7:44:35 PM|Einstein@Home|Result r1_1498.5__2759_S4R2a_2 exited with zero status but no 'finished' file
3/6/2006 7:44:35 PM|Einstein@Home|If this happens repeatedly you may need to reset the project.
3/6/2006 7:44:35 PM||request_reschedule_cpus: process exited
3/6/2006 7:44:36 PM|Einstein@Home|Restarting result r1_1498.5__2759_S4R2a_2 using albert version 437


I have reset the project as suggested in the messages and the problem continues to occur. Is there something else I should try? Or is the only cure going to be shutting down boinc before I close and hibernate the laptop?

Archie & Mehitabel
Archie & Mehitabel
Joined: 28 Nov 05
Posts: 11
Credit: 24412
RAC: 0

Einstein Restarts?

A bit more info - the stderr file shows the following:

Quote:


2006-03-05 15:28:09.9693 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-05 15:28:09.9693 [normal]: Started search at lalDebugLevel = 0
2006-03-05 15:28:14.4457 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-05 15:28:14.5458 [normal]: Trying to read Fstat-file into toplist ...
2006-03-05 15:28:20.0638 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-05 15:28:20.0638 [normal]: Resuming computation at (63214/128065678/2572804).
No heartbeat from core client for 31 sec - exiting

2006-03-06 08:10:24.1482 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-06 08:10:24.2283 [normal]: Started search at lalDebugLevel = 0
2006-03-06 08:10:29.3757 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-06 08:10:29.3757 [normal]: Trying to read Fstat-file into toplist ...
2006-03-06 08:10:39.4202 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-06 08:10:39.4202 [normal]: Resuming computation at (72129/132473208/2661225).
No heartbeat from core client for 31 sec - exiting

2006-03-06 08:59:31.7375 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-06 08:59:31.7775 [normal]: Started search at lalDebugLevel = 0
2006-03-06 08:59:35.0723 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-06 08:59:35.1824 [normal]: Trying to read Fstat-file into toplist ...
2006-03-06 08:59:40.1696 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-06 08:59:40.1696 [normal]: Resuming computation at (82232/135915533/2730222).

2006-03-06 17:24:33.4859 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-06 17:24:34.0367 [normal]: Started search at lalDebugLevel = 0
2006-03-06 17:24:37.1712 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-03-06 17:24:37.3915 [normal]: Trying to read Fstat-file into toplist ...
2006-03-06 17:24:46.3544 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-03-06 17:24:46.3844 [normal]: Resuming computation at (112925/144766999/2907377).
No heartbeat from core client for 31 sec - exiting

2006-03-06 19:44:39.2983 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/albert_4.37_windows_intelx86.exe'.
2006-03-06 19:44:42.0522 [normal]: Started search at lalDebugLevel = 0
2006-03-06 19:44:46.9092 [normal]: Found checkpoint-file 'Fstat.out.ckp'
Failed to read checkpoint-counters from 'Fstat.out.ckp'!
2006-03-06 19:44:47.9708 [normal]: No usable checkpoint found, starting from beginning.
2006-03-06 20:03:13.8610 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ...

If I'm reading it right, it looks like einstein usually picks up the checkpoint file and resumes OK, but every once in a while, maybe once in ten times, it will give the above error message and start over.

Michael Roycraft
Michael Roycraft
Joined: 10 Mar 05
Posts: 846
Credit: 157718
RAC: 0

OK, for starters, let's clear

OK, for starters, let's clear this one up. "3/6/2006 7:44:35 PM|Einstein@Home|Result r1_1498.5__2759_S4R2a_2 exited with zero status but no 'finished' file
3/6/2006 7:44:35 PM|Einstein@Home|If this happens repeatedly you may need to reset the project."

A search of the BOINC WIKI, (a wonderful and exhaustive document) reveals this explanation (among several) - This message indicates that the BOINC Client Software is telling the Participant that they may have a problem on their computer. In those cases where the problem is continual the cure may be to "detach" from the Project or to do a Project "Reset". This will let the BOINC Client Software delete all of the files related to that Project so that, hopefully, the bad file will be eliminated. However, most of the time the best thing to do is to do nothing, the BOINC Client Software will normally recover with no intervention by the Participant.

Before you do a Project "Reset" or "detach" from a Project, check with the Participants that assist on the "Questions & Problems" forums of the affected project!

I would add that I have never heard of a case where that message could not be safely ignored, it is an artifact from the early days of BOINC. Note that as you read further into the explanation, "resetting" can and does have ugly consequences (lost result, wasted work, extra server load, etc). Resetting and "detaching" are to be reserved as a last resort only, and should only be done after consulting the helpdesk boards, as you have now done.

As to "restarting result", it is an unfortunate choice of words - it should properly be "resuming result", meaning that it picks up computation from the last checkpoint. Depending upon your preference settings ("write to disk at most every...") usually very little work will be wasted.

You may run into some problem by hibernating so often - is there some compelling reason why you choose that option?

Regards,

Michael

microcraft
"The arc of history is long, but it bends toward justice" - MLK

Archie & Mehitabel
Archie & Mehitabel
Joined: 28 Nov 05
Posts: 11
Credit: 24412
RAC: 0

Yep, I read the wiki and FAQ

Yep, I read the wiki and FAQ pre-posting but neither seemed to quite apply.

I tried the reset over a week ago before searching the wiki, sorry I resorted to the "last resort" too soon, but I don't think a great deal of work was lost since boinc had just restarted the wu from scratch anyway. If I screwed up someone else's prompt credits, my bad.

Note the last few lines in my second message - there was some sort of problem with resuming from the checkpoint (failed to read checkpoint-counters). The problem is intermittent, as the wiki says, "most of the time" the client software recovers and "resumes" work as you suggest the message should have been worded. Maybe 10% of the time however, the description of "restarting" is all too accurate and Boinc/Einstein completely restarts the calculation as it did tonight.

Seti has never shown the symptoms.

The disks have been checked, and there were no problems in the file system.

The frequent hibernations are simply because I close the laptop when I'm not using it during the day or have something I need to do away from the keyboard for a while and I can pickup where I left off from a hibernation more easily than beginning again from a complete reboot (which gets done about once a week anyway). No other application is showing any problem with the frequent hibernation.

I've got no problem ignoring the error message and the restarted calculations (though they may make me late once in a while in delivering the results) - I've been noticing the symptom on the laptop almost since I joined einstein@home.

I just thought y'all would want to know in case it was a reasonably common problem and just maybe that my symptoms would help you chase it down a bit better.

Michael Roycraft
Michael Roycraft
Joined: 10 Mar 05
Posts: 846
Credit: 157718
RAC: 0

Hehe :-) I'm not scolding

Hehe :-)

I'm not scolding you for resetting, and any delay in creditting for co-crunchers will be minimal, because your result , although errored, was turned in quickly, thus it can also be resent to another quickly. I intended nothing negative at all, just informative. Ditto the hibernate, although it may be a factor in the failure to resume from checkpoint. In that light, could you tell me what your "write to disk at most every ... " setting is (in "General preferences"), and, if possible, how far into that particular WU the "no usable checkpoint found" occurred?

microcraft
"The arc of history is long, but it bends toward justice" - MLK

Archie & Mehitabel
Archie & Mehitabel
Joined: 28 Nov 05
Posts: 11
Credit: 24412
RAC: 0

The "write to disk at most

The "write to disk at most every..." is set to 60 seconds. "Connect to Network every ..." is 0.1 days. Boinc is set to "run always" and "Network always available". The boinc manager screen was open, but minimized.

Unfortunately I didn't know that this would be one of the restart times, so I didn't look too closely at where I was in the WU. I had seen one of the Boinc scheduling interventions (the "overcommitted" message) in favor of Einstein@home in the previous day or two and at that point I was about 60-70% done as I remember. Boinc had begun accepting Seti work again, so the "crisis" had passed at the time of the restart.

Let me know if I can provide any other information that might help?

j

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9352143
RAC: 0

The explanation here is

The explanation here is simple. When the machine goes into hibernation, the EAH app is not being given enough time to complete its exit tasks and "go to sleep" gracefully, hence when the app comes back after hibernation it sees that exited, thinks it completed the result, but can't find the "finished" file and says, "WHAT??!! This ain't right, I guess I'll have to go back to the last checkpoint and try again."

FWIW, this will happen with SAH (I have seen it happen there as well), and applications not behaving well in the hibernation process is not all that uncommon. Also, as has been mentioned this is not grounds for a project reset, the fact it restarts and continues testifies to that, and yes you could avoid the error messages by manually quitting BOINC, but I never bother and haven't lost a result due to it for this reason per se.

HTH,

Alinator

Michael Roycraft
Michael Roycraft
Joined: 10 Mar 05
Posts: 846
Credit: 157718
RAC: 0

RE: The "write to disk at

Message 25649 in response to message 25647

Quote:
The "write to disk at most every..." is set to 60 seconds. "Connect to Network every ..." is 0.1 days. Boinc is set to "run always" and "Network always available". The boinc manager screen was open, but minimized.

OK, there seems to be nothing out-of-kilter there.

Quote:

Unfortunately I didn't know that this would be one of the restart times, so I didn't look too closely at where I was in the WU. I had seen one of the Boinc scheduling interventions (the "overcommitted" message) in favor of Einstein@home in the previous day or two and at that point I was about 60-70% done as I remember. Boinc had begun accepting Seti work again, so the "crisis" had passed at the time of the restart.

Let me know if I can provide any other information that might help?

j

Nothing is "jumping out" as the cause, but it may be helpful if you would "un-hide" your computer/s - there may be something in the info we could see in there that would explain. Up to you ...

Michael

microcraft
"The arc of history is long, but it bends toward justice" - MLK

Archie & Mehitabel
Archie & Mehitabel
Joined: 28 Nov 05
Posts: 11
Credit: 24412
RAC: 0

My computers are visible now.

My computers are visible now. The relevant computer is Toshiba-User. I was surprised that I've only completed 2 wu's on that machine - it seems too low, but I suppose it's possible with the occasional complete restarts and only running part-time.

Michael Roycraft
Michael Roycraft
Joined: 10 Mar 05
Posts: 846
Credit: 157718
RAC: 0

RE: My computers are

Message 25651 in response to message 25650

Quote:
My computers are visible now. The relevant computer is Toshiba-User. I was surprised that I've only completed 2 wu's on that machine - it seems too low, but I suppose it's possible with the occasional complete restarts and only running part-time.

They aren't listed by name, except on your private page, but I am assuming that you refer to the Celeron-powered host, # 458419. It shows one WU finished, turned in on Mar 3, and another in-progress, which I will further assume must be the offender by the date of the error message, Mar 6. When that WU finishes, I'll be able to look at the detail on the result, to see if there's anything "funny" there.

Your laptop has surely done more than 2 WUs. Credit earned for that host is ~ 615, so going from average granted credit, it has probably completed at least 10-12 WUs (they don't remain on the page for too long after they've been been granted credit, they're deleted from the database after 2-3 weeks, to allow room on the database server). Don't misunderstand - your credit for doing the work remains, and the result of the work remains at the project, just the info on the 'net database server is culled periodically.

microcraft
"The arc of history is long, but it bends toward justice" - MLK

Archie & Mehitabel
Archie & Mehitabel
Joined: 28 Nov 05
Posts: 11
Credit: 24412
RAC: 0

That's the right computer.

That's the right computer. If you want to chase it further, I'll be glad to post another note here when the wu completes.

I don't think the problem is specific to the work unit though - this has happened somewhat regularly and the other work units have appeared to go through properly.

No worries re the credits. I haven't found any place to spend 'em yet.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.