how often does einstein checkpoint workunits?

cIclops
cIclops
Joined: 19 Feb 05
Posts: 26
Credit: 450
RAC: 0
Topic 187811

(i see this has been asked before, but i'll ask again as the board seems to have been restructured)

my system crashed and einstein resumed at about 45% after it had finished around 65% of the work unit. there seems to be no message log of either the crash or from previous processing, so i'm unable to provide more information.

how often does einstein checkpoint workunits?

einstein 4.79
BOINC 4.19
Win98SE
Credit still 0

--
searching for gravitational waves since 2005

Rytis
Rytis
Joined: 10 Nov 04
Posts: 56
Credit: 1049463
RAC: 0

how often does einstein checkpoint workunits?

> how often does einstein checkpoint workunits?

Einstein checkpoints every time BOINC lets to do so, unlike in Predictor or CPDN. You should check the prefs under "write to disk at most xx seconds" - that might be set too high and therefore preventing Einstein to checkpoint.


Administrator
Message@Home

cIclops
cIclops
Joined: 19 Feb 05
Posts: 26
Credit: 450
RAC: 0

> > how often does einstein

Message 3961 in response to message 3960

> > how often does einstein checkpoint workunits?
>
> Einstein checkpoints every time BOINC lets to do so, unlike in Predictor or
> CPDN. You should check the prefs under "write to disk at most xx seconds" -
> that might be set too high and therefore preventing Einstein to checkpoint.
>
my setting is once every 60 seconds (the default?) ... and i see there is a boinc file called client_state.xml that is updated every minute, however it didn't rollback, bug?

any comment about the lack of a message log file?

--
searching for gravitational waves since 2005

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245202788
RAC: 13466

The App writes two temp files

The App writes two temp files during two stages of the analysis (0-49.5%,49.5-99%). It writes checkpoints as you specified in the preferences. However it checks the temp files when it is resumed. I think it happened that during the crash the second file got corrupted, so the App decided to repeat the calculation.

BM

BM

cIclops
cIclops
Joined: 19 Feb 05
Posts: 26
Credit: 450
RAC: 0

> The App writes two temp

Message 3963 in response to message 3962

> The App writes two temp files during two stages of the analysis
> (0-49.5%,49.5-99%). It writes checkpoints as you specified in the preferences.
> However it checks the temp files when it is resumed. I think it happened that
> during the crash the second file got corrupted, so the App decided to repeat
> the calculation.

yes that would explain it. perhaps the temp files should be written more often, especially for those of us with slower processors.

--
searching for gravitational waves since 2005

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4273
Credit: 245202788
RAC: 13466

> yes that would explain it.

Message 3964 in response to message 3963

> yes that would explain it. perhaps the temp files should be written more
> often, especially for those of us with slower processors.

They are written more or less continously (though not more often than the checkpoints). The "problem", if any, is that they are not closed until they have been fully written. This isn't a problem as long as the processes on that machine are properly shut down. When, however, the machine crashes severely or someone pulls the plug and the OS hasn't time to properly terminate the running processes, the file _might_ get damaged, just like other files of other running applications.

BM

BM

Walt Gribben
Walt Gribben
Joined: 20 Feb 05
Posts: 219
Credit: 1645393
RAC: 0

> > yes that would explain

Message 3965 in response to message 3964

> > yes that would explain it. perhaps the temp files should be written more
> > often, especially for those of us with slower processors.
>
> They are written more or less continously (though not more often than the
> checkpoints). The "problem", if any, is that they are not closed until they
> have been fully written. This isn't a problem as long as the processes on that
> machine are properly shut down. When, however, the machine crashes severely or
> someone pulls the plug and the OS hasn't time to properly terminate the
> running processes, the file _might_ get damaged, just like other files of
> other running applications.
>
> BM

I looked for source code, but couldn't find a link to it. Is it available? Would help in looking into some of these problems.

Does the program flush the buffered data to disk? It won't help anything if the PC crashes while the temporary files are being written, but it sure does help if the PC crashes later. Like:

fflush( *outstream );

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

> > > yes that would explain

Message 3966 in response to message 3965

> > > yes that would explain it. perhaps the temp files should be written
> more
> > > often, especially for those of us with slower processors.
> >
> > They are written more or less continously (though not more often than
> the
> > checkpoints). The "problem", if any, is that they are not closed until
> they
> > have been fully written. This isn't a problem as long as the processes on
> that
> > machine are properly shut down. When, however, the machine crashes
> severely or
> > someone pulls the plug and the OS hasn't time to properly terminate the
> > running processes, the file _might_ get damaged, just like other files
> of
> > other running applications.
> >
> > BM
>
> I looked for source code, but couldn't find a link to it. Is it available?
> Would help in looking into some of these problems.
>
> Does the program flush the buffered data to disk? It won't help anything if
> the PC crashes while the temporary files are being written, but it sure does
> help if the PC crashes later. Like:
>
> fflush( *outstream );

Walt, the code internally allocates a 2MB buffer and writes to the buffer. Checkpointing consists of flushing that buffer to disk. The frequency can be set by the user in their preferences. The E@h default is 60 secs. Users attached to other projects should beware that they inherit whatever checkpoint default was used for *those* projects.

Bruce

Director, Einstein@Home

Ned
Ned
Joined: 22 Jan 05
Posts: 18
Credit: 24493621
RAC: 0

The original problem seems to

The original problem seems to be related to a corrupted file...
May I suggest that the client use two files. Use one at a time, but close it every half hour then copy to the second and use it for the next half hour. That way a hard system crash should only lose up to a half hour of work on restart.

Ned

Ol' Retired IT Geezer

cIclops
cIclops
Joined: 19 Feb 05
Posts: 26
Credit: 450
RAC: 0

> The original problem seems

Message 3968 in response to message 3967

> The original problem seems to be related to a corrupted file...
> May I suggest that the client use two files. Use one at a time, but close it
> every half hour then copy to the second and use it for the next half hour.
> That way a hard system crash should only lose up to a half hour of work on
> restart.
>
> Ned

it looks like the two checkpoint files (client_state.xml and client_state_prev.xml) are cycled every 60 seconds in the default mode. a reliable file protection scheme will ensure that they are both available for resume with a maximum data loss of only 60 seconds or whatever value is set in the preferences.

--
searching for gravitational waves since 2005

Ned
Ned
Joined: 22 Jan 05
Posts: 18
Credit: 24493621
RAC: 0

> > The original problem

Message 3969 in response to message 3968

> > The original problem seems to be related to a corrupted file...
> > May I suggest that the client use two files. Use one at a time, but
> close it
> > every half hour then copy to the second and use it for the next half
> hour.
> > That way a hard system crash should only lose up to a half hour of work
> on
> > restart.
> >
> > Ned
>
> it looks like the two checkpoint files (client_state.xml and
> client_state_prev.xml) are cycled every 60 seconds in the default mode. a
> reliable file protection scheme will ensure that they are both available for
> resume with a maximum data loss of only 60 seconds or whatever value is set in
> the preferences.
>
>
The checkpoint files probably point to positions in the "temp" files mentioned in BM's earlier post.
"The App writes two temp files during two stages of the analysis (0-49.5%,49.5-99%)."
My post was ment to refer to those files... Especially under Windows, since windows doesn't have a journalling file system like Linux to perform any file recovery in a restart situation.

Ned

Ol' Retired IT Geezer

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.