(i see this has been asked before, but i'll ask again as the board seems to have been restructured)
my system crashed and einstein resumed at about 45% after it had finished around 65% of the work unit. there seems to be no message log of either the crash or from previous processing, so i'm unable to provide more information.
how often does einstein checkpoint workunits?
einstein 4.79
BOINC 4.19
Win98SE
Credit still 0
--
searching for gravitational waves since 2005
Copyright © 2024 Einstein@Home. All rights reserved.
how often does einstein checkpoint workunits?
)
> how often does einstein checkpoint workunits?
Einstein checkpoints every time BOINC lets to do so, unlike in Predictor or CPDN. You should check the prefs under "write to disk at most xx seconds" - that might be set too high and therefore preventing Einstein to checkpoint.
Administrator
Message@Home
> > how often does einstein
)
> > how often does einstein checkpoint workunits?
>
> Einstein checkpoints every time BOINC lets to do so, unlike in Predictor or
> CPDN. You should check the prefs under "write to disk at most xx seconds" -
> that might be set too high and therefore preventing Einstein to checkpoint.
>
my setting is once every 60 seconds (the default?) ... and i see there is a boinc file called client_state.xml that is updated every minute, however it didn't rollback, bug?
any comment about the lack of a message log file?
--
searching for gravitational waves since 2005
The App writes two temp files
)
The App writes two temp files during two stages of the analysis (0-49.5%,49.5-99%). It writes checkpoints as you specified in the preferences. However it checks the temp files when it is resumed. I think it happened that during the crash the second file got corrupted, so the App decided to repeat the calculation.
BM
BM
> The App writes two temp
)
> The App writes two temp files during two stages of the analysis
> (0-49.5%,49.5-99%). It writes checkpoints as you specified in the preferences.
> However it checks the temp files when it is resumed. I think it happened that
> during the crash the second file got corrupted, so the App decided to repeat
> the calculation.
yes that would explain it. perhaps the temp files should be written more often, especially for those of us with slower processors.
--
searching for gravitational waves since 2005
> yes that would explain it.
)
> yes that would explain it. perhaps the temp files should be written more
> often, especially for those of us with slower processors.
They are written more or less continously (though not more often than the checkpoints). The "problem", if any, is that they are not closed until they have been fully written. This isn't a problem as long as the processes on that machine are properly shut down. When, however, the machine crashes severely or someone pulls the plug and the OS hasn't time to properly terminate the running processes, the file _might_ get damaged, just like other files of other running applications.
BM
BM
> > yes that would explain
)
> > yes that would explain it. perhaps the temp files should be written more
> > often, especially for those of us with slower processors.
>
> They are written more or less continously (though not more often than the
> checkpoints). The "problem", if any, is that they are not closed until they
> have been fully written. This isn't a problem as long as the processes on that
> machine are properly shut down. When, however, the machine crashes severely or
> someone pulls the plug and the OS hasn't time to properly terminate the
> running processes, the file _might_ get damaged, just like other files of
> other running applications.
>
> BM
I looked for source code, but couldn't find a link to it. Is it available? Would help in looking into some of these problems.
Does the program flush the buffered data to disk? It won't help anything if the PC crashes while the temporary files are being written, but it sure does help if the PC crashes later. Like:
fflush( *outstream );
> > > yes that would explain
)
> > > yes that would explain it. perhaps the temp files should be written
> more
> > > often, especially for those of us with slower processors.
> >
> > They are written more or less continously (though not more often than
> the
> > checkpoints). The "problem", if any, is that they are not closed until
> they
> > have been fully written. This isn't a problem as long as the processes on
> that
> > machine are properly shut down. When, however, the machine crashes
> severely or
> > someone pulls the plug and the OS hasn't time to properly terminate the
> > running processes, the file _might_ get damaged, just like other files
> of
> > other running applications.
> >
> > BM
>
> I looked for source code, but couldn't find a link to it. Is it available?
> Would help in looking into some of these problems.
>
> Does the program flush the buffered data to disk? It won't help anything if
> the PC crashes while the temporary files are being written, but it sure does
> help if the PC crashes later. Like:
>
> fflush( *outstream );
Walt, the code internally allocates a 2MB buffer and writes to the buffer. Checkpointing consists of flushing that buffer to disk. The frequency can be set by the user in their preferences. The E@h default is 60 secs. Users attached to other projects should beware that they inherit whatever checkpoint default was used for *those* projects.
Bruce
Director, Einstein@Home
The original problem seems to
)
The original problem seems to be related to a corrupted file...
May I suggest that the client use two files. Use one at a time, but close it every half hour then copy to the second and use it for the next half hour. That way a hard system crash should only lose up to a half hour of work on restart.
Ned
Ol' Retired IT Geezer
> The original problem seems
)
> The original problem seems to be related to a corrupted file...
> May I suggest that the client use two files. Use one at a time, but close it
> every half hour then copy to the second and use it for the next half hour.
> That way a hard system crash should only lose up to a half hour of work on
> restart.
>
> Ned
it looks like the two checkpoint files (client_state.xml and client_state_prev.xml) are cycled every 60 seconds in the default mode. a reliable file protection scheme will ensure that they are both available for resume with a maximum data loss of only 60 seconds or whatever value is set in the preferences.
--
searching for gravitational waves since 2005
> > The original problem
)
> > The original problem seems to be related to a corrupted file...
> > May I suggest that the client use two files. Use one at a time, but
> close it
> > every half hour then copy to the second and use it for the next half
> hour.
> > That way a hard system crash should only lose up to a half hour of work
> on
> > restart.
> >
> > Ned
>
> it looks like the two checkpoint files (client_state.xml and
> client_state_prev.xml) are cycled every 60 seconds in the default mode. a
> reliable file protection scheme will ensure that they are both available for
> resume with a maximum data loss of only 60 seconds or whatever value is set in
> the preferences.
>
>
The checkpoint files probably point to positions in the "temp" files mentioned in BM's earlier post.
"The App writes two temp files during two stages of the analysis (0-49.5%,49.5-99%)."
My post was ment to refer to those files... Especially under Windows, since windows doesn't have a journalling file system like Linux to perform any file recovery in a restart situation.
Ned
Ol' Retired IT Geezer