how often does einstein checkpoint workunits?

cIclops

cIclops

Joined: 19 Feb 05

Posts: 26

Credit: 450

RAC: 0

21 Feb 2005 15:47:44 UTC

Topic 187811

(moderation:

)

(i see this has been asked before, but i'll ask again as the board seems to have been restructured)

my system crashed and einstein resumed at about 45% after it had finished around 65% of the work unit. there seems to be no message log of either the crash or from previous processing, so i'm unable to provide more information.

how often does einstein checkpoint workunits?

einstein 4.79
BOINC 4.19
Win98SE
Credit still 0

--
searching for gravitational waves since 2005

Rytis

Joined: 10 Nov 04

Posts: 56

Credit: 1210932

RAC: 165

how often does einstein checkpoint workunits?

21 Feb 2005 16:05:50 UTC

Message 3960

(moderation:

)

> how often does einstein checkpoint workunits?

Einstein checkpoints every time BOINC lets to do so, unlike in Predictor or CPDN. You should check the prefs under "write to disk at most xx seconds" - that might be set too high and therefore preventing Einstein to checkpoint.

Administrator
Message@Home

cIclops

cIclops

Joined: 19 Feb 05

Posts: 26

Credit: 450

RAC: 0

> > how often does einstein

21 Feb 2005 16:16:34 UTC

Message 3961 in response to message 3960

(moderation:

)

> > how often does einstein checkpoint workunits?
>
> Einstein checkpoints every time BOINC lets to do so, unlike in Predictor or
> CPDN. You should check the prefs under "write to disk at most xx seconds" -
> that might be set too high and therefore preventing Einstein to checkpoint.
>
my setting is once every 60 seconds (the default?) ... and i see there is a boinc file called client_state.xml that is updated every minute, however it didn't rollback, bug?

any comment about the lack of a message log file?

--
searching for gravitational waves since 2005

Bernd Machenschalk

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250719643

RAC: 35700

The App writes two temp files

21 Feb 2005 19:31:34 UTC

Message 3962

(moderation:

)

The App writes two temp files during two stages of the analysis (0-49.5%,49.5-99%). It writes checkpoints as you specified in the preferences. However it checks the temp files when it is resumed. I think it happened that during the crash the second file got corrupted, so the App decided to repeat the calculation.

BM

BM

cIclops

cIclops

Joined: 19 Feb 05

Posts: 26

Credit: 450

RAC: 0

> The App writes two temp

21 Feb 2005 19:57:32 UTC

Message 3963 in response to message 3962

(moderation:

)

> The App writes two temp files during two stages of the analysis
> (0-49.5%,49.5-99%). It writes checkpoints as you specified in the preferences.
> However it checks the temp files when it is resumed. I think it happened that
> during the crash the second file got corrupted, so the App decided to repeat
> the calculation.

yes that would explain it. perhaps the temp files should be written more often, especially for those of us with slower processors.

--
searching for gravitational waves since 2005

Bernd Machenschalk

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4312

Credit: 250719643

RAC: 35700

> yes that would explain it.

21 Feb 2005 20:06:08 UTC

Message 3964 in response to message 3963

(moderation:

)

> yes that would explain it. perhaps the temp files should be written more
> often, especially for those of us with slower processors.

They are written more or less continously (though not more often than the checkpoints). The "problem", if any, is that they are not closed until they have been fully written. This isn't a problem as long as the processes on that machine are properly shut down. When, however, the machine crashes severely or someone pulls the plug and the OS hasn't time to properly terminate the running processes, the file _might_ get damaged, just like other files of other running applications.

BM

BM

Walt Gribben

Joined: 20 Feb 05

Posts: 219

Credit: 1645393

RAC: 0

> > yes that would explain

21 Feb 2005 20:33:59 UTC

Message 3965 in response to message 3964

(moderation:

)

> > yes that would explain it. perhaps the temp files should be written more
> > often, especially for those of us with slower processors.
>
> They are written more or less continously (though not more often than the
> checkpoints). The "problem", if any, is that they are not closed until they
> have been fully written. This isn't a problem as long as the processes on that
> machine are properly shut down. When, however, the machine crashes severely or
> someone pulls the plug and the OS hasn't time to properly terminate the
> running processes, the file _might_ get damaged, just like other files of
> other running applications.
>
> BM

I looked for source code, but couldn't find a link to it. Is it available? Would help in looking into some of these problems.

Does the program flush the buffered data to disk? It won't help anything if the PC crashes while the temporary files are being written, but it sure does help if the PC crashes later. Like:

fflush( *outstream );

Bruce Allen

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1119

Credit: 172127663

RAC: 0

> > > yes that would explain

21 Feb 2005 21:04:15 UTC

Message 3966 in response to message 3965

(moderation:

)

> > > yes that would explain it. perhaps the temp files should be written
> more
> > > often, especially for those of us with slower processors.
> >
> > They are written more or less continously (though not more often than
> the
> > checkpoints). The "problem", if any, is that they are not closed until
> they
> > have been fully written. This isn't a problem as long as the processes on
> that
> > machine are properly shut down. When, however, the machine crashes
> severely or
> > someone pulls the plug and the OS hasn't time to properly terminate the
> > running processes, the file _might_ get damaged, just like other files
> of
> > other running applications.
> >
> > BM
>
> I looked for source code, but couldn't find a link to it. Is it available?
> Would help in looking into some of these problems.
>
> Does the program flush the buffered data to disk? It won't help anything if
> the PC crashes while the temporary files are being written, but it sure does
> help if the PC crashes later. Like:
>
> fflush( *outstream );

Walt, the code internally allocates a 2MB buffer and writes to the buffer. Checkpointing consists of flushing that buffer to disk. The frequency can be set by the user in their preferences. The E@h default is 60 secs. Users attached to other projects should beware that they inherit whatever checkpoint default was used for *those* projects.

Bruce

Director, Einstein@Home

Ned

Joined: 22 Jan 05

Posts: 18

Credit: 24493621

RAC: 0

The original problem seems to

21 Feb 2005 21:20:10 UTC

Message 3967

(moderation:

)

The original problem seems to be related to a corrupted file...
May I suggest that the client use two files. Use one at a time, but close it every half hour then copy to the second and use it for the next half hour. That way a hard system crash should only lose up to a half hour of work on restart.

Ned

Ol' Retired IT Geezer

cIclops

cIclops

Joined: 19 Feb 05

Posts: 26

Credit: 450

RAC: 0

> The original problem seems

21 Feb 2005 21:36:58 UTC

Message 3968 in response to message 3967

(moderation:

)

> The original problem seems to be related to a corrupted file...
> May I suggest that the client use two files. Use one at a time, but close it
> every half hour then copy to the second and use it for the next half hour.
> That way a hard system crash should only lose up to a half hour of work on
> restart.
>
> Ned

it looks like the two checkpoint files (client_state.xml and client_state_prev.xml) are cycled every 60 seconds in the default mode. a reliable file protection scheme will ensure that they are both available for resume with a maximum data loss of only 60 seconds or whatever value is set in the preferences.

--
searching for gravitational waves since 2005

Ned

Joined: 22 Jan 05

Posts: 18

Credit: 24493621

RAC: 0

> > The original problem

21 Feb 2005 21:57:00 UTC

Message 3969 in response to message 3968

(moderation:

)

> > The original problem seems to be related to a corrupted file...
> > May I suggest that the client use two files. Use one at a time, but
> close it
> > every half hour then copy to the second and use it for the next half
> hour.
> > That way a hard system crash should only lose up to a half hour of work
> on
> > restart.
> >
> > Ned
>
> it looks like the two checkpoint files (client_state.xml and
> client_state_prev.xml) are cycled every 60 seconds in the default mode. a
> reliable file protection scheme will ensure that they are both available for
> resume with a maximum data loss of only 60 seconds or whatever value is set in
> the preferences.
>
>
The checkpoint files probably point to positions in the "temp" files mentioned in BM's earlier post.
"The App writes two temp files during two stages of the analysis (0-49.5%,49.5-99%)."
My post was ment to refer to those files... Especially under Windows, since windows doesn't have a journalling file system like Linux to perform any file recovery in a restart situation.

Ned

Ol' Retired IT Geezer