Handling of power-outage crashes?

ADDMP

Joined: 25 Feb 05

Posts: 104

Credit: 7332049

RAC: 0

13 Sep 2007 20:40:11 UTC

Topic 193132

(moderation:

)

In the past, it appeared that if a computer crashed due to a power-outage, E@H would check the partial results on startup & then usually continue processing them.

Recently a restart after a power outage seems to cause a rejection of all partial results & the downloading of all new units.

Has there been some change in E@H or have I just been having bad luck recently?

Thanks, ADDMP.

Jord

Joined: 26 Jan 05

Posts: 2952

Credit: 5893653

RAC: 2

Handling of power-outage crashes?

13 Sep 2007 20:50:51 UTC

Message 73045

(moderation:

)

Usually Einstein can restart from a previous checkpoint that it made. In the case of an unexpected power-outage, it's possible Einstein was writing its checkpoint to disk at the time the power went out. That causes slight disk corruption and no checkpoint to fall back on.

When you then restart, you get The environment is incorrect. (0xa) - exit code 10 (0xa) as error message. Since Einstein doesn't know where to restart from it dumps the task and gets you a new one.

Checkpoints are made every minute in EAH, so it is reasonably easy to get corruption when the light goes out.

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

There;s a simple solution to

13 Sep 2007 21:06:45 UTC

Message 73046

(moderation:

)

There;s a simple solution to the problem of loosing all data from a powerfailure during a write.

1 Save new data to temp file.

2 Delete old data file.

3 rename temp file to same name as original file.

Since hte old data isn't destroyed until the new file is written there's always good data to recover from.

Jord

Joined: 26 Jan 05

Posts: 2952

Credit: 5893653

RAC: 2

A bit like client_state.xml

13 Sep 2007 21:29:44 UTC

Message 73047 in response to message 73046

(moderation:

)

A bit like client_state.xml that's backed up to client_state_prev.xml?
It's a good idea, but it would mean EAH has to rewrite their application, so it checks at startup time if the *.cpt or backup*.cpt file are available.

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

RE: A bit like

13 Sep 2007 21:51:21 UTC

Message 73048 in response to message 73047

(moderation:

)

Quote:

A bit like client_state.xml that's backed up to client_state_prev.xml?

Exactly.

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4332

Credit: 251887728

RAC: 33844

The Einstein@home App writes

14 Sep 2007 11:13:21 UTC

Message 73049

(moderation:

)

The Einstein@home App writes two files: a (temporary) output file and a checkpoint file.

The checkpoint file contains information about some internal states and about the temporary output file, too (e.g. a checksum). The Einstein@Home Application writes the checkpoint file to a temporary file and then renames it, overwriting a possible previous checkpoint file in single, "atomic" operation. This way there's always a checkpoint it recovers from, which might, in rare cases, be a previous one. There is, however, no way to take any influence on how "atomic" the rename operation actually is within the operating system.

The output file is normally only appended to. There is a certain operation, however, which we call "compacting", which is automatically done when the file grows larger than a certain limit. Then the output is rewritten (to a temporary file first, then renamed to the original name, overwriting the older file). You'll see a message "Compacting toplist..." in stderr output. This usually only happens a few times during a run, and less frequently with time.

In case of "compacting" the output file it might happen that for a short time the checkpoint does not reflect the state of the temporary output file. Being interrupted between having written a new output file and a new checkpoint causes an inconsistent state on disk. It would be possible to avoid this by first deleting the checkpoint file, so if the App gets interrupted before having written a new one it would start over from the beginning. However as the time already spent on this task would be wasted anyway in this case, and as it bears the additional risk of having Tasks that endlessly restart from the beginning, I didn't take that option.

It is unlikely up to practical impossibility that all Tasks were compacting their output at the very same time, and that the power outage happened just then. More likely this case was a corruption of the directory or more than one file cause by the operating system that caused this failure.

You can point me to the results in question (when they have been reported) if you want me to take a closer examination of your case.

ADDMP

Joined: 25 Feb 05

Posts: 104

Credit: 7332049

RAC: 0

RE: It is unlikely up to

16 Sep 2007 5:27:50 UTC

Message 73050 in response to message 73049

(moderation:

)

Quote:

It is unlikely up to practical impossibility that all Tasks were compacting their output at the very same time, and that the power outage happened just then. More likely this case was a corruption of the directory or more than one file cause by the operating system that caused this failure.

You can point me to the results in question (when they have been reported) if you want me to take a closer examination of your case.

BM

Thanks to Bernd & the others who responded here.

No, I would not want you to use your time checking it.

I thought there might be some setting I had wrong or some change in the way the E@H software ran, but no one here thinks that is the case.

If an OS glitch is in the running as a cause, this is the first computer I have had running Vista. Maybe MS is still working out some problems.

ADDMP

Handling of power-outage crashes?

Forums › Cruncher's Corner

Handling of power-outage crashes?

There;s a simple solution to

A bit like client_state.xml

RE: A bit like

The Einstein@home App writes

RE: It is unlikely up to

Comment viewing options

Forums › Cruncher's Corner