Handling of power-outage crashes?

ADDMP
ADDMP
Joined: 25 Feb 05
Posts: 104
Credit: 7332049
RAC: 0
Topic 193132

In the past, it appeared that if a computer crashed due to a power-outage, E@H would check the partial results on startup & then usually continue processing them.

Recently a restart after a power outage seems to cause a rejection of all partial results & the downloading of all new units.

Has there been some change in E@H or have I just been having bad luck recently?

Thanks, ADDMP.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 2

Handling of power-outage crashes?

Usually Einstein can restart from a previous checkpoint that it made. In the case of an unexpected power-outage, it's possible Einstein was writing its checkpoint to disk at the time the power went out. That causes slight disk corruption and no checkpoint to fall back on.

When you then restart, you get The environment is incorrect. (0xa) - exit code 10 (0xa) as error message. Since Einstein doesn't know where to restart from it dumps the task and gets you a new one.

Checkpoints are made every minute in EAH, so it is reasonably easy to get corruption when the light goes out.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 0

There;s a simple solution to

There;s a simple solution to the problem of loosing all data from a powerfailure during a write.

1 Save new data to temp file.

2 Delete old data file.

3 rename temp file to same name as original file.

Since hte old data isn't destroyed until the new file is written there's always good data to recover from.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5893653
RAC: 2

A bit like client_state.xml

Message 73047 in response to message 73046

A bit like client_state.xml that's backed up to client_state_prev.xml?
It's a good idea, but it would mean EAH has to rewrite their application, so it checks at startup time if the *.cpt or backup*.cpt file are available.

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1364
Credit: 3562358667
RAC: 0

RE: A bit like

Message 73048 in response to message 73047

Quote:
A bit like client_state.xml that's backed up to client_state_prev.xml?

Exactly.

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4332
Credit: 251885649
RAC: 33768

The Einstein@home App writes

The Einstein@home App writes two files: a (temporary) output file and a checkpoint file.

The checkpoint file contains information about some internal states and about the temporary output file, too (e.g. a checksum). The Einstein@Home Application writes the checkpoint file to a temporary file and then renames it, overwriting a possible previous checkpoint file in single, "atomic" operation. This way there's always a checkpoint it recovers from, which might, in rare cases, be a previous one. There is, however, no way to take any influence on how "atomic" the rename operation actually is within the operating system.

The output file is normally only appended to. There is a certain operation, however, which we call "compacting", which is automatically done when the file grows larger than a certain limit. Then the output is rewritten (to a temporary file first, then renamed to the original name, overwriting the older file). You'll see a message "Compacting toplist..." in stderr output. This usually only happens a few times during a run, and less frequently with time.

In case of "compacting" the output file it might happen that for a short time the checkpoint does not reflect the state of the temporary output file. Being interrupted between having written a new output file and a new checkpoint causes an inconsistent state on disk. It would be possible to avoid this by first deleting the checkpoint file, so if the App gets interrupted before having written a new one it would start over from the beginning. However as the time already spent on this task would be wasted anyway in this case, and as it bears the additional risk of having Tasks that endlessly restart from the beginning, I didn't take that option.

It is unlikely up to practical impossibility that all Tasks were compacting their output at the very same time, and that the power outage happened just then. More likely this case was a corruption of the directory or more than one file cause by the operating system that caused this failure.

You can point me to the results in question (when they have been reported) if you want me to take a closer examination of your case.

BM

BM

ADDMP
ADDMP
Joined: 25 Feb 05
Posts: 104
Credit: 7332049
RAC: 0

RE: It is unlikely up to

Message 73050 in response to message 73049

Quote:


It is unlikely up to practical impossibility that all Tasks were compacting their output at the very same time, and that the power outage happened just then. More likely this case was a corruption of the directory or more than one file cause by the operating system that caused this failure.

You can point me to the results in question (when they have been reported) if you want me to take a closer examination of your case.

BM

Thanks to Bernd & the others who responded here.

No, I would not want you to use your time checking it.

I thought there might be some setting I had wrong or some change in the way the E@H software ran, but no one here thinks that is the case.

If an OS glitch is in the running as a cause, this is the first computer I have had running Vista. Maybe MS is still working out some problems.

ADDMP

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.