In the past, it appeared that if a computer crashed due to a power-outage, E@H would check the partial results on startup & then usually continue processing them.
Recently a restart after a power outage seems to cause a rejection of all partial results & the downloading of all new units.
Has there been some change in E@H or have I just been having bad luck recently?
Thanks, ADDMP.
Copyright © 2024 Einstein@Home. All rights reserved.
Handling of power-outage crashes?
)
Usually Einstein can restart from a previous checkpoint that it made. In the case of an unexpected power-outage, it's possible Einstein was writing its checkpoint to disk at the time the power went out. That causes slight disk corruption and no checkpoint to fall back on.
When you then restart, you get The environment is incorrect. (0xa) - exit code 10 (0xa) as error message. Since Einstein doesn't know where to restart from it dumps the task and gets you a new one.
Checkpoints are made every minute in EAH, so it is reasonably easy to get corruption when the light goes out.
There;s a simple solution to
)
There;s a simple solution to the problem of loosing all data from a powerfailure during a write.
1 Save new data to temp file.
2 Delete old data file.
3 rename temp file to same name as original file.
Since hte old data isn't destroyed until the new file is written there's always good data to recover from.
A bit like client_state.xml
)
A bit like client_state.xml that's backed up to client_state_prev.xml?
It's a good idea, but it would mean EAH has to rewrite their application, so it checks at startup time if the *.cpt or backup*.cpt file are available.
RE: A bit like
)
Exactly.
The Einstein@home App writes
)
The Einstein@home App writes two files: a (temporary) output file and a checkpoint file.
The checkpoint file contains information about some internal states and about the temporary output file, too (e.g. a checksum). The Einstein@Home Application writes the checkpoint file to a temporary file and then renames it, overwriting a possible previous checkpoint file in single, "atomic" operation. This way there's always a checkpoint it recovers from, which might, in rare cases, be a previous one. There is, however, no way to take any influence on how "atomic" the rename operation actually is within the operating system.
The output file is normally only appended to. There is a certain operation, however, which we call "compacting", which is automatically done when the file grows larger than a certain limit. Then the output is rewritten (to a temporary file first, then renamed to the original name, overwriting the older file). You'll see a message "Compacting toplist..." in stderr output. This usually only happens a few times during a run, and less frequently with time.
In case of "compacting" the output file it might happen that for a short time the checkpoint does not reflect the state of the temporary output file. Being interrupted between having written a new output file and a new checkpoint causes an inconsistent state on disk. It would be possible to avoid this by first deleting the checkpoint file, so if the App gets interrupted before having written a new one it would start over from the beginning. However as the time already spent on this task would be wasted anyway in this case, and as it bears the additional risk of having Tasks that endlessly restart from the beginning, I didn't take that option.
It is unlikely up to practical impossibility that all Tasks were compacting their output at the very same time, and that the power outage happened just then. More likely this case was a corruption of the directory or more than one file cause by the operating system that caused this failure.
You can point me to the results in question (when they have been reported) if you want me to take a closer examination of your case.
BM
BM
RE: It is unlikely up to
)
Thanks to Bernd & the others who responded here.
No, I would not want you to use your time checking it.
I thought there might be some setting I had wrong or some change in the way the E@H software ran, but no one here thinks that is the case.
If an OS glitch is in the running as a cause, this is the first computer I have had running Vista. Maybe MS is still working out some problems.
ADDMP