The Gravitational Wave Search tasks run long times without writing a checkpoint, almost 40 minutes on my computer, so at every shutdown a lot of work is lost. I'd like to be able to stop any time without having to worry about wasting too much CPU power. Please consider adding more checkpoints, the more frequent the better.
Copyright © 2024 Einstein@Home. All rights reserved.
More checkpoints
)
What is your setting in the Computing preferences for "Tasks checkpoint to disk at most every xx seconds"? Set this to high and the app won't checkpoint...
That setting is unchanged
)
That setting is unchanged from the default value of 60 seconds. In fact I've never changed global preferences through any project, I always use BOINC Manager.
Other projects' checkpoints never seem to be much older than a minute, and I think other Einstein tasks checkpointed more frequently, too. But Gravitational Wave Search are the only ones I get now, and those seem to be split in 13 blocks. The progress bar in BOINC Manager advances in steps of 7.7% and a checkpoint is written only after each step.
Right now, the Einstein task currently running is 24 minutes past the latest checkpoint and if I decided to shut down the computer now all that work would be lost. I'd hate that. For the Malariacontrol task running in parallel it would be 40 seconds, nothing to even think about. I'd really wish I didn't have to care about Einstein, too.
I've noticed this too, it
)
I've noticed this too, it seems to be a feature of S6CasA (and to a lesser extent, GRP2). Obviously the speed of the host plays into this to some extent but even on reasonable hosts it can be 10's of minutes between checkpoints.
The request for more
)
The request for more checkpoints seems valid. But as a work-around you could use standby (costs some energy, but saves you some time upon waking up) or suspend to disk (takes longer to sleep and wake, but should still be faster than a regular boot).
MrS
Scanning for our furry friends since Jan 2002
In general we design our Apps
)
In general we design our Apps to (potentially) checkpoint as often as possible / feasible, i.e. after each reasonably independent computation.
Feasibility limits here include the programming effort (parameters in data structures modified in nested loops saved and restored) and the data volume (storage space and time to write) of the necessary checkpoints. It doesn't make much sense to checkpoint every minute when writing the checkpoint takes several seconds (multiplied by the number of instances that may be running and checkpointing at once) and thus noticeably slows down computation, or if initializing the application picking up from a checkpoint takes several minutes alone.
BM
BM
Currently the GW S6 CasA app
)
Currently the GW S6 CasA app checkpoints about once an hour on this host. That host runs for about 8 hours on weekdays only, so it basically gets shut down each day.
That means in a worst case scenario 59 minutes of crunching time are lost or 1/8 of available crunching time.
I'd consider that suboptimal.
In a effort to mitigate the situation the owner of said host has taken to monitor the timestamp of the checkpoint file in the slot directory and not to shut down unless the checkpoint is fairly recent but rather wait for the next checkpoint.
Very frequent checkpoints (e.g. every minute or every 5 minutes) may not be feasible, but said owner feels that checkpointing at least every 15 minutes would be highly desirable - keeping an eye on the checkpoint file is tedious.
Losses of up to 15 minutes crunching time are more easily stomached. To potentially lose an hour is very annoying.
So if you can make the app checkpoint a little more frequently, that would be nice.
Another curve ball is that
)
Another curve ball is that Boinc 7.2.38 introduced the following:
client: if app doesn't report fraction done, estimate it.
client: if app doesn't report fraction done, estimate fraction done in a way that converges to but never reaches 100%.
With Boinc 7.2.39 now the latest Recommended version, the app will seemly be progressing, before jumping back if you restart Boinc,
or if you have suspend when in use selected, and don't have Leave Tasks in Memory selected, and you interrupt crunching.
Claggy
RE: So if you can make the
)
Thanks for the note.
Under different circumstances I would just advise to not run GW work on this computer, but run BRP4 instead. However we currently haven't enough Arecibo data to produce enough BRP4 work for normal CPUs, so there currently is no alternative (I suspect FGRP3 won't run the last stage reasonably fast either).
It turns out that the setup of the current GW analysis run makes more extensive use of a feature ("second order spindown") that was negligible when we added checkpointing to the application; you may see this as a communication problem between scientists and (software) engineers.
I'll see what I can do about that, but seeing my current pile of work I doubt that I can fix that within the remaining time of the S6CasA run. As we probably need to change a few things for the next GW run anyway, I am confident, though, that the next GW app will checkpoint more frequently.
BM
BM
Thanks, Bernd :) I wasn't
)
Thanks, Bernd :)
I wasn't expecting an update to the current app, it would just be nice if this was taken into consideration for future app development.
Queen of Aliasses, wielder of the SETI rolling pin, Mistress of the red shoes, Guardian of the orange tree, Slayer of very small dragons.