More checkpoints

floyd

Joined: 12 Sep 11

Posts: 133

Credit: 186610495

RAC: 0

9 Dec 2013 18:13:47 UTC

Topic 197302

(moderation:

)

The Gravitational Wave Search tasks run long times without writing a checkpoint, almost 40 minutes on my computer, so at every shutdown a lot of work is lost. I'd like to be able to stop any time without having to worry about wasting too much CPU power. Please consider adding more checkpoints, the more frequent the better.

Holmis

Joined: 4 Jan 05

Posts: 1118

Credit: 1055935564

RAC: 0

More checkpoints

9 Dec 2013 23:28:54 UTC

Message 119509

(moderation:

)

What is your setting in the Computing preferences for "Tasks checkpoint to disk at most every xx seconds"? Set this to high and the app won't checkpoint...

floyd

Joined: 12 Sep 11

Posts: 133

Credit: 186610495

RAC: 0

That setting is unchanged

10 Dec 2013 2:54:10 UTC

Message 119510

(moderation:

)

That setting is unchanged from the default value of 60 seconds. In fact I've never changed global preferences through any project, I always use BOINC Manager.

Other projects' checkpoints never seem to be much older than a minute, and I think other Einstein tasks checkpointed more frequently, too. But Gravitational Wave Search are the only ones I get now, and those seem to be split in 13 blocks. The progress bar in BOINC Manager advances in steps of 7.7% and a checkpoint is written only after each step.

Right now, the Einstein task currently running is 24 minutes past the latest checkpoint and if I decided to shut down the computer now all that work would be lost. I'd hate that. For the Malariacontrol task running in parallel it would be 40 seconds, nothing to even think about. I'd really wish I didn't have to care about Einstein, too.

Neil Newell

Joined: 20 Nov 12

Posts: 176

Credit: 169699457

RAC: 0

I've noticed this too, it

10 Dec 2013 9:59:15 UTC

Message 119511 in response to message 119510

(moderation:

)

I've noticed this too, it seems to be a feature of S6CasA (and to a lesser extent, GRP2). Obviously the speed of the host plays into this to some extent but even on reasonable hosts it can be 10's of minutes between checkpoints.

ExtraTerrestria...

Joined: 10 Nov 04

Posts: 770

Credit: 586304742

RAC: 120119

The request for more

11 Dec 2013 21:21:13 UTC

Message 119512

(moderation:

)

The request for more checkpoints seems valid. But as a work-around you could use standby (costs some energy, but saves you some time upon waking up) or suspend to disk (takes longer to sleep and wake, but should still be faster than a regular boot).

MrS

Scanning for our furry friends since Jan 2002

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4342

Credit: 252640035

RAC: 35394

In general we design our Apps

29 Jan 2014 10:27:59 UTC

Message 119513

(moderation:

)

In general we design our Apps to (potentially) checkpoint as often as possible / feasible, i.e. after each reasonably independent computation.

Feasibility limits here include the programming effort (parameters in data structures modified in nested loops saved and restored) and the data volume (storage space and time to write) of the necessary checkpoints. It doesn't make much sense to checkpoint every minute when writing the checkpoint takes several seconds (multiplied by the number of instances that may be running and checkpointing at once) and thus noticeably slows down computation, or if initializing the application picking up from a checkpoint takes several minutes alone.

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 3000482033

RAC: 698802

Currently the GW S6 CasA app

12 Feb 2014 17:37:20 UTC

Message 119514

(moderation:

)

Currently the GW S6 CasA app checkpoints about once an hour on this host. That host runs for about 8 hours on weekdays only, so it basically gets shut down each day.
That means in a worst case scenario 59 minutes of crunching time are lost or 1/8 of available crunching time.
I'd consider that suboptimal.
In a effort to mitigate the situation the owner of said host has taken to monitor the timestamp of the checkpoint file in the slot directory and not to shut down unless the checkpoint is fairly recent but rather wait for the next checkpoint.

Very frequent checkpoints (e.g. every minute or every 5 minutes) may not be feasible, but said owner feels that checkpointing at least every 15 minutes would be highly desirable - keeping an eye on the checkpoint file is tedious.
Losses of up to 15 minutes crunching time are more easily stomached. To potentially lose an hour is very annoying.

So if you can make the app checkpoint a little more frequently, that would be nice.

Claggy

Joined: 29 Dec 06

Posts: 560

Credit: 2793143

RAC: 2671

Another curve ball is that

12 Feb 2014 19:31:52 UTC

Message 119515 in response to message 119514

(moderation:

)

Another curve ball is that Boinc 7.2.38 introduced the following:

client: if app doesn't report fraction done, estimate it.
client: if app doesn't report fraction done, estimate fraction done in a way that converges to but never reaches 100%.

With Boinc 7.2.39 now the latest Recommended version, the app will seemly be progressing, before jumping back if you restart Boinc,
or if you have suspend when in use selected, and don't have Leave Tasks in Memory selected, and you interrupt crunching.

Claggy

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4342

Credit: 252640035

RAC: 35394

RE: So if you can make the

14 Feb 2014 11:03:41 UTC

Message 119516 in response to message 119514

(moderation:

)

Quote:

So if you can make the app checkpoint a little more frequently, that would be nice.

Thanks for the note.

Under different circumstances I would just advise to not run GW work on this computer, but run BRP4 instead. However we currently haven't enough Arecibo data to produce enough BRP4 work for normal CPUs, so there currently is no alternative (I suspect FGRP3 won't run the last stage reasonably fast either).

It turns out that the setup of the current GW analysis run makes more extensive use of a feature ("second order spindown") that was negligible when we added checkpointing to the application; you may see this as a communication problem between scientists and (software) engineers.

I'll see what I can do about that, but seeing my current pile of work I doubt that I can fix that within the remaining time of the S6CasA run. As we probably need to change a few things for the next GW run anyway, I am confident, though, that the next GW app will checkpoint more frequently.

Eyrie

Joined: 20 Feb 14

Posts: 4

Credit: 4415

RAC: 0

Thanks, Bernd :) I wasn't

24 Feb 2014 17:53:09 UTC

Message 119517

(moderation:

)

Thanks, Bernd :)

I wasn't expecting an update to the current app, it would just be nice if this was taken into consideration for future app development.

Queen of Aliasses, wielder of the SETI rolling pin, Mistress of the red shoes, Guardian of the orange tree, Slayer of very small dragons.

More checkpoints

Forums › Wish List

Comment viewing options

Forums › Wish List