FGRPSSE Not checkpointing to disk

wwamann
wwamann
Joined: 19 Nov 11
Posts: 2
Credit: 47702517
RAC: 0
Topic 223038

Gamma-ray pulsar search does not appear to checkpoint to disk on loss of power or exiting BOINC (shut down all task).  Upon restart all CPU task start back at 0%. Frustrating to lose 80% (hours) of work. What, if anything,  am I doing wrong. Thank You...

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4972
Credit: 18773465502
RAC: 7212152

Check your preferences.  The

Check your preferences.  The FGRPSSE tasks DO checkpoint.  Default is 60 seconds.  But that can be overridden by local preferences.  Check what location or venue your host is set to and what the checkpoint interval for that location.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117792061904
RAC: 34679285

wwamann wrote:Gamma-ray

wwamann wrote:
Gamma-ray pulsar search does not appear to checkpoint to disk ...

You can easily check this for yourself.  With the CPU GRP search, there are two parameters that control checkpointing -  the number of 'sky points' and the value of 'nf1dots'.  Each time a sky point is completed, a checkpoint will be written.  How long a sky point will take depends on how large nf1dots is.

The way to check this is to go to your tasks list on the website and take a look at what has been sent back to the project for any completed task you like to choose.  I picked the first validated task in your current list.  All tasks that are using the same data file will have the same parameters so it doesn't really matter which one you choose.  If you click on the task ID link for a completed task you get the complete processing history for that task.  Scroll down below the Stderr Output sub-heading and look for the very first sky point.

Here is what I found.  There are 6 sky points (so 6 checkpoints) and nf1dots is 738.  In other words, there are 738 calculation loops to complete a sky point and it's only then that a checkpoint can be written.  Adjusting a value in the website preferences will not allow you to shorten that.  I have deliberately truncated the very long line of 738 dots in order to conserve space.  The line that starts "% C 1 0" signifies the writing of checkpoint #1.  The calculations then move to sky point #2 (of 6).



% Sky point 1/6
% Creating FFT (3.3.4 22109fa) plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 738  df1dot: 1.356999071e-015  f1dot_start: -1.6e-011  f1dot_band: 1e-012
% Filling array of photon pairs
.......................................................   ..................................
INFO: Major Windows version: 6
% C 1 0
% Sky point 2/6

Unfortunately, you are currently processing tasks that don't use many checkpoints.  The total number of calculation loops in a task is usually reasonably constant and it's the product of sky points and nf1dots so that the calculation time for a task may be reasonably constant.  This is not always the case though.

There is usually an inverse relationship between sky points and nf1dots.  For example, there may be an order of magnitude more sky points and an order of magnitude fewer nf1dots.  In that case the checkpoint interval would be an order of magnitude less than what it currently is, whilst the overall crunch time would stay much the same.

If you select a task that is crunching on your machine (use the tasks tab of BOINC Manager - advanced view) and then click the properties button, you will see information about how long ago the last checkpoint was written.  If no time is listed, then the very first checkpoint is yet to be written.  These tasks seem to be taking around 16 hours on your machine.  In that case, the current checkpoint interval will be greater than 2.5 hours.  You can't shorten that but you can make sure that tasks are kept in memory if suspended (otherwise progress will be lost) and that you try to shut down your machine just after a checkpoint rather than just before :-).

For the tasks that are just like the example above, checkpoints will be written when the progress shows 15%, 30%, 45%, 60%, 75% and 90% so you can tell this at a glance.  The main calculations are complete at 90%.  The last 10% is for a followup stage where the top ten candidate signals are reprocessed.  There are no checkpoints in the followup stage.  Progress jumps from 90% to 100% at the very end.

I hope some of this might be useful to you.

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4972
Credit: 18773465502
RAC: 7212152

Thanks for the explanation

Thanks for the explanation Gary.

 

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117792061904
RAC: 34679285

Keith Myers wrote:Thanks for

Keith Myers wrote:
Thanks for the explanation Gary.

You're most welcome!

On thinking about what I wrote, there's one assumption in there that I should highlight.  It's quite a few years ago now that I studied these types of tasks to work out what controlled checkpointing.  At that time I did look at quite a large number of individual tasks and it seemed that many (if not all) tasks that used a particular data file did use the same combination of sky points and nf1dot values so that knowing the time for a checkpoint interval and where the particular elapsed percentages in BOINC Manager would be was routine for me and always seemed to work.

These days I do GPU tasks pretty much exclusively so I don't know if there is any different behaviour for current CPU tasks compared to what I saw previously.  There is probably no reason why tasks for one data file might not use variable numbers of sky points.  In that case, the checkpoint interval could change quite drastically without any forewarning.  In recent times, data files have been changing much more frequently than they used to and there have been many examples of significantly different run times.

So the statement that, "All tasks that are using the same data file will have the same parameters" may not be true these days.  It may not have been true in the past either.  It's just that I never noticed any change during the life of a given data file so just assumed there weren't any.

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4972
Credit: 18773465502
RAC: 7212152

Well one difference for the

Well one difference for the gpu tasks is there are no sky points listed.  Just the nf1dots values.

<core_client_version>7.17.0</core_client_version>
<![CDATA[
<stderr_txt>
ch over f0 and f1.
% nf1dots: 41  df1dot: 2.512676418e-15  f1dot_start: -1e-13  f1dot_band: 1e-13
% Filling array of photon pairs
.
.
.

 

wwamann
wwamann
Joined: 19 Nov 11
Posts: 2
Credit: 47702517
RAC: 0

Thank you Keith for the quick

Thank you Keith for the quick response and information last night, it helped or at least I thought it did.

Thank You Gary for the in depth response which I am struggling to understand. I will go over it till I do.

Basically your saying I could go many hours without a checkpoint and be at the risk of wasting time and money (electricity) at any moment based on the whim of the power company.

That's not good .

Oh well, buy your ticket ride your ride, also, with my equipment setup would it be better to run any of the other task in addition to or instead of Gamma ray pulsar search. 

Thanks again  for the response.

Waren...

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4972
Credit: 18773465502
RAC: 7212152

That is why a prudent

That is why a prudent computer user purchases a uninterruptible power supply unit to power a crunching host.

Then you don't worry about the power company outages.

I would say no, the GR sub-project is the easiest to crunch for both your cpu and gpu.

 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117792061904
RAC: 34679285

Keith Myers wrote:Well one

Keith Myers wrote:
Well one difference for the gpu tasks is there are no sky points listed.  Just the nf1dots values.

Well, actually there are:-).  The GPU search is looking for pulsars in binary systems - the B1G in the string FGRPB1G stands for 'Binary search #1 on GPUs'.  Here, there is not a series of 'sky points' but rather a series of 'binary points' which I guess probably means some sort of 'region' encompassing the orbiting objects.

What you have shown in your example is part of a much larger and truncated output.  There is a limit on the file size that gets returned so you see none of the initial stuff like you do with CPU task output.  You just get to see the final stages of the output.  When these GPU tasks first started, the first part of the file got returned.  That got changed because the interesting stuff is at the tail end of the output.

If you scroll down below what you posted until you get to the end of the column of dots you will see something like the following which comes from one of mine:-



% Binary point 1393/1631
% Starting semicoherent search over f0 and f1.
% nf1dots: 41  df1dot: 2.512676418e-15  f1dot_start: -1e-13  f1dot_band: 1e-13
% Filling array of photon pairs

In this example, everything from the start of the output up to here (binary point 1393 out of a total of 1631) has been removed to keep the file size within limits - I think the limit is 64KB but I might be wrong.  If you keep scrolling down you will eventually come to a slightly different binary point that looks like the following:-



.
.
% C 0 1418
% Binary point 1419/1631
% Starting semicoherent search over f0 and f1.
% nf1dots: 41  df1dot: 2.512676418e-15  f1dot_start: -1e-13  f1dot_band: 1e-13
% Filling array of photon pairs
.
.

The initial extra line tells you that once binary point 1418 was completed, a checkpoint was written prior to moving on to binary point 1419.  The GPU performs these calculations so quickly that it would be a real waste to keep writing checkpoints for every single binary point.  With the GPU app, the checkpoint interval is designed to be as close as possible to one minute.

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117792061904
RAC: 34679285

wwamann wrote:Thank You Gary

wwamann wrote:
Thank You Gary for the in depth response which I am struggling to understand. I will go over it till I do.

Please ask further questions if anything is not clear.

wwamann wrote:
Basically your saying I could go many hours without a checkpoint and be at the risk of wasting time and money (electricity) at any moment based on the whim of the power company.

Are you saying that you have a very unreliable power company?  Imagine that you did and the power went off at some random time once a week.  With the current type of task I looked at, on average you might lose 1.3hrs of crunching every week to that outage.  If you were running 24/7, 1.3 hrs in 168 hours is less than a 1% loss.  I imagine it would be an extremely bad power utility to give you an outage each week.

The thing that's a bigger concern is how you have set your preferences and how you share your computer between crunching and your daily activities.  If you only switch on your machine when you have your normal work to do and if your preferences suspend crunching when you are using your computer, then progress might be pretty much non-existent, if you don't set the preference to keep tasks in memory when suspended.  You need to look through your preferences and tell us how various things are set and what your usage habits are.

The tasks you are crunching are the best for your machines which look like they have lots of CPU cores.  If you haven't done so already, you should try allowing your machine to crunch while you continue to use it for your daily activities.  BOINC is quite OK with getting out of the way when you need the resources for your own use.  I'd be surprised if you noticed any slowdown.  If you do suspend crunching when you use the machine, make sure you keep tasks in memory when suspended.

Cheers,
Gary.

Keith Myers
Keith Myers
Joined: 11 Feb 11
Posts: 4972
Credit: 18773465502
RAC: 7212152

I always forget that Einstein

I always forget that Einstein truncates the stderr.txt message when reported.  Too used to how Seti worked.

 

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.