No progress save on Gamma-ray pulsar search #5

Chris
Chris
Joined: 19 Jan 10
Posts: 1
Credit: 3987010
RAC: 0
Topic 213432

I run several Gamma-ray pulsar search #5 instances but they have their progress reset every time I restart BOINC manager.

I have to abort those tasks until I've found the solution.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110034303372
RAC: 22395166

Chris_440 wrote:I run

Chris_440 wrote:
I run several Gamma-ray pulsar search #5 instances but they have their progress reset every time I restart BOINC manager.

You don't give enough details to be sure, but it sounds like you have a situation where crunching is stopping and restarting - perhaps due to a preference setting like, "Suspend crunching when user is active", for example, particularly if tasks are not being kept in memory when crunching is suspended.  Normally, this is not too much of a problem (as long as new checkpoints are made regularly) because crunching is designed to restart from a checkpoint saved on disk, even if the the 'in memory' image has been discarded.  All you lose is the time from when the last checkpoint was saved.

The problem is really that many of the current FGRP CPU tasks have a long interval between checkpoints.  It could easily be something like an hour.  If a task happens to get suspended without being kept in memory, up to an hours worth of crunching could disappear.   A task would restart right back at the beginning when crunching is resumed if the initial checkpoint had never been created.  If you select a task in BOINC Manager (advanced view) and click on properties, you can see if/when a checkpoint was last saved.

Quote:
I have to abort those tasks until I've found the solution.

This won't solve anything because the next task will just behave the same way.  One of your options is to work out how you can avoid suspending running tasks in the first place.  If tasks must be suspended make sure they are set to remain in memory when suspended.  There is nothing else you can do to change the very long checkpoint save interval.  These tasks require a lot of memory so if you have a problem with sluggish performance of your machine, an option is to reduce the number of concurrent tasks (%cores setting) that BOINC is allowed to start.

As I mentioned, the above comments are just guesses.  If it's something else, please give more details.

 

Cheers,
Gary.

mitchB
mitchB
Joined: 22 Aug 12
Posts: 2
Credit: 1171950
RAC: 0

Gary, I too have the exact

Gary, I too have the exact same issue, while crunching Gamma-ray pulsar search #5, as Chris is having.

I work odd and irregular hours. When I'm home, BOINC is closed down. If I go to work I'll open BOINC and let it run. I've set my machine to trade off between SETI and Einstein every hour. Lately I've suspended SETI giving every minute to Einstein, because the deadline is near. The last 3 days I've worked at least 3 hours a day. Every time it started crunching Einstein from the begging and now there's no way I can finish on time. I have always operated BOINC this way. This issue has just started. If you need more information just let me know what it is and I will get it to you.

Thanks Mitch

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110034303372
RAC: 22395166

mitchB wrote:Gary, I too have

mitchB wrote:
Gary, I too have the exact same issue, while crunching Gamma-ray pulsar search #5, as Chris is having.

Your issue isn't the same.  I can't show this over different tasks since there is only one task currently on the website that has output that can be browsed.  This one does show that it was able to be restarted from a saved checkpoint (several times) without going back and starting from the beginning.  There are two other tasks that show as 'in progress' (nothing returned yet) but unless they are both close to finishing, you should probably consider aborting them to save wasting time on a lost cause.  That decision is entirely up to you.

Your computer shows as a 3.0GHz P4.  I presume it is a single core with HT (2 threads) and that both threads are crunching tasks?  I'm guessing it was probably new around 2006/2007.  If so, it's really not up to crunching current Einstein tasks in a reasonable time.  With HT enabled, tasks are going to be incredibly slow .  You can get some idea of how much under stress the machine is by looking at the difference between elapsed time and CPU time for any returned tasks.

The single task you have was returned on Feb 6 with a computation error.  The CPU time was 19K secs and the elapsed time was 41K secs.  In other words, the CPU was doing work on that task for less than half the time that the task was listed as 'running', and in case you are wondering, this time gap has nothing to do with restarting from a saved checkpoint.  Normally, (if a machine is not under other loads) the two times are very much closer than that.  The entry for that task could be removed from the database any time now so if you want to see what I'm talking about you should check it out immediately before it disappears.

To do that, go to your account page, click on the link to your computer there and on the details page, click on the tasks link.  You will then see the three tasks I'm talking about.  Under the Task ID column for that task whose status is listed as aborted, there is a clickable link that will take you to a page showing exactly what was returned to the project when that task failed.  Below are some small snips that show what happened - starting some lines below the Stderr output heading.

10:15:02 (2648): [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/hsgamma_FGRP5_1.08_windows_intelx86__FGRPSSE.exe'.

The above shows the time crunching started - unfortunately no date.

% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0021F.dat
% Total amount of photon times: 30000
% Preparing toplist of length: 10
read_checkpoint(): Couldn't open file 'LATeah0021F_1016.0_255565_0.0_0_0.out.cpt': No such file or directory (2)
% fft_size: 67108864 (0x4000000, 2^26); alloc: 268435464
% Sky point 1/79
% Creating FFT (3.3.4 22109fa) plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 56 df1dot: 1.840486273e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
........................................................
INFO: Major Windows version: 6
% C 1 0

The above shows several things.  The data file being used was LATeah0021F.dat.  No checkpoint file could be found so this was a normal 'start from the very beginning'.  This is the first Sky point out of a total of 79 to be processed.  The parameter nf1dots has a value of 56 so there will be 56 calculation 'loops' before this Sky point is completed.  Each of those 'loops' is represented by a '.' in that line of '....'.  Count them if you wish :-).  When that Sky point is finished, a checkpoint is written.  The '% C 1 0' line represents the saving of the very first checkpoint.  Then the sequence repeats with Sky point 2/79 and so on.

Eventually, if you follow the full output, you will see a total of 4 successful checkpoints, with the following for Sky point 5

% Sky point 5/79
% Starting semicoherent search over f0 and f1.
% nf1dots: 56 df1dot: 1.840486273e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
................................................09:44:50 (7348): [normal]: This Einstein@home App was built at: Jul 26 2017 09:32:43

Notice the incomplete line of '......' compared with earlier versions.  Those dots ended abpuptly when the machine was turned off.  They represent crunching that was not saved because the machine was stopped and restarted.  Unfortunately, that's the way it goes if the machine has to be stopped.  However, it doesn't restart from the beginning but from checkpoint 4 as you can clearly see in the next bit of the record.  You can see the restart showing as the current time 09:44:50 tacked straight on to where the partial row of dots finished.

The following snip is taken after some further startup messages, right at the point where you can see the checkpoint for Sky point 4 being successfully read.  It continues right through to the successful writing of checkpoint 5 (after the full 56 '.....'.

% checkpoint read: skypoint 4 binarypoint 0
% fft_size: 67108864 (0x4000000, 2^26); alloc: 268435464
% Sky point 5/79
% Creating FFT (3.3.4 22109fa) plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 56 df1dot: 1.840486273e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
........................................................
INFO: Major Windows version: 6
% C 5 0

If you continue scrolling through the record, you will see a number of additional stops and restarts, but all of them clearly show restarting from a saved checkpoint.  The last successful one written was #16 out of 79.  In other words, at that point around 20% only of the task had been crunched.  The times I showed earlier don't include any of the time that was used to crunch part of the way towards the next checkpoint when the machine is stopped and restarted.  That time is always reverted back to the time accumulated when the checkpoint was written.  So, reading in a saved checkpoint when crunching restarts always gets the appropriate total crunching time used so far up to that point without recording any of the extra time that was 'lost' up to the point of stopping.

After checkpoint 16 when Sky point 17/79 was being crunched, here's what happened.

% C 16 0
% Sky point 17/79
% Starting semicoherent search over f0 and f1.
% nf1dots: 56 df1dot: 1.840486273e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
...................

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Breakpoint Encountered (0x80000003) at address 0x74B91072

Engaging BOINC Windows Runtime Debugger...

You can see a partial line of '....' so it never got to the next checkpoint.  I don't use Windows at all and I'm not a programmer so I don't know what has caused this crash.  I think an 'unhandled exception' means there was an error condition that the programmer didn't anticipate and didn't provide a mechanism to recover from.

The status of the task says aborted but did you abort the task while it was running?  I was just wondering if aborting the running task might be something that triggered the crash rather than allowing for a more graceful termination.  Seeing as it was already past deadline (received 20 Jan, returned 6 Feb) and if you didn't specifically abort it, perhaps BOINC just auto-listed it as 'aborted' anyway.

So the summary of this whole episode is that your machine really isn't powerful enough to do these tasks in a reasonable time in the first place.  Because checkpoints are fairly widely spaced, if you stop and start the machine a lot, you will lose some of the work (the partial checkpoints) each time you do that.  If your machine only runs a short time each day, you'll probably be hard pressed to get tasks finished within the allowed deadline.  There is no evidence that tasks are restarting right from the beginning.  The large difference between CPU time and elapsed time suggests that your machine is overloaded.  That may simply be the result of running two simultaneous crunching tasks on a machine with the much less efficient old style HT.  Have you tried seeing what happens if you turn HT off and run just one task at a time?

 

Cheers,
Gary.

mitchB
mitchB
Joined: 22 Aug 12
Posts: 2
Credit: 1171950
RAC: 0

Wow Gary, thanks for your

Wow Gary, thanks for your thoroughness.  You pretty much pegged my system right on the head. I've been doing this from the start. I'm getting old and so is my system. Back in 2000 I started working for intel so I had the best sh*t for the time. Then I left there in '04 with the fastest machine on the planet. So I thought but there it stayed. Thanks for reminding me of my mortality Gary. (jk I'm bustin yur balls.) So it's time to step up or step back, hell I thought we'd be talking to ET by now anyway. I'll try turning off HT and see what happens. If that fails I'll have to see about upgrading. 

 

Thanks Gary

Peace!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110034303372
RAC: 22395166

mitchB wrote:Wow Gary, thanks

mitchB wrote:
Wow Gary, thanks for your thoroughness.

You're most welcome.  I'm glad it was of some use to you.

I like to do these sort of investigations and then write them up.  I learn a lot myself in the process.  I know that for every person who asks for help about a problem, there will probably be multiple readers, interested in the details of the solution.  My hope is that if I do it thoroughly and try to make it easy for others to follow along, there will be some who see the sort of things they can do to investigate problems for themselves.

Lots of us oldies hang around projects like Einstein.  I'm sure most of us are reminded of our mortality every day :-).   In the back of our minds runs a constant theme - use it or lose it.  I have no idea if there is even a smidgen of truth to it but I'm quite motivated by the hope that there is.  I get a lot of satisfaction out of making a contribution to a science that particularly interests me as well as keeping my brain as active as possible.

 I notice you have now aborted the other two tasks that were about to expire anyway.  Those two tasks were for a different data file (LATeah0002F.dat) from the one used for the task in the previous analysis I posted.  The replacement tasks you have recently received are also for this same data file so there will be some things you need to consider if you are to get these crunched and returned.  To illustrate the difference, here is a snip from one of the tasks you have just aborted.  This shows why checkpoints didn't seem to be working and why those particular tasks were in fact always restarting from the beginning.

15:19:52 (4616): [debug]: Flags: i386 SSE GNUC X86 GNUX86
15:19:52 (4616): [debug]: Set up communication with graphics process.
% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0002F.dat
% Total amount of photon times: 30000
% Preparing toplist of length: 10
read_checkpoint(): Couldn't open file 'LATeah0002F_56.0_5552_-5e-11_1_0.out.cpt': No such file or directory (2)
% fft_size: 67108864 (0x4000000, 2^26); alloc: 268435464
% Sky point 1/8
% Creating FFT (3.3.4 22109fa) plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 548 df1dot: 1.830806352e-015 f1dot_start: -5.1e-011 f1dot_band: 1e-012
% Filling array of photon pairs
..........................................................................................................................................................................................................................................................................................................10:56:20 (6736): [normal]: This Einstein@home App was built at: Jul 26 2017 09:32:43

I took this snip right near the end of what was reported back to the project for this aborted task.  The task had been stopped and started many times but (as the above shows) it was always restarting from the very beginning - simply because there had never been enough time to get to the point of writing the first checkpoint.

The reason for that can be seen in the above - there are only 8 Sky points in total and the value of nf1dots is 548.  In other words tasks for LATeah0002F.dat have only a tenth of the Sky points and ~10 times the work for each Sky point.  The interval between checkpoints will be around 10 times longer than for the data file in use for the task that was returned on Feb 06.  Whilst there are lots of dots in the above snip, there aren't 548.

Unless you allow your computer to run continuously until the first checkpoint is written, you will always lose whatever work has been done up to the point you turn it off.  If you use BOINC Manager - Advanced view, you can go to the tasks tab and highlight the currently running task and select 'properties' to see if it has actually got to the point of writing the very first checkpoint.  You must keep the machine running until the first checkpoint is written, otherwise you will never make progress.  At that point you will be able to see how much time was needed.  If you multiply that by eight (and add an hour or three for the followup stage, you will know how long it will take to crunch the complete task.

You could complete a task in 8 or 9 days if you could allow your machine to be on continuously each day for as long as it takes to write a new checkpoint.  My guess (which could be quite wrong) is that you might have to budget for a continuous on time of around 3-4 hours (maybe even more) before you turn it off.  At some point in the future, the data file will change (the new task name will tell you).  The next file could be different in Sky points  and nf1dots.  You would need to repeat the procedure of finding out how long a checkpoint takes, since it could be different from what it is currently.  Of course, if you were prepared to leave the machine run continuously, you wouldn't need to worry about checkpoints at all :-).

Let me know if you're not properly understanding what I'm waffling on about.

 

Cheers,
Gary.

Darren Peets
Darren Peets
Joined: 19 Nov 09
Posts: 37
Credit: 98766954
RAC: 28527

I suspect the time between

I suspect the time between checkpoints changes with how the data file is sliced for analysis:  Based on a quick look, my LATeah0002F_88.0 tasks have nf1dots: 548 and my LATeah0002F_216.0 tasks have nf1dots: 56.

It's pretty difficult to extrapolate based on just that, though...

Another setting that could conceivably be involved is "Switch between tasks every ____ minutes".  If the time between checkpoints is longer than this interval and it tries to switch tasks without waiting for the next checkpoint (hopefully it would wait?), you might lose everything.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5850
Credit: 110034303372
RAC: 22395166

Darren Peets wrote:I suspect

Darren Peets wrote:
I suspect the time between checkpoints changes with how the data file is sliced for analysis:  Based on a quick look, my LATeah0002F_88.0 tasks have nf1dots: 548 and my LATeah0002F_216.0 tasks have nf1dots: 56.

Now that is interesting!  Tasks where the frequency component is relatively low (your 02F_88.0 example) have only 8 Sky points so only 8 checkpoints will be written during the main analysis.  At some point as the frequency component becomes higher (somewhere between 88.0 and 216.0) the number of Sky points changes to 79 and nf1dots reduces from 548 to 56.  The measure of the work content of a task (8x548=4384 as opposed to 79x56=4424) remains essentially the same but the number of checkpoints written increases dramatically to 79 all told.

I had thought the number of Sky points was changing with the particular data file because MitchB's task list ended up with several 02F tasks, all of which have 8 Sky points.  The single example of a task which had 79 Sky points was for a previous data file (021F).  I made the error of assuming it was the change in data file that caused the change in Sky points.  Because MitchB's newest tasks are low frequency (56.0) there will be long intervals between checkpoints for him.

Darren Peets wrote:
Another setting that could conceivably be involved is "Switch between tasks every ____ minutes".  If the time between checkpoints is longer than this interval and it tries to switch tasks without waiting for the next checkpoint (hopefully it would wait?), you might lose everything.

It depends on how the setting for "Keep tasks in memory when suspended" is set.  If that option is 'yes' then it won't matter about checkpoints because BOINC will be able to resume such a task from what is in memory rather than having to go look for a checkpoint.  Even if you are short of RAM and a suspended task gets swapped to disk, BOINC will use the swapped image rather than try to read a checkpoint (I believe - YMMV :-) ).  Of course, such an image would be lost if BOINC gets stopped and restarted before the task is resumed.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.