Losing progress on shutdown

Redvibe
Redvibe
Joined: 5 Apr 18
Posts: 11
Credit: 2189846
RAC: 0
Topic 218406

For the past three days I have been losing my progress every time I shut down. When I start up again, everyhing starts from scratch. I am using mac OSX High Sierra 10.13.4

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118372482166
RAC: 25524153

Normally, tasks that are

Normally, tasks that are being crunched will have their state saved at regular intervals.  This is referred to as 'checkpointing'.  If you select any 'in progress' task on the tasks tab of BOINC Manager and click on 'properties', you will be able to see the current CPU time and also the CPU time when a checkpoint was last saved for that task.  If you shut down, you always lose that (usually small) bit of crunching between when the checkpoint was saved and the time when you shut down.

There are two very recent threads (in this very forum) that suggest that some recent tasks were not properly saving checkpoints and then that got fixed, apparently.  Check out those threads for yourself.  Perhaps you got a rather large group of these problem tasks and are still working through them?  It would seem that you should eventually get new tasks that work correctly.  Nothing has actually been announced about this problem that I'm aware of.

Until the situation changes, perhaps you could time your shutdown to coincide with task completion and reporting.  You should be able to see when the problem goes away by checking the properties, perhaps 10 mins after a task starts.  I would expect to see a checkpoint written by that amount of time.

 

Cheers,
Gary.

Ged
Ged
Joined: 7 May 05
Posts: 4
Credit: 12143322
RAC: 0

I'm experiencing the same

I'm experiencing the same 'not checkpointing' issues with FGRPSSE #51.08 app running LATeah0052F_... and ...51F... tasks which run on a Windows 10 Pro, Xeon powered machine.

Ahead of machine shutdown, I checked the progress of the tasks that were running; one was at 89.989% with eight hours elapsed and 53 minutes to completion, another at 4.6%, 1:20hrs elapsed and 8:20hrs to go and six others that were 82%, 6:45hrs elapsed and 1:10 to completion (within a minute or two of each other). On starting the machine this morning, the latter six tasks have reverted to 0% complete - they did not checkpoint during execution nor as part of their 'app exit' routine.

I suffered similar issues, documented in another thread, for the O1OD1 (Gravity Wave Search v0.03) so deselected these from download until that issue got resolved.

From my 'contributor' perspective, I truly enjoy supporting this and other BOINC-based projects which run on my own kit at home, consuming electricity at a cost to me that is part of that contribution. However, the cost of processing a work unit for Einstein is rising rapidly, compared to other projects, due to the fact that non-checkpointing work units have to be reprocessed. Because I run multiple BOINC projects, this has a knock-on consequence for those other projects, too.

With that in mind, can someone undertake a code review to establish why some Einstein workunits checkpoint but others don't? Is(are) there some common, cross-application factor(s) e.g. has the checkpointing subroutine changed, is that routine common to many Einstein apps, where in an app's execution cycle is the 'checkpointing need' being assessed, is complier code optimisation causing the checkpoint to be bypassed... It should be noted that in this thread and the other O1OD1 threads, problems were observed and reported by people with Linux, Mac and Windows platforms.

Finally, and because in the end, it's about doing reliable processing to support the science, another great concern in all this is that if there are 'code problems' with checkpointing that seems to affect some workunit instances, but not others of the same generation, how reliable are the results?

Rgds,

Ged

Redvibe
Redvibe
Joined: 5 Apr 18
Posts: 11
Credit: 2189846
RAC: 0

Gary, I have not had time to

Gary, I have not had time to read the 'other posts' you mentioned (very busy right now) but I now think this is a more general problem with BOINC. I have the same problem on other projects (as noted by GED above). When I look at the event log I see "Scheduler request completed: got 0 new tasks" and, perhaps more worrying, "Host location: none" (see full copy of today's log below). This makes me wonder if BOINC is not connecting with the host. What can be done about this?

Event log

Wed 20 Mar 09:09:01 2019 |  | cc_config.xml not found - using defaults
Wed 20 Mar 09:09:01 2019 |  | Starting BOINC client version 7.14.2 for x86_64-apple-darwin
Wed 20 Mar 09:09:01 2019 |  | log flags: file_xfer, sched_ops, task
Wed 20 Mar 09:09:01 2019 |  | Libraries: libcurl/7.58.0 OpenSSL/1.1.0g zlib/1.2.11 c-ares/1.13.0
Wed 20 Mar 09:09:01 2019 |  | Data directory: /Library/Application Support/BOINC Data
Wed 20 Mar 09:09:01 2019 |  | OpenCL: Intel GPU 0: Intel(R) Iris(TM) Plus Graphics 640 (driver version 1.2(Mar 15 2018 22:04:21), device version OpenCL 1.2, 1536MB, 1536MB available, 384 GFLOPS peak)
Wed 20 Mar 09:09:01 2019 |  | OpenCL CPU: Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz (OpenCL driver vendor: Apple, driver version 1.1, device version OpenCL 1.2)
Wed 20 Mar 09:09:01 2019 |  | Host name: Ruths-MBP.home
Wed 20 Mar 09:09:01 2019 |  | Processor: 4 GenuineIntel Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz [x86 Family 6 Model 142 Stepping 9]
Wed 20 Mar 09:09:01 2019 |  | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clfsh ds acpi mmx fxsr sse sse2 ss htt tm pbe pni pclmulqdq dtes64 mon dscpl vmx smx est tm2 ssse3 fma cx16 tpr pdcm sse4_1 sse4_2 x2apic movbe popcnt aes pcid xsave osxsave seglim64 tsctmr avx rdrand f16c
Wed 20 Mar 09:09:01 2019 |  | OS: Mac OS X 10.13.4 (Darwin 17.5.0)
Wed 20 Mar 09:09:01 2019 |  | Memory: 8.00 GB physical, 40.02 GB virtual
Wed 20 Mar 09:09:01 2019 |  | Disk: 233.47 GB total, 38.39 GB free
Wed 20 Mar 09:09:01 2019 |  | Local time is UTC +0 hours
Wed 20 Mar 09:09:01 2019 |  | VirtualBox version: 5.2.22r126460
Wed 20 Mar 09:09:01 2019 | Einstein@Home | URL http://einstein.phys.uwm.edu/; Computer ID 12639019; resource share 100
Wed 20 Mar 09:09:01 2019 | Milkyway@Home | URL http://milkyway.cs.rpi.edu/milkyway/; Computer ID 800327; resource share 100
Wed 20 Mar 09:09:01 2019 | Einstein@Home | General prefs: from Einstein@Home (last modified 11-Dec-2018 09:39:06)
Wed 20 Mar 09:09:01 2019 | Einstein@Home | Host location: none
Wed 20 Mar 09:09:01 2019 | Einstein@Home | General prefs: using your defaults
Wed 20 Mar 09:09:01 2019 |  | Reading preferences override file
Wed 20 Mar 09:09:01 2019 |  | Preferences:
Wed 20 Mar 09:09:01 2019 |  | max memory usage when active: 1228.80 MB
Wed 20 Mar 09:09:01 2019 |  | max memory usage when idle: 4096.00 MB
Wed 20 Mar 09:09:01 2019 |  | max disk usage: 8.00 GB
Wed 20 Mar 09:09:01 2019 |  | max CPUs used: 3
Wed 20 Mar 09:09:01 2019 |  | suspend work if non-BOINC CPU load exceeds 25%
Wed 20 Mar 09:09:01 2019 |  | (to change preferences, visit a project web site or select Preferences in the Manager)
Wed 20 Mar 09:09:01 2019 |  | Setting up project and slot directories
Wed 20 Mar 09:09:01 2019 |  | Checking active tasks
Wed 20 Mar 09:09:01 2019 |  | Setting up GUI RPC socket
Wed 20 Mar 09:09:01 2019 |  | Checking presence of 61 project files
Wed 20 Mar 09:09:02 2019 | Milkyway@Home | Sending scheduler request: To fetch work.
Wed 20 Mar 09:09:02 2019 | Milkyway@Home | Requesting new tasks for Intel GPU
Wed 20 Mar 09:09:04 2019 | Milkyway@Home | Scheduler request completed: got 0 new tasks
Wed 20 Mar 09:09:04 2019 | Milkyway@Home | General prefs: from Milkyway@Home (last modified 18-Mar-2019 07:35:53)
Wed 20 Mar 09:09:04 2019 | Milkyway@Home | Host location: none
Wed 20 Mar 09:09:04 2019 | Milkyway@Home | General prefs: using your defaults
Wed 20 Mar 09:09:04 2019 |  | Reading preferences override file
Wed 20 Mar 09:09:04 2019 |  | Preferences:
Wed 20 Mar 09:09:04 2019 |  | max memory usage when active: 1228.80 MB
Wed 20 Mar 09:09:04 2019 |  | max memory usage when idle: 4096.00 MB
Wed 20 Mar 09:09:04 2019 |  | max disk usage: 8.00 GB
Wed 20 Mar 09:09:04 2019 |  | max CPUs used: 3
Wed 20 Mar 09:09:04 2019 |  | suspend work if non-BOINC CPU load exceeds 25%
Wed 20 Mar 09:09:04 2019 |  | (to change preferences, visit a project web site or select Preferences in the Manager)
Wed 20 Mar 09:09:09 2019 | Einstein@Home | Sending scheduler request: To fetch work.
Wed 20 Mar 09:09:09 2019 | Einstein@Home | Requesting new tasks for Intel GPU
Wed 20 Mar 09:09:11 2019 | Einstein@Home | Scheduler request completed: got 0 new tasks
Wed 20 Mar 09:09:11 2019 | Einstein@Home | No work sent
Wed 20 Mar 09:09:11 2019 | Einstein@Home | Binary Radio Pulsar Search (Arecibo) is not available for your type of computer.
Wed 20 Mar 09:09:11 2019 | Einstein@Home | see scheduler log messages on https://einsteinathome.org/host/12639019/log

Ged
Ged
Joined: 7 May 05
Posts: 4
Credit: 12143322
RAC: 0

@REDVIBE I think the "Wed 20

@REDVIBE

I think the "Wed 20 Mar 09:09:01 2019 | Einstein@Home | Host location: none" just means that you haven't specified, in your Einstein Account->Preference if your computer (the Host) is at Home, Work or School.

Rgds,

Ged

mikey
mikey
Joined: 22 Jan 05
Posts: 12780
Credit: 1867890186
RAC: 1854095

Ged wrote:I'm experiencing

Ged wrote:

I'm experiencing the same 'not checkpointing' issues with FGRPSSE #51.08 app running LATeah0052F_... and ...51F... tasks which run on a Windows 10 Pro, Xeon powered machine.

Ahead of machine shutdown, I checked the progress of the tasks that were running; one was at 89.989% with eight hours elapsed and 53 minutes to completion, another at 4.6%, 1:20hrs elapsed and 8:20hrs to go and six others that were 82%, 6:45hrs elapsed and 1:10 to completion (within a minute or two of each other). On starting the machine this morning, the latter six tasks have reverted to 0% complete - they did not checkpoint during execution nor as part of their 'app exit' routine.

I suffered similar issues, documented in another thread, for the O1OD1 (Gravity Wave Search v0.03) so deselected these from download until that issue got resolved.

From my 'contributor' perspective, I truly enjoy supporting this and other BOINC-based projects which run on my own kit at home, consuming electricity at a cost to me that is part of that contribution. However, the cost of processing a work unit for Einstein is rising rapidly, compared to other projects, due to the fact that non-checkpointing work units have to be reprocessed. Because I run multiple BOINC projects, this has a knock-on consequence for those other projects, too.

With that in mind, can someone undertake a code review to establish why some Einstein workunits checkpoint but others don't? Is(are) there some common, cross-application factor(s) e.g. has the checkpointing subroutine changed, is that routine common to many Einstein apps, where in an app's execution cycle is the 'checkpointing need' being assessed, is complier code optimisation causing the checkpoint to be bypassed... It should be noted that in this thread and the other O1OD1 threads, problems were observed and reported by people with Linux, Mac and Windows platforms.

Finally, and because in the end, it's about doing reliable processing to support the science, another great concern in all this is that if there are 'code problems' with checkpointing that seems to affect some workunit instances, but not others of the same generation, how reliable are the results?

Rgds, Ged

Some tasks checkpoint and some don't because that's the way the programmers at Einstein, in this case, want things to work that way. Boinc itself is not in charge of checkpointing a task only the timing of it if the programmers built one into the work units, each Project builds and then maintains their own workunits.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118372482166
RAC: 25524153

Ged wrote:@REDVIBE I think

Ged wrote:

@REDVIBE

I think the "Wed 20 Mar 09:09:01 2019 | Einstein@Home | Host location: none" just means that you haven't specified, in your Einstein Account->Preference if your computer (the Host) is at Home, Work or School.

Rgds,

Ged

This is correct but it needs to be mentioned that, for most people, there is no need to define a location.  The use of the word "none" is perhaps unfortunate.  There is a 'default' location used if one of the other locations of home, school, work have not been set.  The message would probably be less concerning if it read, "Host location:  The default location is in use."

The idea for having the 4 separate 'locations' (aka 'venues') is to allow people with more than a single computer to use different preference sets for different machines - if they so desire.  If you don't need that functionality, just ignore it.

 

Cheers,
Gary.

archae86
archae86
Joined: 6 Dec 05
Posts: 3161
Credit: 7272611730
RAC: 1819488

Gary Roberts wrote:This is

Gary Roberts wrote:

This is correct but it needs to be mentioned that, for most people, there is no need to define a location.  The use of the word "none" is perhaps unfortunate.  There is a 'default' location used if one of the other locations of home, school, work have not been set.  The message would probably be less concerning if it read, "Host location:  The default location is in use."

The idea for having the 4 separate 'locations' (aka 'venues') is to allow people with more than a single computer to use different preference sets for different machines - if they so desire.  If you don't need that functionality, just ignore it.

 

It is an endless source of confusion that the alternate names "location" and "venue" are both used, but in different places, to mean precisely the same thing.

Regarding the four alternative locations, three have consistent names:

Home, School, and Work

But the fourth is in some places called "generic" and I believe in other places such as the quoted message "none".

But none of the four is called "default".  Rather it is true that a user has the authority to designate any one of the four standard locations as their personal default choice.  The functional meaning is that after setting that preference (available in the project preference page for each location) if the user attaches a new computer to the project it will start off assigned to that location.

Anyway, that is my current understanding.  The terminology is a bit awkward, and is not consistently used, and is a source of misunderstanding during troubleshooting, as frequently users make settings that apply to one location, which is not the actual location assigned to the computer they are trying to influence.  Naturally, such a user confidently thinks that either the project or BOINC on their computer is ignoring them. 

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118372482166
RAC: 25524153

Redvibe wrote:Gary, I have

Redvibe wrote:
Gary, I have not had time to read the 'other posts' you mentioned (very busy right now) but I now think this is a more general problem with BOINC.

As Mikey suggests, checkpointing (or the lack of it) is a function of the application and not BOINC.  Each project (Einstein, Milkway, etc) decides (for each individual search being conducted) if checkpointing is to be used and how it is to be implemented.

I'm not running CPU applications at the moment but I've previously done so for many years at Einstein.  In the past, each production run has always used checkpointing for the very reason of saving state regularly for long running tasks so I'd be extremely surprised if it wasn't being used for current production searches.

The two other comments I suggested you look at were simply to show that there may have been a temporary glitch with the O1OD1 search.  The followup there suggested that the issue had been resolved.  The absence of ongoing reports in those threads seems to confirm that.

Redvibe wrote:
I have the same problem on other projects (as noted by GED above). When I look at the event log I see "Scheduler request completed: got 0 new tasks" and, perhaps more worrying, "Host location: none" (see full copy of today's log below). This makes me wonder if BOINC is not connecting with the host. What can be done about this?

The BOINC client is correctly running on your host - the event log shows that - and it is communicating with the Einstein servers.  The "scheduler request completed" message shows there was a successful two-way transaction.  Apparently, your host was requesting work for the old BRP (Arecibo) search and I think these days, that sort of work is designed for mobile devices like phones and tablets, etc.  Are you normally able to get those sorts of tasks?

In the last line there is a link to the scheduler logs that you can use to see a lot more details about the decision making process the scheduler goes through in deciding what work to send.  Did you have a look there to see if there was additional information?  There is this pinned thread in the 'Getting Started' forum which gives some information about interpreting the scheduler logs.

With reference to Ged's post, the 'no checkpointing' report was about the FGRP5 search.  That's quite different to your report about the O1OD1 search.  As far as I'm aware, there are no other reports about the FGRP5 search and there have been no changes to the app or the nature of the data there.  I would expect to see lots of reports if in general, checkpointing wasn't working correctly for that search.

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118372482166
RAC: 25524153

ThankGed wrote:I'm

Ged wrote:

I'm experiencing the same 'not checkpointing' issues with FGRPSSE #51.08 app running LATeah0052F_... and ...51F... tasks which run on a Windows 10 Pro, Xeon powered machine.

Ahead of machine shutdown, I checked the progress of the tasks that were running ....
....

Thanks for the report about the FGRP5 search.  I'm not running any CPU searches at the moment so I have no direct knowledge of a checkpointing problem with this search.  I've looked at a couple of machines owned by relatives and running this search and checkpointing seems normal there.

As far as I'm aware, regular checkpoints have always been used for this search.  I imagine there would be quite a few complaints if others were seeing a problem as well.  I haven't seen any so I'm wondering if something else is going on.

When you went to the trouble of recording the state of all tasks prior to shutdown, did you use the 'task properties' option to record the current CPU time and the time when the last checkpoint was written, as well?  If all your tasks (especially the one at 89.989%) had no previous checkpoint information recorded, that would prove there was a problem creating checkpoints.  If all had checkpoint times recorded, but all subsequently restarted from scratch, that would suggest a different issue - checkpoints are recorded but (for some odd reason) are not being used when BOINC is restarted.

I picked one of your recently completed tasks listed on the website at random.  By clicking on the TaskID link for such a task, you can see exactly what was returned to the servers in the stderr message output.  Below is a snippet of such information which clearly shows that checkpoints were being written.

This task contains 79 'Sky points'.  For each of these there are 56 'nf1dots'.  Each nf1dot is a calculation loop which, when complete, results in a single 'dot' (or decimal point) being written to the output.  If you count them you will find there are 56 on a single line.  When that line is complete, the sky point is complete and then a checkpoint is written.  The very first checkpoint can be seen in the line "% C 1 0" and then the calculation moves to sky point 2 (of 79).  This example also shows the read_checkpoint() function being used to look for a previously saved checkpoint.  Since none was found, the calculations started right at the beginning.

% Opening inputfile: ../../projects/einstein.phys.uwm.edu/LATeah0050F.dat
% Total amount of photon times: 30000
% Preparing toplist of length: 10
read_checkpoint(): Couldn't open file 'LATeah0050F_1224.0_52219_0.0_0_0.out.cpt': No such file or directory (2)
% fft_size: 67108864 (0x4000000, 2^26); alloc: 268435464
% Sky point 1/79
% Creating FFT (3.3.4 22109fa) plan.
% Starting semicoherent search over f0 and f1.
% nf1dots: 56 df1dot: 1.846420581e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
........................................................
INFO: Major Windows version: 6
% C 1 0
% Sky point 2/79
% Starting semicoherent search over f0 and f1.
% nf1dots: 56 df1dot: 1.846420581e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
........................................................
% C 2 0
% Sky point 3/79
% Starting semicoherent search over f0 and f1.
% nf1dots: 56 df1dot: 1.846420581e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
........................................................
% C 3 0
% Sky point 4/79
% Starting semicoherent search over f0 and f1.
% nf1dots: 56 df1dot: 1.846420581e-015 f1dot_start: -1e-013 f1dot_band: 1e-013
% Filling array of photon pairs
........................................................
% C 4 0

It would seem to me that your problem could be caused if something was deleting saved checkpoints.  I'm wondering if perhaps you might have an overly aggressive virus/malware checker that might be identifying checkpoint files as some sort of malware and removing them??

The above task took about 30,000 secs to complete for 79 sky points (plus the followup stage after 89.989%).  Roughly speaking, that means a checkpoint was written approximately every 6 mins.

 

Cheers,
Gary.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5874
Credit: 118372482166
RAC: 25524153

archae86 wrote:Regarding the

archae86 wrote:

Regarding the four alternative locations, three have consistent names:

Home, School, and Work

But the fourth is in some places called "generic" and I believe in other places such as the quoted message "none".

But none of the four is called "default"....

I think the "generic" term might be specific to Einstein.  It's quite a while since I've run other projects so I'm not familiar with the naming used elsewhere.  In the past I've seen the default location expressed as "--" or maybe it was three dashes :-).

Whilst there have been cosmetic changes to long standing terminology (eg venues -> locations) that are directly attributable to BOINC devs, I think there are also server-side changes that individual projects make that further add to this potential for confusion.  I imagine that there are probably several nomenclature differences that now exist due to the fact that Einstein uses older (and heavily customised) server code versions.

I didn't actually say (or imply) that the name of the default location was "default" :-).  I said there was a 'default' location, but I steered clear of the thorny issue of what its proper name was :-).

For the benefit of people with a single computer, or for those with a small number where the same settings for each is appropriate, I would suggest that users carefully consider if additional locations are really needed.  They could tend to create the type of problems that archae86 mentions.  A user could check this out by going to the account page and clicking the right hand menu item of "Preferences".  On the next page click the 2nd sub-menu item "Project".

This will show the set of project preferences in play for the location shown in the "Preference set:" drop down box near the top left of the page.  For me, that shows "Generic".  Right down in the bottom right hand corner of the full page, there is a "(show comparison view)" link.  By clicking that you will be able to see if you are using more locations than just the generic one as there will perhaps be entries for any or all of the 4 different locations.  There are controls there for clearing any locations that aren't needed.

From time to time - particularly when applications change or new versions come along - it's useful to carefully review the project preferences to make sure they meet your expectations.  There is a lot of complexity there and also in other preference sets, like computing preferences, as well.

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.