CPU time since checkpoint: 4h

hyphens
hyphens
Joined: 7 Jan 22
Posts: 3
Credit: 11781
RAC: 0
Topic 226803

Seems wasteful. Can we get more frequent checkpoints?

(Request tasks to checkpoint at most is at default, 60 s. And yes, it's an older machine, 6 years or so.)

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1035
Credit: 17687506
RAC: 12873

This is an ongoing problem,

This is an ongoing problem, well more of an annoyance to watch out for to avoid wasting CPU time. It affects the FGRP5 CPU app. It only occurs with workunits that contain resp. process a very small number of skypoints. The usual number of skypoints in a FGRP5 CPU WU is e.g. 63. A checkpoint can only be written after each processed skypoint, regardless of the setting in the host configuration ("checkpoint at most each ... 60 sec"). Certain workunits contain ONLY SIX (6) skypoints. These can be identified by a small number (< 100) after the prefix "LATeah" and the ID of the raw data file (e.g. "1090F"), e.g.: "88.0" in "LATeah1090F_88.0_4314_-3.2e-11_0". These WUs checkpoint extremely seldom as long as their computation progress is between 0% and 90%. So there are exactly six checkpoints only at 15%, 30%, 45%, ..., 90% progress. After 90% (calculating final toplists) additional 10 checkpoints are written after each of 10 candidates was processed (see: progress in stderr.txt file in WUs slot directory).

With an older notebook it easily takes 90 minutes between two checkpoints. Or even some hours, if the CPU runs throttled to reduce heat and fan noise (through BOINCs CPU throttle configuration or external tools like TThrottle). If one shuts the computer down in between, all computing time since the last checkpoint is lost. So one has to closely monitor progress of the running wu and has to delay system shutdown until 15%, 30% ... is reached. Or one has to send the computer into Suspend to RAM or Suspend to Disk mode instead of full shutdown to prevent loss of computation.

It would be desirable if the FGRP5 app could checkpoint more frequently and also within same skypoint.

HAL
HAL
Joined: 9 Mar 20
Posts: 2038
Credit: 40453712
RAC: 39819

Scrooge McDuck wrote:This

Scrooge McDuck wrote:

This is an ongoing problem, well more of an annoyance to watch out for to avoid wasting CPU time. It affects the FGRP5 CPU app. It only occurs with workunits that contain resp. process a very small number of skypoints. The usual number of skypoints in a FGRP5 CPU WU is e.g. 63. A checkpoint can only be written after each processed skypoint, regardless of the setting in the host configuration ("checkpoint at most each ... 60 sec"). Certain workunits contain ONLY SIX (6) skypoints. These can be identified by a small number (< 100) after the prefix "LATeah" and the ID of the raw data file (e.g. "1090F"), e.g.: "88.0" in "LATeah1090F_88.0_4314_-3.2e-11_0". These WUs checkpoint extremely seldom as long as their computation progress is between 0% and 90%. So there are exactly six checkpoints only at 15%, 30%, 45%, ..., 90% progress. After 90% (calculating final toplists) additional 10 checkpoints are written after each of 10 candidates was processed (see: progress in stderr.txt file in WUs slot directory).

Thanks for that explanation. I was just running into this question this last week. I thought something went awry since I changed the BOINC Manager (Linux Mint) from the version that you get if you use Synaptic package manager to the one you get when you get it from the Software Manager. The one you get from Software Manager is 7.22.0 which is not the version that comes from Synaptic Package Manager.

The one I got from SPM 7.18.1 had never ever got any update that I ever saw for a year maybe. But the one from Software Manager has already got a new version since installing it. Then I noticed the checkpoints were very very long and I mistakenly thought something was wrong. I just never paid any attention before.

Processing work units with "outdated" (according to Microsoft) Ryzen 7 1700

mikey
mikey
Joined: 22 Jan 05
Posts: 12639
Credit: 1839023411
RAC: 5631

hyphens wrote:Seems wasteful.

hyphens wrote:
Seems wasteful. Can we get more frequent checkpoints? (Request tasks to checkpoint at most is at default, 60 s. And yes, it's an older machine, 6 years or so.)

The 60 seconds is an 'at most' setting and most people change it to something like 900 seconds, 15 minutes, but the task itself decides on how often it gets checkpointed and that's decided by the programmers who make the tasks. In short your question is a valid one but no one can help you except an admin and even then it would need to be sent up the chain to get put into the next batch of tasks.

Scrooge McDuck
Scrooge McDuck
Joined: 2 May 07
Posts: 1035
Credit: 17687506
RAC: 12873

I reopened this unaswered,

I reopened this unanswered, one year old thread for my 'wish'. I later stumbled across a nine year old post of project admin Bernd Machenschalk on this topic in this old thread from 2014:

https://einsteinathome.org/de/content/more-checkpoints#comment-119513

Bernd Machenschalk wrote:

In general we design our Apps to (potentially) checkpoint as often as possible / feasible, i.e. after each reasonably independent computation.

Feasibility limits here include the programming effort (parameters in data structures modified in nested loops saved and restored) and the data volume (storage space and time to write) of the necessary checkpoints. It doesn't make much sense to checkpoint every minute when writing the checkpoint takes several seconds (multiplied by the number of instances that may be running and checkpointing at once) and thus noticeably slows down computation, or if initializing the application picking up from a checkpoint takes several minutes alone.

Mr Anderson
Mr Anderson
Joined: 28 Oct 17
Posts: 39
Credit: 149238722
RAC: 38277

In the case of shutting the

In the case of shutting the computer down, the wasted time could be solved if it were possible to perform the shutdown only after all tasks have checkpointed (or completed). If multiple tasks are running then they would each suspend after their next checkpoint and no others would start running. So in other words instead of shutting the computer down in the normal way, the user initiates the shutdown process in BOINC and then goes away. Later when all the tasks have checkpointed and suspended themselves (or been suspended) then the PC shuts down.

I realise this might be more of a BOINC thing, or perhaps a combination of BOINC and Einstein but the people at Einstein surely have more influence over BOINC than ordinary users such as myself and could get such a change organised.

mikey
mikey
Joined: 22 Jan 05
Posts: 12639
Credit: 1839023411
RAC: 5631

Mr Anderson wrote:In the

Mr Anderson wrote:

In the case of shutting the computer down, the wasted time could be solved if it were possible to perform the shutdown only after all tasks have checkpointed (or completed). If multiple tasks are running then they would each suspend after their next checkpoint and no others would start running. So in other words instead of shutting the computer down in the normal way, the user initiates the shutdown process in BOINC and then goes away. Later when all the tasks have checkpointed and suspended themselves (or been suspended) then the PC shuts down.

I realise this might be more of a BOINC thing, or perhaps a combination of BOINC and Einstein but the people at Einstein surely have more influence over BOINC than ordinary users such as myself and could get such a change organised.

That last part could be addressed by the Boinc software group but I'm not sure given the WIDE variety of OS's people use that would be as easy as adding a few lines of code.

What I would prefer is a system where at certain points along the way the tasks checkpoint, as they do now, but add the ability for me to click to create a checkpoint so I can then shut down the pc and pick back up from that point when Boinc is restarted. If Boinc wanted too they could even pop up a box saying 'this task was checkpointed X minutes ago do you want to make another checkpoint now?' and then a simple yes or no click to move forward to the next task. They could then have an option to say 'do this for all remaining running tasks' so if a user is running 20 or 60 tasks they wouldn't have to spend time clicking yes or no for each one.

Sometimes you just have to reboot the pc and losing hours of work is not helpful to the Project or our stats.

Mr Anderson
Mr Anderson
Joined: 28 Oct 17
Posts: 39
Credit: 149238722
RAC: 38277

Although there is a wide

Although there is a wide variety of OSs such a feature could be implemented on the one or two most often used. Then if it proves to be popular it could be implemented on others, if possible. I'm not a fan of the "if we can't do it for all systems then we won't do it at all" style of thinking since then you're limited to a bland set of features common to all systems and cannot build on any of the strengths of individual ones.

Regardless of how this could work (and I'd like to extend my suggestion to include rebooting in addition to shutting down), this is a long standing problem that really needs addressing. When I first got involved, I was under the impression that the computing would just run in the background whenever you use your PC and do useful work regardless of how long the PC would be running. Eventually I gave up running Einstein on my home PC since I sometimes only use it for brief periods and I found that the task progress kept reverting back to where it had been previously, meaning that work (and energy) was being wasted. I wonder how many others have made the same experience resulting in the loss of that computing capacity to Einstein and other projects. Not everyone wants to or has the means to set up an array of computers to run 24/7, some just want to contribute what they can.

mikey
mikey
Joined: 22 Jan 05
Posts: 12639
Credit: 1839023411
RAC: 5631

Mr Anderson wrote: Although

Mr Anderson wrote:

Although there is a wide variety of OSs such a feature could be implemented on the one or two most often used. Then if it proves to be popular it could be implemented on others, if possible. I'm not a fan of the "if we can't do it for all systems then we won't do it at all" style of thinking since then you're limited to a bland set of features common to all systems and cannot build on any of the strengths of individual ones.

Regardless of how this could work (and I'd like to extend my suggestion to include rebooting in addition to shutting down), this is a long standing problem that really needs addressing. When I first got involved, I was under the impression that the computing would just run in the background whenever you use your PC and do useful work regardless of how long the PC would be running. Eventually I gave up running Einstein on my home PC since I sometimes only use it for brief periods and I found that the task progress kept reverting back to where it had been previously, meaning that work (and energy) was being wasted. I wonder how many others have made the same experience resulting in the loss of that computing capacity to Einstein and other projects. Not everyone wants to or has the means to set up an array of computers to run 24/7, some just want to contribute what they can.

You might consider sending your ideas to Richard Hasellgrove, I think that's the right way to spell it, as he's one of the main Boinc software people dealing with the team of Boinc software writers.

HAL
HAL
Joined: 9 Mar 20
Posts: 2038
Credit: 40453712
RAC: 39819

After learning of this issue

After learning of this issue I was going to simply put this computer to sleep until the next day. The second one I use for this project runs 24/7 unless I have to reboot Linux Mint required for an update. Even so I now look at the % completion and only do it when it's the least loss of work time.

This one I'm on now (Linux Mint also) I just use for about five hours in the evening, so this issue is causing wasted computing time when I have to shut down for the night. So I said I'd just put it to sleep and awake the next evening ... so sorry no worky. It's a known issue with some systems. It looks like it goes to sleep, but it doesn't wake up. It goes into brain-lock and so you have to power it down and back on.

Otherwise, it would have been a solution ... :-(

Processing work units with "outdated" (according to Microsoft) Ryzen 7 1700

mikey
mikey
Joined: 22 Jan 05
Posts: 12639
Credit: 1839023411
RAC: 5631

HAL wrote: After learning of

HAL wrote:

After learning of this issue I was going to simply put this computer to sleep until the next day. The second one I use for this project runs 24/7 unless I have to reboot Linux Mint required for an update. Even so I now look at the % completion and only do it when it's the least loss of work time.

This one I'm on now (Linux Mint also) I just use for about five hours in the evening, so this issue is causing wasted computing time when I have to shut down for the night. So I said I'd just put it to sleep and awake the next evening ... so sorry no worky. It's a known issue with some systems. It looks like it goes to sleep, but it doesn't wake up. It goes into brain-lock and so you have to power it down and back on.

Otherwise, it would have been a solution ... :-(

I use Linux Mint as well and have turned the screensaver and sleep stuff off, that way it runs until I tell it not too which for me is almost never. I also turned off the updates because mine are Boinc only machines so updates don't matter as I don't browse the net or do mail, documents etc etc just Boinc crunching 24/7. Obviously if you do things besides crunching with your machines you need the updates so others don't do bad things to your machines.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.