Hello.
About 4 hours ago, 2 of these tasks began their run. Just a few minutes ago, I had to reboot, so I suspended my project in BOINC manager and then reboot. I do the suspension first whenever I have to reboot because otherwise my system misbehaves in refusing to start BOINC manager after reboot. This has always worked fine for all BOINC projects, including Einstein.
After reboot, when I started BOINC, I noticed I had lost all progress on these tasks. First time this has happened. Thoughts?
Copyright © 2024 Einstein@Home. All rights reserved.
Hmmm, I looked at my running
)
Hmmm, I looked at my running FGRP5 tasks and I don't see any evidence of any checkpointing.
Are you sure they previously checkpointed?
You'll normally lose progress
)
You'll normally lose progress back to the most recent checkpoint. Checkpoint spacing depends on the project and the app, and perhaps some on the WU.
I normally monitor with the add-on application BoincTasks, which shows most recent checkpoint. Don't know where to find that information otherwise, but someone else may come by and inform us both.
You have to look at the work
)
You have to look at the work units properties in the Manager which shows you the time since last checkpoint.
But to be absolutely certain you should examine the slot that the task is running in and see if any checkpoint file is present.
Task checkpointing is dependent on the application if it has that feature.
Thanks, all. I selected each
)
Thanks, all. I selected each one and it appears that neither of them checkpoint. CPU time = time since last checkpoint = total run time.
Just yesterday afternoon I rebooted while it was in the final 89.9% stage of some earlier tasks (where it shows no progress until it suddenly completes) and it even saved the progress there because it completed those tasks in the typical time after I reboot. So, it must have checkpoints yesterday but it’s not doing so today.
I'd say that your unusual
)
I'd say that your unusual need to suspend the project before rebooting is the issue and is disrupting the normal checkpointing mechanism.
I'm curious whether this behavior has previously been reported as a bug in the Boinc development Github account as an issue. Might want to check.
https://github.com/BOINC/boinc/issues
This closed issue comes closest to describing your problem. But it was closed with no action because discussed changes would be viewed unfavorably by the majority of Boinc users.
https://github.com/BOINC/boinc/issues/4748
Thanks. As you all have
)
Thanks. As you all have noted, the checkpoint request from BOINC is just a request. I set the value really high to see if it changes anything and I’ll try really low as well. Either way, if the request is ignored, that tells me nothing. But given that it checkpointed before, you seem to have nailed it that something screwed up on my machine.
PS - the enthusiasm on this thread was mildly amusing but I did appreciate the critique of the logic: https://github.com/BOINC/boinc/issues/5106
Update 1 - I suspended (did not reboot) and resumed. It saved progress although under task properties it still shows CPU time since last checkpoint is the same as total run time. Will monitor.
Update 2 - Yep, shutting down is the problem. Removed project. Will reinstall tomorrow and see if that fixes it.
Reinstalled Einstein. Still
)
Reinstalled Einstein. Still the same issue. Tried with a different project (Milkyway), no issue.
The Milkyway N-Body
)
The Milkyway N-Body applications have many more opportunities where checkpointing can occur :-)
It appears that this Einstein dataset has 20 "skypoints" and as far as I know it can only checkpoint when it has completed processing of one of those... As your system seems to take about 11 hours to run one of these tasks (including the second pass over some of the data that happens around the 90% mark) that means there won't be an initial checkpoint for at least half an hour.
So it might be that whether you get a checkpoint or not depends on how long you let a task run for before halting it! -- I noticed that one of your tasks did checkpoint and resume after skypoint 7, so it seems to work when it has a chance..
Cheers - Al.
P.S. I'm sure one of the experts will be along to correct me about this :-)
Al, your summation is
)
Al, your summation is correct. The tasks do checkpoint when given the chance. Best to not abort any task prematurely.
Thank you both.
)
Thank you both.