After getting my GPU working on a new iMac, I am now seeing the following error message in the stderr log for tasks:
Checkpoint file unavailable: status.cpt
and this is after a number of "Checkpoint committed!" messages. Here's the context (below).
The messages seem to indicate that when BOINC Manager suspends activity, it writes to a file in order to startup again later. But somehow the file is not there. Oddly, the tasks do seem to resume after the time for activity resumes.
Anyone know what the message means and how to create the missing file (status.cpt) if it is indeed needed?
Thanks.
[22:04:08][39840][INFO ] Checkpoint committed!
[22:05:09][39840][INFO ] Checkpoint committed!
[22:06:09][39840][INFO ] Checkpoint committed!
[22:07:10][39840][INFO ] Checkpoint committed!
[22:08:10][39840][INFO ] Checkpoint committed!
[22:09:11][39840][INFO ] Checkpoint committed!
[22:10:11][39840][INFO ] Checkpoint committed!
[22:11:12][39840][INFO ] Checkpoint committed!
[22:11:41][39840][INFO ] Statistics: count dirty SumSpec pages 675 (not checkpointed), Page Size 1024, fundamental_idx_hi-window_2: 329052
[22:11:41][39840][INFO ] Data processing finished successfully!
[22:11:41][39840][INFO ] Starting data processing...
[22:11:41][39840][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
------> Starting from scratch...
Copyright © 2024 Einstein@Home. All rights reserved.
Checkpoint file unavailable
)
This is because every GPU-task actually consists of 8 subtasks and if you look a bit closer the quoted log says "Data processing finished successfully" and then it starts one of the other 7 subtasks.
So everything is quite normal.
The reason to have 8 tasks packed into a single one is that the GPUs are to quick and the load on the servers was getting out of control, after the change things have settled down.
Thank you for the
)
Thank you for the explanation.
So in this case the previously reported successful checkpoint commits are written to status.cpt, but in the case of GPU tasks, the system checks for the file when it does not need to?
I imagine this is a temporary file that gets deleted, since I was not able to find it on my system?
If I understand things
)
If I understand things correctly each BRP4 task is actually 8 tasks bundled together into one, which means that as one "subtask" completes the checkpoint file is deleted as it can't be used for the next "subtask". The program always checks for a checkpoint file as it can't otherwise know if it should start from the beginning of the calculation or if it's resuming after some kind of interruption. The program also checks if there are result files from any of the 8 subtasks to know which one it should process next.
All these files reside in the Boinc data directory\slots\xx where xx=some number, the location of the data directory is given in the startup messages in the event log. Which of the numbered folders to look in can be found when in Boinc advanced view you click on a running task and then to the left click on the button labeled "Properties".