The following log items are all from a single stderr.txt file in a crunching slot with my notes and questions in bold:
The WU is just starting...
2006-07-18 11:40:41.4950 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R1_4.02_windows_intelx86.exe'.
2006-07-18 11:40:41.5107 [normal]: Started search at lalDebugLevel = 0
2006-07-18 11:40:46.9013 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-07-18 11:40:46.9169 [normal]: No usable checkpoint found, starting from beginning.
Detected CPU type 1
What does the next line mean?
2006-07-18 12:28:33.3857 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.
Notice! Now the WU has been finished on 08:53, it's all OK.
2006-07-19 08:53:17.9638 [normal]: Search finished successfully.
I restarted my computer on 10:11, then..
2006-07-19 10:11:19.0625 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R1_4.02_windows_intelx86.exe'.
2006-07-19 10:11:19.1562 [normal]: Started search at lalDebugLevel = 0
2006-07-19 10:11:23.6875 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-07-19 10:11:23.6875 [normal]: No usable checkpoint found, starting from beginning.
Detected CPU type 1
So...why the checkpoint file would be lost and then this WU has to start from beginning? I've encountered this situation a few times and any explanations would be appreciated since many hours of crunching has been wasted.
Welcome To Team China!
Copyright © 2024 Einstein@Home. All rights reserved.
'Fstat.out.ckp' not found after a workunit has been finished
)
Each new start has that error. It just means it's the beginning of the result you are crunching on. Since the new results can be several pieces, the names are all the same, except the last little part.
Nothing is wrong, BOINC and Eisntein are working correctly.
Hope this helps.
Sorry, It seems that I failed
)
Sorry, It seems that I failed to make clear what I mean:(
My question is that this error sometimes comes to an already finished workunit.
// So I've changed the title of this thread...:)
Welcome To Team China!
Again, a WU can be made up of
)
Again, a WU can be made up of several parts (not all of them are). When the first part finishes, it told you the search was done, then the next part does start without a checkpoint. It looks like the same unit, but it's just a different part of the unit.
That is what I tried to explain.
RE: The following log items
)
Please confirm that ALL of the above is from a SINGLE stderr.txt file in the slots/N/ directory.
This is strange -- it could indicate some problem in our code -- I'll have someone take a closer look at it. Does this happen repeatedly or was this a 'one-time' occurence?
Bruce
Director, Einstein@Home
Hi Bruce, For sure it's
)
Hi Bruce,
For sure it's all from a single stderr.txt file in the BOINC/Slot/1/ directory (it's still there).
1. The name of this workunit is h1_1268.0_S5R1__2575_S5R1a_1, normally this kind of workunit should end in about 20 hours, now it would take about 40 hours to finish.
2. I also checked the stdoutdae.txt just now and found the following log items:
2006-07-19 08:53:19 [---] Suspending computation - user is active
2006-07-19 08:53:19 [Einstein@Home] Pausing result h1_1268.0_S5R1__2575_S5R1a_1 (left in memory)
It's nearly the same time when the workunit has finished its search and the boinc client didn't know anything about this since normally there should be a log item such as "Computation for result h1_1268.0_S5R1__2575_S5R1a_1 finished"
IMHO, it all seems like, when the workunit finished it's search and cleared the checkpoint file, it was suspended by the client without telling client what has been done.
3. It doesn't happen always, I just encountered a few times (less than five).
Hope these help:)
@Pooh Bear, Thanks for your help also:)
Regards,
Yin Gang
Welcome To Team China!
Pretty strange. Which Core
)
Pretty strange. Which Core Client are you using?
For the first shot I would think of a problem with access rights of the directories under BOINC. Are you running BOINC as a service, or as a different user than who installed it?
BM
BM
I'm using the 5.2.13 client
)
I'm using the 5.2.13 client as a system service on all my machines and have all applied the trux's 5.3.12 calibrating client (AFAIK it shouldn't do any harm though s5 doesn't need it).
YG
Welcome To Team China!
This workunit has been
)
This workunit has been finished successfully:
http://einsteinathome.org/task/36984717
YG
Welcome To Team China!
This problem just happened
)
This problem just happened again. I'm using the official 5.4.11 boinc manager for crunching.
stderr.txt
2006-09-22 15:29:07.5468 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R1_4.24_windows_intelx86.exe'.
2006-09-22 15:29:07.6093 [normal]: Started search at lalDebugLevel = 0
2006-09-22 15:29:10.9218 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-09-22 15:29:10.9218 [normal]: Trying to read Fstat-file into toplist ...
2006-09-22 15:29:16.5312 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-09-22 15:29:16.5312 [normal]: Resuming computation at (84656/108495836/2188027).
Detected CPU type 1
small x
small x
2006-09-22 17:28:27.7656 [normal]: Search finished successfully.
2006-09-22 17:33:15.0468 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R1_4.24_windows_intelx86.exe'.
2006-09-22 17:33:15.0781 [normal]: Started search at lalDebugLevel = 0
2006-09-22 17:33:18.2031 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-09-22 17:33:18.2187 [normal]: No usable checkpoint found, starting from beginning.
Detected CPU type 1
stdoutdae.txt
2006-09-22 17:28:27 [---] Suspending computation - user is active
2006-09-22 17:28:27 [Einstein@Home] Pausing task h1_1057.0_S5R1__572_S5R1a_1 (removed from memory)
2006-09-22 17:28:27 [Einstein@Home] Pausing task h1_1057.0_S5R1__829_S5R1a_2 (removed from memory)
2006-09-22 17:28:27 [---] Suspending network activity - user is active
Welcome To Team China!
It looks like you went active
)
It looks like you went active exactly at the time it finished a result. So, the checkpoint file didn't need to write, but the result never had time to finish the write it needed to upload the completion, before being removed from memory.
Are you watching the screen saver, and exactly when it hit finished, you are starting to use the computer? If this is not the case, are you using the screen saver? The screen saver can think it is activity, at times. Many people do not use the screen saver.
Also you have the remove from memory option on. I think you'd have less problems if you allowed it to stay in memory. This would keep the information in memory when it goes active, and allow it to finish the writes when it goes idle again.