'Fstat.out.ckp' not found after a workunit has been finished

Yin Gang
Yin Gang
Joined: 23 Feb 05
Posts: 52
Credit: 120187750
RAC: 0
Topic 191585

The following log items are all from a single stderr.txt file in a crunching slot with my notes and questions in bold:

The WU is just starting...

2006-07-18 11:40:41.4950 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R1_4.02_windows_intelx86.exe'.
2006-07-18 11:40:41.5107 [normal]: Started search at lalDebugLevel = 0
2006-07-18 11:40:46.9013 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-07-18 11:40:46.9169 [normal]: No usable checkpoint found, starting from beginning.
Detected CPU type 1

What does the next line mean?

2006-07-18 12:28:33.3857 [normal]: Fstat file reached MaxFileSizeKB ==> compactifying ... done.

Notice! Now the WU has been finished on 08:53, it's all OK.

2006-07-19 08:53:17.9638 [normal]: Search finished successfully.

I restarted my computer on 10:11, then..

2006-07-19 10:11:19.0625 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R1_4.02_windows_intelx86.exe'.
2006-07-19 10:11:19.1562 [normal]: Started search at lalDebugLevel = 0
2006-07-19 10:11:23.6875 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-07-19 10:11:23.6875 [normal]: No usable checkpoint found, starting from beginning.
Detected CPU type 1

So...why the checkpoint file would be lost and then this WU has to start from beginning? I've encountered this situation a few times and any explanations would be appreciated since many hours of crunching has been wasted.

Welcome To Team China!

Pooh Bear 27
Pooh Bear 27
Joined: 20 Mar 05
Posts: 1376
Credit: 20312671
RAC: 0

'Fstat.out.ckp' not found after a workunit has been finished

Each new start has that error. It just means it's the beginning of the result you are crunching on. Since the new results can be several pieces, the names are all the same, except the last little part.

Nothing is wrong, BOINC and Eisntein are working correctly.

Hope this helps.

Yin Gang
Yin Gang
Joined: 23 Feb 05
Posts: 52
Credit: 120187750
RAC: 0

Sorry, It seems that I failed

Sorry, It seems that I failed to make clear what I mean:(

My question is that this error sometimes comes to an already finished workunit.

// So I've changed the title of this thread...:)

Welcome To Team China!

Pooh Bear 27
Pooh Bear 27
Joined: 20 Mar 05
Posts: 1376
Credit: 20312671
RAC: 0

Again, a WU can be made up of

Again, a WU can be made up of several parts (not all of them are). When the first part finishes, it told you the search was done, then the next part does start without a checkpoint. It looks like the same unit, but it's just a different part of the unit.

That is what I tried to explain.

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: The following log items

Quote:

The following log items are all from a single stderr.txt file in a crunching slot with my notes and questions in bold:

2006-07-19 08:53:17.9638 [normal]: Search finished successfully.

I restarted my computer on 10:11, then..

2006-07-19 10:11:19.0625 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R1_4.02_windows_intelx86.exe'.
2006-07-19 10:11:19.1562 [normal]: Started search at lalDebugLevel = 0
2006-07-19 10:11:23.6875 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-07-19 10:11:23.6875 [normal]: No usable checkpoint found, starting from beginning.
Detected CPU type 1

So...why the checkpoint file would be lost and then this WU has to start from beginning? I've encountered this situation a few times and any explanations would be appreciated since many hours of crunching has been wasted.

Please confirm that ALL of the above is from a SINGLE stderr.txt file in the slots/N/ directory.

This is strange -- it could indicate some problem in our code -- I'll have someone take a closer look at it. Does this happen repeatedly or was this a 'one-time' occurence?

Bruce

Director, Einstein@Home

Yin Gang
Yin Gang
Joined: 23 Feb 05
Posts: 52
Credit: 120187750
RAC: 0

Hi Bruce, For sure it's

Hi Bruce,

For sure it's all from a single stderr.txt file in the BOINC/Slot/1/ directory (it's still there).

1. The name of this workunit is h1_1268.0_S5R1__2575_S5R1a_1, normally this kind of workunit should end in about 20 hours, now it would take about 40 hours to finish.

2. I also checked the stdoutdae.txt just now and found the following log items:

2006-07-19 08:53:19 [---] Suspending computation - user is active
2006-07-19 08:53:19 [Einstein@Home] Pausing result h1_1268.0_S5R1__2575_S5R1a_1 (left in memory)

It's nearly the same time when the workunit has finished its search and the boinc client didn't know anything about this since normally there should be a log item such as "Computation for result h1_1268.0_S5R1__2575_S5R1a_1 finished"

IMHO, it all seems like, when the workunit finished it's search and cleared the checkpoint file, it was suspended by the client without telling client what has been done.

3. It doesn't happen always, I just encountered a few times (less than five).

Hope these help:)

@Pooh Bear, Thanks for your help also:)

Regards,
Yin Gang

Welcome To Team China!

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4267
Credit: 244931831
RAC: 16380

Pretty strange. Which Core

Pretty strange. Which Core Client are you using?

For the first shot I would think of a problem with access rights of the directories under BOINC. Are you running BOINC as a service, or as a different user than who installed it?

BM

BM

Yin Gang
Yin Gang
Joined: 23 Feb 05
Posts: 52
Credit: 120187750
RAC: 0

I'm using the 5.2.13 client

Message 42606 in response to message 42605

I'm using the 5.2.13 client as a system service on all my machines and have all applied the trux's 5.3.12 calibrating client (AFAIK it shouldn't do any harm though s5 doesn't need it).

YG

Welcome To Team China!

Yin Gang
Yin Gang
Joined: 23 Feb 05
Posts: 52
Credit: 120187750
RAC: 0

This workunit has been

This workunit has been finished successfully:

http://einsteinathome.org/task/36984717

YG

Welcome To Team China!

Yin Gang
Yin Gang
Joined: 23 Feb 05
Posts: 52
Credit: 120187750
RAC: 0

This problem just happened

This problem just happened again. I'm using the official 5.4.11 boinc manager for crunching.

stderr.txt

2006-09-22 15:29:07.5468 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R1_4.24_windows_intelx86.exe'.
2006-09-22 15:29:07.6093 [normal]: Started search at lalDebugLevel = 0
2006-09-22 15:29:10.9218 [normal]: Found checkpoint-file 'Fstat.out.ckp'
2006-09-22 15:29:10.9218 [normal]: Trying to read Fstat-file into toplist ...
2006-09-22 15:29:16.5312 [normal]: Checksum Ok. Successfully read_toplist_from_fp()
2006-09-22 15:29:16.5312 [normal]: Resuming computation at (84656/108495836/2188027).
Detected CPU type 1
small x
small x
2006-09-22 17:28:27.7656 [normal]: Search finished successfully.
2006-09-22 17:33:15.0468 [normal]: Start of BOINC application 'projects/einstein.phys.uwm.edu/einstein_S5R1_4.24_windows_intelx86.exe'.
2006-09-22 17:33:15.0781 [normal]: Started search at lalDebugLevel = 0
2006-09-22 17:33:18.2031 [normal]: Checkpoint-file 'Fstat.out.ckp' not found.
2006-09-22 17:33:18.2187 [normal]: No usable checkpoint found, starting from beginning.
Detected CPU type 1

stdoutdae.txt

2006-09-22 17:28:27 [---] Suspending computation - user is active
2006-09-22 17:28:27 [Einstein@Home] Pausing task h1_1057.0_S5R1__572_S5R1a_1 (removed from memory)
2006-09-22 17:28:27 [Einstein@Home] Pausing task h1_1057.0_S5R1__829_S5R1a_2 (removed from memory)
2006-09-22 17:28:27 [---] Suspending network activity - user is active

Welcome To Team China!

Pooh Bear 27
Pooh Bear 27
Joined: 20 Mar 05
Posts: 1376
Credit: 20312671
RAC: 0

It looks like you went active

It looks like you went active exactly at the time it finished a result. So, the checkpoint file didn't need to write, but the result never had time to finish the write it needed to upload the completion, before being removed from memory.

Are you watching the screen saver, and exactly when it hit finished, you are starting to use the computer? If this is not the case, are you using the screen saver? The screen saver can think it is activity, at times. Many people do not use the screen saver.

Also you have the remove from memory option on. I think you'd have less problems if you allowed it to stay in memory. This would keep the information in memory when it goes active, and allow it to finish the writes when it goes idle again.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.