My Linux forgets half done works after turned off

Tetsuji Maverick Rai

Joined: 11 Apr 05

Posts: 23

Credit: 3658667

RAC: 0

14 Dec 2017 20:56:57 UTC

Topic 211919

(moderation:

)

Hi all,

My CentOS Linux forgets workunits half-done by cpu. It starts from the beginning after reboot. I think there's an option to store progresses every 60 seconds or specified time, but I forgot where.

Anyone helps me?

Thanks in advance!

-Tetsuji

Tetsuji Maverick Rai

Joined: 11 Apr 05

Posts: 23

Credit: 3658667

RAC: 0

Addition: I checked boinc

14 Dec 2017 21:10:32 UTC

Message 163411

(moderation:

)

Addition: I checked boinc directory and found in slots directories, init_data.xml or boinc_task_state.xml aren't rewritten every 60 seconds. Maybe that's why the crunchers forget spent times.

mikey

Joined: 22 Jan 05

Posts: 12967

Credit: 1884539453

RAC: 13050

Tetsuji Maverick Rai wrote:Hi

14 Dec 2017 22:48:00 UTC

Message 163414

(moderation:

)

Tetsuji Maverick Rai wrote:

Hi all,

My CentOS Linux forgets workunits half-done by cpu. It starts from the beginning after reboot. I think there's an option to store progresses every 60 seconds or specified time, but I forgot where.

Anyone helps me?

Thanks in advance!

-Tetsuji

Not all workunits actually set checkpoints, it's a choice by the Einstein programmers to use it or not and I don't think they do.

Tetsuji Maverick Rai

Joined: 11 Apr 05

Posts: 23

Credit: 3658667

RAC: 0

But on Windows, checkpoint

15 Dec 2017 1:59:31 UTC

Message 163415

(moderation:

)

But on Windows, checkpoint works. Only Linux (only CentOS?) it doesn't work. I suspect it is because I added only Einstein@home which has no checkpoint settings on the homepage. So now I add another project which has checkpoint settings and crunched some and detached the project.

In /var/lib/boinc/global_prefs.xml, there is <disk_interval> section and it is set to 60. I suspect this be the checkpoint interval in seconds.

EDIT: bingo. /var/lib/boinc/slots/0/boinc_mmap_file seems to keep the status every 60 seconds.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119973578479

RAC: 26468860

Tetsuji Maverick Rai wrote:My

15 Dec 2017 3:24:26 UTC

Message 163417

(moderation:

)

Tetsuji Maverick Rai wrote:

My CentOS Linux forgets workunits half-done by cpu. It starts from the beginning after reboot. I think there's an option to store progresses every 60 seconds or specified time, but I forgot where.

First up, all Einstein apps do set checkpoints. You can see when these are set using BOINC Manager. Just select a task that is crunching and click the 'properties' button. It will show you the current CPU time and the CPU time for when the checkpoint for that task was last saved.

There is a computing preference for checkpoints (under Account -> Preferences -> Computing -> Advanced) which defaults to 60 sec. The preference is called, "Request tasks to checkpoint at most every: ". As I understand it, the purpose is to prevent tasks from checkpointing too frequently, not to set the checkpoint interval. To save a checkpoint, crunching has to stop momentarily so that the complete state of the task can be written to disk. You really wouldn't want this happening too frequently as it might impact the overall crunch time.

FGRPB1G tasks do checkpoint about every 60 seconds but the CPU tasks (FGRP5) take somewhat longer. Using the 'properties' button, I've observed CPU tasks checkpoint about every couple of minutes. It depends on how fast the CPU is. I've observed intervals between about 1.5 minutes and perhaps 4 minutes for a much slower CPU. In my experience, the first checkpoint is written within a fairly short time after startup so if your tasks are "half done" without a checkpoint being created, there might be some sort of file creation problem. Perhaps you could observe the creation of successive checkpoints using the 'properties' function? It would be good to know if checkpoints are being reported there. If checkpoints are being created, they will be used after a reboot and restart of BOINC so it's important to confirm that they are being written. Checkpoint files have a .cpt extension. Can you find a file with this extension in an active slot directory?

Cheers,
Gary.

Tetsuji Maverick Rai

Joined: 11 Apr 05

Posts: 23

Credit: 3658667

RAC: 0

Thank you for explanation,

15 Dec 2017 7:15:57 UTC

Message 163419

(moderation:

)

Thank you for explanation, Gary.

Actually checkpoint.cpt exists, but its timestamp was too old, so I ignored it. Maybe the real checkpoint may be longer than 60 seconds. I didn't observe so long a time because I thought checkpoint time was 60 seconds. It can be 4 minutes or longer.....it is very probable.

Now I built my own cruncher for continuous gravitational wave myself and now am trying to crunch by myself in standalone mode. It made checkpoint.cpt file some minutes after it began. I want to modify this application for the better and help LIGO. Currently it's too slow. I want openCL version.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119973578479

RAC: 26468860

There seems to be something a

15 Dec 2017 8:54:00 UTC

Message 163420

(moderation:

)

There seems to be something a bit odd with tasks for the latest data file - LATeah0010F.dat. Compared with the previous data file (LATeah0009F.dat) tasks for the new one are taking a lot longer, perhaps twice or three times as as long. It just occurred to me that you would probably have tasks for the new data whereas I've been talking about what I observed for the previous one. There has been variation in the past with how long tasks for particular data take to crunch but this one does seem to be an unusually large increase in crunch time. No doubt there will be some reason for this behaviour - perhaps a lot more computation per task.

Out of interest, I just started observing a newly started task for the new file. So far it has been running for over 30 mins and it is yet to write the initial checkpoint. I rarely stop and start BOINC so the checkpoint interval isn't much of a concern for me and I don't usually keep track of it. The interval is so large now that I can't imagine it was an intended change to make it this long. As you found, this will really waste a lot of crunching if people stop and start crunching regularly or if crunching gets temporarily suspended without the task being kept in memory when crunching is suspended.

Hopefully Bernd will see this thread and perhaps let us know what is going on. I'll send him a PM about it. Thanks for reporting this.

EDIT: I've just had a futher look at the situation. The initial checkpoint for the above task was written at 55min 51sec CPU time. I'll ask Bernd if this can be improved.

Cheers,
Gary.

solling2

Joined: 20 Nov 14

Posts: 219

Credit: 1580677828

RAC: 64881

Gary Roberts schrieb: No

15 Dec 2017 10:01:12 UTC

Message 163422 in response to message 163420

(moderation:

)

Gary Roberts wrote:

No doubt there will be some reason for this behaviour - perhaps a lot more computation per task.

...which isn't reflected in the estimated GFLOPs in the properties for the tasks, since for both ....0009F... and ...0010F... 105.000 GFLOPs are estimated. Behaviour not restricted to Linux, one could add. Thanks for notifying about what I was wondering about as well. However, other than huge run time increase things seem to behave normal, i.e. valid results. :-)

mikey

Joined: 22 Jan 05

Posts: 12967

Credit: 1884539453

RAC: 13050

Gary Roberts wrote:There

15 Dec 2017 13:20:43 UTC

Message 163424 in response to message 163420

(moderation:

)

Gary Roberts wrote:

There seems to be something a bit odd with tasks for the latest data file - LATeah0010F.dat. Compared with the previous data file (LATeah0009F.dat) tasks for the new one are taking a lot longer, perhaps twice or three times as as long. It just occurred to me that you would probably have tasks for the new data whereas I've been talking about what I observed for the previous one. There has been variation in the past with how long tasks for particular data take to crunch but this one does seem to be an unusually large increase in crunch time. No doubt there will be some reason for this behaviour - perhaps a lot more computation per task.

Out of interest, I just started observing a newly started task for the new file. So far it has been running for over 30 mins and it is yet to write the initial checkpoint. I rarely stop and start BOINC so the checkpoint interval isn't much of a concern for me and I don't usually keep track of it. The interval is so large now that I can't imagine it was an intended change to make it this long. As you found, this will really waste a lot of crunching if people stop and start crunching regularly or if crunching gets temporarily suspended without the task being kept in memory when crunching is suspended.

Hopefully Bernd will see this thread and perhaps let us know what is going on. I'll send him a PM about it. Thanks for reporting this.

EDIT: I've just had a futher look at the situation. The initial checkpoint for the above task was written at 55min 51sec CPU time. I'll ask Bernd if this can be improved.

Which is why I thought they wern't being check pointed at all, sorry for not waiting long enough, the pc I was watching finishes workunits in about 63 minutes and I just didn't wait long enough.

Tetsuji Maverick Rai

Joined: 11 Apr 05

Posts: 23

Credit: 3658667

RAC: 0

Hi Gary, I've got a proof

16 Dec 2017 0:42:49 UTC

Message 163431

(moderation:

)

Hi Gary,

I've got a proof continuous GW detection app forgot the progress. See this stderr

https://einsteinathome.org/task/707710369

It was stopped a few times, but when it restarted, it always started with "0.% --- CG:1237602 FG:206109 f1dotmin_fg:-8.6262976525e-09 df1dot_fg:1.082695e-13 f2dotmin_fg:-2.662708636364e-19". Does it mean it started from 0%?

I haven't checkpoint.cpt at this time, but it must be there.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119973578479

RAC: 26468860

Up to this point, I've been

17 Dec 2017 0:15:20 UTC

Message 163443 in response to message 163431

(moderation:

)

Up to this point, I've been talking about the checkpointing behaviour of FGRP style tasks (FGRP5 and FGRPB1G) and not GW tasks. I haven't had any response from Bernd but I'm pretty sure the checkpointing behaviour of FGRP5 tasks has been corrected. I've just watched a recently downloaded task on an old machine write the first checkpoint at around 7 mins and the second one at 14 mins. On that machine CPU tasks take around 11 hours. The CPU is a quad core Q8400 from around 2010. It was a machine that I upgraded recently with an RX 460 GPU.

I think the checkpoint interval must be settable at the time of task generation and needed to be tweaked for the new data file in use currently. My guess is that tasks are designed to checkpoint at certain % completed stages, or something like that. With longer running tasks, smaller % completed intervals would be needed.

Tetsuji Maverick Rai wrote:

I've got a proof continuous GW detection app forgot the progress. See this stderr

https://einsteinathome.org/task/707710369

It was stopped a few times, but when it restarted, it always started with "0.% --- CG:1237602 FG:206109 f1dotmin_fg:-8.6262976525e-09 df1dot_fg:1.082695e-13 f2dotmin_fg:-2.662708636364e-19". Does it mean it started from 0%?

I haven't checkpoint.cpt at this time, but it must be there.

I know very little about the checkpointing behaviour of GW tasks. I seem to recall when you see a row of 'dots' followed by a 'c', the dots represent certain 'cycles' in the calculations and the 'c' represents one of those cycles where a checkpoint was actually written at the end of that particular cycle. If you look at the output you linked to, you can see the task was started around 2017-12-15 10:20 and then, around 10 minutes later, the process was sent a sigterm signal

2017-12-15 10:30:12.738662
-- signal handler called: signal 15

However, look at the dots before the sigterm was issued. There was a checkpoint and it was almost to the point of writing a second one when it was so rudely interrupted by a sigterm :-). I'm not a programmer and stacktraces are way beyond my pay grade so I'm a bit bemused as to why you would want to keep on hitting the poor dear thing with sigterms all the time :-).

When you eventually let it continue, you can see the normal output of rows of dots with a checkpoint at the end of each one with the task eventually finishing without problems.

So I really don't know how you could interpret the output as proving that the app forgets its progress. If there was any problem with checkpointing in the GW app, I'm sure someone would have complained long before now :-).

Cheers,
Gary.

My Linux forgets half done works after turned off

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner