Stalled task at 89.989 % and another at 89.980 %

David Weiner
David Weiner
Joined: 27 Apr 18
Posts: 3
Credit: 92264
RAC: 0
Topic 215501

Q: Should I abort this task or let it be?

I aborted my 1st stalled task at 98.989 % - after the deadline had passed.

Now the same or a similar task has stalled in an eerily similar way:

- the elapsed time resets at 06:08:20 to 06:06:48

- progress is stuck at 89.980 % (The previous stalled task time also reset.)

Thanks and Good Hunting!

 
Application
Gamma-ray pulsar search #5 1.08 (FGRPSSE)
Name
LATeah0026F_968.0_154050_0.0
State
Running
Received
Tuesday, June 19, 2018 at 10:34:19 AM
Report deadline
Tuesday, July 03, 2018 at 10:34:17 AM
Estimated computation size
105,000 GFLOPs
CPU time
04:37:33
CPU time since checkpoint
00:00:00
Elapsed time
06:07:41
Estimated time remaining
01:11:13
Fraction done
89.980%
Virtual memory size
4.09 GB
Working set size
1.45 MB
Directory
slots/2
Process ID
1822
Progress rate
7.560% per hour
Executable
hsgamma_FGRP5_1.08_x86_64-apple-darwin__FGRPSSE

 

Zalster
Zalster
Joined: 26 Nov 13
Posts: 3117
Credit: 4050672230
RAC: 0

How long did you wait before

How long did you wait before aborting the work unit?  The OpenCl task are known to hold at 89.9% while the GPU and CPU are communicating back and forth. I believe this is when the double precision is done. It usually takes about 6-7 minutes depending on your Mobo and ram speeds.  Let it run for a while. If it goes beyond an hour then I would abort it.  Also, since your computers are hidden, we can't check any other tasks. 

mikey
mikey
Joined: 22 Jan 05
Posts: 11889
Credit: 1828122138
RAC: 206506

You can also try pausing or

You can also try pausing or suspending Boinc and then after a slow 10 count restart it again, that can restart tasks that are 'stuck' too.

David Weiner
David Weiner
Joined: 27 Apr 18
Posts: 3
Credit: 92264
RAC: 0

Hi Zalster, I left it in the

Hi Zalster,

I left it in the queue until about 2 weeks after the deadline, then aborted it.

However the current stalled task is still hanging where it was, here it is today:

 
Application
Gamma-ray pulsar search #5 1.08 (FGRPSSE)
Name
LATeah0026F_968.0_154050_0.0
State
Running
Received
Tuesday, June 19, 2018 at 10:34:19 AM
Report deadline
Tuesday, July 03, 2018 at 10:34:17 AM
Estimated computation size
105,000 GFLOPs
CPU time
04:37:33
CPU time since checkpoint
---
Elapsed time
06:08:11
Estimated time remaining
01:02:50
Fraction done
89.980%
Virtual memory size
4.08 GB
Working set size
1.45 MB
Directory
slots/2
Process ID
2620
Progress rate
14.760% per hour
Executable
hsgamma_FGRP5_1.08_x86_64-apple-darwin__FGRPSSE

David Weiner
David Weiner
Joined: 27 Apr 18
Posts: 3
Credit: 92264
RAC: 0

Hi Mikey, I do suspend BOINC

Hi Mikey,

I do suspend BOINC 1x - 2x per day, when I'm graphics intensive.

And for this stalled task, I've rebooted 2x in the past 2 days...

My plan is to let it stay in the queue until past the deadline again, as it doesn't appear to draining any resources!

I'm guessing now the data set is corrupted, and was sent again after I aborted but I don't know...I wish I had copied the info on the 1st instance!

 

Alan Ettridge
Alan Ettridge
Joined: 5 Sep 07
Posts: 3
Credit: 91152
RAC: 0

Hi,   My task has stalled

Hi,

 

My task has stalled at the 89.979% point, going through a loop of 1D 1H 29m and 00s +/_ about a minute.  This is 24 hours now.  I have suspended then restarted but no progress.  Any suggestions apart from abort??

 

Alan E

Alan Ettridge
Alan Ettridge
Joined: 5 Sep 07
Posts: 3
Credit: 91152
RAC: 0

Hi,   Further to my last;

Hi,

 

Further to my last; this is the data:
Application
Gamma-ray pulsar search #5 1.08 (FGRPSSE)
Name
LATeah0034F_1496.0_322557_0.0
State
Waiting to run
Received
Sunday, 16 September 2018 20:53:53
Report deadline
Sunday, 30 September 2018 20:53:13
Estimated computation size
105,000 GFLOPs
CPU time
12:39:58
CPU time since checkpoint
---
Elapsed time
1d 01:28:11
Estimated time remaining
02:46:00
Fraction done
89.980%
Virtual memory size
2.33 GB
Working set size
968.02 KB
Directory
slots/1
Progress rate
3.600% per hour
Executable
hsgamma_FGRP5_1.08_x86_64-apple-darwin__FGRPSSE

 

Alan E

Alan Ettridge
Alan Ettridge
Joined: 5 Sep 07
Posts: 3
Credit: 91152
RAC: 0

Hi, Final bit of data to the

Hi,

Final bit of data to the previous two posts:

The loop is 1D 01:28:11 then counts to 1D 01:29:25 before resetting to the earlier time again.

 

Alan E

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109394476721
RAC: 35818935

Alan Ettridge wrote:...  Any

Alan Ettridge wrote:
...  Any suggestions apart from abort??

Before aborting, are you really sure it won't finish normally?

I had a quick look at your list of computers and on the one I found, you seem to have just one task - a FGRP5 CPU task - that is listed as 'in progress'.  I was hoping to see any previous tasks so as to get an idea about the performance of your machine.  Unfortunately, there aren't any, so I'll take a bit of a stab at how long that task might take.

I wouldn't be surprised for it to take 24hrs or more to crunch.  The 'stall' at 89.979% is completely normal.  It is caused by the fact that crunching takes place in two separate stages.  The first one ends at 89.979% (for CPU tasks) and unfortunately, there is no further progress indication until the 2nd stage is complete.  At that point, the indication will immediately jump to 100% and the result will be uploaded.  It wouldn't surprise me to find that the task might sit at 89.979% for an hour or more, given the type of CPU you have.

You can get more details about what is happening if you browse what I wrote in this pinned thread.  Whilst the comments there were in reference to the behaviour of GPU tasks, you see exactly the same sort of thing in slow motion for CPU tasks.  I've explained this behaviour quite a few times previously so if you do a search for "follow-up AND stage" you'll probably be able to find some of them.

I guess the key question is how long has it been 'stuck' at 89.979%?  If you stop and restart a task (or even suspend it without it staying in memory while suspended) you will lose some of the crunch time already accumulated because it will need to reload the last 'saved state' (known as a checkpoint) from disk.  Some tasks 'checkpoint' relatively frequently, whilst others have a much longer interval between checkpoints.

The worst I've seen is just 8 checkpoints for the entire first stage of crunching.  For a task taking 24 hrs to crunch, this could be around 3 hrs between checkpoints.  I'm not saying this applies for the current task.  I'm just warning that stopping and restarting a task will likely cause some loss of progress.

If you use BOINC Manager - Advanced view and highlight a running task, a 'properties' button will become available to use.  By clicking that, you will be able to see the current CPU time used and the CPU time when the current checkpoint was written.

Checkpoints are written for both stages of crunching, even though you don't see progress during the follow-up stage.  I'm guessing they may well be more frequent during the follow-up stage because I suspect (I haven't tried to confirm it) that a checkpoint will be written after each of the 10 top candidate signals is re-evaluated in double precision.  So even if the follow-up stage took a total of 2 hrs, you may well see a new checkpoint around every 12 mins.

You should use this technique (checking properties to see if checkpoints are being written) to help you assess whether or not your task is really 'stuck' or if it is making progress slowly.

 

Cheers,
Gary.

DAF
DAF
Joined: 11 Jan 16
Posts: 1
Credit: 16096347
RAC: 0

In the meantime, that's

In the meantime, that's what's happening. gpu boot 9 times then it falls then rises. It turns out so necessary, well, okay.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5842
Credit: 109394476721
RAC: 35818935

I'm sorry but I don't know

I'm sorry but I don't know what you mean by "gpu boot 9 times".  Do you mean that you rebooted your machine 9 times when you thought a task was 'stuck' just short of 90% completed?

If you do, just leave it alone and it will complete when it's good and ready to do so :-).  If you keep rebooting your machine, you will always lose whatever crunching was done since the last checkpoint was written.

I looked at the tasks list for your computer.  Everything looks normal.

 

 

Cheers,
Gary.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.