task appear stuck at 100%

Wolfy Naughtious
Wolfy Naughtious
Joined: 2 Nov 20
Posts: 2
Credit: 131670
RAC: 0

My problem is similar.  I run

My problem is similar.  I run an old Dell on Win 10, duel processer w/solid state C:.

The process completes, then hangs and the time to complete starts to go up rapidly.  It has reached 181d in one case before I aborted it.  This happens on every job I have finished.

If I suspend the job and then resume it starts back down, but only at 1 sec per sec.  That would take 181 days to get back to 0.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117703592426
RAC: 35072329

Wolfy Naughtious wrote:My

Wolfy Naughtious wrote:
My problem is similar.  I run an old Dell on Win 10 ...

It's very difficult to guess what the problem might be when you give so little useful information.

Do you know if you are running the same type of task as the OP?  What tasks are you trying to run?  Are you using CPU cores or a GPU or perhaps both?  Are you running GW tasks or gamma-ray pulsar tasks or perhaps both types?  Are all tasks failing (you seem to imply that) or just some?  You have a positive RAC so something must be succeeding.  Unless you have exactly the same problem as the OP, it's best to start your own thread and give as much detail about your particular issue as you can.

Your computers are 'hidden' which means that people willing to help can't get any information about how your hardware is performing.  If you go to your account -> preferences -> privacy on the website, you can change settings there to allow others to see basic details of your hardware and the tasks you are crunching which removes the need for you to specify all those details in your problem report.  If you don't want to 'un-hide' your computers, at least post a link to the machine having the problem.

Do you know if any 'real' progress is being made on tasks that seem to go on forever?  In BOINC Manager (Advanced view), have you selected one of these tasks and examined its properties?  Tasks that are making some progress will be creating regular 'checkpoints' and the properties page will show information about that.  Until the very first checkpoint is written to disk, the BOINC client just uses 'simulated' progress based on the estimate that the task comes with.

The symptoms you mention sound like simulated progress only with no actual checkpoint ever being written.  Depending on the type of work you are crunching, checkpoints should be occurring fairly regularly - in the order of minutes to tens of minutes, rather than many hours or days.  If you haven't seen the first checkpoint being laid down after say an hour or so (but the progress is still gradually increasing) it's possible there is no true progress being made at all.  In that case maybe your hardware/software setup is not capable of tackling the type of work you have selected to run.  This is why you need to make those details available when you ask for help.

Cheers,
Gary.

Stephen Hawkins
Stephen Hawkins
Joined: 11 Mar 15
Posts: 70
Credit: 92362641
RAC: 136446

This is happening more often

This is happening more often now. Not quite daily but close. Many tasks run fine but the ones that get stuck will sit there stuck until I abort them.

Model Name: iMac
Model Identifier: iMac18,1
Processor Name: Dual-Core Intel Core i5
Processor Speed: 2.3 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 4 MB
Hyper-Threading Technology: Enabled
Memory: 16 GB
System Firmware Version: 429.140.8.0.0
SMC Version (system): 2.39f40

Computer 12823020
IP address:
Show IP address
Domain name:
mycroft.local
Local standard time:
UTC -5 hours
Name:
mycroft.local
Created:
8 Apr 2020 9:08:53 UTC
Total credit:
1,563,503
Average credit:
5,902.08
Cross project credit:
CPU type:
GenuineIntel Intel(R) Core(TM) i5-7360U CPU @ 2.30GHz [x86 Family 6 Model 142 Stepping 9]
Number of processors:
4
Coprocessors:
INTEL Intel(R) Iris(TM) Plus Graphics 640 (1536MB)
Operating system:
Darwin 20.6.0
BOINC client version:
7.16.19
Memory:
16384 MiB
Cache:
0 KiB
Swap space:
168080.31 MiB
Total disk space:
233.47 GiB
Free disk space:
150.37 GiB
Measured floating point speed:
5085.62 million ops/sec
Measured integer speed:
16386.82 million ops/sec
Average upload rate:
42.64 KiB/sec
Average download rate:
8424.91 KiB/sec
Average turnaround time:
8.12 days

Application
Gamma-ray pulsar search #5 1.11 (FGRPSSE)
Name
LATeah1085F_1080.0_81618_0.0
State
Running
Received
Tuesday, October 19, 2021 at 04:03:31 AM
Report deadline
Tuesday, November 02, 2021 at 04:03:31 AM
Estimated computation size
105,000 GFLOPs
CPU time
02:08:21
CPU time since checkpoint
00:00:00
Elapsed time
02:24:24
Estimated time remaining
---
Fraction done
100.000%
Virtual memory size
4.10 GB
Working set size
1.39 MB
Directory
slots/1
Process ID
28031
Progress rate
41.400% per hour
Executable
hsgamma_FGRP5_1.11_x86_64-apple-darwin__FGRPSSE

Stephen Hawkins

73 49 111 01001001

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5872
Credit: 117703592426
RAC: 35072329

Stephen Hawkins wrote:This is

Stephen Hawkins wrote:
This is happening more often now. Not quite daily but close. Many tasks run fine but the ones that get stuck will sit there stuck until I abort them.

This is unlikely to be anything to do with the tasks themselves.  To work out what might be causing this, you need to look at the log that is returned to the project when you abort a task.  That might contain some clues.

Your tasks list shows just 4 aborted tasks - the latest two being returned on 26 Oct 2021 12:47:21 UTC.  If you click on the Task ID link for one of those, you can see information up to the point the task was aborted.  For a normal task, you see the whole log.  In both these cases, you only see the end part (and not the beginning) of the complete log because there is a size limit for what can be returned and it looks like the two tasks had been 'spinning their wheels' for long enough for the full output to exceed that limit.  Unfortunately, any clues as to what happened to cause the problem would have been in the stuff that was truncated.

If you select any in-progress task on the tasks tab of BOINC Manager, you can examine it's properties.  For tasks proceeding normally, you would be able to see details about checkpoints and the time from when the last one was created.  A 'stuck' task will not be creating checkpoints.  If you find a stuck task this way, you could go into the relevant slots directory where that task is being crunched and capture a full copy of the log from there, while the task is running.  That file should be the the full record, up to the point of capture.  The truncation only occurs after the task is aborted and the information is being prepared for uploading to the project.

I've never had to do this myself so I'm just guessing and NOT talking from experience.  I'm just guessing that the missing part of a full log might contain some clue as to why the task became stuck.  You're the only person that can access the full log.

If you can't find the file that contains the full log, you might be able to abort a stuck task early enough so that the log hasn't grown so large with huge amounts of repetition.  If you examine the website logs I mentioned above, you will see what I mean by "repetition" :-).

I've never actually needed to examine logs as they are growing in a slot directory before so I don't know what the actual file name is.  It shouldn't be too hard for you find that out :-).

Cheers,
Gary.

Stephen Hawkins
Stephen Hawkins
Joined: 11 Mar 15
Posts: 70
Credit: 92362641
RAC: 136446

I think this is what you are

I think this is what you are looking for:

---------------------

19:06:21 (11663): [normal]: Start of BOINC application 'hsgamma_FGRP5_1.11_x86_64-apple-darwin__FGRPSSE'.
19:06:21 (11663): [debug]: 2.1e+15 fp, 5.2e+09 fp/s, 406674 s, 112h57m54s25
command line: hsgamma_FGRP5_1.11_x86_64-apple-darwin__FGRPSSE --inputfile ../../projects/einstein.phys.uwm.edu/LATeah1085F.dat --alpha 4.9667881722 --delta -0.9496737886 --skyRadius 0.00174881991 --ldiBins 15 --f0start 1416 --f0Band 16 --firstSkyPoint 1564772 --numSkyPoints 61 --f1dot -1.0e-13 --f1dotBand 1.0e-13 --df1dot 1.428289491e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 4194304.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmsteve@mycroft 1 % tail -20 stderr.txt
19:22:03 (11845): [debug]: Set up communication with graphics process.
mv: LATeah1085F_1432.0_1564772_0.0_0_0.out: No such file or directory
mv: LATeah1085F_1432.0_1564772_0.0_0_0.out: No such file or directory
mv: LATeah1085F_1432.0_1564772_0.0_0_0.out: No such file or directory
mv: LATeah1085F_1432.0_1564772_0.0_0_0.out: No such file or directory
mv: LATeah1085F_1432.0_1564772_0.0_0_0.out: No such file or directory
19:25:12 (11878): [normal]: This Einstein@home App was built at: Jul 26 2017 12:06:48

19:25:12 (11878): [normal]: Start of BOINC application 'hsgamma_FGRP5_1.11_x86_64-apple-darwin__FGRPSSE'.
19:25:12 (11878): [debug]: 2.1e+15 fp, 5.2e+09 fp/s, 406674 s, 112h57m54s25
command line: hsgamma_FGRP5_1.11_x86_64-apple-darwin__FGRPSSE --inputfile ../../projects/einstein.phys.uwm.edu/LATeah1085F.dat --alpha 4.9667881722 --delta -0.9496737886 --skyRadius 0.00174881991 --ldiBins 15 --f0start 1416 --f0Band 16 --firstSkyPoint 1564772 --numSkyPoints 61 --f1dot -1.0e-13 --f1dotBand 1.0e-13 --df1dot 1.428289491e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 4194304.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 56757.0 --f0orbit 0.005 --freeRadiusFactor 2 --mismatch 0.15 --debug 0 -o LATeah1085F_1432.0_1564772_0.0_0_0.out
output files: 'LATeah1085F_1432.0_1564772_0.0_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah1085F_1432.0_1564772_0.0_0_0' 'LATeah1085F_1432.0_1564772_0.0_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah1085F_1432.0_1564772_0.0_0_1'
19:25:12 (11878): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
19:25:12 (11878): [normal]: WARNING: Resultfile '../../projects/einstein.phys.uwm.edu/LATeah1085F_1432.0_1564772_0.0_0_1' present - doing nothing
19:25:12 (11878): [debug]: Set up communication with graphics process.
mv: LATeah1085F_1432.0_1564772_0.0_0_0.out: No such file or directory
mv: LATeah1085F_1432.0_1564772_0.0_0_0.out: No such file or directory
mv: LATeah1085F_1432.0_1564772_0.0_0_0.out: No such file or directory
mv: LATeah1085F_1432.0_1564772_0.0_0_0.out: No such file or directory
mv: LATeah1085F_1432.0_1564772_0.0_0_0.out: No such file or directory

--------------------------------------------

Stephen Hawkins

73 49 111 01001001

Stephen Hawkins
Stephen Hawkins
Joined: 11 Mar 15
Posts: 70
Credit: 92362641
RAC: 136446

more

more boinc_task*

<active_task>
    <project_master_url>http://einstein.phys.uwm.edu/</project_master_url>
    <result_name>LATeah1085F_1432.0_1564772_0.0_0</result_name>
    <checkpoint_cpu_time>7338.350000</checkpoint_cpu_time>
    <checkpoint_elapsed_time>7964.547632</checkpoint_elapsed_time>
    <fraction_done>0.899795</fraction_done>
    <peak_working_set_size>480616448</peak_working_set_size>
    <peak_swap_size>35446243328</peak_swap_size>
    <peak_disk_usage>23819</peak_disk_usage>
</active_task>


----------------Last entry in stderr.txt after abort------------------


tail -f stderr.txt
03:35:02 (17065): [normal]: Start of BOINC application 'hsgamma_FGRP5_1.11_x86_64-apple-darwin__FGRPSSE'.
03:35:02 (17065): [debug]: 2.1e+15 fp, 5.2e+09 fp/s, 406674 s, 112h57m54s25
command line: hsgamma_FGRP5_1.11_x86_64-apple-darwin__FGRPSSE --inputfile ../../projects/einstein.phys.uwm.edu/LATeah1085F.dat --alpha 4.9667881722 --delta -0.9496737886 --skyRadius 0.00174881991 --ldiBins 15 --f0start 1416 --f0Band 16 --firstSkyPoint 1564772 --numSkyPoints 61 --f1dot -1.0e-13 --f1dotBand 1.0e-13 --df1dot 1.428289491e-15 --ephemdir ../../projects/einstein.phys.uwm.edu/JPLEPH --Tcoh 4194304.0 --toplist 10 --cohFollow 10 --numCells 1 --useWeights 1 --Srefinement 1 --CohSkyRef 1 --cohfullskybox 1 --mmfu 0.15 --reftime 56757.0 --f0orbit 0.005 --freeRadiusFactor 2 --mismatch 0.15 --debug 0 -o LATeah1085F_1432.0_1564772_0.0_0_0.out
output files: 'LATeah1085F_1432.0_1564772_0.0_0_0.out' '../../projects/einstein.phys.uwm.edu/LATeah1085F_1432.0_1564772_0.0_0_0' 'LATeah1085F_1432.0_1564772_0.0_0_0.out.cohfu' '../../projects/einstein.phys.uwm.edu/LATeah1085F_1432.0_1564772_0.0_0_1'
03:35:02 (17065): [debug]: Flags: X64 SSE SSE2 GNUC X86 GNUX86
03:35:02 (17065): [normal]: WARNING: Resultfile '../../projects/einstein.phys.uwm.edu/LATeah1085F_1432.0_1564772_0.0_0_1' present - doing nothing
03:35:02 (17065): [debug]: Set up communication with graphics process.
mv: LATeah1085F_1432.0_1564772_0.0_0_0.out: No such file or directory
mv: LATeah1085F_1432.0_1564772_0.0_0_0.out: No such file or directory
mv: LATeah1085F_1432.0_1564772_0.0_0_0.out: No such file or directory
mv: LATeah1085F_1432.0_1564772_0.0_0_0.out: No such file or directory
mv: LATeah1085F_1432.0_1564772_0.0_0_0.out: No such file or directory

Stephen Hawkins

73 49 111 01001001

Stephen Hawkins
Stephen Hawkins
Joined: 11 Mar 15
Posts: 70
Credit: 92362641
RAC: 136446

Aborting two more this

Aborting two more this morning.

Stephen Hawkins

73 49 111 01001001

Stephen Hawkins
Stephen Hawkins
Joined: 11 Mar 15
Posts: 70
Credit: 92362641
RAC: 136446

New variant.  Some appear to

New variant.  Some appear to get stuck at 89.979%  The time remaining will start at 1:04:00 and count down to 1:00:00 then jump back up to 1:04:00 and start counting down again.   While this is happening other tasks are running, and finishing successfully. The one below has been doing this for at least two days.

Application
Gamma-ray pulsar search #5 1.11 (FGRPSSE)
Name
LATeah1086F_1400.0_70560_0.0
State
Running
Received
Saturday, October 30, 2021 at 07:13:02 AM
Report deadline
Saturday, November 13, 2021 at 06:13:01 AM
Estimated computation size
105,000 GFLOPs
CPU time
03:39:04
CPU time since checkpoint
00:00:00
Elapsed time
04:02:23
Estimated time remaining
01:02:40
Fraction done
89.980%
Virtual memory size
32.57 GB
Working set size
1.42 MB
Directory
slots/1
Process ID
66043
Progress rate
22.320% per hour
Executable
hsgamma_FGRP5_1.11_x86_64-apple-darwin__FGRPSSE

Stephen Hawkins

73 49 111 01001001

Stephen Hawkins
Stephen Hawkins
Joined: 11 Mar 15
Posts: 70
Credit: 92362641
RAC: 136446

I believe the problem is that

I believe the problem is that BOINC Manager believes the task is being processed but the computer's operating system (OS-X Monterey 12.1) does not.

Running "top" I see:

79134  hsgamma_FGRP 98.6  04:09:46 2/1   0    15    748M   0B     0B     595   619   running  *0[1]           0.00000 0.00000    503  370907    146     13635      6672       32961615+
82538  hsgamma_FGRP 88.7  40:43.65 2/1   0    15    748M   0B     0B     595   619   running  *0[1]           0.00000 0.00000    503  192905    159     2234       1065       5333897+

There should be three of these running, one for each task.  If I tell BOINC Manager to "suspend" the stuck task it does so, and immediately starts running a new task in it's place.  At that point OS-X shows three processes, one for each task.  When I unsuspend the stuck task a new process does not get started for it.  Is there a way to get BOINC Manager to request a new process for the orphan task?

Stephen Hawkins

73 49 111 01001001

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.