When this happens no progress is made in the run. I have to quit Boinc and re-start it to un-hang whatever-it-is; suspending the run is not sufficient. This has happened thrice in the past six weeks.
Has this been seen before? What if anything should I do to diagnose it further?
Copyright © 2024 Einstein@Home. All rights reserved.
Some recent change causes OSX 10.9 on PPC4 to hang in a system c
)
I had one work unit hang about a week ago using a G5, Mac OS 10.4.10, Boinc 5.8.17, App PPC 4.34. The processor appeared to continue at full steam but no progress had been made for five hours so apparently it was caught in a loop. First time I've ever seen that. Restarting Boinc fixed it. I check it every couple of hours now.
So, yes, I have seen it before but have nothing else useful to offer.
RE: When this happens no
)
If this happens again, it would be useful to know which of the processes consumes all the CPU (you could use "top" in terminal to find out):
a) The "einstein*..." sscience app ?
b) The "boinc" core client
c) some other process (which would also have the net effect of freezing einstein and any other BOINC science app)
CU
BRM
RE: If this happens again,
)
Einstein_S5R2_4.
In my case I had two Einstein
)
In my case I had two Einstein WUs running. Only one of the two got stuck.
Some even more helpful
)
Some even more helpful information would be the end of stderr (/Library/Application Support/BOINC Data/slots/*/stderr.txt) of the task while it is stuck, and especially whether the output changes (it might be that just the display of progress is broken). And I would be really thankful for a Shark tree (shark is a performance analysis tool that is installed with Xcode or CHUD tools) to see what function keeps the CPU busy without progress.
BM
BM
RE: And I would be really
)
BTW: Probably VTune could do the same on Linux and Windows for "stuck" tasks. All current Apps are compiled with debugging symbols.
BM
BM
RE: RE: And I would be
)
Not sure (my evaluation key has expired) but I believe VTune under Windows requires a local PDB file and won't download it automatically from the symstore. But there was a download link , right?
CU
H-B
RE: Not sure (my evaluation
)
For the 4.33 the PDB is distributed withe the App again.
BM
BM
RE: RE: Not sure (my
)
Ah, I see...I'm still running the beta :-), will switch when the dry-run point is at a time of day where I can delete app_info.xml without loosing too much work.
CU
BRM
RE: Some even more helpful
)
Got a stuck one now, Bernd. The display in Boinc Manager has not altered for over 60 minutes, nor have the stderr or .cpt files updated in that time. Nevertheless, it is maxing out *both* CPUs. Interesting. It only maxes out one when it is actually working.
End of stderr.txt reads:
44669, 44670, c
44671, 44672, 44673, 44674, 44675, 44676, 44677, 44678, 44679, 44680, 44681, 44682, 44683, 44684, 44685, 44686, 44687, 44688, 44689, 44690, 44691, 44692, 44693, 44694, 44695, 44696, 44697, 44698, 44699, 44700, 44701, c
44702, 44703, 44704, 44705, 44706,
End of file h1_0543.40_S5R2__145_S5R2c_1_0 reads:
543.5073357382 3.712175 0.2344938 -9.7777e-11 5.0811
543.5073424488 3.725723 0.2344938 2.568e-11 5.15934
543.5073625804 3.739271 0.2344938 -3.4469e-10 5.53302
543.5073625804 3.752819 0.2344938 -9.7777e-11 5.38567
File h1_0543.40_S5R2__145_S5R2c_1_0.cpt reads:
3.712175,0.234494,44702,72482,41356798,829571
Part of client_state.xml reads:
http://einstein.phys.uwm.edu/
h1_0543.40_S5R2__145_S5R2c_1
1
434
1
2
81639.828535
0.616788
81648.509865
80920576.000000
39030784.000000
39030784.000000
0.000000
1
Shark was a disappointment. I've installed CHUD 4.4.4 but although Shark appears to take a sample, at the end of processing it does not display an output window. I reinstalled but with the same result. Any advice?
Activity Monitor shows Einsein making a lot of Mach System Calls but I don't know how to interpret the significance (if any) of that.
In Boinc Manager I hit Suspended (and it showed as such but the CPUs didn't stop), and then Resume. The task then showed as Waiting to Run (although still with full CPU activity). The checkpoint file did not update.
During all of this a Rosetta WU finished and uploaded so Boinc Manager appears working OK, at least in part.
Mac G5, 2GB, Boinc 5.8.17, App 4.34