Task http://einsteinathome.org/task/101714739 froze up with 2 1/2 hours used, showed 100% complete, not using cpu time. I couldn't skake it loose with suspend-resume. When I rebooted and restarted it, it restarted ok and shows 7 hours to completion. I will wait for it to finish and report the status. I have no clue what might have caused it to freeze.
Copyright © 2024 Einstein@Home. All rights reserved.
Task stalled with Linux 4.49 test app
)
I saw a similar thing on one of my machines recently.
In my case it was sufficient to stop and restart BOINC.
Mine finished and the result was accepted successfully by the servers. It happened quite recently and a quick scan of the results list shows no invalid results. I assume it was validated correctly but I didn't specifically check. It was so recent that I'm fairly sure I would still see it if there were a problem with it.
My machine was moderately overclocked but has performed without other issue for quite a while now. There have been a couple of occasions where the room got a little on the warm side so I'm assuming the lockup was heat related. The machine is a dual core and the other core continued crunching normally. It would appear that the core that froze is a little more sensitive to heat than the other.
Cheers,
Gary.
RE: Task
)
I've seen similar reports over at Rosetta and have had the same experience running Rosetta on my Linux box. I've not seen it at E@H.
More rarely, you'll see similar behavior reported on Windows boxes (I think I've seen it most at Seti).
Kathryn :o)
Einstein@Home Moderator
Thanks for your comments.
)
Thanks for your comments. The task died due to an error:
389, 390, 391, 392, *** glibc detected *** corrupted double-linked list: 0x0bfe9218 ***
Why it then froze, I'm not sure -- probably tried to provide debug info and wasn't set up correctly. In any case, it was prepared it to be restart, and ran to successful completion. Good fumble recovery!
There are two issues here: debugging the underlying problem of the corrupted list (see the stderr file for what info is available), and if possible, producing an immediate exit and restart instead of hanging -- but hanging up may be preferable to aborting.
I will attempt to set up a better debug environment if someone advises me what to do. As for the "stuck task" problem, that could happen and not be noticed, since it recovers when boinc is restarted. That causes wasted cpu cycles, and might encourage reduction of resource share, but in my opinion I should still run einstein, even if it occured quite often.
If anyone has had this happen more than once, I think it would be interesting to know how often, and how many restart to a successful result. Also a look at the stderr files might show other bugs that can be fixed.