Detect Non-progressing WorkUnits?

peanut
peanut
Joined: 4 May 07
Posts: 162
Credit: 9644812
RAC: 0
Topic 193159

It is likely someone has already wished for this, but it is worth repeating.

I wonder if it would be possible for the BOINC manager to detect when a WU has not progessed in a certain amount of real time. I don't know if it would be possible to communicate with an App and have it "restart" at a checkpoint or not but that might be a nice feature.

My dual and quad cores and even my single core computers have occasionally had a WU get stuck. Usually, I have to restart my computer to fix this problem. The stuck WU has always continued on and validated successfully. So, it appears that even good WUs can get stuck for some reason. The reasons could be many and I don't have the knowledge to guess exactly what happens. I could imagine a race condition of some kind in multicore processors, but even my single processor computer has had WUs freeze.

Jord
Joined: 26 Jan 05
Posts: 2952
Credit: 5779100
RAC: 0

Detect Non-progressing WorkUnits?

The best way to go at this is to register at BOINC Trac and make a new ticket with exactly that request to the BOINC developers.

John McLeod VII
John McLeod VII
Moderator
Joined: 10 Nov 04
Posts: 547
Credit: 632255
RAC: 0

There is a basic problem.

There is a basic problem. Some of the projects have tasks that never report % complete, and never checkpoint even though they are making progress. They complete after several hours. Some projects also have tasks that immediately jump to some % complete, and then freeze there until they are complete.

If by non-progressing, you mean that they keep switching back to the same checkpoint, that is different.

rbpeake
rbpeake
Joined: 18 Jan 05
Posts: 266
Credit: 978833130
RAC: 668204

RE: ....If by

Message 73210 in response to message 73209

Quote:
....If by non-progressing, you mean that they keep switching back to the same checkpoint, that is different.


Rosetta built some kind of a "watchdog" into their application, so that if it freezes for some specified number of seconds, the watchdog shuts down the work unit and aborts it.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.