Stagnant BRP4-file

astro-marwil
astro-marwil
Joined: 28 May 05
Posts: 527
Credit: 601,066,543
RAC: 1,092,495
Topic 195897

Yesterday I became aware of a stagnant BRP3-file. Normaly they are finished within 80 - 85 min, but this one had done 9:45h and still counting up at stadily progress of 66,7% during the next 15min. As GPU-load was 0%, memory controller load just 7% and CPU-load <0,01% (Process Explorer), I decided to abort this file. From that on every works constantly well.

For me this is the first time. But how often does this happen to other participants? And is there no mechanism installed watching the progress, abortimg the task at some limit?

I lost the succesfull crunching of 7 other files. But it was just luck, that I looked on. It could have become much longer times, even some days and the disservice proportionally bigger.

What´s your experience in this regard?

Kind regards
Martin

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,522
Credit: 695,194,649
RAC: 123,701

Stagnant BRP4-file

Quote:

Yesterday I became aware of a stagnant BRP3-file. Normaly they are finished within 80 - 85 min, but this one had done 9:45h and still counting up at stadily progress of 66,7% during the next 15min. As GPU-load was 0%, memory controller load just 7% and CPU-load <0,01% (Process Explorer), I decided to abort this file. From that on every works constantly well.

For me this is the first time. But how often does this happen to other participants?

I had a similar problem once which went away after I moved the PC to a cooler room. In my case, the CUDA Task would lock up only right at the start tho (= at 0% progress), so your problem might be different.

Quote:

And is there no mechanism installed watching the progress, abortimg the task at some limit?


Yes, every BOINC task comes with a measure of its complexity (assigned by the work generator) that will translate (based on benchmark) to a maximum allowed runtime, based on the individual computer's speed. Because the benchmarks are not very reliable, projects tend to set this value very conservatively so that tasks may run VERY long before timing out. But eventually, they will time out.

HBE

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1,079
Credit: 341,280
RAC: 0

RE: I decided to abort this

Quote:
I decided to abort this file.


You could have tried to suspend/resume the task and then to see if it continued normally. Sometimes a restart of the BOINC client or a reboot helps too.

Quote:
And is there no mechanism installed watching the progress, aborting the task at some limit?


There is. After a runtime ten times the originally estimated time to completion, the task is aborted with an exit code of -177 "Maximum elapsed time exceeded".

Gruß,
Gundolf
[edit]Less than a minute difference! ;-)[/edit]

Computer sind nicht alles im Leben. (Kleiner Scherz)

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4,305
Credit: 249,076,723
RAC: 33,501

RE: There is. After a

Quote:
There is. After a runtime ten times the originally estimated time to completion, the task is aborted with an exit code of -177 "Maximum elapsed time exceeded".

Note that (AFAIK) for this timeout the client still measures 'runtime' as CPU time. If the CUDA App runs 20x as fast as the CPU App on your system, this means that that the task will time out after 200x the normal execution time, in your case after 16 days (if running 24h a day).

BM

BM

Rechenkuenstler
Rechenkuenstler
Joined: 22 Aug 10
Posts: 138
Credit: 102,567,115
RAC: 0

RE: Yesterday I became

Quote:

Yesterday I became aware of a stagnant BRP3-file. Normaly they are finished within 80 - 85 min, but this one had done 9:45h and still counting up at stadily progress of 66,7% during the next 15min. As GPU-load was 0%, memory controller load just 7% and CPU-load <0,01% (Process Explorer), I decided to abort this file. From that on every works constantly well.

For me this is the first time. But how often does this happen to other participants? And is there no mechanism installed watching the progress, abortimg the task at some limit?

I lost the succesfull crunching of 7 other files. But it was just luck, that I looked on. It could have become much longer times, even some days and the disservice proportionally bigger.

What´s your experience in this regard?

Kind regards
Martin

I got the same problem and it could be related to BRP4, because it didn't occur with BRP 3 tasks.
Since BRP4 tasks do not give enough workload for my GPU's I've added some new projects with GPU tasks. So what happens is, when the switch between the projects occurs, than sometimes the nvidia driver crashes and for a few seconds there is a black screen, before it resumes. It is significant, that GPU tasks are sent to Nirvana only after such a driver crash. The problem can be solved very easily with restarting the boinc client. The GPU tasks resume at the last checkpoint.

So the reason, why I mention, that this could be related to BRP4 is, that I have one machine, where I'm running only Einstein and Milkyway GPU tasks. Nothing of the nwe GPU projects. And even on this machine, there occurs the same problem and it was NOT the case with BRP3 tasks.

NVIDIA Driver version is 275.33 on three machines
BOINC version is 2.12.26
OS is WIN7 Ultimate x64 on two machines and Vista 32bit

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,143
Credit: 2,928,361,848
RAC: 764,710

RE: I got the same problem

Quote:

I got the same problem and it could be related to BRP4, because it didn't occur with BRP 3 tasks.
Since BRP4 tasks do not give enough workload for my GPU's I've added some new projects with GPU tasks. So what happens is, when the switch between the projects occurs, than sometimes the nvidia driver crashes and for a few seconds there is a black screen, before it resumes. It is significant, that GPU tasks are sent to Nirvana only after such a driver crash. The problem can be solved very easily with restarting the boinc client. The GPU tasks resume at the last checkpoint.

So the reason, why I mention, that this could be related to BRP4 is, that I have one machine, where I'm running only Einstein and Milkyway GPU tasks. Nothing of the nwe GPU projects. And even on this machine, there occurs the same problem and it was NOT the case with BRP3 tasks.

NVIDIA Driver version is 275.33 on three machines
BOINC version is 2.12.26
OS is WIN7 Ultimate x64 on two machines and Vista 32bit


Driver crashes at task switch could be related to the BOINC API issue we discussed here.

Driver 275.33 on Windows 7 is certainly in the vulnerable zone for that problem, which will only be fixed when the BRP app (the same app for BRP3 and BRP4) is modified and re-compiled against the new API, as described in AppCoprocessor at 'Cleanup on premature exit'.

Until then the only known workround is to revert to 266.xx series video drivers, for those cards which are supported by drivers from that era.

Rechenkuenstler
Rechenkuenstler
Joined: 22 Aug 10
Posts: 138
Credit: 102,567,115
RAC: 0

I have reinstalled the 266.xx

I have reinstalled the 266.xx drivers on all machines. There is another problem with that driver. For reasons, that I cannot evaluate, boinc sometimes doesn't request new BRP4 or CPU tasks, even if there is non to crunch. The message is: Not reporting or requesting tasks.

After stopping and restarting Boinc client, it immediately requests new tasks. My guess is, that this still occurs at the switching between the projects.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.