BRP5 Version 1.36 not running with Boinc 7.0.64

Rechenkuenstler
Rechenkuenstler
Joined: 22 Aug 10
Posts: 138
Credit: 102567115
RAC: 0
Topic 197031

Since BRP5 Version 1.36 is out, I've got computation errors on all BRP5 tasks. It seems to be a problem with creating and/or writing to file status.cpt
Boinc Version is 7.0.64

Updated another computer which was crunching BRP5 tasks successfully with an old Boinc Version 6.X to Boinc 7.0.64 and then there was no more progress on the task.

I've switched back both computers from Boinc 7.0.64 to Boinc 7.0.42 and on both computers BRP5 is running successfully

Here is an example error task
http://einsteinathome.org/task/387531737

Seraphim401
Seraphim401
Joined: 16 Sep 09
Posts: 1
Credit: 12474030
RAC: 0

BRP5 Version 1.36 not running with Boinc 7.0.64

Great to see that I'm not the only one with this problem.
I'm also using Boinc Version 7.0.64.
My gpu is 92% utilized,but the project shows no progress,it just stays it 0%.
Hope this gets fixed soon,because I'm not downgrading.

http://einsteinathome.org/task/387507144

Gustav
Gustav
Joined: 13 Aug 10
Posts: 4
Credit: 8019583
RAC: 0

I see the same thing starting

I see the same thing starting a couple days ago. All BRP5 tasks restart every few minutes and eventually fail, getting marked as errored tasks.

I am running Windows 7 with an ATI Radeon HD 5700 series GPU. BOINC version is 7.0.64 (x64).

Example error task:
http://einsteinathome.org/task/387613382

When the task eventually completes, the output files cannot be found:

6/28/2013 8:03:47 PM | Einstein@Home | Computation for task PA0079_01251_348_1 finished
6/28/2013 8:03:47 PM | Einstein@Home | Output file PA0079_01251_348_1_0 for task PA0079_01251_348_1 absent
6/28/2013 8:03:47 PM | Einstein@Home | Output file PA0079_01251_348_1_1 for task PA0079_01251_348_1 absent
6/28/2013 8:03:47 PM | Einstein@Home | Output file PA0079_01251_348_1_2 for task PA0079_01251_348_1 absent

The log on the web site reports "too many exit(0)s" and many messages about status.cpt unavailable.

7.0.64

too many exit(0)s

Checkpoint file unavailable: status.cpt (No such file or directory).


Artur Anikeich
Artur Anikeich
Joined: 19 Feb 12
Posts: 3
Credit: 18123572
RAC: 6354

I think I've got the same

I think I've got the same problem even with Boinc 7.0.28.
I canceled this task http://einsteinathome.org/task/387465118, it's too long ran without any progress..
Although
[23:45:45][3840][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory). seems to be normal,
there is something wrong between [01:08:08] and [06:03:45]:

[00:46:17][4676][INFO ] Checkpoint committed!
[00:57:13][4676][INFO ] Checkpoint committed!
[01:08:08][4676][INFO ] Checkpoint committed!
[06:03:45][4676][INFO ] OpenCL shutdown complete!
[06:03:45][4676][INFO ] Statistics: count dirty SumSpec pages 1824 (not checkpointed), Page Size 1024, fundamental_idx_hi-window_2: 1100505
[06:03:45][4676][INFO ] Data processing finished successfully!


Checkpoint set to every 10 minutes, and in Binary Radio Pulsar Search (Perseus Arm Survey) v1.34 (opencl-ati) it worked fine. Example: http://einsteinathome.org/task/385491213

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 760322651
RAC: 1122087

Thanks for reporting

Thanks for reporting this.

We do see an increased failure rate for this app version, but by no means are all tasks failing, so there must be some combination of things that trigger this problem.

It seems that when this problem occurs, the task is interrupted (suspended??) before it can make any progress that would be written to the checkpoint file. So after a restart such a task begins from scratch, is again terminated before reaching the first checkpoint etc... Eventually BOINC will terminate the task with "too many exits".

The big question is what is causing the task to stop only seconds after starting. If the task would just crash, the error messages would be different. The same for memory shortages (and the log for the NVIDIA tasks indicate the amount of free mem, more than enough).

Could those who have experienced this problem please report what is their setting for "Leave tasks in memory while suspended?" in the Computing preferences (for the venue matching the hosts in question)?

If you want to help us in tracking down this problem and you are experienced in BOINC configuration, you can also set some debugging flags in the cc_config.xml, restart BOINC and report back the messages in the event log after seeing the problem happening. I think the flags that might help are

(see http://boinc.berkeley.edu/wiki/Client_configuration)

HBE

Artur Anikeich
Artur Anikeich
Joined: 19 Feb 12
Posts: 3
Credit: 18123572
RAC: 6354

"Leave tasks in memory while

"Leave tasks in memory while suspended" is not checked.

30/06/2013 19:04:15 | Einstein@Home | [coproc] Assigning ATI instance 0 to PA0071_00551_267_3
30/06/2013 19:04:15 | Einstein@Home | [task] task_state=EXECUTING for PA0071_00551_267_3 from start
30/06/2013 19:04:15 | Einstein@Home | Restarting task PA0071_00551_267_3 using einsteinbinary_BRP5 version 136 (opencl-ati) in slot 3
30/06/2013 19:05:16 | Einstein@Home | [coproc] ATI instance 0: confirming for PA0071_00551_267_3
30/06/2013 19:06:16 | Einstein@Home | [coproc] ATI instance 0: confirming for PA0071_00551_267_3
...
30/06/2013 19:14:16 | Einstein@Home | [coproc] ATI instance 0: confirming for PA0071_00551_267_3
30/06/2013 19:14:17 | Einstein@Home | [checkpoint] result PA0071_00551_267_3 checkpointed
30/06/2013 19:15:16 | Einstein@Home | [coproc] ATI instance 0: confirming for PA0071_00551_267_3
...
30/06/2013 19:23:16 | Einstein@Home | [coproc] ATI instance 0: confirming for PA0071_00551_267_3
30/06/2013 19:24:16 | Einstein@Home | [coproc] ATI instance 0: confirming for PA0071_00551_267_3
30/06/2013 19:24:17 | Einstein@Home | [checkpoint] result PA0071_00551_267_3 checkpointed
30/06/2013 19:25:16 | Einstein@Home | [coproc] ATI instance 0: confirming for PA0071_00551_267_3
...
30/06/2013 19:32:16 | Einstein@Home | [coproc] ATI instance 0: confirming for PA0071_00551_267_3
30/06/2013 19:33:11 | World Community Grid | General prefs: from World Community Grid (last modified 03-Nov-2012 23:26:52)
...
30/06/2013 19:33:15 | Einstein@Home | [coproc] ATI instance 0: confirming for PA0071_00551_267_3
30/06/2013 19:33:16 | Einstein@Home | task LATeah0032U_688.0_83160_0.0_1 resumed by user
30/06/2013 19:33:36 | Einstein@Home | [coproc] ATI instance 0: confirming for PA0071_00551_267_3
30/06/2013 19:33:36 | Einstein@Home | [task] task_state=EXECUTING for LATeah0032U_688.0_83160_0.0_1 from start
30/06/2013 19:33:36 | Einstein@Home | Restarting task LATeah0032U_688.0_83160_0.0_1 using hsgamma_FGRP2 version 109 in slot 1
30/06/2013 19:33:38 | Einstein@Home | [task] task_state=SUSPENDED for LATeah0032U_688.0_83160_0.0_1 from suspend
30/06/2013 19:33:55 | Einstein@Home | [task] task_state=EXECUTING for LATeah0032U_688.0_83160_0.0_1 from unsuspend
30/06/2013 19:33:56 | Einstein@Home | [task] task_state=SUSPENDED for LATeah0032U_688.0_83160_0.0_1 from suspend
30/06/2013 19:34:15 | Einstein@Home | [task] task_state=EXECUTING for LATeah0032U_688.0_83160_0.0_1 from unsuspend
30/06/2013 19:34:16 | Einstein@Home | [task] task_state=SUSPENDED for LATeah0032U_688.0_83160_0.0_1 from suspend
30/06/2013 19:34:35 | Einstein@Home | [task] task_state=EXECUTING for LATeah0032U_688.0_83160_0.0_1 from unsuspend
30/06/2013 19:34:36 | Einstein@Home | [task] task_state=SUSPENDED for LATeah0032U_688.0_83160_0.0_1 from suspend
30/06/2013 19:34:55 | Einstein@Home | [task] task_state=EXECUTING for LATeah0032U_688.0_83160_0.0_1 from unsuspend
30/06/2013 19:34:55 | Einstein@Home | [coproc] ATI instance 0: confirming for PA0071_00551_267_3
30/06/2013 19:34:56 | Einstein@Home | [task] task_state=SUSPENDED for LATeah0032U_688.0_83160_0.0_1 from suspend
...
30/06/2013 19:35:55 | Einstein@Home | [task] task_state=EXECUTING for LATeah0032U_688.0_83160_0.0_1 from unsuspend
30/06/2013 19:35:55 | Einstein@Home | [coproc] ATI instance 0: confirming for PA0071_00551_267_3
30/06/2013 19:35:56 | Einstein@Home | [task] task_state=SUSPENDED for LATeah0032U_688.0_83160_0.0_1 from suspend
...
...
30/06/2013 19:43:35 | Einstein@Home | [task] task_state=EXECUTING for LATeah0032U_688.0_83160_0.0_1 from unsuspend
30/06/2013 19:43:35 | Einstein@Home | [coproc] ATI instance 0: confirming for PA0071_00551_267_3
30/06/2013 19:43:36 | Einstein@Home | [task] task_state=SUSPENDED for LATeah0032U_688.0_83160_0.0_1 from suspend
...
30/06/2013 19:44:35 | Einstein@Home | [task] task_state=EXECUTING for LATeah0032U_688.0_83160_0.0_1 from unsuspend
30/06/2013 19:44:35 | Einstein@Home | [coproc] ATI instance 0: confirming for PA0071_00551_267_3
30/06/2013 19:44:36 | Einstein@Home | [task] task_state=SUSPENDED for LATeah0032U_688.0_83160_0.0_1 from suspend
...
30/06/2013 19:45:35 | Einstein@Home | [task] task_state=EXECUTING for LATeah0032U_688.0_83160_0.0_1 from unsuspend
30/06/2013 19:45:35 | Einstein@Home | [coproc] ATI instance 0: confirming for PA0071_00551_267_3


After I resumed LATeah0032U_688.0_83160_0.0_1 progress bar stopped at 1.938%, although time goes on, and there is no more "checkpointed". This time einsteinbinary_BRP5_1.36_windows_x86_64__opencl-ati.exe running without interruptions.
is not set because it generates too many messages about the state of the network.
Percentage of the task is correct on the graphics, by the way.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 760322651
RAC: 1122087

Thanks! Interesting...

Thanks!

Interesting... There is a lot of stop-and-go going on, this time for the LAT task, after 19:33:36 in your log.

In general, all settings that are related to suspending tasks might be interesting when this kind of problem happens.

Background: the new app version 1.36 fixes a problem in the BOINC API part, which caused the previous app versions to ignore suspend requests in many circumstances. Now that this is fixed and the app really does suspend when asked to do so by the core client, this might cause other problems when BOINC, for some yet unknown reasons, is excessively suspending and resuming in short intervals.

Those who experience this problem might want to try settings that minimize suspension of the app, e.g. allow GPU tasks while host is in use.

Thanks for the feedback
HB

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 760322651
RAC: 1122087

For now, the version 1.36 was

For now, the version 1.36 was deprecated. However, to solve the problem, those who experienced this problem can attach to Albert@Home at http://albert.phys.uwm and try to reproduce the problem with the debugging settings I mentioned above.

Sorry for the inconveniences
HBE

Artur Anikeich
Artur Anikeich
Joined: 19 Feb 12
Posts: 3
Credit: 18123572
RAC: 6354

RE: Interesting... There

Quote:

Interesting... There is a lot of stop-and-go going on, this time for the LAT task, after 19:33:36 in your log.


Well, it is because option "use at most _ % CPU time" was set to 5 %. With previous version of app there was no problem with that.

Gustav
Gustav
Joined: 13 Aug 10
Posts: 4
Credit: 8019583
RAC: 0

"Leave applications in memory

"Leave applications in memory while suspended" is checked.

"Use at most _% CPU time" is set to 75%. Setting it to 100% CPU time doesn't fix the issue.

Here is the output with the requested debug flags. I removed other tasks and SUSPENDING/EXECUTING messages.

6/30/2013 11:11:37 PM | Einstein@Home | [coproc] Assigning ATI instance 0 to PA0079_01171_369_1
6/30/2013 11:11:37 PM | Einstein@Home | [task] task_state=EXECUTING for PA0079_01171_369_1 from start
6/30/2013 11:11:37 PM | Einstein@Home | Restarting task PA0079_01171_369_1 using einsteinbinary_BRP5 version 136 (opencl-ati) in slot 4
6/30/2013 11:12:38 PM | Einstein@Home | [coproc] ATI instance 0: confirming for PA0079_01171_369_1
6/30/2013 11:13:38 PM | Einstein@Home | [coproc] ATI instance 0: confirming for PA0079_01171_369_1
6/30/2013 11:14:38 PM | Einstein@Home | [coproc] ATI instance 0: confirming for PA0079_01171_369_1
6/30/2013 11:14:45 PM | Einstein@Home | Restarting PA0079_01171_369_1 - message timeout
6/30/2013 11:14:46 PM | Einstein@Home | [task] Process for PA0079_01171_369_1 exited, exit code 0, task state 1
6/30/2013 11:14:46 PM | Einstein@Home | Task PA0079_01171_369_1 exited with zero status but no 'finished' file
6/30/2013 11:14:46 PM | Einstein@Home | [task] task_state=UNINITIALIZED for PA0079_01171_369_1 from handle_premature_exit
6/30/2013 11:14:46 PM | Einstein@Home | [coproc] Assigning ATI instance 0 to PA0079_01171_369_1
6/30/2013 11:14:46 PM | Einstein@Home | [task] task_state=EXECUTING for PA0079_01171_369_1 from start
6/30/2013 11:14:46 PM | Einstein@Home | Restarting task PA0079_01171_369_1 using einsteinbinary_BRP5 version 136 (opencl-ati) in slot 4

Thanks,
Gustav

Rechenkuenstler
Rechenkuenstler
Joined: 22 Aug 10
Posts: 138
Credit: 102567115
RAC: 0

RE: Thanks! Interesting...

Quote:

Thanks!

Interesting... There is a lot of stop-and-go going on, this time for the LAT task, after 19:33:36 in your log.

In general, all settings that are related to suspending tasks might be interesting when this kind of problem happens.

Background: the new app version 1.36 fixes a problem in the BOINC API part, which caused the previous app versions to ignore suspend requests in many circumstances. Now that this is fixed and the app really does suspend when asked to do so by the core client, this might cause other problems when BOINC, for some yet unknown reasons, is excessively suspending and resuming in short intervals.

Those who experience this problem might want to try settings that minimize suspension of the app, e.g. allow GPU tasks while host is in use.

Thanks for the feedback
HB

On my systems, "Leave in memory..." is activated. And on the big systems, I'm using only 5 of 8 cores to crunch LAT-tasks + 0.20CPU für BRP5. Configuration is just to use 80% of CPU usage. That's the only config-setting, which should cause suspension.
At the moment I'm not able to retest, cause I've downgraded to BOINC 7.0.42. Here everything works fine, even with App Version 1.36. And the workaround fixes the problem for Vista and Win7 OS.
So the starting point for an anlysis could be, what is the difference between Boinc 7.0.42 and 7.0.62. App version 1.36 works with the old Boinc, but not with the new one.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.