Since BRP5 Version 1.36 is out, I've got computation errors on all BRP5 tasks. It seems to be a problem with creating and/or writing to file status.cpt
Boinc Version is 7.0.64
Updated another computer which was crunching BRP5 tasks successfully with an old Boinc Version 6.X to Boinc 7.0.64 and then there was no more progress on the task.
I've switched back both computers from Boinc 7.0.64 to Boinc 7.0.42 and on both computers BRP5 is running successfully
Here is an example error task
http://einsteinathome.org/task/387531737
Copyright © 2024 Einstein@Home. All rights reserved.
BRP5 Version 1.36 not running with Boinc 7.0.64
)
Great to see that I'm not the only one with this problem.
I'm also using Boinc Version 7.0.64.
My gpu is 92% utilized,but the project shows no progress,it just stays it 0%.
Hope this gets fixed soon,because I'm not downgrading.
http://einsteinathome.org/task/387507144
I see the same thing starting
)
I see the same thing starting a couple days ago. All BRP5 tasks restart every few minutes and eventually fail, getting marked as errored tasks.
I am running Windows 7 with an ATI Radeon HD 5700 series GPU. BOINC version is 7.0.64 (x64).
Example error task:
http://einsteinathome.org/task/387613382
When the task eventually completes, the output files cannot be found:
6/28/2013 8:03:47 PM | Einstein@Home | Computation for task PA0079_01251_348_1 finished
6/28/2013 8:03:47 PM | Einstein@Home | Output file PA0079_01251_348_1_0 for task PA0079_01251_348_1 absent
6/28/2013 8:03:47 PM | Einstein@Home | Output file PA0079_01251_348_1_1 for task PA0079_01251_348_1 absent
6/28/2013 8:03:47 PM | Einstein@Home | Output file PA0079_01251_348_1_2 for task PA0079_01251_348_1 absent
The log on the web site reports "too many exit(0)s" and many messages about status.cpt unavailable.
too many exit(0)s
Checkpoint file unavailable: status.cpt (No such file or directory).
I think I've got the same
)
I think I've got the same problem even with Boinc 7.0.28.
I canceled this task http://einsteinathome.org/task/387465118, it's too long ran without any progress..
Although
[23:45:45][3840][INFO ] Checkpoint file unavailable: status.cpt (No such file or directory).
seems to be normal,there is something wrong between [01:08:08] and [06:03:45]:
Checkpoint set to every 10 minutes, and in Binary Radio Pulsar Search (Perseus Arm Survey) v1.34 (opencl-ati) it worked fine. Example: http://einsteinathome.org/task/385491213
Thanks for reporting
)
Thanks for reporting this.
We do see an increased failure rate for this app version, but by no means are all tasks failing, so there must be some combination of things that trigger this problem.
It seems that when this problem occurs, the task is interrupted (suspended??) before it can make any progress that would be written to the checkpoint file. So after a restart such a task begins from scratch, is again terminated before reaching the first checkpoint etc... Eventually BOINC will terminate the task with "too many exits".
The big question is what is causing the task to stop only seconds after starting. If the task would just crash, the error messages would be different. The same for memory shortages (and the log for the NVIDIA tasks indicate the amount of free mem, more than enough).
Could those who have experienced this problem please report what is their setting for "Leave tasks in memory while suspended?" in the Computing preferences (for the venue matching the hosts in question)?
If you want to help us in tracking down this problem and you are experienced in BOINC configuration, you can also set some debugging flags in the cc_config.xml, restart BOINC and report back the messages in the event log after seeing the problem happening. I think the flags that might help are
(see http://boinc.berkeley.edu/wiki/Client_configuration)
HBE
"Leave tasks in memory while
)
"Leave tasks in memory while suspended" is not checked.
After I resumed LATeah0032U_688.0_83160_0.0_1 progress bar stopped at 1.938%, although time goes on, and there is no more "checkpointed". This time einsteinbinary_BRP5_1.36_windows_x86_64__opencl-ati.exe running without interruptions.
is not set because it generates too many messages about the state of the network.
Percentage of the task is correct on the graphics, by the way.
Thanks! Interesting...
)
Thanks!
Interesting... There is a lot of stop-and-go going on, this time for the LAT task, after 19:33:36 in your log.
In general, all settings that are related to suspending tasks might be interesting when this kind of problem happens.
Background: the new app version 1.36 fixes a problem in the BOINC API part, which caused the previous app versions to ignore suspend requests in many circumstances. Now that this is fixed and the app really does suspend when asked to do so by the core client, this might cause other problems when BOINC, for some yet unknown reasons, is excessively suspending and resuming in short intervals.
Those who experience this problem might want to try settings that minimize suspension of the app, e.g. allow GPU tasks while host is in use.
Thanks for the feedback
HB
For now, the version 1.36 was
)
For now, the version 1.36 was deprecated. However, to solve the problem, those who experienced this problem can attach to Albert@Home at http://albert.phys.uwm and try to reproduce the problem with the debugging settings I mentioned above.
Sorry for the inconveniences
HBE
RE: Interesting... There
)
Well, it is because option "use at most _ % CPU time" was set to 5 %. With previous version of app there was no problem with that.
"Leave applications in memory
)
"Leave applications in memory while suspended" is checked.
"Use at most _% CPU time" is set to 75%. Setting it to 100% CPU time doesn't fix the issue.
Here is the output with the requested debug flags. I removed other tasks and SUSPENDING/EXECUTING messages.
Thanks,
Gustav
RE: Thanks! Interesting...
)
On my systems, "Leave in memory..." is activated. And on the big systems, I'm using only 5 of 8 cores to crunch LAT-tasks + 0.20CPU für BRP5. Configuration is just to use 80% of CPU usage. That's the only config-setting, which should cause suspension.
At the moment I'm not able to retest, cause I've downgraded to BOINC 7.0.42. Here everything works fine, even with App Version 1.36. And the workaround fixes the problem for Vista and Win7 OS.
So the starting point for an anlysis could be, what is the difference between Boinc 7.0.42 and 7.0.62. App version 1.36 works with the old Boinc, but not with the new one.