I've switched back both computers from Boinc 7.0.64 to Boinc 7.0.42 and on both computers BRP5 is running successfully
This is interesting indeed. It would be really helpful if you could help us identifying the exact client version that introduced the error. If you're willing to do so, these are steps to follow:
1) Attach to our test project at http://albert.phys.uwm.edu/
2) Reproduce the error with 7.0.64
3) Reproduce no error with 7.0.42
4) Iteratively bisect the versions between those two, available here: http://boinc.berkeley.edu/dl/
By 4) I mean: given the following available versions...
4.1) pick the last working version (start with 7.0.42) and the first non-working version (start with 7.0.64)
4.2) pick the version in the middle of the set (start with 7.0.55) and test if it's working
4.3) if it's working it becomes your new "last working version". If it's failing it becomes your new "first non-working version"
4.4) continue with 4.2 until you found the failing version that introduced the error.
Please note: each test you run will reduce the versions remaining to be tested by 50%, hence will quickly tell us which client version introduced the problem (assuming that the bug was introduced in one of the version between the two you reported).
Thanks in advance,
Oliver
Hi Oliver,
tried to perform the testing. The problem is, that there are no Tasks available for BRP5. Tasks to send = 0. And I get the message from the boinc client, that there is no work available. So we have to suspend the test until tomorrow evening. Please make sure, that there are tasks available on Albert@home.
Ah, ok, thanks. Gustav, just to be sure: did you perform an "update" on the client side after making the change in the web settings?
I didn't perform an "update" on the client side. But I changed "Use at most _% CPU time" locally using BOINC Manager, which seems to take effect immediately.
I am now running BRP5 through Albert@Home with fewer cores in use. Results so far:
"Use at most 75% CPU time" -- BRP5 restarts every 3-4 minutes, usually every 00:03:08 but occasionally long.
"Use at most 100% CPU time" -- BRP5 has restarted only once in 2 hours so far. Here is the one restart:
7/1/2013 1:10:28 PM | Albert@Home | Restarting PA0080_00281_276_0 - message timeout
7/1/2013 1:10:29 PM | Albert@Home | [task] Process for PA0080_00281_276_0 exited, exit code 0, task state 1
7/1/2013 1:10:29 PM | Albert@Home | Task PA0080_00281_276_0 exited with zero status but no 'finished' file
7/1/2013 1:10:29 PM | Albert@Home | If this happens repeatedly you may need to reset the project.
7/1/2013 1:10:29 PM | Albert@Home | [task] task_state=UNINITIALIZED for PA0080_00281_276_0 from handle_premature_exit
7/1/2013 1:10:29 PM | Albert@Home | [coproc] Assigning ATI instance 0 to PA0080_00281_276_0
7/1/2013 1:10:29 PM | Albert@Home | [task] task_state=EXECUTING for PA0080_00281_276_0 from start
7/1/2013 1:10:29 PM | Albert@Home | Restarting task PA0080_00281_276_0 using einsteinbinary_BRP5 version 136 (opencl-ati) in slot 7
stderr.txt shows checkpoints committed once a minute for the first 15-20 minutes after restart, then checkpoints stop. I don't know if that means anything.
Why would it be strange!?
As of this post I have 36 tasks completed with v1.36 under Boinc 7.0.64 x64 and out of that 27 are valid and 9 are pending validation. I have exactly 0 errors or invalids.
The main theme of this thread have been that if you somehow limit the amount of time Boinc can run then problems arise. I don't say that's the cause as there have been posts that says setting it to 100% of the time won't help. I'm just saying that some of us didn't have any problems with v1.36. If everyone had problems I'd expect more threads and posts about it.
For now, the version 1.36 was deprecated. However, to solve the problem, those who experienced this problem can attach to Albert@Home at http://albert.phys.uwm and try to reproduce the problem with the debugging settings I mentioned above.
Sorry for the inconveniences
HBE
I think, it's not that easy. And they wouldn't have deprecated Version 1.36, if there weren't a problem. I think, the developers will find it and fix it. Everything else is speculation
My system had no problems running many version 1.36 tasks via BOINC 7.0.65 in Linux. Occasionally there is a validation error but I saw those with version 1.34 as well.
RE: RE: I've switched
)
Hi Oliver,
tried to perform the testing. The problem is, that there are no Tasks available for BRP5. Tasks to send = 0. And I get the message from the boinc client, that there is no work available. So we have to suspend the test until tomorrow evening. Please make sure, that there are tasks available on Albert@home.
regards
There should be more BRP5
)
There should be more BRP5 work on Albert@Home in <15 Min from now.
BM
BM
Hi Oliver, here is the
)
Hi Oliver,
here is the test history. You can't see the stderr.txt files on the webpage, because I aborted the tasks, when the result was clear.
success with 7.0.42 reproduced
error with 7.0.64 reproduced
7.0.55 error
7.0.47 error (plus 2 x freeze of the graphic card when restarting from scratch)
7.0.45 error
7.0.43 success
7.0.44 success
7.0.45 error reproduced
7.0.44 success reproduced
7.0.45 is the first bad one.
Regards
RE: Ah, ok, thanks. Gustav,
)
I didn't perform an "update" on the client side. But I changed "Use at most _% CPU time" locally using BOINC Manager, which seems to take effect immediately.
I am now running BRP5 through Albert@Home with fewer cores in use. Results so far:
"Use at most 75% CPU time" -- BRP5 restarts every 3-4 minutes, usually every 00:03:08 but occasionally long.
"Use at most 100% CPU time" -- BRP5 has restarted only once in 2 hours so far. Here is the one restart:
stderr.txt shows checkpoints committed once a minute for the first 15-20 minutes after restart, then checkpoints stop. I don't know if that means anything.
Thanks,
Gustav
Thanks very much. This
)
Thanks very much.
This here:
looks suspicious again, we'll have to dig in the BOINC client code to see where this is thrown.
Thanks again, this was extremely useful.
Cheers
HB
RE: 7.0.45 is the first
)
Excellent, many thanks!
Cheers
HB
Strange. Here is a V1.36 with
)
Strange. Here is a V1.36 with Boinc 7.0.64 and with success
http://einsteinathome.org/task/387640831
RE: Strange. Here is a
)
Why would it be strange!?
As of this post I have 36 tasks completed with v1.36 under Boinc 7.0.64 x64 and out of that 27 are valid and 9 are pending validation. I have exactly 0 errors or invalids.
The main theme of this thread have been that if you somehow limit the amount of time Boinc can run then problems arise. I don't say that's the cause as there have been posts that says setting it to 100% of the time won't help. I'm just saying that some of us didn't have any problems with v1.36. If everyone had problems I'd expect more threads and posts about it.
RE: For now, the version
)
I think, it's not that easy. And they wouldn't have deprecated Version 1.36, if there weren't a problem. I think, the developers will find it and fix it. Everything else is speculation
My system had no problems
)
My system had no problems running many version 1.36 tasks via BOINC 7.0.65 in Linux. Occasionally there is a validation error but I saw those with version 1.34 as well.