BRP5 Version 1.36 not running with Boinc 7.0.64

Rechenkuenstler
Rechenkuenstler
Joined: 22 Aug 10
Posts: 138
Credit: 102567115
RAC: 0

RE: RE: I've switched

Quote:
Quote:

I've switched back both computers from Boinc 7.0.64 to Boinc 7.0.42 and on both computers BRP5 is running successfully

This is interesting indeed. It would be really helpful if you could help us identifying the exact client version that introduced the error. If you're willing to do so, these are steps to follow:

1) Attach to our test project at http://albert.phys.uwm.edu/
2) Reproduce the error with 7.0.64
3) Reproduce no error with 7.0.42
4) Iteratively bisect the versions between those two, available here: http://boinc.berkeley.edu/dl/

By 4) I mean: given the following available versions...

* 7.0.42
* 7.0.43
* 7.0.44
* 7.0.45
* 7.0.46
* 7.0.47
* 7.0.48
* 7.0.52
* 7.0.54
* 7.0.55
* 7.0.56
* 7.0.57
* 7.0.58
* 7.0.59
* 7.0.60
* 7.0.61
* 7.0.62
* 7.0.63
* 7.0.64

4.1) pick the last working version (start with 7.0.42) and the first non-working version (start with 7.0.64)
4.2) pick the version in the middle of the set (start with 7.0.55) and test if it's working
4.3) if it's working it becomes your new "last working version". If it's failing it becomes your new "first non-working version"
4.4) continue with 4.2 until you found the failing version that introduced the error.

Please note: each test you run will reduce the versions remaining to be tested by 50%, hence will quickly tell us which client version introduced the problem (assuming that the bug was introduced in one of the version between the two you reported).

Thanks in advance,
Oliver

Hi Oliver,

tried to perform the testing. The problem is, that there are no Tasks available for BRP5. Tasks to send = 0. And I get the message from the boinc client, that there is no work available. So we have to suspend the test until tomorrow evening. Please make sure, that there are tasks available on Albert@home.

regards

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4312
Credit: 250515710
RAC: 34286

There should be more BRP5

There should be more BRP5 work on Albert@Home in <15 Min from now.

BM

BM

Rechenkuenstler
Rechenkuenstler
Joined: 22 Aug 10
Posts: 138
Credit: 102567115
RAC: 0

Hi Oliver, here is the

Hi Oliver,

here is the test history. You can't see the stderr.txt files on the webpage, because I aborted the tasks, when the result was clear.

success with 7.0.42 reproduced
error with 7.0.64 reproduced
7.0.55 error
7.0.47 error (plus 2 x freeze of the graphic card when restarting from scratch)
7.0.45 error
7.0.43 success
7.0.44 success
7.0.45 error reproduced
7.0.44 success reproduced

7.0.45 is the first bad one.

Regards

Gustav
Gustav
Joined: 13 Aug 10
Posts: 4
Credit: 8019583
RAC: 0

RE: Ah, ok, thanks. Gustav,

Quote:
Ah, ok, thanks. Gustav, just to be sure: did you perform an "update" on the client side after making the change in the web settings?

I didn't perform an "update" on the client side. But I changed "Use at most _% CPU time" locally using BOINC Manager, which seems to take effect immediately.

I am now running BRP5 through Albert@Home with fewer cores in use. Results so far:

"Use at most 75% CPU time" -- BRP5 restarts every 3-4 minutes, usually every 00:03:08 but occasionally long.

"Use at most 100% CPU time" -- BRP5 has restarted only once in 2 hours so far. Here is the one restart:

7/1/2013 1:10:28 PM | Albert@Home | Restarting PA0080_00281_276_0 - message timeout
7/1/2013 1:10:29 PM | Albert@Home | [task] Process for PA0080_00281_276_0 exited, exit code 0, task state 1
7/1/2013 1:10:29 PM | Albert@Home | Task PA0080_00281_276_0 exited with zero status but no 'finished' file
7/1/2013 1:10:29 PM | Albert@Home | If this happens repeatedly you may need to reset the project.
7/1/2013 1:10:29 PM | Albert@Home | [task] task_state=UNINITIALIZED for PA0080_00281_276_0 from handle_premature_exit
7/1/2013 1:10:29 PM | Albert@Home | [coproc] Assigning ATI instance 0 to PA0080_00281_276_0
7/1/2013 1:10:29 PM | Albert@Home | [task] task_state=EXECUTING for PA0080_00281_276_0 from start
7/1/2013 1:10:29 PM | Albert@Home | Restarting task PA0080_00281_276_0 using einsteinbinary_BRP5 version 136 (opencl-ati) in slot 7

stderr.txt shows checkpoints committed once a minute for the first 15-20 minutes after restart, then checkpoints stop. I don't know if that means anything.

Thanks,
Gustav

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 727960019
RAC: 1224865

Thanks very much. This

Thanks very much.

This here:

Quote:

7/1/2013 1:10:28 PM | Albert@Home | Restarting PA0080_00281_276_0 - message timeout

looks suspicious again, we'll have to dig in the BOINC client code to see where this is thrown.

Thanks again, this was extremely useful.

Cheers
HB

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 727960019
RAC: 1224865

RE: 7.0.45 is the first

Quote:

7.0.45 is the first bad one.

Excellent, many thanks!
Cheers
HB

Rechenkuenstler
Rechenkuenstler
Joined: 22 Aug 10
Posts: 138
Credit: 102567115
RAC: 0

Strange. Here is a V1.36 with

Strange. Here is a V1.36 with Boinc 7.0.64 and with success

http://einsteinathome.org/task/387640831

Holmis
Joined: 4 Jan 05
Posts: 1118
Credit: 1055935564
RAC: 0

RE: Strange. Here is a

Quote:

Strange. Here is a V1.36 with Boinc 7.0.64 and with success

http://einsteinathome.org/task/387640831

Why would it be strange!?
As of this post I have 36 tasks completed with v1.36 under Boinc 7.0.64 x64 and out of that 27 are valid and 9 are pending validation. I have exactly 0 errors or invalids.

The main theme of this thread have been that if you somehow limit the amount of time Boinc can run then problems arise. I don't say that's the cause as there have been posts that says setting it to 100% of the time won't help. I'm just saying that some of us didn't have any problems with v1.36. If everyone had problems I'd expect more threads and posts about it.

Rechenkuenstler
Rechenkuenstler
Joined: 22 Aug 10
Posts: 138
Credit: 102567115
RAC: 0

RE: For now, the version

Quote:

For now, the version 1.36 was deprecated. However, to solve the problem, those who experienced this problem can attach to Albert@Home at http://albert.phys.uwm and try to reproduce the problem with the debugging settings I mentioned above.

Sorry for the inconveniences
HBE

I think, it's not that easy. And they wouldn't have deprecated Version 1.36, if there weren't a problem. I think, the developers will find it and fix it. Everything else is speculation

Jeroen
Jeroen
Joined: 25 Nov 05
Posts: 379
Credit: 740030628
RAC: 0

My system had no problems

My system had no problems running many version 1.36 tasks via BOINC 7.0.65 in Linux. Occasionally there is a validation error but I saw those with version 1.34 as well.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.