Results showing "Aborted by user"

archae86
archae86
Joined: 6 Dec 05
Posts: 3,064
Credit: 5,771,795,602
RAC: 3,900,163

Richard Haselgrove

Richard Haselgrove wrote:
...so that wheel-spinner wouldn't trigger that characteristic sawtooth uptick in runtime estimates for the following tasks, when - I suspect - it's killed by BOINC for 'maximum time exceeded'.


I failed to pay close enough attention to recognize that in the case of immediate interest none of the tasks were running to completion, so my stretching comment did not apply. Thanks for the correction.

But I do think it is properly descriptive of current behavior for the BOINC system as Einstein runs on it in circumstances of new successful completions not matching current estimate, as I pay somewhat close attention at times to the progressive modification of completion time estimates. I get to see significant disparities quickly any time I look, as the scheme does not handle my two hosts which have a non-matched pair of GPUs each especially gracefully.

I don't currently run work on other projects, and was not aware that this particular behavior also changed when so much else was "modernized" for projects keeping more up-to-date with BOINC changes. So in addition to not addressing the specific situation at hand, I spoke too broadly. Thanks again.

Lastly regarding queue length--I'm to some degree following my own advice in that, hoping for distribution soon of new Parkes code (beta, presumably) with more modern CUDA possibly giving better performance on my three Maxwells, I've lowered my queue of unstarted work to a tenth the size it was at the time of the latest announcement/tease/hint. One of these times, perhaps the press of more important current work will be low enough that it will actually happen.

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,540
Credit: 76,390,612,544
RAC: 65,890,794

RE: I didn't work my way

Quote:
I didn't work my way all the back to the start of the problem ...


I've tried to, and the interesting thing is that valid tasks for both BRP4G and BRP6 were last returned on 20th April and after that BRP4G tasks (14 in total) have run for about a full day before reaching the MAX_TIME_LIMIT_EXCEEDED stage and failing. I presume this is what has created the log jam so that a whole bunch of further tasks have exceeded the deadline and been branded as "Aborted by user" because of Einstein's old version of the server software.

The BRP6 tasks have behaved differently. Mostly they fail very quickly rather than running forever until a time limit is reached.

Quote:
... but if tasks stopped running at all after one or other upgrade ...


The OP said he only upgraded BOINC and nothing else, but to me it looks like the problem is not to do with BOINC but rather an OS update. I'm wondering if there might have been an update from Apple that has interfered with the driver/OpenCL libs. Both BRP4G and BRP6 were running flawlessly and around 20th April that all changed. The OP also said that he reverted to the previous BOINC version and, "things are back to normal". If that were really true, there would be lots of recently completed GPU tasks. There are plenty 'in progress' tasks received on 27th April but not a single completed one. CPU tasks have been and are still completing normally - it's just a GPU issue.

Quite apart from dealing with the problem of unduly large work cache settings when one of the allowed GPU runs (BRP4G) has only a 7 day deadline, Michael needs to check whether any OS updates affecting GPU drivers/OpenCL libs have occurred. In the meantime, he should stop the flow of GPU tasks as they are simply going to be trashed until the real problem is identified and corrected. Since 20th April, there has not been a single valid GPU task returned and I would be highly skeptical that this has anything to do with a BOINC upgrade.

To try to pin this down more closely, I found the oldest error result still in the online database - 21st April 16:05:07 UTC. Two 'max time limit exceeded' tasks, each with close to a full day run time were reported at this time. There may have been earlier failing tasks that have been deleted since but I would suspect the problem occurred sometime around when the most recent successful tasks were reported on 20th April 15:46:38 UTC. Funnily enough, that is just over a full day earlier than the oldest error task. Unless Michael was running two concurrent GPU tasks, I don't really understand why two error tasks, each with a full day of elapsed time, were able to be reported together just a day after two successful tasks.

Hopefully Michael can say exactly when BOINC was upgraded and can check if there have been any Apple updates around the critical time of 20-21 April.

Cheers,
Gary.

Michael Robertson
Michael Robertson
Joined: 5 Nov 12
Posts: 18
Credit: 89,478,168
RAC: 0

Apologies, all...I've been

Apologies, all...I've been unable to attend to this for a few days, but see that my attempts to downgrade--which at first seemed to remedy the issue--have in fact made no difference. I will check to see if there was an OS X update applied with coincides with the interruption of work on that box.

I had increased my work cache after the system outage a while back, and prior to this issue don't recall a single instance of a missed deadline. The machine in question is a 2014 Mac Pro which is currently more-or-less dedicated to E@H, so it has more than enough horsepower to crank out these units in time.

From the front end (the BOINC client and Activity Monitor), everything looks correct. BOINC reports that all processors and both GPUs are processing, Activity Monitor shows CPU usage in line with my setting. There are no pending transfers. I'm a bit stumped, but then again the inner workings of the project and how all of this happens behind the scenes aren't my forte.

Thanks everyone for the input. I will attempt to get you more data as soon as I am able.

Michael Robertson
Michael Robertson
Joined: 5 Nov 12
Posts: 18
Credit: 89,478,168
RAC: 0

There were indeed two system

There were indeed two system updates applied to that machine on April 20: OS X 10.10.3 and the Yosemite Supplemental Recovery Update, the latter of which seems extremely unlikely to be the culprit. I think we have a winner.

archae86
archae86
Joined: 6 Dec 05
Posts: 3,064
Credit: 5,771,795,602
RAC: 3,900,163

RE: The estimated time for

Quote:

The estimated time for all the remaining tasks gets boosted as soon as the unusually long-completing task finishes, not by a small averaged-in adjustment, but to the full effect of the single slow observation.

Recovery begins as soon as the first task with normal completion time is done, but for "faster than currently predicted" completion the programming responds intentionally slowly, whereas if something is slower than expected by an appreciable margin (don't know the current definition of appreciable, but maybe something like 20%), then the new prediction is bumped all the way up.


As bad luck would have it, I got to observe this effect in action today.

For the first time in many weeks, my Haswell /GTX 970 host suffered an event which caused the three in-process tasks to error out, and apparently caused a big downclock of the GPU. By the time I noticed, the tasks in process had already taken over triple the normal time.

So, just before the first of the long jobs completed the estimate completion time for Parkes jobs was somewhat over three hours. Immediately after two completed (with reported elapsed times of 12:55:55 and 12:56:39) the unstarted jobs in queue showed estimated times of 12:55:59. While I was not quick enough to observe it, I believe that the first completion triggered the "too big a change, so bump up all the way" rule, while the second was just a little longer, so only caused a modest further adjustment.

Michael Robertson
Michael Robertson
Joined: 5 Nov 12
Posts: 18
Credit: 89,478,168
RAC: 0

So, I guess it's stupid

So, I guess it's stupid question time: what happens from here?

I'm not one to freak out because my RAC has dropped to (relatively speaking) jack squat, but it does suck that I'm unable to contribute to the project to the same degree. Is this an issue to be addressed by a future BOINC upgrade, or...?

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2,059
Credit: 961,305,522
RAC: 1,477,165

RE: So, I guess it's stupid

Quote:

So, I guess it's stupid question time: what happens from here?

I'm not one to freak out because my RAC has dropped to (relatively speaking) jack squat, but it does suck that I'm unable to contribute to the project to the same degree. Is this an issue to be addressed by a future BOINC upgrade, or...?


I thought we'd established that the BOINC version upgrade was out of the loop. It seems to be doing its job: giving the Einstein science application the instruction to start. That it doesn't actually do any work, after being launched, suggests that perhaps it's this specific Einstein application that needs some attention, to make it compatible with the new updates to OS X. Didn't Oliver say he was going to look into it, a few days ago?

Moldr
Moldr
Joined: 3 Apr 15
Posts: 11
Credit: 145,119
RAC: 0

RE: That it doesn't

Quote:
That it doesn't actually do any work, after being launched, suggests that perhaps it's this specific Einstein application that needs some attention, to make it compatible with the new updates to OS X. Didn't Oliver say he was going to look into it, a few days ago?

He did and at this point all we can do is wait. I periodically download beta Parkes tasks, because at least when they fail they do it quickly, to check if any progress is being made and leave the non-beta apps on 'yes' because Bernd Machenschalk has disabled them for Yosemite until the issue is resolved.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.