Note on EAH Scheduler Behaviour

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

2 Apr 2008 1:22:37 UTC

Topic 193607

(moderation:

)

Here's the gist of a conversation Gary and I started last night on the SAH Boards.

Didn't have anything to do with SAH per se, but since you might run up against this here I thought it was worth transplanting.

Quote:

Quote:
IIRC, there is one case where EAH will issue an unconditional abort to a host, and that is when the task has overrun the deadline and a quorum has already been formed.

Nope, not as far as I'm aware. I'm pretty sure that serverside aborts never happen at E@H.

I can say this with a fair degree of confidence because I've had a few examples where a host at a different site has been accidently turned off and not noticed for quite a while. When it is restarted it invariably restarts crunching what it has even though all work on the machine is way out of date with quorums already formed. There will be a client message to the effect that there may be no credit for the work and a suggestion to abort it but the user has to do the aborting manually.

OK, I see the caveat here. You won't get the unconditional abort command until the first scheduler contact your host makes after the deadline has passed and a quorum is formed. As far as the CC is concerned, it would crunch the task if it was a century overdue but still under the Max CPU time. ;-)

The way I found out about it was back in S5R2. If you remember from the TauWU graph there was a big discontinuity at around 300 MHz, which demarcated the change over from being a short task to a long one.

The problem for me was that as I approached 300 MHz from below, the run times for my K6/300's exceeded the the 2 week deadline for the last two steps before the break. IIRC, in the end of last range it would have taken just about a month of flatout crunch to complete.

So what happened was when the host got to the -TSI (Task Switch Interval) work fetch gate for EAH and contacted the projected, the new wingman had reported back ahead of me and the WU had validated. As a result, the project sent the unconditional for the current task and another new 250+ MHz task to chew on. That was when I decided I was not going to abort it (since S5R2 was starting to wind down), and I was not going to get poofed either!

Solution; Using the TauWU graph I calculated how much 'horsepower' help it would need to be able to make the deadline just in the nick, and then it got that help by transplanting the task over onto its -2/500 brothers, as well as its Katmai and Northie Intel cousins for a little extra 'kick', and then back 'home' to finish up and report. :-)

Alinator

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5877

Credit: 118775906479

RAC: 21490059

Note on EAH Scheduler Behaviour

2 Apr 2008 8:35:34 UTC

Message 80580

(moderation:

)

Quote:

OK, I see the caveat here. You won't get the unconditional abort command until the first scheduler contact your host makes after the deadline has passed and a quorum is formed. As far as the CC is concerned, it would crunch the task if it was a century overdue but still under the Max CPU time. ;-)

In the particular case that was the most recent time that I saw this, there were about three tasks on the computer including an uploaded but not reported one, a partially crunched one and one that hadn't been started (I think). I remember the one that had been uploaded but not reported because I clearly remember feeling peeved at the waste.

Now I might be wrong in my recollection of things but the first thing that happened when the machine was restarted was that the crunched task was reported and crunching was resumed on the partly crunched task. Because I only use service installs, this had all happened before I loaded BOINC Manager. I remember realising that the on_frac would be way down so I'm pretty sure I then stopped BOINC and edited client_state.xml to restore on_frac to close to 1. I remember checking the website and seeing that there were zero tasks in the list (all quorums had been completed and results deleted) so I recall restarting BOINC and aborting all tasks in BOINC Manager manually. So whilst I wasn't even thinking about serverside aborts, the initial reporting of the completed task was a contact with the server and no action was generated as a result of that contact. Perhaps because the server had no record of those tasks any more it could hardly generate an abort for tasks it no longer knew about :-).

Concerning other cases of this sort of thing, with a fleet of 100+ machines to monitor, there will always be some that fall through the cracks when problems occur and don't get noticed until too late. I have personally seen quite a few cases where the warning about "not getting credit because the deadline has passed" is given. In a lot of cases, I've taken a punt and made an assessment on the third wingman and whether or not I should be able to beat him back. Mostly I win but occasionally I lose. However, win or lose, I've never seen an example of a serverside abort.

So I fully appreciate the incident you are reporting about and am willing to concede that perhaps I just haven't been involved in a suitable situation. I tend to run small caches and the situations I'm recalling are all to do with deadline stress caused by a crashed or shut down machine and not by a machine that is too slow or has too much work on board. In many of my cases, the client is simply trying to complete remaining work under conditions where it wouldn't be wanting to contact the server until after the event anyway.

Cheers,
Gary.

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

Hmmm... Like you I was

2 Apr 2008 17:22:58 UTC

Message 80581

(moderation:

)

Hmmm...

Like you I was shooting crap on beating the new wingman back and came up snake eyes. So I wasn't expecting to get credit and would have aborted it myself.

The surprise was to find out I had aborted it (showed as a 197-User abort) while I had been in bed asleep! :-)

That's interesting about being so far overdue that the WU(s) would have been purged before the next scheduler session from the host. That might well explain the difference between your case and mine. Although for current server builds, I would have to classify not aborting them just because one of the one's in the scheduler request isn't listed as 'active' in the BOINC DB as a bug.

The other thought which comes to mind is how long ago was your experience? IIRC mine was right around the time they were adding the 221 support to the CC. If you were running a version which was not sending the list of work onboard back to the project, then the scheduler would have no way to know you were crunching on 'mummified' tasks and the outcome would be just what you observed.

I went back through my old hosts, and it was around May of last year when the incident occurred on mine.

Alinator

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5877

Credit: 118775906479

RAC: 21490059

RE: ... I would have to

2 Apr 2008 23:16:14 UTC

Message 80582 in response to message 80581

(moderation:

)

Quote:

... I would have to classify not aborting them just because one of the one's in the scheduler request isn't listed as 'active' in the BOINC DB as a bug.

At first I considered that behaviour to be a bug too. On further reflection, I'm not so sure. Given the fairly high proportion of new users who download tasks and return few or none of them, it would be a very severe burden on the servers to keep track of those tasks long after a quorum has been formed just so that if one of these MIAs suddenly turned up at some future point in time, the server could then interrogate its records and send out aborts for other long overdue results the host might still have in its cache. So I regard it as a 'feature' to minimise server load rather than a bug :-).

Quote:

The other thought which comes to mind is how long ago was your experience?

I believe the machine was turned off before Christmas 07 and was restarted in early Feb 08. It was probably running 5.8.16 but it could have been 5.10.28 since I had been doing that upgrade on a number of machines late last year. As I intimated earlier, this is all from memory as I've got too many hosts to worry about keeping detailed records. I try to 'set and forget' most of them although I do follow a few 'favourites' more closely.

Cheers,
Gary.

Note on EAH Scheduler Behaviour

Forums › Cruncher's Corner

Note on EAH Scheduler Behaviour

Hmmm... Like you I was

RE: ... I would have to

Comment viewing options

Forums › Cruncher's Corner