Scheduler partially offline

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 3,936
Credit: 198,926,084
RAC: 37,429
Topic 195302

Due to some high request load the scheduler is (automatically) disabled occasionally for 5 mins to let the DB and transitioner catch up. We apologize for the inconveniences.

BM

BM

BarryAZ
BarryAZ
Joined: 8 May 05
Posts: 190
Credit: 54,560,244
RAC: 7,961

Scheduler partially offline

Actually, since yesterday (Sunday - October 3), the scheduler has been *totally* offline -- which means completed work can't be reported (including work coming up on deadline).

Is this related to the validator (storage location) problem mentioned elsewhere?

I have seen nothing from the project regarding the scheduler going offline for an extended period (as it has been).

Quote:

Due to some high request load the scheduler is (automatically) disabled occasionally for 5 mins to let the DB and transitioner catch up. We apologize for the inconveniences.

BM


Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1,079
Credit: 341,280
RAC: 0

RE: -- which means

Message 99517 in response to message 99516

Quote:
-- which means completed work can't be reported (including work coming up on deadline).


Sorry to contradict you once again, but reporting is possible, as of 21:00 UTC:[pre]04/10/2010 23:01:11|Einstein@Home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 1 completed tasks
04/10/2010 23:01:16|Einstein@Home|Scheduler request succeeded: got 0 new tasks
04/10/2010 23:01:16||[sched_op_debug] handle_scheduler_reply(): got ack for result h1_0953.55_S5R4__42_S5GC1a_2[/pre]

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,124
Credit: 126,479,473
RAC: 19,791

RE: Sorry to contradict you

Message 99518 in response to message 99517

Quote:
Sorry to contradict you once again, but reporting is possible ......


Yeah, I haven't been copping any of the 5 minute scheduler pauses over the weekend, either ... I think Barry has just been unlucky.

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

BarryAZ
BarryAZ
Joined: 8 May 05
Posts: 190
Credit: 54,560,244
RAC: 7,961

Well, I'd rather be badly

Message 99519 in response to message 99517

Well, I'd rather be badly unlucky rather than right on this one:

10/4/2010 3:38:53 PM Einstein@Home Sending scheduler request: Requested by user.
10/4/2010 3:38:53 PM Einstein@Home Reporting 4 completed tasks, requesting new tasks for GPU
10/4/2010 3:38:57 PM Einstein@Home Scheduler request completed: got 0 new tasks
10/4/2010 3:38:57 PM Einstein@Home Message from server: Project is temporarily shut down for maintenance

I did a successful retry a few minutes later.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,124
Credit: 126,479,473
RAC: 19,791

RE: Well, I'd rather be

Message 99520 in response to message 99519

Quote:

Well, I'd rather be badly unlucky rather than right on this one:

10/4/2010 3:38:53 PM Einstein@Home Sending scheduler request: Requested by user.
10/4/2010 3:38:53 PM Einstein@Home Reporting 4 completed tasks, requesting new tasks for GPU
10/4/2010 3:38:57 PM Einstein@Home Scheduler request completed: got 0 new tasks
10/4/2010 3:38:57 PM Einstein@Home Message from server: Project is temporarily shut down for maintenance

I did a successful retry a few minutes later.


Ah, good. So the tasks were uploaded OK, but you had a retry for more. Probably the language ought be 'the scheduler is responding but we have short periods where you won't necessarily get new work upon request.'

PS - what's your turnaround time on GPU work?

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

BarryAZ
BarryAZ
Joined: 8 May 05
Posts: 190
Credit: 54,560,244
RAC: 7,961

Mike -- I run a batch of

Message 99521 in response to message 99520

Mike -- I run a batch of workstations many of which have Einstein running, and some of these have 9800GT GPU's. I will take a look as to run times.

Regarding reporting work -- frankly, I'd not encountered this scheduler issue with Einstein that I could see until the past day. Today, I've seen it on every one of my workstations with Einstein -- and instead of say 5 minutes every hour, I'd see rather extended 'maintenance' failures. Not so much for new work, I've typically a decent cache and since nearly all my workstations 6 or more projects on them I don't ever run out of work.

Over the past few months, I've increased the amount of Einstein work I am processing -- from around 5K credits a day to double that. Mostly shifting cycles from SETI (which is simply too problematic these days), GPUGrid (for those 9800GT workstations), and to a degree from 'low yield' projects like Rosetta and Malaria for CPU only.

The lion's share of my processing though is ATI GPU processing for MilkyWay, Collatz and DNetc.

Regarding the scheduler, my sense is that there may be some other things going on there which for the past day have made it less accessible more often than even say last week.

Quote:

Ah, good. So the tasks were uploaded OK, but you had a retry for more. Probably the language ought be 'the scheduler is responding but we have short periods where you won't necessarily get new work upon request.'

PS - what's your turnaround time on GPU work?

Cheers, Mike.


Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,124
Credit: 126,479,473
RAC: 19,791

RE: Regarding the

Message 99522 in response to message 99521

Quote:
Regarding the scheduler, my sense is that there may be some other things going on ....


Well that's correct. With the validator disparity between ( worldwide ) servers being corrected now, one challenge has been to reconcile the DB without loss of coherence or user credits - a sin of replication and redundancy. That requires more accesses on the fly to achieve, and the SQL constructs aren't simple. The validator problem has turned out to affect about one WU per currently active contributor ( average ). It's like a bubble in a fuel line, it will purge but only after some burping and backfires ..... :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

BarryAZ
BarryAZ
Joined: 8 May 05
Posts: 190
Credit: 54,560,244
RAC: 7,961

OK -- I thought it might be

Message 99523 in response to message 99522

OK -- I thought it might be something like that -- just as I am seeing quite a rise in pending credit -- figured that might also be a piece of the side effects of the database locational oopsie being dealt with. A couple of weeks ago, my pendings ran between 4K and 6K -- currently it is over 9K.

Quote:


Well that's correct. With the validator disparity between ( worldwide ) servers being corrected now, one challenge has been to reconcile the DB without loss of coherence or user credits - a sin of replication and redundancy. That requires more accesses on the fly to achieve, and the SQL constructs aren't simple. The validator problem has turned out to affect about one WU per currently active contributor ( average ). It's like a bubble in a fuel line, it will purge but only after some burping and backfires ..... :-)

Cheers, Mike.


Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,124
Credit: 126,479,473
RAC: 19,791

RE: OK -- I thought it

Message 99524 in response to message 99523

Quote:
OK -- I thought it might be something like that -- just as I am seeing quite a rise in pending credit -- figured that might also be a piece of the side effects of the database locational oopsie being dealt with. A couple of weeks ago, my pendings ran between 4K and 6K -- currently it is over 9K.


Actually now that I think about it, we probably haven't quite been explicit enough. My understanding is that : the ~ 70K work units that didn't correctly validate were likely to have nearly all been processed adequately by the user computers ie. with the usual ( low ) error rates. It would be probably silly to send them all out again if so, even though that would 'solve' the problem with fairly minimal admin. You could guess how contributors might not be happy with that though. I, like yourself, have fast machines/bandwidth so that's likely no great insult perhaps, but we have the vast hordes ( ** we luv you!! ** ) not on the bleeding edge to politely cater for as well. Thus the fiddle is to identify those that boned on the relevant naughty validator during the period in question, rinse them again, and not double/re validate those adequately checked elsewhere etc ....

I've peeked at the SQL syntax being discussed. I mean it's not like the DB entries have a 'we stuffed a validator' flag/field to inspect and pluck with. :- )

[ Mike awaits dev wrath via email .... ]

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

BarryAZ
BarryAZ
Joined: 8 May 05
Posts: 190
Credit: 54,560,244
RAC: 7,961

Mike, thanks for your

Message 99525 in response to message 99524

Mike, thanks for your explanation here. For what it is worth, since Thursday, I've seen my pendings go from under 5800 to approaching double that (10.5K), and during that time the daily credits have dropped from 11K to under 6K.

I figure eventually this will flush thru the system and things return to the level I had increased to before this drive mapping issue.

For me, I tend to tolerate project problems when they fit two characteristics == that they are 'rifle shots' rather than 'machine gun' fire (compare Einstein where problems are quite rare to SETI where problems are varied and very frequent); and that there is a sense of someone minding the store (what you are doing here) instead of either denial or even more exasperating -- silence, regarding reports.

So I really appreciate you taking the time to provide explanations here.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.