Scheduler partially offline

ajinbc

Joined: 25 Sep 10

Posts: 1

Credit: 89039

RAC: 0

Hello there BarryAZ, I'm

5 Oct 2010 7:03:02 UTC

Message 99526 in response to message 99525

(moderation:

)

Hello there BarryAZ,

I'm new to E@H, but I do agree with your comments on SETI@Home. I've been crunching for about four years, off and on, the "down-time" rate drives me mental.

E@H is now getting 33% of my on-line time...

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6537

Credit: 286298690

RAC: 103405

RE: Mike, thanks for your

5 Oct 2010 9:36:20 UTC

Message 99527 in response to message 99525

(moderation:

)

Quote:

Mike, thanks for your explanation here. For what it is worth, since Thursday, I've seen my pendings go from under 5800 to approaching double that (10.5K), and during that time the daily credits have dropped from 11K to under 6K.

I figure eventually this will flush thru the system and things return to the level I had increased to before this drive mapping issue.

For me, I tend to tolerate project problems when they fit two characteristics == that they are 'rifle shots' rather than 'machine gun' fire (compare Einstein where problems are quite rare to SETI where problems are varied and very frequent); and that there is a sense of someone minding the store (what you are doing here) instead of either denial or even more exasperating -- silence, regarding reports.

So I really appreciate you taking the time to provide explanations here.

You're welcome, and thank you for saying so. :-)

Soooo ..... gulp ..... now for some mild bad news it seems, which I've just viewed at the back end. As the DB server is the center of the storm here ( coherence is King ) it may be prudent to pause the project - cause it not to respond temporarily to users - for a couple of hours soon. Likely in the next 12 to 24 hours. This will enable all of this query business to proceed/catch-up without distraction and get the issue off the table for good.

Alas the DB server has shown to be the limiting cog on this occasion. Something always is, logically it has to be one ( at least ) of the processes shown on the server status page.

The original validator problem ( incorrect mirroring of result files ) was rather unusual and has caused a special circumstance to occur in the logical position of the entire project's workflows. As they say in the classics - we didn't expect the unexpected!! :-)

Cheers, Mike.

( edit ) No need to panic about a project pause. A user's BOINC manager will just progressively back off on contacts with E@H and resume normal conversations afterwards. Possibly even slot in another project's work in the meantime depending on your preferences/config. Just looking at the messages pane on this machine I'm writing with right now, it's OK at present. BTW, I run my BOINC instances with the work-fetch-debug flag/feature enabled, it's way more informative in these scenarios.

( edit ) As for project management overall, I ought point out that the E@H admins and developers at the various worldwide locations kindly take the time to give the moderators timely snippets/updates on what they're up to as they go through their daily work ( via a dedicated mailing list ). Then hopefully the likes of myself can pass that coherently on to the multitude .... :-)

( edit ) Ah, Bernd has just updated matters in another thread. You know, I'm not sure when the guy sleeps .......

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

archae86

Joined: 6 Dec 05

Posts: 3145

Credit: 7052594931

RAC: 1623437

Just an update on how this

5 Oct 2010 17:26:57 UTC

Message 99528

(moderation:

)

Just an update on how this looks on my hosts.

ABP2 downloads have indeed resumed. I run relatively long queues (most target 6 days) so on most of my hosts the problem (10-fold ABP2) results had not yet started returning at the onset of the server limitations.

But two of the hosts were already returning 10-folds, and the others have since joined in. As of this writing the 10-fold returns show as pending, not invalid, but it appears none of them for my flotilla have validated. On the other hand, the 4-folds which got validation postponement while things were being worked have been validating away at a great rate over the last few hours. Spot checking just before writing this note suggested that the considerable majority of my remaining 4-folds in pending are actually waiting for a quorum partner to return a result, not sitting with fulfilled quorum awaiting judgment from a validation task--though I was still easily able to find a few of those.

So from where I sit ongoing operations have resumed well enough not to trouble a moderately diligent cruncher who does not scrutinize daily RAC and such closely, but close watchers will see clear signs of abnormality until a process for clearing the 10-folds dispatched during the problem period (and subsequentely?) is operating well enough to make a big dent in a considerable backlog.

I'd like to add my voice to those praising the relatively fast response and the considerable amount of informative communication in this unfortunate event. It would be better had it not happened, but it is good to see capable incident response.

This note feels like it belongs in Cruncher's Corner--but this seems the active thread on the topic at the moment. I'll not protest at all if it is modded elsewhere.

archae86

Joined: 6 Dec 05

Posts: 3145

Credit: 7052594931

RAC: 1623437

RE: As of this writing the

5 Oct 2010 23:23:31 UTC

Message 99529 in response to message 99528

(moderation:

)

Quote:

As of this writing the 10-fold returns show as pending, not invalid, but it appears none of them for my flotilla have validated.

Not true now, may not have been true when I wrote this. I actually have two 10-fold ABP2 jobs sent to me on October 3 which have validated. However none of the couple of dozen 10-fold ABP2 jobs my hosts have returned which were sent out between September 29 and 22:24 on October 2 have validated, and many (most, I think) have reached quorum. At least one I checked has 6 returns from six separate hosts! (of course, when they were being marked invalid that meant they were getting sent out again).

archae86

Joined: 6 Dec 05

Posts: 3145

Credit: 7052594931

RAC: 1623437

Things on my flotilla have

6 Oct 2010 20:48:06 UTC

Message 99530

(moderation:

)

Things on my flotilla have taken a turn for the worse.

Many validate errors are now shown. The common features at the moment seem to be:

1. all are ABP2
2. nearly all are 10-fold jobs
3. dates they were sent to my hosts mostly range from 29 Sep 16:58 to 1 OCT 6:06
4. dates my hosts returned them mostly range from 5 Oct 19:00 to 6 Oct 17:18

I wonder if the fixes adopted so far work properly for results reported after sometime on October 5?

Or perhaps what I am seeing is just a transient state as the fix is applied.

I'm pretty sure there is not a common fault which suddenly cropped up on four separate PCs.

Here is the invalid list for one of my hosts.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5845

Credit: 109878279306

RAC: 30578175

RE: Things on my flotilla

7 Oct 2010 1:02:28 UTC

Message 99531 in response to message 99530

(moderation:

)

Quote:

Things on my flotilla have taken a turn for the worse.

Thanks for reporting this. If it's any consolation, I'm seeing exactly the same. I've sent a message to Bernd about it since there seems to be a problem with the 'recovery procedure' - hopefully only a temporary problem.

Here is the content of my message. Note that the 'strategy' being referred to in the first sentence involved changing the minimum quorum from 2 to 21 to prevent a quorum being reached and the validator being called prematurely and at the same time setting the initial task replication to -1 instead of 2 so that no extra tasks would be generated.

Quote:

Hi Bernd,

I saw yesterday that you must have invoked this strategy, since all the x10 pendings on my hosts that I examined were showing a minimum quorum of 21 and an initial replication of -1. At that stage (24 hours ago) I didn't see any validate errors - just 'pending' for the x10 tasks. I also saw x4 tasks that were validating OK so everything was looking good.

Today, there appears to be a problem. I saw your post in the news thread that you made about 9 hours ago (12:30AM local time here) regarding the transfer of 65K results (out of 140K total results for 70K WUs, I presume) and the expectation that these results should validate. On inspection of a couple of my hosts at random, I see lots of x10 validate errors and not a single example of a valid result. There are still lots of x10 pendings and these appear to be normal - 2nd result not yet in or too recently completed to have been transferred across yet. All the x4 results appear to be quite OK.

I'm not the only person seeing this continuing problem with x10 results. Peter Stoll (archae86) has recently posted exactly the same thing in your Technical News thread about the 5min scheduler disabling pauses. This thread has morphed into a discussion about increasing pendings and hence the underlying cause of this - the incorrect file upload url.

A common scenario with the simplest of these WUs seems to be;-

* Initial x10 tasks sent out in the 'problem' period (ie around 29th, 30th Sept or so)
* One task returned relatively quickly and which remains 'validation inconclusive'
* 2nd task returned much later - around 4th, 5th October and which now shows 'validate error'
* New task issued to a third host on 6th October.

There are variations on the above of course, particularly when comp errors are involved

Just like Peter does, I keep a relatively large cache of work on all my hosts (~6 days) so the original problem was noticed (and the cause identified) before any x10 tasks were even started on any of my hosts. I even contemplated visiting each machine and editing the state file to fix the incorrect url. To get a feel for the time that would be involved, I actually did the correction on one host with a global search and replace in a text editor. I also wrote a one line sed script for all my linux hosts so it would have been quite feasible to do it. In the end, I decided not to make the effort in case it impacted on what you were trying to achieve. Interestingly, the one host I did 'convert' (and where the results would have been uploaded to the correct server) is not showing any benefit from the conversion. It too has lots of the same validate errors.

The nasty side effect is that all these validate errors are generating extra tasks which are being sent out, despite the -1 trick which was supposed to stop generation of extra tasks (so I thought??).

Hopefully, you can identify the cause of this latest problem and do something to fix it.

Cheers,
Gary.

Cheers,
Gary.

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5845

Credit: 109878279306

RAC: 30578175

RE: Things on my flotilla

7 Oct 2010 22:38:25 UTC

Message 99532 in response to message 99530

(moderation:

)

Quote:

Things on my flotilla have taken a turn for the worse.

....

Here is the invalid list for one of my hosts.

Bernd has now given a further update over in the main news thread. Looks like things are very much on the improve. I notice your list of validate errors from the above link has now shrunk to just one. It's not really clear to me why that one still remains. Perhaps the result didn't get transferred across in time to make the validation run. I'm also a bit puzzled as to why the third task remains 'unsent'. With ABP, extra tasks seem to be sent out quite quickly when needed.

I haven't yet had time to check my hosts to see how widespread these 'misses' might be.

Cheers,
Gary.

Scheduler partially offline

Forums › Technical News

Hello there BarryAZ, I'm

RE: Mike, thanks for your

Just an update on how this

RE: As of this writing the

Things on my flotilla have

RE: Things on my flotilla

RE: Things on my flotilla

Comment viewing options

Forums › Technical News