"Project is down" for 19 hours now

Winterknight

Joined: 4 Jun 05

Posts: 1478

Credit: 383815754

RAC: 403012

RE: RE: It always takes

22 Apr 2009 12:42:48 UTC

Message 92480 in response to message 92474

(moderation:

)

Quote:

Quote:
It always takes longer than you expect, even when you take into account Hofstadter's Law.

I've just looked that up. What a beauty! Hofstadter is related to Murphy perhaps? .... :-)

Well I've taken the interruption in my stride : defragging a few disks, re-routing the cabling on my small farm ( gotta love those cable ties ), laying baits for the Dust Bunnies ....

.... that sort of thing. :-)

Cheers, Mike.

There is a formula that someone wrote for Hofstadter's Law, that says you should double the figure given and increase the units by one.

Unfortunately this becomes wildly variable if some says half an hour and it is calculated using 30 minutes.

The example being, when BM hoped for later in the day should we have used half a day, which would be one week, or 12 hours which would be 24 days

I use velco cable ties if I think cable re-routing might be frequent. Spiral wrap if I know it is to be permanent. But then I was taught how to cable routing and installing in the days when we used waxed string. Some time ago in the early 60's.

And when did you last clean that fan? Looks likes my youngest son's computer. Which reminds me as the weather is getting warmer, I must get him to clean it or else I am going to get the "Dad, my computer is making strange noises" and/or "Dad, my computer keeps crashing" message.

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

Well, I have one suggestion

22 Apr 2009 15:37:42 UTC

Message 92481

(moderation:

)

Well, I have one suggestion for the next restart attempt.

Given the duration of the difficulties here at EAH, which really dates back to the fileserver crash of March 27th.

It might be wise to shut off new work generation for a while at first and let the outstanding work in the field catch up a bit. I'm sure that I'm not the only one who has work near, at, or over the deadline. Some of which has been uploaded and just needs to be reported.

I don't know if anyone else has noticed, but with the scheduler coming on and offline like a crazy monkey at times, it is possible to get the situation where you will get a 24 hour comm deferral from the CC due to comm sessions being interrupted abnormally. It should be noted that on some older CC versions, this deferral could be up to 1 week in length.

In my case, I'm not going to be to thrilled if one of my old timers gets burned for the 911+ KSecs of work it completed before the deadline, but has been trying to report for the last 36 hours.

General Note:

Please spare me any commentary about 'Green Computing', 'efficiency', I should shut off any of my machines, or any other claptrap like that.

Basically, it is no one else's business but mine what, where, or why I choose to run any given hardware on any given project.

Alinator

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 757240856

RAC: 1164083

Indeed. The longer the

22 Apr 2009 16:44:50 UTC

Message 92482

(moderation:

)

Indeed. The longer the outage, the harder it gets to cope with the traffic. On a normal day, EAH would award (say) 15Mio credits, so roughly 100.000 results are processed. A single scheduler request can fetch and report many tasks, but let's assume that it's one scheduler request per result, so there would be like 100,000 scheduler requests per day, or an all-day-average of ca 1.2 requests per second.

Now let there be a really long outage. Let's say most of the hosts run dry and will try once every hour to fetch new (and report old) work (most of my hosts are doing that right now). Assume 50 k hosts, so 50k requests per hour or more than 13 requests per second, about 10 times what we calculated above. This might be a little bit simplified but you get the point.

CU
Bikeman

BarryAZ

Joined: 8 May 05

Posts: 190

Credit: 325576842

RAC: 16492

One of the problems here is

22 Apr 2009 17:27:29 UTC

Message 92483 in response to message 92449

(moderation:

)

One of the problems here is perhaps one of communications approach. That is, to the extent the home page (and for that matter the message board) has information about the outage, it is (based on what seems to be going on) less than clear (database problems), and more optimistic than seems to be the case.

My sense is that the database performance problems may not yet be fully understood, and because of that perhaps not even 'incompletely' resolved. One approach to tech support suggests that a more optimistic message than may be warranted is important to relate to the 'user'. to reduce user angst. An alternative approach that has its advocates is along the lines of full and accurate disclosure. The first approach often works if the fix to a problem happens in a relatively short time frame. That first approach those can be counterproductive if the fix takes a significantly longer time.

Here we've seen a major problem (or problems) over a period approaching a month. Thus the optimistic report approach employed now has the opposite effect of its intent -- that is, optimistic reports tend to strain the credibility of the admin folks to the user community.

At this point those who are actually watching here KNOW there is a serioud database problem. We don't know if the nature of that problem has been identified, or if there is actually a recovery fix plan being followed -- as it may well be a case of ongoing not completed extended troubleshooting to identify the actual root cause -- that is, the root cause has neither been found (that sort of thing does happen), so the 'what to do next' is still indeterminate.

In any event, as a long time Einstein user, I lament the extended outage, but as an active multiproject BOINC user, I've simply suspended Einstein and let the other active projects pick up the cycles (as I'd suggest others do as well).

At some point (days or weeks) I am confident this project will come back to the ongoing reliable project that it has been over the years. Until then, we can only watch, wait, hope, and seek some amount of status clarity.

Quote:

Quote:
But there are no messages about it. What's up?

The database performance problems that are mentioned in the latest news item on the project home page are not completely resolved yet. The project admins are working on it, please stay tuned.

CU
Bikeman

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

@ Bikeman: Yep, and that

22 Apr 2009 17:31:58 UTC

Message 92484

(moderation:

)

@ Bikeman:

Yep, and that isn't even taking into account things like folks who whip up and publish for general consumption 'gadfly' fetch scripts to bypass the builtin inadvertent DoS features of BOINC because they figure their hosts are so much more important than someone else's they're entitled to 'cut to the front of the line' all the time. ;-)

You only have to examine the net output from another popular project to see where that kind of thinking leads to. :-/

Alinator

MAGIC Quantum M...

Joined: 18 Jan 05

Posts: 1907

Credit: 1438786145

RAC: 1220251

Requesting 961229 seconds of

22 Apr 2009 18:01:20 UTC

Message 92485

(moderation:

)

Requesting 961229 seconds of work, reporting 14 completed tasks
4/22/2009 9:58:44 AM|Einstein@Home|Scheduler request succeeded: got 0 new tasks
4/22/2009 9:58:44 AM|Einstein@Home|Message from server: Project is temporarily shut down for maintenance
4/22/2009 10:09:35 AM|lhcathome|Sending scheduler request: To fetch work. Requesting 962899 seconds of work, reporting 0 completed tasks
4/22/2009 10:09:40 AM|lhcathome|Scheduler request succeeded: got 0 new tasks
4/22/2009 10:24:51 AM|lhcathome|Sending scheduler request: To fetch work. Requesting 965222 seconds of work, reporting 0 completed tasks
4/22/2009 10:24:56 AM|lhcathome|Scheduler request succeeded: got 0 new tasks
4/22/2009 10:52:29 AM|Einstein@Home|Sending scheduler request: Requested by user. Requesting 968582 seconds of work, reporting 14 completed tasks
4/22/2009 10:52:34 AM|Einstein@Home|Scheduler request succeeded: got 0 new tasks
4/22/2009 10:52:34 AM|Einstein@Home|Message from server: Server error: feeder not running

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

Yes, I've been getting that

22 Apr 2009 18:44:44 UTC

Message 92486

(moderation:

)

Yes, I've been getting that now for a while too.

It's improvement anyway, in that you won't get hit with the big 24 hour up to 1 week deferral as long as you can get through to the scheduler normally, even if you get turned away for whatever you were requesting. ;-)

Alinator

DanNeely

Joined: 4 Sep 05

Posts: 1364

Credit: 3562358667

RAC: 0

RE: RE: I've been running

22 Apr 2009 21:51:02 UTC

Message 92487 in response to message 92477

(moderation:

)

Quote:

Quote:
I've been running a 4 day cache for several years to ride out outages, but I've just had to activate my backup project on 3 of my 4 computers (my netbook of all things managed to catch the server during a brief period of uptime).

Your netbook is taking 26 HOURS to finish just one unit, while your wingman is taking about 4!!! Those N270 processors are not very efficient at this kind of stuff are they? I see you are crunching 2 units at once, how is it as a pc? My wife wants one and I am making her use Linux, like on this pc, so we can get more than 1 gig of memory in one. Damn MS limits any XP Home pc to a max of 1 gig of memory when it is sold, before upgrading!!!! Just dumb IMO!

Reasonably well. It's not a power horse by any means and the 1024x600 screen is annoying at times (lots of vertical scrolling in longer web pages), but if you're just using it as a web broswer it's more than capable. The RAM limits a non issue; for the sorts of task that the CPU can handle 1GB of ram's plenty. On my wind access to install more is trivial. Popping the bottom off the chassis off is easier than popping most of the access panels on my 15" acer.

It's FPU is more limited than the rest of the system (vs my other core1 laptop and athlonx2 desktop it's a 3:1 performance ratio vs 2:1). 24hrs is about the shortest I've seen for a WU on it, the longer ones from this run have taken upto 40ish.

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 757240856

RAC: 1164083

Dan, your i7 is looking real

22 Apr 2009 23:03:21 UTC

Message 92488

(moderation:

)

Dan, your i7 is looking real nice, performance-wise. Is it running at stock seped or OC'ed ? It will sure have to report quite a lot of results once the scheduler is up again.

RandyC

Joined: 18 Jan 05

Posts: 6680

Credit: 111139797

RAC: 0

From the Home

23 Apr 2009 0:03:48 UTC

Message 92489

(moderation:

)

From the Home page:

Quote:

Apr 22, 2009

We think we have found a simple fix for the database problems. The database has grown to 45 GB in size and has gotten too large for the physical memory of the machine that hosts it. However it turns out that due to some mistakes made in the project operations during the past weeks, about 80% of the work in the database is already completed. So we are running a db_purge task tonight that should remove this already-completed work from the database, leaving only the work-in-progress still in place in the database. If this is successful, then Einstein@Home should be back up and operating normally in about another 24 hours. Thank you for your patience!

45GB is overdoing it a LITTLE bit I'd say.

Seti Classic Final Total: 11446 WU.

"Project is down" for 19 hours now

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner