"Project is down" for 19 hours now

Winterknight
Winterknight
Joined: 4 Jun 05
Posts: 482
Credit: 120,611,204
RAC: 198,025

RE: RE: It always takes

Message 92480 in response to message 92474

Quote:
Quote:
It always takes longer than you expect, even when you take into account Hofstadter's Law.

I've just looked that up. What a beauty! Hofstadter is related to Murphy perhaps? .... :-)

Well I've taken the interruption in my stride : defragging a few disks, re-routing the cabling on my small farm ( gotta love those cable ties ), laying baits for the Dust Bunnies ....

.... that sort of thing. :-)

Cheers, Mike.


There is a formula that someone wrote for Hofstadter's Law, that says you should double the figure given and increase the units by one.

Unfortunately this becomes wildly variable if some says half an hour and it is calculated using 30 minutes.

The example being, when BM hoped for later in the day should we have used half a day, which would be one week, or 12 hours which would be 24 days

I use velco cable ties if I think cable re-routing might be frequent. Spiral wrap if I know it is to be permanent. But then I was taught how to cable routing and installing in the days when we used waxed string. Some time ago in the early 60's.

And when did you last clean that fan? Looks likes my youngest son's computer. Which reminds me as the weather is getting warmer, I must get him to clean it or else I am going to get the "Dad, my computer is making strange noises" and/or "Dad, my computer keeps crashing" message.

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

Well, I have one suggestion

Well, I have one suggestion for the next restart attempt.

Given the duration of the difficulties here at EAH, which really dates back to the fileserver crash of March 27th.

It might be wise to shut off new work generation for a while at first and let the outstanding work in the field catch up a bit. I'm sure that I'm not the only one who has work near, at, or over the deadline. Some of which has been uploaded and just needs to be reported.

I don't know if anyone else has noticed, but with the scheduler coming on and offline like a crazy monkey at times, it is possible to get the situation where you will get a 24 hour comm deferral from the CC due to comm sessions being interrupted abnormally. It should be noted that on some older CC versions, this deferral could be up to 1 week in length.

In my case, I'm not going to be to thrilled if one of my old timers gets burned for the 911+ KSecs of work it completed before the deadline, but has been trying to report for the last 36 hours.

General Note:

Please spare me any commentary about 'Green Computing', 'efficiency', I should shut off any of my machines, or any other claptrap like that.

Basically, it is no one else's business but mine what, where, or why I choose to run any given hardware on any given project.

Alinator

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,516
Credit: 454,533,746
RAC: 45,375

Indeed. The longer the

Indeed. The longer the outage, the harder it gets to cope with the traffic. On a normal day, EAH would award (say) 15Mio credits, so roughly 100.000 results are processed. A single scheduler request can fetch and report many tasks, but let's assume that it's one scheduler request per result, so there would be like 100,000 scheduler requests per day, or an all-day-average of ca 1.2 requests per second.

Now let there be a really long outage. Let's say most of the hosts run dry and will try once every hour to fetch new (and report old) work (most of my hosts are doing that right now). Assume 50 k hosts, so 50k requests per hour or more than 13 requests per second, about 10 times what we calculated above. This might be a little bit simplified but you get the point.

CU
Bikeman

BarryAZ
BarryAZ
Joined: 8 May 05
Posts: 190
Credit: 54,550,165
RAC: 8,268

One of the problems here is

Message 92483 in response to message 92449

One of the problems here is perhaps one of communications approach. That is, to the extent the home page (and for that matter the message board) has information about the outage, it is (based on what seems to be going on) less than clear (database problems), and more optimistic than seems to be the case.

My sense is that the database performance problems may not yet be fully understood, and because of that perhaps not even 'incompletely' resolved. One approach to tech support suggests that a more optimistic message than may be warranted is important to relate to the 'user'. to reduce user angst. An alternative approach that has its advocates is along the lines of full and accurate disclosure. The first approach often works if the fix to a problem happens in a relatively short time frame. That first approach those can be counterproductive if the fix takes a significantly longer time.

Here we've seen a major problem (or problems) over a period approaching a month. Thus the optimistic report approach employed now has the opposite effect of its intent -- that is, optimistic reports tend to strain the credibility of the admin folks to the user community.

At this point those who are actually watching here KNOW there is a serioud database problem. We don't know if the nature of that problem has been identified, or if there is actually a recovery fix plan being followed -- as it may well be a case of ongoing not completed extended troubleshooting to identify the actual root cause -- that is, the root cause has neither been found (that sort of thing does happen), so the 'what to do next' is still indeterminate.

In any event, as a long time Einstein user, I lament the extended outage, but as an active multiproject BOINC user, I've simply suspended Einstein and let the other active projects pick up the cycles (as I'd suggest others do as well).

At some point (days or weeks) I am confident this project will come back to the ongoing reliable project that it has been over the years. Until then, we can only watch, wait, hope, and seek some amount of status clarity.

Quote:
Quote:
But there are no messages about it. What's up?

The database performance problems that are mentioned in the latest news item on the project home page are not completely resolved yet. The project admins are working on it, please stay tuned.

CU
Bikeman


Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

@ Bikeman: Yep, and that

@ Bikeman:

Yep, and that isn't even taking into account things like folks who whip up and publish for general consumption 'gadfly' fetch scripts to bypass the builtin inadvertent DoS features of BOINC because they figure their hosts are so much more important than someone else's they're entitled to 'cut to the front of the line' all the time. ;-)

You only have to examine the net output from another popular project to see where that kind of thinking leads to. :-/

Alinator

MAGIC Quantum Mechanic
MAGIC Quantum M...
Joined: 18 Jan 05
Posts: 1,303
Credit: 415,647,247
RAC: 84,008

Requesting 961229 seconds of

Requesting 961229 seconds of work, reporting 14 completed tasks
4/22/2009 9:58:44 AM|Einstein@Home|Scheduler request succeeded: got 0 new tasks
4/22/2009 9:58:44 AM|Einstein@Home|Message from server: Project is temporarily shut down for maintenance
4/22/2009 10:09:35 AM|lhcathome|Sending scheduler request: To fetch work. Requesting 962899 seconds of work, reporting 0 completed tasks
4/22/2009 10:09:40 AM|lhcathome|Scheduler request succeeded: got 0 new tasks
4/22/2009 10:24:51 AM|lhcathome|Sending scheduler request: To fetch work. Requesting 965222 seconds of work, reporting 0 completed tasks
4/22/2009 10:24:56 AM|lhcathome|Scheduler request succeeded: got 0 new tasks
4/22/2009 10:52:29 AM|Einstein@Home|Sending scheduler request: Requested by user. Requesting 968582 seconds of work, reporting 14 completed tasks
4/22/2009 10:52:34 AM|Einstein@Home|Scheduler request succeeded: got 0 new tasks
4/22/2009 10:52:34 AM|Einstein@Home|Message from server: Server error: feeder not running


 

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9,352,143
RAC: 0

Yes, I've been getting that

Yes, I've been getting that now for a while too.

It's improvement anyway, in that you won't get hit with the big 24 hour up to 1 week deferral as long as you can get through to the scheduler normally, even if you get turned away for whatever you were requesting. ;-)

Alinator

DanNeely
DanNeely
Joined: 4 Sep 05
Posts: 1,313
Credit: 1,698,849,766
RAC: 917,238

RE: RE: I've been running

Message 92487 in response to message 92477

Quote:
Quote:
I've been running a 4 day cache for several years to ride out outages, but I've just had to activate my backup project on 3 of my 4 computers (my netbook of all things managed to catch the server during a brief period of uptime).

Your netbook is taking 26 HOURS to finish just one unit, while your wingman is taking about 4!!! Those N270 processors are not very efficient at this kind of stuff are they? I see you are crunching 2 units at once, how is it as a pc? My wife wants one and I am making her use Linux, like on this pc, so we can get more than 1 gig of memory in one. Damn MS limits any XP Home pc to a max of 1 gig of memory when it is sold, before upgrading!!!! Just dumb IMO!

Reasonably well. It's not a power horse by any means and the 1024x600 screen is annoying at times (lots of vertical scrolling in longer web pages), but if you're just using it as a web broswer it's more than capable. The RAM limits a non issue; for the sorts of task that the CPU can handle 1GB of ram's plenty. On my wind access to install more is trivial. Popping the bottom off the chassis off is easier than popping most of the access panels on my 15" acer.

It's FPU is more limited than the rest of the system (vs my other core1 laptop and athlonx2 desktop it's a 3:1 performance ratio vs 2:1). 24hrs is about the shortest I've seen for a WU on it, the longer ones from this run have taken upto 40ish.

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3,516
Credit: 454,533,746
RAC: 45,375

Dan, your i7 is looking real

Dan, your i7 is looking real nice, performance-wise. Is it running at stock seped or OC'ed ? It will sure have to report quite a lot of results once the scheduler is up again.

RandyC
RandyC
Joined: 18 Jan 05
Posts: 2,827
Credit: 111,004,849
RAC: 19,730

From the Home

From the Home page:

Quote:


Apr 22, 2009

We think we have found a simple fix for the database problems. The database has grown to 45 GB in size and has gotten too large for the physical memory of the machine that hosts it. However it turns out that due to some mistakes made in the project operations during the past weeks, about 80% of the work in the database is already completed. So we are running a db_purge task tonight that should remove this already-completed work from the database, leaving only the work-in-progress still in place in the database. If this is successful, then Einstein@Home should be back up and operating normally in about another 24 hours. Thank you for your patience!

45GB is overdoing it a LITTLE bit I'd say.

Seti Classic Final Total: 11446 WU.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.