"Project is down" for 19 hours now

Mikkie
Mikkie
Joined: 2 Apr 07
Posts: 25
Credit: 242066
RAC: 0

Not worth waiting any longer.

Not worth waiting any longer. Project is almost 7 days down after I rejoined E@H 16 days ago.

"souls ain't born, souls don't die"

Bikeman (Heinz-Bernd Eggenstein)
Bikeman (Heinz-...
Moderator
Joined: 28 Aug 06
Posts: 3522
Credit: 686159127
RAC: 549612

RE: Not worth waiting any

Message 92460 in response to message 92459

Quote:
Not worth waiting any longer. Project is almost 7 days down after I rejoined E@H 16 days ago.

The project experienced several days of downtime mostly on weekends during April but was up most of the time, certainly most of the time last week.

No question the current string of downtime is bad, but let's not exaggerate it either.

CU
Bikeman

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2139
Credit: 2752987905
RAC: 1373867

RE: Some of my hosts did

Message 92461 in response to message 92458

Quote:
Some of my hosts did get fresh work today. Until the problem is resolved (which will be announced on the home page), I'm afraid there will be phases where the scheduler is turned off and on repeatedly to see how the system reacts to different countermeasures for the problem until the performance is acceptable again. Please ignore the status page for the time being, generating it causes extra load on the database which would interfere with the problem solving right now.


None of mine did, but let that pass - not important.

I haven't seen any description yet of the nature of the additional load that the database is experiencing. Although there was the recent release of Windows app v3.04 and v3.05, these would affect downloads only, not the database. I haven't read of any change *** within Einstein *** that should have changed the database loading so spectacularly.

There is the possibility that an inwards migration of active users - say following a SETI outage - might have produced a temporary blip, but Einstein has coped with that before - it has never caused such sustained loading problems.

Which leaves a nagging concern. The first report of database loading issues on the Einstein front page is dated 12 April 2009 - three days after BOINC v6.6.20 was made the official "recommended" version. I have identified at least one situation where BOINC (v6.6.23 actually, but it's unchanged v6.6.20 code) can flood a server with spurious "Requesting 0.00 seconds of work" RPC calls - see my post on Friday in Not getting new WU's for CPU projects after upgrade to 6.6.20 on the BOINC message boards, and subsequent discussion.

Is the excessive database loading caused by an increase in the number of scheduler RPCs coming from external users? or am I barking up the wrong tree?

Edit - judging by the new front page news since I started composing that, the load is indeed scheduler related. Alarm bells are starting to ring.

gravitysmith
gravitysmith
Joined: 8 Nov 04
Posts: 55
Credit: 90243370
RAC: 9370

RE: Edit - judging by the

Message 92462 in response to message 92461

Quote:

Edit - judging by the new front page news since I started composing that, the load is indeed scheduler related. Alarm bells are starting to ring.

Another possibility is a problem with Norton AntiVirus which occurred around the same time. I had this myself and for about a day. I was going through WU as fast as Einstein@Home would hand them out to me. I would get some WU, Norton would flag the data files as suspicious and remove them. At that point BOINC would error out when it tried to start the WU but there was no data file. Those results were returned as errored out, but then Einstein would request more. This repeated non-stop until I figured out how to shut off the Norton advanced heuristics detection which was causing the problem. Because my Einstein resource share is so high, I quickly hit the limit of 1 WU/core/day but other crunchers may take a little longer. I imagine there are lots of people using Norton ...

Gravitysmith

Gundolf Jahn
Gundolf Jahn
Joined: 1 Mar 05
Posts: 1079
Credit: 341280
RAC: 0

RE: Another possibility is

Message 92463 in response to message 92462

Quote:
Another possibility is a problem with Norton AntiVirus which occurred around the same time...


I don't think that would cause such a big problem, since Einstein initially only allows a "Maximum daily WU quota per CPU" of 16/day; other than SETI with a quota of 100 WUs/CPU (and 400/GPU).

Gruß,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Bernd Machenschalk
Bernd Machenschalk
Moderator
Administrator
Joined: 15 Oct 04
Posts: 4267
Credit: 244933143
RAC: 16332

RE: I haven't seen any

Message 92464 in response to message 92461

Quote:
I haven't seen any description yet of the nature of the additional load that the database is experiencing.


Surely a number of things added together, we didn't spent much time to investigate this, but we surely will.

Under heavy load finally our old DB server crashed and we couldn't bring it back online for more than a last dump (that itself took several hours). So we stuffed together the hardware we could find to a new, larger server (32G RAM) that is running now. While the additional memory surely helped, the performance of this new DB is still poor. We are looking into this and already solved some issues. For now the project has been stopped again to run some offline analysis, indexing and other optimization on the database.

We'll continue to fix the problems that occur, so far we had all kinds of things, ranging from hardware failures to bugs in software not written by us, last one fixed just minutes ago, all in just a few days.

I strongly intend to have the project up and running later today.

BM

BM

archae86
archae86
Joined: 6 Dec 05
Posts: 3145
Credit: 7024814931
RAC: 1811359

I've had a few episodes on my

I've had a few episodes on my hosts recently which have resulted in rapid rejection of my entire queue of Einstein tasks with "access is denied" error messages.

Until now I have assumed that this was my fault in failing to be present when my firewall (COMODO with the Defense+ option which requires initial authorization of new programs) popped up a permission request.

However, some of the failures appear to have been _not_ on new science aps. So I suppose there is a small chance that what I saw has been happening to other non-COMODO users. If so, it could be a contribution to excessive database load.

By way of giving a signature, so others can chime in if they think they have seen the same thing, this is a typical set of log messages regarding the attempt to run a single WU during such an episode:

	Host	Project	Date	Message
	stoll3	Einstein@Home	4/20/2009 3:49:36 AM	Output file h1_0609.60_S5R4__123_S5R5a_0_0 for task h1_0609.60_S5R4__123_S5R5a_0 absent
	stoll3	Einstein@Home	4/20/2009 3:49:36 AM	Computation for task h1_0609.60_S5R4__123_S5R5a_0 finished
	stoll3	Einstein@Home	4/20/2009 3:49:36 AM	Reason: Unrecoverable error for result h1_0609.60_S5R4__123_S5R5a_0 (CreateProcess() failed - Access is denied. (0x5))
	stoll3	Einstein@Home	4/20/2009 3:49:36 AM	Deferring communication for 1 min 0 sec
	stoll3	Einstein@Home	4/20/2009 3:49:35 AM	[error] Process creation failed: Access is denied. (0x5)
	stoll3	Einstein@Home	4/20/2009 3:49:34 AM	[error] Process creation failed: Access is denied. (0x5)
	stoll3	Einstein@Home	4/20/2009 3:49:33 AM	[error] Process creation failed: Access is denied. (0x5)
	stoll3	Einstein@Home	4/20/2009 3:49:32 AM	[error] Process creation failed: Access is denied. (0x5)
	stoll3	Einstein@Home	4/20/2009 3:49:32 AM	[error] Process creation failed: Access is denied. (0x5)
	stoll3	Einstein@Home	4/20/2009 3:49:31 AM	Starting h1_0609.60_S5R4__123_S5R5a_0
gaz
gaz
Joined: 11 Oct 05
Posts: 650
Credit: 1902306
RAC: 0

give the guys some slack thay

give the guys some slack thay will sort it
at least it given my c.p.u a rest

Elphidieus
Elphidieus
Joined: 20 Feb 05
Posts: 245
Credit: 20603702
RAC: 0

RE: give the guys some

Message 92468 in response to message 92467

Quote:
give the guys some slack thay will sort it
at least it given my c.p.u a rest

I feel restless whenever my CPUs are at rest...

gaz
gaz
Joined: 11 Oct 05
Posts: 650
Credit: 1902306
RAC: 0

chill out man flower power

chill out man flower power and allthat
(1960's)

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.