"Project is down" for 19 hours now

Mikkie

Joined: 2 Apr 07

Posts: 25

Credit: 242066

RAC: 0

Not worth waiting any longer.

21 Apr 2009 10:55:00 UTC

Message 92459

(moderation:

)

Not worth waiting any longer. Project is almost 7 days down after I rejoined E@H 16 days ago.

"souls ain't born, souls don't die"

Bikeman (Heinz-...

Moderator

Joined: 28 Aug 06

Posts: 3522

Credit: 757629547

RAC: 1156121

RE: Not worth waiting any

21 Apr 2009 11:04:12 UTC

Message 92460 in response to message 92459

(moderation:

)

Quote:

Not worth waiting any longer. Project is almost 7 days down after I rejoined E@H 16 days ago.

The project experienced several days of downtime mostly on weekends during April but was up most of the time, certainly most of the time last week.

No question the current string of downtime is bad, but let's not exaggerate it either.

CU
Bikeman

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2977804301

RAC: 787196

RE: Some of my hosts did

21 Apr 2009 11:04:57 UTC

Message 92461 in response to message 92458

(moderation:

)

Quote:

Some of my hosts did get fresh work today. Until the problem is resolved (which will be announced on the home page), I'm afraid there will be phases where the scheduler is turned off and on repeatedly to see how the system reacts to different countermeasures for the problem until the performance is acceptable again. Please ignore the status page for the time being, generating it causes extra load on the database which would interfere with the problem solving right now.

None of mine did, but let that pass - not important.

I haven't seen any description yet of the nature of the additional load that the database is experiencing. Although there was the recent release of Windows app v3.04 and v3.05, these would affect downloads only, not the database. I haven't read of any change *** within Einstein *** that should have changed the database loading so spectacularly.

There is the possibility that an inwards migration of active users - say following a SETI outage - might have produced a temporary blip, but Einstein has coped with that before - it has never caused such sustained loading problems.

Which leaves a nagging concern. The first report of database loading issues on the Einstein front page is dated 12 April 2009 - three days after BOINC v6.6.20 was made the official "recommended" version. I have identified at least one situation where BOINC (v6.6.23 actually, but it's unchanged v6.6.20 code) can flood a server with spurious "Requesting 0.00 seconds of work" RPC calls - see my post on Friday in Not getting new WU's for CPU projects after upgrade to 6.6.20 on the BOINC message boards, and subsequent discussion.

Is the excessive database loading caused by an increase in the number of scheduler RPCs coming from external users? or am I barking up the wrong tree?

Edit - judging by the new front page news since I started composing that, the load is indeed scheduler related. Alarm bells are starting to ring.

gravitysmith

Joined: 8 Nov 04

Posts: 55

Credit: 93403384

RAC: 14583

RE: Edit - judging by the

21 Apr 2009 13:27:31 UTC

Message 92462 in response to message 92461

(moderation:

)

Quote:

Edit - judging by the new front page news since I started composing that, the load is indeed scheduler related. Alarm bells are starting to ring.

Another possibility is a problem with Norton AntiVirus which occurred around the same time. I had this myself and for about a day. I was going through WU as fast as Einstein@Home would hand them out to me. I would get some WU, Norton would flag the data files as suspicious and remove them. At that point BOINC would error out when it tried to start the WU but there was no data file. Those results were returned as errored out, but then Einstein would request more. This repeated non-stop until I figured out how to shut off the Norton advanced heuristics detection which was causing the problem. Because my Einstein resource share is so high, I quickly hit the limit of 1 WU/core/day but other crunchers may take a little longer. I imagine there are lots of people using Norton ...

Gravitysmith

Gundolf Jahn

Joined: 1 Mar 05

Posts: 1079

Credit: 341280

RAC: 0

RE: Another possibility is

21 Apr 2009 15:28:51 UTC

Message 92463 in response to message 92462

(moderation:

)

Quote:

Another possibility is a problem with Norton AntiVirus which occurred around the same time...

I don't think that would cause such a big problem, since Einstein initially only allows a "Maximum daily WU quota per CPU" of 16/day; other than SETI with a quota of 100 WUs/CPU (and 400/GPU).

GruÃŸ,
Gundolf

Computer sind nicht alles im Leben. (Kleiner Scherz)

Bernd Machenschalk

Moderator

Administrator

Joined: 15 Oct 04

Posts: 4330

Credit: 251527094

RAC: 37319

RE: I haven't seen any

21 Apr 2009 15:57:22 UTC

Message 92464 in response to message 92461

(moderation:

)

Quote:

I haven't seen any description yet of the nature of the additional load that the database is experiencing.

Surely a number of things added together, we didn't spent much time to investigate this, but we surely will.

Under heavy load finally our old DB server crashed and we couldn't bring it back online for more than a last dump (that itself took several hours). So we stuffed together the hardware we could find to a new, larger server (32G RAM) that is running now. While the additional memory surely helped, the performance of this new DB is still poor. We are looking into this and already solved some issues. For now the project has been stopped again to run some offline analysis, indexing and other optimization on the database.

We'll continue to fix the problems that occur, so far we had all kinds of things, ranging from hardware failures to bugs in software not written by us, last one fixed just minutes ago, all in just a few days.

I strongly intend to have the project up and running later today.

archae86

Joined: 6 Dec 05

Posts: 3161

Credit: 7263311786

RAC: 1556839

I've had a few episodes on my

21 Apr 2009 18:37:10 UTC

Message 92466

(moderation:

)

I've had a few episodes on my hosts recently which have resulted in rapid rejection of my entire queue of Einstein tasks with "access is denied" error messages.

Until now I have assumed that this was my fault in failing to be present when my firewall (COMODO with the Defense+ option which requires initial authorization of new programs) popped up a permission request.

However, some of the failures appear to have been _not_ on new science aps. So I suppose there is a small chance that what I saw has been happening to other non-COMODO users. If so, it could be a contribution to excessive database load.

By way of giving a signature, so others can chime in if they think they have seen the same thing, this is a typical set of log messages regarding the attempt to run a single WU during such an episode:

	Host	Project	Date	Message
	stoll3	Einstein@Home	4/20/2009 3:49:36 AM	Output file h1_0609.60_S5R4__123_S5R5a_0_0 for task h1_0609.60_S5R4__123_S5R5a_0 absent
	stoll3	Einstein@Home	4/20/2009 3:49:36 AM	Computation for task h1_0609.60_S5R4__123_S5R5a_0 finished
	stoll3	Einstein@Home	4/20/2009 3:49:36 AM	Reason: Unrecoverable error for result h1_0609.60_S5R4__123_S5R5a_0 (CreateProcess() failed - Access is denied. (0x5))
	stoll3	Einstein@Home	4/20/2009 3:49:36 AM	Deferring communication for 1 min 0 sec
	stoll3	Einstein@Home	4/20/2009 3:49:35 AM	[error] Process creation failed: Access is denied. (0x5)
	stoll3	Einstein@Home	4/20/2009 3:49:34 AM	[error] Process creation failed: Access is denied. (0x5)
	stoll3	Einstein@Home	4/20/2009 3:49:33 AM	[error] Process creation failed: Access is denied. (0x5)
	stoll3	Einstein@Home	4/20/2009 3:49:32 AM	[error] Process creation failed: Access is denied. (0x5)
	stoll3	Einstein@Home	4/20/2009 3:49:32 AM	[error] Process creation failed: Access is denied. (0x5)
	stoll3	Einstein@Home	4/20/2009 3:49:31 AM	Starting h1_0609.60_S5R4__123_S5R5a_0

gaz

Joined: 11 Oct 05

Posts: 650

Credit: 1902306

RAC: 0

give the guys some slack thay

21 Apr 2009 18:55:19 UTC

Message 92467

(moderation:

)

give the guys some slack thay will sort it
at least it given my c.p.u a rest

Elphidieus

Joined: 20 Feb 05

Posts: 245

Credit: 20603702

RAC: 0

RE: give the guys some

21 Apr 2009 18:58:55 UTC

Message 92468 in response to message 92467

(moderation:

)

Quote:

give the guys some slack thay will sort it
at least it given my c.p.u a rest

I feel restless whenever my CPUs are at rest...

gaz

Joined: 11 Oct 05

Posts: 650

Credit: 1902306

RAC: 0

chill out man flower power

21 Apr 2009 19:04:03 UTC

Message 92469

(moderation:

)

chill out man flower power and allthat
(1960's)

"Project is down" for 19 hours now

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner