Some of my hosts did get fresh work today. Until the problem is resolved (which will be announced on the home page), I'm afraid there will be phases where the scheduler is turned off and on repeatedly to see how the system reacts to different countermeasures for the problem until the performance is acceptable again. Please ignore the status page for the time being, generating it causes extra load on the database which would interfere with the problem solving right now.
None of mine did, but let that pass - not important.
I haven't seen any description yet of the nature of the additional load that the database is experiencing. Although there was the recent release of Windows app v3.04 and v3.05, these would affect downloads only, not the database. I haven't read of any change *** within Einstein *** that should have changed the database loading so spectacularly.
There is the possibility that an inwards migration of active users - say following a SETI outage - might have produced a temporary blip, but Einstein has coped with that before - it has never caused such sustained loading problems.
Which leaves a nagging concern. The first report of database loading issues on the Einstein front page is dated 12 April 2009 - three days after BOINC v6.6.20 was made the official "recommended" version. I have identified at least one situation where BOINC (v6.6.23 actually, but it's unchanged v6.6.20 code) can flood a server with spurious "Requesting 0.00 seconds of work" RPC calls - see my post on Friday in Not getting new WU's for CPU projects after upgrade to 6.6.20 on the BOINC message boards, and subsequent discussion.
Is the excessive database loading caused by an increase in the number of scheduler RPCs coming from external users? or am I barking up the wrong tree?
Edit - judging by the new front page news since I started composing that, the load is indeed scheduler related. Alarm bells are starting to ring.
Edit - judging by the new front page news since I started composing that, the load is indeed scheduler related. Alarm bells are starting to ring.
Another possibility is a problem with Norton AntiVirus which occurred around the same time. I had this myself and for about a day. I was going through WU as fast as Einstein@Home would hand them out to me. I would get some WU, Norton would flag the data files as suspicious and remove them. At that point BOINC would error out when it tried to start the WU but there was no data file. Those results were returned as errored out, but then Einstein would request more. This repeated non-stop until I figured out how to shut off the Norton advanced heuristics detection which was causing the problem. Because my Einstein resource share is so high, I quickly hit the limit of 1 WU/core/day but other crunchers may take a little longer. I imagine there are lots of people using Norton ...
Another possibility is a problem with Norton AntiVirus which occurred around the same time...
I don't think that would cause such a big problem, since Einstein initially only allows a "Maximum daily WU quota per CPU" of 16/day; other than SETI with a quota of 100 WUs/CPU (and 400/GPU).
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
I haven't seen any description yet of the nature of the additional load that the database is experiencing.
Surely a number of things added together, we didn't spent much time to investigate this, but we surely will.
Under heavy load finally our old DB server crashed and we couldn't bring it back online for more than a last dump (that itself took several hours). So we stuffed together the hardware we could find to a new, larger server (32G RAM) that is running now. While the additional memory surely helped, the performance of this new DB is still poor. We are looking into this and already solved some issues. For now the project has been stopped again to run some offline analysis, indexing and other optimization on the database.
We'll continue to fix the problems that occur, so far we had all kinds of things, ranging from hardware failures to bugs in software not written by us, last one fixed just minutes ago, all in just a few days.
I strongly intend to have the project up and running later today.
I've had a few episodes on my hosts recently which have resulted in rapid rejection of my entire queue of Einstein tasks with "access is denied" error messages.
Until now I have assumed that this was my fault in failing to be present when my firewall (COMODO with the Defense+ option which requires initial authorization of new programs) popped up a permission request.
However, some of the failures appear to have been _not_ on new science aps. So I suppose there is a small chance that what I saw has been happening to other non-COMODO users. If so, it could be a contribution to excessive database load.
By way of giving a signature, so others can chime in if they think they have seen the same thing, this is a typical set of log messages regarding the attempt to run a single WU during such an episode:
Host Project Date Message
stoll3 Einstein@Home 4/20/2009 3:49:36 AM Output file h1_0609.60_S5R4__123_S5R5a_0_0 for task h1_0609.60_S5R4__123_S5R5a_0 absent
stoll3 Einstein@Home 4/20/2009 3:49:36 AM Computation for task h1_0609.60_S5R4__123_S5R5a_0 finished
stoll3 Einstein@Home 4/20/2009 3:49:36 AM Reason: Unrecoverable error for result h1_0609.60_S5R4__123_S5R5a_0 (CreateProcess() failed - Access is denied. (0x5))
stoll3 Einstein@Home 4/20/2009 3:49:36 AM Deferring communication for 1 min 0 sec
stoll3 Einstein@Home 4/20/2009 3:49:35 AM [error] Process creation failed: Access is denied. (0x5)
stoll3 Einstein@Home 4/20/2009 3:49:34 AM [error] Process creation failed: Access is denied. (0x5)
stoll3 Einstein@Home 4/20/2009 3:49:33 AM [error] Process creation failed: Access is denied. (0x5)
stoll3 Einstein@Home 4/20/2009 3:49:32 AM [error] Process creation failed: Access is denied. (0x5)
stoll3 Einstein@Home 4/20/2009 3:49:32 AM [error] Process creation failed: Access is denied. (0x5)
stoll3 Einstein@Home 4/20/2009 3:49:31 AM Starting h1_0609.60_S5R4__123_S5R5a_0
Not worth waiting any longer.
)
Not worth waiting any longer. Project is almost 7 days down after I rejoined E@H 16 days ago.
"souls ain't born, souls don't die"
RE: Not worth waiting any
)
The project experienced several days of downtime mostly on weekends during April but was up most of the time, certainly most of the time last week.
No question the current string of downtime is bad, but let's not exaggerate it either.
CU
Bikeman
RE: Some of my hosts did
)
None of mine did, but let that pass - not important.
I haven't seen any description yet of the nature of the additional load that the database is experiencing. Although there was the recent release of Windows app v3.04 and v3.05, these would affect downloads only, not the database. I haven't read of any change *** within Einstein *** that should have changed the database loading so spectacularly.
There is the possibility that an inwards migration of active users - say following a SETI outage - might have produced a temporary blip, but Einstein has coped with that before - it has never caused such sustained loading problems.
Which leaves a nagging concern. The first report of database loading issues on the Einstein front page is dated 12 April 2009 - three days after BOINC v6.6.20 was made the official "recommended" version. I have identified at least one situation where BOINC (v6.6.23 actually, but it's unchanged v6.6.20 code) can flood a server with spurious "Requesting 0.00 seconds of work" RPC calls - see my post on Friday in Not getting new WU's for CPU projects after upgrade to 6.6.20 on the BOINC message boards, and subsequent discussion.
Is the excessive database loading caused by an increase in the number of scheduler RPCs coming from external users? or am I barking up the wrong tree?
Edit - judging by the new front page news since I started composing that, the load is indeed scheduler related. Alarm bells are starting to ring.
RE: Edit - judging by the
)
Another possibility is a problem with Norton AntiVirus which occurred around the same time. I had this myself and for about a day. I was going through WU as fast as Einstein@Home would hand them out to me. I would get some WU, Norton would flag the data files as suspicious and remove them. At that point BOINC would error out when it tried to start the WU but there was no data file. Those results were returned as errored out, but then Einstein would request more. This repeated non-stop until I figured out how to shut off the Norton advanced heuristics detection which was causing the problem. Because my Einstein resource share is so high, I quickly hit the limit of 1 WU/core/day but other crunchers may take a little longer. I imagine there are lots of people using Norton ...
Gravitysmith
RE: Another possibility is
)
I don't think that would cause such a big problem, since Einstein initially only allows a "Maximum daily WU quota per CPU" of 16/day; other than SETI with a quota of 100 WUs/CPU (and 400/GPU).
Gruß,
Gundolf
Computer sind nicht alles im Leben. (Kleiner Scherz)
RE: I haven't seen any
)
Surely a number of things added together, we didn't spent much time to investigate this, but we surely will.
Under heavy load finally our old DB server crashed and we couldn't bring it back online for more than a last dump (that itself took several hours). So we stuffed together the hardware we could find to a new, larger server (32G RAM) that is running now. While the additional memory surely helped, the performance of this new DB is still poor. We are looking into this and already solved some issues. For now the project has been stopped again to run some offline analysis, indexing and other optimization on the database.
We'll continue to fix the problems that occur, so far we had all kinds of things, ranging from hardware failures to bugs in software not written by us, last one fixed just minutes ago, all in just a few days.
I strongly intend to have the project up and running later today.
BM
BM
I've had a few episodes on my
)
I've had a few episodes on my hosts recently which have resulted in rapid rejection of my entire queue of Einstein tasks with "access is denied" error messages.
Until now I have assumed that this was my fault in failing to be present when my firewall (COMODO with the Defense+ option which requires initial authorization of new programs) popped up a permission request.
However, some of the failures appear to have been _not_ on new science aps. So I suppose there is a small chance that what I saw has been happening to other non-COMODO users. If so, it could be a contribution to excessive database load.
By way of giving a signature, so others can chime in if they think they have seen the same thing, this is a typical set of log messages regarding the attempt to run a single WU during such an episode:
give the guys some slack thay
)
give the guys some slack thay will sort it
at least it given my c.p.u a rest
RE: give the guys some
)
I feel restless whenever my CPUs are at rest...
chill out man flower power
)
chill out man flower power and allthat
(1960's)