When will the remaining 2 disks crash?

Wurgl (speak^Wcrunching for Special: Off-Topic)
Wurgl (speak^Wc...
Joined: 11 Feb 05
Posts: 321
Credit: 140,550,008
RAC: 0
Topic 192246

As I remember there was a message, saying that two disks crashed on December 7th. Now another two crashed yesterday.

I also remember that there is a RAID6 system, since 4 disks crashed it is now a RAID2 system and the question comes up "When will the remaining two disks crash, causing the project to use a RAID0?".

SCNR, I am in some troll mood :-)

Dotsch
Dotsch
Joined: 1 May 05
Posts: 50
Credit: 215,423
RAC: 0

When will the remaining 2 disks crash?

Quote:

As I remember there was a message, saying that two disks crashed on December 7th. Now another two crashed yesterday.

I also remember that there is a RAID6 system, since 4 disks crashed it is now a RAID2 system and the question comes up "When will the remaining two disks crash, causing the project to use a RAID0?".

SCNR, I am in some troll mood :-)


Your totaly wrong. The RAID level has nothing to do with the number of installed disks. The RAID level describes a algorithm, how the redundancy would build over the disks.
A RAID6 is a RAID5 with dual parity, so there can 2 disks fail within a RAID set. - If two disk fails in a RAID 6, it has no more redundancy. If the third disk will fail, you lose data.

Bruce has written, that this is a known problem with the storage array (Backplane) : http://einsteinathome.org/node/192165&nowrap=true#58955

Thunder
Thunder
Joined: 18 Jan 05
Posts: 138
Credit: 46,754,541
RAC: 0

Errr... I think he was

Message 58399 in response to message 58398

Errr...

I think he was just shooting for a bit of a play on words/math with that post. ;)

Heh, I at least got a chuckle out of it!

P.S. Thanks for getting the project back up even faster than anticipated, E@H team!! :D

Wurgl (speak^Wcrunching for Special: Off-Topic)
Wurgl (speak^Wc...
Joined: 11 Feb 05
Posts: 321
Credit: 140,550,008
RAC: 0

RE: RE: SCNR, I am in

Message 58400 in response to message 58398

Quote:
Quote:

SCNR, I am in some troll mood :-)

Your totaly wrong.

Just to help you what I ment with "troll mood".
An article on wikipedia in [url=http://de.wikipedia.org/wiki/Troll_(Internet)]German[/url] and in English.

##############################

Quote:

P.S. Thanks for getting the project back up even faster than anticipated, E@H team!! :D

And I totally agree with you. But seriously, I am afraid that more disks will give up soon :-( Hopefully not ...

Harvey
Harvey
Joined: 12 Dec 06
Posts: 4
Credit: 73,652
RAC: 0

Greetings :) Here are some

Greetings :)

Here are some links for any who may be interested.

Intels Answer:
http://www.intel.com/technology/magazine/computing/RAID-6-0505.htm

Hardware Secrets answer:
http://www.hardwaresecrets.com/article/314

And the Math behind it all:
http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

The only problem we have with disk failure is the delays during rebuild time. My current update communication has been deferred for 150 hours :P
Hmmm, perhaps this is something else?

Any word on why 2M of the results were deleted?

Thanx Pooh Bear, I did a manual update and its working again :)

Pooh Bear 27
Pooh Bear 27
Joined: 20 Mar 05
Posts: 1,376
Credit: 20,312,671
RAC: 0

RE: The only problem we

Message 58402 in response to message 58401

Quote:
The only problem we have with disk failure is the delays during rebuild time. My current update communication has been deferred for 150 hours :P
Hmmm, perhaps this is something else?


No, BOINC version 5.6 and lower had a bad setting that allowed it to go into a week of no contact if pushed back enough. This is being fixed. If you notice your machine has lot of hours (more than 24), please UPDATE, so that the counter resets, and your machines return to normal.

This is the ONLY time I say UPDATE, because updating is hard on the databases, but you do not want your computer idle for a week.

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6,120
Credit: 119,948,773
RAC: 67,357

RE: This is the ONLY time I

Message 58403 in response to message 58402

Quote:
This is the ONLY time I say UPDATE, because updating is hard on the databases, but you do not want your computer idle for a week.


Yeah... I did that too, clearing out about 40 pending results!! Mainly 'shorts'.

Though the server messages do indicate some backing and filling - deletions, re-downloading, new grid data, re-sending lost results, duplicate reports & such. So I'd assume there has been some loss ( beyond timewise ) from the recent down. However now my little farm has resumed crunching more slices of the pie... :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter. Blaise Pascal

Jim Bailey
Jim Bailey
Joined: 31 Aug 05
Posts: 91
Credit: 1,452,829
RAC: 0

I didn't even know we had an

I didn't even know we had an outage for a while. Just so happens that I lost the boot drive on the main machine about the time Einstein went down. Spent most of yesterday working on that mess.

I love the "resent lost WU" feature. Lost everything on this one when the drive failed. Have them all back now, and they will be done before the deadline. :)

Now, I'm off to play with my RAID. This RAID card does not play nice with this machine. Takes forever to reboot unless you pull the three power cords first. Works fine after that, but it's a real pain if you are trying to load an OS!

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,126
Credit: 36,865,950,929
RAC: 37,761,263

RE: No, BOINC version 5.6

Message 58405 in response to message 58402

Quote:


No, BOINC version 5.6 and lower had a bad setting that allowed it to go into a week of no contact if pushed back enough. This is being fixed. If you notice your machine has lot of hours (more than 24), please UPDATE, so that the counter resets, and your machines return to normal.

This is the ONLY time I say UPDATE, because updating is hard on the databases, but you do not want your computer idle for a week.

The one week backoff looks like it was introduced somewhere in the 5.3.x or later series of BOINC. I have a couple of boxes still on 5.2.13 which have managed to recover by themselves, although there were still quite a few hours of lost crunching time. At least it wasn't a week for them.

This is now the third time recently that the vast bulk of my farm has had to be manually rescued from its one week coma. I can't wait until they release a production version of the client that fixes this stupidity. I call it stupidity because there appears to be no easy way to break out of this coma automatically, even though there would appear to be usable triggers. For example, whilst the project is offline, the inability to contact the scheduling server quickly leads to the one week backoff. At the same time, the backoff for uploading seems to peak out at around 1-4 hours for each finished result. If you have a reasonable cache, you will end up with multiple finished results, all trying to upload at random times. Now whilst the upload server and the scheduling server are different processes perhaps on different machines, why couldn't the sudden resumption of uploads when the project comes back online be used to trigger a resumption (even if only temporarily) of trying to contact the scheduler so that new work can be downloaded?

I'm probably missing something obvious, but my only recourse for the problem of losing a weeks crunching has been to manually update each machine as soon as the project is back on line. This is fine for most people but quite a problem for me. At the time of these last three outages, I've had close to 200 machines to manually update. Some of this had to be done by going physically from box to box. However most of it could be done from one machine by using Boinc Manager to access the remote machine. By the time you actually establish the connection, examine the various work queues and message lists, hit the update button, watch the messages list to ensure that results are reported and that new work is downloaded, etc, several minutes can easily fly by. Multiply say 3 mins per machine by 200 machines and that's a helluva lot of unnecessary (and very mind numbing) hours that I've just wasted :).

So if anybody knows of an easier way to update 200 machines, I'd love to hear about it. I'm getting mighty sick of doing it manually. The machines are in two separate locations (my home and my work) and are spread over several rooms. The work location has 180+ on a single subnet sharing (through about 15 hubs/switches each from 8 to 24 ports) a single ADSL internet connection. As long as the project is up and stable, everything works fine and everything is very much set and forget. However these recent outages and server flakiness problems have been giving me real headaches.

Cheers,
Gary.

anders n
anders n
Joined: 29 Aug 05
Posts: 123
Credit: 1,656,300
RAC: 0

Hi Gary I don´t know if

Hi Gary

I don´t know if you like to hear this But in my world

the best thing to do is to have a backup project so the computers

don´t get down time. There should be some project out there that

could get a 1 % share on your computer time :)

Anders n

KAMasud
KAMasud
Joined: 6 Oct 06
Posts: 13
Credit: 9,559,237
RAC: 0

:-) Well Said


:-) Well Said :-)
Regards
Masud.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.