When will the remaining 2 disks crash?

Wurgl (speak^Wc...

Joined: 11 Feb 05

Posts: 321

Credit: 140550008

RAC: 0

29 Dec 2006 17:50:27 UTC

Topic 192246

(moderation:

)

As I remember there was a message, saying that two disks crashed on December 7th. Now another two crashed yesterday.

I also remember that there is a RAID6 system, since 4 disks crashed it is now a RAID2 system and the question comes up "When will the remaining two disks crash, causing the project to use a RAID0?".

SCNR, I am in some troll mood :-)

Dotsch

Joined: 1 May 05

Posts: 50

Credit: 741828

RAC: 1248

When will the remaining 2 disks crash?

29 Dec 2006 18:02:24 UTC

Message 58398

(moderation:

)

Quote:

As I remember there was a message, saying that two disks crashed on December 7th. Now another two crashed yesterday.

I also remember that there is a RAID6 system, since 4 disks crashed it is now a RAID2 system and the question comes up "When will the remaining two disks crash, causing the project to use a RAID0?".

SCNR, I am in some troll mood :-)

Your totaly wrong. The RAID level has nothing to do with the number of installed disks. The RAID level describes a algorithm, how the redundancy would build over the disks.
A RAID6 is a RAID5 with dual parity, so there can 2 disks fail within a RAID set. - If two disk fails in a RAID 6, it has no more redundancy. If the third disk will fail, you lose data.

Bruce has written, that this is a known problem with the storage array (Backplane) : http://einsteinathome.org/node/192165&nowrap=true#58955

Thunder

Joined: 18 Jan 05

Posts: 138

Credit: 46754541

RAC: 0

Errr... I think he was

29 Dec 2006 18:18:53 UTC

Message 58399 in response to message 58398

(moderation:

)

Errr...

I think he was just shooting for a bit of a play on words/math with that post. ;)

Heh, I at least got a chuckle out of it!

P.S. Thanks for getting the project back up even faster than anticipated, E@H team!! :D

Wurgl (speak^Wc...

Joined: 11 Feb 05

Posts: 321

Credit: 140550008

RAC: 0

RE: RE: SCNR, I am in

29 Dec 2006 18:33:10 UTC

Message 58400 in response to message 58398

(moderation:

)

Quote:

Quote:

SCNR, I am in some troll mood :-)

Your totaly wrong.

Just to help you what I ment with "troll mood".
An article on wikipedia in [url=http://de.wikipedia.org/wiki/Troll_(Internet)]German[/url] and in English.

##############################

Quote:

P.S. Thanks for getting the project back up even faster than anticipated, E@H team!! :D

And I totally agree with you. But seriously, I am afraid that more disks will give up soon :-( Hopefully not ...

Harvey

Joined: 12 Dec 06

Posts: 4

Credit: 73652

RAC: 0

Greetings :) Here are some

29 Dec 2006 19:04:21 UTC

Message 58401

(moderation:

)

Greetings :)

Here are some links for any who may be interested.

Intels Answer:
http://www.intel.com/technology/magazine/computing/RAID-6-0505.htm

Hardware Secrets answer:
http://www.hardwaresecrets.com/article/314

And the Math behind it all:
http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

The only problem we have with disk failure is the delays during rebuild time. My current update communication has been deferred for 150 hours :P
Hmmm, perhaps this is something else?

Any word on why 2M of the results were deleted?

Thanx Pooh Bear, I did a manual update and its working again :)

Pooh Bear 27

Joined: 20 Mar 05

Posts: 1376

Credit: 20312671

RAC: 0

RE: The only problem we

29 Dec 2006 19:17:34 UTC

Message 58402 in response to message 58401

(moderation:

)

Quote:

The only problem we have with disk failure is the delays during rebuild time. My current update communication has been deferred for 150 hours :P
Hmmm, perhaps this is something else?

No, BOINC version 5.6 and lower had a bad setting that allowed it to go into a week of no contact if pushed back enough. This is being fixed. If you notice your machine has lot of hours (more than 24), please UPDATE, so that the counter resets, and your machines return to normal.

This is the ONLY time I say UPDATE, because updating is hard on the databases, but you do not want your computer idle for a week.

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6597

Credit: 340286563

RAC: 95020

RE: This is the ONLY time I

30 Dec 2006 1:26:29 UTC

Message 58403 in response to message 58402

(moderation:

)

Quote:

This is the ONLY time I say UPDATE, because updating is hard on the databases, but you do not want your computer idle for a week.

Yeah... I did that too, clearing out about 40 pending results!! Mainly 'shorts'.

Though the server messages do indicate some backing and filling - deletions, re-downloading, new grid data, re-sending lost results, duplicate reports & such. So I'd assume there has been some loss ( beyond timewise ) from the recent down. However now my little farm has resumed crunching more slices of the pie... :-)

Cheers, Mike.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Jim Bailey

Joined: 31 Aug 05

Posts: 91

Credit: 1452829

RAC: 0

I didn't even know we had an

30 Dec 2006 3:02:18 UTC

Message 58404

(moderation:

)

I didn't even know we had an outage for a while. Just so happens that I lost the boot drive on the main machine about the time Einstein went down. Spent most of yesterday working on that mess.

I love the "resent lost WU" feature. Lost everything on this one when the drive failed. Have them all back now, and they will be done before the deadline. :)

Now, I'm off to play with my RAID. This RAID card does not play nice with this machine. Takes forever to reboot unless you pull the three power cords first. Works fine after that, but it's a real pain if you are trying to load an OS!

Gary Roberts

Moderator

Joined: 9 Feb 05

Posts: 5888

Credit: 119996408298

RAC: 26439511

RE: No, BOINC version 5.6

30 Dec 2006 12:28:13 UTC

Message 58405 in response to message 58402

(moderation:

)

Quote:

No, BOINC version 5.6 and lower had a bad setting that allowed it to go into a week of no contact if pushed back enough. This is being fixed. If you notice your machine has lot of hours (more than 24), please UPDATE, so that the counter resets, and your machines return to normal.

This is the ONLY time I say UPDATE, because updating is hard on the databases, but you do not want your computer idle for a week.

The one week backoff looks like it was introduced somewhere in the 5.3.x or later series of BOINC. I have a couple of boxes still on 5.2.13 which have managed to recover by themselves, although there were still quite a few hours of lost crunching time. At least it wasn't a week for them.

This is now the third time recently that the vast bulk of my farm has had to be manually rescued from its one week coma. I can't wait until they release a production version of the client that fixes this stupidity. I call it stupidity because there appears to be no easy way to break out of this coma automatically, even though there would appear to be usable triggers. For example, whilst the project is offline, the inability to contact the scheduling server quickly leads to the one week backoff. At the same time, the backoff for uploading seems to peak out at around 1-4 hours for each finished result. If you have a reasonable cache, you will end up with multiple finished results, all trying to upload at random times. Now whilst the upload server and the scheduling server are different processes perhaps on different machines, why couldn't the sudden resumption of uploads when the project comes back online be used to trigger a resumption (even if only temporarily) of trying to contact the scheduler so that new work can be downloaded?

I'm probably missing something obvious, but my only recourse for the problem of losing a weeks crunching has been to manually update each machine as soon as the project is back on line. This is fine for most people but quite a problem for me. At the time of these last three outages, I've had close to 200 machines to manually update. Some of this had to be done by going physically from box to box. However most of it could be done from one machine by using Boinc Manager to access the remote machine. By the time you actually establish the connection, examine the various work queues and message lists, hit the update button, watch the messages list to ensure that results are reported and that new work is downloaded, etc, several minutes can easily fly by. Multiply say 3 mins per machine by 200 machines and that's a helluva lot of unnecessary (and very mind numbing) hours that I've just wasted :).

So if anybody knows of an easier way to update 200 machines, I'd love to hear about it. I'm getting mighty sick of doing it manually. The machines are in two separate locations (my home and my work) and are spread over several rooms. The work location has 180+ on a single subnet sharing (through about 15 hubs/switches each from 8 to 24 ports) a single ADSL internet connection. As long as the project is up and stable, everything works fine and everything is very much set and forget. However these recent outages and server flakiness problems have been giving me real headaches.

Cheers,
Gary.

anders n

Joined: 29 Aug 05

Posts: 123

Credit: 1656300

RAC: 0

Hi Gary I donÂ´t know if

30 Dec 2006 12:46:16 UTC

Message 58406

(moderation:

)

Hi Gary

I donÂ´t know if you like to hear this But in my world

the best thing to do is to have a backup project so the computers

donÂ´t get down time. There should be some project out there that

could get a 1 % share on your computer time :)

Anders n

KAMasud

Joined: 6 Oct 06

Posts: 14

Credit: 76997758

RAC: 303031

:-) Well Said

30 Dec 2006 13:40:49 UTC

Message 58407

(moderation:

)

:-) Well Said :-)
Regards
Masud.

When will the remaining 2 disks crash?

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner