As I remember there was a message, saying that two disks crashed on December 7th. Now another two crashed yesterday.
I also remember that there is a RAID6 system, since 4 disks crashed it is now a RAID2 system and the question comes up "When will the remaining two disks crash, causing the project to use a RAID0?".
SCNR, I am in some troll mood :-)
Copyright © 2024 Einstein@Home. All rights reserved.
When will the remaining 2 disks crash?
)
Your totaly wrong. The RAID level has nothing to do with the number of installed disks. The RAID level describes a algorithm, how the redundancy would build over the disks.
A RAID6 is a RAID5 with dual parity, so there can 2 disks fail within a RAID set. - If two disk fails in a RAID 6, it has no more redundancy. If the third disk will fail, you lose data.
Bruce has written, that this is a known problem with the storage array (Backplane) : http://einsteinathome.org/node/192165&nowrap=true#58955
Errr... I think he was
)
Errr...
I think he was just shooting for a bit of a play on words/math with that post. ;)
Heh, I at least got a chuckle out of it!
P.S. Thanks for getting the project back up even faster than anticipated, E@H team!! :D
RE: RE: SCNR, I am in
)
Just to help you what I ment with "troll mood".
An article on wikipedia in [url=http://de.wikipedia.org/wiki/Troll_(Internet)]German[/url] and in English.
##############################
And I totally agree with you. But seriously, I am afraid that more disks will give up soon :-( Hopefully not ...
Greetings :) Here are some
)
Greetings :)
Here are some links for any who may be interested.
Intels Answer:
http://www.intel.com/technology/magazine/computing/RAID-6-0505.htm
Hardware Secrets answer:
http://www.hardwaresecrets.com/article/314
And the Math behind it all:
http://kernel.org/pub/linux/kernel/people/hpa/raid6.pdf
The only problem we have with disk failure is the delays during rebuild time. My current update communication has been deferred for 150 hours :P
Hmmm, perhaps this is something else?
Any word on why 2M of the results were deleted?
Thanx Pooh Bear, I did a manual update and its working again :)
RE: The only problem we
)
No, BOINC version 5.6 and lower had a bad setting that allowed it to go into a week of no contact if pushed back enough. This is being fixed. If you notice your machine has lot of hours (more than 24), please UPDATE, so that the counter resets, and your machines return to normal.
This is the ONLY time I say UPDATE, because updating is hard on the databases, but you do not want your computer idle for a week.
RE: This is the ONLY time I
)
Yeah... I did that too, clearing out about 40 pending results!! Mainly 'shorts'.
Though the server messages do indicate some backing and filling - deletions, re-downloading, new grid data, re-sending lost results, duplicate reports & such. So I'd assume there has been some loss ( beyond timewise ) from the recent down. However now my little farm has resumed crunching more slices of the pie... :-)
Cheers, Mike.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
I didn't even know we had an
)
I didn't even know we had an outage for a while. Just so happens that I lost the boot drive on the main machine about the time Einstein went down. Spent most of yesterday working on that mess.
I love the "resent lost WU" feature. Lost everything on this one when the drive failed. Have them all back now, and they will be done before the deadline. :)
Now, I'm off to play with my RAID. This RAID card does not play nice with this machine. Takes forever to reboot unless you pull the three power cords first. Works fine after that, but it's a real pain if you are trying to load an OS!
RE: No, BOINC version 5.6
)
The one week backoff looks like it was introduced somewhere in the 5.3.x or later series of BOINC. I have a couple of boxes still on 5.2.13 which have managed to recover by themselves, although there were still quite a few hours of lost crunching time. At least it wasn't a week for them.
This is now the third time recently that the vast bulk of my farm has had to be manually rescued from its one week coma. I can't wait until they release a production version of the client that fixes this stupidity. I call it stupidity because there appears to be no easy way to break out of this coma automatically, even though there would appear to be usable triggers. For example, whilst the project is offline, the inability to contact the scheduling server quickly leads to the one week backoff. At the same time, the backoff for uploading seems to peak out at around 1-4 hours for each finished result. If you have a reasonable cache, you will end up with multiple finished results, all trying to upload at random times. Now whilst the upload server and the scheduling server are different processes perhaps on different machines, why couldn't the sudden resumption of uploads when the project comes back online be used to trigger a resumption (even if only temporarily) of trying to contact the scheduler so that new work can be downloaded?
I'm probably missing something obvious, but my only recourse for the problem of losing a weeks crunching has been to manually update each machine as soon as the project is back on line. This is fine for most people but quite a problem for me. At the time of these last three outages, I've had close to 200 machines to manually update. Some of this had to be done by going physically from box to box. However most of it could be done from one machine by using Boinc Manager to access the remote machine. By the time you actually establish the connection, examine the various work queues and message lists, hit the update button, watch the messages list to ensure that results are reported and that new work is downloaded, etc, several minutes can easily fly by. Multiply say 3 mins per machine by 200 machines and that's a helluva lot of unnecessary (and very mind numbing) hours that I've just wasted :).
So if anybody knows of an easier way to update 200 machines, I'd love to hear about it. I'm getting mighty sick of doing it manually. The machines are in two separate locations (my home and my work) and are spread over several rooms. The work location has 180+ on a single subnet sharing (through about 15 hubs/switches each from 8 to 24 ports) a single ADSL internet connection. As long as the project is up and stable, everything works fine and everything is very much set and forget. However these recent outages and server flakiness problems have been giving me real headaches.
Cheers,
Gary.
Hi Gary I don´t know if
)
Hi Gary
I don´t know if you like to hear this But in my world
the best thing to do is to have a backup project so the computers
don´t get down time. There should be some project out there that
could get a 1 % share on your computer time :)
Anders n
:-) Well Said
)
:-) Well Said :-)
Regards
Masud.