Congrats E@H team!

Richard Haselgrove
Richard Haselgrove
Joined: 10 Dec 05
Posts: 2140
Credit: 2768243592
RAC: 983999

RE: RE: RE: I have one

Message 56220 in response to message 56219

Quote:
Quote:
Quote:

I have one question for the users here - how long will it be until this computer sends/receives WU's again?

TIA.


Looking at your hosts I see one that last connected on 12/7, if that one does not connect in the next day it is probably going to wait a week or more before it tries again.

John, thanks for the response. The remote computer is running BOINC 5.4.11. Assuming it doesn't try and connect in the next day or so, it will run out of work in less than a week. Will it try and connect again before it runs out of work? (I hope so!)

FD.


With this sort of project downtime, the delay goes up to 168 hours (1 week) very quickly, but it won't be longer than that. When the week starts depends how far into the outage it was when your machine next tried to contact the server.

AFAIK, once BOINC has decided to back off for a week, it won't ask for any more work until that time is up. If you haven't already set up a backup project, and you can't get anyone on-site to click the update button for you, then it may well run dry.

But I notice that you have some results which will be getting close to deadline inside a week, and BOINC has some quite strong rules about not missing deadlines. So it's possible that BOINC will try to report completed results before the week is up, and if it succeeds, then the logjam will be broken and it'll ask for new work again as well.

Fuzzy Duck
Fuzzy Duck
Joined: 3 Dec 05
Posts: 37
Credit: 936924
RAC: 0

Thanks for the

Message 56221 in response to message 56220

Thanks for the information.

I will install LogMeIn on the remote PC next visit, so I can update it in cases like this!

Annika
Annika
Joined: 8 Aug 06
Posts: 720
Credit: 494410
RAC: 0

I can confirm from own

I can confirm from own experience that with a small cache 168 hours of deferral are reached very quickly; one of my boxes did so yesterday afternoon (CET). Doesn't matter though as it stands @ home where I can force updates manually, so it's crunching normal again.

BarryAZ
BarryAZ
Joined: 8 May 05
Posts: 190
Credit: 321050369
RAC: 15069

Not only did the Einstein

Not only did the Einstein folks respond quickly, they did an excellent job in posting the status information as soon as they did. Good job there.

Quote:

Well done !

Great job in rebuilding the raid6 array.

How come 2 disks crash simultaneously ? defective power supply ?

BTW, could you teach other project teams how to maintain their servers as reliable as yours ? Even if sometimes Murphy comes by, you react quick : this could also be a training for other teams ;)


Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: Well done ! Great job

Quote:

Well done !

Great job in rebuilding the raid6 array.

How come 2 disks crash simultaneously ? defective power supply ?

BTW, could you teach other project teams how to maintain their servers as reliable as yours ? Even if sometimes Murphy comes by, you react quick : this could also be a training for other teams ;)

Since this forum seems to attract tech junkies, here are the gory details:

Einstein@Home has three large file servers. Each one is based on a dual-opteron supermicro motherboard, an AIC chassis with 24-hot-swap SATA disks, and an Areca 1170 RAID controller. We use 24 Western Digital 400GB RAID-edition disk drives (9.6 TB of raw disk space per file server) and run the RAID arrays in a RAID-6 configuration. This gives us 8.8 TB of usable storage space and two redundant disks per storage server.

Each AIC chassis contains six backplanes (each supporting four SATA disks). It turned out that AIC designed and manufactured these with an incorrect value of a critical resisitor. When all the drives were active, the drive voltage would sometimes dip below 4.75V and one of the drives would RESET.

The Areca controller would then think that the drive had been removed from the array and a new drive added. It would begin to rebuild the RAID array, a process that takes several hours and loads the server and disk drives.

In the middle of one of these rebuilds, a second drive reset itself. At that point we lost all redundancy and decided to shut down to reduce the risk of actual data loss.

We have now replaced the backplanes in the AIC cases with new ones, that contain the correct resistor value. I hope that from now on these critical file servers will be stable!

Cheers,
Bruce

Director, Einstein@Home

Alinator
Alinator
Joined: 8 May 05
Posts: 927
Credit: 9352143
RAC: 0

It's always nice when you can

It's always nice when you can find a definitive reason (as well as the solution) when things like this happen. :-)

Alinator

Mike Hewson
Mike Hewson
Moderator
Joined: 1 Dec 05
Posts: 6537
Credit: 286312022
RAC: 103610

RE: Since this forum seems

Message 56226 in response to message 56224

Quote:
Since this forum seems to attract tech junkies


Really? Naaa.... :-)

Quote:
Einstein@Home has three large file servers. Each one is based on a dual-opteron supermicro motherboard, an AIC chassis with 24-hot-swap SATA disks, and an Areca 1170 RAID controller. We use 24 Western Digital 400GB RAID-edition disk drives (9.6 TB of raw disk space per file server) and run the RAID arrays in a RAID-6 configuration. This gives us 8.8 TB of usable storage space and two redundant disks per storage server.


Lift jaw, slap face, wipe drool ... :-)

Quote:
Each AIC chassis contains ........ that contain the correct resistor value. I hope that from now on these critical file servers will be stable!


A classic case of the 20c component!

Cheers, Mike.

NB. For those of us, like me, who perhaps thought they knew their RAID stuff, then here's a Wiki which nicely explains it.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Jim Bailey
Jim Bailey
Joined: 31 Aug 05
Posts: 91
Credit: 1452829
RAC: 0

RE: Lift jaw, slap face,

Message 56227 in response to message 56226

Quote:

Lift jaw, slap face, wipe drool ... :-)

I'll second that! Sure makes the array I put together a few weeks ago look sad!

Bruce Allen
Bruce Allen
Moderator
Joined: 15 Oct 04
Posts: 1119
Credit: 172127663
RAC: 0

RE: RE: Lift jaw, slap

Message 56228 in response to message 56227

Quote:
Quote:

Lift jaw, slap face, wipe drool ... :-)

I'll second that! Sure makes the array I put together a few weeks ago look sad!

OK, hardware junkies. We have a webcam in our cluster room and you can see the Einstein@Home racks in real time.

The two racks in the middle left with lots of blinking blue lights are the Einstein@Home servers. There are six 5U machines (each with 24 disks). These are (1) database server (2) project server (3) file server. All machines are duplicated in the second rack, to give us some hardware redundancy. At the base of the racks are the four UPS power supplies (each of the machines has four indepdendent power supplies, each connected to a different UPS).

I think that the webcam can only handle four connections at a time, so please disconnect after a couple of minutes of spying!

Cheers,
Bruce

Director, Einstein@Home

Jim Bailey
Jim Bailey
Joined: 31 Aug 05
Posts: 91
Credit: 1452829
RAC: 0

Very nice! Thanks for letting

Message 56229 in response to message 56228

Very nice! Thanks for letting us have a peek at what is on the other end of the line.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.