Congrats E@H team!

Richard Haselgrove

Joined: 10 Dec 05

Posts: 2143

Credit: 2981663915

RAC: 754497

RE: RE: RE: I have one

9 Dec 2006 11:30:41 UTC

Message 56220 in response to message 56219

(moderation:

)

Quote:

Quote:
Quote:
I have one question for the users here - how long will it be until this computer sends/receives WU's again?

TIA.

Looking at your hosts I see one that last connected on 12/7, if that one does not connect in the next day it is probably going to wait a week or more before it tries again.

John, thanks for the response. The remote computer is running BOINC 5.4.11. Assuming it doesn't try and connect in the next day or so, it will run out of work in less than a week. Will it try and connect again before it runs out of work? (I hope so!)

FD.

With this sort of project downtime, the delay goes up to 168 hours (1 week) very quickly, but it won't be longer than that. When the week starts depends how far into the outage it was when your machine next tried to contact the server.

AFAIK, once BOINC has decided to back off for a week, it won't ask for any more work until that time is up. If you haven't already set up a backup project, and you can't get anyone on-site to click the update button for you, then it may well run dry.

But I notice that you have some results which will be getting close to deadline inside a week, and BOINC has some quite strong rules about not missing deadlines. So it's possible that BOINC will try to report completed results before the week is up, and if it succeeds, then the logjam will be broken and it'll ask for new work again as well.

Fuzzy Duck

Joined: 3 Dec 05

Posts: 37

Credit: 936924

RAC: 0

Thanks for the

9 Dec 2006 12:41:23 UTC

Message 56221 in response to message 56220

(moderation:

)

Thanks for the information.

I will install LogMeIn on the remote PC next visit, so I can update it in cases like this!

Annika

Joined: 8 Aug 06

Posts: 720

Credit: 494410

RAC: 0

I can confirm from own

9 Dec 2006 13:28:26 UTC

Message 56222

(moderation:

)

I can confirm from own experience that with a small cache 168 hours of deferral are reached very quickly; one of my boxes did so yesterday afternoon (CET). Doesn't matter though as it stands @ home where I can force updates manually, so it's crunching normal again.

BarryAZ

Joined: 8 May 05

Posts: 190

Credit: 325666833

RAC: 15869

Not only did the Einstein

10 Dec 2006 6:01:37 UTC

Message 56223

(moderation:

)

Not only did the Einstein folks respond quickly, they did an excellent job in posting the status information as soon as they did. Good job there.

Quote:

Well done !

Great job in rebuilding the raid6 array.

How come 2 disks crash simultaneously ? defective power supply ?

BTW, could you teach other project teams how to maintain their servers as reliable as yours ? Even if sometimes Murphy comes by, you react quick : this could also be a training for other teams ;)

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1119

Credit: 172127663

RAC: 0

RE: Well done ! Great job

10 Dec 2006 20:25:50 UTC

Message 56224

(moderation:

)

Quote:

Well done !

Great job in rebuilding the raid6 array.

How come 2 disks crash simultaneously ? defective power supply ?

BTW, could you teach other project teams how to maintain their servers as reliable as yours ? Even if sometimes Murphy comes by, you react quick : this could also be a training for other teams ;)

Since this forum seems to attract tech junkies, here are the gory details:

Einstein@Home has three large file servers. Each one is based on a dual-opteron supermicro motherboard, an AIC chassis with 24-hot-swap SATA disks, and an Areca 1170 RAID controller. We use 24 Western Digital 400GB RAID-edition disk drives (9.6 TB of raw disk space per file server) and run the RAID arrays in a RAID-6 configuration. This gives us 8.8 TB of usable storage space and two redundant disks per storage server.

Each AIC chassis contains six backplanes (each supporting four SATA disks). It turned out that AIC designed and manufactured these with an incorrect value of a critical resisitor. When all the drives were active, the drive voltage would sometimes dip below 4.75V and one of the drives would RESET.

The Areca controller would then think that the drive had been removed from the array and a new drive added. It would begin to rebuild the RAID array, a process that takes several hours and loads the server and disk drives.

In the middle of one of these rebuilds, a second drive reset itself. At that point we lost all redundancy and decided to shut down to reduce the risk of actual data loss.

We have now replaced the backplanes in the AIC cases with new ones, that contain the correct resistor value. I hope that from now on these critical file servers will be stable!

Cheers,
Bruce

Director, Einstein@Home

Alinator

Joined: 8 May 05

Posts: 927

Credit: 9352143

RAC: 0

It's always nice when you can

10 Dec 2006 21:22:13 UTC

Message 56225

(moderation:

)

It's always nice when you can find a definitive reason (as well as the solution) when things like this happen. :-)

Alinator

Mike Hewson

Moderator

Joined: 1 Dec 05

Posts: 6591

Credit: 325197108

RAC: 212064

RE: Since this forum seems

11 Dec 2006 1:25:40 UTC

Message 56226 in response to message 56224

(moderation:

)

Quote:

Since this forum seems to attract tech junkies

Really? Naaa.... :-)

Quote:

Einstein@Home has three large file servers. Each one is based on a dual-opteron supermicro motherboard, an AIC chassis with 24-hot-swap SATA disks, and an Areca 1170 RAID controller. We use 24 Western Digital 400GB RAID-edition disk drives (9.6 TB of raw disk space per file server) and run the RAID arrays in a RAID-6 configuration. This gives us 8.8 TB of usable storage space and two redundant disks per storage server.

Lift jaw, slap face, wipe drool ... :-)

Quote:

Each AIC chassis contains ........ that contain the correct resistor value. I hope that from now on these critical file servers will be stable!

A classic case of the 20c component!

Cheers, Mike.

NB. For those of us, like me, who perhaps thought they knew their RAID stuff, then here's a Wiki which nicely explains it.

I have made this letter longer than usual because I lack the time to make it shorter ...

... and my other CPU is a Ryzen 5950X :-) Blaise Pascal

Jim Bailey

Joined: 31 Aug 05

Posts: 91

Credit: 1452829

RAC: 0

RE: Lift jaw, slap face,

11 Dec 2006 1:49:10 UTC

Message 56227 in response to message 56226

(moderation:

)

Quote:

Lift jaw, slap face, wipe drool ... :-)

I'll second that! Sure makes the array I put together a few weeks ago look sad!

Bruce Allen

Moderator

Joined: 15 Oct 04

Posts: 1119

Credit: 172127663

RAC: 0

RE: RE: Lift jaw, slap

11 Dec 2006 6:55:04 UTC

Message 56228 in response to message 56227

(moderation:

)

Quote:

Quote:

Lift jaw, slap face, wipe drool ... :-)

I'll second that! Sure makes the array I put together a few weeks ago look sad!

OK, hardware junkies. We have a webcam in our cluster room and you can see the Einstein@Home racks in real time.

The two racks in the middle left with lots of blinking blue lights are the Einstein@Home servers. There are six 5U machines (each with 24 disks). These are (1) database server (2) project server (3) file server. All machines are duplicated in the second rack, to give us some hardware redundancy. At the base of the racks are the four UPS power supplies (each of the machines has four indepdendent power supplies, each connected to a different UPS).

I think that the webcam can only handle four connections at a time, so please disconnect after a couple of minutes of spying!

Cheers,
Bruce

Director, Einstein@Home

Jim Bailey

Joined: 31 Aug 05

Posts: 91

Credit: 1452829

RAC: 0

Very nice! Thanks for letting

11 Dec 2006 7:31:47 UTC

Message 56229 in response to message 56228

(moderation:

)

Very nice! Thanks for letting us have a peek at what is on the other end of the line.

Congrats E@H team!

Forums › Cruncher's Corner

Comment viewing options

Forums › Cruncher's Corner