I have one question for the users here - how long will it be until this computer sends/receives WU's again?
TIA.
Looking at your hosts I see one that last connected on 12/7, if that one does not connect in the next day it is probably going to wait a week or more before it tries again.
John, thanks for the response. The remote computer is running BOINC 5.4.11. Assuming it doesn't try and connect in the next day or so, it will run out of work in less than a week. Will it try and connect again before it runs out of work? (I hope so!)
FD.
With this sort of project downtime, the delay goes up to 168 hours (1 week) very quickly, but it won't be longer than that. When the week starts depends how far into the outage it was when your machine next tried to contact the server.
AFAIK, once BOINC has decided to back off for a week, it won't ask for any more work until that time is up. If you haven't already set up a backup project, and you can't get anyone on-site to click the update button for you, then it may well run dry.
But I notice that you have some results which will be getting close to deadline inside a week, and BOINC has some quite strong rules about not missing deadlines. So it's possible that BOINC will try to report completed results before the week is up, and if it succeeds, then the logjam will be broken and it'll ask for new work again as well.
I can confirm from own experience that with a small cache 168 hours of deferral are reached very quickly; one of my boxes did so yesterday afternoon (CET). Doesn't matter though as it stands @ home where I can force updates manually, so it's crunching normal again.
Not only did the Einstein folks respond quickly, they did an excellent job in posting the status information as soon as they did. Good job there.
Quote:
Well done !
Great job in rebuilding the raid6 array.
How come 2 disks crash simultaneously ? defective power supply ?
BTW, could you teach other project teams how to maintain their servers as reliable as yours ? Even if sometimes Murphy comes by, you react quick : this could also be a training for other teams ;)
How come 2 disks crash simultaneously ? defective power supply ?
BTW, could you teach other project teams how to maintain their servers as reliable as yours ? Even if sometimes Murphy comes by, you react quick : this could also be a training for other teams ;)
Since this forum seems to attract tech junkies, here are the gory details:
Einstein@Home has three large file servers. Each one is based on a dual-opteron supermicro motherboard, an AIC chassis with 24-hot-swap SATA disks, and an Areca 1170 RAID controller. We use 24 Western Digital 400GB RAID-edition disk drives (9.6 TB of raw disk space per file server) and run the RAID arrays in a RAID-6 configuration. This gives us 8.8 TB of usable storage space and two redundant disks per storage server.
Each AIC chassis contains six backplanes (each supporting four SATA disks). It turned out that AIC designed and manufactured these with an incorrect value of a critical resisitor. When all the drives were active, the drive voltage would sometimes dip below 4.75V and one of the drives would RESET.
The Areca controller would then think that the drive had been removed from the array and a new drive added. It would begin to rebuild the RAID array, a process that takes several hours and loads the server and disk drives.
In the middle of one of these rebuilds, a second drive reset itself. At that point we lost all redundancy and decided to shut down to reduce the risk of actual data loss.
We have now replaced the backplanes in the AIC cases with new ones, that contain the correct resistor value. I hope that from now on these critical file servers will be stable!
Einstein@Home has three large file servers. Each one is based on a dual-opteron supermicro motherboard, an AIC chassis with 24-hot-swap SATA disks, and an Areca 1170 RAID controller. We use 24 Western Digital 400GB RAID-edition disk drives (9.6 TB of raw disk space per file server) and run the RAID arrays in a RAID-6 configuration. This gives us 8.8 TB of usable storage space and two redundant disks per storage server.
Lift jaw, slap face, wipe drool ... :-)
Quote:
Each AIC chassis contains ........ that contain the correct resistor value. I hope that from now on these critical file servers will be stable!
A classic case of the 20c component!
Cheers, Mike.
NB. For those of us, like me, who perhaps thought they knew their RAID stuff, then here's a Wiki which nicely explains it.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
The two racks in the middle left with lots of blinking blue lights are the Einstein@Home servers. There are six 5U machines (each with 24 disks). These are (1) database server (2) project server (3) file server. All machines are duplicated in the second rack, to give us some hardware redundancy. At the base of the racks are the four UPS power supplies (each of the machines has four indepdendent power supplies, each connected to a different UPS).
I think that the webcam can only handle four connections at a time, so please disconnect after a couple of minutes of spying!
RE: RE: RE: I have one
)
With this sort of project downtime, the delay goes up to 168 hours (1 week) very quickly, but it won't be longer than that. When the week starts depends how far into the outage it was when your machine next tried to contact the server.
AFAIK, once BOINC has decided to back off for a week, it won't ask for any more work until that time is up. If you haven't already set up a backup project, and you can't get anyone on-site to click the update button for you, then it may well run dry.
But I notice that you have some results which will be getting close to deadline inside a week, and BOINC has some quite strong rules about not missing deadlines. So it's possible that BOINC will try to report completed results before the week is up, and if it succeeds, then the logjam will be broken and it'll ask for new work again as well.
Thanks for the
)
Thanks for the information.
I will install LogMeIn on the remote PC next visit, so I can update it in cases like this!
I can confirm from own
)
I can confirm from own experience that with a small cache 168 hours of deferral are reached very quickly; one of my boxes did so yesterday afternoon (CET). Doesn't matter though as it stands @ home where I can force updates manually, so it's crunching normal again.
Not only did the Einstein
)
Not only did the Einstein folks respond quickly, they did an excellent job in posting the status information as soon as they did. Good job there.
RE: Well done ! Great job
)
Since this forum seems to attract tech junkies, here are the gory details:
Einstein@Home has three large file servers. Each one is based on a dual-opteron supermicro motherboard, an AIC chassis with 24-hot-swap SATA disks, and an Areca 1170 RAID controller. We use 24 Western Digital 400GB RAID-edition disk drives (9.6 TB of raw disk space per file server) and run the RAID arrays in a RAID-6 configuration. This gives us 8.8 TB of usable storage space and two redundant disks per storage server.
Each AIC chassis contains six backplanes (each supporting four SATA disks). It turned out that AIC designed and manufactured these with an incorrect value of a critical resisitor. When all the drives were active, the drive voltage would sometimes dip below 4.75V and one of the drives would RESET.
The Areca controller would then think that the drive had been removed from the array and a new drive added. It would begin to rebuild the RAID array, a process that takes several hours and loads the server and disk drives.
In the middle of one of these rebuilds, a second drive reset itself. At that point we lost all redundancy and decided to shut down to reduce the risk of actual data loss.
We have now replaced the backplanes in the AIC cases with new ones, that contain the correct resistor value. I hope that from now on these critical file servers will be stable!
Cheers,
Bruce
Director, Einstein@Home
It's always nice when you can
)
It's always nice when you can find a definitive reason (as well as the solution) when things like this happen. :-)
Alinator
RE: Since this forum seems
)
Really? Naaa.... :-)
Lift jaw, slap face, wipe drool ... :-)
A classic case of the 20c component!
Cheers, Mike.
NB. For those of us, like me, who perhaps thought they knew their RAID stuff, then here's a Wiki which nicely explains it.
I have made this letter longer than usual because I lack the time to make it shorter ...
... and my other CPU is a Ryzen 5950X :-) Blaise Pascal
RE: Lift jaw, slap face,
)
I'll second that! Sure makes the array I put together a few weeks ago look sad!
RE: RE: Lift jaw, slap
)
OK, hardware junkies. We have a webcam in our cluster room and you can see the Einstein@Home racks in real time.
The two racks in the middle left with lots of blinking blue lights are the Einstein@Home servers. There are six 5U machines (each with 24 disks). These are (1) database server (2) project server (3) file server. All machines are duplicated in the second rack, to give us some hardware redundancy. At the base of the racks are the four UPS power supplies (each of the machines has four indepdendent power supplies, each connected to a different UPS).
I think that the webcam can only handle four connections at a time, so please disconnect after a couple of minutes of spying!
Cheers,
Bruce
Director, Einstein@Home
Very nice! Thanks for letting
)
Very nice! Thanks for letting us have a peek at what is on the other end of the line.