> Ive been in the businees continuity racket for years. Stuff like UPS's are
> not a priority with folks like SETI because there is no financial pain
> involved in being down. Maybe this last experience will be painful enough to
> motivate thinking in a different way. But no matter how much of a pain in the
> rear it might be, if there is no $$$ involved in being down, its difficult for
> some to make the business case to invest in uptime.
Some of my associates think I'm insane, but from time to time I go around and yank the UPS plugs out of the wall.
Because, ultimately, that is the only way to tell if a UPS works -- and if you are afraid to do that, you need new UPSes.
The problem with a project like SETI@home is that many hardware is partial or full donated and not really fully sold to them. And it seems to be much more easy to get a new server as donation then a UPS.
The second problem is, that there are UPS in the server room, but because of the migration they have to manage two projects at the moment and not all hardware can be placed in the server room. It am sure it not possible with this known small budgets, to buy more UPS power only for the current time until it is all back to one project and one server room.
At all this is typical for such a migration scenario. I am working in IT business and did a lot of this projects in the past. You are always living with the risk that you can not fully protect all systems all time until all hardware is at its final place.
One more problem - and the report sounds like this problem - is that it is not enough to use a UPS, you must configure a safe automatic shutdown of all applications and operating system before the UPS is at its end, but not for every small outage. With the different systems, some of them new and strange and having issues, it is difficty to get it working and then test it carefully.
I am sure the system administrators of the SETI@home project didn't have a nice time at the moment. As posted by other in this thread, it is a very special situation to migrate a 7/24 project with 500000 active users to new hard- and software with extremly limited budget. I believe I am really good in my job, but don't dare me to say I could do it better.
Hopefully they have a little bit more luck in the future. :)
> My current UPS is about dead. I went to Best Buy(german translation:
> Wonderful Adult/child Electronics playland/store, hehehe).
>
> THe UPSes ran from $40-$100. So, Seti can't afford $500 for all new UPSes?
I'm not going to say that this isn't a good idea, but....
We're not talking typical desktop machines, but servers with multiple disks and multiple processors, and I'm not sure these would be suitable -- and I'm not talking output power, I'm talking battery capacity.
If you read the announcement, the folks in Berkeley had UPSes, but they didn't run long enough for everything to be gracefully shut down. I'm second-guessing what happened, but it's probably a combination of a UPS that's a little small, and old batteries. New batteries might have been enough.
> We're not talking typical desktop machines, but servers with multiple disks
> and multiple processors, and I'm not sure these would be suitable -- and I'm
> not talking output power, I'm talking battery capacity.
To run my 6-10 computers I use a 3000 VA UPS that will hold them up for about 15 minutes ... it cost about $1,800 ... and it is not a UPS either, it is an SPS which is even cheaper.
But your main point is that the protection is not a $100 purchase is correct ...
>
> > If BOINC was made smart enough to not download WU’s from projects that
> > wouldn’t get CPU time in the next couple of days, we could set it up
> to
> > crunch WU’s serial instead of parallel like today. Then deadlines
> wouldn’t
> > be a problem, even if we set resource chare to project A=10000 B=1.
>
> Actually, it might be. The server knows how long it takes you to return work
> units based on past history, and it certainly seems that it is adjusting how
> much work it offers based on how fast stuff comes back.
>
> ... and if you are crunching more projects, keeping "days between connections"
> low seems like a good thing.
The problem is that you will always have at least one WU from each project you are attached to on your computer. It is the client that requests work based on your "days between connections" settings. The servers only know information regarding one project. What i basically want to do is to crunch one WU from start till finish. Yes i know i can do this now, but the problem is that the WU from the other project was downloaded before i start crunching this one.
My "days between connections" is set to 0.02 so i can return work before the deadline. Because of the way BOINC downloads work at the moment, i can’t participate in multiple projects and still meat the deadline of all of them.
Then you're really interested in a subject, there is no way to avoid it. You have to read the Manual.
> To run my 6-10 computers I use a 3000 VA UPS that will hold them up for about
> 15 minutes ... it cost about $1,800 ... and it is not a UPS either, it is an
> SPS which is even cheaper.
On roughly the same size server farm I run a pair of 2200VA units. I managed to acquire them at a very good price because the batteries were dead.
For one of my UPSes, APC wants something like $600 for new battery cartridges, and all they are are standard AGM-type cells with double-sticky. I think I paid about $100 per UPS to replace them.
The automatic transfer switch shifts the whole load from one UPS to the other as needed.
> One more problem - and the report sounds like this problem - is that it is not
> enough to use a UPS, you must configure a safe automatic shutdown of all
> applications and operating system before the UPS is at its end, but not for
> every small outage. With the different systems, some of them new and strange
> and having issues, it is difficty to get it working and then test it
> carefully.
The way I read this was that they had the UPSes, had the shutdown software, but the batteries just didn't last long enough for things to come down gracefully.
Yep, having the equipment is only 1/2 of the problem. The other half is having Documented Operational Procedures on Regular Periodic Testing of the system. I had a large customer (Broadwing), they tested almost everything. But not the transfer switch. Well, power outage, transfer switch failed of course, the Huge data center sucked the UPS dry in 10 mins. Lots of unhappy people. At least if the switch had failed in testing, they would have been ready for it with everyone on standby.
You have to test EVERYTHING REGULARLY and have documentation on how to do it and what to do if it fails. Putting together an ROI in these cases is very easy.
> The way I read this was that they had the UPSes, had the shutdown software,
> but the batteries just didn't last long enough for things to come down
> gracefully.
Yes, thats what I was talking about. You have to configure a save shutdown (possible synchronized between servers) without a user action, then to test how long it needs to shutdown all operations and then to test how long the UPS holds the power, how long the security buffer should be and then to decide how long to wait until you start with the shutdown. Starts you shutdown to early, every small outage of some minutes may stop the project for a longer time. It needs time to shutdown and may be difficulty to startup this combined network of integrated services automaticly. And that means possible to wait for the operaters beginning the work in the morning.
And when this is all well done, then you have to check and test it again on every change. The batteries become older and didn't stay as long as before, one server get a new harddisk, the other more RAM and a new Fan. The possiblity that your plan failed when it meet the reallity is not small. ;)
Many words, short meaning. ;) It is not a problem of 100 or 1000 Dollar to protect a server system like that currently running for SETI@home. You have to be lucky too. ;)
> Ive been in the businees
)
> Ive been in the businees continuity racket for years. Stuff like UPS's are
> not a priority with folks like SETI because there is no financial pain
> involved in being down. Maybe this last experience will be painful enough to
> motivate thinking in a different way. But no matter how much of a pain in the
> rear it might be, if there is no $$$ involved in being down, its difficult for
> some to make the business case to invest in uptime.
Some of my associates think I'm insane, but from time to time I go around and yank the UPS plugs out of the wall.
Because, ultimately, that is the only way to tell if a UPS works -- and if you are afraid to do that, you need new UPSes.
The problem with a project
)
The problem with a project like SETI@home is that many hardware is partial or full donated and not really fully sold to them. And it seems to be much more easy to get a new server as donation then a UPS.
The second problem is, that there are UPS in the server room, but because of the migration they have to manage two projects at the moment and not all hardware can be placed in the server room. It am sure it not possible with this known small budgets, to buy more UPS power only for the current time until it is all back to one project and one server room.
At all this is typical for such a migration scenario. I am working in IT business and did a lot of this projects in the past. You are always living with the risk that you can not fully protect all systems all time until all hardware is at its final place.
One more problem - and the report sounds like this problem - is that it is not enough to use a UPS, you must configure a safe automatic shutdown of all applications and operating system before the UPS is at its end, but not for every small outage. With the different systems, some of them new and strange and having issues, it is difficty to get it working and then test it carefully.
I am sure the system administrators of the SETI@home project didn't have a nice time at the moment. As posted by other in this thread, it is a very special situation to migrate a 7/24 project with 500000 active users to new hard- and software with extremly limited budget. I believe I am really good in my job, but don't dare me to say I could do it better.
Hopefully they have a little bit more luck in the future. :)
Greetings from Bremen/Germany
Jens Seidler (TheBigJens)
> My current UPS is about
)
> My current UPS is about dead. I went to Best Buy(german translation:
> Wonderful Adult/child Electronics playland/store, hehehe).
>
> THe UPSes ran from $40-$100. So, Seti can't afford $500 for all new UPSes?
I'm not going to say that this isn't a good idea, but....
We're not talking typical desktop machines, but servers with multiple disks and multiple processors, and I'm not sure these would be suitable -- and I'm not talking output power, I'm talking battery capacity.
If you read the announcement, the folks in Berkeley had UPSes, but they didn't run long enough for everything to be gracefully shut down. I'm second-guessing what happened, but it's probably a combination of a UPS that's a little small, and old batteries. New batteries might have been enough.
> We're not talking typical
)
> We're not talking typical desktop machines, but servers with multiple disks
> and multiple processors, and I'm not sure these would be suitable -- and I'm
> not talking output power, I'm talking battery capacity.
To run my 6-10 computers I use a 3000 VA UPS that will hold them up for about 15 minutes ... it cost about $1,800 ... and it is not a UPS either, it is an SPS which is even cheaper.
But your main point is that the protection is not a $100 purchase is correct ...
It might interest some of you
)
It might interest some of you to see the museum pieces that Seti runs on - if you haven't already. ;)
Piccies
Be lucky,
Neil
> > > If BOINC was made
)
>
> > If BOINC was made smart enough to not download WU’s from projects that
> > wouldn’t get CPU time in the next couple of days, we could set it up
> to
> > crunch WU’s serial instead of parallel like today. Then deadlines
> wouldn’t
> > be a problem, even if we set resource chare to project A=10000 B=1.
>
> Actually, it might be. The server knows how long it takes you to return work
> units based on past history, and it certainly seems that it is adjusting how
> much work it offers based on how fast stuff comes back.
>
> ... and if you are crunching more projects, keeping "days between connections"
> low seems like a good thing.
The problem is that you will always have at least one WU from each project you are attached to on your computer. It is the client that requests work based on your "days between connections" settings. The servers only know information regarding one project. What i basically want to do is to crunch one WU from start till finish. Yes i know i can do this now, but the problem is that the WU from the other project was downloaded before i start crunching this one.
My "days between connections" is set to 0.02 so i can return work before the deadline. Because of the way BOINC downloads work at the moment, i can’t participate in multiple projects and still meat the deadline of all of them.
Then you're really interested in a subject, there is no way to avoid it. You have to read the Manual.
> To run my 6-10 computers I
)
> To run my 6-10 computers I use a 3000 VA UPS that will hold them up for about
> 15 minutes ... it cost about $1,800 ... and it is not a UPS either, it is an
> SPS which is even cheaper.
On roughly the same size server farm I run a pair of 2200VA units. I managed to acquire them at a very good price because the batteries were dead.
For one of my UPSes, APC wants something like $600 for new battery cartridges, and all they are are standard AGM-type cells with double-sticky. I think I paid about $100 per UPS to replace them.
The automatic transfer switch shifts the whole load from one UPS to the other as needed.
> One more problem - and the
)
> One more problem - and the report sounds like this problem - is that it is not
> enough to use a UPS, you must configure a safe automatic shutdown of all
> applications and operating system before the UPS is at its end, but not for
> every small outage. With the different systems, some of them new and strange
> and having issues, it is difficty to get it working and then test it
> carefully.
The way I read this was that they had the UPSes, had the shutdown software, but the batteries just didn't last long enough for things to come down gracefully.
Yep, having the equipment is
)
Yep, having the equipment is only 1/2 of the problem. The other half is having Documented Operational Procedures on Regular Periodic Testing of the system. I had a large customer (Broadwing), they tested almost everything. But not the transfer switch. Well, power outage, transfer switch failed of course, the Huge data center sucked the UPS dry in 10 mins. Lots of unhappy people. At least if the switch had failed in testing, they would have been ready for it with everyone on standby.
You have to test EVERYTHING REGULARLY and have documentation on how to do it and what to do if it fails. Putting together an ROI in these cases is very easy.
> The way I read this was
)
> The way I read this was that they had the UPSes, had the shutdown software,
> but the batteries just didn't last long enough for things to come down
> gracefully.
Yes, thats what I was talking about. You have to configure a save shutdown (possible synchronized between servers) without a user action, then to test how long it needs to shutdown all operations and then to test how long the UPS holds the power, how long the security buffer should be and then to decide how long to wait until you start with the shutdown. Starts you shutdown to early, every small outage of some minutes may stop the project for a longer time. It needs time to shutdown and may be difficulty to startup this combined network of integrated services automaticly. And that means possible to wait for the operaters beginning the work in the morning.
And when this is all well done, then you have to check and test it again on every change. The batteries become older and didn't stay as long as before, one server get a new harddisk, the other more RAM and a new Fan. The possiblity that your plan failed when it meet the reallity is not small. ;)
Many words, short meaning. ;) It is not a problem of 100 or 1000 Dollar to protect a server system like that currently running for SETI@home. You have to be lucky too. ;)
Greetings from Bremen/Germany
Jens Seidler (TheBigJens)