Or maybe an alien radio signal from outer space ^^ so we could beat the SETI guys to it.
Ha!
That would be something... Searching for gravitational waves and finding that ET use them for communication... I guess this would provide two answers at the same time. :D
I don´t know if you like to hear this But in my world
the best thing to do is to have a backup project so the computers
don´t get down time. There should be some project out there that
could get a 1 % share on your computer time :)
Anders n
Dear Anders,
Sorry for the belated reply but the last couple of outages have kept me rather busy with some "farm management" issues :).
It might surprise you to know that I actually have backup projects so that is not the issue. The issue is that I have very specific goals as to where my crunching cycles are to be directed and I don't wish to have to micromanage BOINC in order to undo actions that BOINC takes which are tending to defeat those goals.
Luckily, by a stroke of good fortune, one of my preferred backup projects, LHCatHome, just happened to have a bunch of work in the nick of time. Any of you who follow LHC will know just how patchy that project has been since about last June. At just the right time, I got a swag of units which just about lasted until EAH came back up about 15 hours ago.
Of course, as the LHC work was running out, all my machines were in their 1 week coma for EAH so I had to manually update them all in order to allow EAH to download new work. Otherwise I would have been "wasting" my cycles on my third project which is not something I wish to do if my primary project has work.
Hopefully I have explained my situation a little better so that you wont feel the need to tell me to get a backup project :).
- Then I'd assume Gary could sit at home on HOME_BV managing the home subnet directly using it's BOINCView, and could open up a Remote Desktop Connection ( so definitely XP Pro required ) to WORK_BV and thus use it's BOINView installation for those 180+ other PC's at work.
Just a tip about remote desktop connectivity. Very good, stable, free program available, that allows for Windows Service Installation, and Password lockout. Called RealVNC
Actually, for all v5.2.14 and earlier clients, the backoff was 2 weeks, the backoffs was introduction 19.12.2002, this is pre-v0.14.
This was decreased to only 1 week 06.12.2005, for the v5.3.3-version.
And finally, decreased again to 1 day, 29.07.2006 for v5.5.10 and later.
Yes, backoffs of one form or another have been around since virtually the year dot. I should have phrased my statement better by saying that the value of specifically 1 week seemed to have been set somewhere in the 5.3.x series because I had just a couple of boxes on 5.2.13 which didn't seem to go into a long backoff. The max I ever saw on those was about 1-2 days at most. Once both the scheduler and webserver are down, the 1 week backoff is generated extremely rapidly in 5.4.11.
With 5.4.11 as the recommended stable version, it's hard to avoid the one week coma. I've started gradually trialling 5.8.x as one possible way of avoiding this issue in the future (having just completed yet another "rescue" of comatose boxes :). I'm also in the process of writing a script along the lines you have suggested - thanks very much for that most useful suggestion. I'd never previously bothered to do the research to find out just why we were given a boinccmd.exe. I should have been looking at that a long time ago :).
Quote:
Basically, scheduling-server going down isn't a problem, it's only if also the web-server goes down you'll be hit with the long deferral. So, using separate web-server is an advantage, if can also put this on separate internet-connection and power-source, even better.
Yes, with EAH, it's basically one down - all down.
- Then I'd assume Gary could sit at home on HOME_BV managing the home subnet directly using it's BOINCView, and could open up a Remote Desktop Connection ( so definitely XP Pro required ) to WORK_BV and thus use it's BOINView installation for those 180+ other PC's at work.
Just a tip about remote desktop connectivity. Very good, stable, free program available, that allows for Windows Service Installation, and Password lockout. Called RealVNC
A note on RealVNC- the free version. It does not allow password encryption. You can get around the security issues by setting the remote host firewall to reject all unknown connections to the ports it uses (I think 5801{HTTP} and 5901{VNC} if I recall right).
If 2 batch-files, #1 would basically be list_of_boinc_computers.cmd:
call update_boinc.cmd computername0001 password0001
call update_boinc.cmd computername0002 password0002
...
// continue listing all computers & their password
This is actually quite easy for me to do because I've been quite structured (and therefore very dull) in my choice of computer names :). For example, I have about 40 HP Vectras so guess what I called them :). vectra-01, vectra-02, ..., vectra-40. I have about 50 HP e-PCs and you guessed it, e-pc-01, e-pc-02, ..., e-pc-50, etc, etc. All machines now have exactly the same password and it isn't 32 chars long :). I was silly enough to let each one have the unique 32 char random string for a while until I wised up :). It should be quite easy to create comps.cmd, even moreso as I can put the single password in update_all.cmd
Quote:
#2 is update_boinc.cmd, and includes 1 line:
boinccmd --host %1 --passwd %2 --project http://einstein.phys.uwm.edu/ update
I found the page on the BOINC website that describes all the switches for boinccmd.exe so the above is perfectly clear, thanks very much for prompting me to go find the info.
Quote:
Batch-file #1 will run batch-file #2 for each computer, and therefore can very quickly update all computers.
I'm toying with the idea to understand how to automatically examine the local network so as to make a list of all currently active hosts. Then make a smaller list of those currently stuck in deferral and so just update that subset. It's probably not worth the effort to work out how to do that though :).
Quote:
So, the time-consuming is to make the list of all the computer-names, or possibly easier, the ip-numbers.
No, not IP numbers as DHCP running on the ADSL modem/router insists on changing these from time to time, particularly if I have to power cycle the router. When attempting to manage close to 200 machines, the router freaks out occasionally and has to be rebooted. I've never had any trouble attaching to a host using its actual hostname, even if the IP has changed.
Quote:
If all computers uses the same password, can put this directly in #2 instead of %2, and don't need to list password0001 and so on in batch-file #1.
Oh, and the advantage of using 2 batch-files is, can easily change #2 to also do other things. ;)
Yes, all this is very clear, thanks very much. Your advice is much appreciated.
Just an idea, not testet .. not even thought about it too much.
What if you let your computers talk with the project servers just for one hour each day? If it talks only within that timeframe, the 7-days delay will maybe show up a lot later. I have checked only one thing: My apple can retrieve the maximum of 144 WUs within half an hour. So I expect that you still get enough WUs to crunch.
And positive is also: You just need to change one (or maybe 3) settings on the web.
Hmm, use boinccmd, and make a batch-file, or 2...Oh, and the advantage of using 2 batch-files is, can easily change #2 to also do other things. ;)
Here's my two cents:
- Gary has two separated subnets, ~ 180 PC's at work and ~ 20 PC's at home.
- Set up BOINView on ONE computer per subnet, call these machines 'WORK_BV' and 'HOME_BV' for this discussion. There will be considerable jiggery-pokery with 'gui_rpc_auth.cfg' and 'remote_hosts.cfg' on EACH computer in EACH subnet - but it's do-able and only needs to be done once....
....
much useful and interesting info snipped
....
Mike,
Thanks very much for taking the time to do all that.
I used BOINCView extensively when my farm had about 20 or so machines. By the time the number had reached about 30-40, the machine on which the app was running was brought to its knees and the app started locking up at regular intervals. I had to abandon BOINCView. Maybe there are now more recent versions which scale better.
Until very recently, I've had little need to monitor each box. During 2006, most machines ran for weeks or months at a time with no intervention from me and that's exactly how I like it. My daily ritual was to look at the list of computers on the website and scroll down to the bottom of the page. Since the boxes are listed in order of last contact with the website, those at the bottom represented a possible problem if the last contact interval was unusually long. I would only look at those which repesented a possible problem. I would actually be quite oblivious to the goings on for the vast majority of machines that were just happily crunching away with normal contact intervals.
Recently, problems have arisen because of the one week backoff. Ingleside's suggestion of using boinccmd is very doable so I'm about to give it a whirl. Murphy's Law, being what it is should guarantee that as soon as I've finished setting it all up and have tested it to ensure it all works, the servers will suddenly develop enormous stability and run without hiccup for the next six months :).
Just an idea, not testet .. not even thought about it too much.
What if you let your computers talk with the project servers just for one hour each day? If it talks only within that timeframe, the 7-days delay will maybe show up a lot later. I have checked only one thing: My apple can retrieve the maximum of 144 WUs within half an hour. So I expect that you still get enough WUs to crunch.
And positive is also: You just need to change one (or maybe 3) settings on the web.
Interesting idea -- thanks for making the suggestion. For my setup, there are a couple of negatives I can think of.
*Cache size. This would need to be increased to cover the possibility that contact might not be able to be established in the small daily window of opportunity.
*Backup projects. My guess is that these would be a lot harder to manage than currently is the case. Particularly since LHC is my strongly preferred backup from which I very much like to grab work when it is available.
*Spotting misbehaving machines. My current strategy (explained in a previous post) would probably not work as well.
*Bandwidth saturation. Imagine 200 machines all trying to get new work and perhaps many trying to download new 15.7MB data files all in that single window of opportunity.
Actually, on balance, I think I'll stick with the devil I know :).
Or maybe an alien radio
)
Or maybe an alien radio signal from outer space ^^ so we could beat the SETI guys to it.
RE: Or maybe an alien radio
)
Ha!
That would be something... Searching for gravitational waves and finding that ET use them for communication... I guess this would provide two answers at the same time. :D
Greets,
RE: Hi Gary I don´t know
)
Dear Anders,
Sorry for the belated reply but the last couple of outages have kept me rather busy with some "farm management" issues :).
It might surprise you to know that I actually have backup projects so that is not the issue. The issue is that I have very specific goals as to where my crunching cycles are to be directed and I don't wish to have to micromanage BOINC in order to undo actions that BOINC takes which are tending to defeat those goals.
Luckily, by a stroke of good fortune, one of my preferred backup projects, LHCatHome, just happened to have a bunch of work in the nick of time. Any of you who follow LHC will know just how patchy that project has been since about last June. At just the right time, I got a swag of units which just about lasted until EAH came back up about 15 hours ago.
Of course, as the LHC work was running out, all my machines were in their 1 week coma for EAH so I had to manually update them all in order to allow EAH to download new work. Otherwise I would have been "wasting" my cycles on my third project which is not something I wish to do if my primary project has work.
Hopefully I have explained my situation a little better so that you wont feel the need to tell me to get a backup project :).
Cheers,
Gary.
RE: - Then I'd assume
)
Just a tip about remote desktop connectivity. Very good, stable, free program available, that allows for Windows Service Installation, and Password lockout. Called RealVNC
d3xt3r.net
RE: Actually, for all
)
Yes, backoffs of one form or another have been around since virtually the year dot. I should have phrased my statement better by saying that the value of specifically 1 week seemed to have been set somewhere in the 5.3.x series because I had just a couple of boxes on 5.2.13 which didn't seem to go into a long backoff. The max I ever saw on those was about 1-2 days at most. Once both the scheduler and webserver are down, the 1 week backoff is generated extremely rapidly in 5.4.11.
With 5.4.11 as the recommended stable version, it's hard to avoid the one week coma. I've started gradually trialling 5.8.x as one possible way of avoiding this issue in the future (having just completed yet another "rescue" of comatose boxes :). I'm also in the process of writing a script along the lines you have suggested - thanks very much for that most useful suggestion. I'd never previously bothered to do the research to find out just why we were given a boinccmd.exe. I should have been looking at that a long time ago :).
Yes, with EAH, it's basically one down - all down.
Cheers,
Gary.
RE: RE: - Then I'd
)
A note on RealVNC- the free version. It does not allow password encryption. You can get around the security issues by setting the remote host firewall to reject all unknown connections to the ports it uses (I think 5801{HTTP} and 5901{VNC} if I recall right).
RE: If 2 batch-files, #1
)
This is actually quite easy for me to do because I've been quite structured (and therefore very dull) in my choice of computer names :). For example, I have about 40 HP Vectras so guess what I called them :). vectra-01, vectra-02, ..., vectra-40. I have about 50 HP e-PCs and you guessed it, e-pc-01, e-pc-02, ..., e-pc-50, etc, etc. All machines now have exactly the same password and it isn't 32 chars long :). I was silly enough to let each one have the unique 32 char random string for a while until I wised up :). It should be quite easy to create comps.cmd, even moreso as I can put the single password in update_all.cmd
I found the page on the BOINC website that describes all the switches for boinccmd.exe so the above is perfectly clear, thanks very much for prompting me to go find the info.
I'm toying with the idea to understand how to automatically examine the local network so as to make a list of all currently active hosts. Then make a smaller list of those currently stuck in deferral and so just update that subset. It's probably not worth the effort to work out how to do that though :).
No, not IP numbers as DHCP running on the ADSL modem/router insists on changing these from time to time, particularly if I have to power cycle the router. When attempting to manage close to 200 machines, the router freaks out occasionally and has to be rebooted. I've never had any trouble attaching to a host using its actual hostname, even if the IP has changed.
Yes, all this is very clear, thanks very much. Your advice is much appreciated.
Cheers,
Gary.
Just an idea, not testet ..
)
Just an idea, not testet .. not even thought about it too much.
What if you let your computers talk with the project servers just for one hour each day? If it talks only within that timeframe, the 7-days delay will maybe show up a lot later. I have checked only one thing: My apple can retrieve the maximum of 144 WUs within half an hour. So I expect that you still get enough WUs to crunch.
And positive is also: You just need to change one (or maybe 3) settings on the web.
RE: RE: Hmm, use
)
....
much useful and interesting info snipped
....
Mike,
Thanks very much for taking the time to do all that.
I used BOINCView extensively when my farm had about 20 or so machines. By the time the number had reached about 30-40, the machine on which the app was running was brought to its knees and the app started locking up at regular intervals. I had to abandon BOINCView. Maybe there are now more recent versions which scale better.
Until very recently, I've had little need to monitor each box. During 2006, most machines ran for weeks or months at a time with no intervention from me and that's exactly how I like it. My daily ritual was to look at the list of computers on the website and scroll down to the bottom of the page. Since the boxes are listed in order of last contact with the website, those at the bottom represented a possible problem if the last contact interval was unusually long. I would only look at those which repesented a possible problem. I would actually be quite oblivious to the goings on for the vast majority of machines that were just happily crunching away with normal contact intervals.
Recently, problems have arisen because of the one week backoff. Ingleside's suggestion of using boinccmd is very doable so I'm about to give it a whirl. Murphy's Law, being what it is should guarantee that as soon as I've finished setting it all up and have tested it to ensure it all works, the servers will suddenly develop enormous stability and run without hiccup for the next six months :).
I wish!!! :).
Cheers,
Gary.
RE: Just an idea, not
)
Interesting idea -- thanks for making the suggestion. For my setup, there are a couple of negatives I can think of.
*Cache size. This would need to be increased to cover the possibility that contact might not be able to be established in the small daily window of opportunity.
*Backup projects. My guess is that these would be a lot harder to manage than currently is the case. Particularly since LHC is my strongly preferred backup from which I very much like to grab work when it is available.
*Spotting misbehaving machines. My current strategy (explained in a previous post) would probably not work as well.
*Bandwidth saturation. Imagine 200 machines all trying to get new work and perhaps many trying to download new 15.7MB data files all in that single window of opportunity.
Actually, on balance, I think I'll stick with the devil I know :).
But many thanks for the suggestion anyway.
Cheers,
Gary.