Techniques I use for Managing an Excessive Number of Linux Hosts :-)

Gary Roberts
Gary Roberts
Moderator
Joined: 9 Feb 05
Posts: 5,870
Credit: 115,683,843,516
RAC: 34,691,125
Topic 223388

Recently, Mikey sent me a PM where he raised a particular question.  I tend to ignore personal help requests by PM since I don't have an abundance of time with nothing better to do.  If I'm going to put some effort into a reply, I'd rather do so in open forum where what I write can be 'peer reviewed' (so to speak) and challenged where necessary.  Others can offer counter opinions and the wider readership can perhaps benefit from the discussion.  Mikey was fine with me transferring the question and any response to a new thread here.

Mikey wrote:
I am STILL working on a distro of Linux I like, yes it's been a LONG time, but I think I like Linux Mint the best so far...my question is you have a bunch of computers all with the same OS etc on all of them do you clone the drives for a new setup/replacement hard drive or do a manual load for each pc?

I currently do have quite a "bunch" so I could be potentially spending a huge amount of time managing the things Mikey mentions.  So, yes, trying to manually handle each one individually is quite impossible. The short answer is to automate as much as possible.  I thought I'd try to explain what works for me - and why - and share it with anyone else who cares to read it.

I don't regard myself as particularly 'computer literate' - 'weirdly obsessed' might be a better description.  I had no background training in IT or programming.  I did a degree in Metallurgical Engineering and worked in a University environment (a Mining and Metallurgical Engineering Department) for many years in a variety of roles.  In the late 1970s, the Department installed a DEC PDP 11/23 and instead of using the proprietary OS, a version of Unix got installed.  Later on a SUN Sparcstation that ran BSD Unix was installed as a departmental minicomputer.  As a simple user on a dumb terminal, I was enthralled with what you could do with the Unix shell - nothing GUI about that.  That was where I first discovered what simple shell scripts and the beautifully simple but powerful Unix utilities could do.

I first got interested in volunteer computing in 1999 when I joined Seti.  When BOINC came along, I transferred my attention to Einstein in 2005.  As my interest grew, I looked for a Unix-like OS and so Linux was the obvious choice.  I (very briefly) tried a Ubuntu CD and hated it.  I tried Mandrake (which became Mandriva and now there is Mageia) and OpenSUSE and liked Mandrake the best and then I discovered PCLinuxOS or PCLOS for short.  The guy who started it was a Mandrake contributor, I believe.   PCLOS at that time (2006) was a new kid on the block.  I can remember being really blown away by the 2007 release of PCLOS.  I knew right then it really suited my temperament and requirements.

Between then and now, I've spent a lot of time improving my understanding of 'bash', the Linux implementation of the traditional Unix (Bourne) shell.  As numbers of hosts have grown, I've had a lot of satisfaction from teaching myself how to write various control and management scripts.  Mikey's question relates to new installs or repair installs after disk failure, so I'll document exactly what I do for those situations.

His key question was, "clone?" or "manual load?".  The answer is "Neither!".  When you do a Linux install, there can be quite a lot of extras or updates or configuration changes to get everything settled just the way you like it.  Making a clone (ie. an exact copy) of an existing setup to put on the next machine might seem attractive but there are some things that should be unique and unless the hardware is identical there may well be driver related problems.  Loading from the original install medium isn't attractive either because you still have that potentially long, error prone road to your ultimate preferred configuration.  The solution is called 'remastering'. You take your 'just right' final setup and you turn it into a new installer, in the form of an ISO image which you can put onto whatever install medium (optical or flash) you prefer.

The remastering process handles stuff that needs to be unique and includes all the hardware detection routines that come with the original install media but with all your additions and customisations already done.  So when you install from a remaster you get an exact duplicate of the setup that existed on the machine you remastered and during the install all the potential 'gotchas' from hardware differences, etc, get sorted out automatically.

So far this year, I have done a huge number of installs from a remaster I made back in January.  I've been lazy and haven't made a new remaster since then.  When I want to do an install, I boot the remaster and use gparted on it (a GUI utility) to partition the disk how I want.  The basic install process then takes less than 5 mins.  The thing that takes the most time is applying the updates that have arrived since January.  That can take from 10 minutes to perhaps half an hour, depending of how many there are and how new (or ancient) the machine is.  The numbers of updates are getting quite large now so I need to make a new remaster to cut out the bulk of that time spent updating.

I then run a special script that allows further tweaking and checking that is tedious to do manually and quite easy to overlook.  Everything from setting specific host names and static IP addresses and checking ssh configuration so all hosts have proper communications with a central server machine where I cache a lot of BOINC and EAH related files.  This can save a lot of unnecessary downloads of specific files that may change quickly.  I choose not to have potentially transient stuff on the remaster.  Editing of a BOINC template to set a particular host ID is also done here.  There is also a script for checking and upgrading the installed version of the OpenCL libs.  The remaster has a version from last year.  I can easily upgrade to the recently released 20.30 version of amdgpu-pro. PCLOS doesn't have that.  I steal the bits I need from the Red Hat version that AMD provides.

PCLOS is a rolling release, unlike the 'buntus which have fixed release schedules.  The project does produce regular, official ISOs, but if you update your existing install, you'll already have all that is in the new ISO without having to disturb your special setups and configurations.  The philosophy is essentially install once, configure once, update forever.  With the number of machines I have, a pre-configured remaster is an absolute necessity.

One of my home-built scripts has the job of keeping a local repository mirror up to date with the full PCLOS repository.  I have a 1.5TB external USB drive that has dated copies of the full repository from 2014 to just last week.  About 3 times a year when everything looks nice and stable, a new dated copy gets created.  If something were to go wrong and I wanted to install a system at a particular stable date, then I can.  The local repo is on USB 3 device so installing hundreds of updates is extremely fast compared with doing so over the internet.

I don't know how other distros handle remastering but PCLOS' tools are internally created by some pretty skilled people.  There are 3 particular utilities.  The first is 'mylivecd' which is a CLI utility with command line options that control what gets included in the output ISO image.  One of the talented PCLOS forum regulars created a GUI front end called 'mylivegtk' which allows the user to point and click at the required options and then launch mylivecd with all the correct options - no CLI needed.

The third utility is called 'myliveusb' which allows a user to take an PCLOS ISO, either an official one or a remaster, and create an installer on a USB device like a flash stick or an external HDD.  It can add further ISOs over time up to the storage capacity of the device.  It can also use spare space on the device to save stuff from the host receiving the new install, and then put it back after the install has been done.  I quite often use this to save the complete BOINC tree on a host, do a fresh install, put the BOINC tree back and have it continue on from where it left off.  Remastering tools are probably specific to a given distro so you would need to explore what your distro of choice offers.

Here is a classic example of why remasters are useful to me.  I have a recent new build into which I installed an old AMD HD7850 GPU that was left over from an RX 570 upgrade on an existing machine.  This new build had been running without issue for several weeks and then all of a sudden it shut itself down.  A monitoring script reported the problem and I tried to reboot it.  It immediately crashed during startup.  Except for the GPU, the machine had all new components including a 120GB SSD.  I retried the reboot but from my remaster USB stick, not the internal SSD.  Everything booted fine so I started running the file system consistency checker (fsck) utility and was greeted with lots of damage to mainly the home partition on the SSD.  fsck 'repaired' the damage so I tried rebooting (this time successfully) from the SSD.  After rebooting, I discovered that the complete BOINC tree was missing.  I'd never seen that happen before but I suspected it must have been part of the the damaged stuff that fsck had 'repaired'.

To cut a long story short, I did find partial bits of the BOINC tree in the /lost+found directory where fsck had put the bits it couldn't reattach to the damaged filesystem. I was even able to find the einstein.phys.uwm.edu directory but there were lots of bits missing.  So the plan was to do a complete reinstall from my remaster to get a clean set of files and then use my post-install scripts to edit the BOINC template to have the host ID set to match that of the one that had gone belly up.  Of course, that would be after finding (if possible) what caused the problem in the first place so there could be no repeat :-).

I got lucky there.  I noticed the heat pipes in the GPU were quite hot, even though the machine was essentially just idling.  The fan on the GPU was spinning, but on closer inspection, a bit on the slow side, so the cause became obvious.  When under crunching load, the temperature must have gone through the roof.  I removed the failing fan and attached a server fan to the opening where the old fan had been and now the card runs nice and cool :-).  A quick install from my remaster and with the edits in place to reconstruct the host ID, the machine contacted the scheduler and all the lost work came flooding back under that very nice "resend lost tasks" feature that works quite nicely.

I probably spent far too much time looking through the huge stack of stuff in lost+found/ as I'd never tried that before and was interested in what might be retrievable.  The answer is, "probably nothing much, its there for a reason, it's all damaged, stop wasting time, get on, find the problem, install the remaster and get all the lost tasks back." :-).  The machine has now been crunching for well over a day and everything is fine with it.

This post has become far larger than I originally intended - sorry about that :-).  Hopefully any reader silly enough to be still reading may have found something useful on the long journey to here :-).

Cheers,
Gary.

mikey
mikey
Joined: 22 Jan 05
Posts: 12,531
Credit: 1,838,580,393
RAC: 3,628

Thank you very much!!!! I

Thank you very much!!!! I will look and see if I can figure out if that works with my favorite, right now, Linux distro of Linux Mint. I am using the 'clone' method right now but as you said different hardware etc can cause problems and as you said updates are also an issue. Cloning also has the problem of not being able to use a smaller drive than the original so I have to be careful which drive I use as the original, your system would eliminate that altogether.

MarkJ
MarkJ
Joined: 28 Feb 08
Posts: 437
Credit: 139,002,861
RAC: 62

I only have a small fleet of

I only have a small fleet of 14 x64 machines and a bunch of Raspberry Pis.

I clean install each of the x64 machines from a thumb drive which has the OS installed. I have a set of instructions in a word document that I follow and once the OS is installed it’s mostly a matter of cut and pasting into an ssh session.

My OS of choice is Debian. They do new releases about every 2 years and have “point” releases about every 3 months. I have a caching proxy server (running squid) that helps with the bandwidth there. Once one machine has installed the update it’s in the cache and the rest of them use the cached version. It also works nicely with caching the gravity wave data files. This server doubles as an ntp server for the rest of network.

For the Pis I have two support nodes. One is an NFS server with an external USB hard drive and the other is a caching proxy server with an external USB SSD. I clean install Raspbian or Raspberry Pi OS lite version onto an SD card and after running the initial raspi-config to set things up I copy the relevant files off the NFS server. Again instructions are from a word document so its a bunch of cut and paste commands. They only take about 10-15 minutes to install. They are usually fine until the SD card wears out in a year or two. At that point I replace the SD card and repeat the process. Updates are a bit more of pain due to the number of them. Its usually a process of logging into each one via ssh (they are all headless) and doing a "sudo apt update && sudo apt upgrade -y && sync && sudo reboot" for each one. This is where the proxy server really helps.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.