Cleaning house (and ssh known_hosts)...
By mrbill on Jun 24, 2008
One issue that nagged us for quite a while during this project was cleaning up the provisioning servers (JET/Jumpstart or N1SPS/N1SM). Since we were developing in a lab environment, our provisioned and JET'd hosts were re-installed hundreds of times over the course of the project. As part of the installation of a target host and integrating it into the management framework, several user accounts use ssh to pop into the machine and twiddle bits (add packages, change configuration files, move stuff around, start services, etc.). I thought I would add this blahg entry so that others might Google the symptoms and find some relief.
Have I mentioned that Google is my favorite debugging tool? If you are having issues, odds are that you are not the first to stumble upon them. Even if it is a "silly user error", someone has probably asked about that exact (or a very similar) error message or symptom in some newsgroup, support alias, or documented it in an FAQ somewhere. Even if the answer is buried somewhere on docs.sun.com, Google will likely find the answers for you.
The symptom that we were seeing is rather simple. When a host is installed (or re-installed), ssh-keygen is run on first boot to establish the credentials needed for SSH to work. Since this machine is re-using an IP address / hostname that has already been seen as a "real machine", remote machines trying to connect to it will already have an entry in ~/.ssh/known_hosts associated with this IP address. This causes an interesting error message when you try to SSH into the machine, informing you that there is a key mismatch, and that someone is possibly doing something very evil and spoofing the target machine. Connection closed.
Simple answer, when you are going to re-install or re-provision a machine, just "cd" all over the place and delete all of the known_hosts entries for that IP address. Yeah, like we are going to remember to do that, or remember all of the accounts that have used SSH to access that machine. With 7-8 re-installs happening every day in our environment, this was ugly. We had failures 2-3 hours into the provisioning testing with N1SPS/N1SM because application stuff would fail to install/configure with SSH errors. Lots of wasted time.
So here is my simple answer. A stupid little shell script that is run on the JET / Jumpstart server and/or the N1SPS/N1SM server in the lab when re-using a target system and IP. Yes, you can put this into JET to run when a "make_client -f" is run. Yes, you can have N1SPS/N1SM run this automagically as part of preparing to provision a new host. Yes, this is a really simple task and a really simple shell script, but if it saves someone 2-3 hours on failed provisioning testing, then I am happy to waste bandwidth to stick it on my blahg.
#!/bin/sh # # # 1.0 - Bill Walker
Just edit "USERS" to be the list of users that you want to clean out and away you go. This script cleans out entries that have the IP address at the beginning of the line, and entries that have "hostname,IP". It saves off a copy of the known_hosts file in ~/.ssh/known_hosts.orig just in case something goes awry. I could modify this to go through /etc/passwd or something like that to get the USERS list, but what do you want for a quick hack written at 11pm in a hotel room in the middle of the desert? As always, your mileage may vary, objects in mirror may be closer than they appear, and (of course), there is no warranty expressed or implied. Use at your own risk. I hope it saves someone some aggravation and time.
Oh yeah, any comments about how my shell scripts look like old school top-down FORTRAN are definitely >/dev/null 2>&1. I'm old, I write linear code, it (mostly) works.