Daaaaaaaad, the computer isn't booting
By Alan Hargreaves-Oracle on Apr 06, 2009
These were the words that my 10 year old boy yelled to me on Sunday.
I'm documenting this as I tried to imagine facing this as an end user, rather than as a Solaris kernel support engineer, and shuddered
The machine he is talking about is the OpenSolaris box that I installed for them and recently upgraded to SRU4 on Friday (more on supported updates shortly).
The box (silver) had been sitting at the boot load screen (those of us old enough to remember the original Battlestar Galactica would refer to it as the cylon screen) with the disk light hard on and the disk rattling away threatening to send itself off into hyperspace. It had apparently been like this for a few hours (he lost interest and went to watch TV before thinking to tell me).
He'd tried resetting it and it didn't help.
"OK", I thought, "I've just upgraded the box, maybe there was a problem with SRU4, let's boot into the prior boot environment." Easy enough to do, just reset and select the prior environment in grub.
No dice. Same issue.
Failsafe boot? No it appears that we don't have one of those.
Right, a single user boot. I want to have a look at what is going on on the console, So we need to get rid of the graphics crud at the start.
This isn't too hard. Have a look at the options in the text boot to be sure, but all I did was hit 'e' (edit) in grub, d (delete) the splash and graphics lines, e (edit) the unix line to take out the ",Graphics... " stuff off the end of the command, hit Enter to go back a screen then hit b (boot) and watch what happens.
I didn't have to wait long.
Let me give you a little more background on this machine. It really is scrounged together. The root pool consists solely of a 4gb disc removed from an ultra 10.
The root zpool was 100% full. The disc full messages scrolled for a while.
OK, once we waited for a few minutes we got the prompt asking for a login name and password to drop us to a root single user. OK, let's go looking for where the space issue is.
zfs list' showed me that rpool/export/home was a little larger than I expected. Unfortunately, as the pool was full, I couldn't mount those. No worries, let's poke around on / to try to find something to remove to make enough space so we can mount things.
A good place to look for such space on a workstation is in /var/log, specifically the Xorg logs.
Let's remove one of those, ....
Copy on write, .... In order to unlink a file we need to write a new block for the directory entry. Oops no free blocks.
The trick is to lose the space without having to rewrite the directory entry. We need to truncate one of the logs.
# : > Xorg.0.log.old
Much better. For good measure I zapped Xorg.0.log as well.
OK, that looks much better.
Let's mount rpool/export/home and have a look.
# zfs mount rpool/export/home
Ahhh, the kids home directories each have a largish core in them. Remove those, unmount /export/home. Now, as I mounted rpool/export/home and not rpool/export, a directory got created in /export. We need to remove that or the filesystem/local service won't start correctly (it will complain about /export having stuff in it).