Dr. Live Upgrade - Or How I Learned to Stop Worrying and Love Solaris Patching
By User12611829-Oracle on Mar 22, 2009
That's right, nobody. Or if you do perhaps we should start a local support group to help you come to terms with this unusual fascination. Patching, and to a lesser extent upgrades (which can be thought of as patches delivered more efficiently through package replacement), is the the most common complaint that I hear when meeting with system administrators and their management.
Most of the difficulties seem to fit into one of the following categories.
- Analysis: What patches need to be applied to my system ?
- Effort: What do I have to do to perform the required maintenance ?
- Outage: How long will the system be down to perform the maintenance ?
- Recovery: What happens when something goes wrong ?
Before we looking at Live Upgrade, let's start with a definition. A boot environment is the set of all file systems and devices that are unique to an instance of Solaris on a system. If you have several boot environments then some data will be shared (non svr4 package installed applications, data, local home directories) and some will be exclusive to one boot environment. Not making this more complicated than it needs to be, a boot environment is generally your root (including /usr and /etc), /var (frequently split out on a separate file system), and /opt. Swap may or may not be a part of a boot environment - it is your choice. I prefer to share swap, but there are some operational situations where this may not be feasible. There may be additional items, but generally everything else is shared. Network mounted file systems and removable media are assumed to be shared.
With this definition behind us, let's proceed.
Analysis: What patches need to be applied to my system ?For all of the assistance that Live Upgrade offers, it doesn't do anything to help with the analysis phase. Fortunately there are plenty of tools that can help with this phase. Some of them work nicely with Live Upgrade, others take a bit more effort.
smpatch(1M) has an analyze capability that can determine which patches need to be applied to your system. It will get a list of patches from an update server, most likely one at Sun, and match up the dependencies and requirements with your system. smpatch can be used to download these patches for future application or it can apply them for you. smpatch works nicely with Live Upgrade, so from a single command you can upgrade an alternate boot environment. With containers!
The Sun Update Manager is a simple to use graphical front end for smpatch. It gives you a little more flexibility during the inspection phase by allowing you to look at individual patch README files. It is also much easier to see what collection a patch belongs to (recommended, security, none) and if the application of that patch will require a reboot. For all of that additional flexibility you lose the integration with Live Upgrade. Not for lack of trying, but I have not found a good way to make Update Manager and Live Upgrade play together.
Sun xVM Ops Center has a much more sophisticated patch analysis system that uses additional knowledge engines beyond those used by smpatch and Update Manager. The result is a higher quality patch bundle tailored for each individual system, automated deployment of the patch bundle, detailed auditing of what was done and simple backout should problems occur. And it basically does the same for Windows and Linux. It is this last feature that makes things interesting. Neither Windows nor Linux have anything like Live Upgrade and the least common denominator approach of Ops Center in its current state means that it doesn't work with Live Upgrade. Fortunately this will change in the not too distant future, and when it does I will be shouting about this feature from rooftops (OK, what I really mean is I'll post a blog and a tweet about it). If I can coax Ops Center into doing the analysis and download pieces then I can manually bolt it onto Live Upgrade for a best of both worlds solution.
These are our offerings and there are others. Some of them are quite good and in use in many places. Patch Check Advanced (PCA) is one of the more common tools in use. It operates on a patch dependency cross reference file and does a good job with the dependency analysis (this is obsoleted by that, etc). It can be used to maintain an alternate boot environment and in simple cases that would be fine. If the alternate boot environment contains any containers then I would use Live Upgrade's luupgrade instead of PCA's patchadd -R approach. If I was familiar with PCA then I would still use it for the analysis and download feature. Just let luupgrade apply the patches. You might have to uncompress the patches downloaded by PCA before handing them over to luupgrade, but that is a minor implementation detail.
In summary, use an analysis tool appropriate to the task (based on familiarity, budget and complexity) to figure out what patches are needed. Then use Live Upgrade (luupgrade) to deploy the desired patches.
Effort: What does it take to perform the required maintenance ?This is a big topic and I could write pages on the subject. Even if I use an analysis tool like smpatch or pca to save me hours of trolling through READMEs drawing dependency graphs, there is still a lot of work to do in order to survive the ordeal of applying patches. Some of the more common techniques include ....
Backing up your boot environment.I should not have to mention this, but there are some operational considerations unique to system maintenance. Even though tiny, there is a greater chance that you will render your system non-bootable during system maintenance than any other operational task. Even with mature processes, human factors can come into play and bad things can happen (oops - that was my fallback boot environment that I just ran newfs(1M) on).
This is why automation and time tested scripting becomes so important. Should you do the unthinkable and render a system nonfunctional, rapid restoration of the boot environment is important. And getting it back to the last known good state is just as important. A fresh backup that can be restored by utilities from install media or jumpstart miniroot is a very good idea. Flash archives (see flarcreate(1M)) is even better, although complications with containers make this less interesting now than in previous releases of Solaris. How many of you take a backup before applying patches ? Probably about the same number as replace batteries in your RAID controllers or change out your UPS systems after their expiration date.
Split MirrorsOne interesting technique is to split mirrors instead of backups. Of course this only works if you mirror your boot environment (a recommended practice for those systems with adequate disk space). Break your mirror, apply patches to the non-running half, cut over the updated boot environment during the next maintenance window and see how this goes. At first glance this seems like a good idea, but there are two catches.
- Do you synchronize dynamic boot environment elements ? Things like /etc/passwd, /etc/shadow, /var/adm/messages, print and mail queues are constantly changing. It is possible that these have changed between the mirror split and subsequent activation.
- How long are you willing to run without your boot environment being mirrored ? This may cause to you certify the new boot environment too quickly. You want to reestablish your mirror, but if that is your fallback in case of trouble you have a conundrum. And if you are the sort that seems to have a black cloud following you through life, you will discover a problem shortly after you started the mirror resync.
Pez disks ?OK, the mirror split thing can be solved by swinging in another disk. Operationally a bit more complex and you have at least one disk that you can't use for other purposes (like hosting a few containers), but it can be done. I wouldn't do it (mainly because I know where this story is heading) but many of you do.
Better living through Live UpgradeEverything we do to try to make it better adds complexity, or another hundred lines of scripting. It doesn't need to be this way, and if you become one with the LU commands it won't for you either. Live Upgrade will take care building and updating multiple boot environments. It will check to make sure the disks being used are bootable and not part of another boot environment. It works with the Solaris Volume Manager, Veritas encapulated root devices, and starting with Solaris 10 10/08 (update 6) ZFS. It also takes care of the synchronization problem. Starting with Solaris 10 8/07 (update 4), Live Upgrade also works with containers, both native and branded (and with Solaris 10 10/08 your zoneroots can be in a ZFS pool).
Outage: How long will my system be down for the maintenance?Or perhaps more to the point, how long will my applications be unavailable ? The proper reply is it depends on how big the patch bundle is and how many containers you have. And if a kernel patch is involved, double or triple your estimate. This can be a big problem and cause you to take short cuts like only install some patches now and others later when it is more convenient. Our good friend Bart Smaalders has a nice discussion on the implications of this approach and what we are doing in OpenSolaris to solve this. That solution will eventually work its way into the Next Solaris, but in the mean time we have a problem to solve.
There is a large set (not really large, but more than one) of patches that require a quiescent system to be properly applied. An example would be a kernel patch that causes a change to libc. It is sort of hard to rip out libc on a running system (new processes get the new libc my may have issues with the running kernel, old processes get the old libc and tend to be fine, until they do a fork(2) and exec(2)). So we developed a brilliant solution to this problem - deferred activation patching. If you apply one of these troublesome patches then we will throw it in a queue to be applied the next time the system is quiesced (a fancy term for the next time we're in single user mode). This solves the current system stability concerns but may make the next reboot take a bit longer. And if you forgot you have deferred patches in your queue, don't get anxious and interrupt the shutdown or next boot. Grab a noncaffeinated beverage and put some Bobby McFerrin on your iPod. Don't Worry, Be Happy.
So deferred activation patching seems like a good way to deal with situation where everything goes well. And some brilliant engineers are working on applying patches in parallel (where applicable) which will make this even better. But what happens when things go wrong ? This is when you realize that patchrm(1M) is not your friend. It has never been your friend, nor will it ever be. I have an almost paralyzing fear of dentists, but would rather visit one then start down a path where patchrm is involved. Well tested tools and some automation can reduce this to simple anxiety, but if I could eliminate patchrm altogether I would be much happier.
For all that Live Upgrade can do to ease system maintenance, it is in the area of outage and recovery that make it special. And when speaking about Solaris, either in training or evangelism events, this is why I urge attendees to drop whatever they are doing and adopt Live Upgrade immediately.
Since Live Upgrade (lucreate, lumake, luupgrade) operates on an alternate boot environment, the currently running set of applications are not affected. The system stays up, applications stay running and nothing is changing underneath them so there is no cause for concern. The only impact is some additional load by the live upgrade operations. If that is a concern then run live upgrade in a project and cap resource consumption to that project.
An interesting implication of Live Upgrade is that the operational sanity of each step is no longer required. All that matters is the end state. This gives us more freedom to apply patches in a more efficient fashion than would be possible on a running boot environment. This is especially noticeable on a system with containers. The time that the upgrade runs is significantly reduced, and all the while applications are running. No more deferred activation patches, no more single user mode patching. And if all goes poorly after activating the new boot environment you still have your old one to fall back on. Queue Bobby McFerrin for another round of "Don't Worry, Be Happy".
This brings up another feature of Live Upgrade - the synchronization of system files in flight between boot environments. After a boot environment is activated, a synchronization process is queued as a K0 script to be run during shutdown. Live Upgrade will catch a lot of private files that we know about and the obvious public ones (/etc/passwd, /etc/shadow, /var/adm/messages, mail queues). It also provides a place (/etc/lu/synclist) for you to include things we might not have thought about or are unique to your applications.
When using Live Upgrade applications are only unavailable for the amount of time it takes to shut down the system (the synchronization process) and boot the new boot environment. This may include some minor SMF manifest importing but that should not add much to the new boot time. You only have to complete the restart during a maintenance window, not the entire upgrade. While vampires are all the rage for teenagers these days, system administrators can now come out into the light and work regular hours.
Recovery: What happens when something goes wrong?This is when you will fully appreciate Live Upgrade. After activation of a new boot environment, now called the Primary Boot Environment (PBE), your old boot environment, now called an Alternate Boot Environment (ABE) can still be called upon in case of trouble. Just activate it and shut down the system. Applications will be down for a short period (the K0 sync and subsequence start up), but there will be no more wringing of the hands, reaching for beverages with too much caffeine and vitamin B12, trying to remember where you kept your bottle of Tums. Queue Bobby McFerrin one more timne and "Don't Worry, Be Happy". You will be back to your previous operational state in a matter of a few minutes (longer if you have a large server with many disks). Then you can mount up your ABE and troll through the logs trying to determine what went wrong. If you have a service contract then we will troll through the logs with you.
I neglected to mention earlier, disks that comprise boot environments can be mirrored, so there is no rush to certification. Everything can be mirrored, at all times. Which is a very good thing. You still need to back up your boot environments, but you will find yourself reaching for the backup media much less often when using Live Upgrade.
All that is left are a few simple examples of how to use Live Upgrade. I'll save that for next time.
Technocrati Tags: Sun Solaris patching liveupgrade