Dr. Live Upgrade - Or How I Learned to Stop Worrying and Love Solaris Patching

Who loves to patch or upgrade a system ?

That's right, nobody. Or if you do perhaps we should start a local support group to help you come to terms with this unusual fascination. Patching, and to a lesser extent upgrades (which can be thought of as patches delivered more efficiently through package replacement), is the the most common complaint that I hear when meeting with system administrators and their management.

Most of the difficulties seem to fit into one of the following categories.
  • Analysis: What patches need to be applied to my system ?
  • Effort: What do I have to do to perform the required maintenance ?
  • Outage: How long will the system be down to perform the maintenance ?
  • Recovery: What happens when something goes wrong ?
And if a single system gives you a headache, adding a few containers into the mix will bring on a full migraine. And without some relief you may be left with the impression that containers aren't worth the effort. That's unfortunate because containers don't have to be troublesome and patching doesn't have to be hard. But it does take getting to know one of the most important and sadly least used features in Solaris: Live Upgrade

Before we looking at Live Upgrade, let's start with a definition. A boot environment is the set of all file systems and devices that are unique to an instance of Solaris on a system. If you have several boot environments then some data will be shared (non svr4 package installed applications, data, local home directories) and some will be exclusive to one boot environment. Not making this more complicated than it needs to be, a boot environment is generally your root (including /usr and /etc), /var (frequently split out on a separate file system), and /opt. Swap may or may not be a part of a boot environment - it is your choice. I prefer to share swap, but there are some operational situations where this may not be feasible. There may be additional items, but generally everything else is shared. Network mounted file systems and removable media are assumed to be shared.

With this definition behind us, let's proceed.

Analysis: What patches need to be applied to my system ?

For all of the assistance that Live Upgrade offers, it doesn't do anything to help with the analysis phase. Fortunately there are plenty of tools that can help with this phase. Some of them work nicely with Live Upgrade, others take a bit more effort.

smpatch(1M) has an analyze capability that can determine which patches need to be applied to your system. It will get a list of patches from an update server, most likely one at Sun, and match up the dependencies and requirements with your system. smpatch can be used to download these patches for future application or it can apply them for you. smpatch works nicely with Live Upgrade, so from a single command you can upgrade an alternate boot environment. With containers!

The Sun Update Manager is a simple to use graphical front end for smpatch. It gives you a little more flexibility during the inspection phase by allowing you to look at individual patch README files. It is also much easier to see what collection a patch belongs to (recommended, security, none) and if the application of that patch will require a reboot. For all of that additional flexibility you lose the integration with Live Upgrade. Not for lack of trying, but I have not found a good way to make Update Manager and Live Upgrade play together.

Sun xVM Ops Center has a much more sophisticated patch analysis system that uses additional knowledge engines beyond those used by smpatch and Update Manager. The result is a higher quality patch bundle tailored for each individual system, automated deployment of the patch bundle, detailed auditing of what was done and simple backout should problems occur. And it basically does the same for Windows and Linux. It is this last feature that makes things interesting. Neither Windows nor Linux have anything like Live Upgrade and the least common denominator approach of Ops Center in its current state means that it doesn't work with Live Upgrade. Fortunately this will change in the not too distant future, and when it does I will be shouting about this feature from rooftops (OK, what I really mean is I'll post a blog and a tweet about it). If I can coax Ops Center into doing the analysis and download pieces then I can manually bolt it onto Live Upgrade for a best of both worlds solution.

These are our offerings and there are others. Some of them are quite good and in use in many places. Patch Check Advanced (PCA) is one of the more common tools in use. It operates on a patch dependency cross reference file and does a good job with the dependency analysis (this is obsoleted by that, etc). It can be used to maintain an alternate boot environment and in simple cases that would be fine. If the alternate boot environment contains any containers then I would use Live Upgrade's luupgrade instead of PCA's patchadd -R approach. If I was familiar with PCA then I would still use it for the analysis and download feature. Just let luupgrade apply the patches. You might have to uncompress the patches downloaded by PCA before handing them over to luupgrade, but that is a minor implementation detail.

In summary, use an analysis tool appropriate to the task (based on familiarity, budget and complexity) to figure out what patches are needed. Then use Live Upgrade (luupgrade) to deploy the desired patches.

Effort: What does it take to perform the required maintenance ?

This is a big topic and I could write pages on the subject. Even if I use an analysis tool like smpatch or pca to save me hours of trolling through READMEs drawing dependency graphs, there is still a lot of work to do in order to survive the ordeal of applying patches. Some of the more common techniques include ....
Backing up your boot environment.
I should not have to mention this, but there are some operational considerations unique to system maintenance. Even though tiny, there is a greater chance that you will render your system non-bootable during system maintenance than any other operational task. Even with mature processes, human factors can come into play and bad things can happen (oops - that was my fallback boot environment that I just ran newfs(1M) on).

This is why automation and time tested scripting becomes so important. Should you do the unthinkable and render a system nonfunctional, rapid restoration of the boot environment is important. And getting it back to the last known good state is just as important. A fresh backup that can be restored by utilities from install media or jumpstart miniroot is a very good idea. Flash archives (see flarcreate(1M)) is even better, although complications with containers make this less interesting now than in previous releases of Solaris. How many of you take a backup before applying patches ? Probably about the same number as replace batteries in your RAID controllers or change out your UPS systems after their expiration date.

Split Mirrors
One interesting technique is to split mirrors instead of backups. Of course this only works if you mirror your boot environment (a recommended practice for those systems with adequate disk space). Break your mirror, apply patches to the non-running half, cut over the updated boot environment during the next maintenance window and see how this goes. At first glance this seems like a good idea, but there are two catches.
  1. Do you synchronize dynamic boot environment elements ? Things like /etc/passwd, /etc/shadow, /var/adm/messages, print and mail queues are constantly changing. It is possible that these have changed between the mirror split and subsequent activation.
  2. How long are you willing to run without your boot environment being mirrored ? This may cause to you certify the new boot environment too quickly. You want to reestablish your mirror, but if that is your fallback in case of trouble you have a conundrum. And if you are the sort that seems to have a black cloud following you through life, you will discover a problem shortly after you started the mirror resync.
Pez disks ?
OK, the mirror split thing can be solved by swinging in another disk. Operationally a bit more complex and you have at least one disk that you can't use for other purposes (like hosting a few containers), but it can be done. I wouldn't do it (mainly because I know where this story is heading) but many of you do.
Better living through Live Upgrade
Everything we do to try to make it better adds complexity, or another hundred lines of scripting. It doesn't need to be this way, and if you become one with the LU commands it won't for you either. Live Upgrade will take care building and updating multiple boot environments. It will check to make sure the disks being used are bootable and not part of another boot environment. It works with the Solaris Volume Manager, Veritas encapulated root devices, and starting with Solaris 10 10/08 (update 6) ZFS. It also takes care of the synchronization problem. Starting with Solaris 10 8/07 (update 4), Live Upgrade also works with containers, both native and branded (and with Solaris 10 10/08 your zoneroots can be in a ZFS pool).

Outage: How long will my system be down for the maintenance?

Or perhaps more to the point, how long will my applications be unavailable ? The proper reply is it depends on how big the patch bundle is and how many containers you have. And if a kernel patch is involved, double or triple your estimate. This can be a big problem and cause you to take short cuts like only install some patches now and others later when it is more convenient. Our good friend Bart Smaalders has a nice discussion on the implications of this approach and what we are doing in OpenSolaris to solve this. That solution will eventually work its way into the Next Solaris, but in the mean time we have a problem to solve.

There is a large set (not really large, but more than one) of patches that require a quiescent system to be properly applied. An example would be a kernel patch that causes a change to libc. It is sort of hard to rip out libc on a running system (new processes get the new libc my may have issues with the running kernel, old processes get the old libc and tend to be fine, until they do a fork(2) and exec(2)). So we developed a brilliant solution to this problem - deferred activation patching. If you apply one of these troublesome patches then we will throw it in a queue to be applied the next time the system is quiesced (a fancy term for the next time we're in single user mode). This solves the current system stability concerns but may make the next reboot take a bit longer. And if you forgot you have deferred patches in your queue, don't get anxious and interrupt the shutdown or next boot. Grab a noncaffeinated beverage and put some Bobby McFerrin on your iPod. Don't Worry, Be Happy.

So deferred activation patching seems like a good way to deal with situation where everything goes well. And some brilliant engineers are working on applying patches in parallel (where applicable) which will make this even better. But what happens when things go wrong ? This is when you realize that patchrm(1M) is not your friend. It has never been your friend, nor will it ever be. I have an almost paralyzing fear of dentists, but would rather visit one then start down a path where patchrm is involved. Well tested tools and some automation can reduce this to simple anxiety, but if I could eliminate patchrm altogether I would be much happier.

For all that Live Upgrade can do to ease system maintenance, it is in the area of outage and recovery that make it special. And when speaking about Solaris, either in training or evangelism events, this is why I urge attendees to drop whatever they are doing and adopt Live Upgrade immediately.

Since Live Upgrade (lucreate, lumake, luupgrade) operates on an alternate boot environment, the currently running set of applications are not affected. The system stays up, applications stay running and nothing is changing underneath them so there is no cause for concern. The only impact is some additional load by the live upgrade operations. If that is a concern then run live upgrade in a project and cap resource consumption to that project.

An interesting implication of Live Upgrade is that the operational sanity of each step is no longer required. All that matters is the end state. This gives us more freedom to apply patches in a more efficient fashion than would be possible on a running boot environment. This is especially noticeable on a system with containers. The time that the upgrade runs is significantly reduced, and all the while applications are running. No more deferred activation patches, no more single user mode patching. And if all goes poorly after activating the new boot environment you still have your old one to fall back on. Queue Bobby McFerrin for another round of "Don't Worry, Be Happy".

This brings up another feature of Live Upgrade - the synchronization of system files in flight between boot environments. After a boot environment is activated, a synchronization process is queued as a K0 script to be run during shutdown. Live Upgrade will catch a lot of private files that we know about and the obvious public ones (/etc/passwd, /etc/shadow, /var/adm/messages, mail queues). It also provides a place (/etc/lu/synclist) for you to include things we might not have thought about or are unique to your applications.

When using Live Upgrade applications are only unavailable for the amount of time it takes to shut down the system (the synchronization process) and boot the new boot environment. This may include some minor SMF manifest importing but that should not add much to the new boot time. You only have to complete the restart during a maintenance window, not the entire upgrade. While vampires are all the rage for teenagers these days, system administrators can now come out into the light and work regular hours.

Recovery: What happens when something goes wrong?

This is when you will fully appreciate Live Upgrade. After activation of a new boot environment, now called the Primary Boot Environment (PBE), your old boot environment, now called an Alternate Boot Environment (ABE) can still be called upon in case of trouble. Just activate it and shut down the system. Applications will be down for a short period (the K0 sync and subsequence start up), but there will be no more wringing of the hands, reaching for beverages with too much caffeine and vitamin B12, trying to remember where you kept your bottle of Tums. Queue Bobby McFerrin one more timne and "Don't Worry, Be Happy". You will be back to your previous operational state in a matter of a few minutes (longer if you have a large server with many disks). Then you can mount up your ABE and troll through the logs trying to determine what went wrong. If you have a service contract then we will troll through the logs with you.

I neglected to mention earlier, disks that comprise boot environments can be mirrored, so there is no rush to certification. Everything can be mirrored, at all times. Which is a very good thing. You still need to back up your boot environments, but you will find yourself reaching for the backup media much less often when using Live Upgrade.

All that is left are a few simple examples of how to use Live Upgrade. I'll save that for next time.

Technocrati Tags:
Comments:

Wow, Nice article on what is the most pressing issue of Unix admins. Thank you.

Posted by Ashish on March 22, 2009 at 06:47 PM CDT #

Thank you, Ashish. I appreciate the comment. The proof point will be in the examples which I hope to get posted in the next couple of days.

Posted by Bob Netherton on March 23, 2009 at 02:02 AM CDT #

Hi Bob,

Great article... more ammo to persuade folks about the benefits of Live Upgrade.

Just one question, are you going to be covering backing out of a failed BE with a zfs pool?

Cheers

Posted by steve foster on March 23, 2009 at 02:15 AM CDT #

Thanks Steve. And yes, I am going to cover that, including some really nasty recovery scenarios. It may take on a bit more of a dissection of ZFS boot, but when you are trying to recover a broken system you need to know how things fit together.

Posted by Bob Netherton on March 23, 2009 at 02:42 AM CDT #

Excellent article,

Any comments on how LU and xVM will come to live together. Can we instruct xVM to patch the no current BE?

Posted by Bernardo Prigge on March 23, 2009 at 08:52 AM CDT #

Unfortunately what I'm running into is that LU seems to basically have been broken somewhere between u4 and u6.

Not literally broken as in "you can't live upgrade from here to there" but rather broken as in you're having to take an outage to properly patch to actually get the LU to run properly.

Which in my mind is counter to the LU mindset -- i.e. you're only taking one outage to reboot into the new BE.

Not that this is the end of the world, but still ;-)

P.S. Hope you get back to STL soon!

Posted by John Kotches on March 25, 2009 at 09:52 AM CDT #

Hey John.

I think the way to look at this is that u6 demands a lot of Live Upgrade, especially where ZFS root is concerned. And we have to upgrade a lot of the supporting components to minimum levels for it to work as you would expect. Without giving away the next article too much :-) , you can use Live Upgrade in a couple of stages to get there.

Step one is the minimum number of changes I need to make to the current boot environment to survive just the ordeal of the Live Upgrade patch set. Then use Live Upgrade to deploy the rest of the LU patch set.

There is a Live Upgrade patch bundle on SunSolve and the install script actually works in this 2 step manner (--apply-prereq and then -B for application of the rest to another BE). There's also a zones starter bundle that is supposed to be zones friendly.

I'm always up for a trip back home. Just tell the local guys that you want to meet about - whatever :-) And please do it soon: playoff hockey seems to be heating up and its about baseball season! Probably shouldn't say that in public, should I ?

Posted by Robert Netherton on March 25, 2009 at 10:50 AM CDT #

Bernardo - I'll ask my friends over on the Ops Center side of things for some information. I was hoping one of them might jump in.

Posted by Robert Netherton on March 25, 2009 at 10:52 AM CDT #

It is very fantastic documentations.

Posted by Paul Orapin on March 27, 2009 at 05:12 AM CDT #

Bob:

That's all well and good, but it doesn't avoid multiple reboots, which is what was the beauty of LU.

I also realize that a lot of these issues are specifically because of the new stuff that u6 brings to the table. Unfortunately, given our infrastructure that makes it quite painful when you're talking about 150 Oracle instances in our development environment.

Well clearly we need a LU seminar in St. Louis so that we can get all the facts straight.

As an FYI, with our new DCs coming on line the base OS will be 10u6. That avoid much of the heartache we have now.

Posted by John Kotches on March 27, 2009 at 05:29 AM CDT #

LU is a great product, and after all these years it is becoming a staple in our Solaris environment.

A few caveats when using LU with systems with Containers. Our consolidation project will be using this setup. I've spent a decent amount of time getting it all set up. Works great taking a few things into consideration. We're using a zfs root pool for the global zone and a zfs raidz for our multitude of zones for system consolidation.

A few points
- If the containers are on a zfs pool other than rpool make sure their mount point is not the name of that zfs pool. I set the mountpoint to none for the ZFS pool and created a directory /zones and put all the dataset mountpoints in there: i.e. zonepool/zone1 -o mountpoint=/zones/zone1
- The ZFS clone that is made by LU does not inherit attributes... so, reset your quotas and make sure they're set to mount at boot, and other stuff like that.

Posted by meb on April 15, 2009 at 06:39 AM CDT #

Hello,

A couple of questions if someone will answer:

Is it better to use luupgrade or smpatch?

I want to use smpatch for ease but is it OK to ignore All the special handling (immediate reboot, single user, interactive, etc)? If it is how would I do that?

Thanks,
A

Posted by A on April 23, 2009 at 09:10 AM CDT #

Thanks for all the valuable information.

We are in need to remove BE from our training machine. How can we get rid of all BE's? Ludelete will not delete the last "current active" BE. Is there a way to do this "cleanly".

Posted by Bernardo Prigge on April 24, 2009 at 03:31 AM CDT #

A, we use luupgrade instead of smpatch when we upgrade an alternate boot environment. We use smpatch to download (usually to another directory, not into /var), then we use smpatch order (or just use the patchpro download file).

If we used smpatch to download the patches into download:

luupgrade -t -n ABE -s /download /download/patch_order > lu.out 2>&1

Since the ABE isn't running you can ignore the reboot/single user/etc stuff for patches. Of course, always check the special install instructions, some patches do require manual steps...

Posted by meb on April 24, 2009 at 03:54 AM CDT #

My only problem with lu is that if you change/delete any pools, you cannot delete an old BE. Currently you need to have your pools/file structure set in stone otherwise old BE's cannot be deleted or booted properly.

Posted by Mark Christoph on April 29, 2009 at 04:21 AM CDT #

Hi
Deferred Activation patching ( otherwise known as DAP ),means that the patch gets applied using lofs mounts, basically we replicate / and any sub mounts out to the var directory, using lofs nosub, then we apply patches to this Alternate Environment using patchadd -R, we copy the files being replaced to another directory and then once the new patched file is installed, mounted back over, so that the live system appears to be running the older prepatched bits. This si a simplistic view, but is basically what deferred activation patching means, a reboot then clears out the mounts etc, and then exposes the new patched file on the way back up.

smpatch does use a mchanism where Kernel Patches and other complex patches are not installed untill system is shutdown, an smf service then installs the patches during shutdown.

But Deferred Patch mActivation is what the first mehod is called.
it is delivered by patchadd itself, and is only invoked when applying certain patcehs like Kernel Patches 120011-14,127127-11 and 137137-09, plus their x86 counterparts.
Enda

Posted by Enda O'Connor on May 08, 2009 at 03:53 AM CDT #

Just ran into an issue with containers on ZFS pools other than rpool. We had it all working fine (I can explain in more detail later), but after upgrading to patch 121430-36, the lucreate would fail when it tried to mount one of the containers:

Creating snapshot for <zonepool/z-erf-be5.10-20090505> on <zonepool/z-erf-be5.10-20090 505@be5.10-20090508>.
Creating clone for <zonepool/z-erf-be5.10-20090505@be5.10-20090508> on <zonepool/z-erf -be5.10-20090508>.
cannot mount 'zonepool/z-erf-be5.10-20090508': no mountpoint set
/usr/lib/lu/luclonefs: none/lu_moved: cannot create

Backing out of the patch (we're running 121430-29) fixed it back up again. Haven't done anymore digging, but I just submitted the issue to Sun. I don't think this is quite a fully supported environment, but it's such a great feature I don't want to give it up.

Posted by meb on May 08, 2009 at 09:00 AM CDT #

although live upgrade workr great I have hit a wall with NIS servers correctly patched. New BE during upgrade comes back with pfinstall failed. processing profile. Doesn't matter if I use dvd or jumpstart image. All other system upgrades have occured without issue. Any thoughts?

Posted by john beward on February 15, 2011 at 11:03 PM CST #

Post a Comment:
  • HTML Syntax: NOT allowed
About

Bob Netherton is a Principal Sales Consultant for the North American Commercial Hardware group, specializing in Solaris, Virtualization and Engineered Systems. Bob is also a contributing author of Solaris 10 Virtualization Essentials.

This blog will contain information about all three, but primarily focused on topics for Solaris system administrators.

Please follow me on Twitter Facebook or send me email

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today