For a long time, I have advocated that Solaris users adopt ZFS for root, storing the operating system in ZFS. I've also strongly advocated for using Live Upgrade as a patching tool in this case. The benefits are intuitive and striking, but are they actual and quantifiable?
You can find a number of bloggers on BOC talking about the hows and whys of ZFS root. Suffice it to say that ZFS has a number of great qualities that make management of the operating system simpler, especially when combined with other tools like Live Upgrade. ZFS allows for the immediate creation of as many snapshots as you might want, simply by preserving the view of the filesystem meta-data and taking advantage of the fact that all writes in ZFS use copy-on-write, completely writing the new data before releasing the old. The gives us snapshots for free.
Like chocolate and peanut butter, ZFS and Live Upgrade are two great tastes that taste great together. Live Upgrade, traditionally was used just to upgrade systems from one release of Solaris (or update release) to another. However, in Solaris 10, it now becomes a tremendous tool for patching. With Live Upgrade (LU), the operating system is replicated in an Alternate Boot Environment (ABE) and all of the changes (patches, upgrades, whatever) are done to the copy of the OS while the OS is running, rather than taking the system down to apply maintenance. Then, when the time is right, during a maintenance window, the new changes are activated by rebooting using the ABE.
With this approach, downtime is minimized since changes are applied while the system is running. Moreover, there is a fall-back procedure since the original boot environment is still there. Rebooting again into the original environment effectively cancels out the changes exactly.
The Problem with Patching
Patching, generally speaking, is something that everyone knows they need to do, but few are really happy with how they do it. It's not much fun. It takes time. It keeps coming around like clockwork. You have to do it to every system. You have to work around the schedules of the applications on the system. But, it needs to be done. Sort of like mowing the grass every week in the summer.
Live Upgrade can take a lot of the pain out of patching, since the actual application of the patches no longer has to be done while the system is shut down. A typical non-LU approach for patches is to shut the system down to single-user prior to applying the patches. In this way, you are certain that nothing else is going on on the system and you can change anything that might need to be updated safely. But, the application is also down during this entire period. And that is the crux of the problem. Patching takes too long; we expect systems to be always available these days.
How long does patching take? That all depends. It depends on the number of changes and patches being applied to the system. If you have not patched in a very long time, then a large number of patches are required to bring the system current. The more patches you apply, the longer it takes.
It depends on the complexity of the system. If, for example, there are Solaris zones on the system for virtualization, patches applied to the system are automatically applied to each of the Zones as well. This takes extra time. If patches are begin applied to a shut-down system, that just extends the outage.
It's hard to get outage windows in any case and long outage windows are especially hard to schedule especially when the perceived benefit is small. Patches are like a flu shot. They can vaccinate you against problems that have been found, but they won't help if this year has a new strain of flu that we've not seen before. So, long outage across lots of systems are hard to justify.
So, How Long Does It Really Take
I have long heard people talk about how patching takes too long, but I've not measured it in some time. So, I decided to do a bit of an experiment. Using a couple of different systems, neither one very fast or very new, I applied the Solaris 10 Recommended patch set from mid-July 2011. I applied the patches to systems running different update releases of Solaris 10. This gives different numbers of patches that have to be applied to bring the system current. As far as procedure goes, for each test, I shut the system down to single-user (init S), applied the patches, and rebooted. The times listed are just the time for the patching, although the actual maintenance window in real-life would include time to shut down, time to reboot, and time to validate system operation. The two systems I used for my tests were an X4100 server with 2 dual-core Opteron processors and 16GB of memory and a Sun Fire V480 server with 4 UltraSPARC III+ processors. Clearly, these are not new systems, but they will show what we need to see.
|System|| Operating System || Patches Applied || Elapsed Time (hh:mm:ss) |
|X4100||Solaris 10 9/10 ||105 ||00:17:00 |
|X4100||Solaris 10 10/09 ||166 ||00:26:00 |
|X4100||Solaris 10 10/08 ||216 ||00:36:06 |
|V480 ||Solaris 10 9/10||99||00:47:29 |
For each of these tests, the server is installed with root on ZFS and patches are applied from the Recommended Patchset via the command "./installpatch -d --<pw>" for whatever password this patchset has. All of this is done while the system was in single-user rather than while running multi-user.
It appears that clock speed is important when applying patches. The older V480 took three times as long as the X4100 for the same patchset.
And this is the crux of the problem. Even to apply patches to a pretty current system requires an extended outage. This does not even take into account the time required for whatever internal validation of the work done, reboot time, application restart time, etc. How can we make this better? Let's make it worse first.
More Complicated Systems Take Longer to Patch
Nearly a quarter of all production systems running Solaris 10 are deployed using Solaris Zones. Many more non-production systems may also use zones. Zones allow me to consolidate the administrative overhead of only having to patch the global zone rather than each virtualized environment. But, when applying patches to the global zone, patches are automatically applied to each zone in turn. So, the time to patch a system can be significantly increased by having multiple zones. Let's first see how much longer this might take, and then we will show two solutions.
|System ||Operating System ||Number of Zones ||Patches Applied ||Elapsed Time (hh:mm:ss)|
|X4100 ||Solaris 10 9/10||2||105 ||00:46:51 |
|X4100||Solaris 10 9/10||20 ||105 ||03:03:59 |
|X4100||Solaris 10 10/09||2 ||166 ||01:17:17 |
|X4100||Solaris 10 10/08||2 ||216 ||01:37:17 |
|V480 ||Solaris 10 9/10||2 ||99 ||01:53:59 |
Again, all of these patches were applied to systems in single-user in the same way as the previous set. Just having two (sparse-root) zones defined took nearly three times as long as just the global zone alone. Having 20 zones installed took the patch time from 17 minutes to over three hours for even the smallest tested patchset.
How Can We Improve This? Live Upgrade is Your Friend
There are two main ways that this patch time can be improved. One applies to systems with or without zones, while the second improves on the first for systems with zones installed.
I mentioned before that Live Upgrade is very much your friend. Rather than go into all the details of LU, I would refer you to the many other blogs and documents on LU. Check out especially Bob Netherton's Blog for lots of LU articles.
When we use LU, rather than taking the system down to single-user, we are able to create a new alternate boot environment, using ZFS snapshot and clone capability, while the system is up, running in production. Then, we apply the patches to that new boot environment, still using the installpatchset command. For example, "./installpatchset -d -B NewABE --<pw>" to apply the patches into NewABE rather than the current boot environment. When we use this approach, the patch times that we saw before improve don't change very much, since the same work is being done. However, all of this is time that the system is not out of service. The outage is only the time required to reboot into the new boot environment.
So, Live Upgrade saves us all of that outage time. Customers who have older servers and are fairly out of date on patches say that applying a patch bundle can take more than four or five hours, an outage window that is completely unworkable. With Live Upgrade, the outage is reduced to the time for a reboot, scheduled when it can be most convenient.
Live Upgrade Plus Parallel Patching
Recently, another enhancement was made to patching so that multiple zones are patched in parallel. Check out Jeff Victor's blog where he explains how this all works. As it turns out, this parallel patching works whether you are patching zones in single-user or via Live Upgrade. So, just to get an idea of how this might help I tried to do some simple measurement with 2 and 20 sparse-root zones created on a system running Solaris 10 9/10.
|System ||Operating System ||Number of Zones ||Patches Applied ||num_procs ||Elapsed Time (hh:mm:ss)|
|X4100 ||Solaris 10 9/10 ||2 ||105 ||1 ||00:46:51 |
|X4100 ||Solaris 10 9/10||2 ||105 ||2 ||00:36:04 |
|X4100||Solaris 10 9/10||20 ||105||1 ||03:03:59 |
|X4100||Solaris 10 9/10||20 ||105||2 ||01:55:58 |
|X4100||Solaris 10 9/10||20 ||105||4 ||01:25:53 |
num_procs is used as a guide for the number of threads to be engaged in parallel patching. Jeff Victor's blog (above) and the man page for pdo.conf talk about how this relates to the actual number of processes that are used for patching.
With only two zones, doubling the number of threads has an effect, but not a huge effect, since the amount of parallelism is limited. However, with 20 zones on a system, boosting the number of zones patched in parallel can significantly reduce the time taken for patching.
Recall that all of this is done within the application of patches with Live Upgrade. Used alone, outside of Live Upgrade, this can help reduce the time required to patch a system during a maintenance window. Used with Live Upgrade, it reduces the time required to apply patches to the alternate boot environment.
So, what should you do to speed up patching and reduce the outage required for patching?
Use ZFS root and Live Upgrade so that you can apply your patches to an alternate boot environment while the system is up and running. Then, use parallel patching to reduce the time required to apply the patches to the alternate boot environment where you have zones deployed.