Rethinking patching

As Stephen mentioned recently, several of us have been thinking about revising the way we manage software change on Solaris.  I've been particularly focused on the difficulties Sun and it's customers have with the patching process, and the kinds of changes we need to make as a result in our technology and development processes.

 Today, most customers don't run OpenSolaris; they run a supported version of Solaris such as Solaris 8, 9 or 10.  A supported release means that someone will answer the phone, and that patches for problems are available.

Patches are a separate software change control mechanism distinct from package versions in Solaris.  Each patch may affect portions of several packages; patches are intended to include all the files necessary to fix one or more problems, either directly or by specifying dependencies.  If a patch affects packages which are not installed on this system (typically because it has been minimized), those portions of the patch are not installed.  If the administrator later adds the missing package, he must remember (good luck) to re-apply the patches since the packaging code knows nothing of patches.

Customers are today free to install which ever patches they feel are appropriate for their environment, consistent with the built-in dependency requirements.  This customization is a technique I refer to as Dim Sum patching, and is a major cause of patching difficulties.  Many customers pick and choose amongst the thousands of patches available for Solaris 10, for example; this means that customers are often pioneering new configurations.  Note that each Solaris release consists of a single source base; all Solaris 10 updates, for example, are but snapshots of the same Solaris patch gate at different times.  As a result, the developers are working on a cumulative  set of all previous changes; when a new patch is created, the files in the patch not only contain the desired fix, but all previous fixes as well.  Thus, the software change is constructed as a linear stream of change, but customers installs selected binaries from the various builds via patches.

 

When I've discussed the hazards of  Dim Sum patching with customers, the reasons given are typically characterizable as :

 

 

  1. we don't need all those patches,  we don't have those drivers loaded
  2. we're reducing downtime by not installing so many patches
  3. the less change, the less risk.

To these, I reply:

  1. If you don't need those drivers, then remove them them w/ pkgrm rather than leaving them in an unpatched state awaiting the introduction of new hardware or software to expose problems.  Minimization, not spotty patching, is the answer.  This is akin to disposing of an unused car, rather than simply leaving it unmaintained.
  2. Today, you should be using Live Upgrade and patching the alternate boot image to reduce downtime.  This allows machines in production to be safely patched, and will not leave the system in an inconsistent or unbootable state in the case of power failure during patching operations.  In the future, the new packaging system will always patch a clone of the current system to avoid the potential for disaster in case of power failure.
  3. Our experience has been that customers running all of the changes in an update are generally far less likely to experience problems than those who select only the fixes and features that appeal to them, and hope that our QA processes found all hidden dependencies on previous changes.

For our new packaging system, there is a powerful incentive to eliminate Dim Sum patching:  since we wish to use a single version numbering space for any package, attempting to support fine-grain Dim Sum patching would require very small packages - affecting the performance of packaging operations, and significantly increasing the workload of OpenSolaris developers.   Instead, we can set package boundaries according to what makes sense for minimization purposes. 

This implies that future (post Solaris 11) patches will be completely cumulative (aside from some exceptions for urgent security fixes), at least for the core OS.  Your system will be able to determine what is needed to bring the installed software up to the desired revision level automatically; needing to pick and choose patches will be a thing of the past. 


 

No Dim Sum Patching!


Comments:

Ah, but there is a key component to reply number 1 that is missing and/or hard to achieve.

How do you \*know\* what you can delete? Given a running/production system how do you determine which of the multitude of installed packages can be removed without adverse impact?

Posted by David Bryant on July 25, 2007 at 10:46 AM PDT #

There is a second issue with reason number 1. In numerous Sun documents, they recommend installing the Entire distribution plus OEM support. There are also several tools (smpatch being one of them) that are only supported with the developer cluster or higher. Since numerous pieces of Sun documentation recommend installing everything, a lot of people end up doing so. - Ryan

Posted by Matty on July 25, 2007 at 12:31 PM PDT #

More often than not Sun support recommends individual patches, even when it is communicated that an in-house tested patch set that is largely based on a "recommended and security" or EIS patch cluster that includes the recommended patches is available. The deviations that make up "largely based on" are those required to address Sun Alerts and have review by Sun SSE's or better. Such recommendations commonly come in in the course of support cases that involve anyone from front line support to high level escalations that involve PTS, Sun Architects on-site, and Cx0 involvement.

The first step in addressing this problem is to get support to be able to identify the specific bug that they think a patch will fix before recommending a patch. In very few cases has a patch Sun recommends addressed the root cause of the problem I am experiencing. By and large, Sun Support's reactive recommendations actually drive variability into the environment and create the Dim Sum patching you describe.

Posted by Mike Gerdts on July 25, 2007 at 12:51 PM PDT #

"How do you \*know\* what you can delete? Given a running/production system how do you determine which of the multitude of installed packages can be removed without adverse impact?"

Long story short: preconfigured, prepatched and stringently tested Solaris images.
This is what system engineering and platform lifecycle management are all about. At the very core, we can summarize the above terms to standardized Flash(TM) builds, preconfigured, prepatched and rigorously tested Solaris images. No ad-hoc changes are ever allowed in such a setup; configurations come on as tightly controlled package payload; and adding or removing something from the image is a matter of request/bug tracking, specification (in writing), panel approval, and finally, if the request / fix is approved, integration into the next platform release cycle.

Note that I'm not referring to Sun development, although one can surmise they have similar practices.

Also, ad-hoc changes, and this includes patching, are strictly banned, prohibited, and forbidden, unless engineering tested them and approved them. For example, it would be strictly forbidden for an SA to log into the system and start modifying configuration files with `vi` (or `emacs`, or whichever editor); changes would come in form of revised package payload, so they would be reproducible and uniform across systems.

Posted by UX-admin on July 25, 2007 at 10:46 PM PDT #

we use also standardized Flash(TM) builds between all sun4u systems and therefor the recommend way is the sunwcxall cluster

Posted by mario on July 26, 2007 at 02:50 AM PDT #

David wrote: Ah, but there is a key component to reply number 1 that is missing and/or hard to achieve.How do you \*know\* what you can delete? Given a running/production system how do you determine which of the multitude of installed packages can be removed without adverse impact?
Providing better support for minimization is going to be increasing important as OpenSolaris continues to grow. Clearly, we need better pkg dependencies, and tools to help maintain them. We also need more automated ways of determining which components may be removed.

Posted by barts on July 26, 2007 at 04:37 AM PDT #

Sadly, core install, ldd, and truss have become the only minimization technique for production servers. Dependencies in pkgs don't hack it here. NB that I said SERVERS ... SMCC has never provided a sensible cluster for real use in the field, it still thinks it's selling workstations ! We started having to play these games in 4.0.3 days, and are yet to find an alternative. ANY sort of automated patching, esp. that which drags in dependent pkgs would be a retrograde step. pjc, who just had to do it all again for b68

Posted by PeterC on July 29, 2007 at 07:48 PM PDT #

Here's a thought: Improve the process and quality of update releases until we can do one every month. Then just stop issueing 95% of all the patches we currently issue. Then leave the patch process the way it is, only reserved for critical security problems. Once you make that jump, customers will be willing to install 100% of the security patches, and all is wonderful. All the magic was in the first sentence. :-)

Posted by Chris Quenelle on July 31, 2007 at 02:10 PM PDT #

Or, in other words, since your customers don't want to apply all patches, you're planning to force them too. That's screwed beyond all recognition. As an example of where this policy of forcing your customers to a specific revision would have forced me to abandon a project costing hundreds of thousands of dollars, note that patch 126312-01 panics all of my systems on boot. While I now have a T-patch, that took over a month to be produced. In the meantime, I'm supposed to do without security patches? Don't think so.

Posted by Ceri Davies on July 31, 2007 at 11:49 PM PDT #

Ceri - What happened in your case is that the fix for 6509271 introduced 6574028, which tips your boxes. Clearly, undoing the fix for 6509271 and re-releasing immediately would be the right answer.
We're still exploring what the approach should be for security fixes, particularly those that are more separable. It's clear, however, that the ability to piece a running kernel and set of core libraries together by selecting binaries from all the builds over the last two years is fraught with hazard, and is difficult and expensive for Sun. Much of the reason that patches take so long to deliver is that they need to be tested in so many configurations; restricting the ability to mix and match binaries vastly eases the work and testing required to release a fix.

Posted by barts on August 01, 2007 at 08:15 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

An engineer's viewpoint on Solaris...

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today