Wednesday Sep 22, 2010

SS7000 Software Updates

In this entry I'll explain some of the underlying principles around software upgrade for the 7000 series.  Keep in mind that nearly all of this information are implementation details of the system and thus subject to change.

Entire system image

One of the fundamental design principles about SS7000 software updates is that all releases update the entire system no matter how small the underlying software change is. Releases never update individual components of the system separately. This sounds surprising to people familiar with traditional OS patches, but this model provides a critical guarantee: for a given release, the software components are identical across all systems running that release. This guarantee makes it possible to test every combination of software the system can run, which isn't the case for traditional OS patches.

When operating systems allow users to apply separate patches to fix individual issues, different systems running the same release (obviously) may have different patches applied to different components. It's impossible to test every combination of components before releasing each new patch, so engineers rely heavily on understanding the scope of the underlying software change (as it affects the rest of the system at several different versions) to know which combinations of patches may conflict with one other. In a complex system, this is very hard to get right. What's worse is that it introduces unnecessary risk to the upgrade and patching process, making customers wary of upgrading, which results in more customers running older software. With a single system image, we can (and do) test every combination of component versions a customer can have.

This model has a few other consequences, some of which are more obvious than others:

  • Updates are complete and self-contained. There's no chance for interaction between older and newer software, and there's no chance for user error in applying illegal combinations of patches.
  • An update's version number implicitly identifies the versions of all software components on the system. This is very helpful for customers, support, and engineering to exactly what it means to say a bug was fixed in release X or that a system is running release Y (without having to guess or specify the versions of dozens of smaller components).
  • Updates are cumulative; any version of the software has all bug fixes from all previous versions. This makes it easy to identify which releases have a given bug fixed. "Previous" here refers to the version order, not chronological order. For example, V1 and V3 may be in the field when a V2 is released that contains all fixes in V1 plus a few others, but not the fixes in V3. More below on when this happens.
  • Updates are effectively atomic. They either succeed or they fail, but the system is never left in some in-between state running partly old bits and partly new bits. This doesn't follow immediately from having an entire system image, but doing it that way makes this possible.
Of course, it's much easier to achieve this model in the appliance context (which constrains supported configurations and actions for exactly this kind of purpose) than on a general purpose operating system.

Types of updates

SS7000 software releases come in a few types characterized by the scope of the software changes contained in the update.

  • Major updates are the primary release vehicle. These are scheduled releases that deliver new features and the vast majority of bug fixes. Major updates generally include a complete sync with the underlying Solaris OS.
  • Minor updates are much smaller, scheduled releases that include a small number of valuable bug fixes. Bugs fixed in minor updates must have high impact or high likelihood of being experienced by customers and have a relatively low risk fix.
  • Micro updates are unscheduled releases issued to address significant issues. These would usually be potential data loss issues, pathological service interruptions (e.g., frequent reboots), or significant performance regressions.

Enterprise customers are often reluctant to apply patches and upgrades to working systems, since any software change carries some risk that it may introduce new problems (even if it only contains bug fixes). This breakdown allows customers to make risk management decisions about upgrading their systems based on the risk associated with each type of update. In particular, the scope of minor and micro releases is highly constrained to minimize risk.

For examples, four major software updates have been released: 2008.Q4, 2009.Q2, and 2009.Q3, and 2010.Q1. The first four of these have had several updates. We've also released a few micro updates.

About

On Fishworks, Sun, and software engineering

Search

Archives
« September 2010
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
10
11
12
13
14
15
16
17
18
19
20
24
25
26
27
28
29
30
  
       
Today