Wednesday Sep 22, 2010

SS7000 Software Updates

In this entry I'll explain some of the underlying principles around software upgrade for the 7000 series.  Keep in mind that nearly all of this information are implementation details of the system and thus subject to change.

Entire system image

One of the fundamental design principles about SS7000 software updates is that all releases update the entire system no matter how small the underlying software change is. Releases never update individual components of the system separately. This sounds surprising to people familiar with traditional OS patches, but this model provides a critical guarantee: for a given release, the software components are identical across all systems running that release. This guarantee makes it possible to test every combination of software the system can run, which isn't the case for traditional OS patches.

When operating systems allow users to apply separate patches to fix individual issues, different systems running the same release (obviously) may have different patches applied to different components. It's impossible to test every combination of components before releasing each new patch, so engineers rely heavily on understanding the scope of the underlying software change (as it affects the rest of the system at several different versions) to know which combinations of patches may conflict with one other. In a complex system, this is very hard to get right. What's worse is that it introduces unnecessary risk to the upgrade and patching process, making customers wary of upgrading, which results in more customers running older software. With a single system image, we can (and do) test every combination of component versions a customer can have.

This model has a few other consequences, some of which are more obvious than others:

  • Updates are complete and self-contained. There's no chance for interaction between older and newer software, and there's no chance for user error in applying illegal combinations of patches.
  • An update's version number implicitly identifies the versions of all software components on the system. This is very helpful for customers, support, and engineering to exactly what it means to say a bug was fixed in release X or that a system is running release Y (without having to guess or specify the versions of dozens of smaller components).
  • Updates are cumulative; any version of the software has all bug fixes from all previous versions. This makes it easy to identify which releases have a given bug fixed. "Previous" here refers to the version order, not chronological order. For example, V1 and V3 may be in the field when a V2 is released that contains all fixes in V1 plus a few others, but not the fixes in V3. More below on when this happens.
  • Updates are effectively atomic. They either succeed or they fail, but the system is never left in some in-between state running partly old bits and partly new bits. This doesn't follow immediately from having an entire system image, but doing it that way makes this possible.
Of course, it's much easier to achieve this model in the appliance context (which constrains supported configurations and actions for exactly this kind of purpose) than on a general purpose operating system.

Types of updates

SS7000 software releases come in a few types characterized by the scope of the software changes contained in the update.

  • Major updates are the primary release vehicle. These are scheduled releases that deliver new features and the vast majority of bug fixes. Major updates generally include a complete sync with the underlying Solaris OS.
  • Minor updates are much smaller, scheduled releases that include a small number of valuable bug fixes. Bugs fixed in minor updates must have high impact or high likelihood of being experienced by customers and have a relatively low risk fix.
  • Micro updates are unscheduled releases issued to address significant issues. These would usually be potential data loss issues, pathological service interruptions (e.g., frequent reboots), or significant performance regressions.

Enterprise customers are often reluctant to apply patches and upgrades to working systems, since any software change carries some risk that it may introduce new problems (even if it only contains bug fixes). This breakdown allows customers to make risk management decisions about upgrading their systems based on the risk associated with each type of update. In particular, the scope of minor and micro releases is highly constrained to minimize risk.

For examples, four major software updates have been released: 2008.Q4, 2009.Q2, and 2009.Q3, and 2010.Q1. The first four of these have had several updates. We've also released a few micro updates.

Sunday Apr 18, 2010

Replication in 2010.Q1

This post is long overdue since 2010.Q1 came out over a month ago now, but it's better late than never. The bullet-point feature list for 2010.Q1 typically includes something like "improved remote replication", but what do we mean by that? The summary is vague because, well, it's hard to summarize what we did concisely. Let's break it down:

Improved stability. We've rewritten the replication management subsystem. Informed by the downfalls of its predecessor, the new design avoids large classes of problems that were customer pain points in older releases. The new implementation also keeps more of the relevant debugging data that allows us to drive new issues to root-cause faster and more reliably.

Enhanced management model. We've formalized the notion of packages, which were previously just "replicas" or "replicated projects". Older releases mandated that a given project could only be replicated to a given target once (at a time) and that only one copy of a project could exist on a particular target at a time. 2010.Q1 supports multiple actions for a given project and target, each one corresponding to an independent copy on the target called a "package." This allows administrators to replicate a fresh copy without destroying the one that's already on the target.

Share-level replication. 2010.Q1 supports more fine-grained control of replication configuration, like leaving an individual share out of its project's replication configuration or replicating a share by itself without the other shares in its project.

Optional SSL encryption for improved performance. Older releases always encrypt the data sent over the wire. 2010.Q1 still supports this, but also lets customers disable SSL encryption for significantly improved performance when the security of data on the wire isn't so critical (as in many internal environments).

Bandwidth throttling. The system now supports limiting the bandwidth used by individual replication actions. With this, customers with limited network resources can keep replication from hogging the available bandwidth and starving the client data path.

Improved target-side management. Administrators can browse replicated projects and shares in the BUI and CLI just like local projects and shares. You can also view properties of these shares and even change them where appropriate. For example, the NFS export list can be customized on the target, which is important for disaster-recovery plans where the target will serve different clients in a different datacenter. Or you could enable stronger compression on the target, saving disk space at the expense of performance, which may be less important on a backup site.

Read-only view of replicated filesystems and snapshots. This is pretty self-explanatory. You can now export replicated filesystems read-only over NFS, CIFS, HTTP, FTP, etc., allowing you to verify the data, run incremental NDMP backups, or perform data analysis that's too expensive to run on the primary system. You can also see and clone the non-replication snapshots.

Then there are lots of small improvements, like being able to disable replication globally, per-action, or per-package, which is very handy when trying it out or measuring performance. Check out the documentation (also much improved) for details.

Tuesday Aug 18, 2009

Threshold alerts

I've previously discussed alerts in the context of fault management. On the 7000, alerts are just events of interest to administrators, like "disk failure" or "backup completion" for example. Administrators can configure actions that the system takes in response to particular classes of alert. Common actions include sending email (for notification) and sending an SNMP trap (for integration into environments which use SNMP for monitoring systems across a datacenter). A simple alert configuration might be "send mail to admin@mydomain when a disk fails."

While most alerts come predefined on the system, there's one interesting class of parametrized alerts that we call threshold-based alerts, or just thresholds. Thresholds use DTrace-based Analytics to post (trigger) alerts based on anything you can measure with Analytics. It's best summed up with this screenshot of the configuration screen:

Click for larger view

The possibilities here are pretty wild, particularly because you can enable or disable Analytics datasets as an alert action. For datasets with some non-negligible cost of instrumentation (like NFS file operations broken down by client), you can automatically enable them only when you need them. For example, you could have one threshold that enables data collection when the number of operations exceeds some limit, and then turn collection off again if the number fell below a slightly lower limit:


Implementation

One of the nice things about developing a software framework like the appliance kit is that once it's accumulated a critical mass of basic building blocks, you can leverage existing infrastructure to quickly build powerful features. The threshold alerts system sits atop several pieces of the appliance kit, including the Analytics, alerts, and data persistence subsystems. The bulk of the implementation is just a few hundred lines of code that monitors stats and post alerts as needed.

Monday Apr 27, 2009

2009.Q2 Released

Today we've released the first major software update for the 7000 series, called 2009.Q2. This update contains a boatload of bugfixes and new features, including support for HTTPS (HTTP with SSL using self-signed certificates). This makes HTTP user support more tenable in less secure environments because credentials don't have to be transmitted in the clear.

Another updated feature that's important for RAS is enhanced support bundles. Support bundles are tarballs containing core files, log files, and other debugging output generated by the appliance that can be sent directly to support personnel. In this release, support bundles collect more useful data about failed services and (critically) can be created even when the appliance's core services have failed or network connectivity has been lost. You can also monitor support bundle progress on the Maintenance -> System screen, and bundles can be downloaded directly from the browser for environments where the appliance cannot connect directly to Sun Support. All of these improvements help us to track down problems remotely, relying as little as possible on the functioning system or the administrator's savvy.

See the Fishworks blog for more details on this release. Enjoy!

Saturday Feb 14, 2009

Fault management

The Fishworks storage appliance stands on the shoulders of giants. Many of the most exciting features -- Analytics, the hybrid storage pool, and integrated fault management, for example -- are built upon existing technologies in OpenSolaris (DTrace, ZFS, and FMA, respectively). The first two of these have been covered extensively elsewhere, but I'd like to discuss our integrated fault management, a central piece of our RAS (reliability/availability/serviceability) architecture.

Let's start with a concrete example: suppose hard disk #4 is acting up in your new 7000 series server. Rather than returning user data, it's giving back garbage. ZFS checksums the garbage, immediately detects the corruption, reconstructs the correct data from the redundant disks, and writes the correct block back to disk #4. This is great, but if the disk is really going south, such failures will keep happening, and the system will generate a fault and take the disk out of service.

Faults represent active problems on the system, usually hardware failures. OpenSolaris users are familiar with observing and managing faults through fmadm(1M). The appliance integrates fault management in several ways: as alerts, which allow administrators to configure automated responses to these events of interest; in the active problems screen, which provides a summary view of the current faults on the system; and through the maintenance view, which correlates faults with the actual failed hardware components. Let's look at each of these in turn.

Alerts

Faults are part of a broader group of events we call alerts. Alerts represent events of interest to appliance administrators, ranging from hardware failures to backup job notifications. When one of these events occurs, the system posts an alert, taking whatever action has been configured for it. Most commonly, administrators configure the appliance to send mail or trigger an SNMP trap in response to certain alerts:

Alerts configuration

Managing faults

In our example, you'd probably discover the failed hard disk because you previously configured the appliance to send mail on hardware failure (or hot spare activation, or resilvering completion...). Once you get the mail, you'd log into the appliance web UI (BUI) and navigate to the active problems screen:

Active problems screen

The above screen presents all current faults on the system, summarizing each failure, its impact on the system, and suggested actions for the administrator. You might next click the "more info" button (next to the "out of service" text), which would bring you to the maintenance screen for the faulted chassis, highlighting the broken disk both in the diagram and the component list:

Faulted chassis

This screen connects the fault with the actual physical component that's failed. From here you could also activate the locator LED (which is no simple task behind the scenes) and have a service technician go replace the blinking disk. Of course, once they do, you'll get another mail saying that ZFS has finished resilvering the new disk.

Beyond disks

Disks are interesting examples because they are the heart of the storage server. Moreover, disks are often some of the first components to fail (in part because there are so many of them). But FMA allows us to diagnose many other kinds of components. For example, here are the same screens on a machine with a broken CPU cache:

Failed CPU

Failed CPU (chassis)

Under the hood

This complete story -- from hardware failure to replaced disk, for example -- is built on foundational technologies in OpenSolaris like FMA. Schrock has described much of the additional work that makes this simple but powerful user experience possible for the appliance. Best of all, little of the code is specific to our NAS appliances - we could conceivably leverage the same infrastructure to manage faults on other kinds of systems.

If you want to see more, download our VMware simulator and try it out for yourself.

About

On Fishworks, Sun, and software engineering

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today