FMA on x64 and at DSN 06

Last Monday, Sun officially released Solaris 10 6/06, our second update to Solaris 10. Among the many new features are some exciting enhancements to our Solaris Predictive Self-Healing feature set, including:
  • Fault management support for Opteron x64 systems, including CPU, Memory diagnosis and recovery,
  • Fault management support for SNMP traps and a new MIB for browsing fault management results, and
  • Fault management support for ZFS, which also is new in Solaris 10 6/06.
And last week at Dependable Systems and Networks 2006, Dong Tang presented a paper we co-authored demonstrating some of the quantitative benefits of Solaris's unique self-healing features, showing that our unique memory retirement feature can decrease annual downtime by 37-54%. If you haven't already tried some of these features out through OpenSolaris or Solaris Express, and availability is important to you, you should definitely give Solaris 10 6/06 a try. In particular, the combination of ZFS protecting your data, AMD's Opteron RAS features, and our unique Solaris Predictive Self-Healing capabilities provide unprecedented availability for x64 platforms. Here are a few more details:

Opteron/x64 Features

The folks at AMD have put together an impressive (and growing) list of hardware RAS features in Opteron. These include:
  • Hardware cache and memory scrubbers,
  • ChipKill ECC for main memory,
  • Extended hardware error registers for first-fault analysis, and
  • a hardware watchdog for HyperTransport transactions.
With Solaris 10 6/06 (or Solaris Express or the latest OpenSolaris), we've provided a new kernel module loading mechanism to permit Solaris to load enhanced cpu-specific support for a particular type of CPU. For example, the Opteron fault management support is provided in this module:
 15 fffffffffbbd3eb0   3d10   -   1  cpu.AuthenticAMD.15 (AMD Athlon64/Opteron CPU Module)
which we load automatically on any Athlon64 or Opteron system. This new module permits Solaris to use the Opteron-specific hardware features to convert hardware error state into telemetry to drive our automated diagnosis software, and then trigger reactions like dynamically offlining a CPU core or retiring a physical page of memory. All of this done automatically for you, with a first-class administrative model built into Solaris, rather than bursting into flames or spewing random bits of hardware error state out to the poor administrator. Gavin posted more low-level details and examples of our Opteron fault management features on his blog.

Memory Page Retire Benefits

Memory page retire (MPR) is a unique feature of Solaris's self-healing system. It provides the ability for the kernel, at the request of fmd based upon a diagnosis of an underlying memory fault, to remove a particular physical page of memory from use in the system. Thanks to virtual memory, we can actually copy the content to another physical page first (assuming we have a series of correctable errors or ChipKill event), thereby making the entire operation transparent to running user processes. Similarly, if we have an uncorrectable error (UE) on a clean page, we can retire the page, and then let the kernel fault in the page content again by reading it from the backing object (e.g. a page of text from libc sitting on your filesystem). In the unlikely event of an uncorrectable memory error on a dirty page, we can kill the process, letting smf(5) restart the associated service according to its dependencies.

The great part about page retire is that it's free: it comes at no performance cost, and at a significantly smaller space and dollar cost than hardware memory redundancy. And of course the software to implement it, OpenSolaris, is entirely free too :) Page retire is therefore complementary to the hardware RAS features on Opteron and SPARC, and maximizes the benefit of everything when you have our diagnosis software looking at the underlying failures and figuring out which hammer is the right one to use on the problem. Recently, Dong Tang, Peter Carruthers, Zuheir Totari, and I wrote up a paper describing a quantitative model for the benefits of MPR, demonstrating that it can reduce annual downtime by 37-54%. You can read our DSN '06 paper here. Solaris now offers memory diagnosis and page retire on all our systems, including Opteron, UltraSPARC III, UltraSPARC IV, and UltraSPARC T1.

SNMP Features

We've also introduced a connector between the Solaris fault management stack and SNMP in Solaris 10 6/06, permitting the fault manager to publish SNMP traps and provide MIB browsing for diagnosed problems. This is implemented using the Solaris NetSNMP stack: Keith has posted examples of how to use these features on his blog. The SNMP MIB provides an ideal connection between the unique fault management features in Solaris and any type of existing heterogeneous management software you use. You can examine the MIB itself if you want to learn more.


Technorati Tag:

Post a Comment:
Comments are closed for this entry.



« July 2016