Solaris Fault Management: A Look Back and Looking Forward

The Solaris Fault Management Architecture has come a long way since Mike Shapiro and I started talking about it way back in 2001. We started out with a bang as the industry leader in fault management technology:


  • August 10, 2001: First discussions of a new approach to fault management begin at Sun.

  • January 15, 2002: First internal presentation of plans for a Solaris Fault Management Architecture

  • March 18, 2004: FMA integrates into Solaris 10 Build 56, providing CPU/Mem for US-III and IV

  • March 7, 2005: FMA ships to customers as part of Solaris 10 G/A

     

  • The members of our original development team have changed along the way, but our commitment to improving the architecture and adding new content remains steadfast. Since the introduction of FMA in Solaris 10, additional content has been added to support new platforms and extend FMA concepts into other subsystems. Just look at what we've delivered since S10 was released a short 2 years ago:

    • New for SPARC: US-IV+, US-T1, Niagara & Niagara-2, Fire PCI-E I/O

    • New for x64: CPU/Memory error handling and diagnosis for AMD Opteron and Athlon 64

    Enables all detector banks and sets all documented MCi_CTL bits

    Full machine-check and error-poller handling for all error types documented in the BKDG

    Diagnosis engine rules for all error types

    Response agent: core offline, page retire

    • New for x64: PCI-Express

    Diagnostic correlation based on transmit/receiver error information

    Connections to platform machine-check error handling

    Connections to FMA-aware leaf drivers for increased availability and diagnosability

    Diagnosis engine rules for all error described in PCI-E Base Specification

    Generates SNMP traps (notifications) for FMA diagnosis

    FM MIB permits additional details by UUID

    Web browsable interface to view

    3730 FMA Events

    338 FMA Knowledge Articles

    CLIs to extract event payload and message content

    • New for Developers: Public interfaces for IO FMA

    Updated WDD chapter for writing FMA-aware drivers

    • Deployment: FMA Demo Package

    Infrastructure to inject errors in a simulation environment

    What's best is that Solaris FMA is getting noticed and showing real benefits. The Sun Service organization estimates that platforms shipping without FMA support can cost $252 per-unit per-year. Let's do the math...if Sun sells 100,000 units per year that means after 3 years, Solaris with FMA is saving Sun $75,600,000.

    100000 units per year x $252 per unit x 3 years = $75,600,000

    I don't know about you, but I wouldn't mind saving $75,000,000.00 a year. A paper presented by Mike Shapiro and Dong Tang at the Dependable Systems Network 2006 demonstrated a decrease in annual system downtime by 37-54% using quantitative analysis of the FMA memory retirement capabilities. InfoWorld gave Solaris FMA a nod by awarding our team members its 2005 Innovation of the Year Award.

    So, what are we working on now? Well, we are continuing to deliver on the promise of Predictive Self-Healing. Work is on-going to support out-the-door fault management capabilities for new processors, platforms and I/O subsystems. With the announced support for Intel on Solaris (or is it Solaris on Intel?), we are busily working on a FMA implementation for Intel processors. Solaris will be the first OS to take full advantage of industry-leading x86 processor error handling features. In the I/O space, we are beefing up leaf drivers, adding FMA error handling and diagnosis for SCSI problems and using SMART disk data to actively predict impending disk failures for all platforms. The Xen project gives us an opportunity to deploy a FMA in a virtualized environment. We'll take some of the infrastructure we delivered for LDOMs and use it to connect hypervisor error handling to a DOM0 diagnosis environment. But that's not all...we are looking at ways to use sensor telemetry to offer better fault prediction, manage resource guarantees and power budgeting. On the software front, we are modifying the techniques we've used to diagnose hardware problems to be useful for software diagnosis. This is a huge under-explored area that will keep Solaris in the fore-front with leading-edge availability and serviceability.

    Stay tuned, we're not done with FMA just yet.

    Cindi

    Comments:

    Post a Comment:
    • HTML Syntax: NOT allowed
    About

    cindi

    Search

    Top Tags
    Categories
    Archives
    « April 2014
    SunMonTueWedThuFriSat
      
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
       
           
    Today