Friday Jun 08, 2007


Just today, I posted the first draft of the Sensor Abstraction Layer design document. The project addresses the problem of aggregating and analyzing telemetry exported by disparate sources such that the results may be observed via standard interfaces. The basic design is composed of three distinct sub-layers: a provider layer, a collection layer and a analyzer layer. At the lowest level, the provider layer exports interfaces to read sensor or statistical values without having to understand the implementation details of the subs-system exporting the telemetry.

Telemetry data is logged according to collection parameters established for a collector . Sensor telemetry is passed from collectors to the analyzer layer for the purpose of online analysis. For example, we may want to collect telemetry for our network sub-system based upon GLD-aware NIC driver kstats, protocol-specific errors and memory usage as seen in netstat(1M) to help predict unhealthy hardware or software or to ensure QOS guarantees.

We can use many of the concepts and the infrastructure developed for the Solaris Fault Manager. For example, telemetry data can be passed as FMA standard events and logged using the Extended Accounting format developed for the errlog and fltlog. We can also leverage the fmd(1M) tool set to observe telemetry logs and analysis results.




Hope to have more details soon...




Friday May 04, 2007

Solaris Fault Management: A Look Back and Looking Forward

The Solaris Fault Management Architecture has come a long way since Mike Shapiro and I started talking about it way back in 2001. We started out with a bang as the industry leader in fault management technology:

  • August 10, 2001: First discussions of a new approach to fault management begin at Sun.

  • January 15, 2002: First internal presentation of plans for a Solaris Fault Management Architecture

  • March 18, 2004: FMA integrates into Solaris 10 Build 56, providing CPU/Mem for US-III and IV

  • March 7, 2005: FMA ships to customers as part of Solaris 10 G/A


  • The members of our original development team have changed along the way, but our commitment to improving the architecture and adding new content remains steadfast. Since the introduction of FMA in Solaris 10, additional content has been added to support new platforms and extend FMA concepts into other subsystems. Just look at what we've delivered since S10 was released a short 2 years ago:

    • New for SPARC: US-IV+, US-T1, Niagara & Niagara-2, Fire PCI-E I/O

    • New for x64: CPU/Memory error handling and diagnosis for AMD Opteron and Athlon 64

    Enables all detector banks and sets all documented MCi_CTL bits

    Full machine-check and error-poller handling for all error types documented in the BKDG

    Diagnosis engine rules for all error types

    Response agent: core offline, page retire

    • New for x64: PCI-Express

    Diagnostic correlation based on transmit/receiver error information

    Connections to platform machine-check error handling

    Connections to FMA-aware leaf drivers for increased availability and diagnosability

    Diagnosis engine rules for all error described in PCI-E Base Specification

    Generates SNMP traps (notifications) for FMA diagnosis

    FM MIB permits additional details by UUID

    Web browsable interface to view

    3730 FMA Events

    338 FMA Knowledge Articles

    CLIs to extract event payload and message content

    • New for Developers: Public interfaces for IO FMA

    Updated WDD chapter for writing FMA-aware drivers

    • Deployment: FMA Demo Package

    Infrastructure to inject errors in a simulation environment

    What's best is that Solaris FMA is getting noticed and showing real benefits. The Sun Service organization estimates that platforms shipping without FMA support can cost $252 per-unit per-year. Let's do the math...if Sun sells 100,000 units per year that means after 3 years, Solaris with FMA is saving Sun $75,600,000.

    100000 units per year x $252 per unit x 3 years = $75,600,000

    I don't know about you, but I wouldn't mind saving $75,000,000.00 a year. A paper presented by Mike Shapiro and Dong Tang at the Dependable Systems Network 2006 demonstrated a decrease in annual system downtime by 37-54% using quantitative analysis of the FMA memory retirement capabilities. InfoWorld gave Solaris FMA a nod by awarding our team members its 2005 Innovation of the Year Award.

    So, what are we working on now? Well, we are continuing to deliver on the promise of Predictive Self-Healing. Work is on-going to support out-the-door fault management capabilities for new processors, platforms and I/O subsystems. With the announced support for Intel on Solaris (or is it Solaris on Intel?), we are busily working on a FMA implementation for Intel processors. Solaris will be the first OS to take full advantage of industry-leading x86 processor error handling features. In the I/O space, we are beefing up leaf drivers, adding FMA error handling and diagnosis for SCSI problems and using SMART disk data to actively predict impending disk failures for all platforms. The Xen project gives us an opportunity to deploy a FMA in a virtualized environment. We'll take some of the infrastructure we delivered for LDOMs and use it to connect hypervisor error handling to a DOM0 diagnosis environment. But that's not all...we are looking at ways to use sensor telemetry to offer better fault prediction, manage resource guarantees and power budgeting. On the software front, we are modifying the techniques we've used to diagnose hardware problems to be useful for software diagnosis. This is a huge under-explored area that will keep Solaris in the fore-front with leading-edge availability and serviceability.

    Stay tuned, we're not done with FMA just yet.





    Top Tags
    « July 2016