Tuesday Apr 07, 2009

SNMP Traps on T2/T2plus Systems

Earlier this week, an account architect for a large European bank asked me about the SNMP traps that are generated on the SPARC CMT systems, specifically the T5220 and T5240 systems. On these systems, both the ILOM stack on the SP and the Solaris FMA stack on the host hardware are capable of generating SNMP traps for fault conditions. Our customer wanted to know if it was sufficient to monitor the traps from solely Solaris FMA. The short answer is both Solaris FMA and ILOM traps must be monitored to ensure full coverage. The rest of this entry is to detail the reasons why.

For the purposes of this discussion, I'll refer to an SNMP trap from Solaris FMA as an "FMA SNMP trap" and one from ILOM as an "ILOM SNMP trap".

The key to understanding why both ILOM and Solaris must be monitored is the flow of fault information in the system. This picture (admittedly simplified) should help:

For faults generated in their respective precinct, ILOM diagnosed faults will produce an ILOM SNMP trap. And Solaris FMA faults will produce an FMA SNMP trap (via the snmp-trapgen plugin). And there's a level of fault sharing between ILOM and Solaris - but notice the flow of fault information is from Solaris to ILOM.

In Solaris, an FMD plugin called the Event Transport Module (ETM) subscribes to selected fault events and (you guessed it) transports them to ILOM. ILOM then updates its state and view of the components in the system. And for faults received from Solaris, ILOM will also generate an SNMP trap. However, ETM does not transport all fault events. Some fault events are not meaningful to ILOM as they represent components beyond ILOM's visibility. Precisely which faults are forwarded by ETM is driven by a configuration file, etm.conf, tailored for each platform or platform family.

This gives us a few flows for SNMP trap generation.

  • FMA diagnosed fault that is transported to ILOM: snmp-trapgen in Solaris generates an FMA SNMP trap. And ILOM generates an ILOM SNMP trap when the fault is received in the service processor.
  • FMA diagnosed fault that is not transported to ILOM: snmp-trapgen in Solaris generates an FMA SNMP trap.
  • ILOM diagnosed fault: ILOM generates an ILOM SNMP trap
  • ILOM chassis event: ILOM generates an ILOM SNMP trap

Summing this up, taking into account the faults that ETM will forward to ILOM, we can expect the following SNMP trap generation for the various subsystems:

Subsystem FMA SNMP Trap ILOM SNMP Trap
Processor/Cache yes yes
Memory yes yes
PCI/PCIE yes yes
Coherency Links1 yes yes
ZFS yes no
Disks yes no
SCSI yes no
Power/Cooling no yes
Environmental/Sensors no yes
ASR Disables no yes
Component Insertion/Removal no yes

As I'm not an expert on ILOM, the table above may not be an exhaustive list of all of the ILOM events that can trigger an ILOM SNMP trap. But I believe it's sufficient to illustrate the point that if you're monitoring your SPARC CMT systems via SNMP, you must monitor both ILOM SNMP and FMA SNMP traps.


Wednesday Apr 09, 2008

Predictive Self Healing (FMA) for T5140/T5240

April 9, 2008: Sun announced the T5140/T5240 platforms centered around the UltraSPARC T2 Plus processor. The T2 Plus extends the capabilities of the UltraSPARC T2 processor, the most obvious being the capability for multiple processors in a system. And I'm happy to report that Solaris' Predictive Self Healing feature (also known as the Fault Management Architecture (FMA)) has been extended to include coverage for T2 Plus.

With respect to fault management, T2 Plus is very similar to the T2. The fault management features of T5140/T5240 are listed below, along with example output for a couple of the new T2 Plus diagnosis features.

  • Base UltraSPARC T2 features: All of the FMA features present on the T2 processor are also available with the T2 Plus-based systems
  • Coherency plane diagnosis: The T2 Plus processors in the T5140/T5240 systems communicate with one another across a coherency plane, similar in nature to a Fully Buffered DIMM (FB-DIMM) channel. Error handling and diagnosis have been enhanced to detect and diagnose errors (single-lane/multi-lane/protocol errors) on the coherency plane.
  • Local vs. remote errors: With multiple processors in the system, it is possible that one T2 Plus can trigger an error in another T2 Plus (e.g. a remote read of memory/cache). The error handlers have been extended to recognize local vs. remote errors and produce the proper telemetry so diagnosis engines indict the correct T2 Plus.
  • Automatic FB-DIMM lane failover: The UltraSPARC T2 Plus memory controller seamlessly handles a single lane failover on an FM-DIMM link without a system crash. The fault management subsystem has been updated to differentiate between FB-DIMM errors resulting in lane failovers vs. those that do not. Additional information on FB-DIMM diagnosis is in one of my earlier blogs.
  • Extended IO controller diagnosis: The embedded IO controller in the T2 Plus added a few new error detectors, and the FMA software has been extended to include diagnoses for these.

Example output of some of the new features is below.

Additionally, the FMA Demo Kit has been updated for the T5140/T5240 as well. For those not familiar with the kit, it provides a harness for executing fault management demos on CPU, Memory and PCI errors. It runs on out-of-the-box Solaris - no custom hardware, error injectors, or configuration necessary. It's a great - and if using the simulation mode, safe - way to get familiar with how Solaris will respond to certain types of errors.

Example: Automatic Recovery from a Coherency Plane Single Lane Failure
If the T2 Plus hardware detects errors on the coherency planes between the processors, the lanes can be retrained. In the face of a persisting error, a lane may be failed by the hardware. For a single lane failure, the system continues to operate and the following is printed to the console:

SUNW-MSG-ID: SUN4V-8001-MR, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Wed Aug 29 18:24:37 EDT 2007 PLATFORM: SUNW,T5140, CSN: -, HOSTNAME: wgs48-134 SOURCE: cpumem-diagnosis, REV: 1.6 EVENT-ID: 64fe5c4b-894c-e49e-b14d-ba7cbe809a12 DESC: A CPU chip's Link Framing Unit has stopped using a bad lane. Refer to http://sun.com/msg/SUN4V-8001-MR for more information. AUTO-RESPONSE: No other automated response. IMPACT: The system's capacity to correct transmission errors between CPU chips has been reduced. REC-ACTION: Schedule a repair procedure to replace the affected resource, the identity of which can be determined using fmdump -v -u <EVENT_ID>.

Similar messaging is produced for multi-lane failures or protocol failures, although such failures are fatal and cause a system reset.

Example: FB-DIMM Single Lane Failover
In my blog a few months ago, I covered the addition of FB-DIMM channel diagnosis. New with T5140/T5240 is the hardware's capability to ensure a single lane failover without system interruption. And

SUNW-MSG-ID: SUN4V-8001-7R, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Fri Feb 22 16:20:37 EST 2008 PLATFORM: SUNW,T5240, CSN: -, HOSTNAME: wgs48-113 SOURCE: cpumem-diagnosis, REV: 1.6 EVENT-ID: 24c57072-1f59-6054-9af4-a5515325ab0c DESC: A problem was detected in the interconnect between a memory DIMM module and its memory controller. A lane failover has taken place. Refer to http://sun.com/msg/SUN4V-8001-7R for more information. AUTO-RESPONSE: No automated response. IMPACT: System performance may be impacted. REC-ACTION: At convenient time, try reseating the memory module(s). If problem persists, contact Sun to schedule part replacement.

Note the bit about a lane failover taking place in the DESC portion of the console message. Also, when examining the telemetry leading to the fault, we see a single FB-DIMM recoverable ereport:

# fmdump -e -u 24c57072-1f59-6054-9af4-a5515325ab0c TIME CLASS Feb 22 16:20:37.7236 ereport.cpu.ultraSPARC-T2plus.fbr

Since the hardware has already experienced a lane failover, messaging of the fault is immediate. This type of correctable error is not put through a Soft Error Rate Discriminator (SERD) engine.





« July 2016