Tuesday Apr 07, 2009

SNMP Traps on T2/T2plus Systems

Earlier this week, an account architect for a large European bank asked me about the SNMP traps that are generated on the SPARC CMT systems, specifically the T5220 and T5240 systems. On these systems, both the ILOM stack on the SP and the Solaris FMA stack on the host hardware are capable of generating SNMP traps for fault conditions. Our customer wanted to know if it was sufficient to monitor the traps from solely Solaris FMA. The short answer is both Solaris FMA and ILOM traps must be monitored to ensure full coverage. The rest of this entry is to detail the reasons why.

For the purposes of this discussion, I'll refer to an SNMP trap from Solaris FMA as an "FMA SNMP trap" and one from ILOM as an "ILOM SNMP trap".

The key to understanding why both ILOM and Solaris must be monitored is the flow of fault information in the system. This picture (admittedly simplified) should help:

For faults generated in their respective precinct, ILOM diagnosed faults will produce an ILOM SNMP trap. And Solaris FMA faults will produce an FMA SNMP trap (via the snmp-trapgen plugin). And there's a level of fault sharing between ILOM and Solaris - but notice the flow of fault information is from Solaris to ILOM.

In Solaris, an FMD plugin called the Event Transport Module (ETM) subscribes to selected fault events and (you guessed it) transports them to ILOM. ILOM then updates its state and view of the components in the system. And for faults received from Solaris, ILOM will also generate an SNMP trap. However, ETM does not transport all fault events. Some fault events are not meaningful to ILOM as they represent components beyond ILOM's visibility. Precisely which faults are forwarded by ETM is driven by a configuration file, etm.conf, tailored for each platform or platform family.

This gives us a few flows for SNMP trap generation.

  • FMA diagnosed fault that is transported to ILOM: snmp-trapgen in Solaris generates an FMA SNMP trap. And ILOM generates an ILOM SNMP trap when the fault is received in the service processor.
  • FMA diagnosed fault that is not transported to ILOM: snmp-trapgen in Solaris generates an FMA SNMP trap.
  • ILOM diagnosed fault: ILOM generates an ILOM SNMP trap
  • ILOM chassis event: ILOM generates an ILOM SNMP trap

Summing this up, taking into account the faults that ETM will forward to ILOM, we can expect the following SNMP trap generation for the various subsystems:

Subsystem FMA SNMP Trap ILOM SNMP Trap
Processor/Cache yes yes
Memory yes yes
PCI/PCIE yes yes
Coherency Links1 yes yes
ZFS yes no
Disks yes no
SCSI yes no
Power/Cooling no yes
Environmental/Sensors no yes
ASR Disables no yes
Component Insertion/Removal no yes

As I'm not an expert on ILOM, the table above may not be an exhaustive list of all of the ILOM events that can trigger an ILOM SNMP trap. But I believe it's sufficient to illustrate the point that if you're monitoring your SPARC CMT systems via SNMP, you must monitor both ILOM SNMP and FMA SNMP traps.


Monday Oct 13, 2008

T5440 Fault Management

10/13/2008: Today Sun announced the T5440 platforms centered around the UltraSPARC T2 Plus processor. The T5440 is the big brother of the T5140/T5240 systems packing 256 strands of execution into a single system. And I'm happy to report that Solaris' Predictive Self Healing feature (also known as the Fault Management Architecture (FMA)) has been extended to include coverage for the T5440.

With respect to fault management, T5440 inherits both the T5140/T5240 FMA features and the T5120/T520 FMA features. Just some of the features included are:

  • diagnosis of CPU errors at the strand, core, and chip level
  • offlining of problematic strands and cores
  • diagnosis of the memory subsystem
  • automatic FB-DIMM lane failover
  • extended IO controller diagnosis
  • identification of local vs remote
  • cryptographic unit offlining
The T5440 introduces a coherency interconnect tying the 4 UltraSPARC T2 Plus processors together. The interconnect ASICs have their own set of error detectors. The fault management system covers interconnect-detected errors as well.

Additionally, the FMA Demo Kit has been updated for the T5440. For those not familiar with the kit, it provides a harness for executing fault management demos on CPU, Memory and PCI errors. It runs on out-of-the-box Solaris - no custom hardware, error injectors, or configuration necessary. It's a great - and if using the simulation mode, safe - way to get familiar with how Solaris will respond to certain types of errors.

Example: T5440 Coherency Plane ASIC Error
If one of the coherency ASICs detects errors with coherency between the processors, the system may or may not continue to operate, depending on the nature of the error. A prime tenet of the hardware is to disallow propagation of bogus transations - we don't want data corruption. An example of a fatal error on the coherency plane:

SUNW-MSG-ID: SUN4V-8001-R3, TYPE: Fault, VER: 1, SEVERITY: Critical EVENT-TIME: Wed Oct 24 15:28:05 EDT 2007 PLATFORM: SUNW,T5440, CSN: -, HOSTNAME: wgs48-88 SOURCE: eft, REV: 1.16 EVENT-ID: bc7a8eb5-86be-c138-eece-e65e57840b95 DESC: The ultraSPARC-T2plus Coherency Interconnect has detected a serious communication problem between it and the CPU. Refer to http://sun.com/msg/SUN4V-8001-R3 for more information. AUTO-RESPONSE: No automated reponse. IMPACT: The system's integrity is seriously compromised. REC-ACTION: Schedule a repair procedure to replace the affected component, the identity of which can be determined by using fmdump -v -u <EVENT_ID>.




« July 2016