Wednesday Apr 09, 2008

Predictive Self Healing (FMA) for T5140/T5240

April 9, 2008: Sun announced the T5140/T5240 platforms centered around the UltraSPARC T2 Plus processor. The T2 Plus extends the capabilities of the UltraSPARC T2 processor, the most obvious being the capability for multiple processors in a system. And I'm happy to report that Solaris' Predictive Self Healing feature (also known as the Fault Management Architecture (FMA)) has been extended to include coverage for T2 Plus.

With respect to fault management, T2 Plus is very similar to the T2. The fault management features of T5140/T5240 are listed below, along with example output for a couple of the new T2 Plus diagnosis features.

  • Base UltraSPARC T2 features: All of the FMA features present on the T2 processor are also available with the T2 Plus-based systems
  • Coherency plane diagnosis: The T2 Plus processors in the T5140/T5240 systems communicate with one another across a coherency plane, similar in nature to a Fully Buffered DIMM (FB-DIMM) channel. Error handling and diagnosis have been enhanced to detect and diagnose errors (single-lane/multi-lane/protocol errors) on the coherency plane.
  • Local vs. remote errors: With multiple processors in the system, it is possible that one T2 Plus can trigger an error in another T2 Plus (e.g. a remote read of memory/cache). The error handlers have been extended to recognize local vs. remote errors and produce the proper telemetry so diagnosis engines indict the correct T2 Plus.
  • Automatic FB-DIMM lane failover: The UltraSPARC T2 Plus memory controller seamlessly handles a single lane failover on an FM-DIMM link without a system crash. The fault management subsystem has been updated to differentiate between FB-DIMM errors resulting in lane failovers vs. those that do not. Additional information on FB-DIMM diagnosis is in one of my earlier blogs.
  • Extended IO controller diagnosis: The embedded IO controller in the T2 Plus added a few new error detectors, and the FMA software has been extended to include diagnoses for these.

Example output of some of the new features is below.

Additionally, the FMA Demo Kit has been updated for the T5140/T5240 as well. For those not familiar with the kit, it provides a harness for executing fault management demos on CPU, Memory and PCI errors. It runs on out-of-the-box Solaris - no custom hardware, error injectors, or configuration necessary. It's a great - and if using the simulation mode, safe - way to get familiar with how Solaris will respond to certain types of errors.

Example: Automatic Recovery from a Coherency Plane Single Lane Failure
If the T2 Plus hardware detects errors on the coherency planes between the processors, the lanes can be retrained. In the face of a persisting error, a lane may be failed by the hardware. For a single lane failure, the system continues to operate and the following is printed to the console:

SUNW-MSG-ID: SUN4V-8001-MR, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Wed Aug 29 18:24:37 EDT 2007 PLATFORM: SUNW,T5140, CSN: -, HOSTNAME: wgs48-134 SOURCE: cpumem-diagnosis, REV: 1.6 EVENT-ID: 64fe5c4b-894c-e49e-b14d-ba7cbe809a12 DESC: A CPU chip's Link Framing Unit has stopped using a bad lane. Refer to for more information. AUTO-RESPONSE: No other automated response. IMPACT: The system's capacity to correct transmission errors between CPU chips has been reduced. REC-ACTION: Schedule a repair procedure to replace the affected resource, the identity of which can be determined using fmdump -v -u <EVENT_ID>.

Similar messaging is produced for multi-lane failures or protocol failures, although such failures are fatal and cause a system reset.

Example: FB-DIMM Single Lane Failover
In my blog a few months ago, I covered the addition of FB-DIMM channel diagnosis. New with T5140/T5240 is the hardware's capability to ensure a single lane failover without system interruption. And

SUNW-MSG-ID: SUN4V-8001-7R, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Fri Feb 22 16:20:37 EST 2008 PLATFORM: SUNW,T5240, CSN: -, HOSTNAME: wgs48-113 SOURCE: cpumem-diagnosis, REV: 1.6 EVENT-ID: 24c57072-1f59-6054-9af4-a5515325ab0c DESC: A problem was detected in the interconnect between a memory DIMM module and its memory controller. A lane failover has taken place. Refer to for more information. AUTO-RESPONSE: No automated response. IMPACT: System performance may be impacted. REC-ACTION: At convenient time, try reseating the memory module(s). If problem persists, contact Sun to schedule part replacement.

Note the bit about a lane failover taking place in the DESC portion of the console message. Also, when examining the telemetry leading to the fault, we see a single FB-DIMM recoverable ereport:

# fmdump -e -u 24c57072-1f59-6054-9af4-a5515325ab0c TIME CLASS Feb 22 16:20:37.7236 ereport.cpu.ultraSPARC-T2plus.fbr

Since the hardware has already experienced a lane failover, messaging of the fault is immediate. This type of correctable error is not put through a Soft Error Rate Discriminator (SERD) engine.





« July 2016