Friday Dec 11, 2009

Rainbow Falls Integration

Several months ago, the Rainbow Falls got press. And today, the processor support integrated into OpenSolaris. While I played a small role in the specifics of the Rainbow Falls project, I'm personally excited about this integration as it's the first SPARC processor to utilize the Platform Independent FMA work I spearheaded roughly one year ago. From an implementation standpoint, Rainbow Falls required few changes to topo enumeration code or diagnosis rules.

Looking forward to the forthcoming products!


Thursday Aug 27, 2009

Rainbow Falls and FMA

Many of you no doubt heard about the Rainbow Falls processor Sun provided details on at the Hot Chips conference. I'm excited about this new SPARC processor. And not just because it's a nifty package bundling in more cores and the chops to scream running Oracle. This SPARC chipset will be the first to fully deploy the work I lead on sun4v platform independent FMA

In OpenSolaris today, the groundwork is there for building FMA topology sourced from platform firmware structures. Also, a rich set of platform agnostic CPU and memory diagnosis rules. I've also heard that the IO diagnosis rules will be moving toward platform independent constructs in the Rainbow Falls time frame as well. (I'm talking about the chipset specific IO rules; PCI/PCIE rules have been common across platforms for quite some time.)

Mental note for a future blog...once the reference implementation is out, I should write about how a platform team can go about deploying the platform independent enumerator and diagnosis rules in more details. For those looking at OpenSPARC and building a system, delivering top notch CPU and Memory FMA requires zero Solaris code changes. :wq

Wednesday Apr 09, 2008

Predictive Self Healing (FMA) for T5140/T5240

April 9, 2008: Sun announced the T5140/T5240 platforms centered around the UltraSPARC T2 Plus processor. The T2 Plus extends the capabilities of the UltraSPARC T2 processor, the most obvious being the capability for multiple processors in a system. And I'm happy to report that Solaris' Predictive Self Healing feature (also known as the Fault Management Architecture (FMA)) has been extended to include coverage for T2 Plus.

With respect to fault management, T2 Plus is very similar to the T2. The fault management features of T5140/T5240 are listed below, along with example output for a couple of the new T2 Plus diagnosis features.

  • Base UltraSPARC T2 features: All of the FMA features present on the T2 processor are also available with the T2 Plus-based systems
  • Coherency plane diagnosis: The T2 Plus processors in the T5140/T5240 systems communicate with one another across a coherency plane, similar in nature to a Fully Buffered DIMM (FB-DIMM) channel. Error handling and diagnosis have been enhanced to detect and diagnose errors (single-lane/multi-lane/protocol errors) on the coherency plane.
  • Local vs. remote errors: With multiple processors in the system, it is possible that one T2 Plus can trigger an error in another T2 Plus (e.g. a remote read of memory/cache). The error handlers have been extended to recognize local vs. remote errors and produce the proper telemetry so diagnosis engines indict the correct T2 Plus.
  • Automatic FB-DIMM lane failover: The UltraSPARC T2 Plus memory controller seamlessly handles a single lane failover on an FM-DIMM link without a system crash. The fault management subsystem has been updated to differentiate between FB-DIMM errors resulting in lane failovers vs. those that do not. Additional information on FB-DIMM diagnosis is in one of my earlier blogs.
  • Extended IO controller diagnosis: The embedded IO controller in the T2 Plus added a few new error detectors, and the FMA software has been extended to include diagnoses for these.

Example output of some of the new features is below.

Additionally, the FMA Demo Kit has been updated for the T5140/T5240 as well. For those not familiar with the kit, it provides a harness for executing fault management demos on CPU, Memory and PCI errors. It runs on out-of-the-box Solaris - no custom hardware, error injectors, or configuration necessary. It's a great - and if using the simulation mode, safe - way to get familiar with how Solaris will respond to certain types of errors.

Example: Automatic Recovery from a Coherency Plane Single Lane Failure
If the T2 Plus hardware detects errors on the coherency planes between the processors, the lanes can be retrained. In the face of a persisting error, a lane may be failed by the hardware. For a single lane failure, the system continues to operate and the following is printed to the console:

SUNW-MSG-ID: SUN4V-8001-MR, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Wed Aug 29 18:24:37 EDT 2007 PLATFORM: SUNW,T5140, CSN: -, HOSTNAME: wgs48-134 SOURCE: cpumem-diagnosis, REV: 1.6 EVENT-ID: 64fe5c4b-894c-e49e-b14d-ba7cbe809a12 DESC: A CPU chip's Link Framing Unit has stopped using a bad lane. Refer to for more information. AUTO-RESPONSE: No other automated response. IMPACT: The system's capacity to correct transmission errors between CPU chips has been reduced. REC-ACTION: Schedule a repair procedure to replace the affected resource, the identity of which can be determined using fmdump -v -u <EVENT_ID>.

Similar messaging is produced for multi-lane failures or protocol failures, although such failures are fatal and cause a system reset.

Example: FB-DIMM Single Lane Failover
In my blog a few months ago, I covered the addition of FB-DIMM channel diagnosis. New with T5140/T5240 is the hardware's capability to ensure a single lane failover without system interruption. And

SUNW-MSG-ID: SUN4V-8001-7R, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Fri Feb 22 16:20:37 EST 2008 PLATFORM: SUNW,T5240, CSN: -, HOSTNAME: wgs48-113 SOURCE: cpumem-diagnosis, REV: 1.6 EVENT-ID: 24c57072-1f59-6054-9af4-a5515325ab0c DESC: A problem was detected in the interconnect between a memory DIMM module and its memory controller. A lane failover has taken place. Refer to for more information. AUTO-RESPONSE: No automated response. IMPACT: System performance may be impacted. REC-ACTION: At convenient time, try reseating the memory module(s). If problem persists, contact Sun to schedule part replacement.

Note the bit about a lane failover taking place in the DESC portion of the console message. Also, when examining the telemetry leading to the fault, we see a single FB-DIMM recoverable ereport:

# fmdump -e -u 24c57072-1f59-6054-9af4-a5515325ab0c TIME CLASS Feb 22 16:20:37.7236 ereport.cpu.ultraSPARC-T2plus.fbr

Since the hardware has already experienced a lane failover, messaging of the fault is immediate. This type of correctable error is not put through a Soft Error Rate Discriminator (SERD) engine.





« July 2016