T5440 Fault Management

10/13/2008: Today Sun announced the T5440 platforms centered around the UltraSPARC T2 Plus processor. The T5440 is the big brother of the T5140/T5240 systems packing 256 strands of execution into a single system. And I'm happy to report that Solaris' Predictive Self Healing feature (also known as the Fault Management Architecture (FMA)) has been extended to include coverage for the T5440.

With respect to fault management, T5440 inherits both the T5140/T5240 FMA features and the T5120/T520 FMA features. Just some of the features included are:

  • diagnosis of CPU errors at the strand, core, and chip level
  • offlining of problematic strands and cores
  • diagnosis of the memory subsystem
  • automatic FB-DIMM lane failover
  • extended IO controller diagnosis
  • identification of local vs remote
  • cryptographic unit offlining
The T5440 introduces a coherency interconnect tying the 4 UltraSPARC T2 Plus processors together. The interconnect ASICs have their own set of error detectors. The fault management system covers interconnect-detected errors as well.

Additionally, the FMA Demo Kit has been updated for the T5440. For those not familiar with the kit, it provides a harness for executing fault management demos on CPU, Memory and PCI errors. It runs on out-of-the-box Solaris - no custom hardware, error injectors, or configuration necessary. It's a great - and if using the simulation mode, safe - way to get familiar with how Solaris will respond to certain types of errors.


Example: T5440 Coherency Plane ASIC Error
If one of the coherency ASICs detects errors with coherency between the processors, the system may or may not continue to operate, depending on the nature of the error. A prime tenet of the hardware is to disallow propagation of bogus transations - we don't want data corruption. An example of a fatal error on the coherency plane:

SUNW-MSG-ID: SUN4V-8001-R3, TYPE: Fault, VER: 1, SEVERITY: Critical EVENT-TIME: Wed Oct 24 15:28:05 EDT 2007 PLATFORM: SUNW,T5440, CSN: -, HOSTNAME: wgs48-88 SOURCE: eft, REV: 1.16 EVENT-ID: bc7a8eb5-86be-c138-eece-e65e57840b95 DESC: The ultraSPARC-T2plus Coherency Interconnect has detected a serious communication problem between it and the CPU. Refer to http://sun.com/msg/SUN4V-8001-R3 for more information. AUTO-RESPONSE: No automated reponse. IMPACT: The system's integrity is seriously compromised. REC-ACTION: Schedule a repair procedure to replace the affected component, the identity of which can be determined by using fmdump -v -u <EVENT_ID>.

Comments:

One of the scenarios they wanted to test is when all strands on a single UltraSPARC-T2plus chip are offlined. First, explained that the fault policies for FMA on T5140/T5240 (and that of all sun4v platforms presently) do not offline all strands on an entire chip.
------------------

Adam

<a href="http://www.drivenwide.com" REL="Do Follow">Internet marketing</a>

Posted by Adamgilly on October 13, 2008 at 08:57 PM PDT #

Correct. Present sun4v policy is that for a chip-wide failure,
no strands or cores are taken offline. However, that will
change in the future for uncorrectable errors:

6743826 sun4v CPU DE should offline chip for chip-wide UEs
http://bugs.opensolaris.org/view_bug.do?bug_id=6743826

As for testing behavior of a full chip offline, this can be
done via a series of core offlines. The FMA Demo Kit is one
method to do so. I've described that here:
http://blogs.sun.com/sdaven/entry/chip_offline_with_fma_demokit

The end result is a chip whose strands are all offlined -
Solaris will not schedule processes to such strands. The
rest of the chip (IO, memory controller) are still active.
Also note the strands are not parked - the hypervisor can
still execute on the chip. What the offline buys is using a
faulty chip as little as possible until a repair can be scheduled.

Posted by Scott Davenport on October 14, 2008 at 02:33 AM PDT #

Post a Comment:
Comments are closed for this entry.
About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today