Predictive Self Healing on T5120/T5220

In my first posting I covered some of the new predictive self-healing features and enhancements we're bringing forth with the UltraSPARC-T2 processor. Now that the T5120/T5220 systems have been announced, I wanted to show examples of some of these new features on the T5120/T5220. I won't rehash my previous posting here, so let's dive into some examples.

Core Offlining
On my particular system, I've used tools to cause errors in core 2 of the T2's 8 cores. The resource I'm triggering errors in, the DTLB, is shared among all strands in that core. Once the diagnosis engine determines there is a fault, the fault event is issued and we see the following on the console:

SUNW-MSG-ID: SUN4V-8000-L5, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Wed Sep 26 13:49:03 EDT 2007 PLATFORM: SUNW,SPARC-Enterprise-T5220, CSN: -, HOSTNAME: wgs48-100 SOURCE: cpumem-diagnosis, REV: 1.6 EVENT-ID: bc3e3685-a8c3-e728-efd3-f3382b633458 DESC: The number of DTLB errors associated with this thread has exceeded acceptable levels. Refer to http://sun.com/msg/SUN4V-8000-L5 for more information. AUTO-RESPONSE: The fault manager will attempt to remove all threads associated with this resource from service. IMPACT: System performance is likely to be affected. REC-ACTION: Schedule a repair procedure to replace the affected resource, the id entity of which can be determined using fmdump -v -u .

As the AUTO-RESPONSE portion of the console message indicates, the system will attempt to offline all affected cpu threads. With T2, there are 8 threads per core. And sure enough, when looking at psrinfo output, we see 8 strands taken offline:

# psrinfo |grep fault 16 faulted since 09/26/2007 13:49:06 17 faulted since 09/26/2007 13:49:06 18 faulted since 09/26/2007 13:49:06 19 faulted since 09/26/2007 13:49:06 20 faulted since 09/26/2007 13:49:06 21 faulted since 09/26/2007 13:49:06 22 faulted since 09/26/2007 13:49:06 23 faulted since 09/26/2007 13:49:03

Single 'fmadm repair' operation
Another improvement with T5120/T5220 is a single CLI to mark a component as repaired. When PSH in Solaris issues a fault, the fault information is also made available to the Service Processor (SP). In the ALOM shell, the showfaults command displays all faults in the system, including faults diagnosed outside of PSH. For example, after the above fault is issued, the following is displayed on the SP:

sc> showfaults Last POST Run: Mon Aug 20 19:30:59 2007 Post Status: Passed all devices ID FRU Fault 0 /SYS/PS0 SP detected fault: Input power unavailable for PSU at PS0 1 /SYS/MB Host detected fault MSGID: SUN4V-8000-L5 UUID: bc3e3685-a8c3- e728-efd3-f3382b633458

Entry 1 is from the PSH fault I caused. Now, suppose I've received my new motherboard, done the replacement, and now I want to clean up the resource state of the system. I only need to issue a single fmadm repair command in Solaris. The repair operation issues a list.repaired event which the SP picks up and uses to update the state in the SP:

# fmadm repair bc3e3685-a8c3-e728-efd3-f3382b633458 fmadm: recorded repair to bc3e3685-a8c3-e728-efd3-f3382b633458 sc> showfaults Last POST Run: Mon Aug 20 19:30:59 2007 Post Status: Passed all devices ID FRU Fault 0 /SYS/PS0 SP detected fault: Input power unavailable for PSU at PS0

For those familiar with the T1000/T2000 systems, I no longer have to run the ALOM clearfault CLI.

Offlining of cryptographic units
A cool new feature in the crypto arena is offlining crypto units - instead of the threads that use them. With UltraSPARC-T2, there are 8 crypto units, one per core. If there's a fault in a crypto unit, PSH will diagnose and message that. For example:

SUNW-MSG-ID: SUN4V-8000-PH, TYPE: Fault, VER: 1, SEVERITY: Major EVENT-TIME: Wed Sep 26 13:51:41 EDT 2007 PLATFORM: SUNW,SPARC-Enterprise-T5220, CSN: -, HOSTNAME: wgs48-100 SOURCE: cpumem-diagnosis, REV: 1.6 EVENT-ID: e89ce6f4-d961-e6f5-f84f-def8914f5f4e DESC: The number of modular arithmetic unit errors associated with this unit has exceeded acceptable levels. Refer to http://sun.com/msg/SUN4V-8000-PH for more information. AUTO-RESPONSE: Cryptographic software will not use this modular arithmetic unit. IMPACT: System performance is likely to be affected. REC-ACTION: Schedule a repair procedure to replace the affected resource, the id entity of which can be determined using fmdump -v -u .

As the AUTO-RESPONSE portion of the console message states, the crypto unit that experienced the problem is no longer used by the crypto drivers. But, the threads in the core that has the faulty crypto unit are not offlined. Other crypto unit configured in your domain remain available and online. There are some kstats that give insight to which crypto units are offline or online. In this example, the modular arithmetic unit (MAU) portion of a crypto unit failed:

# kstat |grep mau |grep state mau0state online mau1state offline mau2state online mau3state online mau4state online mau5state online

Only one of the MAU units in my domain is disabled.

Diagnosis of the memory subsystem
A new feature of PSH introduced with T5120/T5220 (and retroactive to T1000/T2000) is SERDing of correctable memory errors at the memory page level. This means we more accurately retire pages. In previous generations of the diagnosis engines, we'd SERD at the DIMM level. For example, I've caused a few page errors in the system. The PSH statistics show us the SERD engines active:

# fmstat -s -m cpumem-diagnosis NAME >N T CNT DELTA STAT page_172964000serd >2 3d 1 77288254474ns pend page_32000000serd >2 3d 1 3374714362ns pend page_32006000serd >2 3d 1 51151186037ns pend page_DF7EC000serd >2 3d 1 31292650388ns pend

If too many errors occur within a specific memory page, that page is retired. As with previous generations, page retires are not messaged to the console. Memory CEs are expected to occur over time. And yes, there is messaging if there are an abundance of page retires, which is more indicative of a systemic problem with a DIMM. But one can see if there are page retires via the PSH statistics:

# fmstat -m cpumem-retire -z NAME VALUE DESCRIPTION page_flts 1 page faults resolved

Diagnosis of the IO subsystem
Now for an example fault in the IO subsystem. In this particular case, the IO controller itself which is embedded in the UltraSPARC-T2 chip itself. On the console, PSH reports:

SUNW-MSG-ID: SUN4-8000-0Y, TYPE: Fault, VER: 1, SEVERITY: Critical EVENT-TIME: Wed Sep 26 14:17:29 EDT 2007 PLATFORM: SUNW,SPARC-Enterprise-T5220, CSN: -, HOSTNAME: wgs48-100 SOURCE: eft, REV: 1.16 EVENT-ID: 72f9822e-6d5a-c36d-adcb-93e273804925 DESC: A problem was detected in the PCI-Express subsystem. Refer to http://sun.com/msg/SUN4-8000-0Y for more information. AUTO-RESPONSE: This fault does not have an automated response agent and thus requires interaction from the user and/or Sun Services. IMPACT: Loss of services provided by the device instances associated with this fault REC-ACTION: Schedule a repair procedure to replace the affected device. Use fmdump -v -u EVENT_ID to identify the device or contact Sun for support.

The DESC is a little vague, but when looking at the fmdump output as suggested by the console message, there's a clearer picture of what has failed:

# fmdump -v -u 72f9822e-6d5a-c36d-adcb-93e273804925 TIME UUID SUNW-MSG-ID Sep 26 14:17:29.3407 72f9822e-6d5a-c36d-adcb-93e273804925 SUN4-8000-0Y 100% fault.io.fire.asic Problem in: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0704 BB5053:server-id=wgs48-100/motherboard=0/chip=0/hostbridge=0/pciexrc=0 Affects: dev:////pci@0 FRU: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0704 BB5053:server-id=wgs48-100:serial=101083:part=541215101/motherboard=0 Location: MB

So this is the root complex and the FRU is the motherboard. Also, note that the AUTO-RESPONSE says there is no automated response agent. That's rapidly going to change thanks to some great work being done in the io-retire agent. Check out PSARC/2007/290 for details.

Inclusion of part/serial numbers in fault events
I'll close with highlighting that the faults events include the chassis serial number, part number and serial number of the FRU that is faulty. Repeating the FRU line from the IO fault above:

FRU: hc://:product-id=SUNW,SPARC-Enterprise-T5220:chassis-id=0704 BB5053:server-id=wgs48-100:serial=101083:part=541215101/motherboard=0

The FRU portion of the fault event includes all the necessary information to order the replacement part. To prove to myself the information is accurate, I'll check the values using the SP commands:

sc> showplatform SUNW,SPARC-Enterprise-T5220 Chassis Serial Number: 0704BB5053 ... sc> showfru /SYS/MB ... SEGMENT: SD /ManR /ManR/UNIX_Timestamp32: Sat Jan 20 18:39:47 GMT 2007 /ManR/Fru_Description: ASY,HURON,MB,TRAY,6-CORE,SKT /ManR/Manufacture_Loc: San Jose, California USA /ManR/Sun_Part_No: 5412151 /ManR/Sun_Serial_No: 101083 /ManR/Vendor_Name: Flextronics Semiconductor /ManR/Initial_HW_Dash_Level: 01 /ManR/Initial_HW_Rev_Level: 11 /ManR/Fru_Shortname: MB /SpecPartNo: 885-0897-03 /HazardClassCode: N /Partner_Part_NumberR /Partner_Part_NumberR/Vendor_Name: Fujitsu /Partner_Part_NumberR/Partner_Part_Number: 9999999

The part= in the fmdump is a concatenation of the part number and dash level. The intention is to save you time, and avoid having to hunt down this information. When you call Sun Services to order a replacement part, the info you'll need is in front of you.

That's the nickel tour. Thanks for playing.

Find other T5120/T5220 blog entries here

:wq

Comments:

Post a Comment:
Comments are closed for this entry.
About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today