Monday Oct 13, 2008

T5440 Fault Management

10/13/2008: Today Sun announced the T5440 platforms centered around the UltraSPARC T2 Plus processor. The T5440 is the big brother of the T5140/T5240 systems packing 256 strands of execution into a single system. And I'm happy to report that Solaris' Predictive Self Healing feature (also known as the Fault Management Architecture (FMA)) has been extended to include coverage for the T5440.

With respect to fault management, T5440 inherits both the T5140/T5240 FMA features and the T5120/T520 FMA features. Just some of the features included are:

  • diagnosis of CPU errors at the strand, core, and chip level
  • offlining of problematic strands and cores
  • diagnosis of the memory subsystem
  • automatic FB-DIMM lane failover
  • extended IO controller diagnosis
  • identification of local vs remote
  • cryptographic unit offlining
The T5440 introduces a coherency interconnect tying the 4 UltraSPARC T2 Plus processors together. The interconnect ASICs have their own set of error detectors. The fault management system covers interconnect-detected errors as well.

Additionally, the FMA Demo Kit has been updated for the T5440. For those not familiar with the kit, it provides a harness for executing fault management demos on CPU, Memory and PCI errors. It runs on out-of-the-box Solaris - no custom hardware, error injectors, or configuration necessary. It's a great - and if using the simulation mode, safe - way to get familiar with how Solaris will respond to certain types of errors.

Example: T5440 Coherency Plane ASIC Error
If one of the coherency ASICs detects errors with coherency between the processors, the system may or may not continue to operate, depending on the nature of the error. A prime tenet of the hardware is to disallow propagation of bogus transations - we don't want data corruption. An example of a fatal error on the coherency plane:

SUNW-MSG-ID: SUN4V-8001-R3, TYPE: Fault, VER: 1, SEVERITY: Critical EVENT-TIME: Wed Oct 24 15:28:05 EDT 2007 PLATFORM: SUNW,T5440, CSN: -, HOSTNAME: wgs48-88 SOURCE: eft, REV: 1.16 EVENT-ID: bc7a8eb5-86be-c138-eece-e65e57840b95 DESC: The ultraSPARC-T2plus Coherency Interconnect has detected a serious communication problem between it and the CPU. Refer to for more information. AUTO-RESPONSE: No automated reponse. IMPACT: The system's integrity is seriously compromised. REC-ACTION: Schedule a repair procedure to replace the affected component, the identity of which can be determined by using fmdump -v -u <EVENT_ID>.

Tuesday Sep 16, 2008

sun4v Chip Offline Using the FMA Demo Kit

I met with a customer today - and it's been far too long since I'd done that. They were putting their T5140/T5240 systems through the paces, including a variety of fault scenarios. One of the scenarios they wanted to test is when all strands on a single UltraSPARC-T2plus chip are offlined. First, I explained that the fault policies for FMA on T5140/T5240 (and that of all sun4v platforms presently) do not offline all strands on an entire chip. However, the situation could arise if there were a series of failures at the core level. A contrived situation, perhaps, since after the first or second core failure, corrective action would likely take place (we're all very good about addressing faulted components in our systems, right?).

Enter the FMA Demo Kit. The demo for T2plus processors in the kit is for a core offline. The demo kit finds the lowest numbered online processor strand and uses that strand as the simulated error detector. By running a series of FMA Demos, we can have FMA offline all strands on a chip.

WARNING: This uses the Demo Kit in "live" mode. It is not suggested to run in live mode on your production systems. Faults generated will be transported to your SP, and any other monitoring software (e.g. SNMP)

Assuming a healthy, full-up T5140/T5240 with 128 strands online:

# ./run_fmdemo -d cpu -L ### offlines VF0 core 0 # ./run_fmdemo -d cpu -L ### offlines VF0 core 1 # ./run_fmdemo -d cpu -L ### offlines VF0 core 2 # ./run_fmdemo -d cpu -L ### offlines VF0 core 3 # ./run_fmdemo -d cpu -L ### offlines VF0 core 4 # ./run_fmdemo -d cpu -L ### offlines VF0 core 5 # ./run_fmdemo -d cpu -L ### offlines VF0 core 6 # ./run_fmdemo -d cpu -L ### offlines VF0 core 7

The end result:

# psrinfo |grep faulted | wc -l 64 # psrinfo |grep faulted 0 faulted since 09/15/2008 12:42:03 1 faulted since 09/15/2008 12:42:03 2 faulted since 09/15/2008 12:42:03 ... 63 faulted since 09/15/2008 12:42:03

Of course, after running the demo kit in live mode, you've got cleanup to do with 'fmadm repair'.

And yes....I'm aware that a straight psradm command can also offline a slew of strands. A small semantic difference is that the psrinfo status would read off-line instead of faulted.


Thursday Jan 03, 2008

FMA Demo Kit - Bug in UltraSPARC-T1/T2 CPU Support

For the past couple of months, I've been working on adding memory support to the FMA Demo Kit for UltraSPARC-T1 and UltraSPARC-T2 processors. That coding and testing is actually done, just working on amending a contract in one of the PSARC cases (sadly, not an open one, otherwise I'd hyperlink here).

But, in testing the memory support, I found a bug in the CPU functionality. It's exposed when the physical to logical mapping of CPUs isn't 1-to-1. In cpu/predemo.sun4v, the following logic is used to find an online CPU and get its brand and serial number:

&common::get_online_cpuids; my $brand = &common::get_cpu_brand($cpuids[0]); if ($brand eq "UNKNOWN") { &common::log_message(1, 1, "The CPU fault demo is not supported ". "on this platform.\\n"); exit 1; } my $serial = &common::get_cpu_serial($cpuids[0]); if ($serial eq "UNKNOWN") { &common::log_message(1, 1, "Unable to obtain serial number for ". "cpuid $cpuids[0].\\n"); exit 1; }

The problem is with how get_online_cpuids() in lib/ generates the list of online CPUs. Here's the routine:

sub get_online_cpuids { my $cpuid = $_[0]; my @output = `$PSRINFO | $GREP on-line`; foreach (@output) { if ($_ =~ /\^(\\d+)\\s+\\S+/) { push(@main::cpuids, $1); } } }

It's using psrinfo(1M) to list online CPUs. This in and of itself is fine, but psrinfo displays logical cpuids. Conversely, FMA is working in a physical world. So, cpuid 0 is not necessarily physical CPU strand 0. For example, suppose on an UltraSPARC-T2, only CORE 1 enabled. The system has 8 strands. Logically, they are strands 0-7. Physically, they are 8-15. Compare the output of psrinfo with fmtopo -s cpu:

# psrinfo 0 on-line since 11/14/2007 10:57:39 1 on-line since 11/14/2007 11:03:17 2 on-line since 11/14/2007 11:03:17 3 on-line since 11/14/2007 11:03:17 4 on-line since 11/14/2007 11:03:17 5 on-line since 11/14/2007 11:03:17 6 on-line since 11/14/2007 11:03:17 7 on-line since 11/14/2007 11:03:17 # /usr/lib/fm/fmd/fmtopo -s cpu TIME UUID Nov 14 20:10:12 bd97c9f8-7d8d-c3b6-f336-92de8d821111 cpu:///cpuid=8/serial=5d67334847 cpu:///cpuid=9/serial=5d67334847 cpu:///cpuid=10/serial=5d67334847 cpu:///cpuid=11/serial=5d67334847 cpu:///cpuid=12/serial=5d67334847 cpu:///cpuid=13/serial=5d67334847 cpu:///cpuid=14/serial=5d67334847 cpu:///cpuid=15/serial=5d67334847

So, the Demo Kit finds the first online processor, it's a logical cpuid. In this example, the Demo Kit will use a cpuid of 0 when constructing the fminject input file. When the CPU/Mem Diagnosis Engine receives the ereport, it'll drop it since cpuid 0 is not physically present in the system. End result - a very uninteresting demo.

The fix is reasonably straightforward, just need a little jiggering after finding the first online CPU via psrinfo to translate that to a physical CPU before constructing the fminject input file. It'd be wonderful just to use fmtopo -S -s cpu and key on the "Unusable" property, but the -S isn't available in S10 currently. That and not all platforms enumerate in the cpu scheme.

I've filed CR 6647037 to track this. Anyone out in the community that wants to contribute a fix is welcome to it. It could be a few weeks before I can tackle this (that pesky day job again).





« June 2016