UltraSPARC-T2 Fault Management

As Sun announced the UltraSPARC-T2 processor today, I thought I'd take a quick dip into the blogging pool and describe, at a high level, some of the fault management and predictive self-healing capabilities we have forthcoming in support of the chip. If you haven't read about the UltraSPARC-T2, the chip overview is:
  • 8 cores, each core with 8 CPU strands
  • integrated IO root complex
  • integrated 10GbE networking
  • dedicated cryptographic and floating point units per core
For the last 18 months or so, I've led the team that designed and developed the FMA implementation for the UltraSPARC-T2. And I'm happy to say that we have fault management support for the T2 in all of the areas above. And in many areas, the feature set exceeds Some of the fault management highlights for platforms centered around the UltraSPARC-T2:
  • Diagnosis of CPU errors at the strand, core, and chip level: This is an improvement over UltraSPARC-T1, which did everything at the strand level. Now, with the next release of Solaris 10, resources that are shared across all strands within a core will offline all impacted strands, not just the detecting one. And, yes, this will apply retroactively to UltraSPARC-T1 with the next release of Solaris 10 (or today, if you're running Open Solaris).
  • Diagnosis of the memory subsystem: diagnosis to the memory page level, and page retire operations, are available on UltraSPARC-T2. Additionally, memory SERDing is done at the page level (vs. the DIMM level).
  • Diagnosis of the IO subsystem, including PCI Express Fabric: In addition to diagnosis of the root complex itself, UltraSPARC-T2 takes advantage of additional support put into Solaris for PCIE Fabric diagnosis.
  • Offlining of cryptographic units: If a fault is diagnosed to one of the crypto unit in a core, the crypto drivers will stop using that crypto units. Other crypto units in your set of domain resources are still available. And, all CPU strands remain active.
  • Diagnosis of the on-chip network unit: the new 10 GbE network unit can report errors, and we'll diagnose them.
  • POST/FMA interaction: In T1000/T2000 systems, POST initially would fail a DIMM based on a single correctable error. By virtue of the configuration requirements of the memory subsystem, this had the net effect of taking away half of system memory. Starting with UltraSPARC-T2 platforms, when POST encounters a correctable memory error, the error is queued up for FMA diagnosis. When the domain comes online, if a page retire is necessary, it is performed. At worst, lose an 8K page instead of 50% of memory.
  • Inclusion of part/serial numbers in fault events: for those that service systems, fault events on UltraSPARC-T2 platforms now include the FRU part and serial number of the faulted component(s).
  • Single 'fmadm repair' operation: For faults diagnosed by FMA in the Solaris domain, they can be repaired with a single 'fmadm repair' command on the OS side. The SP state of the component(s) are kept in sync. For those familiar with the two-step process on T1000/T2000 systems, this goes away.
And as you'd expect from any FMA implementation, the same CLIs apply...fmdump(1M), fmstat(1M), fmadm(1M). And the error/fault persistence, common messaging, and knowledge articles that's provided by the FMA architecture in Solaris is there as well. And as on prior platforms, UltraSPARC-T2 systems offer support for environmental monitoring, LED illumination, and FRUID. For those familiar with the Sensor Abstraction Layer....sorry, UltraSPARC-T2 systems aren't there yet. But we're working on it.

:wq

Comments:

Post a Comment:
Comments are closed for this entry.
About

user9148476

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today