Accurately diagnosing faults on a large and complex system like a Sun SPARC Enterprise M-class server is critical to maintaining server availability. But fault diagnosis is complicated by the fact that Solaris sees a limited view of the universe, while the service processor (SP) sees a different, limited view. In the M-class server line, things are further complicated because two Solaris instances may be sharing the same hardware, and as a result, seeing different errors due to the same hardware fault.
Solaris basically sees three types of hardware: CPUs, memory and I/O devices. A real machine, though, is composed of many more components: interconnect ASICs, cables and connectors, power supplies and voltage regulators, and fans, to name just a few. The SP sees these other components. A fault may manifest itself as different errors depending no one's point of view. For example, a fault in an ASIC that connects a CPU to memory might be seen by the SP as protocol errors on the data interconnect, while the effect that Solaris sees is an uncorrectable memory error. Accurate diagnosis and recovery requires the SP and Solaris to coordinate.
In the M-class family of servers, the SP/Solaris coordination basically comes in three areas: Memory, I/O and CPU.
For the most part, Solaris handles correctable errors in memory. Single-bit, correctable errors in DIMMs are a natural result of physics of dynamic RAMs, and does not, in general, reflect faulty hardware. A stray cosmic ray
may hit the device and flip a bit. Thanks to ECC (error correction code) bits, the single-bit error is corrected, and the data is rewritten to memory.
There are, however, several memory errors that may imply faulty hardware:
- Multi-bit uncorrectable errors (UEs).
- Permanent correctable errors (also called PCEs, or "Sticky" CEs). These are correctable errors
that are not cleared by rewriting to memory. They typically represent a stuck bit.
- Excessive correctable errors in a short period of time, which could indicate a faulty DIMM.
Solaris detects all of these errors, and handles each slightly differently.
When a UE is detected, the memory controller notifies Solaris and the SCF at the same time. The Solaris Fault Management Architecture (FMA) will retire the page containing the UE, to avoid hitting the same error in the future. On the SCF, the DIMM immediately is considered faulty, and the next time Solaris is rebooted, POST will map-out a 64k chunk of memory around the UE to ensure that OBP and Solaris do not use that memory again. This allows a system to continue running as long as possible, using memory that's safe to use, until a service action can be scheduled to replace the DIMM.
When a CE is detected, only Solaris gets notified. At first, Solaris FMA may decide that the DIMM is slightly degraded and will retire a page of memory (issuing a fault.memory.page event), but after enough PCEs have been collected for a single DIMM, Solaris FMA will declare the DIMM faulted and issue a fault.memory.dimm event.
The fault event gets sent to the SCF over the internal network. On the SCF, the fault.memory.dimm is processed by the SCF FMA, and produces a fault.chassis.SPARC-Enterprise.memory.block.pce event. At this point, the SCF will consider the entire DIMM faulty, and the next time Solaris is rebooted, the SCF will isolate this DIMM.
Errors in the PCI-Express fabric are detected by PCI-Express devices, and handled by Solaris device drivers. Generally, the errors are reported to the Solaris FMA, which may decide an I/O device is faulty and issue one of the fault.io.\* events. If possible, Solaris will retire the I/O device, or prevent it from being used in the future. The fault.io.\* event is also forwarded to the SCF.
For memory, the SCF was able to remember that a DIMM was faulty, and POST could map-out a chunk of memory to avoid the faulty pages from being used. In the case of I/O, the SCF really can't do either.
First, POST can't map-out a PCI device. POST presents to OBP a list of PCI-Express root complexes. Once OBP takes over, it can probe the PCI-Express fabric to discover PCI devices. POST could map-out an entire root complex, but that would be a "big hammer" solution to the problem. Instead, the SCF allows Solaris to handle mapping-out the faulty devices, since Solaris can map-out devices at a much finer granularity.
Second, while the SCF could remember than an I/O device is faulty, as I've just written, it doesn't do much good since it doesn't map-out PCI devices. Furthermore, unlike memory DIMMs, PCI cards do not have FRU ID PROMs and serial numbers. So if you powered off your machine and replaced a faulty PCI-Express card, the SCF would not know that the faulty device was replaced. This would require another manual step when you replace a PCI card, i.e., logging into the SCF and telling the SCF the card has been replaced.
So instead, the SCF just logs that Solaris detected the I/O fault, and it relies on Solaris to handle mapping-out the device in the future. The fault event is visible using
fmdump on the SCF.
CPU errors are reported to both the SCF and Solaris. In some cases, it won't do Solaris much good; if a CPU is faulty, it may result in a panic, or even a complete hardware reset. In many other cases, Solaris is able to identify a CPU exhibiting excessive correctable errors, and offline the CPU.
The SCF, however, sees a superset of errors from the CPU, the support ASICs, memory controllers, and I/O controllers. Errors could be reported by ASICs that don't belong to a single Solaris instance, for example, a crossbar ASIC which is routing data for all Solaris domains in the chassis.
In some cases, the SCF may decide that a CPU chip is generating excessive correctable errors on the interconnect between chips, errors that are completely transparent to Solaris. In this case, the SCF will diagnose the CPU chip as faulty, and the next time Solaris boots, the SCF will map-out that CPU chip.
If nothing else is done, eventually the faulty CPU chip will probably emit an uncorrectable error, resulting in a complete domain stop and reset. To minimize that likelihood, the SCF issues a fault.chassis.SPARC-Enterprise.cpu.SPARC64-VI.core.ce-offlinereq fault event. This event gets forwarded to Solaris FMA over the internal network. In Solaris, this fault event is treated like any other CPU fault event that Solaris might have diagnosed itself -- it gets logged to /var/adm/messages and the console, results in an snmp trap, and the CPU is offlined.
One peculiarity about the ce-offlinereq event is that the Solaris FMA stack received the event from the SCF, and did not generate the event itself. As a result, the ce-offlinereq does not show up using
fmdump in Solaris.
But wait, there's one more case... The switch ASIC can detect excessive correctable errors in the L2 cache tags for a specific CPU cache way. The SCF can handle this on its own; it deconfigures the L2 cache way and the CPU does not need to be offlined. Solaris is unaffected, except for a slight performance degredation. In this case, however, the Solaris administrator should know that a CPU is being degraded. So the SCF emits the event fault.chassis.SPARC-Enterprise.asic.sc.ce-l2tagcpu, which is forwarded to the affected Solaris domain. Solaris logs the fault in /var/adm/messages, on the console, and through an snmp trap, but otherwise, does nothing to offline any CPU.
All The Rest
All of the rest of the faults fall into two broad categories: Solaris-only and SCF-only. Solaris can detect things like SCSI errors, zfs errors, and software errors. These are handled in Solaris, and do not involve the SCF at all. The SCF, on the other hand, can detect a wide range of errors -- power and voltages, over temperature and fan speeds, crossbar and switch ASICs, and many more. In these cases, the SCF diagnoses the fault and performs appropriate fault recovery on its own, and there's nothing that Solaris needs to do.
The Sun SPARC Enterprise M-class servers employ the same Fault Management Architecture on the SCF that has existed in Solaris since S10 first shipped. Having both the SCF and Solaris running the same FMA stack enables the two entities to communicate using a common event protocol, and coordinate their activities in handling errors, diagnosing faults, and recovering from those faults.