Thursday Oct 22, 2009

Solaris 10 Update 8 FMA Fixes

Solaris 10 Update 8 is now posted and available for download. And there's been plenty of bug fix work for FMA. Beyond lots of memory leak and core dump fixes, here's my favorite fixes.

6743295 fault.memory.dimm is overloaded
6758561 KA pages for fault.memory.dimm\* are needlessly different

These two fixes provide some cleanup for DIMM faults. CR 6743295 explains the more tangible benefit, IMHO. In addition to getting the information a DIMM is declared faulty, there is now better separation via the fault and knowledge article explaining why the DIMM is deemed bad.

6394503 fmdump should show contents of rotated logs without specifying them explicitly
6535637 Add Severity level to payload of list.suspects event

I lump these together as administrative improvements. The first expanding fmdump to show information from historical logs and not just the current log. FMA error and fault logs are rotated periodically, and the new behavior is to display data from all logs.

6618751 Include memboard in T5440 FBR/FBU diagnosis

Beyond fixing a nasty core dump in the diagnosis flow, this fix also improves the diagnosis. This was a gap from when T5440 first shipped. On configurations where DIMMs on memory boards are in the mix, the memory board itself is part of the FB-DIMM channel. That component is now included in the diagnosis of channel errors.

6747341 Add FMA to hermon driver
6656720 Initial hxge driver

Add two more FMA hardened drivers to Solaris 10!


The impetus for this change is the higher core counts coming in the x86 world. With newer chips from both AMD and Intel, the previously defined maximums would be insufficient to fault manage all the cores and strands. Not anymore :)

6818561 FMA topology fails on Sun Blade T6300

This was just a flat out embarrassment. Topology completely missing from the T6300 blades, rendering FMA largely ineffective (errors still logged, but diagnosis crippled). Glad this one is fixed.


Thursday Aug 27, 2009

Rainbow Falls and FMA

Many of you no doubt heard about the Rainbow Falls processor Sun provided details on at the Hot Chips conference. I'm excited about this new SPARC processor. And not just because it's a nifty package bundling in more cores and the chops to scream running Oracle. This SPARC chipset will be the first to fully deploy the work I lead on sun4v platform independent FMA

In OpenSolaris today, the groundwork is there for building FMA topology sourced from platform firmware structures. Also, a rich set of platform agnostic CPU and memory diagnosis rules. I've also heard that the IO diagnosis rules will be moving toward platform independent constructs in the Rainbow Falls time frame as well. (I'm talking about the chipset specific IO rules; PCI/PCIE rules have been common across platforms for quite some time.)

Mental note for a future blog...once the reference implementation is out, I should write about how a platform team can go about deploying the platform independent enumerator and diagnosis rules in more details. For those looking at OpenSPARC and building a system, delivering top notch CPU and Memory FMA requires zero Solaris code changes. :wq

Friday May 01, 2009

Solaris 10 Update 7 Available

Solaris 10 Update 7 is now posted and available for download. And there's been 65+ bug fixes and enhancements for FMA. Here's a few of my favorites (can one have favorite bugs? :) fixed in S10U7:

6540058 libldom enhancements for sun4v root domains
6540055 ETM enhancements for sun4v root domains
6540080 topology enhancements for sun4v root domains

For quite a long time, when running SPARC Logical Domains (LDOMs), FMA had real gaps when the IO subsystem was divided across logical domains. Namely, when an IO root complex is granted to a non-control domain (a so-called "root" domain), FMA in the IO was disjoint and could break. Some of the IO diagnosis rules needs to pair up root complex ereports (created in the SP) with PCIE fabric ereports (created in Solaris). With these fixes, the event transport plumbing is in place so a given instance of Solaris gets all the ereports it needs to produce an accurate diagnosis.

Update 05/04/2009: Eric Sharakan posted a blog detailing some of the LDOM side requirements needed to ensure FMA is fully featured - namely the LDOMs 1.2 release planned for this summer.

6706543 FMA for Intel Nehalem
There's actually several other bug fixes that go along with the Nehalem support. Please refer to my prior blog entry on Nehalem FMA at

6722048 diagnosis of and KA for SUNOS-8000-1L should be split

For those of you that have gotten an SUNOS-8000-1L message, you've been annoyed. It means there's a bug in the FMA stack somewhere, and a diagnosis engine received an ereport it couldn't understand. Before you get excited, this fix doesn't fix all the bugs. But, it does help us developers better identify where a bug might be. Several new message IDs are introduced, which better classify why a particular ereport was deemed bad. They are SUNOS-8000-E8, SUNOS-8000-G7, SUNOS-8000-HV, and the unfortunately named SUNOS-8000-FU.

6639248 RFE: Eversholt should allow dynamic SERD engine names
6639255 RFE: Eversholt should allow bumping SERD by an arbitrary value

If you're not developing diagnosis code in Eversholt (which is most everyone on the planet), then you won't care about this. But these changes allow us to do some more interesting things to make diagnosis engines more flexible. I asked for these changes as part of the SPARC/sun4v Platform Independent FMA work. The language extensions allows diagnosis rules to be tailored by ereport payload members. And in the sun4v world, where telemetry is generated outside of Solaris on the Service Processor, we've designed diagnosis rules that can be "guided" by platform-controlled telemetry.





« April 2014