Wednesday Sep 08, 2010

Oracle Solaris 10 Update 9: FMA Fixes

It's been a long time since my last blog on FMA - I've since moved into another technology area. Maybe someday I'll garner enough knowledge there to start blogging. In the meantime, with the release of Oracle Solaris 10 Update 9 today, I thought I'd reprise my "favorite FMA fixes" blog entry for this release. So here's my (thankfully short :) list of my favorites:

6627041 Add a PSN nv-pair to the authority portion of the FMRI scheme

This may be less exciting to the end users, but my top favorite. This fix includes the product serial number (PSN) into the FMA fault telemetry. It's a key piece of information to improve the Oracle Auto Service Request (ASR) program. With the PSN, the "Auto" part of ASR is sped up. Hmm...maybe this is exciting for customers - you'll get a faster turn around time on parts servicing.

6502086 DBU errors should be diagnosed as HV defect/fault
6502089 ferg.invalid errors should be diagnosed as a fault

These two CRs provide a level of diagnosis for firmware issues on the sun4v systems. The first covers an error that should not happen unless there's a bug in the hypervisor. The second warrants a little explanation. On CMT systems, FMA telemetry is sourced in the SP - the hardware error is collected in the SP, packaged into an ereport, and provided to Solaris. If the error bits collected can't be mapped to a defined ereport, a "ferg.invalid" ereport is produced. Basically saying "There's been an error, but can't determine what kind". This shouldn't happen outside the lab. If it does, it indicates a logic flaw in the firmware on the SP.

6764337 Needs level 2 FMA compliance for chipset 5100 MCH
6889350 fma fails if DIMM sizes are mixed

A couple of fixes for various Intel configurations. The first enhancing FMA support to include another memory controller chipset. Oracle's CP3250 blades use the 5100 chipset. The second here fixes up a bug - the bug description is self explanatory.

6860401 FMA CPU Topology & Memory Topology needs to support Magny Cours(Multi chip Module)
6812502 Enable Generic-AMD FMA memory topology for Istanbul

These fixes enable CPU and memory diagnosis for the more recent AMD processor offerings. Beyond the typical changes of new model numbers, these processors have a more interesting topology with multiple processing nodes within a single package. Solaris 10 Update 9 understands these chips and creates the correct topology to support FMA diagnosis.


Thursday Dec 10, 2009

Solaris 10 System Administration Essentials

I'm a little behind on this one....but after many, many months, the Solaris 10 System Administration Essentials book is available for purchase. It covers all aspects of Solaris 10, from features you'd expect like ZFS, DTrace, FMA, and Zones as well as packaging & patching, user & network administration, and filesystems.

I had the distinct honor of authoring the Fault Management chapter. A little over a year ago, followers of this blog got a taste of that chapter in the Managing Fault Management Log Files.



12/11/2009 Update: The text is live on Safari Books now.

Thursday Oct 22, 2009

Solaris 10 Update 8 FMA Fixes

Solaris 10 Update 8 is now posted and available for download. And there's been plenty of bug fix work for FMA. Beyond lots of memory leak and core dump fixes, here's my favorite fixes.

6743295 fault.memory.dimm is overloaded
6758561 KA pages for fault.memory.dimm\* are needlessly different

These two fixes provide some cleanup for DIMM faults. CR 6743295 explains the more tangible benefit, IMHO. In addition to getting the information a DIMM is declared faulty, there is now better separation via the fault and knowledge article explaining why the DIMM is deemed bad.

6394503 fmdump should show contents of rotated logs without specifying them explicitly
6535637 Add Severity level to payload of list.suspects event

I lump these together as administrative improvements. The first expanding fmdump to show information from historical logs and not just the current log. FMA error and fault logs are rotated periodically, and the new behavior is to display data from all logs.

6618751 Include memboard in T5440 FBR/FBU diagnosis

Beyond fixing a nasty core dump in the diagnosis flow, this fix also improves the diagnosis. This was a gap from when T5440 first shipped. On configurations where DIMMs on memory boards are in the mix, the memory board itself is part of the FB-DIMM channel. That component is now included in the diagnosis of channel errors.

6747341 Add FMA to hermon driver
6656720 Initial hxge driver

Add two more FMA hardened drivers to Solaris 10!


The impetus for this change is the higher core counts coming in the x86 world. With newer chips from both AMD and Intel, the previously defined maximums would be insufficient to fault manage all the cores and strands. Not anymore :)

6818561 FMA topology fails on Sun Blade T6300

This was just a flat out embarrassment. Topology completely missing from the T6300 blades, rendering FMA largely ineffective (errors still logged, but diagnosis crippled). Glad this one is fixed.





« April 2014