Solaris FMA for Nehalem
By user9148476 on Mar 30, 2009
Several subsystems were updated to ensure that Solaris FMA continued to support expected features including CPU offlining and memory page retire. Changes include:
- Machine Check Handler updates: Most notably, support for the newly added corrected machine check interrupt (CMCI) is added. Error throttling control on CMCIs is in place to mitigate against correctable error storms. Also, an intel plugin for refined error telemetry has been added to Solaris' MCA framework.
- Memory Topology: For DIMM diagnosis and page retire, FMA requires a memory topology. For Nehalem, the memory topology is read directly from the memory controllers on the system via the intel_nhm driver, and post-processed by FMA's topology enumerators
- Diagnosis Rule updates: Coverage for new Nehalem ereports, notably when the QuickPath detects errors. A particularly interesting one is notification of a memory sparing event.
Oh....and all of the above is forthcoming in the next Solaris 10 update release.
UPDATE 04/01/2009: There's a great video from Dave Stewart at Intel on Nehalem's CMCI and FMA interaction.
UPDATE 04/14/2009: Sun announced Nehalem based systems. Check out my blog describing fault management on Sun's new Nehalem systems.:wq