For the last two years, I've been leading teams delivering FMA diagnosis and recovery for sun4v based platforms. Sun has gotten T5120 and T5220
out the door. And blade
derivatives of these using the same FMA stack as well. And there's more products in the pipeline. And it's been a mix of fun, stressful, hectic, and satisfying.
But one common theme across the ump-teen platforms I've been involved with has been a royal pain - putting code back into Solaris. Now, don't get me wrong. I love Solaris. I enjoy working with it, coding to it, and running it. But when it comes to bringing a hardware platform to market, the desired release date of the platform rarely aligns with a Solaris update release train. Yes, there's the patch route, but the Solaris 10 putback must happen several months before the system will ship. And the Nevada putback even earlier than that. For example, T5120 released in October 2007. The Nevada putback had to happen in December 2006, to align with an interim build of Solaris 10 Update 4 in March 2007. That's a lot of up front schedule pressure.
So in 2008, a new endeavor I'll be working on is laying down some FMA groundwork to enable platform delivery with zero FMA putbacks required to support the initial release of the product. The basic pieces of this are to enumerate the topology in a generic manner and deliver some generic diagnosis engines for CPU, Memory, and IO. x86 has already done some work in this area. I'll certainly be leveraging that for SPARC, and intend to improve upon it. For example. generic x86 FMA doesn't have the concept of cores or chips - it's strands only. Since sun4v systems have Service Processors, and Sun has control over the firmware, I think for SPARC we can teach generic DEs about cores and chips.
And looking farther out, I'll be working (with the help of lots of other engineers) to enable moving diagnosis out of Solaris for most of the hardware into the Service Processor. This is huge. Firstly, we get out of the situation we have on T5120/T5220 and related products of having the diagnosis happen within the entity experiencing the problem. It also opens up the door for more RAS features, such as automatic ASR disables of components triggered off FMA suspect lists. SP-resident diagnosis also means we can have FMA support even if Solaris isn't the OS running on the host hardware (i.e. Ubuntu). That's pretty frickin' cool.
If you're not following this, take an example: the host hardware experiences some unrecoverable error and Solaris must be restarted. With diagnosis in the OS, we take the error and reset. Error telemetry (ereports) get queued up. Once the OS is back up and running, the ereports flow in, FMA diagnoses the fault, and the suspect lists are messaged. Now, the nature of the error could prevent the OS from booting. So no diagnosis. Or even if the OS recovers (which is actually the norm), but you wanted to ASR disable the component entirely (e.g. blacklist), the host hardware must be taken down again, manually ASR disable, then bring the OS back up. Two hits to the domains to handle one error event. Not ideal to say the least.
So I'm excited about 2008. As I wrote in one of the project descriptions, we will be "liberating platforms from the shackles of an FMA putback". The revolution will not be televised....