By relling on Apr 23, 2007
My colleague, Gary Combs, put together a podcast describing the new RAS features found in the Sun SPARC Enterprise Servers. The M4000, M5000, M8000, and M9000 servers have very advanced RAS features, which put them head and shoulders above the competition. Here is my list of favorites, in no particular order:
- Memory mirroring. This is like RAID-1 for main memory. As I've said many times, there are 4 types of components which tend to break most often: disks, DIMMs (memory), fans, and power supplies. Memory mirroring brings the fully redundant reliability techniques often used for disks, fans, and power supplies to DIMMs.
- Extended ECC for main memory. Full chip failures on a DIMM can be tolerated.
- Instruction retry. The processor can detect faulty operation and retry instructions. This feature has been available on mainframes, and is now available for the general purpose computing markets.
- Improved data path protection. Many improvements here, along the entire data path. ECC protection is provided for all of the on-processor memory.
- Reduced part count from the older generation Sun Fire E25K. Better integration allows us to do more with fewer parts while simultaneously improving the error detection and correction capabilities of the subsystems.
- Open-source Solaris Fault Management Architecture (FMA) integration. This allows systems administrators to see what faults the system has detected and the system will automatically heal itself.
- Enhanced dynamic reconfiguration. Dynamic reconfiguration can be done at the processor, DIMM (bank), and PCI-E (pairs) level of grainularity.
- Solaris Cluster support. Of course Solaris Cluster is supported including clustering between Solaris containers, dynamic system domains, or chassis.
- Comprehensive service processor. The service processor monitors the health of the system and controls system operation and reconfiguration. This is the most advanced service processor we've developed. Another welcome feature is the ability to delegate responsibilities to different system administrators with restrictions so that they cannot control the entire chassis. This will be greatly appreciated in large organizations where multiple groups need computing resources.
- Dual power grid. You can connect the power supplies to two different power grids. Many people do not have the luxury of access to two different power grids, but those who have been bitten by a grid outage will really appreciate this feature. Think of this as RAID-1 for your power source.
I don't think you'll see anything revolutionary in my favorites list. This is due to the continuous improvements in the RAS technologies. The older Sun Fire servers were already very reliable, and it is hard to create a revolutionary change for mature technologies. We have goals to make every generation better, and we've made many advances with this new generation. If the RAS guys do their job right, you won't notice it - things will just keep working.