CoolThreads, T2000, and RAS
By relling on Dec 06, 2005
Now that we've officially announced the Sun FireTM CoolThreads servers, I can talk about some of the work we've been doing in the Reliability, Availability, and Serviceability (RAS) Engineering group for the past year. We are very excited because these servers offer some of the best RAS features ever. Of course, you'll hear from many folks about the price, performance, power, space, and other advantages of these new servers. I'll concentrate on why these servers will keep running and running for a long, long time.
When designing data centers, we often spend a lot of time designing the heating, ventilation, and cooling (HVAC) systems. The laws of thermodynamics dictate that we move heat or energy around and that doing so isn't free. We begin the design with an enclosure which provides good cooling paths, including thermal isolation between the power supplies, disk drives, and motherboard. For the Sun FireTM T2000 server we use basically the same enclosure as the Sun FireTM X4200 servers, that I've discussed previously. By designing this enclosure for the heat generation envelope of Opteron servers, we have plenty of margin for the lower power UltraSPARCTM T1 processor based servers. The laws of semiconductor physics say that as the temperature rises, the reliability decreases. Thus the physical design of the T2000 server leads to increased reliability.
In some of the competitive comparisons you'll see today, people will compare one T2000 server against competing products which use multiple processors. From a reliability perspective, we know that heat kills, and more processors generate more heat.
High Integration Leads to High Reliability
We perform reliability projections for new products early in the design cycle. In RAS-industry lingo, we use the methods descibed by MIL-HDBK-217 and Telcordia for reliability modeling. In general, the more parts you have, the lower your reliability. This is really common sense, in that if I have some widgets, then the probability that any widget will fail is a function of the number of widgets. Moore's law says that over time, we can integrate more widgets into one widget. Think of it this way, we've taken half of a Sun EnterpriseTM 10000 server and put it in a single chip, effective reducing hundreds of widgets into one UltraSPARCTM T1 processor.
In some of the performance comparisons you'll see today, people will compare a single T1 processor against several competing processors. A funny thing happens when you look at the failure rate of the processors. To a large degree, complex, contemporary semiconductors have the same failure rate under the same environmental conditions. If you compare a T2000 against a server with four processor chips, the reliability of the T2000 processor is approximately four times that of the competing system's processors. This sort of analysis extends to multiple servers also. For example, if one T2000 has equivalent performance to a set of two competing servers, then the failure rate of the T2000 will be approximately half (or less :-) that of the competing solution because there are half as many power supplies, disks, semiconductors, etc. Not only do we gain a power, space, and cost advantage, but we get higher reliability as well.
But Reliability Is Only One Letter in RAS...
So, what about availability and serviceability? We do the usual things to make the T2000 more available and serviceable. The power supplies, fans, and disk drives are all redundant and hot swappable. Main memory has ECC and chip kill which complement the Solaris Fault Management Architecture (FMA) which provides memory page retirement. The UltraSPARCTM T1 processor has builtin reliability and redundancy features. Automatic System Recovery (ASR) provides automatic recovery from failures in PCI cards or memory (though ASR is disabled by default). There is an ALOM system controller which allows remote management of the platform. In other words, all of the availability and serviceability methods and processes for this class of server are carried forward – there is no sacrificing of these important features.
More on Redundancy
Above I mentioned that we basically shrunk half of a Sun EnterpriseTM 10000 server (aka starfire) and put it in a single chip. One feature that we lose when we do that is the grainularity of the redundancy. We also lose Dynamic System Domains (DSD) which provides some allocation and deployment flexibility, but for now, I'll stick to the basic RAS concepts. The starfire has many redundant and hot-pluggable features which are lost when we integrate onto a single chip. In other words, we are replacing a highly redundant system with a system containing many single points of failure (SPOFs). In general, reliability trumps redundancy, so this trade-off isn't always a bad thing. But perhaps more specifically, for the target market and price range, we intentionally designed the T2000 server to have such SPOFs. There is more good news here, the actual parts count for the motherboard is still lower than competing systems in the same market and price range. It doesn't really make sense to replace a starfire with a T2000, for RAS purposes, but it makes excellent sense to replace a pile of Xeon or PowerPC based servers with a T2000 or two. Ok, to be fair, it also makes sense to replace a pile of older 1-2 RU UltraSPARC-based servers, too, with the added benefit of maintaining the software binary interface.
Of course, another approach is to use the no-cost (!) Solaris Enterprise System which includes the Sun Java Availability Suite which includes the Sun Cluster software to achieve full redundancy between two or more T2000 servers. From a cluster perspective, the T2000 is an excellent choice for a small form factor cluster node.
The new Sun Fire T2000 server based on the CoolThreads technology is a breakthrough in RAS as well as price, performance, space, and power consumption. If you remember only three things:
Sun Fire CoolThreads servers offer extraordinarily high reliability due to high integration
Best-in-class RAS features
CoolThreads servers are very cool!
[ T: http://technorati.com/tag/NiagaraCMT ]