RAS and the X4100/X4200 Servers
By relling on Sep 12, 2005
Here's a peek inside the reliability, availability, and serviceability (RAS) features of the new Sun Fire X4100 and Sun Fire X4200 servers. In this post, I'll talk about heat flow and power distribution, two important design constraints for computer system designs. You will see why the new X4100 and X4200 server is not just another run-of-the-mill x64 server.
Heat costs money and kills
By now, you've probably seen the evidence and talk about power consumption and heat. It has long been known that cranking up the number of components and speed of their operation generates more heat. You can do the math and figure out how the heat affects your pocketbook. From a RAS perspective, I'll add that heat kills. The ambient temperature has a direct affect on the reliability of electronics. Cool servers don't break as often as hot servers.
Mechanical systems are the most seriously affected and in modern computer systems that usually means disk drives. The X4100/X4200 designs use new 2.5" serial attached SCSI (SAS) disks, which draw about 40% less power than the equivalent 3.5" disk drives. The performance gurus will also note that the average seek times also drop by about 15% for the smaller drives. For power estimation purposes, plan on about 8W for a reasonably busy 2.5" SAS disk. The reduced form factor and power consumption means that we can offer 4 disks in the space and power budget formerly needed for two 3.5" disks. Ok, so most thin servers don't need more than 2 disks, and I fully expect that people will deploy many more X4100/X4200 servers with zero, one, or two disks than four disks.
But the reduced form factor also allows us to improve overall system RAS because we regain a bunch of space from the front bezel area. In older thin servers such as the wildly popular Netra t1 series, the disks basically consume all of the front bezel area. Consider what this means to the airflow in the system. You will have some heat generators sitting in front of the other electronics. Airflow is front-to-back, so the air that passes over your motherboard is already hotter than the ambient air. By using the 2.5" disk drives, we were able to move the disks out of the way and fully isolate the airflow for the disks from the motherboard. I'll use the X4100 to demonstrate. The configuration is such that the disks and DVD are located on the right side of the front of the server. Behind the drives are the hot swappable power supplies. There is a wall between the drives/power supplies and the motherboard to keep the air separate. The air flowing over the motherboard comes directly from the exterior and flows out the back. The orientation of the CPUs and memory is such that they all get clean (cool) airflow directly. Pushing air over the motherboard are two rows of hot-pluggable, redundant fans. The bezel in front of the fans is not blocked by bulky disk drives, further ensure good, cool, airflow into the server.
By contrast, the Sun Fire V60x and Sun Fire V65x chassis designs were done by "another company" and Sun took that design and re-badged it. The problem is that the other company wasn't used to designing data center class systems. The V60x has a series of holes in the side of the chassis. When you put a bunch of them into a rack, hot air circulates through the rack and back into the chassis. The result is that the systems run very hot. In the front, the disk drives further block the airflow, such that the motherboard sees inconsistent, pre-warmed air flow. These systems also use Xeon processors, which tend to run hot. The net result is that the environmental requirements are de-rated to adjust for the additional cooling requirements of the system. The elegant and clean design of the X4100 and X4200 is vastly superior to the older design in this respect.
Any discussion about power would not be complete without a discussion of power conversion. The Sun Fire X4100 and Sun Fire X4200 servers have RAS improvements there as well. The AC/DC power supplies are a new design which are remarkably simple. Only two DC voltage levels are provided: 3.3 and 12 VDC. The 3.3V level is used for control logic and the iLOM controller. Most of the power is internally distributed as 12V. This 12V is converted to the various logic levels at other places in the system using reliable DC-DC converters. This design decision allowed Sun to simplify the power supply, and any such simplifications improve reliability.
Power supplies operate in a hostile environment. We added a metal-oxide varistor (MOV, aka surge suppressor) into the power supply to help protect the system from unwanted power surges. This protection, the simplicity, low power consumption, and low parts count will result in a highly reliable power supply subsystem. This is systems engineering at its finest.
By contrast, the ATX-style power supplies used in systems, such as the new Sun Fire X2100 server, provide +3.3, +5, -5, +12, and -12 V. This is an old-school design and is significantly more complex than the new, simpler design. Simple designs tend to use fewer parts and thus have higher reliability. While browsing the power supplies at a local computer store last week, I noticed that most ATX power supplies were rated at around 80,000 hours MTBF. The power supplies in the X4100 and X4200 are projected to have more than twice as many hours MTBF.
The new Sun Fire X4100 and Sun Fire X4200 servers are definitely not just another repackaging of some old-school ATX design. The entire design was approached from a data center perspective with the intent of providing a highly reliable, high performance, small form factor server. The good RAS design should translate directly into long life, fewer service calls, and happy customers. Do you want to be happy?