Monday Oct 13, 2008

Evolution of RAS in the Sun SPARC T5440 server

Reliability, Availability, and Serviceability (RAS) in the Sun SPARC Enterprise T5440 builds upon the solid foundations created for the Sun SPARC Enterprise T5140, T5240, and Sun Fire X4600 M2 servers. The large number of CPU cores available in the T5440 needs large amounts of I/O capability to balance the design. The physical design of the X4600 M2 servers was a natural candidate for the new design – modular CPU and memory cards along with plenty of slots for I/O expansion. We've also seen good field reliability from the X4600 M2 servers and their components. The T5440 is a excellent example of how leveraging the best parts of these other designs has resulted in a very reliable and serviceable system.

The trade-offs required for scaling from a single board design to a larger, multiple board design always impact reliability of the server. Additional connectors and other parts also contribute to increased failure rates, or lower reliability. On the other hand, the ability to replace a major component without replacing a whole motherboard increases serviceability – and lowers operating costs. The additional parts which enable the system to scale also have an impact on performance, as some of my colleagues have noted. When comparing systems on a single aspect of the RAS and performance spectrum, you can miss important design characteristics, or worse, misunderstand how the trade-offs impact the overall suitability of a system. To get a better insight on how to apply highly scalable systems to a complex task prefer to do a performability analysis.

The T5440 has almost exactly twice the performance capabilities of the T5220. If you have a workload which previously required four T5220s with a spare (for availability), then you should be able to host that workload on only two T5440s, and a spare. Using benchmarks for sizing is the best way to compare, and we can generally see that a T5440 is six times more capable than a Sun Fire V490 server. This will complete a comparable performance sizing.

On the RAS side, a single T5440 is more reliable than two T5220s, so there is a reliability gain. But for a performability analysis, that is contrasted with the fewer numbers of T5440. For example, if the workload requires 4 servers and we add a spare, then the system is considered performant when 4 of 5 servers are available. As we consolidate onto fewer servers, the model changes accordingly: for 2 servers and a spare, the system is performant when 2 of 3 servers are available. The reliability gain of using fewer servers can be readily seen in the number of yearly service calls expected. Fewer servers tends to mean fewer service calls. The math behind this can become complicated for large clusters and is arguably counter-intuitive at times. Fortuntately, our RAS modeling tools can handle very complicated systems relatively easily.

We build availability models for all of our systems and use the same service parameters to permit easy comparisons. For example, we would model all systems with 8 hour service response time. The models are then compared, thusly

System

Units

Performability

Yearly Services

Sun SPARC Enterprise 5440 server

2 + 1

0.99999903

0.585

Sun SPARC Enterprise 5240 server

4 + 1

0.99999909

0.661

Sun SPARC Enterprise 5140 server

4 + 1

0.99999915

0.687

Sun Fire V490 server

12 + 1

0.99998644

1.402

In these results, you can see that T5440 clearly wins the number of units and yearly services. Both of these metrics impact total cost of ownership (TCO) as the complexity of an environment is generally attributed to the number of OS instances – fewer servers generally means fewer OS instances. Fewer service calls means fewer problems that require physical human interactions.

You can also see that the performability of the T5x40 systems are very similar. Any of these systems will be much better than a system of V490 servers.

More information on the RAS features these servers can be found in the white paper we wrote, Maximizing IT Service Uptime by Utilizing Dependable Sun SPARC Enterprise T5140, T5240, and T5440 Servers. Ok, I'll admit that someone else wrote the title...

About

relling

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today