Friday Oct 24, 2008

Oldie but goodie

This week, several colleagues and I were discussing problems posing people who run datacenters due to the increasingly high churn rates in hardware. Some customers prefer systems which don't change every other month, so they can plan for longevity.  Other customers want the latest and greatest bling. Vendors are stuck in the middle and have their own issues with stocking, manufacturing, plant obsolescence, etc. So, how can you design your datacenter architecture to cope with these conflicting trends?  Well, back in 1999 I wrote a paper for the SuperG conference which discusses this trend, to some degree.  It is called A Model for Datacenter.com.  It is now nearly 10 years old, but seems to be holding its own, even though much has changed in the past decade. I'm posting it here, because I never throw things away and the SuperG proceedings are not generally available.

Enjoy.

Monday Oct 13, 2008

Evolution of RAS in the Sun SPARC T5440 server

Reliability, Availability, and Serviceability (RAS) in the Sun SPARC Enterprise T5440 builds upon the solid foundations created for the Sun SPARC Enterprise T5140, T5240, and Sun Fire X4600 M2 servers. The large number of CPU cores available in the T5440 needs large amounts of I/O capability to balance the design. The physical design of the X4600 M2 servers was a natural candidate for the new design – modular CPU and memory cards along with plenty of slots for I/O expansion. We've also seen good field reliability from the X4600 M2 servers and their components. The T5440 is a excellent example of how leveraging the best parts of these other designs has resulted in a very reliable and serviceable system.

The trade-offs required for scaling from a single board design to a larger, multiple board design always impact reliability of the server. Additional connectors and other parts also contribute to increased failure rates, or lower reliability. On the other hand, the ability to replace a major component without replacing a whole motherboard increases serviceability – and lowers operating costs. The additional parts which enable the system to scale also have an impact on performance, as some of my colleagues have noted. When comparing systems on a single aspect of the RAS and performance spectrum, you can miss important design characteristics, or worse, misunderstand how the trade-offs impact the overall suitability of a system. To get a better insight on how to apply highly scalable systems to a complex task prefer to do a performability analysis.

The T5440 has almost exactly twice the performance capabilities of the T5220. If you have a workload which previously required four T5220s with a spare (for availability), then you should be able to host that workload on only two T5440s, and a spare. Using benchmarks for sizing is the best way to compare, and we can generally see that a T5440 is six times more capable than a Sun Fire V490 server. This will complete a comparable performance sizing.

On the RAS side, a single T5440 is more reliable than two T5220s, so there is a reliability gain. But for a performability analysis, that is contrasted with the fewer numbers of T5440. For example, if the workload requires 4 servers and we add a spare, then the system is considered performant when 4 of 5 servers are available. As we consolidate onto fewer servers, the model changes accordingly: for 2 servers and a spare, the system is performant when 2 of 3 servers are available. The reliability gain of using fewer servers can be readily seen in the number of yearly service calls expected. Fewer servers tends to mean fewer service calls. The math behind this can become complicated for large clusters and is arguably counter-intuitive at times. Fortuntately, our RAS modeling tools can handle very complicated systems relatively easily.

We build availability models for all of our systems and use the same service parameters to permit easy comparisons. For example, we would model all systems with 8 hour service response time. The models are then compared, thusly

System

Units

Performability

Yearly Services

Sun SPARC Enterprise 5440 server

2 + 1

0.99999903

0.585

Sun SPARC Enterprise 5240 server

4 + 1

0.99999909

0.661

Sun SPARC Enterprise 5140 server

4 + 1

0.99999915

0.687

Sun Fire V490 server

12 + 1

0.99998644

1.402

In these results, you can see that T5440 clearly wins the number of units and yearly services. Both of these metrics impact total cost of ownership (TCO) as the complexity of an environment is generally attributed to the number of OS instances – fewer servers generally means fewer OS instances. Fewer service calls means fewer problems that require physical human interactions.

You can also see that the performability of the T5x40 systems are very similar. Any of these systems will be much better than a system of V490 servers.

More information on the RAS features these servers can be found in the white paper we wrote, Maximizing IT Service Uptime by Utilizing Dependable Sun SPARC Enterprise T5140, T5240, and T5440 Servers. Ok, I'll admit that someone else wrote the title...

Wednesday Oct 01, 2008

Hey! Where did my snapshots go? Ahh, new feature...

I've been running Solaris NV b99 for a week or so.  I've also been experimenting with the new automatic snapshot tool, which should arrive in b100 soon. To see what snapshots are taken, you can use zfs's list subcommand.

# zfs list
NAME                     USED    AVAIL    REFER   MOUNTPOINT
rpool                   3.63G    12.0G    57.5K   /rpool
rpool@install             17K        -    57.5K   -

...

This is typical output and shows that my rpool (root pool) file system has a snapshot which was taken at install time: rpool@install

But in NV b99, suddenly, the snapshots are no longer listed by default. In general, this is a good thing because there may be thousands of snapshots and the long listing is too long for humans to understand. But what if you really do want to see the snapshots?  A new flag has been added to the zfs list subcommand which will show the snapshots.

# zfs list -t snapshot
NAME                     USED    AVAIL    REFER   MOUNTPOINT
rpool@install             17K        -    57.5K   -

...


This should clear things up a bit and make it easier to manage large numbers of snapshots when using the CLI. If you want to see more details on this change, see the head's up notice for PSARC 2008/469.
About

relling

Search

Archives
« October 2008 »
SunMonTueWedThuFriSat
   
2
3
4
5
6
7
8
9
10
11
12
14
15
16
17
18
19
20
21
22
23
25
26
27
28
29
30
31
 
       
Today