Tuesday Oct 09, 2007

Performability analysis of T5120 and T5220

In complex systems, we must often trade-off performance against reliability, availability, or serviceability. In many cases, a system design will include both performance and availability requirements. We use performability analysis to examine the performance versus availability trade-off. Performability is simply the ability to perform. A performability analysis combines performance characterization for systems under the possible combinations of degraded states with the probability that the system will be operating the degraded states.

The simplest performability analysis is often appropriate for multiple node, shared nothing clusters which scale performance perfectly. For example, in a simple web server farm, you might have N servers capable of delivering M pages per server. Disregarding other bottlenecks in the system such, as the capacity of the internet connection to the server farm, we can say that N+1 servers will deliver M\*(N+1) performance. Thus we can estimate the aggregate performance of any number of web servers.

We can also perform an availability analysis on a web server. We can build Markov models which consider the reliability of the components in a server and their expected time to repair. The output of the models will provide the estimated time per year that each web server may be operational. More specifically, we will know the staying time per year for each of the model states. For a simple model, the performance reward for an up state is M and a down state is 0. A system which provides 99.99% (four-nines) availability can be expected to be down for approximately 53 minutes per year and up for the remainder.

For a shared nothing cluster, we can further simplify the analysis by ignoring common fault effects. In practice, this means that a failure or repair in one web server does not affect any other web servers. In many respects, this is the same simplifying assumption we made with performance, where the performance of a web server is dependent on any of the other web servers.

The shared nothing cluster availability model will contain the following system states and the annual staying time in each state: all up, one down (N-1 up), two down (N-2 up), three down (N-3 up), and so on. The availability model inputs include the unscheduled mean time between system interruption (U_MTBSI) and mean time to repair (MTTR) for the nodes. We often choose a MTTR value by considering the cost of service response time. For many shared nothing clusters, a service response time of 48 hours may be reasonable – a value which may not be reasonable for a database or storage tier. Model results might look like this:

System State

Annual Staying Time (minutes)

Cumulative Uptime (%)

Performance Reward

All up

521,395.20

99.2

M \* N

1 down

4,162.75

99.992

M \* (N - 1)

2 down

39.95

99.9996

M \* (N - 2)

3 down

2.00

99.99998

M \* (N - 3)

> 3 down

0.11

100

< M \* (N - 4)

Total

525,600.00

100


Now we have enough data to evaluate the performability of the system. For the simple analysis, we accept the cumulative uptime result for the minimum required performance. We can then compare various systems considering performability.

We have modeled the new Sun SPARC Enterprise T5120 and Sun SPARC Enterprise T5220 servers against the venerable Sun Fire V490 servers. For this analysis we chose a performance benchmark with a metric that showed we needed 6 T5120 or T5220 servers to match the performance of 9 V490 servers. We will choose to overprovision by one server, which is often optimum for such architectures. The performability results are:

Servers

Units

Performability (%)

Sun SPARC Enterprise T5120

6 + 1

99.99988

Sun SPARC Enterprise T5220

6 + 1

99.99988

Sun Fire V490

9 + 1

99.99893

You might notice that the T5120 and T5220 have the same performability results. This is because they share the same motherboard design, disks, power supplies, etc. It is much more interesting to compare these to the V490. Even though we use more V490 systems, the T5120 and T5220 solution provides better performability. Fewer, faster, more reliable servers should generally have better performability than more, slower, less reliable servers.

 

Thursday Oct 04, 2007

Performability Analysis for Storage

I'll be blogging about performability analysis over the next few weeks. Last year Hairong Sun, Tina Tyan, Steven Johnson, Nisha Talagala, Bob Wood, and I published a paper on how we do performability analysis at Sun.  It is titled Performability Analysis of Storage Systems in Practice: Methodology and Tools, and is available online at SpringerLink. Here is the abstract:

This paper presents a methodology and tools used for performability analysis of storage systems in Sun Microsystems. A Markov modeling tool is used to evaluate the probabilities of normal and fault states in the storage system, based on field reliability data collected from customer sites. Fault injection tests are conducted to measure the performance of the storage system in various degraded states with a performance benchmark developed within Sun Microsystems. A graphic metric is introduced for performability assessment and comparison. An example is used throughout the paper to illustrate the methodology and process.

I'm giving a presentation on performability at Sun's Customer Engineering Conference next week, so if you're attending stop by and visit.

Tuesday Jan 30, 2007

ZFS RAID recommendations: space, performance, and MTTDL

In this blog I connect the space vs MTTDL models with a performance model. This gives engough information for you to make RAID configuration trade-offs for various systems. The examples here target the Sun Fire X4500 (aka Thumper) server, but the models work for other systems too. In particular, the model results apply generically to RAID implementations on just a bunch of disks (JBOD) systems. 

The best thing about a model is that it is a simplification of real life.
The worst thing about a model is that it is a simplification of real life.

Small, Random Read Performance Model

For this analysis, we will use a small, random read performance model. The calculations for the model can be made with data which is readily available from disk data sheets. We calculate the expected I/O operations per second (iops) based on the average read seek and rotational speed of the disk. We don't consider the command overhead, as it is generally small for modern drives and is not always specified in disk data sheets.

maximum rotational latency = 60,000 (ms/min) / rotational speed (rpm)

iops = 1000 (ms/s) / (average read seek time (ms) + (maximum rotational latency (ms) / 2))

Since most disks use consistent rotational speeds, this small table may help you to see what the rotational speed contribution will be.

Rotational Speed (rpm)

Maximum Rotational Latency (ms)

4,200

14.3

5,400

11.1

7,200

8.3

10,000

6.0

15,000

4.0

For example, if we have a 73 GByte, 2.5" Seagate Saviio SAS drive which has a 4.1 ms average read seek and rotational speed of 10,000 rpm:

iops = 1000 / (4.1 + (6.0 / 2)) = 140.8

By comparison, a 750 GByte, 3.5" Seagate Barracuda SATA drive which has a 8.5 ms average read seek and rotational speed of 7,200 rpm:

iops = 1000 / (8.5 + (8.3 / 2)) = 79.0

I purposely used those two examples because people are always wondering why we tend to prefer smaller, faster, and (unfortunately) more expensive drives over larger, slower, less expensive drives - a 78% performance improvement is rather significant. The 3.5" drives also use about 25-75% more power than their smaller cousins, largely due to the rotating mass. Small is beautiful in a SWaP sense.

Next we use the RAID set configuration information to calculate the total small, random read iops for the zpool or volume. Here we need to talk about sets of disks which may make up a multi-level zpool or volume. For example, RAID-1+0 is a stripe (RAID-0) of mirrored sets (RAID-1). RAID-0 is a stripe of disks.

  • For dynamic striping (RAID-0), add the iops for each set or disk. On average the iops are spread randomly across all sets or disks, gaining concurrency.

  • For mirroring (RAID-1), add the iops for each set or disk. For reads, any set or disk can satisfy a read, so we also get concurrency.

  • For single parity raidz (RAID-5), the set operates at the performance of one disk. See below.

  • For double parity raidz2 (RAID-6), the set operates at the performance of one disk. See below.

For example, if you have 6 disks, then there are many different ways you can configure them, with varying performance calculations

RAID Configuration (6 disks)

Small, Random Read Performance Relative to a Single Disk

6-disk dynamic stripe (RAID-0)

6

3-set dynamic stripe, 2-way mirror (RAID-1+0)

6

2-set dynamic stripe, 3-way mirror (RAID-1+0)

6

6-disk raidz (RAID-5)

1

2-set dynamic stripe, 3-disk raidz (RAID-5+0)

2

2-way mirror, 3-disk raidz (RAID-5+1)

2

6-disk raidz2 (RAID-6)

1

Clearly, using mirrors improves both performance and data reliability. Using stripes increases performance, at the cost of data reliability. raidz and raidz2 offer data reliability, at the cost of performance. This leads us down a rathole...

The Parity Performance Rathole

Many people expect that data protection schemes based on parity, such as raidz (RAID-5) or raidz2 (RAID-6), will offer the performance of striped volumes, except for the parity disk. In other words, they expect that a 6-disk raidz zpool would have the same small. random read performance as a 5-disk dynamic stripe. Similarly, they expect that a 6-disk raidz2 zpool would have the same performance as a 4-disk dynamic stripe. ZFS doesn't work that way, today. ZFS uses a checksum to validate the contents of a block of data written. The block is spread across the disks (vdevs) in the set. In order to validate the checksum, ZFS must read the blocks from more than one disk, thus not taking advantage of spreading unrelated, random reads concurrently across the disks. In other words, the small, random read performance of a raidz or raidz2 set is, essentially, the same as the single disk performance. The benefit of this design is that writes should be more reliable and faster because you don't have the RAID-5 write hole or read-modify-write performance penalty.

Many people also think that this is a design deficiency. As a RAS guy, I value the data validation offered by the checksum over the performance supposedly gained by RAID-5. Reasonable people can disagree, but perhaps some day a clever person will solve this for ZFS.

So, what do other logical volume managers or RAID arrays do? The results seem mixed. I have seen some RAID array performance characterization data which is very similar to the ZFS performance for parity sets. I have heard anecdotes that other implementations will read the blocks and only reconstruct a failed block as needed. The problem is, how do such systems know that a block has failed? Anecdotally, it seems that some of them trust what is read from the disk. To implement a per-disk block checksum verification, you'd still have to perform at least two reads from different disks, so it seems to me that you are trading off data integrity for performance. In ZFS, data integrity is paramount. Perhaps there is more room for research here, or perhaps it is just one of those engineering trade-offs that we must live with.

Other Performance Models

I'm also looking for other performance models which can be applied to generic disks with data that is readily available to the public. The reason that the small, random read iops model works is that it doesn't need to consider caching or channel resource utilization. Adding these variables would require some knowledge of the configuration topology and the cache policies (which may also change with firmware updates.) I've kicked around the idea of a total disk bandwidth model which will describe a range of possible bandwidths based upon the media speed of the drives, but it is not clear to me that it will offer any satisfaction. Drop me a line if you have a good model or further thoughts on this topic.

You should be cautious about extrapolating the performance results described here to other workloads. You could consider this to be a worst-case model because it assumes 0% disk cache hits. I would hope that most workloads exhibit better performance, but rather than guessing (hoping) the best way to find out is to run the workload and measure the performance. If you characterize a number of different configurations, then you might build your own performance graphs which fit your workload.

Putting It All Together

Now we have a method to compare a variety of different ZFS or RAID disk configurations by evaluating space, performance, and MTTDL. First, let's look at single parity schemes such as 2-way mirrors and raidz on the Sun Fire X4500 (aka Thumper) server.

Single Parity Model Results 

Here you can see that a 2-way mirror (RAID-1, RAID-1+0) has better performance and MTTDL than raidz for any specific space requirement except for the case where we run out of hot spares for the 2-way mirror (using all 46 disks for data). By contrast, all of the raidz configurations here have hot spares. You can use this to help make design trade-offs by prioritizing space, performance, and MTTDL.

You'll also note that I did not label the left-side Y axis (MTTDL) again, but I did label the right-side Y axis (small, random read iops). I did this with mixed emotion. I didn't label the MTTDL axis values as I explained previously. But I did label the performance axis so that you can do a rough comparison to the double parity graph below. Note that in the double parity graph, the MTTDL axis is in units of Millions of years, instead of years above.

Double Parity Model Results

Here you can see the same sort of comparison between 3-way mirrors and raidz2 sets. The mirrors still outperform the raidz2 sets and have better MTTDL.

Some people ask me which layout I prefer. For the vast majority of cases, I prefer mirroring. Then, of course, we get into some sort of discussion about how raidz, raidz2, RAID-5, RAID-6 or some other configuration offers more GBytes/$. For my 2007 New Year's resolution I've decided to help make the world a happier place.  If you want to be happier, you should use mirroring with at least one hot spare.

Conclusion

We can make design trade-offs between space, performance, and MTTDL for disk storage systems. As with most engineering decisions, there often is not a clear best solution given all of the possible solutions. By using some simple models, we can see the trade-offs more clearly.


About

relling

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today