In this blog I connect the space vs MTTDL models with a performance model. This gives engough information for you to make RAID configuration trade-offs for various systems. The examples here target the Sun Fire X4500 (aka Thumper) server, but the models work for other systems too. In particular, the model results apply generically to RAID implementations on just a bunch of disks (JBOD) systems.
The best thing about a model is that it is a simplification of
The worst thing about a model is that it is a
simplification of real life.
Small, Random Read Performance Model
For this analysis, we will use a small, random read performance
model. The calculations for the model can be made with data which is
readily available from disk data sheets. We calculate the expected
I/O operations per second (iops) based on the average read seek and
rotational speed of the disk. We don't consider the command overhead,
as it is generally small for modern drives and is not always
specified in disk data sheets.
maximum rotational latency = 60,000
(ms/min) / rotational speed (rpm)
iops = 1000 (ms/s) / (average read
seek time (ms) + (maximum rotational latency (ms) / 2))
Since most disks use consistent rotational speeds, this small
table may help you to see what the rotational speed contribution will
Rotational Speed (rpm)
Maximum Rotational Latency (ms)
For example, if we have a 73
GByte, 2.5" Seagate Saviio SAS drive which has a 4.1 ms
average read seek and rotational speed of 10,000 rpm:
iops = 1000 / (4.1 + (6.0 / 2)) =
By comparison, a 750
GByte, 3.5" Seagate Barracuda SATA drive which has a 8.5 ms
average read seek and rotational speed of 7,200 rpm:
iops = 1000 / (8.5 + (8.3 / 2)) = 79.0
I purposely used those two examples because people are always
wondering why we tend to prefer smaller, faster, and (unfortunately)
more expensive drives over larger, slower, less expensive drives - a
78% performance improvement is rather significant. The 3.5"
drives also use about 25-75% more power than their smaller cousins,
largely due to the rotating mass. Small is beautiful in a SWaP
Next we use the RAID set configuration information to calculate
the total small, random read iops for the zpool or volume. Here we
need to talk about sets of disks which may make up a multi-level
zpool or volume. For example, RAID-1+0 is a stripe (RAID-0) of
mirrored sets (RAID-1). RAID-0 is a stripe of disks.
For dynamic striping (RAID-0), add the iops for each set or
disk. On average the iops are spread randomly across all sets or
disks, gaining concurrency.
For mirroring (RAID-1), add the iops for each set or disk.
For reads, any set or disk can satisfy a read, so we also get
For single parity raidz (RAID-5), the set operates at the
performance of one disk. See below.
For double parity raidz2 (RAID-6), the set operates at the
performance of one disk. See below.
For example, if you have 6 disks, then there are many different
ways you can configure them, with varying performance calculations
RAID Configuration (6 disks)
Small, Random Read Performance Relative to a Single Disk
6-disk dynamic stripe (RAID-0)
3-set dynamic stripe, 2-way mirror (RAID-1+0)
2-set dynamic stripe, 3-way mirror (RAID-1+0)
6-disk raidz (RAID-5)
2-set dynamic stripe, 3-disk raidz (RAID-5+0)
2-way mirror, 3-disk raidz (RAID-5+1)
6-disk raidz2 (RAID-6)
Clearly, using mirrors improves both performance and data
reliability. Using stripes increases performance, at the cost of data
reliability. raidz and raidz2 offer data reliability, at the cost of
performance. This leads us down a rathole...
The Parity Performance Rathole
Many people expect that data protection schemes based on parity,
such as raidz (RAID-5) or raidz2 (RAID-6), will offer the performance
of striped volumes, except for the parity disk. In other words, they
expect that a 6-disk raidz zpool would have the same small. random
read performance as a 5-disk dynamic stripe. Similarly, they expect
that a 6-disk raidz2 zpool would have the same performance as a
4-disk dynamic stripe. ZFS doesn't work that way, today. ZFS uses a
checksum to validate the contents of a block of data written. The
block is spread across the disks (vdevs) in the set. In order to
validate the checksum, ZFS must read the blocks from more than one
disk, thus not taking advantage of spreading unrelated, random reads
concurrently across the disks. In other words, the small, random read
performance of a raidz or raidz2 set is, essentially, the same as the
single disk performance. The benefit of this design is that writes should be more reliable and faster because you don't have the RAID-5 write hole or read-modify-write performance penalty.
Many people also think that this is a design deficiency. As a RAS
guy, I value the data validation offered by the checksum over the
performance supposedly gained by RAID-5. Reasonable people can
disagree, but perhaps some day a clever person will solve this for
So, what do other logical volume managers or RAID arrays do? The
results seem mixed. I have seen some RAID array performance
characterization data which is very similar to the ZFS performance
for parity sets. I have heard anecdotes that other implementations
will read the blocks and only reconstruct a failed block as
needed. The problem is, how do such systems know that a block has
it seems that some of them trust what is read from the disk. To
implement a per-disk block checksum verification, you'd still have to
perform at least two reads from different disks, so it seems to me
that you are trading off data integrity for performance. In ZFS, data
integrity is paramount. Perhaps there is more room for research here,
or perhaps it is just one of those engineering trade-offs that we
must live with.
Other Performance Models
I'm also looking for other performance models which can be applied
to generic disks with data that is readily available to the public.
The reason that the small, random read iops model works is that it
doesn't need to consider caching or channel resource utilization.
Adding these variables would require some knowledge of the
configuration topology and the cache policies (which may also change
with firmware updates.) I've kicked around the idea of a total disk
bandwidth model which will describe a range of possible bandwidths
based upon the media speed of the drives, but it is not clear to me
that it will offer any satisfaction. Drop me a line if you have a
good model or further thoughts on this topic.
You should be cautious about extrapolating the performance results
described here to other workloads. You could consider this to be a
worst-case model because it assumes 0% disk cache hits. I would hope
that most workloads exhibit better performance, but rather than
guessing (hoping) the best way to find out is to run the workload and
measure the performance. If you characterize a number of different
configurations, then you might build your own performance graphs
which fit your workload.
Putting It All Together
Now we have a method to compare a variety of different ZFS or RAID
disk configurations by evaluating space, performance, and MTTDL.
First, let's look at single parity schemes such as 2-way mirrors and
raidz on the Sun
Fire X4500 (aka Thumper) server.
Here you can see that a 2-way mirror (RAID-1, RAID-1+0) has better
performance and MTTDL than raidz for any specific space requirement
except for the case where we run out of hot spares for the 2-way
mirror (using all 46 disks for data). By contrast, all of the raidz
configurations here have hot spares. You can use this to help make
design trade-offs by prioritizing space, performance, and MTTDL.
You'll also note that I did not label the left-side Y axis (MTTDL)
again, but I did label the right-side Y axis (small, random read
iops). I did this with mixed emotion. I didn't label the MTTDL axis
values as I explained previously. But I did label the performance
axis so that you can do a rough comparison to the double parity graph
below. Note that in the double parity graph, the MTTDL axis is in
units of Millions of years, instead of years above.
Here you can see the same sort of comparison between 3-way mirrors
and raidz2 sets. The mirrors still outperform the raidz2 sets and have better MTTDL.
Some people ask me which layout I prefer. For the vast majority of cases, I prefer mirroring. Then, of course, we get into some sort of discussion about how raidz, raidz2, RAID-5, RAID-6 or some other configuration offers more GBytes/$. For my 2007 New Year's resolution I've decided to help make the world a happier place. If you want to be happier, you should use mirroring with at least one hot spare.
We can make design trade-offs between space, performance, and
MTTDL for disk storage systems. As with most engineering decisions,
there often is not a clear best solution given all of the possible
solutions. By using some simple models, we can see the trade-offs