Thursday Apr 05, 2007

ZFS ported to FreeBSD!

The FreeBSD team has added ZFS to the FreeBSD-7.0 release!  This is excellent news and all of us are happy to share with the FreeBSD community.  Pawel Jakub Dawidek has posted this note to the FreeBSD and ZFS community. This will greatly expand the use of ZFS and will no doubt lead to more innovative developments in the community.  Well done!

Tuesday Feb 13, 2007

Solaris Express Developer Edition solves "support" problem

Last month, I blogged about open source Solaris Nevada build 55. I normally don't single out specific Nevada builds as worthy of a blog entry -- not because I think they are unworthy, but because they come out every two weeks which gets to be a lot of regular work to blog about.

So, why is build 55 so special? It is the first build which has become known as the Solaris Express Developer Edition. This is a milestone release because it is available with a support contract. Many people don't care about support, and you will often hear them complaining about the fact that Solaris 10 doesn't have some version of an application which was just put up on the web last week. The quality process involved in producing, distributing, and supporting a large software product, such as Solaris, takes time to crank through. If you really want the lastest, greatest, and riskiest version of Solaris, then you need to be on a Solaris Express release. The problem is that we couldn't provide a support contract for Solaris Express. That problem is now solved. Those who demand support and want to be closer to the leading edge can now get both in the Solaris Express Developer Edition.

Go ahead, give it a try. Then hang out with the Solaris community at the website where there is always interesting discussions going on.


Wednesday Jan 31, 2007

ZFS RAID recommendations: space, performance, and MTTDL all-in

Wrapping up the thread on space, performance, and MTTDL, I thought that you might like to see one graph which would show the entire design space I've been using.  Here it is:

All-in graph 

This shows the data I've previously blogged about in scale. You can easily see that for MTTDL, double parity protection is better than single parity protection which is better than striping (no parity protection). Mirroring is also better than raidz or raidz2 for MTTDL and small, random read iops. I call this the "all-in" slide because, in a sense, it puts everything in one pot.

While this sort of analysis is useful, the problem is that there are more dimensions of the problem. I will show you some of the other models we use to evaluate and model systems in later blogs, but it might not be so easy to show so many useful factors on one graph.  I'll try my best...

Tuesday Jan 30, 2007

ZFS RAID recommendations: space, performance, and MTTDL

In this blog I connect the space vs MTTDL models with a performance model. This gives engough information for you to make RAID configuration trade-offs for various systems. The examples here target the Sun Fire X4500 (aka Thumper) server, but the models work for other systems too. In particular, the model results apply generically to RAID implementations on just a bunch of disks (JBOD) systems. 

The best thing about a model is that it is a simplification of real life.
The worst thing about a model is that it is a simplification of real life.

Small, Random Read Performance Model

For this analysis, we will use a small, random read performance model. The calculations for the model can be made with data which is readily available from disk data sheets. We calculate the expected I/O operations per second (iops) based on the average read seek and rotational speed of the disk. We don't consider the command overhead, as it is generally small for modern drives and is not always specified in disk data sheets.

maximum rotational latency = 60,000 (ms/min) / rotational speed (rpm)

iops = 1000 (ms/s) / (average read seek time (ms) + (maximum rotational latency (ms) / 2))

Since most disks use consistent rotational speeds, this small table may help you to see what the rotational speed contribution will be.

Rotational Speed (rpm)

Maximum Rotational Latency (ms)











For example, if we have a 73 GByte, 2.5" Seagate Saviio SAS drive which has a 4.1 ms average read seek and rotational speed of 10,000 rpm:

iops = 1000 / (4.1 + (6.0 / 2)) = 140.8

By comparison, a 750 GByte, 3.5" Seagate Barracuda SATA drive which has a 8.5 ms average read seek and rotational speed of 7,200 rpm:

iops = 1000 / (8.5 + (8.3 / 2)) = 79.0

I purposely used those two examples because people are always wondering why we tend to prefer smaller, faster, and (unfortunately) more expensive drives over larger, slower, less expensive drives - a 78% performance improvement is rather significant. The 3.5" drives also use about 25-75% more power than their smaller cousins, largely due to the rotating mass. Small is beautiful in a SWaP sense.

Next we use the RAID set configuration information to calculate the total small, random read iops for the zpool or volume. Here we need to talk about sets of disks which may make up a multi-level zpool or volume. For example, RAID-1+0 is a stripe (RAID-0) of mirrored sets (RAID-1). RAID-0 is a stripe of disks.

  • For dynamic striping (RAID-0), add the iops for each set or disk. On average the iops are spread randomly across all sets or disks, gaining concurrency.

  • For mirroring (RAID-1), add the iops for each set or disk. For reads, any set or disk can satisfy a read, so we also get concurrency.

  • For single parity raidz (RAID-5), the set operates at the performance of one disk. See below.

  • For double parity raidz2 (RAID-6), the set operates at the performance of one disk. See below.

For example, if you have 6 disks, then there are many different ways you can configure them, with varying performance calculations

RAID Configuration (6 disks)

Small, Random Read Performance Relative to a Single Disk

6-disk dynamic stripe (RAID-0)


3-set dynamic stripe, 2-way mirror (RAID-1+0)


2-set dynamic stripe, 3-way mirror (RAID-1+0)


6-disk raidz (RAID-5)


2-set dynamic stripe, 3-disk raidz (RAID-5+0)


2-way mirror, 3-disk raidz (RAID-5+1)


6-disk raidz2 (RAID-6)


Clearly, using mirrors improves both performance and data reliability. Using stripes increases performance, at the cost of data reliability. raidz and raidz2 offer data reliability, at the cost of performance. This leads us down a rathole...

The Parity Performance Rathole

Many people expect that data protection schemes based on parity, such as raidz (RAID-5) or raidz2 (RAID-6), will offer the performance of striped volumes, except for the parity disk. In other words, they expect that a 6-disk raidz zpool would have the same small. random read performance as a 5-disk dynamic stripe. Similarly, they expect that a 6-disk raidz2 zpool would have the same performance as a 4-disk dynamic stripe. ZFS doesn't work that way, today. ZFS uses a checksum to validate the contents of a block of data written. The block is spread across the disks (vdevs) in the set. In order to validate the checksum, ZFS must read the blocks from more than one disk, thus not taking advantage of spreading unrelated, random reads concurrently across the disks. In other words, the small, random read performance of a raidz or raidz2 set is, essentially, the same as the single disk performance. The benefit of this design is that writes should be more reliable and faster because you don't have the RAID-5 write hole or read-modify-write performance penalty.

Many people also think that this is a design deficiency. As a RAS guy, I value the data validation offered by the checksum over the performance supposedly gained by RAID-5. Reasonable people can disagree, but perhaps some day a clever person will solve this for ZFS.

So, what do other logical volume managers or RAID arrays do? The results seem mixed. I have seen some RAID array performance characterization data which is very similar to the ZFS performance for parity sets. I have heard anecdotes that other implementations will read the blocks and only reconstruct a failed block as needed. The problem is, how do such systems know that a block has failed? Anecdotally, it seems that some of them trust what is read from the disk. To implement a per-disk block checksum verification, you'd still have to perform at least two reads from different disks, so it seems to me that you are trading off data integrity for performance. In ZFS, data integrity is paramount. Perhaps there is more room for research here, or perhaps it is just one of those engineering trade-offs that we must live with.

Other Performance Models

I'm also looking for other performance models which can be applied to generic disks with data that is readily available to the public. The reason that the small, random read iops model works is that it doesn't need to consider caching or channel resource utilization. Adding these variables would require some knowledge of the configuration topology and the cache policies (which may also change with firmware updates.) I've kicked around the idea of a total disk bandwidth model which will describe a range of possible bandwidths based upon the media speed of the drives, but it is not clear to me that it will offer any satisfaction. Drop me a line if you have a good model or further thoughts on this topic.

You should be cautious about extrapolating the performance results described here to other workloads. You could consider this to be a worst-case model because it assumes 0% disk cache hits. I would hope that most workloads exhibit better performance, but rather than guessing (hoping) the best way to find out is to run the workload and measure the performance. If you characterize a number of different configurations, then you might build your own performance graphs which fit your workload.

Putting It All Together

Now we have a method to compare a variety of different ZFS or RAID disk configurations by evaluating space, performance, and MTTDL. First, let's look at single parity schemes such as 2-way mirrors and raidz on the Sun Fire X4500 (aka Thumper) server.

Single Parity Model Results 

Here you can see that a 2-way mirror (RAID-1, RAID-1+0) has better performance and MTTDL than raidz for any specific space requirement except for the case where we run out of hot spares for the 2-way mirror (using all 46 disks for data). By contrast, all of the raidz configurations here have hot spares. You can use this to help make design trade-offs by prioritizing space, performance, and MTTDL.

You'll also note that I did not label the left-side Y axis (MTTDL) again, but I did label the right-side Y axis (small, random read iops). I did this with mixed emotion. I didn't label the MTTDL axis values as I explained previously. But I did label the performance axis so that you can do a rough comparison to the double parity graph below. Note that in the double parity graph, the MTTDL axis is in units of Millions of years, instead of years above.

Double Parity Model Results

Here you can see the same sort of comparison between 3-way mirrors and raidz2 sets. The mirrors still outperform the raidz2 sets and have better MTTDL.

Some people ask me which layout I prefer. For the vast majority of cases, I prefer mirroring. Then, of course, we get into some sort of discussion about how raidz, raidz2, RAID-5, RAID-6 or some other configuration offers more GBytes/$. For my 2007 New Year's resolution I've decided to help make the world a happier place.  If you want to be happier, you should use mirroring with at least one hot spare.


We can make design trade-offs between space, performance, and MTTDL for disk storage systems. As with most engineering decisions, there often is not a clear best solution given all of the possible solutions. By using some simple models, we can see the trade-offs more clearly.

Wednesday Jan 17, 2007

A story of two MTTDL models

Mean Time to Data Loss (MTTDL) is a metric that we find useful for comparing data storage systems. I think it is particularly useful for determining what sort of data protection you may want to use for a RAID system. For example, suppose you have a Sun Fire X4550 (aka Thumper) server with 48 internal disk drives. What would be the best way to configure the disks for redundancy? Previously, I explored space versus MTTDL and space versus unscheduled Mean Time Between System Interruptions (U_MTBSI) for the X4500 running ZFS. The same analysis works for SVM or LVM, too.

For this blog, I want to explore the calculation of MTTDL for a bunch of disks. It turns out, there are multiple models for calculating MTTDL. The one described previously here is the simplest and only considers the Mean Time Between Failure (MTBF) of a disk and the Mean Time to Repair (MTTR) of the repair and reconstruction process. I'll call that model #1 which solves for MTTDL[1]. To quickly recap:

For non-protected schemes (dynamic striping, RAID-0)
For single parity schemes (2-way mirror, raidz, RAID-1, RAID-5):
MTTDL[1] = MTBF2 / (N \* (N-1) \* MTTR)
For double parity schemes (3-way mirror, raidz2, RAID-6):
MTTDL[1] = MTBF3 / (N \* (N-1) \* (N-2) \* MTTR2)
You can often get MTBF data from your drive vendor and you can measure or estimate your MTTR with reasonable accuracy. But MTTDL[1] does not consider the Unrecoverable Error Rate (UER) for read operations on disk drives. It turns out that the UER is often easier to get from the disk drive data sheets, because sometimes the drive vendors don't list MTBF (or Annual Failure Rate, AFR) for all of their drive models. Typically, UER will be 1 per 1014 bits read for consumer class drives and 1 per 1015 for enterprise class drives. This can be alarming, because you could also say that consumer class drives should see 1 UER per 12.5 TBytes of data read. Today, 500-750 GByte drives are readily available and 1 TByte drives are announced. Most people will be unhappy if they get an unrecoverable read error once every dozen or so times they read the whole disk. Worse yet, if we have the data protected with RAID and we have to replace a drive, we really do hope that the data reconstruction completes correctly. To add to our nightmare, the UER does not decrease by adding disks. If we can't rely on the data to be correctly read, we can't be sure that our data reconstruction will succeed, and we'll have data loss. Clearly, we need a model which takes this into account. Let's call that model #2, for MTTDL[2]:
First, we calculate the probability of unsuccessful reconstruction due to a UER for N disks of a given size (unit conversion omitted):
Precon_fail = (N-1) \* size / UER
For single-disk failure protection:
MTTDL[2] = MTBF / (N \* Precon_fail)
For double-disk failure protection:
MTTDL[2] = MTBF2/ (N \* (N-1) \* MTTR \* Precon_fail)
Comparing the MTTDL[1] model to the MTTDL[2] model shows some interesting aspects of design. First, there is no MTTDL[2] model for RAID-0 because there is no data reconstruction – any failure and you lose data. Second, the MTTR doesn't enter into the MTTDL[2] model until you get to double-disk failure scenarios. You could nit pick about this, but as you'll soon see, it really doesn't make any difference for our design decision process. Third, you can see that the Precon_fail is a function of the size of the data set. This is because the UER doesn't change as you grow the data set. Or, to look at it from a different direction, if you use consumer class drives with 1 UER for 1014 bits, and you have 12.5 TBytes of data, the probability of an unrecoverable read during the data reconstruction is 1. Ugh. If the Precon_fail is 1, then the MTTDL[2] model looks a lot like the RAID-0 model and friends don't let friends use RAID-0! Maybe you could consider a smaller sized data set to offset this risk. Let's see how that looks in pictures.

 MTTDL models for 2-way mirror

2-way mirroring is an example of a configuration which provides single-disk failure protection. Each data point represents the space available when using 2-way mirrors in a zpool. Since this is for a X4500, we consider 46 total available disks and any disks not used for data are available as spares. In this graph you can clearly see that the MTTDL[1] model encourages the use of hot spares. More importantly, although the results of the calculations of the two models are around 5 orders of magnitude different, the overall shape of the curve remains the same. Keep in mind that we are talking years here, perhaps 10 million years, which is well beyond the 5-year expected life span of a disk. This is the nature of the beast when using a constant MTBF. For models which consider the change in MTBF as the device ages, you should never see such large numbers. But the wish for more accurate models does not change the relative merits of the design decision, which is what we really care about – the best RAID configuration given a bunch of disks. Should I use single disk failure protection or double disk failure protection? To answer that, lets look at the model for raidz2.

MTTDL models for raidz2 

From this graph you can see that double disk protection is clearly better than single disk protection above, regardless of which model we choose. Good, this makes sense. You can also see that with raidz2 we have a larger number of disk configuration options. A 3-disk raidz2 set is somewhat similar to a 3-way mirror with the best MTTDL, but doesn't offer much available space. A 4-disk set will offer better space, but not quite as good MTTDL. This pattern continues through 8 disks/set. Judging from the graphs, you should see that a 3-disk set will offer approximately an order of magnitude better MTTDL than an 8-disk, for either MTTDL model. This is because the UER remains constant while the data to be reconstructed increases.
I hope that these models give you an insight into how you can model systems for RAS. In my experience, most people get all jazzed up with the space and forget that they are often making a space vs. RAS trade-off. You can use these models to help you make good design decisions when configuring RAID systems. Since the graphs use Space on the X-axis, it is easy to look at the design trade-offs for a given amount of available space.

Just one more teaser... there are other MTTDL models, but it is unclear if they would help make better decisions, and I'll explore those in another blog.

Thursday Jan 11, 2007

ZFS RAID recommendations: space vs MTTDL

It is not always obvious what the best RAID set configuration should be for a given set of disks. This is even more difficult to see as the number of disks grows large, like on a Sun Fire X4500 (aka Thumper) server. By default, the X4500 ships with 46 disks available for data. This leads to hundreds of possible permutations of RAID sets.  Which would be best? One analysis is the trade-off space and Mean Time To Data Loss (MTTDL). For this blog, I will try to stick with ZFS terminology in the text, but the principles apply to other RAID systems, too.

The space calculation is straightforward.  Configure the RAID sets and sum the total space available.

The MTTDL calculation is one attribute of Reliability, Availability, and Serviceability (RAS) which we can also calculate relatively easily. For large numbers of disks, MTTDL is particularly useful because we only need to consider the reliability of the disks, and not the other parts of the system (fodder for a later blog :-). While this doesn't tell the whole RAS story, it is a very good method for evaluating a big bunch of disks. The equations are fairly straightforward:

For non-protected schemes (dynamic striping, RAID-0)


For single parity schemes (2-way mirror, raidz, RAID-1, RAID-5):

MTTDL = MTBF2 / (N \* (N-1) \* MTTR)

For double parity schemes (3-way mirror, raidz2, RAID-6):

MTTDL = MTBF3 / (N \* (N-1) \* (N-2) \* MTTR2)

Where MTBF is the Mean Time Between Failure and MTTR is the Mean Time To Recover. You can get MTBF values from disk data sheets which are usually readily available. You could also adjust them for your situation or based upon your actual experience. At Sun, we have many years of field failure data for disks and use design values which are consistent with our experiences. YMMV, of course. For MTTR you need to consider the logistical repair time, which is usually the time required to identify the failed disk and physically replace it.  You also need to consider the data reconstruction time, which may be a long time for large disks, depending on how rapidly ZFS or your logical volume manager (LVM) will reconstruct the data. Obviously, a spreadsheet or tool helps ease the computational burden.

Note: the reconstruction time for ZFS is a function of the amount of data, not the size of the disk. Traditional LVMs or hardware RAID arrays have no context of the data and therefore have to reconstruct the entire disk rather than just reconstruct the data. In the best case (0% used), ZFS will reconstruct the data almost instantaneously.  In the worst case (100% used) ZFS will have to reconstruct the entire disk, just like a traditional LVM.  This is one of the advantages of ZFS over traditional LVMs: faster reconstruction time, lower MTTR, better MTTDL.

Note: if you have a multi-level RAID set, such as RAID-1+0, then you need to use both the single parity and no protection MTTDL calculations to get the MTTDL of the top-level volume. 

So, I took a large number of possible ZFS configurations for a X4500 and calculated the space and MTTDL for the zpool. The interesting thing is that the various RAID protection schemes fall out in clumps. For example, you would expect that a 3-way mirror has better MTTDL and less available space than a 2-way mirror. As you vary the configurations, you can see the changes in space and MTTDL, but you would never expect a 2-way mirror to have better MTTDL than a 3-way mirror. The result is that if you plot the available space against the MTTDL, then the various RAID configurations will tend to clump together.

X4500 MTTDL vs Space

The most obvious conclusion from the above data is that you shouldn't use simply dynamic striping or RAID-0. Friends don't let friends use RAID-0!

You will also notice that I've omitted the values on the MTTDL axis. You've noticed that the MTTDL axis uses a log scale, so that should give you a clue as to the magnitude of the differences. The real reason I've omitted the values is because they are a rat hole opportunity with a high entrance probability. It really doesn't matter if the MTTDL calculation shows that you should see a trillion years of MTTDL because the expected lifetime of a disk is on the order of 5 years.  I don't really expect any disk to last more than a decade or two. What you should take away from this is that bigger MTTDL is better, and you get a much bigger MTTDL as you increase the number of redundant copies of the data. It is better to stay out of the MTTDL value rat hole. 

The other obvious conclusion is that you should use hot spares. The reason for this is that when a hot spare is used, the MTTR is decreased because we don't have to wait for the physical disk to be replaced before we start data reconstruction on a spare disk. The time you must wait for the data to be reconstructed and available is time where you are exposed to another failure which may cause data loss. In general, you always want to increase MTBF (the numerator) and decrease MTTR (the denominator) to get high RAS.

The most interesting result of this analysis is that the RAID configurations will tend to clump together. For example, there isn't much difference between the MTTDL of a 5-disk zpool versus a 6-disk raidz zpool.

But if you look at this data another way, there is a huge difference in the RAS. For example, suppose you want 15,000 GBytes of space in your X4500.  You could use either raidz or raidz2 with or without spares. Clearly, you would have better RAS if you choose raidz2 with spares than any of the other options for the space requirement. Whether you use 6, 7, 8, or 9 disks in your raidz2 set makes less difference in MTTDL.


There are other considerations when choosing the ZFS or RAID configurations which I plan to address in later blogs. For now, I hope that this will encourage you to think about how you might approach the space and RAS trade-offs for your storage configurations.


Monday Jan 01, 2007

Nevada build 55 -- nice!

Over the holiday break I installed Solaris Nevada build 55 on a few machines.  This is an important build and I think you'll see lots of people talking and blogging about it.  Here are a few reasons why I think it is cool:

  • StarOffice 8 is now integrated. Previous Nevada builds had StarOffice 7, which meant that I had to install and patch StarOffice 8 separately.  This significantly reduces the amount of time I needed to get my desktop up to snuff.
  • Studio 11 is now integrated.  This is another time saver.
  • NetBeans 5.5 is also integrated.  I use NetBeans quite a bit for Java programming. I fell in love with refactoring and the debugging environment.  I also do quite a bit of GUI work, and NetBeans has saved me many hundreds of hours of drudgery.
  • NVidia 3-D drivers are integrated.  Still more installation time savings.
I do install every other build of Nevada, and these changes have significantly reduced the time it takes for me to go from install boot to being ready as my primary desktop.  More importantly, it shows that once we can get past some of the (internal political) roadblocks, we can create a richly featured, easy to install, open source Solaris environment that showcases some of the best of Sun's technologies.

BTW, one of the cool things I've been playing with over the break is the integration of StarOffice with NetBeans. There are NetBeans wizards which you can use to create add-ons for StarOffice. Check out how easy it is to add a function to StarOffice Calc!




« July 2016