Tuesday Jan 16, 2007

Diversity revisited

The recent disruption of the internet connection between Taiwan and the mainland is an example of a system where there is not much diversity. This is a rather large scale example, perhaps one of the largest such examples on earth. The lesson is that the diversity in your design needs to be able to cover the expected fault (pun intended) zone. In this case, what was lost was capacity, the international connection capacity was reduced to approximately 50%.

Another, more local example of diversity in communications recently occurred to my colleague, Mike Anuta, in  Wisconsin. A construction accident severed the underground fiber link. This resulted in disconnecting the local area from the rest of the world. Mike says the accident also clobbered a bunch of the cellular service. Being a clever engineer, and having a working internet link, Mike was able to access the most important application at Sun, e-mail. The big question is whether it is also worth the cost to add a VOIP account, if the pricing was reasonable. So, this is another example of how you can work around systems faults using diversity. Remember, you can run IP over almost anything from cups-n-strings to radio links to fiber -- all using a protocol (IP) designed to overcome link faults.

Thursday Jan 11, 2007

ZFS RAID recommendations: space vs MTTDL

It is not always obvious what the best RAID set configuration should be for a given set of disks. This is even more difficult to see as the number of disks grows large, like on a Sun Fire X4500 (aka Thumper) server. By default, the X4500 ships with 46 disks available for data. This leads to hundreds of possible permutations of RAID sets.  Which would be best? One analysis is the trade-off space and Mean Time To Data Loss (MTTDL). For this blog, I will try to stick with ZFS terminology in the text, but the principles apply to other RAID systems, too.

The space calculation is straightforward.  Configure the RAID sets and sum the total space available.

The MTTDL calculation is one attribute of Reliability, Availability, and Serviceability (RAS) which we can also calculate relatively easily. For large numbers of disks, MTTDL is particularly useful because we only need to consider the reliability of the disks, and not the other parts of the system (fodder for a later blog :-). While this doesn't tell the whole RAS story, it is a very good method for evaluating a big bunch of disks. The equations are fairly straightforward:

For non-protected schemes (dynamic striping, RAID-0)


For single parity schemes (2-way mirror, raidz, RAID-1, RAID-5):

MTTDL = MTBF2 / (N \* (N-1) \* MTTR)

For double parity schemes (3-way mirror, raidz2, RAID-6):

MTTDL = MTBF3 / (N \* (N-1) \* (N-2) \* MTTR2)

Where MTBF is the Mean Time Between Failure and MTTR is the Mean Time To Recover. You can get MTBF values from disk data sheets which are usually readily available. You could also adjust them for your situation or based upon your actual experience. At Sun, we have many years of field failure data for disks and use design values which are consistent with our experiences. YMMV, of course. For MTTR you need to consider the logistical repair time, which is usually the time required to identify the failed disk and physically replace it.  You also need to consider the data reconstruction time, which may be a long time for large disks, depending on how rapidly ZFS or your logical volume manager (LVM) will reconstruct the data. Obviously, a spreadsheet or tool helps ease the computational burden.

Note: the reconstruction time for ZFS is a function of the amount of data, not the size of the disk. Traditional LVMs or hardware RAID arrays have no context of the data and therefore have to reconstruct the entire disk rather than just reconstruct the data. In the best case (0% used), ZFS will reconstruct the data almost instantaneously.  In the worst case (100% used) ZFS will have to reconstruct the entire disk, just like a traditional LVM.  This is one of the advantages of ZFS over traditional LVMs: faster reconstruction time, lower MTTR, better MTTDL.

Note: if you have a multi-level RAID set, such as RAID-1+0, then you need to use both the single parity and no protection MTTDL calculations to get the MTTDL of the top-level volume. 

So, I took a large number of possible ZFS configurations for a X4500 and calculated the space and MTTDL for the zpool. The interesting thing is that the various RAID protection schemes fall out in clumps. For example, you would expect that a 3-way mirror has better MTTDL and less available space than a 2-way mirror. As you vary the configurations, you can see the changes in space and MTTDL, but you would never expect a 2-way mirror to have better MTTDL than a 3-way mirror. The result is that if you plot the available space against the MTTDL, then the various RAID configurations will tend to clump together.

X4500 MTTDL vs Space

The most obvious conclusion from the above data is that you shouldn't use simply dynamic striping or RAID-0. Friends don't let friends use RAID-0!

You will also notice that I've omitted the values on the MTTDL axis. You've noticed that the MTTDL axis uses a log scale, so that should give you a clue as to the magnitude of the differences. The real reason I've omitted the values is because they are a rat hole opportunity with a high entrance probability. It really doesn't matter if the MTTDL calculation shows that you should see a trillion years of MTTDL because the expected lifetime of a disk is on the order of 5 years.  I don't really expect any disk to last more than a decade or two. What you should take away from this is that bigger MTTDL is better, and you get a much bigger MTTDL as you increase the number of redundant copies of the data. It is better to stay out of the MTTDL value rat hole. 

The other obvious conclusion is that you should use hot spares. The reason for this is that when a hot spare is used, the MTTR is decreased because we don't have to wait for the physical disk to be replaced before we start data reconstruction on a spare disk. The time you must wait for the data to be reconstructed and available is time where you are exposed to another failure which may cause data loss. In general, you always want to increase MTBF (the numerator) and decrease MTTR (the denominator) to get high RAS.

The most interesting result of this analysis is that the RAID configurations will tend to clump together. For example, there isn't much difference between the MTTDL of a 5-disk zpool versus a 6-disk raidz zpool.

But if you look at this data another way, there is a huge difference in the RAS. For example, suppose you want 15,000 GBytes of space in your X4500.  You could use either raidz or raidz2 with or without spares. Clearly, you would have better RAS if you choose raidz2 with spares than any of the other options for the space requirement. Whether you use 6, 7, 8, or 9 disks in your raidz2 set makes less difference in MTTDL.


There are other considerations when choosing the ZFS or RAID configurations which I plan to address in later blogs. For now, I hope that this will encourage you to think about how you might approach the space and RAS trade-offs for your storage configurations.





« July 2016