By bill on May 12, 2006
I still remember the smell...
Before Xerox took over the copying business, the world used the ditto machine. You know the one - blue ink, nice smell. I still remember popping a few brain cells sniffing my fresh-off-the-press homework papers in grade school. Alas, I'm sure many readers of this blog won't have had the chance to have the ditto machine experience, but that's progress for you.
Anyway, let's take a left at the next intersection and turn off of memory lane and rejoin the present.
One block to rule them all
In the abstract, you can think of ZFS (or any other filesystem, for that matter) as a tree of blocks. By this, I mean that there is a root block from which all other blocks are discoverable. Let's now imagine a case where you have a petabyte of data in your filesystem (or storage pool, in ZFS' case) and think about having enough failures such that a single block becomes unavailable. What happens?
If the failure is at a leaf block, that one block is now unreadable and an application will get EIO if it tries to access that block. Ok, fine, that doesn't seem too bad. Also, since typically more than 98% of the data in a storage pool is user data blocks (leaf blocks), this is the most likely scenario.
But let's consider what happens if that single block failure is near the top of the tree. Now, we've got a problem. This single failed block casts an expanding shadow of undiscoverable blocks all the way down the tree. If you have a lot of data, a significant portion of it may now be unavailable. Potentially hundreds of terabytes of data at the mercy of a single block. Probably not what you had in mind.
What is a user to do? Set up a 3-way mirror for all my data? Even though storage is cheap, it's not that cheap. A while back, we decided we needed to do better than this for ZFS. The result? Ditto blocks.
What are ditto blocks, you ask? ZFS has block pointers, which as you might imagine, point to blocks on disk. We call these Disk Virtual Addresses (DVAs). Before we did our initial integration of ZFS into Solaris, I made room in the block pointers for not just one DVA, but up to three DVAs. Using these extra DVAs, we can store up to three copies of a block in three separate locations. Mind you, this is on top of whatever replication the pool already has. If you were to use a 3-way ditto block in a pool with mirrored disks, that means that there would be six physical copies of that block.
We use ditto blocks to ensure that the more "important" a filesystem block is (the closer to the root of the tree), the more replicated it becomes. Our current policy is that we store one DVA for user data, two DVAs for filesystem metadata, and three DVAs for metadata that's global across all filesystems in the storage pool.
This has several nice properties. First, the blocks that are more critical to the health of the pool are more replicated, making them far less likely to suffer catastrophic failure. If P is the chance a given block will suffer an unrecoverable error in a given unit of time, then P3 is the chance that a 3-way ditto block will fail in that same amount of time. This approaches zero very quickly.
Second, since almost every storage pool has the vast majority of data in user blocks (well over 98%), there is very little impact in terms of I/O and space consumption for utilizing ditto blocks. Very little data is global to all filesystems (for which we store three copies), and usually about 1.5-2% of the data is per-filesystem metadata, which means that there is about a 2% hit in terms of space and I/O for this added redundancy.
Once we had ditto blocks, then next question is: Where should we put the extra copies? The answer seems pretty obvious: As far apart as possible.
In a storage pool with only a single disk, we spread the blocks physically across the disk. Our policy aims to place the blocks at least 1/8 of the disk apart. This way, if there is a localized media failure (not all that uncommon on today's drives), you still have a copy elsewhere on that disk.
In a storage pool with multiple devices (vdevs), things get a little spicier. We allocate each copy of a block on a separate vdev. So even if an entire top-level vdev fails (a mirror or RAID-Z stripe), we can still access data. Furthermore, if you think of all the vdevs in your pool as forming a ring, we always try to allocate ditto blocks on the vdevs adjacent to the first copy. The reasoning behind this is a little subtle.
Imagine you have 100 vdevs making up your storage pool. To simplify things, further assume that all blocks in the pool were 2-way ditto blocks. If you just randomly allocated two blocks on two random vdevs, any two vdev failure will guarantee that you lose at least some data. Now consider our policy of only mirroring using neighboring vdevs. If you have two top-level vdevs fail, the only way you could possibly lose data is if they were two adjacent vdevs. This means that given two failures, the probability of data loss goes from 100% to just under 2%.
Don't try this at home
After writing this code and testing it, I thought what fun it would be to see it in action on my laptop. I created a new storage pool using a slice on my laptop drive, put a bunch of data on there, then wiped clean the first 1GB of that slice. As you might imagine, any of the file blocks that were unlucky enough to be allocated in that first 1GB were unreadable. However, I could still navigate the entire filesystem, typing "ls", "rm" and creating new files as much as I wanted. Pretty damn sweet. ZFS just survived a failure scenario that would send any other filesystem to tape. I know you'd have to still go to tape for the file contents that were damaged, but the filesystem was still 100% usable and I could get a list of files that were damaged by running zpool status -v. For the careful reader, you'll note that this command currently only give you the object number, but it will give you the actual filename in the near future.
It's all good, mate
In the future, ditto blocks will be available for user data as well on a per-filesystem basis. Imagine having your laptop with you on vacation (as I recently did). A single disk. You can create a filesystem, for your digital photos, and specify that you want it to use 2- or 3-way ditto blocks for your pictures. ZFS will spread each copy of each block far apart on the disk so that you wind up with an effect very close to mirroring with only a single disk. Furthermore, since we have a pretty decent I/O scheduler, it doesn't rattle the crap out of your disk drive.
This can lead to even more fun if you create a storage pool with several non-replicated disks. You'll be able to mix non-replicated data (for a build area or web cache) and in the same storage pool, be able to use ditto blocks to mirror your "important" data. How cool would that be?
But wait! What if a filesystem requests 2-way ditto blocks for user data? Wouldn't that mean that the filesystem metadata is no more replicated than its contents? Actually, we calculate the per-filesystem and per-pool metadata replication to be +1 and +2 compared to the user data (capped at 3, of course). So we do our best to have the same semantics, even when user data utilizes ditto blocks. More fun that you can shake a stick at.
Finally, you have to admit that the name is kinda catchy. I can smell the blue ink from here...