I still remember the smell...
took over the copying business, the world used the ditto machine.
You know the one - blue ink, nice smell. I still remember popping a few
brain cells sniffing my fresh-off-the-press homework papers in grade
school. Alas, I'm sure many readers of this blog won't have had the
chance to have the ditto machine experience, but that's progress for
Anyway, let's take a left at the next intersection and turn off of
memory lane and rejoin the present.
One block to rule them all
In the abstract, you can think of ZFS (or any other filesystem, for
that matter) as a tree of blocks. By this, I mean that there is a root
block from which all other blocks are discoverable. Let's now imagine
a case where you have a petabyte of data in your filesystem (or storage
pool, in ZFS' case) and think about having enough failures such that a
single block becomes unavailable. What happens?
If the failure is at a leaf block, that one block is now unreadable and
an application will get EIO if it tries to access that block. Ok,
fine, that doesn't seem too bad. Also, since typically more than 98%
of the data in a storage pool is user data blocks (leaf blocks), this
is the most likely scenario.
But let's consider what happens if that single block failure is near
the top of the tree. Now, we've got a problem. This single failed
block casts an expanding shadow of undiscoverable blocks all the way
down the tree. If you have a lot of data, a significant portion of it
may now be unavailable. Potentially hundreds of terabytes of data at
the mercy of a single block. Probably not what you had in mind.
What is a user to do? Set up a 3-way mirror for all my data? Even
though storage is cheap, it's not that cheap. A while back, we
decided we needed to do better than this for ZFS. The result? Ditto
What are ditto blocks, you ask? ZFS has block pointers, which as you
might imagine, point to blocks on disk. We call these Disk Virtual
Addresses (DVAs). Before we did our initial integration of ZFS into
Solaris, I made room in the block pointers for not just one DVA, but up
to three DVAs. Using these extra DVAs, we can store up to three copies
of a block in three separate locations. Mind you, this is on top of
whatever replication the pool already has. If you were to use a 3-way
ditto block in a pool with mirrored disks, that means that there would
be six physical copies of that block.
We use ditto blocks to ensure that the more "important" a filesystem
block is (the closer to the root of the tree), the more replicated it
becomes. Our current policy is that we store one DVA for user data,
two DVAs for filesystem metadata, and three DVAs for metadata that's
global across all filesystems in the storage pool.
This has several nice properties. First, the blocks that are more
critical to the health of the pool are more replicated, making them far
less likely to suffer catastrophic failure. If P is the chance
a given block will suffer an unrecoverable error in a given unit of
time, then P3 is the chance that a 3-way ditto block
will fail in that same amount of time. This approaches zero very
Second, since almost every storage pool has the vast majority of data
in user blocks (well over 98%), there is very little impact in terms of
I/O and space consumption for utilizing ditto blocks. Very little data
is global to all filesystems (for which we store three copies), and
usually about 1.5-2% of the data is per-filesystem metadata, which
means that there is about a 2% hit in terms of space and I/O for this
Once we had ditto blocks, then next question is: Where should we put
the extra copies? The answer seems pretty obvious: As far apart as
In a storage pool with only a single disk, we spread the blocks
physically across the disk. Our policy aims to place the blocks at
least 1/8 of the disk apart. This way, if there is a localized media
failure (not all that uncommon on today's drives), you still have a
copy elsewhere on that disk.
In a storage pool with multiple devices (vdevs), things get a little
spicier. We allocate each copy of a block on a separate vdev. So even
if an entire top-level vdev fails (a mirror or RAID-Z stripe), we can
still access data. Furthermore, if you think of all the vdevs in your
pool as forming a ring, we always try to allocate ditto blocks on the
vdevs adjacent to the first copy. The reasoning behind this is a
Imagine you have 100 vdevs making up your storage pool. To simplify
things, further assume that all blocks in the pool were 2-way ditto
blocks. If you just randomly allocated two blocks on two random vdevs,
any two vdev failure will guarantee that you lose at least some data.
Now consider our policy of only mirroring using neighboring vdevs. If
you have two top-level vdevs fail, the only way you could possibly lose
data is if they were two adjacent vdevs. This means that given two
failures, the probability of data loss goes from 100% to just under 2%.
Don't try this at home
After writing this code and testing it, I thought what fun it would be
to see it in action on my laptop. I created a new storage pool using a
slice on my laptop drive, put a bunch of data on there, then wiped
clean the first 1GB of that slice. As you might imagine, any of the
file blocks that were unlucky enough to be allocated in that first 1GB
were unreadable. However, I could still navigate the entire
filesystem, typing "ls", "rm" and creating new files as much as I
wanted. Pretty damn sweet. ZFS just survived a failure scenario that
would send any other filesystem to tape. I know you'd have to still go
to tape for the file contents that were damaged, but the filesystem was
still 100% usable and I could get a list of files that were damaged by
running zpool status -v. For the careful reader, you'll
note that this command currently only give you the object number, but
it will give you the actual filename in the near future.
It's all good, mate
In the future, ditto blocks will be available for user data as well on a
per-filesystem basis. Imagine having your laptop with you on vacation (as I
recently did). A single disk. You can create a filesystem, for your digital
photos, and specify that you want it to use 2- or 3-way ditto blocks for your
pictures. ZFS will spread each copy of each block far apart on the disk so
that you wind up with an effect very close to mirroring with only a single
disk. Furthermore, since we have a pretty decent
it doesn't rattle the crap out of your disk drive.
This can lead to even more fun if you create a storage pool with
several non-replicated disks. You'll be able to mix non-replicated
data (for a build area or web cache) and in the same storage pool, be
able to use ditto blocks to mirror your "important" data. How cool
would that be?
But wait! What if a filesystem requests 2-way ditto blocks for user
data? Wouldn't that mean that the filesystem metadata is no more
replicated than its contents? Actually, we calculate the
per-filesystem and per-pool metadata replication to be +1 and +2
compared to the user data (capped at 3, of course). So we do our best
to have the same semantics, even when user data utilizes ditto blocks.
More fun that you can shake a stick at.
Finally, you have to admit that the name is kinda catchy. I can smell
the blue ink from here...