The job of any filesystem boils down to this: when asked to read a block,
it should return the same data that was previously written to that block.
If it can't do that -- because the disk is offline or the data has been
damaged or tampered with -- it should detect this and return an error.
Incredibly, most filesystems fail this test. They depend on
the underlying hardware to detect and report errors. If a disk
simply returns bad data, the average filesystem won't even detect it.
Even if we could assume that all disks were perfect, the data would still
be vulnerable to damage in transit: controller bugs, DMA parity errors,
and so on. All you'd really know is that the data was intact when it
left the platter. If you think of your data as a package, this would
be like UPS saying, "We guarantee that your package wasn't damaged
when we picked it up." Not quite the guarantee you were looking for.
In-flight damage is not a mere academic concern: even something as mundane as a
bad power supply
can cause silent data corruption.
Arbitrarily expensive storage arrays can't solve the problem. The I/O path
remains just as vulnerable, but becomes even longer: after leaving the platter,
the data has to survive whatever hardware and firmware bugs the array
has to offer.
And if you're on a SAN, you're using a network designed by disk firmware
writers. God help you.
What to do? One option is to store a checksum with every disk block.
Most modern disk drives can be formatted with sectors that are slightly
larger than the usual 512 bytes -- typically 520 or 528. These extra
bytes can be used to hold a block checksum. But making good use of this
checksum is harder than it sounds: the effectiveness of a checksum
depends tremendously on where it's stored and when it's evaluated.
In many storage arrays (see the
Dell|EMC PowerVault paper
for a typical example with an excellent description of the issues),
the data is compared to its checksum inside the array.
Unfortunately this doesn't help much. It doesn't detect common firmware
bugs such as phantom writes (the previous write never made it to disk) because
the data and checksum are stored as a unit -- so they're self-consistent
even when the disk returns stale data.
And the rest of the I/O path from the array to the host remains unprotected.
In short, this type of block checksum provides a good way to ensure that
an array product is not any less reliable than the disks it contains,
but that's about all.
NetApp's block-appended checksum approach appears similar but is in fact
much stronger. Like many arrays, NetApp formats its drives with 520-byte
sectors. It then groups them into 8-sector blocks: 4K of data (the WAFL
filesystem blocksize) and 64 bytes of checksum. When WAFL reads a block
it compares the checksum to the data just like an array would,
but there's a key difference: it does this comparison after
the data has made it through the I/O path, so it validates that the
block made the journey from platter to memory without damage in transit.
This is a major improvement, but it's still not enough. A block-level
checksum only proves that a block is self-consistent; it doesn't prove
that it's the right block. Reprising our UPS analogy,
"We guarantee that the package you received is not damaged.
We do not guarantee that it's your package."
The fundamental problem with all of these schemes is that they don't
provide fault isolation
between the data and the checksum that protects it.
ZFS Data Authentication
End-to-end data integrity requires that each data block be verified
against an independent checksum, after the data has arrived
in the host's memory. It's not enough to know that each block is merely
consistent with itself, or that it was correct at some earlier point
in the I/O path. Our goal is to detect every possible form of damage,
including human mistakes like swapping on a filesystem disk or mistyping
the arguments to dd(1). (Have you ever typed "of=" when you meant "if="?)
A ZFS storage pool is really just a tree of blocks.
ZFS provides fault isolation between data and checksum by storing the
checksum of each block in its parent block pointer -- not in the block
block in the tree contains the checksums for all its children,
so the entire pool is self-validating. [The uberblock (the root of the
tree) is a special case because it has no parent; more on how we handle
that in another post.]
When the data and checksum disagree, ZFS knows that the checksum can be
trusted because the checksum itself is part of some other block that's
one level higher in the tree, and that block has already been validated.
ZFS uses its end-to-end checksums to detect and correct silent data
corruption. If a disk
returns bad data transiently, ZFS will detect it and retry the read.
If the disk is part of a mirror or
group, ZFS will both detect and correct the error: it will use the checksum
to determine which copy is correct, provide good data to the application,
and repair the damaged copy.
As always, note that ZFS end-to-end data integrity doesn't require
any special hardware.
You don't need pricey disks or arrays,
you don't need to reformat drives with 520-byte sectors, and you don't have to
to benefit from it. It's entirely automatic, and it works with cheap disks.
But wait, there's more!
The blocks of a ZFS storage pool form a
Merkle tree in which
each block validates all of its children. Merkle trees have been proven
to provide cryptographically-strong authentication for any component of
the tree, and for the tree as a whole.
ZFS employs 256-bit checksums for every block,
and offers checksum functions ranging from the simple-and-fast
fletcher2 (the default) to the slower-but-secure
When using a cryptographic hash like SHA-256, the uberblock checksum provides
a constantly up-to-date digital signature for the entire storage pool.
Which comes in handy if you ask UPS to move it.