The following is a write-up by Oracle mainline Linux kernel
engineer, Darrick Wong, providing some backround on his work on Linux FS Metadata Checksumming, which, after many years of work, will will
be turned on by default in the upcoming e2fsprogs 1.43 and xfsprogs 3.2.3.
One of the bigger problems facing filesystems today is the problem of online
verification of the integrity of the metadata. Even though storage bandwidth
has increased considerably and (in some cases) seek times have dropped to
nearly zero, the forensic work required to square a filesystem back to sense
increases at least as quickly as metadata size, which scales up about as
quickly as total storage capacity. Furthermore, the threat of random bit
corruption in a critical piece of metadata causing unrecoverable filesystem
damage remains as true as it ever was — the author has encountered scenarios
where corruption in the block usage data structure results in the block
allocator crosslinking file data with metadata, which multiplies the resulting
damage.
Self-describing metadata helps both the kernel and the repair tools to decide
if a block actually contains the data the filesystem is trying to read. In
most cases, this involves tagging each metadata block with a tuple describing
the type of the block, the block number where the block lives, a unique
identifier tying the block to the filesystem (typically the FS UUID), the
checksum of the data in the block, and some sort of pointer to the metadata
object that points to the block (the owner). For a transactional filesystem,
it is useful also to record the transaction ID to facilitate analyzing where in
time a corruption happened. Storing the FS UUID is useful in deciding whether
an arbitrary metadata block actually belongs to this filesystem, or if it
belongs to something else — a previous filesystem or perhaps an FS image
stored inside the filesystem. Given a theoretical mental model of an FS as a
forest of trees all reachable by a single root, owner pointers theoretically
enable a repair effort to reconstruct missing parts of the tree.
The checksum, while neither fool-proof nor tamper-proof, is usually a fast
method to detect random bit corruption. While it is possible to choose
stronger schemes such as sha256 (or even cryptographically signed hashes),
these come with high performance and management overhead, which is why most
systems choose a checksum of some sort. Both filesystems chose CRC32c,
primarily for its ability to detect bit flips and the presence of hardware
acceleration on a number of platforms.
One area that the neither XFS nor ext4 have touched on is the topic of data
checksumming. While it is technically possible to record the same
self-description tuple for data blocks (btrfs stores at least the checksum),
this was deliberately left out of the design for both XFS and ext4. There will
be more to say about data block back-references later. First, requiring a
metadata update (and log transaction) for every write of every block will have
a sharply negative impact on rewrite performance. Second, some applications
ensure that their internal file formats already provide the integrity data that
the application requires; for them, the filesystem overhead is unnecessary.
Migration of the data and its integrity information is easier when both are
encapsulated in a single file. Third, performing file data integrity in
userspace has the advantage that the integrity profiles can be customised for
each program — some may deem bitflip detection via CRC to be sufficient;
others might want sha256 to take advantage of the reduced probability of
collisions; and still more might go all the way to verification through digital
signatures. There does not seem to be a pressing need to provide data block
integrity specifically through the filesystem, unlike metadata, which is
accessible only through the filesystem.
In XFS, self-describing metadata was introduced with a new (v5) on-disk
format. All existing v4 structure were enlarged to store (type, blocknr,
fsuuid, owner, lsn); this allowed XFS to deploy a set of block verifiers to
decide quickly if a block being read in matches what the reader expects. These
verifiers also perform a quick check of the block’s metadata at read and write
time to detect bad metadata resulting from coding bugs. Unfortunately, it is
necessary to reformat the filesystem to accomodate the resized metadata
headers. The kernel and the repair tool, however, are still quick to discard
broken metadata; however, as we will see, this new metadata format extension
opens the door to enhanced recovery efforts.
For ext4, it was discovered that every metadata structure had sufficient room
to squeeze in an extra four or two byte field to store checksum data while
leaving the structure size and layout otherwise intact. This meant making a
few compromises in the design — instead of adding the 5 attributes to each
block, a single 32-bit checksum is calculated over the type, blocknr, fsuuid,
owner and block data; this value is then plugged into the checksum field. This
scheme allows ext4 to decide if a block’s contents match what we thought we
were reading, but it will not enable us to reconstruct missing parts of the FS
metadata object hierarchy. However, existing ext2/3/4 filesystems can be
upgraded easily via tune2fs.
In the near future, XFS could grow a few new features to enable an even greater
level of self-directed integrity checking and repair. Inodes may soon grow
parent directory pointers, which enable XFS to reconstruct directories by
scanning all non-free (link count > 0) inodes in the filesystem. Similarly, a
proposed block reverse-mapping btree makes it possible for XFS to rebuild a
file by iterating all the rmaps looking for extent data. These two operations
can even be performed online, which means that the filesystem can evolve
towards self-healing abilities. Major factors blocking this development are
(a) the inability to close an open file and (b) the need to shut down the
allocators while we repair per-AG data. These improvements will be harder or
impossible to implement for ext4, unfortunately.
The metadata checksumming features as described will be enabled by default in
the respective mkfs tools as part of the next releases of e2fsprogs (1.43) and
xfsprogs (3.2.3). Existing filesystems must be upgraded (ext4) or reformatted
and reloaded (xfs) manually.
— D
Further reading:
- https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums
- https://www.kernel.org/doc/Documentation/filesystems/xfs-self-describing-metadata.txt