By Jamesmorris-Oracle on May 29, 2015
The following is a write-up by Oracle mainline Linux kernel engineer, Darrick Wong, providing some backround on his work on Linux FS Metadata Checksumming, which, after many years of work, will will be turned on by default in the upcoming e2fsprogs 1.43 and xfsprogs 3.2.3.
One of the bigger problems facing filesystems today is the problem of online verification of the integrity of the metadata. Even though storage bandwidth has increased considerably and (in some cases) seek times have dropped to nearly zero, the forensic work required to square a filesystem back to sense increases at least as quickly as metadata size, which scales up about as quickly as total storage capacity. Furthermore, the threat of random bit corruption in a critical piece of metadata causing unrecoverable filesystem damage remains as true as it ever was -- the author has encountered scenarios where corruption in the block usage data structure results in the block allocator crosslinking file data with metadata, which multiplies the resulting damage.
Self-describing metadata helps both the kernel and the repair tools to decide if a block actually contains the data the filesystem is trying to read. In most cases, this involves tagging each metadata block with a tuple describing the type of the block, the block number where the block lives, a unique identifier tying the block to the filesystem (typically the FS UUID), the checksum of the data in the block, and some sort of pointer to the metadata object that points to the block (the owner). For a transactional filesystem, it is useful also to record the transaction ID to facilitate analyzing where in time a corruption happened. Storing the FS UUID is useful in deciding whether an arbitrary metadata block actually belongs to this filesystem, or if it belongs to something else -- a previous filesystem or perhaps an FS image stored inside the filesystem. Given a theoretical mental model of an FS as a forest of trees all reachable by a single root, owner pointers theoretically enable a repair effort to reconstruct missing parts of the tree.
The checksum, while neither fool-proof nor tamper-proof, is usually a fast method to detect random bit corruption. While it is possible to choose stronger schemes such as sha256 (or even cryptographically signed hashes), these come with high performance and management overhead, which is why most systems choose a checksum of some sort. Both filesystems chose CRC32c, primarily for its ability to detect bit flips and the presence of hardware acceleration on a number of platforms. One area that the neither XFS nor ext4 have touched on is the topic of data checksumming. While it is technically possible to record the same self-description tuple for data blocks (btrfs stores at least the checksum), this was deliberately left out of the design for both XFS and ext4. There will be more to say about data block back-references later. First, requiring a metadata update (and log transaction) for every write of every block will have a sharply negative impact on rewrite performance. Second, some applications ensure that their internal file formats already provide the integrity data that the application requires; for them, the filesystem overhead is unnecessary. Migration of the data and its integrity information is easier when both are encapsulated in a single file. Third, performing file data integrity in userspace has the advantage that the integrity profiles can be customised for each program -- some may deem bitflip detection via CRC to be sufficient; others might want sha256 to take advantage of the reduced probability of collisions; and still more might go all the way to verification through digital signatures. There does not seem to be a pressing need to provide data block integrity specifically through the filesystem, unlike metadata, which is accessible only through the filesystem. In XFS, self-describing metadata was introduced with a new (v5) on-disk format. All existing v4 structure were enlarged to store (type, blocknr, fsuuid, owner, lsn); this allowed XFS to deploy a set of block verifiers to decide quickly if a block being read in matches what the reader expects. These verifiers also perform a quick check of the block's metadata at read and write time to detect bad metadata resulting from coding bugs. Unfortunately, it is necessary to reformat the filesystem to accomodate the resized metadata headers. The kernel and the repair tool, however, are still quick to discard broken metadata; however, as we will see, this new metadata format extension opens the door to enhanced recovery efforts.
For ext4, it was discovered that every metadata structure had sufficient room to squeeze in an extra four or two byte field to store checksum data while leaving the structure size and layout otherwise intact. This meant making a few compromises in the design -- instead of adding the 5 attributes to each block, a single 32-bit checksum is calculated over the type, blocknr, fsuuid, owner and block data; this value is then plugged into the checksum field. This scheme allows ext4 to decide if a block's contents match what we thought we were reading, but it will not enable us to reconstruct missing parts of the FS metadata object hierarchy. However, existing ext2/3/4 filesystems can be upgraded easily via tune2fs. In the near future, XFS could grow a few new features to enable an even greater level of self-directed integrity checking and repair. Inodes may soon grow parent directory pointers, which enable XFS to reconstruct directories by scanning all non-free (link count > 0) inodes in the filesystem. Similarly, a proposed block reverse-mapping btree makes it possible for XFS to rebuild a file by iterating all the rmaps looking for extent data. These two operations can even be performed online, which means that the filesystem can evolve towards self-healing abilities. Major factors blocking this development are (a) the inability to close an open file and (b) the need to shut down the allocators while we repair per-AG data. These improvements will be harder or impossible to implement for ext4, unfortunately.
The metadata checksumming features as described will be enabled by default in the respective mkfs tools as part of the next releases of e2fsprogs (1.43) and xfsprogs (3.2.3). Existing filesystems must be upgraded (ext4) or reformatted and reloaded (xfs) manually.