Upstream XFS Maintainer and kernel developer for Oracle Linux, Darrick Wong, wrote up this retrospective on a year's worth of work on XFS, and hints at some great features which are coming up in the latest Linux kernels. I'm really excited about online filesystem checking.
Over the past year, we've spent the bulk of our time stabilizing the new reverse mapping and reflink features that were introduced in Linux 4.8-4.9 and covered in last year's entry on this blog. We are preparing to remove the EXPERIMENTAL tag from reflink in mid-2018, which will require a rework of the in-core file extent map cache to scale better. The conversion of the VFS block mapping interface to iomap wrapped up last January; now all three IO access methods (buffered, direct, DAX) benefit from the reduced overhead and better performance on NVMe flash.
For Linux 4.12, we introduced the GETFSMAP ioctl in both XFS and ext4 to enable userspace administrative tools to query the entire filesystem space map. This enables system administrators to perform live analysis of the filesystem state, including metadata overhead and free space fragmentation. In XFS, the new xfs_spaceman provides this capability, while in ext4 the e2freefrag tool in e2fsprogs 1.43.5 has adapted to perform live queries when it detects a mounted filesystem. xfsprogs releases now follow the kernel.
Throughout 2017, we have also been working on getting an online fsck tool ready for XFS. The existing XFS codebase has simple metadata block verifiers that perform spot-checking of metadata buffers as they are read in from disk and before they are written out to disk. These verifiers are very limited in scope due to their existence in the IO hot path. They cannot perform computationally expensive checks; they cannot cross-reference other metadata structures; and errors in the write path shut down the filesystem.
The solution to this is a separate online fsck utility. This tool extends the verifiers by enabling userspace to schedule expensive checking operations and avoid paying the high performance cost with every operation. A dedicated online fsck tool can check that a metadata value is exactly correct, not merely within a rough ballpark range. The most notable example of this is using the reverse mappings to check block reference counts -- this requires walking the reverse mapping tree several times, which we cannot spare during regular operations. Everything else that can be cross-referenced with other metadata is also checked. For example, given an extent of file data we can ensure that it is not an inode, not free space, not part of some btree, and not crosslinked with extended attribute data. This kind of check requires a lot of IO, which we can manage and throttle from the userspace driver program. As part of the scrub work we will restructure the XFS verifiers to provide more precise reporting of where corruption was found, and to detect corruption of in-memory buffers.
Online filesystem repair will land some time after scrub. This type of repair is very difficult because the only viable repair strategy is to rebuild metadata from scratch. To do that we must lock and scan all the metadata in the entire filesystem, formulate all the new records in memory, and then write them out to disk. Then we must reset carefully all the in-memory state.
We included the first part of scrub in Linux 4.15, and the remaining pieces in every release thereafter. This functionality, once stable, enables us to reduce filesystem downtime even further.
Allison Henderson has recently restarted development on the XFS parent pointer patchset, which enables files to record a back-link to a directory entry in a parent directory by setting extended attributes. This first requires us to improve the atomicity of extended attribute operations. Unlike regular file attribute operations where we can fail back to userspace, parent pointer operations must succeed or the filesystem is corrupt. Once the parent pointers are in place, we can then adapt the online fsck tool to perform bidirectional checking of the directory tree and to rebuild directories. We will also be able to resolve write errors into a richer log message containing the affected file path and offset.
Further in the future, we hope to improve support for DAX and persistent memory once we are able to hash out a desirable userspace interface to these new storage media. This has provoked a lot of discussion on the mailing lists, but usage models and hardware have been slow to appear. There are also changes proposed to mkfs that will enable distributions and administrators to provide default mkfs options in a configuration file, similar to mke2fs. There is also a patchset proposed to reuse the old realtime volume to build an XFS capable of recording archival data on an SMR drive while keeping metadata and small files on an SSD.
Stay tuned for all XFS work which will be landing as technical previews in 2018!