Upcoming XFS Work in Linux v4.8 v4.9 and v4.10+, by Darrick Wong
By Jamesmorris-Oracle on Nov 08, 2016
The following is a write-up by Oracle mainline Linux kernel engineer, Darrick Wong.
For the past year I have been working on a bunch of new features for the XFS filesystem on Linux. Modern-day XFS is a direct descendant of the original XFS code from SGI Irix that was donated long ago. The goals are the same -- XFS is intended to behave consistently as it scales to large storage and many files.
v4.8: Reverse Mapping
Back in 2014, Dave Chinner prototyped a mechanism for storing extent owner reverse mapping information in the filesystem. The idea here is that for a given block, the filesystem retains a record identifying which metadata structure or file owns that block. In 2015 I took over development of this feature, enlarging the record definitions to include both owner and file offset information with the goal of using the reverse mapping information to rebuild damaged metadata from the ground up.
While working on block sharing, it became obvious that the generic XFS btree index implementation needed to gain the ability to treat extents as a first class data type. In classic XFS where extents never overlap this wasn't a problem because the three query types (le, eq, ge) were expressive enough. If something needs to find the record overlapping a given block, XFS can perform a LE lookup of that block and iterate upward until it reaches the end of the range. However, in a world where extents _can_ overlap, it becomes very useful to have a query function that takes a range and returns any record overlapping that range. Enhancing interior btree nodes to track both the lowest and highest keys accessible under the subtree made it possible to perform these queries efficiently.
The other big piece to land in 4.8 was the deferred operations transaction control structure, which tracks redo items in the XFS log. XFS already had a single logical redo item that was used to split unmapping an extent from a file and freeing the extent into two transactions while maintaining logical consistency. All updates related to the unmapping process were logged in the first transaction along with a promise to free the blocks "later". The first transaction is then committed and a new transaction allocated ("rolled"). The updates necessary to free the blocks are logged in this second transaction along with a confirmation that the promise has been fulfilled. Log recovery replays all committed transactions and maintains a list of unfulfilled promises. After the first round of transaction replay is complete, the promised updates are written as new transactions.
The new deferred operations tracker centralized control of these redo items and introduced new redo item types. This makes implementing compound atomic updates very easy in XFS -- all deferred ops are logged as promises in the first transaction along with the first update. Then it loops through the deferred work, rolling the transaction, making updates, and logging a confirmation when the work item is done. When there are no more deferred ops, the compound transaction is complete. For reverse mapping, this means that XFS can defer rmap updates for file mapping activity to a second transaction, which helps us avoid overfilling transactions and violating locking rules, which can lead to deadlocks.
Note that reverse mapping itself is not a yet user-visible feature. We will return to this in the section covering online scrub and repair.
v4.9: Reflink and Deduplication
For years, btrfs and ocfs2 have both had the ability to share blocks between files. This was first exposed to users as a "quick copy" feature that bypassed the usual read-allocate-write cycle by copying mappings from one file to another; this feature is more commonly known as "reflink". btrfs later added a variant that only performed the mapping if the block contents were identical with both files locked -- a basic building block of data deduplication.
XFS historically never had either of these features, but it would be very useful to have this capability to build quick-deploy VM and container hosting farms. Conceptually, it's not difficult to add reflink support to an existing filesystem -- all the filesystem needs to do is to store extent reference count information in a new btree and extend the write path to detect overwrites to shared blocks and redirect the writes to new blocks (copy on write) and update the reference counts.
Implementing reflink in XFS was not this simple. Once the basic work was complete, performance testing revealed that repeated queries to the refcount btree took a lot of time, which meant that the cost of discovering whether or not a write had to be COW'd was high. VM write tests showed horrifying fragmentation problems that greatly increased metadata overhead. The continued use of buffer heads to convey mapping information to the VFS contributed to inefficiencies. There was no good way to cram a remapping operation into a single log transaction for atomicity. Clearly, a lot of optimization work had to be done prior to merging this feature.
The first things to get fixed were the CoW discovery overhead and the fragmentation problems. To the inode core I added a status flag and a CoW extent size hint. The status flag indicates whether or not this inode has had shared extents at some point in the past; if not set, the CoW discovery checks can be skipped safely. I also rewrote the parts of the write path that implement CoW to create a new in-memory block mapping "fork" so that I could reuse the existing delayed allocation mechanism to try to allocate larger extents for CoW writes. This was the first step to combatting fragmentation; the second step was the CoW extent size hint. The hint tells the allocator to create larger delayed allocation reservations in the hope that future writes can land in the one large extent that was allocated. Writes to non-shared blocks are also promoted to CoW to reduce fragmentation. ocfs2 also uses this trick to combat fragmentation.
Next on the list of things to fix was the inability to remap operations atomically. One theoretical advantage of using copy on write to modify file data is that those writes become (sort of) atomic. Given any write to a data block, reads should return either the old contents or the new contents. XFS has operations to unmap an extent, map an extent, and increase or decrease the reference count of an extent, but each of these three operations were designed to use a single transaction to convey the metadata changes to disk. Now, remember the deferred update system introduced in 4.8? This is exactly what XFS needed to make CoW updates atomic. In the first transaction it unmaps a chunk of the old file and promise to remove the old rmap; decreases the refcount of the old extent; (possibly) frees the old extent; maps in a new chunk; and adds a new rmap. XFS can then use the deferred update mechanism to complete each intended work item in a separate transaction. If a crash should happen during a compound update, log recovery will continue the work at the exact point of the crash, so CoW writes are totally atomic in XFS.
As part of broader efforts to remove buffer heads from XFS and retrofit XFS for persistent memory, Christoph Hellwig hoisted the internal 'iomap' mechanism that XFS used to expose extent data to pNFS clients into the VFS as a more general mechanism for XFS to communicate extent mapping information upwards. Buffer heads are supposed to wrap a memory buffer that caches a range of sectors on a disk; filesystems were instead abusing them to convey file mapping data (with no memory buffer involved at all). Rewriting the write paths to eliminate this kludge cut down the overhead of page faults and repetitive looping of writeback, which was a big help amortizing the pain of figuring out which writes had to be CoW'd.
Finally, the offline repair tool xfs_repair had to be taught to regenerate the reference count data from all the reverse mappings in the filesystem. Reverse mappings are already collected from primary metadata in order to rebuild the rmap btree, so xfs_repair can reuse this data to iterate the reverse maps and determine how many rmaps overlap the current block, and for how long does this keep up?
v4.10?: Online Scrub and RepairThe next phase of XFS development will enhance XFS error recovery. Step one is to add to the kernel an online scrubbing capability that walks all metadata looking for problems. Nearly all XFS metadata records are indexed by btrees, so first the online scrubber looks at every btree block looking for structural problems. It also perform basic sanity checks of every record and cross-references each record with the other metadata to look for discrepancies. Those discrepancies can be reported to the administrator for further action.
The next step is to provide an online repair facility. Prior to 4.8 block usage information was only encoded in a single btree -- free space, inodes, or block mappings. Damage to any of this primary metadata meant losing file data or taking the filesystem offline to rebuild the free space and inode indices. However, the recording of reverse block mappings means that XFS now has secondary metadata from which it can reconstruct the block related primary metadata.
This means that XFS can rebuild a file's block map, or reconstruct the free space data, or even find lost inodes. The reverse mapping btree itself can be rebuilt by (mostly) freezing the filesystem and scanning all primary data internally, though this can have an adverse impact on any other IO going on at the same time. Better yet, this reconstruction can happen without taking the filesystem offline, though xfs_repair will also gain the ability to rebuild a damaged file block map from rmap data.
In a future development sprint we will make inodes store pointers to the directories from which they are linked. This would enable us to reconstruct directory trees as well. Stay tuned!