The following is a write-up by Oracle mainline Linux kernel
engineer, Darrick Wong.
For the past year I have been working on a bunch of new features for the
XFS filesystem on Linux. Modern-day XFS is a direct descendant of the
original XFS code from SGI Irix that was donated long ago. The goals
are the same -- XFS is intended to behave consistently as it scales to
large storage and many files.
Back in 2014, Dave Chinner prototyped a mechanism for storing extent
owner reverse mapping information in the filesystem. The idea here is
that for a given block, the filesystem retains a record identifying
which metadata structure or file owns that block. In 2015 I took over
development of this feature, enlarging the record definitions to include
both owner and file offset information with the goal of using the
reverse mapping information to rebuild damaged metadata from the ground
While working on block sharing, it became obvious that the generic XFS
btree index implementation needed to gain the ability to treat extents
as a first class data type. In classic XFS where extents never overlap
this wasn't a problem because the three query types (le, eq, ge) were
expressive enough. If something needs to find the record overlapping a
given block, XFS can perform a LE lookup of that block and iterate
upward until it reaches the end of the range. However, in a world where
extents _can_ overlap, it becomes very useful to have a query function
that takes a range and returns any record overlapping that range.
Enhancing interior btree nodes to track both the lowest and highest keys
accessible under the subtree made it possible to perform these queries
The other big piece to land in 4.8 was the deferred operations
transaction control structure, which tracks redo items in the XFS log.
XFS already had a single logical redo item that was used to split
unmapping an extent from a file and freeing the extent into two
transactions while maintaining logical consistency. All updates related
to the unmapping process were logged in the first transaction along with
a promise to free the blocks "later". The first transaction is then
committed and a new transaction allocated ("rolled"). The updates
necessary to free the blocks are logged in this second transaction along
with a confirmation that the promise has been fulfilled. Log recovery
replays all committed transactions and maintains a list of unfulfilled
promises. After the first round of transaction replay is complete, the
promised updates are written as new transactions.
The new deferred operations tracker centralized control of these redo
items and introduced new redo item types. This makes implementing
compound atomic updates very easy in XFS -- all deferred ops are logged
as promises in the first transaction along with the first update. Then
it loops through the deferred work, rolling the transaction, making
updates, and logging a confirmation when the work item is done. When
there are no more deferred ops, the compound transaction is complete.
For reverse mapping, this means that XFS can defer rmap updates for file
mapping activity to a second transaction, which helps us avoid
overfilling transactions and violating locking rules, which can lead to
Note that reverse mapping itself is not a yet user-visible feature. We
will return to this in the section covering online scrub and repair.
For years, btrfs and ocfs2 have both had the ability to share blocks
between files. This was first exposed to users as a "quick copy"
feature that bypassed the usual read-allocate-write cycle by copying
mappings from one file to another; this feature is more commonly known
as "reflink". btrfs later added a variant that only performed the
mapping if the block contents were identical with both files locked -- a
basic building block of data deduplication.
XFS historically never had either of these features, but it would be
very useful to have this capability to build quick-deploy VM and
container hosting farms. Conceptually, it's not difficult to add
reflink support to an existing filesystem -- all the filesystem needs to
do is to store extent reference count information in a new btree and
extend the write path to detect overwrites to shared blocks and redirect
the writes to new blocks (copy on write) and update the reference
Implementing reflink in XFS was not this simple. Once the basic work
was complete, performance testing revealed that repeated queries to the
refcount btree took a lot of time, which meant that the cost of
discovering whether or not a write had to be COW'd was high. VM write
tests showed horrifying fragmentation problems that greatly increased
metadata overhead. The continued use of buffer heads to convey mapping
information to the VFS contributed to inefficiencies. There was no good
way to cram a remapping operation into a single log transaction for
atomicity. Clearly, a lot of optimization work had to be done prior to
merging this feature.
The first things to get fixed were the CoW discovery overhead and the
fragmentation problems. To the inode core I added a status flag and a
CoW extent size hint. The status flag indicates whether or not this
inode has had shared extents at some point in the past; if not set, the
CoW discovery checks can be skipped safely. I also rewrote the parts of
the write path that implement CoW to create a new in-memory block
mapping "fork" so that I could reuse the existing delayed allocation
mechanism to try to allocate larger extents for CoW writes. This was
the first step to combatting fragmentation; the second step was the CoW
extent size hint. The hint tells the allocator to create larger delayed
allocation reservations in the hope that future writes can land in the
one large extent that was allocated. Writes to non-shared blocks are
also promoted to CoW to reduce fragmentation. ocfs2 also uses this
trick to combat fragmentation.
Next on the list of things to fix was the inability to remap operations
atomically. One theoretical advantage of using copy on write to modify
file data is that those writes become (sort of) atomic. Given any write
to a data block, reads should return either the old contents or the new
contents. XFS has operations to unmap an extent, map an extent, and
increase or decrease the reference count of an extent, but each of these
three operations were designed to use a single transaction to convey the
metadata changes to disk. Now, remember the deferred update system
introduced in 4.8? This is exactly what XFS needed to make CoW updates
atomic. In the first transaction it unmaps a chunk of the old file and
promise to remove the old rmap; decreases the refcount of the old
extent; (possibly) frees the old extent; maps in a new chunk; and adds a
new rmap. XFS can then use the deferred update mechanism to complete
each intended work item in a separate transaction. If a crash should
happen during a compound update, log recovery will continue the work at
the exact point of the crash, so CoW writes are totally atomic in XFS.
As part of broader efforts to remove buffer heads from XFS and retrofit
XFS for persistent memory, Christoph Hellwig hoisted the internal
'iomap' mechanism that XFS used to expose extent data to pNFS clients
into the VFS as a more general mechanism for XFS to communicate extent
mapping information upwards. Buffer heads are supposed to wrap a memory
buffer that caches a range of sectors on a disk; filesystems were
instead abusing them to convey file mapping data (with no memory buffer
involved at all). Rewriting the write paths to eliminate this kludge
cut down the overhead of page faults and repetitive looping of
writeback, which was a big help amortizing the pain of figuring out
which writes had to be CoW'd.
Finally, the offline repair tool xfs_repair had to be taught to
regenerate the reference count data from all the reverse mappings in the
filesystem. Reverse mappings are already collected from primary
metadata in order to rebuild the rmap btree, so xfs_repair can reuse
this data to iterate the reverse maps and determine how many rmaps
overlap the current block, and for how long does this keep up?
The next step is to provide an online repair facility. Prior to 4.8
block usage information was only encoded in a single btree -- free
space, inodes, or block mappings. Damage to any of this primary
metadata meant losing file data or taking the filesystem offline to
rebuild the free space and inode indices. However, the recording of
reverse block mappings means that XFS now has secondary metadata from
which it can reconstruct the block related primary metadata.
This means that XFS can rebuild a file's block map, or reconstruct the
free space data, or even find lost inodes. The reverse mapping btree
itself can be rebuilt by (mostly) freezing the filesystem and scanning
all primary data internally, though this can have an adverse impact on
any other IO going on at the same time. Better yet, this reconstruction
can happen without taking the filesystem offline, though xfs_repair will
also gain the ability to rebuild a damaged file block map from rmap data.
In a future development sprint we will make inodes store pointers to the
directories from which they are linked. This would enable us to
reconstruct directory trees as well. Stay tuned!