Following on from his recent blog XFS – 2019 Development Retrospective, XFS Upstream maintainer Darrick Wong dives a little deeper into the Reflinks implementation for XFS in the mainline Linux Kernel.

Three years ago, I introduced to XFS a new experimental “reflink” feature that enables users to share data blocks between files. With this feature, users gain the ability to make fast snapshots of VM images and directory trees; and deduplicate file data for more efficient use of storage hardware. Copy on write is used when necessary to keep file contents intact, but XFS otherwise continues to use direct overwrites to keep metadata overhead low. The filesystem automatically creates speculative preallocations when copy on write is in use to combat fragmentation.

I’m pleased to announce with xfsprogs 5.1, the reflink feature is now production ready and enabled by default on new installations, having graduated from the experimental and stabilization phases. Based on feedback from early adopters of reflink, we also redesigned some of our in-kernel algorithms for better performance, as noted below:

iomap for Faster I/O

Beginning with Linux 4.18, Christoph Hellwig and I have migrated XFS’ IO paths away from the old VFS/MM infrastructure, which dealt with IO on a per-block and per-block-per-page (“bufferhead”) basis. These mechanisms were introduced to handle simple filesystems on Linux in the 1990s, but are very inefficient.

The new IO paths, known as “iomap”, iterate IO requests on an extent basis as much as possible to reduce overhead. The subsystem was written years ago to handle file mapping leases and the like, but nowadays we can use it as a generic binding between the VFS, the memory manager, and XFS whenever possible. The conversion finished as of Linux 5.4.

In-Core Extent Tree

For many years, the in-core file extent cache in XFS used a contiguous chunk of memory to store the mappings. This introduces a serious and subtle pain point for users with large sparse files, because it can be very difficult for the kernel to fulfill such an allocation when memory is fragmented.

Christoph Hellwig rewrote the in-core mapping cache in Linux 4.15 to use a btree structure. Instead of using a single huge array, the btree structure reduces our contiguous memory requirements to 256 bytes per chunk, with no maximum on the number of chunks. This enables XFS to scale to hundreds of millions of extents while eliminating a source of OOM killer reports.

Users need only upgrade their kernel to take advantage of this improvement.

To begin experimenting with XFS’s reflink support, one must format a new filesystem:

# mkfs.xfs /dev/sda1
meta-data=/dev/sda1              isize=512    agcount=4, agsize=6553600 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=26214400, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=12800, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

If you do not see the exact phrase “reflink=1” in the mkfs output then your system is too old to support reflink on XFS. Now one must mount the filesystem:

# mount /dev/sda1 /storage

At this point, the filesystem is ready to absorb some new files. Let’s pretend that we’re running a virtual machine (VM) farm and therefore need to manage deployment images. This and the next example are admittedly contrived, as any serious VM and container farm manager takes care of all these details.

# mkdir /storage/images
# truncate -s 30g /storage/images/os8_base.img
# qemu-system-x86_64 -hda /storage/images/os8_base.img -cdrom /isoz/os8_install.iso

Now we install a base OS image that we will later use for fast deployment. Once that’s done, we shut down the QEMU process. But first, we’ll check that everything’s in order:

# xfs_bmap -e -l -p -v -v -v /storage/images/os8_base.img
/storage/images/os8_base.img:
 EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET               TOTAL FLAGS
   0: [0..15728639]:        52428960..68157599    1 (160..15728799)      15728640 000000
<listing shortened for brevity>

# df -h /storage
Filesystem       Size  Used Avail Use% Mounted on
/dev/sda1        100G   32G   68G  32% /storage

Now, let’s say that we want to provision a new VM using the base image that we just created. In the old days we would have had to copy the entire image, which can be very time consuming. Now, we can do this very quickly:

# /usr/bin/time cp -pRdu --reflink /storage/images/os8_base.img /storage/images/vm1.img
0.00user 0.00system 0:00.02elapsed 39%CPU (0avgtext+0avgdata 2568maxresident)k
0inputs+0outputs (0major+108minor)pagefaults 0swaps

# xfs_bmap -e -l -p -v -v -v /storage/images/vm1.img
/storage/images/vm1.img:
 EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET               TOTAL FLAGS
   0: [0..15728639]:        52428960..68157599    1 (160..15728799)      15728640 100000
<listing shortened for brevity>
 FLAG Values:
    0100000 Shared extent
    0010000 Unwritten preallocated extent
    0001000 Doesn't begin on stripe unit
    0000100 Doesn't end   on stripe unit
    0000010 Doesn't begin on stripe width
    0000001 Doesn't end   on stripe width

# df -h /storage
Filesystem       Size  Used Avail Use% Mounted on
/dev/sda1        100G   32G   68G  32% /storage

This was a very quick copy! Notice how the extent map on the new image file shows file data pointing to the same physical storage as the original base image, but is now marked as a shared extent, and there’s about as much free space as there was before the copy. Now let’s start that new VM and let it run for a little while before re-querying the block mapping:

# xfs_bmap -e -l -p -v -v -v /storage/images/os8_base.img
/storage/images/vm1.img:
 EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET               TOTAL FLAGS
   0: [0..15728639]:        52428960..68157599    1 (160..15728799)      15728640 100000

<listing shortened for brevity>
# xfs_bmap -e -l -p -v -v -v /storage/images/vm1.img
/storage/images/vm1.img:
 EXT: FILE-OFFSET           BLOCK-RANGE          AG AG-OFFSET               TOTAL FLAGS
   0: [0..255]:             102762656..102762911  1 (50333856..50334111)      256 000000
   1: [256..15728639]:      52429216..68157599    1 (416..15728799)      15728384 100000
<listing shortened for brevity>

# df -h /storage
Filesystem       Size  Used Avail Use% Mounted on
/dev/sda1        100G   36G   64G  32% /storage

Notice how the first 128K of the file now points elsewhere. This is evidence that the VM guest wrote to its storage, causing XFS to employ copy on write on the file so that the original base image remains unmodified. We’ve apparently used another 4GB of space, which is far better than the 64GB that would have been required in the old days.

Let’s turn our attention to the second major feature for reflink: fast(er) snapshotting of directory trees. Suppose now that we want to manage containers with XFS. After a fresh formatting, create a directory tree for our container base:

# mkdir -p /storage/containers/os8_base

In the directory we just created, install a base container OS image that we will later use for fast deployment. Once that’s done, we shut down the container and check that everything’s in order:

# df /storage/
Filesystem       Size  Used Avail Use% Mounted on
/dev/sda1        100G  2.0G   98G   2% /storage
# xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash
/storage/containers/os8_base/bin/bash:
 EXT: FILE-OFFSET      BLOCK-RANGE        AG AG-OFFSET        TOTAL FLAGS
   0: [0..2175]:       52440384..52442559  1 (11584..13759)    2176 000000

Ok, that looks like a reasonable base system. Let’s use reflink to make a fast copy of this system:

# /usr/bin/time cp -pRdu --reflink=always /storage/containers/os8_base /storage/containers/container1
0.01user 0.64system 0:00.68elapsed 96%CPU (0avgtext+0avgdata 2744maxresident)k
0inputs+0outputs (0major+129minor)pagefaults 0swaps

# xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash
/storage/containers/os8_base/bin/bash:
 EXT: FILE-OFFSET      BLOCK-RANGE        AG AG-OFFSET        TOTAL FLAGS
   0: [0..2175]:       52440384..52442559  1 (11584..13759)    2176 100000

# df /storage/
Filesystem       Size  Used Avail Use% Mounted on
/dev/sda1        100G  2.0G   98G   2% /storage

Now we let the container runtime do some work and update (for example) the bash binary:

# xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash
/storage/containers/os8_base/bin/bash:
 EXT: FILE-OFFSET      BLOCK-RANGE        AG AG-OFFSET        TOTAL FLAGS
   0: [0..2175]:       52440384..52442559  1 (11584..13759)    2176 000000

# xfs_bmap -e -l -p -v -v -v /storage/containers/container1/bin/bash
/storage/containers/container1/bin/bash:
 EXT: FILE-OFFSET      BLOCK-RANGE        AG AG-OFFSET        TOTAL FLAGS
   0: [0..2175]:       52442824..52444999  1 (14024..16199)    2176 000000

Notice that the two copies of bash no longer share blocks. This concludes our demonstration. We hope you enjoy this major new feature!