Following on from his recent blog XFS - 2019 Development Retrospective, XFS Upstream maintainer Darrick Wong dives a little deeper into the Reflinks implementation for XFS in the mainline Linux Kernel.
Three years ago, I introduced to XFS a new experimental "reflink" feature that enables users to share data blocks between files. With this feature, users gain the ability to make fast snapshots of VM images and directory trees; and deduplicate file data for more efficient use of storage hardware. Copy on write is used when necessary to keep file contents intact, but XFS otherwise continues to use direct overwrites to keep metadata overhead low. The filesystem automatically creates speculative preallocations when copy on write is in use to combat fragmentation.
I'm pleased to announce with xfsprogs 5.1, the reflink feature is now production ready and enabled by default on new installations, having graduated from the experimental and stabilization phases. Based on feedback from early adopters of reflink, we also redesigned some of our in-kernel algorithms for better performance, as noted below:
Beginning with Linux 4.18, Christoph Hellwig and I have migrated XFS' IO paths away from the old VFS/MM infrastructure, which dealt with IO on a per-block and per-block-per-page ("bufferhead") basis. These mechanisms were introduced to handle simple filesystems on Linux in the 1990s, but are very inefficient.
The new IO paths, known as "iomap", iterate IO requests on an extent basis as much as possible to reduce overhead. The subsystem was written years ago to handle file mapping leases and the like, but nowadays we can use it as a generic binding between the VFS, the memory manager, and XFS whenever possible. The conversion finished as of Linux 5.4.
For many years, the in-core file extent cache in XFS used a contiguous chunk of memory to store the mappings. This introduces a serious and subtle pain point for users with large sparse files, because it can be very difficult for the kernel to fulfill such an allocation when memory is fragmented.
Christoph Hellwig rewrote the in-core mapping cache in Linux 4.15 to use a btree structure. Instead of using a single huge array, the btree structure reduces our contiguous memory requirements to 256 bytes per chunk, with no maximum on the number of chunks. This enables XFS to scale to hundreds of millions of extents while eliminating a source of OOM killer reports.
Users need only upgrade their kernel to take advantage of this improvement.
To begin experimenting with XFS's reflink support, one must format a new filesystem:
# mkfs.xfs /dev/sda1 meta-data=/dev/sda1 isize=512 agcount=4, agsize=6553600 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=26214400, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=12800, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
If you do not see the exact phrase "reflink=1" in the mkfs output then your system is too old to support reflink on XFS. Now one must mount the filesystem:
# mount /dev/sda1 /storage
At this point, the filesystem is ready to absorb some new files. Let's pretend that we're running a virtual machine (VM) farm and therefore need to manage deployment images. This and the next example are admittedly contrived, as any serious VM and container farm manager takes care of all these details.
# mkdir /storage/images # truncate -s 30g /storage/images/os8_base.img # qemu-system-x86_64 -hda /storage/images/os8_base.img -cdrom /isoz/os8_install.iso
Now we install a base OS image that we will later use for fast deployment. Once that's done, we shut down the QEMU process. But first, we'll check that everything's in order:
# xfs_bmap -e -l -p -v -v -v /storage/images/os8_base.img /storage/images/os8_base.img: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..15728639]: 52428960..68157599 1 (160..15728799) 15728640 000000 <listing shortened for brevity> # df -h /storage Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 32G 68G 32% /storage
Now, let's say that we want to provision a new VM using the base image that we just created. In the old days we would have had to copy the entire image, which can be very time consuming. Now, we can do this very quickly:
# /usr/bin/time cp -pRdu --reflink /storage/images/os8_base.img /storage/images/vm1.img 0.00user 0.00system 0:00.02elapsed 39%CPU (0avgtext+0avgdata 2568maxresident)k 0inputs+0outputs (0major+108minor)pagefaults 0swaps # xfs_bmap -e -l -p -v -v -v /storage/images/vm1.img /storage/images/vm1.img: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..15728639]: 52428960..68157599 1 (160..15728799) 15728640 100000 <listing shortened for brevity> FLAG Values: 0100000 Shared extent 0010000 Unwritten preallocated extent 0001000 Doesn't begin on stripe unit 0000100 Doesn't end on stripe unit 0000010 Doesn't begin on stripe width 0000001 Doesn't end on stripe width # df -h /storage Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 32G 68G 32% /storage
This was a very quick copy! Notice how the extent map on the new image file shows file data pointing to the same physical storage as the original base image, but is now marked as a shared extent, and there's about as much free space as there was before the copy. Now let's start that new VM and let it run for a little while before re-querying the block mapping:
# xfs_bmap -e -l -p -v -v -v /storage/images/os8_base.img /storage/images/vm1.img: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..15728639]: 52428960..68157599 1 (160..15728799) 15728640 100000 <listing shortened for brevity> # xfs_bmap -e -l -p -v -v -v /storage/images/vm1.img /storage/images/vm1.img: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..255]: 102762656..102762911 1 (50333856..50334111) 256 000000 1: [256..15728639]: 52429216..68157599 1 (416..15728799) 15728384 100000 <listing shortened for brevity> # df -h /storage Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 36G 64G 32% /storage
Notice how the first 128K of the file now points elsewhere. This is evidence that the VM guest wrote to its storage, causing XFS to employ copy on write on the file so that the original base image remains unmodified. We've apparently used another 4GB of space, which is far better than the 64GB that would have been required in the old days.
Let's turn our attention to the second major feature for reflink: fast(er) snapshotting of directory trees. Suppose now that we want to manage containers with XFS. After a fresh formatting, create a directory tree for our container base:
# mkdir -p /storage/containers/os8_base
In the directory we just created, install a base container OS image that we will later use for fast deployment. Once that's done, we shut down the container and check that everything's in order:
# df /storage/ Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 2.0G 98G 2% /storage # xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash /storage/containers/os8_base/bin/bash: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2175]: 52440384..52442559 1 (11584..13759) 2176 000000
Ok, that looks like a reasonable base system. Let's use reflink to make a fast copy of this system:
# /usr/bin/time cp -pRdu --reflink=always /storage/containers/os8_base /storage/containers/container1 0.01user 0.64system 0:00.68elapsed 96%CPU (0avgtext+0avgdata 2744maxresident)k 0inputs+0outputs (0major+129minor)pagefaults 0swaps # xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash /storage/containers/os8_base/bin/bash: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2175]: 52440384..52442559 1 (11584..13759) 2176 100000 # df /storage/ Filesystem Size Used Avail Use% Mounted on /dev/sda1 100G 2.0G 98G 2% /storage
Now we let the container runtime do some work and update (for example) the bash binary:
# xfs_bmap -e -l -p -v -v -v /storage/containers/os8_base/bin/bash /storage/containers/os8_base/bin/bash: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2175]: 52440384..52442559 1 (11584..13759) 2176 000000 # xfs_bmap -e -l -p -v -v -v /storage/containers/container1/bin/bash /storage/containers/container1/bin/bash: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..2175]: 52442824..52444999 1 (14024..16199) 2176 000000
Notice that the two copies of bash no longer share blocks. This concludes our demonstration. We hope you enjoy this major new feature!
Next Post