In this blog post, we discuss a new XFS feature of the Unbreakable Enterprise Kernel (UEK) 8, which is the ability to exchange arbitrary file contents atomically. This feature has been under development since 2020, but it has only been undergoing QA work since 2022, and is now available as a technology preview in UEK8. The code landed in Linux 6.10.

Note: There is a separate new feature in UEK 8 that provides untorn file block writes to storage. That feature is covered in a separate blog post.

What Problems Does This Solve?

The first problem is that the Linux file I/O interface does not specify that writes to multiple ranges of a file must be persisted in an all or nothing fashion, which means that file contents can be inconsistent after a crash. This makes persisting data difficult for files where updates cannot be written to the end of the file with a single write() operation. A common example of this type of file is a complex non-appending data structure, such as a B-tree index or a database table.

Most programs solve this problem in one of two expensive ways:

  1. Creating a new file, writing the new contents to the new file, and renaming the new file atop the old file. This requires programs to know how to copy all relevant file attributes and extended attributes from the old file to the new file. Most don’t. After a crash, the temporary file will be left behind and must be cleaned up.

  2. Double writing all file updates – first the updates are appended to a log file and flushed; and then the updates are written into the file a second time. This is 2x write amplification, which is expensive.

The second problem is that some external readers of a structured file must never see an update in progress. For example, a database backup program must never see a partially written update because the table data is not consistent. POSIX file locks are not sufficient to fix this problem because they are advisory.

A third problem in this space concerns software defined storage. Database vendors have demonstrated repeatedly that they can increase the speed of database updates if the underlying storage can persist large quantities of data without tearing in the case of a crash. Software block devices might wish to advertise this ability, but if the underlying layer is a filesystem, it may need the ability to commit multiple fs blocks without tearing. For efficiency reasons, it may not be desirable to increase the block size.

How Does XFS Solve These Problems?

In the past, the XFS log only recorded updates to ondisk metadata buffers in transactions, and each transaction only captured the ondisk changes required to make one change to a single file record. The log did not record higher level intentions, so operations involving multiple file records were not possible. Compound operations required the ondisk format to accommodate partially complete structures; consequently, only one (extended attributes) was ever written. As part of the reflink project of the mid-2010s, XFS formalized the idea of tracking multiple-step high level filesystem updates through the log. In other words, it is now possible for XFS to perform long running compound operations cheaply.

The first step of a compound operation is that the filesystem commits to its log a first transaction containing a record of an intent to apply a transformation over an interval starting at some position X and proceeding for some length Y. That first transaction is committed and a new one is begun for the following loop:

The filesystem determines the amount of work w that can be done over the interval [X, X + Y) and attaches ondisk metadata changes to the transaction. Typically this quantity will be the maximum distance along that interval that can be performed with a single update to a metadata record. Upon finishing that work, the filesystem commits in the same transaction three things:

  • The actual metadata buffer updates created by the unit of work.
  • A record that the previously committed intent is now resolved.
  • A new record of intent for the transformation beginning at position X' = X + w and proceeding for a length Y' = Y - w, but only if if Y' > 0.

This loop repeats (with a fresh transaction each time through the loop) until Y' becomes zero, at which point all work is complete. Because the progress of the work is persisted to the filesystem log, the recovery process can restart the operation after a crash.

This capability is key to supporting atomic exchanges of potentially discontiguous file content updates. If a userspace program opens a file A, it can then create an unlinked clone B in which to stage updates. After writing changes to B, the program requests an atomic exchange of A and B’s contents. Upon completion, the file update will be fully applied. If the system crashes after the first intent has been written to disk, log recovery at the next mount will resume the operation. Either way, a subsequent read of the file will see all the old data or all the new data, but not both.

An exchange operation has multiple steps:

  1. Create a temporary file to stage the new contents.
  2. Clone the target file’s contents to the temporary file and write whatever changes are desired to the temporary file.
  3. Exchange the contents.

But first, we must configure the filesystem.

Configuring the System

Because this is new functionality being offered as a technology preview, it isn’t yet enabled by default either in Oracle Linux or UEK 8. Therefore, one must specially format a data filesystem to enable the new feature. Here is an example of enabling this feature on a fresh 200TiB storage device:

$ mkfs.xfs /dev/nvme0n1 -f -i exchange=1
meta-data=/dev/nvme0n1           isize=512    agcount=200, agsize=268435455 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=1   
data     =                       bsize=4096   blocks=53687091000, imaxpct=1
         =                       sunit=0      swidth=0 blks 
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1 
realtime =none                   extsz=4096   blocks=0, rtextents=0

The key thing to notice here is exchange=1 in the output – this is the signal that atomic file content exchanges are ready to go.

Note: Automatic fsck requires this feature to repair file metadata; if you format with -m autofsck=1 or use the provided /usr/share/xfsprogs/mkfs/ol_autofsck_10.0.conf file, then the atomic file content exchange feature will be turned on.

The C API for this functionality are defined in xfs.h, so it is also necessary to install the xfsprogs-devel package.

Whole File Exchanges

This is what a naïve file editor’s Save command would do – clone the whole file, make some changes, and commit them even if someone else already overwrote the file.

Step 1: Create a Temporary File

This is sample code, so error handling is minimal. A robust program should implement that.

First, open the file and create an unlinked staging file:

struct xfs_exchange_range exchange;
int real_fd, temp_fd, ret;

real_fd = open("peppers.btree", O_RDWR);
temp_fd = open(".", O_TMPFILE | O_RDWR | O_EXCL | O_CLOEXEC, 0600);

Step 2: Clone the Contents and Change Them

Clone the real file’s contents to the temporary file:

ioctl(temp_fd, FICLONE, real_fd);

Note that one does not have to copy file or extended attributes to the temporary file – it is private to the process. Now make some changes to temp_fd:

for (int i = 0; i < 100; i++) {
    pwrite(temp_fd, some_buf, some_buf_len, random());
}

At this point, real_fd hasn’t changed. If the program decides that it isn’t happy with the changes that it wrote to temp_fd, it can close the temporary file and the filesystem will erase it.

Step 3: Exchange the Contents

If the program wishes to commit the random changes it just staged, then it should do so:

exchange.file1_fd     = temp_fd;
exchange.file1_offset = 0;
exchange.length       = 0;
exchange.file2_offset = 0;
exchange.flags        = XFS_EXCHANGE_RANGE_TO_EOF |
                        XFS_EXCHANGE_RANGE_DSYNC;

ret = ioctl(real_fd, XFS_IOC_EXCHANGE_RANGE, &exchange);

Note that the file1_offset and file2_offset fields are both zero, which means that the exchange starts at the beginning of both files. These offsets need not be zero, or even the same. Although the length field is set to zero, the XFS_EXCHANGE_RANGE_TO_EOF flag tells the kernel to exchange all data between the specified offsets and the end of the file and to exchange the EOF position if necessary. The XFS_EXCHANGE_RANGE_DSYNC flag instructs the kernel to persist all dirty metadata in both files before returning to userspace. All dirty data in the target file ranges are persisted to disk before the metadata changes begin.

The changes that were staged in temp_fd are now readable from real_fd. The old contents of real_fd are now attached to temp_fd.

Note also that both files are locked from reads and writes for the entire operation, so reader programs won’t see intermediate results.

The program can now close temp_fd to free all resources associated with the temporary staging file and the old file contents. It may also ftruncate() the temporary file and jump back to the FICLONE step to stage another series of updates.

If ret is zero, then the file contents have been exchanged.

Software Defined Storage

Software defined storage that wants to support untorn writes to the exported block device can use exchange-range to persist writes that are larger than a single file block or cluster. If the block device’s LBA matches the file cluster size, no read-modify-write cycles are needed to commit the new contents. The XFS_EXCHANGE_RANGE_FILE1_WRITTEN flag can be used to avoid the initial clone and only commit the written areas of the temporary file.

Step 1: Preparation

The first step is the same as above.

Step 2: Write Changes to the Temporary File

Here’s our simplistic file write example again. Let’s suppose that the file cluster size is 4096 bytes. A real program would discover this from stat(3) and perhaps have some more interesting data to write. Assuming that real_fd is at least 1MiB in size, let’s make some cluster-aligned block updates:

loff_t pos = (random() & 127) * 4096;
pwrite(temp_fd, some_buf, 4096, pos);
pwrite(temp_fd, some_buf, 4096, pos + 8192);
pwrite(temp_fd, some_buf, 4096, pos + 16384);

Observe that the writes can be to discontiguous ranges of the file. There are no maximum write size constraints, but there is now an alignment constraint.

Step 3: Exchange the Contents

Commit the change:

exchange.file1_fd     = temp_fd;
exchange.file1_offset = pos;
exchange.length       = 20480;
exchange.file2_offset = pos;
exchange.flags        = XFS_EXCHANGE_RANGE_FILE1_WRITTEN |
                        XFS_EXCHANGE_RANGE_DSYNC;

ret = ioctl(real_fd, XFS_IOC_EXCHANGE_RANGE, &exchange);

Notice that we’ve added the XFS_EXCHANGE_RANGE_FILE1_WRITTEN flag here. This flag causes the exchange operation to skip any areas of file1 that haven’t been written. Note that the written regions do not have to be flushed to disk by the calling program.

If ret is zero, then 20KiB have been committed into real_fd at pos.

Twists

If specified, the XFS_EXCHANGE_RANGE_DRY_RUN flag performs all checking to ensure that the operation could be performed, but does not perform the actual exchange. This flag can be used by application software to determine the filesystem’s capabilities, in case the program could fall back to another method (e.g. write ahead log).

Unlike the atomic block writes available via O_ATOMIC, there are no alignment, maximum size, or contiguity requirements to use this feature.

Conclusion

As you have seen, it is now possible to commit a collection of arbitrary file updates to a file and expect that readers see either the old version or the new version. Linux 6.13 builds upon this capability to enable limited compare-and-exchange operations on file contents.

Appendix A: Whole File Exchanges

/* LICENSE: GPLv2 */
#define _GNU_SOURCE
#include &lt;stdlib.h&gt;
#include &lt;fcntl.h&gt;
#include &lt;xfs/xfs.h&gt;

int change_peppers(void *some_buf, size_t some_buf_len)
{
    struct xfs_exchange_range exchange = { };
    int real_fd, temp_fd, ret;

    real_fd = open("peppers.txt", O_RDWR);
    if (real_fd < 0) {
        perror("peppers.txt");
        return -1;
    }

    temp_fd = open(".", O_TMPFILE | O_RDWR | O_EXCL | O_CLOEXEC, 0600);
    if (temp_fd < 0) {
        perror("temporary file");
        close(real_fd);
        return -1;
    }

    ret = ioctl(temp_fd, FICLONE, real_fd);
    if (ret) {
        perror("cloning tempfile");
        goto out_close;
    }

    for (int i = 0; i < 100; i++) {
        ssize_t written = pwrite(temp_fd, some_buf, some_buf_len, random());
        if (written < 0) {
            perror("writing temp file");
            ret = written;
            goto out_close;
        }
    }

    exchange.file1_fd     = temp_fd;
    exchange.file1_offset = 0;
    exchange.length       = 0;
    exchange.file2_offset = 0;
    exchange.flags        = XFS_EXCHANGE_RANGE_TO_EOF | XFS_EXCHANGE_RANGE_DSYNC;

    ret = ioctl(real_fd, XFS_IOC_EXCHANGE_RANGE, &exchange);
    if (ret) {
        perror("file commit failed");
        goto out_close;
    }

out_close:
    close(temp_fd);
    close(real_fd);
    return ret;
}

Appendix B: Software Defined Storage

/* LICENSE: GPLv2 */
#define _GNU_SOURCE
#include &lt;stdlib.h&gt;
#include &lt;fcntl.h&gt;
#include &lt;xfs/xfs.h&gt;

int atomic_write_disk(void *some_buf)
{
    struct xfs_exchange_range exchange = { };
    loff_t pos;
    ssize_t written;
    int real_fd, temp_fd, ret;

    /* disk.img must be at least 1MiB. */
    real_fd = open("disk.img", O_RDWR);
    if (real_fd < 0) {
        perror("disk.img");
        return -1;
    }

    temp_fd = open(".", O_TMPFILE | O_RDWR | O_EXCL | O_CLOEXEC, 0600);
    if (temp_fd < 0) {
        perror("temp file");
        close(real_fd);
        return -1;
    }

    pos = (random() & 127) * 4096;
    errno = ENOSPC;
    written = pwrite(temp_fd, some_buf, 4096, pos);
    if (written == 4096)
        written = pwrite(temp_fd, some_buf, 4096, pos + 8192);
    if (written == 4096)
        written = pwrite(temp_fd, some_buf, 4096, pos + 16384);
    if (written < 4096) {
        perror("pwrite");
        ret = -1;
        goto out_close;
    }

    exchange.file1_fd     = temp_fd;
    exchange.file1_offset = pos;
    exchange.length       = 20480;
    exchange.file2_offset = pos;
    exchange.flags        = XFS_EXCHANGE_RANGE_FILE1_WRITTEN |
                            XFS_EXCHANGE_RANGE_DSYNC;

    ret = ioctl(real_fd, XFS_IOC_EXCHANGE_RANGE, &exchange);
    if (ret) {
        perror("exchange");
        goto out_close;
    }

out_close:
    close(temp_fd);
    close(real_fd);
    return ret;
}