In a separate blog post Atomic File Content Exchange, I discussed the new atomic file capability that landed in Linux 6.10 and the Unbreakable Enterprise Kernel 8. This blog post discusses a more useful variant of that interface that landed in Linux 6.13: atomic commit of file contents. In this context, the difference between “commit” and “exchange” is that the commit system call exchanges the two files’ contents only if the target file has not been modified since a past point in time.

What Problem Does This Solve?

Astute readers of that first blog post might have noticed a gap in the functionality of the XFS_IOC_EXCHANGE_RANGE ioctl – any program using that functionality must coordinate writes to the file or else overlapping updates can collide. For programs that implement their own file caches (e.g. databases) this isn’t a big deal because they must already coordinate access to and writeback of dirty cache buffers.

However, there is a certain class of programs that might prefer something like a “compare and swap” operation, which only performs the exchange if the file hasn’t otherwise been modified. The XFS defragmentation program requires this behavior, because it copies the contents of a fragmented file to a donor file, but it should only exchange the contents if the first file has not been modified.

How Does XFS Solve This Problem?

The mechanism to exchange the file contents is the same one powering the atomic file exchange interface. Unlike a compare and swap operation on the CPU, we don’t want to require the calling program to pass in both the new contents and a copy of what it thinks are the old file contents because files on XFS can be 8EiB in size. Userspace must therefore sample a proxy variable for a file write and pass the proxy sample back to the kernel to perform the commit.

A commit operation has multiple steps:

  1. Create a temporary file to stage the new contents.
  2. Sample representative metadata about the target.
  3. Clone the target file’s contents to the temporary file and write whatever changes are desired to the temporary file.
  4. Commit the changes.

Filesystem configuration is the same as it is for the atomic file exchange interface (conveying -i exchange=1 to mkfs.xfs), so I will not detail that here. Please see the previous blog post for more details.

Whole File Commits

This is roughly what a simple file editor’s Save command would do – clone the whole file, make some changes, and try to commit those changes. If the commit fails due to a race, then it can try to resolve the conflict and try again.

Step 1: Creating a Temporary File

This is sample code, so error handling is minimal. A robust program should implement that.

First, open the file and create an unlinked staging file:

struct xfs_commit_range commit = { };
int real_fd, temp_fd, ret;

real_fd = open("diary.txt", O_RDWR);
temp_fd = open(".", O_TMPFILE | O_RDWR | O_EXCL | O_CLOEXEC, 0600);

Step 2: Sampling the Target

In this step, the program asks the kernel to start a commit by sampling some of the target file’s metadata:

ioctl(real_fd, XFS_IOC_START_COMMIT, &commit);

Programs should not access the file2_freshness region of struct xfs_commit_range because the contents are implementation specific and may change. As of Linux 6.13, the sampled metadata include the status change time (aka the ctime), which means that the multi-grain timestamp code is required for correct detection of file writes.

Step 3: Clone and Update

Next, clone the target file’s contents to the temporary file, and write any changes as desired. This is a simplistic example:

ioctl(temp_fd, FICLONE, real_fd);

for (int i = 0; i < 100; i++) {
    pwrite(temp_fd, some_buf, some_buf_len, random());
}

Observe that the writes can be to discontiguous ranges of the file. There are no alignment or maximum write size constraints.

This code is exactly the same as for exchange-range. It is also not necessary to copy file or extended attributes to the temporary file.

Step 4: Commit the Changes

In the final step, the user program gives the sampled metadata and the new contents to the kernel and asks it to exchange the contents if the target file has not changed:

commit.file1_fd     = temp_fd;
commit.file1_offset = 0;
commit.length       = 0;
commit.file2_offset = 0;
commit.flags        = XFS_EXCHANGE_RANGE_TO_EOF |
                      XFS_EXCHANGE_RANGE_DSYNC;

ret = ioctl(real_fd, XFS_IOC_COMMIT_RANGE, &commit);

Here, we’ve set both offsets to 0 and supplied the XFS_EXCHANGE_RANGE_TO_EOF flag to indicate that we want to commit all changes in the file, including EOF changes.

If ret is zero, then the new contents have been moved from temp_fd to real_fd and the old contents have been moved from real_fd to temp_fd. temp_fd can be closed to dispose of the old contents.

If ret is nonzero and errno is set to EBUSY, then the update lost a race with another write to real_fd. The application may decide to restart the operation at step 2.

Slim-Lock Database Update

A database could choose to store its data in a file with blocks that are aligned to file allocation unit (i.e. cluster) boundaries. If the database design allows for issuing multiple transactions and retrying ones that fail, it might choose to stage updates in a temporary file and rely on the filesystem to commit those changes or reject them due to races.

For this case, the XFS_EXCHANGE_RANGE_FILE1_WRITTEN flag can be used to avoid the initial clone and only commit the written areas of the temporary file. The use of commit-range is only useful if there are multiple independent writers to the file; if writes are coordinated, then the extra sampling is not necessary.

Step 1 and 2: Preparation

The first two steps are the same as above, but steps 3 and 4 are different.

Step 3: Cluster-Aligned Updates

Here’s our simplistic file write example again. Let’s suppose that the file cluster size is 4096 bytes. A real program would discover this from stat(3). Assuming that real_fd is at least 1MiB, let’s make some cluster-aligned block updates:

loff_t pos = (random() & 127) * 4096;
pwrite(temp_fd, some_buf, 4096, pos);
pwrite(temp_fd, some_buf, 4096, pos + 8192);
pwrite(temp_fd, some_buf, 4096, pos + 16384);

Observe that the writes can be to discontiguous ranges of the file. There are no maximum write size constraints, but there is now an alignment constraint.

Step 4: Commit the Changes

Here’s the code to commit the changes:

commit.file1_fd     = temp_fd;
commit.file1_offset = pos;
commit.length       = 20480;
commit.file2_offset = pos;
commit.flags        = XFS_EXCHANGE_RANGE_DSYNC |
                      XFS_EXCHANGE_RANGE_FILE1_WRITTEN;

ret = ioctl(real_fd, XFS_IOC_COMMIT_RANGE, &commit);

Notice that we’ve added the XFS_EXCHANGE_RANGE_FILE1_WRITTEN flag here. This flag causes the exchange operation to skip any areas of file1 that haven’t been written. Note that the written regions do not have to be flushed to disk by the calling program.

If ret is zero, then 20KiB have been committed into real_fd at pos.

If ret is nonzero and errno is set to EBUSY, then the update lost a race with another write to real_fd. The application may decide to restart the operation at step 2.

Conclusion

As you have seen, it is now possible to commit a collection of arbitrary file updates to a file if that file has not been updated, and expect that readers see either the old version or the new version.

Appendix A: Whole File Commit

/* LICENSE: GPLv2 */
#define _GNU_SOURCE
#include &lt;stdlib.h&gt;
#include &lt;fcntl.h&gt;
#include &lt;xfs/xfs.h&gt;

int commit_peppers(void *some_buf, size_t some_buf_len)
{
    struct xfs_commit_range commit = { };
    int real_fd, temp_fd, ret;

    real_fd = open("diary.txt", O_RDWR);
    if (real_fd < 0) {
        perror("diary.txt");
        return -1;
    }

restart:
    temp_fd = open(".", O_TMPFILE | O_RDWR | O_EXCL | O_CLOEXEC, 0600);
    if (temp_fd < 0) {
        perror("temp file");
        close(real_fd);
        return -1;
    }

    ret = ioctl(real_fd, XFS_IOC_START_COMMIT, &commit);
    if (ret) {
        perror("start commit");
        goto out_close;
    }

    ret = ioctl(temp_fd, FICLONE, real_fd);
    if (ret) {
        perror("clone");
        goto out_close;
    }

    /* generate some data to commit */
    for (int i = 0; i < 100; i++) {
        ssize_t written;

        errno = ENOSPC;
        written = pwrite(temp_fd, some_buf, some_buf_len, random());
        if (written < some_buf_len) {
            perror("pwrite");
            goto out_close;
        }
    }

    commit.file1_fd     = temp_fd;
    commit.file1_offset = 0;
    commit.length       = 0;
    commit.file2_offset = 0;
    commit.flags        = XFS_EXCHANGE_RANGE_TO_EOF |
                          XFS_EXCHANGE_RANGE_DSYNC;

    ret = ioctl(real_fd, XFS_IOC_COMMIT_RANGE, &commit);
    if (ret && errno == EBUSY) {
        fprintf(stderr, "raced with another writer, retrying\n");
        close(temp_fd);
        goto restart;
    }
    if (ret) {
        perror("commit");
        goto out_close;
    }

out_close:
    close(temp_fd);
    close(real_fd);
    return ret;
}

Appendix B: Slim-Lock Database Update

/* LICENSE: GPLv2 */
#define _GNU_SOURCE
#include &lt;stdlib.h&gt;
#include &lt;fcntl.h&gt;
#include &lt;xfs/xfs.h&gt;

int maybe_write_db(void *some_buf)
{
    struct xfs_commit_range commit = { };
    loff_t pos;
    ssize_t written;
    int real_fd, temp_fd, ret;

    /* table.db must be at least 1MiB */
    real_fd = open("table.db", O_RDWR);
    if (real_fd < 0) {
        perror("table.db");
        return -1;
    }

restart:
    temp_fd = open(".", O_TMPFILE | O_RDWR | O_EXCL | O_CLOEXEC, 0600);
    if (temp_fd < 0) {
        perror("temp file");
        close(real_fd);
        return -1;
    }

    ret = ioctl(real_fd, XFS_IOC_START_COMMIT, &commit);
    if (ret) {
        perror("start commit");
        goto out_close;
    }

    /* generate some data to commit */
    pos = (random() & 127) * 4096;
    errno = ENOSPC;
    written = pwrite(temp_fd, some_buf, 4096, pos);
    if (written == 4096)
        written = pwrite(temp_fd, some_buf, 4096, pos + 8192);
    if (written == 4096)
        written = pwrite(temp_fd, some_buf, 4096, pos + 16384);
    if (written < 4096) {
        perror("pwrite");
        goto out_close;
    }

    commit.file1_fd     = temp_fd;
    commit.file1_offset = pos;
    commit.length       = 20480;
    commit.file2_offset = pos;
    commit.flags        = XFS_EXCHANGE_RANGE_FILE1_WRITTEN |
                          XFS_EXCHANGE_RANGE_DSYNC;

    ret = ioctl(real_fd, XFS_IOC_COMMIT_RANGE, &commit);
    if (ret && errno == EBUSY) {
        fprintf(stderr, "raced with another writer, retrying\n");
        close(temp_fd);
        goto restart;
    }
    if (ret) {
        perror("commit");
        goto out_close;
    }

out_close:
    close(temp_fd);
    close(real_fd);
    return ret;
}