In this blog post the new block atomic writes (or untorn writes) feature in Unbreakable Enterprise Kernel (UEK) 8 will be discussed. For further information on UEK 8, please consult the following: https://blogs.oracle.com/linux/post/oracle-linux-now-includes-the-latest-linux-kernel-with-uek-8

The feature has been developed in Linux mainline as multiple sub-features, including vfs layer, block layer, block driver, md/dm, and XFS support. The features have been released as follows:

 

FEATURE(S)
LINUX VERSION
vfs, block layer, block driver (SCSI, NVMe)
6.11
MD RAID 0/1/10
6.13
DM linear/stripe/mirror
6.14
XFS support
6.16

 

Note: There is a separate new feature in UEK 8 that provides XFS atomic file exchange range support – this will not be discussed here.

What problems does this solve?

Databases – like MySQL – work on a fixed internal page size, which is often larger than the filesystem block size. A typical MySQL OCI (Oracle Cloud Infrastructure) deployment will use a 16KB page size, while the filesystem will have a 4KB block size.

For background, InnoDB is a general purpose, transactional engine used by MySQL.

InnoDB will log writes to the database in what is known as a doublewrite buffer. The doublewrite buffer is a storage area where InnoDB write pages flushed from the buffer pool are written before writing the pages to their proper positions in the InnoDB data files. If there is an operating system, storage subsystem, or unexpected MySQL process exit in the middle of a page write, InnoDB can find a good copy of the page from the doublewrite buffer during crash recovery.

The downside of using the doublewrite buffer is that data is written twice, thus increasing application latency, decreasing data throughput, and increasing wear on the storage device (and thus decreasing lifespan).

Block atomic writes provides an alternate approach to using the doublewrite buffer, by guaranteeing that we will never experience a corrupted (or torn) InnoDB page committed to the storage device even in the case of an operating system, storage subsystem, or unexpected MySQL process exit in the middle of a write.

When block atomic writes are used by MySQL, a good copy of the page is always available from the InnoDB data files; as such, the doublewrite buffer is not required in that scenario.

For a write issued with torn write protection, if an unexpected power loss, operating system crash, or some other system failure occurs, a subsequent read of that range written with that data will return all the old data or all the new data, but never a mix of old and new.

The storage device logical block size is an inherent limit which Linux will guarantee to write atomically. Block atomic writes allows that limit to be raised to many filesystem blocks.

Atomic writes are supported for both block devices and regular files.

What problems does this not solve?

Block atomic writes does not provide serialization of racing reads and writes (to the same data range). For a read racing with an atomic write, the read may return a mix of old and new data. In addition, a regular write racing with an atomic write may result in the data range being written with a mix of the two writes. Indeed, racing reads and writes to the same data range can be characterised as an application bug.

Note however that due to the atomic nature of storage technologies providing atomic write support – NVMe and SCSI – and how the Linux storage stack won’t unnecessarily split reads or writes into multiple transactions, the user may experience serialization of atomic writes and reads – but this is not guaranteed.

Block atomic writes will be supported for Linux DM (Device Mapper) or MD RAID mirror personalities. It is not guaranteed that copies of data will be committed atomically to all mirrors when the write is issued with torn write protection. Technology does not exist to provide such a feature. However, the same data is committed atomically to each mirror individually. As such, it is guaranteed that each mirror will contain all the old data or all the new data (and never a mix of old or new). So, when we read from a DM or MD RAID device with a mirror-based personality, whichever mirrored copy is selected for the read will return an untorn copy. More information on Linux MD/DM support is provided in a later section.

Block atomic writes does not provide atomic file updates. This means that committing scattered and random-sized writes to a file cannot be committed atomically. Only a single contiguous range can be committed atomically.

Only Direct I/O is supported – buffered I/O is not supported.

What is atomic write HW offload?

Atomic write HW offload is supported by both NVMe and SCSI. HW offload means that the storage device will commit the data atomically. NVMe and SCSI both have two inseparable features of atomic writes for Linux support:

  1. Regular reads and writes are serialized with atomic writes.
  2. Data is committed atomically for a power fail.

It is because of 1. that the user may experience serialization of racing reads and atomic writes. As also mentioned earlier, the Linux storage stack may split reads – even though unlikely – and this is why serialization cannot be guaranteed. Indeed, atomic reads are not supported.

NVMe and SCSI have certain complementary atomic write features, therefore special rules need to be adhered to when issuing atomic writes as to align with both technologies’ requirements.

NVMe and SCSI have discoverable atomic write capabilities, which will be used to determine the block device atomic limits.

The NVMe and SCSI sd driver need to report atomic capabilities to the block layer. These include:

  • atomic write unit min
  • atomic write unit max
  • atomic write max bytes
  • atomic write boundary

The atomic write unit min and max are the lower and upper limit in size supported for an atomic write. Typically the lower limit will be the logical block size. Both values need to be a power-of-2.

The block layer may merge atomic writes. The atomic writes max bytes limit is max size of a merged atomic write. Typically this will be the same as the unit max value, and should only differ if the max bytes value is not a power-of-2.

When issuing an atomic write, the block layer must issue a single request to the block drivers (NVMe and SCSI), otherwise the write is inherently not atomic.

The atomic write boundary is a logical block address space boundary which may be present on NVMe devices. Writes which straddle an atomic write boundary will not be issued atomically, and so this must not occur.

The atomic write limits for a block device can be read from sysfs, like:

$ cat /sys/block/sda/queue/atomic_write_max_bytes
32768
$ cat /sys/block/sda/queue/atomic_write_boundary_bytes
0
$ cat /sys/block/sda/queue/atomic_write_unit_min_bytes
512
$ cat /sys/block/sda/queue/atomic_write_unit_min_bytes
32768

A value of 0 in atomic_write_boundary_bytes means that there is no atomic write boundary.

SCSI has a dedicated command to issue an atomic write, WRITE ATOMIC (16). This command will error when an atomic write is issued which does not adhere to the atomic write capabilities of the device.

Unfortunately, NVMe has no dedicated command and a write which adheres to the device atomic write capabilities is automatically issued atomically. This is quite an unfortunate design, as there is no failsafe for detecting an atomic write which will not be issued atomically by the NVMe device. However, the NVMe driver will detect if it has been sent an atomic write request which would not adhere to the device atomic write limits.

NVMe and SCSI allow a single range of data to be written atomically, i.e. there is no method to atomically write multiple ranges of data.

XFS Filesystem-Based Atomic Write Support

XFS may still support atomic writes when HW offload from the mounted block device is not available.

This method is based on achieving atomic write behaviour from atomically updating file range mapping. This involves a 3-step approach:

  1. Allocate new disk blocks for the atomic write, i.e. out-of-place write.
  2. Write new data with regular write.
  3. When all data written, atomically update mappings.

Because no HW offload with its special alignment and single data range requirements is used, the disk block allocated in step 1. can be allocated anywhere.

The XFS CoW (Copy-on-Write) reflink framework is reused to provide this out-of-place write feature. See the following for more information on XFS CoW: https://blogs.oracle.com/linux/post/xfs-data-block-sharing-reflink.

The max atomic write limit of this filesystem-based atomic write is determined by how many mappings can be updated in a single transaction in step 3. This will typically be very large compared to HW offload limits.

How does XFS provide Atomic Write support?

XFS may use either of two methods of achieving atomic write behaviour:

  1. HW offload in the storage device (when available)
  2. filesystem-based method of atomically update the file data range mapping

Issuing an atomic write with the HW offload method is preferred, as it is expected to be much faster.

To avail of HW offload, it must be ensured that the range of filesystem blocks being written maps to a contiguous range of aligned disk blocks, i.e. the range must be covered by a single aligned extent. XFS has no method to guarantee extent alignment and granularity (apart from rtvol). As such, when atomic writing a range of data it may be possible that the range covered multiple data ranges on the disk. In cases that it is not possible to use HW offload, we fallback on the filesystem method.

The filesystem-based method of atomic writes will be used always when the size of the atomic write exceeds the HW offload limit. Because of this, the user may experience a significant drop off in atomic write performance for larger atomic write sizes.

As described, it cannot be guaranteed that HW offload will be used when available. However, for an atomic write issued with the filesystem-based method, the XFS block allocator will be hinted at to allocate an aligned extent. This will increase the chance of HW offload-based write being used on subsequent writes to the same file range. The file extent size hint should also be set to a size at least as large as the expected atomic write size.

Software RAID and Device Mapper Atomic Writes support

Linux RAID 0/1/10 (MD) and device mapper (DM) linear, stripe, and mirror personalities will support atomic writes. For more information on Linux RAID MD and DM, check the following:

For a RAID or DM device to support atomic writes, all underlying disks must support atomic writes. Furthermore, the atomic write limits for the MD/DM are the maximum set of limits which can be supported by all underlying disks, e.g. for unit max for disk #0 is 16KB and unit max for disk #1 is 8KB, then the unit max for the MD/DM device is 8KB. Ideally the limits would be the same for all disks.

When creating an MD/DM array, nothing specific needs to be done to enable atomic writes – it is done automatically.

Note though that the atomic write unit max for the MD/DM device is limited by any stripe (chunk) size, as would be expected. This is because we cannot issue an atomic write spanning multiple disks by HW offload method. If the chunk size is not a power-of-2, the atomic write unit max is limited to the highest power-of-2 factor. For example, for a 24KB chunk size, the highest power-of-2 factor would be 8KB.

Regarding mirrored personalities, as mentioned earlier, each mirror will not always contain the same data if written by an atomic write. We are just guaranteed that each mirror will have all old or all new data, but never a mix of old and new. If it is required to have all mirrors contain the same data after the write syscall, then RWF_SYNC needs to be used to guarantee persistence to all mirrors. See below for more details on guaranteeing persistence.

Getting Atomic Write limits

A block device or regular file atomic write limits may be got with the statx syscall. However, for a block device, more limits can be read from sysfs, as already mentioned.

The STATX_{ATTR}_WRITE_ATOMIC flag is used to request these limits. The limits are as follows:

/* Direct I/O atomic write limits */
__u32 stx_atomic_write_unit_min;
__u32 stx_atomic_write_unit_max;
__u32 stx_atomic_write_segments_max;
__u32 stx_atomic_write_unit_max_opt;
  • stx_atomic_write_unit_min and stx_atomic_write_unit_max are the inclusive atomic write lower and upper limits. Both values are guaranteed to be a power-of-2.
  • stx_atomic_write_segments_max is the upper limit in iovcnt for pwritev2() or aio IOCB_CMD_PWRITEV.
  • stx_atomic_write_unit_max_opt describes the upper limit of HW offload, if available. A value of 0 indicates no threshold, i.e. no HW offload support or only HW offload support.

Atomic Writes rules

  • The write must be a power-of-2 in length between stx_atomic_write_unit_min and stx_atomic_write_unit_max, inclusive.
  • The write must be at a naturally-aligned offset, e.g. at least 32KB aligned for a 32KB write.
  • The number of segments must not exceed stx_atomic_write_segments_max. Typically this value will be 1.
  • All other Direct I/O rules apply, e.g. userspace buffer must be at least LBS (Logical Block Size) aligned.

Issuing an Atomic Write

For pwritev2(), set RWF_ATOMIC flag. Similarly for aio IOCB_CMD_PWRITEV mode or a io_uring write, use IOCB_ATOMIC flag.

Note that just using RWF_ATOMIC guarantees that the data is committed as all-or-nothing. However, it does not guarantee that the data is persisted when the write syscall returns. To guarantee that the data is persisted after the write syscall, use RWF_SYNC and friends.

Configuring your system for Atomic Writes

Very little configuration is required to enable a system for atomic writes.

With respect to the block device, as before, the atomic limits can be read from sysfs or with statx syscall.

For XFS, if the filesystem instance supports reflink then at least the filesystem-based atomic write support works. This allows atomic writes to be used on pre-existing deployments. No XFS ondisk updates are required.

It is possible to format and mount the filesystem for a guaranteed maximum atomic write support. This is recommended.

The mkfs.xfs support currently available is experimental, and will work like this:

// format for 16KB atomic writes
$ mkfs.xfs -i max_atomic_write=16KB /dev/sda

If the max_atomic_write option is not provided in the above example, then it is not guaranteed that 16KB atomic writes can be achieved for that filesystem.

To mount the filesystem for a required maximum atomic write support, use the mount option, like:

// mount for 16KB atomic writes
$ mount -o max_atomic_write=16KB /dev/sda mnt

The mount will be rejected if the max_atomic_write option is invalid or cannot be supported. This will not happen if the filesystem has been formatted for at least the same atomic write size.

Sample Atomic Write usage

The following is an example of usage of the atomic writes feature:

// write max atomic unit for a file at offset 0
struct statx = {};
if (statx(0, "/mnt/file", 0, STATX_BASIC_STATS | STAT_BTIME | STATX_WRITE_ATOMIC, &statx)
    return error;

if (!statx.STATX_ATTR_WRITE_ATOMIC || !statx.stx_atomic_write_unit_max)
    return error;
posix_memalign(&buffer, 512, statx.stx_atomic_write_unit_max);

struct iovec iov = {
    .iov_base = &buffer,
    .iov_size = statx.stx_atomic_write_unit_max,
};

fd = open(file_path, O_RDWR | O_DIRECT, ...);
ssize_t datawritten = pwritev2(fd, &iov, 1, 0, RWF_SYNC | RWF_ATOMIC);

Summary

Block atomic writes provides a means to ensure that data written to a storage device will not be torn due to an operating system crash or unexpected power failure. This guarantees applications will always see a ‘good’ copy of the data on disk and so are not required to employ inefficient logging methods.