Hello! Did you know that the upstream XFS community have solved some longstanding problems in XFS by adjusting the ondisk format in Linux 6.13? The new format supports much more interesting configurations of quotas and the realtime volume.

Note: The name “realtime” in the context of the realtime volume is a misnomer — back in the 1990s it was intended for use with specialty hardware, but in the 2020s it is simply a secondary volume that does not contain filesystem metadata. Modern realtime volumes are more appropriate for odd devices such as persistent memory and zoned storage.

What Problems Does Linux 6.13 Solve?

  • Free space management on the realtime volume is managed with a single large free space bitmap file. This has severely limited the scalability of the realtime volume because multiple writer threads all contend on the free space bitmap file’s lock. It is not possible to shard the free space information across multiple files because there is not enough space in the superblock to store pointers to many inodes.
  • Historically, the XFS mount procedure considers the mount quota mount options (e.g. -o uquota) to be the final word in the desired quota state. System administrators must remember the desired quota options, either through /etc/fstab or by hand when mounting during an emergency. Forgetting the options causes mount to quotacheck, which wastes time.
  • Realtime volumes are not labeled, so tools that try to guess the contents of a file can be mislead by the contents of block 0 of the first realtime file written to an XFS filesystem. Did you write an ext4 image to that file? Well, now the realtime volume looks like an ext4 filesystem to blkid and friends.
  • Quota has never been supported on filesystems that have realtime volumes, even if there are no realtime files.
  • Metadata files that are linked into the main filesystem can be accessed directly by userspace programs. This is extremely bad.

How Does XFS Address These Deficiencies?

XFS solves all four problems by introducing an ondisk format change known as the metadata directory tree (metadir). Instead of separate fields in the primary superblock pointing to metadata inodes, the superblock now points to a second root directory inode. This creates a second directory tree, solely for metadata. We will refer to this as the metadata directory tree. Metadata files can be looked up by path name underneath this metadata root directory.

Sharding the Realtime Volume

Metadir enables sharding because we can create paths such as /rtgroups/0.bitmap and /rtgroups/0.summary to refer to rtgroup 0’s free space bitmap and summary file, respectively. Directories are much more scalable than a static list of inode numbers, so it is now very easy to support sharding the realtime volume into hundreds of smaller pieces that writer threads can use without needing to coordinate access. Each shard is called a realtime allocation group and has a maximum size of 230 filesystem blocks or 1TiB of space, just like the data device.

Persistent Quota State

This inconvenience is solved by modifying the mount option parsing code to detect that the sysadmin did not pass in any quota-related mount options and to treat the quota state encoded in the sb_qflags field of the primary superblock as the desired quota functionality. To change the available quota functionality, an administrator must specify the desired functionality through the existing mount options. This hasn’t changed from previous versions of XFS. To disable quota, however, an administrator must now pass the noquota mount option. No changes to the ondisk format are necessary to support this new behavior, but it is sufficiently different that it is gated by the metadir feature. Quota files are also now stored in the metadata directory tree as /quota/{user,group,project}.

Labeling the Realtime Volume

The first extent of the realtime volume is now reserved, and a superblock is written to the start of the device. This superblock contains enough information about the data device (e.g. filesystem label, user-visible UUID, and metadata UUID) so that the data and realtime volumes can be discovered. In the opinion of the upstream community, the reduction in confusion is worth the slight loss in capacity of the realtime volume.

Realtime Quotas

This was the easiest problem to solve — all that was necessary was to fix the remaining problems in the quota codebase to handle realtime files correctly. Most of this refactoring happened during the creation of the bigtime feature to support timestamps beyond 2038, but wasn’t turned on until now.

Forbidding User-Accessible Metadata

All metadata files are now tagged as belonging to the metadata directory tree. They cannot be opened by handle, nor can they be opened even if they appear in the user-visible directory tree by accident.

Demonstration

Let’s format a filesystem with a metadata directory tree, a concurrency level of 8 for the rt device, user quotas, and a user-visible label. The data device is 200GiB and the realtime volume is 20TiB:

$ $ sudo mkfs.xfs -m metadir=1,uquota /dev/nvme0n1 -f -r rtdev=/dev/nvme1n1,concurrency=8 -L frogs
meta-data=/dev/nvme0n1           isize=512    agcount=4, agsize=13107200 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=0    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=1   metadir=1
data     =                       bsize=4096   blocks=52428800, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =internal log           bsize=4096   blocks=25600, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =/dev/nvme1n1           extsz=4096   blocks=5242880000, rtextents=5242880000
         =                       rgcount=8    rgsize=655360000 extents

Notice a few new things in the output here — metadir=1 signals that the metadata directory tree is active. We told mkfs that we wanted to support 8 threads writing to realtime files, so it set rgcount=8 and rgsize=6553600000 to indicate that the realtime volume has been sharded into eight groups of 2500GiB apiece.

Let’s mount the filesystem with no quota options:

$ sudo mount -o rtdev=/dev/nvme1n1 /dev/nvme0n1 /mnt

Now let’s look around:

$ sudo xfs_quota -x -c 'report -a -h -u' /mnt
User quota on /mnt (/dev/nvme0n1)
                        Blocks              
User ID      Used   Soft   Hard Warn/Grace   
---------- --------------------------------- 
root            0      0      0  00 [------]

User quotas are active, just like we asked mkfs to do. Now let’s try again with group quotas:

$ sudo xfs_quota -x -c 'report -a -h -g' /mnt

These aren’t enabled, as expected.

If we unmount, we can probe the data volume’s superblock:

$ sudo xfs_db -R /dev/nvme1n1 /dev/nvme0n1 -c sb -c 'print fname uuid'
fname = "frogs\000\000\000\000\000\000\000"
uuid = 4c633f5f-9fef-49ff-a28e-4b2599719d61

Does the identifying information match the realtime volume’s superblock?

$ sudo xfs_db -R /dev/nvme1n1 /dev/nvme0n1 -c rtsb -c print -c stack
magicnum = 0x46726f67
crc = 0x365f30a9 (correct)
pad = 0
fname = "frogs\000\000\000\000\000\000\000"
uuid = 4c633f5f-9fef-49ff-a28e-4b2599719d61
meta_uuid = 4c633f5f-9fef-49ff-a28e-4b2599719d61
1: 
        byte offset 0, length 4096
        buffer block 0 (rtbno 0), 8 bbs
        inode -1, dir inode -1, type rtsb

Matching superblocks, excellent! As you can see, rt block 0 contains a superblock, so the possibility for confusion is eliminated. Support in libblkid is pending.

Let’s see if the debugger finds all the metadata we’re looking for:

$ sudo xfs_db /dev/nvme0n1 -c 'ls -m /rtgroups' -c 'ls -m /quota'
/rtgroups:
8          268640384          directory      0x0000002e   1 . (good)
10         129                directory      0x0000172e   2 .. (good)
12         268640385          regular        0x9efbcbe6   8 0.bitmap (good)
15         268640386          regular        0xede5b6d7   9 0.summary (good)
18         268640387          regular        0x9ef9cbe6   8 1.bitmap (good)
21         268640388          regular        0xece5b6d7   9 1.summary (good)
24         268640389          regular        0x9effcbe6   8 2.bitmap (good)
27         268640390          regular        0xefe5b6d7   9 2.summary (good)
30         268640391          regular        0x9efdcbe6   8 3.bitmap (good)
33         268640392          regular        0xeee5b6d7   9 3.summary (good)
36         268640393          regular        0x9ef3cbe6   8 4.bitmap (good)
39         268640394          regular        0xe9e5b6d7   9 4.summary (good)
42         268640395          regular        0x9ef1cbe6   8 5.bitmap (good)
45         268640396          regular        0xe8e5b6d7   9 5.summary (good)
48         268640397          regular        0x9ef7cbe6   8 6.bitmap (good)
51         268640398          regular        0xebe5b6d7   9 6.summary (good)
54         268640399          regular        0x9ef5cbe6   8 7.bitmap (good)
57         268640400          regular        0xeae5b6d7   9 7.summary (good)
/quota:
8          268640401          directory      0x0000002e   1 . (good)
10         129                directory      0x0000172e   2 .. (good)
12         268640402          regular        0x0ebcf2f2   4 user (good)

I count 8 bitmaps and 8 summary files in /rtgroups, and a single file for user quotas under /quota. This is exactly what we’re looking for.

Now let’s mount with the gquota option to see if group quotas come online and user quotas disappear.

$ sudo mount -o rtdev=/dev/nvme1n1,gquota /dev/nvme0n1 /mnt

Did we get what we asked for?

$ sudo xfs_quota -x -c 'report -a -h -g' /mnt
Group quota on /mnt (/dev/nvme0n1)
                        Blocks              
Group ID     Used   Soft   Hard Warn/Grace   
---------- --------------------------------- 
root            0      0      0  00 [------]

$ sudo xfs_quota -x -c 'report -a -h -u' /mnt

Group quotas are enabled and user quotas are no longer enabled.

Conclusion

We’re very excited about metadata directory trees landing in the upstream Linux 6.13 codebase! In addition to the new functionality that is available right now, the directory tree will at long last enable us to add reverse mapping and reflink support to the realtime volume, but that’s a topic for the next blog.