Hello! Did you know that the upstream XFS community have solved some longstanding problems in XFS by adjusting the ondisk format in Linux 6.13? The new format supports much more interesting configurations of quotas and the realtime volume.
Note: The name “realtime” in the context of the realtime volume is a misnomer — back in the 1990s it was intended for use with specialty hardware, but in the 2020s it is simply a secondary volume that does not contain filesystem metadata. Modern realtime volumes are more appropriate for odd devices such as persistent memory and zoned storage.
What Problems Does Linux 6.13 Solve?
- Free space management on the realtime volume is managed with a single large free space bitmap file. This has severely limited the scalability of the realtime volume because multiple writer threads all contend on the free space bitmap file’s lock. It is not possible to shard the free space information across multiple files because there is not enough space in the superblock to store pointers to many inodes.
- Historically, the XFS mount procedure considers the mount quota mount options (e.g.
-o uquota) to be the final word in the desired quota state. System administrators must remember the desired quota options, either through/etc/fstabor by hand when mounting during an emergency. Forgetting the options causes mount to quotacheck, which wastes time.
- Realtime volumes are not labeled, so tools that try to guess the contents of a file can be mislead by the contents of block 0 of the first realtime file written to an XFS filesystem. Did you write an ext4 image to that file? Well, now the realtime volume looks like an ext4 filesystem to
blkidand friends.
- Quota has never been supported on filesystems that have realtime volumes, even if there are no realtime files.
- Metadata files that are linked into the main filesystem can be accessed directly by userspace programs. This is extremely bad.
How Does XFS Address These Deficiencies?
XFS solves all four problems by introducing an ondisk format change known as the metadata directory tree (metadir). Instead of separate fields in the primary superblock pointing to metadata inodes, the superblock now points to a second root directory inode. This creates a second directory tree, solely for metadata. We will refer to this as the metadata directory tree. Metadata files can be looked up by path name underneath this metadata root directory.
Sharding the Realtime Volume
Metadir enables sharding because we can create paths such as /rtgroups/0.bitmap and /rtgroups/0.summary to refer to rtgroup 0’s free space bitmap and summary file, respectively. Directories are much more scalable than a static list of inode numbers, so it is now very easy to support sharding the realtime volume into hundreds of smaller pieces that writer threads can use without needing to coordinate access. Each shard is called a realtime allocation group and has a maximum size of 230 filesystem blocks or 1TiB of space, just like the data device.
Persistent Quota State
This inconvenience is solved by modifying the mount option parsing code to detect that the sysadmin did not pass in any quota-related mount options and to treat the quota state encoded in the sb_qflags field of the primary superblock as the desired quota functionality. To change the available quota functionality, an administrator must specify the desired functionality through the existing mount options. This hasn’t changed from previous versions of XFS. To disable quota, however, an administrator must now pass the noquota mount option. No changes to the ondisk format are necessary to support this new behavior, but it is sufficiently different that it is gated by the metadir feature. Quota files are also now stored in the metadata directory tree as /quota/{user,group,project}.
Labeling the Realtime Volume
The first extent of the realtime volume is now reserved, and a superblock is written to the start of the device. This superblock contains enough information about the data device (e.g. filesystem label, user-visible UUID, and metadata UUID) so that the data and realtime volumes can be discovered. In the opinion of the upstream community, the reduction in confusion is worth the slight loss in capacity of the realtime volume.
Realtime Quotas
This was the easiest problem to solve — all that was necessary was to fix the remaining problems in the quota codebase to handle realtime files correctly. Most of this refactoring happened during the creation of the bigtime feature to support timestamps beyond 2038, but wasn’t turned on until now.
Forbidding User-Accessible Metadata
All metadata files are now tagged as belonging to the metadata directory tree. They cannot be opened by handle, nor can they be opened even if they appear in the user-visible directory tree by accident.
Demonstration
Let’s format a filesystem with a metadata directory tree, a concurrency level of 8 for the rt device, user quotas, and a user-visible label. The data device is 200GiB and the realtime volume is 20TiB:
$ $ sudo mkfs.xfs -m metadir=1,uquota /dev/nvme0n1 -f -r rtdev=/dev/nvme1n1,concurrency=8 -L frogs
meta-data=/dev/nvme0n1 isize=512 agcount=4, agsize=13107200 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=0
= reflink=0 bigtime=1 inobtcount=1 nrext64=1
= exchange=1 metadir=1
data = bsize=4096 blocks=52428800, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
log =internal log bsize=4096 blocks=25600, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =/dev/nvme1n1 extsz=4096 blocks=5242880000, rtextents=5242880000
= rgcount=8 rgsize=655360000 extents
Notice a few new things in the output here — metadir=1 signals that the metadata directory tree is active. We told mkfs that we wanted to support 8 threads writing to realtime files, so it set rgcount=8 and rgsize=6553600000 to indicate that the realtime volume has been sharded into eight groups of 2500GiB apiece.
Let’s mount the filesystem with no quota options:
$ sudo mount -o rtdev=/dev/nvme1n1 /dev/nvme0n1 /mnt
Now let’s look around:
$ sudo xfs_quota -x -c 'report -a -h -u' /mnt
User quota on /mnt (/dev/nvme0n1)
Blocks
User ID Used Soft Hard Warn/Grace
---------- ---------------------------------
root 0 0 0 00 [------]
User quotas are active, just like we asked mkfs to do. Now let’s try again with group quotas:
$ sudo xfs_quota -x -c 'report -a -h -g' /mnt
These aren’t enabled, as expected.
If we unmount, we can probe the data volume’s superblock:
$ sudo xfs_db -R /dev/nvme1n1 /dev/nvme0n1 -c sb -c 'print fname uuid'
fname = "frogs\000\000\000\000\000\000\000"
uuid = 4c633f5f-9fef-49ff-a28e-4b2599719d61
Does the identifying information match the realtime volume’s superblock?
$ sudo xfs_db -R /dev/nvme1n1 /dev/nvme0n1 -c rtsb -c print -c stack
magicnum = 0x46726f67
crc = 0x365f30a9 (correct)
pad = 0
fname = "frogs\000\000\000\000\000\000\000"
uuid = 4c633f5f-9fef-49ff-a28e-4b2599719d61
meta_uuid = 4c633f5f-9fef-49ff-a28e-4b2599719d61
1:
byte offset 0, length 4096
buffer block 0 (rtbno 0), 8 bbs
inode -1, dir inode -1, type rtsb
Matching superblocks, excellent! As you can see, rt block 0 contains a superblock, so the possibility for confusion is eliminated. Support in libblkid is pending.
Let’s see if the debugger finds all the metadata we’re looking for:
$ sudo xfs_db /dev/nvme0n1 -c 'ls -m /rtgroups' -c 'ls -m /quota'
/rtgroups:
8 268640384 directory 0x0000002e 1 . (good)
10 129 directory 0x0000172e 2 .. (good)
12 268640385 regular 0x9efbcbe6 8 0.bitmap (good)
15 268640386 regular 0xede5b6d7 9 0.summary (good)
18 268640387 regular 0x9ef9cbe6 8 1.bitmap (good)
21 268640388 regular 0xece5b6d7 9 1.summary (good)
24 268640389 regular 0x9effcbe6 8 2.bitmap (good)
27 268640390 regular 0xefe5b6d7 9 2.summary (good)
30 268640391 regular 0x9efdcbe6 8 3.bitmap (good)
33 268640392 regular 0xeee5b6d7 9 3.summary (good)
36 268640393 regular 0x9ef3cbe6 8 4.bitmap (good)
39 268640394 regular 0xe9e5b6d7 9 4.summary (good)
42 268640395 regular 0x9ef1cbe6 8 5.bitmap (good)
45 268640396 regular 0xe8e5b6d7 9 5.summary (good)
48 268640397 regular 0x9ef7cbe6 8 6.bitmap (good)
51 268640398 regular 0xebe5b6d7 9 6.summary (good)
54 268640399 regular 0x9ef5cbe6 8 7.bitmap (good)
57 268640400 regular 0xeae5b6d7 9 7.summary (good)
/quota:
8 268640401 directory 0x0000002e 1 . (good)
10 129 directory 0x0000172e 2 .. (good)
12 268640402 regular 0x0ebcf2f2 4 user (good)
I count 8 bitmaps and 8 summary files in /rtgroups, and a single file for user quotas under /quota. This is exactly what we’re looking for.
Now let’s mount with the gquota option to see if group quotas come online and user quotas disappear.
$ sudo mount -o rtdev=/dev/nvme1n1,gquota /dev/nvme0n1 /mnt
Did we get what we asked for?
$ sudo xfs_quota -x -c 'report -a -h -g' /mnt
Group quota on /mnt (/dev/nvme0n1)
Blocks
Group ID Used Soft Hard Warn/Grace
---------- ---------------------------------
root 0 0 0 00 [------]
$ sudo xfs_quota -x -c 'report -a -h -u' /mnt
Group quotas are enabled and user quotas are no longer enabled.
Conclusion
We’re very excited about metadata directory trees landing in the upstream Linux 6.13 codebase! In addition to the new functionality that is available right now, the directory tree will at long last enable us to add reverse mapping and reflink support to the realtime volume, but that’s a topic for the next blog.