Hello! This blog is a follow-up to the discussion of the new metadata directory tree feature that was introduced to XFS in Linux 6.13. Here we talk about features that were added to 6.14 that build on the new code added to 6.13.
What Problems Does Linux 6.14 Solve?
Online fsck is not available for the realtime volume because there is no redundancy in the space allocation metadata. Furthermore, realtime files cannot be cloned because there is no reference count index like there is on the data device. Both of these features have been available on the data device for years, so the feature disparity has been very problematic. Unfortunately, adding them to the realtime volume required a huge redesign of how we store and access metadata for the realtime volume.
How Does XFS Address This?
XFS uses the new metadata directory feature to add one reverse space mapping index and one extent reference count index for each realtime allocation group. Each metadata file can be found through the metadata directory tree under such names as /rtgroups/0.rmap and /rtgroups/0.refcount for the rtgroup 0 metadata.
Demonstration
Let’s format a filesystem with a metadata directory tree, a concurrency level of 8 for the rt device, user quotas, and a user-visible label.
The data device is 500GiB and the realtime volume is 4TiB. Let’s set up the filesystem to target the realtime volume whenever possible with the -d rtinherit=1 option:
$ $ sudo mkfs.xfs -m metadir=1,uquota /dev/nvme0n1 -f -r rtdev=/dev/nvme1n1,concurrency=8 -L frogs -d rtinherit=1
meta-data=/dev/nvme0n1 isize=512 agcount=4, agsize=32768000 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=1, rmapbt=1
= reflink=1 bigtime=1 inobtcount=1 nrext64=1
= exchange=1 metadir=1
data = bsize=4096 blocks=131072000, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0, ftype=1, parent=0
log =internal log bsize=4096 blocks=64000, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =/dev/nvme1n1 extsz=4096 blocks=1048576000, rtextents=1048576000
= rgcount=8 rgsize=131072000 extents
Notice this time that rmap=1 and reflink=1, signifying that they are both enabled.
Let’s mount the filesystem and copy some data:
$ sudo mount -o rtdev=/dev/nvme1n1 /dev/nvme0n1 /mnt
$ sudo cp -pRdu /etc/mnt/
$ sudo cp --reflink=always -pRdu /mnt/etc/mnt/copy_of_etc
Now let’s look around at a shared file:
$ filefrag -v /mnt/copy_of_etc/issue
File size of /mnt/copy_of_etc/issue is 25 (1 block of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 0: 655360014.. 655360014: 1: last,shared,eof
/mnt/copy_of_etc/issue: 1 extent found
$ filefrag -v /mnt/issue
File size of /mnt/etc/issue is 25 (1 block of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 0: 655360014.. 655360014: 1: last,shared,eof
/mnt/etc/issue: 1 extent found
As you can see, the two copies of the issue file share the same block 655360014. Judging from the large physical block number, this file’s data landed in rtgroup 5. Can we verify this with the fsmap command?
First let’s find the inode numbers:
$ stat /mnt/copy_of_etc/issue /mnt/etc/issue
File: /mnt/copy_of_etc/issue
Size: 25 Blocks: 8 IO Block: 4096 regular file
Device: 253,1 Inode: 537649948 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2025-02-18 09:18:09.815899091 -0800
Modify: 2023-10-26 14:32:05.000000000 -0700
Change: 2025-02-18 16:10:23.023395759 -0800
Birth: 2025-02-18 16:10:23.011149838 -0800
File: /mnt/etc/issue
Size: 25 Blocks: 8 IO Block: 4096 regular file
Device: 253,1 Inode: 320 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2025-02-18 09:18:09.815899091 -0800
Modify: 2023-10-26 14:32:05.000000000 -0700
Change: 2025-02-18 16:10:16.102451361 -0800
Birth: 2025-02-18 16:10:16.101056970 -0800
The inode numbers we’re looking for are 537649948 and 320. What happens when we grep for that?
$ sudo xfs_io -c 'fsmap -vvvv -r' /mnt/ | grep -w -E '(537649948|320)'
166: 253:3 [1048576320..1048576327]: 537649256 0..7 1 (320..327) 8 0100000
167: 253:3 [1048576320..1048576327]: 537650093 0..7 1 (320..327) 8 0100000
320: 253:3 [3145728272..3145728287]: 268435657 0..15 3 (272..287) 16 0100000
330: 253:3 [3145728320..3145728327]: 268435773 0..7 3 (320..327) 8 0100000
331: 253:3 [3145728320..3145728327]: 537650051 0..7 3 (320..327) 8 0100000
480: 253:3 [5242880112..5242880119]: 320 0..7 5 (112..119) 8 0100000
481: 253:3 [5242880112..5242880119]: 537649948 0..7 5 (112..119) 8 0100000
From lines 480 and 481, we see that both inodes map the same extent of 8 512-byte blocks (aka one 4k filesystem block) at offset 0. This extent has a LBA address of 5242880112, which matches the 655360014 that we saw above. It is also in rtgroup 5, as guessed.
Now, can we activate online fsck? Yes!
$ sudo xfs_scrub -d -v -n /mnt
EXPERIMENTAL xfs_scrub program in use! Use at your own risk!
Phase 1: Find filesystem geometry.
/mnt: using 4 threads to scrub.
Phase 2: Check internal metadata.
Info: AG 2 superblock: Optimization is possible. (scrub.c line 241)
Phase 3: Scan all inodes.
Phase 5: Check directory tree.
Phase 7: Check summary counters.
83.5GiB data used; 4.0KiB realtime data used; 2.4K inodes used.
381.5MiB data found; 5.1MiB realtime data found; 2.5K inodes found.
2.5K inodes counted; 2.5K inodes checked.
Phase 8: Trim filesystem storage.
Let’s look at the rtgroup metadata. Unmount and run the debugger to confirm that we got the features we asked for
$ sudo umount /mnt
$ sudo xfs_db /dev/mapper/nvme0n1 -c 'ls -m /rtgroups' -c 'ls -m /quota'
/rtgroups:
8 537383040 directory 0x0000002e 1 . (good)
10 129 directory 0x0000172e 2 .. (good)
12 537383041 regular 0x9efbcbe6 8 0.bitmap (good)
15 537383042 regular 0xede5b6d7 9 0.summary (good)
18 537383043 regular 0xee5b7172 6 0.rmap (good)
21 537383044 regular 0x1318e05a 10 0.refcount (good)
24 537383045 regular 0x9ef9cbe6 8 1.bitmap (good)
27 537383046 regular 0xece5b6d7 9 1.summary (good)
30 537383047 regular 0xee5b717a 6 1.rmap (good)
33 537383048 regular 0x9318e05a 10 1.refcount (good)
36 537383049 regular 0x9effcbe6 8 2.bitmap (good)
39 537383050 regular 0xefe5b6d7 9 2.summary (good)
42 537383051 regular 0xee5b7162 6 2.rmap (good)
45 537383052 regular 0x1318e05b 10 2.refcount (good)
48 537383053 regular 0x9efdcbe6 8 3.bitmap (good)
51 537383054 regular 0xeee5b6d7 9 3.summary (good)
54 537383055 regular 0xee5b716a 6 3.rmap (good)
57 537383056 regular 0x9318e05b 10 3.refcount (good)
60 537383057 regular 0x9ef3cbe6 8 4.bitmap (good)
63 537383058 regular 0xe9e5b6d7 9 4.summary (good)
66 537383059 regular 0xee5b7152 6 4.rmap (good)
69 537383060 regular 0x1318e058 10 4.refcount (good)
72 537383061 regular 0x9ef1cbe6 8 5.bitmap (good)
75 537383062 regular 0xe8e5b6d7 9 5.summary (good)
78 537383063 regular 0xee5b715a 6 5.rmap (good)
81 537383064 regular 0x9318e058 10 5.refcount (good)
84 537383065 regular 0x9ef7cbe6 8 6.bitmap (good)
87 537383066 regular 0xebe5b6d7 9 6.summary (good)
90 537383067 regular 0xee5b7142 6 6.rmap (good)
93 537383068 regular 0x1318e059 10 6.refcount (good)
96 537383069 regular 0x9ef5cbe6 8 7.bitmap (good)
99 537383070 regular 0xeae5b6d7 9 7.summary (good)
102 537383071 regular 0xee5b714a 6 7.rmap (good)
105 537383072 regular 0x9318e059 10 7.refcount (good)
/quota:
8 537383073 directory 0x0000002e 1 . (good)
10 129 directory 0x0000172e 2 .. (good)
12 537383074 regular 0x0ebcf2f2 4 user (good)
As compared to a filesystem with only metadir, this filesystem has twice as many btrees per realtime allocation group.
Conclusion
We’re very excited about realtime rmap and reflink landing in the upstream Linux 6.14 codebase! This completes three major XFS projects started by Oracle in the 2010s — reverse mapping to support shrink and free space defragmentation; reflink to support cloning; and online fsck to avoid downtime due to fsck.