Hello! This blog is a follow-up to the discussion of the new metadata directory tree feature that was introduced to XFS in Linux 6.13. Here we talk about features that were added to 6.14 that build on the new code added to 6.13.

What Problems Does Linux 6.14 Solve?

Online fsck is not available for the realtime volume because there is no redundancy in the space allocation metadata. Furthermore, realtime files cannot be cloned because there is no reference count index like there is on the data device. Both of these features have been available on the data device for years, so the feature disparity has been very problematic. Unfortunately, adding them to the realtime volume required a huge redesign of how we store and access metadata for the realtime volume.

How Does XFS Address This?

XFS uses the new metadata directory feature to add one reverse space mapping index and one extent reference count index for each realtime allocation group. Each metadata file can be found through the metadata directory tree under such names as /rtgroups/0.rmap and /rtgroups/0.refcount for the rtgroup 0 metadata.

Demonstration

Let’s format a filesystem with a metadata directory tree, a concurrency level of 8 for the rt device, user quotas, and a user-visible label.

The data device is 500GiB and the realtime volume is 4TiB. Let’s set up the filesystem to target the realtime volume whenever possible with the -d rtinherit=1 option:

$ $ sudo mkfs.xfs -m metadir=1,uquota /dev/nvme0n1 -f -r rtdev=/dev/nvme1n1,concurrency=8 -L frogs -d rtinherit=1
meta-data=/dev/nvme0n1           isize=512    agcount=4, agsize=32768000 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=1
         =                       reflink=1    bigtime=1 inobtcount=1 nrext64=1
         =                       exchange=1   metadir=1
data     =                       bsize=4096   blocks=131072000, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1, parent=0
log      =internal log           bsize=4096   blocks=64000, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =/dev/nvme1n1           extsz=4096   blocks=1048576000, rtextents=1048576000
         =                       rgcount=8    rgsize=131072000 extents

Notice this time that rmap=1 and reflink=1, signifying that they are both enabled.

Let’s mount the filesystem and copy some data:

$ sudo mount -o rtdev=/dev/nvme1n1 /dev/nvme0n1 /mnt
$ sudo cp -pRdu /etc/mnt/
$ sudo cp --reflink=always -pRdu /mnt/etc/mnt/copy_of_etc

Now let’s look around at a shared file:

$ filefrag -v /mnt/copy_of_etc/issue
File size of /mnt/copy_of_etc/issue is 25 (1 block of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:  655360014.. 655360014:      1:             last,shared,eof
/mnt/copy_of_etc/issue: 1 extent found
$ filefrag -v /mnt/issue
File size of /mnt/etc/issue is 25 (1 block of 4096 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       0:  655360014.. 655360014:      1:             last,shared,eof
/mnt/etc/issue: 1 extent found

As you can see, the two copies of the issue file share the same block 655360014. Judging from the large physical block number, this file’s data landed in rtgroup 5. Can we verify this with the fsmap command?

First let’s find the inode numbers:

$ stat /mnt/copy_of_etc/issue /mnt/etc/issue
  File: /mnt/copy_of_etc/issue
  Size: 25              Blocks: 8          IO Block: 4096   regular file
Device: 253,1   Inode: 537649948   Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-02-18 09:18:09.815899091 -0800
Modify: 2023-10-26 14:32:05.000000000 -0700
Change: 2025-02-18 16:10:23.023395759 -0800
 Birth: 2025-02-18 16:10:23.011149838 -0800
  File: /mnt/etc/issue
  Size: 25              Blocks: 8          IO Block: 4096   regular file
Device: 253,1   Inode: 320         Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2025-02-18 09:18:09.815899091 -0800
Modify: 2023-10-26 14:32:05.000000000 -0700
Change: 2025-02-18 16:10:16.102451361 -0800
 Birth: 2025-02-18 16:10:16.101056970 -0800

The inode numbers we’re looking for are 537649948 and 320. What happens when we grep for that?

$ sudo xfs_io -c 'fsmap -vvvv -r' /mnt/ | grep -w -E '(537649948|320)'
 166: 253:3 [1048576320..1048576327]: 537649256          0..7             1  (320..327)                  8 0100000
 167: 253:3 [1048576320..1048576327]: 537650093          0..7             1  (320..327)                  8 0100000
 320: 253:3 [3145728272..3145728287]: 268435657          0..15            3  (272..287)                 16 0100000
 330: 253:3 [3145728320..3145728327]: 268435773          0..7             3  (320..327)                  8 0100000
 331: 253:3 [3145728320..3145728327]: 537650051          0..7             3  (320..327)                  8 0100000
 480: 253:3 [5242880112..5242880119]: 320                0..7             5  (112..119)                  8 0100000
 481: 253:3 [5242880112..5242880119]: 537649948          0..7             5  (112..119)                  8 0100000

From lines 480 and 481, we see that both inodes map the same extent of 8 512-byte blocks (aka one 4k filesystem block) at offset 0. This extent has a LBA address of 5242880112, which matches the 655360014 that we saw above. It is also in rtgroup 5, as guessed.

Now, can we activate online fsck? Yes!

$ sudo xfs_scrub -d -v -n /mnt
EXPERIMENTAL xfs_scrub program in use! Use at your own risk!
Phase 1: Find filesystem geometry.
/mnt: using 4 threads to scrub.
Phase 2: Check internal metadata.
Info: AG 2 superblock: Optimization is possible. (scrub.c line 241)
Phase 3: Scan all inodes.
Phase 5: Check directory tree.
Phase 7: Check summary counters.
83.5GiB data used;  4.0KiB realtime data used;  2.4K inodes used.
381.5MiB data found; 5.1MiB realtime data found; 2.5K inodes found.
2.5K inodes counted; 2.5K inodes checked.
Phase 8: Trim filesystem storage.

Let’s look at the rtgroup metadata. Unmount and run the debugger to confirm that we got the features we asked for

$ sudo umount /mnt
$ sudo xfs_db /dev/mapper/nvme0n1 -c 'ls -m /rtgroups' -c 'ls -m /quota'
/rtgroups:
8          537383040          directory      0x0000002e   1 . (good)
10         129                directory      0x0000172e   2 .. (good)
12         537383041          regular        0x9efbcbe6   8 0.bitmap (good)
15         537383042          regular        0xede5b6d7   9 0.summary (good)
18         537383043          regular        0xee5b7172   6 0.rmap (good)
21         537383044          regular        0x1318e05a  10 0.refcount (good)
24         537383045          regular        0x9ef9cbe6   8 1.bitmap (good)
27         537383046          regular        0xece5b6d7   9 1.summary (good)
30         537383047          regular        0xee5b717a   6 1.rmap (good)
33         537383048          regular        0x9318e05a  10 1.refcount (good)
36         537383049          regular        0x9effcbe6   8 2.bitmap (good)
39         537383050          regular        0xefe5b6d7   9 2.summary (good)
42         537383051          regular        0xee5b7162   6 2.rmap (good)
45         537383052          regular        0x1318e05b  10 2.refcount (good)
48         537383053          regular        0x9efdcbe6   8 3.bitmap (good)
51         537383054          regular        0xeee5b6d7   9 3.summary (good)
54         537383055          regular        0xee5b716a   6 3.rmap (good)
57         537383056          regular        0x9318e05b  10 3.refcount (good)
60         537383057          regular        0x9ef3cbe6   8 4.bitmap (good)
63         537383058          regular        0xe9e5b6d7   9 4.summary (good)
66         537383059          regular        0xee5b7152   6 4.rmap (good)
69         537383060          regular        0x1318e058  10 4.refcount (good)
72         537383061          regular        0x9ef1cbe6   8 5.bitmap (good)
75         537383062          regular        0xe8e5b6d7   9 5.summary (good)
78         537383063          regular        0xee5b715a   6 5.rmap (good)
81         537383064          regular        0x9318e058  10 5.refcount (good)
84         537383065          regular        0x9ef7cbe6   8 6.bitmap (good)
87         537383066          regular        0xebe5b6d7   9 6.summary (good)
90         537383067          regular        0xee5b7142   6 6.rmap (good)
93         537383068          regular        0x1318e059  10 6.refcount (good)
96         537383069          regular        0x9ef5cbe6   8 7.bitmap (good)
99         537383070          regular        0xeae5b6d7   9 7.summary (good)
102        537383071          regular        0xee5b714a   6 7.rmap (good)
105        537383072          regular        0x9318e059  10 7.refcount (good)
/quota:
8          537383073          directory      0x0000002e   1 . (good)
10         129                directory      0x0000172e   2 .. (good)
12         537383074          regular        0x0ebcf2f2   4 user (good)

As compared to a filesystem with only metadir, this filesystem has twice as many btrees per realtime allocation group.

Conclusion

We’re very excited about realtime rmap and reflink landing in the upstream Linux 6.14 codebase! This completes three major XFS projects started by Oracle in the 2010s — reverse mapping to support shrink and free space defragmentation; reflink to support cloning; and online fsck to avoid downtime due to fsck.