XFS – Online Filesystem Repair

Hi! Back in 2020, I wrote about the online filesystem checking feature that we’re developing here at Oracle. We’ve made a lot of progress on that feature in the four years since then, so I decided it was time to write a new blog about where we are now.

In short: XFS now has self healing capabilities! In the near future, that will become autonomous self healing capabilities.

Online repair will be fully merged in Linux 6.10 and xfsprogs 6.10, which means that the kernel can rebuild all space management and directory tree metadata while the filesystem continues to run. The ability to detect and correct inconsistencies caused by latent corruption and software bugs is very exciting, because this results in far less downtime for the system administrator. The overall online fsck feature is still marked experimental because we want to make absolutely sure this feature is solid before enabling it by default for all users. Metadata checking has proven stable over the last few years, so now it is time to concentrate on performance and stability of the repair code.

Autonomous self healing with repair is still under development and will be merged after the basic functionality stabilizes.

But Why?

Quoting from the online repair design document, these are the problems that online fsck (xfs_scrub) solves:

User programs suddenly lose access to the filesystem when unexpected shutdowns occur as a result of silent corruptions in the metadata. These occur unpredictably and often without warning.
Users experience a total loss of service during the recovery period after an unexpected shutdown occurs.
Users experience a total loss of service if the filesystem is taken offline to look for problems proactively.
Data owners cannot check the integrity of their stored data without reading all of it. This may expose them to substantial billing costs when a linear media scan performed by the storage system administrator might suffice.
System administrators cannot schedule a maintenance window to deal with corruptions if they lack the means to assess filesystem health while the filesystem is online.
Fleet monitoring tools cannot automate periodic checks of filesystem health when doing so requires manual intervention and downtime.
Users can be tricked into doing things they do not desire when malicious actors exploit quirks of Unicode to place misleading names in directories.

New Supporting Filesystem Features

Before getting into a demonstration of the new self-healing features, it is important to acknowledge several recent feature additions that support that healing:

Reverse Mapping

As I stated last time, the reverse space mapping feature in XFS provides a secondary index of storage space usage. This index is crucial both to establishing the consistency of the ondisk space management metadata and generating new indices after problems are found, though this comes at a slight cost to performance. This feature is enabled by default as of xfsprogs 6.5.

Health Reporting

As of Linux 6.9, the kernel now fully captures and reports metadata health events. As xfs_scrub runs, each metadata object examined is marked as checked, and either sick or not sick, depending on whether problems were found. Upon observing obviously corrupt metadata buffers during runtime, the filesystem will mark the affected object as sick. When online repair rebuilds an object, it will re-check the object after the repair completes to validate the repair and to update the metadata object’s health status. All health status can be queried by the system administrator:

# xfs_spaceman -c 'health -c -n' /opt/
Health status has not been collected for this filesystem.
Please run xfs_scrub(8) to remedy this situation.

No health information has been collected for this filesystem, so we must run xfs_scrub to scan the metadata:

# xfs_scrub -n /opt/
EXPERIMENTAL xfs_scrub program in use! Use at your own risk!
Info: AG 1 superblock: Optimization is possible.

Now let’s try it again, with some editing for brevity:

# xfs_spaceman -c 'health -c -n' /opt/
filesystem summary counters: ok
filesystem user quota: ok
filesystem group quota: ok
filesystem project quota: ok
filesystem quota counts: ok
filesystem inode link counts: ok
AG 0 superblock: ok
AG 0 AGF header: ok
AG 0 AGFL header: ok
AG 0 AGI header: ok
AG 0 free space by block btree: ok
AG 0 free space by length btree: ok
AG 0 inode btree: ok
AG 0 free inode btree: ok
AG 0 reverse mappings btree: ok
AG 0 reference count btree: ok
<snip>
/opt inode core: ok
/opt data fork: ok
/opt directory: ok
/opt parent pointers: ok
/opt directory tree structure: ok
<snip>
/opt/git/dht11/.git/refs/remotes/origin/master inode core: ok
/opt/git/dht11/.git/refs/remotes/origin/master data fork: ok
/opt/git/dht11/.git/refs/remotes/origin/master extended attribute fork: ok
/opt/git/dht11/.git/refs/remotes/origin/master extended attributes: ok
/opt/git/dht11/.git/refs/remotes/origin/master parent pointers: ok
<snip>

You can see that we have completed and logged a comprehensive scan of the filesystem metadata. Everything has checked out, so we’re ok to continue. Note that since the 2020 blog post, the raw inode numbers in the output have been replaced with file paths. This takes us to the next topic!

Directory Parent Pointers

Allison Henderson and Catherine Hoang delivered a new XFS feature that increases the redundancy of the directory tree by adding backpointers from child files to directory parents. These backpointers encode a filesystem handle to the parent directory as well as the name used in that directory. If a directory’s entries become corrupted, online fsck scans all files in the filesystem for parent pointers referencing the corrupt directory. Each parent pointer becomes a directory entry in a new temporary directory, and then the directory contents are exchanged to complete the repair. With this feature enabled, XFS can also report file paths for any file on the filesystem.

For example, given a file that is hardlinked into multiple directory trees, we can now find out which directory trees:

# xfs_io -c 'parent -p' /opt/PPTRS/SUB3/moofile
/opt/PPTRS/moofile
/opt/PPTRS/SUB0/moofile
/opt/PPTRS/SUB1/moofile
/opt/PPTRS/SUB2/moofile
/opt/PPTRS/SUB3/moofile

Now we know that SUB0..SUB3 all reference this moofile. xfs_scrub also uses this functionality to improve reporting:

# xfs_scrub -n /opt/
EXPERIMENTAL xfs_scrub program in use! Use at your own risk!
Info: /opt/S_IFDIR.FMT_NODE/motd: Optimizations of inode record are possible.
537.7MiB data used;  38.9K inodes used.
118.3MiB data found; 38.9K inodes found.

Reporting file paths (instead of inode numbers) reduces the amount of work that a system administrator must do to identify the source of problems.

This feature is new for Linux 6.10, so it is marked EXPERIMENTAL and not yet enabled by default. To try out this feature, pass the parameter -n 'parent=1' to mkfs.xfs when formatting a new filesystem.

Exchanges of File Data Ranges

A key design point of online repair is that all repairs must commit atomically, which is to say that partially constructed data structures must never be exposed to users. Either we commit a correct structure, or we change nothing at all. If we do commit, then the new structure must survive permanently. In other words, repairs of file-based data (e.g. directories, extended attributes, symbolic links) must therefore construct a completely new data structure and exchange it with the old data structure in an atomic and durable fashion.

To satisfy this requirement, XFS now has a feature to initate an exchange of data (or extended attribute) blocks between two files. Progress of this high level exchange operation is tracked in the XFS log so that log recovery can complete the exchange if the system goes down during the exchange. The ability to exchange file contents has uses outside of filesystem repair – a database program could in theory update an indexed table file atomically by reflinking the table file into a temporary file, writing potentially sparse updates to the temporary file, and then use the new file range exchange ioctl to commit the changes to the original table file.

An example C program might look like this (error handling elided for clarity):

int fd = open("/some/file", O_RDWR);
int temp_fd = open("/some", O_TMPFILE | O_RDWR);

ioctl(temp_fd, FICLONE, fd);

/* append 1MB of records */
lseek(temp_fd, 0, SEEK_END);
write(temp_fd, data1, 1000000);

/* update record index */
pwrite(temp_fd, data1, 600, 98765);
pwrite(temp_fd, data2, 320, 54321);
pwrite(temp_fd, data2, 15, 0);

/* commit the entire update */
struct xfs_exchange_range args = {
    .file1_fd = temp_fd,
    .flags = XFS_EXCHANGE_RANGE_TO_EOF,
};

ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);

/* discard the temporary file */
close(temp_fd);

This feature is new for Linux 6.10, so it is marked EXPERIMENTAL and not yet enabled by default. To try out this feature, pass the parameter -i 'exchange=1' to mkfs.xfs when formatting a new filesystem.

Demonstrations

Let’s see the online repair system in action!

Online File System Repair

Online filesystem repair must be built into the Linux kernel at compile time by enabling the CONFIG_XFS_ONLINE_REPAIR kernel option. Your Linux distributor must enable the kernel option and provide the userspace program xfs_scrub for the feature to work.

On Debian and Ubuntu systems, the xfs_scrub program is shipped in the regular xfsprogs package. On RedHat and Fedora systems, it is shipped in the xfsprogs-xfs_scrub package which must be installed separately. You can, of course, compile kernel and userspace from source.

Let us begin by creating a test filesystem and creating a directory tree:

# mkfs.xfs -f -i exchange=1 -n parent=1 /dev/sda > /dev/null
# mount /dev/sda /opt
# echo testdata > /opt/a
# mkdir -p "/opt/some/victimdir"
# for ((i = 0; i < 1000; i++)); do
    fname="$(printf "%08d" "$i")"
    ln /opt/a /opt/some/victimdir/$fname
done
# ls /opt/some/victimdir | wc -l
1000

Now let’s corrupt the second block of the directory and try to read it again:

# umount /opt
# xfs_db -x -c 'path /some/victimdir' -c 'dblock 1' -c 'blocktrash -z -0 -o 0 -x 2048 -y 2048 -n 2048' /dev/sda
blocktrash: seed 1718292864
blocktrash: 1/13 unknown block 2048 bits starting 0:0 zeroed
# mount /dev/sda /opt
# ls /opt/some/victimdir | wc -l
ls: reading directory '/opt/some/victimdir': Structure needs cleaning
166

Next, query the health reporting system to make sure that it logged the corruption:

# xfs_spaceman -c 'health -c -q -n' /opt
/opt/some/victimdir directory: unhealthy

Ok, so something is wrong. Check the metadata to see if there are any other problems:

# xfs_scrub -n /opt
EXPERIMENTAL xfs_scrub program in use! Use at your own risk!
Info: AG 3 superblock: Optimization is possible.
Info: AG 1 superblock: Optimization is possible.
Info: AG 2 superblock: Optimization is possible.
Corruption: /opt/some/victimdir directory entries: Repairs are required.
Info: /opt/a parent pointer: Cross-referencing failed.
Info: /opt: Optimizations of extended attributes are possible.
Info: inode link counts: Check incomplete.
Info: inode link counts: Cross-referencing failed.
Info: /opt: Filesystem has errors, skipping connectivity checks.
483.8MiB data used;  6 inodes used.
64.4MiB data found; 6 inodes found.
/opt: corruptions found: 1
/opt: Re-run xfs_scrub without -n.

xfs_scrub checked the directory and reported the corruption, which is as it should be. Notice that the parent pointer check for the testdata file also complained that it could not cross-reference one of the parent pointers against the parent directory. Next, let’s try repairing the filesystem:

# xfs_scrub /opt
EXPERIMENTAL xfs_scrub program in use! Use at your own risk!
Info: /opt/a parent pointer: Cross-referencing failed.
Info: /opt/some/victimdir directory entries: Attempting repair.
Repaired: /opt/some/victimdir directory entries: Repairs successful.
483.8MiB data used;  6 inodes used.
64.4MiB data found; 7 inodes found.
/opt: repairs made: 1; optimizations made: 5.

xfs_scrub claims that it fixed the directory. Can we list the directory again?

# ls /opt/some/victimdir | wc -l
1000
# xfs_spaceman -c 'health -c -q -n' /opt

Success!

Patrol Scrub and Repair Service

As I noted in my previous blogpost, xfsprogs ships with a background service that periodically checks and optimizes the filesystem. The ultimate goal is to have it repair corruptions as well, but that will not be enabled until later in the stabilization phase. The user-visible aspect of this service remains the same:

# systemctl start xfs_scrub_all.timer
# systemctl list-timers
NEXT                        LEFT          LAST                        PASSED        UNIT                                 ACTIVATES
Thu 2024-06-06 03:10:10 PDT 12h left      Wed 2024-06-05 08:22:08 PDT 6h ago        xfs_scrub_all.timer                  xfs_scrub_all.service

Administrators can configure when the service runs by running systemctl edit xfs_scrub_all.timer. However, one major change since 2020 is that sysadmins must edit the EMAIL_ADDR variable in both the xfs_scrub_fail@.service and xfs_scrub_media_fail@.service services to have failure reports emailed to the system administrator. The systemd service definitions have honed their ability to take advantage of systemd’s sandboxing capabilities to restrict the program’s runtime resource usage and to run with as few privileges as possible.

The background patrol service is not yet enabled by default, but your distributor may choose to enable it earlier than upstream.

Autonomous Self Healing

Earlier, I mentioned the possibility of autonomous self healing of the filesystem. This final phase of the online repair project is still under development, so let’s look at a preview. Starting with the filesystem with the corrupted directory from above, let’s enable the autonomous self healing service and mount the filesystem. The service is tentatively named xfs_scrubbed:

# systemctl unmask xfs_scrubbed@.service
# systemctl enable xfs_scrubbed@.service
# mount /dev/sda /opt
# systemctl status xfs_scrubbed@opt.service
● xfs_scrubbed@opt.service - Self Healing of XFS Metadata for /opt
     Loaded: loaded (/lib/systemd/system/xfs_scrubbed@.service; disabled; preset: enabled)
     Active: active (running) since Wed 2024-06-05 08:21:39 PDT; 6h ago
       Docs: man:xfs_scrubbed(8)
    Process: 5686 ExecCondition=/usr/libexec/xfsprogs/xfs_scrubbed --check /opt (code=exited, status=0/SUCCESS)
   Main PID: 8595 (xfs_scrubbed)
      Tasks: 1 (limit: 309445)
     Memory: 11.7M
        CPU: 245ms
     CGroup: /system.slice/system-xfs_scrub.slice/xfs_scrubbed@opt.service
             └─8595 /usr/bin/python3 /usr/libexec/xfsprogs/xfs_scrubbed --repair --log /opt

The self healing service activates automatically at mount time, so it’s good to see it running. Now let’s access the corrupt directory to trigger an automatic repair:

# ls /opt/some/victimdir | wc -l
ls: reading directory '/opt/some/victimdir': Structure needs cleaning
166
# xfs_spaceman -c 'health -c -q -n' /opt
/opt/some/victimdir directory: unhealthy

Same as before, we tried to read the directory and logged a corruption. Did the service notice? Scanning the log for the xfs_scrubbed@opt.service produces the following:

/opt/some/victimdir: {'type': 'sick', 'domain': 'inode',
'structures': ['directory'], 'inode': 16777344, 'generation': 2651534240,
'time': '2024-06-05 15:07:35.670629-07:00', 'path': 'some/victimdir'}

/opt/some/victimdir: {'type': 'sick', 'domain': 'inode',
'structures': ['directory'], 'inode': 16777344, 'generation': 2651534240,
'time': '2024-06-05 15:07:35.746629-07:00', 'path': 'some/victimdir'}

/opt/some/victimdir: {'type': 'corrupt', 'domain': 'inode',
'structures': [], 'inode': 16777344, 'generation': 2651534240,
'time': '2024-06-05 15:07:35.746629-07:00', 'path': 'some/victimdir'}

/opt/some/victimdir: directory: Repairs successful.

/opt/some/victimdir: {'type': 'healthy', 'domain': 'inode',
'structures': ['directory'], 'inode': 16777344, 'generation': 2651534240,
'time': '2024-06-05 15:07:35.790629-07:00', 'path': 'some/victimdir'}

The first “sick” log message captures the ls process stumbling over the corrupt directory. 80 milliseconds later, xfs_scrubbed called the kernel to ask for a repair. The next two lines (“sick” and “corrupt”) reflect the kernel checking the metadata and finding errors. Line 4 is the daemon reporting that the directory was repaired successfully. The final line is the post-repair re-check logging that the directory is now healthy. Let’s see if the directory is usable again:

# ls /opt/some/victimdir | wc -l
1000
# xfs_spaceman -c 'health -c -q -n' /opt

Success! The filesystem repaired itself without needing direct intervention from the system administrator. This service is not enabled by default and, as the final phase of this long project, may not land any time soon.

Conclusion

As you have seen, online fsck in XFS is nearly complete. We hope you’ll try out these new features and let us know what you think!

XFS – Online Filesystem Repair

But Why?

New Supporting Filesystem Features

Reverse Mapping

Health Reporting

Directory Parent Pointers

Exchanges of File Data Ranges

Demonstrations

Online File System Repair

Patrol Scrub and Repair Service

Autonomous Self Healing

Conclusion

Darrick Wong

Oracle Cloud Native Environment 1.9 introduces support for Kubernetes 1.29

Ksplice Known Exploit Detection for io_uring, glibc, overlayfs and netfilter

XFS – Online Filesystem Repair

But Why?

New Supporting Filesystem Features

Reverse Mapping

Health Reporting

Directory Parent Pointers

Exchanges of File Data Ranges

Demonstrations

Online File System Repair

Patrol Scrub and Repair Service

Autonomous Self Healing

Conclusion

Authors

Darrick Wong

Oracle Cloud Native Environment 1.9 introduces support for Kubernetes 1.29

Ksplice Known Exploit Detection for io_uring, glibc, overlayfs and netfilter