News, tips, partners, and perspectives for the Oracle Linux operating system and upstream Linux kernel work

XFS - Online Filesystem Checking

Matt Keenan
Principal Software Engineer

XFS Upstream maintainer Darrick Wong provides another instalment, this time focusing on how to facilitate sysadmins in maintaining healthy filesystems.

Since Linux 4.17, I have been working on an online filesystem checking feature for XFS. As I mentioned in the previous update, the online fsck tool (named xfs_scrub) walks all internal filesystem metadata records. Each record is checked for obvious corruptions before being cross-referenced with all other metadata in the filesystem. If problems are found, they are reported to the system administrator through both xfs_scrub and the health reporting system.

As of Linux 5.3 and xfsprogs 5.3, online checking is feature complete and has entered the stabilization and performance optimization stage. For the moment it remains tagged experimental, though it should be stable. We seek early adopters to try out this new functionality and give us feedback.

Health Reporting

A new feature under development since Linux 5.2 is the new metadata health reporting feature. In its current draft form, it collects checking and corruption reports from the online filesystem checker, and can report that to userspace via the xfs_spaceman health command. Soon, we will begin connecting it to all other places in the XFS codebase where we test for metadata problems so that administrators can find out if a filesystem observed any errors during operation.

Reverse Mapping

Three years ago, I also introduced the reverse space mapping feature to XFS. At its core is a secondary index of storage space usage that effectively provides a redundant copy of primary space usage metadata. This adds some overhead to filesystem operations, but its inclusion in a filesystem makes cross-referencing very fast. It is an essential feature for repairing filesystems online because we can rebuild damaged primary metadata from the secondary copy.

The feature graduated from EXPERIMENTAL status in Linux 4.16 and is production ready. However, online filesystem checking and repair is (so far) the only use case for this feature, so it will remain opt-in at least until online checking graduates to production readiness.

To try out this feature, pass the parameter -m rmapbt=1 to mkfs.xfs when formatting a new filesystem.

Online Filesystem Repair

Work has continued on online repair over the past two years. The basic core of how it works has not changed (we use reverse mapping information to reconnect damaged primary metadata), but our rigorous review processes have revealed other areas of XFS that could be improved significantly ahead of landing online repair support.

For example, the offline repair tool (xfs_repair) rebuilds the filesystem btrees in bulk by regenerating all the records in memory and then writing out fully formed btree blocks all at once. The original online repair code would rebuild indices one record at a time to avoid running afoul of other transactions, which was not efficient. Because this is an opportunity to share code, I have cleaned up xfs_repair's code into a generic btree bulk load function and have refactored both repair tools to use it.

Another part of repair that has been re-engineered significantly is how we stage those new records in memory. In the original design, we simply used kernel memory to hold all the records. The memory stress that this introduced made running repair a risky operation until I realized that repair should be running on a fully operational system. This means that we can store those records in memory that can be swapped out to conserve working set size.

A potential third area for improvement is avoiding filesystem freezes to repair metadata. While freezing the filesystem to run a repair probably involves less downtime than unmounting, it would be very useful if we could isolate an allocation group that is found to be bad. This will reduce service impacts and is probably the only practical way to repair the reverse mapping index.

I look forward to sending out a new revision of the online repair code in 2020 for further review.

Demonstration: Online File System Check

Online filesystem checking is a component that must be built into the Linux kernel at compile time by enabling the CONFIG_XFS_ONLINE_SCRUB kernel option. Checks are driven by a userspace utility named xfs_scrub. When run, this program announces itself as an experimental technical preview. Your kernel distributor must enable the option for the feature to work.

On Debian and Ubuntu systems, the program is shipped in the regular xfsprogs package. On RedHat and Fedora systems, it is shipped in the xfsprogs-xfs_scrub package and must be installed separately. You can, of course, compile kernel and userspace from source.

Let's try out the new program. It isn't very chatty by default, so we invoke it with the -v option to display status information and the -n option because we only want to check metadata:

# xfs_scrub -n -v /storage/
EXPERIMENTAL xfs_scrub program in use! Use at your own risk!
Phase 1: Find filesystem geometry.
/storage/: using 4 threads to scrub.
Phase 2: Check internal metadata.
Info: AG 1 superblock: Optimization is possible.
Info: AG 2 superblock: Optimization is possible.
Info: AG 3 superblock: Optimization is possible.
Phase 3: Scan all inodes.
Info: /storage/: Optimizations of inode record are possible.
Phase 5: Check directory tree.
Info: inode 139431063 (1/5213335): Unicode name "arn.lm" in directory could be confused with "am.lm".
Info: inode 407937855 (3/5284671): Unicode name "obs-l.I" in directory could be confused with "obs-1.I".
Info: inode 407937855 (3/5284671): Unicode name "obs-l.X" in directory could be confused with "obs-1.X".
Info: inode 688764901 (5/17676261): Unicode name "empty-fl.I" in directory could be confused with "empty-f1.I".
Info: inode 688764901 (5/17676261): Unicode name "empty-fl.X" in directory could be confused with "empty-f1.X".
Info: inode 688764901 (5/17676261): Unicode name "l.I" in directory could be confused with "1.I".
Info: inode 688764901 (5/17676261): Unicode name "l.X" in directory could be confused with "1.X".
Info: inode 944886180 (7/5362084): Unicode name "l.I" in directory could be confused with "1.I".
Info: inode 944886180 (7/5362084): Unicode name "l.X" in directory could be confused with "1.X".
Phase 7: Check summary counters.
279.1GiB data used;  3.5M inodes used.
262.2GiB data found; 3.5M inodes found.
3.5M inodes counted; 3.5M inodes checked.

As you can see, metadata checking is split into different phases:

  1. This phase gathers information about the filesystem and tests whether or not online checking is supported.
  2. Here we examine allocation group metadata and aggregated filesystem metadata for problems. These include free space indices, inode indices, reverse mapping and reference count information, and quota records. In this example, the program lets us know that the secondary superblocks could be updated, though they are not corrupt.
  3. Now we scan all inodes for problems in the storage mappings, extended attributes, and directory contents, if applicable. No problems found here!
  4. Repairs are performed on the filesystem in this phase, though only if the user did not invoke the program with -n.
  5. Directories and extended attributes are checked for connectivity and naming problems. Here, we see that the program has identified several directories containing file names that could render similarly enough to be confusing. These aren't filesystem errors per se, but should be reviewed by the administrator.
  6. If enabled with -x, this phase scans the underlying disk media for latent failures.
  7. In the final phase, we compare the summary counters against what we've seen and report on the effectiveness of our scan. As you can see, we found all the files and most of the file data.

Our sample filesystem is in good shape! We saw a few things that could be optimized or reviewed, but no corruptions were reported. No data have been lost.

However, this is not the only way we can run xfs_scrub! System administrators can set it up to run in the background when the system is idle. xfsprogs ships with the appropriate job control files to run as a systemd timer service or a cron job.

The systemd timer service can be run automatically by enabling the timer:

# systemctl start xfs_scrub_all.timer
# systemctl list-timers
NEXT                         LEFT          LAST                         PASSED       UNIT                           ACTIVATES
Thu 2019-11-28 03:10:59 PST  12h left      Wed 2019-11-27 07:25:21 PST  7h ago       xfs_scrub_all.timer            xfs_scrub_all.service
<listing shortened for brevity>

When enabled, the background service will email failure reports to root. Administrators can configure when the service runs by running systemctl edit xfs_scrub_all.timer, and where the failure reports are sent by running systemctl edit xfs_scrub_fail@.service to change the EMAIL_ADDR variable. The systemd service takes advantage of systemd's sandboxing capabilities to restrict the program to idle priority and to run with as few privileges as possible.

For systems that have cron installed (but not systemd), a sample cronjob file is shipped in /usr/lib/xfsprogs/xfs_scrub_all.cron. This file can be edited as necessary and copied to /etc/cron.d/. Failure reports are dispatched to wherever cronjob errors are sent.

Demonstration: Health Reporting

A comprehensive health report can be generated with the xfs_spaceman tool. The report contains health status about allocation group metadata and inodes in the filesystem:

# xfs_spaceman -c 'health -c' /storage
filesystem summary counters: ok
AG 0 superblock: ok
AG 0 AGF header: ok
AG 0 AGFL header: ok
AG 0 AGI header: ok
AG 0 free space by block btree: ok
AG 0 free space by length btree: ok
AG 0 inode btree: ok
AG 0 free inode btree: ok
AG 0 overall inode state: ok
inode 501370 inode core: ok
inode 501370 data fork: ok
inode 501370 extended attribute fork: ok

This concludes our demonstrations. We hope you'll try out these new features and let us know what you think!

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.