XFS Upstream maintainer Darrick Wong provides another instalment, this time focusing on how to facilitate sysadmins in maintaining healthy filesystems.
Since Linux 4.17, I have been working on an online filesystem checking feature for XFS. As I mentioned in the previous update, the online fsck tool (named xfs_scrub) walks all internal filesystem metadata records. Each record is checked for obvious corruptions before being cross-referenced with all other metadata in the filesystem. If problems are found, they are reported to the system administrator through both xfs_scrub and the health reporting system.
As of Linux 5.3 and xfsprogs 5.3, online checking is feature complete and has entered the stabilization and performance optimization stage. For the moment it remains tagged experimental, though it should be stable. We seek early adopters to try out this new functionality and give us feedback.
A new feature under development since Linux 5.2 is the new metadata health reporting feature. In its current draft form, it collects checking and corruption reports from the online filesystem checker, and can report that to userspace via the xfs_spaceman health command. Soon, we will begin connecting it to all other places in the XFS codebase where we test for metadata problems so that administrators can find out if a filesystem observed any errors during operation.
Three years ago, I also introduced the reverse space mapping feature to XFS. At its core is a secondary index of storage space usage that effectively provides a redundant copy of primary space usage metadata. This adds some overhead to filesystem operations, but its inclusion in a filesystem makes cross-referencing very fast. It is an essential feature for repairing filesystems online because we can rebuild damaged primary metadata from the secondary copy.
The feature graduated from EXPERIMENTAL status in Linux 4.16 and is production ready. However, online filesystem checking and repair is (so far) the only use case for this feature, so it will remain opt-in at least until online checking graduates to production readiness.
To try out this feature, pass the parameter -m rmapbt=1 to mkfs.xfs when formatting a new filesystem.
Work has continued on online repair over the past two years. The basic core of how it works has not changed (we use reverse mapping information to reconnect damaged primary metadata), but our rigorous review processes have revealed other areas of XFS that could be improved significantly ahead of landing online repair support.
For example, the offline repair tool (xfs_repair) rebuilds the filesystem btrees in bulk by regenerating all the records in memory and then writing out fully formed btree blocks all at once. The original online repair code would rebuild indices one record at a time to avoid running afoul of other transactions, which was not efficient. Because this is an opportunity to share code, I have cleaned up xfs_repair's code into a generic btree bulk load function and have refactored both repair tools to use it.
Another part of repair that has been re-engineered significantly is how we stage those new records in memory. In the original design, we simply used kernel memory to hold all the records. The memory stress that this introduced made running repair a risky operation until I realized that repair should be running on a fully operational system. This means that we can store those records in memory that can be swapped out to conserve working set size.
A potential third area for improvement is avoiding filesystem freezes to repair metadata. While freezing the filesystem to run a repair probably involves less downtime than unmounting, it would be very useful if we could isolate an allocation group that is found to be bad. This will reduce service impacts and is probably the only practical way to repair the reverse mapping index.
I look forward to sending out a new revision of the online repair code in 2020 for further review.
Online filesystem checking is a component that must be built into the Linux kernel at compile time by enabling the CONFIG_XFS_ONLINE_SCRUB kernel option. Checks are driven by a userspace utility named xfs_scrub. When run, this program announces itself as an experimental technical preview. Your kernel distributor must enable the option for the feature to work.
On Debian and Ubuntu systems, the program is shipped in the regular xfsprogs package. On RedHat and Fedora systems, it is shipped in the xfsprogs-xfs_scrub package and must be installed separately. You can, of course, compile kernel and userspace from source.
Let's try out the new program. It isn't very chatty by default, so we invoke it with the -v option to display status information and the -n option because we only want to check metadata:
# xfs_scrub -n -v /storage/ EXPERIMENTAL xfs_scrub program in use! Use at your own risk! Phase 1: Find filesystem geometry. /storage/: using 4 threads to scrub. Phase 2: Check internal metadata. Info: AG 1 superblock: Optimization is possible. Info: AG 2 superblock: Optimization is possible. Info: AG 3 superblock: Optimization is possible. Phase 3: Scan all inodes. Info: /storage/: Optimizations of inode record are possible. Phase 5: Check directory tree. Info: inode 139431063 (1/5213335): Unicode name "arn.lm" in directory could be confused with "am.lm". Info: inode 407937855 (3/5284671): Unicode name "obs-l.I" in directory could be confused with "obs-1.I". Info: inode 407937855 (3/5284671): Unicode name "obs-l.X" in directory could be confused with "obs-1.X". Info: inode 688764901 (5/17676261): Unicode name "empty-fl.I" in directory could be confused with "empty-f1.I". Info: inode 688764901 (5/17676261): Unicode name "empty-fl.X" in directory could be confused with "empty-f1.X". Info: inode 688764901 (5/17676261): Unicode name "l.I" in directory could be confused with "1.I". Info: inode 688764901 (5/17676261): Unicode name "l.X" in directory could be confused with "1.X". Info: inode 944886180 (7/5362084): Unicode name "l.I" in directory could be confused with "1.I". Info: inode 944886180 (7/5362084): Unicode name "l.X" in directory could be confused with "1.X". Phase 7: Check summary counters. 279.1GiB data used; 3.5M inodes used. 262.2GiB data found; 3.5M inodes found. 3.5M inodes counted; 3.5M inodes checked.
As you can see, metadata checking is split into different phases:
Our sample filesystem is in good shape! We saw a few things that could be optimized or reviewed, but no corruptions were reported. No data have been lost.
However, this is not the only way we can run xfs_scrub! System administrators can set it up to run in the background when the system is idle. xfsprogs ships with the appropriate job control files to run as a systemd timer service or a cron job.
The systemd timer service can be run automatically by enabling the timer:
# systemctl start xfs_scrub_all.timer # systemctl list-timers NEXT LEFT LAST PASSED UNIT ACTIVATES Thu 2019-11-28 03:10:59 PST 12h left Wed 2019-11-27 07:25:21 PST 7h ago xfs_scrub_all.timer xfs_scrub_all.service <listing shortened for brevity>
When enabled, the background service will email failure reports to root. Administrators can configure when the service runs by running systemctl edit xfs_scrub_all.timer, and where the failure reports are sent by running systemctl edit xfs_scrub_fail@.service to change the EMAIL_ADDR variable. The systemd service takes advantage of systemd's sandboxing capabilities to restrict the program to idle priority and to run with as few privileges as possible.
For systems that have cron installed (but not systemd), a sample cronjob file is shipped in /usr/lib/xfsprogs/xfs_scrub_all.cron. This file can be edited as necessary and copied to /etc/cron.d/. Failure reports are dispatched to wherever cronjob errors are sent.
A comprehensive health report can be generated with the xfs_spaceman tool. The report contains health status about allocation group metadata and inodes in the filesystem:
# xfs_spaceman -c 'health -c' /storage filesystem summary counters: ok AG 0 superblock: ok AG 0 AGF header: ok AG 0 AGFL header: ok AG 0 AGI header: ok AG 0 free space by block btree: ok AG 0 free space by length btree: ok AG 0 inode btree: ok AG 0 free inode btree: ok AG 0 overall inode state: ok <snip> inode 501370 inode core: ok inode 501370 data fork: ok inode 501370 extended attribute fork: ok
This concludes our demonstrations. We hope you'll try out these new features and let us know what you think!