Introduction
The mkfs.xfs
command is used to create an XFS filesystem. In this article, we will explore the mkfs.xfs
code flow, and what it writes to disk to create an XFS filesystem. The mkfs.xfs
command is part of the xfsprogs
package.
Processing Arguments
The mkfs.xfs
command comes with a handful of options to configure the filesystem in the way that best suits the user. Some of these options are enabled by default and others can be passed as command line options. The first stage of mkfs
involves processing and validating arguments.
-
Get all the predefined default values (These do not need any calculations, a few of them are listed here.)
/* build time defaults */ struct mkfs_default_params dft = { .source = _("package build definitions"), .sectorsize = XFS_MIN_SECTORSIZE, .blocksize = 1 << XFS_DFL_BLOCKSIZE_LOG, .sb_feat = { ... .lazy_sb_counters = true, .crcs_enabled = true, .finobt = true, .reflink = true, .inobtcnt = true, ... } }
-
Processing Command line options (CLI)
- Changes to the inode can be specified using the
-i
option. Use the-m
option for data changes. - Validate block size.
- Validate sector size as other operations depend on it. Along with this, disable direct IO on the image file so that a sector mismatch between the new and underlying host filesystem does not create an error. XFS uses
blkid_get_topology()
to query the logical and physical sector size of the underlying storage device. For the filesystem to function, the defined sector size must be a multiple of the underlying sector size. - If CRC (checksum) is enabled, specific parameters need to be enabled too. If they are overridden by the user, catch them. For example, with CRC enabled, the minimum inode size is 512 bytes. If the inode size is specified as under 512 bytes, this will trigger an error.
- Check CLI features. Some features are dependent on others, ensure all dependencies are enabled (e.g.
reflink
requirescrc
. Ifreflink=1
andcrc=0
, stop themkfs
process). - Validate block size, inode size, and sector size if they are passed as CLI.
- Changes to the inode can be specified using the
Getting Allocation Group Geometry
The XFS filesystem is divided into Allocation Groups (AGs), with each group tracking its own allocation and free space info. This ensures that each AG acts as its own filesystem and uses parallelism. After argument processing is done, it is essential to calculate the AG geometry such as the AG size, count, and journal size.
-
If the user specifies
agsize
, then check if it is a multiple of the blocksize, and ifagcount
is specified, calculateagsize
as filesystem size divided byagcount
. -
If the user does not provide a value for
agsize
, call thecalc_default_ag_geometry()
method to get the default AG geometry:-
For a single underlying storage device, if 128 MB <= filesystem size <= 4 TB, then use 4 AGs. For filesystems larger than 4 TB, set the AG size to 1 TB (the maximum possible AG size) and calculate the AG count based on the AG size.
-
Configurations with more than one storage device can enable a greater extent of parallelism, so for those choose a larger
agcount
based on the filesystem size.-
If the filesystem size is larger than 32 TB, AG size is the maximum AG size =
XFS_AG_MAX_BLOCKS(blocklog)
= 1 TB. -
The default AG count is 32. That can decrease if the filesystem is smaller, as shown in the code sample and table that follows:
#define XFS_MULTIDISK_AGLOG 5 /* 32 AGs */ shift = XFS_MULTIDISK_AGLOG; if (dblocks <= GIGABYTES(512, blocklog)) shift--; if (dblocks <= GIGABYTES(8, blocklog)) shift--; if (dblocks < MEGABYTES(128, blocklog)) shift--; if (dblocks < MEGABYTES(64, blocklog)) shift--; if (dblocks < MEGABYTES(32, blocklog)) shift--;
Filesystem SizeAG Count> 32 TBAG size = XFS_AG_MAX_BLOCKS = 1 TB; AG count = (filesystem size / AG Size)32 TB >= size > 512 GB32<= 512 GB16<= 8 GB8<= 128 MB4<= 64 MB2<= 32 MB1
-
-
Calculating the Default Log Size
XFS is a journaling filesystem, one can reserve space within the filesystem as an internal log, or use an external device for storing logs. The rest of this blog covers filesystems using an internal log.
The log size must be decided while getting the AG geometry, and that is based on the filesystem size. XFS is a metadata journaling filesystem and the size of this journal is decided based on the filesystem creation size. The journal log improves performance and ensures the reliability of the filesystem as it can return to a consistent state after a system crash.
- If the log size is not specified on the CLI, use the following formula to get the log size:
FS SizeLog ValueCalculations< 300 MBActual minimum.Calculated based on the XFS transaction code.300 MB <= size <= 128 GB64 MBlog size = max(ratio, reasonable log size) ratio = 2048: 1. Every 2 GB of the filesystem adds 1 MB to the log size. A reasonable log size = 64 MB.> 128 GBmin(ratio,
XFS_MAX_LOG_BYTES
)ratio = 2048: 1. Every 2 GB of filesystem adds 1 MB to the log size.XFS_MAX_LOG_BYTES
= <2^31 = 2 GB –XFS_MIN_LOG_BYTES
. TheXFS_MIN_LOG_BYTES
= 10 MB. - The log should fit inside a single AG. After you know the log size, you can adjust the complete log to fit the AG. So if the log size is larger than the AG size, the log size is reduced to the maximum usable space in the AG (That is AG size – prealloc reservations like header fields – 1).
- For an internal log filesystem, the default AG number containing the log is half the total number of AGs. That can be overridden by the
-l agnum=
option of mkfs. For a filesystem with 4 AGs, the journal is placed in AG 2. - Getting the actual minimum log:
-
The terminology, actual minimum, is used as
XFS_MIN_LOG_BYTES
and is 10 MB. You can optionally create an XFS filesystem with a log size of less than 10 MB. Thexfs_log_calc_minimum_size()
method returns the minimum possible log size based on superblock configuration. The following points factor in these calculations:- No single transaction can be larger than half the size of the log. This ensures that at any point in time, you can fit two transactions in the log. So in the event of a system crash, at least one valid transaction is not overwritten.
- Log stripe unit or stripe width, if set, should be considered while calculating this value.
-
- Find the log size based on the ratio 2048:1, where every 2048 data blocks gets 1 log block:
cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 2048; cfg->logblocks = cfg->logblocks >> cfg->blocklog;
- The default reasonable log size is set to 64 MB to improve performance on threaded workloads. Note that the filesystem can be grown using
xfs_growfs
, but the log cannot be grown.#define XFS_MIN_REALISTIC_LOG_BLOCKS(blog) (MEGABYTES(64, (blog))) cfg->logblocks = max(cfg->logblocks, XFS_MIN_REALISTIC_LOG_BLOCKS(cfg->blocklog));
Header Reservations
In XFS, each AG acts like a standalone filesystem, and manages its space. Here are the headers saved in each AG for a regular mkfs
command with reflink enabled. (Command: mkfs.xfs
)
XFSB
– Superblock.XAGF
– AG free space info.XAGI
– AG inode info.XAFL
– Free space list.AB3B
– Free space by Block B+ tree.AB3C
– Free space by Size B+ tree.IAB3
– Inode B+ tree.FIB3
– Free Inode B+ tree.R3FC
– Reference count B+ tree (This is added ifreflink
is enabled).
Each AG has a copy of the superblock to help with recovery, and each AG has free space and inode info to aid with space allocation and file creation. The superblock, AG free space info, AG inode info, and AG free space list, all take one sector each (512 bytes each) and the remaining take up one block each. The space after these headers can be used for actual inode allocation and storage.
Allocating the Root Inode
The process of allocating the root inode is the same as inode creation or allocation for regular files in XFS. Choose AG 0 and create 64 inodes (In XFS, inodes are allocated in chunks of 64) and choose the first inode from the chunk as the root inode.
In XFS, the inode number can be translated into the exact block location that contains the inode structure. In the default case with the inode size set to 512 bytes, the root inode number would be 128.
-
The inode number 128 indicates that the inode location is AG 0, 16th block at offset 0. You can use the
xfs_db
shell to verify this:xfs_db> sb 0 xfs_db> p rootino = 128 xfs_db> convert inode 128 agno # AG containing the inode 0x0 (0) xfs_db> convert inode 128 fsblock # Block containing the inode 0x10 (16) xfs_db> convert inode 128 offset # Offset of the inode in the block 0x0 (0)
-
The first 16 blocks of the AG contain headers such as the superblock, AG inode information, and AG free block information.
Finishing mkfs
After all the checks and calculations are complete, it is time to write the changes and finish formatting the filesystem.
- Align AG geometry if the underlying device is striped, or if the stripe unit and width are specified.
- Get the maximum inode percentage based on the filesystem size if it is not passed via the CLI.
- Initialize the minimum parameters required for log calculations to proceed.
- Set up the mount parameters for log calculations. (Here we add parameters that help calculate the minimum pre-allocated blocks to help with log size calculations).
- Calculate the log size.
- Update the incore superblock with those log calculations.
- Now that the validations are done, discard the old device layout using
discard_devices()
. - Use
prepare_devices()
to make the device ready for mounting:- Zero out the beginning of the devices to obliterate any old filesystem signatures.
- Write the superblock to the disk.
- Zero out the filesystem journal log.
- Several XFS macros use the mount structure. To use macros, initialize the
xfs_mount_t
structure by using thelibxfs_mount()
method. - Initialize the AG headers and update the secondary superblock.
- Allocate the root inode.
- Call the
libxfs_umount()
method to free all the resources obtained during the mount. - The first mount and unmount is done by
mkfs
. To review the result, usexfs_logprint
to dump journal contents:
xfs_logprint: data device: 0xffffffffffffffff log device: 0xffffffffffffffff daddr: 10485808 length: 204800 cycle: 1 version: 2 lsn: 1,0 tail_lsn: 1,0 length of Log Record: 512 prev offset: -1 num ops: 1 uuid: 393834a0-61f6-46ad-9659-b1a3c66863c6 format: little endian linux h_size: 32768 ---------------------------------------------------------------------------- Oper (0): tid: b0c0d0d0 len: 8 clientid: LOG flags: UNMOUNT Unmount filesystem ============================================================================ xfs_logprint: skipped 204798 zeroed blocks in range: 2 - 204799 xfs_logprint: physical end of log ============================================================================ xfs_logprint: logical end of log ============================================================================
Defaults
The following table lists key defaults specified within the mkfs.xfs
command:
xfs_growfs
command.
reflink
copy.
btree
in each allocation group.
Constructing the XFS Filesystem
-
To create a regular filesystem with a default configuration, run the following command:
sudo mkfs.xfs /dev/sdb1
-
To create a filesystem with external log, run the following command:
sudo mkfs.xfs -l logdev=/dev/sdb2 /dev/sdb1
-
To create a filesystem with a specified AG count, run the following command:
sudo mkfs.xfs -d agcount=32 /dev/sdb1
-
To verify the sanity of a raid stripe, extra steps are required.
If the underlying device is configured for RAID, the
mkfs
process picks the underlying device geometry and sets the stripe unit and width. Optionally,sunit
andswidth
can be used to specify a custom stripe unit and width. Note that these are specified in 512 byte block units. So,mkfs.xfs -d sunit=8
implies setting the stripe unit is 4096 bytes. Thesu
andsw
options can be used instead to specify stripe unit and width in bytes.For a device with raid level 5 created with the
mdadm
command, thensunit=128
andswidth=768
blocks. Here the block size is filesystem block size (bsize=4096
).mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sdb1 /dev/sdb2 /dev/sdb3 /dev/sdb4
# mkfs.xfs /dev/md0 meta-data=/dev/md0 isize=512 agcount=16, agsize=491136 blks = sectsz=4096 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=1 data = bsize=4096 blocks=7857408, imaxpct=25 = sunit=128 swidth=768 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=25600, version=2 = sectsz=4096 sunit=1 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0
You can then use details from the
/sys/block
directory to verify the stripe unit and width:cat /sys/block/md0/md/chunk_size 524288 = (128 * 4096)
cat /sys/block/md0/md/raid_disks 3
References
man mkfs.xfs
- xfsprogs source code – https://kernel.googlesource.com/pub/scm/fs/xfs/xfsprogs-dev/