Formatting an XFS Filesystem

December 14, 2023 | 13 minute read
Text Size 100%:

Introduction

The mkfs.xfs command is used to create an XFS filesystem. In this article, we will explore the mkfs.xfs code flow, and what it writes to disk to create an XFS filesystem. The mkfs.xfs command is part of the xfsprogs package.

Processing Arguments

The mkfs.xfs command comes with a handful of options to configure the filesystem in the way that best suits the user. Some of these options are enabled by default and others can be passed as command line options. The first stage of mkfs involves processing and validating arguments.

  • Get all the predefined default values (These do not need any calculations, a few of them are listed here.)

    /* build time defaults */
    struct mkfs_default_params  dft = {
        .source = _("package build definitions"),
        .sectorsize = XFS_MIN_SECTORSIZE,
        .blocksize = 1 << XFS_DFL_BLOCKSIZE_LOG,
        .sb_feat = {
            ...
            .lazy_sb_counters = true,
            .crcs_enabled = true,
            .finobt = true,
            .reflink = true,
            .inobtcnt = true,
            ...
        }
    }
  • Processing Command line options (CLI)

    • Changes to the inode can be specified using the -i option. Use the -m option for data changes.
    • Validate block size.
    • Validate sector size as other operations depend on it. Along with this, disable direct IO on the image file so that a sector mismatch between the new and underlying host filesystem does not create an error. XFS uses blkid_get_topology() to query the logical and physical sector size of the underlying storage device. For the filesystem to function, the defined sector size must be a multiple of the underlying sector size.
    • If CRC (checksum) is enabled, specific parameters need to be enabled too. If they are overridden by the user, catch them. For example, with CRC enabled, the minimum inode size is 512 bytes. If the inode size is specified as under 512 bytes, this will trigger an error.
    • Check CLI features. Some features are dependent on others, ensure all dependencies are enabled (e.g. reflink requires crc. If reflink=1 and crc=0, stop the mkfs process).
    • Validate block size, inode size, and sector size if they are passed as CLI.

Getting Allocation Group Geometry

The XFS filesystem is divided into Allocation Groups (AGs), with each group tracking its own allocation and free space info. This ensures that each AG acts as its own filesystem and uses parallelism. After argument processing is done, it is essential to calculate the AG geometry such as the AG size, count, and journal size.

  • If the user specifies agsize, then check if it is a multiple of the blocksize, and if agcount is specified, calculate agsize as filesystem size divided by agcount.

  • If the user does not provide a value for agsize, call the calc_default_ag_geometry() method to get the default AG geometry:

    • For a single underlying storage device, if 128 MB <= filesystem size <= 4 TB, then use 4 AGs. For filesystems larger than 4 TB, set the AG size to 1 TB (the maximum possible AG size) and calculate the AG count based on the AG size.

    • Configurations with more than one storage device can enable a greater extent of parallelism, so for those choose a larger agcount based on the filesystem size.

      • If the filesystem size is larger than 32 TB, AG size is the maximum AG size = XFS_AG_MAX_BLOCKS(blocklog) = 1 TB.

      • The default AG count is 32. That can decrease if the filesystem is smaller, as shown in the code sample and table that follows:

        #define XFS_MULTIDISK_AGLOG     5   /* 32 AGs */
        
        shift = XFS_MULTIDISK_AGLOG;
        if (dblocks <= GIGABYTES(512, blocklog))
            shift--;
        if (dblocks <= GIGABYTES(8, blocklog))
            shift--;
        if (dblocks < MEGABYTES(128, blocklog))
            shift--;
        if (dblocks < MEGABYTES(64, blocklog))
            shift--;
        if (dblocks < MEGABYTES(32, blocklog))
            shift--;

         

        Filesystem Size
        AG Count
        > 32 TB
        AG size = XFS_AG_MAX_BLOCKS = 1 TB; AG count = (filesystem size / AG Size)
        32 TB >= size > 512 GB
        32
        <= 512 GB
        16
        <= 8 GB
        8
        <= 128 MB
        4
        <= 64 MB
        2
        <= 32 MB
        1

         

Calculating the Default Log Size

XFS is a journaling filesystem, one can reserve space within the filesystem as an internal log, or use an external device for storing logs. The rest of this blog covers filesystems using an internal log.

The log size must be decided while getting the AG geometry, and that is based on the filesystem size. XFS is a metadata journaling filesystem and the size of this journal is decided based on the filesystem creation size. The journal log improves performance and ensures the reliability of the filesystem as it can return to a consistent state after a system crash.

  1. If the log size is not specified on the CLI, use the following formula to get the log size:
    FS Size
    Log Value
    Calculations
    < 300 MB
    Actual minimum.
    Calculated based on the XFS transaction code.
    300 MB <= size <= 128 GB
    64 MB
    log size = max(ratio, reasonable log size) ratio = 2048: 1. Every 2 GB of the filesystem adds 1 MB to the log size. A reasonable log size = 64 MB.
    > 128 GB
    min(ratio, XFS_MAX_LOG_BYTES)
    ratio = 2048: 1. Every 2 GB of filesystem adds 1 MB to the log size. XFS_MAX_LOG_BYTES = <2^31 = 2 GB - XFS_MIN_LOG_BYTES. The XFS_MIN_LOG_BYTES = 10 MB.
  2. The log should fit inside a single AG. After you know the log size, you can adjust the complete log to fit the AG. So if the log size is larger than the AG size, the log size is reduced to the maximum usable space in the AG (That is AG size - prealloc reservations like header fields - 1).
  3. For an internal log filesystem, the default AG number containing the log is half the total number of AGs. That can be overridden by the -l agnum= option of mkfs. For a filesystem with 4 AGs, the journal is placed in AG 2.
  4. Getting the actual minimum log:
    • The terminology, actual minimum, is used as XFS_MIN_LOG_BYTES and is 10 MB. You can optionally create an XFS filesystem with a log size of less than 10 MB. The xfs_log_calc_minimum_size() method returns the minimum possible log size based on superblock configuration. The following points factor in these calculations:

      • No single transaction can be larger than half the size of the log. This ensures that at any point in time, you can fit two transactions in the log. So in the event of a system crash, at least one valid transaction is not overwritten.
      • Log stripe unit or stripe width, if set, should be considered while calculating this value.
  5. Find the log size based on the ratio 2048:1, where every 2048 data blocks gets 1 log block:
    cfg->logblocks = (cfg->dblocks << cfg->blocklog) / 2048;
    cfg->logblocks = cfg->logblocks >> cfg->blocklog;
  6. The default reasonable log size is set to 64 MB to improve performance on threaded workloads. Note that the filesystem can be grown using xfs_growfs, but the log cannot be grown.
     #define XFS_MIN_REALISTIC_LOG_BLOCKS(blog)  (MEGABYTES(64, (blog)))
     cfg->logblocks = max(cfg->logblocks,
            XFS_MIN_REALISTIC_LOG_BLOCKS(cfg->blocklog));

Header Reservations

In XFS, each AG acts like a standalone filesystem, and manages its space. Here are the headers saved in each AG for a regular mkfs command with reflink enabled. (Command: mkfs.xfs )

  • XFSB - Superblock.
  • XAGF - AG free space info.
  • XAGI - AG inode info.
  • XAFL - Free space list.
  • AB3B - Free space by Block B+ tree.
  • AB3C - Free space by Size B+ tree.
  • IAB3 - Inode B+ tree.
  • FIB3 - Free Inode B+ tree.
  • R3FC - Reference count B+ tree (This is added if reflink is enabled).

Each AG has a copy of the superblock to help with recovery, and each AG has free space and inode info to aid with space allocation and file creation. The superblock, AG free space info, AG inode info, and AG free space list, all take one sector each (512 bytes each) and the remaining take up one block each. The space after these headers can be used for actual inode allocation and storage.

Allocating the Root Inode

The process of allocating the root inode is the same as inode creation or allocation for regular files in XFS. Choose AG 0 and create 64 inodes (In XFS, inodes are allocated in chunks of 64) and choose the first inode from the chunk as the root inode.

In XFS, the inode number can be translated into the exact block location that contains the inode structure. In the default case with the inode size set to 512 bytes, the root inode number would be 128.

  • The inode number 128 indicates that the inode location is AG 0, 16th block at offset 0. You can use the xfs_db shell to verify this:

    xfs_db> sb 0
    xfs_db> p
    rootino = 128
    
    xfs_db> convert inode 128 agno       # AG containing the inode
    0x0 (0)
    xfs_db> convert inode 128 fsblock    # Block containing the inode
    0x10 (16)
    xfs_db> convert inode 128 offset # Offset of the inode in the block
    0x0 (0)
  • The first 16 blocks of the AG contain headers such as the superblock, AG inode information, and AG free block information.

Finishing mkfs

After all the checks and calculations are complete, it is time to write the changes and finish formatting the filesystem.

  1. Align AG geometry if the underlying device is striped, or if the stripe unit and width are specified.
  2. Get the maximum inode percentage based on the filesystem size if it is not passed via the CLI.
  3. Initialize the minimum parameters required for log calculations to proceed.
  4. Set up the mount parameters for log calculations. (Here we add parameters that help calculate the minimum pre-allocated blocks to help with log size calculations).
  5. Calculate the log size.
  6. Update the incore superblock with those log calculations.
  7. Now that the validations are done, discard the old device layout using discard_devices().
  8. Use prepare_devices() to make the device ready for mounting:
    • Zero out the beginning of the devices to obliterate any old filesystem signatures.
    • Write the superblock to the disk.
    • Zero out the filesystem journal log.
  9. Several XFS macros use the mount structure. To use macros, initialize the xfs_mount_t structure by using the libxfs_mount() method.
  10. Initialize the AG headers and update the secondary superblock.
  11. Allocate the root inode.
  12. Call the libxfs_umount() method to free all the resources obtained during the mount.
  13. The first mount and unmount is done by mkfs. To review the result, use xfs_logprint to dump journal contents:
xfs_logprint:
    data device: 0xffffffffffffffff
    log device: 0xffffffffffffffff daddr: 10485808 length: 204800

cycle: 1        version: 2              lsn: 1,0        tail_lsn: 1,0
length of Log Record: 512       prev offset: -1         num ops: 1
uuid: 393834a0-61f6-46ad-9659-b1a3c66863c6   format: little endian linux
h_size: 32768
----------------------------------------------------------------------------
Oper (0): tid: b0c0d0d0  len: 8  clientid: LOG  flags: UNMOUNT
Unmount filesystem

============================================================================
xfs_logprint: skipped 204798 zeroed blocks in range: 2 - 204799
xfs_logprint: physical end of log
============================================================================
xfs_logprint: logical end of log
============================================================================

Defaults

The following table lists key defaults specified within the mkfs.xfs command:

Parameter
Default Value
Details
Block size
4096
 
Inode size
512
 
Max inode percentage
25% if filesystem < 1 TB. 5% if 1 TB < filesystem <= 50 TB. 1% if file sytem > 50 TB.
Maximum percentage of space that can be used for inode allocation. This can be changed using the xfs_growfs command.
AG count
4 AGs for filesystems smaller than 4 TB.
 
crc
1 (enabled)
Enable checksum for metadata.
reflink
1 (enabled)
Enable reflink copy.
finobt
1
Maintains a separate free inode btree in each allocation group.
lazy-count
1
Free space and inode counters are maintained outside of the superblock.

 

Constructing the XFS Filesystem

  • To create a regular filesystem with a default configuration, run the following command:

    sudo mkfs.xfs /dev/sdb1
  • To create a filesystem with external log, run the following command:

    sudo mkfs.xfs -l logdev=/dev/sdb2 /dev/sdb1
  • To create a filesystem with a specified AG count, run the following command:

    sudo mkfs.xfs -d agcount=32 /dev/sdb1
  • To verify the sanity of a raid stripe, extra steps are required.

    If the underlying device is configured for RAID, the mkfs process picks the underlying device geometry and sets the stripe unit and width. Optionally, sunit and swidth can be used to specify a custom stripe unit and width. Note that these are specified in 512 byte block units. So, mkfs.xfs -d sunit=8 implies setting the stripe unit is 4096 bytes. The su and sw options can be used instead to specify stripe unit and width in bytes.

    For a device with raid level 5 created with the mdadm command, then sunit=128 and swidth=768 blocks. Here the block size is filesystem block size (bsize=4096).

    mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sdb1
                            /dev/sdb2 /dev/sdb3 /dev/sdb4 
    # mkfs.xfs /dev/md0
    meta-data=/dev/md0               isize=512    agcount=16, agsize=491136 blks
             =                       sectsz=4096  attr=2, projid32bit=1
             =                       crc=1        finobt=1, sparse=1, rmapbt=0
             =                       reflink=1
    data     =                       bsize=4096   blocks=7857408, imaxpct=25
             =                       sunit=128    swidth=768 blks
    naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
    log      =internal log           bsize=4096   blocks=25600, version=2
             =                       sectsz=4096  sunit=1 blks, lazy-count=1
    realtime =none                   extsz=4096   blocks=0, rtextents=0

    You can then use details from the /sys/block directory to verify the stripe unit and width:

    cat /sys/block/md0/md/chunk_size
    524288 = (128 * 4096)
    cat /sys/block/md0/md/raid_disks
    3

References

Srikanth C S


Previous Post

Oracle Linux Virtualization Manager delivers enhanced monitoring and reporting

John Priest | 4 min read

Next Post


Oracle Linux: 2023 year in review

Honglin Su | 7 min read