Why does my new XFS filesystem use so many blocks?

System Administrators and power users are very observant when it comes to disk usage and performance issues. Many have wondered why their newly created, empty XFS filesystem lists 2% to 8% of the disk space is already in use.

Consider below the df output for several sized 8 AG CRC XFS external log filesystem on 4096 byte sector storage with all features enabled showing blocks and percentage used:

type	512m	1GB	10GB	100 GB	1000 GB
all opts	9153	13384	58568	510304	5027760
% used	8	6	3	2	2

This blog looks at how the storage sector size, the filesystem size, the Allocation Group (AG) count, the log type (internal or external) and the different enabled optional XFS features effects the reported disk usage (from df) in a new XFS filesystem and understand what that number means.

XFS evolution

XFS is high performance filesystem created by Silicon Graphics Inc for the Irix operating system. Initially, XFS had a behavior layer to interface SGI’s Clustered XFS filesystem (CXFS) and Data Management API (DMAPI) calls for SGI’s Data Management Framework Hierarchical Storage Management Facility (DMF).

XFS was ported in 2001 to the Linux 2.6 kernel series. The Linux community version of XFS removed the CXFS behavior layer and the DMAPI support in the Linux 3 kernel series. The community Linux XFS has continued to evolve by adding CRC checksums to the on-disk metadata blocks and many new features and performance improvements.

Pagesize, Basic Blocks, Disk Sectors, Filesystem Blocks

The Linux pagesize is the basic unit of kernel memory and is computer architecture specific. For example, the Intel based x86 systems uses a basic pagesize of 4KB and the ARM based systems uses a basic pagesize of 4KB or 64KB.

The sector size is the smallest physical read or write unit of the storage device. In the past, 512 bytes was the common sector size but now 4096 bytes is more common.

The disk blocks is the filesystem basic allocation size. In Linux there is an addition 512 byte logical basic block size. The default XFS disk block size is 4096 blocks. In XFS, the disk offset from the beginning of the filesystem is referred to as the disk address and is measured in logical basic blocks. For example, in a XFS filesystem with 4KB blocks, the second block would be at disk address 8.

The filesystem block must be an even multiple of the sector size. Before the Folio memory management system, the logical filesystem block size could not be larger than the memory pagesize. But with the Folio memory management system, it is possible to have a 64KB filesystem block size on 4KB pagesize architectures. The examples in this blog will use the default 4KB block size but concepts are applicable to any filesystem block size.

df(1) versus du(1)

The program df reports the filesystem size in blocks, the number of blocks in use, the number of blocks free and the percentage of the filesystem blocks that are used. By default, df specifies counts in 1024 byte size blocks.

The df percent of the number of blocks in use, is simply the number of blocks used divided by the size of the filesystem. Certain fixed filesystem structures, such as the internal log blocks, are not considered in the filesystem size nor in use in the df calculation.

The program du estimates space in 1024 byte size block that are used in the specified file or the recursively descended directory.

Both df and du programs allow the “–block-size=” argument to display the output in the specified filesystem block size instead of the default 1024 byte block size.

In general, du measures data used in visible files and directories and df measures all space internally allocated and reserved in the filesystem. New XFS filesystem have no files so du would be zero but df reports many blocks in use.

XFS Concepts

B+ Trees

XFS metadata such as inode extent lists, AG free blocks, allocated inodes, etc are stored in a data structure called a B+ Tree (btree). B+ trees are commonly used in filesystems because they organize the metadata under a searchable key, they are block oriented for fast disk retrieval and they are cacheable for repeated use.

B+ trees blocks are either a leaf block, holding specific types of metadata entries, or internal block that holds the key and block references to the next level closer to the leaf.

A root block of the btree can be either a leaf node which has a btree level of 1, or an internal* node of level greater than one and points to two or more btree entries.

Each btree block has a small generic header and the remaining space holds either internal or leaf entries. It is beyond the scope of this paper to explain the specifics of B+ trees. The important idea to know here is that B+ trees are self balancing and for memory consumption consideration, the worse case is the internal and leaf entries are only half full.

XFS Allocation Groups

The XFS filesystem is divided into equally sized regions called an Allocation Group (AG). The AG is the inode and disk block basic allocation region. XFS expects at least 2 AGs so there is a duplicate superblock but the default minimum AG count is 4 AGs.

The total number of blocks in the filesystem may not be an equal multiple of the AG data blocks size, so it is very possible that the last AG is a smaller sized “runt”. Also in an internal log filesystem, the middle AG will be smaller because the log blocks are no longer allocatable.

The XFS AG ranges from 16MB to 1 TB in size. Since each AG is limited to 1TB, and the basic block is 512 bytes in size, the AG block number (agbno) can be addressed in a (2^40 / 2^9 = 2^31) 32 bit variable. Similarly, CRC filesystems use 512 byte or greater sized inodes inodes, so both inodes and blocks have both a 32 bit AG relative and they have a 64 bit filesystem global values. The global filesystem block number (fsb) number is a combination of the AG specific prefix and the AG relative block offset.

Inodes are also numbered relative to the AG block that holds the inode chunk and the offset in the inode chunk. This association allows for easy conversions from inodes to disk block and disk block to inodes without disk access.

Each AG has a copy of the filesystem superblock, an inode descriptor (AGI) structure, a free block (AGF) structure that includes a by-size and a by-block free block btree roots and an AG free list (AGFL) structure that contains enough (4 total) blocks to perform a complete btree split / combine in both the AGF by-block and by-count btrees during a transaction that cannot do block allocation call.

The AG superblock, AGI, AGF, and AGFL are a sector in size. If the sector is smaller than the XFS block size, then all of these items will share one filesystem block. If the sector size is the same as the block size, then these items will consume their own filesystem blocks (4 total).

In XFS, the inode and block allocation decisions are performed relative in an AG. This allows block allocations to be performed in different AGs in parallel. Having more AGs in a filesystem, allows for more allocation decision parallelism but also leads to smaller allocation pools of blocks which can lead to more fragmentation in the filesystem. Larger filesystems tend to have more AG due to the AG size limitation requiring more AG.

XFS Journal (log)

XFS uses a metadata log to record the metadata changes made to permanent storage to allow the filesystem to recover the filesystem to a consistent state after a system crash and speeds synchronous metadata writes. The log can be either external or internal. The external log is created using separate data and log storage volumes. The internal log is created in the data storage volume using disk space borrowed from the middle AG of the filesystem. The internal log results in fewer allocatable blocks in that AG.

The space used for the internal log is removed from the df calculation of (available – free) / available. This calculation could also be written as 1 – (free / available). Writing the df calculation in this way, it is easier to see that the smaller available space in the denominator (because of the internal log disk space usage), makes the filesystem appear a little more full when using the same amount of disk space in a filesystem compared to an external log.

When the filesystem is created, mkfs.xfs scales the internal log size based on the number of data blocks in the filesystem. See the blog Formatting an XFS Filesystem for more information. Basically, larger filesystems have larger sized internal logs. The filesystem creator may also manually specify the log size as long as the log fits inside a AG.

Using an internal versus external log has only minor effect on the disk usage results. The decision to use an internal or external log is generally a filesystem performance decision rather than a space decision.

Allocated versus reserved space in XFS

XFS block allocation is performed at the Allocation Group level but the block can be used anywhere in the filesystem. For example, the XFS file and block allocation policies may prefer allocating the directory and file, the file and extents in the same AG for spinning media seeking and rotational locality but these items may also be located in directories and files in different AGs.

Space in the XFS filesystem is allocated for a specific purpose such as data blocks, or metadata such as superblock, directories, files, links, extents, extended attributes, btree etc. Allocated space reserves blocks and also removes the allocated blocks from the free block list. Every allocated block has a defined lifetime and the block is returned to the free block list when freed. All allocated blocks can be found in a fixed location (such as the sectors defining the AG) or stored in metadata structures and can be found with a metadata walk.

Space in the XFS filesystem can also be reserved for future use but specific blocks are not allocated and no blocks are removed from the free block list in any AG. Reserved space prevents other tasks from depleting the filesystem below the reserved space amount. The intent of reserving space is to keep the space available for a future allocation of blocks for the reserved specific purpose. When the specific allocation is performed, the reserved space is decreased by the allocation amount. The remaining unused reserved space is returned at the unmount of the filesystem.

XFS Allocated space.

When a new XFS filesystem is created with mkfs.xfs, a few special blocks are allocate for every AG in the filesystem.

A new XFS filesystem on 512 byte sector storage, allocates only 8 blocks per AG: the superblock, AGI, AGF, AGFL are in one block, a inode btree root, a by-count free block btree root, a by-count free block btree root, and 4 blocks in the AGFL (2 for by-count and 2 for by-count btree splits).

A new XFS filesystem on 4K sector storage is similar but the superblock, AGI, AGF and AGFL are in separate blocks resulting in using 3 more blocks per AG.

Also new XFS filesystem also allocates one inode chunk for the root inode. Pre-CRC filesystems default to 256 byte inodes and CRC filesystems default to 512 byte inodes. That means the Pre-CRC inode chunk is (64 * 256 bytes) or 4 4KB pages and CRC inode chunk is 8 4KB pages.

The free inode, reference link and reverse map features each allocate another btree root block per AG and the reverse map allocated 2 more blocks in the AGFL per AG.

In summary, the optional features add only a very small amount of allocated space to each AG. We will see below that the feature allocated space is dwarfed by the feature reserved space.

XFS Reserved space

Space is reserved at the filesystem mount time. There are two kinds of reserved spaces, the filesystem level privileged transaction reservation and the optional feature specific per AG reservations.

XFS privileged transaction reservation

Every XFS filesystem reserves a small number of blocks to allow privileged metadata transactions to still allocate blocks when the normal allocation fails due to conditions such as no space left in the filesystem. An example of operation at full filesystem condition would be removing data blocks and inodes. These removal commands may require (inode, by-block and by-count) btree blocks to split from the leaf to the root as the btree is rebalanced.

In filesystems smaller than 640 MB in size, the privileged transaction reserved space is limited to 5% of the total number of data blocks. In filesystems that are 640MB and greater in size, the privileged transaction reserved space is fixed at 8192 blocks.

Below are examples of new filesystem of difference sector sizes, volume sizes and specifying 4 or 8 AG that show this reserved space:

Blocks used on a new 4 AG XFS filesystem with 512 byte sectors:

XFS type	512m	1GB	10GB	100 GB	1000 GB
crc=0	6617	8244	8244	8244	8244
crc noopt	6621	8248	8248	8248	8248

Blocks used on a new 8 AG XFS filesystem with 512 byte sectors:

XFS type	512m	1GB	10GB	100 GB	1000 GB
crc=0	6677	8292	8292	8292	8292
crc noopt	6681	8296	8296	8296	8296

Blocks used on a new 4 AG XFS filesystem with 4096 byte sectors:

XFS type	512m	1GB	10GB	100 GB	1000 GB
crc=0	6617	8256	8256	8256	8256
crc noopt	6621	8260	8260	8260	8260

Blocks used on a new 8 AG XFS filesystem with 4096 byte sectors:

XFS type	512m	1GB	10GB	100 GB	1000 GB
crc=0	6677	8316	8316	8316	8316
crc noopt	6681	8320	8320	8320	8320

The first thing to note is the pre-CRC (crc=0) XFS filesystem and the CRC XFS filesystem with no enabled features vary only in the inode chunk size. The pre-CRC XFS filesystem uses a 256 byte default inode size and the CRC XFS filesystem uses a 512 byte default inode size. With 64 inodes per chunk, means that CRC XFS filesystem will use 4 more blocks than the pre-CRC filesystem.

The 512 MB filesystem shows the limited 5% of the filesystem size for the privileged transaction reservation and the larger filesystem shows the 8192 byte privileged transaction reservation.

An internal log XFS filesystem allocates the same number of blocks per AG and reserves the same number of restricted transaction metadata blocks as an external log XFS filesystem. But since the internal log uses space from the middle AG, the number of available blocks will be smaller on the internal log XFS filesystem.

The examples also show that the 512 byte sector filesystem use 3 blocks per AG less than the 4096 byte sector filesystem because the superblock, AGI, AGF and AGFL share a filesystem block on the 512 byte sector filesystem.

CRC XFS per AG reservations by feature

The per AG reservations for the CRC filesystem advanced features, reserves enough btree metadata space to describe the hypothetical situation where the entire AG needs to be described by the specific feature. This reservation also gives credit for the blocks already allocated for the feature. In a new filesystem, the feature’s btree root is the only feature allocated block. As blocks are allocated for the feature, the reservation is decreased by the allocation size. Unused reservation blocks are returned when the filesystem is unmounted.

For example, if the entire AG consists of inode chunks and each inode chunk has at least one free inode, then the free inode btree feature will need to have a free inode btree entry for every allocated inode chunk in the AG. The btree space reserved for the free inode feature also assumes the worse btree case, where the free inode btree leaf and internal btree nodes are only half full.

CRC XFS per AG reservations example (finobt)

Below is the btree short form representation and the inode btree record and key. We will use these example structures to show how to calculate the number of finobt entries that can fit into a btree block and to calculate the worse case number of blocks needed to be reserved if every AG was allocated with inode chunks and every chunk had at least one free inode. The reflink and rmap values are calculated in a similar way.

Btree short form

struct xfs_btree_block_shdr {
	__be32		bb_leftsib;
	__be32		bb_rightsib;

	__be64		bb_blkno;
	__be64		bb_lsn;
	uuid_t		bb_uuid;
	__be32		bb_owner;
	__le32		bb_crc;
};

struct xfs_btree_block {
	__be32		bb_magic;
	__be16		bb_level;
	__be16		bb_numrecs;
	union {
		struct xfs_btree_block_shdr s;
		struct xfs_btree_block_lhdr l;
	} bb_u;
};

inobt leaf entry
typedef struct xfs_inobt_rec {
	__be32			ir_startino;
	union {
		struct {
			__be32	ir_freecount;
		} f;
		struct {
			__be16	ir_holemask;
			__u8	ir_count;
			__u8	ir_freecount;
		} sp;
	} ir_u;
	__be64		ir_free;
} xfs_inobt_rec_t;

The inobt internal btree records the agbno as
the key and next level block number:

typedef struct xfs_inobt_key {
	__be32		ir_startino;
} xfs_inobt_key_t;

typedef	__be32	xfs_inobt_ptr_t;

To calculate the number of reservation btree blocks for a feature, first determine how may leaf and internal entries will fit in the btree block.

The short form btree header uses 56 bytes, so that leaves 4040 bytes of a 4KB block for the btree entry records.

The xfs_inobt_rec size is 16 bytes which means 252 (4040 / 16) max leaf entries or 126 (252 / 2) minimum (half full) leaf entries will fit into one btree block.

Similarly, the inobt key is 4 + 4 = 8 bytes which means 505 (4040 / 8) max internal entries or 252 (505 / 2) minimum (half full) internal entries will fit into one btree block.

Likewise the min/max leaf and internal entries per btree block by feature is listed below:

feature	leaf min	leaf max	internal min	internal max
finopt	126	252	252	505
reflink	168	336	252	505
rmap	84	168	45	91

Once the number of leaf and internal entries that can fit into a btree block is known, the next step is to find the number of btree blocks that are needed for the leaf blocks. This is done by dividing the total number of blocks (chunks) in each AG by the number of leaf entries per btree block.

Note, that the free inode features uses inode chunk per AG instead of data blocks. If the filesystem uses an internal log, then the log size is subtracted from the middle AG that contains the log.

Then the number of internal btree blocks is calculated at each level by dividing the previous level by the internal entries per btree block until the root btree node is reached. In the following sections we will show the specific calculation to calculate the number of blocks needed to be reserved for each feature.

The reservation process also subtracts blocks that have been allocated in each AG for feature use. On a new XFS filesystem, the feature’s btree root is the only allocated block.

CRC XFS free inode btree extension (-m finobt=1)

For every allocated inode chunk there is an inode btree entry that specifies the starting inode number of the chunk and a mask representing which inodes in the chunk are free.

Allocating a new XFS inode originally involved searching every AG inode btree from the beginning of the inode btree looking for a chunk with a free inode. This search can require multiple disk accesses to find a inode chunk with a free inode. If no free inode is found in the AG, another AG could be chosen or a new inode chunk is allocated in this AG. To solve this time and disk access consuming search to find a free inode, the Free Inode (finobt) feature added a new per AG btree to point inode chunks that have at least one free inode. The free inode btree entry has the same starting inode number and a mask as the inode btree entry.

The free inode btree entry is added or removed when a file, directory, link, device special file (block, character device), etc is made or removed and the operation leaves a free entry in the inode chunk. For example, allocating a new chunk for a new inode will also add the chunk to the free inode btree, Also the removing of an inode from a full chunk will also add the chunk to the free inode btree.

The free inode btree will be either empty, subset of the inode btree or, at worse, the same as the inode btree. The entries in the free inode btree will match the entries in the inode btree. The free inode btree reserve space ensures there is metadata space available for the free inode btree.

XFS inodes are allocated in 64 inode chunks. Originally the XFS inode default size was 256 bytes, meaning the inode chunk is a 16KB sized and aligned block. When the CRC checksum was added to the metadata, the minimum inode size had to be increased to 512 bytes. It was also found that allocating 32KB sized and aligned blocks can be difficult as the filesystem becomes fragmented, so a special “sparse inode” support was provided. A sparse inode allocation is considered part of a logical chunk with a mask that specifies the blocks of the chunk that are valid.

The df output below compares the free inode feature reported disk used compared to a filesystem without any features:

Blocks used on a new 4 AG XFS external log filesystem with 4096 byte sectors:

feature	512m	1GB	10GB	100 GB	1000 GB
crc noopt	6621	8260	8260	8260	8260
finopt	6757	8528	10880	34376	269368

Blocks used on a new 8 AG XFS external log filesystem with 4096 byte sectors:

feature	512m	1GB	10GB	100 GB	1000 GB
crc noopt	6681	8320	8320	8320	8320
finopt	6825	8592	10952	34440	269424

Blocks used on a new 8 AG XFS internal log filesystem with 4096 byte sectors:

feature	512m 4AG	1GB	10GB	100 GB	1000 GB
crc noopt	6621	8320	8320	8320	8320
finopt	6741	8576	10935	34424	269297

The free inode btree will have an entry for each inode chunk (64 inodes) that has at least one free inode. The fintobt reservation calculation assumes that the AG completely contains inode chunks that have at least one free inode.

The number of inode chunks that can completely fill an AG would be the number of data blocks in the AG divided by the number of blocks in an inode chunk. The chunk size for the CRC filesystem with the default 512 byte inode is 8 ((512 * 64) / 4096) blocks.

A 10GB 8 AG filesystem has 327680 blocks in each AG. If this filesystem has an internal log, then 16384 blocks would be not be available in AG 4. In 512 byte sized inode, there are 40960 (327680 / 8) chunks in an AG. The middle AG of an internal log XFS filesystem would have 38912 ((327680 – 16384) / 8) chunks in the AG.

For the AGs in this filesystem that do not have a log entry, the free inode feature reservation will have 326 (40960 / 126 rounded up) btree leaf blocks. This reservation would also require 2 (326 / 252 rounded up) level 1 btree internal blocks and the level 2 root. This results in a total of 329 (326 + 2 + 1) btree block reservation for this AG.

If the filesystem has an internal log, then the middle AG will have at most 309 (38912 / 126 rounded up) leaf btree blocks, 2 (309 / 252 rounded up) level 2 blocks and one root 2 block for a total of 312 (309 + 2 + 1) block reservation for this AG.

An external log filesystem total reservation will be 2632 (329 * 8) blocks and an internal log filesystem reservation will be 2615 ((329 * 7) + 312) blocks for the finobt feature.

As we see above, the calculated finobt reservation matches the (10952 – 8320 = 2632) external log and internal log (10935 – 8320 = 2615) df output.

In summary, the number of AG makes only a little difference in the reserved number of blocks (2620 4 AG versus 2632 8AG). The additional amount comes from the added root block to each AG and small rounding up calculation differences.

CRC XFS reference count btree extension (-m reflink=1)

The reflink feature allows data blocks to be deduplicated and shared between multiple files. The *reflink features allows for quick (shallow) copies and uses less space for duplicated data.

The XFS reflink implementation adds a new per AG btree to track the number of times (the reference count) the data blocks (block start and count) are shared. When a shared block range is modified, a Copy On Write processing on the range will allocate new data block(s) for the modification range, possibly split the reference count btree range and decrement the reference count for the modified shared block range. If the reference count becomes 1, the block range is now unshared and the block range is removed from the reflink btree.

Although the reference count btree is designed to track block ranges, the reflink btree block reservation space calculation assumes the worst case situation where every block in every AG is shared in such a way that each block must be tracked separately.

The reference count in the blocks btree entries are inserted when the blocks are (re)shared by remapping the inode’s data block range (cp –reflink command). An IO write operation may add new reference count block subranges or remove the reference count entry if the block range is no longer shared. Reserved blocks for the block range reference count is needed to ensure the write can complete.

Blocks used on a new 8 AG XFS external log filesystem with 4096 byte sectors:

feature	512m	1GB	10GB	100 GB	1000 GB
crc noopt	6681	8320	8320	8320	8320
reflink	7473	9896	24000	164992	1574936

The reservation calculation process for the reflink feature is similar to the finobt feature.

The leaf and internal entry sizes for the reflink feature are relatively similar to the inobt feature sizes but reflink reserves about 6 times more space. This difference is due to the fact that inobt feature has an entry for every 32KB inode chunk and reflink assumes worst case of an entry for every 4KB block.

For the 10GB XFS example, the reflink reservation calculation, there will be 1951 (327680 / 168 rounded up) leaf btree blocks, 8 (1951 / 252 rounded up) level 1 internal btree blocks and a root for a total of 1960 (1951 + 8 + 1) reserved reflink blocks per AG. The total reflink reservation for an external log filesystem is 15680 (8 * 1960) btree blocks which matches the df output (24000 – 8320).

CRC XFS reverse map extension (-m rmapbt=1)

The reverse mapping (rmap) feature adds a per AG btree and 2 additional AGFL blocks to track the owner of metadata or inode and offset for data blocks to aid filesystem repair and shrinking, In the case of reflink shared data blocks, there will be multiple owners.

This feature is enabled by default in Oracle Linux 10 mkfs.xfs, in Oracle Linux 9 it is available but disabled by default.

The reverse map btree entry is inserted when a block range is allocated and also when a data block range is shared with a reference link. The case of block allocation, the added AGFL blocks is required for btree splits similar to the block allocation case. The information in the reverse map btree must match the block allocation and reference count btrees. The reverse map reservation space is available to ensure the metadata information is consistent.

Blocks used on a new 8 AG XFS external log filesystem with 4096 byte sectors:

feature	512m	1GB	10GB	100 GB	1000 GB
crc noopt	6681	8320	8320	8320	8320
rmapbt	8305	11536	40256	327512	3200040

The rmap btree record [startblock, blockcount, owner, offset] and key [startblock, owner, offset] are much larger than the finobt and reflink features. This larger leaf and key results in a reservation that is over 10 times larger than the finobt feature and 2 times larger than the reflink feature. Because this feature uses much more space, the rmap reservation space is capped at 1% of the AG for smaller filesystems.

For the 10GB XFS example, the rmap reservation calculation there will be 3901 (327680 / 84 rounded up) leaf blocks, 87 (3901 / 45 rounded up) level 2 internal btree blocks, 2 level 1 blocks internal btree blocks in the root for a total of 3990 (3901 + 87 + 2) rmap reservation blocks and 2 additional AGFL blocks per AG. The total rmap reservation/allocation for an external log filesystem is 31936 (3992 * 8) btree blocks which matches the df output (40256 – 8320).

xfs_db

The user space xfs debugger, xfs_db, is a tool that can be used to generate the number of records in each btree level, the number of blocks at each level and total levels and blocks needed for a specific btree type. Options allow the output to be based on the btree to be as full as possible (-w max) or at most half full (-w min).

To generate the worse case btree disk use for the free inode, reference link and reverse map types use the commands:

 xfs_db -r -c 'sb 0' -c 'p agblocks' [-l LOOPDEV] DEV

 xfs_db -r -c 'sb 0' -c 'p inopblock' [-l LOOPDEV] DEV

 # for internal logs:
 xfs_db -r -c 'sb 0' -c 'p logblocks' [-l LOOPDEV] DEV

 xfs_db -r -c 'btheight -n CLUSTERS -w min finobt' [-l LOOPDEV] DEV

 xfs_db -r -c 'btheight -n AGBLKS -w min refcountbt' [-l LOOPDEV] DEV

 xfs_db -r -c 'btheight -n AGBLKS -w min rmapbt' [-l LOOPDEV] DEV

Where the first command returns the number of blocks in the AG (AGBLKS). The second command return the number of inodes per block (INOPBLK). The number of clusters in an AG (CLUSTERS) is (AGBLKS * INOPBLK) / 64.

External log Example:
 # xfs_db -r -c 'sb 0' -c 'p agblocks'  -l /dev/sdb /dev/sba
   agblocks = 327680

 # xfs_db -r -c 'sb 0' -c 'p inopblock' -l /dev/sdb /dev/sda
   inopblock = 8

 Note: 327680 * 8 / 64 = 40960 clusters per AG without internal log
 # xfs_db -r -c 'btheight -n 40960 -w min finobt' -l /dev/sdb /dev/sda
  finobt: worst case per 4096-byte block: 126 records (leaf) / 252 keyptrs (node)
   level 0: 40960 records, 326 blocks
   level 1: 326 records, 2 blocks
   level 2: 2 records, 1 block
   3 levels, 329 blocks total

 # xfs_db -r -c 'btheight -n 327680 -w min refcountbt' -l /dev/sdb /dev/sda
  refcountbt: worst case per 4096-byte block: 168 records (leaf) / 252 keyptrs (node)
  level 0: 327680 records, 1951 blocks
  level 1: 1951 records, 8 blocks
  level 2: 8 records, 1 block
  3 levels, 1960 blocks total

If there is an internal log, the third command returns the number of blocks in the log which is subtracted from the AGBLKS for the calculation of the internal AG btree reservation as shown in the earlier sections.

Conclusion

We saw that each XFS Allocation Group allocates a sector for a superblock, the inode description (AGI), the free space description (AGF) and the list of special free pages (AGFL). If the sector size is smaller than the filesystem block size, then these sectors are stored in one filesystem block. Also, each Allocation Group allocates a few blocks for btree roots and the free block list.

The rest of the reported use space is for reservation. XFS reserves at most 8192 blocks for metadata use when the filesystem is full and more importantly space is reserved for optional features.

Each optional feature reserves enough btree space for the hypothetical case where the entire AG is allocated with specific feature structures. The optional features reserve space is heavily effected by the number of blocks in the AGs. The larger the filesystem, the larger the total number of feature blocks reserved. Each feature reserves a specific amount of space, the free inode feature reserves the least, followed but the reference link feature and the reverse map feature reserves the most.

On Oracle Linux version 9 mkfs.xfs, the finobt and reflink are currently enabled by default and rmap is optional. On Oracle Linux version 10 mkfs.xfs, the finobt, reflink and rmap are all enabled by default. These features may be manually disabled when creating the filesystem if the feature is not needed. Disabling features result in significant reserved space savings.

Since the internal log area uses blocks in the middle AG of the filesystem, these log blocks are not used in the calculation of the optional feature reserve space. An internal log will reserve fewer optional feature blocks, but overall internal log will have fewer available blocks for data and metadata use. An internal versus external log is decided for performance reasons rather than disk usage considerations.

Why does my new XFS filesystem use so many blocks?

XFS evolution

Pagesize, Basic Blocks, Disk Sectors, Filesystem Blocks

df(1) versus du(1)

XFS Concepts

B+ Trees

XFS Allocation Groups

XFS Journal (log)

Allocated versus reserved space in XFS

XFS Allocated space.

XFS Reserved space

XFS privileged transaction reservation

CRC XFS per AG reservations by feature

CRC XFS per AG reservations example (finobt)

CRC XFS free inode btree extension (-m finobt=1)

CRC XFS reference count btree extension (-m reflink=1)

CRC XFS reverse map extension (-m rmapbt=1)

xfs_db

Conclusion

Mark Tinguely

Trace Kernel Function Return Values with trace-cmd

Debugging Windows VirtIO: Live Logs, Crash Dumps, and Hangs

Why does my new XFS filesystem use so many blocks?

XFS evolution

Pagesize, Basic Blocks, Disk Sectors, Filesystem Blocks

df(1) versus du(1)

XFS Concepts

B+ Trees

XFS Allocation Groups

XFS Journal (log)

Allocated versus reserved space in XFS

XFS Allocated space.

XFS Reserved space

XFS privileged transaction reservation

CRC XFS per AG reservations by feature

CRC XFS per AG reservations example (finobt)

CRC XFS free inode btree extension (-m finobt=1)

CRC XFS reference count btree extension (-m reflink=1)

CRC XFS reverse map extension (-m rmapbt=1)

xfs_db

Conclusion

Authors

Mark Tinguely

Trace Kernel Function Return Values with trace-cmd

Debugging Windows VirtIO: Live Logs, Crash Dumps, and Hangs