Growing a file requires additional allocation of space and the filesystem is responsible for finding an extent or a group of extents to fit the new data. In this blog post, we are going to look at the extent allocation process for an XFS filesystem. XFS supports features such as extent hint that allows some intervention over the size of the extent allocated, more about that further in the blog. XFS also supports realtime volumes, which use a simple bitmap to track free space on that volume. The realtime volume details are not covered in this blog.
XFS is a smart filesystem that maintains all its free space info in the Allocation Group (AG) headers. This free space info is tracked by two b+trees, with one tree sorted based on the starting block number of the extent (bnoroot) and the other based on the extent length (cntroot). The rationale behind this is to allow XFS to quickly find an extent either close to the previous extent or of a suitable length. Each AG has the root of these two trees saved in its free space management header (AGF) and it also tracks the longest free extent in that AG for easy access. Extent allocation makes use of these headers to choose where the allocation goes and which extent best suits the request.
# Example xfs_db> agf 0 xfs_db> p magicnum = 0x58414746 ... bnoroot = 1 cntroot = 2 ... freeblks = 104502 longest = 93465 ... ...
Delayed Allocation
Like many other filesystems, XFS supports delayed file extent allocation. Delayed allocations can combine individual buffered writes when pagecache is written back. Delay Allocation of the buffered file combines multiple small and possibly out of order writes into one large contiguous extent when the required space is available. For larger files, delayed allocation can help reduce the number of extents required by grouping data before requesting an allocation. This delayed allocation is done in XFS by xfs_bmap_add_extent_delay_real(), which converts delayed allocation to a real allocation.
Allocation Algorithms
Block allocation is done when a file is grown or when changes to metadata, COW, or unwritten extents occur. All the interesting extent allocation stuff happens in xfs_bmapi_write() and xfs_bmapi_allocate(). We check if the allocation is for a file, COW, or metadata and then work accordingly.
When a file is grown, the file requests the filesystem for an extent and now the filesystem has to look around for free space that best fits the file and return with an extent. We have the file/inode and the space required as our parameters, let’s say that the space required is minblocks. Now, as the filesystem is divided into AGs, the requested minblocks may be bigger than the AG size. In such cases, allocation of a single extent that large is not possible as we do not have extents bigger than the AG size. Trim the minblocks to max usable space in an AG (AG size – header reservations).
In the extent allocation flow we start from iomap_writeback_ops for XFS and reach xfs_bmap_btalloc(). This function starts by selecting an AG with an extent large enough to satisfy the requirements. Traverse all the AGs starting from the AG that contains the previous extent. If no previous extent is to be found, select AG of the inode. Check if this is a metadata preferred AG. If so, as we are trying to allocate user data, try selecting a different AG. Initialize the perag (Incore AG header) structure to get AG free space details. Each AG free space header holds a reference to the longest extent available in that AG, this can be used to determine if this AG works for us. If no AG has the required free space, pick one with the most free space and adjust minblocks to the newly found length. If there are available data blocks at EOF, optimize allocation for contiguous file extension and extend the previous extent.
All these extent allocation functions make use of a lot of repeated parameters and to avoid having to pass them all the time, they are stored in the structure xfs_alloc_arg and then passed around. This structure can be seen below and contains most of the extent-related data such as a reference to the AG incore structure (pag), AG data, extent length, and mod, prod for extent adjustments and others.
typedef struct xfs_alloc_arg { struct xfs_trans *tp; /* transaction pointer */ struct xfs_mount *mp; /* file system mount point */ struct xfs_buf *agbp; /* buffer for a.g. freelist header */ struct xfs_perag *pag; /* per-ag struct for this agno */ xfs_fsblock_t fsbno; /* file system block number */ xfs_agnumber_t agno; /* allocation group number */ xfs_agblock_t agbno; /* allocation group-relative block # */ xfs_extlen_t minlen; /* minimum size of extent */ xfs_extlen_t maxlen; /* maximum size of extent */ xfs_extlen_t mod; /* mod value for extent size */ xfs_extlen_t prod; /* prod value for extent size */ xfs_extlen_t minleft; /* min blocks must be left after us */ xfs_extlen_t total; /* total blocks needed in xaction */ xfs_extlen_t alignment; /* align answer to multiple of this */ xfs_extlen_t minalignslop; /* slop for minlen+alignment calcs */ xfs_agblock_t min_agbno; /* set an agbno range for NEAR allocs */ xfs_agblock_t max_agbno; /* ... */ xfs_extlen_t len; /* output: actual size of extent */ int datatype; /* mask defining data type treatment */ char wasdel; /* set if allocation was prev delayed */ char wasfromfl; /* set if allocation is from freelist */ struct xfs_owner_info oinfo; /* owner of blocks being allocated */ enum xfs_ag_resv_type resv; /* block reservation to use */ #ifdef DEBUG bool alloc_minlen_only; /* allocate exact minlen extent */ #endif } xfs_alloc_arg_t;
If we have the locality information, i.e., we have set the start block of the new extent based on the previous extent, try out a near allocation using xfs_alloc_ag_vextent_near() to find an extent close to the previous extent.
Once the start block is set and we have an AG with the required free space, we can start searching for an extent close to the start block. To do this, make use of the perag incore reference and get the cnt and bno free space trees. Try to find an extent close to the start block set earlier. Using start block and minblocks, there are two algorithms to select the most suitable extent.
- Algorithm 1 – If the minblocks is large when compared with the available free space in that AG, then we would reach the right leaf of the cnt tree, examine the entries in this block and pick the most suitable one.
- Algorithm 2 – Search cnt and bno btrees in parallel by making use of the start block and minblocks. Combine these two trees and search for the required extent, this can help find the best suitable extent.
Here is a visual representation of cnt and bno trees that can be looked at from the AG free space header using the tool xfs_db. We can see that for AG 1, the entries in the bno tree are sorted based on the start block of the extent and entries in the cnt tree are sorted based on the extent length.
bno tree
xfs_db> agf 0 xfs_db> addr bnoroot xfs_db> btdump bnobt level 0 block 1 daddr 8 recs[1-5] = [startblock,blockcount] 1:[39169,10000] 2:[64945,583] 3:[87501,10000] 4:[107501,10000] 5:[127501,3571]
cnt tree
xfs_db> agf 0 xfs_db> addr cntroot xfs_db> btdump cntbt level 0 block 1 daddr 16 recs[1-5] = [startblock,blockcount] 1:[64945,583] 2:[127501,3571] 3:[39169,10000] 4:[87501,10000] 5:[107501,10000]
If the allocation fails in this AG, try other AGs. If the filesystem is fragmented and there may not be an extent large enough for the allocation, the allocation process loops requesting the remaining amount until the whole amount is fulfilled or the user aborts with a failure.
Conclusion
Once the required free space is found, the members of xfs_alloc_arg are filled and xfs_bmap_process_allocated_extent() is called to finish extent allocation. If the extent size hint is set, we try and round the extent length based on the hint. Meanwhile, update all the metadata related to the free space. The allocation discussed above is for general cases, if the filesystem is low on space or if busy extents exist (busy extents are freed metadata blocks that have not yet been committed and cannot be reallocated for data until the log is committed to storage), the allocation flow is altered. Different xfs_alloc_ventent() policies exist to cater to different situations; the case discussed above is for near allocation. Other policies include exact_bno, start_ag, and first_ag allocations, which try to allocate as their names indicate. And at any point of time if we fail to find an extent, -ENOSPC is returned to the caller and the caller can decide to look for the extent in a different AG. Once an extent is allocated, we can see the extent details using the tools xfs_bmap and xfs_db. The difference in the outputs seen below is because xfs_bmap presents data in Linux basic blocks (512 bytes) and xfs_db presents it in filesystem blocks (4096 bytes). Consider the file extent_demo.data, which initially had only one extent and after growing was allocated a new extent. The details of which can be seen below. We can also see the changes in free extent btree done in the cnt and bno trees after allocation.
Before extent allocation: There are two files of size 109M:
# ls -li /mnt 132 -rw-------. 1 root root 114117140 May 20 11:43 copy.data 132 -rw-------. 1 root root 114117140 May 20 11:43 extent_demo.data # xfs_bmap -e -l -p -v -v -v extent_demo.data extent_demo.data: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..222887]: 192..223079 0 (192..223079) 222888 000000 # xfs_bmap -e -l -p -v -v -v copy.data copy.data: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..222887]: 223080..445967 0 (223080..445967) 222888 000000 # xfs_db -x -c "inode 131" -c bmap filesystem.img data offset 0 startblock 24 (0/24) count 27861 flag 0 # xfs_db -c "inode 132" -c bmap filesystem.img data offset 0 startblock 27885 (0/27885) count 27861 flag 0 # xfs_db -c "agf 0" -c "addr cntroot" -c p -c "agf 0" -c "addr bnoroot" -c p filesystem.img cnt tree: cntbt level 0 block 1 daddr 16 recs[1-2] = [startblock,blockcount] 1:[10,6] 2:[55746,75326] bno tree: bnobt level 0 block 1 daddr 8 recs[1-2] = [startblock,blockcount] 1:[10,6] 2:[55746,75326]
After allocation of new extent: We grow the extent_demo.data file to 218M and the new allocation alters the free blocks which can be seen below:
# xfs_bmap -e -l -p -v -v -v extent_demo.data extent_demo.data: EXT: FILE-OFFSET BLOCK-RANGE AG AG-OFFSET TOTAL FLAGS 0: [0..222887]: 192..223079 0 (192..223079) 222888 000000 1: [222888..445775]: 445968..668855 0 (445968..668855) 222888 000000 # xfs_db -x -c "inode 131" -c bmap filesystem.img data offset 0 startblock 24 (0/24) count 27861 flag 0 data offset 27861 startblock 55746 (0/55746) count 27861 flag 0 # xfs_db -c "agf 0" -c "addr cntroot" -c p -c "agf 0" -c "addr bnoroot" -c p filesystem.img cnt tree: cntbt level 0 block 1 daddr 16 recs[1-2] = [startblock,blockcount] 1:[10,6] 2:[83607,47465] bno tree: bnobt level 0 block 1 daddr 8 recs[1-2] = [startblock,blockcount] 1:[10,6] 2:[83607,47465]