Introduction
The XFS: Ondisk layout of checkpoint transactions blog described how the checkpoint transactions are laid out in the ondisk log. This article delves deeper by providing examples of actual metadata getting logged. We will look into details of logging the following metadata items:
- Inodes
- Btree blocks
- Intent items
Logging an Inode
The XFS inode can be divided into the following regions:
- Core
This category includes fields like the uid, gid, hard link count, access/modified/change times, size of the file, number of extents used to hold the data of the file and number of extents used to hold the attributes of the file.
- Data fork
This can contain one of the following:- Directory entries in case of small directories.
- Extent mappings if the mappings can fit in the region.
- The root node of a Btree. Among other things, the Btree can be used to index the mappings of file extents.
- Attribute fork
This can contain one of the following:- Attributes of the file if they can fit here.
- Extent mappings if the mappings can fit in the region.
- The root node of a Btree whose leaves contain the mappings of extents.
The extents whose mappings are mentioned above contain a Btree whose leaves hold the attributes of the inode.
Consider the case of writing a 4k block at zeroth offset of an empty file. The will cause some of the core fields of the inode like size, various timestamps and data extent count to be modified. This will also add the new extent mapping to the data fork. The checkpoint transaction through which the inode modification is logged will contain the following:
struct xfs_inode_log_format
This structure has the following layout:
struct xfs_inode_log_format { uint16_t ilf_type; uint16_t ilf_size; uint32_t ilf_fields; uint16_t ilf_asize; uint16_t ilf_dsize; uint32_t ilf_pad; uint64_t ilf_ino; union { uint32_t ilfu_rdev; uint8_t __pad[16]; } ilf_u; int64_t ilf_blkno; int32_t ilf_len; int32_t ilf_boffset; };
-
ilf_type
will be set toxfs_li_inode
(i.e.0x123b
). This member helps the program reading the checkpoint transaction on the disk in figuring out the type of the structure that it is supposed to read. -
ilf_size
will be set to the number of regions of the inode that are logged. In our example, this field will be set to2
, since we are going to log thecore
anddata fork
regions of the inode. -
ilf_fields
is a bit field that indicates the regions of the inode being logged in a more fine-grained manner. The modifications made in the example will cause the following bits to be set:xfs_ilog_core
core inode region is to be logged.xfs_ilog_dext
the single extent mapping stored in the inode is to be logged.
Each of the regions indicated by the bit field will be logged following the
struct xfs_inode_log_format
with their ownstruct xlog_op_header
. -
ilf_dsize
is set to the size of space occupied by the single extent mapping in the data fork. -
ilf_ino
is set to the number of the inode being logged.
-
struct xfs_log_dinode
This structure has a copy of the core region of the inode.
- Logging extent mapping in the data fork.
The last logged region in the example contains a copy of the single extent map owned by the inode.
Logging a Btree block
Most of the ondisk metadata structures are tracked in memory using a struct xfs_buf
. A disk block holding a node or a leaf of a btree is an example of a metadata structure represented in memory by a struct xfs_buf
.
The following is the layout of struct xfs_buf
with only members of interest included:
struct xfs_buf { int b_length; void *b_addr; struct xfs_buf_map *b_maps; int b_map_count; ... };
b_length
size of the metadata in 512 byte units.b_addr
address in memory holding the contents of metadata block read from the disk.b_maps
: an array which maps metadata offset to disk offset. Some of the metadata blocks can span across more than one filesystem block. For example, we could have a filesystem with 4k as the block size and 64k as the directory block size. In such a case, the filesystem blocks making up a directory block may not be contiguous on disk. This array helps in mapping directory offset to the underlying filesystem blocks.b_map_count
number of elements in theb_maps
array.
Let us consider the case of adding a new extent mapping to an inode which has its extent mappings stored in a btree. The new mapping record will be added to one of the leaves of the btree. So only a small portion of the btree leaf is actually being modified. This is a general pattern applicable to many metadata modifications. Hence, to reduce the space used in the ondisk log, a metadata block is divided virtually into 128-byte sized chunks and only the dirty chunks are logged.
The checkpoint transaction through which the btree leaf modification is logged will contain the following:
struct xfs_buf_log_format
This structure has the following layout:
typedef struct xfs_buf_log_format { unsigned short blf_type; unsigned short blf_size; unsigned short blf_flags; unsigned short blf_len; int64_t blf_blkno; unsigned int blf_map_size; unsigned int blf_data_map[xfs_blf_datamap_size]; } xfs_buf_log_format_t;
-
blf_type
will be set toxfs_li_buf
(i.e.0x123c
). -
blf_size
will be set to the number of regions logged for a mapping. The following are the regions logged for every physically contiguous mapping of an xfs_buf:- struct xfs_buf_log_format
- one or more modified chunk sequences.
As an example, if a mapping had two modified chunk sequences, the
blf_size
field is set to 3; one forstruct xfs_buf_log_format
and one each for the two modified chunk sequences. -
blf_flags
flags to describe the contents ofstruct xfs_buf
being logged. -
blf_len
will be set to the length of the mapping expressed in units of 512 byte blocks. An xfs_buf with a single map will be set to the size of the metadata block. -
blf_blkno
is the first disk block number of the mapping in units of 512 byte blocks. -
blf_data_map
is a bitmap where each bit refers to a 128 byte chunk of the metadata block at offset(bit_position * 128)
bytes. A set bit indicates that the corresponding 128 byte chunk has been logged. -
blf_map_size
indicates the size of bitmap atblf_data_map
in units of 4 byte words.
-
The struct xfs_buf_log_format
will be followed by an alternating sequence of xlog_op_header
and a chunk sequence. The size of the chunk sequence is indicated by the oh_len
field.
Metadata blocks spanning across multiple non-contiguous physical blocks, will have multiple struct xfs_buf_log_format
structures written in the checkpoint transactions. Each of these structures will have an accompanying list of chunk sequences.
Logging an Intent item
Intent items are used to spread large metadata modifications into multiple high level transactions. This is to prevent ondisk log reservations from becoming too large. They do not have any corresponding metadata on the filesystem.
A filesystem operation which causes large amounts of metadata modification is split up into multiple high level transactions as described below:
- The first high level transaction logs (among other things) an intent item expressing the fact that the filesystem operation associated with the intent item will be executed by a future high level transaction.
- The future high level transaction logs a done item and also performs the actual filesystem operation dictated by the intent item.
Freeing an extent associated with an inode is an example of a filesystem operation which is split into multiple high level transactions. The first high level transaction will disassociate the extent from the inode. It will also log an extent free intent item thereby expressing the need to add the extent to the filesystem’s free space tracking structures.
The checkpoint transaction through which the extent free intent item is logged will contain the following:
struct xfs_efi_log_format
typedef struct xfs_efi_log_format { uint16_t efi_type; /* efi log item type */ uint16_t efi_size; /* size of this item */ uint32_t efi_nextents; /* # extents to free */ uint64_t efi_id; /* efi identifier */ xfs_extent_t efi_extents[]; /* array of extents to free */ } xfs_efi_log_format_t;
efi_type
will be set toxfs_li_efi
(i.e.0x1236
). This member helps the program reading the checkpoint transaction on the disk in figuring out the type of the structure that it is supposed to read.efi_size
will be set to 1.efi_id
will set to a unique identifier which will be used by log recovery to match it to the corresponding extent free done item.efi_extents[]
set to the array of extents to be freed.efi_nextents
will be set to the number of elements in theefi_extents[]
array.
The second high level transaction will log a extent free done item and will actually mark the extent as free space by adding it to the free space tracking structures.
The checkpoint transaction through which the extent free done item is logged will contain the following:
struct xfs_efd_log_format
struct xfs_efd_log_format { uint16_t efd_type; uint16_t efd_size; uint32_t efd_nextents; uint64_t efd_efi_id; xfs_extent_t efd_extents[]; };
efd_type
will be set toxfs_li_efd
(i.e.0x1237
).efd_size
will be set to 1.efd_efi_id
will be set to the unique identifier used in the extent free intent item that was logged previously.efd_extents[]
set to the array of extents that will be freed by the current high level transaction.efd_nextents
will be set to the number of elements in theefd_extents[]
array.
The following illustrates as to how the intent item and the corresponding done item help in restoring the consistency of a filesystem in the event of a filesystem crash.
Consider the following sequence of filesystem operations:
- The first high level transaction removes the extent from the inode’s extent map and creates a extent free intent log item.
- It commits the log items to the cil.
- The contents of the log items in the cil are written to a new checkpoint transaction to the ondisk log and the log items are moved to the ail.
- The contents of the these log items are written to their respective locations on the filesystem and removed from the ail. This would mean that the extent no longer is associated with the inode.
- The second high level transaction starts its operation to return the freed extent to the filesystem’s free space tracking structures. However, the filesystem crashes before the changes are written to the ondisk log.
At this point, the metadata on the filesystem is inconsistent since the extent which has been disassociated from the inode is not tracked by the ondisk free space tracking structures.
The log recovery logic executed as part of the mount procedure notices the extent free intent item and executes the steps to add the extent to the free space tracking structure thereby restoring the consistency of the filesystem.
Also, the extent free intent log item and its extent free done log item counterpart are not written to the ondisk log if both end up being part of the same checkpoint transaction.
Conclusion
With the examples described in this article, we hope that the reader now has a better understanding of the xfs’ logging mechanism. the reader can also experiment with the xfs_logprint
command to observe the contents of a dirty ondisk log. This should help reinforce the learning from the above article.