Introduction

The XFS: Ondisk layout of checkpoint transactions blog described how the checkpoint transactions are laid out in the ondisk log. This article delves deeper by providing examples of actual metadata getting logged. We will look into details of logging the following metadata items:

  1. Inodes
  2. Btree blocks
  3. Intent items

Logging an Inode

The XFS inode can be divided into the following regions:

  1. Core
    This category includes fields like the uid, gid, hard link count, access/modified/change times, size of the file, number of extents used to hold the data of the file and number of extents used to hold the attributes of the file.
     
  2. Data fork
    This can contain one of the following:
    • Directory entries in case of small directories.
    • Extent mappings if the mappings can fit in the region.
    • The root node of a Btree. Among other things, the Btree can be used to index the mappings of file extents.
       
  3. Attribute fork
    This can contain one of the following:
    • Attributes of the file if they can fit here.
    • Extent mappings if the mappings can fit in the region.
    • The root node of a Btree whose leaves contain the mappings of extents.

    The extents whose mappings are mentioned above contain a Btree whose leaves hold the attributes of the inode.

Consider the case of writing a 4k block at zeroth offset of an empty file. The will cause some of the core fields of the inode like size, various timestamps and data extent count to be modified. This will also add the new extent mapping to the data fork. The checkpoint transaction through which the inode modification is logged will contain the following:

Logging Inode

  1. struct xfs_inode_log_format

    This structure has the following layout:

    struct xfs_inode_log_format {
            uint16_t                ilf_type;
            uint16_t                ilf_size;
            uint32_t                ilf_fields;
            uint16_t                ilf_asize;
            uint16_t                ilf_dsize;
            uint32_t                ilf_pad;
            uint64_t                ilf_ino;
            union {
                    uint32_t        ilfu_rdev;
                    uint8_t         __pad[16];
            } ilf_u;
            int64_t                 ilf_blkno;
            int32_t                 ilf_len;
            int32_t                 ilf_boffset;
    };
    • ilf_type will be set to xfs_li_inode (i.e. 0x123b). This member helps the program reading the checkpoint transaction on the disk in figuring out the type of the structure that it is supposed to read.

    • ilf_size will be set to the number of regions of the inode that are logged. In our example, this field will be set to 2, since we are going to log the core and data fork regions of the inode.

    • ilf_fields is a bit field that indicates the regions of the inode being logged in a more fine-grained manner. The modifications made in the example will cause the following bits to be set:

      1. xfs_ilog_core core inode region is to be logged.
      2. xfs_ilog_dext the single extent mapping stored in the inode is to be logged.

      Each of the regions indicated by the bit field will be logged following the struct xfs_inode_log_format with their own struct xlog_op_header.

    • ilf_dsize is set to the size of space occupied by the single extent mapping in the data fork.

    • ilf_ino is set to the number of the inode being logged.

  2. struct xfs_log_dinode

    This structure has a copy of the core region of the inode.

  3. Logging extent mapping in the data fork.

    The last logged region in the example contains a copy of the single extent map owned by the inode.

Logging a Btree block

Most of the ondisk metadata structures are tracked in memory using a struct xfs_buf. A disk block holding a node or a leaf of a btree is an example of a metadata structure represented in memory by a struct xfs_buf.

The following is the layout of struct xfs_buf with only members of interest included:

struct xfs_buf {
        int         b_length;
        void            *b_addr;
        struct xfs_buf_map  *b_maps;
        int         b_map_count;
        ...
};
  • b_length size of the metadata in 512 byte units.
  • b_addr address in memory holding the contents of metadata block read from the disk.
  • b_maps: an array which maps metadata offset to disk offset. Some of the metadata blocks can span across more than one filesystem block. For example, we could have a filesystem with 4k as the block size and 64k as the directory block size. In such a case, the filesystem blocks making up a directory block may not be contiguous on disk. This array helps in mapping directory offset to the underlying filesystem blocks.
  • b_map_count number of elements in the b_maps array.

Let us consider the case of adding a new extent mapping to an inode which has its extent mappings stored in a btree. The new mapping record will be added to one of the leaves of the btree. So only a small portion of the btree leaf is actually being modified. This is a general pattern applicable to many metadata modifications. Hence, to reduce the space used in the ondisk log, a metadata block is divided virtually into 128-byte sized chunks and only the dirty chunks are logged.

The checkpoint transaction through which the btree leaf modification is logged will contain the following:

img

  1. struct xfs_buf_log_format

    This structure has the following layout:

    typedef struct xfs_buf_log_format {
            unsigned short  blf_type;
            unsigned short  blf_size;
            unsigned short  blf_flags;
            unsigned short  blf_len;
            int64_t     blf_blkno;
            unsigned int    blf_map_size;
            unsigned int    blf_data_map[xfs_blf_datamap_size];
    } xfs_buf_log_format_t;
    • blf_type will be set to xfs_li_buf (i.e. 0x123c).

    • blf_size will be set to the number of regions logged for a mapping. The following are the regions logged for every physically contiguous mapping of an xfs_buf:

      1. struct xfs_buf_log_format
      2. one or more modified chunk sequences.

      As an example, if a mapping had two modified chunk sequences, the blf_size field is set to 3; one for struct xfs_buf_log_format and one each for the two modified chunk sequences.

    • blf_flags flags to describe the contents of struct xfs_buf being logged.

    • blf_len will be set to the length of the mapping expressed in units of 512 byte blocks. An xfs_buf with a single map will be set to the size of the metadata block.

    • blf_blkno is the first disk block number of the mapping in units of 512 byte blocks.

    • blf_data_map is a bitmap where each bit refers to a 128 byte chunk of the metadata block at offset (bit_position * 128) bytes. A set bit indicates that the corresponding 128 byte chunk has been logged.

    • blf_map_size indicates the size of bitmap at blf_data_map in units of 4 byte words.

The struct xfs_buf_log_format will be followed by an alternating sequence of xlog_op_header and a chunk sequence. The size of the chunk sequence is indicated by the oh_len field.

Metadata blocks spanning across multiple non-contiguous physical blocks, will have multiple struct xfs_buf_log_format structures written in the checkpoint transactions. Each of these structures will have an accompanying list of chunk sequences.

Logging an Intent item

Intent items are used to spread large metadata modifications into multiple high level transactions. This is to prevent ondisk log reservations from becoming too large. They do not have any corresponding metadata on the filesystem.

A filesystem operation which causes large amounts of metadata modification is split up into multiple high level transactions as described below:

  1. The first high level transaction logs (among other things) an intent item expressing the fact that the filesystem operation associated with the intent item will be executed by a future high level transaction.
  2. The future high level transaction logs a done item and also performs the actual filesystem operation dictated by the intent item.

Freeing an extent associated with an inode is an example of a filesystem operation which is split into multiple high level transactions. The first high level transaction will disassociate the extent from the inode. It will also log an extent free intent item thereby expressing the need to add the extent to the filesystem’s free space tracking structures.

The checkpoint transaction through which the extent free intent item is logged will contain the following:

img

  1. struct xfs_efi_log_format
    typedef struct xfs_efi_log_format {
            uint16_t        efi_type;   /* efi log item type */
            uint16_t        efi_size;   /* size of this item */
            uint32_t        efi_nextents;   /* # extents to free */
            uint64_t        efi_id;     /* efi identifier */
            xfs_extent_t    efi_extents[];  /* array of extents to free */
    } xfs_efi_log_format_t;
    • efi_type will be set to xfs_li_efi (i.e. 0x1236). This member helps the program reading the checkpoint transaction on the disk in figuring out the type of the structure that it is supposed to read.
    • efi_size will be set to 1.
    • efi_id will set to a unique identifier which will be used by log recovery to match it to the corresponding extent free done item.
    • efi_extents[] set to the array of extents to be freed.
    • efi_nextents will be set to the number of elements in the efi_extents[] array.

The second high level transaction will log a extent free done item and will actually mark the extent as free space by adding it to the free space tracking structures.

The checkpoint transaction through which the extent free done item is logged will contain the following:

img

  1. struct xfs_efd_log_format
    struct xfs_efd_log_format {
            uint16_t        efd_type;
            uint16_t        efd_size;
            uint32_t        efd_nextents;
            uint64_t        efd_efi_id;
            xfs_extent_t    efd_extents[];
    };
    • efd_type will be set to xfs_li_efd (i.e. 0x1237).
    • efd_size will be set to 1.
    • efd_efi_id will be set to the unique identifier used in the extent free intent item that was logged previously.
    • efd_extents[] set to the array of extents that will be freed by the current high level transaction.
    • efd_nextents will be set to the number of elements in the efd_extents[] array.

The following illustrates as to how the intent item and the corresponding done item help in restoring the consistency of a filesystem in the event of a filesystem crash.

Consider the following sequence of filesystem operations:

  1. The first high level transaction removes the extent from the inode’s extent map and creates a extent free intent log item.
  2. It commits the log items to the cil.
  3. The contents of the log items in the cil are written to a new checkpoint transaction to the ondisk log and the log items are moved to the ail.
  4. The contents of the these log items are written to their respective locations on the filesystem and removed from the ail. This would mean that the extent no longer is associated with the inode.
  5. The second high level transaction starts its operation to return the freed extent to the filesystem’s free space tracking structures. However, the filesystem crashes before the changes are written to the ondisk log.

At this point, the metadata on the filesystem is inconsistent since the extent which has been disassociated from the inode is not tracked by the ondisk free space tracking structures.

The log recovery logic executed as part of the mount procedure notices the extent free intent item and executes the steps to add the extent to the free space tracking structure thereby restoring the consistency of the filesystem.

Also, the extent free intent log item and its extent free done log item counterpart are not written to the ondisk log if both end up being part of the same checkpoint transaction.

Conclusion

With the examples described in this article, we hope that the reader now has a better understanding of the xfs’ logging mechanism. the reader can also experiment with the xfs_logprint command to observe the contents of a dirty ondisk log. This should help reinforce the learning from the above article.