Welcome to the second blog post of the Understanding Ext4 Disk Layout series. In the previous blog, we explored the concepts of the Superblock, GDT, Bitmaps, and Inode Table. In this blog, our focus will shift toward the on-disk structures of the extent tree, hash tree, and their corresponding data structures.
The i_block
field, present in the struct ext4_inode
, occupies 60 bytes (EXT4_N_BLOCKS == 15
). This field serves as a container for block-related information associated with the inode.
struct ext4_inode { __le16 i_mode; /* File mode */ __le16 i_uid; /* Low 16 bits of Owner Uid */ __le32 i_size_lo; /* Size in bytes */ __le32 i_atime; /* Access time */ __le32 i_ctime; /* Inode Change time */ __le32 i_mtime; /* Modification time */ __le32 i_dtime; /* Deletion Time */ __le16 i_gid; /* Low 16 bits of Group Id */ __le16 i_links_count; /* Links count */ __le32 i_blocks_lo; /* Blocks count */ __le32 i_flags; /* File flags */ union { struct { __le32 l_i_version; } linux1; struct { __u32 h_i_translator; } hurd1; struct { __u32 m_i_reserved1; } masix1; } osd1; /* OS dependent 1 */ __le32 i_block[EXT4_N_BLOCKS];/* Pointers to blocks */ __le32 i_generation; /* File version (for NFS) */ __le32 i_file_acl_lo; /* File ACL */ __le32 i_size_high; __le32 i_obso_faddr; /* Obsoleted fragment address */ union { struct { __le16 l_i_blocks_high; /* were l_i_reserved1 */ __le16 l_i_file_acl_high; __le16 l_i_uid_high; /* these 2 fields */ __le16 l_i_gid_high; /* were reserved2[0] */ __le16 l_i_checksum_lo;/* crc32c(uuid+inum+inode) LE */ __le16 l_i_reserved; } linux2; struct { __le16 h_i_reserved1; /* Obsoleted fragment number/size which are removed in ext4 */ __u16 h_i_mode_high; __u16 h_i_uid_high; __u16 h_i_gid_high; __u32 h_i_author; } hurd2; struct { __le16 h_i_reserved1; /* Obsoleted fragment number/size which are removed in ext4 */ __le16 m_i_file_acl_high; __u32 m_i_reserved2[2]; } masix2; } osd2; /* OS dependent 2 */ __le16 i_extra_isize; __le16 i_checksum_hi; /* crc32c(uuid+inum+inode) BE */ __le32 i_ctime_extra; /* extra Change time (nsec << 2 | epoch) */ __le32 i_mtime_extra; /* extra Modification time(nsec << 2 | epoch) */ __le32 i_atime_extra; /* extra Access time (nsec << 2 | epoch) */ __le32 i_crtime; /* File Creation time */ __le32 i_crtime_extra; /* extra FileCreationtime (nsec << 2 | epoch) */ __le32 i_version_hi; /* high 32 bits for 64-bit version */ __le32 i_projid; /* Project ID */ };
An extent refers to a group of physically contiguous blocks. This grouping reduces the need for direct block mapping between logical and physical blocks, which, in turn, reduces the amount of metadata to be maintained. That results in improved performance and efficiency.
An extent tree is a data structure that maintains the extents associated with an inode. The extent tree helps in faster traversal and retrieval of data.
The i_block
field stores structures such as struct ext4_extent_header
, struct ext4_extent_idx
, and struct ext4_extent
, which are integral components of the extent tree. Each of these structures have a size of 12 bytes.
As previously stated, the i_block
field has a size of 60 bytes. Within these 60 bytes, the initial 12 bytes are allocated for the extent header. The remaining 48 bytes can be utilized by either ext4_extent_idx
or ext4_extent
, depending on the depth of the extent tree. Consequently, the i_block
field can store an extent header along with a maximum of 4 extents or 4 extent indices, depending on the configuration of the extent tree.
struct ext4_extent_header { __le16 eh_magic; /* probably will support different formats */ __le16 eh_entries; /* number of valid entries */ __le16 eh_max; /* capacity of store in entries */ __le16 eh_depth; /* has tree real underlying blocks? */ __le32 eh_generation; /* generation of the tree */ };
struct ext4_extent_idx { __le32 ei_block; /* index covers logical blocks from 'block' */ __le32 ei_leaf_lo; /* pointer to the physical block of the next * * level. leaf or next index could be there */ __le16 ei_leaf_hi; /* high 16 bits of physical block */ __u16 ei_unused; };
struct ext4_extent { __le32 ee_block; /* first logical block extent covers */ __le16 ee_len; /* number of blocks covered by extent */ __le16 ee_start_hi; /* high 16 bits of physical block */ __le32 ee_start_lo; /* low 32 bits of physical block */ }
Now the question comes, what is the purpose of these structures? * ext4_extent_header
- The header provides vital information about the extent tree. It holds the magic number, depth of the extent tree, and the number of valid entries (these entries can be either ext4_extent_idx
or ext4_extent
), etc. When the depth is zero, then the entries are ext4_extent
s otherwise ext4_extent_idx
s. The header occupies the first 12 bytes in all blocks of the extent tree. * ext4_extent_idx
- This structure serves as an index, it holds references to further levels of the tree. * ext4_extent
- Represents an extent. These structures are exclusively present in the leaf blocks of the extent tree and provides details regarding the initial block and the length of the extent.
Let’s take an example by analyzing the hexdump of an inode.
Note: Ext4 uses little endian notation. That means the least significant byte is stored first and the most significant byte is stored last. For example, if 0x1234
is written on disk, then its value in memory is 0x3412
.
Above is the hexdump of an inode, with the highlighted region representing the extent header. From the extent header, it is evident that the extent tree has a depth of 1. Consequently, the extent header is followed by struct ext4_extent_idx
instances.
The image below showcases the highlighted struct ext4_extent_idx
. The extent index is pointing to the block 0x85ca
.
Let’s analyze the dump of that particular block for further insights.
Upon inspection, we observe that the first 12 bytes correspond to the extent header. Notably, the maximum number of extents that can follow the extent header is 340. This calculation is derived by considering that each block has a size of 4k bytes, with the first 12 bytes allocated to the extent header and the final 4 bytes utilized by struct ext4_extent_tail
. Consequently, this leaves 4080 bytes available. Since each extent occupies 12 bytes, dividing 4080 by 12 results in 340 extents. Additionally, it is worth noting that the depth of this block is zero, indicating that we reached leaf block of the extent tree. As previously mentioned, in a leaf block, the entries following the extent header are ext4_extent
structures.
The number of valid extents in the given context is 182, which means 182 extents follow the extent header. Each of these extents is associated with a specific range of data blocks and contains a pointer that indicates the location of the first data block within its respective extent.
Here is the hexdump of the same block, but this time ext4_extent
are highlighted:
Upon analyzing the provided hexdump, we can observe the following information from the first three struct ext4_extent
entries:
The first extent has a logical block value of 0. This extent encompasses 140 blocks, with the first block address pointing to the 3,576,372nd block. Therefore, the 140 consecutive blocks starting from the 3,576,372nd block belong to this extent, and they represent actual data blocks.
In the second extent, the logical block value is 140, indicating that the first 140 blocks are part of the previous extent. This extent comprises 160 blocks, and the first block of this extent is located at the 36,139,252nd block.
The third extent has a logical block value of 300, which is the sum of the blocks present in the previous two extents (140 + 160). This extent encompasses 158 blocks, and its first block is situated at the 3700229th block.
Similarly, the remaining 179 ext4_extent
entries provide information about the extents associated with the inode, allowing for the determination of the corresponding data blocks and their respective ranges.
Debugfs also provides extent tree information of an inode. ex
command serves this purpose. Here are the first few lines of the output obtained from the debugfs ex
command:
Looking at the 2nd, 3rd, and 4th rows of the provided output, we can observe that the Logical, Physical, and Length columns align with the logical block, first physical block, and length of the corresponding extents discussed earlier. By cross-referencing these values, we can confirm the accurate representation of the extent tree information for the given inode.
We have seen the case of an extent tree with a depth of 1, there is a single intermediate block between the i_block
field and the actual data blocks. This means that there is one level of indirection between the inode and the data blocks.
To further illustrate, if the depth of the extent tree were 2, then there would be two intermediate blocks between the i_block
field and the actual data blocks.
However, in the special case of a zero depth extent tree, the i_block
field directly holds the ext4_extent
structures. There are no ext4_extent_idx
structures in this case, as there is no need for intermediate blocks. The ext4_extent
structures within the i_block
directly point to the extents that represent the data blocks. In this case of zero depth, the i_block
has a maximum capacity of holding four ext4_extent
s and an ext4_extent_header
whose size adds up to 60 bytes. Since the i_block
itself is 60 bytes, when a 5th extent is allocated, the extent tree’s depth is increased to accommodate the new extent.
This mechanism allows for efficient navigation and retrieval of data blocks within the Ext4 file system, providing different levels of indirection based on the depth of the extent tree.
A directory stores the dirents of all its files in its data blocks. In the context of a directory that contains a large number of files, performing a linear search for a specific directory entry within its data blocks can become time-consuming and inefficient. To address this issue, the Ext4 file system implements a hash tree.
The hash tree provides a more efficient way to locate, insert or delete a directory entry within a directory. The hierarchical structure of hash trees allows for faster lookup than linear searchs. A specific hash algorithm is used to hash the directory entries, and the resulting hash values are used to navigate through the hash tree.
Typically, a directory in the Ext4 file system initially follows an unindexed approach for storing its directory entries. In this unindexed mode, directory entries are stored sequentially within a single data block of the directory. This approach is efficient when the number of directory entries are contained within a single data block.
However, once the number of directory entries exceeds the capacity of a single data block, the directory can be converted to an indexed tree structure, i.e. a hash tree. This conversion occurs when the hash tree feature is enabled.
The hash tree structure includes root node, intermediate nodes and leaf nodes.
The root node holds essential metadata about the structure and characteristics of the tree. This metadata includes information like the depth of the tree, hash algorithm used, etc. The intermediate nodes act as branches in the tree, guiding the search process based on hashed values derived from the directory entries. These intermediate nodes direct the search algorithm to the appropriate subtree that potentially contains the desired directory entry. On the other hand, the leaf nodes hold the actual directory entries themselves.
Below is the dirent structure which is stored by the directory in its data block:
struct ext4_dir_entry_2 { __le32 inode; /* Inode number */ __le16 rec_len; /* Directory entry length */ __u8 name_len; /* Name length */ __u8 file_type; char name[EXT4_NAME_LEN]; /* File name */ };
Note: EXT4_NAME_LEN
is 255, so the maximum size for a file name in EXT4 is 255.
Let’s examine a directory with an inode number of 131007, which contains a total of 2500 files.
Below is the inode dump of the directory:
If a directory is unindexed, its data blocks will store dirents (directory entries). However, if a directory is indexed (i.e is using hash tree), the first few data blocks will store information related to the hash tree structure, while the remaining data blocks will store the dirents. In the case of a hash tree, the 1st data block serves as the host for the hash tree root, represented by struct dx_root
, followed by struct dx_entry
instances. These dx_entry
structures provide the addresses of the next level blocks within the hash tree hierarchy.
struct dx_root { struct fake_dirent dot; char dot_name[4]; struct fake_dirent dotdot; char dotdot_name[4]; struct dx_root_info { __le32 reserved_zero; u8 hash_version; u8 info_length; /* 8 */ u8 indirect_levels; u8 unused_flags; } info; struct dx_entry entries[0]; };
struct fake_dirent { __le32 inode; __le16 rec_len; u8 name_len; u8 file_type; };
The struct fake_dirent
serves as the directory entry for the dot (.) and dotdot (..) directories within the dx_root
structure. It is similar to the struct ext4_dir_entry_2
, but it lacks a name field since the name field is incorporated within the dx_root
structure.
struct dx_entry { __le32 hash; __le32 block; };
From the above image, it is evident that the depth of the extent tree is zero. Furthermore, there is only a single extent present in the extent tree. The first block of this extent is the 532,576th block, and it is followed by 29 consecutive blocks, collectively representing the extent associated with the inode.
Here is the dump of the initial few bytes of the first block (Block number: 532,576):
The first 36 bytes are used by dx_root
, and following dx_root
we have 28 dx_entry
s. We can observe that the number of valid entries in dx_root
is 29, considering the 28 dx_entry
s and the dx_root
itself. The first three dx_entry
s are highlighted in the provided hexdump. Additionally, the depth of the tree is indicated as 0, indicating that it is a leaf level in the hash tree structure.
In a dx_entry
, the first 4 bytes represent the hash value, and the next 4 bytes represent the logical block number. This logical block corresponds to a data block that stores directory entries (dirents) whose hash values fall between the hash value of the current dx_entry
and the hash value of the next dx_entry
in the hash tree structure.
Below is a table of hash values and logical blocks of the first 3 dx_entry
s:
Let’s consider an example where we are trying to add a dirent with a hash value of 0x092e0111. By comparing the hash value with the hash values in the table above, we find that it is greater than 0x082e0162 and less than 0x103050d2. Therefore, the directory entry will be inserted into the logical block 12, which corresponds to the data block that stores dirents whose hash values are between 0x082e0162 and 0x103050d2.
Similarly, if we are searching for a dirent with a hash value greater than 0 and less than 0x082e0162, we need to search block 1 to find the target directory entry. On the other hand, if the target dirent hash value is greater than 0x103050d2 and less than 0x17358dcc, we need to search block 9 to find the desired dirent. This process allows us to locate the directory entry based on its hash value within the hash tree structure.
In the case where the hash tree depth is 2, there will be two levels of intermediate blocks between the leaf block (which stores directory entries) and the dx_root
block. These intermediate blocks exclusively store dx_entry
blocks.
During the search for a file in the hash tree, we only examine a single block at each level until we reach the leaf block. At the leaf block, we perform a linear search to find the desired dirent. This hierarchical search process significantly improves efficiency compared to an unindexed directory, where a linear search is conducted across all blocks of the directory.
Different Hash algorithms:
Below is the hexdump of one of the leaf blocks in the hash tree:
a1
, b1
, a2
, b2
etc are the files in the directory, so their dirents are stored in the data block of the directory.
Let’s take a closer look at the dirent of file a1 in the directory. The structure of the dirent is as follows:
The rec_len
of the last dirent will be such that it spans till the end of the block.
This structure allows us to retrieve important details about the file, such as its inode number, length, type, and name, from the dirent stored in the directory’s data block.
The table below provides a list of file types:
As the number of dirents increases in a directory, the size and depth of the tree used to store them will also increase.
Initially, a directory is in an unindexed state, where dirents are stored in a single block. However, if the number of dirents exceeds the capacity of a single block, the directory is converted to an indexed tree structure. During this conversion process, two new blocks are created. One of these blocks becomes the root block and is initialized with a dx_root
structure. The dirents are then sorted in ascending order, and the first half of the dirents remain in the original block while the remaining dirents are moved to the other new block. Two dx_entry
structures are added to the dx_root
to maintain the directory’s indexed state.
Suppose the directory was initially unindexed and contained 40 dirents, with the data block of the directory being completely full. When the 41st dirent is added, the directory undergoes a conversion to an indexed structure.
During the conversion, two new blocks are created. One of these blocks becomes the root block, and the 40 dirents are sorted. The first 20 dirents remain in the original block, while the other 20 dirents are moved to the new block. The dx_entry
of the new block will have a hash value equal to the hash of the first dirent in that block. On the other hand, the dx_entry
of the initial block will have a hash value of zero.
With this reorganization, the addition of the 41st dirent can now be accommodated in the directory’s indexed structure.
After the conversion of the directory to an indexed structure, if any leaf block becomes filled with dirents, a new block will be created to accommodate additional dirents. The dirents from the filled leaf block will be redistributed equally between the filled block and the new block.
This redistribution ensures that the dirents are evenly distributed among the leaf blocks, preventing any single block from becoming overloaded with dirents. By creating a new block and redistributing the dirents, the indexed structure of the directory can efficiently store and manage a larger number of dirents while maintaining balanced block utilization.
Example: The image below depicts the initial configuration of a directory’s hash tree. At this point, the hash tree has a depth of 0 and consists of three leaf blocks (1, 2, and 3).
Now, let’s consider a scenario where 20 new files are created within the same directory. After adding these new dirents, the hash tree configuration is updated as follows:
The depth of the tree remains unchanged, indicating that no additional levels were required. However, a new leaf block (block 4) is created and inserted between block 1 and block 2 in the hash tree. This is necessary because some of the newly added entries filled up block 1 completely. To accommodate the additional dirents, a new block (block 4) is created, and the dirents are evenly redistributed between block 1 and block 4 to maintain a balanced hash tree structure.
In the event of a hash collision, it is important to note that the hash value is primarily used to locate the block that holds dirents within a specific hash value range. If a collision occurs and multiple dirents share the same hash value, they will be stored together in the same block. Once the target block is located, a linear search is performed within that block to find the specific dirent.
When adding a new dirent that experiences a hash collision with a dirent residing in a completely filled block, a new block will be created. The dirents will be redistributed equally between the two blocks based on their hash values. However, it is worth noting that dirents with the same hash values will still remain together in the same block. This ensures that despite the redistribution, dirents with colliding hash values are kept in the same block, allowing for efficient retrieval during linear search operations.
In this blog, we explored two important concepts: extent tree and hash tree, along with the supporting structures associated with them. We learned that filesystems often allocate non-contiguous blocks for storing data, and the extent tree provides a mechanism to link these scattered blocks together, enabling efficient data retrieval and management.
Additionally, we delved into the hash tree, which serves as a powerful tool for directories with a significant number of files. The hash tree scheme facilitates quick and efficient search, addition, and deletion of directory entries (dirents). This feature becomes particularly valuable in directories that house a large volume of files, as it minimizes the need for linear searches and ensures swift access to the desired dirents.
In summary, the extent tree and hash tree play vital roles in optimizing filesystem performance and organization. They enable efficient allocation of non-contiguous blocks and streamline operations within directories, leading to improved overall system efficiency and effectiveness.