In our day-to-day usage, we often request the kernel to retrieve a specific file for us. We do this in various ways, whether through commands like ls
, touch
, or vi
, specifying the file path as an argument. In each of these cases, the kernel locates and retrieves the requested file. But how does it go about this? What processes does it undertake to retrieve the file? In this discussion, we will provide answers to these questions in relation to the ext4 filesystem.
For simplicity, we will focus on the scenario where we use ls
on a file. It’s important to note that the process the kernel follows to access the file will be similar for all other cases, such as touch
or vi
.
When we request the kernel to look up a file, the kernel goes through the following process:
For the purpose of demonstration, we have created a directory containing a large number of files, currently totaling 5086 files. In general, a directory stores the 1struct ext4_dir_entry_2` (referred to as “dirent”) of all the files within its data block. However, if it’s not possible to accommodate all the dirents within a single data block, the directory is converted into an indexed htree structure, which allows for faster retrieval of dirents.
In our case, with approximately 5,000 files, it’s not feasible to store all the 5,000 dirents in a single data block. Therefore, the directory we are examining utilizes an indexed htree structure with an indirect level of 1, indicating a depth of 1 in the htree structure.
Information regarding the htree structure for the ‘dir1/’ directory: The debugfs htree
command provides htree information for a specified directory, provided that the specified directory utilizes htree.
Below is the output of debugfs htree
:
The file we are searching for in the ‘dir1/’ directory is named ‘target-file’. When we execute ls target-file
, the kernel will initially search for the file, and if the file is located, a successful result is returned, as demonstrated in the image below.
On this page, we will examine the code path that the kernel follows to access the ‘target-file’.
As previously mentioned, the ‘dir1/’ directory utilizes an indexed htree structure to store dirents. The position of a dirent within the htree is determined by its hash value. When traversing the htree to locate a specific dirent, knowledge of its hash value is essential. The hash value of a file is computed based on its filename, and the exact hash value relies on the hash algorithm and hash seed in use.
Below is a portion of the debugfs stats -h
output, which provides superblock information. This command gives the hash algorithm and hash seed used for the filesystem. In this particular case, the filesystem is configured to use the half_md4
hash algorithm with c23d5b1e-1bde-4792-b9eb-dea82015a4a0
seed.
With this information, we can get the hash value for our ‘target-file’.
To calculate the hash value for a given filename, you can use the following command within debugfs
:
hash -h <hash algorithm> -s <seed> <filename>
We use the hash algorithm and hash seed information given by debugfs stat -h
to calculate the hash value:
Now, we have obtained the major and minor hash values, which the kernel uses to determine the on-disk location of the ‘target-file’ dirent.
Major hash: 0xde6e061c
Minor hash: 0xcd0c4d8e
When searching for any file, the kernel first checks the dcache. If it doesn’t find the requested file in the dcache, the kernel retrieves it from the filesystem.
Note: Discussing the dcache is beyond the scope of this article. For our purposes, it is sufficiet to understand that the dcache is a cache used to store the dentries of files that have been accessed thus far.
There are two functions involved:
lookup_fast()
- This function performs an RCU walk of the path and retrieves the dentry of the target file. If it fails to find the file, it returns an error, and the caller must perform a REF walk.
lookup_slow()
- This function conducts a REF walk of the path, obtaining the dentry from the filesystem. Regardless of whether the requested file exists, a dcache entry will be created.
In our case, we will be examining the scenario of lookup_slow()
. This function calls \__lookup_slow()
, which in turn invokes a filesystem-specific callback function via inode->i_op->lookup();
.
static struct dentry *__lookup_slow(const struct qstr *name, struct dentry *dir, unsigned int flags) { ... old = inode->i_op->lookup(inode, dentry, flags); d_lookup_done(dentry); if (unlikely(old)) { dput(dentry); dentry = old; } ... return dentry; }
In the case of the ext4 filesystem, the operation inode->i_op->lookup()
is implemented as ext4_lookup()
. Functions called by ext4_lookup()
eventually lead to dx_probe()
, which carries out the majority of the work involved in locating the on-disk location.
dx_probe()
initially locates the dx_root
, which serves as the root of the htree. This dx_root
is situated in the 0th logical block of the directory inode.
To determine the location of the 0th logical block, the kernel searches for it within the extent tree. This task is performed by the ext4_es_lookup_extent()
function, which searches for the requested logical block in the extent tree. In our scenario, the function ext4_es_lookup_extent()
looks up for 0th logical block as it hosts the htree root, which is the dx_root
.
int ext4_es_lookup_extent(struct inode *inode, ext4_lblk_t lblk, ext4_lblk_t *next_lblk, struct extent_status *es) { struct ext4_es_tree *tree; struct ext4_es_stats *stats; struct extent_status *es1 = NULL; struct rb_node *node; int found = 0; ... node = tree->root.rb_node; while (node) { es1 = rb_entry(node, struct extent_status, rb_node); if (lblk < es1->es_lblk) node = node->rb_left; else if (lblk > ext4_es_end(es1)) node = node->rb_right; else { found = 1; break; } } ... }
ext4_es_lookup_extent()
function.Following is the inode dump of the dir1/
:
Note: Ext4 uses little endian notation
The highlighted region represents the top level of the extent tree. This inode possesses an extent tree with a depth of 1. Below the extent header, we find the extent index (ext_extent_idx
), which points to the next-level extent index block. In our specific case, the address of the next-level extent index block is 0x6027e9.
The following section provides a dump of block 0x6027e9, which contains information about the next-level extent tree:
Here, we can find from the extent header that the depth is 0, indicating that we’ve reached the next level in the extent tree. Following the extent header, we find a series of extents (ext4_extent
), each of which provides the address of the first block within its respective extent.
In order for the kernel to read the htree root, it needs to access the zeroth logical block, which is specified by 1st of these extents. In this case, that address is 0x602092, and it corresponds to the zeroth logical block of the inode, as it represents the first block within the first extent. Thus, we have successfully located the zeroth logical block and, consequently, the htree root.
For more detailed information regarding extents and the extent tree in ext4, please refer to: Ext4 Disk Layout 2
Below is the dump of block 0x602092 (Zeroth Logical block):
As previously mentioned, the image above is a dump of the zeroth logical block, which houses the dx_root
and other top-level dx_entries
of the htree. This block will be provided as an argument to dx_probe()
. Subsequently, dx_probe()
, with help of the target-file’s hash value, traverses through the htree to locate the on-disk block that contains the dirent for the target-file.
In the htree structure, when performing a lookup for a file, there are specific hash value ranges that determine the relative block number where the desired dentry can be found (Taking the above table as a reference):
If the hash value of the file is greater than or equal to 0x082e0162 and lesser than 0x103050d2, then the desired dentry can be found in relative block number 12.
If the hash value is greater than 0 (positive) and less than 0x082e0162, then the desired dentry can be found in relative block number 1.
This lookup method is efficient compared to a linear search, as it allows for quick and direct access to the block where the desired dentry is likely to be located based on the hash value range.
static struct dx_frame * dx_probe(struct ext4_filename *fname, struct inode *dir, struct dx_hash_info *hinfo, struct dx_frame *frame_in) { ... level = 0; blocks[0] = 0; while (1) { ... p = entries + 1; q = entries + count - 1; while (p <= q) { m = p + (q - p) / 2; dxtrace(printk(KERN_CONT ".")); if (dx_get_hash(m) > hash) q = m - 1; else p = m + 1; } at = p - 1; frame->entries = entries; frame->at = at; block = dx_get_block(at); ... if (++level > indirect) return frame; blocks[level] = block; frame++; frame->bh = ext4_read_dirblock(dir, block, INDEX); ... entries = ((struct dx_node *) frame->bh->b_data)->entries; ... } ... }
dx_probe()
function, a binary search is employed to locate the hash value of the target_file
.++level > indirect
when the leaf block of the htree is reached. If this condition fails (i.e., level
is not greater than indirect
), it indicates that the leaf block hosting the dirent for the ‘target_file’ has been found.Once dx_probe()
has identified the leaf block that likely contains the dirent for the ‘target_file’, ext4_search_dir()
takes over. ext4_search_dir()
is responsible for iterating through all the dirents within the leaf block to locate the requested dirent. This ensures that the correct dirent is found within the identified block
int ext4_search_dir(struct buffer_head *bh, char *search_buf, int buf_size, struct inode *dir, struct ext4_filename *fname, unsigned int offset, struct ext4_dir_entry_2 **res_dir) { struct ext4_dir_entry_2 * de; char * dlimit; int de_len; de = (struct ext4_dir_entry_2 *)search_buf; dlimit = search_buf + buf_size; while ((char *) de < dlimit - EXT4_BASE_DIR_LEN) { /* this code is executed quadratically often */ /* do minimal checking `by hand' */ if (de->name + de->name_len <= dlimit && ext4_match(dir, fname, de)) { /* found a match - just to be sure, do * a full check */ if (ext4_check_dir_entry(dir, NULL, de, bh, search_buf, buf_size, offset)) return -1; *res_dir = de; return 1; } /* prevent looping on a bad block */ de_len = ext4_rec_len_from_disk(de->rec_len, dir->i_sb->s_blocksize); if (de_len <= 0) return -1; offset += de_len; de = (struct ext4_dir_entry_2 *) ((char *) de + de_len); } return 0; }
Now that we have a better understanding of the process, let’s apply this knowledge to the hexdump of the htree structure. As mentioned earlier, the hash values for the ‘target-file’ are Major hash: 0xde6e061c and Minor hash: 0xcd0c4d8e. We use this information, to determine the on-disk location of the dirent for the ‘target-file’.
From the image above, it’s clear that the next level block is 0x3ec4. This implies that in order to locate dirents with hash values falling between 0xdb1ebe56 and 0xdf264dc2, we should examine logical block 0x3ec4.
The debugfs
command, specifically bd -f <inode> <logical block number>
, dumps the logical block content for a given inode.
The following is the dump of logical block number 0x3ec4:
We are now at a depth of 1 within the htree, signifying that the block number provided in the image above corresponds to a leaf block. Within this leaf block, we find dirents with hash values falling between 0xde6c4378 and 0xde6e93f6. Which means, the block 0x1825 contains the dirent for the ‘target-file’.
The following is the dump of logical block 0x1825 (leaf block):
As we can observe in the highlighted region, we have successfully located the dirent we were searching for. The struct ext4_dir_entry_2
is the data structure used to represent directory entries on the disk.
struct ext4_dir_entry_2 { __le32 inode; /* Inode number */ __le16 rec_len; /* Directory entry length */ __u8 name_len; /* Name length */ __u8 file_type; char name[EXT4_NAME_LEN]; /* File name */ };
We can confirm the inode number using ls -i
, and as a result, we have successfully determined the on-disk location of the requested file’s dirent.
When a filename is given, we’ve explored the process of searching for the file. However, it’s worth considering what happens in the event of a hash collision. In such cases, it’s important to note that the hash value is primarily used to locate the block that contains dirents with a range of hash values. Initially, in the event of collison, dirents with the same hash values are accommodated within a block. However, if all the colliding dirents cannot fit within a single block, a new block is created. The dirents are then redistributed between these two blocks. This process is not different from what usually happens when a block becomes filled in a normal case. The key difference arises during the search for a dirent whose hash has multiple collisions. In this case, hash collided dirents may span multiple blocks, the search operation is perfomed on all of these blocks.
In this article, we have explored various algorithms involved in the process of searching for a file within an ext4 filesystem. These algorithms, including hash-based indexing, binary search, and linear searches, work together to efficiently locate and retrieve files from the filesystem.