Understanding Linux Mount and Unmount Operations

Mounting and unmounting filesystems are routine tasks for a Linux user. A filesystem once created cannot be accessed immediately, it has to be mounted for users to access its contents. Later, when the filesystem is no longer needed, it can be unmounted. When unmounted its content cannot be accessed even if the device containing the filesystem remains attached to the system.

The mount operation involves creating relevant in-memory data structures, initializing the superblock, and attaching the root dentry (directory entry) of the filesystem to the mount point whereas the umount operation undoes what mount does: dereferencing, freeing and other cleanup tasks.

In this blog, we will explore what mount (sys_mount) and umount (sys_umount) operations do.

Understanding Mount Operation

Mounting involves attaching the filesystem to a mount point, once the filesystem is mounted, the mount point acts like a gateway to filesystem contents. The mount operation attaches the filesystem root dentry to the mount point’s dentry. So that, when the mount point is accessed the contents of the filesystem are accessed.

The mount operation involves: – Creating and initializing data structures such as struct mount, struct super_block and their fields. – Creating sysfs entries, allocating per-cpu counters, and filesystem specific caches etc. – Most importantly, setting the DCACHE_MOUNTED flag on the mount point’s dentry. This flag allows the Virtual File System (VFS) layer to differentiate between a regular directory and a mount point. When looking up a path, if the DCACHE_MOUNTED flag is set, the VFS identifies that it is a mount point and follows the mounted filesystem.

Let’s see with an example how mounting affects path lookup (For more info, refer: Path Lookup). Let’s mount the device /dev/sdc1, which contains an ext4 filesystem, to the mount point /mnt. We’ll then see how the struct path of /mnt changes before and after the mount operation.

Before Mounting:

>>> mnt_struct_path = path_lookup("/mnt")
>>>
>>> mnt_struct_path
(struct path){
        .mnt = (struct vfsmount *)0xffffa01abf8c51a0,
        .dentry = (struct dentry *)0xffffa01a31843dd0,
}
>>>
>>> mnt_struct_path.dentry.d_iname
(unsigned char [32])"mnt"
>>>
>>> mnt_struct_path.mnt.mnt_sb.s_type.name
(const char *)0xffffffffc0666c2d = "xfs"

We are using drgn to read in memory structures (drgn is a programmable debugger that helps you read and explore a running Linux kernel or a vmcore. It’s very useful for kernel debugging. To get started, you can check this blog: Enter the drgn). The drgn function path_lookup() is equivalent to the kernel function kern_path(), which resolves a given path to a struct path.

struct path plays a crucial role in path resolution. It represents a specific location in the filesystem hierarchy and it consists of two components: * struct dentry dentry – Points to the dentry. struct vfsmount *mnt – Represents the filesystem the dentry is part of.

Currently, we can see that the dentry name (mnt_struct_path.dentry.d_iname) is “/mnt”. The root filesystem here is an XFS filesystem. Hence, the mnt_struct_path.mnt.mnt_sb.s_type.name is “xfs”.

Now, let’s create an ext4 filesystem, mount it on “/mnt”, and observe how the struct path changes.

Creating and Mounting an ext4 Filesystem:

[root@sridara-s opc]# mkfs.ext4 /dev/sdc1
mke2fs 1.46.2 (28-Feb-2021)
Creating filesystem with 2621440 4k blocks and 655360 inodes
Filesystem UUID: a1a52116-f7b9-4af7-b8be-f62e74fd50a2
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

Allocating group tables: done
Writing inode tables: done
Creating journal (16384 blocks): done
Writing superblocks and filesystem accounting information: done

[root@sridara-s opc]# mount /dev/sdc1 /mnt
[root@sridara-s opc]#

Now, let’s again use drgn to see how things have changed. After Mounting:

>>> mnt_struct_path = path_lookup("/mnt")
>>> mnt_struct_path
(struct path){
        .mnt = (struct vfsmount *)0xffffa01a7bc1c2a0,
        .dentry = (struct dentry *)0xffffa018fea8b4e0,
}
>>>
>>> mnt_struct_path.dentry.d_iname
(unsigned char [32])"/"
>>>
>>> mnt_struct_path.mnt.mnt_sb.s_type.name
(const char *)0xffffffffc09980f4 = "ext4"
>>>

We can see that both path->dentry and path->mnt are now pointing to different addresses. The dentry name for this path is “/”, which represents the root of the newly mounted filesystem. As expected, the filesystem type is ext4, since we mounted an ext4 filesystem.

So, what happened? As we discussed earlier, when a filesystem is mounted, the dentry flags of the mount point are marked with the DCACHE_MOUNTED flag. When the kernel encounters this flag during path lookup, it recognizes that this is not a regular dentry but rather a mount point where another filesystem is mounted. At this point, the kernel knows that path resolution has to cross the mount point boundary. So, Instead of following the usual lookup process, it takes a different path, which involves searching for the mounts associated with that dentry. This is primarily handled by a function called __lookup_mnt() which returns the struct mount of the filesystem mounted, through which we obtain the updated struct path.

Before diving into how __lookup_mnt() works, let’s first understand what the mount system call does in the simplest way.

Let’s say we run the command mount /dev/sdc1 /mnt. This command mounts the root of the filesystem present on /dev/sdc1 to the mount point /mnt. Now, the question comes: What does the mount process do to ensure that accessing /mnt leads to the mounted filesystem?

To answer this, we need to understand the key operations performed during mounting, which involve resolving the creation and initialization of different structures, and linking them to the mount point. Let’s break down the complexity of mounting into simple steps for better understanding:

Step 1: Resolving the struct path of the Mount Point

The first step is to resolve the struct path of the given mount point, which in our case is “/mnt”.

Step 2: Initializing the Superblock

The next is initialization of the superblock by calling a filesystem specific function, FSTYPE_fill_super().
For example:
- If the filesystem is ext4, the function ext4_fill_super() is called.
- If the filesystem is btrfs, the function btrfs_fill_super() is used.
These FSTYPE_fill_super() functions:
- Read the on-disk superblock.
- Create an in-memory struct super_block and initialize its fields.
- Read on-disk root inode and create in-memory struct dentry for it.
- Set up filesystem specific sysfs entries.
- Allocate per-CPU counters and filesystem specific caches, etc.

Step 3: Retrieving the Root Dentry of the Filesystem

Once the superblock is initialized, the root dentry of the filesystem being mounted is retrieved.
This is straightforward since the sb->s_root field in the superblock points to the root dentry.

Step 4: Creating and Initializing struct mount

The struct mount structure is created and initialized.
This structure is crucial because it represents the mounted filesystem and maintains references to:
- Child mounts, sibling mounts, and other related mount structures.
- The superblock of the mounted filesystem.
- The mount point dentry and its parent mount.

Step 5: Grafting the Root Dentry to the Mount Point

This is the most critical step, this involves linking the root dentry (we obtained this in step 2) of the filesystem to the mount point dentry. This is done by the function graft_tree().
graft_tree() updates the dentry flags for the mount point to include DCACHE_MOUNTED, indicating that this dentry is now a mount point.
After that, graft_tree() calls __attach_mnt():
- __attach_mnt() adds the struct mount to the global mount hash list.
- The key for this hash is computed using the addresses of the struct vfsmount and the struct dentry of the mount point, we get these from path resolution (Step 0).
```
static void __attach_mnt(struct mount *mnt, struct mount *parent)
{
    hlist_add_head_rcu(&mnt->mnt_hash, m_hash(&parent->mnt, mnt->mnt_mountpoint));
    list_add_tail(&mnt->mnt_child, &parent->mnt_mounts);
}
```
hlist_add_head_rcu() adds the mount’s hash entry to the parent’s hash list. Here, you can see that the m_hash() function is used on parent->mnt and mnt->mnt_mountpoint (which are the struct vfsmount and struct dentry of the mount point) to compute the key.
list_add_tail() appends the mount to the parent’s mount list, linking the new mount to the parent’s hierarchy.

One of the other tasks that graft_tree() does is propagating the new mount to all the shared and slave mounts in the propagation tree of the parent mount (parent mount is the struct mount of the mount point “/mnt”). This function also handles some bookkeeping tasks, such as adding the new mount to the list of child mounts in the parent struct mount, among other things.

After the mount operation is complete, if a process or the user attempts to access any file within the mounted filesystem, say “/mnt/ext4-file”, the kernel proceeds with the normal path lookup process. When it reaches the ‘mnt’ dentry while looking up the path, the kernel sees that the dentry’s flags have the DCACHE_MOUNTED flag set, indicating that it is a mount point. At this point, the kernel computes a hash using the struct vfsmount and struct dentry of the mount point. It then looks for the corresponding struct mount of the mounted filesystem in the global mount hash list. This lookup is performed by the kernel function __lookup_mnt().

struct mount *__lookup_mnt(struct vfsmount *mnt, struct dentry *dentry)
{
        struct hlist_head *head = m_hash(mnt, dentry);
        struct mount *p;

        hlist_for_each_entry_rcu(p, head, mnt_hash)
                if (&p->mnt_parent->mnt == mnt && p->mnt_mountpoint == dentry)
                        return p;
        return NULL;
}

The __lookup_mnt() function finds the struct mount corresponding to a filesystem mounted at a given mount point. It does so by computing a hash based on the original struct vfsmount and struct dentry of the mount point, which is then used to access the appropriate list in the global mount hash table. It iterates through the entries in this list to find a mount whose parent mount and mount point match the provided parameters, effectively identifying the struct mount of the filesystem mounted on top of the given dentry.

Once the correct struct mount is found, struct path fields can be retrieved:

The struct vfsmount from struct mount->mnt
The dentry from struct mount->mnt->mnt_root, which points to the root of the mounted filesystem.

With both the struct vfsmount and the struct dentry, we get the updated struct path, enabling the kernel to access the root of the mounted filesystem and complete the file lookup process.

This is why the fields in struct path changed before and after the mount operation. After mounting, the fields in struct path now correspond to the root of the mounted filesystem.

Bind and Subvolume Mounts

Now that we have understood what happens during a mount, let’s explore additional mount options: bind mount and subvolume mount. The subvolume mount is specific to the btrfs filesystem, while bind mount is a VFS feature, therefore is supported by all filesystems.

Here are the main differences between a traditional mount, bind mount, and subvolume mount:

In a traditional mount, we attach the root dentry of the filesystem to the mount point. However, in the case of bind and subvolume mounts, we attach the dentry of a specific directory within the filesystem. In the mount process described above, the key difference between a traditional mount and a bind or sub-volume mount occurs in Step 2. In a traditional mount, the kernel retrieves the dentry corresponding to the root of the filesystem being mounted. However, in the case of a bind or sub-volume mount, instead of retrieving the root dentry, the kernel retrieves the dentry of the specific directory that is to be bind-mounted or sub-volume mounted.
The difference between a bind mount and a subvolume mount is:
- For a bind mount, the filesystem must already be mounted. Once the filesystem is mounted, any directory within it can be bind mounted. The original filesystem can be unmounted later, but the directory remains accessible through the bind mount.
- Whereas, subvolume mounts are a feature of the btrfs filesystem. The main difference here is that the filesystem does not need to be mounted initially. btrfs allows direct mounting of a directory as long as that directory has been initialized as a subvolume. Not every directory can be mounted as a subvolume, only those explicitly created as subvolumes can be mounted.

Let’s understand these differences using an example.

The device /dev/sdc2 is initialized with a btrfs filesystem.

Normal mount:

[root@sridara-s opc]# mount /dev/sdc2 /media/

>>> path_lookup("/media").mnt.mnt_root.d_iname
(unsigned char [32])"/"
>>>
>>>
>>> struct_path = path_lookup("/media")
>>> struct_path
(struct path){
        .mnt = (struct vfsmount *)0xffffa018a8bb1a60,
        .dentry = (struct dentry *)0xffffa018ff5d7410,
}
>>> struct_path.dentry.d_iname
(unsigned char [32])"/"

In this case, you can see that struct_path.dentry.d_iname is “/”, which is the root of the filesystem that is mounted.

Bind mount:

To demonstrate a bind mount, I’ll create a new directory called “bind-dir” and place a few files in it.

[root@sridara-s opc]# mount /dev/sdc2 /media/
[root@sridara-s opc]# cd /media/
[root@sridara-s media]# mkdir bind-dir/
[root@sridara-s media]# for i in {1..3}; do touch file$i; done
[root@sridara-s media]# ls bind-dir/
file1  file2  file3

The filesystem is mounted on /media, and we have the bind-dir/ directory which we want to bind mount.

Now, let’s bind mount the bind-dir directory:

[root@sridara-s media]# mount --bind /media/bind-dir/ /mnt1
[root@sridara-s media]# ls /mnt1
file1  file2  file3

We’ve successfully bind-mounted the bind-dir directory. Observe that we provided the path to the directory as an argument to the –bind option of the mount utility. This is not possible unless the filesystem is already mounted. Once bind-mounted, we can access the contents of bind-dir through /mnt1.

Now, let’s use drgn to check which directory the struct path->dentry is pointing to:

>>> struct_path = path_lookup("/mnt1")
>>> struct_path
(struct path){
        .mnt = (struct vfsmount *)0xffffa01aa63836a0,
        .dentry = (struct dentry *)0xffffa018ff1c09c0,
}
>>> struct_path.dentry.d_iname
(unsigned char [32])"bind-dir"

As you can see, the dentry in struct path is now pointing to the dentry of bind-dir (struct_path.dentry.d_iname = “bind-dir”). That is why when this mount point is accessed, we are able to access the contents of bind-dir.

One of the flexibilities that bind mount offers is that once a directory is bind-mounted, even if the parent filesystem is unmounted, the bind mount will still function.

I’ve unmounted /media, but you can see that accessing /mnt1 still allows access to the contents of bind-dir:

[root@sridara-s /]# umount /media
[root@sridara-s /]# ls /mnt1
file1  file2  file3
[root@sridara-s /]#

Now, let’s move on to Subvolume mount.

Subvolume mount:

As mentioned earlier, subvolume is a feature specific to btrfs, and to use this feature, you must initialize the subvolume before mounting it.

The command btrfs subvolume create <subvolume-name> creates a subvolume:

[root@sridara-s /]# mount /dev/sdc2 /media
[root@sridara-s /]# btrfs subvolume create /media/subvolume/
Create subvolume '/media/subvolume'
[root@sridara-s /]# btrfs subvolume list /media
ID 257 gen 15 top level 5 path subvolume
[root@sridara-s /]# for i in {1..5};do touch /media/subvolume/test$i;done
[root@sridara-s /]# ls /media/subvolume/
test1  test2  test3  test4  test5
[root@sridara-s /]# umount /dev/sdc2

I initially mounted /dev/sdc2, so I could create a subvolume. After creating the subvolume named “subvolume”, I added some files to it and then unmounted the device.

[root@sridara-s /]# mount -o subvol=subvolume /dev/sdc2 /mnt2
[root@sridara-s /]# ls /mnt2
test1  test2  test3  test4  test5

We’ve now successfully mounted the subvolume named subvolume. The btrfs filesystem allows us to directly retrieve the dentry of the subvolume, which is why the filesystem doesn’t need to be mounted to mount a subvolume. As a result, subvolumes can be mounted regardless of whether the parent filesystem is mounted or not. Once mounted, we can access the contents of the subvolume by accessing /mnt2.

Now, let’s use drgn to check which directory the struct path->dentry is pointing to:

>>> struct_path = path_lookup("/mnt2")
>>> struct_path
(struct path){
        .mnt = (struct vfsmount *)0xffffa018a8bb11a0,
        .dentry = (struct dentry *)0xffffa01931eb1680,
}
>>> struct_path.dentry.d_iname
(unsigned char [32])"subvolume"

As you can see, the dentry in struct path is now pointing to the dentry of the subvolume (struct_path.dentry.d_iname=“subvolume”). That’s why when this mount point is accessed, we access the contents of subvolume.

The key difference between normal mounts, bind mounts, and subvolume mounts lies in the dentry that is attached to the mount point. In a normal mount, the root dentry of the mounted filesystem is attached to the mount point. For bind mounts, instead of attaching the root dentry, a specific directory’s dentry within the already mounted filesystem is linked to the mount point. In the case of subvolume mounts (specific to btrfs), a directory initialized as a subvolume is attached to the mount point, and this can be done even if the parent filesystem is not mounted.

Now, let us move on our discussion to umount operation.

Understanding Unmount Operation

Mounting makes a filesystem contents accessible to the system. But when we no longer need this connection. We use the umount utility to disconnect the filesystem. The umount opeartion undoes the setup set by the mount. It dereferences, deallocates, unsets flags and frees all the resources created by the mount operation, ultimately removes the struct mount from the global mount hash list, after which the mountpoint acts as regular directory and the filesystem becomes inaccessible.

Unmounting behaves differently depending on the flags provided:

MNT_EXPIRE: This flag allows userspace programs to mark a mount point as expired if it is not busy currently. If this mark was already set before and the mount point has not been used since then, this umount call will proceed to unmount. When the mark is being set for the first time, this returns -EAGAIN. This usecase is for unmounting an inactive mount. A filesystem marked with MNT_EXPIRE will be unmounted if it remains inactive and untouched since the last call to ksys_umount().
MNT_FORCE: Forcefully unmounts the filesystem, aborting pending requests. However, this option does not guarantee data integrity. This option is not supported by popular filesystems such as ext4, btrfs, and xfs.
MNT_DETACH: This is the lazy unmount flag. If specified, it disables mount point access for new processes, but processes currently holding a reference to it can still access it. Once these processes (which are currently using the filesystem) release their references, the unmount will take place, and all resources associated with the mount point will be released.

Let us see how the lazy unmount option works in action:

I have kept a file file1 open, and attempting to unmount the device from another terminal. If I try to unmount this device, it is going to fail with an EBUSY error as there is a process still using the filesystem.
```
[root@sridara-s opc]# lsof /mnt
COMMAND     PID USER   FD   TYPE DEVICE SIZE/OFF NODE NAME
vim     1018759 root    6u   REG   0,51    12288  269 /mnt/bind-dir/.file1.swp
[root@sridara-s opc]#
[root@sridara-s opc]# umount /mnt
umount: /mnt: target is busy.
[root@sridara-s opc]#
```
You can see, the command lsof has listed the process that is still using the filesystem. Here, it is vim.

But, I still want to unmount it. I can do that using the lazy unmount option.
```
[root@sridara-s opc]# umount -l /mnt
[root@sridara-s opc]# echo $?
0
[root@sridara-s opc]#
```
Here, you can see umount exited without any error.

Now that I have unmounted it lazily, no new process should be able to access the mount point. However, processes that are already using the filesystem (in our case it’s vim) will still have access to it. Let’s verify this with lsof:
```
[root@sridara-s opc]# lsof /mnt
[root@sridara-s opc]#
```
It returned nothing. Strange, right? I still have that file open, so why is lsof not able to find the process?

The answer is: as already mentioned, new processes cannot not access the filesystem as it’s lazily unmounted. Since lsof itself is a new process (spawned after the lazy unmount), it no longer sees /mnt as a mount point. As a result, it couldn’t find the vim process which is still using the filesystem.

So now the question comes: How do we find processes that are using a lazily unmounted filesystem?

To find such processes, the best approach is to list all open files and grep for files associated with the specific device number of the unmounted filesystem.

For example:

Using: lsof | grep "major,minor"
```
[root@sridara-s opc]# lsof | grep "0,51"
vim       1018759                                         root    6u      REG               0,51     12288        269 /bind-dir/.file1.swp
[root@sridara-s opc]#
```
Now you can see that the process is still present.

Note: You have to get the major and minor device numbers using the stat command on the mount point before attempting the unmount operation:
```
[root@sridara-s opc]# stat -c "%d" /mnt # Run this before you unmount
51
[root@sridara-s opc]#
```
This will give the encoded device number. From this, the major and minor numbers can be calculated as follows:
```
major = devnumber >> 8 (51 >> 8 = 0)
minor = devnumber & 255 (51 & 255 = 51)
```
For most filesystems, this major and minor number will be the same as the block device they are on. However, for multi device filesystems like btrfs, this will be different from the major and minor numbers of the backing block device. As btrfs has the ability to create and manage filesystems that span across multiple devices. It creates a virtual device number which will be different from that of the backing devices.

As lsof also gets the device number using the stat() system call, grepping for the major and minor numbers obtained from the stat command is the right way.
No Flag Specified: Without any special flags, the system follows the standard unmounting process. It first checks whether any processes are still using the mount point. If active references exist, the call fails with an -EBUSY error. Otherwise, it proceeds to umount_tree(), initiating the cleanup process.

Just like mount, unmount process also propagates to associated shared or slave mounts in the the propagation tree.

The key steps in unmounting process are:

Removing the struct mount from the namespace by adjusting reference counts and setting the struct mount->mnt_ns field to NULL.
Decrementing the reference count of struct mount.
Marking the struct mount as doomed using the MNT_DOOMED flag which implies the umount operation is in the middle of unmounting.
Calling cleanup_mnt(), which triggers deactivate_super() to release filesystem-related resources.
Dereference the root/directory dentry of the filesystem that is being unmounted.
Dereferencing and freeing in memory superblock, and pruning dentries in the dcache associated with the unmounted filesystem.
Calling FSTYPE_put_super() function calls, to release filesystem specific resources. Filesystems have their own specialized FSTYPE_put_super() functions. For example, ext4 and btrfs, have ext4_put_super() or btrfs_put_super() respectively, which basically does filesystem specific resource cleanup actions, such as:
- Releasing in-memory superblock info structures (struct ext4_sb_info for ext4 and struct btrfs_fs_info for btrfs).
- Unregistering sysfs entries and freeing associated caches.
- Undoing the allocations made by the FSTYPE_fill_super() function during the initial mount operation.

Conclusion

In this blog, we went through an overview of the mount and umount operations. Mounting involves creating relevant structures, initializing the superblock, and attaching the filesystem dentry to the mount point, while unmounting undoes what mount does: dereferencing, freeing, and other cleanup tasks.

References

Linux Source Code
https://docs.kernel.org/filesystems/vfs.html
https://litux.nl/mirror/kerneldevelopment/0672327201/ch12lev1sec7.html
https://blogs.oracle.com/linux/post/enter-the-drgn
https://www.kernel.org/doc/html/latest/filesystems/path-lookup.html

Understanding Linux Mount and Unmount Operations

Understanding Mount Operation

Bind and Subvolume Mounts

Understanding Unmount Operation

Conclusion

References

Srivathsa Dara

BPF: Making Split BPF Type Format (BTF) more resilient

Oracle Linux 9.6 Now Generally Available

Understanding Linux Mount and Unmount Operations

Understanding Mount Operation

Bind and Subvolume Mounts

Understanding Unmount Operation

Conclusion

References

Authors

Srivathsa Dara

BPF: Making Split BPF Type Format (BTF) more resilient

Oracle Linux 9.6 Now Generally Available