The python based drgn which stands for “debugger with runtime introspection” is a programmable kernel debugging tool that is unique in its ability to be extended to manipulate complex data structures. Similarly, drgn-tools is an extended library of python helper scripts for drgn that mainly targets the Oracle Unbreakable Enterprise Kernel (UEK). This blog introduces a new list_lru iterator for drgn-tools and provides a couple of practical examples of uses for this iterator.
Introduction
The Linux kernel provides a least-recently-used list (list_lru) structure to hold inactive items of a specific type (for example dentry, inode, xfs_buf, memory management shadow node, zswap, etc), at the cost of temporarily using kernel memory, with the hope that the items may be reused without expensive reinitialization. The list_lru structure is also integrated with the Linux kernel memory management shrinker routines to quickly discard the inactive lru items when the kernel is experiencing memory pressure. The list_lru organizes these items into sublists by their Non-Uniform Memory Access (NUMA) node-id and optionally by their memory control groups (memcg) index.
NUMA is a high performance multiprocessor architecture that groups the memory and processors into interconnected nodes such that all physical memory is accessible by all processors but the memory in the local node is accessed faster. In contrast, many servers are configured as Uniform Memory Access, which for the list_lru purpose, can be thought of as having only one node.
Control groups (cgroups) in the Linux kernel, organize and limit system resources (CPU, memory, storage IO, open files, etc). The memory control groups (memcg) manage the memory resource for processes. A memcg can be organized as “per page” or “per slab entry”. The memcg aware list_lru entries use the “per slab entry” organization.
The Linux list_lru
The Linux kernel list_lru was introduced in Linux 3.12 as a simple list, a spin lock and NUMA support. The list_lru added the memcg support in Linux 3.20. The memcg support for a particular list_lru remains optional and the support is specified when the list_lru is created. For example, list_lru for XFS metadata buffers and the list_lru for the XFS quotas are not memcg aware, whereas NFS and superblock inode and dentry list_lru are memcg aware. A boolean (memcg_aware) was added to the list_lru in Linux 5.3, to aid in the detection of memcg aware lists without the assumption of the existence of NUMA node-id 0. In Linux 5.17, the API transposed the NUMA node and memcg hierarchy ordering in the list_lru and also the unshrinkable memcg array was replaced with a shrinkable xarray. This change resulted in substantial kernel memory savings especially in the container environments and these changes were backported into Oracle UEK 7 using a UEK 7 specific list_lru_ext structure.
The list_lru iterator
The drgn-tools list_lru iterator implementation has to detect if the list_lru is memcg aware and also deals with all the list_lru API changes in the various Linux versions including the Oracle UEK 7 specific change.
The list_lru eventually resolves to a list_lru_one entry, which is the fundamental list of entries for a particular type and for a particular NUMA node-id and memcg index combination. The list_lru iterator code provides both functions that return list_lru_one and functions that return each entry in the list_lru along with the memcg index and NUMA node-id when appropriate.
The list_lru_for_each_entry() and list_lru_from_memcg_node_for_each_entry() functions return the entries grouped first by the NUMA node-id and then by the memcg.
The list_lru iterator in drgn-tools provides the following functions:
-
list_lru_for_each_list(list_lru) iterates through the list_lru and returns the next list_lru_one entry, the corresponding NUMA node-id and the memcg index.
-
list_lru_for_each_entry(type, list_lru, member) calls list_lru_for_each_list() and iterates through the resulting list_lru_one list and returns the next entry of the specified type and the corresponding NUMA node-id and the memcg index.
-
list_lru_from_memcg_node_for_each_list(memcg_idx, node_id, list_lru) returns the specific list_lru_one for a particular memcg index, NUMA node-id from the list_lru. Since the memcg index and NUMA node-id are known, these entries are not returned.
-
list_lru_from_memcg_node_for_each_entry(memcg_idx, node_id, type, list_lru, member) calls list_lru_from_memcg_node_for_each_list() and iterates through the resulting list_lru_one list and returns the next entry of the specified type. Since the memcg index and NUMA node-id are known, these entries are not returned.
-
slab_object_to_memcgidx(entry) returns the memcg index of the slab entry.
-
slab_object_to_nodeid(entry) returns the NUMA node-id of the slab entry.
slab_object_to_memcgidx() and slab_object_to_nodeid() are helper functions that return the memcg index and NUMA node-id respectively, for a particular list_lru entry. The output from these functions could be used as input to list_lru_from_memcg_node_for_each_list() and list_lru_from_memcg_node_for_each_entry(). Currently, these routines are used in the list_lru validation test to verify the results from list_lru_for_each_entry().
Directory Entry Examples
The following examples use the Linux superblock directory entry (dentry) list_lru. The dentry list_lru (s_dentry_lru) has either inactive dentry items or negative dentry (the dentry that contains a cached invalid name from a failed lookup in a directory) items.
The python script count_neg_dentry.py uses the for_each_mount() helper from drgn.helpers.linux.fs, so the processed filesystem dentry items can either be for every mounted filesystem, for a specific filesystem type (fstype= ‘xfs’, ext4”, “nfs4”, “tmpfs”, “cgroup2”, etc) or for a specific mount point (dst=). The specific mount device (src=) option is not used at this time but could be easily added.
This example script uses the list_lru_for_each_entry() function to iterate over the superblock s_dentry_lru list_lru to print the mount point and the total number of dentry and negative dentry items on a mounted filesystem. The optional “verbose=N” argument provides more information:
- verbose=1: also prints the path of the negative dentry items.
- verbose=2: also prints the counts of number of dentry and negative dentry by (node-id, memcg) combination, by NUMA node-id and by memcg index. These additional counts are not printed if there are no dentry or negative dentry items for a particular filesystem.
Example 1 (no verbose)
Print the total number of dentry and negative dentry entries of all the XFS filesystems (fstype=‘xfs’):
count_neg_dentry(prog, fstype='xfs') mntpt / dentry 262182 neg dentry 65257 mntpt /var dentry 2365 neg dentry 194 mntpt /home dentry 183 neg dentry 118 mntpt /tmp dentry 10700 neg dentry 2909 mntpt /var/log dentry 10819 neg dentry 8272 mntpt /u01 dentry 673303 neg dentry 298159 mntpt /boot dentry 30 neg dentry 12 mntpt /var/log/audit dentry 6 neg dentry 1
Example 2 (optional verbose=1)
Print the dentry and negative dentry entries totals and also print the negative dentry entry paths (verbose=1) for the filesystem mounted at ‘/boot’ (dst=‘/boot’):
>>> count_neg_dentry(prog, dst='/boot', verbose=1) mntpt /boot dentry 0xffff8cdd4e7d0b60 name /boot/.reclaim_non_active_sys mntpt /boot dentry 0xffff8d17d3392560 name /boot/loader mntpt /boot dentry 0xffff8d281b227330 name /boot/app mntpt /boot dentry 0xffff8d5ea469d520 name /boot/.initramfs-5.15.0-300.163.18.7.el8uek.x86_64.img.default mntpt /boot dentry 0xffff8d9c752cc270 name /boot/.elasticConfig mntpt /boot dentry 0xffff8d3d4fa4da00 name /boot/.upgrade.out.place mntpt /boot dentry 0xffff8d3d4fa9e630 name /boot/.saved_ssh_host_keys mntpt /boot dentry 0xffff8d3d4fa9c820 name /boot/.switch_to_non_ovm mntpt /boot dentry 0xffff8ddd6f9ac410 name /boot/dracut mntpt /boot dentry 0xffff8e3f5d88eff0 name /boot/grub2/grub.cfg mntpt /boot dentry 0xffff8e98702678e0 name /boot/base mntpt /boot dentry 0xffff8ebd4edd6cb0 name /boot/boot_backup mntpt /boot dentry 30 neg dentry 12
Example 3 (optional verbose=2)
Print the dentry and negative dentry entries totals and count the per (NUMA node-id / memcg index) combination, count per NUMA node-id and count per memcg index items. In this example, NUMA node-id 0 and memcg index 2 sublist has 708 dentry items and 400 of those are negative dentry items:
>>> count_neg_dentry(prog, dst='/sys', verbose=2) mntpt /sys dentry 87425 neg dentry 17657 dentry by nid/memcg {(0, 2): 708, (0, 71): 4, (0, 94): 94, (0, 95): 4, (0, 107): 2, (0, 111): 5, (0, 114): 1, (1, 2): 1061, (1, 63): 4, (1, 91): 47, (1, 94): 51, (1, 95): 3, (1, 100): 30, (1, 111): 21, (1, 114): 89, (1, 131): 1, (2, 2): 10565, (2, 4): 1, (2, 8): 5, (2, 10): 59, (2, 24): 16, (2, 71): 25762, (2, 74): 125, (2, 86): 231, (2, 88): 13, (2, 89): 4, (2, 91): 9, (2, 94): 250, (2, 95): 115, (2, 98): 49, (2, 100): 20, (2, 105): 2, (2, 111): 3345, (2, 114): 50, (2, 131): 6, (3, 2): 1346, (3, 4): 2, (3, 24): 1, (3, 30): 2, (3, 91): 67, (3, 94): 149, (3, 95): 5, (3, 98): 1, (3, 111): 2081, (4, 2): 1012, (4, 71): 22, (4, 95): 1, (4, 100): 18, (4, 114): 3, (5, 2): 1536, (5, 10): 1, (5, 71): 23, (5, 91): 78, (5, 94): 13, (5, 95): 3, (5, 100): 130, (5, 111): 16, (5, 114): 1, (5, 131): 13, (6, 2): 4565, (6, 4): 4, (6, 10): 61, (6, 30): 2, (6, 71): 136, (6, 94): 15, (6, 95): 12, (6, 100): 16, (6, 105): 8, (6, 111): 26, (6, 114): 4, (6, 131): 1, (7, 2): 33010, (7, 10): 3, (7, 71): 75, (7, 94): 77, (7, 95): 3, (7, 100): 35, (7, 114): 2, (7, 131): 4, (7, 133): 90} dentry by nid {0: 818, 1: 1307, 2: 40627, 3: 3654, 4: 1056, 5: 1814, 6: 4850, 7: 33299} dentry by memcg {2: 53803, 71: 26022, 94: 649, 95: 146, 107: 2, 111: 5494, 114: 150, 63: 4, 91: 201, 100: 249, 131: 25, 4: 7, 8: 5, 10: 124, 24: 17, 74: 125, 86: 231, 88: 13, 89: 4, 98: 50, 105: 10, 30: 4, 133: 90} neg dentry by nid/memcg {(0, 2): 400, (0, 95): 2, (0, 107): 2, (0, 114): 1, (1, 2): 610, (1, 63): 1, (1, 94): 4, (1, 95): 2, (1, 111): 3, (1, 131): 1, (2, 2): 4632, (2, 10): 7, (2, 71): 3238, (2, 74): 4, (2, 88): 13, (2, 89): 1, (2, 94): 2, (2, 95): 40, (2, 98): 31, (2, 100): 1, (2, 105): 2, (3, 2): 932, (3, 98): 1, (3, 111): 25, (4, 2): 938, (4, 71): 22, (4, 114): 3, (5, 2): 1208, (5, 71): 23, (5, 95): 2, (5, 100): 15, (5, 114): 1, (5, 131): 9, (6, 2): 2751, (6, 10): 21, (6, 71): 35, (6, 95): 1, (6, 100): 2, (6, 105): 3, (6, 111): 1, (6, 131): 1, (7, 2): 2634, (7, 10): 2, (7, 71): 25, (7, 95): 2, (7, 131): 3} neg dentry by nid {0: 405, 1: 621, 2: 7971, 3: 958, 4: 963, 5: 1258, 6: 2815, 7: 2666} neg dentry by memcg {2: 14105, 95: 49, 107: 2, 114: 5, 63: 1, 94: 6, 111: 29, 131: 14, 10: 30, 71: 3343, 74: 4, 88: 13, 89: 1, 98: 32, 100: 18, 105: 5}
XFS Metadata Buffer Example
The python script drgn_xfs_routines.py uses the for_each_mount() helper from drgn.helpers.linux.fs, so the processed filesystem dentry items can either be for every mounted XFS filesystem or a specific XFS filesystem (dst=).
The xfs_print_all_bufs() function uses the list_lru_for_each_entry() function to iterate over the XFS metadata buffers in the bt_lru list_lru for the XFS data xfs_buftarg device and optionally the log xfs_buftarg device. For each XFS metadata buffer in the list_lru, the routine prints the kernel pointer, the XFS block number, the flags, AG number and attempts to print XFS buffer header identity “magic” for the metadata buffer if it appears to be ASCII. The XFS list_lru are not memcg aware.
>>> xfs_print_all_bufs(prog, dst='/home') /dev/mapper/VGExaDb-LVDbHome /home (struct xfs_mount *)0xffff8d3d4796d000 buftarg 0xffff8d3d4b0c7100 bp 0xffff8cdd861f88c0 bno 0x80 flgs 0x110030 ag 0 bp 0xffff8d3c4aa832c0 bno 0x100 flgs 0x100020 ag 0 (char *)0xffff8cdd4f0df000 = "XDB3.>\x9c\x05" bp 0xffff8d3c4aa83d40 bno 0xa0 flgs 0x100020 ag 0 bp 0xffff8d3c4881a680 bno 0x200080 flgs 0x110030 ag 1 bp 0xffff8cdd479263c0 bno 0x2003b0 flgs 0x100020 ag 1 bp 0xffff8d3c57ccaf40 bno 0x432080 flgs 0x110030 ag 2 bp 0xffff8d3c57ccfa80 bno 0x2000a0 flgs 0x100020 ag 1 bp 0xffff8d3c57cce740 bno 0x600080 flgs 0x110030 ag 3 bp 0xffff8d3c57ccd780 bno 0x200100 flgs 0x100020 ag 1 (char *)0xffff8d3301089000 = "XDD3\xb2\xf0\a\x95" bp 0xffff8d3c57ccc280 bno 0x201b80 flgs 0x100020 ag 1 bp 0xffff8d3d4fbf24c0 bno 0x1 flgs 0x200020 ag 0 (char *)0xffff8d3d4b7ec600 = "XAGF" bp 0xffff8d3d4fbf7380 bno 0x20 flgs 0x100020 ag 0 (char *)0xffff8d420b112000 = "FIB3" bp 0xffff8d3d4fbf08c0 bno 0x2 flgs 0x200020 ag 0 (char *)0xffff8d3d4b7ee800 = "XAGI" bp 0xffff8d3d4fbf5080 bno 0x200001 flgs 0x200020 ag 1 (char *)0xffff8d3d4b7ebe00 = "XAGF" bp 0xffff8d3d4fbf2a00 bno 0x200020 flgs 0x100020 ag 1 (char *)0xffff8d43100e2000 = "FIB3" bp 0xffff8d3d4fbf7e00 bno 0x200002 flgs 0x200020 ag 1 (char *)0xffff8d3d4b7e8000 = "XAGI" bp 0xffff8d3d4fbf2300 bno 0x400001 flgs 0x200020 ag 2 (char *)0xffff8d3d4b7eba00 = "XAGF" bp 0xffff8d3d4fbf1dc0 bno 0x400020 flgs 0x100020 ag 2 (char *)0xffff8d420b284000 = "FIB3" bp 0xffff8d3d4fbf1a40 bno 0x400002 flgs 0x200020 ag 2 (char *)0xffff8d3d4b7ea000 = "XAGI" bp 0xffff8d3d4fbf7000 bno 0x600001 flgs 0x200020 ag 3 (char *)0xffff8d3d4b7efc00 = "XAGF" bp 0xffff8d3d4fbf01c0 bno 0x600020 flgs 0x100020 ag 3 (char *)0xffff8d420b113000 = "FIB3" bp 0xffff8d3d4fbf5940 bno 0x600002 flgs 0x200020 ag 3 (char *)0xffff8d3d4b7ebc00 = "XAGI"
Conclusion
This blog introduces the new drgn-tool list_lru iterator that can be used to debug problems in the Linux kernel. The list_lru iterator works on UMA and NUMA architectures and with memcg aware and memcg unaware lists. The iterator works with several Linux versions especially Oracle UEK 5 (Linux 4.14) to Linux 6.15. The iterator callers can use or ignore the returned NUMA node-id and memcg index information as desired. This blog looked at a dentry and XFS metadata buffer list_lru examples and showed the information that can be obtained from these list_lru.