Debugging Linux kernel issues is a challenging task that requires specialized tools to identify what went wrong. Traditionally, tools like GDB and the crash utility are widely used for kernel debugging. When debugging a customer issue in a running kernel, this usually means manually navigating kernel data structures and resolving pointers. The process can be slow, tedious, and error prone. Alternatively, we could use diagnostic kernel modules or Ksplice patches, but these involve significant development effort and risk, as any bugs could crash the customer’s system.
Drgn is a newer tool that combines Python scripting with a debugger. It allows us to write scripts that accomplish the same tasks efficiently and safely on a live kernel. Once written, these scripts can be quickly run on multiple machines and be adapted to new situations. With this ability, we’re increasingly able to resolve customer kernel issues quickly, without even needing downtime for a vmcore. In short, drgn transforms a slow, error-prone, and highly manual process into a programmable, repeatable, and scalable solution, making live kernel diagnostics far more practical and efficient.
In this article, we’ll walk through five customer case studies that demonstrate how we used drgn to solve problems much quicker than we ever could have before, all without needing downtime for a vmcore!
If you’re net yet familiar, or want to learn more about drgn, you can read our article introducing it, as well as the drgn documentation. Some of the solutions presented here have been upstreamed to drgn, while others are part of Oracle’s corelens tool, which you can read about here.
Case 1: BTRFS – Mount Failure.
One of our Oracle Linux customers encountered a problem where a btrfs filesystem failed to mount, returning an EEXIST error.
# mount /dev/sdd1 /mnt
mount: /mnt: mount(2) system call failed: File exists.
8bd2037b-2d14-42a7-90d7-49200ecc6f70 is the UUID of the filesystem being mounted.
# lsblk -f
NAME FSTYPE LABEL UUID MOUNTPOINT
...
sdd
├─sdd1 btrfs 8bd2037b-2d14-42a7-90d7-49200ecc6f70
...
This issue could occur if there is a filesystem mounted with the same UUID but under a different device name. However, userspace utilities such as mount and lsblk did not show any other filesystem mounted with the UUID 8bd2037b-2d14-42a7-90d7-49200ecc6f70.
As userspace did not show any filesystem mounted with the UUID 8bd2037b-2d14-42a7-90d7-49200ecc6f70, we suspected a scenario where userspace might not be aware of a mounted filesystem, but the kernel is – a situation that can occur with a lazy unmount.
btrfs maintains an internal record of filesystems (the fs_uuids cache) that are currently or were recently mounted. If an entry exists in this cache with the same UUID but under a different device name, and that entry is still marked as mounted, the mount operation will fail with an EEXIST error. We wanted to verify whether such an entry existed in the cache.
Unfortunately, there is no official way to peek into this cache – the kernel neither exports it to userspace nor provides any interface to inspect its contents. This is where drgn proved invaluable.
Using drgn, we wrote a script to read the fs_uuids cache directly. Since drgn can read kernel variables and data structures, we could navigate through the cache and verify our hypothesis. This script was later contributed to upstream drgn and can be found here.
From the script’s output, we confirmed our suspicion – userspace was unaware of the mounted filesystem, but the kernel still held a reference to it. As a result, the filesystem remained open from the kernel’s perspective, causing the mount operation to fail with an EEXIST error.
Here is the output of the script (trimmed for brevity):
# drgn --ctf -c /proc/kcore /usr/share/drgn/contrib/btrfs_print_fs_uuids_cache.py
------------------------------------------------------------
FS Devices:
FS UUID: 8bd2037b-2d14-42a7-90d7-49200ecc6f70 <<<<<<<<<<<<<<<<<<<<<<<< Matching UUID
...
opened: (int)1 <<<<<<<<<<<<<<<<<<<<<<<< FS is open
...
Super Block:
sb ref count: (int)1
sb s_count: (int)1
...
sb s_id (char [32])"sdc1" <<<<<<<<<<<<<<<<<<<<<<<< Different Device name
...
Note: If you notice, I haven’t provided any debuginfo packages to drgn. That’s because drgn is using CTF here. CTF (Compact C Type Format), a lightweight debugging format that Oracle helped bring to the Linux community. CTF is designed to be extremely compact while a kernel debuginfo package typically consumes around 4.7 GiB, the latest UEK7 kernel’s CTF data is only about 14 MiB. With CTF support, you can start debugging the UEK kernel immediately without installing large debuginfo packages. You can read more about it here.
You can see that there is a device in the fs_uuids cache with the same UUID as the filesystem being mounted, but with a different device name (“sdc1”). The opened field confirms that the filesystem is still open (i.e., it is mounted).
As expected, userspace tools such as mount and lsblk are not aware that sdc1 is mounted.
# lsblk | grep sdc1
# mount | grep sdc1
#
With this information, we checked with the customer, and they confirmed that they had indeed lazily unmounted the filesystem and detached it from the system without waiting for the references held by some processes to be released. As a result, the filesystem remained mounted in the kernel (it stays mounted as long as the processes using it are still active) but was not visible to userspace. When they reattached the device, the kernel assigned it a new device name. Upon attempting to mount it again, the kernel detected that a filesystem with the same UUID but a different device name already existed in the cache and was still marked as open, causing the mount operation to fail with an EEXIST error.
In this way, drgn helped us access internal btrfs kernel caches directly, confirm our hypothesis about a lingering filesystem reference, and pinpoint the exact cause of the mount failure – all from the running kernel, and without needing downtime for a vmcore.
Case 2: RDS – Connection stuck.
On a cluster of EXADATA VMs, the RDS (Reliable Datagram Service) interfaces were observed to be flapping, which triggered the Cluster Synchronization Service (CSS) responsible for monitoring network health to crash the systems.
From the logs, it appeared that network communication between nodes was missing for a prolonged period, leading to timeouts and node evictions.
2023-01-15 18:32:08.899 [OCSSD(25871)]CRS-1612: Network communication with node vm01 (4) has been missing for 50% of the timeout interval. If this persists, removal of this node from cluster will occur in 14.990 seconds
2023-01-15 18:32:09.900 [OCSSD(25871)]CRS-1612: Network communication with node vm02 (2) has been missing for 50% of the timeout interval. If this persists, removal of this node from cluster will occur in 14.230 seconds
Further checks showed that the route resolution step during RDS connection setup was taking longer than usual. The RDS connection between the DB and cell node goes through the following steps, which normally complete within a couple of seconds:
- Initiate connection
- Address resolution (
RDMA_CM_EVENT_ADDR_RESOLVED) - Route resolution (
RDMA_CM_EVENT_ROUTE_RESOLVED) - Connection established (
RDMA_CM_EVENT_ESTABLISHED)
The logs indicate that the route resolution step is taking longer than expected:
kworker/u256:33-355025 [007] 1844011.231580: rds_rdma_cm_event_handler: RDS/IB: <::ffff:100.107.0.50,::ffff:100.107.0.15,0> lguid 0x0 fguid 0x0 qps <0,0> dev <none> reason [RDMA_CM_EVENT_ADDR_RESOLVED], err [0] <<<<<<<< Step 2
kworker/u256:21-355012 [002] 1844021.747680: rds_rdma_cm_event_handler: RDS/IB: <::ffff:100.107.0.50,::ffff:100.107.0.15,0> lguid 0x0 fguid 0x0 qps <0,0> dev <none> reason [RDMA_CM_EVENT_ROUTE_RESOLVED], err [0] <<<<<<<< Step 3
The time gap between Step 2 and Step 3 is around 10 seconds (1844021.747680 – 1844011.231580), which is unusually long.
To investigate, we used a drgn script (part of corelens) to examine the RDS connection states from the vmcore. The results revealed that several RDS connections were stuck in the RDS_CONN_CONNECTING state, while the corresponding InfiniBand Connection Manager (CM) state was IB_CM_REP_SENT.
>>> rds.rds_conn_info(prog)
rds_conn ib_conn Conn Path ToS Local Addr Remote Addr State NextTX NextRX Flags Conn-Time i_cm_id RDMA CM State ib_cm_id IB CM State
0xffff931622ac4f30 0xffff931542006000 0xffff9326c1babc00 0 100.107.0.42 100.107.0.16 RDS_CONN_CONNECTING 301712 1361778 -c-- N/A 0xffff930ff1165800 RDMA_CM_CONNECT 0xffff9311ee01e400 IB_CM_REP_SENT
0xffff9326c10f5950 0xffff931617dec000 0xffff931739761400 0 100.107.0.42 100.107.0.20 RDS_CONN_CONNECTING 164831 251587 -c-- N/A 0xffff9326bc047800 RDMA_CM_CONNECT 0xffff9311ee01c000 IB_CM_REP_SENT
0xffff9314025b8d80 0xffff9313f7d5c000 0xffff931538dab400 0 100.107.0.42 100.107.0.18 RDS_CONN_CONNECTING 350199 483029 -c-- N/A 0xffff9313f701b400 RDMA_CM_CONNECT 0xffff930eaf295a00 IB_CM_REP_SENT
This state combination suggested that the MAD packets were not actually being transmitted on the wire by the card’s firmware.
With the help of drgn, we inspected the qp_info.send_queue.list and qp_info.overflow_list for the affected connections and found that MAD packets were stuck in both the send and overflow queues.
>>> stack = prog.stack_trace(355030)
>>> cm_id = stack[8]["cm_id"]
>>> cm_id_priv = container_of(cm_id, "struct cm_id_private", "id")
>>> msg = cm_id_priv.msg;
>>> mad_sw = container_of(msg, "struct ib_mad_send_wr_private", "send_buf")
>>> sw = Object(prog, "struct ib_mad_send_wr_private *",0xffff9222bed0a100)
>>> qp_info = sw.mad_agent_priv.qp_info
>>> sq_list = qp_info.send_queue.list
>>> ovfl_list = qp_info.overflow_list
>>> count = 1
>>> for mad_list in list_for_each_entry("struct ib_mad_list_head", sq_list.address_of_(), "list"):
send_wr = container_of(mad_list, "struct ib_mad_send_wr_private", "mad_list")
print("{:<6d}: {:<#20x} (sq_list)".format(count, send_wr.value_()))
count += 1
1 : 0xffff9205759c5100 (sq_list)
2 : 0xffff920e7269b500 (sq_list)
3 : 0xffff920868158d00 (sq_list)
4 : 0xffff92107296a900 (sq_list)
5 : 0xffff920822766d00 (sq_list)
.
.
.
123 : 0xffff920722d63100 (sq_list)
124 : 0xffff920e72917900 (sq_list)
125 : 0xffff920dfde4bd00 (sq_list)
126 : 0xffff920e72913100 (sq_list)
127 : 0xffff920dfde4e900 (sq_list)
128 : 0xffff920e72915500 (sq_list)
>>> count = 1
>>> for mad_list in list_for_each_entry("struct ib_mad_list_head", ovfl_list.address_of_(), "list"):
send_wr = container_of(mad_list, "struct ib_mad_send_wr_private", "mad_list")
print("{:<6d} : {:<#20x} (ovfl_list)".format(count, send_wr.value_()))
count += 1
1 : 0xffff920e72914500 (ovfl_list)
2 : 0xffff920e72915900 (ovfl_list)
3 : 0xffff920c6941ed00 (ovfl_list)
4 : 0xffff920e7278d500 (ovfl_list)
5 : 0xffff9222c0d7d500 (ovfl_list)
.
.
.
80 : 0xffff92125da20d00 (ovfl_list)
81 : 0xffff92120cc0b500 (ovfl_list)
82 : 0xffff920811126100 (ovfl_list)
83 : 0xffff92051a795100 (ovfl_list)
84 : 0xffff9222c5813100 (ovfl_list)
85 : 0xffff921072969d00 (ovfl_list)
...
These findings confirmed that the issue was not with RDS itself, but rather with the MAD packets stuck at the Infiniband layer.
With this evidence, we approached the vendor, who later confirmed that it was indeed a card firmware issue.
The data gathered through drgn allowed us to pinpoint where the problem truly lay and guided our investigation in the right direction. Working with the InfiniBand team, we were able to deliver a fix for the issue.
You can find the drgn script we used here. This script is now part of corelens – refer to the rds_conn_info() function to see how we extracted the RDS connection details from the vmcore.
Case 3: Zombie Memory Control Groups
On one of our customer’s EXADATA machine, we noticed the available memory was quite low. The per-CPU memory usage was unusually high. While investigating, we observed that clearing out zombie memory control groups (zombie memcgs in short) was consistently freeing up per-CPU memory. This led us to conclude that zombie memcgs were exhausting the system memory.
Zombie Memory Control Groups are memory control groups that have been deleted from /sys/fs/cgroup/ using rmdir. When a cgroup is deleted in this way, all associated kernel data structures are freed only if no active references remain. However, if there are still active references, those structures stay allocated, potentially holding onto significant memory. The memory overhead grows with the number of CPUs and NUMA nodes, and over time, these “zombie” memcgs can cause memory congestion.
Unfortunately, there’s no standard way to list all zombie memcgs and accurately measure their memory usage. We needed a way to identify such memcgs, especially on live systems – and drgn is the perfect tool for this.
As part of the corelens project, we developed several functions to extract key information about memory control groups (memcgs).
- get_num_dying_mem_cgroups() – This function returns the number of zombie memory control groups currently present in the kernel.
>>> from drgn_tools.kernfs_memcg import *
>>> get_num_dying_mem_cgroups(prog)
10565
The system is currently holding 10,565 zombie control groups
- dump_memcg_kernfs_nodes() – This function outputs all kernfs_node objects associated with memory control groups (memcgs).
>>> dump_memcg_kernfs_nodes(prog)
kernfs_node: 0xffff8fba8020ed80 /system.slice
kernfs_node: 0xffff8fba80343000 /system.slice/sysstat-collect.service
kernfs_node: 0xffff8fba80343300 /system.slice/sysstat-collect.service
kernfs_node: 0xffff8fba80343380 /system.slice/sysstat-collect.service
kernfs_node: 0xffff8fba80343500 /system.slice/pmlogger_check.service
kernfs_node: 0xffff8fba80343600 /system.slice/dnf-makecache.service
kernfs_node: 0xffff8fba80343780 /system.slice/pmlogger_daily.service
kernfs_node: 0xffff8fba80343800 /system.slice/sysstat-collect.service
kernfs_node: 0xffff8fba80343880 /system.slice/proxyt-update.service
kernfs_node: 0xffff8fba80343b80 /system.slice/dnf-makecache.service
kernfs_node: 0xffff8fba80343c00 /system.slice/sysstat-collect.service
kernfs_node: 0xffff8fba80343c80 /system.slice/sysstat-collect.service
- dump_page_cache_pages_pinning_cgroups() – This function lists all pages pinned by memory control groups (memcgs) and the files cached by those pages.
To display only the pages associated with zombie cgroups, you can filter the output for “ZOMBIE”.
I used a bit of Python to filter the output so I could see only the ZOMBIE cgroups’ pinned pages.
>>> from drgn_tools.kernfs_memcg import *
>>> from contextlib import redirect_stdout
>>> import io
>>> buf = io.StringIO()
>>> with redirect_stdout(buf):
... dump_page_cache_pages_pinning_cgroups(prog) # Redirect the output to a buffer
...
>>> count = 0
>>> for line in buf.getvalue().splitlines():
... if "ZOMBIE" in line:
... print(line)
... count += 1
... if count >= 5:
... break
...
page: 0xffffe7a5442801c0 cgroup: /system.slice/dnf-makecache.service state: ZOMBIE path: /var/cache/dnf/ol8_ksplice-5610f1234dfc73b7/repodata/8a8e75807757cb31aa5fb80a6904b3bfcaccbf22e0bc7011050961f61e37df81-filelists.xml.gz
page: 0xffffe7a544284500 cgroup: /system.slice/setroubleshootd.service state: ZOMBIE path: /usr/lib64/libdb-5.3.so
page: 0xffffe7a544284700 cgroup: /system.slice/pmlogger_check.service state: ZOMBIE path: /var/oled/pcp/pmlogger/sridara-source/20251113.0.xz
page: 0xffffe7a5442848c0 cgroup: /system.slice/pmlogger_check.service state: ZOMBIE path: /var/oled/pcp/pmlogger/sridara-source/20251117.00.10.0.xz
page: 0xffffe7a544284900 cgroup: /system.slice/NetworkManager-wait-online.service state: ZOMBIE path: /usr/lib64/libnm.so.0.1.0
Note: The outputs shown here are trimmed for brevity.
The output of dump_page_cache_pages_pinning_cgroups() will be extremely large, so redirecting it into a buffer/file may take a long time. Alternatively, if you become comfortable with drgn, you can directly modify the drgn_tools source code to filter the output and customize the scripts as needed.
You can find the relevant code corresponding to these functions in this corelens script.
By running this script (or functions) on live kernels, we could detect the number of zombie memcgs, assess their memory impact, and trace back the pages – and eventually the processes – responsible for creating them.
Since the Linux kernel currently lacks an official fix for zombie memcgs, identifying the offending processes is crucial. In some cases, applications may need to adjust how they use cgroups. drgn made it possible to perform this deep analysis on live systems, giving us actionable insights to mitigate memory overhead caused by zombie memcgs.
For a detailed walkthrough of this case, refer to our blog: Zombie memcg issues.
Case 4: Long RDS pings
We observed an issue where RDS pings were taking unusually long, caused by CPU scale-down events.
RDS ping latency before the scale down event:
# rds-ping 192.200.2.16 -I 192.200.2.6 -Q 2
1: 79 usec
2: 72 usec
3: 69 usec
4: 77 usec
5: 71 usec
6: 58 usec
7: 67 usec
8: 65 usec
9: 62 usec
10: 71 usec
RDS ping latency after the scale down event:
# rds-ping 192.200.2.16 -I 192.200.2.6 -Q 2
1: 2641 usec
2: 1970 usec
3: 2958 usec
4: 3008 usec
5: 105 usec
6: 1870 usec
7: 1972 usec
8: 1978 usec
9: 2007 usec
10: 2007 usec
You can see the latency has increased drastically after the scale down event. Before the scale down event, the average latency was around 70 usec, but after the scale down event, the average latency jumped to around 2000 usec.
The latency arose because worker threads responsible for sending the replies were not being scheduled promptly.
Here’s what was happening:
- RDS ping packets sent from a source node reach the destination node, where reply packets (pongs) are created and placed in a send queue associated with the connection.
- In some corner cases, the sending of pongs may have to be deferred through delayed work. This delayed work gets queued on a particular CPU. If this preferred CPU is offline at the time of queuing of the delayed work, the work ends up never getting scheduled as long as the CPU is offline.
- The packets in the send queue wait either for the worker to be scheduled (which wouldn’t happen as long as the CPU remains offline) or for another code path to push them to the HCA.
- In short, the offline CPU caused work functions to be stuck in the timer subsystem, leading to the observed long RDS ping latency. Once we confirmed this root cause, we were able to deliver a fix.
In this case, we used drgn to dump the delayed works with unexpired timers on the workqueues of offline CPUs.
CPU: 20 state: offline
timer: ff1d52cd4bf2e0d8 tte(jiffies): -162181455 work: ff1d52cd4bf2e098 func: UNKNOWN: 0xffffffffc0819b40
timer: ff1d52d220ada8d8 tte(jiffies): -162157443 work: ff1d52d220ada898 func: UNKNOWN: 0xffffffffc0819b40
CPU: 22 state: offline
timer: ff1d52da7618c8d8 tte(jiffies): -8940 work: ff1d52da7618c898 func: UNKNOWN: 0xffffffffc0819b40
timer: ff1d53090742e8d8 tte(jiffies): -9003 work: ff1d53090742e898 func: UNKNOWN: 0xffffffffc0819b40
You can find the relevant code in the show_unexpired_delayed_works() function in this corelens script.
Without drgn, verifying this would have required manually iterating through each CPU, each workqueue, and each timer – a tedious and error-prone process. With drgn’s programmability, we could write a small program to perform this entire analysis automatically on a live kernel or a vmcore.
Case 5: Exposing IRQs Usage
In one scenario, we observed that all host IRQs were exhausted. While exhaustion itself wasn’t necessarily a problem, we wanted to understand how different drivers were using the IRQs.
Since drgn can read kernel data structures, we were able to write a script to read per-CPU IRQ vectors and their usage. This allowed us to quickly analyze IRQ distribution across drivers. The script is now part of corelens – you can check it out here.
Here is the output of the script:
System vector matrix:
Vector bits : 256
Vector allocation start : 32
Vector allocation end : 236
Vector allocation size : 204
Global available : 785
Global reserved : 6
Total allocated : 16
Online maps : 4
Per-CPU IRQ vector map (CPU 0):
Available : 196
Allocated : 4
Managed : 1
Managed Allocated : 0
Per-CPU IRQ vector map (CPU 1):
Available : 196
Allocated : 4
Managed : 1
Managed Allocated : 0
Per-CPU IRQ vector map (CPU 2):
Available : 197
Allocated : 3
Managed : 1
Managed Allocated : 0
Per-CPU IRQ vector map (CPU 3):
Available : 196
Allocated : 5
Managed : 1
Managed Allocated : 1
CPU : 0
00000000 00000000 00000000 00000000 01101000 00000000 11111111 11111111
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
CPU : 1
00000000 00000000 00000000 00000000 01101010 00000000 00100000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
CPU : 2
00000000 00000000 00000000 00000000 01101010 00000000 00100000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
CPU : 3
00000000 00000000 00000000 00000000 01111100 00000000 00100000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
CPU Total Vectors Used Vectors Reserved Vectors Available Vectors
0 256 19 53 184
1 256 5 53 198
2 256 5 53 198
3 256 6 53 197
Total 1024 35 212 777
As you can see, drgn allows us to extract highly detailed and insightful information from the system.
Without drgn, this would have required:
- Requesting a vmcore
- Manually iterating through each CPU and each IRQ vector to verify usage
- Generating and uploading the vmcore (time-consuming), and
- Repeating the process for multiple machines if necessary – especially challenging during live customer investigations.
With drgn, we could write a single script that works on live kernels or vmcores, providing immediate visibility into IRQ usage without the cumbersome manual process.
Conclusion
Across all these examples, one thing is clear: drgn transforms kernel debugging from a slow, manual error-prone process into a programmable and scalable workflow. It enables live kernel inspection through flexible Python scripting, allowing engineers to obtain precise diagnostic information quickly without disturbing production systems. Complex issues can be analyzed in real time, often while the system is still running, leading to faster turnaround and minimal customer impact.
On top of that, corelens, built on drgn, offers a set of scripts that gather comprehensive diagnostic information from various kernel subsystems and automate much of this work, significantly speeding up the debugging process.