Detecting and debugging zombie memcg issues

Introduction

When a memory cgroup(memcg) is removed from user space, it becomes offline and remains invisible to user space but the kernel still maintains its internal data structures about this memcg. In other words, the memcg becomes a zombie. Linux kernels as of version 6.12 address some of the ways zombie memcgs may occur, however they do not fix zombies left due to pinning from pagecache pages. This blog shows a Linux user how to determine if this problem exists on a live system or in a kernel core dump (vmcore).

Background

Linux kernel provides control groups (or “cgroups”) to hierarchically organize processes, such that their usage of different system resources, like CPUs, system memory etc., can be controlled and monitored. Memory control group(s) (or “memcg(s)”) are used to manage the memory resources allotted to a process or group of processes.
The kernel’s cgroup interface is provided through a pseudo filesystem called cgroupfs which is usually mounted over /sys/fs/cgroup/<controller>) for cgroup-v1 or over /sys/fs/cgroup for cgroup- v2. memcgs are created/removed by creating/removing directories under cgroupfs.
A memcg removed from user space (using rmdir or other method) becomes offline and new allocations cannot be charged to this offline memcg. But any prior allocations, done on behalf of this memcg and charged to it, hold a reference (refcount) to the memcg. Depending on the kernel version such allocations can correspond to different types like slab objects, kernel stack, pagecache pages, shared memory between applications of two or more cgroups etc. The kernel destroys its internal memcg objects when it is sure that there are no more users of that memcg i.e when reference count to that memcg drops down to zero. This means that kernel objects, corresponding to an offline(d) memcg, can remain in memory for a long time (even indefinitely) and this causes the problem of memory wastage due to zombie memcgs.
The issue of different types of allocations, pinning the offline memcg(s), has been addressed in newer kernels but the issue of pagecache pages pinning offlined memcgs for a long time, still exists. This blog describes some methods to identify this problem on a live system or from a vmcore.
It is assumed that the reader has working knowledge of the bcc toolset and drgn).

Memory consumption of memcg Objects

Memcg objects have a significant memory footprint which scales with the number of CPUs and number of NUMA nodes. For example in a v5.4.233 kernel, a memcg object (mem_cgroup) is more than 2.5KBs in size and gets allocated from a kmalloc-4K slab cache, as can be seen in the following output snippet from drgn:

>>> prog.type("struct mem_cgroup").size
2688

Each mem_cgroup will have an associated cgroup as well and on this kernel it is almost 1K in size:

>>> prog.type("struct cgroup").size
968

It should be noted that the size of these objects can differ in different kernel versions or in the same kernel version using different configs, but in any case these objects have a significant enough memory footprint.
Further, each memcg uses percpu memory to maintain some stats:

And these percpu blocks together on this kernel, take more than 1.5KBs. So on a 256 CPU system, We need around 256 * 1.5 = 384KBs, just to store percpu stats for each memory cgroup. Then there is some memory needed for per node stat maintenance as well.
So we can see that on such a system each mem_cgroup object needs around 400KBs of memory and if we have say 10K zombie memcgs, we are wasting around 4GBs (unless kernel reclaims it). This further bloats on larger systems with more CPUs and more NUMA nodes.

The number of zombie memcgs mentioned here are not impractical. In fact on large scale systems that have been up for months and where this issue is present, we have seen zombie memcgs occupying 100s of GBs of memory.

Memory taken by zombie cgroups is not lost forever, for example a memory reclaim can free up slab or pagecache pages that were pinning the memcg and this in turn can free up the memcg objects but memory reclaim has its own overheads and we would not like to trigger it every now and then. Further, it may be too late before reclaim is triggered or it may take too long to complete and this in turn can cause other issues. In any case wasting memory to hold objects corresponding to offline memcgs is not an optimal use of system memory resources.

How to identify the problem of memory consumption due to zombie memcgs

In order to figure out if we are indeed seeing high memory consumption due to zombie memcgs, we first need to see if we indeed have zombie memcgs. This can be done by taking the difference in number of memcgs seen under /proc/cgroups and under /sys/fs/cgroup/memory (for v1) or under /sys/fs/cgroup (for v2). /proc/cgroups indicates the total number of memcgs (including zombie ones) while sysfs gives the number of active memcgs. The difference is the number of zombie memcgs at that point of time.

The below snippet is from a system (using cgroup v1) with more than 10K zombie memcgs. From sysfs we see number of memcgs as:

# find /sys/fs/cgroup/memory/ -type d | wc -l
140

and from procfs we see the total number of memcgs as:

# cat /proc/cgroups | grep memory | awk ‘{print $3}’
10206

If we are dealing with vmcore/kcore, we can use the drgn helper named get_gnum_dying_mem_cgroups from drgn_tools to see how many (if any) zombie memcgs are there. For example, the following snippet is from the above mentioned system, but is getting the number zombie memcgs from kcore:

>>> from drgn_tools.kernfs_memcg import *
>>> get_num_dying_mem_cgroups(prog)
10066

Once we have verified that there are zombie memcgs, we don’t need to further verify memory consumption due to them (since it is known that corresponding kernel data structures would be consuming memory) but still if you want to see the result of this issue in terms of memory consumption, it can be seen under Percpu consumption under /proc/meminfo:

# cat /proc/meminfo | grep Percpu
Percpu:         1674240 kB

Further, depending on the size of mem_cgroup and cgroup object, one can check slabinfo for kmalloc-X slabs to see the number of objects in use. For example on a pretty much idle system that has around 10K zombie cgroups and mem_cgroup allocated from kmalloc-4k slab caches, we can see that the number of kmalloc-4K objects is pretty much the same as the number of zombie memcgs:

# cat /proc/slabinfo | grep kmalloc-4k | grep -v dma
kmalloc-4k   10199  10216   4096    8   8 : tunables  0  0  0 : slabdata   2777   2777  0

This can be further correlated with the increase/decrease in the number of zombie memcgs that we might be seeing.

Zombie memcg(s) Due To Page Cache Pages

Let’s see, some ways to debug zombie memcg(s) issues, caused by pagecache pages and how to find the corresponding files, on a live system or from a vmcore.
Once we know the files, we can find the applications using those files and once we know the application we can use some workaround (since the kernel fix is not available yet) to change the application’s memcg creation and usage pattern.

On a live system. If the zombie memcgs are increasing in a noticeable manner or if we know the time window during which these get created we can use the following bcc script to identify them:

zombiememcgstat.py

This shows zombie memcgs that have been offlined for more than 30 secs(by default), along with the pid and comm of the task that created it and for how long it has been offlined for:

# ./zombiememcgstat.py
Show zombie memcgroups at specified intervals... Hit Ctrl-C to end

MEMCG               NAME                 COMM        PID      AGE(secs)
0xffff9efd798ce000  session-8047.scope   systemd       1        84
0xffff9f017b3c3000  session-8049.scope   systemd       1        84
0xffff9f06bd247000  session-8054.scope   systemd       1        83
0xffff9eff7b553000  session-8060.scope   systemd       1        82
0xffff9efdbfd58000  session-8065.scope   systemd       1        81
0xffff9efd7e9db000  session-8070.scope   systemd       1        80
0xffff9f05bbdcc000  session-8076.scope   systemd       1        79
0xffff9f037f493000  session-8081.scope   systemd       1        78
0xffff9f073f484000  dummy                python3      473782    66

One can use this script to monitor at different intervals (other than 30 sec) or monitor for memcgs created by tasks with a specific pid or comm. Please see examples, to see how other options can be used. So far we have seen one way to locate zombie memcgs on a live system. But if zombie memcgs slowly build up over time or if we just have the vmcore, then we have to rely on vmcore/kcore to find them and corresponding application(s).
For example we may have a system that has been up for say 1 year and shows thousands of zombie memcgs. It may be that there is a slow increase in the number of zombie memcgs and that has accumulated over this long time window or it may be that the workload has certain time windows when this increment happens in bursts and remains stable after that till the next burst.
In other words, when we need to locate zombie memcg from current system state or when we need to locate these in a vmcore, the above bcc script will not help. In such cases one can use drgn and the details of kernfs subsystem to find the applications creating zombie memcgs.
One thing to note here is that an application does not explicitly create a zombie memcg, it is just the way an application may be using the memcgs that those memcgs become zombies after the application is gone. I will show an example and workaround at the end of this blog. But first let’s see how to find zombie memcgs from a vmcore/kcore.

Finding all zombie memcgs

Zombie memcgs are not present in any hierarchy tree so we need to rely on kernfs internals to locate the dead or zombie memcgs.
Kernfs is a kernel subsystem that can be used to implement a pseudo file system. The kernel exports its cgroup interface via a pseudo file system called cgroupfs and kernfs acts as the backend of this pseudo file system. kernfs acts as a backend for sysfs as well, so it is important to distinguish kernfs objects that actually correspond to a cgroup, from other kernfs objects. This can be done using the fact that, each cgroup object has a : pointer to a corresponding kernfs (struct kernfs_node) object and each kernfs_node object has a private void * which points to a cgroup object if kernfs_node corresponds to a cgroup object. So using the two functions, kernfs_node_of_cgroup and kernfs_node_of_memcgroup, from drgn_tools we can check if a given kernfs_node object actually corresponds to a memcg or not.

So far we have seen, how to confirm if a given kernfs_node object corresponds to a memcg or not. But how can we do this for all kernfs_node objects?
kernfs_node objects are allocated from a dedicated slab cache named kernfs_node_cache, so one can traverse through the objects belonging to this slab cache, to find all kernfs_node objects.

The drgn function named for_each_kernfs_node from drgn_tools does exactly that.

Once we have the kernfs_node(s) we can use kernfs_node_of_cgroup and kernfs_node_of_memcgroup, mentioned earlier, to filter out kernfs_nodes belonging to memcgs, as has been done in dump_memcg_kernfs_nodes of drgn_tools. Next from the priv pointer of these kernfs_node objects, we can get cgroup objects and check their status in cgroup.self.flags. For zombie memcgs cgroup.self.flags will be 0. This way we can find all the zombie memcgs in the system. The next section describes how to find zombie memcgs pinned by pagecache pages.

Finding page cache pages (and corresponding files) that are pinning zombie memcgs:

Pagecache pages, charged to a zombie memcg, point to it via page->mem_cgroup or page->memcg_data. To find pagecache pages that are pinning the zombie memcgs we can iterate through all pages and ignore the ones that don’t have page->mem_cgroup or page->memcg_data set. We can also ignore pages belonging to slab caches or pages that don’t have page->mapping set, because such pages can’t belong to pagecache.

Now for the remaining pages page->mem_cgroup or page->memcg_data can point to a memcg but we are interested only in zombie ones. Once we have a mem_cgroup object, we can determine the memcg’s state using cgroup_subsys_state flags for the corresponding cgroup (i.e mem_cgroup.css.cgroup->self.flags).

For an active cgroup CSS_ONLINE and CSS_VISIBLE bits are set in cgroup.self.flags but when a cgroup is removed (using rmdir) these two bits get reset by cgroup_rmdir. So one can check mem_cgroup.css.cgroup->self.flags for these two bits and confirm if that memcg is a zombie or not.

Once we have such page(s) we can find address_space and if that corresponds to a valid inode, we can get the dentry. This dentry corresponds to a file whose content was being cached by this pagecache page. Next we can see which applications are accessing those files and from that information we can decide/find workarounds to mitigate or stop the issue of zombie memcg leakage.

The drgn function dump_page_cache_pages_pinning_cgroups, from drgn_tools lists pagecache pages, the memcgs they are charged to, state of those memcgs and the files cached by these pagecache pages. One can filter this output just for ZOMBIE state or just tweak the function a bit so that it just shows pages charged to zombie memcgs only.

As mentioned earlier once we know the files, we can find out the applications using it and work on some workarounds to avoid zombie memcg creations.

Example:

As an example, to show how the above described approach can be used to find out the source of zombie cgroups, let’s consider a database that has been up for a few months and database users are using ssh to login. For each ssh session systemd creates a new cgroup, which may remain as a zombie even after the ssh session has been terminated. Depending on the number and frequency of logins the corresponding zombie memcgs can pile up quickly or over a period of time.

Further, it may happen that right at the time of debugging such zombie memcgs were not getting created.
We can use the earlier mentioned drgn helper on kcore (or if vmcore is available then on vmcore) and see which pages are pinning the zombie memcg. In this case we see something like:

page: 0xfffff204f2f56f40 cgroup: /user.slice/user-1000.slice/session-1860.scope state: ZOMBIE dpath: var/log/wtmp
page: 0xfffff204f3bec500 cgroup: /user.slice/user-1000.slice/session-1774.scope state: ZOMBIE dpath: var/log/wtmp
page: 0xfffff204f3cdae00 cgroup: /user.slice/user-1000.slice/session-1742.scope state: ZOMBIE dpath: var/log/wtmp
page: 0xfffff204f3ce9640 cgroup: /user.slice/user-1000.slice/session-1871.scope state: ZOMBIE dpath: var/log/wtmp
page: 0xfffff204f4bc7340 cgroup: /user.slice/user-1000.slice/session-1924.scope state: ZOMBIE dpath: var/log/wtmp
page: 0xfffff204f57c21c0 cgroup: /user.slice/user-1000.slice/session-1983.scope state: ZOMBIE dpath: var/log/wtmp
page: 0xfffff204f57e9e00 cgroup: /user.slice/user-1000.slice/session-1790.scope state: ZOMBIE dpath: var/log/wtmp

So we can see that pages corresponding to the wtmp file, are pinning the zombie memcgs
here. wtmp is a log file to record logins and logouts and this system had pam_postlog module enabled in postlogin file:

#cat /etc/pam.d/postlogin
# Generated by authselect on Thu Feb 29 19:01:09 2024
# Do not modify this file manually.

session required pam_lastlog.so showfailed

Now since systemd was creating a new cgroup for each ssh login, the wtmp file gets accessed in context of this cgroup and the pagecache page corresponding to the wtmp file was ending up pinning the memcg.
By disabling pam_postlog module in postlogin file, we could disable this behaviour and thus avoid the corresponding pagecache pages pinning down the zombie memcg.

This was one example of a user space workaround to avoid creation and accumulation of zombie memcgs.
One can also use the workaround of cleaning pagecache (echo 1 > /proc/sys/vm/drop_caches) or using the memory.force_empty interface, just before removing the cgroup. Cleaning pagecache will reclaim pages that were pinning the memcg. Similarly memory.force_empty will release all pages charged to the memcg and hence there will be none pinning it at the time of removal. But each of these workarounds have performance overheads. For example, cleaning all of the pagecache will impact other applications too, using memory.force_empty can block the task for long durations.

There can be other use cases and workarounds too, but first we need to figure out which application is causing zombie memcgs and how. This blog intends to help on this front.

Of course we can’t have a workaround for all situations, but until we have a permanent fix in the kernel for this issue we can try with workarounds or in the worst case stop the application or make some changes there so that it utilizes cgroup differently.

Conclusion

In this article we looked at the problem of zombie memcgs caused by pinning from pagecache pages. We saw some ways to detect this issue and if possible use some workarounds to avoid this issue. Hopefully the information presented here will help others who run into this issue.

Detecting and debugging zombie memcg issues

Introduction

Background

Memory consumption of memcg Objects

How to identify the problem of memory consumption due to zombie memcgs

Zombie memcg(s) Due To Page Cache Pages

Finding all zombie memcgs

Finding page cache pages (and corresponding files) that are pinning zombie memcgs:

Example:

Conclusion

Reference

Imran Khan

XFS: Examples of logged metadata

OS Management Hub Adds Support for Third-Party and Private Software Sources

Detecting and debugging zombie memcg issues

Introduction

Background

Memory consumption of memcg Objects

How to identify the problem of memory consumption due to zombie memcgs

Zombie memcg(s) Due To Page Cache Pages

Finding all zombie memcgs

Finding page cache pages (and corresponding files) that are pinning zombie memcgs:

Example:

Conclusion

Reference

Authors

Imran Khan

XFS: Examples of logged metadata

OS Management Hub Adds Support for Third-Party and Private Software Sources