When cgroup v2 meets nftables: A migration story

How libcgroup saved the day

In the evolving landscape of Linux resource control, the transition from cgroup v1 to cgroup v2 has been both inevitable and challenging. While cgroup v2 introduces a unified hierarchy and improved consistency, it also brings stricter delegation rules, more complex controller enablement, and subtle behavioral changes.

In this article, we explore a real-world migration from a functional cgroup v1 (using cgrules) + nftables setup to cgroup v2. The goal was to filter OUTPUT packets from network applications after they had been migrated to their appropriate cgroups based on configured rules. Along the way, libcgroup tools like cgexec and cgrulesengd played a crucial role, both as part of the problem and, ultimately, the solution.

This post walks through the entire migration process: the initial issues, step-by-step debugging, and the workarounds that led to a successful outcome. It aims to assist others navigating similar cgroup v2 transitions, particularly those relying on libcgroup to manage network traffic.

The Problem I: Sockets left behind

The goal was to filter OUTPUT packets of various applications/tools using cgroup v2 and nftables. This setup had worked seamlessly under cgroup v1, but with the deprecation of v1, there was a need to migrate to v2. However, the migration was not as straightforward as expected.

Here’s an example of the configuration in /etc/cgrules.conf for managing firefox:

# egrep firefox /etc/cgrules.conf
*:firefox            cpu,io,memory,pids,rdma,misc john/apps/firefox/
*:firefox-bin        cpu,io,memory,pids,rdma,misc john/apps/firefox/

# nft list chain inet filter check-cgroup-user | grep firefox
    socket cgroupv2 level 3 "john/apps/firefox" tcp dport { 80, 443 } counter packets 124 bytes 7440 accept comment "https/http"
    socket cgroupv2 level 3 "john/apps/firefox" udp dport 443 counter packets 42 bytes 53471 accept comment "google quic protocol/http3"
    socket cgroupv2 level 3 "john/apps/firefox" tcp dport { 3000, 4433, 5443, 8080, 9090 } ip daddr 192.168.1.0/24 counter packets 0 bytes 0 accept comment "for firefox non standard https/http"

This worked perfectly for firefox, where the packets were filtered through nftables. However, when similar rules were applied to terminal based tools like ping or ssh, the behavior was inconsistent sometimes the packets were filtered correctly, and sometimes they were not.

# egrep -ir ssh /etc/cgrules.conf
*:sshfs              cpu,io,memory,pids,rdma,misc john/apps/ssh/
*:ssh                cpu,io,memory,pids,rdma,misc john/apps/ssh/

# nft list chain inet filter check-cgroup-user | grep ssh
    socket cgroupv2 level 3 "john/apps/ssh" meta l4proto tcp counter packets 2 bytes 120 accept

The Debugging:

Upon investigation, it became clear that the issue stemmed from the handling of ssh processes. The logs showed an error related to task migration in the cgroup:

# CGROUP_LOGLEVEL=debug cgrulesengd
...
cgrulesengd[13882]: Warning: cgroup_attach_task_pid failed: 50001
cgrulesengd[13882]: Warning: failed to apply the rule. Error was: 50001
cgrulesengd[13882]: Cgroup change for PID: 15280, UID: 1000, GID: 1000, PROCNAME: /usr/bin/ssh FAILED! (Error Code: 50001)

While running the command:
$ ssh root@192.168.1.1
Enter passphrase for key '/home/john/.ssh/id_rsa':

Let’s break down the likely cause:

Process starts → socket created (inherits parent/current cgroup)
cgrulesengd moves process to target cgroup, based on cgrules (/etc/cgrules.conf)
Existing sockets stay behind, in the initial cgroup

The cgroup_attach_task_pid failed: 50001 error highlighted the failure of the cgrulesngd to properly move the ssh process into the correct cgroup. In reality, the process did join the cgroup, but its sockets didn’t.

Note: This blog was written while investigating a libcgroup issue on GitHub. Examples, logs, cgroup paths, and nftables rules are taken directly from that scenario.

Solution: Delegated Scope + cgexec

The issue was tracked down to the Kernel behavior, which does not migrate the socket along with the task, when the task is moved to a new cgroup. The solution proposed was in two steps:

Create a delegated systemd scope using cgcreate to prevent systemd interfering on task placement decisions:

# cgcreate -c -g cpu,io,memory,pids,rdma,misc:john.slice/apps.scope

Execute the commands using cgexec with a specific scope and command, allowing the task to be launched in the correct cgroup.

# cgexec -g cpu:john.slice/apps.scope <command>

This approach avoided the race condition and made the /etc/cgrules.conf configuration unnecessary for migrating ssh or ping tasks.

The Problem II: Creating a cgroup under scope

A new challenge appeared when trying to organize cgroups hierarchically under apps.scope, e.g., for classification of all user tasks under the CG_APPS_DIR. Attempting to create a child cgroup failed:

# cgcreate -S -c -g cpu,io,memory,pids:${CG_SLICE}/${CG_SCOPE}
# cgcreate -g cpu,io,memory,pids:${CG_APPS_DIR}
cgcreate: can't create cgroup cg_apps_dir: No such file or directory

Solution:

The Kernel enforces the rule of not enabling/disabling the controllers in a cgroup that already contains a task. The solution is to move the idle task created by libcgroup, under CG_SCOPE to CG_APPS_DIR:

# cgcreate -c -g cpu,io,memory,pids:${CG_SLICE}/${CG_SCOPE}
# cgcreate -g:${CG_SLICE}/${CG_SCOPE}/${CG_APPS_DIR}
# pid=$(cgget -n -v -r cgroup.procs ${CG_SLICE}/${CG_SCOPE})
# cgset -r cgroup.procs="$pid" ${CG_SLICE}/${CG_SCOPE}/${CG_APPS_DIR}
# cgset -r cgroup.subtree_control="+cpu +cpuset +io +memory +pids" ${CG_SLICE}/${CG_SCOPE}

This approach allowed the subtree control to be applied successfully to the slice and scope.

Important: Ensure at least one task remains alive under CG_SCOPE or its children. Otherwise, the scope may be removed automatically. libcgroup mitigates this by spawning a libcgroup_systemd_idle_thread idle task, when creating delegated slice/scope cgroups by passing -c option to cgcreate. It’s yet another systemd quirk, that expects a task under the scope or its children. We will not delve more on this, as its out of scope.

The Problem III: Scaling to multiple cgroup under scope

Another challenge arose when trying to scale the above solution under the delegated slice/scope to manage multiple cgroups, hosting different tasks:

${CG_SLICE}/${CG_SCOPE}/${CG_APPS_DIR}
${CG_SLICE}/${CG_SCOPE}/${CG_OTHERS_DIR}

While the solution discussed in the previous section worked fine with single child cgroups, it failed for sibling cgroups, to allow the enabling of controllers. The solution was to create a temporary cgroup and ensure that the subtree control was applied at the scope level:

Solution:

To ensure that both directories could have controllers applied, the following sequence of commands was used:

# Create the slice and scope
# cgcreate -c -g cpu,io,memory,pids:${CG_SLICE}/${CG_SCOPE}

# Create tmp cgroup for task delegation
cgcreate -g:${CG_SLICE}/${CG_SCOPE}/_tmp

# Move the idle task to tmp cgroup
# pid=$(cgget -n -v -r cgroup.procs ${CG_SLICE}/${CG_SCOPE})
# cgset -r cgroup.procs="$pid" ${CG_SLICE}/${CG_SCOPE}/_tmp

# Enable controllers for the scope
# cgset -r cgroup.subtree_control="+cpu +cpuset +io +memory +pids" ${CG_SLICE}/${CG_SCOPE}

# Create apps dir and enable controllers
# cgcreate -g:${CG_SLICE}/${CG_SCOPE}/${CG_APPS_DIR}
# cgset -r cgroup.subtree_control="+cpu +cpuset +io +memory +pids" ${CG_SLICE}/${CG_SCOPE}/${CG_APPS_DIR}

# Create others dir and enable controllers
# cgcreate -g:${CG_SLICE}/${CG_SCOPE}/${CG_OTHERS_DIR}
# cgset -r cgroup.subtree_control="+cpu +cpuset +io +memory +pids" ${CG_SLICE}/${CG_SCOPE}/${CG_OTHERS_DIR}

This method ensured that multiple directories under the same slice and scope could be managed, with controllers enabled for each.

The Problem IV: systemd’fying the script

When attempting to automate the bash commands at boot using a systemd service, an error occurred:

cgcreate: can't create cgroup john.slice/libcgroup.scope: Cgroup operation failed
Error: failed to open the system bus: 2

Solution

This issue was resolved by modifying the systemd unit configuration. The critical change was to adjust the After and Before directives:

Before:

[Unit]
Description=Control Group configuration service
...
Before=sysinit.target nftables.service network-pre.target umount.target shutdown.target
After=dbus.service cgrulesengd.service
....

After:

[Unit]
Description=Control Group configuration service
...
Requires=cgrulesengd.service dbus.service
After=cgrulesengd.service dbus.service
Before=nftables.service network-pre.target umount.target shutdown.target
...

By making these changes, the systemd service was successfully executed at boot, and the error was resolved.

The Problem V: Allowing Regular Users to Add Tasks to cgroups

A final hurdle was ensuring that regular users could add processes to the cgroup. This proved more complex with cgroup v2, which enforces stricter delegation rules compared to v1.

Solution:

The solution involved setting appropriate ownership and permissions on the necessary files:

# Set permissions for cgexec
# chown root:cgroups /usr/bin/cgexec
# chmod 2750 /usr/bin/cgexec

# Set permissions for cgroup files
# chown root:cgroups /sys/fs/cgroup/cgroup.procs
# chown root:cgroups /sys/fs/cgroup/cgroup.threads
# chmod 660 /sys/fs/cgroup/cgroup.procs
# chmod 660 /sys/fs/cgroup/cgroup.threads

# Ensure proper ownership and permissions recursively
# find /sys/fs/cgroup/john.slice -iname cgroup.procs | while read pid; do
    chown root:cgroups $pid
    chmod 660 $pid
done

# find /sys/fs/cgroup/john.slice -iname cgroup.threads | while read thread; do
    chown root:cgroups $thread
    chmod 660 $thread
done

This allowed regular users to interact with the cgroups while maintaining proper delegation and permissions.

Conclusion

Migrating from cgroup v1 to cgroup v2 posed several challenges, particularly with integrating nftables for packet filtering. Through debugging, task delegation, and adjustments to systemd configuration, the issues were systematically addressed. The transition was eventually successful, allowing for efficient management of cgroup v2 resources with nftables while ensuring compatibility for both system and user-level processes.

When cgroup v2 meets nftables: A migration story

How libcgroup saved the day

The Problem I: Sockets left behind

The Debugging:

Solution: Delegated Scope + cgexec

The Problem II: Creating a cgroup under scope

Solution:

The Problem III: Scaling to multiple cgroup under scope

Solution:

The Problem IV: systemd’fying the script

Solution

The Problem V: Allowing Regular Users to Add Tasks to cgroups

Solution:

Conclusion

References

Kamalesh Babulal

Using TLS to secure a NVMe-TCP Connection

Binary Compatibility and OpenELA's ELValidated Project

When cgroup v2 meets nftables: A migration story

How libcgroup saved the day

The Problem I: Sockets left behind

The Debugging:

Solution: Delegated Scope + cgexec

The Problem II: Creating a cgroup under scope

Solution:

The Problem III: Scaling to multiple cgroup under scope

Solution:

The Problem IV: systemd’fying the script

Solution

The Problem V: Allowing Regular Users to Add Tasks to cgroups

Solution:

Conclusion

References

Authors

Kamalesh Babulal

Using TLS to secure a NVMe-TCP Connection

Binary Compatibility and OpenELA's ELValidated Project