Realtime and cgroups – a Cautionary Tale

Linux provides a mechanism to promote processes to realtime; realtime processes have higher priority over default (SCHED_OTHER) processes in the Linux kernel. To ensure the kernel and other critical processes with lesser priority than realtime tasks (SCHED_RT) have enough time to run, there is a system-wide realtime budget setting sysctl.sched_rt_runtime_us. Furthermore, some distros (Oracle Linux 8 and older) have enabled CONFIG_RT_GROUP_SCHED in their kernels; this allows for realtime budgets on a per-cgroup basis. But with great power comes great responsibility, and there’s a potential for misconfiguration of realtime on all Linux systems. This blog will outline the best practices of realtime on enterprise Linux servers.

Why Realtime

The major advantage of using realtime is that, its deterministic nature of task execution, within expected time limits and less latency. Making it suitable for deployment in areas of Robotics in Industries, Aerospace, Multimedia and many others, that require predictable response time. Realtime works great with customized systems, repeating the same tasks over and over again without any human interactions, mostly using embedded chips, it gets interesting when they are used on general systems, with a mix of tasks of higher priority (SCHED_DEADLINE) and lower priority (SCHED_OTHER) and also it needs to respond to user interactions such as gaming, web servers or any applications running one of its tasks in realtime priority.

System-Wide Realtime Budget

Realtime tasks like any another task are auto-placed under a cgroup, determined mostly by systemd or an application managing its own cgroup sub-tree. The realtime (henceforth abbreviated to rt) tasks, can run until preempted by higher priority tasks or yield waiting for I/O or sleep. Such monopolization, might end up destabilizing the system by not allowing Linux Kernel housekeeping tasks and periodic tasks to be run. The rt tasks are characterized to be very short and latency-intensive. Oracle Linux comes with a safe guard tun-able sysctl.sched_rt_runtime, which is set to 95% of a second or 950,000 microseconds by default. This guarantees 5% of the remaining second is available for non-RT tasks. We will see more about this setting later.

CONFIG_RT_GROUP_SCHED and Realtime in cgroups

Before diving into the topic, let’s understand a few basics on how the time is allocated to the tasks of the cgroup. Every CPU controller cgroup has two settings, period and quota. Where the period sets the time range and quota sets the time within the range, i.e., if a cgroup period is set to one second, the quota can range from 0 to < 1 second. In other words, the quota is about restricting the CPU time to the tasks that would get over a period. For illustration, setting a quota to 0.50 seconds will allow the tasks in a cgroup to run up to 50% of a second and throttle it for the next 50%, before allowing it to run again. Setting the quota, such that it matches the period is legal, but the side effect is that the tasks can monopolize the CPU until they are preempted by higher priority tasks or yield, effectively not affecting the quota. Building upon the simple example, rt quota is a global setting distributed among the cgroups in the hierarchical fashion. Consider the analogy of a pie, if a pie had to be shared among children, each child would get a share of pie, that might be equally or unequally sliced, adding up every piece would complete the pie.

Similarly, the rt quota needs to be shared between cgroups of the same level in the cgroup hierarchy and each cgroup rt quota is further distributed among its children cgroups. The global runtime (quota) and period can be read/set via:

/proc/sys/kernel/sched_rt_period_us
/proc/sys/kernel/sched_rt_runtime_us

Both settings are in microseconds, and the max allowed value is INT_MAX -1. The values written to these files are not persistent across reboots and to make them persistent, set them using sysctl:

sysctl kernel.sched_rt_runtime_us
sysctl kernel.sched_rt_period_us

By default, the runtime (quota) is set to 0.95 seconds and period one second. The root cgroup of the CPU controller hierarchy is assigned the system-wide global quota, which is shared among its children and future grandchildren cgroups, a diagram of an example cgroup tree rt quota/period distribution would make it easier to untangle the idea of the distribution, that is different from CFS tasks quota/period:

cgroup hierarchy

The sum of sched_rt_period_us of all leaf cgroup should be equal to the global sched_rt_period_us .i.e., 250,000 μs (cgrpA1) + 200,000 μs (cgrpA2) + 400,000 μs (cgrpB) = 950,000 μs.

The rules for rt quota allocation, is as follows:

A cgroup’s rt quota is limited by share of quota allocated to its parent cgroup.
Its not mandatory for a cgroup to have rt quota allocated to it, if it doesn’t have rt priority tasks in it.
If there are sibling cgroups, sum of their rt quota should be less than or equal to its parents.

Out of Realtime

The kernel.sched_rt_runtime_us, also allows the user to set it to -1, effectively allowing the rt_tasks to run for un-restricted time, leading to an instability or even crash, if the tasks don’t yield and allowing other non-rt Linux Kernel housekeeping tasks to run. In most cases, the default setting of 95% of a second works well and setting it too less, might result in rt task failure, use this setting with caution.

One word of caution to the application developers, don’t assume that the total rt quota of the system will be available to the cgroup in which your application would be placed. This makes it hard for other applications requiring rt runtime and failing due to no rt budget available to the cgroup, always calculate the rt time required by the applications and carve out only the required budget from the parent cgroups budget by setting them into sched_rt_period_us.

Summary

Linux Realtime cgroup can provide advantages. By grouping critical realtime tasks, applications performance can receive a notable boost and budgeting the realtime quota among the tasks, make it easier to manage the CPU cycle allocations. However, elevating CPU-intensive tasks to realtime or over/under allocating the realtime budget, can lead to system stall or slowdowns. Proper tuning of the realtime global and cgroup settings can cater to specialized applications and to general systems, providing them with benefits of realtime and cgroups.

Realtime and cgroups – a Cautionary Tale

Why Realtime

System-Wide Realtime Budget

CONFIG_RT_GROUP_SCHED and Realtime in cgroups

Out of Realtime

Summary

Kamalesh Babulal

Exploring UEK-next's Kernel Configuration

Using BPF-based DTrace on Linux to Trace Packet Drops

Realtime and cgroups – a Cautionary Tale

Why Realtime

System-Wide Realtime Budget

CONFIG_RT_GROUP_SCHED and Realtime in cgroups

Out of Realtime

Summary

Authors

Kamalesh Babulal

Exploring UEK-next's Kernel Configuration

Using BPF-based DTrace on Linux to Trace Packet Drops