Introducing bpftune for lightweight, always-on auto-tuning of system behaviour

TCP rmem, wmem, congestion control, oh my!

The Linux kernel contains more than 1,500 tunables – and setting these parameters correctly can significantly improve system performance and utilization! For years, we’ve tried to provide the right suggestions for these tunables, via software release notes and improved default values, but many system loads will benefit from dynamic tuning of these values.

Introducing bpftune, an automatic configurator that monitors your workloads and sets the correct kernel parameter values! bpftune is an open source project available via dnf install in the Oracle Linux ol_developer repos, and at https://github.com/oracle-samples/bpftune.

bpftune aims to provide lightweight, always-on auto-tuning of system behaviour. The key benefits it provides are:

Continuously monitoring and adjusting system behavior by using BPF (Berkeley Packet Filter) observability features.
Tuning system behavior at a fine-grained level, made possible since we can observe more details of system state using BPF.

It is currently focused on some of the most common issues with tunables we have run into at Oracle, but with a pluggable infrastructure that is open to contributions. We hope you find it useful too!

What can bpftune tune?

Congestion tuner: auto-tune choice of congestion control algorithm. See bpftune-tcp-cong (8).
Neighbour table tuner: auto-tune neighbour table sizes by growing tables when approaching full. See bpftune-neigh (8).
Route table tuner: auto-tune route table size by growing tables when approaching full. See bpftune-route (8).
sysctl tuner: monitor sysctl setting and if it collides with an auto-tuned sysctl value, disable the associated tuner. See bpftune-sysctl (8).
TCP buffer tuner: auto-tune max and initial buffer sizes. See bpftune-tcp-buffer (8).
net buffer tuner: auto-tune tunables related to core networking. See bpftune-net-buffer (8).
netns tuner: notices addition and removal of network namespaces, which helps power namespace awareness for bpftune as a whole. Namespace awareness is important as we want to be able to auto-tune containers also. See bpftune-netns (8).

The problem with tunables

Even as the number of sysctls in the kernel grows, individual systems get a lot less care and adminstrator attention than they used to; phrases like “cattle not pets” exemplify this. Given the modern cloud architectures used for most deployments, most systems never have any human adminstrator interaction after initial provisioning; in fact given the scale requirements, this is often an explicit design goal – “no ssh’ing in!”.

These two observations are not unrelated; in an earlier era of fewer, larger systems, tuning by administrators was more feasible.

These trends – system complexity combined with minimal admin interaction suggest a rethink in terms of tunable management.

A lot of lore accumulates around these tunables, and to help clarify why we developed bpftune, we will use a straw-man version of the approach taken with tunables:

“find the set of magic numbers that will work for the system forever”

This is obviously a caricature of how administrators approach the problem, but it does highlight a critical implicit assumption – that systems are static.

And that gets to the “BPF” in bpftune; BPF provides the means to carry out low-overhead observations of a system. So not only can we observe the system and tune appropriately, we can also observe the effect of that tuning and re-tune if necessary. This is a key feature of bpftune which we will return to.

Key design principles

Minimize overhead. Use observability features sparingly; do not trace very high frequency events.
Be explicit about policy changes providing both a “what” – what change was made – and a “why” – how does it help? syslog logging makes policy actions explicit with explanations
Get out of the way of the administrator! We can use BPF observability to see if the admin sets tunable values that we are auto-tuning; if they do, we need to get out of the way and disable auto-tuning of the related feature set.
Do not replace tunables with more tunables! bpftune is designed to be zero configuration; there are no options, and we try to avoid magic numbers where possible.
Use push-pull approaches. For example, with tcp buffer sizing, we often want to get out of the way of applications and bump up tcp sndbuf and rcvbuf, but at a certain point we run the risk of exhausting TCP memory. We can however monitor if we are approaching TCP memory pressure and if so we can tune down values that we have tuned up. In this way, we can let the system find a balance between providing resources and exhausting them. In some cases, we will not need to tune up values; they may be fine as they are. But in other cases, these limits block optimal performance, and if they are raised safely – with awareness of global memory limits – we can get out of the way of improved performance. Another concern is that increasing buffer size leads to latency – to handle that, we correlate buffer size changes and TCP smoothed round-trip time; if the correlation between these exceeds a threshold (0.7) we stop increasing buffer size.

Architecture

bpftune is a daemon which manages a set of .so plugin tuners; each of these is a shared object that is loaded on start-up.

Tuners can be enabled or disabled; a tuner is automatically disabled if the admin changes associated tunables manually. Tuners share a global BPF ring buffer which allows posting of events from BPF programs to userspace. For example, if the sysctl tuner sees a sysctl being set, it posts an event. Each tuner has an associated id (set when it is loaded), and events posted contain the tuner id.

Each tuner has a BPF component (built using a BPF skeleton) and a userspace component. The latter has init(), fini() and event_handler() entrypoints. When an event is received, the tuner id is used to identify the appropriate event handler and its event_handler() callback function is run. init, fini and event_handler functions are loaded from the tuner .so object.

Getting Started

bpftune is also available in the ol9_developer and ol8_developer repositories for Oracle Linux and can be installed via:

$ sudo yum install --enablerepo=ol9_developer bpftune

For OL8:

$ sudo yum install --enablerepo=ol8_developer,ol8_UEKR7 bpftune

To enable bpftune as a service

$ sudo service bpftune start

…and to enable it by default

$ sudo systemctl enable bpftune

bpftune logs to syslog so /var/log/messages will contain details of any tuning carried out.

bpftune can also be run in the foreground as a program; to redirect output to stdout/stderr, run

$ sudo bpftune -s

On exit, bpftune will summarize any tuning done.

Simply starting bpftune and observing changes made via /var/log/messages can be instructive. For example, on a standard VM with sysctl defaults, I ran

$ service bpftune start

…and went about normal development activities such as cloning git trees from upstream, building kernels, etc. From the log we see some of the adjustments bpftune made to accommodate these activities

$ sudo grep bpftune /var/log/messages
...
Apr 19 16:14:59 bpftest bpftune[2778]: bpftune works fully
Apr 19 16:14:59 bpftest bpftune[2778]: bpftune supports per-netns policy (via netns cookie)
Apr 19 16:18:40 bpftest bpftune[2778]: Scenario 'specify bbr congestion control' occurred for tunable 'TCP congestion control' in global ns. Because loss rate has exceeded 1 percent for a connection, use bbr congestion control algorithm instead of default
Apr 19 16:18:40 bpftest bpftune[2778]: due to loss events for 145.40.68.75, specify 'bbr' congestion control algorithm
Apr 19 16:26:53 bpftest bpftune[2778]: Scenario 'need to increase TCP buffer size(s)' occurred for tunable 'net.ipv4.tcp_rmem' in global ns. Need to increase buffer size(s) to maximize throughput
Apr 19 16:26:53 bpftest bpftune[2778]: Due to need to increase max buffer size to maximize throughput change net.ipv4.tcp_rmem(min default max) from (4096 131072 6291456) -> (4096 131072 7864320)
Apr 19 16:26:53 bpftest bpftune[2778]: Scenario 'need to increase TCP buffer size(s)' occurred for tunable 'net.ipv4.tcp_rmem' in global ns. Need to increase buffer size(s) to maximize throughput
Apr 19 16:26:53 bpftest bpftune[2778]: Due to need to increase max buffer size to maximize throughput change net.ipv4.tcp_rmem(min default max) from (4096 131072 7864320) -> (4096 131072 9830400)
Apr 19 16:29:04 bpftest bpftune[2778]: Scenario 'specify bbr congestion control' occurred for tunable 'TCP congestion control' in global ns. Because loss rate has exceeded 1 percent for a connection, use bbr congestion control algorithm instead of default
Apr 19 16:29:04 bpftest bpftune[2778]: due to loss events for 140.91.12.81, specify 'bbr' congestion control algorithm

Contributing

Developers can find build dependencies, instructions and source code layout, as well as instructions for contributing to this project, at the source repo: https://github.com/oracle-samples/bpftune

Introducing bpftune for lightweight, always-on auto-tuning of system behaviour

TCP rmem, wmem, congestion control, oh my!

What can bpftune tune?

The problem with tunables

Key design principles

Architecture

Getting Started

Contributing

Alan Maguire

Learn to profile and analyze application performance with gprofng

Exploring the Discard Mechanism of Btrfs Filesystem

Introducing bpftune for lightweight, always-on auto-tuning of system behaviour

TCP rmem, wmem, congestion control, oh my!

What can bpftune tune?

The problem with tunables

Key design principles

Architecture

Getting Started

Contributing

Authors

Alan Maguire

Learn to profile and analyze application performance with gprofng

Exploring the Discard Mechanism of Btrfs Filesystem