Linux kernel developer Tom Hromatka has been working in the area of seccomp looking to improve performance of large seccomp filters.

Background

Seccomp is a critical component to safely isolate and secure containers by restricting the syscalls that a container is allowed to invoke. In a nod to the many security threats that have arisen lately, current seccomp best practices are to create a (typically large) whitelist of allowed syscalls. This is safer than a small blacklist because new syscalls are occasionally added to the Linux kernel. If a blacklist is used and the seccomp filter and a new kernel are not updated together, malicious code could call this new syscall and use it as an attack vector to harm the system.

But libseccomp isn’t equipped to manage large whitelists at present. In its current form, it generates a series of sequential if syscall == n statements. Thus, a large seccomp filter can consist of hundreds of classic Berkeley Packet Filter (cBPF) instructions. The kernel must execute every if syscall == n cBPF instruction until the if statement matching the syscall being processed is found. For syscalls near the end of the filter, it can take milliseconds to process the cBPF instructions.

 

Hope (and better seccomp performance) is on the way!

At Oracle, we are working on significantly improving seccomp performance when running large filters. We have proposed changes to libseccomp to utilize a binary tree which will reduce the cBPF computation time from O(n) down to O(log n). For a seccomp filter with 300 syscalls, this will drastically decrease the number of cBPF instructions executed from 300+ down to as little as 9 instructions.

 

An Example

We created a simple test program using Docker’s default libseccomp filter. In this test program, we call

getppid()

(a very fast syscall) millions of times and record how quickly it is executed. By modifying libseccomp and the cBPF instructions it generates, we’re able to identify the impact of large syscall filters and their effect on performance. The results are even better than we hoped:

 

Libseccomp Performance Comparison

 

The performance of the current libseccomp implementation degrades linearly as the syscall falls later in the filter. Conversely, the performance of the binary tree remains consistent regardless of the location of the syscall within the filter. It is nearly as fast as the best case of the current filter.

Expect to see this feature in libseccomp in 2019!

Resources

The binary tree RFC for libseccomp is available here:

  https://github.com/seccomp/libseccomp/issues/116

Tom presented this topic at  Linux Plumbers Conference 2018. The video of that talk and presentation can be found here:

  https://lpc.events/event/2/contributions/213/