If you've been following Linux kernel development or system call fuzzing at all, then you probably know about trinity and syzkaller. These programs have uncovered many kernel bugs over the years. The way they work is basically throwing random system calls at the kernel in the hope that some of them will crash the kernel or otherwise trigger a detectable fault (e.g. a buffer overflow) in the kernel code.
While these fuzzers effectively test the system calls themselves (and the code reachable through system calls), one thing they don't test very well is what happens at the actual transition point between userspace and the kernel. There is more to this boundary than meets the eye; it is written in assembly code and there is a lot of architectural state (CPU state) that must be verified or sanitized before the kernel can safely start executing its C code.
This blog post explores how one might go about writing a fuzzer targeting the Linux kernel entry code on x86.
Before continuing, you may want to have a brief look at the main two files involved on 64-bit kernels:
In all, the entry code is around 1,700 lines of assembly code (including comments), so it's not exactly trivial, but at the same time only a very tiny part of the whole kernel.
To start off, I would like to give a concrete example of CPU state that the kernel needs to verify when entering the kernel from userspace.
On x86, memset() is typically implemented using the rep stos instruction, since it is highly optimized by the CPU/microcode for writing to a contiguous range of bytes. Conceptually, this is a hardware loop that repeats (rep) a store (stos) a number of times; the target address is given by the %rdi register and the number of iterations is given by the %rcx register. You could for example implement memset() using inline assembly like this:
static inline void memset(void *dest, int value, size_t count) { asm volatile ("rep stosb" // 4 : "+D" (dest), "+c" (count) // 1, 2 : "a" (value) // 3 : "cc", "memory"); // 5 }
If you are unfamiliar with inline assembly, this tells GCC to:
For reference, you can also have a look at the current mainline implementation of memset() on x86.
Here is the kicker: The %rflags register contains a rarely used bit called DF (the "Direction flag"). This bit controls whether rep stos increments or decrements %rdi after writing each byte; when DF is set to 0, the affected memory range is %rdi..(%rdi + %rcx), whereas when DF is set to 1, the affected memory range is (%rdi - %rcx)..%rdi! This has pretty big implications for what memset() ends up doing, so we had better make sure that DF is always set to 0.
The x86_64 SysV ABI (which is more or less what the kernel uses) actually mandates that the DF bit must always be 0 on function entry and return (page 15):
The direction flag DF in the %rFLAGS register must be cleared (set to “forward” direction) on function entry and return. Other user flags have no specified role in the standard calling sequence and are not preserved across calls.
This is a convention that the kernel actually relies heavily on internally; if the DF flag was somehow set to 1 when calling memset() it would overwrite a completely wrong part of memory. Consequently, one of the jobs of the kernel's entry code is to make sure that the DF flag is always set to 0 before we enter any of the kernel's C code. We can do this with a single instruction: cld ("clear direction flag"), and this is indeed what the kernel does in many of its entry paths, see e.g. paranoid_entry() or error_entry().
Now that we understand how important even a single bit of CPU state can affect the kernel, let's try to enumerate all the various CPU state variables that the entry code needs to handle:
Something we've skirted around until now is that there are many different ways to enter the kernel from userspace, not just system calls (and not just one mechanism for system calls). Let's try to enumerate those too:
The goal of the fuzzer should be to test all possible combinations of CPU states and userspace/kernel transitions. Ideally, we would do an exhaustive search, but if you consider all possible combinations of register values and methods of entry the search space is simply too large. We will consider two main strategies to improve our chances of finding a bug:
Focus on those values/cases that we suspect are more likely to cause interesting/unusual things to happen. This typically comes down to looking at x86 documentation (Wikipedia, Intel manuals, etc.) and the entry code itself. For example, the entry code documents several cases of processor errata which we can use directly to hit known edge cases.
Collapse classes of values that we think will not make a difference. For example, when picking random values to load a register with, it is much more important to try different classes of pointers (e.g. kernel, userspace, non-canonical, mapped, non-mapped, etc.) than trying all possible values.
It's also worth mentioning that the kernel already has an excellent regression test suite for x86-specific code under tools/testing/selftests/x86/, developed mainly by Andy Lutomirski. It contains test cases for various methods of entering/leaving the kernel and can be a useful source of inspiration.
Our fuzzer will be a userspace program run by the kernel we are fuzzing. Since we need very precise control over some of the instructions used to trigger a transition to the kernel we will actually not write this code directly in C; instead, we will dynamically generate x86 machine code at runtime and then execute it. For simplicity, and in order to avoid having to recover to a clean state after setting up the desired CPU state (if that is even possible), we will execute the generated machine code in a child process which can be thrown away after the entry attempt.
We start with a basic fork loop:
#include <sys/mman.h> #include <sys/wait.h> #include <error.h> #include <errno.h> #include <stdint.h> #include <stdio.h> #include <stdlib.h> #include <unistd.h> static void *mem; static void emit_code(); typedef void (*generated_code_fn)(void); int main(int argc, char *argv[]) { mem = mmap(NULL, PAGE_SIZE, // prot PROT_READ | PROT_WRITE | PROT_EXEC, // flags MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, // fd, offset -1, 0); if (mem == MAP_FAILED) error(EXIT_FAILURE, errno, "mmap()"); while (1) { emit_code(); pid_t child = fork(); if (child == -1) error(EXIT_FAILURE, errno, "fork()"); if (child == 0) { // we're the child; call our newly generated function ((generated_code_fn) mem)(); exit(EXIT_SUCCESS); } // we're the parent; wait for the child to exit while (1) { int status; if (waitpid(child, &status, 0) == -1) { if (errno == EINTR) continue; error(EXIT_FAILURE, errno, "waitpid()"); } break; } } return 0; }
We may then also implement a very basic emit_code(), which so far just creates a function containing just a single retq instruction:
static void emit_code() { uint8_t *out = (uint8_t *) mem; // retq *out++ = 0xc3; }
If you read the code carefully, you might wonder why we're creating the mapping using the MAP_32BIT flag. This is because we'll want the fuzzer to be able to enter the kernel while executing in 32-bit compatibility mode, and for that we need to be executing at a valid 32-bit address.
System calls are a bit of a messy affair on x86. First of all, there is the fact that system calls first evolved on 32-bit, where they started out using the relatively slow int instruction. Then Intel and AMD developed their own separate mechanisms for fast system calls (using brand new mutually-incompatible instructions, sysenter and syscall, respectively). And to make matters worse, 64-bit needs to handle both 32-bit processes (using any of the 32-bit system call mechanisms), 64-bit processes, and (potentially) a third mode of operation referred to as x32, where code is 64-bit like usual (and has access to 64-bit registers), but pointers are 32 bits to save memory. As they vary in exactly what CPU state is saved/modified when entering kernel mode, most of these different system call mechanisms take slightly different paths through the kernel's entry code. This is also one of the reasons why the entry code can be quite difficult to follow!
For a more in-depth guide to system calls on x86, see this excellent (if a bit dated by now) series over at LWN:
A good way to get started with system calls by hand is to get comfortable using the GNU assembler to prototype assembly snippets that we can then use in the fuzzer. For example, we can make a single read(STDIN_FILENO, NULL, 0) call to the kernel like this (save as e.g. read.S):
.text .global main main: movl $0, %eax # SYS_read/__NR_read movl $0, %edi # fd = STDIN_FILENO movl $0, %esi # buf = NULL movl $0, %edx # count = 0 syscall movl $0, %eax retq
As you can see from this snippet, when using the syscall instruction the system call number itself is passed in %rax and arguments are passed in %rdi, %rsi, %rdx, etc. The Linux x86 syscall ABI is as far as I know documented "officially" in entry_SYSCALL_64() in the entry code itself (We use %eXX instead of %rXX here since the machine code is slightly shorter; setting %eXX to 0 will also clear the upper 32 bits of %rXX).
We can build this with gcc read.S and check that it does the right thing using strace:
$ strace ./a.out execve("./a.out", ["./a.out"], [/* 53 vars */]) = 0 [...] read(0, NULL, 0) = 0 exit_group(0) = ? +++ exited with 0 +++
To get the bytes of machine code after assembling, we can compile with gcc -c read.S and use objdump -d read.o:
0000000000000000 <main>: 0: b8 00 00 00 00 mov $0x0,%eax 5: bf 00 00 00 00 mov $0x0,%edi a: be 00 00 00 00 mov $0x0,%esi f: ba 00 00 00 00 mov $0x0,%edx 14: 0f 05 syscall 16: b8 00 00 00 00 mov $0x0,%eax 1b: c3 retq
To add this sequence to our JIT-assembled function, we can use code such as this:
// mov $0, %eax *out++ = 0xb8; *out++ = 0x00; *out++ = 0x00; *out++ = 0x00; *out++ = 0x00; [...] // syscall *out++ = 0x0f; *out++ = 0x05;
We now have most of the pieces we need to write a test for our memset() example above. To set DF, we can use the std instruction ("set direction flag") before making our system call:
// std *out++ = 0xfd;
Since we're writing a fuzzer we probably want to actually randomize the value of the flag. If we're using C++ we can initialize a PRNG with this code:
#include <random> static std::default_random_engine rnd; int main(...) { std::random_device rdev; rnd = std::default_random_engine(rdev()); ... }
and then we can set (or clear) the flag before making the system call using something like this:
switch (std::uniform_int_distribution<int>(0, 1)(rnd)) { case 0: // cld *out++ = 0xfc; break; case 1: // std *out++ = 0xfd; break; }
(Again, these bytes came from assembling a short test program and then looking at the objdump output.)
Note: We need to be careful about generating random numbers in a child process; we don't want all the children to generate the same thing! That's why we're actually generating the code in the parent and simply executing it in the child process.
Be sure to check out part 2, where we dig into the stack pointer, segment registers (including 32-bit compatibility mode), debug registers, and actually entering the kernel!
Ksplice is Oracle's technology for patching security vulnerabilities in the Linux kernel without rebooting. Ksplice supports patching entry code, and we have shipped several updates that do exactly this, including workarounds for many of the CPU vulnerabilities that were discovered in recent years:
Some of these updates were pretty challenging for various reasons and required ingenuity and a lot of attention to detail. Part of the reason we decided to write a fuzzer for the entry code was so that we could test our updates more effectively.
If you've enjoyed this blog post and you think you would enjoy working on these kinds of problems, feel free to drop us a line at ksplice-support_ww@oracle.com. We are a diverse, fully remote team, spanning 3 continents. We look at a ton of Linux kernel patches and ship updates for 5-6 different distributions, totalling more than 1,100 unique vulnerabilities in a year. Of course, nobody can ever hope to be familiar with every corner of the kernel (and vulnerabilities can appear anywhere), so patch- and source-code comprehension are essential skills. We also patch important userspace libraries like glibc and OpenSSL, which enables us to update programs using those libraries without restarting anything and without requiring any special support in those applications themselves. Other projects we've worked on include Known Exploit Detection or porting Ksplice to new architectures like ARM.