Fuzzing the Linux kernel (x86) entry code, Part 3 of 3

August 27, 2020 | 11 minute read
Text Size 100%:

Note: Make sure to read part 1 and part 2 if you missed them.

In the previous blog post we ended up with a basic working fuzzer, but there are still many things we can do to improve the usability and utility of the fuzzer. In this part we will explore some obvious and some non-obvious extensions.

General-purpose registers + delta states

One thing we haven't really covered so far is setting the other general-purpose registers to random values as well. The entry code does use some of those general-purpose registers in the course of its work and if we do run into a bug somewhere then it might be more likely to crash with a random value.

We might also want to try to find more subtle bugs where the kernel doesn't crash outright, but perhaps leaks a kernel address in some register that userspace never otherwise looks at. One way to check that the kernel behaves correctly and preserves our registers/flags/etc. would be to write out the register state just after returning from kernel mode. It's not very difficult to achieve, as we can movq all (or at least most of) the registers into fixed addresses (e.g. in the data page which we're already using for other things). The difficulty here is integrating this with running multiple entry attempts/syscalls in a single child process, since one would need to interleave the sanity checking with the entry attempts and that could be quite fiddly to get right.

Minimising the probability of crashing

We already mentioned in part 2 that crashing the child process is pretty expensive, since it means starting an entirely new child process. Avoiding crashing as much as possible (and running as many entry attempts as possible in the same child process) could therefore be a viable strategy towards improving the performance of the fuzzer. There are two main parts to this:

  • Saving/restoring state that is needed down the line, e.g. you'll want to save and restore %rsp so that subsequent pushf/popf instructions will continue to work.
  • Recovering from signal handlers, e.g. by installing handlers which can restore the process to a known good state.

Checking the generated assembly code

It's very easy to make a mistake in the assembly code generation and never notice because the program was crashing anyway and you couldn't tell that you were getting an unexpected result. I had a bug like that where I didn't notice for 2 years that I had accidentally used the wrong byte order when encoding the address of the ljmp operand, so it never actually ran anything in 32-bit compatibility mode — oops!

One very quick and easy way to check the assembly code is to use a disassembly library like udis86 and then verify some of the generated code by hand. All you need is something like this:

#include <udis86.h>


ud_t u;

ud_set_vendor(&u, UD_VENDOR_INTEL);
ud_set_mode(&u, 64);
ud_set_pc(&u, (uint64_t) mem);
ud_set_input_buffer(&u, (unsigned char *) mem, (char *) out - (char *) mem);

ud_set_syntax(&u, UD_SYN_ATT);

while (ud_disassemble(&u))
    fprintf(stderr, "  %08lx %s\n", ud_insn_off(&u), ud_insn_asm(&u));

fprintf(stderr, "\n");

KVM/Xen/Intel/AMD interactions

In at least one case, we saw an interaction with KVM where starting any KVM instance would corrupt the size of the GDTR (GDT register) and allow the fuzzer to hit a crash by using one of the segments outside the intended size of the GDT. This turned out to be exploitable to get ring 0 execution. In at least one other case, we saw an interaction when running in a hardware-accelerated nested guest (guest within a guest).

In general, KVM needs to emulate some aspects of the underlying hardware and this adds quite a lot of complexity. It is quite possible that the fuzzer can find bugs in hypervisors such as KVM or Xen, so it is probably valuable to run the fuzzer both on different bare metal CPUs and under a variety of hypervisors.

To create a KVM instance programmatically, see KVM host in a few lines of code by Serge Zaitsev.

A related fun experiment could be to compile the fuzzer for Windows or other operating systems running on x86 and see how they fare. I briefly tested the Linux binary on WSL (Windows Subsystem for Linux) and nothing bad happened, so there's that.

Config/boot options

Config and boot options affect the exact operation of the entry code. On a recent kernel, I get these:

$ grep -o 'CONFIG_[A-Z0-9_]*' arch/x86/entry/entry_64*.S | sort | uniq

There are actually more options hidden behind header files as well. Building multiple kernels with different combinations of these options could help reveal combinations that are broken, perhaps only in edge cases triggered by the fuzzer.

By looking through Documentation/admin-guide/kernel-parameters.txt you can also find a number of options that may influence the entry code. Here's an example Python script that generates random combinations of config options which is useful for passing on the kernel command line with KVM:

import random

flags = """nopti nospectre_v1 nospectre_v2 spectre_v2_user=off
spec_store_bypass_disable=off l1tf=off mds=off tsx_async_abort=off
kvm.nx_huge_pages=off noapic noclflush nosmap nosmep noexec32 nofxsr
nohugeiomap nosmt nosmt noxsave noxsaveopt noxsaves intremap=off
nolapic nomce nopat nopcid norandmaps noreplace-smp nordrand nosep
nosmp nox2apic""".split()

print(' '.join(random.sample(flags, 5)), "nmi_watchdog=%u" % (random.randrange(2), ))


Ftrace inserts some code into the entry code when enabled, e.g. for system call and irqflags tracing. This could be worth testing as well, so I would recommend occasionally tweaking these files (under /sys/kernel/tracing) before running the fuzzer:




We've already seen that ptrace changes the way system call entry/exit is handled (since the process needs to be stopped and the tracer needs to be notified), so it's a good idea to run some fraction of entry attempts under ptrace() using PTRACE_SYSCALL. It could also be interesting to try to tweak some/all of the traced's process' registers while it is stopped by ptrace. Getting this completely right is pretty hairy, so I'll consider it out of scope for this blog post.


When I do my testing in a VM I prefer to bundle up the program in an initrd and run it as init (pid 1) so that I don't need to copy it onto a filesystem image. You can use a simple script like this:

#! /bin/bash

set -e
set -x

rm -rf initrd/
mkdir initrd/
g++ -static -Wall -std=c++14 -O2 -g -o initrd/init main.cc -lm

(cd initrd/ && (find | cpio -o -H newc)) \
    | gzip -c \
    > initrd.entry-fuzz.gz

If you're using Qemu/KVM, just pass -initrd initrd.entry-fuzz.gz and it will run the fuzzer as the first thing after booting.

Taint checking

If the fuzzer ever does stumble upon some kind of kernel crash or bug, it's useful to make sure we don't miss it! I personally pass oops=panic panic_on_warn panic=-1 on the kernel command line and -no-reboot to Qemu/KVM; this will ensure that any warning or panic will immediately cause Qemu to exit (leaving any diagnostics on the terminal). If you are doing a dedicated bare metal run (e.g. using the initrd method above), you would probably want panic=0 instead, which just hangs the machine.

If you're doing a bare metal run on your regular workstation and don't want your whole machine to hang, another thing you can do is to check whether the kernel becomes tainted (which it does whenever there is a WARNING or a BUG) and simply exit.

int tainted_fd = open("/proc/sys/kernel/tainted", O_RDONLY);
if (tainted_fd == -1)
    error(EXIT_FAILURE, errno, "open()");

char tainted_orig_buf[16];
ssize_t tainted_orig_len = pread(tainted_fd, tainted_orig_buf, sizeof(tainted_orig_buf), 0);
if (tainted_orig_len == -1)
    error(EXIT_FAILURE, errno, "pread()");

while (1) {
    // generate + run test case


    char tainted_buf[16];
    ssize_t tainted_len = pread(tainted_fd, tainted_buf, sizeof(tainted_buf), 0);
    if (tainted_len == -1)
        error(EXIT_FAILURE, errno, "pread()");

    if (tainted_len != tainted_orig_len || memcmp(tainted_buf, tainted_orig_buf, tainted_len)) {
        fprintf(stderr, "Kernel became tainted, stopping.\n");
        // TODO: dump hex bytes or disassembly

Network logging

In case that the kernel ever crashes and it's not clear from the crash what the problem was, it can be very useful to log everything that is being attempted to the network. I'll just give a quick sketch of a UDP logging scheme:

int main(...)
    int udp_socket = socket(AF_INET, SOCK_DGRAM, 0);
    if (udp_socket == -1)
        error(EXIT_FAILURE, errno, "socket(AF_INET, SOCK_DGRAM, 0)");

    struct sockaddr_in remote_addr = {};
    remote_addr.sin_family = AF_INET;
    remote_addr.sin_port = htons(21000);
    inet_pton(AF_INET, "", &remote_addr.sin_addr.s_addr);

    if (connect(udp_socket, (const struct sockaddr *) &remote_addr, sizeof(remote_addr)) == -1)
        error(EXIT_FAILURE, errno, "connect()");


Then, after the code for each entry/exit has been generated, you can simply dump it on this socket:

write(udp_socket, (char *) mem, out - (uint8_t *) mem);

Hopefully the last data received by the logging server (here will contain the assembly code that caused the crash. Depending on the exact use case, it might be worth adding some kind of framing so you can easily tell exactly where a test case starts and ends.

Check that the fuzzer catches known bugs

There have been a number of bugs in the entry code over the years. It could be interesting to build some of these old, buggy kernels and run the fuzzer on them to make sure it actually catches those known bugs as a sanity check. We could perhaps also use the time it takes to find the bug as a measure of the fuzzer's efficiency, although we have to be careful not to optimize it to only find these bugs!

Code coverage/instrumentation feedback


One of the things that makes fuzzers like AFL and syzkaller so effective is that they use code coverage to very accurately gauge the effect of tweaking individual bits of a test case. This is usually achieved by compiling C code with a special compiler flag that emits extra code to gather the coverage data. That's a lot tricker with assembly code, especially the entry code, since we don't know exactly what state the CPU is in (and what registers/state we can clobber) without manually inspecting each instruction of the code.

However, if we really want code coverage, there is a way to do it. The x86 instruction set fortunately includes an instruction that takes both an immediate value and an immediate address, and which doesn't affect any other state (e.g. flags): movb $value, (addr). The only thing we need to be careful about is making sure that addr is a compile time constant address which is always mapped to some physical memory and marked present in the page tables so we don't incur a page fault while accessing it. Linux fortunately already has a mechanism for this: fixmaps AKA "compile-time virtual memory allocation". With this we can statically allocate a compile-time constant virtual address which points to the same underlying physical page for all tasks and contexts. Since it is shared between tasks, we would have to clear or otherwise save/restore these values when switching between processes.

By using a combination of C macros and assembler macros we can obtain a fairly non-intrusive coverage primitive that you can drop in anywhere in the entry code to record a taken code path. I have a patch for this, but there are a few corner cases to work out (e.g. it doesn't quite work when SMAP is enabled). Besides, I doubt the x86 maintainers would relish the thought of littering the entry code with these coverage annotations.

One thing that makes instrumentation feedback a lot more complicated on the fuzzer side is that you need a whole system to keep track of test cases, outcomes, and (possibly) which mutations you've applied to each test case. Because of this I've chosen to ignore code coverage for now; in any case, that's a general fuzzing topic and doesn't pertain much to x86 or the entry code in particular.

Performance counters/hardware feedback

A completely different approach to gathering code coverage would be to use performance counters. I know of two recent projects that are doing this:

The big benefit here is obviously that no instrumentation (modification of the kernel) is required. Probably the biggest potential drawback is that performance counters are not completely deterministic (perhaps due to external factors like hardware interrupts or thermal throttling). Perhaps it also won't really work for the entry code, since only a really short amount of time is spent in the assembly code. In any case, here are a couple of links for further reading:

Bugs found

Ksplice — also, we're hiring!

Ksplice is Oracle's technology for patching security vulnerabilities in the Linux kernel without rebooting. Ksplice supports patching entry code, and we have shipped several updates that do exactly this, including workarounds for many of the CPU vulnerabilities that were discovered in recent years:

Some of these updates were pretty challenging for various reasons and required ingenuity and a lot of attention to detail. Part of the reason we decided to write a fuzzer for the entry code was so that we could test our updates more effectively.

If you've enjoyed this blog post and you think you would enjoy working on these kinds of problems, feel free to drop us a line at ksplice-support_ww@oracle.com. We are a diverse, fully remote team, spanning 3 continents. We look at a ton of Linux kernel patches and ship updates for 5-6 different distributions, totalling more than 1,100 unique vulnerabilities in a year. Of course, nobody can ever hope to be familiar with every corner of the kernel (and vulnerabilities can appear anywhere), so patch- and source-code comprehension are essential skills. We also patch important userspace libraries like glibc and OpenSSL, which enables us to update programs using those libraries without restarting anything and without requiring any special support in those applications themselves. Other projects we've worked on include Known Exploit Detection or porting Ksplice to new architectures like ARM.

Vegard Nossum

Previous Post

Fuzzing the Linux kernel (x86) entry code, Part 2 of 3

Vegard Nossum | 27 min read

Next Post

Announcing the Unbreakable Enterprise Kernel Release 5 Update 4 for Oracle Linux

Simon Coter | 4 min read