Fuzzing the Linux kernel (x86) entry code, Part 2 of 3

August 26, 2020 | 27 minute read
Text Size 100%:

In part 1 of this series we looked at what the Linux kernel entry code does and how to JIT-assemble and call a system call. In this part, we'll have a closer look at flag registers, the stack pointer, segment registers, debug registers, and different ways to enter the kernel.

More flags (%rflags)

The Direction flag is not the only flag that could be interesting to play with. The Wikipedia article for %rflags lists a couple of others that look interesting to me:

  • bit 8: Trap flag (used for single-step debugging)
  • bit 18: Alignment check

Most of the arithmetic-related flags (carry, parity, etc.) are not so interesting because they change a lot during normal operation of regular code, which means the kernel's handling of those is probably quite well tested. Some of the flags that could be interesting (e.g. the Interrupt Enable flag) may not be modified by userspace, so it's not very useful to even try.

The Trap flag is interesting because when set, the CPU delivers a debug exception after every instruction, which naturally also interferes with the normal operation of the entry code.

The Alignment check flag is interesting because it causes the CPU to deliver an alignment check exception when a misaligned pointer is dereferenced. Although the CPU is not supposed to perform alignment checking when executing in ring 0, it could still be interesting to see whether there are any bugs related to entering the kernel because of an alignment check exception (we'll get back to this later).

The Wikipedia article gives a procedure for modifying these flags, but we can do a tiny bit better:

   0:   9c                              pushfq
   1:   48 81 34 24 00 01 00 00         xorq   $0x100,(%rsp)
   9:   48 81 34 24 00 04 00 00         xorq   $0x400,(%rsp)
  11:   48 81 34 24 00 00 04 00         xorq   $0x40000,(%rsp)
  19:   9d                              popfq  

This code pushes the contents of %rflags onto the stack, then directly modifies the flag to the value on the stack before popping that value back into %rflags. We actually have a choice here between using orq or xorq; I'm going with xorq since it will toggle whatever value was already in the register. This way, if we do multiple system calls (or kernel entries) in a row, we can toggle the flags at random without having to care what the existing value was.

Since we're modifying %rflags anyway, we might as well bake the Direction Flag change into it and combine the modification of all three flags together into a single instruction. It's a tiny optimization, but there is no reason not to do it. The result is something like this:

// pushfq
*out++ = 0x9c;

uint32_t mask = 0;

// trap flag
mask |= std::uniform_int_distribution<unsigned int>(0, 1)(rnd) << 8;

// direction flag
mask |= std::uniform_int_distribution<unsigned int>(0, 1)(rnd) << 10;

// alignment check
mask |= std::uniform_int_distribution<unsigned int>(0, 1)(rnd) << 18;

// xorq $mask, 0(%rsp)
*out++ = 0x48;
*out++ = 0x81;
*out++ = 0x34;
*out++ = 0x24;
*out++ = mask;
*out++ = mask >> 8;
*out++ = mask >> 16; 
*out++ = mask >> 24;

// popfq 
*out++ = 0x9d;

If we don't want our process to be immediately killed with SIGTRAP when the trap flag is set, we need to register a signal handler that will effectively ignore this signal (apparently using SIG_IGN is not enough):

static void handle_child_sigtrap(int signum, siginfo_t *siginfo, void *ucontext)
    // this gets called when TF is set in %rflags; do nothing
struct sigaction sigtrap_act = {};
sigtrap_act.sa_sigaction = &handle_child_sigtrap;
sigtrap_act.sa_flags = SA_SIGINFO | SA_ONSTACK;
if (sigaction(SIGTRAP, &sigtrap_act, NULL) == -1)
    error(EXIT_FAILURE, errno, "sigaction(SIGTRAP)");

You might wonder about the reason for the SA_ONSTACK flag; we will talk about that in the next section!

Stack pointer (%rsp)

After modifying %rflags, we don't really need to use the stack again, which means we are free to change it without impacting the execution of our program. Why would we want to change the stack pointer, though? It's not like the kernel will use our userspace stack for anything, right..? Well, actually, it might:

  • Debugging tools like ftrace and perf occasionally dereference the userspace stack during e.g. system call tracing. In fact, I found at least two different bugs in this area:
  • When delivering signals to userspace, the signal handler's stack frame is created by the kernel and usually located just above the interrupted thread's current stack pointer.

  • If, by some mistake, %rsp is accessed directly by the kernel, it might not be noticed during normal operation, since the stack pointer usually always points to a valid address. To catch this kind of bug, we can simply point it at a non-mapped address (or perhaps even a kernel address!).

To help us test various potentially interesting values of the stack pointer, we can define a helper:

static void *page_not_present;
static void *page_not_writable;
static void *page_not_executable;
static uint64_t get_random_address()
    // very occasionally hand out a non-canonical address
    if (std::uniform_int_distribution<int>(0, 100)(rnd) < 5)
        return 1UL << 63;
    uint64_t value = 0;

    switch (std::uniform_int_distribution<int>(0, 4)(rnd)) {
    case 0:
    case 1:
        value = (uint64_t) page_not_present;
    case 2:
        value = (uint64_t) page_not_writable;
    case 3:
        value = (uint64_t) page_not_executable;
    case 4:
        static const uint64_t kernel_pointers[] = {

        value = kernel_pointers[std::uniform_int_distribution<int>(0, ARRAY_SIZE(kernel_pointers))(rnd)];

        // random ~2MiB offset
        value += PAGE_SIZE * std::uniform_int_distribution<unsigned int>(0, 512)(rnd);

    // occasionally intentionally misalign it
    if (std::uniform_int_distribution<int>(0, 100)(rnd) < 25)
        value += std::uniform_int_distribution<int>(-7, 7)(rnd);

    return value;

int main(...)
    page_not_present = mmap(NULL, PAGE_SIZE, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, -1, 0);
    page_not_writable = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, -1, 0);
    page_not_executable = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_32BIT, -1, 0);

Here I used a few kernel pointers that I found in /proc/kallsyms on my system. They are not necessarily very good choices, but you get the idea. As I mentioned earlier, there is a balance to be found between picking values that are so crazy that nobody had ever thought of handling them (we are after all trying to find corner cases here), but without getting lost in the huge sea of uninteresting values; we could just uniformly pick any random 64-bit value, but that is exceedingly unlikely to give us any valid pointers at all (most of them will probably be non-canonical addresses). Part of the art of fuzzing is smoking out the relevant corner cases by making informed guesses about what will and what won't matter.

Now it's just a matter of setting the value, which we can luckily do by loading the 64-bit value directly into %rsp:

movq $0x12345678aabbccdd, %rsp

In code:

uint64_t rsp = get_random_address();

// movq $imm, %rsp
*out++ = 0x48;
*out++ = 0xbc;
for (int i = 0; i < 8; ++i)
    *out++ = rsp >> (8 * i);

However, there is an important interaction with %rflags mentioned above that we need to take care of. The problem is that once we enable single-stepping in %rflags, the CPU will deliver a debug exception on every subsequently executed instruction. The kernel will handle the debug exception by delivering a SIGTRAP signal to the process. By default, this signal is delivered on the stack given by the value of %rsp when the trap is delivered... if %rsp is not valid, the kernel instead kills the process with an uncatcheable SIGSEGV.

In order to deal with situations like this, the kernel does offer a mechanism to set %rsp to a known-good value when delivering the signal: sigaltstack(). All we have to do is use it like this:

stack_t ss = {};

ss.ss_sp = malloc(SIGSTKSZ);
if (!ss.ss_sp)
    error(EXIT_FAILURE, errno, "malloc()");

ss.ss_size = SIGSTKSZ;
ss.ss_flags = 0;
if (sigaltstack(&ss, NULL) == -1)
    error(EXIT_FAILURE, errno, "sigaltstack()");

and then pass SA_ONSTACK in the sa_flags of the sigaction() call for SIGTRAP.

Segment registers

When it comes to segment registers, you will frequently see the claim that they are not actually used that much on 64-bit anymore. However, that is not the whole truth. It is true that you can't change the base address or segment sizes, but almost everything else is still relevant. In particular, some things that are relevant for us are:

  • %cs, %ds, %es, and %ss must hold valid 16-bit segment selectors referring to valid entries in the GDT (global descriptor table) or the LDT (local descriptor table).

  • %cs cannot be loaded using the mov instruction, but we can use the ljmp (far/long jump) instruction.

  • The CPL (current privilege level) field of %cs is the privilege level that the CPU is executing at. Normally, 64-bit userspace processes run with a %cs of 0x33, which is index 6 of the GDT and privilege level 3, and the kernel runs with a %cs of 0x10, which is index 2 of the GDT and privilege level 0 (hence the term "ring 0").

  • We can actually install entries in the LDT using the modify_ldt() system call, but note that the kernel does sanitize the entries so that we can't for example create a call gate pointing to a segment with DPL 0.

  • %fs and %gs have base addresses specified by MSRs. These registers are typically used for TLS (Thread-Local Storage) and per-CPU data by userspace processes and the kernel, respectively. We can change the values of these registers using the arch_prctl() system call. On some CPUs/kernels, we can use the wrfsbase and wrgsbase instructions.

  • Using the mov or pop instructions to set %ss causes the CPU to inhibit interrupts, NMIs, breakpoints, and single-stepping traps for one instruction following the mov or pop instruction. If this next instruction causes an entry into the kernel, those interrupts, NMIs, breakpoints, or single-stepping traps will be delivered after the CPU has started executing in kernel space. This was the source of CVE-2018-8897, where the kernel did not properly handle this case.


Since we'll potentially load segment registers with segments from the LDT we might as well start by setting up the LDT. There is no glibc wrapper for modify_ldt() so we have to call it using the syscall() function:

#include <asm/ldt.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/user.h>

for (unsigned int i = 0; i < 4; ++i) {
    struct user_desc desc = {};
    desc.entry_number = i;
    desc.base_addr = std::uniform_int_distribution<unsigned long>(0, ULONG_MAX)(rnd);
    desc.limit = std::uniform_int_distribution<unsigned int>(0, UINT_MAX)(rnd);
    desc.seg_32bit = std::uniform_int_distribution<int>(0, 1)(rnd);
    desc.contents = std::uniform_int_distribution<int>(0, 3)(rnd);
    desc.read_exec_only = std::uniform_int_distribution<int>(0, 1)(rnd);
    desc.limit_in_pages = std::uniform_int_distribution<int>(0, 1)(rnd);
    desc.seg_not_present = std::uniform_int_distribution<int>(0, 1)(rnd);
    desc.useable = std::uniform_int_distribution<int>(0, 1)(rnd);

    syscall(SYS_modify_ldt, 1, &desc, sizeof(desc));

We may want to check the return value here; we shouldn't be generating invalid LDT entries, so it's useful to know if we ever do.

static uint16_t get_random_segment_selector()
    unsigned int index;

    switch (std::uniform_int_distribution<unsigned int>(0, 2)(rnd)) {
    case 0:
        // The LDT is small, so favour smaller indices
        index = std::uniform_int_distribution<unsigned int>(0, 3)(rnd);
    case 1:
        // Linux defines 32 GDT entries by default
        index = std::uniform_int_distribution<unsigned int>(0, 31)(rnd);
    case 2:
        // Max table size
        index = std::uniform_int_distribution<unsigned int>(0, 255)(rnd);
    unsigned int ti = std::uniform_int_distribution<unsigned int>(0, 1)(rnd);
    unsigned int rpl = std::uniform_int_distribution<unsigned int>(0, 3)(rnd);

    return (index << 3) | (ti << 2) | rpl;

Data segment (%ds)

And using it (only showing %ds here):

if (std::uniform_int_distribution<unsigned int>(0, 100)(rnd) < 20) {
    uint16_t sel = get_random_segment_selector();

    // movw $imm, %ax
    *out++ = 0x66;
    *out++ = 0xb8;
    *out++ = sel;
    *out++ = sel >> 8;

    // movw %ax, %ds
    *out++ = 0x8e;
    *out++ = 0xd8;

%fs and %gs

For %fs and %gs we need to use the system call arch_prctl(). In normal (non-JIT-assembled) code, this would be:

#include <asm/prctl.h>
#include <sys/prctl.h>


syscall(SYS_arch_prctl, ARCH_SET_FS, get_random_address());
syscall(SYS_arch_prctl, ARCH_SET_GS, get_random_address());

Unfortunately, doing this is very likely to cause glibc/libstdc++ to crash on any code that uses thread-local storage (which may happen even as soon as the second get_random_address() call). If we want to generate the system calls to do it, we can do it more easily with a bit of support code:

enum machine_register {
    // 0
    // 8

const unsigned int REX = 0x40;
const unsigned int REX_B = 0x01;
const unsigned int REX_W = 0x08;

static uint8_t *emit_mov_imm64_reg(uint8_t *out, uint64_t imm, machine_register reg)
    *out++ = REX | REX_W | (REX_B * (reg >= 8));
    *out++ = 0xb8 | (reg & 7);
    for (int i = 0; i < 8; ++i)
        *out++ = imm >> (8 * i);

    return out;

static uint8_t *emit_call_arch_prctl(uint8_t *out, int code, unsigned long addr)
    // int arch_prctl(int code, unsigned long addr);
    out = emit_mov_imm64_reg(out, SYS_arch_prctl, RAX);
    out = emit_mov_imm64_reg(out, code, RDI);
    out = emit_mov_imm64_reg(out, addr, RSI);

    // syscall
    *out++ = 0x0f;
    *out++ = 0x05;

    return out;

Note that in addition to needing a few registers to do the system call itself, the syscall instruction also overwrites %rcx with the return address (i.e. the address of the instruction after the syscall instruction), so we'll probably want to make these calls before anything else.

Stack segment (%ss)

%ss should be the last register we set before the instruction that enters the kernel so that we're sure to see the effect of any delayed traps or exceptions. We can use the same code as for %ds above; the reason we don't use popw %ss is because we might have already set %rsp to point to a "weird" location and so the stack is probably not usable at that point.

32-bit compatibility mode (%cs)

Fun fact: you can actually change your 64-bit process into a 32-bit process on the fly, no need to even tell the kernel about it. The CPU includes a mechanism for this which is allowed in ring 3: far jumps.

In particular, the instruction we'll be using is "jump far, absolute indirect, address given in m16:32". Since this can be a bit tricky to work out the exact syntax and bytes for, I'll give a full assembly example first:

    .global main
    ljmpl *target

    movl $1, %eax # __NR_exit == 1 from asm/unistd_32.h
    movl $2, %ebx # status == 0

    .long 1b # address (32 bits)
    .word 0x23 # segment selector (16 bits)

Here, the ljmpl instruction uses the memory at our target label, which is a 32-bit instruction pointer followed by a 16-bit segment selector (here pointing to the 32-bit code segment for userspace, 0x23). The target address here, 1b, is not a hexadecimal value, it's actually a reference to the label 1; the b stands for "backwards". The code at this label is 32-bit, which is why we are using sysenter and not syscall that we used before. The calling conventions are also different, and, in fact, we need to use the system call numbers from the 32-bit ABI (SYS_exit is 60 on 64-bit, but 1 here). Another fun thing is that if you try to run this under strace, you will see something like this:

write(1, "\366\242[\204\374\177\0\0\0\0\0\0\0\0\0\0\376\242[\204\374\177\0\0\t\243[\204\374\177\0\0"..., 140722529079224 <unfinished ...>
+++ exited with 0 +++

strace clearly thought we were still a 64-bit process and thought that we had called write(), when we were really calling exit() (as evidenced by the last line, which plainly told us the process exited).

Now that we know what bytes to use, we can port the whole thing to C. Since both the ljmp memory operand and the target address are 32 bits, we need to make sure that they are both located in addresses where the upper 32 bits are all 0. The best way to do this is to allocate memory using mmap() and the MAP_32BIT flag.

struct ljmp_target {
    uint32_t rip;
    uint16_t cs;
} __attribute__((packed));

struct data {
    struct ljmp_target ljmp;

static struct data *data;

int main(...)

    void *addr = mmap(NULL, PAGE_SIZE,
        -1, 0);
    if (addr == MAP_FAILED)
        error(EXIT_FAILURE, errno, "mmap()");

    data = (struct data *) addr;


void emit_code()

    // ljmp *target
    *out++ = 0xff;
    *out++ = 0x2c;
    *out++ = 0x25;
    for (unsigned int i = 0; i < 4; ++i)
        *out++ = ((uint64_t) &data->ljmp) >> (8 * i);

    // cs:rip (jump target; in our case, the next instruction)
    data->ljmp.cs = 0x23;
    data->ljmp.rip = (uint64_t) out;


A couple of things to note here:

  • This changes the CPU mode, which means subsequent instructions must be valid in 32-bit (otherwise, you may get a general protection fault or invalid opcode exception).

  • The instruction sequence we used to load segment registers above (e.g. movw ..., %ax; movw %ax, %ss) has the exact same encoding on 32-bit and 64-bit, so we can execute it after switching to a 32-bit code segment without any trouble — this is particularly useful for ensuring that we can still load %ss just before entering the kernel.

  • We can choose whether to always change to segment 4 (segment selector 0x23), or we can try changing to a random segment selector (e.g. using get_random_segment_selector()). If we select a random one, we might not even know whether we would still be executing in 32-bit or 64-bit mode.

  • We may want to try to jump back to our normal code segment (segment 6, segment selector 0x33) after returning from the kernel, assuming we didn't exit, crash, or get killed. The procedure is exactly the same modulo the different segment selector.

Debug registers (%dr0, etc.)

Debug registers on x86 are used to set code breakpoints and data watchpoints. The registers %dr0 through %dr3 are used to set the actual breakpoint/watchpoint addresses and register %dr7 is used to control how those four addresses are used (whether they are breakpoints or watchpoints, etc.).

Setting debug registers is a bit trickier than what we've seen so far because you can't load them directly in userspace. Like changing the LDT, the kernel wants to make sure that we don't for example set a breakpoint or watchpoint on a kernel address, but even more importantly, the CPU itself doesn't allow ring 3 to modify these registers directly. The only way to set the debug registers that I know of is using ptrace().

ptrace() is a notoriously difficult API to use. There is a lot of implicit state that the tracer needs to track manually and a lot of corner cases around signal handling. Luckily, in this case, we can get by with just attaching to the child process, setting the debug registers, and detaching; the debug register changes will persist even after we stop tracing:

#include <sys/ptrace.h>
#include <sys/user.h>

#include <signal.h>
#include <stddef.h> // for offsetof()

int main(...)
    pid_t child = fork();
    if (child == -1)
        error(EXIT_FAILURE, errno, "fork()");

    if (child == 0) {
        // make us a tracee of the parent
        if (ptrace(PTRACE_TRACEME, 0, 0, 0) == -1)
            error(EXIT_FAILURE, errno, "ptrace(PTRACE_TRACEME)");

        // give the parent control



    // parent; wait for child to stop
    while (1) {
        int status;
        if (waitpid(child, &status, 0) == -1) {
            if (errno == EINTR)

            error(EXIT_FAILURE, errno, "waitpid()");

        if (WIFEXITED(status))
        if (WIFSIGNALED(status))

        if (WIFSTOPPED(status) && WSTOPSIG(status) == SIGTRAP)


    // set debug registers and stop tracing
    if (ptrace(PTRACE_POKEUSER, child, offsetof(struct user, u_debugreg[0]), ...) == -1)
        error(EXIT_FAILURE, errno, "ptrace(PTRACE_POKEUSER)");
    if (ptrace(PTRACE_POKEUSER, child, offsetof(struct user, u_debugreg[7]), ...) == -1)
        error(EXIT_FAILURE, errno, "ptrace(PTRACE_POKEUSER)");
    if (ptrace(PTRACE_DETACH, child, 0, 0) == -1)
        error(EXIT_FAILURE, errno, "ptrace(PTRACE_DETACH)");


Even in this small example, waiting for the child to stop is a bit fiddly. It's always possible that waitpid() returns before the the child has reached raise(SIGTRAP), e.g. if it was killed by some external process. We handle those cases by simply exiting as well.

Since setting debug registers requires tracing, signals, and multiple context switches (which are all pretty slow), I would suggest doing this just once for each child process and then letting the child run multiple attempts at entering the kernel in a row.

Setting any of the debug registers could fail, so in the actual fuzzer we would probably want to ignore any errors and set %dr7 one breakpoint at a time, e.g. something like:

// stddef.h offsetof() doesn't always allow non-const array indices,
// so precompute them here.
const unsigned int debugreg_offsets[] = {
    offsetof(struct user, u_debugreg[0]),
    offsetof(struct user, u_debugreg[1]),
    offsetof(struct user, u_debugreg[2]),
    offsetof(struct user, u_debugreg[3]),

for (unsigned int i = 0; i < 4; ++i) {
    // try random addresses until we succeed
    while (true) {
        unsigned long addr = get_random_address();
        if (ptrace(PTRACE_POKEUSER, child, debugreg_offsets[i], addr) != -1)

    // Condition:
    // 0 - execution
    // 1 - write
    // 2 - (unused)
    // 3 - read or write
    unsigned int condition = std::uniform_int_distribution<unsigned int>(0, 2)(rnd);
    if (condition == 2)
        condition = 3;

    // Size
    // 0 - 1 byte
    // 1 - 2 bytes
    // 2 - 8 bytes
    // 3 - 4 bytes
    unsigned int size = std::uniform_int_distribution<unsigned int>(0, 3)(rnd);

    unsigned long dr7 = ptrace(PTRACE_PEEKUSER, child, offsetof(struct user, u_debugreg[7]), 0);
    dr7 &= ~((1 | (3 << 16) | (3 << 18)) << i);
    dr7 |= (1 | (condition << 16) | (size << 18)) << i;
    ptrace(PTRACE_POKEUSER, child, offsetof(struct user, u_debugreg[7]), dr7);

Entering the kernel

We already saw how to emit the code for making a system call in part 1 of this blog series; here, we use the same basic approach, but also take into account all the other ways to enter the kernel. As I mentioned earlier, the syscall instruction is not the only way to enter the kernel on 64-bit, it's not even the only way to make a system call. For system calls, we have the following options:

  • int $0x80
  • sysenter
  • syscall

It can also be useful to look at the table of hardware-generated exceptions. Many of these exceptions are handled slightly differently from system calls and regular interrupts; for example, when you try to load a segment register with an invalid segment selector the CPU will push an error code onto the (kernel) stack.

We can trigger many of the exceptions, but not all of them. For example, it is trivial to generate a division by zero by simply carrying out a division by zero, but we can't easily generate an NMI on demand. (That said, there may be things we can do to make NMIs more likely to happen, albeit in an uncontrollable fashion: if we are testing the kernel in a VM, we can inject NMIs from the host, or we can enable the kernel NMI watchdog feature.)

enum entry_type {
    // system calls + software interrupts

    // exceptions
    ENTRY_DE, // Divide error
    ENTRY_OF, // Overflow
    ENTRY_BR, // Bound range exceeded
    ENTRY_UD, // Undefined opcode
    ENTRY_SS, // Stack segment fault
    ENTRY_GP, // General protection fault
    ENTRY_PF, // Page fault
    ENTRY_MF, // x87 floating-point exception
    ENTRY_AC, // Alignment check


enum entry_type type = (enum entry_type) std::uniform_int_distribution<int>(0, NR_ENTRY_TYPES - 1)(rnd);

// Some entry types require a setup/preamble; do that here
switch (type) {
case ENTRY_DE:
    // xor %eax, %eax
    *out++ = 0x31;
    *out++ = 0xc0;
case ENTRY_MF:
    // pxor %xmm0, %xmm0
    *out++ = 0x66;
    *out++ = 0x0f;
    *out++ = 0xef;
    *out++ = 0xc0;
case ENTRY_BR:
    // xor %eax, %eax
    *out++ = 0x31;
    *out++ = 0xc0;
case ENTRY_SS:
        uint16_t sel = get_random_segment_selector();

        // movw $imm, %bx
        *out++ = 0x66;
        *out++ = 0xbb;
        *out++ = sel;
        *out++ = sel >> 8;
    // do nothing


switch (type) {
    // system calls + software interrupts

    // syscall
    *out++ = 0x0f;
    *out++ = 0x05;
    // sysenter
    *out++ = 0x0f;
    *out++ = 0x34;
    // int $x
    *out++ = 0xcd;
    *out++ = std::uniform_int_distribution<uint8_t>(0, 255)(rnd);
case ENTRY_INT_80:
    // int $0x80
    *out++ = 0xcd;
    *out++ = 0x80;
case ENTRY_INT3:
    // int3
    *out++ = 0xcc;

    // exceptions

case ENTRY_DE:
    // div %eax
    *out++ = 0xf7;
    *out++ = 0xf0;
case ENTRY_OF:
    // into (32-bit only!)
    *out++ = 0xce;
case ENTRY_BR:
    // bound %eax, data
    *out++ = 0x62;
    *out++ = 0x05;
    *out++ = 0x09;
    for (unsigned int i = 0; i < 4; ++i)
        *out++ = ((uint64_t) &data->bound) >> (8 * i);
case ENTRY_UD:
    // ud2
    *out++ = 0x0f;
    *out++ = 0x0b;
case ENTRY_SS:
    // Load %ss again, with a random segment selector (this is not
    // guaranteed to raise #SS, but most likely it will). The reason
    // we don't just rely on the load above to do it is that it could
    // be interesting to trigger #SS with a "weird" %ss too.

    // movw %bx, %ss
    *out++ = 0x8e;
    *out++ = 0xd3;
case ENTRY_GP:
    // wrmsr
    *out++ = 0x0f;
    *out++ = 0x30;
case ENTRY_PF:
    // testl %eax, (xxxxxxxx)
    *out++ = 0x85;
    *out++ = 0x04;
    *out++ = 0x25;
    for (int i = 0; i < 4; ++i)
        *out++ = ((uint64_t) page_not_present) >> (8 * i);
case ENTRY_MF:
    // divss %xmm0, %xmm0
    *out++ = 0xf3;
    *out++ = 0x0f;
    *out++ = 0x5e;
    *out++ = 0xc0;
case ENTRY_AC:
    // testl %eax, (page_not_writable + 1)
    *out++ = 0x85;
    *out++ = 0x04;
    *out++ = 0x25;
    for (int i = 0; i < 4; ++i)
        *out++ = ((uint64_t) page_not_writable + 1) >> (8 * i);

Putting it all together

We now have almost everything we need to actually start some fuzzing! Just a couple more things, though...

If you run the code we have so far, you'll quickly run into some issues. First of all, many of the instructions we've used may cause crashes (and deliberately so), which makes the fuzzer slow. By installing signal handlers for a few common terminating signals (SIGBUS, SIGSEGV, etc.), we can skip over the faulting instruction and (hopefully) continue executing within the same child process.

Secondly, some of the system calls we make may have unintended side effects. In particular, we don't really want to block on I/O, since that will effectively stop the fuzzer in its tracks. One way to do this is to install an interval timer alarm to detect when a child process has hung. Another way could be to filter out certain system calls which are known to block (e.g. read(), select(), sleep(), etc.). Other "unfortunate" system calls might be fork(), exit(), and kill(). It's less likely that the fuzzer is able to delete files or otherwise mess up the system, but we might want to use some form of sandboxing (e.g. setuid(65534)).

If you just want to see the final result, here is a link to the code (made available under The Universal Permissive License):

To be continued...

Be sure to check out part 3, where we discuss further improvements and ideas for the fuzzer.

Ksplice — also, we're hiring!

Ksplice is Oracle's technology for patching security vulnerabilities in the Linux kernel without rebooting. Ksplice supports patching entry code, and we have shipped several updates that do exactly this, including workarounds for many of the CPU vulnerabilities that were discovered in recent years:

Some of these updates were pretty challenging for various reasons and required ingenuity and a lot of attention to detail. Part of the reason we decided to write a fuzzer for the entry code was so that we could test our updates more effectively.

If you've enjoyed this blog post and you think you would enjoy working on these kinds of problems, feel free to drop us a line at ksplice-support_ww@oracle.com. We are a diverse, fully remote team, spanning 3 continents. We look at a ton of Linux kernel patches and ship updates for 5-6 different distributions, totalling more than 1,100 unique vulnerabilities in a year. Of course, nobody can ever hope to be familiar with every corner of the kernel (and vulnerabilities can appear anywhere), so patch- and source-code comprehension are essential skills. We also patch important userspace libraries like glibc and OpenSSL, which enables us to update programs using those libraries without restarting anything and without requiring any special support in those applications themselves. Other projects we've worked on include Known Exploit Detection or porting Ksplice to new architectures like ARM.

Vegard Nossum

Previous Post

Fuzzing the Linux kernel (x86) entry code, Part 1 of 3

Vegard Nossum | 13 min read

Next Post

Fuzzing the Linux kernel (x86) entry code, Part 3 of 3

Vegard Nossum | 11 min read