Monday Jun 27, 2011

Building a physical CPU load meter

I built this analog CPU load meter for my dev workstation:

Physical CPU load meter

All I did was drill a few holes into the CPU and probe the power supply lines...

Okay, I lied. This is actually a fun project that would make a great intro to embedded electronics, or a quick afternoon hack for someone with a bit of experience.

The parts

The main components are:

  • Current meter: I got this at MIT Swapfest. The scale printed on the face is misleading: the meter itself measures only about 600 microamps in each direction. (It's designed for use with a circuit like this one). We can determine the actual current scale by connecting (in series) the analog meter, a variable resistor, and a digital multimeter, and driving them from a 5 volt power supply. This lets us adjust and reliably measure the current flowing through the analog meter.

  • Arduino: This little board has a 16 MHz processor, with direct control of a few dozen input/output pins, and a USB serial interface to a computer. In our project, it will take commands over USB and produce the appropriate amount of current to drive the meter. We're not even using most of the capabilities of the Arduino, but it's hard to beat as a platform for rapid development.

  • Resistor: The Arduino board is powered over USB; its output pins produce 5 volts for a logic 'high'. We want this 5 volt potential to push 600 microamps of current through the meter, according to the earlier measurement. Using Ohm's law we can calculate that we'll need a resistance of about 8.3 kilohms. Or you can just measure the variable resistor from earlier.

We'll also use some wire, solder, and tape.

Building it

The resistor goes in series with the meter. I just soldered it directly to the back:

Back of the meter

Some tape over these components prevents them from shorting against any of the various junk on my desk. Those wires run to the Arduino, hidden behind my monitor, which is connected to the computer by USB:

The Arduino

That's it for hardware!

Code for the Arduino

The Arduino IDE will compile code written in a C-like language and upload it to the Arduino board over USB. Here's our program:

#define DIRECTION 2
#define MAGNITUDE 3

void setup() {

void loop() {
    int x =;
    if (x == -1)

    if (x < 128) {
        digitalWrite(DIRECTION, LOW);
        analogWrite (MAGNITUDE, 2*(127 - x));
    } else {
        digitalWrite(DIRECTION, HIGH);
        analogWrite (MAGNITUDE, 255 - 2*(x - 128));

When it turns on, the Arduino will execute setup() once, and then call loop() over and over, forever. On each iteration, we try to read a byte from the serial port. A value of -1 indicates that no byte is available, so we return and try again a moment later. Otherwise, we translate a byte value between 0 to 255 into a meter deflection between −600 and 600 microamps.

Pins 0 and 1 are used for serial communication, so I connected the meter to pins 2 and 3, and named them DIRECTION and MAGNITUDE respectively. When we call analogWrite on the MAGNITUDE pin with a value between 0 and 255, we get a proportional voltage between 0 and 5 volts. Actually, the Arduino fakes this by alternating between 0 and 5 volts very rapidly, but our meter is a slow mechanical object and won't know the difference.

Suppose the MAGNITUDE pin is at some intermediate voltage between 0 and 5 volts. If the DIRECTION pin is low (0 V), conventional current will flow from MAGNITUDE to DIRECTION through the meter. If we set DIRECTION high (5 V), current will flow from DIRECTION to MAGNITUDE. So we can send current through the meter in either direction, and we can control the amount of current by controlling the effective voltage at MAGNITUDE. This is all we need to make the meter display whatever reading we want.

Code for the Linux host

On Linux we can get CPU load information from the proc special filesystem:

keegan@lyle$ head -n 1 /proc/stat
cpu  965348 22839 479136 88577774 104454 5259 24633 0 0

These numbers tell us how much time the system's CPUs have spent in each of several states:

  1. user: running normal user processes
  2. nice: running user processes of low priority
  3. system: running kernel code, often on behalf of user processes
  4. idle: doing nothing because all processes are sleeping
  5. iowait: doing nothing because all runnable processes are waiting on I/O devices
  6. irq: handling asynchronous events from hardware
  7. softirq: performing tasks deferred by irq handlers
  8. steal: not running, because we're in a virtual machine and some other VM is using the physical CPU
  9. guest: acting as the host for a running virtual machine

The numbers in /proc/stat are cumulative totals since boot, measured in arbitrary time units. We can read the file twice and subtract, in order to get a measure of where CPU time was spent recently. Then we'll use the fraction of time spent in states other than idle as a measure of CPU load, and send this to the Arduino.

We'll do all this with a small Python script. The pySerial library lets us talk to the Arduino over USB serial. We'll configure it for 57,600 bits per second, the same rate specified in the Arduino's setup() function. Here's the code:

#!/usr/bin/env python

import serial
import time

port = serial.Serial('/dev/ttyUSB0', 57600)

old = None
while True:
    with open('/proc/stat') as stat:
        new = map(float, stat.readline().strip().split()[1:])
    if old is not None:
        diff = [n - o for n, o in zip(new, old)]
        idle = diff[3] / sum(diff)
        port.write(chr(int(255 * (1 - idle))))
    old = new

That's it!

That's all it takes to make a physical, analog CPU meter. It's been done before and will be done again, but we're interested in what you'd do (or have already done!) with the concept. You could measure website hits, or load across a whole cluster, or your profits from trading Bitcoins. One standard Arduino can run at least six meters of this type (being the number of pins which support analogWrite), and a number of switches, knobs, buzzers, and blinky lights besides. If your server room has a sweet control panel, we'd love to see a picture!


Monday Feb 21, 2011

Mapping your network with nmap

If you run a computer network, be it home WiFi or a global enterprise system, you need a way to investigate the machines connected to your network. When ping and traceroute won't cut it, you need a port scanner.

nmap is the port scanner. It's a powerful, sophisticated tool, not to mention a movie star. The documentation on nmap is voluminous: there's an entire book, with a free online edition, as well as a detailed manpage. In this post I'll show you just a few of the cool things nmap can do.

The law and ethics of port scanning are complex. A network scan can be detected by humans or automated systems, and treated as a malicious act, resulting in real costs to the target. Depending on the options you choose, the traffic generated by nmap can range from "completely innocuous" to "watch out for admins with baseball bats". A safe rule is to avoid scanning any network without the explicit permission of its administrators — better yet if that's you.

You'll need root privileges on the scanning system to run most interesting nmap commands, because nmap likes to bypass the standard network stack when synthesizing esoteric packets.

A firm handshake

Let's start by scanning my home network for web and SSH servers:

root@lyle# nmap -sS -p22,80
Nmap scan report for
22/tcp filtered ssh
80/tcp open     http

Nmap scan report for
22/tcp filtered ssh
80/tcp filtered http

Nmap scan report for
22/tcp open   ssh
80/tcp closed http

Nmap done: 256 IP addresses (3 hosts up) scanned in 6.05 seconds

We use -p22,80 to ask for a scan of TCP ports 22 and 80, the most popular ports for SSH and web servers respectively. If you don't specify a -p option, nmap will scan the 1,000 most commonly-used ports. You can give a port range like -p1-5000, or even use -p- to scan all ports, but your scan will take longer.

We describe the subnet to scan using CIDR notation. We could equivalently write

The option -sS requests a TCP SYN scan. nmap will start a TCP handshake by sending a SYN packet. Then it waits for a response. If the target replies with SYN/ACK, then some program is accepting our connection. A well-behaved client should respond with ACK, but nmap will simply record an open port and move on. This makes an nmap SYN scan both faster and more stealthy than a normal call to connect().

If the target replies with RST, then there's no service on that port, and nmap will record it as closed. Or we might not get a response at all. Perhaps a firewall is blocking our traffic, or the target host simply doesn't exist. In that case the port state is recorded as filtered after nmap times out.

You can scan UDP ports by passing -sU. There's one important difference from TCP: Since UDP is connectionless, there's no particular response required from an open port. Therefore nmap may show UDP ports in the ambiguous state open|filtered, unless you can prod the target application into sending you data (see below).

To save time, nmap tries to confirm that a target exists before performing a full scan. By default it will send ICMP echo (the ubiquitous "ping") as well as TCP SYN and ACK packets. You can use the -P family of options to customize this host-discovery phase.

Weird packets

nmap has the ability to generate all sorts of invalid, useless, or just plain weird network traffic. You can send a TCP packet with no flags at all (null scan, -sN) or one that's lit up "like a Christmas tree" (Xmas scan, -sX). You can chop your packets into little fragments (--mtu) or send an invalid checksum (--badsum). As a network administrator, you should know if the bad guys can confuse your security systems by sending weird packets. As the manpage advises, "Let your creative juices flow".

There's a second benefit to sending weird traffic: We can identify the target's operating system by seeing how it responds to unusual situations. nmap will perform this OS detection if you specify the -O flag:

root@lyle# nmap -sS -O
Nmap scan report for
Not shown: 998 filtered ports
23/tcp   closed telnet
80/tcp   open   http
MAC Address: 00:1C:10:33:6B:99 (Cisco-Linksys)
Device type: WAP|broadband router
Running: Linksys embedded, Netgear embedded, Netgear VxWorks 5.X
Nmap scan report for
Not shown: 998 filtered ports
139/tcp   open  netbios-ssn
445/tcp   open  microsoft-ds
MAC Address: 00:1F:3A:7F:7C:26 (Hon Hai Precision Ind.Co.)
Warning: OSScan results may be unreliable because we could not find
  at least 1 open and 1 closed port
Device type: general purpose
Running (JUST GUESSING) : Microsoft Windows Vista|2008|7 (98%)
Nmap scan report for
All 1000 scanned ports on are closed
MAC Address: 7C:61:93:53:9F:E5 (Unknown)
Too many fingerprints match this host to give specific OS details
TCP/IP fingerprint:

Since the first target has both an open and a closed port, nmap has many protocol corner cases to explore, and it easily recognizes a Linksys home router. With the second target, there's no port in the closed state, so nmap isn't as confident. It guesses a Windows OS, which seems especially plausible given the open NetBIOS ports. In the last case nmap has no clue, and gives us some raw findings only. If you know the OS of the target, you can contribute this fingerprint and help make nmap even better.

Behind the port

It's all well and good to discover that port 1234 is open, but what's actually listening there? nmap has a version detection subsystem that will spam a host's open ports with data in hopes of eliciting a response. Let's pass -sV to try this out:

root@lyle# nmap -sS -sV
Nmap scan report for
Not shown: 998 closed ports
443/tcp  open  ssh     OpenSSH 5.5p1 Debian 6 (protocol 2.0)
8888/tcp open  http    thttpd 2.25b 29dec2003

nmap correctly spotted an HTTP server on non-standard port 8888. The SSH server on port 443 (usually HTTPS) is also interesting. I find this setup useful when connecting from behind a restrictive outbound firewall. But I've also had network admins send me worried emails, thinking my machine has been compromised.

nmap also gives us the exact server software versions, straight from the server's own responses. This is a great way to quickly audit your network for any out-of-date, insecure servers.

Since a version scan involves sending application-level probes, it's more intrusive and can cause more trouble. From the book:

In the nmap-service-probes included with Nmap the only ports excluded are TCP port 9100 through 9107. These are common ports for printers to listen on and they often print any data sent to them. So a version detection scan can cause them to print many pages full of probes that Nmap sends, such as SunRPC requests, help statements, and X11 probes.

This behavior is often undesirable, especially when a scan is meant to be stealthy.

Trusting the source

It's a common (if questionable) practice for servers or firewalls to trust certain traffic based on where it appears to come from. nmap gives you a variety of tools for mapping these trust relationships. For example, some firewalls have special rules for traffic originating on ports 53, 67, or 20. You can set the source port for nmap's TCP and UDP packets by passing --source-port.

You can also spoof your source IP address using -S, and the target's responses will go to that fake address. This normally means that nmap won't see any results. But these responses can affect the unwitting source machine's IP protocol state in a way that nmap can observe indirectly. You can read about nmap's TCP idle scan for more details on this extremely clever technique. Imagine making any machine on the Internet — or your private network — port-scan any other machine, while you collect the results in secret. Can you use this to map out trust relationships in your network? Could an attacker?

Bells and whistles

So that's an overview of a few cool nmap features. There's a lot we haven't covered, such as performance tuning, packet traces, or nmap's useful output modes like XML or ScRipT KIdd|3. There's even a full scripting engine with hundreds of useful plugins written in Lua.


Monday Jan 10, 2011

Solving problems with proc

The Linux kernel exposes a wealth of information through the proc special filesystem. It's not hard to find an encyclopedic reference about proc. In this article I'll take a different approach: we'll see how proc tricks can solve a number of real-world problems. All of these tricks should work on a recent Linux kernel, though some will fail on older systems like RHEL version 4.

Almost all Linux systems will have the proc filesystem mounted at /proc. If you look inside this directory you'll see a ton of stuff:

keegan@lyle$ mount | grep ^proc
proc on /proc type proc (rw,noexec,nosuid,nodev)
keegan@lyle$ ls /proc
1      13     23     29672  462        cmdline      kcore         self
10411  13112  23842  29813  5          cpuinfo      keys          slabinfo
12934  15260  26317  4      bus        irq          partitions    zoneinfo
12938  15262  26349  413    cgroups    kallsyms     sched_debug

These directories and files don't exist anywhere on disk. Rather, the kernel generates the contents of /proc as you read it. proc is a great example of the UNIX "everything is a file" philosophy. Since the Linux kernel exposes its internal state as a set of ordinary files, you can build tools using basic shell scripting, or any other programming environment you like. You can also change kernel behavior by writing to certain files in /proc, though we won't discuss this further.

Each process has a directory in /proc, named by its numerical process identifier (PID). So for example, information about init (PID 1) is stored in /proc/1. There's also a symlink /proc/self, which each process sees as pointing to its own directory:

keegan@lyle$ ls -l /proc/self
lrwxrwxrwx 1 root root 64 Jan 6 13:22 /proc/self -> 13833

Here we see that 13833 was the PID of the ls process. Since ls has exited, the directory /proc/13883 will have already vanished, unless your system reused the PID for another process. The contents of /proc are constantly changing, even in response to your queries!

Back from the dead

It's happened to all of us. You hit the up-arrow one too many times and accidentally wiped out that really important disk image.

keegan@lyle$ rm hda.img

Time to think fast! Luckily you were still computing its checksum in another terminal. And UNIX systems won't actually delete a file on disk while the file is in use. Let's make sure our file stays "in use" by suspending md5sum with control-Z:

keegan@lyle$ md5sum hda.img
[1]+  Stopped                 md5sum hda.img

The proc filesystem contains links to a process's open files, under the fd subdirectory. We'll get the PID of md5sum and try to recover our file:

keegan@lyle$ jobs -l
[1]+ 14595 Stopped                 md5sum hda.img
keegan@lyle$ ls -l /proc/14595/fd/
total 0
lrwx------ 1 keegan keegan 64 Jan 6 15:05 0 -> /dev/pts/18
lrwx------ 1 keegan keegan 64 Jan 6 15:05 1 -> /dev/pts/18
lrwx------ 1 keegan keegan 64 Jan 6 15:05 2 -> /dev/pts/18
lr-x------ 1 keegan keegan 64 Jan 6 15:05 3 -> /home/keegan/hda.img (deleted)
keegan@lyle$ cp /proc/14595/fd/3 saved.img
keegan@lyle$ du -h saved.img
320G    saved.img

Disaster averted, thanks to proc. There's one big caveat: making a full byte-for-byte copy of the file could require a lot of time and free disk space. In theory this isn't necessary; the file still exists on disk, and we just need to make a new name for it (a hardlink). But the ln command and associated system calls have no way to name a deleted file. On FreeBSD we could use fsdb, but I'm not aware of a similar tool for Linux. Suggestions are welcome!

Redirect harder

Most UNIX tools can read from standard input, either by default or with a specified filename of "-". But sometimes we have to use a program which requires an explicitly named file. proc provides an elegant workaround for this flaw.

A UNIX process refers to its open files using integers called file descriptors. When we say "standard input", we really mean "file descriptor 0". So we can use /proc/self/fd/0 as an explicit name for standard input:

keegan@lyle$ cat 
import sys
print file(sys.argv[1]).read()
keegan@lyle$ echo hello | python 
IndexError: list index out of range
keegan@lyle$ echo hello | python -
IOError: [Errno 2] No such file or directory: '-'
keegan@lyle$ echo hello | python /proc/self/fd/0

This also works for standard output and standard error, on file descriptors 1 and 2 respectively. This trick is useful enough that many distributions provide symlinks at /dev/stdin, etc.

There are a lot of possibilities for where /proc/self/fd/0 might point:

keegan@lyle$ ls -l /proc/self/fd/0
lrwx------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> /dev/pts/6
keegan@lyle$ ls -l /proc/self/fd/0 < /dev/null
lr-x------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> /dev/null
keegan@lyle$ echo | ls -l /proc/self/fd/0
lr-x------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> pipe:[9159930]

In the first case, stdin is the pseudo-terminal created by my screen session. In the second case it's redirected from a different file. In the third case, stdin is an anonymous pipe. The symlink target isn't a real filename, but proc provides the appropriate magic so that we can read the file anyway. The filesystem nodes for anonymous pipes live in the pipefs special filesystem — specialer than proc, because it can't even be mounted.

The phantom progress bar

Say we have some program which is slowly working its way through an input file. We'd like a progress bar, but we already launched the program, so it's too late for pv.

Alongside /proc/$PID/fd we have /proc/$PID/fdinfo, which will tell us (among other things) a process's current position within an open file. Let's use this to make a little script that will attach a progress bar to an existing process:

keegan@lyle$ cat phantom-progress.bash
name=$(readlink $fd)
size=$(wc -c $fd | awk '{print $1}')
while [ -e $fd ]; do
  progress=$(cat $fdinfo | grep ^pos | awk '{print $2}')
  echo $((100*$progress / $size))
  sleep 1
done | dialog --gauge "Progress reading $name" 7 100

We pass the PID and a file descriptor as arguments. Let's test it:

keegan@lyle$ cat 
import sys
import time
f = file(sys.argv[1], 'r')
keegan@lyle$ python bigfile &
[1] 18589
keegan@lyle$ ls -l /proc/18589/fd
total 0
lrwx------ 1 keegan keegan 64 Jan  6 16:40 0 -> /dev/pts/16
lrwx------ 1 keegan keegan 64 Jan  6 16:40 1 -> /dev/pts/16
lrwx------ 1 keegan keegan 64 Jan  6 16:40 2 -> /dev/pts/16
lr-x------ 1 keegan keegan 64 Jan  6 16:40 3 -> /home/keegan/bigfile
keegan@lyle$ ./phantom-progress.bash 18589 3

And you should see a nice curses progress bar, courtesy of dialog. Or replace dialog with gdialog and you'll get a GTK+ window.

Chasing plugins

A user comes to you with a problem: every so often, their instance of Enterprise FooServer will crash and burn. You read up on Enterprise FooServer and discover that it's a plugin-riddled behemoth, loading dozens of shared libraries at startup. Loading the wrong library could very well cause mysterious crashing.

The exact set of libraries loaded will depend on the user's config files, as well as environment variables like LD_PRELOAD and LD_LIBRARY_PATH. So you ask the user to start fooserver exactly as they normally do. You get the process's PID and dump its memory map:

keegan@lyle$ cat /proc/21637/maps
00400000-00401000 r-xp 00000000 fe:02 475918             /usr/bin/fooserver
00600000-00601000 rw-p 00000000 fe:02 475918             /usr/bin/fooserver
02519000-0253a000 rw-p 00000000 00:00 0                  [heap]
7ffa5d3c5000-7ffa5d3c6000 r-xp 00000000 fe:02 1286241    /usr/lib/foo-1.2/
7ffa5d3c6000-7ffa5d5c5000 ---p 00001000 fe:02 1286241    /usr/lib/foo-1.2/
7ffa5d5c5000-7ffa5d5c6000 rw-p 00000000 fe:02 1286241    /usr/lib/foo-1.2/
7ffa5d5c6000-7ffa5d5c7000 r-xp 00000000 fe:02 1286243    /usr/lib/foo-1.3/
7ffa5d5c7000-7ffa5d7c6000 ---p 00001000 fe:02 1286243    /usr/lib/foo-1.3/
7ffa5d7c6000-7ffa5d7c7000 rw-p 00000000 fe:02 1286243    /usr/lib/foo-1.3/
7ffa5d7c7000-7ffa5d91f000 r-xp 00000000 fe:02 4055115    /lib/
7ffa5d91f000-7ffa5db1e000 ---p 00158000 fe:02 4055115    /lib/
7ffa5db1e000-7ffa5db22000 r--p 00157000 fe:02 4055115    /lib/
7ffa5db22000-7ffa5db23000 rw-p 0015b000 fe:02 4055115    /lib/
7ffa5db23000-7ffa5db28000 rw-p 00000000 00:00 0 
7ffa5db28000-7ffa5db2a000 r-xp 00000000 fe:02 4055114    /lib/
7ffa5db2a000-7ffa5dd2a000 ---p 00002000 fe:02 4055114    /lib/
7ffa5dd2a000-7ffa5dd2b000 r--p 00002000 fe:02 4055114    /lib/
7ffa5dd2b000-7ffa5dd2c000 rw-p 00003000 fe:02 4055114    /lib/
7ffa5dd2c000-7ffa5dd4a000 r-xp 00000000 fe:02 4055128    /lib/
7ffa5df26000-7ffa5df29000 rw-p 00000000 00:00 0 
7ffa5df46000-7ffa5df49000 rw-p 00000000 00:00 0 
7ffa5df49000-7ffa5df4a000 r--p 0001d000 fe:02 4055128    /lib/
7ffa5df4a000-7ffa5df4b000 rw-p 0001e000 fe:02 4055128    /lib/
7ffa5df4b000-7ffa5df4c000 rw-p 00000000 00:00 0 
7fffedc07000-7fffedc1c000 rw-p 00000000 00:00 0          [stack]
7fffedcdd000-7fffedcde000 r-xp 00000000 00:00 0          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0  [vsyscall]

That's a serious red flag: fooserver is loading the bar plugin from FooServer version 1.2 and the quux plugin from FooServer version 1.3. If the versions aren't binary-compatible, that might explain the mysterious crashes. You can now hassle the user for their config files and try to fix the problem.

Just for fun, let's take a closer look at what the memory map means. Right away, we can recognize a memory address range (first column), a filename (last column), and file-like permission bits rwx. So each line indicates that the contents of a particular file are available to the process at a particular range of addresses with a particular set of permissions. For more details, see the proc manpage.

The executable itself is mapped twice: once for executing code, and once for reading and writing data. The same is true of the shared libraries. The flag p indicates a private mapping: changes to this memory area will not be shared with other processes, or saved to disk. We certainly don't want the global variables in a shared library to be shared by every process which loads that library. If you're wondering, as I was, why some library mappings have no access permissions, see this glibc source comment. There are also a number of "anonymous" mappings lacking filenames; these exist in memory only. An allocator like malloc can ask the kernel for such a mapping, then parcel out this storage as the application requests it.

The last two entries are special creatures which aim to reduce system call overhead. At boot time, the kernel will determine the fastest way to make a system call on your particular CPU model. It builds this instruction sequence into a little shared library in memory, and provides this virtual dynamic shared object (named vdso) for use by userspace code. Even so, the overhead of switching to the kernel context should be avoided when possible. Certain system calls such as gettimeofday are merely reading information maintained by the kernel. The kernel will store this information in the public virtual system call page (named vsyscall), so that these "system calls" can be implemented entirely in userspace.

Counting interruptions

We have a process which is taking a long time to run. How can we tell if it's CPU-bound or IO-bound?

When a process makes a system call, the kernel might let a different process run for a while before servicing the request. This voluntary context switch is especially likely if the system call requires waiting for some resource or event. If a process is only doing pure computation, it's not making any system calls. In that case, the kernel uses a hardware timer interrupt to eventually perform a nonvoluntary context switch.

The file /proc/$PID/status has fields labeled voluntary_ctxt_switches and nonvoluntary_ctxt_switches showing how many of each event have occurred. Let's try our slow reader process from before:

keegan@lyle$ python bigfile &
[1] 15264
keegan@lyle$ watch -d -n 1 'cat /proc/15264/status | grep ctxt_switches'

You should see mostly voluntary context switches. Our program calls into the kernel in order to read or sleep, and the kernel can decide to let another process run for a while. We could use strace to see the individual calls. Now let's try a tight computational loop:

keegan@lyle$ cat tightloop.c
int main() {
  while (1) {
keegan@lyle$ gcc -Wall -o tightloop tightloop.c
keegan@lyle$ ./tightloop &
[1] 30086
keegan@lyle$ watch -d -n 1 'cat /proc/30086/status | grep ctxt_switches'

You'll see exclusively nonvoluntary context switches. This program isn't making system calls; it just spins the CPU until the kernel decides to let someone else have a turn. Don't forget to kill this useless process!

Staying ahead of the OOM killer

The Linux memory subsystem has a nasty habit of making promises it can't keep. A userspace program can successfully allocate as much memory as it likes. The kernel will only look for free space in physical memory once the program actually writes to the addresses it allocated. And if the kernel can't find enough space, a component called the OOM killer will use an ad-hoc heuristic to choose a victim process and unceremoniously kill it.

Needless to say, this feature is controversial. The kernel has no reliable idea of who's actually responsible for consuming the machine's memory. The victim process may be totally innocent. You can disable memory overcommitting on your own machine, but there's inherent risk in breaking assumptions that processes make — even when those assumptions are harmful.

As a less drastic step, let's keep an eye on the OOM killer so we can predict where it might strike next. The victim process will be the process with the highest "OOM score", which we can read from /proc/$PID/oom_score:

keegan@lyle$ cat oom-scores.bash
for procdir in $(find /proc -maxdepth 1 -regex '/proc/[0-9]+'); do
  printf "%10d %6d %s\n" \
    "$(cat $procdir/oom_score)" \
    "$(basename $procdir)" \
    "$(cat $procdir/cmdline | tr '\0' ' ' | head -c 100)"
done 2>/dev/null | sort -nr | head -n 20

For each process we print the OOM score, the PID (obtained from the directory name) and the process's command line. proc provides string arrays in NULL-delimited format, which we convert using tr. It's important to suppress error output using 2>/dev/null because some of the processes found by find (including find itself) will no longer exist within the loop. Let's see the results:

keegan@lyle$ ./oom-scores.bash 
  13647872  15439 /usr/lib/chromium-browser/chromium-browser --type=plugin
   1516288  15430 /usr/lib/chromium-browser/chromium-browser --type=gpu-process
   1006592  13204 /usr/lib/nspluginwrapper/i386/linux/npviewer.bin --plugin
    687581  15264 /usr/lib/chromium-browser/chromium-browser --type=zygote
    445352  14323 /usr/lib/chromium-browser/chromium-browser --type=renderer
    444930  11255 /usr/lib/chromium-browser/chromium-browser --type=renderer

Unsurprisingly, my web browser and Flash plugin are prime targets for the OOM killer. But the rankings might change if some runaway process caused an actual out-of-memory condition. Let's (carefully!) run a program that will (slowly!) eat 500 MB of RAM:

keegan@lyle$ cat oomnomnom.c
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>
#define SIZE (10*1024*1024)

int main() {
  int i;
  for (i=0; i<50; i++) {
    void *m = mmap(NULL, SIZE, PROT_WRITE,
    memset(m, 0x80, SIZE);
  return 0;

On each loop iteration, we ask for 10 megabytes of memory as a private, anonymous (non-file-backed) mapping. We then write to this region, so that the kernel will have to allocate some physical RAM. Now we'll watch OOM scores and free memory as this program runs:

keegan@lyle$ gcc -Wall -o oomnomnom oomnomnom.c
keegan@lyle$ ./oomnomnom &
[1] 19697
keegan@lyle$ watch -d -n 1 './oom-scores.bash; echo; free -m'

You'll see oomnomnom climb to the top of the list.

So we've seen a few ways that proc can help us solve problems. Actually, we've only scratched the surface. Inside each process's directory you'll find information about resource limits, chroots, CPU affinity, page faults, and much more. What are your favorite proc tricks? Let us know in the comments!


Wednesday Sep 22, 2010

Anatomy of an exploit: CVE-2010-3081

It has been an exciting week for most people running 64-bit Linux systems. Shortly after "Ac1dB1tch3z" released his or her exploit of the vulnerability known as CVE-2010-3081, we saw this exploit aggressively compromising machines, with reports of compromises all over the hosting industry and many machines using our diagnostic tool and testing positive for the backdoors left by the exploit.

The talk around the exploit has mostly been panic and mitigation, though, so now that people have had time to patch their machines and triage their compromised systems, what I'd like to do for you today is talk about how this bug worked, how the exploit worked, and what we can learn about Linux security.

The Ingredients of an Exploit

There are three basic ingredients that typically go into a kernel exploit: the bug, the target, and the payload. The exploit triggers the bug -- a flaw in the kernel -- to write evil data corrupting the target, which is some kernel data structure. Then it prods the kernel to look at that evil data and follow it to run the payload, a snippet of code that gives the exploit the run of the system.

The bug is the one ingredient that is unique to a particular vulnerability. The target and the payload may be reused by an attacker in exploits for other vulnerabilities -- if 'Ac1dB1tch3z' didn't copy them already from an earlier exploit, by himself or by someone else, he or she will probably reuse them in future exploits.

Let's look at each of these in more detail.

The Bug: CVE-2010-3081

An exploit starts with a bug, or vulnerability, some kernel flaw that allows a malicious user to make a mess -- to write onto its target in the kernel. This bug is called CVE-2010-3081, and it allows a user to write a handful of words into memory almost anywhere in the kernel.

The bug was present in Linux's 'compat' subsystem, which is used on 64-bit systems to maintain compatibility with 32-bit binaries by providing all the system calls in 32-bit form. Now Linux has over 300 different system calls, so this was a big job. The Linux developers made certain choices in order to keep the task manageable:

  • We don't want to rewrite the code that actually does the work of each system call, so instead we have a little wrapper function for compat mode.
  • The wrapper function needs to take arguments from userspace in 32-bit form, then put them in 64-bit form to pass to the code that does the system call's work. Often some arguments are structs which are laid out differently in the 32-bit and 64-bit worlds, so we have to make a new 64-bit struct based on the user's 32-bit struct.
  • The code that does the work expects to find the struct in the user's address space, so we have to put ours there. Where in userspace can we find space without stepping on toes? The compat subsystem provides a function to find it on the user's stack.
Now, here's the core problem. That allocation routine went like this:
  static inline void __user *compat_alloc_user_space(long len)
          struct pt_regs *regs = task_pt_regs(current);
          return (void __user *)regs->sp - len;
The way you use it looks a lot like the old familiar malloc(), or the kernel's kmalloc(), or any number of other memory-allocation routines: you pass in the number of bytes you need, and it returns a pointer where you are supposed to read and write that many bytes to your heart's content. But it comes -- came -- with a special catch, and it's a big one: before you used that memory, you had to check that it was actually OK for the user to use that memory, with the kernel's access_ok() function. If you've ever helped maintain a large piece of software, you know it's inevitable that someone will eventually be fooled by the analogy, miss the incongruence, and forget that check.

Fortunately the kernel developers are smart and careful people, and they defied that inevitability almost everywhere. Unfortunately, they missed it in at least two places. One of those is this bug. If we call getsockopt() in 32-bit fashion on the socket that represents a network connection over IP, and pass an optname of MCAST_MSFILTER, then in a 64-bit kernel we end up in compat_mc_getsockopt():

  int compat_mc_getsockopt(struct sock *sock, int level, int optname,
          char __user *optval, int __user *optlen,
          int (*getsockopt)(struct sock *,int,int,char __user *,int __user *))
This function calls compat_alloc_user_space(), and it fails to check the result is OK for the user to access -- and by happenstance the struct it's making room for has a variable length, supplied by the user.

So the attacker's strategy goes like so:

  • Make an IP socket in a 32-bit process, and call getsockopt() on it with optname MCAST_MSFILTER. Pass in a giant length value, almost the full possible 2GB. Because compat_alloc_user_space() finds space by just subtracting the length from the user's stack pointer, with a giant length the address wraps around, down past zero, to where the kernel lives at the top of the address space.
  • When the bug fires, the kernel will copy the original struct, which the attacker provides, into the space it has just 'allocated', starting at that address up in kernel-land. So fill that struct with, say, an address for evil code.
  • Tune the length value so that the address where the 'new struct' lives is a particularly interesting object in the kernel, a target.
The fix for CVE-2010-3081 was to make compat_alloc_user_space() call access_ok() to check for itself.

More technical details are ably explained in the original report by security researcher Ben Hawkes, who brought the vulnerability to light.

The Target: Function Pointers Everywhere

The target is some place in the kernel where if we make the right mess, we can leverage that into the kernel running the attacker's code, the payload. Now the kernel is full of function pointers, because secretly it's object oriented. So for example the attacker may poke some userspace object like a special file to cause the kernel to invoke a certain method on it -- and before doing so will target that method's function pointer in the object's virtual method table (called an "ops struct" in kernel lingo) which says where to find all the methods, scribbling over it with the address of the payload.

A key constraint for the attacker is to pick something that will never be used in normal operation, so that nothing goes awry to catch the user's attention. This exploit uses one of three targets: the interrupt descriptor table, timer_list_fops, and the LSM subsystem.

  • The interrupt descriptor table (IDT) is morally a big table of function pointers. When an interrupt happens, the hardware looks it up in the IDT, which the kernel has set up in advance, and calls the handler function it finds there. It's more complicated than that because each entry in the table also needs some metadata to say who's allowed to invoke the interrupt, whether the handler should be called with user or kernel privileges, etc. This exploit picks interrupt number 221, higher than anybody normally uses, and carefully sets up that entry in the IDT so that its own evil code is the handler and runs in kernel mode. Then with the single instruction int $221, it makes that interrupt happen.
  • timer_list_fops is the "ops struct" or virtual method table for a special file called /proc/timer_list. Like many other special files that make up the proc filesystem, /proc/timer_list exists to provide kernel information to userspace. This exploit scribbles on the pointer for the poll method, which is normally not even provided for this file (so it inherits a generic behavior), and which nobody ever uses. Then it just opens that file and calls poll(). I believe this could just as well have been almost any file in /proc/.
  • The LSM approach attacks several different ops structs of type security_operations, the tables of methods for different 'Linux security modules'. These are gigantic structs with hundreds of function pointers; the one the exploit targets in each struct is msg_queue_msgctl, the 100th one. Then it issues a msgctl system call, which causes the kernel to check whether it's authorized by calling the msg_queue_msgctl method... which is now the exploit's code.
Why three different targets? One is enough, right? The answer is flexibility. Some kernels don't have timer_list_fops. Some kernels have it, but don't make available a symbol to find its address, and the address will vary from kernel to kernel, so it's tricky to find. Other kernels pose the same obstacle with the security_operations structs, or use a different security_operations than the ones the exploit corrupts. Different kernels offer different targets, so a widely applicable exploit has to have several targets in its repertoire. This one picks and chooses which one to use depending on what it can find.

The Payload: Steal Privileges

Finally, once the bug is used to corrupt the target and the target is triggered, the kernel runs the attacker's payload, or shellcode. A simple exploit will run the bare minimum of code inside the kernel, because it's much easier to write code that can run in userspace than in kernelspace -- so it just sets the process up to have the run of the system, and then returns.

This means setting the process's user ID to 0, root, so that everything else it does is with root privileges. A process's user ID is stored in different places in different kernel versions -- the system became more complicated in 2.6.29, and again in 2.6.30 -- so the exploit needs to have flexibility again. This one checks the version with uname and assembles the payload accordingly.

This exploit can also clear a couple of flags to turn off SELinux, with code it optionally includes in the payload -- more flexibility. Then it lets the kernel return to userspace, and starts a root shell.

In a real attack, that root shell might be used to replace key system binaries, steal data, start a botnet daemon, or install backdoors on disk to cement the attacker's control and hide their presence.

Flexibility, or, You Can't Trust a Failing Exploit

All the points of flexibility in this exploit illustrate a key lesson: you can't determine you're vulnerable just because an exploit fails. For example, on a Fedora 13 system, this exploit errors out with a message like this:
  $ ./ABftw
  Ac1dB1tCh3z VS Linux kernel 2.6 kernel 0d4y
  $$$ Kallsyms +r
  $$$ K3rn3l r3l3as3:
  !!! Err0r 1n s3tt1ng cr3d sh3llc0d3z 
Sometimes a system administrator sees an exploit fail like that and concludes they're safe. "Oh, Red Hat / Debian / my vendor says I'm vulnerable", they may say. "But the exploit doesn't work, so they're just making stuff up, right?"

Unfortunately, this can be a fatal mistake. In fact, the machine above is vulnerable. The error message only comes about because the exploit can't find the symbol per_cpu__current_task, whose value it needs in the payload; it's the address at which to find the kernel's main per-process data structure, the task_struct. But a skilled attacker can find the task_struct without that symbol, by following pointers from other known data structures in the kernel.

In general, there is almost infinitely much work an exploit writer could put in to make the exploit function on more and more kernels. Use a wider repertoire of targets; find missing symbols by following pointers or by pattern-matching in the kernel; find missing symbols by brute force, with a table prepared in advance; disable SELinux, as this exploit does, or grsecurity; or add special code to navigate the data structures of unusual kernels like OpenVZ. If the bug is there in a kernel but the exploit breaks, it's only a matter of work or more work to extend the exploit to function there too.

That's why the only way to know that a given kernel is not affected by a vulnerability is a careful examination of the bug against the kernel's source code and configuration, and never to rely on a failing exploit -- and even that examination can sometimes be mistakenly optimistic. In practice, for a busy system administrator this means that when the vendor recommends you update, the only safe choice is to update.


Thursday Aug 05, 2010

Strace -- The Sysadmin's Microscope

Sometimes as a sysadmin the logfiles just don't cut it, and to solve a problem you need to know what's really going on. That's when I turn to strace -- the system-call tracer.

A system call, or syscall, is where a program crosses the boundary between user code and the kernel. Fortunately for us using strace, that boundary is where almost everything interesting happens in a typical program.

The two basic jobs of a modern operating system are abstraction and multiplexing. Abstraction means, for example, that when your program wants to read and write to disk it doesn't need to speak the SATA protocol, or SCSI, or IDE, or USB Mass Storage, or NFS. It speaks in a single, common vocabulary of directories and files, and the operating system translates that abstract vocabulary into whatever has to be done with the actual underlying hardware you have. Multiplexing means that your programs and mine each get fair access to the hardware, and don't have the ability to step on each other -- which means your program can't be permitted to skip the kernel, and speak raw SATA or SCSI to the actual hardware, even if it wanted to.

So for almost everything a program wants to do, it needs to talk to the kernel. Want to read or write a file? Make the open() syscall, and then the syscalls read() or write(). Talk on the network? You need the syscalls socket(), connect(), and again read() and write(). Make more processes? First clone() (inside the standard C library function fork()), then you probably want execve() so the new process runs its own program, and you probably want to interact with that process somehow, with one of wait4(), kill(), pipe(), and a host of others. Even looking at the clock requires a system call, clock_gettime(). Every one of those system calls will show up when we apply strace to the program.

In fact, just about the only thing a process can do without making a telltale system call is pure computation -- using the CPU and RAM and nothing else. As a former algorithms person, that's what I used to think was the fun part. Fortunately for us as sysadmins, very few real-life programs spend very long in that pure realm between having to deal with a file or the network or some other part of the system, and then strace picks them up again.

Let's look at a quick example of how strace solves problems.

Use #1: Understand A Complex Program's Actual Behavior
One day, I wanted to know which Git commands take out a certain lock -- I had a script running a series of different Git commands, and it was failing sometimes when run concurrently because two commands tried to hold the lock at the same time.

Now, I love sourcediving, and I've done some Git hacking, so I spent some time with the source tree investigating this question. But this code is complex enough that I was still left with some uncertainty. So I decided to get a plain, ground-truth answer to the question: if I run "git diff", will it grab this lock?

Strace to the rescue. The lock is on a file called index.lock. Anything trying to touch the file will show up in strace. So we can just trace a command the whole way through and use grep to see if index.lock is mentioned:

$ strace git status 2>&1 >/dev/null | grep index.lock
open(".git/index.lock", O_RDWR|O_CREAT|O_EXCL, 0666) = 3
rename(".git/index.lock", ".git/index") = 0

$ strace git diff 2>&1 >/dev/null | grep index.lock

So git status takes the lock, and git diff doesn't.

Interlude: The Toolbox
To help make it useful for so many purposes, strace takes a variety of options to add or cut out different kinds of detail and help you see exactly what's going on.

In Medias Res, If You Want
Sometimes we don't have the luxury of starting a program over to run it under strace -- it's running, it's misbehaving, and we need to find out what's going on. Fortunately strace handles this case with ease. Instead of specifying a command line for strace to execute and trace, just pass -p PID where PID is the process ID of the process in question -- I find pstree -p invaluable for identifying this -- and strace will attach to that program, while it's running, and start telling you all about it.

When I use strace, I almost always pass the -tt option. This tells me when each syscall happened -- -t prints it to the second, -tt to the microsecond. For system administration problems, this often helps a lot in correlating the trace with other logs, or in seeing where a program is spending too much time.

For performance issues, the -T option comes in handy too -- it tells me how long each individual syscall took from start to finish.

By default strace already prints the strings that the program passes to and from the system -- filenames, data read and written, and so on. To keep the output readable, it cuts off the strings at 32 characters. You can see more with the -s option -- -s 1024 makes strace print up to 1024 characters for each string -- or cut out the strings entirely with -s 0.

Sometimes you want to see the full data flowing in just a few directions, without cluttering your trace with other flows of data. Here the options -e read= and -e write= come in handy.

For example, say you have a program talking to a database server, and you want to see the SQL queries, but not the voluminous data that comes back. The queries and responses go via write() and read() syscalls on a network socket to the database. First, take a preliminary look at the trace to see those syscalls in action:

$ strace -p 9026
Process 9026 attached - interrupt to quit
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52

Those write() syscalls are the SQL queries -- we can make out the SELECT foo FROM bar, and then it trails off. To see the rest, note the file descriptor the syscalls are happening on -- the first argument of read() or write(), which is 3 here. Pass that file descriptor to -e write=:

$ strace -p 9026 -e write=3
Process 9026 attached - interrupt to quit
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
 | 00000  30 00 00 00 03 53 45 4c  45 43 54 20 74 69 6d 65  0....SEL ECT time |
 | 00010  73 74 61 6d 70 20 46 52  4f 4d 20 61 72 74 69 66  stamp FR OM artif |
 | 00020  61 63 74 73 20 57 48 45  52 45 20 69 64 20 3d 20  acts WHE RE id =  |
 | 00030  31 34 35 34                                       1454              |

and we see the whole query. It's both printed and in hex, in case it's binary. We could also get the whole thing with an option like -s 1024, but then we'd see all the data coming back via read() -- the use of -e write= lets us pick and choose.

Filtering the Output
Sometimes the full syscall trace is too much -- you just want to see what files the program touches, or when it reads and writes data, or some other subset. For this the -e trace= option was made. You can select a named suite of system calls like -e trace=file (for syscalls that mention filenames) or -e trace=desc (for read() and write() and friends, which mention file descriptors), or name individual system calls by hand. We'll use this option in the next example.

Child Processes
Sometimes the process you trace doesn't do the real work itself, but delegates it to child processes that it creates. Shell scripts and Make runs are notorious for taking this behavior to the extreme. If that's the case, you may want to pass -f to make strace "follow forks" and trace child processes, too, as soon as they're made.

For example, here's a trace of a simple shell script, without -f:

$ strace -e trace=process,file,desc sh -c \
   'for d in .git/objects/*; do ls $d >/dev/null; done'
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4b68af5770) = 11948
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11948
--- SIGCHLD (Child exited) @ 0 (0) --
wait4(-1, 0x7fffc3473604, WNOHANG, NULL) = -1 ECHILD (No child processes)

Not much to see here -- all the real work was done inside process 11948, the one created by that clone() syscall.

Here's the same script traced with -f (and the trace edited for brevity):

$ strace -f -e trace=process,file,desc sh -c \
   'for d in .git/objects/*; do ls $d >/dev/null; done'
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
clone(Process 10738 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5a93f99770) = 10738
[pid 10682] wait4(-1, Process 10682 suspended

[pid 10738] open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid 10738] dup2(3, 1)                  = 1
[pid 10738] close(3)                    = 0
[pid 10738] execve("/bin/ls", ["ls", ".git/objects/28"], [/* 25 vars */]) = 0
[... setup of C standard library omitted ...]
[pid 10738] stat(".git/objects/28", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 10738] open(".git/objects/28", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
[pid 10738] getdents(3, /* 40 entries */, 4096) = 2480
[pid 10738] getdents(3, /* 0 entries */, 4096) = 0
[pid 10738] close(3)                    = 0
[pid 10738] write(1, "04102fadac20da3550d381f444ccb5676"..., 1482) = 1482
[pid 10738] close(1)                    = 0
[pid 10738] close(2)                    = 0
[pid 10738] exit_group(0)               = ?
Process 10682 resumed
Process 10738 detached
<... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10738
--- SIGCHLD (Child exited) @ 0 (0) ---

Now this trace could be a miniature education in Unix in itself -- future blog post? The key thing is that you can see ls do its work, with that open() call followed by getdents().

The output gets cluttered quickly when multiple processes are traced at once, so sometimes you want -ff, which makes strace write each process's trace into a separate file.

Use #2: Why/Where Is A Program Stuck?
Sometimes a program doesn't seem to be doing anything. Most often, that means it's blocked in some system call. Strace to the rescue.

$ strace -p 22067
Process 22067 attached - interrupt to quit
flock(3, LOCK_EX

Here it's blocked trying to take out a lock, an exclusive lock (LOCK_EX) on the file it's opened as file descriptor 3. What file is that?

$ readlink /proc/22067/fd/3

Aha, it's the file /tmp/foobar.lock. And what process is holding that lock?

$ lsof | grep /tmp/foobar.lock
 command   21856       price    3uW     REG 253,88       0 34443743 /tmp/foobar.lock
 command   22067       price    3u      REG 253,88       0 34443743 /tmp/foobar.lock

Process 21856 is holding the lock. Now we can go figure out why 21856 has been holding the lock for so long, whether 21856 and 22067 really need to grab the same lock, etc.

Other common ways the program might be stuck, and how you can learn more after discovering them with strace:

  • Waiting on the network. Use lsof again to see the remote hostname and port.
  • Trying to read a directory. Don't laugh -- this can actually happen when you have a giant directory with many thousands of entries. And if the directory used to be giant and is now small again, on a traditional filesystem like ext3 it becomes a long list of "nothing to see here" entries, so a single syscall may spend minutes scanning the deleted entries before returning the list of survivors.
  • Not making syscalls at all. This means it's doing some pure computation, perhaps a bunch of math. You're outside of strace's domain; good luck.

Uses #3, #4, ...
A post of this length can only scratch the surface of what strace can do in a sysadmin's toolbox. Some of my other favorites include

  • As a progress bar. When a program's in the middle of a long task and you want to estimate if it'll be another three hours or three days, strace can tell you what it's doing right now -- and a little cleverness can often tell you how far that places it in the overall task.
  • Measuring latency. There's no better way to tell how long your application takes to talk to that remote server than watching it actually read() from the server, with strace -T as your stopwatch.
  • Identifying hot spots. Profilers are great, but they don't always reflect the structure of your program. And have you ever tried to profile a shell script? Sometimes the best data comes from sending a strace -tt run to a file, and picking through to see when each phase of your program started and finished.
  • As a teaching and learning tool. The user/kernel boundary is where almost everything interesting happens in your system. So if you want to know more about how your system really works -- how about curling up with a set of man pages and some output from strace?


Thursday Jul 29, 2010

Choose Your Own Sysadmin Adventure

Today is System Administrator Appreciation Day, and being system administrators ourselves,  we here at Ksplice decided to have a little fun with this holiday.

We've taken a break, drank way too much coffee, and created a very special Choose Your Own Adventure for all the system administrators out there.

Click here to begin the adventure.

Feedback and comments welcome. Above all: Happy System Administrator Appreciation Day. Share the love with your friends, colleagues, and especially any sysadmins you might know.


Wednesday Jul 28, 2010

Learning by doing: Writing your own traceroute in 8 easy steps

Anyone who administers even a moderately sized network knows that when problems arise, diagnosing and fixing them can be extremely difficult. They're usually non-deterministic and difficult to reproduce, and very similar symptoms (e.g. a slow or unreliable connection) can be caused by any number of problems — congestion, a broken router, a bad physical link, etc.

One very useful weapon in a system administrator's arsenal for dealing with network issues is traceroute (or tracert, if you use Windows). This is a neat little program that will print out the path that packets take to get from the local machine to a destination — that is, the sequence of routers that the packets go through.

Using traceroute is pretty straightforward. On a UNIX-like system, you can do something like the following:

    $ traceroute
    traceroute to (, 30 hops max, 60 byte packets
     1  router.lan (  0.595 ms  1.276 ms  1.519 ms
     2 (  13.669 ms  17.583 ms  18.242 ms
     3 (  18.710 ms  19.192 ms  19.640 ms
     4 (  20.642 ms  21.160 ms  21.571 ms
     5 (  28.870 ms  29.788 ms  30.437 ms
     6 (  30.911 ms  17.377 ms  15.442 ms
     7 (  40.081 ms  41.018 ms  39.229 ms
     8 (  20.139 ms  21.629 ms  20.965 ms
     9 (  25.771 ms  26.196 ms  26.633 ms
    10 (  23.856 ms  24.820 ms  27.722 ms

Pretty nifty. But how does it work? After all, when a packet leaves your network, you can't monitor it anymore. So when it hits all those routers, the only way you can know about that is if one of them tells you about it.

The secret behind traceroute is a field called "Time To Live" (TTL) that is contained in the headers of the packets sent via the Internet Protocol. When a host receives a packet, it checks if the packet's TTL is greater than 1 before sending it on down the chain. If it is, it decrements the field. Otherwise, it drops the packet and sends an ICMP TIME_EXCEEDED packet to the sender. This packet, like all IP packets, contains the address of its sender, i.e. the intermediate host.

traceroute works by sending consecutive requests to the same destination with increasing TTL fields. Most of these attempts result in messages from intermediate hosts saying that the packet was dropped. The IP addresses of these intermediate hosts are then printed on the screen (generally with an attempt made at determining the hostname) as they arrive, terminating when the maximum number of hosts have been hit (on my machine's traceroute the default maximum is 30, but this is configurable), or when the intended destination has been reached.

The rest of this post will walk through implementing a very primitive version of traceroute in Python. The real traceroute is of course more complicated than what we will create, with many configurable features and modes. Still, our version will implement the basic functionality, and at the end, we'll have a really nice and short Python script that will do just fine for performing a simple traceroute.

So let's begin. Our algorithm, at a high level, is an infinite loop whose body creates a connection, prints out information about it, and then breaks out of the loop if a certain condition has been reached. So we can start with the following skeletal code:

    def main(dest):
        while True:
            # ... open connections ...
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":

Step 1: Turn a hostname into an IP address.

The socket module provides a gethostbyname() method that attempts to resolve a domain name into an IP address:


    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        while True:
            # ... open connections ...
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":
Step 2: Create sockets for the connections.

We'll need two sockets for our connections — one for receiving data and one for sending. We have a lot of choices for what kind of probes to send; let's use UDP probes, which require a datagram socket (SOCK_DGRAM). The routers along our traceroute path are going to send back ICMP packets, so for those we need a raw socket (SOCK_RAW).


    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":

Step 3: Set the TTL field on the packets.

We'll simply use a counter which begins at 1 and which we increment with each iteration of the loop. We set the TTL using the setsockopt module of the socket object:


    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)

            ttl += 1
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":

Step 4: Bind the sockets and send some packets.

Now that our sockets are all set up, we can put them to work! We first tell the receiving socket to listen to connections from all hosts on a specific port (most implementations of traceroute use ports from 33434 to 33534 so we will use 33434 as a default). We do this using the bind() method of the receiving socket object, by specifying the port and an empty string for the hostname. We can then use the sendto() method of the sending socket object to send to the destination host (on the same port). The first argument of the sendto() method is the data to send; in our case, we don't actually have anything specific we want to send, so we can just give the empty string:


    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))

            ttl += 1
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":
Step 5: Get the intermediate hosts' IP addresses.

Next, we need to actually get our data from the receiving socket. For this, we can use the recvfrom() method of the object, whose return value is a tuple containing the packet data and the sender's address. In our case, we only care about the latter. Note that the address is itself actually a tuple containing both the IP address and the port, but we only care about the former. recvfrom() takes a single argument, the blocksize to read — let's go with 512.

It's worth noting that some administrators disable receiving ICMP ECHO requests, pretty much specifically to prevent the use of utilities like traceroute, since the detailed layout of a network can be sensitive information (another common reason to disable them is the ping utility, which can be used for denial-of-service attacks). It is therefore completely possible that we'll get a timeout error, which will result in an exception. Thus, we'll wrap this call in a try/except block. Traditionally, traceroute prints asterisks when it can't get the address of a host. We'll do the same once we print out results.


    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))
            curr_addr = None
                _, curr_addr = recv_socket.recvfrom(512)
                curr_addr = curr_addr[0]
            except socket.error:

            ttl += 1
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":
Step 6: Turn the IP addresses into hostnames and print the data.

To match traceroute's behavior, we want to try to display the hostname along with the IP address. The socket module provides the gethostbyaddr() method for reverse DNS resolution. The resolution can fail and result in an exception, in which case we'll want to catch it and make the hostname the same as the address. Once we get the hostname, we have all the information we need to print our data:


    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))
            curr_addr = None
            curr_name = None
                _, curr_addr = recv_socket.recvfrom(512)
                curr_addr = curr_addr[0]
                    curr_name = socket.gethostbyaddr(curr_addr)[0]
                except socket.error:
                    curr_name = curr_addr
            except socket.error:

            if curr_addr is not None:
                curr_host = "%s (%s)" % (curr_name, curr_addr)
                curr_host = "*"
            print "%d\t%s" % (ttl, curr_host)

            ttl += 1
            # ... break if useful ...

    if __name__ == "__main__":

Step 7: End the loop.

There are two conditions for exiting our loop — either we have reached our destination (that is, curr_addr is equal to dest_addr)1 or we have exceeded some maximum number of hops. We will set our maximum at 30:


    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        max_hops = 30
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))
            curr_addr = None
            curr_name = None
                _, curr_addr = recv_socket.recvfrom(512)
                curr_addr = curr_addr[0]
                    curr_name = socket.gethostbyaddr(curr_addr)[0]
                except socket.error:
                    curr_name = curr_addr
            except socket.error:

            if curr_addr is not None:
                curr_host = "%s (%s)" % (curr_name, curr_addr)
                curr_host = "*"
            print "%d\t%s" % (ttl, curr_host)

            ttl += 1
            if curr_addr == dest_addr or ttl &gt; max_hops:

    if __name__ == "__main__":

Step 8: Run the code!

We're done! Let's save this to a file and run it! Because raw sockets require root privileges, traceroute is typically setuid. For our purposes, we can just run the script as root:

    $ sudo python
    [sudo] password for leonidg:
    1       router.lan (
    2 (
    3 (
    4 (
    5 (
    6 (
    7 (
    8 (
    9 (
    10 (

Hurrah! The data matches the real traceroute's perfectly.

Of course, there are many improvements that we could make. As I mentioned, the real traceroute has a whole slew of other features, which you can learn about by reading the manpage. In the meantime, I wrote a slightly more complete version of the above code that allows configuring the port and max number of hops, as well as specifying the destination host. You can download it at my github repository.

Alright folks, What UNIX utility should we write next? strace, anyone? :-) 2

1 This is actually not quite how the real traceroute works. Rather than checking the IP addresses of the hosts and stopping when the destination address matches, it stops when it receives a ICMP "port unreachable" message, which means that the host has been reached. For our purposes, though, this simple address heuristic is good enough.

2 Ksplice blogger Nelson took up a DIY strace on his personal blog, Made of Bugs.


Monday Jul 12, 2010

Source diving for sysadmins

As a system administrator, I work with dozens of large systems every day--Apache, MySQL, Postfix, Dovecot, and the list goes on from there. While I have a good idea of how to configure all of these pieces of software, I'm not intimately familiar with all of their code bases. And every so often, I'll run into a problem which I can't configure around.

When I'm lucky, I can reproduce the bug in a testing environment. I can then drop in arbitrary print statements, recompile with debugging flags, or otherwise modify my application to give me useful data. But all too often, I find that either the bug vanishes when it's not in my production environment, or it would simply take too much time or resources to even set up a testing deployment. When this happens, I find myself left with no alternative but to sift through the source code of the failing system, hoping to find clues as to the cause of the bug of the day. Doing so is never painless, but over time I've developed a set of techniques to make the source diving experience as focused and productive and possible.

To illustrate these techniques, I'll walk you through a real-world debugging experience I had a few weeks ago. I am a maintainer of the XVM project, an MIT-internal VPS service. We keep the disks of our virtual servers in shared storage, and we use clustering software to coordinate changes to the disks.

For a long time, we had happily run a cluster of four nodes. After receiving a grant for new hardware, we attempted to increase the size of our cluster from four nodes to eight nodes. But once we added the new nodes to the cluster, disaster struck. With five or more nodes in the cluster, no matter what we tried, we received the same error message:

root@babylon-four:~# lvs
 cluster request failed: Cannot allocate memory
 Can't get lock for xenvg
 Skipping volume group xenvg
 cluster request failed: Cannot allocate memory
 Can't get lock for babylon-four

And to make matters even more exciting, by the time we observed the problem, users had already booted enough virtual servers that we did not have the RAM to go back to four nodes. So there we were, with a broken cluster to debug.

Tip 1: Check the likely causes of failure first.

It can be quite tempting to believe that a given problem is caused by a bug in someone else's code rather than your own error in configuration. In reality, the common bugs in large, widely-used projects have already been squashed, meaning the most likely cause of error is something that you are doing wrong. I've lost track of the number of times I was sure I encountered a bug in some software, only to later discover that I had forgotten to set a configuration variable. So when you encounter a failure of some kind, make sure that your environment is not obviously at fault. Check your configuration files, check resource usage, check log files.

In the case of XVM, after seeing memory errors, we naturally figured we were out of memory--but free -m showed plenty of spare RAM. Thinking a rogue process might be to blame, we ran ps aux and top, but no process was consuming abnormal amounts of RAM or CPU. We consulted man pages, we scoured the relevant configuration files in /etc, and we even emailed the clustering software's user list, trying to determine if we were doing something wrong. Our efforts failed to uncover any problems on our end.

Tip 2: Gather as much debugging output as you can. You're going to need it.

Once you're sure you actually need to do a source dive, you should make sure you have all the information you can get about what your program is doing wrong. See if your program has a "debugging" or "verbosity" level you can turn up. Check /var/log/ for dedicated log files for the software under consideration, or perhaps check a standard log such as syslog. If your program does not provide enough output on its own, try using strace -p to dump the system calls it's issuing.

Before doing our clustering-software source dive, we cranked debugging as high as it would go to get the following output:

Got new connection on fd 5
Read on local socket 5, len = 28
creating pipe, [9, 10]
Creating pre&post thread
in sub thread: client = 0x69f010
Sub thread ready for work.
doing PRE command LOCK_VG 'V_xenvg' at 1 (client=0x69f010)
lock_resource 'V_xenvg-1', flags=0, mode=1
Created pre&post thread, state = 0
Writing status 12 down pipe 10
Waiting for next pre command
read on PIPE 9: 4 bytes: status: 12
background routine status was 12, sock_client=0x69f010
Send local reply

Note that this spew does not contain an obvious error message. Still, it had enough information for us to ultimately track down and fix the problem that beset us.

Tip 3: Use the right tools for the job

Perhaps the worst part of living in a world with many debugging tools is that it's easy to waste time using the wrong ones. If you are seeing a segmentation fault or an apparent deadlock, then your first instinct should be to reach for gdb. gdb has all sorts of nifty capabilities, including the ability to attach to an already-running process. But if you don't have an obvious crash site, often the information you glean from dynamic debugging is too narrow or voluminous to be helpful. Some, such as Linus Torvalds, have even vehemently opposed debuggers in general.

Sometimes the simplest tools are the best: together grep and find can help you navigate an entire codebase knowing only fragments of text or filenames (or guesses thereof). It can also be helpful to use a language-specific tool. For C, I recommend cscope, a tool which lets you find symbol usages or definitions.

XVM's clustering problem was with a multiprocess network-based program, and we had no idea where the failure was originating. Both properties would have made the use of a dynamic debugger quite onerous. Thus we elected to dive into the source code using nothing but our familiar command-line utilities.

Tip 4: Know your error.

Examine your system's failure mode. Is it printing an error message? If so, where is that error message originating? What can possibly cause that error? If you don't understand the symptoms of a failure, you certainly won't be able to diagnose its cause.

Often, grep as you might, you won't find the text of the error message in the codebase under consideration. Rather, a standard UNIX error-reporting mechanism is to internally set the global variable errno, which is converted to a string using strerror.

Here's a Python script that I've found useful for converting the output of strerror to the corresponding symbolic error name. (Just pass the script any substring of your error as an argument.)

#!/usr/bin/env python
import errno, os, sys
msg = ' '.join(sys.argv[1:]).lower()
for i in xrange(256):
    err = os.strerror(i)
    if msg in err.lower():
        print '%s [errno %d]: %s' % (errno.errorcode.get(i, '(unknown)'), i, err)

This script shows that the "Cannot allocate memory" message we had seen was caused by errno being set to the code ENOMEM.

Tip 5: Map lines of output to lines of code.

You can learn a lot about the state of a program by determining which lines of code it is executing. First, fetch the source code for the version of the software you are running (generally one of apt-get source and yumdownloader --source). Using your handy command-line tools, you should then be able to trace lines of debugging output back to the lines of code that emitted them. You can thus determine a set of lines that are definitively being executed.

Returning to the XVM example, we used apt-get source to fetch the relevant source code and dpkg -l to verify we were running the same version. We then ran a grep for each line of debugging output we had obtained. One such invocation, grep -r "lock_resource '.*'" .,

showed that the corresponding log entry was emitted by a line in the middle of a function entitled _lock_resource.

Tip 6: Be systematic.

If you've followed the preceding tips, you'll know what parts of the code the program is executing and how it's erroring out. From there, you should work systematically, eliminating parts of the code that you can prove are not contributing to your error. Be sure you have actual evidence for your conclusions--the existence of a bug indicates that the program is in an unexpected state, and thus the core assumptions of the code may be violated.

At this point in the XVM debugging, we examined the _lock_resource function. After the debugging message we had in our logs, all paths of control flow except one printed a message we had not seen. That path terminated with an error from a function called saLckResourceLock. Hence we had found the source of our error.

We also noticed that _lock_resource transforms error values returned by the function it calls using using ais_to_errno. Reading the body of ais_to_errno, we noted it just maps internal error values to standard UNIX error codes. So instead of ENOMEM, the real culprit was one of SA_AIS_ERR_NO_MEMORY, SA_AIS_ERR_NO_RESOURCES, or SA_AIS_ERR_NO_SECTIONS. This certainly explained why we could see this error message even on machines with tens of gigabytes of free memory!

Ultimately, our debugging process brought us to the following block of code:

if (global_lock_count == LCK_MAX_NUM_LOCKS)     {
goto error_exit;

This chunk of code felt exactly right. It was bound by some hard-coded limit (namely, LCK_MAX_NUM_LOCKS, the maximum number of locks) and hitting it returned one of the error codes we were seeking. We bumped the value of the constant and have been running smoothly ever since.

Tip 7: Make sure you really fixed it.

How many times have you been certain you finally found an elusive bug, spent hours recompiling and redeploying, and then found that the bug was actually still there? Or even better, the bug simply changed when it appears, and you failed to find this out before telling everyone that you fixed it?

It's important that after squashing a bug, you examine, test, and sanity-check your changes. Perhaps explain your reasoning to someone else. It's all too easy to "fix" code that isn't broken, only cover a subset of the relevant cases, or introduce a new bug in your patch.

After bumping the value of LCK_MAX_NUM_LOCKS, we checked the project's changelog. We found a commit increasing the maximum number of locks without any changes to code, so our patch seemed safe. We explained our reasoning and findings to the other project developers, quietly deployed our patched version, and then after a week of stability sent an announce email proclaiming that we had fixed the cluster.

Your turn

What techniques have you found useful for debugging unfamiliar code?


Wednesday Jun 30, 2010

Let's Play Vulnerability Bingo!

Dear Fellow System Administrators,

I like excitement in my life. I go on roller coasters, I ride my bike without a helmet, I make risky financial decisions. I treat my servers no differently. When my Linux vendor releases security updates, I think: I could apply these patches, but why would I? If I did, I'd have to coordinate with my users to schedule a maintenance window for 2am on a Sunday and babysit those systems while they reboot, which is seriously annoying, hurts our availability, and interrupts my beauty sleep (and trust me, I need my beauty sleep). Plus, where's the fun in having a fully-patched system? Without open vulnerabilities, how else would I have won a ton of money in my office's Vulnerability Bingo games?

vulnerability bingo card

How can I get in on some Vulnerability Bingo action, you ask? Simple: get yourself some bingo cards, be sure not to patch your systems, and place chips on appropriate squares when your machines are compromised. Or, as a fun variant, place chips when your friends' machines get compromised! For the less adventurous, place chips as relevant Common Vulnerabilities and Exposures are announced.

What's really great is the diversity of vulnerabilities. In 2009 alone, Vulnerability Bingo featured:

physically proximate denial of service attacks (CVE-2009-1046).

local denial of service attacks (CVE-2009-0322, CVE-2009-0031, CVE-2009-0269, CVE-2009-1242, CVE-2009-2406, CVE-2009-2407, CVE-2009-2287, CVE-2009-2692, CVE-2009-2909, CVE-2009-2908, CVE-2009-3290, CVE-2009-3547, CVE-2009-3621, CVE-2009-3620) coming in at least 5 great flavors: faults, memory corruption, system crashes, hangs, and the kernel OOPS!

And the perennial favorite, remote denial of service attacks (CVE-2009-1439, CVE-2009-1633, CVE-2009-3613, CVE-2009-2903) including but not limited to system crashes, IOMMU space exhaustion, and memory consumption!

How about leaking potentially sensitive information from kernel memory (CVE-2009-0676, CVE-2009-3002, CVE-2009-3612, CVE-2009-3228) and remote access to potentially sensitive information from kernel memory (CVE-2009-1265)?

Perhaps I can interest you in some privilege escalation (CVE-2009-2406, CVE-2009-2407, CVE-2009-2692, CVE-2009-3547, CVE-2009-3620), or my personal favorites, arbitrary code execution (CVE-2009-2908) and unknown impact (CVE-2009-0065, CVE-2009-1633, CVE-2009-3638).

Sometimes you get a triple threat like CVE-2009-1895, which "makes it easier for local users to leverage the details of memory usage to (1) conduct NULL pointer dereference attacks, (2) bypass the mmap_min_addr protection mechanism, or (3) defeat address space layout randomization (ASLR)". Three great tastes that taste great together -- and a great multi-play Bingo opportunity!

Linux vendors release kernel security updates almost every month (take Red Hat for example), so generate some cards and get in on the action before you miss the next round of exciting CVEs!

Happy Hacking,

Ben Bitdiddle
System Administrator
HackMe Inc.



Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.


« July 2016