Wednesday Mar 16, 2011

disown, zombie children, and the uninterruptible sleep

PID 1 Riding, by Albrecht Dürer

It's the end of the day on Friday. On your laptop, in an ssh session on a work machine, you check on long.sh, which has been running all day and has another 8 or 9 hours to go. You start to close your laptop.

You freeze for a second and groan.

This was supposed to be running under a screen session. You know that if you kill the ssh connection, that'll also kill long.sh. What are you going to do? Leave your laptop for the weekend? Kill the job, losing the last 8 hours of work?

You think about what long.sh does for a minute, and breathe a sigh of relief. The output is written to a file, so you don't care about terminal output. This means you can use disown.

How does this little shell built-in let your jobs finish even when you kill the parent process (the shell inside the ssh connection)?

Dissecting disown

As we'll see, disown synthesizes 3 big UNIX concepts: signals, process states, and job control.

The point of disowning a process is that it will continue to run even when you exit the shell that spawned it. Getting this to work requires a prelude. The steps are:

  1. suspend the process with ctl-Z.
  2. background with bg.
  3. disown the job.

What does each of these steps accomplish?

First, here's a summary of the states that a process can be in, from the ps man page:

PROCESS STATE CODES
       Here are the different values that the s, stat and state output specifiers (header "STAT" or "S")
       will display to describe the state of a process.
       D    Uninterruptible sleep (usually IO)
       R    Running or runnable (on run queue)
       S    Interruptible sleep (waiting for an event to complete)
       T    Stopped, either by a job control signal or because it is being traced.
       W    paging (not valid since the 2.6.xx kernel)
       X    dead (should never be seen)
       Z    Defunct ("zombie") process, terminated but not reaped by its parent.

       For BSD formats and when the stat keyword is used, additional characters may be displayed:
       <    high-priority (not nice to other users)
       N    low-priority (nice to other users)
       L    has pages locked into memory (for real-time and custom IO)
       s    is a session leader
       l    is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
       +    is in the foreground process group

And here is a transcript of the steps to disown long.sh. To the right of each step is some useful ps output, in particular the parent process id (PPID), what process state our long job is in (STAT), and the controlling terminal (TT). I've highlighted the interesting changes:

Shell 1: disown Shell 2: monitor with ps
1. Start program
$ sh long.sh
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
  PID  PPID STAT TT       CMD
26298 26145 S+   pts/0    sh long.sh
2. Suspend program with Ctl-z
^Z
[1]+  Stopped     sh long.sh
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
  PID  PPID STAT TT       CMD
26298 26145 T    pts/0    sh long.sh
3. Resume program in background
$ bg
[1]+ sh long.sh &
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
  PID  PPID STAT TT       CMD
26298 26145 S    pts/0    sh long.sh
4. disown job 1, our program
$ disown %1
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
  PID  PPID STAT TT       CMD
26298 26145 S    pts/0    sh long.sh
5. Exit the shell
$ exit
logout
$ ps -o pid,ppid,stat,tty,cmd $(pgrep -f long)
  PID  PPID STAT TT       CMD
26298     1 S    ?        sh long.sh

Putting this information together:

  1. When we run long.sh from the command line, its parent is the shell (PID 26145 in this example). Even though it looks like it is running as we watch it in the terminal, it mostly isn't; long.sh is waiting on some resource or event, so it is in process state S for interruptible sleep. It is in fact in the foreground, so it also gets a +.
  2. First, we suspend the program with Ctl-z. By ``suspend'', we mean send it the SIGTSTP signal, which is like SIGSTOP except that you can install your own signal handler for or ignore it. We see proof in the state change: it's now in T for stopped.
  3. Next, bg sets our process running again, but in the background, so we get the S for interruptible sleep, but no +.
  4. Finally, we can use disown to remove the process from the jobs list that our shell maintains. Our process has to be active when it is removed from the list or it'll get reaped when we kill the parent shell, which is why we needed the bg step.
  5. When we exit the shell, we are sending it a SIGHUP, which it propagates to all children in the jobs table**. By default, a SIGHUP will terminate a process. Because we removed our job from the jobs table, it doesn't get the SIGHUP and keeps on running (STAT S). However, since its parent the shell died, and the shell was the session leader in charge of the controlling tty, it doesn't have a tty anymore (TT ?). Additionally, our long job needs a new parent, so init, with PID 1, becomes the new parent process.
**This is not always true, as it turns out. In the bash shell, for example, there is a huponexit shell option. If this option is disabled, a SIGHUP to the shell isn't propagated to the children. This means if you have a backgrounded, active process (you followed steps 1, 2, and 3 above, or you started the process backgrounded with ``&'') and you exit the shell, you don't have to use disown for the process to keep running. You can check or toggle the huponexit shell option with the shopt shell built-in.

And that is disown in a nutshell.

What else can we learn about process states?

Dissecting disown presents enough interesting tangents about signals, process states, and job control for a small novel. Focusing on process states for this post, here are a few such tangents:

1. There are a lot of process states and modifiers. We saw some interruptible sleeps and suspended processes with disown, but what states are most common?

Using my laptop as a data source and taking advantage of ps format specifiers, we can get counts for the different process states:

jesstess@aja:~$ ps -e h -o stat | sort | uniq -c | sort -rn
     90 S
     31 Sl
     17 Ss
      9 Ss+
      8 Ssl
      4 S<
      3 S+
      2 SNl
      1 S<sl
      1 S<s
      1 SN
      1 SLl
      1 R+

So the vast majority are in an interruptible sleep (S), and a few processes are extra nice (N) and extra mean (<).

We can drill down on process ``niceness'', or scheduling priority, with the ni format specifier to ps:

jesstess@aja:~$ ps -e h -o ni | sort -n | uniq -c
      1 -11
      2  -5
      1  -4
      2  -2
      4   -
    156   0
      1   1
      1   5
      1  10

The numbers range from 19 (super friendly, low scheduling priority) to -20 (a total bully, high scheduling priority). The 6 processes with negative numbers are the 6 with a < process state modifier in the ``ps -e h -o stat'' output, and the 3 with positive numbers have the Ns. Most processes don't run under a special scheduling priority.

Why is almost nothing actually running?

In the ``ps -e h -o stat'' output above, only 1 process was marked as R running or runnable. This is a multi-processor machine, and there are over 150 other processes, so why isn't something running on the other processor?

The answer is that on an unloaded system, most processes really are waiting on an event or resource, so they can't run. On the laptop where I ran these tests, uptime tells us that we have a load average under 1:

jesstess@aja:~$ uptime
 13:09:10 up 16 days, 14:09,  5 users,  load average: 0.92, 0.87, 0.82

So we'd only expect to see 1 process in the R state at any given time for that load.

If we hop over to a more loaded machine -- a shell machine at MIT -- things are a little more interesting:

dr-wily:~> ps -e -o stat,cmd | awk '{if ($1 ~/R/) print}'
R+   /mit/barnowl/arch/i386_deb50/bin/barnowl.real.zephyr3
R+   ps -e -o stat,cmd
R+   w
dr-wily:~> uptime
 23:23:16 up 22 days, 20:09, 132 users,  load average: 3.01, 3.66, 3.43
dr-wily:~> grep processor /proc/cpuinfo
processor	: 0
processor	: 1
processor	: 2
processor	: 3

The machine has 4 processors. On average, 3 or 4 processors have processes running (in the R state). To get a sense of how the running processes change over time, throw the ps line under watch:

watch -n 1 "ps -e -o stat,cmd | awk '{if (\$1 ~/R/) print}'"

We get something like:

watching the changing output of ps

2. What about the zombies?

Noticeably absent in the process state summaries above are zombie processes (STAT Z) and processes in uninterruptible sleep (STAT D).

A process becomes a zombie when it has completed execution but hasn't been reaped by its parent. If a program produces long-lived zombies, this is usually a bug; zombies are undesirable because they take up process IDs, which are a limited resource.

I had to dig around a bit to find real examples of zombies. The winners were old barnowl zephyr clients (zephyr is a popular instant messaging system at MIT):

jesstess@linerva:~$ ps -e h -o stat,cmd | awk '{if ($1 ~/Z/) print}'
Z+   [barnowl] <defunct>
Z+   [barnowl] <defunct>

However, since all it takes to produce a zombie is a child exiting without the parent reaping it, it's easy to construct our own zombies of limited duration:

jesstess@aja:~$ cat zombie.c 
#include <sys/types.h>

int main () {
    pid_t child_pid = fork();
    if (child_pid > 0) {
        sleep(60);
    }
    return 0;
}
jesstess@aja:~$ gcc -o zombie zombie.c
jesstess@aja:~$ ./zombie
^Z
[1]+  Stopped                 ./zombie
jesstess@aja:~$ ps -o stat,cmd $(pgrep -f zombie)
T    ./zombie
Z    [zombie] <defunct>

When you run this script, the parent dies after 60 seconds, init becomes the zombie child's new parent, and init quickly reaps the child by making a wait system call on the child's PID, which removes it from the system process table.

3. What about the uninterruptible sleeps?

A process is put in an uninterruptible sleep (STAT D) when it needs to wait on something (typically I/O) and shouldn't be handling signals while waiting. This means you can't kill it, because all kill does is send it signals. This might happen in the real world if you unplug your NFS server while other machines have open network connections to it.

We can create our own uninterruptible processes of limited duration by taking advantage of the vfork system call. vfork is like fork, except the address space is not copied from the parent into the child, in anticipation of an exec which would just throw out the copied data. Conveniently for us, when you vfork the parent waits uninterruptibly (by way of wait_on_completion) on the child's exec or exit:

jesstess@aja:~$ cat uninterruptible.c 
int main() {
    vfork();
    sleep(60);
    return 0;
}
jesstess@aja:~$ gcc -o uninterruptible uninterruptible.c
jesstess@aja:~$ echo $$
13291
jesstess@aja:~$ ./uninterruptible

and in another shell:

jesstess@aja:~$ ps -o ppid,pid,stat,cmd $(pgrep -f uninterruptible)

13291  1972 D+   ./uninterruptible
 1972  1973 S+   ./uninterruptible

We see the child (PID 1973, PPID 1972) in an interruptible sleep and the parent (PID 1972, PPID 13291 -- the shell) in an uninterruptible sleep while it waits for 60 seconds on the child.

One neat (mischievous?) thing about this script is that processes in an uninterruptible sleep contribute to the load average for a machine. So you could run this script 100 times to temporarily give a machine a load average elevated by 100, as reported by uptime.

It's a family affair

Signals, process states, and job control offer a wealth of opportunities for exploration on a Linux system: we've already disowned children, killed parents, witnessed adoption (by init), crafted zombie children, and more. If this post inspires fun tangents or fond memories, please share in the comments!


*Albrecht had some help from Adam and Photoshop Elements. Larger version here.
Props to Nelson for his boundless supply of sysadmin party tricks, which includes this vfork example.

~jesstess

Monday Jan 10, 2011

Solving problems with proc

The Linux kernel exposes a wealth of information through the proc special filesystem. It's not hard to find an encyclopedic reference about proc. In this article I'll take a different approach: we'll see how proc tricks can solve a number of real-world problems. All of these tricks should work on a recent Linux kernel, though some will fail on older systems like RHEL version 4.

Almost all Linux systems will have the proc filesystem mounted at /proc. If you look inside this directory you'll see a ton of stuff:

keegan@lyle$ mount | grep ^proc
proc on /proc type proc (rw,noexec,nosuid,nodev)
keegan@lyle$ ls /proc
1      13     23     29672  462        cmdline      kcore         self
10411  13112  23842  29813  5          cpuinfo      keys          slabinfo
...
12934  15260  26317  4      bus        irq          partitions    zoneinfo
12938  15262  26349  413    cgroups    kallsyms     sched_debug

These directories and files don't exist anywhere on disk. Rather, the kernel generates the contents of /proc as you read it. proc is a great example of the UNIX "everything is a file" philosophy. Since the Linux kernel exposes its internal state as a set of ordinary files, you can build tools using basic shell scripting, or any other programming environment you like. You can also change kernel behavior by writing to certain files in /proc, though we won't discuss this further.

Each process has a directory in /proc, named by its numerical process identifier (PID). So for example, information about init (PID 1) is stored in /proc/1. There's also a symlink /proc/self, which each process sees as pointing to its own directory:

keegan@lyle$ ls -l /proc/self
lrwxrwxrwx 1 root root 64 Jan 6 13:22 /proc/self -> 13833

Here we see that 13833 was the PID of the ls process. Since ls has exited, the directory /proc/13883 will have already vanished, unless your system reused the PID for another process. The contents of /proc are constantly changing, even in response to your queries!

Back from the dead

It's happened to all of us. You hit the up-arrow one too many times and accidentally wiped out that really important disk image.

keegan@lyle$ rm hda.img

Time to think fast! Luckily you were still computing its checksum in another terminal. And UNIX systems won't actually delete a file on disk while the file is in use. Let's make sure our file stays "in use" by suspending md5sum with control-Z:

keegan@lyle$ md5sum hda.img
^Z
[1]+  Stopped                 md5sum hda.img

The proc filesystem contains links to a process's open files, under the fd subdirectory. We'll get the PID of md5sum and try to recover our file:

keegan@lyle$ jobs -l
[1]+ 14595 Stopped                 md5sum hda.img
keegan@lyle$ ls -l /proc/14595/fd/
total 0
lrwx------ 1 keegan keegan 64 Jan 6 15:05 0 -> /dev/pts/18
lrwx------ 1 keegan keegan 64 Jan 6 15:05 1 -> /dev/pts/18
lrwx------ 1 keegan keegan 64 Jan 6 15:05 2 -> /dev/pts/18
lr-x------ 1 keegan keegan 64 Jan 6 15:05 3 -> /home/keegan/hda.img (deleted)
keegan@lyle$ cp /proc/14595/fd/3 saved.img
keegan@lyle$ du -h saved.img
320G    saved.img

Disaster averted, thanks to proc. There's one big caveat: making a full byte-for-byte copy of the file could require a lot of time and free disk space. In theory this isn't necessary; the file still exists on disk, and we just need to make a new name for it (a hardlink). But the ln command and associated system calls have no way to name a deleted file. On FreeBSD we could use fsdb, but I'm not aware of a similar tool for Linux. Suggestions are welcome!

Redirect harder

Most UNIX tools can read from standard input, either by default or with a specified filename of "-". But sometimes we have to use a program which requires an explicitly named file. proc provides an elegant workaround for this flaw.

A UNIX process refers to its open files using integers called file descriptors. When we say "standard input", we really mean "file descriptor 0". So we can use /proc/self/fd/0 as an explicit name for standard input:

keegan@lyle$ cat crap-prog.py 
import sys
print file(sys.argv[1]).read()
keegan@lyle$ echo hello | python crap-prog.py 
IndexError: list index out of range
keegan@lyle$ echo hello | python crap-prog.py -
IOError: [Errno 2] No such file or directory: '-'
keegan@lyle$ echo hello | python crap-prog.py /proc/self/fd/0
hello

This also works for standard output and standard error, on file descriptors 1 and 2 respectively. This trick is useful enough that many distributions provide symlinks at /dev/stdin, etc.

There are a lot of possibilities for where /proc/self/fd/0 might point:

keegan@lyle$ ls -l /proc/self/fd/0
lrwx------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> /dev/pts/6
keegan@lyle$ ls -l /proc/self/fd/0 < /dev/null
lr-x------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> /dev/null
keegan@lyle$ echo | ls -l /proc/self/fd/0
lr-x------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> pipe:[9159930]

In the first case, stdin is the pseudo-terminal created by my screen session. In the second case it's redirected from a different file. In the third case, stdin is an anonymous pipe. The symlink target isn't a real filename, but proc provides the appropriate magic so that we can read the file anyway. The filesystem nodes for anonymous pipes live in the pipefs special filesystem — specialer than proc, because it can't even be mounted.

The phantom progress bar

Say we have some program which is slowly working its way through an input file. We'd like a progress bar, but we already launched the program, so it's too late for pv.

Alongside /proc/$PID/fd we have /proc/$PID/fdinfo, which will tell us (among other things) a process's current position within an open file. Let's use this to make a little script that will attach a progress bar to an existing process:

keegan@lyle$ cat phantom-progress.bash
#!/bin/bash
fd=/proc/$1/fd/$2
fdinfo=/proc/$1/fdinfo/$2
name=$(readlink $fd)
size=$(wc -c $fd | awk '{print $1}')
while [ -e $fd ]; do
  progress=$(cat $fdinfo | grep ^pos | awk '{print $2}')
  echo $((100*$progress / $size))
  sleep 1
done | dialog --gauge "Progress reading $name" 7 100

We pass the PID and a file descriptor as arguments. Let's test it:

keegan@lyle$ cat slow-reader.py 
import sys
import time
f = file(sys.argv[1], 'r')
while f.read(1024):
  time.sleep(0.01)
keegan@lyle$ python slow-reader.py bigfile &
[1] 18589
keegan@lyle$ ls -l /proc/18589/fd
total 0
lrwx------ 1 keegan keegan 64 Jan  6 16:40 0 -> /dev/pts/16
lrwx------ 1 keegan keegan 64 Jan  6 16:40 1 -> /dev/pts/16
lrwx------ 1 keegan keegan 64 Jan  6 16:40 2 -> /dev/pts/16
lr-x------ 1 keegan keegan 64 Jan  6 16:40 3 -> /home/keegan/bigfile
keegan@lyle$ ./phantom-progress.bash 18589 3

And you should see a nice curses progress bar, courtesy of dialog. Or replace dialog with gdialog and you'll get a GTK+ window.

Chasing plugins

A user comes to you with a problem: every so often, their instance of Enterprise FooServer will crash and burn. You read up on Enterprise FooServer and discover that it's a plugin-riddled behemoth, loading dozens of shared libraries at startup. Loading the wrong library could very well cause mysterious crashing.

The exact set of libraries loaded will depend on the user's config files, as well as environment variables like LD_PRELOAD and LD_LIBRARY_PATH. So you ask the user to start fooserver exactly as they normally do. You get the process's PID and dump its memory map:

keegan@lyle$ cat /proc/21637/maps
00400000-00401000 r-xp 00000000 fe:02 475918             /usr/bin/fooserver
00600000-00601000 rw-p 00000000 fe:02 475918             /usr/bin/fooserver
02519000-0253a000 rw-p 00000000 00:00 0                  [heap]
7ffa5d3c5000-7ffa5d3c6000 r-xp 00000000 fe:02 1286241    /usr/lib/foo-1.2/libplugin-bar.so
7ffa5d3c6000-7ffa5d5c5000 ---p 00001000 fe:02 1286241    /usr/lib/foo-1.2/libplugin-bar.so
7ffa5d5c5000-7ffa5d5c6000 rw-p 00000000 fe:02 1286241    /usr/lib/foo-1.2/libplugin-bar.so
7ffa5d5c6000-7ffa5d5c7000 r-xp 00000000 fe:02 1286243    /usr/lib/foo-1.3/libplugin-quux.so
7ffa5d5c7000-7ffa5d7c6000 ---p 00001000 fe:02 1286243    /usr/lib/foo-1.3/libplugin-quux.so
7ffa5d7c6000-7ffa5d7c7000 rw-p 00000000 fe:02 1286243    /usr/lib/foo-1.3/libplugin-quux.so
7ffa5d7c7000-7ffa5d91f000 r-xp 00000000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5d91f000-7ffa5db1e000 ---p 00158000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5db1e000-7ffa5db22000 r--p 00157000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5db22000-7ffa5db23000 rw-p 0015b000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5db23000-7ffa5db28000 rw-p 00000000 00:00 0 
7ffa5db28000-7ffa5db2a000 r-xp 00000000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5db2a000-7ffa5dd2a000 ---p 00002000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5dd2a000-7ffa5dd2b000 r--p 00002000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5dd2b000-7ffa5dd2c000 rw-p 00003000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5dd2c000-7ffa5dd4a000 r-xp 00000000 fe:02 4055128    /lib/ld-2.11.2.so
7ffa5df26000-7ffa5df29000 rw-p 00000000 00:00 0 
7ffa5df46000-7ffa5df49000 rw-p 00000000 00:00 0 
7ffa5df49000-7ffa5df4a000 r--p 0001d000 fe:02 4055128    /lib/ld-2.11.2.so
7ffa5df4a000-7ffa5df4b000 rw-p 0001e000 fe:02 4055128    /lib/ld-2.11.2.so
7ffa5df4b000-7ffa5df4c000 rw-p 00000000 00:00 0 
7fffedc07000-7fffedc1c000 rw-p 00000000 00:00 0          [stack]
7fffedcdd000-7fffedcde000 r-xp 00000000 00:00 0          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0  [vsyscall]

That's a serious red flag: fooserver is loading the bar plugin from FooServer version 1.2 and the quux plugin from FooServer version 1.3. If the versions aren't binary-compatible, that might explain the mysterious crashes. You can now hassle the user for their config files and try to fix the problem.

Just for fun, let's take a closer look at what the memory map means. Right away, we can recognize a memory address range (first column), a filename (last column), and file-like permission bits rwx. So each line indicates that the contents of a particular file are available to the process at a particular range of addresses with a particular set of permissions. For more details, see the proc manpage.

The executable itself is mapped twice: once for executing code, and once for reading and writing data. The same is true of the shared libraries. The flag p indicates a private mapping: changes to this memory area will not be shared with other processes, or saved to disk. We certainly don't want the global variables in a shared library to be shared by every process which loads that library. If you're wondering, as I was, why some library mappings have no access permissions, see this glibc source comment. There are also a number of "anonymous" mappings lacking filenames; these exist in memory only. An allocator like malloc can ask the kernel for such a mapping, then parcel out this storage as the application requests it.

The last two entries are special creatures which aim to reduce system call overhead. At boot time, the kernel will determine the fastest way to make a system call on your particular CPU model. It builds this instruction sequence into a little shared library in memory, and provides this virtual dynamic shared object (named vdso) for use by userspace code. Even so, the overhead of switching to the kernel context should be avoided when possible. Certain system calls such as gettimeofday are merely reading information maintained by the kernel. The kernel will store this information in the public virtual system call page (named vsyscall), so that these "system calls" can be implemented entirely in userspace.

Counting interruptions

We have a process which is taking a long time to run. How can we tell if it's CPU-bound or IO-bound?

When a process makes a system call, the kernel might let a different process run for a while before servicing the request. This voluntary context switch is especially likely if the system call requires waiting for some resource or event. If a process is only doing pure computation, it's not making any system calls. In that case, the kernel uses a hardware timer interrupt to eventually perform a nonvoluntary context switch.

The file /proc/$PID/status has fields labeled voluntary_ctxt_switches and nonvoluntary_ctxt_switches showing how many of each event have occurred. Let's try our slow reader process from before:

keegan@lyle$ python slow-reader.py bigfile &
[1] 15264
keegan@lyle$ watch -d -n 1 'cat /proc/15264/status | grep ctxt_switches'

You should see mostly voluntary context switches. Our program calls into the kernel in order to read or sleep, and the kernel can decide to let another process run for a while. We could use strace to see the individual calls. Now let's try a tight computational loop:

keegan@lyle$ cat tightloop.c
int main() {
  while (1) {
  }
}
keegan@lyle$ gcc -Wall -o tightloop tightloop.c
keegan@lyle$ ./tightloop &
[1] 30086
keegan@lyle$ watch -d -n 1 'cat /proc/30086/status | grep ctxt_switches'

You'll see exclusively nonvoluntary context switches. This program isn't making system calls; it just spins the CPU until the kernel decides to let someone else have a turn. Don't forget to kill this useless process!

Staying ahead of the OOM killer

The Linux memory subsystem has a nasty habit of making promises it can't keep. A userspace program can successfully allocate as much memory as it likes. The kernel will only look for free space in physical memory once the program actually writes to the addresses it allocated. And if the kernel can't find enough space, a component called the OOM killer will use an ad-hoc heuristic to choose a victim process and unceremoniously kill it.

Needless to say, this feature is controversial. The kernel has no reliable idea of who's actually responsible for consuming the machine's memory. The victim process may be totally innocent. You can disable memory overcommitting on your own machine, but there's inherent risk in breaking assumptions that processes make — even when those assumptions are harmful.

As a less drastic step, let's keep an eye on the OOM killer so we can predict where it might strike next. The victim process will be the process with the highest "OOM score", which we can read from /proc/$PID/oom_score:

keegan@lyle$ cat oom-scores.bash
#!/bin/bash
for procdir in $(find /proc -maxdepth 1 -regex '/proc/[0-9]+'); do
  printf "%10d %6d %s\n" \
    "$(cat $procdir/oom_score)" \
    "$(basename $procdir)" \
    "$(cat $procdir/cmdline | tr '\0' ' ' | head -c 100)"
done 2>/dev/null | sort -nr | head -n 20

For each process we print the OOM score, the PID (obtained from the directory name) and the process's command line. proc provides string arrays in NULL-delimited format, which we convert using tr. It's important to suppress error output using 2>/dev/null because some of the processes found by find (including find itself) will no longer exist within the loop. Let's see the results:

keegan@lyle$ ./oom-scores.bash 
  13647872  15439 /usr/lib/chromium-browser/chromium-browser --type=plugin
   1516288  15430 /usr/lib/chromium-browser/chromium-browser --type=gpu-process
   1006592  13204 /usr/lib/nspluginwrapper/i386/linux/npviewer.bin --plugin
    687581  15264 /usr/lib/chromium-browser/chromium-browser --type=zygote
    445352  14323 /usr/lib/chromium-browser/chromium-browser --type=renderer
    444930  11255 /usr/lib/chromium-browser/chromium-browser --type=renderer
...

Unsurprisingly, my web browser and Flash plugin are prime targets for the OOM killer. But the rankings might change if some runaway process caused an actual out-of-memory condition. Let's (carefully!) run a program that will (slowly!) eat 500 MB of RAM:

keegan@lyle$ cat oomnomnom.c
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>
#define SIZE (10*1024*1024)

int main() {
  int i;
  for (i=0; i<50; i++) {
    void *m = mmap(NULL, SIZE, PROT_WRITE,
      MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    memset(m, 0x80, SIZE);
    sleep(1);
  }
  return 0;
}

On each loop iteration, we ask for 10 megabytes of memory as a private, anonymous (non-file-backed) mapping. We then write to this region, so that the kernel will have to allocate some physical RAM. Now we'll watch OOM scores and free memory as this program runs:

keegan@lyle$ gcc -Wall -o oomnomnom oomnomnom.c
keegan@lyle$ ./oomnomnom &
[1] 19697
keegan@lyle$ watch -d -n 1 './oom-scores.bash; echo; free -m'

You'll see oomnomnom climb to the top of the list.

So we've seen a few ways that proc can help us solve problems. Actually, we've only scratched the surface. Inside each process's directory you'll find information about resource limits, chroots, CPU affinity, page faults, and much more. What are your favorite proc tricks? Let us know in the comments!

~keegan

Wednesday Oct 06, 2010

Anatomy of a Debian package

Ever wondered what a .deb file actually is? How is it put together, and what's inside it, besides the data that is installed to your system when you install the package? Today we're going to break out our sysadmin's toolbox and find out. (While we could just turn to deb(5), that would ruin the fun.) You'll need a Debian-based system to play along. Ubuntu and other derivatives should work just fine.

Finding a file to look at

Whenever APT downloads a package to install, it saves it in a package cache, located in /var/cache/apt/archives/. We can poke around in this directory to find a package to look at.
spang@sencha:~> cd /var/cache/apt/archives
spang@sencha:/var/cache/apt/archives>
spang@sencha:/var/cache/apt/archives> ls
apache2-utils_2.2.16-2_amd64.deb
app-install-data_2010.08.21_all.deb
apt_0.8.0_amd64.deb
apt_0.8.5_amd64.deb
aptitude_0.6.3-3.1_amd64.deb
...
nano, the text editor, ought to be a simple package. Let's take a look at that one.

spang@sencha:/var/cache/apt/archives> cp nano_2.2.5-1_amd64.deb ~/tmp/blog
spang@sencha:/var/cache/apt/archives> cd ~/tmp/blogapt debian dpkg package-management

Digging in

Let's see what we can figure out about this file. The file command is a nifty tool that tries to figure out what kind of data a file contains.

spang@sencha:~/tmp/blog> file --raw --keep-going nano_2.2.5-1_amd64.deb 
nano_2.2.5-1_amd64.deb: Debian binary package (format 2.0)
- current ar archive
- archive file
Hmm, so file, which identifies filetypes by performing tests on them (rather than by looking at the file extension or something else cosmetic), must have a special test that identifies Debian packages. Since we passed the command the --keep-going option, though, it continued on to find other tests that match against the file, which is useful because these later matches are less specific, and in our case they tell us what a "Debian binary package" actually is under the hood—an "ar" archive!

Aside: a little bit of history

Back in the day, in 1995 and before, Debian packages used to use their own ad-hoc archive format. These days, you can find that old format documented in deb-old(5). The new format was added to be "saner and more extensible" than the original. You can still find binaries in the old format on archive.debian.org. You'll see that file tells us that these debs are different; it doesn't know how to identify them in a more specific way than "a bunch of bits":

spang@sencha:~/tmp/blog> file --raw --keep-going adduser-1.94-1.deb
adduser-1.94-1.deb: data
Now we can crack open the deb using the ar utility to see what's inside.

Inside the box

ar takes an operation code and modifier flags and the archive to act upon as its arguments. The x operation tells it to extract files, and the v modifier tells it to be verbose.

spang@sencha:~/tmp/blog> ar vx nano_2.2.5-1_amd64.deb
x - debian-binary
x - control.tar.gz
x - data.tar.gz
So, we have three files.

debian-binary

spang@sencha:~/tmp/blog> cat debian-binary
2.0
This is just the version number of the binary package format being used, so tools know what they're dealing with and can modify their behaviour accordingly. One of file's tests uses the string in this file to add the package format to its output, as we saw earlier.

control.tar.gz

spang@sencha:~/tmp/blog> tar xzvf control.tar.gz
./
./postinst
./control
./conffiles
./prerm
./postrm
./preinst
./md5sums
These control files are used by the tools that work with the package and install it to the system—mostly dpkg.

control
spang@sencha:~/tmp/blog> cat control
Package: nano
Version: 2.2.5-1
Architecture: amd64
Maintainer: Jordi Mallach 
Installed-Size: 1824
Depends: libc6 (>= 2.3.4), libncursesw5 (>= 5.7+20100313), dpkg (>= 1.15.4) | install-info
Suggests: spell
Conflicts: pico
Breaks: alpine-pico (<= 2.00+dfsg-5)
Replaces: pico
Provides: editor
Section: editors
Priority: important
Homepage: http://www.nano-editor.org/
Description: small, friendly text editor inspired by Pico
 GNU nano is an easy-to-use text editor originally designed as a replacement
 for Pico, the ncurses-based editor from the non-free mailer package Pine
 (itself now available under the Apache License as Alpine).
 .
 However, nano also implements many features missing in pico, including:
  - feature toggles;
  - interactive search and replace (with regular expression support);
  - go to line (and column) command;
  - auto-indentation and color syntax-highlighting;
  - filename tab-completion and support for multiple buffers;
  - full internationalization support.
This file contains a lot of important metadata about the package. In this case, we have:
  • its name
  • its version number
  • binary-specific information: which architecture it was built for, and how many bytes it takes up after it is installed
  • its relationship to other packages (on the Depends, Suggests, Conflicts, Breaks, and Replaces lines)
  • the person who is responsible for this package in Debian (the "maintainer")
  • How the package is categorized in Debian as a whole: nano is in the "editors" section. A complete list of archive sections can be found here.
  • A "priority" rating. "Important" means that the package "should be found on any Unix-like system". You'd be hard-pressed to find a Debian system without nano.
  • a homepage
  • a description which should provide enough information for an interested user to figure out whether or not she wants to install the package
One line that takes a bit more explanation is the "Provides:" line. This means that nano, when installed, will not only count as having the nano package installed, but also as the editor package, which doesn't really exist—it is only provided by other packages. This way other packages which need a text editor can depend on "editor" and not have to worry about the fact that there are many different sufficient choices available.

You can get most of this same information for installed packages and packages from your configured package repositories using the command aptitude show <packagename>, or dpkg --status <packagename> if the package is installed.

postinst, prerm, postrm, preinst
These files are maintainer scripts. If you take a look at one, you'll see that it's just a shell script that is run at some point during the [un]installation process.

spang@sencha:~/tmp/blog> cat preinst
#!/bin/sh

set -e

if [ "$1" = "upgrade" ]; then
    if dpkg --compare-versions "$2" lt 1.2.4-2; then
	if [ ! -e /usr/man ]; then
	    ln -s /usr/share/man /usr/man
	    update-alternatives --remove editor /usr/bin/nano || RET=$?
	    rm /usr/man
	    if [ -n "$RET" ]; then
	        exit $RET
	    fi
	else
	    update-alternatives --remove editor /usr/bin/nano
	fi
    fi
fi
More on the nitty-gritty of maintainer scripts can be found here.

conffiles
spang@sencha:~/tmp/blog> cat conffiles 
/etc/nanorc
Any configuration files for the package, generally found in /etc, are listed here, so that dpkg knows when to not blindly overwrite any local configuration changes you've made when upgrading the package.

md5sums
This file contains checksums of each of the data files in the package so dpkg can make sure they weren't corrupted or tampered with.

data.tar.gz

Here are the actual data files that will be added to your system's / when the package is installed.
spang@sencha:~/tmp/blog> tar xzvf data.tar.gz
./
./bin/
./bin/nano
./usr/
./usr/bin/
./usr/share/
./usr/share/doc/
./usr/share/doc/nano/
./usr/share/doc/nano/examples/
./usr/share/doc/nano/examples/nanorc.sample.gz
./usr/share/doc/nano/THANKS
./usr/share/doc/nano/changelog.gz
./usr/share/doc/nano/BUGS.gz
./usr/share/doc/nano/TODO.gz
./usr/share/doc/nano/NEWS.gz
./usr/share/doc/nano/changelog.Debian.gz
[...]
./etc/
./etc/nanorc
./bin/rnano
./usr/bin/nano

Finishing up

That's it! That's all there is inside a Debian package. Of course, no one building a package for Debian-based systems would do the reverse of what we just did, using raw tools like ar, tar, and gzip. Debian packages use a make-based build system, and learning how to build them using all the tools that have been developed for this purpose is a topic for another time. If you're interested, the new maintainer's guide is a decent place to start.

And next time, if you need to take a look inside a .deb again, use the dpkg-deb utility:

spang@sencha:~/tmp/blog> dpkg-deb --extract nano_2.2.5-1_amd64.deb datafiles
spang@sencha:~/tmp/blog> dpkg-deb --control nano_2.2.5-1_amd64.deb controlfiles
spang@sencha:~/tmp/blog> dpkg-deb --info nano_2.2.5-1_amd64.deb
 new debian package, version 2.0.
 size 566450 bytes: control archive= 3569 bytes.
      12 bytes,     1 lines      conffiles            
    1010 bytes,    26 lines      control              
    5313 bytes,    80 lines      md5sums              
     582 bytes,    19 lines   *  postinst             #!/bin/sh
     160 bytes,     5 lines   *  postrm               #!/bin/sh
     379 bytes,    20 lines   *  preinst              #!/bin/sh
     153 bytes,    10 lines   *  prerm                #!/bin/sh
 Package: nano
 Version: 2.2.5-1
 Architecture: amd64
 Maintainer: Jordi Mallach 
 Installed-Size: 1824
 Depends: libc6 (>= 2.3.4), libncursesw5 (>= 5.7+20100313), dpkg (>= 1.15.4) | install-info
 Suggests: spell
 Conflicts: pico
 Breaks: alpine-pico (<= 2.00+dfsg-5)
 Replaces: pico
 Provides: editor
 Section: editors
 Priority: important
 Homepage: http://www.nano-editor.org/
 Description: small, friendly text editor inspired by Pico
  GNU nano is an easy-to-use text editor originally designed as a replacement
  for Pico, the ncurses-based editor from the non-free mailer package Pine
  (itself now available under the Apache License as Alpine).
  .
  However, nano also implements many features missing in pico, including:
   - feature toggles;
   - interactive search and replace (with regular expression support);
   - go to line (and column) command;
   - auto-indentation and color syntax-highlighting;apt debian dpkg package-management
   - filename tab-completion and support for multiple buffers;
   - full internationalization support.

If the package format ever changes again, dpkg-deb will too, and you won't even need to notice.

~spang


Ksplice is hiring!

Do you love tinkering with, exploring, and debugging Linux systems? Does writing Python clones of your favorite childhood computer games sound like a fun weekend project? Have you ever told a joke whose punch line was a git command?

Join Ksplice and work on technology that most people will tell you is impossible: updating the Linux kernel while it is running.

Help us develop the software and infrastructure to bring rebootless kernel updates to Linux, as well as new operating system kernels and other parts of the software stack. We're hiring backend, frontend, and kernel engineers. Say hello at jobs@ksplice.com!

Friday Sep 17, 2010

CVE-2010-3081

Hi. I'm the original developer of Ksplice and the CEO of the company. Today is one of those days that reminds me why I created Ksplice.

I'm writing this blog post to provide some information and assistance to anyone affected by the recent Linux kernel vulnerability CVE-2010-3081, which unfortunately is just about everyone running 64-bit Linux. To make matters worse, in the last day we've received many reports of people attacking production systems using an exploit for this vulnerability, so if you run Linux systems, we recommend that you strongly consider patching this vulnerability. (Linux vendors release important security updates every month, but this vulnerability is particularly high profile and people are using it aggressively to exploit systems).

This vulnerability was introduced into the Linux kernel in April 2008, and so essentially every distribution is affected, including RHEL, CentOS, Debian, Ubuntu, Parallels Virtuozzo Containers, OpenVZ, CloudLinux, and SuSE, among others. A few vendors have released kernels that fix the vulnerability if you reboot, but other vendors, including Red Hat, are still working on releasing an updated kernel.

The published workarounds that we've seen, including the workaround recommended by Red Hat, can themselves be worked around by an attacker to still exploit the system. For now, to be responsible and avoid helping attackers, we don't want to provide those technical details publicly; we've contacted Red Hat and other vendors with the details and we'll cover them in a future blog post, in a few weeks.

Although it might seem self-serving, I do know of one sure way to fix this vulnerability right away on running production systems, and it doesn't even require you to reboot: you can (for free) download Ksplice Uptrack and fully update any of the distributions that we support (We support RHEL, CentOS, Debian, Ubuntu, Parallels Virtuozzo Containers, OpenVZ, and CloudLinux. For high profile updates like this one, Ksplice optionally makes available an update for your distribution before your distribution officially releases a new kernel). We provide a free 30-day trial of Ksplice Uptrack on our website, and you can use this free trial to protect your systems, even if you cannot arrange to reboot anytime soon. It's the best that we can do to help in this situation, and I hope that it's useful to you.

Note: If an attacker has already compromised one of your machines using an exploit for CVE-2010-3081, simply updating the system will not eliminate the presence of an attacker. Similarly, if a machine has already been exploited, then the exploit may continue working on that system even after it has been updated, because of a backdoor that the exploit installs. We've published a test tool to check whether your system has already been compromised by the public CVE-2010-3081 exploit code that we've seen. If one or more of your machines has already been compromised by an attacker, we recommend that you use your normal procedure for dealing with that situation.

~jbarnold

Thursday Aug 05, 2010

Strace -- The Sysadmin's Microscope

Sometimes as a sysadmin the logfiles just don't cut it, and to solve a problem you need to know what's really going on. That's when I turn to strace -- the system-call tracer.

A system call, or syscall, is where a program crosses the boundary between user code and the kernel. Fortunately for us using strace, that boundary is where almost everything interesting happens in a typical program.

The two basic jobs of a modern operating system are abstraction and multiplexing. Abstraction means, for example, that when your program wants to read and write to disk it doesn't need to speak the SATA protocol, or SCSI, or IDE, or USB Mass Storage, or NFS. It speaks in a single, common vocabulary of directories and files, and the operating system translates that abstract vocabulary into whatever has to be done with the actual underlying hardware you have. Multiplexing means that your programs and mine each get fair access to the hardware, and don't have the ability to step on each other -- which means your program can't be permitted to skip the kernel, and speak raw SATA or SCSI to the actual hardware, even if it wanted to.

So for almost everything a program wants to do, it needs to talk to the kernel. Want to read or write a file? Make the open() syscall, and then the syscalls read() or write(). Talk on the network? You need the syscalls socket(), connect(), and again read() and write(). Make more processes? First clone() (inside the standard C library function fork()), then you probably want execve() so the new process runs its own program, and you probably want to interact with that process somehow, with one of wait4(), kill(), pipe(), and a host of others. Even looking at the clock requires a system call, clock_gettime(). Every one of those system calls will show up when we apply strace to the program.

In fact, just about the only thing a process can do without making a telltale system call is pure computation -- using the CPU and RAM and nothing else. As a former algorithms person, that's what I used to think was the fun part. Fortunately for us as sysadmins, very few real-life programs spend very long in that pure realm between having to deal with a file or the network or some other part of the system, and then strace picks them up again.

Let's look at a quick example of how strace solves problems.

Use #1: Understand A Complex Program's Actual Behavior
One day, I wanted to know which Git commands take out a certain lock -- I had a script running a series of different Git commands, and it was failing sometimes when run concurrently because two commands tried to hold the lock at the same time.

Now, I love sourcediving, and I've done some Git hacking, so I spent some time with the source tree investigating this question. But this code is complex enough that I was still left with some uncertainty. So I decided to get a plain, ground-truth answer to the question: if I run "git diff", will it grab this lock?

Strace to the rescue. The lock is on a file called index.lock. Anything trying to touch the file will show up in strace. So we can just trace a command the whole way through and use grep to see if index.lock is mentioned:

$ strace git status 2>&1 >/dev/null | grep index.lock
open(".git/index.lock", O_RDWR|O_CREAT|O_EXCL, 0666) = 3
rename(".git/index.lock", ".git/index") = 0

$ strace git diff 2>&1 >/dev/null | grep index.lock
$

So git status takes the lock, and git diff doesn't.

Interlude: The Toolbox
To help make it useful for so many purposes, strace takes a variety of options to add or cut out different kinds of detail and help you see exactly what's going on.

In Medias Res, If You Want
Sometimes we don't have the luxury of starting a program over to run it under strace -- it's running, it's misbehaving, and we need to find out what's going on. Fortunately strace handles this case with ease. Instead of specifying a command line for strace to execute and trace, just pass -p PID where PID is the process ID of the process in question -- I find pstree -p invaluable for identifying this -- and strace will attach to that program, while it's running, and start telling you all about it.

Times
When I use strace, I almost always pass the -tt option. This tells me when each syscall happened -- -t prints it to the second, -tt to the microsecond. For system administration problems, this often helps a lot in correlating the trace with other logs, or in seeing where a program is spending too much time.

For performance issues, the -T option comes in handy too -- it tells me how long each individual syscall took from start to finish.

Data
By default strace already prints the strings that the program passes to and from the system -- filenames, data read and written, and so on. To keep the output readable, it cuts off the strings at 32 characters. You can see more with the -s option -- -s 1024 makes strace print up to 1024 characters for each string -- or cut out the strings entirely with -s 0.

Sometimes you want to see the full data flowing in just a few directions, without cluttering your trace with other flows of data. Here the options -e read= and -e write= come in handy.

For example, say you have a program talking to a database server, and you want to see the SQL queries, but not the voluminous data that comes back. The queries and responses go via write() and read() syscalls on a network socket to the database. First, take a preliminary look at the trace to see those syscalls in action:

$ strace -p 9026
Process 9026 attached - interrupt to quit
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
[...]

Those write() syscalls are the SQL queries -- we can make out the SELECT foo FROM bar, and then it trails off. To see the rest, note the file descriptor the syscalls are happening on -- the first argument of read() or write(), which is 3 here. Pass that file descriptor to -e write=:

$ strace -p 9026 -e write=3
Process 9026 attached - interrupt to quit
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
 | 00000  30 00 00 00 03 53 45 4c  45 43 54 20 74 69 6d 65  0....SEL ECT time |
 | 00010  73 74 61 6d 70 20 46 52  4f 4d 20 61 72 74 69 66  stamp FR OM artif |
 | 00020  61 63 74 73 20 57 48 45  52 45 20 69 64 20 3d 20  acts WHE RE id =  |
 | 00030  31 34 35 34                                       1454              |

and we see the whole query. It's both printed and in hex, in case it's binary. We could also get the whole thing with an option like -s 1024, but then we'd see all the data coming back via read() -- the use of -e write= lets us pick and choose.

Filtering the Output
Sometimes the full syscall trace is too much -- you just want to see what files the program touches, or when it reads and writes data, or some other subset. For this the -e trace= option was made. You can select a named suite of system calls like -e trace=file (for syscalls that mention filenames) or -e trace=desc (for read() and write() and friends, which mention file descriptors), or name individual system calls by hand. We'll use this option in the next example.

Child Processes
Sometimes the process you trace doesn't do the real work itself, but delegates it to child processes that it creates. Shell scripts and Make runs are notorious for taking this behavior to the extreme. If that's the case, you may want to pass -f to make strace "follow forks" and trace child processes, too, as soon as they're made.

For example, here's a trace of a simple shell script, without -f:

$ strace -e trace=process,file,desc sh -c \
   'for d in .git/objects/*; do ls $d >/dev/null; done'
[...]
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4b68af5770) = 11948
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11948
--- SIGCHLD (Child exited) @ 0 (0) --
wait4(-1, 0x7fffc3473604, WNOHANG, NULL) = -1 ECHILD (No child processes)

Not much to see here -- all the real work was done inside process 11948, the one created by that clone() syscall.

Here's the same script traced with -f (and the trace edited for brevity):

$ strace -f -e trace=process,file,desc sh -c \
   'for d in .git/objects/*; do ls $d >/dev/null; done'
[...]
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
clone(Process 10738 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5a93f99770) = 10738
[pid 10682] wait4(-1, Process 10682 suspended

[pid 10738] open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid 10738] dup2(3, 1)                  = 1
[pid 10738] close(3)                    = 0
[pid 10738] execve("/bin/ls", ["ls", ".git/objects/28"], [/* 25 vars */]) = 0
[... setup of C standard library omitted ...]
[pid 10738] stat(".git/objects/28", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 10738] open(".git/objects/28", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
[pid 10738] getdents(3, /* 40 entries */, 4096) = 2480
[pid 10738] getdents(3, /* 0 entries */, 4096) = 0
[pid 10738] close(3)                    = 0
[pid 10738] write(1, "04102fadac20da3550d381f444ccb5676"..., 1482) = 1482
[pid 10738] close(1)                    = 0
[pid 10738] close(2)                    = 0
[pid 10738] exit_group(0)               = ?
Process 10682 resumed
Process 10738 detached
<... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10738
--- SIGCHLD (Child exited) @ 0 (0) ---

Now this trace could be a miniature education in Unix in itself -- future blog post? The key thing is that you can see ls do its work, with that open() call followed by getdents().

The output gets cluttered quickly when multiple processes are traced at once, so sometimes you want -ff, which makes strace write each process's trace into a separate file.

Use #2: Why/Where Is A Program Stuck?
Sometimes a program doesn't seem to be doing anything. Most often, that means it's blocked in some system call. Strace to the rescue.

$ strace -p 22067
Process 22067 attached - interrupt to quit
flock(3, LOCK_EX

Here it's blocked trying to take out a lock, an exclusive lock (LOCK_EX) on the file it's opened as file descriptor 3. What file is that?

$ readlink /proc/22067/fd/3
/tmp/foobar.lock

Aha, it's the file /tmp/foobar.lock. And what process is holding that lock?

$ lsof | grep /tmp/foobar.lock
 command   21856       price    3uW     REG 253,88       0 34443743 /tmp/foobar.lock
 command   22067       price    3u      REG 253,88       0 34443743 /tmp/foobar.lock

Process 21856 is holding the lock. Now we can go figure out why 21856 has been holding the lock for so long, whether 21856 and 22067 really need to grab the same lock, etc.

Other common ways the program might be stuck, and how you can learn more after discovering them with strace:

  • Waiting on the network. Use lsof again to see the remote hostname and port.
  • Trying to read a directory. Don't laugh -- this can actually happen when you have a giant directory with many thousands of entries. And if the directory used to be giant and is now small again, on a traditional filesystem like ext3 it becomes a long list of "nothing to see here" entries, so a single syscall may spend minutes scanning the deleted entries before returning the list of survivors.
  • Not making syscalls at all. This means it's doing some pure computation, perhaps a bunch of math. You're outside of strace's domain; good luck.

Uses #3, #4, ...
A post of this length can only scratch the surface of what strace can do in a sysadmin's toolbox. Some of my other favorites include

  • As a progress bar. When a program's in the middle of a long task and you want to estimate if it'll be another three hours or three days, strace can tell you what it's doing right now -- and a little cleverness can often tell you how far that places it in the overall task.
  • Measuring latency. There's no better way to tell how long your application takes to talk to that remote server than watching it actually read() from the server, with strace -T as your stopwatch.
  • Identifying hot spots. Profilers are great, but they don't always reflect the structure of your program. And have you ever tried to profile a shell script? Sometimes the best data comes from sending a strace -tt run to a file, and picking through to see when each phase of your program started and finished.
  • As a teaching and learning tool. The user/kernel boundary is where almost everything interesting happens in your system. So if you want to know more about how your system really works -- how about curling up with a set of man pages and some output from strace?

~price

Monday Jul 26, 2010

Six things I wish Mom told me (about ssh)

If you've ever seriously used a Linux system, you're probably already familiar with at least the basics of ssh. But you're hungry for more. In this post, we'll show you six ssh tips that'll help take you to the next level. (We've also found that they make for excellent cocktail party conversation talking points.)

(1) Take command!

Everyone knows that you can use ssh to get a remote shell, but did you know that you can also use it to run commands on their own? Well, you can--just stick the command name after the hostname! Case in point:

wdaher@rocksteady:~$ ssh bebop uname -a
Linux bebop 2.6.32-24-generic #39-Ubuntu SMP Wed Jul 28 05:14:15 UTC 2010 x86_64 GNU/Linux

Combine this with passwordless ssh logins, and your shell scripting powers have just leveled up. Want to figure out what version of Python you have installed on each of your systems? Just stick ssh hostname python -V in a for loop, and you're done!

Some commands, however, don't play along so nicely:

wdaher@rocksteady:~$ ssh bebop top
TERM environment variable not set.

What gives? Some programs need a pseudo-tty, and aren't happy if they don't have one (anything that wants to draw on arbitrary parts of the screen probably falls into this category). But ssh can handle this too--the -t option will force ssh to allocate a pseudo-tty for you, and then you'll be all set.

# Revel in your process-monitoring glory
wdaher@rocksteady:~$ ssh bebop -t top
# Or, resume your session in one command, if you're a GNU Screen user
wdaher@rocksteady:~$ ssh bebop -t screen -dr

(2) Please, try a sip of these ports

But wait, there's more! ssh's ability to forward ports is incredibly powerful. Suppose you have a web dashboard at work that runs at analytics on port 80 and is only accessible from the inside the office, and you're at home but need to access it because it's 2 a.m. and your pager is going off.

Fortunately, you can ssh to your desktop at work, desktop, which is on the same network as analytics. So if we can connect to desktop, and desktop can connect to analytics, surely we can make this work, right?

Right. We'll start out with something that doesn't quite do what we want:

wdaher@rocksteady:~$ ssh desktop -L 8080:desktop:80

OK, the ssh desktop is the straightforward part. The -L port:hostname:hostport option says "Set up a port forward from port (in this case, 8080) to hostname:hostport (in this case, desktop:80)."

So now, if you visit http://localhost:8080/ in your web browser at home, you'll actually be connected to port 80 on desktop. Close, but not quite! Remember, we wanted to connect to the web dashboard, running on port 80, on analytics, not desktop.

All we do, though, is adjust the command like so:

wdaher@rocksteady:~$ ssh desktop -L 8080:analytics:80

Now, the remote end of the port forward is analytics:80, which is precisely where the web dashboard is running. But wait, isn't analytics behind the firewall? How can we even reach it? Remember: this connection is being set up on the remote system (desktop), which is the only reason it works.

If you find yourself setting up multiple such forwards, you're probably better off doing something more like:

wdaher@rocksteady:~$ ssh -D 8080 desktop

This will set up a SOCKS proxy at localhost:8080. If you configure your browser to use it, all of your browser traffic will go over SSH and through your remote system, which means you could just navigate to http://analytics/ directly.

(3) Til-de do us part

Riddle me this: ssh into a system, press Enter a few times, and then type in a tilde. Nothing appears. Why?

Because the tilde is ssh's escape character. Right after a newline, you can type ~ and a number of other keystrokes to do interesting things to your ssh connection (like give you 30 extra lives in each continue.) ~? will display a full list of the escape sequences, but two handy ones are ~. and ~^Z.

~. (a tilde followed by a period) will terminate the ssh connection, which is handy if you lose your network connection and don't want to wait for your ssh session to time out.  ~^Z (a tilde followed by Ctrl-Z) will put the connection in the background, in case you want to do something else on the host while ssh is running. An example of this in action:

wdaher@rocksteady:~$ ssh bebop
wdaher@bebop:~$ sleep 10000
wdaher@bebop:~$ ~^Z [suspend ssh]

[1]+  Stopped                 ssh bebop
wdaher@rocksteady:~$ # Do something else
wdaher@rocksteady:~$ fg # and you're back!

(4) Dusting for prints

I'm sure you've seen this a million times, and you probably just type "yes" without thinking twice:

wdaher@rocksteady:~$ ssh bebop
The authenticity of host 'bebop (192.168.1.106)' can't be established.
RSA key fingerprint is a2:6d:2f:30:a3:d3:12:9d:9d:da:0c:a7:a4:60:20:68.
Are you sure you want to continue connecting (yes/no)?

What's actually going on here? Well, if this is your first time connecting to bebop, you can't really tell whether the machine you're talking to is actually bebop, or just an impostor pretending to be bebop. All you know is the key fingerprint of the system you're talking to.  In principle, you're supposed to verify this out-of-band (i.e. call up the remote host and ask them to read off the fingerprint.)

Let's say you and your incredibly security-minded friend actually want to do this. How does one actually find this fingerprint? On the remote host, have your friend run:

sbaker@bebop:~$ ssh-keygen -l -f /etc/ssh/ssh_host_rsa_key.pub
2048 a2:6d:2f:30:a3:d3:12:9d:9d:da:0c:a7:a4:60:20:68 /etc/ssh/ssh_host_rsa_key.pub (RSA)

Tada! They match, and it's safe to proceed. From now on, this is stored in your list of ssh "known hosts" (in ~/.ssh/known_hosts), so you don't get the prompt every time. And if the key ever changes on the other end, you'll get an alert--someone's trying to read your traffic! (Or your friend reinstalled their system and didn't preserve the key.)

(5) Losing your keys

Unfortunately, some time later, you and your friend have a falling out (something about Kirk vs. Picard), and you want to remove their key from your known hosts. "No problem," you think, "I'll just remove it from my list of known hosts." You open up the file and are unpleasantly surprised: a jumbled file full of all kinds of indecipherable characters. They're actually hashes of the hostnames (or IP addresses) that you've connected to before, and their associated keys.

Before you proceed, surely you're asking yourself: "Why would anyone be so cruel? Why not just list the hostnames in plain text, so that humans could easily edit the file?" In fact, that's actually how it was done until very recently. But it turns out that leaving them in the clear is a potential security risk, since it provides an attacker a convenient list of other places you've connected (places where, e.g., an unwitting user might have used the same password).

Fortunately, ssh-keygen -R <hostname> does the trick:

wdaher@rocksteady:~$ ssh-keygen -R bebop
/home/wdaher/.ssh/known_hosts updated.
Original contents retained as /home/wdaher/.ssh/known_hosts.old

I'm told there's still no easy way to remove now-bitter memories of your friendship together, though.

(6) A connection by any other name...

If you've read this far, you're an ssh pro. And like any ssh pro, you log into a bajillion systems, each with their own usernames, ports, and long hostnames. Like your accounts at AWS, Rackspace Cloud, your dedicated server, and your friend's home system.

And you already know how to do this. username@host or -l username to specify your username, and -p portnumber to specify the port:

wdaher@rocksteady:~$ ssh -p 2222 bob.example.com
wdaher@rocksteady:~$ ssh -p 8183 waseem@alice.example.com
wdaher@rocksteady:~$ ssh -p 31337 -l waseemio wsd.example.com

But this gets really old really quickly, especially when you need to pass a slew of other options for each of these connections. Enter .ssh/config, a file where you specify convenient aliases for each of these sets of settings:

Host bob
    HostName bob.example.com
    Port 2222
    User wdaher

Host alice
    HostName alice.example.com
    Port 8183
    User waseem

Host self
    HostName wsd.example.com
    Port 31337
    User waseemio

So now it's as simple as:

wdaher@rocksteady:~$ ssh bob
wdaher@rocksteady:~$ ssh alice
wdaher@rocksteady:~$ ssh self
And yes, the config file lets you specify port forwards or commands to run as well, if you'd like--check out the ssh_config manual page for the details.

ssh! It's (not) a secret

This list is by no means exhaustive, so I turn to you: what other ssh tips and tricks have you learned over the years? Leave ’em in the comments!

~wdaher

Monday Jul 12, 2010

Source diving for sysadmins

As a system administrator, I work with dozens of large systems every day--Apache, MySQL, Postfix, Dovecot, and the list goes on from there. While I have a good idea of how to configure all of these pieces of software, I'm not intimately familiar with all of their code bases. And every so often, I'll run into a problem which I can't configure around.

When I'm lucky, I can reproduce the bug in a testing environment. I can then drop in arbitrary print statements, recompile with debugging flags, or otherwise modify my application to give me useful data. But all too often, I find that either the bug vanishes when it's not in my production environment, or it would simply take too much time or resources to even set up a testing deployment. When this happens, I find myself left with no alternative but to sift through the source code of the failing system, hoping to find clues as to the cause of the bug of the day. Doing so is never painless, but over time I've developed a set of techniques to make the source diving experience as focused and productive and possible.

To illustrate these techniques, I'll walk you through a real-world debugging experience I had a few weeks ago. I am a maintainer of the XVM project, an MIT-internal VPS service. We keep the disks of our virtual servers in shared storage, and we use clustering software to coordinate changes to the disks.

For a long time, we had happily run a cluster of four nodes. After receiving a grant for new hardware, we attempted to increase the size of our cluster from four nodes to eight nodes. But once we added the new nodes to the cluster, disaster struck. With five or more nodes in the cluster, no matter what we tried, we received the same error message:

root@babylon-four:~# lvs
 cluster request failed: Cannot allocate memory
 Can't get lock for xenvg
 Skipping volume group xenvg
 cluster request failed: Cannot allocate memory
 Can't get lock for babylon-four

And to make matters even more exciting, by the time we observed the problem, users had already booted enough virtual servers that we did not have the RAM to go back to four nodes. So there we were, with a broken cluster to debug.

Tip 1: Check the likely causes of failure first.

It can be quite tempting to believe that a given problem is caused by a bug in someone else's code rather than your own error in configuration. In reality, the common bugs in large, widely-used projects have already been squashed, meaning the most likely cause of error is something that you are doing wrong. I've lost track of the number of times I was sure I encountered a bug in some software, only to later discover that I had forgotten to set a configuration variable. So when you encounter a failure of some kind, make sure that your environment is not obviously at fault. Check your configuration files, check resource usage, check log files.

In the case of XVM, after seeing memory errors, we naturally figured we were out of memory--but free -m showed plenty of spare RAM. Thinking a rogue process might be to blame, we ran ps aux and top, but no process was consuming abnormal amounts of RAM or CPU. We consulted man pages, we scoured the relevant configuration files in /etc, and we even emailed the clustering software's user list, trying to determine if we were doing something wrong. Our efforts failed to uncover any problems on our end.

Tip 2: Gather as much debugging output as you can. You're going to need it.

Once you're sure you actually need to do a source dive, you should make sure you have all the information you can get about what your program is doing wrong. See if your program has a "debugging" or "verbosity" level you can turn up. Check /var/log/ for dedicated log files for the software under consideration, or perhaps check a standard log such as syslog. If your program does not provide enough output on its own, try using strace -p to dump the system calls it's issuing.

Before doing our clustering-software source dive, we cranked debugging as high as it would go to get the following output:

Got new connection on fd 5
Read on local socket 5, len = 28
creating pipe, [9, 10]
Creating pre&post thread
in sub thread: client = 0x69f010
Sub thread ready for work.
doing PRE command LOCK_VG 'V_xenvg' at 1 (client=0x69f010)
lock_resource 'V_xenvg-1', flags=0, mode=1
Created pre&post thread, state = 0
Writing status 12 down pipe 10
Waiting for next pre command
read on PIPE 9: 4 bytes: status: 12
background routine status was 12, sock_client=0x69f010
Send local reply

Note that this spew does not contain an obvious error message. Still, it had enough information for us to ultimately track down and fix the problem that beset us.

Tip 3: Use the right tools for the job

Perhaps the worst part of living in a world with many debugging tools is that it's easy to waste time using the wrong ones. If you are seeing a segmentation fault or an apparent deadlock, then your first instinct should be to reach for gdb. gdb has all sorts of nifty capabilities, including the ability to attach to an already-running process. But if you don't have an obvious crash site, often the information you glean from dynamic debugging is too narrow or voluminous to be helpful. Some, such as Linus Torvalds, have even vehemently opposed debuggers in general.

Sometimes the simplest tools are the best: together grep and find can help you navigate an entire codebase knowing only fragments of text or filenames (or guesses thereof). It can also be helpful to use a language-specific tool. For C, I recommend cscope, a tool which lets you find symbol usages or definitions.

XVM's clustering problem was with a multiprocess network-based program, and we had no idea where the failure was originating. Both properties would have made the use of a dynamic debugger quite onerous. Thus we elected to dive into the source code using nothing but our familiar command-line utilities.

Tip 4: Know your error.

Examine your system's failure mode. Is it printing an error message? If so, where is that error message originating? What can possibly cause that error? If you don't understand the symptoms of a failure, you certainly won't be able to diagnose its cause.

Often, grep as you might, you won't find the text of the error message in the codebase under consideration. Rather, a standard UNIX error-reporting mechanism is to internally set the global variable errno, which is converted to a string using strerror.

Here's a Python script that I've found useful for converting the output of strerror to the corresponding symbolic error name. (Just pass the script any substring of your error as an argument.)

#!/usr/bin/env python
import errno, os, sys
msg = ' '.join(sys.argv[1:]).lower()
for i in xrange(256):
    err = os.strerror(i)
    if msg in err.lower():
        print '%s [errno %d]: %s' % (errno.errorcode.get(i, '(unknown)'), i, err)

This script shows that the "Cannot allocate memory" message we had seen was caused by errno being set to the code ENOMEM.

Tip 5: Map lines of output to lines of code.

You can learn a lot about the state of a program by determining which lines of code it is executing. First, fetch the source code for the version of the software you are running (generally one of apt-get source and yumdownloader --source). Using your handy command-line tools, you should then be able to trace lines of debugging output back to the lines of code that emitted them. You can thus determine a set of lines that are definitively being executed.

Returning to the XVM example, we used apt-get source to fetch the relevant source code and dpkg -l to verify we were running the same version. We then ran a grep for each line of debugging output we had obtained. One such invocation, grep -r "lock_resource '.*'" .,

showed that the corresponding log entry was emitted by a line in the middle of a function entitled _lock_resource.

Tip 6: Be systematic.

If you've followed the preceding tips, you'll know what parts of the code the program is executing and how it's erroring out. From there, you should work systematically, eliminating parts of the code that you can prove are not contributing to your error. Be sure you have actual evidence for your conclusions--the existence of a bug indicates that the program is in an unexpected state, and thus the core assumptions of the code may be violated.

At this point in the XVM debugging, we examined the _lock_resource function. After the debugging message we had in our logs, all paths of control flow except one printed a message we had not seen. That path terminated with an error from a function called saLckResourceLock. Hence we had found the source of our error.

We also noticed that _lock_resource transforms error values returned by the function it calls using using ais_to_errno. Reading the body of ais_to_errno, we noted it just maps internal error values to standard UNIX error codes. So instead of ENOMEM, the real culprit was one of SA_AIS_ERR_NO_MEMORY, SA_AIS_ERR_NO_RESOURCES, or SA_AIS_ERR_NO_SECTIONS. This certainly explained why we could see this error message even on machines with tens of gigabytes of free memory!

Ultimately, our debugging process brought us to the following block of code:

if (global_lock_count == LCK_MAX_NUM_LOCKS)     {
error = SA_AIS_ERR_NO_RESOURCES;
goto error_exit;
}

This chunk of code felt exactly right. It was bound by some hard-coded limit (namely, LCK_MAX_NUM_LOCKS, the maximum number of locks) and hitting it returned one of the error codes we were seeking. We bumped the value of the constant and have been running smoothly ever since.

Tip 7: Make sure you really fixed it.

How many times have you been certain you finally found an elusive bug, spent hours recompiling and redeploying, and then found that the bug was actually still there? Or even better, the bug simply changed when it appears, and you failed to find this out before telling everyone that you fixed it?

It's important that after squashing a bug, you examine, test, and sanity-check your changes. Perhaps explain your reasoning to someone else. It's all too easy to "fix" code that isn't broken, only cover a subset of the relevant cases, or introduce a new bug in your patch.

After bumping the value of LCK_MAX_NUM_LOCKS, we checked the project's changelog. We found a commit increasing the maximum number of locks without any changes to code, so our patch seemed safe. We explained our reasoning and findings to the other project developers, quietly deployed our patched version, and then after a week of stability sent an announce email proclaiming that we had fixed the cluster.

Your turn

What techniques have you found useful for debugging unfamiliar code?

~gdb

Thursday Jun 24, 2010

Attack of the Cosmic Rays!

It's a well-documented fact that RAM in modern computers is susceptible to occasional random bit flips due to various sources of noise, most commonly high-energy cosmic rays. By some estimates, you can even expect error rates as high as one error per 4GB of RAM per day! Many servers these days have ECC RAM, which uses extra bits to store error-correcting codes that let them correct most bit errors, but ECC RAM is still fairly rare in desktops, and unheard-of in laptops.

For me, bitflips due to cosmic rays are one of those problems I always assumed happen to "other people". I also assumed that even if I saw random cosmic-ray bitflips, my computer would probably just crash, and I'd never really be able to tell the difference from some random kernel bug.

A few weeks ago, though, I encountered some bizarre behavior on my desktop, that honestly just didn't make sense. I spent about half an hour digging to discover what had gone wrong, and eventually determined, conclusively, that my problem was a single undetected flipped bit in RAM. I can't prove whether the problem was due to cosmic rays, bad RAM, or something else, but in any case, I hope you find this story interesting and informative.

The problem

The symptom that I observed was that the expr program, used by shell scripts to do basic arithmetic, had started consistently segfaulting. This first manifested when trying to build a software project, since the GNU autotools make heavy use of this program:

[nelhage@psychotique]$ autoreconf -fvi
autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --force -I m4
autoreconf: configure.ac: tracing
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
…

dmesg revealed that the segfaulting program was expr:

psychotique kernel: [105127.372705] expr[7756]: segfault at 1a70 ip 0000000000001a70 sp 00007fff2ee0cc40 error 4 in expr

And I was easily able to reproduce the problem by hand:

[nelhage@psychotique]$ expr 3 + 3
Segmentation fault

expr definitely hadn't been segfaulting as of a day ago or so, so something had clearly gone suddenly, and strangely, wrong. I had no idea what, but I decided to find out.

Check the dumb things

I run Ubuntu, so the first things I checked were the /var/log/dpkg.log and /var/log/aptitude.log files, to determine whether any suspicious packages had been upgraded recently. Perhaps Ubuntu accidentally let a buggy package slip into the release. I didn't recall doing any significant upgrades, but maybe dependencies had pulled in an upgrade I had missed.

The logs revealed I hadn't upgraded anything of note in the last several days, so that theory was out.

Next up, I checked env | grep ^LD. The dynamic linker takes input from a number of environment variables, all of whose names start with LD_. Was it possible I had somehow ended up setting some variable that was messing up the dynamic linker, causing it to link a broken library or something?

[nelhage@psychotique]$ env | grep ^LD
[nelhage@psychotique]$ 

That, too, turned up nothing.

Start digging

I was fortunate in that, although this failure is strange and sudden, it seemed perfectly reproducible, which means I had the luxury of being able to run as many tests as I wanted to debug it.

The problem is a segfault, so I decided to pull up a debugger and figure out where it's segfaulting. First, though, I'd want debug symbols, so I could make heads or tails of the crashed program. Fortunately, Ubuntu provides debug symbols for every package they ship, in a separate repository. I already had the debug sources enabled, so I used dpkg -S to determine that expr belongs to the coreutils package:

[nelhage@psychotique]$ dpkg -S $(which expr)
coreutils: /usr/bin/expr

And installed the coreutils debug symbols:

[nelhage@psychotique]$ sudo aptitude install coreutils-dbgsym

Now, I could run expr inside gdb, catch the segfault, and get a stack trace:

[nelhage@psychotique]$ gdb --args expr 3 + 3
…
(gdb) run
Starting program: /usr/bin/expr 3 + 3
Program received signal SIGSEGV, Segmentation fault.
0x0000000000001a70 in ?? ()
(gdb) bt
#0  0x0000000000001a70 in ?? ()
#1  0x0000000000402782 in eval5 (evaluate=true) at expr.c:745
#2  0x00000000004027dd in eval4 (evaluate=true) at expr.c:773
#3  0x000000000040291d in eval3 (evaluate=true) at expr.c:812
#4  0x000000000040208d in eval2 (evaluate=true) at expr.c:842
#5  0x0000000000402280 in eval1 (evaluate=<value optimized out>) at expr.c:921
#6  0x0000000000402320 in eval (evaluate=<value optimized out>) at expr.c:952
#7  0x0000000000402da5 in main (argc=2, argv=0x0) at expr.c:329

So, for some reason, the eval5 function has jumped off into an invalid memory address, which of course causes a segfault. Repeating the test a few time confirmed that the crash was totally deterministic, with the same stack trace each time. But what is eval5 trying to do that's causing it to jump off into nowhere? Let's grab the source and find out:

[nelhage@psychotique]$ apt-get source coreutils
[nelhage@psychotique]$ cd coreutils-7.4/src/
[nelhage@psychotique]$ gdb --args expr 3 + 3
# Run gdb, wait for the segfault
(gdb) up
#1  0x0000000000402782 in eval5 (evaluate=true) at expr.c:745
745           if (nextarg (":"))
(gdb) l
740       trace ("eval5");
741     #endif
742       l = eval6 (evaluate);
743       while (1)
744         {
745           if (nextarg (":"))
746             {
747               r = eval6 (evaluate);
748               if (evaluate)
749                 {

I used the apt-get source command to download the source package from Ubuntu, and ran gdb in the source directory, so it could find the files referred to by the debug symbols. I then used the up command in gdb to go up a stack frame, to the frame where eval5 called off into nowhere.

From the source, we see that eval5 is trying to call the nextarg function. `gdb` will happily tell us where that function is supposed to be located:

(gdb) p nextarg
$1 = {_Bool (const char *)} 0x401a70 <nextarg>

Comparing that address with the address in the stack trace above, we see that they differ by a single bit. So it appears that somewhere a single bit has been flipped, causing that call to go off into nowhere.

But why?

So there's a flipped bit. But why, and how did it happen? First off, let's determine where the problem is. Is it in the expr binary itself, or is something more subtle going on?

[nelhage@psychotique]$ debsums coreutils | grep FAILED
/usr/bin/expr                                                             FAILED

The debsums program will compare checksums of files on disk with a manifest contained in the Debian package they came from. In this case, examining the coreutils package, we see that the expr binary has in fact been modified since it was installed. We can verify how it's different by downloading a new version of the package, and comparing the files:

[nelhage@psychotique]$ aptitude download coreutils
[nelhage@psychotique]$ mkdir coreutils
[nelhage@psychotique]$ dpkg -x coreutils_7.4-2ubuntu1_amd64.deb coreutils
[nelhage@psychotique]$ cmp -bl coreutils/usr/bin/expr /usr/bin/expr
 10113 377 M-^? 277 M-?

aptitude download downloads a .deb package, instead of actually doing the installation. I used dpkg -x to just extract the contents of the file, and cmp to compare the packaged expr with the installed one. -b tells cmp to list any bytes that differ, and -l tells it to list all differences, not just the first one. So we can see that two bytes differ, and by a single bit, which agrees with the failure we saw. So somehow the installed expr binary is corrupted.

So how did that happen? We can check the mtime ("modified time") field on the program to determine when the file on disk was modified (assuming, for the moment, that whatever modified it didn't fix up the mtime, which seems unlikely):

[nelhage@psychotique]$ ls -l /usr/bin/expr
-rwxr-xr-x 1 root root 111K 2009-10-06 07:06 /usr/bin/expr*

Curious. The mtime on the binary is from last year, presumably whenever it was built by Ubuntu, and set by the package manager when it installed the system. So unless something really fishy is going on, the binary on disk hasn't been touched.

Memory is a tricky thing.

But hold on. I have 12GB of RAM on my desktop, most of which, at any moment, is being used by the operating system to cache the contents of files on disk. expr is a pretty small program, and frequently used, so there's a good chance it will be entirely in cache, and my OS has basically never touched the disk to load it, since it first did so, probably when I booted my computer. So it's likely that this corruption is entirely in memory. But how can we test that? Simple: by forcing the OS to discard the cached version and re-read it from disk.

On Linux, we can do this by writing to the /proc/sys/vm/drop_caches file, as root. We'll take a checksum of the binary first, drop the caches, and compare the checksum after forcing it to be re-read:

[nelhage@psychotique]$ sha256sum /usr/bin/expr
4b86435899caef4830aaae2bbf713b8dbf7a21466067690a796fa05c363e6089  /usr/bin/expr
[nelhage@psychotique]$ echo 3 | sudo tee /proc/sys/vm/drop_caches
3
[nelhage@psychotique]$ sha256sum /usr/bin/expr
5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949  /usr/bin/expr
[nelhage@psychotique]$ sha256sum coreutils/usr/bin/expr
5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949  coreutils/usr/bin/expr

And behold, the file changed. The corruption was entirely in memory. And, furthermore, expr no longer segfaults, and matches the version I downloaded earlier.

(The sudo tee idiom is a common one I used to write to a file as root from a normal user shell. sudo echo 3 > /proc/sys/vm/drop_caches of course won't work because the file is still opened for writing by my shell, which doesn't have the required permissions).

Conclusion

As I mentioned earlier, I can't prove this was due to a cosmic ray, or even a hardware error. It could have been some OS bug in my kernel that accidentally did a wild write into my memory in a way that only flipped a single bit. But that'd be a pretty weird bug.

And in fact, since that incident, I've had several other, similar problems. I haven't gotten around to memtesting my machine, but that does suggest I might just have a bad RAM chip on my hands. But even with bad RAM, I'd guess that flipped bits come from noise somewhere -- they're just susceptible to lower levels of noise. So it could just mean I'm more susceptible to the low-energy cosmic rays that are always falling. Regardless of whatever the cause was, though, I hope this post inspires you to think about the dangers of your RAM corrupting your work, and that the tale of my debugging helps you learn some new tools that you might find useful some day.

Now that I've written this post, I'm going to go memtest my machine and check prices on ECC RAM. In the meanwhile, leave your stories in the comments -- have you ever tracked a problem down to memory corruption? What are your practices for coping with the risk of these problems?

~nelhage

Edited to add a note that this could well just be bad RAM, in addition to a one-off cosmic-ray event.

Wednesday May 26, 2010

The top 10 tricks of Perl one-liners

I'm a recovering perl hacker. Perl used to be far and away my language of choice, but these days I'm more likely to write new code in Python, largely because far more of my friends and coworkers are comfortable with it.

I'll never give up perl for quick one-liners on the command-line or in one-off scripts for munging text, though. Anything that lasts long enough to make it into git somewhere usually gets rewritten in Python, but nothing beats perl for interactive messing with text.

Perl, never afraid of obscure shorthands, has accrued an impressive number of features that help with this use case. I'd like to share some of my favorites that you might not have heard of.

One-liners primer

We'll start with a brief refresher on the basics of perl one-liners before we begin. The core of any perl one-liner is the -e switch, which lets you pass a snippet of code on the command-line: perl -e 'print "hi\n"' prints "hi" to the console.

The second standard trick to perl one-liners are the -n and -p flags. Both of these make perl put an implicit loop around your program, running it once for each line of input, with the line in the $_ variable. -p also adds an implicit print at the end of each iteration.

Both of these use perl's special "ARGV" magic file handle internally. What this means is that if there are any files listed on the command-line after your -e, perl will loop over the contents of the files, one at a time. If there aren't any, it will fall back to looping over standard input.

perl -ne 'print if /foo/' acts a lot like grep foo, and perl -pe 's/foo/bar/' replaces foo with bar

Most of the rest of these tricks assume you're using either -n or -p, so I won't mention it every time.

The top 10 one-liner tricks

Trick #1: -l

Smart newline processing. Normally, perl hands you entire lines, including a trailing newline. With -l, it will strip the trailing newline off of any lines read, and automatically add a newline to anything you print (including via -p).

Suppose I wanted to strip trailing whitespace from a file. I might naïvely try something like

perl -pe 's/\s*$//'
The problem, however, is that the line ends with "\n", which is whitespace, and so that snippet will also remove all newlines from my file! -l solves the problem, by pulling off the newline before handing my script the line, and then tacking a new one on afterwards:
perl -lpe 's/\s*$//'

Trick #2: -0

Occasionally, it's useful to run a script over an entire file, or over larger chunks at once. -0 makes -n and -p feed you chunks split on NULL bytes instead of newlines. This is often useful for, e.g. processing the output of find -print0. Furthermore, perl -0777 makes perl not do any splitting, and pass entire files to your script in $_.

find . -name '*~' -print0 | perl -0ne unlink

Could be used to delete all ~-files in a directory tree, without having to remember how xargs works.

Trick #3: -i

-i tells perl to operate on files in-place. If you use -n or -p with -i, and you pass perl filenames on the command-line, perl will run your script on those files, and then replace their contents with the output. -i optionally accepts an backup suffix as argument; Perl will write backup copies of edited files to names with that suffix added.

perl -i.bak -ne 'print unless /^#/' script.sh

Would strip all whole-line commands from script.sh, but leave a copy of the original in script.sh.bak.

Trick #4: The .. operator

Perl's .. operator is a stateful operator -- it remembers state between evaluations. As long as its left operand is false, it returns false; Once the left hand returns true, it starts evaluating the right-hand operand until that becomes true, at which point, on the next iteration it resets to false and starts testing the other operand again.

What does that mean in practice? It's a range operator: It can be easily used to act on a range of lines in a file. For instance, I can extract all GPG public keys from a file using:

perl -ne 'print if /-----BEGIN PGP PUBLIC KEY BLOCK-----/../-----END PGP PUBLIC KEY BLOCK-----/' FILE
Trick #5: -a

-a turns on autosplit mode – perl will automatically split input lines on whitespace into the @F array. If you ever run into any advice that accidentally escaped from 1980 telling you to use awk because it automatically splits lines into fields, this is how you use perl to do the same thing without learning another, even worse, language.

As an example, you could print a list of files along with their link counts using

ls -l | perl -lane 'print "$F[7] $F[1]"'

Trick #6: -F

-F is used in conjunction with -a, to choose the delimiter on which to split lines. To print every user in /etc/passwd (which is colon-separated with the user in the first column), we could do:

perl -F: -lane 'print $F[0]' /etc/passwd
Trick #7: \K

\K is undoubtedly my favorite little-known-feature of Perl regular expressions. If \K appears in a regex, it causes the regex matcher to drop everything before that point from the internal record of "Which string did this regex match?". This is most useful in conjunction with s///, where it gives you a simple way to match a long expression, but only replace a suffix of it.

Suppose I want to replace the From: field in an email. We could write something like

perl -lape 's/(^From:).*/$1 Nelson Elhage <nelhage\@ksplice.com>/'

But having to parenthesize the right bit and include the $1 is annoying and error-prone. We can simplify the regex by using \K to tell perl we won't want to replace the start of the match:

perl -lape 's/^From:\K.*/ Nelson Elhage <nelhage\@ksplice.com>/'
Trick #8: $ENV{}

When you're writing a one-liner using -e in the shell, you generally want to quote it with ', so that dollar signs inside the one-liner aren't expanded by the shell. But that makes it annoying to use a ' inside your one-liner, since you can't escape a single quote inside of single quotes, in the shell.

Let's suppose we wanted to print the username of anyone in /etc/passwd whose name included an apostrophe. One option would be to use a standard shell-quoting trick to include the ':

perl -F: -lane 'print $F[0] if $F[4] =~ /'"'"'/' /etc/passwd

But counting apostrophes and backslashes gets old fast. A better option, in my opinion, is to use the environment to pass the regex into perl, which lets you dodge a layer of parsing entirely:

env re="'" perl -F: -lane 'print $F[0] if $F[4] =~ /$ENV{re}/' /etc/passwd

We use the env command to place the regex in a variable called re, which we can then refer to from the perl script through the %ENV hash. This way is slightly longer, but I find the savings in counting backslashes or quotes to be worth it, especially if you need to end up embedding strings with more than a single metacharacter.

Trick #9: BEGIN and END

BEGIN { ... } and END { ... } let you put code that gets run entirely before or after the loop over the lines.

For example, I could sum the values in the second column of a CSV file using:

perl -F, -lane '$t += $F[1]; END { print $t }'
Trick #10: -MRegexp::Common

Using -M on the command line tells perl to load the given module before running your code. There are thousands of modules available on CPAN, numerous of them potentially useful in one-liners, but one of my favorite for one-liner use is Regexp::Common, which, as its name suggests, contains regular expressions to match numerous commonly-used pieces of data.

The full set of regexes available in Regexp::Common is available in its documentation, but here's an example of where I might use it:

Neither the ifconfig nor the ip tool that is supposed to replace it provide, as far as I know, an easy way of extracting information for use by scripts. The ifdata program provides such an interface, but isn't installed everywhere. Using perl and Regexp::Common, however, we can do a pretty decent job of extracing an IP from ip's output:

ip address list eth0 | \

  perl -MRegexp::Common -lne 'print $1 if /($RE{net}{IPv4})/'

So, those are my favorite tricks, but I always love learning more. What tricks have you found or invented for messing with perl on the command-line? What's the most egregious perl "one-liner" you've wielded, continuing to tack on statements well after the point where you should have dropped your code into a real script?

~nelhage

Wednesday May 05, 2010

PXE dust: scalable day-to-day diskless booting

If you're a veteran system administrator, you might remember an era of extremely expensive hard disk storage, when any serious network would have a beefy central file server (probably accessed using the Network File System, NFS) that formed the lifeblood of its operations. It was a well-loved feature as early as Linux kernel 2.0 that you could actually boot your machine with a root filesystem in NFS and have no local disk at all. Hardware costs went down, similar machines could share large parts of their system binaries, upgrades could be done without touching anything but the central server—sysadmins loved this.

But that was then. Diskless booting these days seems a lot less common, even though the technology still exists. You hear about supercomputer clusters using it, but not the "typical" IT department. What happened?

Part of it, I'm sure, is that hard disks became speedier and cheaper more quickly than consumer network technology gained performance. With local disks, it's still difficult to roll out updates to a hundred or a thousand computers simultaneously, but many groups don't start with a hundred or a thousand computers, and multicast system re-imaging software like Norton Ghost prevents the hassle from being unbearable enough to force a switch.

More important, though, is that after a few years of real innovation, the de facto standard in network booting has been stagnant for over a decade. Back in 1993, when the fastest Ethernet anyone could use transferred a little over a megabyte of data per second and IDE hard drives didn't go much faster, network card managers were already including boot ROMs on their expansion cards, each following its own proprietary protocol for loading and executing a bootstrap program. A first effort at standardization, Jamie Honan's "Net Boot Image Proposal", was informally published that year, and soon enough two open-source projects, Etherboot (1995) and Netboot (1996), were providing generic ROM images with pluggable driver support. (Full disclosure: I'm an Etherboot Project developer.) They took care of downloading and executing a boot file, but that file would have no way of going back to the network for more data unless it had a network card driver built in. These tools thus became rather popular for booting Linux, and largely useless for booting simpler system management utilities that couldn't afford the maintenance cost of their own network stack and drivers.

Around this time, Intel was looking at diskless booting from a more commercial point of view: it made management easier, consolidated resources, avoided leaving sysadmins at the mercy of users who broke their systems thinking themselves experts. They published a specification for the Preboot Execution Environment (PXE), as part of a larger initiative called Wired for Management. Network cards started replacing their proprietary boot ROMs with PXE, and things looked pretty good; the venerable SYSLINUX bootloader grew a PXELINUX variant for PXE-booting Linux, and a number of enterprise system management utilities became available in PXE-bootable form.

But, for whatever reason, the standard hasn't been updated since 1999. It still operates in terms of the ancient x86 real mode, only supports UDP and a "slow, simple, and stupid" file transfer protocol called TFTP, and officially limits boot program size to 32kB. For modern-day applications, this is less than ideal.

Luckily for us, the Etherboot Project still exists, and Etherboot's successor gPXE has been picking up where Intel left off, and supports a number of more modern protocols. Between that, excellent support in recent Linux kernels for both accessing and serving SAN disks with high performance, and the flexibility gained by booting with an initial ramdisk, diskless booting is making a big comeback. It's not even very hard to set up: read on!

The general idea

PXE booting flowchart
PXE Booting Flowchart

While it can get a lot more complicated to support boot menus and proprietary operating systems, the basic netbooting process these days is pretty straightforward. The PXE firmware (usually burned into ROM on the network card) performs a DHCP request, just like most networked computers do to get an IP address. The DHCP server has been configured to provide additional "options" in its reply, specifying where to find boot files. All PXE stacks support booting from a file (a network bootstrap program, NBP); PXELINUX is the NBP most commonly used for booting Linux. The NBP can call back to the PXE stack to ask it to download more files using TFTP.

Alternatively, some PXE stacks (including gPXE) support booting from a networked disk, accessed using a SAN protocol like ATA over Ethernet or iSCSI. Since it's running in real mode, the firmware can hook the interrupt table to cause other boot-time programs (like the GRUB bootloader) to see an extra hard disk attached to the system; unbeknownst to these programs, requests to read sectors from that hard disk are actually being routed over the network.

If you want to boot a real-mode operating system like DOS, you can stop there; DOS never looks beyond the hooked interrupt, so it never has to know about the network at all. We're interested in booting Linux, though, and Linux has to manage the network card itself. When the kernel is booted, the PXE firmware becomes inaccessible, so it falls to the initial ramdisk (initrd or initramfs) to establish its own connection to the boot server so it can mount the root filesystem.

Setting up the client

We're going to walk through setting up network booting for a group of similar Ubuntu Lucid systems using disk images served over iSCSI. (The instructions should work with Karmic as well.) iSCSI runs on top of a TCP/IP stack, so it'll work fine within your current network infrastructure. Even over 100Mbps Ethernet, it's not significantly slower than a local disk boot, and certainly faster than a live CD. Rebooting may be obsolete, but short bootup times are still good to have!

You'll want to start by installing one Ubuntu system and configuring it how you'll want all of your diskless clients to be configured. There's room for individual configuration (like setting unique hostnames and maybe passwords) later on, but the more you can do once here, the less you'll have to do or script for all however-many clients you have. When they're booted, the clients will find your networked drive just like a real hard drive; it'll show up as /dev/sda, in /proc/scsi/scsi, and so forth, so you can pretty much configure them just like any local machine.

No matter what site-specific configuration choices you make, there are some steps you'll need to perform to make the image iSCSI bootable. First, you'll need to install the iSCSI initiator, which makes the connection to the server housing the boot disk image:

client# aptitude install open-iscsi

That connection will need to occur during the earliest stages of bootup, in the scripts on the initial ramdisk. open-iscsi can automatically update the initramfs to find and mount the iSCSI device, but it assumes you'll be setting a bunch of parameters in a configuration file to point it in the right place. It's quite cumbersome to set this up separately for every node, so I have prepared a patch that will make the initramfs automatically set these values based on the "boot firmware table" created by the iSCSI boot firmware from the information provided by your DHCP server. You should apply it now with

client# wget http://etherboot.org/share/oremanj/ubuntu-iscsi-ibft.patch
client# patch -p0 -i ubuntu-iscsi-ibft.patch
patching file /usr/share/initramfs-tools/hooks/iscsi
patching file /usr/share/initramfs-tools/scripts/local-top/iscsi

Next, tell the setup scripts you want boot-time iSCSI and regenerate the ramdisk to include your modifications:

client# touch /etc/iscsi/iscsi.initramfs
client# update-initramfs -u

Finally, make sure the clients will get an IP address at boot time, so they can get to their root filesystem:

client# vi /etc/default/grub
    [find the GRUB_CMDLINE_LINUX line and add ip=dhcp to it;
     e.g. GRUB_CMDLINE_LINUX="" becomes GRUB_CMDLINE_LINUX="ip=dhcp"]
client# update-grub

Setting up the storage

Let's assume you've set up the prototype client as above, and you now have an image of its hard disk in a file somewhere. Because the disk-level utilities we'll be using don't know how to deal with files, it's necessary to create a loop device to bridge the two:

server# losetup -v -f /path/to/ubuntu-image
Loop device is /dev/loop0
If you get different output, or if the client disk image you created is already on a "real" block device (e.g. using LVM), replace /dev/loop0 with that device in the below examples.

You may be familiar with the Linux device mapper, probably as the backend behind LVM, but there's a lot more it can do. In particular, it gives us very easy copy-on-write (COW) semantics: you can create multiple overlays on a shared image, such that writes to the overlay get stored separately from the underlying image, reads of areas you've written come from the overlay, and reads of areas you've not modified come from the underlying image, all transparently. The shared image is not modified, and the overlays are only as large as is necessary to store each one's changes. Let's suppose you've got some space in /cow that you want to use for the overlay images; then you can create five of them named /cow/overlay-1.img through /cow/overlay-5.img with

server# for i in $(seq 1 5); do
> dd if=/dev/zero of=/cow/overlay-$i.img bs=512 count=1 seek=10M
> done
(10M blocks * 512 bytes per block = 5GB per overlay; this represents the amount of new data that can be written "on top of" the base image.)

Now for the fun part. The aforementioned snapshot functionality is provided by the dm-snapshot module; it's a standard part of the Linux device mapper, but you might not have it loaded if you've not used the snapshotting feature before. Rectify that if necessary:

server# modprobe dm-snapshot
and set up the copy-on-write like this:
server# for i in $(seq 1 5); do
> loopdev=$(losetup -f)
> losetup $loopdev /cow/overlay-$i.img
> echo "0 $(blockdev --getsize /dev/loop0) snapshot /dev/loop0 $loopdev p 8" | dmsetup create image-$i
> done

This sequence of commands assigns a loopback device to each COW overlay file (to make it look like a normal block device) and creates a bunch of /dev/mapper/image-n devices from which each client will boot. The 8 in the above line is the "chunk size" in 512-byte blocks, i.e. the size of the modified regions that the overlay device will record. A large chunk size wastes more disk space if you're only modifying a byte here and there, but may increase performance by lowering the overhead of the COW setup. p makes the overlays "persistent"; i.e., all relevant state is stored in the image itself, so it can survive a reboot.

You can tear down the overlays with dmsetup remove:

server# for i in $(seq 1 5); do
> dmsetup remove image-$i
> done

It's generally not safe to modify the base image when there are overlays on top of it. However, if you script the changes (hostname and such) that you need to make in the overlays, it should be pretty easy to just blow away the COW files and regenerate everything when you need to do an upgrade.

The loopback device and dmsetup configuration need to be performed again after every reboot, but you can reuse the /cow/overlay-n.img files.

Setting up the server for iSCSI

We now have five client images set up to boot over iSCSI; currently they're all passing reads through to the single prototype client image, but when the clients start writing to their disks they won't interfere with each other. All that remains is to set up the iSCSI server and actually boot the clients.

The iSCSI server we'll be using is called ietd, the iSCSI Enterprise Target daemon; there are several others available, but ietd is simple and mature—perfect for our purposes. Install it:

server# aptitude install iscsitarget

Next, we need to tell ietd where it can find our disk images. The relevant configuration file is /etc/ietd.conf; edit it and add lines like the following:

Target iqn.2008-01.com.ksplice.servertest:client-0
    Lun 0 Path=/dev/mapper/image-0,Type=blockio
Target iqn.2008-01.com.ksplice.servertest:client-1
    Lun 0 Path=/dev/mapper/image-1,Type=blockio
...

Each Target line names an image that can be mounted over iSCSI, using a hierarchical naming scheme called the "iSCSI Qualified Name" or IQN. In the example above, the com.ksplice.servertest should be replaced with the reverse DNS name of your organization's domain, and 2008-01 with a year and month as of which that name validly referred to you. The part after the colon determines the specific resource being named; in this case these are the client drives client-0, client-1, etc. None of this is required—your clients will quite happily boot images introduced as Target worfle-blarg—but it's customary, and useful in managing large setups. The Lun 0 line specifies a backing store to use for the first SCSI logical unit number of the exported device. (Multi-LUN configurations are outside the scope of this post.)

Edit /etc/default/iscsitarget and change the one line in that file to:

ISCSITARGET_ENABLE=true
You can then start ietd with
server# /etc/init.d/iscsitarget start

To test that it's working, you can install open-iscsi and ask the server what images it's serving up:

server# aptitude install open-iscsi
server# iscsiadm -m discovery -p localhost -t sendtargets
[::1]:3260,1 iqn.2008-01.com.ksplice.servertest:client-1
[::1]:3260,1 iqn.2008-01.com.ksplice.servertest:client-2
...

Setting up DHCP

The only piece that remains is somehow communicating to your clients what they'll be booting from; if they're diskless, they don't have any way to read that information locally. Luckily, you probably already have a DHCP server set up in your organization, and as we mentioned before, it can hand out boot information just as easily as it can hand out IP addresses. You need to have it supply the root-path option (number 17); detailed instructions for ISC dhcpd, the most popular DHCP server, are below.

In order to make sure each client gets the right disk, you'll need to know their MAC addresses; for this demo's sake, we'll assume the addresses are 52:54:00:00:00:0n where n is the client number (1 through 5). Then the lines you'll need to add to /etc/dhcpd.conf, inside the subnet block corresponding to your network, look like this:

        host client-1 {
                hardware ethernet 52:54:00:00:00:01;
                option root-path "iscsi:192.168.1.90::::iqn.2008-01.com.ksplice.servertest:client-1";
        }

        host client-2 {
                hardware ethernet 52:54:00:00:00:02;
                option root-path "iscsi:192.168.1.90::::iqn.2008-01.com.ksplice.servertest:client-2";
        }
        ...

Replace 192.168.1.90 with the IP address of your iSCSI server. The syntax of the root-path option is actually iscsi:server:protocol:port:lun:iqn, but the middle three fields can be left blank because the defaults (TCP, port 3260, LUN 0) are exactly what we want.

Booting the clients

If your clients are equipped with particularly high-end, "server" network cards, you can likely boot them now and everything will Just Work. Most network cards, though, don't contain an iSCSI initiator; they only know how to boot files downloaded using TFTP. To bridge the gap, we'll be using gPXE.

gPXE is a very flexible open-source boot firmware that implements the PXE standard as well as a number of extensions: you can download files over HTTP, use symbolic DNS names instead of IP addresses, and (most importantly for our purposes) boot off a SAN disk served over iSCSI. You can burn gPXE into your network card, replacing the less-capable PXE firmware, but that's likely more hassle than you'd like to go to. You can start it from a CD or USB key, which is great for testing. For long-term use you probably want to set up PXE chainloading; the basic idea is to configure the DHCP server to hand out your root-path when it gets a DHCP request with user class "gPXE", and the gPXE firmware (in PXE netboot program format) when it gets a request without that user class (coming from your network card's simple PXE firmware).

For now, let's go the easy-testing route and start gPXE from a CD. Download this 600kB ISO image, burn it to a CD, and boot one of your client machines using it. It will automatically perform DHCP and boot, yielding output something like the below:

gPXE 1.0.0+ -- Open Source Boot Firmware -- http://etherboot.org
Features: AoE HTTP iSCSI DNS TFTP bzImage COMBOOT ELF Multiboot NBI PXE PXEXT

net0: 52:54:00:00:00:01 on PCI00:03.0 (open)
  [Link:up, TX:0 TXE:0 RX:0 RXE:0]
DHCP (net0 52:54:00:00:00:01).... ok
net0: 192.168.1.110/255.255.255.0 gw 192.168.1.54
Booting from root path "iscsi:192.168.1.90::::iqn.2008-01.com.ksplice.servertest:client-1"
Registered as BIOS drive 0x80
Booting from BIOS drive 0x80
after which, thanks to the client setup peformed earlier, the boot will proceed just like from a local hard disk. You can eject the CD out as soon as you see the gPXE banner; it's just being used as an oversized ROM chip here.

You'll probably want to boot each client in turn and, at a minimum, set its hostname to something unique. It's also possible to script this on the server side by using kpartx on the /dev/mapper/image-n devices, mounting each client's root partition, and modifying the configuration files therein.

That's it: if you've followed these instructions, you now have a basic but complete architecture for network booting a bunch of similar clients. You've set up servers to handle iSCSI and DHCP, set up one prototype client from which client disks can be automatically generated, and can easily scale to hosting many more clients just by increasing the number 5 to something larger. (You'd probably want to switch to using LVM logical volumes instead of file-backed loopback devices for performance reasons, though.) The number of clients you can quickly provision is limited only by the capacity of your network. And the next time one of your users decides their computer is an excellent place to stick refrigerator magnets, they won't be creating any additional headaches for you.

~oremanj

About

Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today