Wednesday May 12, 2010

Constant factors: constantly foiling you

There are many different kinds of bugs with many different causes. Sometimes, it's just some silly typo or type mismatch; other times, it's an algorithmic issue or a correctness error; and occasionally, it appears to be something much stranger. This is a story about that last kind of bug, where finding the location of the problem is relatively straightforward, but figuring out how to fix it, or even why it's happening, isn't so easy.

We first meet our hero, yours truly, as he debugs an efficiency problem in a particular program for what is now the third day in a row. This program is actually just a small part in an otherwise large and complex system that Ksplice uses for its internal operations, at the heart of which lies a large collection of data, stored in a carefully constructed and extensively optimized MySQL database.

But this had not always been the case. In the past, much of the data was stored in a large collection of JSON-encoded archives. Seamlessly moving all that data to MySQL was a project that I had been involved in, and this was, in a sense, the final stage of the transition. We still had the JSON files, but having moved to a much shinier and spiffier database, it seemed a good idea to port all the programs that used the data to it as well.

Now, this particular program was designed to make five queries to the MySQL database — all totally identical except for a time column that all the rows contained. The queries each covered a different 48-hour period, distributed at various points throughout the last two months (relative when the program gets run).

These queries seemed simple enough, but mysteriously, the queries for older time periods, which returned fewer rows, also took considerably longer to run!

Eventually, I was able to reduce this to a consistently reproducible example:

mysql> SELECT COUNT(*) FROM events
        WHERE time > '2010-04-25' AND time < '2010-04-27';

| COUNT(*) |
|  1196740 |

1 row in set (12.13 sec)

mysql> SELECT COUNT(*) FROM events
        WHERE time > '2010-02-24' AND time < '2010-02-26';

| COUNT(*) |
|   740034 |

1 row in set (61.92 sec)

As you can see, the more recent range has about a million rows and took just over 12 seconds to run. The older range has just over half as many, but takes almost 5 times as long. This really doesn't make any sense — after all, the queries are totally identical except for the ranges we are looking across, and those ranges are of completely equal sizes. There really shouldn't be such a huge disparity in the running times.

Let's look at how MySQL is performing these queries:

                                 WHERE time > '2010-04-25' AND time < '2010-04-27';

| id | select_type | table  | type  | possible_keys | key         | key_len | ref  | rows    | filtered | Extra                    |
|  1 | SIMPLE      | events | range | events_time   | events_time | 8       | NULL | 2234568 |   100.00 | Using where; Using index |

1 row in set, 1 warning (0.04 sec)


                         WHERE time > '2010-02-24' AND time < '2010-02-26';

| id | select_type | table  | type  | possible_keys | key         | key_len | ref  | rows    | filtered | Extra                    |
|  1 | SIMPLE      | events | range | events_time   | events_time | 8       | NULL | 1202716 |   100.00 | Using where; Using index |

1 row in set, 1 warning (0.04 sec)

Besides confirming that we are using an index on the time column, there isn't too much here. I tried using MySQL's built in profiler by running SET PROFILING = 1 and then, after rerunning the second query, running


| Status             | Duration  |
| starting           |  0.000116 |
| Opening tables     |  0.000018 |
| System lock        |  0.000006 |
| Table lock         |  0.000013 |
| init               |  0.000032 |
| optimizing         |  0.000016 |
| statistics         |  0.047627 |
| preparing          |  0.000038 |
| executing          |  0.000007 |
| Sending data       | 62.852915 |
| end                |  0.000035 |
| query end          |  0.000007 |
| freeing items      |  0.000047 |
| logging slow query |  0.000005 |
| cleaning up        |  0.000005 |

15 rows in set (0.03 sec)

Huh. Pretty much all of the time is spent "sending data". What the heck is going on?

Fragmented Database
Fragmented Database. For the newer events, we can just get them all in one go, since they're already sorted by time.

The answer turns out to be fragmentation. Remember when I said that we had migrated to this database from an older flat-file based one? Well, those files weren't read and inserted into the database in the chronological order of their contents, but rather based on the contents itself. So the old data that we inserted was fragmented across the database relative the time column. Newer data, on the other hand, was inserted as it comes in, so it is already pretty much ordered by time.

MySQL stores its indices using B-trees, so we can locate any given row (or range of rows) in logarithmic time. But unfortunately we can't just get any random row off the disk immediately. Instead, MySQL can only read data off the disk on the granularity of a page, which means that for a row living in a page that hasn't yet been read into memory, we have to first do a disk seek to load the right page.

Since the rows are fragmented across the disk, we have to do far more expensive disk seeks. So even though we're getting fewer total rows and the query is locating them in logarithmic time, the constant factor ends up being very large due to all the disk seeking.

What are your favorite bitten-by-a-constant-factor stories?


Wednesday May 05, 2010

PXE dust: scalable day-to-day diskless booting

If you're a veteran system administrator, you might remember an era of extremely expensive hard disk storage, when any serious network would have a beefy central file server (probably accessed using the Network File System, NFS) that formed the lifeblood of its operations. It was a well-loved feature as early as Linux kernel 2.0 that you could actually boot your machine with a root filesystem in NFS and have no local disk at all. Hardware costs went down, similar machines could share large parts of their system binaries, upgrades could be done without touching anything but the central server—sysadmins loved this.

But that was then. Diskless booting these days seems a lot less common, even though the technology still exists. You hear about supercomputer clusters using it, but not the "typical" IT department. What happened?

Part of it, I'm sure, is that hard disks became speedier and cheaper more quickly than consumer network technology gained performance. With local disks, it's still difficult to roll out updates to a hundred or a thousand computers simultaneously, but many groups don't start with a hundred or a thousand computers, and multicast system re-imaging software like Norton Ghost prevents the hassle from being unbearable enough to force a switch.

More important, though, is that after a few years of real innovation, the de facto standard in network booting has been stagnant for over a decade. Back in 1993, when the fastest Ethernet anyone could use transferred a little over a megabyte of data per second and IDE hard drives didn't go much faster, network card managers were already including boot ROMs on their expansion cards, each following its own proprietary protocol for loading and executing a bootstrap program. A first effort at standardization, Jamie Honan's "Net Boot Image Proposal", was informally published that year, and soon enough two open-source projects, Etherboot (1995) and Netboot (1996), were providing generic ROM images with pluggable driver support. (Full disclosure: I'm an Etherboot Project developer.) They took care of downloading and executing a boot file, but that file would have no way of going back to the network for more data unless it had a network card driver built in. These tools thus became rather popular for booting Linux, and largely useless for booting simpler system management utilities that couldn't afford the maintenance cost of their own network stack and drivers.

Around this time, Intel was looking at diskless booting from a more commercial point of view: it made management easier, consolidated resources, avoided leaving sysadmins at the mercy of users who broke their systems thinking themselves experts. They published a specification for the Preboot Execution Environment (PXE), as part of a larger initiative called Wired for Management. Network cards started replacing their proprietary boot ROMs with PXE, and things looked pretty good; the venerable SYSLINUX bootloader grew a PXELINUX variant for PXE-booting Linux, and a number of enterprise system management utilities became available in PXE-bootable form.

But, for whatever reason, the standard hasn't been updated since 1999. It still operates in terms of the ancient x86 real mode, only supports UDP and a "slow, simple, and stupid" file transfer protocol called TFTP, and officially limits boot program size to 32kB. For modern-day applications, this is less than ideal.

Luckily for us, the Etherboot Project still exists, and Etherboot's successor gPXE has been picking up where Intel left off, and supports a number of more modern protocols. Between that, excellent support in recent Linux kernels for both accessing and serving SAN disks with high performance, and the flexibility gained by booting with an initial ramdisk, diskless booting is making a big comeback. It's not even very hard to set up: read on!

The general idea

PXE booting flowchart
PXE Booting Flowchart

While it can get a lot more complicated to support boot menus and proprietary operating systems, the basic netbooting process these days is pretty straightforward. The PXE firmware (usually burned into ROM on the network card) performs a DHCP request, just like most networked computers do to get an IP address. The DHCP server has been configured to provide additional "options" in its reply, specifying where to find boot files. All PXE stacks support booting from a file (a network bootstrap program, NBP); PXELINUX is the NBP most commonly used for booting Linux. The NBP can call back to the PXE stack to ask it to download more files using TFTP.

Alternatively, some PXE stacks (including gPXE) support booting from a networked disk, accessed using a SAN protocol like ATA over Ethernet or iSCSI. Since it's running in real mode, the firmware can hook the interrupt table to cause other boot-time programs (like the GRUB bootloader) to see an extra hard disk attached to the system; unbeknownst to these programs, requests to read sectors from that hard disk are actually being routed over the network.

If you want to boot a real-mode operating system like DOS, you can stop there; DOS never looks beyond the hooked interrupt, so it never has to know about the network at all. We're interested in booting Linux, though, and Linux has to manage the network card itself. When the kernel is booted, the PXE firmware becomes inaccessible, so it falls to the initial ramdisk (initrd or initramfs) to establish its own connection to the boot server so it can mount the root filesystem.

Setting up the client

We're going to walk through setting up network booting for a group of similar Ubuntu Lucid systems using disk images served over iSCSI. (The instructions should work with Karmic as well.) iSCSI runs on top of a TCP/IP stack, so it'll work fine within your current network infrastructure. Even over 100Mbps Ethernet, it's not significantly slower than a local disk boot, and certainly faster than a live CD. Rebooting may be obsolete, but short bootup times are still good to have!

You'll want to start by installing one Ubuntu system and configuring it how you'll want all of your diskless clients to be configured. There's room for individual configuration (like setting unique hostnames and maybe passwords) later on, but the more you can do once here, the less you'll have to do or script for all however-many clients you have. When they're booted, the clients will find your networked drive just like a real hard drive; it'll show up as /dev/sda, in /proc/scsi/scsi, and so forth, so you can pretty much configure them just like any local machine.

No matter what site-specific configuration choices you make, there are some steps you'll need to perform to make the image iSCSI bootable. First, you'll need to install the iSCSI initiator, which makes the connection to the server housing the boot disk image:

client# aptitude install open-iscsi

That connection will need to occur during the earliest stages of bootup, in the scripts on the initial ramdisk. open-iscsi can automatically update the initramfs to find and mount the iSCSI device, but it assumes you'll be setting a bunch of parameters in a configuration file to point it in the right place. It's quite cumbersome to set this up separately for every node, so I have prepared a patch that will make the initramfs automatically set these values based on the "boot firmware table" created by the iSCSI boot firmware from the information provided by your DHCP server. You should apply it now with

client# wget
client# patch -p0 -i ubuntu-iscsi-ibft.patch
patching file /usr/share/initramfs-tools/hooks/iscsi
patching file /usr/share/initramfs-tools/scripts/local-top/iscsi

Next, tell the setup scripts you want boot-time iSCSI and regenerate the ramdisk to include your modifications:

client# touch /etc/iscsi/iscsi.initramfs
client# update-initramfs -u

Finally, make sure the clients will get an IP address at boot time, so they can get to their root filesystem:

client# vi /etc/default/grub
    [find the GRUB_CMDLINE_LINUX line and add ip=dhcp to it;
     e.g. GRUB_CMDLINE_LINUX="" becomes GRUB_CMDLINE_LINUX="ip=dhcp"]
client# update-grub

Setting up the storage

Let's assume you've set up the prototype client as above, and you now have an image of its hard disk in a file somewhere. Because the disk-level utilities we'll be using don't know how to deal with files, it's necessary to create a loop device to bridge the two:

server# losetup -v -f /path/to/ubuntu-image
Loop device is /dev/loop0
If you get different output, or if the client disk image you created is already on a "real" block device (e.g. using LVM), replace /dev/loop0 with that device in the below examples.

You may be familiar with the Linux device mapper, probably as the backend behind LVM, but there's a lot more it can do. In particular, it gives us very easy copy-on-write (COW) semantics: you can create multiple overlays on a shared image, such that writes to the overlay get stored separately from the underlying image, reads of areas you've written come from the overlay, and reads of areas you've not modified come from the underlying image, all transparently. The shared image is not modified, and the overlays are only as large as is necessary to store each one's changes. Let's suppose you've got some space in /cow that you want to use for the overlay images; then you can create five of them named /cow/overlay-1.img through /cow/overlay-5.img with

server# for i in $(seq 1 5); do
> dd if=/dev/zero of=/cow/overlay-$i.img bs=512 count=1 seek=10M
> done
(10M blocks * 512 bytes per block = 5GB per overlay; this represents the amount of new data that can be written "on top of" the base image.)

Now for the fun part. The aforementioned snapshot functionality is provided by the dm-snapshot module; it's a standard part of the Linux device mapper, but you might not have it loaded if you've not used the snapshotting feature before. Rectify that if necessary:

server# modprobe dm-snapshot
and set up the copy-on-write like this:
server# for i in $(seq 1 5); do
> loopdev=$(losetup -f)
> losetup $loopdev /cow/overlay-$i.img
> echo "0 $(blockdev --getsize /dev/loop0) snapshot /dev/loop0 $loopdev p 8" | dmsetup create image-$i
> done

This sequence of commands assigns a loopback device to each COW overlay file (to make it look like a normal block device) and creates a bunch of /dev/mapper/image-n devices from which each client will boot. The 8 in the above line is the "chunk size" in 512-byte blocks, i.e. the size of the modified regions that the overlay device will record. A large chunk size wastes more disk space if you're only modifying a byte here and there, but may increase performance by lowering the overhead of the COW setup. p makes the overlays "persistent"; i.e., all relevant state is stored in the image itself, so it can survive a reboot.

You can tear down the overlays with dmsetup remove:

server# for i in $(seq 1 5); do
> dmsetup remove image-$i
> done

It's generally not safe to modify the base image when there are overlays on top of it. However, if you script the changes (hostname and such) that you need to make in the overlays, it should be pretty easy to just blow away the COW files and regenerate everything when you need to do an upgrade.

The loopback device and dmsetup configuration need to be performed again after every reboot, but you can reuse the /cow/overlay-n.img files.

Setting up the server for iSCSI

We now have five client images set up to boot over iSCSI; currently they're all passing reads through to the single prototype client image, but when the clients start writing to their disks they won't interfere with each other. All that remains is to set up the iSCSI server and actually boot the clients.

The iSCSI server we'll be using is called ietd, the iSCSI Enterprise Target daemon; there are several others available, but ietd is simple and mature—perfect for our purposes. Install it:

server# aptitude install iscsitarget

Next, we need to tell ietd where it can find our disk images. The relevant configuration file is /etc/ietd.conf; edit it and add lines like the following:

    Lun 0 Path=/dev/mapper/image-0,Type=blockio
    Lun 0 Path=/dev/mapper/image-1,Type=blockio

Each Target line names an image that can be mounted over iSCSI, using a hierarchical naming scheme called the "iSCSI Qualified Name" or IQN. In the example above, the com.ksplice.servertest should be replaced with the reverse DNS name of your organization's domain, and 2008-01 with a year and month as of which that name validly referred to you. The part after the colon determines the specific resource being named; in this case these are the client drives client-0, client-1, etc. None of this is required—your clients will quite happily boot images introduced as Target worfle-blarg—but it's customary, and useful in managing large setups. The Lun 0 line specifies a backing store to use for the first SCSI logical unit number of the exported device. (Multi-LUN configurations are outside the scope of this post.)

Edit /etc/default/iscsitarget and change the one line in that file to:

You can then start ietd with
server# /etc/init.d/iscsitarget start

To test that it's working, you can install open-iscsi and ask the server what images it's serving up:

server# aptitude install open-iscsi
server# iscsiadm -m discovery -p localhost -t sendtargets

Setting up DHCP

The only piece that remains is somehow communicating to your clients what they'll be booting from; if they're diskless, they don't have any way to read that information locally. Luckily, you probably already have a DHCP server set up in your organization, and as we mentioned before, it can hand out boot information just as easily as it can hand out IP addresses. You need to have it supply the root-path option (number 17); detailed instructions for ISC dhcpd, the most popular DHCP server, are below.

In order to make sure each client gets the right disk, you'll need to know their MAC addresses; for this demo's sake, we'll assume the addresses are 52:54:00:00:00:0n where n is the client number (1 through 5). Then the lines you'll need to add to /etc/dhcpd.conf, inside the subnet block corresponding to your network, look like this:

        host client-1 {
                hardware ethernet 52:54:00:00:00:01;
                option root-path "";

        host client-2 {
                hardware ethernet 52:54:00:00:00:02;
                option root-path "";

Replace with the IP address of your iSCSI server. The syntax of the root-path option is actually iscsi:server:protocol:port:lun:iqn, but the middle three fields can be left blank because the defaults (TCP, port 3260, LUN 0) are exactly what we want.

Booting the clients

If your clients are equipped with particularly high-end, "server" network cards, you can likely boot them now and everything will Just Work. Most network cards, though, don't contain an iSCSI initiator; they only know how to boot files downloaded using TFTP. To bridge the gap, we'll be using gPXE.

gPXE is a very flexible open-source boot firmware that implements the PXE standard as well as a number of extensions: you can download files over HTTP, use symbolic DNS names instead of IP addresses, and (most importantly for our purposes) boot off a SAN disk served over iSCSI. You can burn gPXE into your network card, replacing the less-capable PXE firmware, but that's likely more hassle than you'd like to go to. You can start it from a CD or USB key, which is great for testing. For long-term use you probably want to set up PXE chainloading; the basic idea is to configure the DHCP server to hand out your root-path when it gets a DHCP request with user class "gPXE", and the gPXE firmware (in PXE netboot program format) when it gets a request without that user class (coming from your network card's simple PXE firmware).

For now, let's go the easy-testing route and start gPXE from a CD. Download this 600kB ISO image, burn it to a CD, and boot one of your client machines using it. It will automatically perform DHCP and boot, yielding output something like the below:

gPXE 1.0.0+ -- Open Source Boot Firmware --

net0: 52:54:00:00:00:01 on PCI00:03.0 (open)
  [Link:up, TX:0 TXE:0 RX:0 RXE:0]
DHCP (net0 52:54:00:00:00:01).... ok
net0: gw
Booting from root path ""
Registered as BIOS drive 0x80
Booting from BIOS drive 0x80
after which, thanks to the client setup peformed earlier, the boot will proceed just like from a local hard disk. You can eject the CD out as soon as you see the gPXE banner; it's just being used as an oversized ROM chip here.

You'll probably want to boot each client in turn and, at a minimum, set its hostname to something unique. It's also possible to script this on the server side by using kpartx on the /dev/mapper/image-n devices, mounting each client's root partition, and modifying the configuration files therein.

That's it: if you've followed these instructions, you now have a basic but complete architecture for network booting a bunch of similar clients. You've set up servers to handle iSCSI and DHCP, set up one prototype client from which client disks can be automatically generated, and can easily scale to hosting many more clients just by increasing the number 5 to something larger. (You'd probably want to switch to using LVM logical volumes instead of file-backed loopback devices for performance reasons, though.) The number of clients you can quickly provision is limited only by the capacity of your network. And the next time one of your users decides their computer is an excellent place to stick refrigerator magnets, they won't be creating any additional headaches for you.



Tired of rebooting to update systems? So are we -- which is why we invented Ksplice, technology that lets you update the Linux kernel without rebooting. It's currently available as part of Oracle Linux Premier Support, Fedora, and Ubuntu desktop. This blog is our place to ramble about technical topics that we (and hopefully you) think are interesting.


« April 2014