Thursday Sep 23, 2010

Leaving Oracle

I first came to Sun over 4 years ago for an internship in the Solaris kernel group.  I was excited to work with such a highly regarded group of talented engineers, and my experience that summer was no disappointment: I learned a lot and had a blast.

After college, I joined Sun's Fishworks team.  Despite my previous experience at Sun, I didn't really know what to expect here, and I certainly didn't imagine then where I'd be now and how much I'd value the experiences of the last three years.  In that short time, I've worked with many bright people with diverse backgrounds, I've learned so much about so many topics (both technical and otherwise), and I've continued having fun contributing to a great product.

Now I've decided to try something different, this time outside Sun/Oracle.  My last day here is 9/24.  While I'm excited about the future, I'll miss both the people and the work here.  I wish you all the best of luck, and thanks for everything!

Update: This blog has moved to, where comments are open for this entry.

Wednesday Sep 22, 2010

SS7000 Software Updates

In this entry I'll explain some of the underlying principles around software upgrade for the 7000 series.  Keep in mind that nearly all of this information are implementation details of the system and thus subject to change.

Entire system image

One of the fundamental design principles about SS7000 software updates is that all releases update the entire system no matter how small the underlying software change is. Releases never update individual components of the system separately. This sounds surprising to people familiar with traditional OS patches, but this model provides a critical guarantee: for a given release, the software components are identical across all systems running that release. This guarantee makes it possible to test every combination of software the system can run, which isn't the case for traditional OS patches.

When operating systems allow users to apply separate patches to fix individual issues, different systems running the same release (obviously) may have different patches applied to different components. It's impossible to test every combination of components before releasing each new patch, so engineers rely heavily on understanding the scope of the underlying software change (as it affects the rest of the system at several different versions) to know which combinations of patches may conflict with one other. In a complex system, this is very hard to get right. What's worse is that it introduces unnecessary risk to the upgrade and patching process, making customers wary of upgrading, which results in more customers running older software. With a single system image, we can (and do) test every combination of component versions a customer can have.

This model has a few other consequences, some of which are more obvious than others:

  • Updates are complete and self-contained. There's no chance for interaction between older and newer software, and there's no chance for user error in applying illegal combinations of patches.
  • An update's version number implicitly identifies the versions of all software components on the system. This is very helpful for customers, support, and engineering to exactly what it means to say a bug was fixed in release X or that a system is running release Y (without having to guess or specify the versions of dozens of smaller components).
  • Updates are cumulative; any version of the software has all bug fixes from all previous versions. This makes it easy to identify which releases have a given bug fixed. "Previous" here refers to the version order, not chronological order. For example, V1 and V3 may be in the field when a V2 is released that contains all fixes in V1 plus a few others, but not the fixes in V3. More below on when this happens.
  • Updates are effectively atomic. They either succeed or they fail, but the system is never left in some in-between state running partly old bits and partly new bits. This doesn't follow immediately from having an entire system image, but doing it that way makes this possible.
Of course, it's much easier to achieve this model in the appliance context (which constrains supported configurations and actions for exactly this kind of purpose) than on a general purpose operating system.

Types of updates

SS7000 software releases come in a few types characterized by the scope of the software changes contained in the update.

  • Major updates are the primary release vehicle. These are scheduled releases that deliver new features and the vast majority of bug fixes. Major updates generally include a complete sync with the underlying Solaris OS.
  • Minor updates are much smaller, scheduled releases that include a small number of valuable bug fixes. Bugs fixed in minor updates must have high impact or high likelihood of being experienced by customers and have a relatively low risk fix.
  • Micro updates are unscheduled releases issued to address significant issues. These would usually be potential data loss issues, pathological service interruptions (e.g., frequent reboots), or significant performance regressions.

Enterprise customers are often reluctant to apply patches and upgrades to working systems, since any software change carries some risk that it may introduce new problems (even if it only contains bug fixes). This breakdown allows customers to make risk management decisions about upgrading their systems based on the risk associated with each type of update. In particular, the scope of minor and micro releases is highly constrained to minimize risk.

For examples, four major software updates have been released: 2008.Q4, 2009.Q2, and 2009.Q3, and 2010.Q1. The first four of these have had several updates. We've also released a few micro updates.

Tuesday Sep 21, 2010

Replication for disaster recovery

When designing a disaster recovery solution using remote replication, two important parameters are the recovery time objective (RTO) and the recovery point objective (RPO). For these purposes, a disaster is any event resulting in permanent data loss at the primary site which requires restoring service using data recovered from the disaster recovery (DR) site. The RTO is how soon service must be restored after a disaster. Designing a DR solution to meet a specified RTO is complex but essentially boils down to ensuring that the recovery plan can be executed within the allotted time. Usually this means keeping the plan simple, automating as much of it as possible, documenting it carefully, and testing it frequently. (It helps to use storage systems with built-in features for this.)

In this entry I want to talk about RPO.  RPO describes how recent the data recovered after the disaster must be (in other words, how much data loss is acceptable in the event of disaster). An RPO of 30 minutes means that the recovered data must include all changes up to 30 minutes before the disaster.  But how do businesses decide how much data loss is acceptable in the event of a disaster? The answer varies greatly from case to case. Photo storage for a small social networking site may have an RPO of a few hours; in the worst case, users who uploaded photos a few hours before the event will have to upload them again, which isn't usually a big deal. A stock exchange, by contrast, may have a requirement for zero data loss: the disaster recovery site absolutely must have the most updated data because otherwise different systems may disagree about whose account owns a particular million dollars. Of course, most businesses are somewhere in between: it's important that data be "pretty fresh," but some loss is acceptable.

Replication solutions used for disaster recovery typically fall into two buckets: synchronous and asynchronous. Synchronous replication means that clients making changes on disk at the primary site can't proceed until that data also resides on disk at the disaster recovery site. For example, a database at the primary site won't consider a transaction having completed until the underlying storage system indicates that the changed data has been stored on disk. If the storage system is configured for synchronous replication, that also means that the change is on disk at the DR site, too. If a disaster occurs at the primary site, there's no data loss when the database is recovered from the DR site because no transactions were committed that weren't also propagated to the DR site. So using synchronous replication it's possible to implement a DR strategy with zero data loss in the event of disaster -- at great cost, discussed below.

By contrast, asynchronous replication means that the storage system can acknowledge changes before they've been replicated to the DR site. In the database example, the database (still using synchronous i/o at the primary site) considers the transaction completed as long as the data is on disk at the primary site. If a disaster occurs at the primary site, the data that hasn't been replicated will be lost. This sounds bad, but if your RPO is 30 minutes and the primary site replicates to the target every 10 minutes, for example, then you can still meet your DR requirements.

Synchronous replication

As I suggested above, synchronous replication comes at great cost, particularly in three areas:

  • availability: In order to truly guarantee no data loss, a synchronous replication system must not acknowledge writes to clients if it can't also replicate that data to the DR site. If the DR site is unavailable, the system must either explicitly fail or just block writes until the problem is resolved, depending on what the application expects. Either way, the entire system now fails if any combination of the primary site, the DR site, or the network link between them fails, instead of just the primary site. If you've got 99% reliability in each of these components, the system's probability of failure goes from 1% to almost 3% (1 - .993).
  • performance: In a system using synchronous replication, the latency of client writes includes the time to write to both storage systems plus the network latency between the sites. DR sites are typically located several miles or more from the primary site in case the disaster takes the form of a datacenter-wide power outage or an actual natural disaster. All things being equal, the farther the sites are apart, the higher the network latency between them. To make things concrete, consider a single threaded client that sees 300us write operations at the primary site (total time including network round trip plus time to write the data to local storage only, not the remote site). Add a synchronous replication target a few miles away with just 500us latency from the primary site and the operation now takes 800us, dropping IOPS from about 3330 to about 1250.
  • money: Of course, this is what it ultimately boils down to. Synchronous replication software alone can cost quite a bit, but you can also end up spending a lot to deal with the above availability and performance problems: clustered head nodes at both sites, redundant network hardware, and very low latency switches and network connection. You might buy some of this for asynchronous replication, too, but network latency is much less important than bandwidth for asynchronous replication so you often can  save on the network side.

Asynchronous replication

The above availability costs scale back as the RPO increases.  If the RPO is 24 hours and it only takes 1 hour to replicate a day's changes, then you can sustain a 23-hour outage at the DR site or on the network link without impacting primary site availability at all.  The only performance cost of asynchronous replication is the added latency resulting from a system's additional load for the replication software (which you'd also have to worry about with synchronous replication).  There's no additional latency from the DR network link or the DR site as long as the system is able to keep up with sending updates.

The 7000 series provides two types of automatic asynchronous replication: scheduled and continuous. Scheduled replication sends discrete, snapshot-based updates at predefined hourly, daily, weekly, or monthly intervals. Continuous replication sends the same discrete, snapshot-based updates as frequently as possible.  The result is essentially a continuous stream of filesystem changes to the DR site.

Of course, there are tradeoffs to both of these approaches. In most cases, continuous replication minimizes data loss in the event of a disaster (i.e., it will achieve minimal RPO), since the system is replicating changes as fast as possible to the DR site without actually holding up production clients. However, if the RPO is a given parameter (as it often is), you can just as well choose a scheduled replication interval that will achieve that RPO, in which case using continuous replication doesn't buy you anything.  In fact, it can hurt because continuous replication can result in transferring significantly more data than necessary. For example, if an application fills a 10GB scratch file and rewrites it over the next half hour, continuous replication will send 20GB, while half-hourly scheduled replication will only send 10GB.  If you're paying for bandwidth, or if the capacity of a pair of systems is limited by the available replication bandwidth, these costs can add up.


Storage systems provide many options for configuring remote replication for disaster recovery, including synchronous (zero data loss) and both continuous and scheduled asynchronous replication.  It's easy to see these options and guess a reasonable strategy for minimizing data loss in the event of disaster, but it's important that actual recovery point objectives be defined based on business needs and that those objectives drive the planning and deployment of the DR solution. Incidentally, some systems use a hybrid approach that uses synchronous replication when possible, but avoids the availability cost (described above) of that approach by falling back to continuous asynchronous replication if the DR link or or DR system fails. This seems at first glance a good compromise, but it's unclear what problem this solves because the resulting system pays the performance and monetary costs of synchronous replication without actually guaranteeing zero data loss.

Thursday Sep 09, 2010

Another detour: short-circuiting cat(1)

What do you think happens when you do this:

        # cat vmcore.4 > /dev/null

If you've used Unix systems before, you might expect this to read vmcore.4 into memory and do nothing with it, since cat(1) reads a file, and "> /dev/null" sends it to the null driver, which accepts data and does nothing. This appears pointless, but can actually be useful to bring a file into memory, for example, or to evict other files from memory (if this file is larger than total cache size).

But here's a result I found surprising:

        # ls -l vmcore.1
        -rw-r--r--   1 root     root     5083361280 Oct 30  2009 vmcore.1

        # time cat vmcore.1 > /dev/null
        real    0m0.007s
        user    0m0.001s
        sys     0m0.007s

That works out to 726GB/s. That's way too fast, even reading from main memory. The obvious question is how does cat(1) know that I'm sending to /dev/null and not bother to read the file at all?

Of course, you can answer this by examining the cat source in the ON gate. There's no special case for /dev/null (though that does exist elsewhere), but rather this behavior is a consequence of an optimization in which cat(1) maps the input file and writes the mapped buffer instead of using read(2) to fill a buffer and write that. With truss(1) it's clear exactly what's going on:

        # truss cat vmcore.1 > /dev/null
        execve("/usr/bin/cat", 0x08046DC4, 0x08046DD0)  argc = 2
        [ ... ]
        write(1, ..., 8388608) = 8388608
        mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 8388608) = 0xFE600000
        write(1, ..., 8388608) = 8388608
        mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x01000000) = 0xFE600000
        [ ... ]
        mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x000000012E000000) = 0xFE600000
        write(1, ..., 8388608) = 8388608
        mmap64(0xFE600000, 8253440, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x000000012E800000) = 0xFE600000
        write(1, ..., 8253440) = 8253440
        llseek(3, 0x000000012EFDF000, SEEK_SET)         = 0x12EFDF000
        munmap(0xFE600000, 8388608)                     = 0
        llseek(3, 0, SEEK_CUR)                          = 0x12EFDF000
        close(3)                                        = 0
        close(1)                                        = 0

cat(1) really is issuing tons of writes from the mapped file, but the /dev/null device just returns immediately without doing anything. The file mapping is never even read. If you actually wanted to read the file (for the side effects mentioned above, for example), you can defeat this with an extra pipe:

        # time cat vmcore.1 | cat > /dev/null
        real    0m32.661s
        user    0m0.865s
        sys     0m32.127s

That's more like it: about 155MB/s streaming from a single disk. In this case the second cat invocation can't use this optimization since stdin is actually a pipe, not the input file.

There's another surprising result of the initial example: the file's access time actually gets updated even though it was never read:

        # ls -lu vmcore.2
        -rw-r--r--   1 root     root     6338052096 Nov  3  2009 vmcore.2

        # time cat vmcore.2 > /dev/null
        real    0m0.040s
        user    0m0.001s
        sys     0m0.008s

        # ls -lu vmcore.2
        -rw-r--r--   1 root     root     6338052096 Aug  6 15:55 vmcore.2

This wasn't always the case, but it was fixed back in 1995 under bug 1193522, which is where this comment and code probably came from:

    363         /\*
    364          \* NFS V2 will let root open a file it does not have permission
    365          \* to read. This read() is here to make sure that the access
    366          \* time on the input file will be updated. The VSC tests for
    367          \* cat do this:
    368          \*      cat file > /dev/null
    369          \* In this case the write()/mmap() pair will not read the file
    370          \* and the access time will not be updated.
    371          \*/
    373         if (read(fi_desc, &x, 1) == -1)
    374                 read_error = 1;

I found this all rather surprising because I think of cat(1) as one of the basic primitives that's dead-simple by design.  If you really want something simple to read a file into memory, you might be better off with dd(1M).

Monday May 24, 2010

A ZFS Home Server

This entry will be a departure from my recent focus on the 7000 series to explain how I replaced my recently-dead Sun Ultra 20 with a home-built ZFS workstation/NAS server. I hope others considering building a home server can benefit from this experience.

Of course, if you're considering building a home server, it pays to think through your goals, constraints, and priorities, and then develop a plan.

Goals: I use my desktop as a general purpose home computer, as a development workstation for working from home and on other hobby projects, and as a NAS file server. This last piece is key because I do a lot of photography work in Lightroom and Photoshop. I want my data stored redundantly on ZFS with snapshots and such, but I want to work from my Macbook. I used pieces of this ZFS CIFS workgroup sharing guide to set this up, and it's worked out pretty well. More recently I've also been storing videos on the desktop and streaming them to my Tivo using pytivo and streambaby.

Priorities: The whole point is to store my data, so data integrity is priority #1 (hence ZFS and a mirrored pool). After that, I want something unobtrusive (quiet) and inexpensive to run (low-power). Finally, it should be reasonably fast and cheap, but I'm willing to sacrifice sacrifice speed and up-front cost for quiet and low-power.

The Pieces

Before sinking money into components I wanted to be sure they'd work well with Solaris and ZFS, so I looked around at what other people have been doing and found Constantin Gonzalez's "reference config." Several people have successfully built ZFS home NAS systems with similar components so I started with that design and beefed it up slightly for my workstation use:

  • CPU: AMD Athlon II x3 405e. 3 cores, 2.3GHz, 45 watts, supports ECC. I had a very tough time finding the "e" line of AMD processors in stock anywhere, but finally got one from via Amazon.
  • Motherboard: Asus M4A785T-M CSM. It's uATX, has onboard video, audio, NIC, and 6 SATA ports, and supports DDR3 and ECC (which uses less power than DDR2).
  • RAM: 2x Super Talent 2GB DDR3-1333MHz with ECC. ZFS loves memory, and NAS loves cache.
  • Case: Sonata Elite. Quiet and lots of internal bays.
  • Power supply: Cooler Master Silent Pro 600W modular power supply. Quiet, energy-efficient, and plenty of power. Actually, this unit supplies much more power than I need.

I was able to salvage several components from the Ultra 20, including the graphics card, boot disk, mirrored data disks, and a DVD-RW drive. Total cost of the new components was $576.

The Build

Putting the system together went pretty smoothly. A few comments about the components:

  • The CoolerMaster power supply comes with silicone pieces that wrap around both ends, ostensibly to dampen the vibe transferred to the chassis. I only ran with this on so I can't say how effective it was, but it made the unit more difficult to install.
  • The case has similar silicone grommets on the screws that lock the disks in place.
  • The rails for installing 5.25" external drives are actually sitting on the inside of each bay's cover. I didn't realize this at first and bought (incompatible) rails separately.
  • Neither the case nor the motherboard has an onboard speaker, so you can't hear the POST beeps. This can be a problem if there's a problem with the electrical connections when you're first bringing it up. I had an extra case from another system around, so I wired that up, but I've since purchased a standalone internal speaker. (I did not find one of these at any of several local RadioShacks or computer stores, contrary to common advice.)
  • I never got the case's front audio connector working properly. Whether I connected the AC97 or HDA connectors (both supplied by the case, and both supported by the motherboard), the front audio jack never worked. I don't care because the rear audio is sufficient for me and works well.

The trickiest part of this whole project was migrating the boot disk from the Ultra 20 to this new system without reinstalling. First, identifying the correct boot device in the BIOS was a matter of trial and error. For future reference, it's worth writing the serial numbers for your boot and data drives down somewhere so you can tell what's what if you need to use them in another system. Don't save this in files on the drives themselves, of course.

When I first booted from the correct disk, the system reset immediately after picking the correct GRUB entry. I booted into kmdb and discovered that the system was actually panicking early because it couldn't mount the root filesystem. With no dump device, the system had been resetting before the panic message ever showed up on the screen. With kmdb, I was able to actually see the panic message. Following the suggestion from someone else who had hit this same problem, I burned an OpenSolaris LiveCD, booted it, and imported and exported the root pool. After this, the system successfully booted from the pool. This experience would be made much better with 6513775/6779374.


Since several others have used similar setups for home NAS, my only real performance concern was whether the low-power processor would be enough to handle desktop needs. So far, it seems about the same as my old system. When using the onboard video, things are choppy moving windows around, but Flash videos play fine. The performance is plenty for streaming video to my Tivo and copying photos from my Macbook.

The new system is pretty quiet. While I never measured the Ultra 20's power consumption, the new system runs at just 85W idle and up to 112W with disks and CPU pegged. That's with the extra video card, which sucks about 30W. That's about $7.50/month in power to keep it running. Using the onboard video (and removing the extra card), the system idles at just 56W. Not bad at all.


On Fishworks, Sun, and software engineering


« March 2015