Recent Posts


SS7000 Software Updates

In this entry I'll explain some of the underlying principles around software upgrade for the 7000 series.  Keep in mind that nearly all of this information are implementation details of the system and thus subject to change. Entire system image One of the fundamental design principles about SS7000 software updates is that all releases update the entire system no matter how small the underlying software change is. Releases never update individual components of the system separately. This sounds surprising to people familiar with traditional OS patches, but this model provides a critical guarantee: for a given release, the software components are identical across all systems running that release. This guarantee makes it possible to test every combination of software the system can run, which isn't the case for traditional OS patches. When operating systems allow users to apply separate patches to fix individual issues, different systems running the same release (obviously) may have different patches applied to different components. It's impossible to test every combination of components before releasing each new patch, so engineers rely heavily on understanding the scope of the underlying software change (as it affects the rest of the system at several different versions) to know which combinations of patches may conflict with one other. In a complex system, this is very hard to get right. What's worse is that it introduces unnecessary risk to the upgrade and patching process, making customers wary of upgrading, which results in more customers running older software. With a single system image, we can (and do) test every combination of component versions a customer can have. This model has a few other consequences, some of which are more obvious than others: Updates are complete and self-contained. There's no chance for interaction between older and newer software, and there's no chance for user error in applying illegal combinations of patches. An update's version number implicitly identifies the versions of all software components on the system. This is very helpful for customers, support, and engineering to exactly what it means to say a bug was fixed in release X or that a system is running release Y (without having to guess or specify the versions of dozens of smaller components). Updates are cumulative; any version of the software has all bug fixes from all previous versions. This makes it easy to identify which releases have a given bug fixed. "Previous" here refers to the version order, not chronological order. For example, V1 and V3 may be in the field when a V2 is released that contains all fixes in V1 plus a few others, but not the fixes in V3. More below on when this happens. Updates are effectively atomic. They either succeed or they fail, but thesystem is never left in some in-between state running partly old bitsand partly new bits. This doesn't follow immediately from having an entire system image, but doing it that way makes this possible. Of course, it's much easier to achieve this model in the appliance context (which constrains supported configurations and actions for exactly this kind of purpose) than on a general purpose operating system. Types of updates SS7000 software releases come in a few types characterized by the scope of the software changes contained in the update. Major updates are the primary release vehicle. These are scheduled releases that deliver new features and the vast majority of bug fixes. Major updates generally include a complete sync with the underlying Solaris OS. Minor updates are much smaller, scheduled releases that include a small number of valuable bug fixes. Bugs fixed in minor updates must have high impact or high likelihood of being experienced by customers and have a relatively low risk fix. Micro updates are unscheduled releases issued to address significant issues. These would usually be potential data loss issues, pathological service interruptions (e.g., frequent reboots), or significant performance regressions. Enterprise customers are often reluctant to apply patches and upgrades to working systems, since any software change carries some risk that it may introduce new problems (even if it only contains bug fixes). This breakdown allows customers to make risk management decisions aboutupgrading their systems based on the risk associated with each type ofupdate. In particular, the scope of minor and micro releases is highly constrained to minimize risk. For examples, four major software updates have been released: 2008.Q4, 2009.Q2, and 2009.Q3, and 2010.Q1. The first four of these have had several updates. We've also released a few micro updates.

In this entry I'll explain some of the underlying principles around software upgrade for the 7000 series.  Keep in mind that nearly all of this information are implementation details of the system and...


Replication for disaster recovery

When designing a disaster recovery solution using remotereplication, two important parameters are the recovery timeobjective (RTO) and the recovery point objective(RPO). For these purposes, a disaster is any event resultingin permanent data loss at the primary site which requires restoringservice using data recovered from the disaster recovery (DR) site. TheRTO is how soon service must be restored after a disaster. Designing a DR solution to meet a specified RTO is complex but essentially boils down to ensuring that therecovery plan can be executed within the allotted time. Usually thismeans keeping the plan simple, automating as much of it as possible,documenting it carefully, and testing it frequently. (It helps to usestorage systems with built-in featuresfor this.) In this entry I want to talk about RPO.  RPOdescribes how recent the data recovered after the disaster must be (in other words, how muchdata loss is acceptable in the event of disaster). An RPO of 30 minutes means that the recovered datamust include all changes up to 30 minutes before the disaster.  But how do businesses decide how much data loss isacceptable in the event of a disaster? The answer varies greatly fromcase to case. Photo storage for a small social networking site may havean RPO of a few hours; in the worst case, users who uploaded photos afew hours before the event will have to upload them again, which isn't usually abig deal. A stock exchange, by contrast, may have a requirement forzero data loss: the disaster recovery site absolutely must have the mostupdated data because otherwise different systems may disagree aboutwhose account owns a particular million dollars. Of course, most businesses aresomewhere in between: it's important that data be "pretty fresh," butsome loss is acceptable. Replication solutions used for disaster recovery typically fall into twobuckets: synchronous andasynchronous. Synchronous replication meansthat clients making changes on disk at the primary site can't proceed until that data also resides on disk at the disaster recoverysite. For example, a database at the primary site won't consider atransaction having completed until the underlying storage systemindicates that the changed data has been stored on disk. If the storagesystem is configured for synchronous replication, that also means thatthe change is on disk at the DR site, too. If a disaster occurs at theprimary site, there's no data loss when the database is recovered fromthe DR site because no transactions were committed that weren't alsopropagated to the DR site. So using synchronous replication it'spossible to implement a DR strategy with zero data loss in the event ofdisaster -- at great cost, discussed below. By contrast, asynchronous replication means that thestorage system can acknowledge changes before they've been replicated tothe DR site. In the database example, the database (still usingsynchronous i/o at the primary site) considers the transaction completedas long as the data is on disk at the primary site. If a disasteroccurs at the primary site, the data that hasn't been replicated will belost. This sounds bad, but if your RPO is 30 minutes and the primarysite replicates to the target every 10 minutes, for example, then youcan still meet your DR requirements. Synchronous replication As I suggested above, synchronous replication comes at great cost,particularly in three areas: availability: In order to truly guarantee no data loss, a synchronous replication system must not acknowledge writes to clients if it can't also replicate that data to the DR site. If the DR site is unavailable, the system must either explicitly fail or just block writes until the problem is resolved, depending on what the application expects. Either way, the entire system now fails if any combination of the primary site, the DR site, or the network link between them fails, instead of just the primary site. If you've got 99% reliability in each of these components, the system's probability of failure goes from 1% to almost 3% (1 - .993). performance: In a system using synchronous replication, the latency of client writes includes the time to write to both storage systems plus the network latency between the sites. DR sites are typically located several miles or more from the primary site in case the disaster takes the form of a datacenter-wide power outage or an actual natural disaster. All things being equal, the farther the sites are apart, the higher the network latency between them. To make things concrete, consider a single threaded client that sees 300us write operations at the primary site (total time including network round trip plus time to write the data to local storage only, not the remote site). Add a synchronous replication target a few miles away with just 500us latency from the primary site and the operation now takes 800us, dropping IOPS from about 3330 to about 1250. money: Of course, this is what it ultimately boils down to. Synchronous replication software alone can cost quite a bit, but you can also end up spending a lot to deal with the above availability and performance problems: clustered head nodes at both sites, redundant network hardware, and very low latency switches and network connection. You might buy some of this for asynchronous replication, too, but network latency is much less important than bandwidth for asynchronous replication so you often can  save on the network side. Asynchronous replication The above availability costs scale back as the RPO increases.  If the RPO is 24 hours and it only takes 1 hour to replicate a day's changes, then you can sustain a 23-hour outage at the DR site or on the network link without impacting primary site availability at all.  The only performance cost of asynchronous replication is the added latency resulting from a system's additional load for the replication software (which you'd also have to worry about with synchronous replication).  There's no additional latency from the DR network link or the DR site as long as the system is able to keep up with sending updates. The 7000 series provides two types of automatic asynchronous replication: scheduled and continuous. Scheduled replication sends discrete, snapshot-based updates at predefined hourly, daily, weekly, or monthly intervals. Continuous replication sends the same discrete, snapshot-based updates as frequently as possible.  The result is essentially a continuous stream of filesystem changes to the DR site. Of course, there are tradeoffs to both of these approaches. In most cases, continuous replication minimizes data loss in the event of a disaster (i.e., it will achieve minimal RPO), since the system is replicating changes as fast as possible to the DR site without actually holding up production clients. However, if the RPO is a given parameter (as it often is), you can just as well choose a scheduled replication interval that will achieve that RPO, in which case using continuous replication doesn't buy you anything.  In fact, it can hurt because continuous replication can result in transferring significantly more data than necessary. For example, if an application fills a 10GB scratch file and rewrites it over the next half hour, continuous replication will send 20GB, while half-hourly scheduled replication will only send 10GB.  If you're paying for bandwidth, or if the capacity of a pair of systems is limited by the available replication bandwidth, these costs can add up. Conclusions Storage systems provide many options for configuring remote replication for disaster recovery, including synchronous (zero data loss) and both continuous and scheduled asynchronous replication.  It's easy to see these options and guess a reasonable strategy for minimizing data loss in the event of disaster, but it's important that actual recovery point objectives be defined based on business needs and that those objectives drive the planning and deployment of the DR solution. Incidentally, some systems use a hybrid approach that uses synchronous replication when possible, but avoids the availability cost (described above) of that approach by falling back to continuous asynchronous replication if the DR link or or DR system fails. This seems at first glance a good compromise, but it's unclear what problem this solves because the resulting system pays the performance and monetary costs of synchronous replication without actually guaranteeing zero data loss.

When designing a disaster recovery solution using remote replication, two important parameters are the recovery time objective (RTO) and the recovery point objective (RPO). For these purposes, a...


Another detour: short-circuiting cat(1)

What do you think happens when you do this: # cat vmcore.4 > /dev/null If you've used Unix systems before, you might expect this to readvmcore.4 into memory and do nothing with it, since cat(1) reads a file,and "> /dev/null" sends it to the null driver, which accepts data anddoes nothing. This appears pointless, but can actually be useful tobring a file into memory, for example, or to evict other files frommemory (if this file is larger than total cache size). But here's a result I found surprising: # ls -l vmcore.1 -rw-r--r-- 1 root root 5083361280 Oct 30 2009 vmcore.1 # time cat vmcore.1 > /dev/null real 0m0.007s user 0m0.001s sys 0m0.007s That works out to 726GB/s. That's way too fast, even reading from mainmemory. The obvious question is how does cat(1) know that I'm sendingto /dev/null and not bother to read the file at all? Of course, you can answer this by examining the catsource in the ON gate. There's no special case for /dev/null(though that does exist elsewhere),but rather this behavior is a consequence of an optimization in whichcat(1) maps the input file and writes the mapped buffer instead of usingread(2) to fill a buffer and write that. With truss(1) it's clearexactly what's going on: # truss cat vmcore.1 > /dev/null execve("/usr/bin/cat", 0x08046DC4, 0x08046DD0) argc = 2 [ ... ] write(1, ..., 8388608) = 8388608 mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 8388608) = 0xFE600000 write(1, ..., 8388608) = 8388608 mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x01000000) = 0xFE600000 [ ... ] mmap64(0xFE600000, 8388608, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x000000012E000000) = 0xFE600000 write(1, ..., 8388608) = 8388608 mmap64(0xFE600000, 8253440, PROT_READ, MAP_SHARED|MAP_FIXED, 3, 0x000000012E800000) = 0xFE600000 write(1, ..., 8253440) = 8253440 llseek(3, 0x000000012EFDF000, SEEK_SET) = 0x12EFDF000 munmap(0xFE600000, 8388608) = 0 llseek(3, 0, SEEK_CUR) = 0x12EFDF000 close(3) = 0 close(1) = 0 _exit(0) cat(1) really is issuing tons of writes from the mapped file, but the/dev/null device just returns immediately without doing anything. Thefile mapping is never even read. If you actually wanted to read thefile (for the side effects mentioned above, for example), you can defeatthis with an extra pipe: # time cat vmcore.1 | cat > /dev/null real 0m32.661s user 0m0.865s sys 0m32.127s That's more like it: about 155MB/s streaming from a singledisk. In this case the second cat invocation can't use thisoptimization since stdin is actually a pipe, not the input file. There's another surprising result of the initial example: the file'saccess time actually gets updated even though it was never read: # ls -lu vmcore.2 -rw-r--r-- 1 root root 6338052096 Nov 3 2009 vmcore.2 # time cat vmcore.2 > /dev/null real 0m0.040s user 0m0.001s sys 0m0.008s # ls -lu vmcore.2 -rw-r--r-- 1 root root 6338052096 Aug 6 15:55 vmcore.2 This wasn't always the case, but it was fixed back in 1995 under bug 1193522,which is where thiscomment and code probably came from: 363 /\* 364 \* NFS V2 will let root open a file it does not have permission 365 \* to read. This read() is here to make sure that the access 366 \* time on the input file will be updated. The VSC tests for 367 \* cat do this: 368 \* cat file > /dev/null 369 \* In this case the write()/mmap() pair will not read the file 370 \* and the access time will not be updated. 371 \*/ 372 373 if (read(fi_desc, &x, 1) == -1) 374 read_error = 1; I found this all rather surprising because I think of cat(1) as one of the basic primitives that's dead-simple by design.  If you really want something simple to read a file into memory, you might be better off with dd(1M).

What do you think happens when you do this: # cat vmcore.4 > /dev/null If you've used Unix systems before, you might expect this to readvmcore.4 into memory and do nothing with it, since...


A ZFS Home Server

This entry will be a departure from my recent focus on the 7000 series to explain how I replaced my recently-dead Sun Ultra 20 with a home-built ZFS workstation/NAS server. I hope others considering building a home server can benefit from this experience. Of course, if you're considering building a home server, it pays to think through your goals, constraints, and priorities, and then develop a plan. Goals: I use my desktop as a general purpose home computer, as a developmentworkstation for working from home and on other hobby projects, and as aNAS file server. This last piece is key because I do a lot of photography work in Lightroom and Photoshop. I want my data stored redundantly on ZFS with snapshots and such, but I want to work from my Macbook. I used pieces of this ZFS CIFS workgroup sharing guideto set this up, and it's worked out pretty well. More recently I'vealso been storing videos on the desktop and streaming themto my Tivo using pytivo and streambaby. Priorities: The whole point is to store my data, so data integrity is priority #1 (hence ZFSand a mirrored pool). After that, I want something unobtrusive (quiet) andinexpensive to run (low-power). Finally, it should be reasonably fast and cheap, but I'm willing to sacrifice sacrifice speed and up-front cost for quiet and low-power. The Pieces Before sinking money into components I wanted to be sure they'd work well with Solaris and ZFS, so I looked around at what other people have been doing and found Constantin Gonzalez's "reference config." Several people have successfully built ZFS home NAS systems with similar components so I started with that design and beefed it up slightly for my workstation use: CPU: AMD Athlon II x3 405e. 3 cores, 2.3GHz, 45 watts, supports ECC. I had a very tough time finding the "e" line of AMD processors in stock anywhere, but finally got one from buy.com via Amazon. Motherboard: Asus M4A785T-M CSM. It's uATX, has onboard video, audio, NIC, and 6 SATA ports, and supports DDR3 and ECC (which uses less power than DDR2). RAM: 2x Super Talent 2GB DDR3-1333MHz with ECC. ZFS loves memory, and NAS loves cache. Case: Sonata Elite. Quiet and lots of internal bays. Power supply: Cooler Master Silent Pro 600W modular power supply. Quiet, energy-efficient, and plenty of power. Actually, this unit supplies much more power than I need. I was able to salvage several components from the Ultra 20, including the graphics card, boot disk, mirrored data disks, and a DVD-RW drive. Total cost of the new components was $576. The Build Putting the system together went pretty smoothly. A few comments about the components: The CoolerMaster power supply comes with silicone pieces that wrap around both ends, ostensibly to dampen the vibe transferred to the chassis. I only ran with this on so I can't say how effective it was, but it made the unit more difficult to install. The case has similar silicone grommets on the screws that lock the disks in place. The rails for installing 5.25" external drives are actually sitting on the inside of each bay's cover. I didn't realize this at first and bought (incompatible) rails separately. Neither the case nor the motherboard has an onboard speaker, so you can't hear the POST beeps. This can be a problem if there's a problem with the electrical connections when you're first bringing it up. I had an extra case from another system around, so I wired that up, but I've since purchased a standalone internal speaker. (I did not find one of these at any of several local RadioShacks or computer stores, contrary to common advice.) I never got the case's front audio connector working properly. Whether I connected the AC97 or HDA connectors (both supplied by the case, and both supported by the motherboard), the front audio jack never worked. I don't care because the rear audio is sufficient for me and works well. The trickiest part of this whole project was migrating the boot disk from the Ultra 20 to this new system without reinstalling. First, identifying the correct boot device in the BIOS was a matter of trial and error. For future reference, it's worth writing the serial numbers for your boot and data drives down somewhere so you can tell what's what if you need to use them in another system. Don't save this in files on the drives themselves, of course. When I first booted from the correct disk, the system reset immediately after picking the correct GRUB entry. I booted into kmdb and discovered that the system was actually panicking early because it couldn't mount the root filesystem. With no dump device, the system had been resetting before the panic message ever showed up on the screen. With kmdb, I was able to actually see the panic message. Following the suggestion from someone else who had hit this same problem, I burned an OpenSolaris LiveCD, booted it, and imported and exported the root pool. After this, the system successfully booted from the pool. This experience would be made much better with 6513775/6779374. Performance Since several others have used similar setups for home NAS, my only real performance concern was whether the low-power processor would be enough to handle desktop needs. So far, it seems about the same as my old system. When using the onboard video, things are choppy moving windows around, but Flash videos play fine. The performance is plenty for streaming video to my Tivo and copying photos from my Macbook. The new system is pretty quiet. While I never measured the Ultra 20's power consumption, the new system runs at just 85W idle and up to 112W with disks and CPU pegged. That's with the extra video card, which sucks about 30W. That's about $7.50/month in power to keep it running. Using the onboard video (and removing the extra card), the system idles at just 56W. Not bad at all.

This entry will be a departure from my recent focus on the 7000 series to explain how I replaced my recently-dead Sun Ultra 20 with a home-built ZFS workstation/NAS server. I hope others considering...


Replication in 2010.Q1

This post is long overdue since 2010.Q1 came out over a month ago now, but it's better late than never. The bullet-point feature list for 2010.Q1 typically includes something like "improved remote replication", but what do we mean by that? The summary is vague because, well, it's hard to summarize what we did concisely. Let's break it down: Improved stability. We've rewritten the replication management subsystem. Informed by the downfalls of its predecessor, the new design avoids large classes of problems that were customer pain points in older releases. The new implementation also keeps more of the relevant debugging data that allows us to drive new issues to root-cause faster and more reliably. Enhanced management model. We've formalized the notion of packages, which were previously just "replicas" or "replicated projects". Older releases mandated that a given project could only be replicated to a given target once (at a time) and that only one copy of a project could exist on a particular target at a time. 2010.Q1 supports multiple actions for a given project and target, each one corresponding to an independent copy on the target called a "package." This allows administrators to replicate a fresh copy without destroying the one that's already on the target. Share-level replication. 2010.Q1 supports more fine-grained control of replication configuration, like leaving an individual share out of its project's replication configuration or replicating a share by itself without the other shares in its project. Optional SSL encryption for improved performance. Older releases always encrypt the data sent over the wire. 2010.Q1 still supports this, but also lets customers disable SSL encryption for significantly improved performance when the security of data on the wire isn't so critical (as in many internal environments). Bandwidth throttling. The system now supports limiting the bandwidth used by individual replication actions. With this, customers with limited network resources can keep replication from hogging the available bandwidth and starving the client data path. Improved target-side management. Administrators can browse replicated projects and shares in the BUI and CLI just like local projects and shares. You can also view properties of these shares and even change them where appropriate. For example, the NFS export list can be customized on the target, which is important for disaster-recovery plans where the target will serve different clients in a different datacenter. Or you could enable stronger compression on the target, saving disk space at the expense of performance, which may be less important on a backup site. Read-only view of replicated filesystems and snapshots. This is pretty self-explanatory. You can now export replicated filesystems read-only over NFS, CIFS, HTTP, FTP, etc., allowing you to verify the data, run incremental NDMP backups, or perform data analysis that's too expensive to run on the primary system. You can also see and clone the non-replication snapshots. Then there are lots of small improvements, like being able to disable replication globally, per-action, or per-package, which is very handy when trying it out or measuring performance. Check out the documentation (also much improved) for details.

This post is long overdue since 2010.Q1 came out over a month ago now, but it's better late than never. The bullet-point feature list for 2010.Q1 typically includes something like "improved remote repl...


Remote Replication Introduction

When we first announced the SS7000 series, we made available a simulator (a virtual machine image) so people could easily try out the new software. At a keynote session that evening, Bryan and Mike challenged audience members to be the first to set up remote replication between two simulators. They didn't realize how quickly someone would take them up on that. Having worked on this feature, it was very satisfying to see it all come together in a new user's easy experience setting up replication for the first time. The product has come a long way in the short time since then. This week sees the release of 2010.Q1, the fourth major software update in just over a year. Each update has come packed with major features from online data migration to on-disk data deduplication. 2010.Q1 includes several significant enhancements (and bug fixes) to the remote replication feature. And while it was great to see one of our first users find it so easy to replicate an NFS share to another appliance, remote replication remains one of the most complex features of our management software. The problem sounds simple enough: just copy this data to that system. But people use remote replication to solve many different business problems, and supporting each of these requires related features that together add significant complexity to the system. Examples include: Backup. Disk-to-disk backup is the most obvious use of data replication. Customers need the ability to recover data in the event of data loss on the primary system, whether a result of system failure or administrative error. Disaster recovery (DR). This sounds like backup, but it's more than that: customers running business-critical services backed by a storage system need the ability to recover service quickly in the event of an outage of their primary system (be it a result of system failure or a datacenter-wide power outage or an actual disaster). Replication can be used to copy data to a secondary system off-site that can be configured to quickly take over service from the primary site in the event of an extended outage there. Of course, you also need a way to switch back to the primary site without copying all the data back. Data distribution. Businesses spread across the globe often want a central store of documents that clients around the world can access quickly. They use replication to copy documents and other data from a central data center to many remote appliances, providing fast local caches for employees working far from the main office. Workload distribution. Many customers replicate data to a second appliance to run analysis or batch jobs that are too expensive to run on the primary system without impacting the production workload. These use cases inform the design requirements for any replication system: Data replication must be configurable on some sort of schedule.We don't just want one copy of the data on another system. We want an up-to-date copy. For example, data changes every day, and we want nightly backups. Or we have DR agreements that require restoringservice using data no more than 10 minutes out-of-date. Some deployments wantingto maximize freshness of replicated data may want to replicatecontinuously (as frequently as possible). Very critical systems may even want to replicatesynchronously (so that the primary system does not acknowledgeclient writes until they're on stable storage on the DR site), thoughthis has significant performance implications. Data should only be replicated once. This one's obvious, but important. When we update the copy, we don't want to have to send an entire new copy of the data. This wastes potentially expensive network bandwidth and disk space. We only want to send the changes made since the previous update. This is also important when restoring primary service after a disaster-recovery event. In that case, we only want to copy the changes made while running on the secondary system back to the primary system. Copies should be point-in-time consistent. Source data may always be changing, but with asynchronous replication, copies will usually be updated at discrete intervals. But at the very least, the copy should represent a snapshot of the data at a single point in time. (By contrast, a simple "cp" or "rsync" copy of an actively changing filesystem would result in a copy where each file's state was copied at slightly different times, potentially resulting in inconsistencies in the copy that didn't exist (and could not exist) in the source data.) This is particularly important for databases or other applications with complex persistent state. Traditional filesystem guarantees about their state after a crash make it possible to write applications that can recover from any point-in-time snapshot, but it's much harder to write software that can recover from arbitrary inconsistencies introduced by sloppy copying. Replication performance is important, but so is performance observability and control (e.g., throttling). Backup and DR operations can't be allowed to significantly impact the performance of primary clients of the system, so administrators need to be able to see the impact of replication on system performance as well as limit system resources used for replication if this impact becomes too large. Complete backup solutions should replicate data-related system configuration (like filesystem block size, quotas, or protocol sharing properties), since this needs to be recovered with a full restore, too. But some properties should be changeable in each copy. Backup copies may use higher compression levels, for example, because performance is less important than disk space on the backup system. DR sites may have different per-host NFS sharing restrictions because they're in different data centers than the source system. Management must be clear and simple. When you need to use your backup copy, whether to restore the original system or bring up a disaster recovery site, you want the process to be as simple as possible. Delays cost money, and missteps can lead to loss of the only good copy of your data. That's an overview of the design goals of our remote replication feature today. Some of these elements have been part of the product since the initial release, while others are new to 2010.Q1. The product will evolve as we see how people use the appliance to solve other business needs. Expect more details in the coming weeks and months.

When we first announced the SS7000 series, we made available a simulator (a virtual machine image) so people could easily try out the new software. At a keynote session that evening, Bryan and...


Threshold alerts

I've previously discussed alerts in the context of fault management. On the 7000, alerts are just events of interest to administrators, like "disk failure" or "backup completion" for example. Administrators can configure actions that the system takes in response to particular classes of alert. Common actions include sending email (for notification) and sending an SNMP trap (for integration into environments which use SNMP for monitoring systems across a datacenter). A simple alert configuration might be "send mail to admin@mydomain when a disk fails." While most alerts come predefined on the system, there's one interesting class of parametrized alerts that we call threshold-based alerts, or just thresholds. Thresholds use DTrace-based Analytics to post (trigger) alerts based on anything you can measure with Analytics. It's best summed up with this screenshot of the configuration screen: The possibilities here are pretty wild, particularly because you can enable or disable Analytics datasets as an alert action. For datasets with some non-negligible cost of instrumentation (like NFS file operations broken down by client), you can automatically enable them only when you need them. For example, you could have one threshold that enables data collection when the number of operations exceeds some limit, and then turn collection off again if the number fell below a slightly lower limit: Implementation One of the nice things about developing a software framework like the appliance kit is that once it's accumulated a critical mass of basic building blocks, you can leverage existing infrastructure to quickly build powerful features. The threshold alerts system sits atop several pieces of the appliance kit, including the Analytics, alerts, and data persistence subsystems. The bulk of the implementation is just a few hundred lines of code that monitors stats and post alerts as needed.

I've previously discussed alerts in the context of fault management. On the 7000, alerts are just events of interest to administrators, like "disk failure" or "backup completion" for example....


Anatomy of a DTrace USDT provider

I've previously mentioned that the 7000 series HTTP/WebDAV Analytics feature relies on USDT, the mechanism by which developers can define application-specific DTrace probes to provide stable points of observation for debugging or analysis. Many projects already use USDT, including the Firefox Javascript engine, mysql, python, perl, and ruby. But writing a USDT provider is (necessarily) somewhat complicated, and documentation is sparse. While there are some USDT examples on the web, most do not make use of newer USDTfeatures (for example, they don't use translated arguments). Moreover, there are some rough edges around the USDT infrastructure that can make the initial development somewhat frustrating. In this entry I'll explain the structure of a USDT provider with translated arguments in hopes that it's helpful to those starting out on such a project. I'll use the HTTP provider as an example since I know that one best. I also refer to the source of the iscsi provider since the complete source is freely available (as part of iscsitgtd in ON). Overview A USDT provider with translated arguments is made up of the following pieces: a provider definition file (e.g., http_provider.d), from which an application header and object files will be generated a provider header file (e.g., http_provider.h), generated from the provider definition file with dtrace -h and included by the application, defines macros used by the application to fire probes a provider object file (e.g., http_provider.o), generated from the provider definition and application object files with dtrace -G a provider application header file (e.g.,http_provider_impl.h), which defines the C structures passed into probes by the application native (C) code that invokes the probe a provider support file (e.g., http.d), delivered into /usr/lib/dtrace, which defines the D structures and translators used in probes at runtime Putting these together: The build process takes the provider definition andgenerates the provider header and provider object files. The application includes theprovider header file as well as the application header file and uses DTracemacros and C structures to fire probes. The compiled application is linked with the generated provider object file that encodes the probes. DTrace consumers (e.g., dtrace(1M)) read in the provider support file and instrument any processes containing the specified probes. It's okay if you didn't follow all that. Let's examine these pieces in more detail. 1. Provider definition (and generated files)The provider definition describes the set of probes made available by thatprovider. For each probe, the definition describes the arguments passed toDTrace by the the probe implementation (within the application) as well as thearguments passed by DTrace to probe consumers (D scripts). For example, here's the heart of the http provider definition, http.d: provider http {probe request__start(httpproto_t \*p) : (conninfo_t \*p, http_reqinfo_t \*p);probe request__done(httpproto_t \*p) : (conninfo_t \*p, http_reqinfo_t \*p);/\* ... \*/}; The http provider defines two probes: request-startand request-done. (DTrace converts double-underscores to hyphens.)Each of these consumes an httpproto_t from the application (in this case, anApache module) and provides a conninfo_t and ahttp_reqinfo_t to DTrace scripts using these probes. Don't worryabout the details of these structures just yet. From the provider definition we build the related http_provider.h and http_provider.o files. The header file is generated by dtrace -h and contains macros used by the application to fire probes. Here's a piece of the generated file (edited to show only the x86 version for clarity): #defineHTTP_REQUEST_DONE(arg0) \\__dtrace_http___request__done(arg0)#defineHTTP_REQUEST_DONE_ENABLED() \\__dtraceenabled_http___request__done() So for each probe, we have a macro of the form provider_probename(probeargs) that the application uses to fire a probe with the specified arguments. Note that the argument to HTTP_REQUEST_DONE should be a httpproto_t \*, since that's what the request-done probe consumes from the application. The provider object file generated by dtrace -G is also necessary to make all this work, but the mechanics are well beyond the scope of this entry. 2. Application components The application header file http_provider_impl.h defines the httpproto_t structure for the application, which passes a pointer to this object into the probe macro to fire the probe. Here's an example: typedef struct {const char \*http_laddr; /\* local IP address (as string) \*/const char \*http_uri; /\* URI of requested \*/const char \*http_useragent; /\* user's browser (User-agent header) \*/ uint64_t http_byteswritten; /\* bytes RECEIVED from client \*//\* ... \*/} httpproto_t; The application uses the macros from http_provider.h and the structure defined in http_provider_impl.h to fire a DTrace probe. We also use the is-enabled macros to avoid constructing the arguments when they're not needed. For example: static voidmod_dtrace_postrequest(request_rec \*rr){ httpproto_t hh;/\* ... \*/if (!HTTP_REQUEST_DONE_ENABLED())return;/\* fill in hh object based on request rr ... \*/ HTTP_REQUEST_DONE(&hh);} 3. DTrace consumer components What we haven't specified yet is exactly what defines a conninfo_t or http_reqinfo_t or how to translate an httpproto_t object into a conninfo_t or http_reqinfo_t. These structures and translators must be defined when a consuming D script is compiled (i.e., when a user runs dtrace(1M) and wants to use our probe). These definitions go into what I've called the provider support file, which includes definitions like these: typedef struct { uint32_t http_laddr; uint32_t http_uri; uint32_t http_useragent; uint64_t http_byteswritten;/\* ... \*/} httpproto_t;typedef struct { string hri_uri; /\* uri requested \*/ string hri_useragent; /\* "User-agent" header (browser) \*/ uint64_t hri_byteswritten; /\* bytes RECEIVED from the client \*//\* ... \*/} http_reqinfo_t;#pragma D binding "1.6.1" translatortranslator conninfo_t <httpproto_t \*dp> { ci_local = copyinstr((uintptr_t) \*(uint32_t \*)copyin((uintptr_t)&dp->http_laddr, sizeof (uint32_t)));/\* ... \*/};#pragma D binding "1.6.1" translatortranslator http_reqinfo_t <httpproto_t \*dp> { hri_uri = copyinstr((uintptr_t) \*(uint32_t \*)copyin((uintptr_t)&dp->http_uri, sizeof (uint32_t)));/\* ... \*/}; There are a few things to note here: The httpproto_t structure must exactly match the one being used by the application. There's no way to enforce this with just one definition because neither file can rely on the other being available. The above example only works for 32-bit applications. For a similar example that uses the ILP of the process to do the right thing for 32-bit and 64-bit apps, see /usr/lib/dtrace/iscsi.d. We didn't define conninfo_t. That's because it's defined in /usr/lib/dtrace/net.d. Instead of redefining it, we have our file depend on net.d with this line at the top: #pragma D depends_on library net.d This provider support file gets delivered into /usr/lib/dtrace. dtrace(1M) automatically imports all .d files in this directory (or another directory specified with -xlibdir) on startup. When a DTrace consumer goes to use our probes, the conninfo_t, httpproto_t, and http_reqinfo_t structures are defined, as well as the needed translators. More concretely, when a user writes: # dtrace -n 'http\*:::request-start{printf("%s\\n", args[1]->hri_uri);}' DTrace knows exactly what to do. Rough edges Remember http_provider.d? It contained the actual provider and probe definitions. It referred to the httpproto_t, conninfo_t, and http_reqinfo_t structures, which we didn't actually mention in that file. We already explained that these structures and definitions are defined by the provider support file and used at runtime, so they shouldn't actually be necessary here. Unfortunately, there's a piece missing that's necessary to work around buggy behavior in DTrace: the D compiler insists on having these definitions and translators available when processing the provider definition file, but those structures and translators won't be used (since this file is not even available at runtime anyway). Even worse, dtrace(1M) doesn't automatically import the files in /usr/lib/dtrace when compiling the provider file, so we can't simply depends_on them. The end result is that we must define "dummy" structures and translators in the provider file, like this: typedef struct http_reqinfo {int dummy;} http_reqinfo_t;typedef struct httpproto {int dummy;} httpproto_t;typedef struct conninfo {int dummy;} conninfo_t;translator conninfo_t <httpproto_t \*dp> { dummy = 0;};translator http_reqinfo_t <httpproto_t \*dp> { dummy = 0;}; We also need stability attributes like the following to use the probes: #pragma D attributes Evolving/Evolving/ISA provider http provider#pragma D attributes Private/Private/Unknown provider http module#pragma D attributes Private/Private/Unknown provider http function#pragma D attributes Private/Private/ISA provider http name#pragma D attributes Evolving/Evolving/ISA provider http args You can see both of these in the iscsi provider as well. Tips I've now covered all the pieces, but there are other considerations in implementing a provider. For example, what arguments should the probes consume from the application, and what should be provided to D scripts? We chose structures on both sides because it's much less unwieldy (especially as it evolves), but that necessitates the ugly translators and multiple definitions. If I'd used pointer and integer arguments, we'd need no structures, and therefore no translators, and thus we could leave out several of the files described above. But it would be a bit unwieldy and consumers would need to use copyin/copyinstr. Both the HTTP and iSCSI providers instrument network-based protocols. Forconsistency, providers for these protocols (which also include NFS, CIFS, and FTPon the 7000 series) use the same conventions for probe argument typesand names (e.g., conninfo_t as the first argument, followed by protocol-specific arguments). Conclusion USDT with translated arguments is extremely powerful, but the documentation is somewhat lacking and there are still some rough edges for implementers. I hope this example is valuable for people trying to put the pieces together. If people want to get involved in documenting this, contact the DTrace community at opensolaris.org.

I've previously mentioned that the 7000 series HTTP/WebDAV Analytics feature relies on USDT,the mechanism by which developers can define application-specific DTrace probes to provide stable points of...


2009.Q2 Released

Today we've released the first major software update for the 7000 series, called 2009.Q2. This update contains a boatload of bugfixes and new features, including support for HTTPS (HTTP with SSL using self-signed certificates). This makes HTTP user support more tenable in less secure environments because credentials don't have to be transmitted in the clear. Another updated feature that's important for RAS is enhanced support bundles. Support bundles are tarballs containing core files, log files, andother debugging output generated by the appliance that can be sentdirectly to support personnel. In this release, support bundles collect more useful data about failed services and (critically) can becreated even when the appliance's core services have failed or networkconnectivity has been lost. You can also monitor support bundle progress on the Maintenance-> System screen, and bundles can be downloaded directly from thebrowser for environments where the appliance cannot connect directly to Sun Support. All of these improvements help us to track down problems remotely, relying as little as possible on the functioning system or the administrator's savvy. See the Fishworks blog for more details on this release. Enjoy!

Today we've released the first major software update for the 7000 series, called 2009.Q2. This update contains a boatload of bugfixes and new features, including support for HTTPS (HTTP with SSL using...


Compression followup

My previous post discussed compression in the 7000 series. I presented some Analytics data showing the effects of compression on a simple workload, but I observed something unexpected: the system never used more than 50% CPU doing the workloads, even when the workload was CPU-bound. This caused the CPU-intensive runs to take a fair bit longer than expected. This happened because ZFS uses at most 8 threads for processing writes through the ZIO pipeline. With a 16-core system, only half the cores could ever be used for compression - hence the 50% CPU usage we observed. When I asked the ZFS team about this, they suggested that nthreads = 3/4 the number of cores might be a more reasonable value, leaving some headroom available for miscellaneous processing. So I reran my experiment with 12 ZIO threads. Here are the results of the same workload (the details of which are described in my previous post): Summary: text data set Compression Ratio Total Write Read off 1.00x 3:29 2:06 1:23 lzjb 1.47x 3:36 2:13 1:23 gzip-2 2.35x 5:16 3:54 1:22 gzip 2.52x 8:39 7:17 1:22 gzip-9 2.52x 9:13 7:49 1:24 Summary: media data set Compression Ratio Total Write Read off 1.00x 3:39 2:17 1:22 lzjb 1.00x 3:38 2:16 1:22 gzip-2 1.01x 5:46 4:24 1:22 gzip 1.01x 5:57 4:34 1:23 gzip-9 1.01x 6:06 4:43 1:23 We see that read times are unaffected by the change (not surprisingly), but write times for the CPU-intensive workloads (gzip) are improved over 20%: From the Analytics, we can see that CPU utilization is now up to 75% (exactly what we'd expect): Note that in order to run this experiment, I had to modify the system in a very unsupported (and unsupportable) way. Thus, the above results do not represent current performance of the 7410, but only suggest what's possible with future software updates. For these kinds of ZFS tunables (as well as those in other components of Solaris, like the networking stack), we'll continue to work with the Solaris teams to find optimal values,exposing configurables to the administrator through our web interface when necessary. Expect future software updates for the 7000 series to include tunable changes to improve performance. Finally, it's also important to realize that if you run into this limit, you've got 8 cores (or 12, in this case) running compression full-tilt and your workload is CPU-bound. Frankly, you're using more CPU for compression than many enterprise storage servers even have today, and it may very well be the right tradeoff if your environment values disk space over absolute performance. Update Mar 27, 2009: Updated charts to start at zero.

My previous postdiscussed compression in the 7000 series. I presented some Analytics data showing the effects of compression on a simple workload, but I observed something unexpected: the system...


Compression on the Sun Storage 7000

Built-in filesystem compression has been part of ZFS since day one, but is only now gaining some enterprise storage spotlight. Compression reduces the disk space needed to store data, not only increasing effective capacity but often improving performance as well (since fewer bytes means less I/O). Beyond that, having compression built into the filesystem (as opposed to using an external appliance between your storage and your clients to do compression, for example) simplifies the management of an already complicated storage architecture. Compression in ZFS Your mail client might use WinZIP to compress attachments before sending them, or you might unzip tarballs in order to open the documents inside. In these cases, you (or your program) must explicitly invoke a separate program to compress and uncompress the data before actually using it. This works fine in these limited cases, but isn't a very general solution. You couldn't easily store your entire operating system compressed on disk, for example. With ZFS, compression is built directly into the I/O pipeline. When compression is enabled on a dataset (filesystem or LUN), data is compressed just before being sent to the spindles and decompressed as it's read back. Since this happens in the kernel, it's completely transparent to userland applications, which need not be modified at all. Besides the initial configuration (which we'll see in a moment is rather trivial), users need not do anything to take advantage of the space savings offered by compression. A simple example Let's take a look at how this works on the 7000 series. Like all software features, compression comes free. Enabling compression for user data is simple because it's just a share property. After creating a new share, double-click it to modify its properties, select a compression level from the drop-down box, and apply your changes: After that, all new data written to the share will be compressed with the specified algorithm. Turning compression off is just as easy: just select 'Off' from the same drop-down. In both cases, extant data will remain as-is - the system won't go rewrite everything that already existed on the share. Note that when compression is enabled, all data written to the share is compressed, no matter where it comes from: NFS, CIFS, HTTP, and FTP clients all reap the benefits. In fact, we use compression under the hood for some of the system data (analytics data, for example), since the performance impact is negligible (as we will see below) and the space savings can be significant. You can observe the compression ratio for a share in the sidebar on the share properties screen. This is the ratio of uncompressed data size to actual (compressed) disk space used and tells you exactly how much space you're saving. The cost of compression People are often concerned about the CPU overhead associated with compression, but the actual cost is difficult to calculate. On the one hand, compression does trade CPU utilization for disk space savings. And up to a point, if you're willing to trade more CPU time, you can get more space savings. But by reducing the space used, you end up doing less disk I/O, which can improve overall performance if your workload is bandwidth-limited. But even when reduced I/O doesn't improve overall performance (because bandwidth isn't the bottleneck), it's important to keep in mind that the 7410 has a great deal of CPU horsepower (up to 4 quad-core 2GHz Opterons), making the "luxury" of compression very affordable. The only way to really know the impact of compression on your disk utilization and system performance is to run your workload with different levels of compression and observe the results. Analytics is the perfect vehicle for this: we can observe CPU utilization and I/O bytes per second over time on shares configured with different compression algorithms. Analytics results I ran some experiments to show the impact of compression on performance. Before we get to the good stuff, here's the nitty-gritty about the experiment and results: These results do not demonstrate maximum performance. I intended to show the effects of compression, not the maximum throughput of our box. Brendan's already got that covered. The server is a quad-core 7410 with 1 JBOD (configured with mirrored storage) and 16GB of RAM. No SSD. The client machine is a quad-core 7410 with 128GB of DRAM. The basic workload consists of 10 clients, each writing 3GB to its own share and then reading it back for a total of 30GB in each direction. This fits entirely in the client's DRAM, but it's about twice the size of the server's total memory. While each client has its own share, they all use the same compression level for each run, so only one level is tested at a time. The experiment is run for each of the compression levels supported on the 7000 series: lzjb, gzip-2, gzip (which is gzip-6), gzip-9, and none. The experiment uses two data sets: 'text' (copies of /usr/dict/words, which is fairly compressible) and 'media' (copies of the Fishworks code swarm video, which is not very compressible). I saw similar results with between 3 and 30 clients (with the same total write/read throughput, so they were each handling more data). I saw similar results whether each client had its own share or not. Now, below is an overview of the text (compressible) data set experiments in terms of NFS ops and network throughput. This gives a good idea of what the test does. For all graphs below, five experiments are shown, each with a different compression level in increasing order of CPU usage and space savings: off, lzjb, gzip-2, gzip, gzip-9. Within each experiment, the first half is writes and the second half reads: Not surprisingly, from the NFS and network levels, the experiments basically appear the same, except that the writes are spread out over a longer period for higher compression levels. The read times are pretty much unchanged across all compression levels. The total NFS and network traffic should be the same for all runs. Now let's look at CPU utilization over these experiments: Notice that CPU usage increases with higher compression levels, but caps out at about 50%. I need to do some digging to understand why this happens on my workload, but it may have to do with the number of threads available for compression. Anyway, since it only uses 50% of CPU, the more expensive compression runs end up taking longer. Let's shift our focus now to disk I/O. Keep in mind that the disk throughput rate is twice that of the data we're actually reading and writing because the storage is mirrored: We expect to see an actual decrease in disk bytes written and read as the compression level increases because we're writing and reading more compressed data. I collected similar data for the media (uncompressible) data set. The three important differences were that with higher compression levels, each workload took less time than the corresponding text one: the CPU utilization during reads was less than in the text workload: and the total disk I/O didn't decrease nearly as much with the compression level as it did in the text workloads (which is to be expected): The results can be summarized by looking at the total execution time for each workload at various levels of compression: Summary: text data set Compression Ratio Total Write Read off 1.00x 3:30 2:08 1:22 lzjb 1.47x 3:26 2:04 1:22 gzip-2 2.35x 6:12 4:50 1:22 gzip 2.52x 11:18 9:56 1:22 gzip-9 2.52x 12:16 10:54 1:22 Summary: media data set Compression Ratio Total Write Read off 1.00x 3:29 2:07 1:22 lzjb 1.00x 3:31 2:09 1:22 gzip-2 1.01x 6:59 5:37 1:22 gzip 1.01x 7:18 5:57 1:21 gzip-9 1.01x 7:37 6:15 1:22 What conclusions can we draw from these results? Of course, what we knew, that compression performance and space savings vary greatly with the compression level and type of data. But more specifically, with my workloads: read performance is generally unaffected by compression lzjb can afford decent space savings, but performs well whether or not it's able to generate much savings. Even modest gzip imposes a noticeable performance hit, whether or not it reduces I/O load. gzip-9 in particular can spend a lot of extra time for marginal gain. Moreover, the 7410 has plenty of CPU headroom to spare, even with highcompression. Summing it all up We've seen that compression is free, built-in, and very easy to enable on the 7000 series. The performance effects vary based on the workload and compression algorithm, but powerful CPUs allow compression to be used even on top of serious loads. Moreover, the appliance provides great visibility into overall system performance and effectiveness of compression, allowing administrators to see whether compression is helping or hurting their workload.

Built-in filesystem compression has been part of ZFS since day one, but is only now gaining some enterprise storage spotlight. Compression reducesthe disk space needed to store data, not only...


Fault management

The Fishworks storage appliance stands on the shoulders of giants. Many of the most exciting features -- Analytics, the hybrid storage pool, and integrated fault management,for example -- are built upon existing technologies in OpenSolaris(DTrace, ZFS, and FMA, respectively). The first two of these have beencovered extensively elsewhere, but I'd like to discuss our integratedfault management, a central piece of our RAS(reliability/availability/serviceability) architecture. Let's start with a concrete example: suppose hard disk #4 is acting up in yournew 7000 series server. Rather than returning user data, it's giving back garbage.ZFS checksums the garbage, immediately detects the corruption, reconstructs thecorrect data from the redundant disks, and writes the correct block back todisk #4. This is great, but if the disk is really going south, such failures will keephappening, and the system will generate a fault and take the disk out ofservice. Faults represent active problems on the system, usually hardware failures. OpenSolaris users are familiar with observing and managing faults through fmadm(1M). The appliance integrates fault management in several ways: as alerts, which allow administrators to configure automated responses to these events of interest; inthe active problems screen, which provides a summary view of the current faultson the system; and through the maintenance view, which correlates faults withthe actual failed hardware components. Let's look at each of these in turn. Alerts Faults are part of a broader group of events we call alerts. Alerts represent events of interest to appliance administrators, ranging fromhardware failures to backup job notifications. When one of these events occurs,the system posts an alert, taking whatever action has been configuredfor it. Most commonly, administrators configure the appliance to send mail ortrigger an SNMP trap in response to certain alerts: Managing faults In our example, you'd probably discover the failed hard disk because you previously configured the appliance to send mail on hardware failure (orhot spare activation, or resilvering completion...). Once you get the mail, you'd log into the appliance web UI (BUI) and navigate to the activeproblems screen: The above screen presents all current faults on the system, summarizing eachfailure, its impact on the system, and suggested actions for the administrator. You might next click the "more info" button (next to the "out of service" text), whichwould bring you to the maintenance screen for the faulted chassis, highlightingthe broken disk both in the diagram and the component list: This screen connects the fault with the actual physical component that'sfailed. From here you could also activate the locator LED (which isno simple task behind the scenes) and have a service technician go replace theblinking disk. Of course, once they do, you'll get another mail saying that ZFS has finished resilvering the new disk. Beyond disks Disks are interesting examples because they are the heart of the storage server. Moreover, disks are often some of the first components to fail (in part because there are so many of them). But FMA allows us to diagnose many other kinds of components. For example, here are the same screens on a machine with a broken CPU cache: Under the hood This complete story -- from hardware failure to replaced disk, for example -- is built on foundational technologies in OpenSolaris like FMA. Schrock has described much of the additional work that makes this simple but powerful user experience possible for the appliance. Best of all, little of the code is specific to our NAS appliances - we could conceivably leverage the same infrastructure to manage faults on other kinds of systems. If you want to see more, download our VMware simulator and try it out foryourself.

The Fishworks storage appliance stands on the shoulders of giants. Many of the most exciting features -- Analytics, the hybrid storage pool, and integrated fault management,for example -- are built...


HTTP/WebDAV Analytics

Mike calls Analytics the killer app of the 7000 series NASappliances. Indeed, this feature enables administrators to quickly understand what's happening ontheir systems in unprecedented depth. Most of the interesting Analytics data comes from DTraceproviders built into Solaris. For example, the iSCSI data are gatheredby the existing iSCSI provider, which allows users to drill down on iSCSI operations byclient. We've got analogous providers for NFS and CIFS, too, which incorporate the richer information we have for those file-level protocols (including file name, user name, etc.). We created a corresponding provider for HTTP in the form of a pluggable Apache module calledmod_dtrace. mod_dtrace hooks into the beginning and end of each request andgathers typical log information, including local and remote IP addresses, theHTTP request method, URI, user, user agent, bytes read and written, and the HTTPresponse code. Since we have two probes, we also have latency information foreach request. We could, of course, collect other data as long as it's readilyavailable when we fire the probes. The upshot of all this is that you can observe HTTP traffic in our Analyticsscreen, and drill down in all the ways you might hope (click image for larger size): Caveat user One thing to keep in mind when analyzing HTTP data is that we're tracking individual requests, not lower level I/O operations. With NFS, for example, each operation might be a readof some part of the file. If you read a whole file, you'll see a bunch ofoperations, each one reading a chunk of the file. With HTTP, there's just one request, so you'llonly see a data point when that request starts or finishes, no matter how big the file is. If one client isdownloading a 2GB file, you won't see it until they're done (and the latency might be very high, but that's not necessarily indicative of poor performance). This is a result of the way the protocol works(or, more precisely, the way it's used). While NFS is defined in termsof small filesystem operations, HTTP is defined in terms of requests,which may be arbitrarily large (depending on the limits of thehardware). One could imagine a world in which an HTTP client that'simplementing a filesystem (like the Windowsmini-redirector) makes smaller requests using HTTPRange headers. This would look more like the NFS case - there wouldbe requests for ranges of files corresponding to the sections of filesthat were being read. (This could have serious consequences forperformance, of course.) But as things are now, users must understandthe nature of protocol-level instrumentation when drawing conclusions based on HTTP Analytics graphs. Implementation For the morbidly curious, mod_dtrace is actually a fairly straightforward USDT provider, consisting of the following components: http.d defines http_reqinfo_t, the stable structureused as an argument to probes (in D scripts). This file also definestranslators to map between httpproto_t, the structure passed tothe DTrace probe macro (by the actual code that fires probes inmod_dtrace.c), and the pseudo-standard conninfo_t andaforementioned http_reqinfo_t. This file is analogous to anyof the files shipped in /usr/lib/dtrace on a stock OpenSolaris system. http_provider_impl.h defines httpproto_t, the structure that mod_dtracepasses into the probes. This structure contains enough information for the aforementioned translators to fill in both theconninfo_t and http_reqinfo_t. http_provider.d defines the provider's probes: provider http { probe request__start(httpproto_t \*p) : (conninfo_t \*p, http_reqinfo_t \*p); probe request__done(httpproto_t \*p) : (conninfo_t \*p, http_reqinfo_t \*p);}; mod_dtrace.c implements the provider itself. We hook intoApache's existing post_read_request andlog_transaction hooks to fire the probes (if they are enabled).The only tricky bit here is counting bytes, since Apache doesn'tnormally keep that information around. We use an input filter to countbytes read, and we override mod_logio's optional function tocount bytes written. This is basically the same approach thatmod_logiouses, though is admittedly pretty nasty. We hope this will shed some light on performance problems in actual customer environments. If you're interested in using HTTP/WebDAV on the NAS appliance, check out my recent post on our support for system users.

Mike calls Analytics the killer app of the 7000 series NAS appliances. Indeed, this feature enables administrators to quickly understand what's happening ontheir systems in unprecedented depth. Most...


User support for HTTP

In building the 7000 series of NAS appliances, we strove to create a solid storage product that's revolutionary both in features and price/performance. This process frequently entailed rethinking old problems and finding new solutions that challenge the limitations of previous ones. Bill has a great example in makingCIFS (SMB) a first-class data protocol on the appliance, from our management interface down to the kernel itself. I'll discuss here the ways in which we've enhanced support for HTTP/WebDAV sharing, particularly as it coexists with other protocols like NFS, CIFS, and FTP. WebDAV is aset of extensions to HTTP that allows clients (like web browsers) to treat websites like filesystems. Windows, MacOS,and Gnome all have built-inWebDAV clients that allow users to "mount" WebDAV shares on theirdesktop andtreat them like virtual disks. Since the client setup is about as simple as itcould be (just enter a URL), this makes a great file sharing solution whereperformance is not critical and ease of setup is important. Users can simplyclick their desktop environment's "connect to server" button and starteditingfiles on the local company file server, where the data is backed up,automatically shared with their laptop, and even exported over NFS or CIFS aswell. User support Many existing HTTP/WebDAV implementations consist of Apache and mod_dav. While this provides a simple, working implementation, its generality creates headaches when users want to share files over both HTTP and other protocols (like NFS). For one, HTTP allows the server to interpret users' credentials (which are optional to begin with) however it chooses. This provides for enormous flexibility in building complex systems, but it means that you've got to do some work to make web users correspond to something meaningful in your environment (i.e., users in your company's name service). The other limitation of a basic Apache-based setup is that Apache itself has no support for assuming the identity of logged-in users. Typically, the web server runs as 'webservd' or 'httpd' or some otherpredefined system user. So in order for files to be accessible via the web, they must be accessible by this arbitrary user, which usuallymeans making them world-accessible. Moreover, when files are createdvia the WebDAV interface, they end up being owned by the web server,rather than the actual user that created them. By contrast, we've included strong support for system users in our HTTP stack. We use basic HTTP authentication to check a user's credentials against the system's name services, and then we process the request under their identity. Theresult is that proper filesystem permissions are enforcedover HTTP, and newly created files are correctly owned by the userthatcreated them, rather than the web server's user. (This is not totallyunlike a union of mod_auth_pam and mod_become, except that those are not very well supported.) The user experience goes something like this: on my Mac laptop, I use theFinder's "Connect to Server" option and enter theappliance'sWebDAV URL. I'm prompted for my username and password, which are checked againstthe local NIS directory server. Once in, my requests are handled by an httpdprocess which uses seteuid(2) to assume my identity. That means I can see exactlythe same set of files I could see if I were using NFS, FTP, CIFS with identitymapping, etc. If I'm accessing someone else's 644 file, then I can read but notwrite it. If I'm accessing my group's 775 directory, then I can create files init. It's just as though I were using the local filesystem. The mod_dav FAQ vaguely describes how one could do this, but implies that making it work requires introducing a huge security hole. Using Solaris's fine-grained privileges, we give Apache worker processes just theproc_setid privilege (see privileges(5)). We don't need httpd to runas root - we just need it to change among a set of unprivileged users.Any service expected to serve data on behalf of users must be granted thisprivilege -- the NFS, CIFS, and FTP servers all do this (admittedly by running as root). Of course, such a system is only as safe as its authentication and authorizationmechanisms, and we've done our best to ensure that this code is safe and tomitigate the possible damage from potential exploits. It's built on top of libpam and (of course) Solaris, so we know the foundation is solid. Implementation notes mod_user is our custom module which authenticates users and causes httpd to assume theidentity of said users. It primarily consists of hooks into Apache's request processing pipeline to authenticate users and change uid's. We authenticate using pam(3PAM), which uses whatever name services have been set up for the appliance (NIS, LDAP, or locally created users). Though mod_user itself is fairly simple, it's also somewhat delicate from a security perspective. For example, since seteuid(2) changes the effective uid of the entire process, we must be sure that we're never handling multiple requests concurrently. This is made pretty easy with the one-request-per-process threading module that is Apache's default, but there's still a bit of complexity around subrequests, where we may be running as some user other than Apache's usual 'webservd' (since we're processing a request for a particular user), but we need to authenticate the user as part of processing the subrequest. For local users, authentication requires reading our equivalent of /etc/shadow, which of course we can't allow to be world-readable. But these kinds of issues are easily solved. Other WebDAV enhancements Enhancing the user model is one of a few updates we've made for HTTP sharing on the NAS appliance. I'll soon discuss mod_dtrace, which facilitates HTTP Analytics by implementing a USDT provider for HTTP. We hope these sorts of features to make life better in many environments, whether you're using WebDAV as a primary means of sharing files or you're just giving read-only access to people's home directories as a remote convenience. Stay tuned to the Fishworks blogs for lots more discussion of Sun's new line of storage appliances.

In building the 7000 series of NAS appliances, we strove to create a solid storage product that's revolutionary both in features and price/performance. This process frequently entailed rethinking...


Event ports and performance

So lots of people have been talking about event ports. They were designed to solve the problem with poll(2)and lots of file descriptors. The goal is to scale with the number ofactual events of interest rather than the number of file descriptorsone is listening on, since the former is often much less than thelatter. To make this a little more concrete, consider your giant new serverwhich maintains a thousand connections, only about a hundred of whichare active at any given time. It has its array with a thousand socketfile descriptors, and sits there in a loop calling poll to figure outwhich of them can be read without blocking. poll waits until at leastone of them is ready, returns, and then the program iterates over thewhole array to see which descriptor(s) poll said was ready. Finally, itprocesses them one by one. You waste all this time iterating over thearray, not to mention the time copying it between the kernel anduserland every time you call poll. It's true that you could multi-thread this and save some time, but theoverhead of thousands of threads (one for each fd) adds up as well. This is the problem event ports are designed to solve. Interestinglyenough, this was not the first attempt to speed up this scenario. /dev/pollwas introduced in Solaris 7 for the same purpose, and works great -except that multi-threaded apps can't use it without [lots of slow]synchronization. Linux has epoll and BSD has kqueue, about neither of which do I pretend to know much. Event ports, as applied to file descriptors, are basically a way oftelling the OS that you're interested in certain events on certaindescriptors, and to deliver (synchronous) notification of said eventson said descriptors to a particular object (...which also happens to bea file descriptor). Then you can have one or more worker threadssleeping, waiting for the OS to wake them up and hand them an event,saying "here you go, this socket's got something to read." You canactually have lots of worker threads hanging around the same event portand let the OS do your "synchronization" - they probably don't need tocommunicate or share any common state (depending on your app, ofcourse). No expensive copies. No traversing huge arrays finding needles in haystacks. At least, in theory. Until now, nobody's really measured their performance with any microbenchmarks. My recent addition of event ports of libeventgave me an excuse to run libevent's own bunchmarks and see how itperformed. The benchmark basically just schedules lots of read/writeevents, and has a few parameters: the number of file descriptors, thenumber of writes, and the number of active connections (the higher thisnumber, the more the benchmark spreads its writes over the many filedescriptors).Results All benchmarks below were run on a uniprocessor x86 machine with 1Gmemory. Note the vertical axes - some of the graphs use logarithmicscales. And remember, these are the results of the libevent benchmark,so differences in performance can be a result of the underlying pollingmechanism as well as the libevent implementation (though since theimplementations don't matter, it's assumed that most affects areattributable to the underlying polling mechanism) Let's start with the simplest example. Consider small numbers of file descriptors, and various methods of polling them: On this machine, event ports are fast, even for small numbers of filedescriptors. In some circumstances, event ports are slightly moreexpensive for small numbers of file descriptors, particularly when manyof them are active. This results because (a) when most of yourdescriptors are active, then the work done by poll is not such a waste(because you have to do it anyway), and (b) event ports require twosystem calls per descriptor event rather than one. But let's look at much larger numbers of file descriptors, where eventports really show their stuff. Note that /dev/poll has a maximum ofabout 250 fd's (Update 7/12/06: Alan Batemanpoints out that you can use setrlimit to update RLIMIT_NOFILE toincrease this to much higher values), and select(3c) can only handle upto about 500, sothey're omitted from the rest of these graphs. This graph has alogarithmic scale. Wow. That's about a 10x difference for 7,500 fd's. 'Nuff said.Finally, let's examine what happens when the connections become moreactive (more of the descriptors are written to): Event ports still win, but not by nearly as much (about 2x, for up to15,000 fd's). Less of poll's array traversing is unnecessary, so thedifference is smaller. But in the long run, event ports scale prettymuch at a constant rate, while poll scales linearly. Select is evenworse, since it's implemented on top of poll (on Solaris). /dev/poll isgood, but has the problem mentioned earlier of being difficult to usein a multi-threaded application. For large numbers of descriptors, event ports are the clear winner. Benchmarking can be difficult, and it's easy to miss subtle effects. Iinvite your comments, especially if you think there's something I'veoverlooked.

So lots of people have been talking about event ports. They were designed to solve the problem with poll(2) and lots of file descriptors. The goal is to scale with the number ofactual events of...


libevent and Solaris event ports

For those who dwell in subterranean shelters, event ports (or the "event completion framework," if you want) are the slick new way to deal with events from various sources, including file descriptors, asynchronous i/o, other user processes, etc. Adam Leventhal even thinks they're the 20th best thing about Solaris 10. Okay, that's not huge, but competing with zfs, dtrace, and openness, that ain't half bad.Though the interface is, of course, well-defined, the framework is still evolving. People are talking about adding signals and filesystem events to the event completion framework, though there's definitely some debate about how to go about doing that.Now I hear you asking me: "Okay, that sounds pretty cool, but why would I port my sw33t @pp to Solaris just for event ports? Or port my existing poll(2) implementation on Solaris to use these newfangled dohickeys?"Well if amazing scalability isn't an important enough reason, then I don't know what is. (One of the great properties of event ports is they scale with the number of actual events, not the number of descriptors you're listening to.)But reworking your code could come with added benefits.Enter libevent. Libevent is a portable library for dealing with file descriptor events. "But wait," you ask, "I thought that's what event ports are for!" Well, libevent provides a portable API for doing this. But the implementation varies from system to system. On BSD, it might use the kqueue facility, while on GNU/Linux it might choose epoll. This way, you can write your program on one platform, move to another, recompile, and all your file descriptor polling code works fine (albeit probably with a different implementation, which may affect performance substantially (possibly for the better, since the library chooses the best one for your platform)).Now libevent could use event ports as a backend on Solaris 10. It's a project I've been working on for the last week or so. With that, you get the portability of libevent plus (some of[1]) the scalability of event ports. Sweet.Oh, and if you're wondering more about event ports, check out the engine room, which includes an example and more technical information.[1]: the event port libevent backend isn't quite as scalable as event ports alone, because a bit of per-descriptor bookkeeping has to be maintained. D'oh.

For those who dwell in subterranean shelters, event ports (or the "event completion framework," if you want) are the slick new way to deal with events from various sources, including file descriptors,...