Wednesday Mar 18, 2009

Compression followup

My previous post discussed compression in the 7000 series. I presented some Analytics data showing the effects of compression on a simple workload, but I observed something unexpected: the system never used more than 50% CPU doing the workloads, even when the workload was CPU-bound. This caused the CPU-intensive runs to take a fair bit longer than expected.

This happened because ZFS uses at most 8 threads for processing writes through the ZIO pipeline. With a 16-core system, only half the cores could ever be used for compression - hence the 50% CPU usage we observed. When I asked the ZFS team about this, they suggested that nthreads = 3/4 the number of cores might be a more reasonable value, leaving some headroom available for miscellaneous processing. So I reran my experiment with 12 ZIO threads. Here are the results of the same workload (the details of which are described in my previous post):

Summary: text data set
lzjb 1.47x
gzip-9 2.52x
Summary: media data set
off 1.00x
gzip-2 1.01x
gzip-9 1.01x

We see that read times are unaffected by the change (not surprisingly), but write times for the CPU-intensive workloads (gzip) are improved over 20%:

From the Analytics, we can see that CPU utilization is now up to 75% (exactly what we'd expect):

CPU usage with 12 ZIO threads

Note that in order to run this experiment, I had to modify the system in a very unsupported (and unsupportable) way. Thus, the above results do not represent current performance of the 7410, but only suggest what's possible with future software updates. For these kinds of ZFS tunables (as well as those in other components of Solaris, like the networking stack), we'll continue to work with the Solaris teams to find optimal values, exposing configurables to the administrator through our web interface when necessary. Expect future software updates for the 7000 series to include tunable changes to improve performance.

Finally, it's also important to realize that if you run into this limit, you've got 8 cores (or 12, in this case) running compression full-tilt and your workload is CPU-bound. Frankly, you're using more CPU for compression than many enterprise storage servers even have today, and it may very well be the right tradeoff if your environment values disk space over absolute performance.

Update Mar 27, 2009: Updated charts to start at zero.

Monday Mar 16, 2009

Compression on the Sun Storage 7000

Built-in filesystem compression has been part of ZFS since day one, but is only now gaining some enterprise storage spotlight. Compression reduces the disk space needed to store data, not only increasing effective capacity but often improving performance as well (since fewer bytes means less I/O). Beyond that, having compression built into the filesystem (as opposed to using an external appliance between your storage and your clients to do compression, for example) simplifies the management of an already complicated storage architecture.

Compression in ZFS

Your mail client might use WinZIP to compress attachments before sending them, or you might unzip tarballs in order to open the documents inside. In these cases, you (or your program) must explicitly invoke a separate program to compress and uncompress the data before actually using it. This works fine in these limited cases, but isn't a very general solution. You couldn't easily store your entire operating system compressed on disk, for example.

With ZFS, compression is built directly into the I/O pipeline. When compression is enabled on a dataset (filesystem or LUN), data is compressed just before being sent to the spindles and decompressed as it's read back. Since this happens in the kernel, it's completely transparent to userland applications, which need not be modified at all. Besides the initial configuration (which we'll see in a moment is rather trivial), users need not do anything to take advantage of the space savings offered by compression.

A simple example

Let's take a look at how this works on the 7000 series. Like all software features, compression comes free. Enabling compression for user data is simple because it's just a share property. After creating a new share, double-click it to modify its properties, select a compression level from the drop-down box, and apply your changes:

Click for larger image

GZIP optionsAfter that, all new data written to the share will be compressed with the specified algorithm. Turning compression off is just as easy: just select 'Off' from the same drop-down. In both cases, extant data will remain as-is - the system won't go rewrite everything that already existed on the share.

Note that when compression is enabled, all data written to the share is compressed, no matter where it comes from: NFS, CIFS, HTTP, and FTP clients all reap the benefits. In fact, we use compression under the hood for some of the system data (analytics data, for example), since the performance impact is negligible (as we will see below) and the space savings can be significant.

You can observe the compression ratio for a share in the sidebar on the share properties screen. This is the ratio of uncompressed data size to actual (compressed) disk space used and tells you exactly how much space you're saving.

The cost of compression

People are often concerned about the CPU overhead associated with compression, but the actual cost is difficult to calculate. On the one hand, compression does trade CPU utilization for disk space savings. And up to a point, if you're willing to trade more CPU time, you can get more space savings. But by reducing the space used, you end up doing less disk I/O, which can improve overall performance if your workload is bandwidth-limited.

But even when reduced I/O doesn't improve overall performance (because bandwidth isn't the bottleneck), it's important to keep in mind that the 7410 has a great deal of CPU horsepower (up to 4 quad-core 2GHz Opterons), making the "luxury" of compression very affordable.

The only way to really know the impact of compression on your disk utilization and system performance is to run your workload with different levels of compression and observe the results. Analytics is the perfect vehicle for this: we can observe CPU utilization and I/O bytes per second over time on shares configured with different compression algorithms.

Analytics results

I ran some experiments to show the impact of compression on performance. Before we get to the good stuff, here's the nitty-gritty about the experiment and results:

  • These results do not demonstrate maximum performance. I intended to show the effects of compression, not the maximum throughput of our box. Brendan's already got that covered.
  • The server is a quad-core 7410 with 1 JBOD (configured with mirrored storage) and 16GB of RAM. No SSD.
  • The client machine is a quad-core 7410 with 128GB of DRAM.
  • The basic workload consists of 10 clients, each writing 3GB to its own share and then reading it back for a total of 30GB in each direction. This fits entirely in the client's DRAM, but it's about twice the size of the server's total memory. While each client has its own share, they all use the same compression level for each run, so only one level is tested at a time.
  • The experiment is run for each of the compression levels supported on the 7000 series: lzjb, gzip-2, gzip (which is gzip-6), gzip-9, and none.
  • The experiment uses two data sets: 'text' (copies of /usr/dict/words, which is fairly compressible) and 'media' (copies of the Fishworks code swarm video, which is not very compressible).
  • I saw similar results with between 3 and 30 clients (with the same total write/read throughput, so they were each handling more data).
  • I saw similar results whether each client had its own share or not.

Now, below is an overview of the text (compressible) data set experiments in terms of NFS ops and network throughput. This gives a good idea of what the test does. For all graphs below, five experiments are shown, each with a different compression level in increasing order of CPU usage and space savings: off, lzjb, gzip-2, gzip, gzip-9. Within each experiment, the first half is writes and the second half reads:

NFS and network stats

Not surprisingly, from the NFS and network levels, the experiments basically appear the same, except that the writes are spread out over a longer period for higher compression levels. The read times are pretty much unchanged across all compression levels. The total NFS and network traffic should be the same for all runs. Now let's look at CPU utilization over these experiments:

CPU usage

Notice that CPU usage increases with higher compression levels, but caps out at about 50%. I need to do some digging to understand why this happens on my workload, but it may have to do with the number of threads available for compression. Anyway, since it only uses 50% of CPU, the more expensive compression runs end up taking longer.

Let's shift our focus now to disk I/O. Keep in mind that the disk throughput rate is twice that of the data we're actually reading and writing because the storage is mirrored:

Disk I/O

We expect to see an actual decrease in disk bytes written and read as the compression level increases because we're writing and reading more compressed data.

I collected similar data for the media (uncompressible) data set. The three important differences were that with higher compression levels, each workload took less time than the corresponding text one:

Network bytes

the CPU utilization during reads was less than in the text workload:

CPU utilization

and the total disk I/O didn't decrease nearly as much with the compression level as it did in the text workloads (which is to be expected):

Disk throughput

The results can be summarized by looking at the total execution time for each workload at various levels of compression:

Summary: text data set
lzjb 1.47x
gzip-9 2.52x
Summary: media data set
off 1.00x
gzip-2 1.01x
gzip-9 1.01x
Space chart Time chart

What conclusions can we draw from these results? Of course, what we knew, that compression performance and space savings vary greatly with the compression level and type of data. But more specifically, with my workloads:

  • read performance is generally unaffected by compression
  • lzjb can afford decent space savings, but performs well whether or not it's able to generate much savings.
  • Even modest gzip imposes a noticeable performance hit, whether or not it reduces I/O load.
  • gzip-9 in particular can spend a lot of extra time for marginal gain.

Moreover, the 7410 has plenty of CPU headroom to spare, even with high compression.

Summing it all up

We've seen that compression is free, built-in, and very easy to enable on the 7000 series. The performance effects vary based on the workload and compression algorithm, but powerful CPUs allow compression to be used even on top of serious loads. Moreover, the appliance provides great visibility into overall system performance and effectiveness of compression, allowing administrators to see whether compression is helping or hurting their workload.

Saturday Feb 14, 2009

Fault management

The Fishworks storage appliance stands on the shoulders of giants. Many of the most exciting features -- Analytics, the hybrid storage pool, and integrated fault management, for example -- are built upon existing technologies in OpenSolaris (DTrace, ZFS, and FMA, respectively). The first two of these have been covered extensively elsewhere, but I'd like to discuss our integrated fault management, a central piece of our RAS (reliability/availability/serviceability) architecture.

Let's start with a concrete example: suppose hard disk #4 is acting up in your new 7000 series server. Rather than returning user data, it's giving back garbage. ZFS checksums the garbage, immediately detects the corruption, reconstructs the correct data from the redundant disks, and writes the correct block back to disk #4. This is great, but if the disk is really going south, such failures will keep happening, and the system will generate a fault and take the disk out of service.

Faults represent active problems on the system, usually hardware failures. OpenSolaris users are familiar with observing and managing faults through fmadm(1M). The appliance integrates fault management in several ways: as alerts, which allow administrators to configure automated responses to these events of interest; in the active problems screen, which provides a summary view of the current faults on the system; and through the maintenance view, which correlates faults with the actual failed hardware components. Let's look at each of these in turn.


Faults are part of a broader group of events we call alerts. Alerts represent events of interest to appliance administrators, ranging from hardware failures to backup job notifications. When one of these events occurs, the system posts an alert, taking whatever action has been configured for it. Most commonly, administrators configure the appliance to send mail or trigger an SNMP trap in response to certain alerts:

Alerts configuration

Managing faults

In our example, you'd probably discover the failed hard disk because you previously configured the appliance to send mail on hardware failure (or hot spare activation, or resilvering completion...). Once you get the mail, you'd log into the appliance web UI (BUI) and navigate to the active problems screen:

Active problems screen

The above screen presents all current faults on the system, summarizing each failure, its impact on the system, and suggested actions for the administrator. You might next click the "more info" button (next to the "out of service" text), which would bring you to the maintenance screen for the faulted chassis, highlighting the broken disk both in the diagram and the component list:

Faulted chassis

This screen connects the fault with the actual physical component that's failed. From here you could also activate the locator LED (which is no simple task behind the scenes) and have a service technician go replace the blinking disk. Of course, once they do, you'll get another mail saying that ZFS has finished resilvering the new disk.

Beyond disks

Disks are interesting examples because they are the heart of the storage server. Moreover, disks are often some of the first components to fail (in part because there are so many of them). But FMA allows us to diagnose many other kinds of components. For example, here are the same screens on a machine with a broken CPU cache:

Failed CPU

Failed CPU (chassis)

Under the hood

This complete story -- from hardware failure to replaced disk, for example -- is built on foundational technologies in OpenSolaris like FMA. Schrock has described much of the additional work that makes this simple but powerful user experience possible for the appliance. Best of all, little of the code is specific to our NAS appliances - we could conceivably leverage the same infrastructure to manage faults on other kinds of systems.

If you want to see more, download our VMware simulator and try it out for yourself.

Monday Nov 17, 2008

HTTP/WebDAV Analytics

Mike calls Analytics the killer app of the 7000 series NAS appliances. Indeed, this feature enables administrators to quickly understand what's happening on their systems in unprecedented depth. Most of the interesting Analytics data comes from DTrace providers built into Solaris. For example, the iSCSI data are gathered by the existing iSCSI provider, which allows users to drill down on iSCSI operations by client. We've got analogous providers for NFS and CIFS, too, which incorporate the richer information we have for those file-level protocols (including file name, user name, etc.).

We created a corresponding provider for HTTP in the form of a pluggable Apache module called mod_dtrace. mod_dtrace hooks into the beginning and end of each request and gathers typical log information, including local and remote IP addresses, the HTTP request method, URI, user, user agent, bytes read and written, and the HTTP response code. Since we have two probes, we also have latency information for each request. We could, of course, collect other data as long as it's readily available when we fire the probes.

The upshot of all this is that you can observe HTTP traffic in our Analytics screen, and drill down in all the ways you might hope (click image for larger size):

Caveat user

One thing to keep in mind when analyzing HTTP data is that we're tracking individual requests, not lower level I/O operations. With NFS, for example, each operation might be a read of some part of the file. If you read a whole file, you'll see a bunch of operations, each one reading a chunk of the file. With HTTP, there's just one request, so you'll only see a data point when that request starts or finishes, no matter how big the file is. If one client is downloading a 2GB file, you won't see it until they're done (and the latency might be very high, but that's not necessarily indicative of poor performance).

This is a result of the way the protocol works (or, more precisely, the way it's used). While NFS is defined in terms of small filesystem operations, HTTP is defined in terms of requests, which may be arbitrarily large (depending on the limits of the hardware). One could imagine a world in which an HTTP client that's implementing a filesystem (like the Windows mini-redirector) makes smaller requests using HTTP Range headers. This would look more like the NFS case - there would be requests for ranges of files corresponding to the sections of files that were being read. (This could have serious consequences for performance, of course.) But as things are now, users must understand the nature of protocol-level instrumentation when drawing conclusions based on HTTP Analytics graphs.


For the morbidly curious, mod_dtrace is actually a fairly straightforward USDT provider, consisting of the following components:

  • http.d defines http_reqinfo_t, the stable structure used as an argument to probes (in D scripts). This file also defines translators to map between httpproto_t, the structure passed to the DTrace probe macro (by the actual code that fires probes in mod_dtrace.c), and the pseudo-standard conninfo_t and aforementioned http_reqinfo_t. This file is analogous to any of the files shipped in /usr/lib/dtrace on a stock OpenSolaris system.
  • http_provider_impl.h defines httpproto_t, the structure that mod_dtrace passes into the probes. This structure contains enough information for the aforementioned translators to fill in both the conninfo_t and http_reqinfo_t.
  • http_provider.d defines the provider's probes:
    		provider http {
    		        probe request__start(httpproto_t \*p) :
    		            (conninfo_t \*p, http_reqinfo_t \*p);
    		        probe request__done(httpproto_t \*p) :
    		            (conninfo_t \*p, http_reqinfo_t \*p);
  • mod_dtrace.c implements the provider itself. We hook into Apache's existing post_read_request and log_transaction hooks to fire the probes (if they are enabled). The only tricky bit here is counting bytes, since Apache doesn't normally keep that information around. We use an input filter to count bytes read, and we override mod_logio's optional function to count bytes written. This is basically the same approach that mod_logio uses, though is admittedly pretty nasty.

We hope this will shed some light on performance problems in actual customer environments. If you're interested in using HTTP/WebDAV on the NAS appliance, check out my recent post on our support for system users.

Monday Nov 10, 2008

User support for HTTP

In building the 7000 series of NAS appliances, we strove to create a solid storage product that's revolutionary both in features and price/performance. This process frequently entailed rethinking old problems and finding new solutions that challenge the limitations of previous ones. Bill has a great example in making CIFS (SMB) a first-class data protocol on the appliance, from our management interface down to the kernel itself. I'll discuss here the ways in which we've enhanced support for HTTP/WebDAV sharing, particularly as it coexists with other protocols like NFS, CIFS, and FTP.

WebDAV is a set of extensions to HTTP that allows clients (like web browsers) to treat web sites like filesystems. Windows, MacOS, and Gnome all have built-in WebDAV clients that allow users to "mount" WebDAV shares on their desktop and treat them like virtual disks. Since the client setup is about as simple as it could be (just enter a URL), this makes a great file sharing solution where performance is not critical and ease of setup is important. Users can simply click their desktop environment's "connect to server" button and start editing files on the local company file server, where the data is backed up, automatically shared with their laptop, and even exported over NFS or CIFS as well.

User support

Many existing HTTP/WebDAV implementations consist of Apache and mod_dav. While this provides a simple, working implementation, its generality creates headaches when users want to share files over both HTTP and other protocols (like NFS). For one, HTTP allows the server to interpret users' credentials (which are optional to begin with) however it chooses. This provides for enormous flexibility in building complex systems, but it means that you've got to do some work to make web users correspond to something meaningful in your environment (i.e., users in your company's name service).

The other limitation of a basic Apache-based setup is that Apache itself has no support for assuming the identity of logged-in users. Typically, the web server runs as 'webservd' or 'httpd' or some other predefined system user. So in order for files to be accessible via the web, they must be accessible by this arbitrary user, which usually means making them world-accessible. Moreover, when files are created via the WebDAV interface, they end up being owned by the web server, rather than the actual user that created them.

By contrast, we've included strong support for system users in our HTTP stack. We use basic HTTP authentication to check a user's credentials against the system's name services, and then we process the request under their identity. The result is that proper filesystem permissions are enforced over HTTP, and newly created files are correctly owned by the user that created them, rather than the web server's user. (This is not totally unlike a union of mod_auth_pam and mod_become, except that those are not very well supported.)

The user experience goes something like this: on my Mac laptop, I use the Finder's "Connect to Server" option and enter the appliance's WebDAV URL. I'm prompted for my username and password, which are checked against the local NIS directory server. Once in, my requests are handled by an httpd process which uses seteuid(2) to assume my identity. That means I can see exactly the same set of files I could see if I were using NFS, FTP, CIFS with identity mapping, etc. If I'm accessing someone else's 644 file, then I can read but not write it. If I'm accessing my group's 775 directory, then I can create files in it. It's just as though I were using the local filesystem.

The mod_dav FAQ vaguely describes how one could do this, but implies that making it work requires introducing a huge security hole. Using Solaris's fine-grained privileges, we give Apache worker processes just the proc_setid privilege (see privileges(5)). We don't need httpd to run as root - we just need it to change among a set of unprivileged users. Any service expected to serve data on behalf of users must be granted this privilege -- the NFS, CIFS, and FTP servers all do this (admittedly by running as root).

Of course, such a system is only as safe as its authentication and authorization mechanisms, and we've done our best to ensure that this code is safe and to mitigate the possible damage from potential exploits. It's built on top of libpam and (of course) Solaris, so we know the foundation is solid.

Implementation notes

mod_user is our custom module which authenticates users and causes httpd to assume the identity of said users. It primarily consists of hooks into Apache's request processing pipeline to authenticate users and change uid's. We authenticate using pam(3PAM), which uses whatever name services have been set up for the appliance (NIS, LDAP, or locally created users).

Though mod_user itself is fairly simple, it's also somewhat delicate from a security perspective. For example, since seteuid(2) changes the effective uid of the entire process, we must be sure that we're never handling multiple requests concurrently. This is made pretty easy with the one-request-per-process threading module that is Apache's default, but there's still a bit of complexity around subrequests, where we may be running as some user other than Apache's usual 'webservd' (since we're processing a request for a particular user), but we need to authenticate the user as part of processing the subrequest. For local users, authentication requires reading our equivalent of /etc/shadow, which of course we can't allow to be world-readable. But these kinds of issues are easily solved.

Other WebDAV enhancements

Enhancing the user model is one of a few updates we've made for HTTP sharing on the NAS appliance. I'll soon discuss mod_dtrace, which facilitates HTTP Analytics by implementing a USDT provider for HTTP. We hope these sorts of features to make life better in many environments, whether you're using WebDAV as a primary means of sharing files or you're just giving read-only access to people's home directories as a remote convenience.

Stay tuned to the Fishworks blogs for lots more discussion of Sun's new line of storage appliances.


On Fishworks, Sun, and software engineering


« June 2016