Tuesday Aug 18, 2009

Threshold alerts

I've previously discussed alerts in the context of fault management. On the 7000, alerts are just events of interest to administrators, like "disk failure" or "backup completion" for example. Administrators can configure actions that the system takes in response to particular classes of alert. Common actions include sending email (for notification) and sending an SNMP trap (for integration into environments which use SNMP for monitoring systems across a datacenter). A simple alert configuration might be "send mail to admin@mydomain when a disk fails."

While most alerts come predefined on the system, there's one interesting class of parametrized alerts that we call threshold-based alerts, or just thresholds. Thresholds use DTrace-based Analytics to post (trigger) alerts based on anything you can measure with Analytics. It's best summed up with this screenshot of the configuration screen:

Click for larger view

The possibilities here are pretty wild, particularly because you can enable or disable Analytics datasets as an alert action. For datasets with some non-negligible cost of instrumentation (like NFS file operations broken down by client), you can automatically enable them only when you need them. For example, you could have one threshold that enables data collection when the number of operations exceeds some limit, and then turn collection off again if the number fell below a slightly lower limit:


Implementation

One of the nice things about developing a software framework like the appliance kit is that once it's accumulated a critical mass of basic building blocks, you can leverage existing infrastructure to quickly build powerful features. The threshold alerts system sits atop several pieces of the appliance kit, including the Analytics, alerts, and data persistence subsystems. The bulk of the implementation is just a few hundred lines of code that monitors stats and post alerts as needed.

Monday May 04, 2009

Anatomy of a DTrace USDT provider

I've previously mentioned that the 7000 series HTTP/WebDAV Analytics feature relies on USDT, the mechanism by which developers can define application-specific DTrace probes to provide stable points of observation for debugging or analysis. Many projects already use USDT, including the Firefox Javascript engine, mysql, python, perl, and ruby. But writing a USDT provider is (necessarily) somewhat complicated, and documentation is sparse. While there are some USDT examples on the web, most do not make use of newer USDT features (for example, they don't use translated arguments). Moreover, there are some rough edges around the USDT infrastructure that can make the initial development somewhat frustrating. In this entry I'll explain the structure of a USDT provider with translated arguments in hopes that it's helpful to those starting out on such a project. I'll use the HTTP provider as an example since I know that one best. I also refer to the source of the iscsi provider since the complete source is freely available (as part of iscsitgtd in ON).

Overview

A USDT provider with translated arguments is made up of the following pieces:

  • a provider definition file (e.g., http_provider.d), from which an application header and object files will be generated
  • a provider header file (e.g., http_provider.h), generated from the provider definition file with dtrace -h and included by the application, defines macros used by the application to fire probes
  • a provider object file (e.g., http_provider.o), generated from the provider definition and application object files with dtrace -G
  • a provider application header file (e.g., http_provider_impl.h), which defines the C structures passed into probes by the application
  • native (C) code that invokes the probe
  • a provider support file (e.g., http.d), delivered into /usr/lib/dtrace, which defines the D structures and translators used in probes at runtime

Putting these together:

  1. The build process takes the provider definition and generates the provider header and provider object files.
  2. The application includes the provider header file as well as the application header file and uses DTrace macros and C structures to fire probes.
  3. The compiled application is linked with the generated provider object file that encodes the probes.
  4. DTrace consumers (e.g., dtrace(1M)) read in the provider support file and instrument any processes containing the specified probes.
It's okay if you didn't follow all that. Let's examine these pieces in more detail.

1. Provider definition (and generated files)

The provider definition describes the set of probes made available by that provider. For each probe, the definition describes the arguments passed to DTrace by the the probe implementation (within the application) as well as the arguments passed by DTrace to probe consumers (D scripts).

For example, here's the heart of the http provider definition, http.d:

provider http {
        probe request__start(httpproto_t \*p) :
            (conninfo_t \*p, http_reqinfo_t \*p);
        probe request__done(httpproto_t \*p) :
            (conninfo_t \*p, http_reqinfo_t \*p);
        /\* ... \*/
};

The http provider defines two probes: request-start and request-done. (DTrace converts double-underscores to hyphens.) Each of these consumes an httpproto_t from the application (in this case, an Apache module) and provides a conninfo_t and a http_reqinfo_t to DTrace scripts using these probes. Don't worry about the details of these structures just yet.

From the provider definition we build the related http_provider.h and http_provider.o files. The header file is generated by dtrace -h and contains macros used by the application to fire probes. Here's a piece of the generated file (edited to show only the x86 version for clarity):

#define	HTTP_REQUEST_DONE(arg0) \\
	__dtrace_http___request__done(arg0)

#define	HTTP_REQUEST_DONE_ENABLED() \\
	__dtraceenabled_http___request__done()  

So for each probe, we have a macro of the form provider_probename(probeargs) that the application uses to fire a probe with the specified arguments. Note that the argument to HTTP_REQUEST_DONE should be a httpproto_t \*, since that's what the request-done probe consumes from the application.

The provider object file generated by dtrace -G is also necessary to make all this work, but the mechanics are well beyond the scope of this entry.

2. Application components

The application header file http_provider_impl.h defines the httpproto_t structure for the application, which passes a pointer to this object into the probe macro to fire the probe. Here's an example:

typedef struct {
        const char \*http_laddr;        /\* local IP address (as string) \*/
        const char \*http_uri;          /\* URI of requested \*/
        const char \*http_useragent;    /\* user's browser (User-agent header) \*/
        uint64_t http_byteswritten;    /\* bytes RECEIVED from client \*/
        /\* ... \*/
} httpproto_t;

The application uses the macros from http_provider.h and the structure defined in http_provider_impl.h to fire a DTrace probe. We also use the is-enabled macros to avoid constructing the arguments when they're not needed. For example:

static void
mod_dtrace_postrequest(request_rec \*rr)
{
        httpproto_t hh;

        /\* ... \*/

        if (!HTTP_REQUEST_DONE_ENABLED())
                return;

        /\* fill in hh object based on request rr ... \*/

        HTTP_REQUEST_DONE(&hh);
}

3. DTrace consumer components

What we haven't specified yet is exactly what defines a conninfo_t or http_reqinfo_t or how to translate an httpproto_t object into a conninfo_t or http_reqinfo_t. These structures and translators must be defined when a consuming D script is compiled (i.e., when a user runs dtrace(1M) and wants to use our probe). These definitions go into what I've called the provider support file, which includes definitions like these:

typedef struct {
        uint32_t http_laddr;
        uint32_t http_uri;
        uint32_t http_useragent;
        uint64_t http_byteswritten;
        /\* ... \*/
} httpproto_t;

typedef struct {
        string hri_uri;                 /\* uri requested \*/
        string hri_useragent;           /\* "User-agent" header (browser) \*/
        uint64_t hri_byteswritten;      /\* bytes RECEIVED from the client \*/
        /\* ... \*/
} http_reqinfo_t;

#pragma D binding "1.6.1" translator
translator conninfo_t <httpproto_t \*dp> {
        ci_local = copyinstr((uintptr_t)
            \*(uint32_t \*)copyin((uintptr_t)&dp->http_laddr, sizeof (uint32_t)));
        /\* ... \*/
};

#pragma D binding "1.6.1" translator
translator http_reqinfo_t <httpproto_t \*dp> {
        hri_uri = copyinstr((uintptr_t)
            \*(uint32_t \*)copyin((uintptr_t)&dp->http_uri, sizeof (uint32_t)));
        /\* ... \*/
};

There are a few things to note here:

  • The httpproto_t structure must exactly match the one being used by the application. There's no way to enforce this with just one definition because neither file can rely on the other being available.
  • The above example only works for 32-bit applications. For a similar example that uses the ILP of the process to do the right thing for 32-bit and 64-bit apps, see /usr/lib/dtrace/iscsi.d.
  • We didn't define conninfo_t. That's because it's defined in /usr/lib/dtrace/net.d. Instead of redefining it, we have our file depend on net.d with this line at the top:
    #pragma D depends_on library net.d
    

This provider support file gets delivered into /usr/lib/dtrace. dtrace(1M) automatically imports all .d files in this directory (or another directory specified with -xlibdir) on startup. When a DTrace consumer goes to use our probes, the conninfo_t, httpproto_t, and http_reqinfo_t structures are defined, as well as the needed translators. More concretely, when a user writes:

# dtrace -n 'http\*:::request-start{printf("%s\\n", args[1]->hri_uri);}'

DTrace knows exactly what to do.

Rough edges

Remember http_provider.d? It contained the actual provider and probe definitions. It referred to the httpproto_t, conninfo_t, and http_reqinfo_t structures, which we didn't actually mention in that file. We already explained that these structures and definitions are defined by the provider support file and used at runtime, so they shouldn't actually be necessary here.

Unfortunately, there's a piece missing that's necessary to work around buggy behavior in DTrace: the D compiler insists on having these definitions and translators available when processing the provider definition file, but those structures and translators won't be used (since this file is not even available at runtime anyway). Even worse, dtrace(1M) doesn't automatically import the files in /usr/lib/dtrace when compiling the provider file, so we can't simply depends_on them.

The end result is that we must define "dummy" structures and translators in the provider file, like this:

typedef struct http_reqinfo {
        int dummy;
} http_reqinfo_t;

typedef struct httpproto {
        int dummy;
} httpproto_t;

typedef struct conninfo {
        int dummy;
} conninfo_t;

translator conninfo_t <httpproto_t \*dp> {
        dummy = 0;
};

translator http_reqinfo_t <httpproto_t \*dp> {
        dummy = 0;
};

We also need stability attributes like the following to use the probes:

#pragma D attributes Evolving/Evolving/ISA provider http provider
#pragma D attributes Private/Private/Unknown provider http module
#pragma D attributes Private/Private/Unknown provider http function
#pragma D attributes Private/Private/ISA provider http name
#pragma D attributes Evolving/Evolving/ISA provider http args

You can see both of these in the iscsi provider as well.

Tips

I've now covered all the pieces, but there are other considerations in implementing a provider. For example, what arguments should the probes consume from the application, and what should be provided to D scripts? We chose structures on both sides because it's much less unwieldy (especially as it evolves), but that necessitates the ugly translators and multiple definitions. If I'd used pointer and integer arguments, we'd need no structures, and therefore no translators, and thus we could leave out several of the files described above. But it would be a bit unwieldy and consumers would need to use copyin/copyinstr.

Both the HTTP and iSCSI providers instrument network-based protocols. For consistency, providers for these protocols (which also include NFS, CIFS, and FTP on the 7000 series) use the same conventions for probe argument types and names (e.g., conninfo_t as the first argument, followed by protocol-specific arguments).

Conclusion

USDT with translated arguments is extremely powerful, but the documentation is somewhat lacking and there are still some rough edges for implementers. I hope this example is valuable for people trying to put the pieces together. If people want to get involved in documenting this, contact the DTrace community at opensolaris.org.

Monday Apr 27, 2009

2009.Q2 Released

Today we've released the first major software update for the 7000 series, called 2009.Q2. This update contains a boatload of bugfixes and new features, including support for HTTPS (HTTP with SSL using self-signed certificates). This makes HTTP user support more tenable in less secure environments because credentials don't have to be transmitted in the clear.

Another updated feature that's important for RAS is enhanced support bundles. Support bundles are tarballs containing core files, log files, and other debugging output generated by the appliance that can be sent directly to support personnel. In this release, support bundles collect more useful data about failed services and (critically) can be created even when the appliance's core services have failed or network connectivity has been lost. You can also monitor support bundle progress on the Maintenance -> System screen, and bundles can be downloaded directly from the browser for environments where the appliance cannot connect directly to Sun Support. All of these improvements help us to track down problems remotely, relying as little as possible on the functioning system or the administrator's savvy.

See the Fishworks blog for more details on this release. Enjoy!

Wednesday Mar 18, 2009

Compression followup

My previous post discussed compression in the 7000 series. I presented some Analytics data showing the effects of compression on a simple workload, but I observed something unexpected: the system never used more than 50% CPU doing the workloads, even when the workload was CPU-bound. This caused the CPU-intensive runs to take a fair bit longer than expected.

This happened because ZFS uses at most 8 threads for processing writes through the ZIO pipeline. With a 16-core system, only half the cores could ever be used for compression - hence the 50% CPU usage we observed. When I asked the ZFS team about this, they suggested that nthreads = 3/4 the number of cores might be a more reasonable value, leaving some headroom available for miscellaneous processing. So I reran my experiment with 12 ZIO threads. Here are the results of the same workload (the details of which are described in my previous post):

Summary: text data set
Compression
Ratio
Total
Write
Read
off
1.00x
3:29
2:06
1:23
lzjb 1.47x
3:36
2:13
1:23
gzip-2
2.35x
5:16
3:54
1:22
gzip
2.52x
8:39
7:17
1:22
gzip-9 2.52x
9:13
7:49
1:24
Summary: media data set
Compression
Ratio
Total
Write
Read
off 1.00x
3:39
2:17
1:22
lzjb
1.00x
3:38
2:16
1:22
gzip-2 1.01x
5:46
4:24
1:22
gzip
1.01x
5:57
4:34
1:23
gzip-9 1.01x
6:06
4:43
1:23

We see that read times are unaffected by the change (not surprisingly), but write times for the CPU-intensive workloads (gzip) are improved over 20%:

From the Analytics, we can see that CPU utilization is now up to 75% (exactly what we'd expect):

CPU usage with 12 ZIO threads

Note that in order to run this experiment, I had to modify the system in a very unsupported (and unsupportable) way. Thus, the above results do not represent current performance of the 7410, but only suggest what's possible with future software updates. For these kinds of ZFS tunables (as well as those in other components of Solaris, like the networking stack), we'll continue to work with the Solaris teams to find optimal values, exposing configurables to the administrator through our web interface when necessary. Expect future software updates for the 7000 series to include tunable changes to improve performance.

Finally, it's also important to realize that if you run into this limit, you've got 8 cores (or 12, in this case) running compression full-tilt and your workload is CPU-bound. Frankly, you're using more CPU for compression than many enterprise storage servers even have today, and it may very well be the right tradeoff if your environment values disk space over absolute performance.

Update Mar 27, 2009: Updated charts to start at zero.

Monday Mar 16, 2009

Compression on the Sun Storage 7000

Built-in filesystem compression has been part of ZFS since day one, but is only now gaining some enterprise storage spotlight. Compression reduces the disk space needed to store data, not only increasing effective capacity but often improving performance as well (since fewer bytes means less I/O). Beyond that, having compression built into the filesystem (as opposed to using an external appliance between your storage and your clients to do compression, for example) simplifies the management of an already complicated storage architecture.

Compression in ZFS

Your mail client might use WinZIP to compress attachments before sending them, or you might unzip tarballs in order to open the documents inside. In these cases, you (or your program) must explicitly invoke a separate program to compress and uncompress the data before actually using it. This works fine in these limited cases, but isn't a very general solution. You couldn't easily store your entire operating system compressed on disk, for example.

With ZFS, compression is built directly into the I/O pipeline. When compression is enabled on a dataset (filesystem or LUN), data is compressed just before being sent to the spindles and decompressed as it's read back. Since this happens in the kernel, it's completely transparent to userland applications, which need not be modified at all. Besides the initial configuration (which we'll see in a moment is rather trivial), users need not do anything to take advantage of the space savings offered by compression.

A simple example

Let's take a look at how this works on the 7000 series. Like all software features, compression comes free. Enabling compression for user data is simple because it's just a share property. After creating a new share, double-click it to modify its properties, select a compression level from the drop-down box, and apply your changes:

Click for larger image

GZIP optionsAfter that, all new data written to the share will be compressed with the specified algorithm. Turning compression off is just as easy: just select 'Off' from the same drop-down. In both cases, extant data will remain as-is - the system won't go rewrite everything that already existed on the share.

Note that when compression is enabled, all data written to the share is compressed, no matter where it comes from: NFS, CIFS, HTTP, and FTP clients all reap the benefits. In fact, we use compression under the hood for some of the system data (analytics data, for example), since the performance impact is negligible (as we will see below) and the space savings can be significant.

You can observe the compression ratio for a share in the sidebar on the share properties screen. This is the ratio of uncompressed data size to actual (compressed) disk space used and tells you exactly how much space you're saving.



The cost of compression

People are often concerned about the CPU overhead associated with compression, but the actual cost is difficult to calculate. On the one hand, compression does trade CPU utilization for disk space savings. And up to a point, if you're willing to trade more CPU time, you can get more space savings. But by reducing the space used, you end up doing less disk I/O, which can improve overall performance if your workload is bandwidth-limited.

But even when reduced I/O doesn't improve overall performance (because bandwidth isn't the bottleneck), it's important to keep in mind that the 7410 has a great deal of CPU horsepower (up to 4 quad-core 2GHz Opterons), making the "luxury" of compression very affordable.

The only way to really know the impact of compression on your disk utilization and system performance is to run your workload with different levels of compression and observe the results. Analytics is the perfect vehicle for this: we can observe CPU utilization and I/O bytes per second over time on shares configured with different compression algorithms.

Analytics results

I ran some experiments to show the impact of compression on performance. Before we get to the good stuff, here's the nitty-gritty about the experiment and results:

  • These results do not demonstrate maximum performance. I intended to show the effects of compression, not the maximum throughput of our box. Brendan's already got that covered.
  • The server is a quad-core 7410 with 1 JBOD (configured with mirrored storage) and 16GB of RAM. No SSD.
  • The client machine is a quad-core 7410 with 128GB of DRAM.
  • The basic workload consists of 10 clients, each writing 3GB to its own share and then reading it back for a total of 30GB in each direction. This fits entirely in the client's DRAM, but it's about twice the size of the server's total memory. While each client has its own share, they all use the same compression level for each run, so only one level is tested at a time.
  • The experiment is run for each of the compression levels supported on the 7000 series: lzjb, gzip-2, gzip (which is gzip-6), gzip-9, and none.
  • The experiment uses two data sets: 'text' (copies of /usr/dict/words, which is fairly compressible) and 'media' (copies of the Fishworks code swarm video, which is not very compressible).
  • I saw similar results with between 3 and 30 clients (with the same total write/read throughput, so they were each handling more data).
  • I saw similar results whether each client had its own share or not.

Now, below is an overview of the text (compressible) data set experiments in terms of NFS ops and network throughput. This gives a good idea of what the test does. For all graphs below, five experiments are shown, each with a different compression level in increasing order of CPU usage and space savings: off, lzjb, gzip-2, gzip, gzip-9. Within each experiment, the first half is writes and the second half reads:

NFS and network stats

Not surprisingly, from the NFS and network levels, the experiments basically appear the same, except that the writes are spread out over a longer period for higher compression levels. The read times are pretty much unchanged across all compression levels. The total NFS and network traffic should be the same for all runs. Now let's look at CPU utilization over these experiments:

CPU usage

Notice that CPU usage increases with higher compression levels, but caps out at about 50%. I need to do some digging to understand why this happens on my workload, but it may have to do with the number of threads available for compression. Anyway, since it only uses 50% of CPU, the more expensive compression runs end up taking longer.

Let's shift our focus now to disk I/O. Keep in mind that the disk throughput rate is twice that of the data we're actually reading and writing because the storage is mirrored:

Disk I/O

We expect to see an actual decrease in disk bytes written and read as the compression level increases because we're writing and reading more compressed data.

I collected similar data for the media (uncompressible) data set. The three important differences were that with higher compression levels, each workload took less time than the corresponding text one:

Network bytes

the CPU utilization during reads was less than in the text workload:

CPU utilization

and the total disk I/O didn't decrease nearly as much with the compression level as it did in the text workloads (which is to be expected):

Disk throughput

The results can be summarized by looking at the total execution time for each workload at various levels of compression:

Summary: text data set
Compression
Ratio
Total
Write
Read
off
1.00x
3:30
2:08
1:22
lzjb 1.47x
3:26
2:04
1:22
gzip-2
2.35x
6:12
4:50
1:22
gzip
2.52x
11:18
9:56
1:22
gzip-9 2.52x
12:16
10:54
1:22
Summary: media data set
Compression
Ratio
Total
Write
Read
off 1.00x
3:29
2:07
1:22
lzjb
1.00x
3:31
2:09
1:22
gzip-2 1.01x
6:59
5:37
1:22
gzip
1.01x
7:18
5:57
1:21
gzip-9 1.01x
7:37
6:15
1:22
Space chart Time chart

What conclusions can we draw from these results? Of course, what we knew, that compression performance and space savings vary greatly with the compression level and type of data. But more specifically, with my workloads:

  • read performance is generally unaffected by compression
  • lzjb can afford decent space savings, but performs well whether or not it's able to generate much savings.
  • Even modest gzip imposes a noticeable performance hit, whether or not it reduces I/O load.
  • gzip-9 in particular can spend a lot of extra time for marginal gain.

Moreover, the 7410 has plenty of CPU headroom to spare, even with high compression.

Summing it all up

We've seen that compression is free, built-in, and very easy to enable on the 7000 series. The performance effects vary based on the workload and compression algorithm, but powerful CPUs allow compression to be used even on top of serious loads. Moreover, the appliance provides great visibility into overall system performance and effectiveness of compression, allowing administrators to see whether compression is helping or hurting their workload.

About

On Fishworks, Sun, and software engineering

Search

Categories
Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today