Sunday Apr 18, 2010

Replication in 2010.Q1

This post is long overdue since 2010.Q1 came out over a month ago now, but it's better late than never. The bullet-point feature list for 2010.Q1 typically includes something like "improved remote replication", but what do we mean by that? The summary is vague because, well, it's hard to summarize what we did concisely. Let's break it down:

Improved stability. We've rewritten the replication management subsystem. Informed by the downfalls of its predecessor, the new design avoids large classes of problems that were customer pain points in older releases. The new implementation also keeps more of the relevant debugging data that allows us to drive new issues to root-cause faster and more reliably.

Enhanced management model. We've formalized the notion of packages, which were previously just "replicas" or "replicated projects". Older releases mandated that a given project could only be replicated to a given target once (at a time) and that only one copy of a project could exist on a particular target at a time. 2010.Q1 supports multiple actions for a given project and target, each one corresponding to an independent copy on the target called a "package." This allows administrators to replicate a fresh copy without destroying the one that's already on the target.

Share-level replication. 2010.Q1 supports more fine-grained control of replication configuration, like leaving an individual share out of its project's replication configuration or replicating a share by itself without the other shares in its project.

Optional SSL encryption for improved performance. Older releases always encrypt the data sent over the wire. 2010.Q1 still supports this, but also lets customers disable SSL encryption for significantly improved performance when the security of data on the wire isn't so critical (as in many internal environments).

Bandwidth throttling. The system now supports limiting the bandwidth used by individual replication actions. With this, customers with limited network resources can keep replication from hogging the available bandwidth and starving the client data path.

Improved target-side management. Administrators can browse replicated projects and shares in the BUI and CLI just like local projects and shares. You can also view properties of these shares and even change them where appropriate. For example, the NFS export list can be customized on the target, which is important for disaster-recovery plans where the target will serve different clients in a different datacenter. Or you could enable stronger compression on the target, saving disk space at the expense of performance, which may be less important on a backup site.

Read-only view of replicated filesystems and snapshots. This is pretty self-explanatory. You can now export replicated filesystems read-only over NFS, CIFS, HTTP, FTP, etc., allowing you to verify the data, run incremental NDMP backups, or perform data analysis that's too expensive to run on the primary system. You can also see and clone the non-replication snapshots.

Then there are lots of small improvements, like being able to disable replication globally, per-action, or per-package, which is very handy when trying it out or measuring performance. Check out the documentation (also much improved) for details.

Wednesday Mar 10, 2010

Remote Replication Introduction

A bad copy?When we first announced the SS7000 series, we made available a simulator (a virtual machine image) so people could easily try out the new software. At a keynote session that evening, Bryan and Mike challenged audience members to be the first to set up remote replication between two simulators. They didn't realize how quickly someone would take them up on that. Having worked on this feature, it was very satisfying to see it all come together in a new user's easy experience setting up replication for the first time.

The product has come a long way in the short time since then. This week sees the release of 2010.Q1, the fourth major software update in just over a year. Each update has come packed with major features from online data migration to on-disk data deduplication. 2010.Q1 includes several significant enhancements (and bug fixes) to the remote replication feature. And while it was great to see one of our first users find it so easy to replicate an NFS share to another appliance, remote replication remains one of the most complex features of our management software.

The problem sounds simple enough: just copy this data to that system. But people use remote replication to solve many different business problems, and supporting each of these requires related features that together add significant complexity to the system. Examples include:

  • Backup. Disk-to-disk backup is the most obvious use of data replication. Customers need the ability to recover data in the event of data loss on the primary system, whether a result of system failure or administrative error.
  • Disaster recovery (DR). This sounds like backup, but it's more than that: customers running business-critical services backed by a storage system need the ability to recover service quickly in the event of an outage of their primary system (be it a result of system failure or a datacenter-wide power outage or an actual disaster). Replication can be used to copy data to a secondary system off-site that can be configured to quickly take over service from the primary site in the event of an extended outage there. Of course, you also need a way to switch back to the primary site without copying all the data back.
  • Data distribution. Businesses spread across the globe often want a central store of documents that clients around the world can access quickly. They use replication to copy documents and other data from a central data center to many remote appliances, providing fast local caches for employees working far from the main office.
  • Workload distribution. Many customers replicate data to a second appliance to run analysis or batch jobs that are too expensive to run on the primary system without impacting the production workload.

These use cases inform the design requirements for any replication system:

  • Data replication must be configurable on some sort of schedule. We don't just want one copy of the data on another system. We want an up-to-date copy. For example, data changes every day, and we want nightly backups. Or we have DR agreements that require restoring service using data no more than 10 minutes out-of-date. Some deployments wanting to maximize freshness of replicated data may want to replicate continuously (as frequently as possible). Very critical systems may even want to replicate synchronously (so that the primary system does not acknowledge client writes until they're on stable storage on the DR site), though this has significant performance implications.
  • Data should only be replicated once. This one's obvious, but important. When we update the copy, we don't want to have to send an entire new copy of the data. This wastes potentially expensive network bandwidth and disk space. We only want to send the changes made since the previous update. This is also important when restoring primary service after a disaster-recovery event. In that case, we only want to copy the changes made while running on the secondary system back to the primary system.
  • Copies should be point-in-time consistent. Source data may always be changing, but with asynchronous replication, copies will usually be updated at discrete intervals. But at the very least, the copy should represent a snapshot of the data at a single point in time. (By contrast, a simple "cp" or "rsync" copy of an actively changing filesystem would result in a copy where each file's state was copied at slightly different times, potentially resulting in inconsistencies in the copy that didn't exist (and could not exist) in the source data.) This is particularly important for databases or other applications with complex persistent state. Traditional filesystem guarantees about their state after a crash make it possible to write applications that can recover from any point-in-time snapshot, but it's much harder to write software that can recover from arbitrary inconsistencies introduced by sloppy copying.


  • Replication performance is important, but so is performance observability and control (e.g., throttling). Backup and DR operations can't be allowed to significantly impact the performance of primary clients of the system, so administrators need to be able to see the impact of replication on system performance as well as limit system resources used for replication if this impact becomes too large.
  • Complete backup solutions should replicate data-related system configuration (like filesystem block size, quotas, or protocol sharing properties), since this needs to be recovered with a full restore, too. But some properties should be changeable in each copy. Backup copies may use higher compression levels, for example, because performance is less important than disk space on the backup system. DR sites may have different per-host NFS sharing restrictions because they're in different data centers than the source system.
  • Management must be clear and simple. When you need to use your backup copy, whether to restore the original system or bring up a disaster recovery site, you want the process to be as simple as possible. Delays cost money, and missteps can lead to loss of the only good copy of your data.

That's an overview of the design goals of our remote replication feature today. Some of these elements have been part of the product since the initial release, while others are new to 2010.Q1. The product will evolve as we see how people use the appliance to solve other business needs. Expect more details in the coming weeks and months.

Tuesday Aug 18, 2009

Threshold alerts

I've previously discussed alerts in the context of fault management. On the 7000, alerts are just events of interest to administrators, like "disk failure" or "backup completion" for example. Administrators can configure actions that the system takes in response to particular classes of alert. Common actions include sending email (for notification) and sending an SNMP trap (for integration into environments which use SNMP for monitoring systems across a datacenter). A simple alert configuration might be "send mail to admin@mydomain when a disk fails."

While most alerts come predefined on the system, there's one interesting class of parametrized alerts that we call threshold-based alerts, or just thresholds. Thresholds use DTrace-based Analytics to post (trigger) alerts based on anything you can measure with Analytics. It's best summed up with this screenshot of the configuration screen:

Click for larger view

The possibilities here are pretty wild, particularly because you can enable or disable Analytics datasets as an alert action. For datasets with some non-negligible cost of instrumentation (like NFS file operations broken down by client), you can automatically enable them only when you need them. For example, you could have one threshold that enables data collection when the number of operations exceeds some limit, and then turn collection off again if the number fell below a slightly lower limit:


One of the nice things about developing a software framework like the appliance kit is that once it's accumulated a critical mass of basic building blocks, you can leverage existing infrastructure to quickly build powerful features. The threshold alerts system sits atop several pieces of the appliance kit, including the Analytics, alerts, and data persistence subsystems. The bulk of the implementation is just a few hundred lines of code that monitors stats and post alerts as needed.

Monday May 04, 2009

Anatomy of a DTrace USDT provider

I've previously mentioned that the 7000 series HTTP/WebDAV Analytics feature relies on USDT, the mechanism by which developers can define application-specific DTrace probes to provide stable points of observation for debugging or analysis. Many projects already use USDT, including the Firefox Javascript engine, mysql, python, perl, and ruby. But writing a USDT provider is (necessarily) somewhat complicated, and documentation is sparse. While there are some USDT examples on the web, most do not make use of newer USDT features (for example, they don't use translated arguments). Moreover, there are some rough edges around the USDT infrastructure that can make the initial development somewhat frustrating. In this entry I'll explain the structure of a USDT provider with translated arguments in hopes that it's helpful to those starting out on such a project. I'll use the HTTP provider as an example since I know that one best. I also refer to the source of the iscsi provider since the complete source is freely available (as part of iscsitgtd in ON).


A USDT provider with translated arguments is made up of the following pieces:

  • a provider definition file (e.g., http_provider.d), from which an application header and object files will be generated
  • a provider header file (e.g., http_provider.h), generated from the provider definition file with dtrace -h and included by the application, defines macros used by the application to fire probes
  • a provider object file (e.g., http_provider.o), generated from the provider definition and application object files with dtrace -G
  • a provider application header file (e.g., http_provider_impl.h), which defines the C structures passed into probes by the application
  • native (C) code that invokes the probe
  • a provider support file (e.g., http.d), delivered into /usr/lib/dtrace, which defines the D structures and translators used in probes at runtime

Putting these together:

  1. The build process takes the provider definition and generates the provider header and provider object files.
  2. The application includes the provider header file as well as the application header file and uses DTrace macros and C structures to fire probes.
  3. The compiled application is linked with the generated provider object file that encodes the probes.
  4. DTrace consumers (e.g., dtrace(1M)) read in the provider support file and instrument any processes containing the specified probes.
It's okay if you didn't follow all that. Let's examine these pieces in more detail.

1. Provider definition (and generated files)

The provider definition describes the set of probes made available by that provider. For each probe, the definition describes the arguments passed to DTrace by the the probe implementation (within the application) as well as the arguments passed by DTrace to probe consumers (D scripts).

For example, here's the heart of the http provider definition, http.d:

provider http {
        probe request__start(httpproto_t \*p) :
            (conninfo_t \*p, http_reqinfo_t \*p);
        probe request__done(httpproto_t \*p) :
            (conninfo_t \*p, http_reqinfo_t \*p);
        /\* ... \*/

The http provider defines two probes: request-start and request-done. (DTrace converts double-underscores to hyphens.) Each of these consumes an httpproto_t from the application (in this case, an Apache module) and provides a conninfo_t and a http_reqinfo_t to DTrace scripts using these probes. Don't worry about the details of these structures just yet.

From the provider definition we build the related http_provider.h and http_provider.o files. The header file is generated by dtrace -h and contains macros used by the application to fire probes. Here's a piece of the generated file (edited to show only the x86 version for clarity):

#define	HTTP_REQUEST_DONE(arg0) \\


So for each probe, we have a macro of the form provider_probename(probeargs) that the application uses to fire a probe with the specified arguments. Note that the argument to HTTP_REQUEST_DONE should be a httpproto_t \*, since that's what the request-done probe consumes from the application.

The provider object file generated by dtrace -G is also necessary to make all this work, but the mechanics are well beyond the scope of this entry.

2. Application components

The application header file http_provider_impl.h defines the httpproto_t structure for the application, which passes a pointer to this object into the probe macro to fire the probe. Here's an example:

typedef struct {
        const char \*http_laddr;        /\* local IP address (as string) \*/
        const char \*http_uri;          /\* URI of requested \*/
        const char \*http_useragent;    /\* user's browser (User-agent header) \*/
        uint64_t http_byteswritten;    /\* bytes RECEIVED from client \*/
        /\* ... \*/
} httpproto_t;

The application uses the macros from http_provider.h and the structure defined in http_provider_impl.h to fire a DTrace probe. We also use the is-enabled macros to avoid constructing the arguments when they're not needed. For example:

static void
mod_dtrace_postrequest(request_rec \*rr)
        httpproto_t hh;

        /\* ... \*/


        /\* fill in hh object based on request rr ... \*/


3. DTrace consumer components

What we haven't specified yet is exactly what defines a conninfo_t or http_reqinfo_t or how to translate an httpproto_t object into a conninfo_t or http_reqinfo_t. These structures and translators must be defined when a consuming D script is compiled (i.e., when a user runs dtrace(1M) and wants to use our probe). These definitions go into what I've called the provider support file, which includes definitions like these:

typedef struct {
        uint32_t http_laddr;
        uint32_t http_uri;
        uint32_t http_useragent;
        uint64_t http_byteswritten;
        /\* ... \*/
} httpproto_t;

typedef struct {
        string hri_uri;                 /\* uri requested \*/
        string hri_useragent;           /\* "User-agent" header (browser) \*/
        uint64_t hri_byteswritten;      /\* bytes RECEIVED from the client \*/
        /\* ... \*/
} http_reqinfo_t;

#pragma D binding "1.6.1" translator
translator conninfo_t <httpproto_t \*dp> {
        ci_local = copyinstr((uintptr_t)
            \*(uint32_t \*)copyin((uintptr_t)&dp->http_laddr, sizeof (uint32_t)));
        /\* ... \*/

#pragma D binding "1.6.1" translator
translator http_reqinfo_t <httpproto_t \*dp> {
        hri_uri = copyinstr((uintptr_t)
            \*(uint32_t \*)copyin((uintptr_t)&dp->http_uri, sizeof (uint32_t)));
        /\* ... \*/

There are a few things to note here:

  • The httpproto_t structure must exactly match the one being used by the application. There's no way to enforce this with just one definition because neither file can rely on the other being available.
  • The above example only works for 32-bit applications. For a similar example that uses the ILP of the process to do the right thing for 32-bit and 64-bit apps, see /usr/lib/dtrace/iscsi.d.
  • We didn't define conninfo_t. That's because it's defined in /usr/lib/dtrace/net.d. Instead of redefining it, we have our file depend on net.d with this line at the top:
    #pragma D depends_on library net.d

This provider support file gets delivered into /usr/lib/dtrace. dtrace(1M) automatically imports all .d files in this directory (or another directory specified with -xlibdir) on startup. When a DTrace consumer goes to use our probes, the conninfo_t, httpproto_t, and http_reqinfo_t structures are defined, as well as the needed translators. More concretely, when a user writes:

# dtrace -n 'http\*:::request-start{printf("%s\\n", args[1]->hri_uri);}'

DTrace knows exactly what to do.

Rough edges

Remember http_provider.d? It contained the actual provider and probe definitions. It referred to the httpproto_t, conninfo_t, and http_reqinfo_t structures, which we didn't actually mention in that file. We already explained that these structures and definitions are defined by the provider support file and used at runtime, so they shouldn't actually be necessary here.

Unfortunately, there's a piece missing that's necessary to work around buggy behavior in DTrace: the D compiler insists on having these definitions and translators available when processing the provider definition file, but those structures and translators won't be used (since this file is not even available at runtime anyway). Even worse, dtrace(1M) doesn't automatically import the files in /usr/lib/dtrace when compiling the provider file, so we can't simply depends_on them.

The end result is that we must define "dummy" structures and translators in the provider file, like this:

typedef struct http_reqinfo {
        int dummy;
} http_reqinfo_t;

typedef struct httpproto {
        int dummy;
} httpproto_t;

typedef struct conninfo {
        int dummy;
} conninfo_t;

translator conninfo_t <httpproto_t \*dp> {
        dummy = 0;

translator http_reqinfo_t <httpproto_t \*dp> {
        dummy = 0;

We also need stability attributes like the following to use the probes:

#pragma D attributes Evolving/Evolving/ISA provider http provider
#pragma D attributes Private/Private/Unknown provider http module
#pragma D attributes Private/Private/Unknown provider http function
#pragma D attributes Private/Private/ISA provider http name
#pragma D attributes Evolving/Evolving/ISA provider http args

You can see both of these in the iscsi provider as well.


I've now covered all the pieces, but there are other considerations in implementing a provider. For example, what arguments should the probes consume from the application, and what should be provided to D scripts? We chose structures on both sides because it's much less unwieldy (especially as it evolves), but that necessitates the ugly translators and multiple definitions. If I'd used pointer and integer arguments, we'd need no structures, and therefore no translators, and thus we could leave out several of the files described above. But it would be a bit unwieldy and consumers would need to use copyin/copyinstr.

Both the HTTP and iSCSI providers instrument network-based protocols. For consistency, providers for these protocols (which also include NFS, CIFS, and FTP on the 7000 series) use the same conventions for probe argument types and names (e.g., conninfo_t as the first argument, followed by protocol-specific arguments).


USDT with translated arguments is extremely powerful, but the documentation is somewhat lacking and there are still some rough edges for implementers. I hope this example is valuable for people trying to put the pieces together. If people want to get involved in documenting this, contact the DTrace community at

Monday Apr 27, 2009

2009.Q2 Released

Today we've released the first major software update for the 7000 series, called 2009.Q2. This update contains a boatload of bugfixes and new features, including support for HTTPS (HTTP with SSL using self-signed certificates). This makes HTTP user support more tenable in less secure environments because credentials don't have to be transmitted in the clear.

Another updated feature that's important for RAS is enhanced support bundles. Support bundles are tarballs containing core files, log files, and other debugging output generated by the appliance that can be sent directly to support personnel. In this release, support bundles collect more useful data about failed services and (critically) can be created even when the appliance's core services have failed or network connectivity has been lost. You can also monitor support bundle progress on the Maintenance -> System screen, and bundles can be downloaded directly from the browser for environments where the appliance cannot connect directly to Sun Support. All of these improvements help us to track down problems remotely, relying as little as possible on the functioning system or the administrator's savvy.

See the Fishworks blog for more details on this release. Enjoy!


On Fishworks, Sun, and software engineering


« July 2016