Tuesday Sep 21, 2010

Replication for disaster recovery

When designing a disaster recovery solution using remote replication, two important parameters are the recovery time objective (RTO) and the recovery point objective (RPO). For these purposes, a disaster is any event resulting in permanent data loss at the primary site which requires restoring service using data recovered from the disaster recovery (DR) site. The RTO is how soon service must be restored after a disaster. Designing a DR solution to meet a specified RTO is complex but essentially boils down to ensuring that the recovery plan can be executed within the allotted time. Usually this means keeping the plan simple, automating as much of it as possible, documenting it carefully, and testing it frequently. (It helps to use storage systems with built-in features for this.)

In this entry I want to talk about RPO.  RPO describes how recent the data recovered after the disaster must be (in other words, how much data loss is acceptable in the event of disaster). An RPO of 30 minutes means that the recovered data must include all changes up to 30 minutes before the disaster.  But how do businesses decide how much data loss is acceptable in the event of a disaster? The answer varies greatly from case to case. Photo storage for a small social networking site may have an RPO of a few hours; in the worst case, users who uploaded photos a few hours before the event will have to upload them again, which isn't usually a big deal. A stock exchange, by contrast, may have a requirement for zero data loss: the disaster recovery site absolutely must have the most updated data because otherwise different systems may disagree about whose account owns a particular million dollars. Of course, most businesses are somewhere in between: it's important that data be "pretty fresh," but some loss is acceptable.

Replication solutions used for disaster recovery typically fall into two buckets: synchronous and asynchronous. Synchronous replication means that clients making changes on disk at the primary site can't proceed until that data also resides on disk at the disaster recovery site. For example, a database at the primary site won't consider a transaction having completed until the underlying storage system indicates that the changed data has been stored on disk. If the storage system is configured for synchronous replication, that also means that the change is on disk at the DR site, too. If a disaster occurs at the primary site, there's no data loss when the database is recovered from the DR site because no transactions were committed that weren't also propagated to the DR site. So using synchronous replication it's possible to implement a DR strategy with zero data loss in the event of disaster -- at great cost, discussed below.

By contrast, asynchronous replication means that the storage system can acknowledge changes before they've been replicated to the DR site. In the database example, the database (still using synchronous i/o at the primary site) considers the transaction completed as long as the data is on disk at the primary site. If a disaster occurs at the primary site, the data that hasn't been replicated will be lost. This sounds bad, but if your RPO is 30 minutes and the primary site replicates to the target every 10 minutes, for example, then you can still meet your DR requirements.

Synchronous replication

As I suggested above, synchronous replication comes at great cost, particularly in three areas:

  • availability: In order to truly guarantee no data loss, a synchronous replication system must not acknowledge writes to clients if it can't also replicate that data to the DR site. If the DR site is unavailable, the system must either explicitly fail or just block writes until the problem is resolved, depending on what the application expects. Either way, the entire system now fails if any combination of the primary site, the DR site, or the network link between them fails, instead of just the primary site. If you've got 99% reliability in each of these components, the system's probability of failure goes from 1% to almost 3% (1 - .993).
  • performance: In a system using synchronous replication, the latency of client writes includes the time to write to both storage systems plus the network latency between the sites. DR sites are typically located several miles or more from the primary site in case the disaster takes the form of a datacenter-wide power outage or an actual natural disaster. All things being equal, the farther the sites are apart, the higher the network latency between them. To make things concrete, consider a single threaded client that sees 300us write operations at the primary site (total time including network round trip plus time to write the data to local storage only, not the remote site). Add a synchronous replication target a few miles away with just 500us latency from the primary site and the operation now takes 800us, dropping IOPS from about 3330 to about 1250.
  • money: Of course, this is what it ultimately boils down to. Synchronous replication software alone can cost quite a bit, but you can also end up spending a lot to deal with the above availability and performance problems: clustered head nodes at both sites, redundant network hardware, and very low latency switches and network connection. You might buy some of this for asynchronous replication, too, but network latency is much less important than bandwidth for asynchronous replication so you often can  save on the network side.

Asynchronous replication

The above availability costs scale back as the RPO increases.  If the RPO is 24 hours and it only takes 1 hour to replicate a day's changes, then you can sustain a 23-hour outage at the DR site or on the network link without impacting primary site availability at all.  The only performance cost of asynchronous replication is the added latency resulting from a system's additional load for the replication software (which you'd also have to worry about with synchronous replication).  There's no additional latency from the DR network link or the DR site as long as the system is able to keep up with sending updates.

The 7000 series provides two types of automatic asynchronous replication: scheduled and continuous. Scheduled replication sends discrete, snapshot-based updates at predefined hourly, daily, weekly, or monthly intervals. Continuous replication sends the same discrete, snapshot-based updates as frequently as possible.  The result is essentially a continuous stream of filesystem changes to the DR site.

Of course, there are tradeoffs to both of these approaches. In most cases, continuous replication minimizes data loss in the event of a disaster (i.e., it will achieve minimal RPO), since the system is replicating changes as fast as possible to the DR site without actually holding up production clients. However, if the RPO is a given parameter (as it often is), you can just as well choose a scheduled replication interval that will achieve that RPO, in which case using continuous replication doesn't buy you anything.  In fact, it can hurt because continuous replication can result in transferring significantly more data than necessary. For example, if an application fills a 10GB scratch file and rewrites it over the next half hour, continuous replication will send 20GB, while half-hourly scheduled replication will only send 10GB.  If you're paying for bandwidth, or if the capacity of a pair of systems is limited by the available replication bandwidth, these costs can add up.


Storage systems provide many options for configuring remote replication for disaster recovery, including synchronous (zero data loss) and both continuous and scheduled asynchronous replication.  It's easy to see these options and guess a reasonable strategy for minimizing data loss in the event of disaster, but it's important that actual recovery point objectives be defined based on business needs and that those objectives drive the planning and deployment of the DR solution. Incidentally, some systems use a hybrid approach that uses synchronous replication when possible, but avoids the availability cost (described above) of that approach by falling back to continuous asynchronous replication if the DR link or or DR system fails. This seems at first glance a good compromise, but it's unclear what problem this solves because the resulting system pays the performance and monetary costs of synchronous replication without actually guaranteeing zero data loss.

Sunday Apr 18, 2010

Replication in 2010.Q1

This post is long overdue since 2010.Q1 came out over a month ago now, but it's better late than never. The bullet-point feature list for 2010.Q1 typically includes something like "improved remote replication", but what do we mean by that? The summary is vague because, well, it's hard to summarize what we did concisely. Let's break it down:

Improved stability. We've rewritten the replication management subsystem. Informed by the downfalls of its predecessor, the new design avoids large classes of problems that were customer pain points in older releases. The new implementation also keeps more of the relevant debugging data that allows us to drive new issues to root-cause faster and more reliably.

Enhanced management model. We've formalized the notion of packages, which were previously just "replicas" or "replicated projects". Older releases mandated that a given project could only be replicated to a given target once (at a time) and that only one copy of a project could exist on a particular target at a time. 2010.Q1 supports multiple actions for a given project and target, each one corresponding to an independent copy on the target called a "package." This allows administrators to replicate a fresh copy without destroying the one that's already on the target.

Share-level replication. 2010.Q1 supports more fine-grained control of replication configuration, like leaving an individual share out of its project's replication configuration or replicating a share by itself without the other shares in its project.

Optional SSL encryption for improved performance. Older releases always encrypt the data sent over the wire. 2010.Q1 still supports this, but also lets customers disable SSL encryption for significantly improved performance when the security of data on the wire isn't so critical (as in many internal environments).

Bandwidth throttling. The system now supports limiting the bandwidth used by individual replication actions. With this, customers with limited network resources can keep replication from hogging the available bandwidth and starving the client data path.

Improved target-side management. Administrators can browse replicated projects and shares in the BUI and CLI just like local projects and shares. You can also view properties of these shares and even change them where appropriate. For example, the NFS export list can be customized on the target, which is important for disaster-recovery plans where the target will serve different clients in a different datacenter. Or you could enable stronger compression on the target, saving disk space at the expense of performance, which may be less important on a backup site.

Read-only view of replicated filesystems and snapshots. This is pretty self-explanatory. You can now export replicated filesystems read-only over NFS, CIFS, HTTP, FTP, etc., allowing you to verify the data, run incremental NDMP backups, or perform data analysis that's too expensive to run on the primary system. You can also see and clone the non-replication snapshots.

Then there are lots of small improvements, like being able to disable replication globally, per-action, or per-package, which is very handy when trying it out or measuring performance. Check out the documentation (also much improved) for details.


On Fishworks, Sun, and software engineering


« March 2015