When designing a disaster recovery solution using remote
replication, two important parameters are the recovery time
objective (RTO) and the recovery point objective
(RPO). For these purposes, a disaster is any event resulting
in permanent data loss at the primary site which requires restoring
service using data recovered from the disaster recovery (DR) site. The
RTO is how soon service must be restored after a disaster. Designing a DR solution to meet a specified RTO is complex but essentially boils down to ensuring that the
recovery plan can be executed within the allotted time. Usually this
means keeping the plan simple, automating as much of it as possible,
documenting it carefully, and testing it frequently. (It helps to use
storage systems with built-in features
for this.)
In this entry I want to talk about RPO. RPO
describes how recent the data recovered after the disaster must be (in other words, how much
data loss is acceptable in the event of disaster). An RPO of 30 minutes means that the recovered data
must include all changes up to 30 minutes before the disaster. But how do businesses decide how much data loss is
acceptable in the event of a disaster? The answer varies greatly from
case to case. Photo storage for a small social networking site may have
an RPO of a few hours; in the worst case, users who uploaded photos a
few hours before the event will have to upload them again, which isn't usually a
big deal. A stock exchange, by contrast, may have a requirement for
zero data loss: the disaster recovery site absolutely must have the most
updated data because otherwise different systems may disagree about
whose account owns a particular million dollars. Of course, most businesses are
somewhere in between: it's important that data be "pretty fresh," but
some loss is acceptable.
Replication solutions used for disaster recovery typically fall into two
buckets: synchronous and
asynchronous. Synchronous replication means
that clients making changes on disk at the primary site can't proceed until that data also resides on disk at the disaster recovery
site. For example, a database at the primary site won't consider a
transaction having completed until the underlying storage system
indicates that the changed data has been stored on disk. If the storage
system is configured for synchronous replication, that also means that
the change is on disk at the DR site, too. If a disaster occurs at the
primary site, there's no data loss when the database is recovered from
the DR site because no transactions were committed that weren't also
propagated to the DR site. So using synchronous replication it's
possible to implement a DR strategy with zero data loss in the event of
disaster -- at great cost, discussed below.
By contrast, asynchronous replication means that the
storage system can acknowledge changes before they've been replicated to
the DR site. In the database example, the database (still using
synchronous i/o at the primary site) considers the transaction completed
as long as the data is on disk at the primary site. If a disaster
occurs at the primary site, the data that hasn't been replicated will be
lost. This sounds bad, but if your RPO is 30 minutes and the primary
site replicates to the target every 10 minutes, for example, then you
can still meet your DR requirements.
As I suggested above, synchronous replication comes at great cost,
particularly in three areas:
The above availability costs scale back as the RPO increases. If the RPO is 24 hours and it only takes 1 hour to replicate a day's changes, then you can sustain a 23-hour outage at the DR site or on the network link without impacting primary site availability at all. The only performance cost of asynchronous replication is the added latency resulting from a system's additional load for the replication software (which you'd also have to worry about with synchronous replication). There's no additional latency from the DR network link or the DR site as long as the system is able to keep up with sending updates.
The 7000 series provides two types of automatic asynchronous replication: scheduled and continuous. Scheduled replication sends discrete, snapshot-based updates at predefined hourly, daily, weekly, or monthly intervals. Continuous replication sends the same discrete, snapshot-based updates as frequently as possible. The result is essentially a continuous stream of filesystem changes to the DR site.
Of course, there are tradeoffs to both of these approaches. In most cases, continuous replication minimizes data loss in the event of a disaster (i.e., it will achieve minimal RPO), since the system is replicating changes as fast as possible to the DR site without actually holding up production clients. However, if the RPO is a given parameter (as it often is), you can just as well choose a scheduled replication interval that will achieve that RPO, in which case using continuous replication doesn't buy you anything. In fact, it can hurt because continuous replication can result in transferring significantly more data than necessary. For example, if an application fills a 10GB scratch file and rewrites it over the next half hour, continuous replication will send 20GB, while half-hourly scheduled replication will only send 10GB. If you're paying for bandwidth, or if the capacity of a pair of systems is limited by the available replication bandwidth, these costs can add up.
Storage systems provide many options for configuring remote replication for disaster recovery, including synchronous (zero data loss) and both continuous and scheduled asynchronous replication. It's easy to see these options and guess a reasonable strategy for minimizing data loss in the event of disaster, but it's important that actual recovery point objectives be defined based on business needs and that those objectives drive the planning and deployment of the DR solution. Incidentally, some systems use a hybrid approach that uses synchronous replication when possible, but avoids the availability cost (described above) of that approach by falling back to continuous asynchronous replication if the DR link or or DR system fails. This seems at first glance a good compromise, but it's unclear what problem this solves because the resulting system pays the performance and monetary costs of synchronous replication without actually guaranteeing zero data loss.