Friday Apr 03, 2015

Oracle Data Protection: How Do You Measure Up? - Part 2

In part 1 of this blog series, we reviewed the results of a database protection survey conducted last year by Database Trends and Applications (DBTA) Magazine. In summary, the most prominent backup and recovery challenges faced by DBAs today are: 

• Poor performance and impact on productivity
• Management complexity
• Lack of continuous data protection

In today's blog, we will take a look at the reasons why these challenges still exist for databases, even after decades of advancements in backup and storage technologies.

Let’s review a few of today’s common approaches to database backup and recovery and the trade-offs that exist with each.

Weekly Full with Daily Incremental

This strategy involves weekly full (level 0) and daily incremental (level 1) backups to disk and/or tape. This means businesses incur the overhead of full backups every week – leading to the dilemma of the “backup window”, the period when backups are ideally performed with the least amount of disruption to production. But with larger data sets and longer production cycles, full backups increasingly extend past their backup windows and encroach on production systems, which become too stretched to support all production applications at normal performance levels. 

Incrementally Updated Backups

With this strategy, an initial image copy backup (level 0) is taken to disk, followed by daily incremental backups. The image copy is then ‘rolled forward’ by applying incrementals to produce a new on-disk copy that corresponds to the most recent or a previous incremental timestamp. The copy can be restored as needed; however, once the copy is rolled forward, it cannot be ‘rolled back’, thus constraining the range of recovery points. In addition, the act of applying the incrementals to the copy requires database resources and further impacts production systems.

Daily Full to Deduplication Appliances

In recent years, backups to deduplication appliances have become more prevalent, with the goal of driving down storage costs through automatic elimination of redundant backup data. In order to drive down storage costs, deduplication (i.e. savings) ratios must be driven up – thus, storage vendors generally recommend a full backup-only strategy to these appliances. However, for most enterprise databases, a full backup-only strategy is neither effective from a backup window nor system utilization perspective. 

Deduplication appliances are typically based on a single controller architecture, in which the compute power and bandwidth are fixed in a given appliance and cannot be altered per business needs. The user can add storage expansion shelves, but this only increases the total storage capacity managed by the system. Once the maximum storage limit is reached in a single appliance, the user must perform a forklift upgrade to the next higher model or buy additional, independently managed appliances, whose compute and storage resources cannot be shared, thereby increasing management complexity. 

Finally, single controller architecture systems severely limit the resilience of the system. Any single component failure in the controller can render the appliance unusable until the component is replaced, thus impacting backup service levels and the ability to support continuous data protection.

Storage Snapshots

Storage-based snapshots of production databases represent another ‘backup’ strategy whereby only new and changed data is stored. A file snapshot is just a set of pointers to all the unchanged and before-change blocks that make up the file, as of the snapshot time. Since a snapshot is tied to the production storage, it cannot serve as a true backup in event of storage corruption, loss, or site disaster. Since snapshots are created outside of the Oracle database, they are not validated for Oracle block correctness, until they are restored and the database is opened. Thus, snapshots by design cannot truly provide continuous data protection against storage corruptions and failures.

In summary, several shortcomings come to the fore with today’s backup technologies, illustrating their inability to address the backup and recovery challenges highlighted above:

Poor performance and impact on productivity

• Prolonged Backup Windows: As databases continue to grow, so do backup windows – and that can result in network and storage resources being tied up longer and more frequently, with much less efficient utilization of overall IT resources.

• Reduced Production Performance: Longer backup windows mean prolonged impact on production performance, stealing cycles and resources away from more critical production workloads.

Management complexity

• Fragmented Backup Processes: Storage appliances treat Oracle backups as just generic files, with no connection back to the databases they support, leading to a lack of visibility and no assurance that the backup is healthy and usable, whether it is on disk, tape, or a replica appliance.

Lack of continuous data protection

• Increased Data Loss Exposure: Data can only be recovered to the last good backup, e.g. hours or days ago. In addition, generic storage systems cannot inherently validate backups at an Oracle block-level for restore consistency.

• Reduced Operational Scalability: Storage appliances cannot easily scale to handle massive backup workloads and concurrent connections from hundreds to thousands of databases across the enterprise.

In the upcoming final blog post in the series, we will discuss how Oracle’s Zero Data Loss Recovery Appliance provides a comprehensive backup and recovery solution that directly addresses these shortcomings for Oracle databases of any size, providing continuous Oracle-validated database protection.

Tuesday Sep 20, 2011

To SYNC or not to SYNC – Part 2

It’s less than two weeks from Oracle OpenWorld! We are going to have an exciting set of sessions from the Oracle HA Development team. Needless to say, all of us are a wee bit busy these days. I think that’s just the perfect time for Part 2 of this multi-part blog article where we are discussing various aspects of setting up Data Guard synchronous redo transport (SYNC).

In Part 1 of this article, I debunked the myth that Data Guard SYNC is similar to a two-phase commit operation. In case you are wondering what the truth is, and don’t have time to read the previous article, the answer is - No, Data Guard synchronous redo transport is NOT the same as two-phase commit.

Now, let’s look into how network latency may or may not impact a Data Guard SYNC configuration.


The network latency issue is a valid concern. That’s a simple law of physics. We have heard of the term “lightspeed” (remember Star Wars?), but still - as you know from your high school physics days, light takes time to travel. So the acknowledgement from RFS back to NSS will take some milliseconds to traverse the network, and that is typically proportional to the network distance.

Actually - it is both network latency and disk I/O latency. Why disk I/O latency? Remember, on the standby database, RFS is writing the incoming redo blocks on disk-resident SRLs. This is governed by the AFFIRM attribute of the log_archive_dest parameter corresponding to the standby database. We had one customer whose SYNC performance on the primary was suffering because of improperly tuned standby storage system.

However, for most cases, network latency is likely to be the gating factor - for example, refer to this real-time network latency chart from AT&T - At the time of writing this blog, US coast-coast latency (SF - NY) is shown to be around 75 ms. Trans-Atlantic latency is shown to be around 80 ms, whereas Trans-Pacific latency is shown to be around 140 ms. Of course you can measure the latency between your own primary and standby servers using utilities such as “ping” and “traceroute”.

Here is some good news - in Oracle Database 11g Release 2, the write to local online redo logs (by LGWR) and the remote write through the network layer (by NSS) happen in parallel. So we do get some efficiency through these parallel local write and network send operations.

Still - you have to make the determination whether the commit operations issued by your application can tolerate the network latency. Remember - if you are testing this out, do it under peak load conditions. Obviously latency will have minimal impact on a read-intensive application (which, by definition, does not generate redo). There are also two elements of application impact - your application response time, and your overall application throughput. For example, your application may have a heavy interactive mode - especially if this interaction happens programmatically (e.g. a trading application accessing an authentication application which in turn is configured with Data Guard SYNC). In such cases, measuring the impact on the application response time is critical. However, if your application has enough parallelism built-in, you may notice that overall throughput doesn’t degrade much with higher latencies. In the database layer, you can measure this with the redo generation rate before and after configuring synchronous redo transport (using AWR).

Not all Latencies are Equal

The cool thing about configuring synchronous redo transport in the database layer, is just that - we do it in the database layer, and we just send redo blocks. Imagine if you have configured it in the storage layer. All the usual database file structures - data files, online redo logs, archived redo logs, flashback logs, control file - that get updated as part of the usual database activities, will have to be synchronously updated across the network. You have to closely monitor the performance of database checkpointing in this case! We discuss these aspects in this OTN article.

So Why Bother?

So where are we? I stated that Data Guard synchronous redo transport does not have the overhead of two-phase-commit - so that’s good, and at the same time I stated that you have to watch out for network latency impact because of simple laws of physics - so that’s not so good - right? So, why bother, right?

This is why you have to bother - Data Guard synchronous redo transport, and hence - the zero data loss assurance, is a good thing! But to appreciate fully why this is a good thing, you have to wait for the next blog article. It’s coming soon, I promise!

For now, let me get back to my session presentation slides for Oracle OpenWorld! See you there!

Wednesday Sep 07, 2011

Key HA-related sessions at Oracle OpenWorld

At this year's Oracle OpenWorld, the Database High Availability and various related Development teams have put together several highly informative sessions, hands-on labs and demos for various components of our HA / MAA solution set. They include sessions on tips & tricks, solution and integration best practices, and span product areas such as clustering, disaster recovery, backup & recovery, replication, storage, Exadata, etc.

A comprehensive datasheet featuring information related to some of these sessions / labs / demos is available as a PDF file. It includes a quick reference guide at the end - handy to print as a single page. This link is available through the OTN HA site.

Customers interested in database high availability & data protection should use this datasheet as a planning guide for attending some of these sessions. As you may notice, some hi-profile customers have already committed to be co-speakers for some of these sessions, featuring their implementation case studies and lessons learned. So this presents some excellent learning and networking opportunities for the attendees.

Tuesday Sep 06, 2011

To SYNC or not to SYNC – Part 1

Zero Data Loss – Nervously So?

As part of our Maximum Availability Architecture (MAA) conversations with customers, one issue that is often discussed is the capability of zero data loss in the event of a disaster. Naturally, this offers the best RPO (Recovery Point Objective), as far as disaster recovery (DR) is concerned. The Oracle solution that is a must-have for this is Oracle Data Guard, configured for synchronous redo transport. However, whenever the word “synchronous” is mentioned, the nervousness barometer rises. Some objections I have heard:

  • “Well, we don’t want our application to be impacted by network hiccups.”
  • “Well, what Data Guard does is two-phase-commit, which is so expensive!”
  • “Well, our DR data center is on the other coast, so we can’t afford a synchronous network.”

And a few others.

Some of these objections are valid, some are not. In this multi-part blog series, I will address these concerns, and more. In this particular blog, which is Part 1 of this series, I will debunk the myth that Data Guard synchronous redo transport is similar to two-phase commit.

SYNC != 2 PC

Let’s be as clear as possible. Data Guard synchronous redo transport (SYNC) is NOT two-phase-commit. Unlike distributed transactions, there is no concept of a coordinator node initiating the transaction, there are no participating nodes, there are no prepare and commit phases working in tandem.

So what really happens with Data Guard SYNC? Let’s look under the covers.

Upon every commit operation in the database, the LGWR process flushes the redo buffer to local online redo logs - this is the standard way Oracle database operates. With Data Guard SYNC, in addition, the LGWR process tells the NSS process on the primary database to make these redo blocks durable on the standby database disk as well. Until LGWR hears back from NSS that the redo blocks have been written successfully in the standby location, the commit operation is held up. That’s what provides the zero data loss assurance. The local storage on the primary database gets damaged? No problem. The bits are available on the standby storage.

But how long should LGWR wait to hear back from NSS? Well, that is governed by the NET_TIMEOUT attribute of the log_archive_dest parameter corresponding to the standby. Once LGWR hears back from NSS that life is good, the commit operation completes.

Now, let’s look into how the NSS process operates. Upon every commit, the NSS process on the primary database dutifully sends the committed redo blocks to the standby database, and then waits till the RFS process on the standby receives them, writes them on disk on the standby (in standby redo logs or SRLs), and then sends the acknowledgement back to the NSS process.

So - on the standby database, what’s happening is just disk I/O to write the incoming redo blocks into the SRLs. This should not be confused with two-phase-commit, and naturally this process is much faster compared to a distributed transaction involving two-phase-commit coordination.

In case you are wondering what happens to these incoming redo blocks in the SRLs - well, they get picked up - asynchronously, by the Managed Recovery Process (MRP) as part of Redo Apply, and the changes get applied to the standby data files in a highly efficient manner. But this Redo Apply process is a completely separate process from Redo Transport - and that is an important thing to remember whenever these two-phase-commit questions come up.

Now that you are convinced that Data Guard SYNC is not the same as two-phase commit, in the next blog article, I will talk about impact of network latency on Data Guard SYNC redo transport.


Musings on Oracle's Maximum Availability Architecture (MAA), by members of Oracle Development team. Note that we may not have the bandwidth to answer generic questions on MAA.


« July 2016