Friday Sep 21, 2012

To SYNC or not to SYNC – Part 4

This is Part 4 of a multi-part blog article where we are discussing various aspects of setting up Data Guard synchronous redo transport (SYNC). In Part 1 of this article, I debunked the myth that Data Guard SYNC is similar to a two-phase commit operation. In Part 2, I discussed the various ways that network latency may or may not impact a Data Guard SYNC configuration. In Part 3, I talked in details regarding why Data Guard SYNC is a good thing, and the distance implications you have to keep in mind.

In this final article of the series, I will talk about how you can nicely complement Data Guard SYNC with the ability to failover in seconds.

Wait - Did I Say “Seconds”?

Did I just say that some customers do Data Guard failover in seconds? Yes, Virginia, there is a Santa Claus.

Data Guard has an automatic failover capability, aptly called Fast-Start Failover. Initially available with Oracle Database 10g Release 2 for Data Guard SYNC transport mode (and enhanced in Oracle Database 11g to support Data Guard ASYNC transport mode), this capability, managed by Data Guard Broker, lets your Data Guard configuration automatically failover to a designated standby database. Yes, this means no human intervention is required to do the failover. This process is controlled by a low footprint Data Guard Broker client called Observer, which makes sure that the primary database and the designated standby database are behaving like good kids. If something bad were to happen to the primary database, the Observer, after a configurable threshold period, tells that standby, “Your time has come, you are the chosen one!” The standby dutifully follows the Observer directives by assuming the role of the new primary database. The DBA or the Sys Admin doesn’t need to be involved.

And - in case you are following this discussion very closely, and are wondering … “Hmmm … what if the old primary is not really dead, but just network isolated from the Observer or the standby - won’t this lead to a split-brain situation?” The answer is No - It Doesn’t. With respect to why-it-doesn’t, I am sure there are some smart DBAs in the audience who can explain the technical reasons. Otherwise - that will be the material for a future blog post.

So - this combination of SYNC and Fast-Start Failover is the nirvana of lights-out, integrated HA and DR, as practiced by some of our advanced customers. They have observed failover times (with no data loss) ranging from single-digit seconds to tens of seconds. With this, they support operations in industry verticals such as manufacturing, retail, telecom, Internet, etc. that have the most demanding availability requirements.

One of our leading customers with massive cloud deployment initiatives tells us that they know about server failures only after Data Guard has automatically completed the failover process and the app is back up and running! Needless to mention, Data Guard Broker has the integration hooks for interfaces such as JDBC and OCI, or even for custom apps, to ensure the application gets automatically rerouted to the new primary database after the database level failover completes.

Net Net?

To sum up this multi-part blog article, Data Guard with SYNC redo transport mode, plus Fast-Start Failover, gives you the ideal triple-combo - that is, it gives you the assurance that for critical outages, you can failover your Oracle databases:

  1. very fast

  2. without human intervention, and

  3. without losing any data.

In short, it takes the element of risk out of critical IT operations. It does require you to be more careful with your network and systems planning, but as far as HA is concerned, the benefits outweigh the investment costs.

So, this is what we in the MAA Development Team believe in. What do you think? How has your deployment experience been? We look forward to hearing from you!

Monday Sep 10, 2012

To SYNC or not to SYNC – Part 3

I can't believe it has been almost a year since my last blog post. I know, that's an absolute no-no in the blogosphere. And I know that "I have been busy" is not a good excuse. So - without trying to come up with an excuse - let me state this - my apologies for taking such a long time to write the next Part.

Without further ado, here goes.

This is Part 3 of a multi-part blog article where we are discussing various aspects of setting up Data Guard synchronous redo transport (SYNC). In Part 1 of this article, I debunked the myth that Data Guard SYNC is similar to a two-phase commit operation. In Part 2, I discussed the various ways that network latency may or may not impact a Data Guard SYNC configuration.

In this article, I will talk in details regarding why Data Guard SYNC is a good thing. I will also talk about distance implications for setting up such a configuration.

So, Why Good?

Why is Data Guard SYNC a good thing? Because, at the end of the day, this gives you the assurance of zero data loss - it doesn’t matter what outage may befall your primary system. Befall! Boy, that sounds theatrical. But seriously - think about this - it minimizes your data risks. That’s a big deal. Whether you have an outage due to bad disks, faulty hardware components, hardware / software bugs, physical data corruptions, power failures, lightning that takes out significant part of your data center, fire that melts your assets, water leakage from the cooling system, human errors such as accidental deletion of online redo log files - it doesn’t matter - you can have that “Om - peace” look on your face and then you can failover to the standby system, without losing a single bit of data in your Oracle database. You will be a hero, as shown in this not so imaginary conversation:

IT Manager: Well, what’s the status?

You: John is doing the trace analysis on the storage array.

IT Manager: So? How long is that gonna take?

You: Well, he is stuck, waiting for a response from <insert your not-so-favorite storage vendor here>.

IT Manager: So, no root cause yet?

You: I told you, he is stuck. We have escalated with their Support, but you know how long these things take.

IT Manager: Darn it - the site is down!

You: Not really …

IT Manager: What do you mean?

You: John is stuck, but Sreeni has already done a failover to the Data Guard standby.

IT Manager: Whoa, whoa - wait! Failover means we lost some data, why did you do this without letting the Business group know?

You: We didn’t lose any data. Remember, we had set up Data Guard with SYNC? So now, any problems on the production – we just failover. No data loss, and we are up and running in minutes. The Business guys don’t need to know.

IT Manager: Wow! Are we great or what!!

You: I guess …

Ok, so you get it - SYNC is good. But as my dear friend Larry Carpenter says, “TANSTAAFL”, or "There ain't no such thing as a free lunch". Yes, of course - investing in Data Guard SYNC means that you have to invest in a low-latency network, you have to monitor your applications and database especially in peak load conditions, and you cannot under-provision your standby systems. But all these are good and necessary things, if you are supporting mission-critical apps that are supposed to be running 24x7. The peace of mind that this investment will give you is priceless, especially if you are serious about HA.

How Far Can We Go?

Someone may say at this point - well, I can’t use Data Guard SYNC over my coast-to-coast deployment. Most likely - true. So how far can you go? Well, we have customers who have deployed Data Guard SYNC over 300+ miles! Does this mean that you can also deploy over similar distances? Duh - no! I am going to say something here that most IT managers don’t like to hear - “It depends!” It depends on your application design, application response time / throughput requirements, network topology, etc. However, because of the optimal way we do SYNC, customers have been able to stretch Data Guard SYNC deployments over longer distances compared to traditional, storage-centric ways of doing this. The MAA Database 10.2 best practices paper Data Guard Redo Transport & Network Configuration, and Oracle Database 11.2 High Availability Best Practices Manual talk about some of these SYNC-related metrics. For example, a test deployment of Data Guard SYNC over 330 miles with 10ms latency showed an impact less than 5% for a busy OLTP application.

Even if you can’t deploy Data Guard SYNC over your WAN distance, or if you already have an ASYNC standby located 1000-s of miles away, here’s another nifty way to boost your HA. Have a local standby, configured SYNC. How local is “local”? Again - it depends. One customer runs a local SYNC standby across the campus. Another customer runs it across 15 miles in another data center. Both of these customers are running Data Guard SYNC as their HA standard. If a localized outage affects their primary system, no problem! They have all the data available on the standby, to which they can failover. Very fast. In seconds.

Wait - did I say “seconds”? Yes, Virginia, there is a Santa Claus. But you have to wait till the next blog article to find out more. I assure you tho’ that this time you won’t have to wait for another year for this.


Musings on Oracle's Maximum Availability Architecture (MAA), by members of Oracle Development team. Note that we may not have the bandwidth to answer generic questions on MAA.


« September 2012 »