Tuesday Jun 03, 2014

How to Achieve Real-Time Data Protection and Availabilty....For Real

There is a class of business and mission critical applications where downtime or data loss have substantial negative impact on revenue, customer service, reputation, cost, etc. Because the Oracle Database is used extensively to provide reliable performance and availability for this class of application, it also provides an integrated set of capabilities for real-time data protection and availability.


Active Data Guard, depicted in the figure below, is the cornerstone for accomplishing these objectives because it provides the absolute best real-time data protection and availability for the Oracle Database. This is a bold statement, but it is supported by the facts. It isn’t so much that alternative solutions are bad, it’s just that their architectures prevent them from achieving the same levels of data protection, availability, simplicity, and asset utilization provided by Active Data Guard. Let’s explore further.



Active Data Guard - Overview

Backups are the most popular method used to protect data and are an essential best practice for every database. Not surprisingly, Oracle Recovery Manager (RMAN) is one of the most commonly used features of the Oracle Database. But comparing Active Data Guard to backups is like comparing apples to motorcycles. Active Data Guard uses a hot (open read-only), synchronized copy of the production database to provide real-time data protection and HA. In contrast, a restore from backup takes time and often has many moving parts - people, processes, software and systems – that can create a level of uncertainty during an outage that critical applications can’t afford. This is why backups play a secondary role for your most critical databases by complementing real-time solutions that can provide both data protection and availability.


Before Data Guard, enterprises used storage remote-mirroring for real-time data protection and availability. Remote-mirroring is a sophisticated storage technology promoted as a generic infrastructure solution that makes a simple promise – whatever is written to a primary volume will also be written to the mirrored volume at a remote site. Keeping this promise is also what causes data loss and downtime when the data written to primary volumes is corrupt – the same corruption is faithfully mirrored to the remote volume making both copies unusable. This happens because remote-mirroring is a generic process. It has no  intrinsic knowledge of Oracle data structures to enable advanced protection, nor can it perform independent Oracle validation BEFORE changes are applied to the remote copy. There is also nothing to prevent human error (e.g. a storage admin accidentally deleting critical files) from also impacting the remote mirrored copy.


Remote-mirroring tricks users by creating a false impression that there are two separate copies of the Oracle Database. In truth; while remote-mirroring maintains two copies of the data on different volumes, both are part of a single closely coupled system. Not only will remote-mirroring propagate corruptions and administrative errors, but the changes applied to the mirrored volume are a result of the same Oracle code path that applied the change to the source volume. There is no isolation, either from a storage mirroring perspective or from an Oracle software perspective.  Bottom line, storage remote-mirroring lacks both the smarts and isolation level necessary to provide true data protection.


Active Data Guard offers much more than storage remote-mirroring when your objective is protecting your enterprise from downtime and data loss. Like remote-mirroring, an Active Data Guard replica is an exact block for block copy of the primary. Unlike remote-mirroring, an Active Data Guard replica is NOT a tightly coupled copy of the source volumes - it is a completely independent Oracle Database. Active Data Guard’s inherent knowledge of Oracle data block and redo structures enables a separate Oracle Database using a different Oracle code path than the primary to use the full complement of Oracle data validation methods before changes are applied to the synchronized copy. These include: physical check sum, logical intra-block checking, lost write validation, and automatic block repair. The figure below illustrates the stark difference between the knowledge that remote-mirroring can discern from an Oracle data block and what Active Data Guard can discern.



Oracle Block Structure


An Active Data Guard standby also provides a range of additional services enabled by the fact that it is a running Oracle Database - not just a mirrored copy of data files. An Active Data Guard standby database can be open read-only while it is synchronizing with the primary. This enables read-only workloads to be offloaded from the primary system and run on the active standby - boosting performance by utilizing all assets. An Active Data Guard standby can also be used to implement many types of system and database maintenance in rolling fashion. Maintenance and upgrades are first implemented on the standby while production runs unaffected at the primary. After the primary and standby are synchronized and all changes have been validated, the production workload is quickly switched to the standby. The only downtime is the time required for user connections to transfer from one system to the next. These capabilities further expand the expectations of availability offered by a data protection solution beyond what is possible to do using storage remote-mirroring.


So don’t be fooled by appearances.  Storage remote-mirroring and Active Data Guard replication may look similar on the surface - but the devil is in the details. Only Active Data Guard has the smarts, the isolation, and the simplicity, to provide the best data protection and availability for the Oracle Database.


Stay tuned for future blog posts that dive into the many differences between storage remote-mirroring and Active Data Guard along the dimensions of data protection, data availability, cost, asset utilization and return on investment.


For additional information on Active Data Guard, see:




Friday Sep 21, 2012

To SYNC or not to SYNC – Part 4

This is Part 4 of a multi-part blog article where we are discussing various aspects of setting up Data Guard synchronous redo transport (SYNC). In Part 1 of this article, I debunked the myth that Data Guard SYNC is similar to a two-phase commit operation. In Part 2, I discussed the various ways that network latency may or may not impact a Data Guard SYNC configuration. In Part 3, I talked in details regarding why Data Guard SYNC is a good thing, and the distance implications you have to keep in mind.


In this final article of the series, I will talk about how you can nicely complement Data Guard SYNC with the ability to failover in seconds.


Wait - Did I Say “Seconds”?


Did I just say that some customers do Data Guard failover in seconds? Yes, Virginia, there is a Santa Claus.


Data Guard has an automatic failover capability, aptly called Fast-Start Failover. Initially available with Oracle Database 10g Release 2 for Data Guard SYNC transport mode (and enhanced in Oracle Database 11g to support Data Guard ASYNC transport mode), this capability, managed by Data Guard Broker, lets your Data Guard configuration automatically failover to a designated standby database. Yes, this means no human intervention is required to do the failover. This process is controlled by a low footprint Data Guard Broker client called Observer, which makes sure that the primary database and the designated standby database are behaving like good kids. If something bad were to happen to the primary database, the Observer, after a configurable threshold period, tells that standby, “Your time has come, you are the chosen one!” The standby dutifully follows the Observer directives by assuming the role of the new primary database. The DBA or the Sys Admin doesn’t need to be involved.


And - in case you are following this discussion very closely, and are wondering … “Hmmm … what if the old primary is not really dead, but just network isolated from the Observer or the standby - won’t this lead to a split-brain situation?” The answer is No - It Doesn’t. With respect to why-it-doesn’t, I am sure there are some smart DBAs in the audience who can explain the technical reasons. Otherwise - that will be the material for a future blog post.


So - this combination of SYNC and Fast-Start Failover is the nirvana of lights-out, integrated HA and DR, as practiced by some of our advanced customers. They have observed failover times (with no data loss) ranging from single-digit seconds to tens of seconds. With this, they support operations in industry verticals such as manufacturing, retail, telecom, Internet, etc. that have the most demanding availability requirements.


One of our leading customers with massive cloud deployment initiatives tells us that they know about server failures only after Data Guard has automatically completed the failover process and the app is back up and running! Needless to mention, Data Guard Broker has the integration hooks for interfaces such as JDBC and OCI, or even for custom apps, to ensure the application gets automatically rerouted to the new primary database after the database level failover completes.


Net Net?


To sum up this multi-part blog article, Data Guard with SYNC redo transport mode, plus Fast-Start Failover, gives you the ideal triple-combo - that is, it gives you the assurance that for critical outages, you can failover your Oracle databases:



  1. very fast

  2. without human intervention, and

  3. without losing any data.


In short, it takes the element of risk out of critical IT operations. It does require you to be more careful with your network and systems planning, but as far as HA is concerned, the benefits outweigh the investment costs.


So, this is what we in the MAA Development Team believe in. What do you think? How has your deployment experience been? We look forward to hearing from you!

Monday Sep 10, 2012

To SYNC or not to SYNC – Part 3

I can't believe it has been almost a year since my last blog post. I know, that's an absolute no-no in the blogosphere. And I know that "I have been busy" is not a good excuse. So - without trying to come up with an excuse - let me state this - my apologies for taking such a long time to write the next Part.


Without further ado, here goes.






This is Part 3 of a multi-part blog article where we are discussing various aspects of setting up Data Guard synchronous redo transport (SYNC). In Part 1 of this article, I debunked the myth that Data Guard SYNC is similar to a two-phase commit operation. In Part 2, I discussed the various ways that network latency may or may not impact a Data Guard SYNC configuration.


In this article, I will talk in details regarding why Data Guard SYNC is a good thing. I will also talk about distance implications for setting up such a configuration.


So, Why Good?


Why is Data Guard SYNC a good thing? Because, at the end of the day, this gives you the assurance of zero data loss - it doesn’t matter what outage may befall your primary system. Befall! Boy, that sounds theatrical. But seriously - think about this - it minimizes your data risks. That’s a big deal. Whether you have an outage due to bad disks, faulty hardware components, hardware / software bugs, physical data corruptions, power failures, lightning that takes out significant part of your data center, fire that melts your assets, water leakage from the cooling system, human errors such as accidental deletion of online redo log files - it doesn’t matter - you can have that “Om - peace” look on your face and then you can failover to the standby system, without losing a single bit of data in your Oracle database. You will be a hero, as shown in this not so imaginary conversation:


IT Manager: Well, what’s the status?

You: John is doing the trace analysis on the storage array.

IT Manager: So? How long is that gonna take?

You: Well, he is stuck, waiting for a response from <insert your not-so-favorite storage vendor here>.

IT Manager: So, no root cause yet?

You: I told you, he is stuck. We have escalated with their Support, but you know how long these things take.

IT Manager: Darn it - the site is down!

You: Not really …

IT Manager: What do you mean?

You: John is stuck, but Sreeni has already done a failover to the Data Guard standby.

IT Manager: Whoa, whoa - wait! Failover means we lost some data, why did you do this without letting the Business group know?

You: We didn’t lose any data. Remember, we had set up Data Guard with SYNC? So now, any problems on the production – we just failover. No data loss, and we are up and running in minutes. The Business guys don’t need to know.

IT Manager: Wow! Are we great or what!!

You: I guess …


Ok, so you get it - SYNC is good. But as my dear friend Larry Carpenter says, “TANSTAAFL”, or "There ain't no such thing as a free lunch". Yes, of course - investing in Data Guard SYNC means that you have to invest in a low-latency network, you have to monitor your applications and database especially in peak load conditions, and you cannot under-provision your standby systems. But all these are good and necessary things, if you are supporting mission-critical apps that are supposed to be running 24x7. The peace of mind that this investment will give you is priceless, especially if you are serious about HA.


How Far Can We Go?


Someone may say at this point - well, I can’t use Data Guard SYNC over my coast-to-coast deployment. Most likely - true. So how far can you go? Well, we have customers who have deployed Data Guard SYNC over 300+ miles! Does this mean that you can also deploy over similar distances? Duh - no! I am going to say something here that most IT managers don’t like to hear - “It depends!” It depends on your application design, application response time / throughput requirements, network topology, etc. However, because of the optimal way we do SYNC, customers have been able to stretch Data Guard SYNC deployments over longer distances compared to traditional, storage-centric ways of doing this. The MAA Database 10.2 best practices paper Data Guard Redo Transport & Network Configuration, and Oracle Database 11.2 High Availability Best Practices Manual talk about some of these SYNC-related metrics. For example, a test deployment of Data Guard SYNC over 330 miles with 10ms latency showed an impact less than 5% for a busy OLTP application.


Even if you can’t deploy Data Guard SYNC over your WAN distance, or if you already have an ASYNC standby located 1000-s of miles away, here’s another nifty way to boost your HA. Have a local standby, configured SYNC. How local is “local”? Again - it depends. One customer runs a local SYNC standby across the campus. Another customer runs it across 15 miles in another data center. Both of these customers are running Data Guard SYNC as their HA standard. If a localized outage affects their primary system, no problem! They have all the data available on the standby, to which they can failover. Very fast. In seconds.


Wait - did I say “seconds”? Yes, Virginia, there is a Santa Claus. But you have to wait till the next blog article to find out more. I assure you tho’ that this time you won’t have to wait for another year for this.


Tuesday Sep 20, 2011

To SYNC or not to SYNC – Part 2


It’s less than two weeks from Oracle OpenWorld! We are going to have an exciting set of sessions from the Oracle HA Development team. Needless to say, all of us are a wee bit busy these days. I think that’s just the perfect time for Part 2 of this multi-part blog article where we are discussing various aspects of setting up Data Guard synchronous redo transport (SYNC).


In Part 1 of this article, I debunked the myth that Data Guard SYNC is similar to a two-phase commit operation. In case you are wondering what the truth is, and don’t have time to read the previous article, the answer is - No, Data Guard synchronous redo transport is NOT the same as two-phase commit.


Now, let’s look into how network latency may or may not impact a Data Guard SYNC configuration.


LATEncy


The network latency issue is a valid concern. That’s a simple law of physics. We have heard of the term “lightspeed” (remember Star Wars?), but still - as you know from your high school physics days, light takes time to travel. So the acknowledgement from RFS back to NSS will take some milliseconds to traverse the network, and that is typically proportional to the network distance.


Actually - it is both network latency and disk I/O latency. Why disk I/O latency? Remember, on the standby database, RFS is writing the incoming redo blocks on disk-resident SRLs. This is governed by the AFFIRM attribute of the log_archive_dest parameter corresponding to the standby database. We had one customer whose SYNC performance on the primary was suffering because of improperly tuned standby storage system.


However, for most cases, network latency is likely to be the gating factor - for example, refer to this real-time network latency chart from AT&T - http://ipnetwork.bgtmo.ip.att.net/pws/network_delay.html. At the time of writing this blog, US coast-coast latency (SF - NY) is shown to be around 75 ms. Trans-Atlantic latency is shown to be around 80 ms, whereas Trans-Pacific latency is shown to be around 140 ms. Of course you can measure the latency between your own primary and standby servers using utilities such as “ping” and “traceroute”.


Here is some good news - in Oracle Database 11g Release 2, the write to local online redo logs (by LGWR) and the remote write through the network layer (by NSS) happen in parallel. So we do get some efficiency through these parallel local write and network send operations.


Still - you have to make the determination whether the commit operations issued by your application can tolerate the network latency. Remember - if you are testing this out, do it under peak load conditions. Obviously latency will have minimal impact on a read-intensive application (which, by definition, does not generate redo). There are also two elements of application impact - your application response time, and your overall application throughput. For example, your application may have a heavy interactive mode - especially if this interaction happens programmatically (e.g. a trading application accessing an authentication application which in turn is configured with Data Guard SYNC). In such cases, measuring the impact on the application response time is critical. However, if your application has enough parallelism built-in, you may notice that overall throughput doesn’t degrade much with higher latencies. In the database layer, you can measure this with the redo generation rate before and after configuring synchronous redo transport (using AWR).


Not all Latencies are Equal


The cool thing about configuring synchronous redo transport in the database layer, is just that - we do it in the database layer, and we just send redo blocks. Imagine if you have configured it in the storage layer. All the usual database file structures - data files, online redo logs, archived redo logs, flashback logs, control file - that get updated as part of the usual database activities, will have to be synchronously updated across the network. You have to closely monitor the performance of database checkpointing in this case! We discuss these aspects in this OTN article.


So Why Bother?


So where are we? I stated that Data Guard synchronous redo transport does not have the overhead of two-phase-commit - so that’s good, and at the same time I stated that you have to watch out for network latency impact because of simple laws of physics - so that’s not so good - right? So, why bother, right?


This is why you have to bother - Data Guard synchronous redo transport, and hence - the zero data loss assurance, is a good thing! But to appreciate fully why this is a good thing, you have to wait for the next blog article. It’s coming soon, I promise!


For now, let me get back to my session presentation slides for Oracle OpenWorld! See you there!


Tuesday Sep 06, 2011

To SYNC or not to SYNC – Part 1

Zero Data Loss – Nervously So?

As part of our Maximum Availability Architecture (MAA) conversations with customers, one issue that is often discussed is the capability of zero data loss in the event of a disaster. Naturally, this offers the best RPO (Recovery Point Objective), as far as disaster recovery (DR) is concerned. The Oracle solution that is a must-have for this is Oracle Data Guard, configured for synchronous redo transport. However, whenever the word “synchronous” is mentioned, the nervousness barometer rises. Some objections I have heard:

  • “Well, we don’t want our application to be impacted by network hiccups.”
  • “Well, what Data Guard does is two-phase-commit, which is so expensive!”
  • “Well, our DR data center is on the other coast, so we can’t afford a synchronous network.”

And a few others.

Some of these objections are valid, some are not. In this multi-part blog series, I will address these concerns, and more. In this particular blog, which is Part 1 of this series, I will debunk the myth that Data Guard synchronous redo transport is similar to two-phase commit.

SYNC != 2 PC

Let’s be as clear as possible. Data Guard synchronous redo transport (SYNC) is NOT two-phase-commit. Unlike distributed transactions, there is no concept of a coordinator node initiating the transaction, there are no participating nodes, there are no prepare and commit phases working in tandem.

So what really happens with Data Guard SYNC? Let’s look under the covers.

Upon every commit operation in the database, the LGWR process flushes the redo buffer to local online redo logs - this is the standard way Oracle database operates. With Data Guard SYNC, in addition, the LGWR process tells the NSS process on the primary database to make these redo blocks durable on the standby database disk as well. Until LGWR hears back from NSS that the redo blocks have been written successfully in the standby location, the commit operation is held up. That’s what provides the zero data loss assurance. The local storage on the primary database gets damaged? No problem. The bits are available on the standby storage.

But how long should LGWR wait to hear back from NSS? Well, that is governed by the NET_TIMEOUT attribute of the log_archive_dest parameter corresponding to the standby. Once LGWR hears back from NSS that life is good, the commit operation completes.

Now, let’s look into how the NSS process operates. Upon every commit, the NSS process on the primary database dutifully sends the committed redo blocks to the standby database, and then waits till the RFS process on the standby receives them, writes them on disk on the standby (in standby redo logs or SRLs), and then sends the acknowledgement back to the NSS process.

So - on the standby database, what’s happening is just disk I/O to write the incoming redo blocks into the SRLs. This should not be confused with two-phase-commit, and naturally this process is much faster compared to a distributed transaction involving two-phase-commit coordination.

In case you are wondering what happens to these incoming redo blocks in the SRLs - well, they get picked up - asynchronously, by the Managed Recovery Process (MRP) as part of Redo Apply, and the changes get applied to the standby data files in a highly efficient manner. But this Redo Apply process is a completely separate process from Redo Transport - and that is an important thing to remember whenever these two-phase-commit questions come up.

Now that you are convinced that Data Guard SYNC is not the same as two-phase commit, in the next blog article, I will talk about impact of network latency on Data Guard SYNC redo transport.


About

Musings on Oracle's Maximum Availability Architecture (MAA), by members of Oracle Development team. Note that we may not have the bandwidth to answer generic questions on MAA.

Search

Categories
Archives
« March 2015
SunMonTueWedThuFriSat
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
    
       
Today