To SYNC or not to SYNC – Part 2
By Ashishray-Oracle on Sep 20, 2011
It’s less than two weeks from Oracle OpenWorld! We are going to have an exciting set of sessions from the Oracle HA Development team. Needless to say, all of us are a wee bit busy these days. I think that’s just the perfect time for Part 2 of this multi-part blog article where we are discussing various aspects of setting up Data Guard synchronous redo transport (SYNC).
In Part 1 of this article, I debunked the myth that Data Guard SYNC is similar to a two-phase commit operation. In case you are wondering what the truth is, and don’t have time to read the previous article, the answer is - No, Data Guard synchronous redo transport is NOT the same as two-phase commit.
Now, let’s look into how network latency may or may not impact a Data Guard SYNC configuration.
The network latency issue is a valid concern. That’s a simple law of physics. We have heard of the term “lightspeed” (remember Star Wars?), but still - as you know from your high school physics days, light takes time to travel. So the acknowledgement from RFS back to NSS will take some milliseconds to traverse the network, and that is typically proportional to the network distance.
Actually - it is both network latency and disk I/O latency. Why disk I/O latency? Remember, on the standby database, RFS is writing the incoming redo blocks on disk-resident SRLs. This is governed by the AFFIRM attribute of the log_archive_dest parameter corresponding to the standby database. We had one customer whose SYNC performance on the primary was suffering because of improperly tuned standby storage system.
However, for most cases, network latency is likely to be the gating factor - for example, refer to this real-time network latency chart from AT&T - http://ipnetwork.bgtmo.ip.att.net/pws/network_delay.html. At the time of writing this blog, US coast-coast latency (SF - NY) is shown to be around 75 ms. Trans-Atlantic latency is shown to be around 80 ms, whereas Trans-Pacific latency is shown to be around 140 ms. Of course you can measure the latency between your own primary and standby servers using utilities such as “ping” and “traceroute”.
Here is some good news - in Oracle Database 11g Release 2, the write to local online redo logs (by LGWR) and the remote write through the network layer (by NSS) happen in parallel. So we do get some efficiency through these parallel local write and network send operations.
Still - you have to make the determination whether the commit operations issued by your application can tolerate the network latency. Remember - if you are testing this out, do it under peak load conditions. Obviously latency will have minimal impact on a read-intensive application (which, by definition, does not generate redo). There are also two elements of application impact - your application response time, and your overall application throughput. For example, your application may have a heavy interactive mode - especially if this interaction happens programmatically (e.g. a trading application accessing an authentication application which in turn is configured with Data Guard SYNC). In such cases, measuring the impact on the application response time is critical. However, if your application has enough parallelism built-in, you may notice that overall throughput doesn’t degrade much with higher latencies. In the database layer, you can measure this with the redo generation rate before and after configuring synchronous redo transport (using AWR).
Not all Latencies are Equal
The cool thing about configuring synchronous redo transport in the database layer, is just that - we do it in the database layer, and we just send redo blocks. Imagine if you have configured it in the storage layer. All the usual database file structures - data files, online redo logs, archived redo logs, flashback logs, control file - that get updated as part of the usual database activities, will have to be synchronously updated across the network. You have to closely monitor the performance of database checkpointing in this case! We discuss these aspects in this OTN article.
So Why Bother?
So where are we? I stated that Data Guard synchronous redo transport does not have the overhead of two-phase-commit - so that’s good, and at the same time I stated that you have to watch out for network latency impact because of simple laws of physics - so that’s not so good - right? So, why bother, right?
This is why you have to bother - Data Guard synchronous redo transport, and hence - the zero data loss assurance, is a good thing! But to appreciate fully why this is a good thing, you have to wait for the next blog article. It’s coming soon, I promise!
For now, let me get back to my session presentation slides for Oracle OpenWorld! See you there!