To SYNC or not to SYNC – Part 1
By AshishRay on Sep 06, 2011
Zero Data Loss – Nervously So?
As part of our Maximum Availability Architecture (MAA) conversations with customers, one issue that is often discussed is the capability of zero data loss in the event of a disaster. Naturally, this offers the best RPO (Recovery Point Objective), as far as disaster recovery (DR) is concerned. The Oracle solution that is a must-have for this is Oracle Data Guard, configured for synchronous redo transport. However, whenever the word “synchronous” is mentioned, the nervousness barometer rises. Some objections I have heard:
- “Well, we don’t want our application to be impacted by network hiccups.”
- “Well, what Data Guard does is two-phase-commit, which is so expensive!”
- “Well, our DR data center is on the other coast, so we can’t afford a synchronous network.”
And a few others.
Some of these objections are valid, some are not. In this multi-part blog series, I will address these concerns, and more. In this particular blog, which is Part 1 of this series, I will debunk the myth that Data Guard synchronous redo transport is similar to two-phase commit.
SYNC != 2 PC
Let’s be as clear as possible. Data Guard synchronous redo transport (SYNC) is NOT two-phase-commit. Unlike distributed transactions, there is no concept of a coordinator node initiating the transaction, there are no participating nodes, there are no prepare and commit phases working in tandem.
So what really happens with Data Guard SYNC? Let’s look under the covers.
Upon every commit operation in the database, the LGWR process flushes the redo buffer to local online redo logs - this is the standard way Oracle database operates. With Data Guard SYNC, in addition, the LGWR process tells the NSS process on the primary database to make these redo blocks durable on the standby database disk as well. Until LGWR hears back from NSS that the redo blocks have been written successfully in the standby location, the commit operation is held up. That’s what provides the zero data loss assurance. The local storage on the primary database gets damaged? No problem. The bits are available on the standby storage.
But how long should LGWR wait to hear back from NSS? Well, that is governed by the NET_TIMEOUT attribute of the log_archive_dest parameter corresponding to the standby. Once LGWR hears back from NSS that life is good, the commit operation completes.
Now, let’s look into how the NSS process operates. Upon every commit, the NSS process on the primary database dutifully sends the committed redo blocks to the standby database, and then waits till the RFS process on the standby receives them, writes them on disk on the standby (in standby redo logs or SRLs), and then sends the acknowledgement back to the NSS process.
So - on the standby database, what’s happening is just disk I/O to write the incoming redo blocks into the SRLs. This should not be confused with two-phase-commit, and naturally this process is much faster compared to a distributed transaction involving two-phase-commit coordination.
In case you are wondering what happens to these incoming redo blocks in the SRLs - well, they get picked up - asynchronously, by the Managed Recovery Process (MRP) as part of Redo Apply, and the changes get applied to the standby data files in a highly efficient manner. But this Redo Apply process is a completely separate process from Redo Transport - and that is an important thing to remember whenever these two-phase-commit questions come up.
Now that you are convinced that Data Guard SYNC is not the same as two-phase commit, in the next blog article, I will talk about impact of network latency on Data Guard SYNC redo transport.