Wednesday Nov 05, 2014

Oracle MAA Part 4: Gold HA Reference Architecture

Welcome to the fourth installment in a series of blog posts describing MAA Best Practices that define four standard reference architectures for HA and data protection: BRONZE, SILVER, GOLD and PLATINUM. The objective of each reference architecture is to deploy an optimal set of Oracle HA capabilities that reliably achieve a given service level (SLA) at the lowest cost.

This article provides details for the Gold reference architecture.

Gold substantially raises the service level for business critical applications that cannot accept vulnerability to single points-of-failure. Gold builds upon Silver by using database replication technology to eliminate single point of failure and provide a much higher level of data protection and HA from all types of unplanned and planned outages.

An overview of Gold is provided in the figure below.

Gold delivers substantially enhanced service levels using the following capabilities:

Oracle Active Data Guard replaces backups used by Bronze and Silver reference architectures as the first line of defense against an unrecoverable outage of the production database. Recovery time (RTO) for outages caused by data corruption, database failure, cluster failure, and site failures is reduced to seconds or minutes with an accompanying data loss exposure (RPO) of zero or near zero depending upon configuration. While backups are no longer used for availability, they are still included in the Gold reference architecture for archival purposes and as an additional level of data protection.

Active Data Guard uses simple physical replication to maintain one or more synchronized copies (standby databases) of the production database (primary database). If the primary becomes unavailable for any reason, production is quickly failed over to the standby and availability is restored.

Active Data Guard offers a unique set of capabilities for availability and Oracle data protection that exceed other alternatives based upon storage remote-mirroring or other methods of database replication. These capabilities include:

  • Choice of zero data loss (sync) or near-zero data loss (async) disaster protection.
  • Direct transmission of database changes (redo) directly from the log buffer of the primary database providing strong isolation from lower layer hardware and software faults.
  • Use of intimate knowledge of Oracle data block and redo structures to perform continuous Oracle data validation at the standby, further isolating the standby database from corruptions that can impact a primary database.
  • Native support for all Oracle data types and features combined with high performance capable of supporting all applications and workloads.
  • Manual or automatic failover to quickly transfer production to the standby database should the primary become unavailable for any reason.
  • Integrated application failover to quickly transition application connections to the new primary database after a failover has occurred.
  • Database rolling maintenance to reduce downtime and risk during planned maintenance.
  • High return on investment by offloading read-only workloads and backups to an Active Data Guard standby while it is being synchronized by the primary database.

Oracle GoldenGate logical replication is also included in the Gold reference architecture, either to complement Active Data Guard when performing planned maintenance or to use as an alternative replication mechanism for maintaining a synchronized copy (target database) of a production database (source database).

GoldenGate reads changes from disk at a source database, transforms the data into a platform independent file format, transmits the file to a target database, then transforms the data into SQL (updates, inserts, and deletes) native to a target database that is open read-write. The target database contains the same data, but is a different physical database from the source (for example, backups are not interchangeable). This enables GoldenGate to easily support heterogeneous environments across different hardware platforms and relational database management systems. This flexibility makes it ideal for a wide range of planned maintenance and other replication requirements.  GoldenGate can:

  • Efficiently replicate subsets of a source database to distribute data to other target databases. It can also be used to consolidate data into a single target database (for example, an Operational Data Store) from multiple source databases. This function of GoldenGate is relevant to each of the four MAA reference architectures and is complementary to the use of Active Data Guard.
  • Perform maintenance and migrations in a rolling manner for use-cases that cannot be supported using Data Guard replication. For example, Oracle GoldenGate enables replication from a source database running on a big-endian platform to a target database running on a little-endian platform. This enables cross-platform migration with the additional advantage of being able to reverse replication for fast fallback to the prior version after cutover. When used in this fashion GoldenGate is complementary to Active Data Guard.
  • Maintain a complete replica of a source database for high availability or disaster protection that is ready for immediate failover should the source database become unavailable. GoldenGate would be an alternative to Active Data Guard when used for this purpose. The primary use-case where you would use GoldenGate instead of Active Data Guard for complete database replication is when there is a requirement for the target database to be open read-write at all times (remember an Active Data Guard standby is open read-only).

Note that there are several trade-offs that must be accepted when using logical replication in place of Active Data Guard for data protection and availability

  • Logical replication has additional pre-requisites and operational complexity.
  • Logical replication is inherently an asynchronous process and thus not able to provide zero data loss protection. Only Active Data Guard can provide zero data loss protection
  • It is obvious that a logical copy is not a physical replica of the source database. Rather than offload backups to a standby you must backup both source and target. Logical replication also can not support advanced data protection features that come with Data Guard physical replication: lost-write detection and automatic block repair.

Oracle Site Guard is optional in the Gold tier but is useful to reduce administrative overhead and the potential for human error. Site Guard enables administrators to automate the orchestration of switchover (a planned event) and failover (in response to an unplanned outage) of their complete Oracle environment - multiple databases and applications - between a production site and a remote disaster recovery site. Oracle Site Guard is included with the Oracle Enterprise Manager Life-Cycle Management Pack.

Oracle Site Guard offers the following benefits:

  • Reduction of errors due to prepared response to site failure. Recovery strategies are mapped out, tested, and rehearsed in prepared responses within the application. Once an administrator initiates a Site Guard operation for disaster recovery, human intervention is not required.
  • Coordination across multiple applications, databases, and various replication technologies. Oracle Site Guard automatically handles dependencies between different targets while starting or stopping a site. Site Guard integrates with Oracle Active Data Guard to coordinate multiple concurrent database failovers. Site Guard also integrates with storage remote mirroring that may be used for data that resides outside of the Oracle Database.
  • Faster recovery time. Oracle Site Guard automation minimizes time spent in the manual coordination of recovery activities. 

Gold = Better HA and Data Protection

Gold builds upon Silver by addressing all fault domains. Even in the worst cases of a complete cluster or site outage, database service can be resumed within seconds or minutes of a failure occurring. Gold eliminates the downtime and potential uncertainty of a restore from backup. Gold also eliminates data loss by protecting every database transaction in real-time.

Database-aware replication is key to achieving Gold service levels. It is network efficient. It enforces a high degree of isolation between replicated copies for optimal data protection and HA. It enables fast failover to an already synchronized and running copy of production. It achieves high ROI by enabling workloads to be offloaded from production to the replicated copy. As important as these tangible benefits are, there is the equally significant benefit of reducing risk. By running workloads at the replicated copy you are performing continuous application-level validation that it is ready for production when needed.

So what is left to address after Gold? There is a class of application where it is desirable to mask the effect of an outage from the end-user. Imagine you are the customer in the process of making a purchase online - you don't want to be left in an uncertain state should there be a database outage. Did my purchase go through, do I resubmit my payment, and if I do will I be charged twice? From a data loss perspective, what if the DR site is 100's or 1000's of miles away. How do you guarantee zero data loss protection? Finally, this same class of application frequently can not  tolerate downtime for planned maintenance - how can you shrink maintenance windows to zero so that applications can be available at all times? To learn how to address this set of requirements stay tuned for the final installment in this MAA series when we cover the Platinum reference architecture.   

Thursday Sep 11, 2014

Oracle MAA Part 3: Silver HA Reference Architecture

This is the third installment in a series of blog posts describing Oracle Maximum Availability Architecture (Oracle MAA) best practices that define four standard reference architectures for data protection and high availability: BRONZE, SILVER, GOLD and PLATINUM.  Each reference architecture uses an optimal set of Oracle HA capabilities that reliably achieve a given service level (SLA) at the lowest cost.

This article provides details for the Silver reference architecture.

Silver builds upon Bronze by adding clustering technology - either Oracle RAC or RAC One Node. This enables automatic failover if there is an unrecoverable outage of a database instance or a complete failure of the server on which it runs. Oracle RAC also delivers substantial benefit by eliminating downtime for many types of planned maintenance. It does this by performing maintenance in a rolling manner across Oracle RAC nodes so that services remain available at all times. As in the case of Bronze, RMAN provides database-optimized backups to protect data and restore availability should an outage prevent the cluster from being able to restart. An overview of Silver is provided in the figure below.

Silver HA Reference Architecture

Oracle RAC

Oracle RAC is an active-active clustering solution that provides instantaneous failover should there be an outage of a database instance or of the server on which it runs. A quick review of how Oracle RAC functions helps to understand its many benefits. There are two major components to any Oracle RAC cluster: Oracle Database instances and the Oracle Database itself.

  • A database instance is defined as a set of server processes and memory structures running on a single node (or server) which make a particular database available to clients.
  • The database is a particular set of shared files (data files, index files, control files, and initialization files) that reside on persistent storage, and together can be opened and used to read and write data.
  • Oracle RAC uses an active-active architecture that enables multiple database instances, each running on different nodes, to simultaneously read and write to the same database.

Oracle RAC is the MAA best practice for server HA and provides a number of advantages:

  • Improved HA: If a server or database instance fails, connections to surviving instances are not affected; connections to the failed instance are quickly failed over to surviving instances that are already running and open on other servers in the cluster.
  • Scalability: Oracle RAC is ideal for applications with high workloads or consolidated environments where scalability and the ability to dynamically add or reprioritize capacity are required. Additional servers, database instances, and database services can be provisioned online. The ability to easily distribute workload across the cluster makes Oracle RAC an ideal solution when Oracle Multitenant is used for database consolidation.
  • Reliable performance: Oracle Quality of Service (QoS) can be used to allocate capacity for high priority database services to deliver consistent high performance in database consolidated environments. Capacity can be dynamically shifted between workloads to quickly respond to changing requirements.
  • HA during planned maintenance: High availability is maintained by implementing changes in a rolling manner across Oracle RAC nodes. This includes hardware, OS, or network maintenance that requires a server to be taken offline; software maintenance to patch the Oracle Grid Infrastructure or database; or if a database instance needs to be moved to another server to increase capacity or balance the workload.

Oracle RAC One Node

RAC One Node provides an alternative to Oracle RAC when scalability and instant failover are not required. RAC One Node license is one-half the price of Oracle RAC, providing a lower cost option when an RTO of minutes is sufficient for server outages.

RAC One Node is an active-passive failover technology. During normal operation it only allows a single database instance to be open at one time. If the server hosting the open instance fails, RAC One Node automatically starts a new database instance on a second node to quickly resume service.

RAC One Node provides several advantages over alternative active-passive clustering technologies.

  • It automatically responds to both database instance and server failures.
  • Oracle Database HA Services, Grid Infrastructure, and database listeners are always running on the second node. At failover time only the database instance and database services need to start, reducing the time required to resume service, and enabling service to resume in minutes. 
  • It provides the same advantages for planned maintenance as Oracle RAC. RAC One Node allows two active database instances during periods of planned maintenance to allow graceful migration of users from one node to another with zero downtime; database services remain available to users at all times.

Silver = Better HA

Silver represents a significant increase in HA compared to the Bronze reference architecture and is very well suited to a broad range of application requirements.  Oracle RAC immediately responds to an instance or server outage and reconnects users to surviving instances.

While Silver has substantial benefits, it is only one step above Bronze - there is still a much broader fault domain beyond instance or server failure. This includes events that can impact the availability of an entire cluster - data corruptions, storage array failures, bugs, human error, site outages, etc. There is also a class of application where the impact of outages must be completely transparent to the user. Stay tuned for future installments when we address this expanded set of requirements with the Gold and Platinum reference architectures.

Wednesday Jul 30, 2014

Welcome to the MAA Blog!!

Welcome to the MAA blog! This set of blogs are created and maintained by members of Oracle’s Maximum Availability Architecture (MAA) team within Oracle’s Server Technology Development group. The MAA team interacts with Oracle’s customers around the world on various critical high availability (HA) initiatives, and with this blog forum, we hope to bring to you musings on some of the rich experiences we have gained till date. Our goal is to enrich the Oracle ecosystem with an interesting, informative and interactive conversation around Oracle MAA.

Please refer to the MAA website in OTN - http://www.oracle.com/goto/maa, for the latest collection of best practices for Oracle MAA.

Ashish Ray


Wednesday Jul 16, 2014

Oracle MAA Part 2: Bronze HA Reference Architecture

In the first installment of this series we discussed how one size does not fit all when it comes to HA architecture. We described Oracle Maximum Availability Architecture (Oracle MAA) best practices that define four standard reference architectures for data protection and high availability: BRONZE, SILVER, GOLD and PLATINUM.  Each reference architecture uses an optimal set of Oracle HA capabilities that reliably achieve a given service level (SLA) at the lowest cost. As you progress from one level to the next, each architecture expands upon the one that preceded it in order to handle an expanded fault domain and deliver a high level of service.

This article provides details for the Bronze reference architecture.

Bronze is appropriate for databases where simple restart or restore from backup is ‘HA enough’. It uses single instance Oracle database (no cluster) to provide a very basic level of HA and data protection in exchange for reduced cost and implementation complexity. An overview is provided in the figure below.

Bronze Reference Architecture

When a database instance or the server on which it is running fails, the recovery time objective (RTO) is a function of how quickly the database can be restarted and resume service. If a database is unrecoverable the RTO becomes a function of how quickly a backup can be restored. In a worst case scenario of a complete site outage additional time is required to provision new systems and perform these tasks at a secondary location, in some cases this can take days.

The potential data loss if there is an unrecoverable outage (recovery point objective or RPO), is equal to the data generated since the last backup was taken. Copies of database backups are retained locally and at a remote location or on the Cloud for the dual purpose of archival and DR should a disaster strike the primary data center.

Major components of the Bronze reference architecture and the service levels achieved include:

Oracle Database HA and Data Protection

  • Oracle Restart automatically restarts the database, the listener, and other Oracle components after a hardware or software failure or whenever a database host computer restarts.
  • Oracle corruption protection checks for physical corruption and logical intra-block corruptions. In-memory corruptions are detected and prevented from being written to disk and in many cases can be repaired automatically. For more details see Preventing, Detecting, and Repairing Block Corruption.
  • Automatic Storage Management (ASM) is an Oracle-integrated file system and volume manager that includes local mirroring to protect against disk failure.
  • Oracle Flashback Technologies provide fast error correction at a level of granularity that is appropriate to repair an individual transaction, a table, or the full database.
  • Oracle Recovery Manager (RMAN) enables low-cost, reliable backup and recovery optimized for the Oracle Database.
  • Online maintenance includes online redefinition and reorganization for database maintenance, online file movement, and online patching. 

Database Consolidation

  • Databases deployed using Bronze often include development and test databases and databases supporting smaller work group and departmental applications that are often the first candidates for database consolidation.
  • Oracle Multitenant is the MAA best practice for database consolidation from Oracle Database 12c onward. 
Life Cycle Management
  • Oracle Enterprise Manager Cloud Control enables self service deployment of IT resources for business users along with resource pooling models that cater to various multitenant architectures. It supports Database as a Service (DBaaS), a paradigm in which end users (Database Administrators, Application Developers, Quality Assurance Engineers, Project Leads, and so on) can request database services, consume it for the lifetime of the project, and then have them automatically de-provisioned and returned to the resource pool.

Oracle Engineered Systems

  • Oracle Engineered Systems are an efficient deployment option for database consolidation and DBaaS. Oracle Engineered Systems reduce lifecycle cost by standardizing on a pre-integrated and optimized platform for Oracle Database that is completely supported by Oracle.

Bronze Summary:  Data Protection, RTO, and RPO

Table 1 summarizes the data protection capabilities and service levels provided by the Bronze tier. The first column indicates when validations for physical and logical corruption are performed:

  • Manual checks are initiated by the administrator or at regular intervals by a scheduled job.
  • Runtime checks are automatically executed on a continuous basis by background processes while the database is open.
  • Background checks are run on a regularly scheduled interval, but only during periods when resources would otherwise be idle.
  • Each check is unique to Oracle Database using specific knowledge of Oracle data block and redo structures.

Table 1: Bronze - Data Protection

Type Capability Physical Block Corruption
Logical Block Corruption
Manual Dbverify, Analyze Physical block checks Logical checks for intra-block and inter-object consistency
Manual RMAN Physical block checks during backup and restore Intra-block logical checks
Runtime Database In-memory block and redo checksum In-memory intra block logical checks
Runtime ASM Automatic corruption detection and repair using local extent pairs
Runtime Exadata HARD checks on write HARD checks on write
Background Exadata Automatic HARD Disk Scrub and Repair

Note that HARD validation and the Automatic Hard Disk Scrub and Repair (the last two rows of Table 1) are unique to Exadata storage. HARD validation ensures that Oracle Database does not write physically corrupt blocks to disk. Automatic Hard Disk Scrub and Repair inspects and repairs hard disks with damaged or worn out disk sectors (cluster of storage) or other physical or logical defects periodically when there are idle resources.

Table 2 summarizes RTO and RPO for the Bronze tier for various unplanned and planned outages.

Table 2: Bronze - Recovery Time and Data Loss Potential

Type  Event  Downtime Data Loss Potential
Unplanned  Database instance failure
 Minutes  Zero
Unplanned  Recoverable server failure
Minutes to an hour
 Zero
Unplanned Data corruptions, unrecoverable server failure, database failures, or site failures
Hours to days
Since last backup
Planned Online file move, online reorganization and redefinition, online patching
Zero
 Zero
Planned Hardware or operating system maintenance and database patches that cannot be performed online
Minutes to hours
Zero
Planned Database upgrades: patch sets and full database releases
Minutes to hours
Zero
Planned Platform migrations
Hours to a day
Zero
Planned Application upgrades that modify back-end database objects
Hours to days
Zero

So when would you use bronze?  Bronze is useful when users can wait for a backup to be restored if there is an unrecoverable outage and accept that any data generated since the last backup was taken will be lost. The Oracle Database has a number of included capabilities described above that provide unique levels of data protection and availability for a low-cost environment based upon the Bronze reference architecture.

But what if I can't accept this level of downtime or data loss potential - well that is where the Silver, Gold and Platinum reference architectures come in. Bronze is only a starting point that establishes the foundation for subsequent HA reference architectures that provide higher quality of service. Stay tuned for future blog posts that will dive into the details of each reference architecture.

Thursday May 29, 2014

Oracle MAA Part 1: When One Size Does Not Fit All

The good news is that Oracle Maximum Availability Architecture (MAA) best practices combined with Oracle Database 12c (see video) introduce first-in-the-industry database capabilities that truly make unplanned outages and planned maintenance transparent to users. The trouble with such good news is that Oracle’s enthusiasm in evangelizing its latest innovations may leave some to wonder if we’ve lost sight of the fact that not all database applications are created equal. Afterall, many databases don’t have the business requirements for high availability and data protection that require all of Oracle’s ‘stuff’. For many real world applications, a controlled amount of downtime and/or data loss is OK if it saves money and effort.


Well, not to worry. Oracle knows that enterprises need solutions that address the full continuum of requirements for data protection and availability. Oracle MAA accomplishes this by defining four HA service level tiers: BRONZE, SILVER, GOLD and PLATINUM. The figure below shows the progression in service levels provided by each tier.




Each tier uses a different MAA reference architecture to deploy the optimal set of Oracle HA capabilities that reliably achieve a given service level (SLA) at the lowest cost.  Each tier includes all of the capabilities of the previous tier and builds upon the architecture to handle an expanded fault domain.



  • Bronze is appropriate for databases where simple restart or restore from backup is ‘HA enough’. Bronze is based upon a single instance Oracle Database with MAA best practices that use the many capabilities for data protection and HA included with every Oracle Enterprise Edition license. Oracle-optimized backups using Oracle Recovery Manager (RMAN) provide data protection and are used to restore availability should an outage prevent the database from being able to restart.

  • Silver provides an additional level of HA for databases that require minimal or zero downtime in the event of database instance or server failure as well as many types of planned maintenance. Silver adds clustering technology - either Oracle RAC or RAC One Node. RMAN provides database-optimized backups to protect data and restore availability should an outage prevent the cluster from being able to restart.

  • Gold raises the game substantially for business critical applications that can’t accept vulnerability to single points-of-failure. Gold adds database-aware replication technologies, Active Data Guard and Oracle GoldenGate, which synchronize one or more replicas of the production database to provide real time data protection and availability. Database-aware replication greatly increases HA and data protection beyond what is possible with storage replication technologies. It also reduces cost while improving return on investment by actively utilizing all replicas at all times.

  • Platinum introduces all of the sexy new Oracle Database 12c capabilities that Oracle staff will gush over with great enthusiasm. These capabilities include Application Continuity for reliable replay of in-flight transactions that masks outages from users; Active Data Guard Far Sync for zero data loss protection at any distance; new Oracle GoldenGate enhancements for zero downtime upgrades and migrations; and Global Data Services for automated service management and workload balancing in replicated database environments. Each of these technologies requires additional effort to implement. But they deliver substantial value for your most critical applications where downtime and data loss are not an option.


The MAA reference architectures are inherently designed to address conflicting realities. On one hand, not every application has the same objectives for availability and data protection – the Not One Size Fits All title of this blog post. On the other hand, standard infrastructure is an operational requirement and a business necessity in order to reduce complexity and cost.


MAA reference architectures address both realities by providing a standard infrastructure optimized for Oracle Database that enables you to dial-in the level of HA appropriate for different service level requirements. This makes it simple to move a database from one HA tier to the next should business requirements change, or from one hardware platform to another – whether it’s your favorite non-Oracle vendor or an Oracle Engineered System.


Please stay tuned for additional blog posts in this series that dive into the details of each MAA reference architecture.


Meanwhile, more information on Oracle HA solutions and the Maximum Availability Architecture can be found at:






Tuesday Aug 13, 2013

CAP: Consistency and Availability except when Partitioned - Part 2

Follow-up on the CAP theorem, exploring the consistency and availability tradeoffs available to well-designed systems.

[Read More]

Friday Jul 19, 2013

The CAP theorem: Consistency and Availability except when Partitioned

In recent years NoSQL databases have justified providing eventual or other weak read/write consistency as an inevitable consequence of Brewer's CAP Theorem. This relies on the simplistic interpretation that because distributed systems can't avoid Partitions, they have to give up Consistency in order to offer Availability. This reasoning if flawed - CAP does allow systems to maintain both Consistency and Availability during the (majority of the time) when there is no Partition, and good distributed systems strive to maximize both C and A while accounting for P. You do not have to give up consistency as a whole just to gain scalability and availability. 

[Read More]

Friday Sep 21, 2012

To SYNC or not to SYNC – Part 4

This is Part 4 of a multi-part blog article where we are discussing various aspects of setting up Data Guard synchronous redo transport (SYNC). In Part 1 of this article, I debunked the myth that Data Guard SYNC is similar to a two-phase commit operation. In Part 2, I discussed the various ways that network latency may or may not impact a Data Guard SYNC configuration. In Part 3, I talked in details regarding why Data Guard SYNC is a good thing, and the distance implications you have to keep in mind.


In this final article of the series, I will talk about how you can nicely complement Data Guard SYNC with the ability to failover in seconds.


Wait - Did I Say “Seconds”?


Did I just say that some customers do Data Guard failover in seconds? Yes, Virginia, there is a Santa Claus.


Data Guard has an automatic failover capability, aptly called Fast-Start Failover. Initially available with Oracle Database 10g Release 2 for Data Guard SYNC transport mode (and enhanced in Oracle Database 11g to support Data Guard ASYNC transport mode), this capability, managed by Data Guard Broker, lets your Data Guard configuration automatically failover to a designated standby database. Yes, this means no human intervention is required to do the failover. This process is controlled by a low footprint Data Guard Broker client called Observer, which makes sure that the primary database and the designated standby database are behaving like good kids. If something bad were to happen to the primary database, the Observer, after a configurable threshold period, tells that standby, “Your time has come, you are the chosen one!” The standby dutifully follows the Observer directives by assuming the role of the new primary database. The DBA or the Sys Admin doesn’t need to be involved.


And - in case you are following this discussion very closely, and are wondering … “Hmmm … what if the old primary is not really dead, but just network isolated from the Observer or the standby - won’t this lead to a split-brain situation?” The answer is No - It Doesn’t. With respect to why-it-doesn’t, I am sure there are some smart DBAs in the audience who can explain the technical reasons. Otherwise - that will be the material for a future blog post.


So - this combination of SYNC and Fast-Start Failover is the nirvana of lights-out, integrated HA and DR, as practiced by some of our advanced customers. They have observed failover times (with no data loss) ranging from single-digit seconds to tens of seconds. With this, they support operations in industry verticals such as manufacturing, retail, telecom, Internet, etc. that have the most demanding availability requirements.


One of our leading customers with massive cloud deployment initiatives tells us that they know about server failures only after Data Guard has automatically completed the failover process and the app is back up and running! Needless to mention, Data Guard Broker has the integration hooks for interfaces such as JDBC and OCI, or even for custom apps, to ensure the application gets automatically rerouted to the new primary database after the database level failover completes.


Net Net?


To sum up this multi-part blog article, Data Guard with SYNC redo transport mode, plus Fast-Start Failover, gives you the ideal triple-combo - that is, it gives you the assurance that for critical outages, you can failover your Oracle databases:



  1. very fast

  2. without human intervention, and

  3. without losing any data.


In short, it takes the element of risk out of critical IT operations. It does require you to be more careful with your network and systems planning, but as far as HA is concerned, the benefits outweigh the investment costs.


So, this is what we in the MAA Development Team believe in. What do you think? How has your deployment experience been? We look forward to hearing from you!

Monday Sep 10, 2012

To SYNC or not to SYNC – Part 3

I can't believe it has been almost a year since my last blog post. I know, that's an absolute no-no in the blogosphere. And I know that "I have been busy" is not a good excuse. So - without trying to come up with an excuse - let me state this - my apologies for taking such a long time to write the next Part.


Without further ado, here goes.






This is Part 3 of a multi-part blog article where we are discussing various aspects of setting up Data Guard synchronous redo transport (SYNC). In Part 1 of this article, I debunked the myth that Data Guard SYNC is similar to a two-phase commit operation. In Part 2, I discussed the various ways that network latency may or may not impact a Data Guard SYNC configuration.


In this article, I will talk in details regarding why Data Guard SYNC is a good thing. I will also talk about distance implications for setting up such a configuration.


So, Why Good?


Why is Data Guard SYNC a good thing? Because, at the end of the day, this gives you the assurance of zero data loss - it doesn’t matter what outage may befall your primary system. Befall! Boy, that sounds theatrical. But seriously - think about this - it minimizes your data risks. That’s a big deal. Whether you have an outage due to bad disks, faulty hardware components, hardware / software bugs, physical data corruptions, power failures, lightning that takes out significant part of your data center, fire that melts your assets, water leakage from the cooling system, human errors such as accidental deletion of online redo log files - it doesn’t matter - you can have that “Om - peace” look on your face and then you can failover to the standby system, without losing a single bit of data in your Oracle database. You will be a hero, as shown in this not so imaginary conversation:


IT Manager: Well, what’s the status?

You: John is doing the trace analysis on the storage array.

IT Manager: So? How long is that gonna take?

You: Well, he is stuck, waiting for a response from <insert your not-so-favorite storage vendor here>.

IT Manager: So, no root cause yet?

You: I told you, he is stuck. We have escalated with their Support, but you know how long these things take.

IT Manager: Darn it - the site is down!

You: Not really …

IT Manager: What do you mean?

You: John is stuck, but Sreeni has already done a failover to the Data Guard standby.

IT Manager: Whoa, whoa - wait! Failover means we lost some data, why did you do this without letting the Business group know?

You: We didn’t lose any data. Remember, we had set up Data Guard with SYNC? So now, any problems on the production – we just failover. No data loss, and we are up and running in minutes. The Business guys don’t need to know.

IT Manager: Wow! Are we great or what!!

You: I guess …


Ok, so you get it - SYNC is good. But as my dear friend Larry Carpenter says, “TANSTAAFL”, or "There ain't no such thing as a free lunch". Yes, of course - investing in Data Guard SYNC means that you have to invest in a low-latency network, you have to monitor your applications and database especially in peak load conditions, and you cannot under-provision your standby systems. But all these are good and necessary things, if you are supporting mission-critical apps that are supposed to be running 24x7. The peace of mind that this investment will give you is priceless, especially if you are serious about HA.


How Far Can We Go?


Someone may say at this point - well, I can’t use Data Guard SYNC over my coast-to-coast deployment. Most likely - true. So how far can you go? Well, we have customers who have deployed Data Guard SYNC over 300+ miles! Does this mean that you can also deploy over similar distances? Duh - no! I am going to say something here that most IT managers don’t like to hear - “It depends!” It depends on your application design, application response time / throughput requirements, network topology, etc. However, because of the optimal way we do SYNC, customers have been able to stretch Data Guard SYNC deployments over longer distances compared to traditional, storage-centric ways of doing this. The MAA Database 10.2 best practices paper Data Guard Redo Transport & Network Configuration, and Oracle Database 11.2 High Availability Best Practices Manual talk about some of these SYNC-related metrics. For example, a test deployment of Data Guard SYNC over 330 miles with 10ms latency showed an impact less than 5% for a busy OLTP application.


Even if you can’t deploy Data Guard SYNC over your WAN distance, or if you already have an ASYNC standby located 1000-s of miles away, here’s another nifty way to boost your HA. Have a local standby, configured SYNC. How local is “local”? Again - it depends. One customer runs a local SYNC standby across the campus. Another customer runs it across 15 miles in another data center. Both of these customers are running Data Guard SYNC as their HA standard. If a localized outage affects their primary system, no problem! They have all the data available on the standby, to which they can failover. Very fast. In seconds.


Wait - did I say “seconds”? Yes, Virginia, there is a Santa Claus. But you have to wait till the next blog article to find out more. I assure you tho’ that this time you won’t have to wait for another year for this.


Tuesday Sep 20, 2011

To SYNC or not to SYNC – Part 2


It’s less than two weeks from Oracle OpenWorld! We are going to have an exciting set of sessions from the Oracle HA Development team. Needless to say, all of us are a wee bit busy these days. I think that’s just the perfect time for Part 2 of this multi-part blog article where we are discussing various aspects of setting up Data Guard synchronous redo transport (SYNC).


In Part 1 of this article, I debunked the myth that Data Guard SYNC is similar to a two-phase commit operation. In case you are wondering what the truth is, and don’t have time to read the previous article, the answer is - No, Data Guard synchronous redo transport is NOT the same as two-phase commit.


Now, let’s look into how network latency may or may not impact a Data Guard SYNC configuration.


LATEncy


The network latency issue is a valid concern. That’s a simple law of physics. We have heard of the term “lightspeed” (remember Star Wars?), but still - as you know from your high school physics days, light takes time to travel. So the acknowledgement from RFS back to NSS will take some milliseconds to traverse the network, and that is typically proportional to the network distance.


Actually - it is both network latency and disk I/O latency. Why disk I/O latency? Remember, on the standby database, RFS is writing the incoming redo blocks on disk-resident SRLs. This is governed by the AFFIRM attribute of the log_archive_dest parameter corresponding to the standby database. We had one customer whose SYNC performance on the primary was suffering because of improperly tuned standby storage system.


However, for most cases, network latency is likely to be the gating factor - for example, refer to this real-time network latency chart from AT&T - http://ipnetwork.bgtmo.ip.att.net/pws/network_delay.html. At the time of writing this blog, US coast-coast latency (SF - NY) is shown to be around 75 ms. Trans-Atlantic latency is shown to be around 80 ms, whereas Trans-Pacific latency is shown to be around 140 ms. Of course you can measure the latency between your own primary and standby servers using utilities such as “ping” and “traceroute”.


Here is some good news - in Oracle Database 11g Release 2, the write to local online redo logs (by LGWR) and the remote write through the network layer (by NSS) happen in parallel. So we do get some efficiency through these parallel local write and network send operations.


Still - you have to make the determination whether the commit operations issued by your application can tolerate the network latency. Remember - if you are testing this out, do it under peak load conditions. Obviously latency will have minimal impact on a read-intensive application (which, by definition, does not generate redo). There are also two elements of application impact - your application response time, and your overall application throughput. For example, your application may have a heavy interactive mode - especially if this interaction happens programmatically (e.g. a trading application accessing an authentication application which in turn is configured with Data Guard SYNC). In such cases, measuring the impact on the application response time is critical. However, if your application has enough parallelism built-in, you may notice that overall throughput doesn’t degrade much with higher latencies. In the database layer, you can measure this with the redo generation rate before and after configuring synchronous redo transport (using AWR).


Not all Latencies are Equal


The cool thing about configuring synchronous redo transport in the database layer, is just that - we do it in the database layer, and we just send redo blocks. Imagine if you have configured it in the storage layer. All the usual database file structures - data files, online redo logs, archived redo logs, flashback logs, control file - that get updated as part of the usual database activities, will have to be synchronously updated across the network. You have to closely monitor the performance of database checkpointing in this case! We discuss these aspects in this OTN article.


So Why Bother?


So where are we? I stated that Data Guard synchronous redo transport does not have the overhead of two-phase-commit - so that’s good, and at the same time I stated that you have to watch out for network latency impact because of simple laws of physics - so that’s not so good - right? So, why bother, right?


This is why you have to bother - Data Guard synchronous redo transport, and hence - the zero data loss assurance, is a good thing! But to appreciate fully why this is a good thing, you have to wait for the next blog article. It’s coming soon, I promise!


For now, let me get back to my session presentation slides for Oracle OpenWorld! See you there!


Wednesday Sep 07, 2011

Key HA-related sessions at Oracle OpenWorld

At this year's Oracle OpenWorld, the Database High Availability and various related Development teams have put together several highly informative sessions, hands-on labs and demos for various components of our HA / MAA solution set. They include sessions on tips & tricks, solution and integration best practices, and span product areas such as clustering, disaster recovery, backup & recovery, replication, storage, Exadata, etc.

A comprehensive datasheet featuring information related to some of these sessions / labs / demos is available as a PDF file. It includes a quick reference guide at the end - handy to print as a single page. This link is available through the OTN HA site.

Customers interested in database high availability & data protection should use this datasheet as a planning guide for attending some of these sessions. As you may notice, some hi-profile customers have already committed to be co-speakers for some of these sessions, featuring their implementation case studies and lessons learned. So this presents some excellent learning and networking opportunities for the attendees.




Tuesday Sep 06, 2011

To SYNC or not to SYNC – Part 1

Zero Data Loss – Nervously So?

As part of our Maximum Availability Architecture (MAA) conversations with customers, one issue that is often discussed is the capability of zero data loss in the event of a disaster. Naturally, this offers the best RPO (Recovery Point Objective), as far as disaster recovery (DR) is concerned. The Oracle solution that is a must-have for this is Oracle Data Guard, configured for synchronous redo transport. However, whenever the word “synchronous” is mentioned, the nervousness barometer rises. Some objections I have heard:

  • “Well, we don’t want our application to be impacted by network hiccups.”
  • “Well, what Data Guard does is two-phase-commit, which is so expensive!”
  • “Well, our DR data center is on the other coast, so we can’t afford a synchronous network.”

And a few others.

Some of these objections are valid, some are not. In this multi-part blog series, I will address these concerns, and more. In this particular blog, which is Part 1 of this series, I will debunk the myth that Data Guard synchronous redo transport is similar to two-phase commit.

SYNC != 2 PC

Let’s be as clear as possible. Data Guard synchronous redo transport (SYNC) is NOT two-phase-commit. Unlike distributed transactions, there is no concept of a coordinator node initiating the transaction, there are no participating nodes, there are no prepare and commit phases working in tandem.

So what really happens with Data Guard SYNC? Let’s look under the covers.

Upon every commit operation in the database, the LGWR process flushes the redo buffer to local online redo logs - this is the standard way Oracle database operates. With Data Guard SYNC, in addition, the LGWR process tells the NSS process on the primary database to make these redo blocks durable on the standby database disk as well. Until LGWR hears back from NSS that the redo blocks have been written successfully in the standby location, the commit operation is held up. That’s what provides the zero data loss assurance. The local storage on the primary database gets damaged? No problem. The bits are available on the standby storage.

But how long should LGWR wait to hear back from NSS? Well, that is governed by the NET_TIMEOUT attribute of the log_archive_dest parameter corresponding to the standby. Once LGWR hears back from NSS that life is good, the commit operation completes.

Now, let’s look into how the NSS process operates. Upon every commit, the NSS process on the primary database dutifully sends the committed redo blocks to the standby database, and then waits till the RFS process on the standby receives them, writes them on disk on the standby (in standby redo logs or SRLs), and then sends the acknowledgement back to the NSS process.

So - on the standby database, what’s happening is just disk I/O to write the incoming redo blocks into the SRLs. This should not be confused with two-phase-commit, and naturally this process is much faster compared to a distributed transaction involving two-phase-commit coordination.

In case you are wondering what happens to these incoming redo blocks in the SRLs - well, they get picked up - asynchronously, by the Managed Recovery Process (MRP) as part of Redo Apply, and the changes get applied to the standby data files in a highly efficient manner. But this Redo Apply process is a completely separate process from Redo Transport - and that is an important thing to remember whenever these two-phase-commit questions come up.

Now that you are convinced that Data Guard SYNC is not the same as two-phase commit, in the next blog article, I will talk about impact of network latency on Data Guard SYNC redo transport.


About

Musings on Oracle's Maximum Availability Architecture (MAA), by members of Oracle Development team. Note that we may not have the bandwidth to answer generic questions on MAA.

Search

Categories
Archives
« August 2015
SunMonTueWedThuFriSat
      
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
     
Today