One of our customers recently asked me about cross-region and cross-AD (Availability Domain) database capabilities in Oracle Cloud. In this article, we will cover read-replicas, deploying databases across multiple Availability Domains and Regions, and some of the Hybrid Cloud deployment capabilities that Oracle customers are using. We will also touch on the built-in redundancy of the Exadata Cloud Service and how those capabilities affect the role played by local vs. remote Disaster Recovery systems. Finally, we’re going to look at the current state of automation related to these capabilities in the Oracle Cloud.
Oracle supports up to 30 read-replicas from a single primary database, as well as cascading replicas, and these full capabilities are available in the Oracle Cloud. The guidance with Data Guard in Oracle Cloud is to use the tooling where possible, or manage Data Guard directly if you need additional functionality beyond what's automated. See the end of this article for a discussion of the automation layer.
When Data Guard was first introduced, it was designed as a Disaster Recovery solution that maintained a standby database that could be activated in the event of a disaster. Later, Oracle introduced Active Data Guard, which allowed those standby databases to be opened in a read-only mode and support for multiple read-replicas was introduced. Oracle also has the ability for CASCADE of replicas, where a primary copy of a database feeds one or more replicas, and those replicas feeds other replicas.
The ability to configure read replicas is useful in some cases, but Exadata Cloud does not require them for scalability. Exadata Cloud Service can scale to 3,200 vCPU, which is 25X larger than the 128 vCPU Amazon RDS can handle for Oracle databases. The ability to support 30 read replicas (using Active Data Guard) is currently 6X more than 5 that Amazon allows with RDS/Oracle. Amazon requires read replicas to address scalability issues, whereas customers can choose when to use them in Exadata Cloud Service.
For more information on Oracle Active Data Guard, see link here.
“Writable” Read-Replicas (DML Redirection)
We learned a long time ago that applications sometimes appear to be read-only but are really “read-mostly” and perform some small amount of updating via DML (INSERT, UPDATE, DELETE). One example is a reporting application that also includes something like user profile information. The main activity is read-only, but there is a small amount of DML against the database to maintain that user profile data. Those applications cannot use a read-only copy of data because those changes would not be allowed. Oracle solved this problem by effectively making read-replicas writable for small volumes of update activity.
The DML Redirection feature of Oracle Active Data Guard allows Data Manipulation Language (DML) operations against what is otherwise a read-only database. Any DML statements (such as INSERT, UPDATE, and DELETE) are redirected to the primary database and those changes flow back to the replica copies via the normal flow.
DML redirection is useful because it allows these read-mostly applications to work against a replica, while eliminating the need to execute a complex process to flip the database into read/write mode, then back to read-only for the apply process to catch up. Those databases can simply be left in read-only mode and the small amount of DML gets redirected back to the primary transparently.
For more information about Active Data Guard DML Re-Direction, see here.
Realms, Regions, and Availability Domains
All Oracle Cloud regions are grouped into realms, including commercial, government, and an even higher security realm for Department of Defense. There is a single commercial realm, and the majority of Oracle Cloud Regions reside in that realm. As of this writing, there are 30 Oracle Cloud Regions in total. See the link here for more details on the services available in each Region. Oracle also has a number of Dedicated Regions that are only available to specific customers. Another option for remote sites is Oracle Roving Edge Infrastructure for smaller “edge” deployments or remote locations (see here for more details).
The largest Oracle Cloud Regions include multiple Availability Domains, including Regions of Frankfurt, London, Ashburn, and Phoenix. See the link here for the latest list of regions and Availability Domains within regions. Availability Domains within a region do not share infrastructure such as power, cooling, or network, so they are unlikely to fail simultaneously. Each Availability Domain is a separate data center within a given region, allowing full separation of data center infrastructure, but still being close enough for low-latency communication.
Multi-Availability Domain (AD)
A primary Oracle database can have up to 30 read-replicas using Active Data Guard, and these replicas can be placed in different regions and/or Availability Domains (as well as in different Clouds or at customer sites in a multi-Cloud or hybrid-Cloud model). For regions that have multiple Availability Domains, communication between ADs is very fast. Applications can reside within one AD and the database residing in that AD could be switched to a replica in another AD within the same region. Communication latency between ADs should be fast enough (low enough latency) to allow cross-AD operation. Deploying replica(s) of a primary within another AD in the same Region will also affect choices for synchronous vs. asynchronous propagation.
See the link here for the latest list of regions and Availability Domains within regions.
Low Latency Read Replicas
Latency is really a factor of physical distance, so Multi-AD deployment of replicas is ideal for cases where low latency is required. If you are putting a replica in another region, in a separate Cloud, or between on-premises and Cloud deployments, be sure to consider the distances involved and the network performance available. The Oracle/Azure Interconnect is being built in locations where there is a short physical distance between Oracle and Azure Cloud regions, so we can achieve low latency. We can have up to 30 read-replicas for a given primary database, but the physical location is going to determine the latency.
Read Replicas can be placed within the same Availability Domain or across ADs using a Multi-AD deployment model with extremely low latency. See the link here for the latest list of regions and Availability Domains within regions.
Sync vs. Async Propagation
Oracle Active Data Guard provides synchronous and asynchronous propagation of changes, but the need goes much deeper than that. Customers want changes to be propagated synchronously UNLESS the distance is too far, or the communication gets slowed for some reason. Customers want changes to be propagated synchronously UNLESS there is some sort of failure. Customers also need the ability to fine-tune whether the changes simply get propagated versus applied at the remote site.
Rather than a one-size fits all approach, Oracle Active Data Guard has different protection modes that can be selected based on the specific situation. The protection modes available are:
Maximum Availability mode also has options to balance performance vs. protection including the sync/affirm versus sync/noaffirm options. Oracle Active Data Guard has been developed and enhanced over multiple generations to meet the wide variety of needs customer have. Data Guard also supports Far Sync, allowing for protection of databases with zero data loss at any distance. See the link here for more information on Data Guard Protection Modes.
For greater separation, better protection against disasters, or for regions that lack multiple ADs, Active Data Guard can also operate across multiple Oracle Cloud Regions. With 30 regions to choose from around the world (currently), it’s difficult to give a specific number in terms of latency between any two regions. The regions involved and the needs of the application will decide which Data Guard protection mode is appropriate.
See the link here for more details on the services available in each Region.
Oracle Active Data Guard is also used by customers in Hybrid Cloud deployments, so we are not limited to propagating changes between Oracle Cloud regions. For example, we have customers running production databases (HUNDREDS of Terabytes in size) in their on-premises data centers, with Active Data Guard replicas in Oracle Cloud regions.
See this blog on deploying a Hybrid Cloud Architecture for Oracle Database.
Multiple Fault Domains
Within each Oracle Cloud Region and Availability Domain, there are also multiple fault domains. Fault Domains ensure that individual faults cannot disrupt operations of specific instances. Each Oracle Availability Domain contains 3 fault domains. Each Exadata system component leverages these separate fault domains by being connected to redundant power sources and redundant networks to ensure the built-in redundancy of Exadata operates as expected.
Fault Domains in Oracle Cloud are described at the link here.
Built-In Redundancy of Exadata Cloud Service
Exadata systems in the Oracle Cloud include built-in redundancy at all levels to provide for extremely high availability. Among this built-in redundancy are the following:
Every Exadata Cloud Service system includes a minimum of 2 database servers for redundancy, and customers can deploy even more redundant database servers if desired. All of the database servers work together against a single copy of the database, so it is not relying on multiple replicas at this layer. While there is a single copy of the database, the data within the database is stored on triple-redundant storage. Exadata Cloud Service uses HIGH redundancy, which means there are 3 copies of each block of data stored across Exadata Storage Servers. Exadata Cloud Service can sustain multiple simultaneous failures, including up to 24 disks and 8 Flash cards across any 2 Exadata Storage Servers and still continue operating.
Exadata Cloud Service also keeps redundant layers of storage within each Storage Server for additional protection. Data in Persistent Memory and Flash storage is also copied to disk for additional protection, and Exadata automatically detects, and repairs blocks of data internally.
The built-in redundancy of Exadata Cloud Service is outlined in the data sheet here.
Exadata Cloud Service deploys the same underlying technology as Exadata for On-Premises.
REAL Active/Active Clusters
In the Oracle lexicon, the word “cluster” means a single database simultaneously running on multiple servers or instances in an active/active manner with Oracle Real Application Clusters (RAC). While some other vendors refer to a “cluster” as a set of replicated databases or sometimes “shared nothing” or sharded databases, only Oracle offers the ability to run a single database simultaneously across the nodes of a real compute cluster. That is, one single database being simultaneously accessed and updated by processes running on multiple compute instances. Other databases lack a feature similar to RAC and other Cloud vendors don't support RAC so they use multiple separate databases, which introduces data synchronization problems.
Oracle Real Application Clusters is a fundamental component of the Exadata Cloud Service architecture. For more information on RAC specifically, see the link here.
Database High Availability
The built-in redundancy of Exadata Cloud Service and support for Oracle Real Application Clusters enables High Availability of each database, which is a key attribute of the Oracle MAA (Maximum Availability Architecture) Silver Level and higher classifications. The MAA Outage Matrix for Silver Level outlines the expected RTO (Recovery Time Objective) and RPO (Recovery Point Objective) during both planned maintenance and unplanned outages.
|Unplanned Outage||Recoverable Node or Instance Failure||Zero||Zero*|
|Planned Maintenance||Software/Hardware Updates||Zero||Zero*|
Note*: applications must meet the requirements for Application Continuity in order to achieve RTO of Zero as
The ability for a database to continue operating without interruption and without data loss under these conditions reduces the need for a Multi-AD (local standby) or cross-Region DR (Disaster Recovery) deployment. Oracle recommends Multi-AD and/or Cross-Region DR deployments for MAA GOLD level deployments, which can be done for the specific databases that require an extra level of protection.
The ability to provide HA for each database without relying on replication (such as Multi-AD or Cross-Region deployments) dramatically improves application availability without data loss, and without loss of service.
For more information on MAA Reference Architectures, see link here.
Handling Application Connections
What’s even more important than fault detection and repair is the handling of database connections. Simply dropping connections and forcing applications to reconnect can be extremely disruptive. While the database might be okay, the application might suffer an outage as it gets re-started and must reconnect. Failover to a Local Standby (in a Multi-AD deployment) can be done with zero data loss (RPO=Zero) but does impact application connections. Applications can operate in a cross-AD fashion (application in one AD with database in another AD), but re-establishing connections is typically an outage from the end-user perspective.
For more information on Oracle Application Continuity, see link here.
Handling Large Numbers of Reader Connections
One common use of read-replicas or “reader farms” is the ability to support large numbers of reader process connections. The fundamental limitation these deployments are trying to get around is the lack of compute scalability. While it is possible to create read-replicas of databases on Oracle Exadata Cloud Service, another option is to simply scale-up Exadata Cloud Service. There is less need to create “reader farms” because you can deploy more compute against the same database. Oracle Exadata Cloud Service currently allows up to 32 Database Servers/instances with 3,200 vCPU for a SINGLE database. Yes, you can also deploy multiple read-replicas (up to 30 of them) with Oracle Exadata Cloud Service, but it’s important to know you have choices.
Fast Fault Detection (and Recovery) on Exadata
While some vendors are touting their ability to detect and recover from a fault within a minute or two, Oracle developers are dealing at the level of seconds, milliseconds, and microseconds in Exadata Cloud Service. Faults (such as loss of a database instance) can be detected on Exadata Cloud Service within as little as 3 milliseconds and there can be a slowdown of several seconds to recover from those faults. Note, we’re talking about a “slowdown”, not an outage, and not a database restart in another location.
Sharding is another way to distribute data across instances, regions, availability domains, and fault domains. Sharding was introduced in Oracle12c, so it has been around for quite a while and has gotten very sophisticated. In a sharded environment, you have one database logically, but built out of multiple databases that operate independently. The largest sharded database application I personally worked on has 90 shards of around 25 TB each, or 2.25 Petabytes in total. While Oracle Database on Exadata can easily scale much higher than 25TB, there are some significant advantages if the database can be broken down into smaller parts that operate independently.
For more information on Oracle Sharding, see the link here.
Logical Data Replication
I first started working with logical data replication using what was originally called Oracle Symmetric Replication, which was based on triggered events within the database. The state of the art these days is logical replication based on capture of changes from the database redo logs using Oracle Golden Gate. The redo log is a single place where all database changes are concentrated. It’s more efficient, faster, and more scalable than the older “trigger based” mechanism. Oracle Golden Gate is our state-of-the-art logical data replication solution, and it is heterogenous as well, working across different database engines. Be aware that some vendors are actually advocating the old-fashioned trigger-based replication, which was outdated technology many years ago.
For more information on Oracle Golden Gate, see link here.
Storage Replication Under Databases
Many of us at Oracle have worked with customers who used storage replication technologies underneath databases and have seen the disadvantages first-hand. While replication at the storage layer is fine for some use-cases, it can be extremely problematic for databases. The first issue is that the number of changes at the storage layer is magnified by the way databases operate (this is not specific to Oracle DB!). Storage replication will dutifully replicate ALL changes to data, whether it’s needed or not. Storage replication will even replicate temporary “sort/work” areas of a database, which serves no purpose. Consider the fact that databases write block changes, undo changes, and redo changes, and you will find that storage replication will dutifully replicate as much as 3X more change than propagating changes via the database transaction log.
In addition to the simple volume of changes that need to be replicated at the storage layer, there is also no respect for transaction boundaries and where the replication process lies in the stream of database changes. Because of this fundamental operating principal, storage replication will replicate all changes to databases including corruption, without any ability to validate whether those changes are correct or not.
Fast Start Failover
The Oracle Fast Start Failover feature allows a Read-Replica to assume the role of primary database in the event the primary database fails. The default setting for the Fast Start Failover Threshold is 30 seconds but can be set as low as 6 seconds. If the primary database runs on Exadata Cloud Service, any instance failure will normally be handled by failover within the Exadata machine itself. It would be extremely rate for an entire Exadata Cloud Service system to fail at once, so the default setting of 30 seconds is correct for most us-cases.
For more information on Fast Start Failover (FSFO), see here.
Some of the concepts discussed in this article are used by other vendors as a workaround to scalability issues with their products or services. It is important to consider the fact that Oracle Database on Exadata Cloud Service scales up to an incredible level compared to other vendor solutions. Exadata Cloud Service currently scales to the following level:
With a current limit of 32 Database Servers and 64 Storage Servers, Exadata Cloud Service scales much larger than other offerings. For example, Amazon RDS is currently limited to 128 vCPU and 64TB of storage. Exadata Cloud Service delivers up to 25X more vCPU and 49X more storage.
While it is possible to scale this large, the point is that you have options. If you would rather deploy a 100TB database with multiple Read-Replicas, or multiple 75TB shards rather than a single database, those certainly are options you have with Exadata Cloud Service.
Disaster Recovery Automation
Oracle Cloud currently automates the use of Data Guard for Disaster Recovery including use of the Standby databases for read-only purposes (Read-Replica). Exadata Cloud Service relies on built-in redundancy (including Real Application Clusters) to deliver High Availability and the extreme scalability of up to 3,200 vCPU per database (or 6,400 vCPU total with 1 read-only Standby) reduces the need for large numbers of read replicas. Use of Data Guard in Exadata Cloud Service is not limited to the functionality built into the automation, but the automation certainly makes it easier to use and will be expanded in the future.
For more information on using the automation of standby (Disaster Recovery) databases using Oracle Data Guard with Exadata Cloud Service, see link here.
For more information on Oracle Autonomous Data Guard, see the blog post here.
Chris Craft is Senior Director of Product Management, focused on advanced Oracle database technologies including Oracle Converged Database, Exadata, Exadata Cloud Services, Autonomous Database, and Oracle’s Zero Data Loss Recovery Appliance. Mr. Craft has served in multiple roles within Oracle including On-Site Customer Support, Service Delivery Management, Sales Consulting, and Product Management. His experience ranges from application development, data modeling, and performance benchmarking to managing customer relations, and leading teams of technical professionals. Mr. Craft has worked with over 500 Oracle customers across virtually all industries and including many of the largest customers in the world.