Part 2: Comparing Database High Availability Approaches and Solutions
In part 1 of this blog post series, I explored the fundamentals of High Availability and Disaster Recovery along with some of the common as well as uncommon causes of downtime and data loss that can impact business continuity. In addition, I defined a number of categories that need to be addressed in relation to these causes. From that perspective, one can compare different solutions and their unique merits in regards to addressing the various categories of downtime and data loss.
How does Oracle Database provide matchless reliability?
Oracle Database has been designed and developed with both common causes of downtime – planned and unplanned downtime – in mind. Not only that, but situations such as database patching, human errors, server / hardware updates and unplanned downtime as well as less common outages have been considered from different angles and a dedicated team is constantly busy at work pulling cables out of systems to recreate outage events in the lab (i.e. the MAA database team) to ensure the highest Oracle Database RTO and RPO objectives can indeed be met assuming best practices are followed.
Let’s take a look at some of the rather unique aspects of Oracle Database HA & DR in the context of some of the more common use cases that span across both planned application maintenance and unexpected outages:
- Human Errors & Logical Corruptions:
- Common Database Provider Approach: Human errors (i.e. accidently deleting rows or an entire table for example) are extremely common and can have severe consequences. For example, one of the more famous cases of human error from Gitlab involved a developer mistakenly deleting a primary database assuming it was instead the secondary database. The only recourse that an application or database administrator often has with other database solutions in this case is to restore via a backup which can result in unfortunate data loss (i.e. RPO of 12 or 24 hours depending on backup frequency). Even worse, some database solutions that focus on replication as a solution replicate these human errors and logical corruptions to the replicated standby backups with no automated mechanism to easily recover without even more painful downtime.
- Oracle Solution: While backup & restore is the default solution for many databases platforms out there, Oracle has a unique solution to address logical corruptions such as those caused by human errors – Oracle Flashback. Oracle’s Flashback technology which is built right into the database is basically a “rewind button” for a specific transaction, table, or the entire database depending on the need. There is no need to a complete restore from backup and the flexibility of the solution allows it to be used in a number of situations in a flexible manner assuming Flashback is properly configured in the environment. Further details on the Flashback approach can be found in the previous blog post here.
- Computer/Storage Failures & Disasters:
- Common Database Provider Approach: The traditional solution to computer & storage failures it to either restore from backup or switch to a replicated version of the database. This routinely is accompanied by a significant amount of downtime spanning many minutes or even hours and can be problematic in cases where replicated database services use storage remote mirroring as disk & file corruption and other non-Database issues on the system can easily be “replicated” into the standby environments as well resulting in a need to fallback to a longer restore procedure to recover that may also result in data loss.
- Oracle Solution: While other solutions may restore from backup or provide some level of storage replication via remote mirror, Oracle Data Guard and Active Data Guard provide a full integrated in-memory replication solution without the downside of storage remote mirroring mentioned above. Data Guard’s in-memory block replication ensures that redo log validation occurs as part of the process preventing corruption and doesn’t carry over other disk artifacts which may copy disk level corruption into the standby database environment. Various protection modes are offered including both synchronous and asynchronous replication options so protection can be balanced with performance. Combined with the built-in Application Continuity feature, protection is even provided for in-flight transactions which can be automatically replayed on the standby database. Instance failover will be invisible from the application perspective in these cases with the only noticeable impact to end-users being a very small performance hiccup (i.e. a brown out) as the transaction is replayed.
- Database & Infrastructure Patching:
- Common Database Provider Approach: For the majority of database solutions downtime is required in order to perform database patching. Normally, both, the underlying infrastructure as well as the database itself require time for the patch to be applied with associated downtime needed to switch to another database instance or transition to a standby environment if available. In-flight transactions won’t be replayed and thus application processing has to be paused in order to transition to the other instance and open the database to read/write.
- Oracle Solution: Oracle Real Application Clusters (RAC) solves this problem with its true clustering capabilities which allow in-flight transactions to be drained and replayed if necessary after the appropriate timeout on another instance. From the perspective of the application, this means that the RTO will be close to zero for patching the cluster when taking advantage of the RAC Rolling Patching feature in the Oracle Database which patches one instance at a time in the cluster while the rest remain up, running and servicing the application workload.
- Database Upgrades:
- Common Database Provider Approach: Database upgrades are routinely much more involved than simple patching operations and can be done in-place or out-of-place. In either case, a lot of planning is routinely involved in a database upgrade and it is not uncommon for a downtime window to be identified as part of the planning process as the database is brought offline for a period of time for an in-place upgrade. Even for an out-of-place upgrade, the process requires a reallocating key resource configuration once the application is up and running and if the new database is not synchronized with the original version via replication, downtime may be required to prevent additional workload during the upgrade process. Once the database upgrade process is complete, the application is brought down as part of the process and restarted on the new upgraded instance. Downtime has to be planned carefully as there may be no simple mechanism to transition back to the previous version if needed.
- Oracle Solution: Oracle Active Data Guard provides a feature called Rolling Upgrades which greatly simplifies database upgrades by converting a physical standby to a transient logical standby. Utilizing replication to keep the databases synchronized during upgrade reduces downtime to near-zero without the need to bring down the application. Once the upgrade is complete on the transient logical standby, workload is transitioned to the standby (which would become the primary) and the former primary is upgraded. Not only is downtime reduced to near-zero, but the entire process is automated with validation built into the solution ensuring that the upgrade goes smoothly with Application Continuity ensuring that sessions are properly drained throughout the entire process.
- Application Upgrades:
- Common Database Provider Approach: When applications are upgraded, it is quite common for the new version of the application to require upgraded database objects such as stored procedures, schema changes, and other underlying changes. This requires that applications be taken offline during the process with changes being applied to both simultaneously. No production workload can easily be tested against the newly upgraded application infrastructure as a result which can be problematic and makes an emergency rollback challenging at best with even more downtime required.
- Oracle Solution: Oracle databases allow for multiple editions to exist simultaneously for applications to access via a built-in feature called Edition-based Redefinition (EBR). As both the original edition and upgraded edition can be up and running simultaneously with applications referring to “Editioning Views” rather than physical tables and other editioned database objects, application workload can be cleanly load balanced across both versions (if desired) allowing just a subset of users to test the new version of the application before moving the entire application workload to the new version. Eventually, once application administrators are comfortable that the new version can handle the production workload as intended, they can transition the entire workload and shutdown the original edition making the new edition the default. The result is zero downtime for the application along with validation and testing to ensure a smooth upgrade to the new version of the application.
The above scenarios are just a sampling of some of the more common cases where database HA & DR are critical but there are obviously many other, less common scenarios as well to consider as discussed earlier. One other very important fact to note is that the features above are built into the Oracle database and fully integrated with one another. These features of the Oracle Database provide a unique level of reliability that is unmatched in the industry and has been one of the main reasons alongside superior performance and scalability that Oracle has dominated the Database Management space for years. This didn’t come by accident, it was well earned with the features above evolving naturally to meet customer demand and a dedicated team of engineers working around the clock to validate the rock solid reliability of the associated Maximum Availability Architecture (MAA) blueprints. These blueprints are designed to provide the most optimal configuration of Oracle’s HA and DR feature portfolio while maintaining flexibility across on-premises, engineered systems (i.e. Exadata) and of course as a foundation of the Oracle Cloud database service offerings such as Exadata Cloud Service, Autonomous Database, and Exadata Cloud @ Customer.
With that, I will conclude this blog post. However, stay tuned, subsequent blog posts on this subject will continue to explore the uniqueness of Oracle High Availability solutions and the evolving vision in comparison to other database platforms on the market today. This will include deeper dives into some of the bedrock features (i.e. RAC, Active Data Guard, Multitenant, Fleet Patching & Provisioning, and other feature sets) of the platform along with the design principles and architecture that went into making them pillars of the Oracle HA solution.
For more information, please take a look at:
You can also follow us for new updates on Twitter at @OracleMAA or check out the MAA Webcast Series via the link below: