Part 1: Framework for Evaluating Database High Availability Capabilities

Historically, the Oracle Maximum Availability Architecture (MAA) blog has tended to focus on various specific technologies. I suppose isn’t a surprise given the nature of MAA, but I wanted to take a step back for a moment to look at the bigger picture of what makes Oracle High Availability (HA) so reliable and great that many companies quite literally depend on it to maintain business continuity. This can be a tricky subject as the terms “matchless” as well as other similar terms like “greatness” have to be taken into perspective.  I think it is relatively safe to say that the Oracle Database is very well known for its “reliability” and I tend to hear the words “Enterprise Quality” or “Enterprise Ready” when referring to the Oracle Database. 

What defines a reliable database?

The answer to this question can be broken down into two terms with which many of us are intimately familiar:

  1. Recovery Time Objective (RTO) – How much downtime can I afford for my application?
    • RTO is measured from the time the application becomes unavailable to workload to the time it is available again. It is usually dictated by the cost of downtime that the business can sustain till it causes material adverse impact.
  2. Recovery Point Objective (RPO) – How much data can I afford to lose in the case of an outage? 
    • RPO is routinely measured in intervals measured backwards in time from the outage.  For example, if the RPO is 24 hours, that up to 24 hours of data could be irretrievable in the case of a major outage.

Is there a way to include Security angle in here too? Eg the way ZDLRA has weaved in a Cyber Security angle?

The interesting thing about the two terms above is that many in the IT world don’t tend to become familiar with them until they have experienced them from a negative perspective and slogged through trying to get an application back up and running for hours or even days.  This of course damages the business both from a direct revenue perspective in many cases as well as their reputation.  With practical experience often learned through living through this type of disaster, seasoned professionals come out with a much greater sense of what they are looking for in regards to both RTO and RPO. These lessons often coincide with a knowledge of the repercussions of not thinking about these terms in advance when building out application and overall system architecture.

This tends to be where the Oracle Database comes in as these seasoned professionals know all too well the pain of not being able to meet these critical RTO and RPO goals for their applications on the front line and Oracle is very well-known for being able to help meet these objectives.  Especially, in today’s world with its uncertainties, Maximum Availability Architecture (MAA) is an absolutely essential data management framework that IT professionals must adopt to ensure that the business RTO and RPO goals are mapped to the right Oracle Database HA and DR feature sets with optimized best practices and business continuity can be maintained.

Before I get into MAA, let’s look at what one might be looking for to ensure that RTO and RPO needs are met keeping in mind that there are a host of different disasters that can unfold as well as planned maintenance to consider in this situation.  So, let’s start there by listing out some of the common as well as rare situations or events that one may want to consider when looking at RTO and RPO requirements:

 

DOWNTIME TYPE

TYPICAL CAUSES

EXAMPLES

Planned

  • Security updates and fixes
  • System and Database updates and fixes
  • Application updates
  • Migrations
  • Server software updates
  • Database software updates or upgrades
  • Data reorganization or changes
  • Application upgrades and optimizations
  • Application & database tier migrations
  • Hardware upgrades

UNPLANNED

  • Server failures
  • Instance failures
  • Logical data disasters
  • Site Disasters
  • Disk failure
  • Storage failure
  • Human error
  • Data corruption
  • Lost writes
  • Hang or slow down
  • Network failures
  • Power failures
  • Natural disasters

As can be seen above, there are a myriad of ways that things can go wrong in regards to an unplanned outage but equally as concerning are all of the routine planned maintenance and other events that can impact RTO.  The trick is to mitigate most if not all of these causes without complicating the architecture to the point that is unmanageable. That ideally means that the technology to address the challenges above would be integrated and address both common and uncommon types of situations as reliability is the key here and rare corner case events can be just as devastating if not more painful than common events.  This is where the Oracle Database shines in regards to High Availability and Disaster Recovery.

With the above causes in mind, how can database platforms be compared in a meaningful way? As many of these types of causes overlap in regards to the solution approach, one can establish broader category groups to tackle the most common and severe causes of downtime and data loss:

  • Human Errors & Logical Corruptions
  • Computer/Storage Failures & Site Disasters
  • Database & Infrastructure Patching
  • Database Upgrades
  • Application Upgrades

Using these common causes of downtime and data loss as a framework, part 2 (coming soon) of this blog post will examine how and why Oracle is unique in its integrated approach to High Availability and Disaster Recovery in comparison to the often non-integrated and non-database specific approach that is often used by the broader database industry to tackle these challenges.

For more information, please take a look at:

https://www.oracle.com/database/technologies/high-availability.html

You can also follow us for new updates on Twitter at @OracleMAA or check out the MAA Webcast Series via the link below: