When thinking about high availability (HA) for their database, people routinely think about protection from instance failures, node failures, and even more disastrous events such as data center failures. However, one category of "failures" that is equally disastrous but commonly overlooked are "logical corruptions", a.k.a. application or human errors.
Any simple but incorrect data manipulation can cause logical corruption. For example, an almost humorous but potentially disastrous situation caused by a human error may involve promoting all of a company’s employees in the HR system to Vice President by accidently leaving out the where clause in an update statement. In order to resolve the issue, the database usually needs to be stopped in order to recover and prevent any further dependent corruptions to the data which routinely results in noticeable downtime for both the database and the application.
It is known that these types of human errors happen from time to time in spite of the best precautions. That is the way of life in today’s world with its fast moving pace where application and data changes are often required with very short notice. The question then becomes quite simple, how best restore from such a failure?
This question can be answered in several ways. One way of answering this question depends on how the compromised database was protected in the first place (e.g. using backups only, replication or a combination thereof). For the sake of this discussion, we will consider both the Oracle Database and to provide some contrast, Google Spanner.
For the longest time, Google Spanner has claimed that they do not even need backups as they maintain a sufficient number of “replications” for their database so that a database as a whole and even sub-sets of data are unlikely to fail at once. However, it is still possible for Google Spanner administrators to create backups if they opt in for that capability, but Google Spanner replication is the default approach. Replication by itself cannot be seen as an ideal solution for this case and many other HA and DR situations which will be examined in a bit more detail later in this article.
From the Oracle Database perspective, digging into how to protect a database properly from the holistic HA & DR perspective is beyond the focus of this blog post (for those interested, please check out the MAA Webcast Series for a great starting point). Instead let’s focus on the challenges of resolving logical data corruptions caused by human and application errors.
If we consider the logical corruption event above, the updated data would be valid from a SQL transaction perspective. Based on that, it is quite likely that further damage could result as subsequent statements might be based on the logically corrupted data. Regardless of the level and depth of the logical corruption, no errors will be seen from the database perspective as this is a normal successful operation. However, the same cannot be said from the perspective of the application or business which will no doubt see this logical corruption as a serious issue.
Consequently, any restore and recovery operation will need to work on very granular pieces of data, likely across a set of different database objects (tables, indices perhaps). Unfortunately, many of the common can be used to recover from such an issue would not be ideal solutions in this case. Let’s look at a couple examples involving the common approach that one might use to address these types of logical corruptions from a different database technology. Specifically, we will use Google Spanner’s approach to better understand functional gaps to address logical corruptions via the common approach and to provide comparison and contrast to Oracle’s solution.
Example 1: Replication – The idea behind replication for protection is that data can always be restored from one of the remaining replicas. The Oracle Database provides replication as well via (Active) Data Guard, but with a much different approach and it is just part of a more holistic solution for high availability and disaster recovery.
For those familiar with the concept of replication, it may seem obvious why this approach won't facilitate an easy recovery in the case of a logical corruption such as the one discussed above. As a matter of fact, this approach may even complicate the problem. The reason being that once the "wrong update" (a.k.a. the logical corruption) has been submitted for commit, distributed databases using replication-based protection will only confirm the commit once the majority number of replicas have received the update. Therefore, by the time the commit has been confirmed, the logical corruption will have already expanded to impact all of the replicas that were expected to protect the data.
The only convenient way to recover the situation at that point would be to write a counter-update. However, the more significant the original logical corruption and the more time passes, the bigger on the business and a counter-update would have to be done with care to prevent further corruption.
Example 2: Backup & Recovery – Now that we have established that replicas wouldn’t really work to restore data from this type of corruption, recovery from backup & restore usually would come to mind as an alternative approach to recover from such logical corruptions. While many might think this to be a valid approach, the practicality of it is cumbersome at best. If we once again use Google Spanner as an example of a database technology in this case, we would find that only recently did Google even introduced backups for each copy of the data (including the copies in replicas). Therefore, to fully restore Google Spanner databases, one would now have to perform as many local restore and recovery operations as the number of replicas that were impacted by the logical corruption in the first place which may be quite a few thanks to the nature of their replication approach.
In addition, the restore and recovery required here is very complicated. Why? A restore and recovery operation is generally designed to restore a full set of database entities - e.g. a complete Oracle Database. However, a logical corruption as described cannot simply be recovered that way. What is needed, is a “point in time recovery” (PITR) to the state of the database prior to when the problematic update or insert was committed. This is a way more complex process, involving what is conveniently called an "auxiliary instance", a rather favorable description for starting another database instance to restore and then recover only certain "bits and pieces" from the backup set into the (at this moment) largely inoperative production database.
An example always helps, so let’s explore how we might resolve this logical corruption in an Oracle Database. Specifically, let’s walk through a scenario including the following stages and steps:
Assume the following table:
An update is executed againt the table resulting in the following “Logical Corruption”:
Although the employees would no doubt be quite happy with their promotion to Vice President (VP), this is a big mistake which could have severe consequences. Imagine when this happens in a payroll or CRM system. Even worse, if col-n is coupled to access rights, this could lead to some serious security breaches.
Even using an Oracle Database, the RMAN approach is not ideal in this case, although better than in case of other databases. When restoring, a source backup and a new target database specification is required. You cannot restore to an existing database. The restored database will have all the data and schema from the original database at the creation time of the backup. This means all the other data is lost! This implies that all operations need to be redone to get the data back.
In Oracle databases, data loss can be avoided in cases where only a single table is impacted. Since Oracle 12.1, RMAN has the RECOVER TABLE command. Which in this case, performs a point in time restore (PITR) operation for a single table. This automatically creates an auxiliary instance, performs a point-in-time restore and then provides the ability to automatically perform the import of the corrupt table from the impacted database. No need to lose any data and even better, the database remains online during this operation! Examples are provided in the documentation on how to do this in order to illustrate all the steps to get the table back.
Yes, the above process sounds somewhat complicated but don’t panic that was just another alternative solution which while better than the other databases is still not the most optimal solution. The reason for this is that the Oracle Database is the only database that can solve this quickly without the need to completely restore the database until just before the update happened which would greatly increase the risk of lost data. Remember, competing distributed databases don’t even have a point-in-time restore, so we are well beyond what can be done by alternative databases.
In Oracle we can perform a flashback table. To be able to do so, the database needs to be enabled for Flashback. Following Query helps to verify the current status of the database:
When the output from this query is no, it is simple to enable flashback with following command:
When running the query again, the database is ready to use Flashback. In the background, it will start to collect Flashback logs. Flashback uses these flashback logs to access previous versions of data blocks in combination with some data in the archived redo log files.
When flashing back a table, the flashback table command looks into the information in the flashback logs to see what the data looked like prior to the logical corruption incident. That way, data can be retrieved from the past in a simple convenient and reliable manner without the need to take the database down or to start unnecessary auxiliary instances which may lead to resource issues with very large databases.
For example, if you want to rewind the table back a minute (assuming you were fast to react but a longer timeframe would work as well) prior to when the corruption has hit the system, it is easy as invoking:
Without restoring the full database, without auxiliary instances… just retrieve the data again. Simple.
To recap, this is why Oracle Flashback is often better than a backup, particularly in cases of local corruption caused by human or application error. It is simple to implement, fully integrated into the Oracle Database and most of all; it doesn’t require database and application downtime. Yet another reason that the Oracle Database is unique in its approach to disaster recovery to ensure your data is truly protected from all types of disaster events.