Total Recall for Data Warehousing
By Jean-Pierre Dijcks-Oracle on May 12, 2009
First of all, this is not a recall notice for your data warehouse...! This is one of those cool features that come with Oracle 11g that I think is usable in a different context (e.g. Data Warehousing).
The first thing I dreamed up here, and I'm sure others have thought about this, is the usage of flashback features in ETL. Every ETL system that I know has a construct built in to remove "incorrect" data. Incorrect here is simply a load that did for some reason more harm than good. The typical solution at the highest level is to mark each record with a load identifier. That identifier will simply mark each record as being touched by this load.
The hard bit is always to figure out what happened to this record. Plain old inserts are easy of course as there is no previous record, removing the inserted version restores the situation to before the load. Updates can be hard to deal with and deletes (if you have any - unlikely) are even harder.
So I think making use of the built-in Total Recall options can make a lot of sense here.
Before I go into what I was doing I should make sure to tell everyone that most of the basic Total Recall can be found in this very nice Oracle By Example tutorial.
Here is my scenario and some of the basics from the tutorial applied to my local 11g database:
Use the CUSTOMERS table in SH as the source for a single target table in my target schema (same database) DW_TGT. I'm using Oracle Warehouse Builder as the tool to create my load mapping, just because it is already installed and I kind of know how to use it.
This is my user and tablespace setup (mostly just taken from the OBE mentioned above):
CREATE TABLESPACE dw_with_fla
SIZE 10M REUSE AUTOEXTEND ON NEXT 640K
NOLOGGING EXTENT MANAGEMENT LOCAL SEGMENT SPACE MANAGEMENT AUTO;
CREATE USER dw_tgt identified by dw_tgt
DEFAULT TABLESPACE dw_with_fla
TEMPORARY TABLESPACE temp
QUOTA 100M ON dw_with_fla
QUOTA 100M ON USERS;
GRANT RESOURCE, CONNECT to dw_tgt;
GRANT FLASHBACK ARCHIVE ADMINISTER TO dw_tgt;
As the actual target user (she who owns the data warehouse tables in my case) - DW_TGT I then run:
CREATE FLASHBACK ARCHIVE etl_fla
RETENTION 5 DAY
ALTER TABLE dw_tgt.cust_tab
FLASHBACK ARCHIVE etl_fla
The OWB mapping does a simple merge on the target, the mapping also uses parameters for me to create (artificial) inserts and updates for a second load. That second load is the bad data being thrown at my end user, and is the data I want to back out again.
The mapping is nothing special. Obviously in real life you would have many a transformation in here. The target is an INSERT/UPDATE using the PK on the table and the table is configured to land in the DW_WITH_FLA tablespace as shown here:
So far the set up to get to my initial load. Initial load runs and loads 50,000 records into the target table.
Now I run my bad data load and modify or insert data. The end result is that I touched 15534 records in my CUST_TAB. 5500 are new inserts, the rest is a random set of updates to varying columns in that table.
To revert back to the original data you can go two ways. To me the interesting one is the most radical one. Truncate the actual table and restore from flashback...
To just illustrate the effect, I created a new table and got my 50,000 records back:
create table cust_tab_orig
(select * from cust_tab
as of timestamp (systimestamp - interval '60' minute)
Obviously the question is, can I do a truncate of my tables? Couple of things come into play here:
- Referential integrity... you may need to truncate a bunch of tables or disable constraints etc.
- Performance... how long does it take to restore, how much data do you have and for how many table are you going to it.
So the obvious answer is: "it depends".
The other big thing is that you have a fixed set of variables to work with if you do not do the truncate bit... This means you can build an ETL routine that does the backing out WITHOUT relying on any source or staging data or dodgy source time stamp columns. In other words, rather than being depend on the source data and source system to remove data, you are working in a controlled environment of the data warehouse and will at all times be able to restore your data.
To do that simply create a view to show historical data from the archive, then parameterize the ETL mapping so you can pick a random moment in time.
To remove the inserts you can of course use some form of MINUS query:
select cust_id from cust_tab
select cust_id from cust_tab as of timestamp (systimestamp - interval '18' hour);
The updates are a bit more tricky since there is no obvious marker in the actual data archive. But wait a minute, there is!
Flashback version queries allow you to find versions of records, which is exactly what we need here. We need the older version of the records that are updated.
So the inserts can be found using something like this:
versions between timestamp (systimestamp - interval '20' hour)
where versions_operation = 'I'
The updates go with versions_operation = 'U'
Note that there is a gotch for user updates of an index key! You may get 'D' and 'I' records (see here)
With that we are in business and can now create a view that will allow us to find the status of all records before the load started and the status after.
BTW, if you do not get your timestamps right you may get an error along these lines: ORA-01466: unable to read data - table definition has changed. After getting my intervals right so there actually are changes and the query actually will get some returns it all worked.
In OWB you could then create a mapping that removes the 'I' records and reverses the 'U' records back to their original state. Build up the to-be-reversed list from the version and update the data back.
Well, that is my disclaimer here. I run this on a laptop with a couple of thousand of rows. The question here is, what is the overhead you get from doing this from the flashback archives? There is no difference (I think) between the actual update performance using flashback or using the source. In effect, I'm guessing that you probably need a database link to go to the source, so performance may be even better using this construction.
If you still have the staging data around you can poke into the versions or into the flashback archive using that list of items. That way it might be quicker or simpler to find the rows that require a reverse-update action.
Now the other thing I was thinking about is to use this instead of slowly changing dimensions... but I will leave that - more theoretical discussion - for a next post.