Error handling using Data Rules

This has been a topic on the shows we presented, but it is a very powerful feature to use, so here is a little bit on how to use data rules in ETL. Note that data rules are part of the Data Quality option to Warehouse Builder.

To create this solution we identify a number of discrete phases after the initial setup steps which are generic for creating sources and targets in Warehouse Builder:


  1. Create a data rule to document the business rule you want to enforce
  2. Apply the data rule to a specific table (or set of tables is the rule applies on more tables)
  3. Create a mapping with the tables and set the properties for error handling
Create a Data Rule

Data rules are metadata definitions of a business rule you define. You can use some of the pre-defined rules in Warehouse Builder (as-is or customized) or create completely new rules. In this case we will create a new rule.

Steps:

Import the HR schema (you may find that a lot of the data fails the rule, but that is not the point we are making here...)


  1. In the HANDSON project locate the Data Rules node
  2. Right mouse click and choose New...
  3. Name the new module ERROR_HANDLERS
  4. Click OK, this launches the Data Rule Wizard
  5. Click Next in the welcome screen

  6. Call the new rule EMAIL_VERIFICATION
  7. Since we will be verifying the pattern for email addresses, on the Define Rule panel set the following:

    Type: Common Format
    Format: Email Address

    CreateEmailRule:


  8. In the Values section you see the default pattern (in Regular Expressions) applied, since this is a rule that will only recognize 3 letter domains (.com, .org, .net etc.) we will modify the rule Highlight and copy (ctrl-c) the text from the first line. Paste the text into the second line visible (the empty line shows this is extensible)

    RegularExpressionEmail:


  9. In the second line find the +. then locate the {3} behind it, we will change this to recognize 2 letter domains. Change the {3} into {2}

    RegularExpressionModified:


  10. Finish the wizard
Regular expressions are a very interesting language, to learn or read more take a look at the Oracle Regular Expressions Pocket Reference by Jonathan Gennick and Peter Linsley. Nice format, great reference for playing with regexps...

Apply the data rule

The important thing to understand when using data rules is that, first they are applied to a table (this can be a source, but read on for that), only then are they moved into the mapping to produce ETL routines.

Because the rule is applied to the table and because you can choose to remove errors from the main table into an error table (no this is not DML error logging! More on that in a later post, it is coming in 10.2.0.3!), you will need to regenerate the table. The error table DDL is generated only when the main table (after applying the rule) is generated.

Applying the rule to any table is done in the Data Object editor:


  1. Copy the HR.EMPLOYEES table into any target module
  2. Name it STG_EMPS
  3. Open the STG_EMPS table in the editor (double click the table)
  4. Navigate to the Data Rules tab

    ApplyDataRuleDOE:


  5. In the Applied Rules panel, click the Apply Rule button to add a new rule to the table
  6. Click Next on the welcome screen of the wizard
  7. Find the ERROR_HANDLERS module and select the EMAIL_VERIFICATION rule
  8. Click Next
  9. Change the name (of the usage) to ETL_EMAIL_VERIFICATION
  10. Click Next

    ApplyRuleWizardBinding:


  11. In the Binding, find and select EMAIL
  12. Finish the wizard
  13. The rule is now attached (applied) to the STG_EMPS table
If you now generate the STG_EMPS you will see an STG_EMPS_ERR table as well. This table gets deployed when you deploy STG_EMPS (sorry no other way to do this). In configuration you can change the error table name as well if so desired.

Once the STG_EMPS and STG_EMPS_ERR tables are deployed and thus in the database you can start creating a mapping that utilizes the data rule within STG_EMPS.

Data rules in a mapping

Create a new mapping (if you have the table in a mapping already and it was in there before adding the rule, do a synchronize inbound so the mapping gets the latest table definition including the data rule) and drag the STG_EMPS table into it. Make sure you can see the entire operator:

MapTableOperatorwithRule:

Notice how the operator has two groups. The top group is the regular in/out group for the table data and you can map from and to this group. The second group (ERR_GROUP above) can be mapped from (not into) and reflects all the columns that are in the STG_EMPS_ERR table. So there is no need to join these two tables or to import the STG_EMPS_ERR table to retrieve rows from. You can map from both groups in the same mapping and even join them back together (simply think of them as two tables).

Activating a data rule

Once a table with a data rule is in your mapping, you need to choose how you want to use the data rule. For that highlight the operator and find the operator properties:

ActivateDataRuleInMap:

You have three choices:
The default is Ignore. This simply does not act upon the data rule, so it does not influence the generated code or the mapping logic.
MOVE TO ERROR means that the data rule is enforced when running the mapping and that all data rows that fail the rule (or any of the rules if you have multiple) is moved out of the main table STG_EMPS into STG_EMPS_ERR. That failing row does no longer exist in the STG_EMPS table and any selects from it will not see this row.
REPORT means that the error rows are written into the error table, but are NOT removed from the STG_EMPS table and remain in the regular select flow from the main table.

Multiple rules on the same table are possible and can be individually configured. In the case that 2 rules are violated, the error reason is concatenated for the record indicating multiple violations.

ErrorTableProperties:

A second set of properties governs the behavior of the error table. Truncate the error table simply means that the table gets truncated by OWB before loading into it. You would typically do this if you process the error rows in the originating mapping ensuring the information is captured.

Roll up errors gives you a consolidated and aggregated set of rows, ensuring that each row is only given to the extraction query once when selecting from the error table. Note it does not prevent multiple rows to be present, it merely aggregates them to show a single one. If you need all violations as separate records, switch this to No.

Once you have set all properties, you can actually inspect what the pluggable mapping looks like for this operator, and more surprisingly (I think) you can deploy this mapping as is with only a single operator! So it could play the role of validation mapping. In that case you would probably choose Report as strategy.

As a last step, you can create a mapping the extracts both from the source STG_EMPS and from the STG_EMPS_ERR operators in the same mapping.

Comments:

Hi I have observed that the application of data rules leads to generation of a very long code.which is causing performance related issues. For example, the mapping which i developed, it has 2 table operators with 12 & 23 data rules respectively. When i try to deploy this mapping, it took close to 36 hours to deploy. After the deployment is over, i viewed the code generated. It was about 198677 lines of code that was generated. when i disabled all the data rules in the 2 tables. The code generated was few hundred lines and it took less than a minute to deploy the mapping. I have another mapping which had 2 tables with 20 and 31 data rules. I have started the deployment 4 days ago and its still running. I decided to stop the mapping and redesign it without using data rules. But great loss of time and effort was caused. I don't know if the real issue is with data rules or do i have to enable any feature to make this work. From this i concluded that usage of large number of Data rules leads to inefficient ETL . I do not recommend using the large number of Data rules in your ETL logic. Regards Vibhuti

Posted by Vibhuti Devatraj on July 20, 2007 at 06:03 AM PDT #

First, I'm sorry, my english is very poor, but i try to be clear. how is posible put into the error table, the rows rejected by primary key violated or other errors, different than the data rules? thanks

Posted by Eduardo on July 21, 2008 at 09:37 AM PDT #

Post a Comment:
  • HTML Syntax: NOT allowed
About

ETL, CDC, Real-Time DI and Data Quality for the Oracle Database from the inside.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today