Main

Data Quality Archives

April 25, 2007

Error handling using Data Rules

This has been a topic on the shows we presented, but it is a very powerful feature to use, so here is a little bit on how to use data rules in ETL. Note that data rules are part of the Data Quality option to Warehouse Builder.

To create this solution we identify a number of discrete phases after the initial setup steps which are generic for creating sources and targets in Warehouse Builder:


  1. Create a data rule to document the business rule you want to enforce
  2. Apply the data rule to a specific table (or set of tables is the rule applies on more tables)
  3. Create a mapping with the tables and set the properties for error handling
Create a Data Rule

Data rules are metadata definitions of a business rule you define. You can use some of the pre-defined rules in Warehouse Builder (as-is or customized) or create completely new rules. In this case we will create a new rule.

Steps:

Import the HR schema (you may find that a lot of the data fails the rule, but that is not the point we are making here...)


  1. In the HANDSON project locate the Data Rules node
  2. Right mouse click and choose New...
  3. Name the new module ERROR_HANDLERS
  4. Click OK, this launches the Data Rule Wizard
  5. Click Next in the welcome screen

  6. Call the new rule EMAIL_VERIFICATION
  7. Since we will be verifying the pattern for email addresses, on the Define Rule panel set the following:

    Type: Common Format
    Format: Email Address

    CreateEmailRule:


  8. In the Values section you see the default pattern (in Regular Expressions) applied, since this is a rule that will only recognize 3 letter domains (.com, .org, .net etc.) we will modify the rule Highlight and copy (ctrl-c) the text from the first line. Paste the text into the second line visible (the empty line shows this is extensible)

    RegularExpressionEmail:


  9. In the second line find the +. then locate the {3} behind it, we will change this to recognize 2 letter domains. Change the {3} into {2}

    RegularExpressionModified:


  10. Finish the wizard
Regular expressions are a very interesting language, to learn or read more take a look at the Oracle Regular Expressions Pocket Reference by Jonathan Gennick and Peter Linsley. Nice format, great reference for playing with regexps...

Apply the data rule

The important thing to understand when using data rules is that, first they are applied to a table (this can be a source, but read on for that), only then are they moved into the mapping to produce ETL routines.

Because the rule is applied to the table and because you can choose to remove errors from the main table into an error table (no this is not DML error logging! More on that in a later post, it is coming in 10.2.0.3!), you will need to regenerate the table. The error table DDL is generated only when the main table (after applying the rule) is generated.

Applying the rule to any table is done in the Data Object editor:


  1. Copy the HR.EMPLOYEES table into any target module
  2. Name it STG_EMPS
  3. Open the STG_EMPS table in the editor (double click the table)
  4. Navigate to the Data Rules tab

    ApplyDataRuleDOE:


  5. In the Applied Rules panel, click the Apply Rule button to add a new rule to the table
  6. Click Next on the welcome screen of the wizard
  7. Find the ERROR_HANDLERS module and select the EMAIL_VERIFICATION rule
  8. Click Next
  9. Change the name (of the usage) to ETL_EMAIL_VERIFICATION
  10. Click Next

    ApplyRuleWizardBinding:


  11. In the Binding, find and select EMAIL
  12. Finish the wizard
  13. The rule is now attached (applied) to the STG_EMPS table
If you now generate the STG_EMPS you will see an STG_EMPS_ERR table as well. This table gets deployed when you deploy STG_EMPS (sorry no other way to do this). In configuration you can change the error table name as well if so desired.

Once the STG_EMPS and STG_EMPS_ERR tables are deployed and thus in the database you can start creating a mapping that utilizes the data rule within STG_EMPS.

Data rules in a mapping

Create a new mapping (if you have the table in a mapping already and it was in there before adding the rule, do a synchronize inbound so the mapping gets the latest table definition including the data rule) and drag the STG_EMPS table into it. Make sure you can see the entire operator:

MapTableOperatorwithRule:

Notice how the operator has two groups. The top group is the regular in/out group for the table data and you can map from and to this group. The second group (ERR_GROUP above) can be mapped from (not into) and reflects all the columns that are in the STG_EMPS_ERR table. So there is no need to join these two tables or to import the STG_EMPS_ERR table to retrieve rows from. You can map from both groups in the same mapping and even join them back together (simply think of them as two tables).

Activating a data rule

Once a table with a data rule is in your mapping, you need to choose how you want to use the data rule. For that highlight the operator and find the operator properties:

ActivateDataRuleInMap:

You have three choices:
The default is Ignore. This simply does not act upon the data rule, so it does not influence the generated code or the mapping logic.
MOVE TO ERROR means that the data rule is enforced when running the mapping and that all data rows that fail the rule (or any of the rules if you have multiple) is moved out of the main table STG_EMPS into STG_EMPS_ERR. That failing row does no longer exist in the STG_EMPS table and any selects from it will not see this row.
REPORT means that the error rows are written into the error table, but are NOT removed from the STG_EMPS table and remain in the regular select flow from the main table.

Multiple rules on the same table are possible and can be individually configured. In the case that 2 rules are violated, the error reason is concatenated for the record indicating multiple violations.

ErrorTableProperties:

A second set of properties governs the behavior of the error table. Truncate the error table simply means that the table gets truncated by OWB before loading into it. You would typically do this if you process the error rows in the originating mapping ensuring the information is captured.

Roll up errors gives you a consolidated and aggregated set of rows, ensuring that each row is only given to the extraction query once when selecting from the error table. Note it does not prevent multiple rows to be present, it merely aggregates them to show a single one. If you need all violations as separate records, switch this to No.

Once you have set all properties, you can actually inspect what the pluggable mapping looks like for this operator, and more surprisingly (I think) you can deploy this mapping as is with only a single operator! So it could play the role of validation mapping. In that case you would probably choose Report as strategy.

As a last step, you can create a mapping the extracts both from the source STG_EMPS and from the STG_EMPS_ERR operators in the same mapping.

June 19, 2007

Creating Data Rules in Scripting

Here are a bunch of examples for creating data rules from OMB, this will be a useful cheat sheet for anyone generating rules from any repositories where the rules already exist. You can download a tcl file which creates the examples below from here. You can check out this post also which has useful info on applying data rules.

There are some subtle differences between how the UI creates the rules and the API exposed in scripting (specifically on unique key / referential / functional dependency). The UI poses a simple question of how many attributes are interesting for each rule type, in the API you have to add attributes to the respective groups for each type (the pre-built groups and attributes are defined in the OMB documentation in the API guide). See the examples below for more info - it is not exactly obvious!

The following examples will be covered:
   1. create a custom rule
   2. create a custom rule with multiple columns
   3. create a fixed domain list
   4. create a domain range
   5. create a no null rule
   6. create a unique key rule
   7. create a referential rule
   8. create domain pattern list for postcode
   9. customize built-in patterns
  10. create functional dependency
  11. apply a data rule to a table


1. Create a Custom Data Rule
In the example below we create a single column rule that check that the column which can be supplied is > 300 - by default the custom rule has an attribute VALUE in the group THIS:

OMBCREATE DATA_RULE 'CUSTOM_RULE' SET PROPERTIES(RULE_TYPE) VALUES('ATTR_VALUE_RULE')
OMBALTER DATA_RULE 'CUSTOM_RULE' SET PROPERTIES(BUSINESS_NAME,DESCRIPTION) VALUES('Custom Rule', 'Single column rule')
OMBALTER DATA_RULE 'CUSTOM_RULE' GROUP 'THIS' ATTRIBUTE 'VALUE' SET PROPERTIES(DATATYPE) VALUES('NUMBER')
OMBALTER DATA_RULE 'CUSTOM_RULE' SET PROPERTIES(ATTR_VALUE_CLAUSE) VALUES('"THIS"."VALUE" > 300')

2. Create a Custom Data Rule with Multiple Columns
Here we can add additional attributes to the data rule (so many columns can be supplied ie. can check salary > XYZ and job_type in ....):

OMBCREATE DATA_RULE 'CUSTOM_RULE2' SET PROPERTIES(RULE_TYPE) VALUES('ATTR_VALUE_RULE')
OMBALTER DATA_RULE 'CUSTOM_RULE2' SET PROPERTIES(BUSINESS_NAME,DESCRIPTION) VALUES('Custom Rule2', 'Multi column rule')
OMBALTER DATA_RULE 'CUSTOM_RULE2' GROUP 'THIS' ADD ATTRIBUTE 'ANOTHER_VALUE'
OMBALTER DATA_RULE 'CUSTOM_RULE2' GROUP 'THIS' ATTRIBUTE 'VALUE' SET PROPERTIES(DATATYPE) VALUES('NUMBER')
OMBALTER DATA_RULE 'CUSTOM_RULE2' GROUP 'THIS' ATTRIBUTE 'ANOTHER_VALUE' SET PROPERTIES(DATATYPE) VALUES('VARCHAR2')
OMBALTER DATA_RULE 'CUSTOM_RULE2' SET PROPERTIES(ATTR_VALUE_CLAUSE) VALUES('THIS.VALUE> 300 AND THIS.ANOTHER_VALUE IN (SELECT DISTINCT JOBTYPE FROM EMP_JOBS)')

3. Create a Fixed Domain List
Create a domain rule with a fixed set of values:

OMBCREATE DATA_RULE 'DOMAIN_RULE' SET PROPERTIES(RULE_TYPE) VALUES('DOMAIN_LIST_RULE')
OMBALTER DATA_RULE 'DOMAIN_RULE' SET PROPERTIES(BUSINESS_NAME) VALUES('Domain List Rule')
OMBALTER DATA_RULE 'DOMAIN_RULE' ADD DOMAIN_VALUE 'dd' ADD DOMAIN_VALUE 'ee'

4. Create a Domain Range
Create a domain range rule:

OMBCREATE DATA_RULE 'DOMAIN_RANGE_RULE' SET PROPERTIES(RULE_TYPE) VALUES('DOMAIN_RANGE_RULE')
OMBALTER DATA_RULE 'DOMAIN_RANGE_RULE' SET PROPERTIES(BUSINESS_NAME) VALUES('Domain Range Rule')
OMBALTER DATA_RULE 'DOMAIN_RANGE_RULE' SET PROPERTIES(MIN_VALUE) VALUES(500)
OMBALTER DATA_RULE 'DOMAIN_RANGE_RULE' SET PROPERTIES(MAX_VALUE) VALUES(20000)
OMBALTER DATA_RULE 'DOMAIN_RANGE_RULE' GROUP 'THIS' ATTRIBUTE 'VALUE' SET PROPERTIES(DATATYPE) VALUES('NUMBER')

5. Create a No Null Rule
Create a 'no null' rule:

OMBCREATE DATA_RULE 'NO_NULL_RULE' SET PROPERTIES(RULE_TYPE) VALUES('DOMAIN_NO_NULL_RULE')
OMBALTER DATA_RULE 'NO_NULL_RULE' SET PROPERTIES(BUSINESS_NAME) VALUES('No Nulls Rule')

6. Create a Unique Key Rule
Create a unique key rule with a single column:

OMBCREATE DATA_RULE 'UK_RULE' SET PROPERTIES(RULE_TYPE) VALUES('IDENTITY_RULE')
OMBALTER DATA_RULE 'UK_RULE' SET PROPERTIES(BUSINESS_NAME) VALUES('UK Rule')
OMBALTER DATA_RULE 'UK_RULE' SET PROPERTIES(IGNORE_NULLS) VALUES('true')

How to set the number of attributes (for a composite unique key rule)? After this you edit the rule in the UI you will see 'Number of Attributes: 2', the attribute names you use will appear when you apply the data rule to a table (so you will bind the attributes from the rule to columns in the table):

OMBCREATE DATA_RULE 'UK_RULE2' SET PROPERTIES(RULE_TYPE) VALUES('IDENTITY_RULE')
OMBALTER DATA_RULE 'UK_RULE2' SET PROPERTIES(BUSINESS_NAME) VALUES('UK Rule2')
OMBALTER DATA_RULE 'UK_RULE2' SET PROPERTIES(IGNORE_NULLS) VALUES('true')
OMBALTER DATA_RULE 'UK_RULE2' GROUP 'THIS' ADD ATTRIBUTE 'KEY_ATTRIBUTE_2'



7. Create a Referential Rule
Create a referential rule:

OMBCREATE DATA_RULE 'REFERENTIAL_RULE' SET PROPERTIES(RULE_TYPE) VALUES('REFERENCE_RULE')
OMBALTER DATA_RULE 'REFERENTIAL_RULE' SET PROPERTIES(BUSINESS_NAME) VALUES('Referential Rule')
OMBALTER DATA_RULE 'REFERENTIAL_RULE' SET PROPERTIES(LOCAL_MIN_COUNT) VALUES(0)
OMBALTER DATA_RULE 'REFERENTIAL_RULE' SET PROPERTIES(LOCAL_MAX_COUNT) VALUES(1)
OMBALTER DATA_RULE 'REFERENTIAL_RULE' SET PROPERTIES(REMOTE_MIN_COUNT) VALUES(0)
OMBALTER DATA_RULE 'REFERENTIAL_RULE' SET PROPERTIES(REMOTE_MAX_COUNT) VALUES(1)

How to set the number of attributes - use group LOCAL/REMOTE, I use the same naming convention for attribute as OWB built-in name:

OMBCREATE DATA_RULE 'REFERENTIAL_RULE2' SET PROPERTIES(RULE_TYPE) VALUES('REFERENCE_RULE')
OMBALTER DATA_RULE 'REFERENTIAL_RULE2' SET PROPERTIES(BUSINESS_NAME) VALUES('Referential Rule2')
OMBALTER DATA_RULE 'REFERENTIAL_RULE2' SET PROPERTIES(LOCAL_MIN_COUNT) VALUES(0)
OMBALTER DATA_RULE 'REFERENTIAL_RULE2' SET PROPERTIES(LOCAL_MAX_COUNT) VALUES(1)
OMBALTER DATA_RULE 'REFERENTIAL_RULE2' SET PROPERTIES(REMOTE_MIN_COUNT) VALUES(0)
OMBALTER DATA_RULE 'REFERENTIAL_RULE2' SET PROPERTIES(REMOTE_MAX_COUNT) VALUES(1)
OMBALTER DATA_RULE 'REFERENTIAL_RULE2' GROUP 'LOCAL' ADD ATTRIBUTE 'LOCAL_KEY_ATTRIBUTE_2'
OMBALTER DATA_RULE 'REFERENTIAL_RULE2' GROUP 'REMOTE' ADD ATTRIBUTE 'REMOTE_KEY_ATTRIBUTE_2'


8. Create a Domain Pattern List for Postcode
Create a domain pattern list for British postcodes, any arbitrary regular expression can be supplied:

OMBCREATE DATA_RULE 'DOMAIN_PATTERN_RULE' SET PROPERTIES(RULE_TYPE) VALUES('DOMAIN_PATTERN_LIST_RULE')
OMBALTER DATA_RULE 'DOMAIN_PATTERN_RULE' SET PROPERTIES(BUSINESS_NAME) VALUES('Domain Pattern Rule')
OMBALTER DATA_RULE 'DOMAIN_PATTERN_RULE' SET PROPERTIES(DESCRIPTION) VALUES('British Postcode RegExp')
# I have escaped various characters with \ such as [ ] {  }
OMBALTER DATA_RULE 'DOMAIN_PATTERN_RULE' ADD DOMAIN_VALUE '(^(\[A-PR-UWYZ0-9\]\\[A-HK-Y0-9\]\\[AEHMNPRTVXY0-9\]?\[ABEHMNPRVWXY0-9\]? \{1,2\}\[0-9\]\\[ABD-HJLN-UW-Z\]\\{2\}|GIR 0AA)$)'

9. Customize Built in
Create a customized version of the built-in telephone format rule (there are common formats defined for;
  telephone      -DOMAIN_FORMAT_TELEPHONE_RULE
  IP address    -DOMAIN_FORMAT_IP_RULE
  SSN             -DOMAIN_FORMAT_SSN_RULE
  Date             -DOMAIN_FORMAT_DATE_RULE
  Number        -DOMAIN_FORMAT_NUMBER_RULE
  URL             -DOMAIN_FORMAT_URL_RULE
  Email            -DOMAIN_FORMAT_EMAIL_RULE 
Some of these are localized versions but it is fairly straightforward to create versions for other locales.

OMBCREATE DATA_RULE 'COMMON_FORMAT_TELE' SET PROPERTIES(RULE_TYPE) VALUES('DOMAIN_FORMAT_TELEPHONE_RULE')
OMBALTER DATA_RULE 'COMMON_FORMAT_TELE' SET PROPERTIES(BUSINESS_NAME) VALUES('Domain Format Telephone Rule')
OMBALTER DATA_RULE 'COMMON_FORMAT_TELE' SET PROPERTIES(DESCRIPTION) VALUES('Telephone Customized')
# I have escaped [ ] {  }
OMBALTER DATA_RULE 'COMMON_FORMAT_TELE' ADD DOMAIN_VALUE '(^\[\\[:space:\]\\]*\[0-9\]\\{3\}\[\\[:punct:\]|\[:space:\]\\]?\[0-9\]\\{4\}\[\\[:space:\]\\]*$)'

10. Create Functional Dependency
Create a functional dependency rule:

OMBCREATE DATA_RULE 'FUNC_DEP_RULE' SET PROPERTIES(RULE_TYPE) VALUES ('FUNCTIONAL_DEP_RULE')
OMBALTER DATA_RULE 'FUNC_DEP_RULE' SET PROPERTIES(FUNCTIONAL_DEP_THRESHOLD) VALUES (10)

If you want more than 1 determinant then add to the group (there is a group for DETERMINANTS and DEPENDENCY):

OMBALTER DATA_RULE 'FUNC_DEP_RULE' GROUP 'DETERMINANTS' ADD ATTRIBUTE 'DETERMINANT_2'
OMBALTER DATA_RULE 'FUNC_DEP_RULE' GROUP 'DEPENDENCY' ADD ATTRIBUTE 'DEPENDENCY_2'


11. How to Apply a Rule to a Table?
So how do you actually add a rule usage to a table.....you have to tie the table to the rule and the columns to the attributes!

The example below adds a domain rule to the COUNTRIES table binding the VALUE attribute to the COUNTRY_NAME column

OMBALTER TABLE 'COUNTRIES' ADD DATA_RULE_USAGE 'DRU' SET REF DATA_RULE '../DRS/DOMAIN_RULE' GROUP 'THIS' SET REF TABLE 'COUNTRIES' ATTRIBUTE 'VALUE' SET REF COLUMN 'COUNTRY_NAME'

That's a quick run down that I hope its useful to share. Hope this helps!

July 6, 2007

Data Profiling Performance Guide

One of the questions people usually ask about data profiling is how to make it go faster. Data profiling takes advantage of the Oracle database so understanding the kinds of operations that data profiling does will help. Attached is the unofficial data profiling performance guide that could be useful if you're trying to get more performance bang for your data profiling buck.

Doc is here

July 10, 2007

OWB 11g and beyond webcast

Just a quick reminder, the webcast is tomorrow morning, 11.45 am ET:

http://ioug.itconvergence.com/pls/apex/f?p=219:1:408683087713141

Look at the Next Webcast section. Oh and it is a good idea to visit this once in a while to look for the next webcasts... on various, but also on OWB.

August 28, 2007

Match and Merge - Complex Deduplication Rules Part 1

Often the match and merge operator is completely missed by users in OWB. Turn the light on. It is probably the most powerful mapping operator in the operator set and it is in the Core ETL of OWB. OWB has had this operator for a number of years and it has been quietly improving as time has gone by. For example some core algorithms were added in 10gR2 of the database specifically for improving performance in this area. UTL_MATCH is in the database in 10gR2, OWB had a PLSQL implementation prior to this - includes Levenshtein Distance/Jaro Winkler algorithms etc. OWB also provides double metaphone routines that can be used if desired.

There are some nifty capabilities that are not immediately obvious (what's new you cry!), here I'll lay down an example that shows cross table matching using compound custom rules. We have a number of tables, files (whatever) from various sources that should be consolidated. We have a bunch of rules that define how the rows should be matched and we have rules for how the rows should be merged.

Match rules in order:

Column
Rule description
SSN
If the SSN is not null and not equal to 999-99-9999 then use fuzzy (edit distance) matching
SAP_CUST_IDIf the SAP_CUST_ID is not null then use partial (abbreviation) matching
XYZ_CUST_IDIf the XYZ_CUST_ID is not null then use exact matching
ABC_CUST_IDIf the ABC_CUST_ID is not null then use exact matching
NAMEif first name and the last name are not null then use fuzzy (soundex) matching

So the match rules are read as in match if the rule 1 description is satisfied, or if the rule 2 description is satisfied etc..

Merge Rules:
Column
Rule description
NAME_MThe longest non-null middle name
SSN
The most common SSN
NAME_F
The longest non-null first name from Table A
A_CUST_SEQ
From the same record as the merged SSN
SAP_CUST_IDThe most common SAP_CUST_ID with 7 characters

The merge rules are descriptions of how each column should be projected through the merge.

Datastore A
   
XYZ_CUST_ID ABC_CUST_ID CUST_SEQ SAP_CUST_ID NAME_F NAME_M NAME_L SSN
9138 KI17038 John Martin Smith 915-12-1234
2271 KI17038 Jonathan R. Smith
4805 R5KI9-17038

Smith 915-21-1234
2716 R5KI9-17038N Jonathan V Smith 286-17-5289

Datastore B
XYZ_CUST_ID ABC_CUST_ID CUST_SEQ SAP_CUST_ID NAME_F NAME_M NAME_L SSN
293105


John
Smith
293105


Marianne
Smith
292617


Jon M. Smith

Datastore C
XYZ_CUST_ID ABC_CUST_ID CUST_SEQ SAP_CUST_ID NAME_F NAME_M NAME SSN

006-41803



Smith 999-99-9999

2006-41803



Smith 999-99-9999

2006-41803



Smith 915-21-1234

These 3 tables, files (or whatever) have related data that we wish to match-merge with the rules above.

So how to initially consolidate this disparate data in preparation for match-merging? Simply use a union all operator for example.

The rules when applied mean the following rows would be matched;

Match Merge 1:

In OWB using the match merge operator we create the match rules. For the table of match rules above we define the rules as below. The first rule MA_1_SSN is defined as a custom rule, the implementation
(in a very small amount of PLSQL) checks that the SSN is not null, it
is not equal to '999-99-9999' (for both 'this' and 'that' - the rows
being compared) and that the edit distance is less than 2 (this example
was based on 10gR2, so I used the database's native UTL_MATCH package,
OWB provided OWB_MATCH_MERGE for pre 10gR2 with a pure PLSQL
implementation).

Match Merge 2:

Note all rules in above example are defined as passive and only one is active. OWB does not directly execute passive match rules. It only executes passive rules when they are called through an active custom match rule. All defined match rules appear in a list of available functions in the Custom Match Rule Editor. To reorder rules, click down on the row header on the left hand side, hold down
for a couple of seconds then drag the row up or down.

The second rule is ensuring the SAP_CUST_ID column matches using partial (abbreviation) matching (the rule type is conditional and the algorithm is abbreviation):

Match Merge 3:

Note here, the rule we wanted to defined was 'If the SAP_CUST_ID is not null then use partial (abbreviation) matching'. We will do the not null check in the final custom match rule! This goes for the next few rules.

The next rule checks XYZ_CUST_ID for an exact match:

Match Merge 4:

The fourth rule checks ABC_CUST_ID for an exact match:

Match Merge 5:


Finally the active rule at the end is a composite rule that augments the earlier defined rules with further refinements (for example adds not null checks on SAP_CUST_ID rule) then finally performs SOUNDEX matching on the names:


Match Merge 6:


When you are building your custom rules there is a specific editor when you edit the rule, it has the available parameters and match functions available for building the custom rule. You can incorporate other rules including the passive rules defined (in this way they are triggered).

Match Merge 7:


So finally our match rules are defined!

Merge Rules

The merge rules define how the operator's merged attributes are determined. Let's reflect what the requirements are:

Column
Rule description
SSN
The most common SSN
NAME_M
The longest non-null middle name
NAME_F
The longest non-null first name from Table A
SAP_CUST_ID
The most common SAP_CUST_ID with 7 characters
A_CUST_SEQ
From the same record as the merged SSN

OWB provides a number of built-in merge rule types, for some of the above we need a custom rule, let's dig deeper.

The NAME_M column is populated from the longest middle name, for this attribute the Min/Max rule type can be used, the attribute selected and Longest chosen as the attribute relation (it could be mininum,maximum,longest,shortest):

Match Merge 8:

For NAME_F we will decide that table A will will be identified by all records having a CUST_SEQ value of null. We could also have identified this by adding a constant attribute for each source (or system) when we unioned the tables and used this rather than purely being dependent on the data. We can code custom merge rules to decide on which value to merge, here we loop around and return the longest non-null name - this is just a simple PLSQL loop around the collection M_MATCHES;

fName varchar2(2000) := null;
BEGIN
    -- return the longest first name from table a
    -- in table a, CUST_SEQ is not null
  FOR i IN M_MATCHES.FIRST .. M_MATCHES.LAST LOOP
      IF  M_MATCHES(i)."NAME_F" IS NOT NULL and
          M_MATCHES(i)."CUST_SEQ"  is not null THEN
        IF fName IS NULL OR LENGTH(RTRIM(M_MATCHES(i)."NAME_F")) > LENGTH(RTRIM(fName)) THEN
          fName := M_MATCHES(i)."NAME_F";
        END IF;
      END IF;
  END LOOP;
  RETURN fName;
END;

The SSN rule can be defined using a PLSQL loop that simply counts occurrences and returns the SSN with the highest occurrence count, here we see the definition of the custom rule for SSN. It is simply 2 loops one 2 get the occurrence count, the 2nd loop returns the highest occurrence count:

type info_type is table of varchar2(4000) index by binary_integer;
type infocnt_type is table of pls_integer index by binary_integer;
v_infos info_type;
v_infocnt infocnt_type;
retval varchar2(20);
retcnt pls_integer;
found boolean;
indx pls_integer := 1;
begin
  -- return the modal value of "SSN" that is not null
  for i in M_MATCHES.FIRST .. M_MATCHES.LAST loop
      if M_MATCHES(i)."SSN" is not null and
M_MATCHES(i)."SSN" != '999-99-9999' then

        found := false;
        if v_infos.count > 0 then
          for j in  v_infos.first .. v_infos.last loop

            if found = false and v_infos(j) = M_MATCHES(i)."SSN" then
              -- found it
              v_infocnt(j) := v_infocnt(j) + 1;
              found := true;
            end if;
          end loop;
        end if; -- v_infos is not null
        if found = false then
          v_infos(indx) := M_MATCHES(i)."SSN";
          v_infocnt(indx) := 1;
          indx := indx + 1;
        end if;
      end if;
  end loop;
  -- find value with max count
  if v_infos.count > 0 then
    for i in v_infos.first .. v_infos.last loop
      if retval is null then
        retval := v_infos(i);
        retcnt := v_infocnt(i);
      else
        if (v_infocnt(i) > retcnt) then
          retcnt := v_infocnt(i);
          retval := v_infos(i);
          indx := indx + 1;
        end if;
      end if;
    end loop;
  end if;
  return retval;
end;


This might look like a lot of code, but really is pretty much 97% boilerplate. Look for SSN in the above code, there are only a few specific parts related to this pattern of finding the most common occurrence.

The CUST_SEQ merge rule which returns the CUST_SEQ from the merged SSN is a loop to find the SSN and then a return of the matched SSN, if none found then return the previously merged CUST_SEQ:

BEGIN
  -- select the cust_seq from the same record as the merged SSN.
  for i in M_MATCHES.FIRST .. M_MATCHES.LAST loop
      if M_MATCHES(i)."SSN" is not null and
         M_MATCHES(i)."CUST_SEQ" is not null and
         M_MATCHES(i)."SSN" = M_MERGE."SSN"  then
        return  M_MATCHES(i)."CUST_SEQ" ;
      end if;
  end loop;
  return  M_MERGE."CUST_SEQ";  -- previously selected cust sequence
END;

The rule for SAP_CUST_ID is to select the most common customer id that
is 7 characters in length, we can easily do this using the rank rule
type and the expression can check for length of 7:

Match Merge 9:

I think that's enough on the rules for both matching and merging, you get the idea now that this is very powerful.

Anatomy of the Map

Let's check out the map structure, to consolidate multiple source tables you can simply union all the tables together. The great thing about OWB is that you can actually also do a lot more if needed! You can perform a selection/filter process using the OWB operators up until the data is pushed into the match merge operator. This is a technique that we've suggested in the past to improve the performance so that there is not a bunch of redundant matching going on - so ensure you have reduced the number of bins to those bins with changes.

Match Merge 10:

The binning concept is very simple; it is all to do with divide and conquer! Often misunderstood and misused. Divide the problem up into bins and match and merge these chunks. A good use of binning can dramatically reduce the number comparisons the map will perform.

Here is an excerpt from the documentation:
'When
Warehouse Builder matches the rows, it compares each row with the subsequent row
for all rows within the same grouping. Limiting the number of rows can greatly
enhance performance, because Warehouse Builder searches for matches only within
a bin and not throughout the entire data set.'


The OWB wizard would be better if there was a binning advisor that actually tied into the data and said OK, you have selected column ZIPCODE this will give you a median bin size of W, an average bin size of X, a minimum bin size of Y and a max bin size of Z, this kind of information would help you make some of the decisions. You must also remember that you cannot pick any arbitrary column as the binning since you also want to avoid separating rows that should be matched. The
attributes you select for grouping similar rows depends on your data.

The general matching loops are structed as;

OUTER Loop for each bin
   MIDDLE loop for new records (filtered input records) OR MIDDLE loop for all input records
      INNER loop for all input records

In the MIDDLE loop it is possible to use the 'Match new records only' option which will restrict the MIDDLE loop to only those records identified by the new record condition. Although entitled new records only, you can use this for any specialist filtering that you have such as matching new and updated records. This is another option that again can dramatically alter performance.

In Summary

As you can see OWB has very sophisticated deduplication and match merge capabilities essential for incorporating data quality into your information processes. This is integrated into the OWB designer and works just like the rest of the operator set, and now you can see it truly is one of the richest and powerful operators in Oracle's data quality operator set.

There is a zip file here containing the MDL for this demonstration along with the data for the sources, have a play.

In further posts we'll look at the matching of items based on context information (see an existing OBE here) such as whether the column is a name, an address or whatever. There are a number of ways to compare strings and depending on what the data means, different comparison algorithms can be used, we will look at this further.

November 6, 2007

Merge rules in scripting

So it has been a bit busy over here with OpenWorld coming up (are you all going???)... but I recently played around with the match merge case discussed earlier seeing if I can script some pieces of it and that was quite a bit of fun. But here it is...

Keep in mind that this does the entire set of MERGE rules for the operator (I've already created the rest via the UI). Also I'm using some custom rules which is placed in a variable and then loaded.

OMBCC 'MY_PROJECT/TARGET'

#Creating a match bin

OMBALTER MAPPING 'CUST_MATCH_MERGE'
MODIFY OPERATOR 'MATCHMERGE'
SET PROPERTIES (MATCH_KEYS) VALUES('INGRP1.NAME_L')

#Creating a merge rule set
#Rule 1:
#Note that the order of the last two settings has to be exactly like this! TYPE before Attribute.

OMBALTER MAPPING 'CUST_MATCH_MERGE'
    ADD MERGE_RULES 'ME_1_NAMEM'
        OF OPERATOR 'MATCHMERGE' SET PROPERTIES (TYPE) VALUES ('MM_MIN_MAX')
    MODIFY MERGE_RULES 'ME_1_NAMEM'
        OF OPERATOR 'MATCHMERGE' SET PROPERTIES (ATTRIBUTE_NAME) VALUES ('NAME_M')
    MODIFY MERGE_RULES 'ME_1_NAMEM'
        OF OPERATOR 'MATCHMERGE' SET PROPERTIES (MIN_MAX_TYPE) VALUES ('MM_LONGEST')
    MODIFY MERGE_RULES 'ME_1_NAMEM'
        OF OPERATOR 'MATCHMERGE' SET PROPERTIES (MIN_MAX_ATTRIBUTE) VALUES ('NAME_M')        

#Rule 2:
#Custom rule text is added to a variable and called from OMB.
#Note the escape characters in the PL/SQL text:
# Double quotes are escaped by a slash
# Single quotes are escaped by a single quote

set custom_rule2 "fName varchar2(2000) := null;
BEGIN
    -- return the longest first name from table a
    -- in table a, CUST_SEQ is not null
  FOR i IN M_MATCHES.FIRST .. M_MATCHES.LAST LOOP
      IF  M_MATCHES(i)."NAME_F" IS NOT NULL and
          M_MATCHES(i)."CUST_SEQ"  is not null THEN
        IF fName IS NULL OR LENGTH(RTRIM(M_MATCHES(i)."NAME_F")) > LENGTH(RTRIM(fName)) THEN
          fName := M_MATCHES(i)."NAME_F";
        END IF;
      END IF;
  END LOOP;
  RETURN fName;
END;"

OMBALTER MAPPING 'CUST_MATCH_MERGE'
    ADD MERGE_RULES 'ME_2_NAMEF'
        OF OPERATOR 'MATCHMERGE' SET PROPERTIES (TYPE) VALUES ('MM_CUSTOM')
    MODIFY MERGE_RULES 'ME_2_NAMEF'
        OF OPERATOR 'MATCHMERGE' SET PROPERTIES (ATTRIBUTE_NAME) VALUES ('NAME_F')
    MODIFY MERGE_RULES 'ME_2_NAMEF'
        OF OPERATOR 'MATCHMERGE' SET PROPERTIES (CUSTOM_TEXT) VALUES ('$custom_rule2')

#Rule 3:
#Custom rule text is added to a variable and called from OMB.
#Note the escape characters in the PL/SQL text:
# Double quotes are escaped by a slash
# Single quotes are escaped by a single quote

set custom_rule3 "BEGIN\n 
type info_type is table of varchar2(4000) index by binary_integer;
type infocnt_type is table of pls_integer index by binary_integer;
v_infos info_type;
v_infocnt infocnt_type;
retval varchar2(20);
retcnt pls_integer;
found boolean;
indx pls_integer := 1;
begin
  -- return the modal value of "SSN" that is not null
  for i in M_MATCHES.FIRST .. M_MATCHES.LAST loop
      if M_MATCHES(i)."SSN" is not null and M_MATCHES(i)."SSN" != ''999-99-9999''  then
        found := false;
        if v_infos.count > 0 then
          for j in  v_infos.first .. v_infos.last loop

            if found = false and v_infos(j) = M_MATCHES(i)."SSN" then
              -- found it
              v_infocnt(j) := v_infocnt(j) + 1;
              found := true;
            end if;
          end loop;
        end if; -- v_infos is not null
        if found = false then
          v_infos(indx) := M_MATCHES(i)."SSN";
          v_infocnt(indx) := 1;
          indx := indx + 1;
        end if;
      end if;
  end loop;
  -- find value with max count
  if v_infos.count > 0 then
    for i in v_infos.first .. v_infos.last loop
      if retval is null then
        retval := v_infos(i);
        retcnt := v_infocnt(i);
      else
        if (v_infocnt(i) > retcnt) then
          retcnt := v_infocnt(i);
          retval := v_infos(i);
          indx := indx + 1;
        end if;
      end if;
    end loop;
  end if;
  return retval; 
nEND;"

OMBALTER MAPPING 'CUST_MATCH_MERGE'
    ADD MERGE_RULES 'ME_3_SSN'
        OF OPERATOR 'MATCHMERGE' SET PROPERTIES (TYPE) VALUES ('MM_CUSTOM')
    MODIFY MERGE_RULES 'ME_3_SSN'
        OF OPERATOR 'MATCHMERGE' SET PROPERTIES (ATTRIBUTE_NAME) VALUES ('SSN')
    MODIFY MERGE_RULES 'ME_3_SSN'
        OF OPERATOR 'MATCHMERGE' SET PROPERTIES (CUSTOM_TEXT) VALUES ('$custom_rule3')

#Rule 4:
#Custom rule text is added to a variable and called from OMB.
#Note the escape characters in the PL/SQL text:
# Double quotes are escaped by a slash
# Single quotes are escaped by a single quote

set custom_rule4 "BEGIN
  -- select the cust_seq from the same record as the merged SSN.
  for i in M_MATCHES.FIRST .. M_MATCHES.LAST loop
      if M_MATCHES(i)."SSN" is not null and
         M_MATCHES(i)."CUST_SEQ" is not null and
         M_MATCHES(i)."SSN" = M_MERGE."SSN"  then
        return  M_MATCHES(i)."CUST_SEQ" ;
      end if;
  end loop;
  return  M_MERGE."CUST_SEQ";  -- previously selected cust sequence
END;"

OMBALTER MAPPING 'CUST_MATCH_MERGE'
    ADD MERGE_RULES 'ME_4_CUSTSEQ'
        OF OPERATOR 'MATCHMERGE' SET PROPERTIES (TYPE) VALUES ('MM_CUSTOM')
    MODIFY MERGE_RULES 'ME_4_CUSTSEQ'
        OF OPERATOR 'MATCHMERGE' SET PROPERTIES (ATTRIBUTE_NAME) VALUES ('NAME_F')
    MODIFY MERGE_RULES 'ME_4_CUSTSEQ'
        OF OPERATOR 'MATCHMERGE' SET PROPERTIES (CUSTOM_TEXT) VALUES ('$custom_rule4')
       
#Rule 5:
#Custom rule text is added to a variable and called from OMB.
#Note the escape characters in the PL/SQL text:
# Double quotes are escaped by a slash
# Single quotes are escaped by a single quote

OMBALTER MAPPING 'CUST_MATCH_MERGE'
    ADD MERGE_RULES 'ME_5_SAPCUSTID'
        OF OPERATOR 'MATCHMERGE' SET PROPERTIES (TYPE) VALUES ('MM_RANK')
    MODIFY MERGE_RULES 'ME_5_SAPCUSTID'
        OF OPERATOR 'MATCHMERGE' SET PROPERTIES (ATTRIBUTE_NAME) VALUES ('SAP_CUST_ID')
    ADD RANK_RULES 'SAP_RULE'
        OF MERGE_RULES 'ME_5_SAPCUSTID'
            OF OPERATOR 'MATCHMERGE'
    MODIFY RANK_RULES 'SAP_RULE'
        OF MERGE_RULES 'ME_5_SAPCUSTID'
            OF OPERATOR 'MATCHMERGE' SET PROPERTIES (EXPRESSION) VALUES ('length(INGRP1.SAP_CUST_ID) = 7')
       

March 19, 2008

Contiuous data quality improvement in a data warehouse

As part of the TDWI Webinars, here is a chance to here a very interesting presentation from Jonathan Geiger. Oracle sponsors the webinar, so you will see an additional part on Warehouse Builder data quality, but in general I think the content is very interesting. It is all about how you deal with quality improvements and what tools can help.

Here is the detailed information on it:

Webinar Abstract

The data warehouse team is expected to deliver quality data but is
often faced with sources that have quality deficiencies. This Webinar
looks at ways to gradually improve data quality issues at the source,
including data profiling, data transformation and cleansing, audits and
controls, and metadata, all of which work together to provide quality
data to the data warehouse users.


You will learn:


  • Ways to define data quality
  • The continuous improvement process steps
  • How to provide quality data in the data warehouse
Presenter: Jonathan Geiger, Executive Vice President, Intelligent Solutions, Inc.

Event Date: 3/26/08

Event Time: 9:00 am PT / 12:00 PM ET

You can register here for the webinar or if the link does not work simply copy and paste the following into your browser:

http://www.tdwi.org/display.aspx?id=8788



April 14, 2008

What features do you use and what do you want next?

Catch 22 that is what we sometimes suffer from. How do we know what to build when we don't quite know what you really use day-to-day. Sure we can look at our bugs (not that we have many ;-) ), but would it not be great if you could just tell us what is really important for your project?

So fill out the feature survey and let us know what you use in your projects, and as important here, add your feedback to the enhancement request section at the end and drive the new Warehouse Builder release with your requests...!

Please download the feature survey here and send us the results (press the big nice button).

Thanks in advance for any and all feedback!

KPIs or KQIs

We all know what a KPI is, but what on earth is KQI? Well I just dreamed that up... but hang in there. This might get fun after all...

Common data quality tooling these days has you do the data profiling. That is good! But you really don't go off and profile your data every day. So between profiling runs you are operating in the dark. First part good, second part not so.

The trick to profile data once than always have your torch is the use of something called a data auditor in OWB. If you read more on the OWB solution in DQ, you will see that the product does profiling, but that the product also allows you to create data rules. Now these are important when we want to look at how our data is doing in between profiling runs.

A data rule is like a KPI definition. It is an indicator, so you should create data rules that are important things to measure with regards your data quality. One of these "choose wisely things"...

Now rather than not knowing what is going on in your data loads, you turn these key quality indicators (KQIs) into the points you measure every day. Not by running the profile again, no by turning them into a data auditor. The data auditor does nothing other than run a select on your incoming data (or on the data after it is fixed - more later) and tell you the value. It then translates that value into a Pass or Fail grade, or in KPI terms, the traffic light goes green or red. You now can have a dashboard of KQIs next to your KPIs and run a system that not only tells you how the business is doing, but also tells you how good your data is that drives the KPIs...

A little less pie in sky, let's say you are loading a warehouse and you have this construct where if the data is bad you can roll it all back out via an ID. Would it not be nice if you would agree on SLAs with the customer community and you would measure the SLAs beforehand? That would save you the time on loading the data. It also would prevent incorrect reporting on the data that gets backed out (imagine running the sales report in the morning and paying bonusses only to learn it is not correct!).

Now back to profiling, your data auditors give you the daily value of the KQI. With these KQIs in place, you can rerun data profiling to do two things:

1) (Re-)verify your data rules or KQIs - this can be done in profiling with a simple check box and allows you to quickly assess KQIs on new data (like a new source)
2) Discover new values that you should start adding to the KQIs for either existing system and or new systems

Based on the new profiling results you can now either regenerate your rules to cater for new data situations or you can tweak the thresholds.

You can also create KQI differentials - e.g. improvement through your process. As we said before, you can profile incoming data as well as data that has been fixed. In OWB, adding a data rule to the correction process gives you data rules automatically attached to the correct objects. You run the auditor on both sides and now you can derive the differential. This is a great way to quantify the amount of money you saved people by fixing the data issues.

Back to KQIs. KPIs are well accepted, but we are still all on the fence about data quality. Now if we accept that we can calculate KQIs, how do we use them with our business users? One of the things is to simply expose them to the KQIs. Show the business users what the rules are, what the measurements are. Then discuss both and start to work on finding the real cause of spikes on KQIs or the real cause for the low results. And yes, one cause might be the incorrect rule! Accept that as well and move on to improve rules and data! The goal is after all to deliver great data to the business users.

April 17, 2008

Checking referential integrity with data rules

So you read my KQIs thingie. This one is a bit more practical in that it is real code without coding :-)

I get incoming data into a table, and what I want to do is check referential integrity without FKs, because they will generate errors while loading. Not quite cool. So here is the scenario:

The incoming data in my STG_TAB:

insert into stg_tab values ('1001', 'Product 1001 Name', '1001 description', 'nothing');
insert into stg_tab values ('1002', 'Product 1002 Name', '1002 description', 'nothing');
insert into stg_tab values ('1003', 'Product 1003 Name', '1003 description', 'nothing');
insert into stg_tab values ('1004', 'Product 1004 Name', '1004 description', 'nothing');
insert into stg_tab values ('1005', 'Product 1005 Name', '1005 description', 'nothing');

The table that has the references to check on has the following data (it is called LKP_TAB):

insert into lkp_tab values ('1001', 'IamProduct1001');
insert into lkp_tab values ('1002', 'IamProduct1002');
insert into lkp_tab values ('1003', 'IamProduct1003');

So the point is that 1004 and 1005 are not around, so they are considered not compliant. And I want to now check this with a data rule. These are steps to get this working:

1) Create a new data rule
    Referential rule type
    Use a single attribute and set both cardinality thingies to 1 : n

Create new datarule:

2) Apply the rule to the STG_TAB table in OWB
    Open the data object and apply the rule
    Bind like this

ApplyRuleToTable:

3) Create a mapping in OWB that has the STG_TAB table
    In the property inspector for the operator, find the DataRule node and open it up
    Set the rule to Move to Error

SetRuleActions:

4) On the mapping, while still highlighting the operator


Set Match By Constraint to NO_CONSTRAINTS (this is assuming the STG_TAB has no PKs or UKs that would work for the delete matching)

SetNoConstraint:

Now find the attribute in the table that is doing the lookup (e.g. the pk without being an enforced pk) and highlight that in the map
Set the Match column when Deleting row to YES

SetDeleteAttribute:

5) Validate the mapping
6) Re-deploy the staging table, you must do this because you need to get the error table deployed as well... so don't load data yet, first redeploy! (Or generate the scripts and simply run the DDL for the error table)
7) Deploy the LKP_TAB
8) Deploy the mapping
9) Insert the data from the scripts above in both tables
10) Run the mapping and you will see the 1004 and 1005 records being removed from the STG_TAB and placed in the STG_TAB_ERR

Here is the final data set in both these tables:

FinalDataCorr:

The error table (I have imported the table back from the database after deploying it via the STG_TAB deploy step - OWB does not do this automatically) looks like this, just as we expected.

FinalDataErr:

May 19, 2008

Data Watch and Repair for Master Data Management

Some happy news on the data quality front (yes I'm harping on that quite a bit these days...). We just completed the new Data Watch and Repair solution using the entire OWB technology stack. One of the things we phased when looking at our MDM products was that we need more emphasis on quality of the data. How do we empower the data steward to take charge of data quality.

So we combined forces and worked on something called data watch and repair for master data management. It is a solution using the some exising technology from OWB (the data quality option for data profiling, data rules, data corrections and data auditors) and a new piece, a connector for the MDM products (for CDH, UCM and PIM). The latter is now officially available for OWB (it is part of OWB 11.1.0.6).

Now what makes this so interesting (in my humble opinion)? DWR - don't you love those acronyms! - allows you to leverage the technology we had in house with pre-defined data rules and pre-defined data corrections. This makes your time to market a lot shorter and your data a lot better. The solution follows our best practices and uses the generated corrections to write back data for the master data applications reducing the need to hand code any routines. The DWR solution also ensures you concentrate your effort on profiling and on data rules. These data rules now drive the entire implementation of data quality enabling the previously mentioned generated data corrections.

You will see more and more information appearing. For a first quick glance at DWR take a look at this paper. You will get an idea of best practices around a data quality solution and how DWR first within that picture.

May 30, 2008

BIWA Summit call for papers

BIWAlogo:

Oracle BIWA Summit - December
2nd & 3rd

Please pass this
information to customers and partners




Oracle
BIWA Summit 2008 is a forum for business intelligence (BI), warehousing
and analytics professionals to exchange information, experiences and
best practices. With over 75 presentations and workshops and the entire
event focused on BI, Warehousing and Analytics (BIWA), you will get the
knowledge and information critical to be successful in your work. You
will hear experts present novel and interesting use cases of Oracle
Database-centric BIWA topics through keynotes, technical talks, hands
on workshops, discussion panels and more.


The summit will be
held December
2-3, 2008 at the Oracle
Conference Center in Redwood Shores, CA
and
offers outstanding
value for BI professionals who use Oracle technology.  


Call for Presentations & Technical
Workshops


You are invited to submit an abstract for a Presentation (50 minutes)
or hands-on Technical Workshop (110 minutes) to the Oracle Business
Intelligence, Warehousing, and Analytics (BIWA) Summit 2008 at the
Oracle Conference Center in Redwood Shore, CA.  See the Submission
Process at
request
for Presentation Website




Important
Dates


Last
date for submission of abstracts:  
August
15, 2008


Presentation
notification:  
September 30, 2008

BIWA
Summit: 
December 2-3, 2008




LEARN MORE

BIWA
Summit 2008 Website


Request
for Presentation Website

OracleBIWA.org

Watch for Summit Registration details coming soon.



About BIWA: 

BIWA
is the Oracle Business Intelligence, Warehousing, and Analytics Special
Interest Group, part of the Independent Oracle User Group (IOUG). BIWA
is a worldwide association of persons seeking the successful deployment
of Oracle Database-centric business intelligence, data warehousing, and
analytical products, features and Options.

September 24, 2008

On Data Quality and OWB at Oracle OpenWorld

We've all been taking our shifts at the demo booths at OpenWorld, and talking to current or future (?) customers, and one question that's come up a number of times is:

"Can we use OWB as a standalone data quality product for data profiling, data quality and data auditing without replacing our current ETL processing with Informatica/DataStage/hand-coded PL/SQL/what have you?"

The answer (as usual) is "Of course you can."

OWB's initial data profiling is basically unobtrusive, extracting data from your sources into a temporary schema where the actual profiling work is done. And OWB's data auditors can test data periodically in your sources and targets even if those are loaded by some other tool.
You can even use OWB to cleanse your dirty data in place, either in your staging area before loading your warehouse or in the ODS or master data itself. We do have customers who license the OWB DQ option for these kinds of scenarios, even when they already have other solutions for the ETL component.

Of course, you will get the maximum benefit from OWB DQ if you use the inline data quality capabilities to handle dirty data during ETL. But you can start introducing OWB DQ without doing a "big bang"-type re-implementation of your ETL in OWB.

OWB, including the data quality capabilities, is installed with any Oracle 11g database, and the standalone 10gR2 install is available on OTN as well to be installed alongside older Oracle databases. And while you do have to license the DQ option to really use it in production, of course, the surest way to know if data quality will add value is to evaluate it. You never know what's in your data until you look. Try a cursory data profiling effort on a subset of your data, just to see whether there's anything in there you aren't expecting.

Reminder: I'll be speaking on using OWB in DW and MDM-type scenarios Thursday at 9AM at Oracle OpenWorld. And my talk will contain tips and tricks on how to ease OWB into the DQ role with minimum disruption to your existing solution. OpenWorld attendees, please come to the talk or stop by the demogrounds booths for OWB to find out more.

October 17, 2008

Getting your feet wet with data profiling: freeware tool

While a lot of what we do on the OWB blog is about teaching OWB users to get more out of the tool, we also want to stimulate Oracle database customers' thinking about their data in general. (We think OWB data quality and its integration with ETL compare favorably with anything else out there, so the more people think about the problem, the more they will gravitate towards OWB as a solution.)

Dylan Jones over at dataqualitypro.com is clearly thinking about some of the same things we've been thinking about in Warehouse Builder for some time. He's introduced a free tool for doing basic pattern analysis of data in an Oracle database, which in some ways provides similar insights into your data as basic profiling with Oracle Warehouse Builde. You can find out more about his freeware tool here.

If you start getting interesting insights using a tool like this, you will find more value in trying out the OWB data profiling and data quality features against your database.

Many OWB customers actually use OWB as a standalone data quality tool, mostly ignoring the ETL and data integration features when they get started. Then, once they have insight into their data, they can create data rules to actually enforce what they discover, and introduce data cleansing and data auditing on their sources and targets with OWB to catch and resolve any bad data that's coming in.

In general, you can use OWB for data quality measurement and enforcement and even data cleansing without disturbing your existing ETL logic: data profiling and data quality auditing are non-intrusive by nature, and for data cleansing you can create OWB mappings that either cleanse your data in place or copy cleansed data to a temporary location from which you can re-load it with your existing ETL method.

Anyway, we wanted to share this thought with you, and give a tip of the hat to Dylan's community, which is growing fast and generating a lot of fresh, thought-provoking content.

October 29, 2008

Good Data Quality Read on Rittman-Mead's Blog, and validating input with data rules

There's a good read over on the Rittman-Mead Consulting blog on End-to-End data quality. OWB would be a good tool to implement parts of the solution they have in mind-- to do an initial DQ assessment, and to do cleansing during load. And DQ reporting on both the sources and the target system is certainly doable, using data auditors and your favorite BI application, to prove to the powers above that the problem isn't yours.

Input validation is where OWB's answer is a bit less obvious, though. Enforcing the OWB DQ rules during input validation is doable, I think, but you have to be creative. Here's how I'd go about it...

Continue reading "Good Data Quality Read on Rittman-Mead's Blog, and validating input with data rules" »

January 28, 2009

Data Quality is Job Number 1

Here's a good read from Enterprise Systems Journal-- a reminder of why data quality is important to all of our efforts, and why OWB has extensive data profiling and data quality features built in.

Of course, if you're reading this blog, you're already mindful of these issues, and are using OWB data profiling and data cleansing with all your data and auditing quality regularly... right? :)


Analysis: Data Quality is Job Number 1

BI tools are only as good as the quality of the data they work with. Analyst Michael Schiff is still surprised at how many BI professionals still ignore this fact.
By Mike Schiff
1/28/2009
Much attention has been focused recently on integrating data from multiple sources to populate data warehouses or data marts for analysis purposes or as part of a migration effort for new enterprise applications. For example, a recent press release from a well-established business intelligence vendor highlights the ability of its BI platform to access multiple data sources where they reside without first having to move the data into a data warehouse or a data mart.

Rather than dwelling on the tradeoffs (perhaps a topic for future analysis) among centralized data warehouses, federated databases, enterprise information integration (EII), or even the old concept of virtual data warehouses, I would like to point out a common property they share: the quality of the information obtained from any of them is directly dependent on the quality of the data they access or contain. In other words, GIGO (garbage in, garbage out) always has applied (and always will apply) to both analytical and operational systems.

We all recognize this, yet I continue to be amazed as to how often we ignore the fact. In most cases, it's not deliberate; rather it results from taking at face value preliminary assumptions about data quality rather than using techniques such as data profiling to validate the assumptions.


August 27, 2009

Matching Abbreviations and Acronyms

A quick post on how to match data using composite match rules for matching abbreviations or matching acronyms. For example we want to match the following tokens;

  • International Business Machines
  • Intl Bus Machines
  • IBM

One of these is a full name, one is an abbreviation (Intl Bus Machines) and one is an acronym (IBM). How can we match multiple rows and cherry pick information from matching rows to construct a result without using custom code? Let's have a look and see...

The map below has a source DATA1_ABBREV (its an external table on a simple flat file) which includes the 3 strings we are matching above, we pass in a constant with the value 1 for our binning strategy.

match_merge1

The mapping uses the match merge operator in OWB which lets you define rules for how to bin data, match data and merge it. There are a wide set of rules and you can add custom rules too.

Binning

Binning is basically a divide and conquer approach to speed up the processing. For example, if you are loading one million rows into a table with a million rows then you want to avoid doing a million by a million comparison. Binning allows you to subset this processing as matching is only performed within the record set for that bin.

The smaller the match bin, the faster the processing, however, rows will only be considered within the match bin so you need to make sure your binning does not exclude any possible matches (that you do not want eliminated). You can have match bins based on any fields passed into the match-merge operator and since these can contain any operator (including constants,expressions and transformations) you have lots of flexibility. In the case below a constant is passed in so there is one bin (not the typical case, but we have a small set of rows for a demo).

match_merge2

Another binning example is on country name, suppose the bin is based on the first 2 characters of the country name which would result in say 500 records. This would considerably speed up the processing but would miss records if someone has put SOTLAND instead of SCOTLAND. Setting the match bin to just the first character would be more accurate but would result in more match tests.

The Match Rules

As we mentioned earlier we have full strings, abbreviations and acronyms to match. We can add multiple match rules, below we have a match rule MA_0 which is of type conditional and in the detail table we see it uses the Abbreviation algorithm.

match_merge3

There is also a match rule MA_1 which is also of type conditional and in the detail table we see it uses the Acronym algorithm. So records will be matched if any active rule passes.

match_merge4

In our example we have no merge rules (not typical), this will randomly select ANY value for the attribute. We could get it to select the shortest or longest value for example.

match_merge6

Here we have a merge rule that uses the Min Max merge rule type and uses the attribute and picks the 'Longest' value, this will ensure we get 'International Business Machines' and not 'IBM' or an abbreviated version.

So here we've seen a way to build up matching algorithms on data based on prebuilt rules that we can use to built up useful matching functionality when cleaning and consolidating data. For more details check the OWB user's guide for the match merge data quality operator and also other posts such as the one on complex de-duplication. Interested to hear other approaches on this that people utilize.

August 29, 2009

Proving the Business Value of Data Quality

It can be tough to show the business value of data quality to stakeholders and sponsors you need to move a DQ effort forward. Dylan Jones over at Data Quality Pro offers some great advice on how to tune up your pitch so you can show decision-makers the impact of DQ across your business, and win the support you need.

There’s a lot to the post, but here’s a couple of key snippets:

Link the information chain to the service value chain. Probably the most powerful technique I have used for building the business case is to integrate a model of how the business drives value with the underlying data. By assessing the information chain that supports this service chain you can build a far more compelling story than simply measuring data quality in isolated data stores. Sponsors like to see that you are not just a data wizard but that you understand their business intimately.

Don't use PowerPoint. I don't know if this is a coincidence but since I dropped it I have had far more success in winning proposals. Eliminating PowerPoint forces you to think of far more creative and engaging mediums.

Demonstrate a link from impacts to evidence using an interactive dashboard linked to data. I typically create a hybrid business model, data model and data quality model populated with full datasets. Using a data visualization tool to present the high level impacts we can then drill-down into the detail as the inevitable barrage of questions come your way…. The sponsors can also take the dashboard away with them to speak with their respective teams which is far more powerful than a PowerPoint file.

Read the rest here at Data Quality Pro.

November 23, 2009

OWB 11gR2 – Oracle By Examples

There are a number of OBEs available for the OWB 11gR2 release which range from an overview of all the cool new usability improvements in the mapping editor (auto-mapping improvements, copy-paste expressions, operators, operator groups, group/ungroup operators, spotlighting to name a few!), through basics of using the operators in a mapping for joining, filtering, pivoting and so on, then diving into building pluggable mappings (aiding reusability and good software design practice)  and last but not least an OBE on data profiling covering profiling, deriving rules and even cleansing the data!

The entire Oracle OBE 11gR2 release OBE's are here, and the OWB set are under 'Business Intelligence and Data Warehousing' then 'Warehouse Builder':

There are some interesting examples here, worth checking out to see some of the cool new features and existing ones that illustrate the capabilities! In the 11gR2 set there are other interesting posts well worth checking out such as the High Speed Data Loading and Rolling Window Operations with Partitioning post, all essential information for leveraging the database to the max in your ETL loads.

About Data Quality

This page contains an archive of all entries posted to Oracle Warehouse Builder (OWB) Weblog in the Data Quality category. They are listed from oldest to newest.

Comments is the previous category.

ETL is the next category.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type and Oracle