Match and Merge - Complex Deduplication Rules Part 1

Often the match and merge operator is completely missed by users in OWB. Turn the light on. It is probably the most powerful mapping operator in the operator set and it is in the Core ETL of OWB. OWB has had this operator for a number of years and it has been quietly improving as time has gone by. For example some core algorithms were added in 10gR2 of the database specifically for improving performance in this area. UTL_MATCH is in the database in 10gR2, OWB had a PLSQL implementation prior to this - includes Levenshtein Distance/Jaro Winkler algorithms etc. OWB also provides double metaphone routines that can be used if desired.

There are some nifty capabilities that are not immediately obvious (what's new you cry!), here I'll lay down an example that shows cross table matching using compound custom rules. We have a number of tables, files (whatever) from various sources that should be consolidated. We have a bunch of rules that define how the rows should be matched and we have rules for how the rows should be merged.

Match rules in order:

Column
Rule description
SSN
If the SSN is not null and not equal to 999-99-9999 then use fuzzy (edit distance) matching
SAP_CUST_IDIf the SAP_CUST_ID is not null then use partial (abbreviation) matching
XYZ_CUST_IDIf the XYZ_CUST_ID is not null then use exact matching
ABC_CUST_IDIf the ABC_CUST_ID is not null then use exact matching
NAMEif first name and the last name are not null then use fuzzy (soundex) matching

So the match rules are read as in match if the rule 1 description is satisfied, or if the rule 2 description is satisfied etc..

Merge Rules:
Column
Rule description
NAME_MThe longest non-null middle name
SSN
The most common SSN
NAME_F
The longest non-null first name from Table A
A_CUST_SEQ
From the same record as the merged SSN
SAP_CUST_IDThe most common SAP_CUST_ID with 7 characters

The merge rules are descriptions of how each column should be projected through the merge.

Datastore A
   
XYZ_CUST_ID ABC_CUST_ID CUST_SEQ SAP_CUST_ID NAME_F NAME_M NAME_L SSN
9138 KI17038 John Martin Smith 915-12-1234
2271 KI17038 Jonathan R. Smith
4805 R5KI9-17038

Smith 915-21-1234
2716 R5KI9-17038N Jonathan V Smith 286-17-5289

Datastore B
XYZ_CUST_ID ABC_CUST_ID CUST_SEQ SAP_CUST_ID NAME_F NAME_M NAME_L SSN
293105


John
Smith
293105


Marianne
Smith
292617


Jon M. Smith

Datastore C
XYZ_CUST_ID ABC_CUST_ID CUST_SEQ SAP_CUST_ID NAME_F NAME_M NAME SSN

006-41803



Smith 999-99-9999

2006-41803



Smith 999-99-9999

2006-41803



Smith 915-21-1234

These 3 tables, files (or whatever) have related data that we wish to match-merge with the rules above.

So how to initially consolidate this disparate data in preparation for match-merging? Simply use a union all operator for example.

The rules when applied mean the following rows would be matched;

Match Merge 1:

In OWB using the match merge operator we create the match rules. For the table of match rules above we define the rules as below. The first rule MA_1_SSN is defined as a custom rule, the implementation
(in a very small amount of PLSQL) checks that the SSN is not null, it
is not equal to '999-99-9999' (for both 'this' and 'that' - the rows
being compared) and that the edit distance is less than 2 (this example
was based on 10gR2, so I used the database's native UTL_MATCH package,
OWB provided OWB_MATCH_MERGE for pre 10gR2 with a pure PLSQL
implementation).

Match Merge 2:

Note all rules in above example are defined as passive and only one is active. OWB does not directly execute passive match rules. It only executes passive rules when they are called through an active custom match rule. All defined match rules appear in a list of available functions in the Custom Match Rule Editor. To reorder rules, click down on the row header on the left hand side, hold down
for a couple of seconds then drag the row up or down.

The second rule is ensuring the SAP_CUST_ID column matches using partial (abbreviation) matching (the rule type is conditional and the algorithm is abbreviation):

Match Merge 3:

Note here, the rule we wanted to defined was 'If the SAP_CUST_ID is not null then use partial (abbreviation) matching'. We will do the not null check in the final custom match rule! This goes for the next few rules.

The next rule checks XYZ_CUST_ID for an exact match:

Match Merge 4:

The fourth rule checks ABC_CUST_ID for an exact match:

Match Merge 5:


Finally the active rule at the end is a composite rule that augments the earlier defined rules with further refinements (for example adds not null checks on SAP_CUST_ID rule) then finally performs SOUNDEX matching on the names:


Match Merge 6:


When you are building your custom rules there is a specific editor when you edit the rule, it has the available parameters and match functions available for building the custom rule. You can incorporate other rules including the passive rules defined (in this way they are triggered).

Match Merge 7:


So finally our match rules are defined!

Merge Rules

The merge rules define how the operator's merged attributes are determined. Let's reflect what the requirements are:

Column
Rule description
SSN
The most common SSN
NAME_M
The longest non-null middle name
NAME_F
The longest non-null first name from Table A
SAP_CUST_ID
The most common SAP_CUST_ID with 7 characters
A_CUST_SEQ
From the same record as the merged SSN

OWB provides a number of built-in merge rule types, for some of the above we need a custom rule, let's dig deeper.

The NAME_M column is populated from the longest middle name, for this attribute the Min/Max rule type can be used, the attribute selected and Longest chosen as the attribute relation (it could be mininum,maximum,longest,shortest):

Match Merge 8:

For NAME_F we will decide that table A will will be identified by all records having a CUST_SEQ value of null. We could also have identified this by adding a constant attribute for each source (or system) when we unioned the tables and used this rather than purely being dependent on the data. We can code custom merge rules to decide on which value to merge, here we loop around and return the longest non-null name - this is just a simple PLSQL loop around the collection M_MATCHES;

fName varchar2(2000) := null;
BEGIN
    -- return the longest first name from table a
    -- in table a, CUST_SEQ is not null
  FOR i IN M_MATCHES.FIRST .. M_MATCHES.LAST LOOP
      IF  M_MATCHES(i)."NAME_F" IS NOT NULL and
          M_MATCHES(i)."CUST_SEQ"  is not null THEN
        IF fName IS NULL OR LENGTH(RTRIM(M_MATCHES(i)."NAME_F")) > LENGTH(RTRIM(fName)) THEN
          fName := M_MATCHES(i)."NAME_F";
        END IF;
      END IF;
  END LOOP;
  RETURN fName;
END;

The SSN rule can be defined using a PLSQL loop that simply counts occurrences and returns the SSN with the highest occurrence count, here we see the definition of the custom rule for SSN. It is simply 2 loops one 2 get the occurrence count, the 2nd loop returns the highest occurrence count:

type info_type is table of varchar2(4000) index by binary_integer;
type infocnt_type is table of pls_integer index by binary_integer;
v_infos info_type;
v_infocnt infocnt_type;
retval varchar2(20);
retcnt pls_integer;
found boolean;
indx pls_integer := 1;
begin
  -- return the modal value of "SSN" that is not null
  for i in M_MATCHES.FIRST .. M_MATCHES.LAST loop
      if M_MATCHES(i)."SSN" is not null and
M_MATCHES(i)."SSN" != '999-99-9999' then

        found := false;
        if v_infos.count > 0 then
          for j in  v_infos.first .. v_infos.last loop

            if found = false and v_infos(j) = M_MATCHES(i)."SSN" then
              -- found it
              v_infocnt(j) := v_infocnt(j) + 1;
              found := true;
            end if;
          end loop;
        end if; -- v_infos is not null
        if found = false then
          v_infos(indx) := M_MATCHES(i)."SSN";
          v_infocnt(indx) := 1;
          indx := indx + 1;
        end if;
      end if;
  end loop;
  -- find value with max count
  if v_infos.count > 0 then
    for i in v_infos.first .. v_infos.last loop
      if retval is null then
        retval := v_infos(i);
        retcnt := v_infocnt(i);
      else
        if (v_infocnt(i) > retcnt) then
          retcnt := v_infocnt(i);
          retval := v_infos(i);
          indx := indx + 1;
        end if;
      end if;
    end loop;
  end if;
  return retval;
end;


This might look like a lot of code, but really is pretty much 97% boilerplate. Look for SSN in the above code, there are only a few specific parts related to this pattern of finding the most common occurrence.

The CUST_SEQ merge rule which returns the CUST_SEQ from the merged SSN is a loop to find the SSN and then a return of the matched SSN, if none found then return the previously merged CUST_SEQ:

BEGIN
  -- select the cust_seq from the same record as the merged SSN.
  for i in M_MATCHES.FIRST .. M_MATCHES.LAST loop
      if M_MATCHES(i)."SSN" is not null and
         M_MATCHES(i)."CUST_SEQ" is not null and
         M_MATCHES(i)."SSN" = M_MERGE."SSN"  then
        return  M_MATCHES(i)."CUST_SEQ" ;
      end if;
  end loop;
  return  M_MERGE."CUST_SEQ";  -- previously selected cust sequence
END;

The rule for SAP_CUST_ID is to select the most common customer id that
is 7 characters in length, we can easily do this using the rank rule
type and the expression can check for length of 7:

Match Merge 9:

I think that's enough on the rules for both matching and merging, you get the idea now that this is very powerful.

Anatomy of the Map

Let's check out the map structure, to consolidate multiple source tables you can simply union all the tables together. The great thing about OWB is that you can actually also do a lot more if needed! You can perform a selection/filter process using the OWB operators up until the data is pushed into the match merge operator. This is a technique that we've suggested in the past to improve the performance so that there is not a bunch of redundant matching going on - so ensure you have reduced the number of bins to those bins with changes.

Match Merge 10:

The binning concept is very simple; it is all to do with divide and conquer! Often misunderstood and misused. Divide the problem up into bins and match and merge these chunks. A good use of binning can dramatically reduce the number comparisons the map will perform.

Here is an excerpt from the documentation:
'When
Warehouse Builder matches the rows, it compares each row with the subsequent row
for all rows within the same grouping. Limiting the number of rows can greatly
enhance performance, because Warehouse Builder searches for matches only within
a bin and not throughout the entire data set.'


The OWB wizard would be better if there was a binning advisor that actually tied into the data and said OK, you have selected column ZIPCODE this will give you a median bin size of W, an average bin size of X, a minimum bin size of Y and a max bin size of Z, this kind of information would help you make some of the decisions. You must also remember that you cannot pick any arbitrary column as the binning since you also want to avoid separating rows that should be matched. The
attributes you select for grouping similar rows depends on your data.

The general matching loops are structed as;

OUTER Loop for each bin
   MIDDLE loop for new records (filtered input records) OR MIDDLE loop for all input records
      INNER loop for all input records

In the MIDDLE loop it is possible to use the 'Match new records only' option which will restrict the MIDDLE loop to only those records identified by the new record condition. Although entitled new records only, you can use this for any specialist filtering that you have such as matching new and updated records. This is another option that again can dramatically alter performance.

In Summary

As you can see OWB has very sophisticated deduplication and match merge capabilities essential for incorporating data quality into your information processes. This is integrated into the OWB designer and works just like the rest of the operator set, and now you can see it truly is one of the richest and powerful operators in Oracle's data quality operator set.

There is a zip file here containing the MDL for this demonstration along with the data for the sources, have a play.

In further posts we'll look at the matching of items based on context information (see an existing OBE here) such as whether the column is a name, an address or whatever. There are a number of ways to compare strings and depending on what the data means, different comparison algorithms can be used, we will look at this further.

Comments:

Post a Comment:
  • HTML Syntax: NOT allowed
About

ETL, CDC, Real-Time DI and Data Quality for the Oracle Database from the inside.

Search

Archives
« April 2014
SunMonTueWedThuFriSat
  
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
   
       
Today