Matching Abbreviations and Acronyms
By David Allan on Aug 27, 2009
A quick post on how to match data using composite match rules for matching abbreviations or matching acronyms. For example we want to match the following tokens;
- International Business Machines
- Intl Bus Machines
One of these is a full name, one is an abbreviation (Intl Bus Machines) and one is an acronym (IBM). How can we match multiple rows and cherry pick information from matching rows to construct a result without using custom code? Let's have a look and see...
The map below has a source DATA1_ABBREV (its an external table on a simple flat file) which includes the 3 strings we are matching above, we pass in a constant with the value 1 for our binning strategy.
The mapping uses the match merge operator in OWB which lets you define rules for how to bin data, match data and merge it. There are a wide set of rules and you can add custom rules too.
Binning is basically a divide and conquer approach to speed up the processing. For example, if you are loading one million rows into a table with a million rows then you want to avoid doing a million by a million comparison. Binning allows you to subset this processing as matching is only performed within the record set for that bin.
The smaller the match bin, the faster the processing, however, rows will only be considered within the match bin so you need to make sure your binning does not exclude any possible matches (that you do not want eliminated). You can have match bins based on any fields passed into the match-merge operator and since these can contain any operator (including constants,expressions and transformations) you have lots of flexibility. In the case below a constant is passed in so there is one bin (not the typical case, but we have a small set of rows for a demo).
Another binning example is on country name, suppose the bin is based on the first 2 characters of the country name which would result in say 500 records. This would considerably speed up the processing but would miss records if someone has put SOTLAND instead of SCOTLAND. Setting the match bin to just the first character would be more accurate but would result in more match tests.
The Match Rules
As we mentioned earlier we have full strings, abbreviations and acronyms to match. We can add multiple match rules, below we have a match rule MA_0 which is of type conditional and in the detail table we see it uses the Abbreviation algorithm.
There is also a match rule MA_1 which is also of type conditional and in the detail table we see it uses the Acronym algorithm. So records will be matched if any active rule passes.
In our example we have no merge rules (not typical), this will randomly select ANY value for the attribute. We could get it to select the shortest or longest value for example.
Here we have a merge rule that uses the Min Max merge rule type and uses the attribute and picks the 'Longest' value, this will ensure we get 'International Business Machines' and not 'IBM' or an abbreviated version.
So here we've seen a way to build up matching algorithms on data based on prebuilt rules that we can use to built up useful matching functionality when cleaning and consolidating data. For more details check the OWB user's guide for the match merge data quality operator and also other posts such as the one on complex de-duplication. Interested to hear other approaches on this that people utilize.