Continuing the data quality conversation from my previous posts on effectiveness verses accuracy and defining accuracy thresholds here is a technique for categorizing various types of data quality issues and developing a strategy for corrections.
If the effectiveness mindset is established, the next concept to follow is to realize there are various types of data quality issues, and correction alternatives are numerous. Data quality issues appear as multiple levels of problems. Let's revisit some of the levels of complexity and formulate strategies on how to correct them.
Level 1: "Fat Fingers"
This is the layer of the pure mistake. Numeric fields are blank, or have alpha values. Alpha codes are blank, or have bad values. Occasionally records are missing. Once known, these errors can be corrected and audited in the future. It is commonplace to repair these errors through ETL processing, but the best and cheapest way is to fix the problems at the operational source.
Level 2: "Bad Values"
The next layer can present some insidious issues. Numbers and values may be valid for a given field or file, but they are wrong in the context of a particular event. These errors are caused by timing problems, synchronization, etc. The classic example is a difference in total across two similar dimensions, because one is supplied by data two days after another. The timing difference creates an illusion of error. These can be corrected via ETL processes.
However, tight controls are required to ensure that altered values, once corrected, do not radically differ from the answers expected by the end users. Audit trails must be used to manage and control end-user expectations. One money saver is to use tools that audit data before it is used as a source for the data warehouse. In lieu of tools, a series of SQL queries can be developed to analyze codes and values for correctness.
Level 3: "Referential Issues"
At this level, incorrect values are not detectable without complicated scrutiny. This level is where most customer data goes awry. Addresses and dimensions can be incorrectly extracted, and there is no referential or other means to know they are wrong. The correction of this data definitely occurs after the fact, using external data services (e.g., Harte Hanks, Acxiom, Polk, and so on) and data cleansing tools (e.g., OWB, Trillium, First Logic, and so on). Correction techniques will include parsing, matching, standardization, and consolidation. More than customer data can be affected, as product and supplier information is also often in error but not detectable.
Level 4: "Invisible Information"
After level 3, the data issues are obviously more in light of data effectiveness as compared with incorrectness. For example, information critical to create a dimension in the data warehouse is buried in a literal field. Parsing must be used to extract the correct values for the dimension. Often, the desired data is buried within complex ancient codes, or worse, a combination of codes known only to a few tribal elders. Converting this data into useful DW dimensions requires complicated coding and even translation tables that must be maintained. This level of DQ contributes significantly to ETL processing and overall data warehouse cost of ownership.
Level 5: "Immaculate Conception Information"