« Evaluating I/O Performance for Large Data Warehouses | Main | New Issue of the Journal of Management Excellence is out! »

Data Quality: Data Correction Strategies

Continuing the data quality conversation from my previous posts on effectiveness verses accuracy  and defining accuracy thresholds here is a technique for categorizing various types of data quality issues and developing a strategy for corrections.

If the effectiveness mindset is established, the next concept to follow is to realize there are various types of data quality issues, and correction alternatives are numerous. Data quality issues appear as multiple levels of problems. Let's revisit some of the levels of complexity and formulate strategies on how to correct them.

Level 1: "Fat Fingers"

This is the layer of the pure mistake. Numeric fields are blank, or have alpha values. Alpha codes are blank, or have bad values. Occasionally records are missing. Once known, these errors can be corrected and audited in the future. It is commonplace to repair these errors through ETL processing, but the best and cheapest way is to fix the problems at the operational source.

Level 2: "Bad Values"

The next layer can present some insidious issues. Numbers and values may be valid for a given field or file, but they are wrong in the context of a particular event. These errors are caused by timing problems, synchronization, etc. The classic example is a difference in total across two similar dimensions, because one is supplied by data two days after another. The timing difference creates an illusion of error. These can be corrected via ETL processes.

However, tight controls are required to ensure that altered values, once corrected, do not radically differ from the answers expected by the end users. Audit trails must be used to manage and control end-user expectations. One money saver is to use tools that audit data before it is used as a source for the data warehouse. In lieu of tools, a series of SQL queries can be developed to analyze codes and values for correctness.

Level 3: "Referential Issues"

At this level, incorrect values are not detectable without complicated scrutiny. This level is where most customer data goes awry. Addresses and dimensions can be incorrectly extracted, and there is no referential or other means to know they are wrong. The correction of this data definitely occurs after the fact, using external data services (e.g., Harte Hanks, Acxiom, Polk, and so on) and data cleansing tools (e.g., OWB, Trillium, First Logic, and so on). Correction techniques will include parsing, matching, standardization, and consolidation. More than customer data can be affected, as product and supplier information is also often in error but not detectable.

Level 4: "Invisible Information"

After level 3, the data issues are obviously more in light of data effectiveness as compared with incorrectness. For example, information critical to create a dimension in the data warehouse is buried in a literal field. Parsing must be used to extract the correct values for the dimension. Often, the desired data is buried within complex ancient codes, or worse, a combination of codes known only to a few tribal elders. Converting this data into useful DW dimensions requires complicated coding and even translation tables that must be maintained. This level of DQ contributes significantly to ETL processing and overall data warehouse cost of ownership.

Level 5: "Immaculate Conception Information"

While it sounds crazy, many times users request data that does not exist. If it does exist, it is in a format that is impossible to access (e.g., a rolodex.) There is no easy way to deal with this issue. Some companies fix the problem by entering new data into the DW---a dangerous practice, as having structures that allow updates and/or queries at the same time compromises performance. Others create a staging area to enter data before the DW load, or develop an ersatz production system to create the new values. These approaches are also used when there is financial data that is issued after loading (as in financial adjustments). The preferred manner for all of these issues is to cycle the new data back through the ETL processes, or change operational systems to contain the new information.



Post a comment

(If you haven't left a comment here before, you may need to be approved by the site owner before your comment will appear. Until then, it won't appear on the entry. Thanks for waiting.)

About This Entry

This page contains a single entry from the blog posted on May 21, 2009 5:09 AM.

The previous post in this blog was Evaluating I/O Performance for Large Data Warehouses.

The next post in this blog is New Issue of the Journal of Management Excellence is out!.

Many more can be found on the main index page or by looking through the archives.

Powered by
Movable Type and Oracle