Are Data Laundromats a Waste of Quarters?
There’s been some interesting discussion around what’s next for data quality and the fascinating challenges of cleaning data for data warehouses and business intelligence applications. I am always intrigued by blogs that discuss the challenges of data management and applying cleansing principles for complex data-centric applications across the enterprise. However, I’m dismayed by discussions that quickly jump to the conclusion of out-sourcing data quality as a software-as-a -service model. For example David Rosenberg writes on his blog:
Look for the emergence of third party B2B integration and commerce management service providers that support data entry and validation for all trading partners. Integrated suites of direct system-to-system integration and Web portal services will be supplemented with combined e-mail and smart-form technologies solving the data quality problem associated with paper-based exchanges with small and occasional trading partners.
While it sounds good on paper, I think this is more marketing spin than a realistic use case. I think we can meet the goal of achieving clean, trusted authoritative data without going off-premise. When we ask companies to deliver their most critical asset into 3rd party hands, it’s going to lead to more challenges that aren’t easily solved. Terabytes of data aren’t easily moved like sacks of dirty laundry. here are 5 reasons why the business model of outsourced data quality is ahead of its time:
1) Moving Data is hard – Moving terabytes of information off-premise –once- can be challenging enough, moving it as changes occur is even more challenging.
2) Auditing – to turn bad data good, means lots of changes. Keeping track of these changes and offering roll-back capabilities and full auditing is critical. How can these be easily managed when they are off-premise?
3) Customization– every company is unique how they approach data, even data like address information which would seem commonplace. Most on-premise data quality engine solutions have some type of customizable rules approach whereas many 3rd party solutions are using generic approaches.
4) Profiling – the forgotten aspect of data cleansing is first understanding and seeing the stain. Off-premise data cleansing solutions assume that the data needs cleansing, but the element of profiling needs to be applied on-premise within the enterprise wide data-centric applications. That’s not necessarily in a single source or a single data warehouse.
5)Trust – It is part psychology and part technology. Companies are likely to outsource certain aspects of their data. For example, a bank might outsource check scans, but only to validate what’s already typed into the system at the bank ATM. Companies will chose to keep most of their core data on-premise, so they’re still going to need an on-premise data quality solution to manage it.
If these data Laundromats sound utopian, it is because they are. I believe we may see some type of outsourced data quality, especially when they need to access outside information, for example DUNS, UNSPC, but not for the critical core business assets of the bulk of their data, I would first run them through an on-premise cleansing cycle.