In The Emperor of All Maladies: A Biography of Cancer, author Siddhartha Mukherjee traces the history of oncology. He observes that while cancer has posed a challenge to physicians since ancient Egypt, breakthroughs came first in the study of leukemia, or cancer of the blood. Why? Because a blood test showing an abnormal white blood cell count is easily measurable.
Many of us in data science spend a great deal of our time thinking about how to get more value out of our data, but we underestimate the importance of starting with good measurements in the first place. Machine learning and computer science are powerful tools, letting us extract hidden insight from seemingly innocuous rows and columns, but no amount of fancy mathematics can reveal secrets that were never there to begin with.
Data analysis is important, but data quality is even more so. This post will draw on elements of information theory — the mid-twentieth century mathematical field that laid the foundation for telecommunications — in order to illustrate how measurement, as much as analysis, determines how much insight we’ll be able to discover.
Information Theory Overview
To show this, we will begin by discussing the data processing inequality, a basic result of information theory.
Say you have three variables, A, B, and C, existing in what is known as a first-order Markov relationship, such that:
This means that A yields B, which then yields C, such that C is connected to A only through B. Then it can be shown that the mutual information (the amount one variable tells you about another) between A and C can never be more than the mutual information between A and B:
The proof is only a few lines long, and can be found, among other places, in The Elements of Information Theory, a standard introductory text by Thomas Cover and Joy Thomas. The term “data processing inequality” comes from the idea that in practice, these three variables represent specific pieces of the data analysis pipeline:
So we can think of A as representing the world: the thing that we care about, but can’t access directly. B represents our data, or a particular component of the world that we are gathering more information about. C represents the results of our analysis, or the outputs of our machine learning models. The key insight of the data processing inequality is that no amount of sophisticated data modeling can extract insights about the world that were never in the data in the first place. Put another way, the quality of your data puts an upper limit on the quality of your analysis.
This result has implications for data scientists on a number of fronts. First, it helps us to understand the importance of investing in better measurement. Better instrumentation, more sophisticated feedback collection, and low-noise communication channels make the downstream work of doing analysis much easier. The basic insights will be easier to find, and more nuanced relationships will be possible to discern.
Second, it makes clear that machine learning and statistics do not create information, but rather transform, or process, the existing information into a more actionable form. Consider the simple case of taking an average of ten numbers: those ten numbers collectively contain more information than the single average, but they are more unwieldy. With an average, it becomes easy to make decisions: if the average is lower or higher than expected, we can take decisive action. The principled view is that the work of data scientists is largely about transformation of information, not its creation.
Measuring Data at Foursquare
As part of Foursquare’s data science team, I work on transforming noisy phone signals into an accurate understanding of what types of places people like to go. A big part of that challenge is understanding the nature and accuracy of the phone signals.
To understand these signals, Foursquare engineers developed a data-gathering tool which allowed us to collect precise location trails. Using this app, I conducted a set of controlled experiments where I measured the GPS and Wi-Fi signals coming from multiple devices at various locations both in the Foursquare office and in the surrounding areas of New York City. By collecting data in a controlled and regular manner, I was able to detect subtle differences in the signals under different conditions, and help the team better understand in what circumstances these signals would be reliable, and when they would not. For example, I provided insight as to whether collecting location signals in high-accuracy, high-power mode provides a meaningful improvement over a lower-accuracy but more energy-efficient mode. This type of understanding is crucial in guiding data and engineering decisions, such as balancing the trade-off between gathering more data and minimizing battery drain.
Data Quality at the Enterprise Level Today
Often, at the enterprise level, there are not enough resources to prioritize thorough investigations into data quality.In the short run, this can be prudent. After all, decisions need to be made, and the basic work of measurement improvement is often long term and uncertain. In the long run, however, research into basic measurement and data quality will have major impact on the data analysis itself, and so proper investment in quality data needs to be made.
Interested in writing for DataScience.com? Read our content contributor guidelines and submit your pitch.