Data preparation is the equivalent of mise en place, but for analytics projects.
Put simply, data preparation is the process of taking raw data and getting it ready for ingestion in an analytics platform. To achieve the final stage of preparation, the data must be cleansed, formatted, and transformed into something digestible by analytics tools. These are broad strokes, and the actual process can include a wide range of steps, such as consolidating/separating fields and columns, changing formats, deleting unnecessary or junk data, and making corrections to data.
Subscribe to the Oracle Analytics Advantage blog and get the latest posts sent to your inbox.
To get a better sense of what’s involved with data preparation, let’s take a closer look at what some of these individual steps involve:
Deleting Unnecessary Data: When juggling different datasets, it’s common for extra or unnecessary fields to be used across different sources. To cleanly consolidate different datasets into a workable analytics platform, the fields must match up, which means streamlining and aligning so things are in sync.
Consolidating/Separating Fields: If you’ve got data coming in from several different outside sources, it’s possible that they use different formats for writing geolocations. For example, one source might have a street address with both number and name in a single field, while another source might separate the two. Proper analysis requires consistency across fields, so the address data needs to either be consolidated or separated based on preference.
Transforming Data: Let’s take the example of Internet-of-Things (IoT) devices. Perhaps you want to run an analysis of errors signaled cumulatively across devices. However, the error codes are in single-digit format, which makes it hard to think about from a real-world perspective. Data transformation is the process of translating those error codes into descriptions and messages that can be much more easily deciphered.
Making Corrections to Data: Assume you’re consolidating datasets from three different IoT devices, all reporting with time stamps. You know that one particular dataset consistently has a time lag in all of its values. In order to accurately use that data, those datasets must be corrected.
Changing Formats: Time stamps are a good example of having different formats across multiple sources. For example, North American timestamps usually use the format of MM/DD/YY but many other countries use the standard YY/MM/DD. Transforming this data into one consistent format is necessary to create a single usable dataset.
To get a sense of how data preparation works, let’s use an example with IoT devices similar to what was mentioned above. Let’s take an IoT monitoring device that pings out a status code of 0 for nominal function and error codes of 1 through 5 based on a particular issue. Let’s also assume there are multiple generations of devices sending pings throughout the world.
Datasets come in for generations of devices and regions. So North America has three datasets for three generations of devices, EMEA has three datasets for three generations of devices, and so on. And it’s known that the oldest generation has a broadcast lag of thirty minutes due to the inherent hardware. As a data scientist, you’ve been tasked with building data visualizations to show how often the error codes have occurred and if they’re centralized in any region in particular. But before you load this into an analytics platform, the data must be prepared with the following steps:
This allows the individual datasets to have consistent formatting and fields, thus enabling easy consolidation. The only issue, then, is the time it takes to prepare all of the individual data.
The above example is fairly straightforward and limited in scope. In theory, following the steps as outlined for a total of, say, 12 datasets might take up a whole day but it’s not impossible. But if you expand the scope beyond that or simply don’t have a day to lose, the good news is that machine learning can help automate data preparation. Machine learning recognizes patterns from repeated usage and can begin to clean up datasets as they come in.
For example, the algorithm may detect that you’re constantly changing timestamps into North American format; in that case, machine learning enables auto-updates of that (or a request to execute). Similarly, if you’re always supplementing coordinate information with a physical address or a city name, machine learning can integrate this as soon as the dataset arrives in the database. Because these patterns are simple and repeatable but labor-intensive due to sheer volume of data, machine learning can save significant time and effort in the data preparation process. In addition, machine learning helps reduce accuracy issues caused by human error, such as typos or overlooked items missed by the human eye.
Simply put, data scientists really don’t like data preparation—a survey by CrowdFlower found that 76% of data scientists identified it as the worst part of the job. The good news is machine learning can significantly reduce that task.
What does all that time and effort (either by human minds or machine learning) do for data scientists? It’s all about confidence—confidence in the data, confidence in the process, and confidence in the insights derived from the data. Data preparation ensures accuracy in the data, which leads to accurate insights. Without data preparation, it’s possible that insights will be off due to junk data, an overlooked calibration issue, or an easily fixed discrepancy between datasets.
Going back to the idea of cooking, how would chefs fare without mise en place? Perhaps they’d be left scrambling for the right tool while the meat is browning, thus causing it to burn. Or they’d discover that their bell peppers had gone bad, except the recipe is shot because they’re halfway through the process. Mise en place is necessary for smooth and successful cooking, and it’s the same thing when handling big data: data preparation is often the first—and most important—step of any analytics project.