Data Lakes: Examining the End to End Process

March 31, 2020 | 5 minute read
Michael Chen
Senior Manager
Text Size 100%:

It's a good way to think of a data lake as being the ultimate hub for your organization. On the most basic level, it takes data in from various sources and makes it available for users to query. But much more goes on during the entire end to end process involving a data lake. To get a clearer understanding of how it all comes together—and a bird's-eye view of what it can do for your organization—let's look at each step in depth.

Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!

Step 1: Identify and connect sources

Unlike data warehouses, data lakes can take inputs from nearly any type of source. Structured, unstructured, and semi-structured data can all coexist in a data lake. The primary goal of this type of feature is allowing all of the data to exist in a single repository in its raw format. A data warehouse specializes in housing processed and prepared data for use, and while that is certainly helpful in many instances, it still leaves many types of data out of the equation. By unifying these disparate data sources into a single source, a data lake allows users to have access to all types of data without requiring the logistical legwork of connecting to individual data warehouses.

Step 2: Ingest data into zones

If a data lake is set up per best practices, then incoming data will not just get dumped into a single data swamp. Instead, since the data sources come from known quantities, it is possible to establish landing zones for datasets from particular sources. For example, if you know that a dataset contains sensitive financial information, it can immediately go into a zone that limits access by user role and additional security measures. If it’s data that comes in a set format ready for use by a certain user group (for example, the data scientists in HR), then that can immediately go into a zone defined for that. And if another dataset delivers raw data with minimal metadata specifics to easily identify it on a database level (like a stream of images), then that can go into its own zone of raw data, essentially setting that group aside for further processing.

In general, it’s recommended that the following zones be used for incoming data. Establishing this zone sorting right away allows for the first broad strokes of organization to be completed without any manual intervention. There are still more steps to go to optimize discoverability and readiness, but this automates the first big step. Per our blog post 6 Ways To Improve Data Lake Security, these are the recommended zones to establish in a data lake:

Temporal: Where ephemeral data such as copies and streaming spools live prior to deletion.

Raw: Where raw data lives prior to processing. Data in this zone may also be further encrypted if it contains sensitive material.

Trusted: Where data that has been validated as trustworthy lives for easy access by data scientists, analysts, and other end users.

Refined: Where enriched and manipulated data lives, often as final outputs from tools.

Step 3: Apply security measures

Data arrives into a data lake completely raw. That means that any inherent security risk with the source data comes along for the ride when it lands in the data lake. If there’s a CSV file with fields containing sensitive data, it will remain that way until security steps have been applied. If step 2 has been established as an automated process, then the initial sorting will help get you halfway to a secure configuration.

Other measures to consider include:

  • Clear user-based access defined by roles, needs, and organization.
  • Encryption based on a big-picture assessment of compatibility within your existing infrastructure.
  • Scrubbing the data for red flags, such as known malware issues, suspicious file names or formats (such as an executable file living in a dataset that is otherwise media files). Machine learning can significantly speed up this process.

Running all incoming data through a standardized security process ensures consistency among protocols and execution; if automation is involved, this also helps to maximize efficiency. The result? The highest levels of confidence that your data will go only to the users that should see it.

Step 4: Apply metadata

Once the data is secure, that means that it’s safe for users to access it—but how will they find it? Discoverability is only enabled when the data is properly organized and tagged with metadata. Unfortunately, since data lakes take in raw data, data can arrive with nothing but a filename, format, and time stamp. So what can you do with this?

A data catalog is a tool that can work with data lakes in a way that optimizes discovery. By enabling more metadata application, data can be organized and labeled in an accurate and effective way. In addition, if machine learning is utilized, the data catalog can begin recognizing patterns and habits to automatically label things. For example, let’s assume a data source is consistently sending MP3 files of various lengths—but the ones over twenty minutes are always given the metatag “podcast” after arriving in the data lake. Machine learning will pick up on that pattern and then start auto-tagging that group with “podcast” upon arrival.

Given that the volume of big data is getting bigger—and that more and more sources of unstructured data are entering data lakes, that type of pattern learning and automation can make huge differences in efficiency.

Step 5: User discovery

Once data is sorted, it’s ready for users to discover. With all of those data sources consolidated into a single data lake, discovery is easier than ever before. If tools like analytics exist outside of the data lake’s infrastructure, then there’s only one export/import step that needs to take place for the data to be used. In a best-case scenario, those tools are integrated into the data lake, allowing for real-time queries against the absolute latest data, all without any manual intervention.

Why is this so important? A recent survey showed that, on average, five data sources are consulted before making a decision. Consider the inefficiency if each source has to be queried and called manually. Putting it all in a single accessible data lake and integrating tools for real-time data querying removes numerous steps so that discovery can be as easy as a few clicks.

The Hidden Benefits of a Data Lake

The above details break down the end-to-end process of a data lake—and the resulting benefits go beyond saving time and money. By opening up more data to users and removing numerous access and workflow hurdles, users have the flexibility to try new perspectives, experiment with data, and look for other results. All of this leads to previously impossible insights, which can drive an organization’s innovation in new and unpredictable ways.

To learn more about how to get started with data lakes, check out Oracle Big Data Service—and don't forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox.

Michael Chen

Senior Manager

Previous Post

Four Tools to Integrate into Your Data Lake

Michael Chen | 5 min read

Next Post

Build Your Data Lake with Oracle Big Data Service

Sherry Tiao | 6 min read