It’s no secret that big data is getting much bigger with each passing year—in fact, the world is seeing exponential growth in the amount of data generated, as plenty of research shows. That creates the issue of storage. If all those bits and bytes are being transmitted and you need access to them in order to analyze and derive insights via business intelligence, then the next logical step is a data lake.
But what happens when all of that data is sitting in the data lake? Finding anything specific within such a repository can be unwieldy by today’s standards. With the growing volume of data generated by all the world’s devices, the data lake will only grow wider and deeper with each passing day. Thus, while collecting it into a repository is key to using it, information needs to be cataloged and accessible in order for it to actually be usable. The sensible solution, then, is to implement a data catalog.
Never miss an update about big data! Subscribe to the Big Data Blog to receive the latest posts straight to your inbox!
Before understanding why a data catalog can be so useful in this situation, it’s important to grasp the concept of a data lake. In layman’s terms, a data lake acts as a repository that stores data exactly the way it comes in. If it’s a structured dataset, it maintains that structure without adding any further indexing or metadata. If it’s unstructured data (for example, social media posts, images, MP3 files, etc.), it lands in the data lake as is, whatever its native format might be. Data lakes can take input from multiple sources, making them a functional single repository for an organization to use as a collection point. To go further into the lake metaphor, consider each data source as a stream or a river and they all lead to the data lake, where raw and unfiltered datasets sit next to curated and enterprise/certified datasets.
Collecting data is only half of the equation, however. A repository only works well if data can be called up and used for analysis. In a data lake, data remains in its raw format until this step happens. At that point, a schema is applied to it for processing (schema on read), allowing analysts and data scientists to pick and choose what they work with and how they work with it.
This is a very simple call-and-response action, but one element is missing: the search process. A data lake requires data governance. Without organization, searching for data is a chaotic, inefficient, and time-consuming process. And if too much time passes without clear organization and governance, the value of a data lake may collapse under its own accumulated data.
Enter the data catalog.
A data catalog is exactly as it sounds: it is a catalog for all the big data in a data lake. By applying metadata to everything within the data lake, data discovery and governance become much easier tasks. By applying metadata and a hierarchical logic to incoming data, datasets receive the necessary context and trackable lineage to be used efficiently in workflows.
Let’s use the analogy of notes in a researcher’s library. In this library, a researcher gets structured data in the form of books that feature chapters, indices, and glossaries. The researcher also gets unstructured data in the form of notebooks that feature no real organization or delineation at all. A data catalog would take each of these items without changing their native format and apply a logical catalog to them using metadata such as date received, sender, general topic, and other such items that could accelerate data discovery.
Given that most data lake situations lack a universal organizational tool, a true data catalog is an essential add-on. Without the level of organization of a data catalog, a data lake becomes a data swamp—and trying to pull data from a data swamp creates a process that is inefficient at best and a bottleneck at worst.
Let’s take a look at a data scientist’s workflow from two different perspectives: without a data catalog and with a data catalog. Our hypothetical case study involves a smart doorbell that provides a stream of device data. At the same time, the company tracks mentions on social media by users who've had packages stolen to determine times that more accurately predict when thieves come.
Without a data catalog: In this example, a data lake has datasets streaming in from Internet of Things (IoT) devices along with collected social media posts from the marketing team. A data analyst wants to examine the impact of a specific feature’s usage on social media sharing. Remember, the data in a data lake remains raw and unprocessed. In this case, data scientists will have to pull device datasets from the time period of the feature’s launch, then examine the individual data tables. To cross reference against social media, they will have to pull all social media posts from this time period, then filter out by keyword to try and drill down using mentions of the feature. While all this can be achieved using the data lake as a single source, it also requires quite a bit of manual labor for preparation time.
With a data catalog: As datasets come into the data lake, a data catalog’s machine learning capabilities recognize the IoT data and create a universal schema based on those elements. Users still have the ability to apply their own metadata to enhance discoverability. Thus, when data scientists want to pull their data, a search within the data catalog brings up relevant results associated with the feature and other targeted keywords, allowing for much quicker preparation and processing.
This example illustrates the stark difference created by a data catalog. Without it, data scientists are essentially searching through folders without context—the information sought has to be already identified through some means such as data source, time range, and file type. In a small, controlled data environment with limited sources, this is workable. However, in a large repository featuring many sources and heavy collaboration, it quickly devolves into murky chaos.
A data catalog doesn’t completely automate everything, though its ability to intake structured data does feature significant automated processing. However, even with unstructured data, inherent machine learning and artificial intelligence capabilities mean that if a data scientist manually processes data with set patterns, then the catalog can begin to learn and provide first-cut recommendations to speed things up.
The volume of data flowing into repositories is only getting bigger with each passing day. To ensure efficiency and accuracy, a form of governance is necessary for creating order among the chaos. Otherwise, a data lake quickly becomes a proverbial data swamp. Fortunately, data catalogs are a simple tool to achieve this, and by integrating such a thing into a repository, organizations are set up for success now—and prepared to scale up as needed towards a bigger-than-big data future.
Need to know more about data lakes and data catalogs? Check out Oracle’s big data management products and don't forget to subscribe to the Oracle Big Data blog to get the latest posts sent to your inbox