Introducing data lineage: Where does the data come from?

September 19, 2023 | 4 minute read
Rashmi Badan
Product Manager
Text Size 100%:

Before you make an important decision based on your analytics, don’t you want to be sure of your sources?

We’re excited to announce the data lineage feature in the latest release of Oracle Cloud Infrastructure (OCI) Data Catalog that also contains other enhancements. OCI Data Catalog is a metadata management service that helps data professionals discover data and support data governance. With data lineage added capability, you can now view the lineage of discovered data.

Why do you need data lineage?

Data consumers, such as data analysts, business analysts, and data scientists, usually work with data ingested from various sources and processed by various systems. They need to understand whether the data comes from a trusted source, which systems it flows through, and how it changes in the data pipelines. Knowing these details not only increases their confidence in the outcomes derived from the data, but it also helps with troubleshooting data issues with better traceability of data. Data engineers responsible for creating, modifying, and maintaining data pipelines also want to understand the impact of changes to data on downstream processes and applications, to proactively notify the right teams about upcoming changes.

Enter data lineage in OCI Data Catalog

OCI data lineage provides a graphical representation of the end-to-end journey of data from the data source through the processing systems to the final target. Data lineage also helps you troubleshoot and debug elusive data issues hidden in complex pipelines. With OCI data lineage, you can view the lineage of data processed by applications in the OCI Data Integration workspaces in your OCI tenancy.

OCI Data Catalog has the advantage of being the single pane of glass for data about data in your OCI and Oracle ecosystem. It helps data consumers gain an understanding of technical, business, and operational metadata of the various data systems, promoting data literacy within your organization. It’s the ideal place for your users to discover the right data, and now with the data lineage feature, it allows you to view lineage and impact of that data.

Unified with OCI Data Integration

In OCI, many data lake use cases involve ingesting data into OCI Object Storage using the OCI Data Integration service, the OCI-native extract, transform, and load (ETL) offering. Data Catalog seamlessly integrates with Data Integration to provide lineage for data ingested and processed in the applications and pipelines across different Data Integration workspaces.

With the Summary and Attributes tabs, you can view a new Lineage tab that provides the entity and attribute level lineage. High-level details of the OCI Data Integration tasks involved in data processing are also shown. You can inspect the details of each data transformation in the OCI Data integration console by following the link.

A graphic depicting a data lineage graph.
Figure 1: Data lineage graph

To access the lineage of data entities processed by Data Integration applications, process the following steps:

  • Configure the applications in your OCI Data Integration workspace to generate lineage data and publish it to Data Catalog.

  • Create a data asset for the OCI Data Integration workspaces in the catalog to fetch lineage data from them.

A screenshot of the create application and creat data asset windows showing the configurations for data lineage.
Figure 2: Configurations for data lineage

After the setup shown in Figure 2, Data Catalog is equipped to show lineage for data processed in the configured Data Integration applications.

You can view details of other entities in the lineage graph only if you have required permission to view the containing data asset in the catalog. Identity and Access Management (IAM) policies set for access to data assets in the catalog are also honored in the lineage graph.

Other enhancements

Another update to the Data Catalog service is the access to an enhanced glossary export to an Excel file to support large glossaries that can contain tens of thousands of terms across several categories and subcategories. The large export is now handled by a dedicated asynchronous job in the catalog allowing users to track any errors during export in the corresponding job logs.

Summary

OCI Data Catalog provides you an easy way to view lineage of data processed by OCI Data Integration applications. Knowing the source of data builds trust in it and helps you assess impact of any changes to data pipelines enabling them to take preemptive action to accommodate those changes.

Try this feature in your own Oracle Cloud Infrastructure Data Catalog instance now!

Rashmi Badan

Product Manager


Previous Post

Introducing Vector Store and Generative AI in MySQL HeatWave

Nipun Agarwal | 12 min read

Next Post


Oracle expands cloud services with new open source data management solutions

Carter Shanklin | 4 min read