What is Oracle Cloud Infrastructure Data Catalog?

November 10, 2021 | 7 minute read
Abhiram Gujjewar
Product Leader, Cloud Data Management
Text Size 100%:

Oracle Cloud Infrastructure (OCI) Data Catalog is a metadata management service that helps data consumers discover data and improve governance in Oracle ecosystem. Data Catalog is also a data asset inventory with business context, and a unified metastore for the lakehouse, and it’s included free of cost with OCI subscription.

Why do you need OCI Data Catalog?

Without a data catalog, finding trusted data for analytics can pose a big challenge for data consumers. They must rely on experienced power users and incomplete community knowledge to find data across silos and understand what it means and how to use it. On the other hand, data providers keep responding to same questions about data from ever increasing population of data consumers. They could capture this knowledge in web pages or documents, but that solution isn’t scalable or reliable. They need a collaborative environment to collect, organize, and search for data about data: Metadata.

OCI Data Catalog saves time and effort for data providers and data consumers alike. Data analysts, data scientists, data engineers, and data stewards have a single self-service environment to discover data available in cloud or on-premises sources. Data providers can create data dictionary comprising of technical and business metadata. Data consumers use this holistic view to easily assess suitability of data for analytics and data science projects. They can collaborate on data issues and increase data literacy of the whole organization.

Hybrid cloud and multicloud environments create data fragmentation, which is a big challenge for data governance. With OCI Data Catalog, you have improved visibility into the overall data estate. Business context is available in the form of a business concept glossary and user annotations, all of which forms the foundation of data governance.

OCI Data Catalog is also the core metadata backbone of the data lakehouses on OCI. For a lakehouse to be successful, data consumers need to understand and access the content of the data lakes from different tools, such as query tools and Spark-based processing engines. With a central metastore to store shareable schemas, OCI Data Catalog accelerates consumption of data lakes by Oracle Autonomous Database service users and OCI Data Flow Spark users.


 

How does OCI Data Catalog work?

OCI Data Catalog takes technical and business metadata from various sources and data experts and connects them meaningfully to turn it into a searchable data asset inventory. Then data consumers search the catalog to find and understand available data. This process involves the following major steps:

  • Technical metadata harvesting: Information about data objects as stored in the sources, such as table names, column names, files names, data types, PK-FK constraints, and reports. This metadata is harvested from actual data sources.
  • Business metadata enrichment: This step captures the extra business context about data and can be different for different objects. It includes the business concepts glossary, classifications, annotations, descriptions, owners, department, region, update frequency, ratings, comments, Q&A, and free form tags. These objects are defined as custom metadata properties and contributed by data experts.
  • Search and exploration: Data consumers use technical and business terms to search, filter, explore, and understand the available data.
  • Metastore for Data Lake: OCI Data Flow users utilize Metastore in OCI Data Catalog as the central repository to store and retrieve metadata for databases, tables, and partitions, backed by files maintained in OCI Object Storage’s data lake.
  • Support for Autonomous Database integration with data lakes: OCI Data Catalog intelligently harvests Object Storage files as logical entities, similar to tables. Autonomous Database service syncs the metadata from logical entities to automatically create external tables to query the data.

What are the key capabilities of OCI Data Catalog?

Let’s look at some of the key capabilities of OCI Data Catalog.

Technical metadata harvesting

You can harvest technical metadata from various data sources, such as Object Storage (CSV, ORC, Avro, Parquet, JSON, and XLSX files) and Autonomous Database. Sources supported on OCI and on-premises include Oracle Database service, MySQL, Hive, Kafka, MS SQL Server, Azure SQL Database, IBM DB2, and PostgreSQL. For the complete list of sources, which continues to grow, see the documentation. You can harvest on demand or on a schedule. Learn more from the blog post, Harvest Metadata from On-Premises and Cloud Sources with a Data Catalog.

OCI Data Catalog can scan your tenancy to discover data sources for harvesting and then create data assets. This process ensures that you don’t miss anything worth harvesting and avoids any errors in configuration of data assets. Learn more in the documentation.

Metadata harvesting for data lakes

Without meaningful metadata, data lakes can become data swamps. For data lake files, OCI Data Catalog groups multiple files based on the file name patterns as logical entities. These logical entities provide a comprehensible view into these files. Learn more from the blog post, Building Meaningful Catalogs For Data Lakes.

Providing metadata for Autonomous Database integration with data lakes

The logical entities harvested by OCI Data Catalog represent a relational tabular structure on top of Object Storage files. The metadata from logical entities syncs with Autonomous Database to automatically create external tables required to access that data. Without this integration, Autonomous Database users need to manually define the schema on data lake files using SQL scripts. That process is error-prone and creates bottle necks. This integration removes those bottle necks and accelerates consumption of data lake. To learn more, see the blog post, Autonomous Database enhances data lake analytics.

Business metadata curation and enrichment

When the technical metadata is harvested, subject matter experts contribute business metadata as enrichments. This business context might be even more important than technical metadata. OCI Data Catalog helps you capture business context with the following features.

Custom properties

Custom properties are user-defined metadata properties with predefined business meaning. Data experts populate these values for each object to capture business context, classifications, and other information. You can also use them in search, filter, and sort.

If done manually, metadata enrichment and glossary term creation can be a lot of effort. To ease that trouble, OCI Data Catalog also provides import and export capabilities using MS Excel for bulk updates. To learn more, see Enrich Metadata with Oracle Cloud Infrastructure Data Catalog.

Business glossaries

One of the first steps toward effective data governance is establishing a common understanding of business concepts and capturing their relationships to the data assets. OCI Data Catalog provides a managed business glossary to support this idea. The glossary allows data stewards to collaboratively define business terms in rich text form, categorize, and build a hierarchy. They can create parent-child relationships between terms to build a taxonomy and set business term owners and approval status. Most importantly, they can also see all the linked objects in one place for each category and term. To learn more, see Managing a Business Glossary in the documentation.

You can also import or export glossary content using MS Excel files. After creation, you can then link these terms to technical assets to provide business meaning and use them for searching. These links to business glossary terms and categories tell you what business concept a technical object refers to. For example, a table called “cust_details” can link to the term, “customer.”

But manually linking one term to one object at a time can be time-consuming. To assist with this linking, OCI Data Catalog provides AI and ML-based recommendations. To learn more, see Using AI and ML to enrich metadata in Oracle Cloud Infrastructure Data Catalog.

Free-form tags

Free-form tags are user-defined annotations that capture business context without specific predefined meaning. These quick annotations lack the overhead of defining custom properties or glossaries. For more information, see Using Tags in the documentation.

Searchable data asset inventory

With this rich knowledge base created, data consumers can use powerful searches to find data. They can search by technical information, tags, custom property values, or business terms. They can also browse metadata based on technical hierarchy of data assets, entities, and attributes. When they find suitable data and the location of that data, they can get started with using that data in analytics, data science, and data engineering projects. For more information, see Searching and Exploring.

Data Catalog metastore

For OCI Data Flow users, the metastore provides a central repository to store metadata for database, table, column, partition definitions for databases backed by structured, and semi-structured data lake files. You can share these definitions by different runs, different Spark applications, or developers improving collaboration and reusability. This functionality accelerates insights from your data lake. Learn more in the blog post, Improving collaboration using OCI Data Flow and Data Catalog metastore.

Data Catalog APIs and SDKs

OCI Data Catalog capabilities are also available as REST APIs and software development kits (SDKs) in many programming languages. This option allows integrating data catalog capabilities in other services. For more details, see Data Catalog API in the documentation.

Conclusion

OCI Data Catalog is the underlying metadata foundation to cloud data management that you need to derive value from your data in Oracle ecosystem. It helps you build data dictionary of assets in Oracle Cloud and on-premises and provides a unified metastore for data lakes. You can manage a business glossary and enrich metadata with AI and ML recommendations, custom properties, and tags.

Data Catalog supports integration with Autonomous Database as a data lake and as a unified central metastore for OCI Data Flow users. The service is offered as a secure, reliable, serverless native OCI service with REST APIs and SDKs for application integration, and it’s free with your Oracle Cloud Infrastructure subscription.

Use your data in new ways and more easily than you ever could before. Try Oracle Cloud Infrastructure Data Catalog today and start discovering the value of your data. Subscribe to the Big Data blog and OCI blog for the latest on Big Data and Oracle Cloud Infrastructure straight to your inbox!

Abhiram Gujjewar

Product Leader, Cloud Data Management


Previous Post

Oracle Cloud Infrastructure addresses Singapore MAS and ABS regulations

Andrew Hahn | 3 min read

Next Post


Running seismic modeling applications using Arm Compute on Oracle Cloud

Dhvani Sheth | 7 min read