Get faster insights from data lakes with the new release of OCI Data Catalog

January 4, 2022 | 4 minute read
Rashmi Badan
Product Manager
Text Size 100%:

We’re announcing a new release of Oracle Cloud Infrastructure (OCI) Data Catalog. In this release, we’ve added features that focus on helping data users understand data lakes better.

In data lake scenarios, files created by external applications, such as Spark, usually follow standard filename patterns. In an earlier release of OCI Data Catalog, we introduced logical entities to group files that match a filename pattern into a single entity because they logically represent a single data set. Taking one step further, with integration of OCI Data Catalog and Autonomous Database, external tables are automatically created based on logical entities. This capacity simplifies Autonomous Database user’s ability to easily consume data in the OCI Object Storage lake using SQL queries.

Automatic generation of logical entities

Logical entities in an Object Storage bucket are derived based on associated filename patterns and these filename patterns are defined using regular expressions. Creating complex regular expressions works well for advanced users. In this release, we simplify this process by eliminating the need for regular expressions. Choose a starting folder prefix in the folder hierarchy of the Object Storage bucket when you create a filename pattern and the logical entities are automatically generated when you harvest the data asset with this filename pattern assigned to it.


Figure 1: Creating a filename pattern with a folder prefix

Using the starting folder prefix, OCI Data Catalog scans all files within it, identifies filename patterns, and derives logical entities based on those patterns. It also automatically identifies any partition key columns and lists them in the attributes list of the logical entity. This simplification leads to a much shorter time-to-value for data users to better understand files present in the data lake.

Partition keys in logical entities

While populating a data lake, most applications partition the files based on certain fields, and this partition key is embedded in the directory structure (or path) of the files. Identifying the partition key columns improves understanding of the data and structure of the files. It also helps improve performance of queries on the external tables in Autonomous Database that are based on logical entities.

In this release, OCI Data Catalog can identify partitions in the Object Storage buckets. The automatic generation of logical entities identifies such partition columns. But for more control, you can also write a regular expression in a filename pattern with a partitionKey qualifier.

Consider files of the following format:

  • movieplex/workshop.db/custsales/country=USA/month=2019-01/custsales-2019-01.csv

  • movieplex/workshop.db/custsales/country=UK/month=2019-02/custsales-2019-01.csv

The following sample filename pattern captures the partition key columns:

{bucketName:.*}/{logicalEntity:[^/]+}.db/{logicalEntity:[^/]+}/{partitionKey:[^/]+}/{partitionKey:[^/]+}/\S+$


Figure 2: Attribute list containing partition keys

Sync Metastore with the catalog

In previous releases, we introduced Metastore capabilities offered by OCI Data Catalog. With this release, you can synchronize the Metastore content with Catalog by creating a data asset of type Metastore in the catalog and enabling a scheduled sync to harvest that metadata.


Figure 3: Enabling Metastore sync

Now, as a data engineer or data scientist, you can discover databases and tables available in the Metastore and annotate it with more context using the custom properties, tags, and glossary terms.

Other features

This release also includes the following features:

  • Support for MS-Excel-based export and import of custom properties for logical entities in an Object Storage data asset

  • Ability to define relationships between glossary terms across different glossaries

  • Harvest metadata from Object Storage files compressed with zlib format

Conclusion

With this release, OCI Data Catalog provides faster and better insights from files in Oracle Cloud Infrastructure Object Storage buckets. New automated capabilities simplify the creation of Logical Entities and partition key identification. The Metastore sync further expands the view of the available data for data consumers. End-to-end this new release helps to reduce time-to-value to get faster insights from your data lakes.

Rashmi Badan

Product Manager


Previous Post

Oracle expands the OCI Cloud Adoption Framework and accompanying resources

Pablo Sanchez Perez | 5 min read

Next Post


Archiving streams on OCI

Nitin Soni | 4 min read