This post was previously published on the Data Integration Blog. 

Oracle Cloud Infrastructure (OCI) Data Catalog is a cloud native service used to discover, organize, enrich, and trace an organization’s technical and business data assets. For a business user, such as a data analyst or business analyst, the key value of a data catalog comes from being able to easily and quickly identify the right business data. For more information on this topic, check out What is a Data Catalog.

Organizing data assets in the catalog based on business definitions provides deep insights when compared to merely the characteristics of technical metadata, based on names of tables, columns, files, fields, and so on. To provide such value at scale to users, OCI Data Catalog continues to expand its capability to enrich knowledge of available data using the new recommendations engine based on artificial intelligence (AI) and machine learning (ML) techniques.

 

The problem

OCI Data Catalog provides a holistic view of data in the organization by bringing together technical and business metadata. Data experts add business context to the harvested technical metadata by enriching it with business glossary links, user-defined custom properties, and tags. This enrichment helps data consumers search and discover data, based on business terms and tags beyond technical names.

Considering the volume of data available in organizations today, manually searching for and linking all metadata to appropriate business terms defined in the glossary is an onerous and time-consuming task. It soon becomes unscalable when the volume of metadata grows over time. This problem is further magnified when business terms with similar definitions are present in the glossaries within the data catalog.

Let’s say you want to have a standard definition of “customer identifier” within your organization for all attributes that imply a form of customer identification. You navigate to the attribute ‘CUST_ID.’

screenshot_2021_05_04_at_11_07_19_pm

Figure 1: An attribute detail page

You want to associate a business term to this attribute, so that a standard way of defining, discovering, and using it exists within the catalog. You can manually find an appropriate business term by clicking the Link Terms and Categories button, going through the list of available terms and categories, search and filter by name, navigate to the term detail page of each term (each possibly belonging to a different glossary) to obtain details, select the terms to link, and then click Link. This selection links all selected terms to the attribute. This process can be time-consuming, unless you know the exact terms you’re looking to link to the attribute.

screenshot_2021_05_04_at_11_07_31_pm
Figure 2: Overlay showing all terms and categories, for manual linking

You would need to repeat this process for the innumerable data entities and attributes in the catalog. It’s not feasible!

So, how can life be made easier to speed up the process of enriching data entities and attributes with business terms?

 

The solution

Automating and accelerating the process of linking through machine learning (ML) and natural language processing (NLP) techniques can improve the user experience and reduce time to value. Based on this premise, we have introduced a recommendation engine in OCI Data Catalog that automates the process of linking business terms and categories to technical metadata. This capability is achieved by building an inference-based knowledge graph through patterns and associations. By combining several string-matching algorithms to calculate feature sets and using an ML model to train labeled data, a best evaluation score is computed and used to provide a list of recommended terms and categories. 

You now have a set of pertinent terms readily available in the context of data entities and attributes. This availability eliminates the effort of browsing and searching through the glossaries for related terms or categories. You can also either accept or reject the recommendations. This action, also known as curation, is taken in as feedback to refine future recommendations.

OCI Data Catalog also provides recommendations of data entities and attributes that you can link in the context of a glossary term or category, with the same curation actions and feedback cycle. These recommendations are useful for anyone familiar with both the technical metadata available in the catalog and the business terms defined in the glossaries. With this knowledge and data catalog’s recommendations, you can now link entities and attributes to terms or categories in bulk.

 

Using glossary recommendations in OCI Data Catalog

Harvesting metadata from different source systems and creating business glossaries are two independent operations. Harvesting is performed by data providers, and data stewards create glossaries. After the data catalog is populated with these objects, the next step is to enrich the metadata by linking the business definitions to the underlying technical metadata.

With the introduction of recommendations, you can now see a curated list of recommended terms and categories that are most relevant for this object and easily select the ones that are most appropriate. You can see the glossary that the terms belong to, their descriptions, and status on the attribute detail page, making it much quicker to select the appropriate term to link. You have single-click access to the term detail page if required. Each recommended term is on a separate card with two actions, Accept or Reject. The Accept action automatically links the term to the data catalog object and removes it from the curated list. The Reject action removes the term from the recommended list, never shown for this object in the future. If you exhaust all recommendations using the Accept or Reject actions, refreshing the object detail page brings new recommendations if available.

In our example, you review the recommended terms and accept the term “customer Identifier,” which is defined in the Sample glossary, successfully linking it in lesser time and with fewer actions, compared to the earlier manual option.

screenshot_2021_05_04_at_11_07_42_pm
Figure 3: Curated list of recommendations

Similarly, linking multiple technical objects to a term is made easy by using the recommendations in the Linked Objects tab of a term or category, as shown in Figure 4. You have selected five objects across all the recommended objects to be linked to the Orders category.

screenshot_2021_05_04_at_11_17_44_pm
Figure 4: Recommendations for technical objects

You can see from the examples how recommendations can help accelerate the enrichment of objects in your catalog. If the recommendations don’t cover the terms you want to link, you can still search for those specific terms and link them manually to the catalog objects.

 

Conclusion

The new recommendation engine in OCI Data Catalog aids in the linking of technical metadata to business metadata in a scalable manner, considerably reducing the time to value of the catalog for any organization.

Organizations are embarking on their next-generation analytics journey with data lakes, autonomous databases, advanced analytics, artificial intelligence, and machine learning in the cloud. Data Catalog helps support discovery, insights, and governance of data assets. The current capabilities of Oracle Cloud Infrastructure Data Catalog are available to Oracle Cloud customers at no extra cost, providing customers with great value-added capabilities! Try it out today!

For more information, review the Oracle Cloud Infrastructure Data Catalog documentation and associated tutorials.