Best practices, news, tips and tricks - learn about Oracle's R Technologies for Oracle Database and Big Data

  • June 8, 2018

Data Science Maturity Model - Data Awareness Dimension (Part 6)

Mark Hornick
Senior Director, Data Science and Machine Learning

In this next installment of the Data Science Maturity Model (DSMM) dimension discussion, I focus on 'data awareness':

How easily can data scientists learn about
enterprise data resources?

Generally speaking, the term 'awareness' can be defined as "the state or condition of being aware; having knowledge; consciousness." For data awareness, we might refine this definition as "having knowledge of the data that exist in an enterprise and an understanding of its contents." As the image above suggests, enterprises often have many data repositories across organizations and departments. Data may reside in databases, flat files, spreadsheets, among others, across a range of hardware, operating systems, and file systems - the data landscape. Moreover, data silos form where one part of the enterprise is completely unaware of the existence of data in another, let alone the meaning of that data.

Data awareness across an enterprise allows data science players, especially data scientists, the ability to browse and understand data from a metadata perspective. Such metadata may include textual descriptions of, e.g., tables and individual columns, key summary statistics, data quality metrics, among others. Data awareness is essential to increase productivity, but also to inventory data assets and enable an enterprise to move toward "a single version of the truth."

The 5 maturity levels of the "data awareness" dimension are:

Level 1: Users of data have no systematic way of learning what data assets are available in the enterprise.

Enterprises at Level 1 are often in the dark when it comes to understanding the data resources that may exist across the enterprise. Data may be siloed in spreadsheets or flat files on employee machines, or stored in departmental or application-specific databases. No map of the data landscape exists to assist in finding data of interest, moreover, the enterprise hasn't awakened to the need for this. 

Level 2: Data analysts and data scientists seek additional data sources through "key people" contacts.

The Level 2 enterprise has 'awakened' to the need for and benefits of finding the right data. As data analysts and data scientists take on more analytically interesting projects, the search for data ensues on a personal level - individually contacting data owners or others 'in the know' within the enterprise to understand what data exist. A significant amount of time is lost trying to understand what data exist, how to interpret them, and their quality.

Level 3: Existing enterprise data resources are cataloged and assessed for quality and utility for solving business problems.

The Level 3 enterprise sees the need for making it easier for data science players to find data and have greater confidence in their quality for solving business problems. Ad hoc metadata catalogs begin to emerge which make it easier to understand what data are available, however, such catalogs are non-standard, not integrated, and dispersed across the enterprise.

Level 4: Enterprise introduces metadata management tool(s).

The Level 4 enterprise builds on the progress from Level 3 by introducing metadata management tools where data scientists and others can discover data resources available to solve critical business problems. Since the enterprise is just starting to take metadata seriously, different departments or organizations within an enterprise may use different tools. While an improvement for data scientists, the metadata models across tools are not integrated, so multiple tools may need to be consulted.

Level 5: Enterprise standardizes on a metadata management tool and institutionalizes its use for all data assets.

The Level 5 enterprise has fully embraced the value of integrated metadata and facilitating the maintenance and organization of that metadata through effective tools. All data assets are curated for quality and utility with full metadata descriptions to enable efficient data identification and discovery across the enterprise. Data science players' productivity and project quality increase as they can now easily find available enterprise data.

In my next post, we'll cover the 'data access' dimension of the Data Science Maturity Model.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.