Serving up gourmet data in the data-driven enterprise

March 2, 2023 | 6 minute read
Mike Matthews
Senior Director, Product Management
Text Size 100%:

The Sushi Principle

plate of sushi

 

 

 

 

 

 

 

It has often been said about data warehousing and analytics that data adheres to the 'Sushi Principle', in that it is best served raw - or at least that, like many types of food, the less processed it is the better. 

It is an attractive idea. Adding processes that transform, prepare and optimize data for analysis certainly adds complexity, and makes meeting data governance objectives, such as providing transparency over where data originates and how it is crafted, much harder, and more expensive, to meet.

However, when we look in detail at many of the leading examples of companies successfully powering analytics from raw data, we often find that they benefit from one or two highly favourable conditions for success; in particular, having only a select few data suppliers, and having considerable control over data production processes. In many cases, the data is auto-generated and much less prone to error than data from applications and other human-generated sources. 

Most enterprises are not so fortunate. The data you need to use to power accurate and powerful analytics often comes from many different systems, or external data suppliers, with different quality standards and different needs from your own, so the source data itself cannot be completely trusted. Sometimes, data may be provided in formats that are not to your taste... and sometimes, the data can simply contain errors and inconsistencies that would lead to reporting inaccuracies if left 'uncooked'. 

Just as not all food can safely be consumed in raw form, there are many essential data sets that can yield some costly side effects if you attempt to use them untouched.

How to satisfy hungry customers

busy restaurant

 

 

 

 

 

 

 

 

 

 

 

 

But let us not forget, our customers, the data consumers, have needs. They want accurate, comprehensible reports, and they want them as soon as possible. So how do we go about this?

Let's consider some basic principles of good data management:

1. Keep it simple

The main tenet of the 'Sushi principle' is to keep it simple, as far as possible, and this is certainly an important foundational principle for a data warehousing architecture. Wherever there is a 'hop' of data between its originating source and its ultimate consumption, there should be a question of whether that hop is necessary. Wherever there is data transformation, or even simpler mapping logic, again this should be questioned. In general, the fewer components there are in the architecture, the more likely it is to succeed. Importantly, this may involve some level of compromise of capabilities. For example, if there are requirements for both complex data transformation and for strong data governance (for example, comprehensive data lineage from report to source, common metadata definitions, etc.), these requirements need to be carefully evaluated for their ongoing cost vs. their benefit. In particular, any architecture that requires many hours of manual intervention to define metadata, or to address data problems, is likely to incur huge costs that may not be sustainable.

Some patterns to consider under the theme of 'keeping it simple' are:

  • Reduce the number of components - for example if a database or analytics component offers built-in data load and transformation capabilities, consider using this instead of a separate and dedicated ETL solution, even if the enterprise already has access to one. Similarly, if you have requirements for handling data in various formats (structured, semi-structured, unstructured, or relational and non-relational), look for a single solution that can manage these centrally rather than separately.
  • Test automation technology - many claims are made for dazzling automation capabilities, for example to classify metadata, or automatically assess or improve data quality. Such capabilities can be powerful to reduce the costs of manual intervention, but need careful testing for effectiveness against real business requirements for the data.
  • Go with what you know - it is wise to be wary of your teams having to learn new tools and technologies, especially if they take time to produce results. Consider the skills that are readily available to the team and whether they can be brought to bear in transforming your data in a repeatable and transparent manner.

2. Measure data quality

While fixing data quality issues can be an expensive process, it is normally much more straight-forward to implement a framework for understanding and measuring the quality of data on an ongoing basis. One of the benefits of pulling data into a data warehouse is that you can establish centralized controls and measures of the data that are closely aligned with the analytical needs of the data. The measurement process should always be concerned with the real business needs of the data and not in identifying data issues that carry little real importance.

It is important to remember that such measures do not always need to be technology-driven. Defining and implementing rules that measure data quality is one approach, but capturing users' own experiences and quality perceptions in the form of simpler ratings and comments on data sets is also useful, and again, automation can play a part, if sufficiently tested.

3. Fix as close to the source as possible

Whenever a data issue is identified, for example by a Data Analyst, it is often hugely tempting to fix it in place. This may be fine in some cases, but it is always useful to consider if a little more investment in the fixing process could yield better results. For example, if the owner or provider of the data can be easily identified and contacted, it is normally worthwhile to raise any encountered data quality issues with them, especially if they are universal, rather than specific to your needs. Fixing the data at source, or even better applying a preventative measure that will stop such issues occurring again, is always better than applying a tactical fix that others may not be able to use, and which compromises simplicity.

For example, if a set of data suffers from duplication, are sufficient duplicate prevention capabilities being used at the point of data capture? Or, if an important field is repeatedly unpopulated or incorrect, is there a systemic reason for this that can be identified and resolved?

Serving Up

plates of food

 

 

 

 

 

 

 

 

 

 

 

 

Keep the above guidance in mind, and you will be on the right path to transforming your organization into an empowered, data-driven enterprise ready for the challenges of the data age.

Here at Oracle, our mission is to help people see data in new ways, discover insights, unlock endless possibilities. Our investments in cloud data warehousing technology and solutions are directly aligned with this vision. This post sets out some of the thinking behind the development and roadmap of the Autonomous Data Warehouse, and the growing set of data tools that are provided with it at no extra cost. 

To get a taste for how the Autonomous Data Warehouse can help you prepare sumptuous data, try it yourself using one of our live labs:

Data Studio - Self-service tools for everyone using Autonomous Database

Integrate, Analyze and Act on all data using Autonomous Database

Mike Matthews

Senior Director, Product Management

Mike has worked for Oracle since the acquisition of Datanomic, a Cambridge-based software vendor, in 2011. He is responsible for Oracle's data quality products, and a member of the team managing the autonomous database.


Previous Post

Easily Build an Analytic Application using APEX and Analytic Views

William Endress | 26 min read

Next Post


Load or query files from a file system directory on Autonomous Database

Nilay Panchal | 3 min read
Everything you need to know about data warehousing with the world's leading cloud solution provider