X

Learn about data lakes, machine learning & more innovations

To Data Lake or Not to Data Lake

We all know the old adage, when all you’ve got is a hammer, everything looks like a nail.  In the world of data management, we’re again faced with the next hammer.  Let’s visit the toolshed and see how many hammers we’ve collected.

Every one of these big data innovations has been a breakthrough in economics, a richer resume, and an entertaining watery vocabulary!  To bring you up-to-date, we started by sqooping data into reservoirs and oceans, and then rivers flowed into lakes, or badly done, swamps.  Most recently, distributed streams are weaving themselves into a fabric.  Then, we dive in and wrangle the data.  Yeehaw! 

Mastery over data volume, variety, velocity (and a dozen other ‘v’s) is still a formidable business and technical challenge.  In the big-data data-lab, the goal of finding value in the sea of unknown value has become the mission for machine learning and the passion for data scientists worldwide.   But, that challenge is not reserved for just big data.  Arguably, these advanced techniques will also have a transformative business impact for the enterprise data that powers our organizations everyday – the systems of record and systems of engagement.

In this blog, I want to share some of the key architecture principles and elements of the Oracle Data Warehouse ecosystem.  In particular, how you can apply the advanced analytical techniques against the enterprise data in your transaction systems and data warehouses - both independent of and in coordination with data lakes.  If you use Oracle today, the key analytical capabilities are already in place and ready to use now.

The Business Goal: 
Inspirational and Aspirational Analytics

Let’s tap into the excitement of modern analytics, namely predicting the future.  Imagine a company that suffers from high attrition among its best employees.  They were determined to understand and solve the problem.  They started by building an attrition-oriented visualization dashboard and had visibility into hundreds of useful facts:  attrition reasons, counts, and trends, performance reviews, and even employee connectedness to other employees.  Their investigative experience allowed them to drill into every piece of data to try to piece together their attrition story – and then infer or deduce conclusions and actions.

HR Attrition Analytics Dashboard
Figure 1:  Typical attrition analytics dashboard
While that was a fun exercise, (who doesn’t like interactively exploring colorful charts,) they eventually learned about the power of predictive analytics.  By combining their enterprise HCM data and machine learning they allowed the data to speak for itself – free from intuition and unintentional bias. They were able to predict the who, when, why, and where for every employee termination in their company.  Then, not only could they test whether those predictions were accurate, they could test if their mitigating actions were effective.    

Figure 2: Attrition predictions
Figure 2:  Attrition predictions

Analytical Data Management Decisions

That dashboard shows off the positive business outcome, but to get there, IT architects were determined to find a sustainable and affordable architecture that was closely aligned with their business requirements.  Here were their business conditions at the time:

A restaurant chain with 25,000 stores and 240K employees.

Annually, 130% employee turnover.  That means they lose 100% of their workforce and then 30% of the replacements!

Management asked for real-time dashboards, high predictive accuracy, and 24/7 system availability.

Figure 3:  A leading restaurant chain

The HR IT Strategy

IT had learned that to be effective and responsive, they needed an integrated architecture versus standalone components.  They had spoken with a data engineer from a Fortune 500 company who lamented about his shortsightedness in buying data warehouse components versus a comprehensive architecture, “you need to use 3rd party ETL tools and 3rd party BI tools to draw insights.  Further, you also rely on a separate relationship with cloud infrastructure providers.  The cost of ownership doesn’t sink in until well into the implementation.”1

Looking for guidance, they contacted Forrester Research who explained their translytical data platform conclusions.  Streamlining the analytics pipeline by combining transaction processing with analytical capabilities and eliminating ETL processes seemed to be precisely the architecture that they were looking for. Thus, their journey began to create a unified and optimized information architecture.

The HR Analytics Pipeline Strategy

Their core HR applications and basic reporting already ran in a single Oracle Database on Oracle Exadata, so they felt that they had two of the components already in place.  They had even made a prior decision to avoid deploying a conventional data warehouse since they thought that the additional data transfers would only slow down their reporting.  Plus, they wanted to eliminate the cost, disruptions, and administration for managing, maintaining, and securing an additional high availability, large-scale system.  

Their key data management accomplishment so far was to move the analytics to the data.  But now, the data scientists and analysts needed data that was outside the core system.  In fact, it was in non-Oracle databases and in the Microsoft Cloud.  

So now, what should the analytic pipeline look like?  Are additional data platforms, such as a data warehouse and/or a data lake necessary to get the data in one place?  And of big concern, what time delays would be incurred to move, transform, analyze, and ultimately publish the results? 

Answering these questions and arriving at an optimal predictive analytics architecture led them to consider three crucial areas:   data sources, data access, and in-place analytics.

Analytical Data Sets

Here are their data sources, data stores, and deployment locations:


Figure 4:  Enterprise portfolio of diverse data stores and data platforms

Systems of Record

The core HCM application uses the Oracle Database.  And, to save money on Exadata storage, they automatically archive data in a cloud object-store.  However, with so much turnover, their analysis needed to include historical employee data.  The good news was that they were already using Oracle’s hybrid partitioning as an active archive.  Even though the data is physically moved, it is still transparently accessed and maintains its metadata, security controls, and audit tracking.  This makes historical data easily accessible.  Learn more here.


Figure 5:  Oracle Hybrid Partitioning

Systems of Engagement

The HCM mobile social and interaction data may reveal the extent of employee engagement with one another. That data is captured in a NoSQL database.  

The corporate survey data store captures crucial exit interview data.  That corporate application publishes survey results as JSON described data in object-store files in the Microsoft Azure Cloud.

Unifying all the Data

The data scientists added two new data sources for use in our analysis:  mobile data in a NoSQL database and survey data in a JSON object-store.  With these two sources outside the database, they needed to determine if they could stay true to their data management principle of moving the analytics to the data.  Here are the options they considered:


Figure 6:  Data movement alternatives to access all data for analysis

Option 1 copies the two new data subsets into the transaction system.  Option 2 copies the transactions into a data lake which also includes copying the other two data sets.  This is more complicated than it first appears since the field metadata needs to be derived and enriched, lineage captured, data quality evaluated, and security rules applied.  Option 3 is to leave the data where it originated and read it securely from the transaction system as needed.  

Due to the dire business conditions, the overriding consideration for information access was speed and simplified self-service access.  So, the optimal scenario was to filter and transfer only the information as needed – option 3.  The intuitive business benefit was that a single data set would always be up-to-date and timely.  Conversely, this also meant they could avoid large data set re-processing time and likely associated network and storage charges.

Figure 7:  Unifying and simplifying access across diverse data sources without data subsetting using Oracle Big Data SQL.

The Oracle Big Data SQL and Oracle Cloud SQL query capabilities allow access to Hadoop, Kafka, and NoSQL across your data centers and cloud object-stores in the public cloud from Oracle, Microsoft Azure, and Amazon Web Services.  Additionally, the high-speed direct connection between Oracle and Microsoft data centers enables high bandwidth low-latency data transfers.

Learn more about Oracle Big Data SQL and Big Data Connectors.

In-Place Data Science
- The Oracle Database-centric Data Lab

The most important aspect of our analytic pipeline is the analysis itself.   Where should the data lab be?  What tools are necessary?  There is an 80-20 maxim in analytics and data science. 80% of the time is spent preparing the data and 20% in doing the analysis.  The company quickly realized that by keeping the data in an Oracle database, they could go a long way to eliminate the 80% of time spent on data preparation.

The conventional first step for predictive analytics is to subset data from larger data sets. With Oracle, the data remains in the database, yet the data can be logically subsetted by users, projects, and data.  This single copy strategy keeps the data consistent and current across all subsets while also eliminating burdensome reconciliation.  Object-stores don’t have this capability.

The conventional second step to move the subset into an analytical processing environment.  With Oracle, the machine learning environment is already present and integrated across many features of the database engine, such as high performance in-memory processing.  This embedded analytics architecture eliminates the data movement step entirely.  Data Lakes also don’t have this intrinsic capability.

Lastly, the analytic tools themselves are embedded and optimized in the architecture.  The results turned out to have outstanding performance due to the many optimized hardware and software capabilities, such as algorithm optimization across memory, CPU, and storage, data partitioning, automated tuning, to name a few.  Data Lake infrastructure is limited to object storage management and does not offer any embedded analytic capabilities whatsoever.   

That said, they did find machine learning and BI tools using data lakes as storage in the cloud.  But, this approach did not fit with their objectives for an integrated architecture.  To take the data lake-centric route meant that they would have to custom assemble and then maintain an entire platform.  Plus, there would be too much data movement.

It might not be a surprise by now, but their choice was to build an in-place database-centric data lab.  With an entire ‘in-place’ pipeline, they had an architecture to deliver on their goal of near real-time analytics.


Figure 8: Oracle Database architected for “in-place” analytics

Zagrebačka Bank uses the Oracle Database as a multi-workload data store, consolidating transaction data from 130 branches and 850 distributed databases (ATM machines), and performing in-place machine learning analytics for near real-time use by loan officers. Read their story here.

Learn more:  Oracle Machine Learning, Oracle Data Miner, Oracle Converged Database, and Oracle Analytics Cloud.

The Analytic Pipeline Architecture

In the end, they adopted the Oracle Data Warehouse ecosystem to meet their analytic pipeline needs. Their pipeline was simpler, more integrated, and minimized data movement both for the real-time reporting and the predictive analytics modeling and training.

Figure 9:  Conceptual architecture for a heterogeneous analytic pipeline based on the Oracle Data Warehouse ecosystem

With this architecture, the applications have immediate access to the data - no need to move or transform data.  Using Oracle Big Data SQL, the disparate data stores and data platforms are unified into a single secured view accessible by all users, analytics, and applications.  The predictive analytics function can be executed directly in the database and those results can be combined with the originating data sources for the necessary reporting and predictions.

Summary

Today, data lakes are a valuable economic alternative that robustly meets the needs for some aspects of IT and data science, however they are less suitable for enterprise data unless the metadata is still controlled and secured by the database.  And for those systems of record, data lakes are useful as an active archive and will help save on storage costs.  Otherwise, there are many not-so-hidden costs to converting structured data into unstructured or semi-structured – because at the end of the day, if you want to use it you have to add the structure back.

Much of the analytics rationale for using a data lake can be achieved within the Oracle Database and the Oracle Data Warehouse ecosystem. Be sure to read the Forrester research ranking translytical data platforms. For many use cases, it just makes sense to do your data management and data analytics in place.  Oracle data warehouses are the right solution, especially when they co-exist with your transactions systems – sharing performance, availability, and security capabilities.  The Oracle Data Warehouse ecosystem enables all the best practices in data management. 

One action for you, try these two “free” powerful features:  multi-workload and machine learning.  All of these database capabilities are available wherever you run an Oracle Database - both on premises and in the cloud.

The Oracle Data Warehouse ecosystem is more powerful than you may think.  Try it out in the Oracle Cloud using free credits. 


[1] Cloud Data Warehousing Platforms, Pique Solutions, July 2020

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.