Learn about data lakes, machine learning & more innovations

Machine Learning and the Modern Data Lake

In this article, we’re going to talk about machine learning, the modern data lake, and what this means for you.

But first, let’s go back to the first Olympic games in modern times, held in Athens in April of 1896. This photograph is from the men’s 100m final. There’s only one runner in the 4-point stance, crouched down with hands on the ground, right behind the start line. That was Tom Burke, and he won—even though he was actually more of a distance runner.

Power of Data

Today every sprinter uses that starting stance. But back then it was new information and only a few athletes were exploiting that data.

Download your free ebook, "Demystifying Machine Learning."

Exploiting data is what we’re going to talk about today. But instead of the 4-point stance and gold medals, we’ll be discussing machine learning, data lakes, and how they can help you exploit data about your business, your customers, your partners, and anything else you need to get that competitive edge. Essentially, we’re here today to say—what’s the extra information you need to gain an edge like Tom Burke? And then, how can you make use of it?

Why Machine Learning Is More Than Just Buzz

Machine learning is trendy right now, but why should it matter to you? McKinsey Global Institute shows that proactive machine learning adopters simply make more profit than their peers who are less proactive.

Machine learning drives profits

As responsible data users, we can’t prove causation from this one diagram alone. But there’s enough material out there that says it’s more than just a simple correlation.

We think the cloud could be the best place for your machine learning workloads. We’ll get into this more later.

Real-Life Machine Learning Example

Here’s one example of the value that machine learning has brought to an organization. The UK’s National Health Service offers health care to all residents of the UK and holders of a valid European insurance card. Their Business Service Authority branch set up a Data Analytics Learning Lab with the goal of getting more value out of their existing data with machine learning. They had a long-term goal of providing ongoing annual savings of one billion pounds per year by:

  • Improving patient outcomes
  • Optimizing internal processes
  • Reducing fraud

For a small team, they accomplished a lot. In just a few months, they found confirmed annual savings of well over £561 million, with additional savings waiting to be confirmed and implemented.

There are a few machine learning best practices from this.

  1. They started with their existing data, which helped them gain quick wins and put the project on a sound business footing within the organization. Remember, there’s always time to collect new data and explore new projects after the initial successes.
  2. They moved their data into a separate lab environment. Applying unpredictable, heavy-duty analytics like machine learning would likely wreak havoc on the service levels of their production systems, so a separate lab environment enables them to perform needed experimentation without having an impact on the normal operation of business.
  3. They initially started this project using Oracle systems in their data center, which was appropriate at the time. But they’re looking at moving to the cloud. And if they were going to start a similar project today, they would most likely start it in the cloud because the cloud is the perfect place to provision a lab, store a large amount of data, and then spin up analytics workloads that vary from lightweight to very compute-intensive, and from short duration to long duration. The cloud is the best place to build a lab and get results while minimizing costs, risk, and commitment.

But machine learning is about much more than healthcare fraud and improving patient outcomes, as important as those may be. Take a look at these machine learning business use cases.

Machine Learning Business Use Cases

Machine learning can help you:

  • Predict customer lifetime value
  • Predict customer churn
  • Segment your customer base for more targeted marketing
  • Find fraud
  • Make recommendations to your customers
  • Identify subtle seasonal patterns in your business

And these are just a few examples. Organizations in all industries are putting machine learning to use.

How Data Lakes Help With Machine Learning

So let’s say you’re sold, and you want to start exploiting your data by using it for machine learning. What else do you need?

The answer is, access to all of your data—lots and lots of data. Data lakes are a great place to store, manage, process, and analyze your data. People often mistake data lakes as just a place to store data, but they’re more than that.

Data lakes were originally built on premise with racks of dedicated hardware, and that has some advantages:

  • It keeps data in your data center for regulatory compliance, where applicable
  • It’s close to your enterprise data sources
  • You have the ability (and cost/effort) to install the components you want (although you do have to maintain them)

But having a data lake in the cloud offers some different advantages:

  • You can scale up or down much more easily and scale storage independently of compute
  • You can use managed services that significantly reduce administration
  • And most compelling of all (to many people), you pay only for what you use

While Oracle offers you both on-premise and cloud data lake solutions, we also offer you a third option that we call Cloud at Customer. This is a cloud service where the hardware sits in your data center. Here are the advantages:

  • It keeps data in your own data center for regulatory compliance
  • It’s close to your enterprise data sources
  • But you still get the advantages of the cloud and access to managed cloud services

With Cloud at Customer, Oracle owns and manages the hardware, but you consume the services just like you would in the public cloud. In many ways, you’re getting the best of both worlds.

To summarize, the trend has been to go from only using relational database technology, to adding big data technology, to adding specialized big data services in the cloud.

All of these technologies are important depending on the problem you’re trying to solve—and we offer all of these technologies. But the trend we’re seeing is that the first generation of big data technologies like Hadoop are giving way to more modern Spark services in the cloud.

New Cloud Data Lake

How to Use Spark in the Cloud

Here’s an example of how you can take advantage of these specialized Spark services in the cloud.

In the cloud, object storage becomes the persistent storage repository for the data in your data lake. What is object storage? Object storage is a very simple system for storing any kind of data file with scalability and redundancy. You only pay for the amount of data that you have stored, and you can add or remove data whenever you want. In addition, object storage is very low-cost storage.

Then you spin up Spark clusters tailored to the specific processing work. One cluster can be for real-time processing, and a second can be a data lab for your data scientists and analysts. Another could be for batch jobs. Each cluster can be configured with the needed processing resources and local storage and each can be scaled up or down as needed. The node storage can be either disk or solid state. When you’re not using the cluster, you can turn it off so that you’re not paying for it. That’s the beauty of a cloud-based data lake.

Data Science in the Cloud

So that’s great. But now let’s look at how you can start using machine learning in the cloud to fulfill your data science goals. The solution pattern below is a simplified depiction of a data lab for data science, and it shows the different services that are used together.

Data Science Lab

First, data is uploaded to cloud storage (object storage). The data engineer or data scientist can do this with open source tools, Oracle’s free Big Data Connectors, or with the free Oracle Software Appliance that makes object storage look like a disk drive to your other systems.

The stored data is accessed for machine learning in Apache Spark and both raw data and generated data in Spark can be accessed for data visualization. Oracle provides both open source and value-added machine learning capabilities that run on Spark. With a cloud-based data lab, you can start small with just a few CPUs and quickly prove value for a business case.

If you or someone in your organization would like to try this out for yourself, Oracle offers free cloud credits that you can use to run these cloud services. And we also offer step-by-step guides to walk you through the process and show you how to use some of the features.


In this article, we’ve talked a lot about how to exploit your data. But here are the three key ideas:

  • Data – Harness the unused data in your organization and combine data sources to find new value.
  • Platform – The modern cloud-based platform enables easy-to-manage and cost-effective solutions to your data management and analytics needs.
  • Machine learning – With your data and the Oracle platform combined, you can apply machine learning to make predictions, find fraud, make recommendations and more—the technology is constantly changing and getting more exciting.

Together, these are the keys to successfully exploiting your data. With this, you have the power to use your data to find a competitive edge. Don't forget to download your free ebook, "Demystifying Machine Learning." Or, if you're ready to get started, it might be time to see what a data science platform can do for you

This article features content and writing by Wes Prichard and Peter Jeffcock.

Join the discussion

Comments ( 2 )
  • Connor Addis Saturday, September 8, 2018
    Hi, thanks very much for the article. In regards to machine learning some predictions are needed to be given in real time. So the data needs to be collected in a continuous manner.

    My question is does real time ML predictions work in the above example?

    Is data being sent to the cloud from clients data sources in a continuous manner?

    Not all ML analysis needs to be real time so how does historic and real time data get separated?
  • Peter Jeffcock Monday, September 17, 2018
    The short answer is, yes, it’s possible and indeed common, to use machine learning with streaming, real-time data. More often than not, developers will use an API in Apache Kafka to embed machine learning in applications or services. For example, if you are streaming sensor data from a machine, you might want to look at the most recent set(s) of value and score them against a ML model to see if the machine in question is running smoothly or has a potential maintenance issue. If you search for “Kafka machine learning” you’ll find lots of technical examples in this vein.

    To look at the difference between “historic and real-time data” let’s look at how that example above would typically be put together. First, a ML model would be built, trained and evaluated based on historical information. Weeks or months of historical information, most likely stored in a data lake, would be used here. Once the model is deemed ready, it would be integrated into the stream processing to score new events as they occur. What data would be needed to do this? Obviously you would need the most recent event. You may well want some number of preceding events - perhaps the model only flags a potential maintenance issue if the unusual vibration has been going on for 30 minutes. And you may want older data (dates of previous maintenance or similar, for example) which is likely historical information from the data lake (sometimes called “contextual” data because it provides context for the most current data in the stream).

    In summary, you can use a mixture of real-time, recently cached or historical data to score a ML model in real-time, depending on the needs of the model and ability to access the data in a timely fashion.
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.