Netflix’s famous algorithm challenge awarded a million dollars to the best algorithm for predicting user ratings for films. But did you know that the winning algorithm was never implemented into a functional model?
Netflix reported that the results of the algorithm just didn’t seem to justify the engineering effort needed to bring them to a production environment. That’s one of the big problems with machine learning.
At your company, you can create the most elegant machine learning model anyone has ever seen. It just won’t matter if you never deploy and operationalize it. That's no easy feat, which is why we're presenting you with seven machine learning best practices. Here's a data science introduction if you need it. Otherwise, read on!
At the most recent Data and Analytics Summit, we caught up with Charlie Berger, Senior Director of Product Management for Data Mining and Advanced Analytics to find out more. This is article is based on what he had to say.
Putting your model into practice might longer than you think. A TDWI report found that 28% of respondents took three to five months to put their model into operational use. And almost 15% needed longer than nine months.
So what can you do to start deploying your machine learning faster?
We’ve laid out our tips here:
In the following points, we’re going to give you a list of different ways to ensure your data science models are used in the best way. But we’re starting out with the most important point of all.
The truth is that at this point in machine learning, many people never get started at all. This happens for many reasons. The technology is complicated, the buy-in perhaps isn’t there, or people are just trying too hard to get everything e-x-a-c-t-l-y right. So here’s Charlie’s recommendation:
Get started, even if you know that you’ll have to rebuild the model once a month. The learning you gain from this will be invaluable.
Starting with a business problem is a common machine learning best practice. But it’s common precisely because it’s so essential and yet many people de-prioritize it.
Think about this quote, “If I had an hour to solve a problem, I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.”
Now be sure that you’re applying it to your machine learning scenarios. Below, we have a list of poorly defined problem statements and examples of ways to define them in a more specific way.
Think about what your definition of profitability is. For example, we recently talked to a nation-wide chain of fast-casual restaurants that wanted to look at increasing their soft drinks sales. In that case, we had to consider carefully the implications of defining the basket. Is the transaction a single meal, or six meals for a family? This matters because it affects how you will display the results. You’ll have to think about how to approach the problem and ultimately operationalize it.
Beyond establishing success metrics, you need to establish the right ones. Metrics will help you establish progress, but does improving the metric actually improve the end user experience? For example, your traditional accuracy measures might encompass precision and square error. But if you’re trying to create a model that measures price optimization for airlines, that doesn’t matter if your cost per purchase and overall purchases isn’t going up.
The Achilles heel in predictive modeling is that it’s a 2-step process. First you build the model, generally on sample data that can run in numbers ranging from the hundreds to the millions. And then, once the predictive model is built, data scientists have to apply it. However, much of that data resides in a database somewhere.
Let’s say you want data on all of the people in the US. There are 360 million people in the US—where does that data reside? Probably in a database somewhere.
Where does your predictive model reside?
What usually happens is that people will take all of their data out of database so they can run their equations with their model. Then they’ll have to import the results back into the database to make those predictions. And that process takes hours and hours and days and days, thus reducing the efficacy of the models you’ve built.
However, growing your equations from inside the database has significant advantages. Running the equations through the kernel of the database takes a few seconds, versus the hours it would take to export your data. Then, the database can do all of your math too and build it inside the database. This means one world for the data scientist and the database administrator.
By keeping your data within your database and Hadoop or object storage, you can build models and score within the database, and use R packages with data-parallel invocations. This allows you to eliminate data duplications and separate analytical servers (by not moving data) and allows you to to score models, embed data prep, build models, and prepare data in just hours.
As James Taylor with Neil Raden wrote in Smart Enough Systems, cataloging everything you have and deciding what data is important is the wrong way to go about things. The right way is to work backward from the solution, define the problem explicitly, and map out the data needed to populate the investigation and models.
And then, it’s time for some collaboration with other teams.
Here’s where you can potentially start to get bogged down. So we will refer to point number 1, which says, “Don’t forget to actually get started.” At the same time, assembling the right data is very important to your success.
For you to figure out the right data to use to populate your investigation and models, you will want to talk to people in the three major areas of business domain, information technology, and data analysts.
Business domain—these are the people who know the business.
Information technology—the people who have access to data.
Data Analysts—people who know the business.
You need the active participation. Without it, you’ll get comments like:
You’ve heard it all before.
You may think, I have all this data already at my fingertips. What more do I need?
But creating new derived variables can help you gain much more insightful information. For example, you might be trying to predict the amount of newspapers and magazines sold the next day. Here’s the information you already have:
Sure, you can make a guess based off that information. But if you’re able to first compare the amount of the current lottery prize versus the typical prize amounts, and then compare that derived variable against the variables you already have, you’ll have a much more accurate answer.
Ideally, you should be able to A/B test with two or more models when you start out. Not only will you know how you’re doing it right, but you’ll also be able to feel more confident knowing that you’re doing it right.
But going further than thorough testing, you should also have a plan in place for when things go wrong. For example, your metrics start dropping. There are several things that will go into this. You’ll need an alert of some sort to ensure that this can be looked into ASAP. And when a VP comes into your office wanting to know what happened, you’re going to have to explain what happened to someone who likely doesn’t have an engineering background.
Then of course, there are the issues you need to plan for before launch. Complying with regulations is one of them. For example, let’s say you’re applying for an auto loan and are denied credit. Under the new regulations of GDPR, you have the right to know why. Of course, one of the problems with machine learning is that it can seem like a black box and even the engineers/data scientists can’t say why certain decisions have been made. However, certain companies will help you by ensuring your algorithms will give a prediction detail.
Once you deploy, it’s best to go beyond the data analyst or data scientist.
What we mean by that is, always, always think about how you can distribute predictions and actionable insights throughout the enterprise. It’s where the data is and when it’s available that makes it valuable; not the fact that it exists. You don’t want to be the one sitting in the ivory tower, occasionally sprinkling insights. You want to be everywhere, with everyone asking for more insights—in short, you want to make sure you’re indispensable and extremely valuable.
Given that we all only have so much time, it’s easiest if you can automate this. Create dashboards. Incorporate these insights into enterprise applications. See if you can become a part of customer touch points, like an ATM recognizing that a customer regularly withdraws $100 every Friday night and likes $500 after every payday.
Here are the core ingredients of good machine learning. You need good data, or you’re nowhere. You need to put it somewhere like a database or object storage. You need deep knowledge of the data and what to do with it, whether it’s creating new derived variables or the right algorithms to make use of them. Then you need to actually put them to work and get great insights and spread them across the information.
The hardest part of this is launching your machine learning project. We hope that by creating this article, we’ve helped you out with the steps to success. If you have any other questions or you’d like to see our machine learning software, feel free to contact us.