Data scientists spend a lot of time thinking about data, performing exploratory data analysis, and building algorithms, all to solve a specific problem. After all that work is done, however, there’s still another step: deciding what to do with the results.
In many cases, when a data scientist builds a data model, the model’s outputs end up summarized in a report that stakeholders can use to make business decisions. This may be the ideal ending for some modeling processes, but in cases where the model is intended to predict future business outcomes — rather than explain why a business outcome occurred — it becomes too tedious and inefficient for stakeholders to sift through a report and make a decision. What they need is a process that consumes the outputs of a data model and makes decisions in real time, something that can only be achieved by deploying the model.
A deployed model can be defined as any unit of code that is seamlessly integrated into a production environment, and can take in an input and return an output. It seems simple enough, but data scientists can’t always do it alone. In order to get their work into production, a data scientist typically must hand over his or her data model to engineering for implementation. And it’s during this handoff that some of the most common — and crippling — data science problems arise.
I’ll explain: Say we’ve built an app that curates a list of specially designed widgets for our customer base to choose from. Users rate widgets, which gives us a nice feedback loop. Our marketing has worked and our user base has grown significantly. We’ve also expanded the amount of widgets we offer. Then, our business analysts notice that the time people spend in our app is increasing, which we interpret as a good thing — until we begin to experience a higher churn rate than is typical for us.
We’d likely bring the problem to our data science team, because it’s possible we’ve reached a point where it’s difficult for customers to navigate our app. Let’s say they suggest some collaborative filtering to build a recommendation engine that will allow our customers to find more relevant widgets more quickly, which involves a whole host of exploration, analysis, and model building. At the end of this process, they announce that they have created a working recommendation engine to alleviate our churn issue (assuming we’ve perfectly identified the cause of churn — but that’s a topic for another post). Hooray! Time to celebrate, right?
Not so fast. Even if we give the green light and say, “Get this model into production ASAP,” our engineering team has to take a number of steps to get our newly christened model into action. Here are just a few of them:
Refactor the model code. In order to simplify our code and improve readability, code refactoring is a must.
Walk through the code and determine how it slots into the engineering cycle. What will the code consume, where will it output, and how will it be called?
Re-write into a production stack language or PMML: Engineers and data scientists work in different languages and tools with good reason — Python or R are great for modeling data, but languages like Java and C++ are more geared toward building scalable applications. Keep in mind, rewriting a model into another language can be a time consuming process.
Implement it into the tech stack: Once the model is rewritten, it can be moved into the engineering production environment.
Test performance: Extensive testing is vital to knowing how the model will perform once it is deployed.
Tweak the model based on test results: This is the point in the process where any bugs should be fixed so the model’s outputs are accurate and relevant.
Slowly roll out the model: Because it’s hard to determine how our model will perform in a live environment, it’s a good idea to strategically roll out the model and gradually increase traffic to it.
Because model deployment often takes a back seat to the actual model building process, it gets the least amount of attention. Yet, clearly, a huge amount of effort is required in order to get a model from the data science development environment into the engineering production environment.
But this doesn’t have to be the case. As more companies are scaling data science operations, they are also implementing tools that make model deployment easier. Not to toot our own horn, but the DataScience Cloud provides an ecosystem in which data scientists can deploy models themselves behind a REST API — eliminating the threat that they will disrupt the engineering production environment. All an engineer needs to do is integrate the API where it’s needed.
Why does it matter? Well, the easier it is to leverage data science work, the faster you and your company will reap the benefits of data-driven decision making. Ultimately, letting models do the heavy lifting allows you to fully realize the value of your data science team — and run a better, smarter business.
Want to keep learning? Download our new study from Forrester about the tools and practices keeping companies on the forefront of data science.