Model Building: From Data Cleaning to Deployment

June 15, 2017 | 3 minute read
Text Size 100%:

Building an accurate predictive model is a multifaceted process that often requires input from business stakeholders and data scientists alike. To ensure you can scale the results of every model your data science team builds, be sure your model building journey follows the 7 key components we’ll explore in this post. 

1. Create a hypothesis

Even before diving into the data that will serve as the foundation of your predictive model, it’s critical for your data scientists to sit down with stakeholders in your business to define its use case. Use key performance indicators to establish what your model should predict and how. That way, everyone’s on the same page about how the results will be used to influence operations in your business from the very beginning.

2. Load and transform relevant data

The next step is to collect the data that is relevant to your hypothesis and transform it to fit the model’s framework. Making your data accessible to your data scientists will require the expertise of a data engineer to ensure it’s aligned with the technical requirements of the target database. Once that’s completed, your data scientists are ready to split the data into two groups: one for training or building the model and another for testing the accuracy of its predictions. 

3. Identify features

Feature engineering allows data scientists to tailor the model to accurately represent key patterns in your business. It involves combining raw data into categories at the appropriate level of granularity, such as amount of daily customer spending or day of the week a purchase was delivered. Feature engineering is an important part of cleaning data because it greatly reduces noise and can even address problems with sparse data. That means the model will have an easier time navigating missing values and outliers to more efficiently learn relationships in your data. 

4. Build your model

Next, your data scientists selects a machine learning algorithm or statistical methodology that suits your data, use case, and available computational resources. Many data scientists choose to build and test multiple modeling methodologies with the intention of moving forward only with the best fit. Testing the accuracy of each model involves introducing the testing set and comparing the outputted predictions against the actual historical results. 

5. Evaluate your model

Simply testing whether your model can accurately predict outcomes within the reserved training data is not enough to be confident that it will perform well in a real-life setting. That’s where validation comes in. This involves techniques like cross validation and receiver operating characteristic (ROC) curve analysis to verify that the model will generalize well to brand new data. Choosing an algorithm that’s interpretable rather than a black box makes it easier to evaluate the resulting model. That’s because this gives your data scientists the power to look under the hood of the model itself and gain an understanding of how it arrives at its results. Model interpretation tools like our Python library Skater can be employed to help interpret even the most complex models.

6. Deploy your model

Once your model appears to be performing satisfactorily, you can deploy it into production. This is usually done in one of two ways. Traditionally, the model is turned over to engineers to translate into a production stack language to prepare for deployment into the production environment. Alternatively, setting up infrastructure that empowers data scientists to deploy models on their own as APIs is an option that’s gaining popularity because it eliminates lags between data science and engineering teams and gets results in front of decision makers faster.

7. Monitor your model

The process doesn’t stop once the model is deployed — you must continue to track performance over time and make adjustments as needed. A successful model will inform improvements in your business that may introduce feedback loops in your data. For that reason, retraining may be necessary to ensure the model keeps up with these evolutions. For example, a customer churn model that outputs individual churn risk scores for each of your customers may find that customers whose orders were delivered on a Saturday are higher risk. After investigating further, stakeholders in your business realize that issues in the packaging department need to be addressed to ensure a better delivery processes on weekends. In turn, the model will need to be updated to account for this change in the typical customer experience.

Following these seven steps is key to building and maintaining predictive models that are truly representative of your business. Doing so is intensive and time consuming, so it’s important to ensure you have the right tools, processes, and infrastructure in place to help you scale data science work across your organization.

Nikki Castle

Marketing operations specialist at

Previous Post

6 Common Machine Learning Applications for Business

Nikki Castle | 3 min read

Next Post

An Introduction to Machine Learning Algorithms

Nikki Castle | 4 min read