Data science is powering applications around the clock, from Netflix’s powerful content recommendation engine to Amazon’s virtual assistant Alexa. But not all companies are equipped to support large-scale, always-on data science projects; doing so requires an infrastructure that can handle the large volumes of traffic these models receive while in production, as well as a process for getting a finished model into a production environment.
In their infancy, data science teams are often more concerned with creating data schemas and identifying key performance indicators than they are with operationalizing their work. But as a business grows, so too does the need for better, faster models. Companies that don’t handle this transition well incur technical debt — such as unstable or underutilized data dependencies and pipeline jungles — ultimately reducing model accuracy and driving up costs.
Performing data science at scale means bringing IT or DevOps teams into the mix to lay the right groundwork before technical debt occurs. There are many different approaches to creating a data science strategy that scales, but in agile environments, teams are embracing software engineering practices — including continuous integration, delivery, and operation — to deploy and support applications that are powered by predictive models. The result is a faster time to value on data science projects and a backend that is easier to manage.
Continuous Integration: Committing Code Regularly
In an enterprise setting where multiple data scientists could be working on a single project, the first step to doing data science work that scales is implementing version control, whether that’s GitHub, GitLab, Bitbucket, or another solution. Once your team has the ability to track code changes, the next step is to create a process in which they regularly commit their code to the master branch of your repository.
In software development, the practice of frequently pushing new code to the master branch — in some cases, multiple times a day — is called continuous integration. It’s often paired with automated testing, and the reason agile teams embrace it is because it eliminates problems like bugs and merge conflicts. These problems arise when a member of the team has waited too long to commit changes, and the code is so far removed from the master branch that it won’t easily integrate.
In a data science setting, problems like these can translate into a lack of reproducibility or a slower timeline for getting projects into production. Regularly merging branches and then testing that code with an open source continuous integration tool, however, will make ongoing model maintenance and improvements easy.
Continuous Delivery: Accelerating Model Deployment
Companies that embrace continuous delivery are pushing new application features or changes into production quickly, sometimes with the click of a button. Traditionally, data science model deployment has been a multi-step process that puts the onus on engineering: Engineers would refactor, rewrite, and test a data scientist’s model before slowly rolling it out, sometimes months after it was originally built.
Even at cutting-edge companies like Uber, this was a recent reality: Until this year, Uber’s data scientists couldn’t train models that required more power than what was available on their local machines and the company couldn’t easily deploy a model into production. Instead, engineers had to create a custom container for every model deployment, which was limiting the company’s ability to scale data science work.
Uber overcame these hurdles by building an internal machine learning platform that automatically partitions training data, re-trains and deploys models via an API, and loads new models from disk into containers so they can start handling prediction requests. Of course, building a platform from scratch isn’t practical for most companies. Increasingly, data science platforms are filling this void with features such as the ability to deploy models as APIs or schedule code runs (the DataScience.com Platform offers both of these capabilities).
Continuous Operations: Achieving Zero Downtime
System downtime costs companies an estimated$700 billion annually. Continuous operations, an approach to data systems management in which certain components are expected to be up and running at all times, strive to resolve this costly problem. For IT teams managing the systems that support models in production and data science environments, the ability to monitor and add resources as data science work expands — while maintaining system availability — is essential.
Zendesk recently took up the task of ensuring that the software and hardware infrastructure supporting Answer Bot, its customer-facing answer app powered by deep learning models, was scalable, reliable, and fault tolerant. Ultimately, the team settled on using an open source model serving system and Amazon Web Services EC2 for hosting, which allowed them to scale horizontally by adding EC2 instances as needed. The result is an application that answers customer questions with links to relevant content in 20 milliseconds, on average.
But that’s just one application. For IT teams managing the resources needed for every deployed model and data science environment across an entire company, a data science platform that offers cluster management features and the ability for IT to dictate the size of the resources made available to data science teams can go a long way toward achieving continuous operations. (Besides, our platform also offers the ability to deploy multiple versions of a model into production at the same time — something Zendesk struggled to achieve with Answer Bot.)
Operationalizing Data Science with Agility
It’s becoming increasingly clear that the value of data science work lies in a company’s ability to operationalize it. But for many, getting data science models into production is an uphill battle: teams don’t code collaboratively, model deployment has to wait for engineering, and the resources needed to support models in production can be difficult to manage.