X

Learn data science best practices

Why Is It so Hard to Put Data Science in Production?

At Blue Yonder, our team has more than eight years of experience delivering and operating data science applications for retail customers. In that time, we have learned some painful lessons — including how hard it is to bring data science applications into production.

I am sure you know what data science is, but let me share with you my personal definition:

Data science aims to build systems that support and automate data-driven operational decisions.

According to my rather restrictive definition (you might even disagree with it), the only purpose of data science is to support and automate decisions. So, what are these “operational decisions” I am talking about? They are decisions that businesses need to make in huge numbers, on a frequent and regular basis, and that have a direct impact on business KPIs. The outcomes of these decisions also need to be evaluated in short periods of time. For example, a business might need to answer questions like: “What is the best price for each single product tomorrow?” or “What is the optimal amount for each single product for the next order sent to supplier X?”

As people are frequently influenced in ways they don't even recognize (see this long list of approved human decision biases), automated decisions, in most cases, can outperform human operational decisions. Therefore, they can significantly improve the efficiency of business processes.

All that really means is data science brings to operational decision-making what industrial robots bring to manufacturing. Just as robots automate repetitive, manual manufacturing tasks, data science can automate repetitive operational decisions.

What is DevOps and what does it have to do with data science?

The DevOps movement is aimed at overcoming a widespread problem in traditional IT organizations by establishing separate developer and operations teams. The developer teams are eager to develop new features that go live as early as possible. At the same time, the operations teams are responsible for system stability and will block new features as long as possible because all changes come with risk.

In this conflict, both teams lose sight of the common objective of delivering value to the customer with highly-stable new features. (For a more detailed explanation, I recommend reading this article.) This chasm between developer and operations teams is only one example of organizational structure gone wrong; the same arguments hold for other groups that are divided by function. At many companies, data science is emerging in a similar “functional silo."

Data Science, The Troublemaker

Here's a fictitious, but nonetheless realistic conversation between two managers at a conference: “Are you already doing this data science stuff?” one manager asks. The other replies, “We've had a team of data scientists in place for about a year now, but the progress is very slow.”

To better understand why the progress in many data science efforts is slow, we need to look at a typical data science workflow for automating business decisions. The example workflow below is focused on the retail sector, but also holds for other industries with only minor modifications:

  • Pull all kinds of necessary data from a variety of sources:

    • Internal data sources like ERP, CRM, and POS systems, or data from an online shop.
    • External data, like weather or public holidays

  • Extract, transform, and load the data:

  • Machine learning and decision-making: 

    • Use historic data to train the machine learning model
    • For decision-making, current, up-to-date data is used

  • The resulting decisions are loaded, either back into the ERP system or some other data warehouse

These steps, which touch essentially all parts of the business, need to be deeply integrated into business processes in order to create an effective decision-making system. This is by far the biggest source of trouble for data science efforts. In order to successfully integrate data science, one needs to transform and modify core business processes, which is a difficult task.

Data science is greedy by nature

“The current database should be sufficiently sized for the next year,” said no data scientist ever!

It’s commonly assumed that data scientists are greedy because they seem to have an unrealistic understanding of available resources. But really, it’s data science itself that is greedy by nature.

In general, the outcome of a data science project gets better with:

  • More features (“columns”)

  • More historic data (“rows”)

  • More independent data sources (e.g., weather, financial markets, social media…)

  • More complex algorithms (e.g., deep learning)

See? It isn’t the fault of data scientists! They are, in principle, right to make such requests. Luckily, there are ways to overcome resource shortages, as I will demonstrate later on.

Another issue is underestimating the sheer number of decisions. Consider daily demand forecasts for a rather small supermarket chain with 100 stores and 5,000 products. We would need 14 days of daily forecasts to be of any use to the replenishment algorithm. But what that really means is 7 million forecasts need to be calculated, processed, and stored every day.

Furthermore, because many different data sources are needed to build an effective machine learning model, new coherence and entanglement may be introduced among departments. The whole organization must come together to agree on common identifiers and data types. Formerly disconnected subdivisions need to synchronize their data flow. For example, an automated daily replenishment system might depend on promotions data from the marketing department and stock data from all stores. All the necessary data needs to be available at a fixed time of the day so the systems can calculate decisions and send them to the suppliers in time. This entanglement is a big problem and can lead to serious political and emotional tensions at a company.

Data scientists vs. the rest of the company

Now back to DevOps. This movement is intended to overcome the potential misalignment of developers and operations teams, a problem that will inevitably arise if you try to build an automated decision system with a data scientist team in a separate silo. Because of the entangled and greedy nature of data science, data science teams will have a very hard time getting a system successfully integrated “against” the other teams operating with different incentives.

To prevent or fix these problems, it's essential to embrace the fundamental principles of the DevOps mindset:

  • Align the objectives of all teams so that they are not working “against” each other, but working together towards a common goal

  • Tear down the walls between silos and build cross functional teams

  • Measure improvements and allocate resources and features based on the measured added value for the customer

It is about commitment

Decision-making is at the heart of any company’s success. So, when introducing data science, the entire company — all hierarchies and divisions — needs to accept and appreciate that automated decision-making using data science is a serious part of the value stream. This most likely means that you need to change established processes, reorganize teams, and rethink the company’s structure. Additionally, to succeed with these changes, you need the necessary buy-in: Everyone needs to understand why the change is happening and back the decision. Without this wholehearted commitment, automated decision-making has no chance for successful integration.

In turn, your data science effort has to strongly focus on the true added value: One needs to evaluate the costs of implementation (including costs of technical debt, increased complexity, increased entanglement, etc.) and compare it with the projected gain due to the improvements. Data science is never a self-purpose.

Tear down data science silos

One of the key goals of DevOps is aligning teams in achieving common company objectives and the tearing down of walls between siloed teams. Putting data scientists into a separate team in a separate room is a sure path to failure.

Instead, embed data scientists in a cross-functional team. This builds the whole decision-making system end-to-end and will align this effort with company goals. Once every department is aligned, the data scientists will not be working against other departments. Instead, the success of the decision-making system becomes a shared common interest. Global optimization through joint efforts towards a common goal replaces local optimization towards self-centered and unaligned goals.

This cross functional team commits to the same quality standards as all other teams. There is no room for any compromise on quality, resilience, or robustness. On the contrary, because of the high risk attached to automated decision-making, even higher standards should be applied. At the same time, following the “lean thinking” methodology, create an environment where it is simultaneously cheap and safe to experiment.

Fight greediness with Occam’s razor

There is a problem-solving principle called Occam’s razor, which says: “Among competing hypotheses, the one with the fewest assumptions should be selected.” Translated into the data science domain, we can reformulate this principle to:

If the outcomes of two data science models are compatible, take the one with smaller resource footprint.

This simple rule gives us clear instruction on how to build data science models, so that the problem of the inherent greediness of data science is solved. Without measuring the generated value and applying this principle throughout the whole implementation cycle, you will likely face exploding costs with limited returned value. Make sure data scientists are committed to this important principle, because admittedly, it is very hard to work against data scientists. They have the data and the expertise to come up with arguments that are hard to make an objection to. Create a culture of efficiency that is as simple as possible, but as complex as needed.

The same holds for the use of different data sources. In the domain of data security, there is the “need to know” principle, which states data should only be accessible to people who need to have access. Applied to data science, it means that we measure the value of adding more data sources, but rigorously purge them again if the improvement is not significant enough to justify the additional data dependency.

Summary

Data science is about supporting and automating decision-making. It is becoming more important than ever for most companies. Because of its role as a decision-making system, data science has to be in the core of business processes. This fact brings forth a whole bunch of serious problems; some of them, especially those of a cultural nature, can be catastrophic. Half-hearted attempts lead to a waste of time and money, at best — and nourish data science’s reputation as a troublemaker.

Properly integrated data science, however, is a game changer you cannot afford to ignore. Embrace data science with a DevOps mindset. Measure important KPIs, learn from experiments, and improve your processes accordingly over and over again. This is the path to becoming a truly data-driven company.

If you have comments or questions, feel free to reach out to me on Twitter.

Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.