The process of taking a machine learning (ML) experiment from a laptop or data science lab to production is not one that many people have experience with. Data Scientists are frequently charged with this daunting task since they understand the machine learning algorithm and likely proposed it in the first place.
This blog describes how to start a successful production ML operational lifecycle (MLOps), by moving from a promising ML experiment to a Minimum Viable Product (MVP) of the same algorithm in a production service. MVPs are common in product development since they help get a product/service to a customer quickly, with just enough features to make it viable and drive usage-based feedback for the next version. In the ML context, MVPs help to isolate the critical needs of the production ML service and help to deliver it with the smallest possible effort.
The steps we describe for a production machine learning MVP are digested from hundreds of use cases and experiences with data scientists and organizations over the last few years. They are illustrated in the figure below:
Figure 1: Path from Developer Experiments to Production ML
Step 1: Identify your use case: What are you trying to do?
This may seem obvious, but the first step is to understand what are the minimum requirements of your business application and the gaps between your experiment and the minimum requirements. For example, if your experiment is assuming more features to be available than what the business application can provide, that gap can cripple production. The best way to find such gaps is to define the ML application that will support your business application.
Some questions you will need to answer:
What business problem is this ML app going to help with from an ML perspective?
What does the ML app need to predict? What inputs will it receive?
Is there enough data to train a model and measure effectiveness? Is this data suitably clean, accessible, etc.? The data used for the experiment may have been manually cleaned. The production training data will also need to be cleaned.
Are there initial experiments (in a developer/notebook/laptop environment) that show a promising algorithmic approach to deliver the necessary predictions/quality?
How does the ML application need to integrate with the business app (REST, Batch, etc.)?
Once you have these questions answered, you have a sketch of what your ML application will need in its MVP. This sets the stage for Steps 2 and 3.
Step 2: Starting state inventory: What do you have?
Once the use case has been identified, the next step is to gather the starting state so that you can map the journey to your destination. Typical characteristics of a starting state include the following artifacts from a developer grade prototype of the desired ML application:
A software program in a data scientist environment, e.g., a Jupyter notebook, an R developer environment, Matlab, etc. This code usually performs training of the initial (promising) ML model and experiment.
This code has been run through one or more sample datasets which live in data lakes, databases, etc. These data lakes and databases are part of the customer’s normal datacenter infrastructure environments. The datasets may live in laptops and may need to be moved to data lakes.
Training code has been run in these developer environments and sometimes sample models exist.
In many ways, this starting state resembles a software prototype in other (non-ML) domains. Like other software, the prototype code may not have been written with all the connectors, scale considerations, and hardening expected of a production version. For example, if the production version needs to read data from a cloud object storage repository while your experiment reads data stored in your laptop, an object store connector would need to be added to your production pipeline code. Similarly, if your experimental code exits upon an error, that is likely not acceptable for production. For more details on how to harden code for production - see the resources section at the end of this blog.
There are also ML-specific challenges. For example:
There may be models generated here that need to be imported to bootstrap the production pipelines.
ML-specific instrumentation may need to be added to the code, for example, to report ML statistics, generate ML-specific alerts, gather instrumentation (statistics from execution) for long term analysis, etc.
Step 3: Define your Production MVP
Now you are ready to define your MVP: the basic first service that you will use in production. To do this, you need to determine your first production location, i.e., the first place where your code will run.
This depends heavily on your environment. A short-term option may be in the data center infrastructure you already have for other applications (non-ML, etc.). You may also have a longer-term view that includes integration with other aspects of your software and/or cloud and service strategy.
In addition to the first location, you will need to address the following:
Access to data lakes, etc., can be generic unless the organization (particularly enterprises) have placed specific data access restrictions that apply to analytics usages.
Machine learning engines have to be installed (Spark, TensorFlow, etc.). If containers are used, this can be quite generic. If analytic engines are used, it can be highly ML-specific. All dependencies (what libraries need to be accessible for your pipeline to run, etc.) need to be found and included.
Analytic engine and container sizing will need to occur to ensure a reasonable performance range for the initial test and tune. Test and tune will be iterative.
Upgrade processes need to be defined. For example, if you decide to upgrade your pipeline code and it needs new libraries not installed before, you will need to think about how that will be handled.
Step 4: Prepare your Code for Production
Now you need to consider what code from your experiment (if any) needs to be used in production. If you are not planning to retrain models in production, the code you need to consider is only for inference. In this case, an easy solution may be to deploy a predefined inference pipeline provided by a vendor.
If you are planning to retrain in production or have a custom need that is not met by a pre-built inference pipeline, you will need to prepare your experimental code for production or build any new production functionality that was not in your experimental code. As part of this, you will need to consider the following:
Hardening for production (error handling, etc.).
Modularization for reuse.
Invoking connectors for retrieving and storing data to and from the production locations identified in Step 3.
Where will you keep the code (Git etc.)?
What instrumentation should you add to make sure you can detect and debug production issues with your models?
Step 5: Build a Machine Learning Application
Now that you have all the code pieces ready, it’s time to build your ML application. Why is this different from just building the pipeline? In order to execute reliably in production, you need to also make sure that the mechanisms to orchestrate your pipeline, manage and version models and other outputs, etc. are also in place. This includes such things as how to update your pipeline if you generate a new model, and how to push new code into production if you have improved your pipeline code.
If you are using a production ML runtime, you can configure the ML application within the runtime and connect it to the code and other artifacts you created in Steps 1 through 4. Figure 2 shows an example ML application generated in the MCenter runtime from ParallelM.
Figure 2: An example ML application
Step 6: Deploy the Machine Learning Application in Production
Once you have an ML application, you are ready to deploy! To deploy, you will need to launch the ML application (or its pipelines) and connect them to your business application. For example, if you are using REST, your ML application will create a REST endpoint upon launch, and your business application can call it for any predictions (see Figure 3).
Figure 3: ML application generates a REST service for use by a Business Application
Note that deployment can be considered the “completion” of the MVP, but it is in no way the end of your journey. A successful ML service will run for months or years and will need to be managed, maintained, and monitored during that time.
Depending on your choice of solution in Step 5, deployment may be automated or manual. MLOps runtime tools offer automated deployment. If you are running your ML application without one of these, you will likely need to write scripts and other software to help you deploy and manage the pipelines. You may also need to work with your IT organization to do this.
Step 7: Making it Better
Recall, that in Step 3, you may have picked a short-term location to run your production ML MVP. Once the MVP is deployed in Steps 5 and 6, you may need a further step to review the results of the MVP and revisit key infrastructure decisions. Now that the code/MVP is instrumented and running in at least the first infrastructure, you can compare and contrast different infrastructures and see if you need to improve.
Step 8: Continuous Optimization
Note that steps 3 - 7 repeat continuously for the lifetime of the business use case served by this ML application. The ML application itself may be redefined, returned, ported to new infrastructure, etc. You can see how your MVP is being used, what feedback you get from your business, and improve accordingly.
What else is there to MLOps?
MLOps is the overall practice of deploying and managing a model in production. The steps above show how to get started with MLOps by deploying your first model. Once you have taken the steps above, you will have at least one ML application in production which you will then need to manage during its lifecycle. You will likely then need to consider other aspects of ML lifecycle management, such as governance to manage models and comply with any regulatory needs of your business, KPIs to assess the benefits your ML models are providing to your business application, etc.
We hope you found this post useful. Look out for similar posts on the topic of MLOps in future. To learn more about Oracle AI, please visit here.