UbiOps is a machine learning operations (MLOps) platform that simplifies the deployment and management of machine learning (ML) models. Oracle Cloud Infrastructure (OCI) is a leading provider of cloud services. In a recent collaboration, UbiOps and Oracle have worked together to make it easier to run ML workloads at scale on OCI with UbiOps.
Companies working on applications involving ML models often run into challenges when moving beyond the development phase. Challenges include the time required to convert the model into a production-grade application, scalability, and monitoring. As a result, ML projects often fail to get further than a proof of concept.
UbiOps is an MLOps solution on top of OCI that enables various features out of the box to address those challenges: One API for model inference and model training at scale, pipelines, logging and metrics collection, an extended user interface, a scheduler tailored towards ML workloads, and more. This capability enables data scientists to focus on getting AI solutions into production instead of dealing with the complexities of the underlying infrastructure.
The solution combines the UbiOps platform and the Oracle Container Engine for Kubernetes (OKE) for running inference and training workloads on OCI GPU-based instances. OKE is a fully managed Kubernetes solution from Oracle that reduces the cost of setting up and maintaining Kubernetes significantly. Using the cluster autoscaler, you can adjust the size of the cluster automatically in 5–10 minutes to have infrastructure right-sized to the needs, offering high performance while saving cloud costs because of the elastic scaling.
OCI offers a selection of shapes that you can use for ML workloads, like model training and inference. OCI Compute shapes equipped with NVIDIA GPUs are particularly well-suited for these jobs. The BM.GPU4.8 and BM.GPU.A100-v2.8 are bare metal shapes with eight NVIDIA A100 Tensor Core GPU that provide the large computational power required for model training. For large workloads, you can aggregate these instances in an OCI supercluster with an RDMA bandwidth of 1.6 Tbps that allows to scale distributed training to thousands of NVIDIA GPUs.
For smaller training workloads and for ML inference, the NVIDIA A10 Tensor Core GPU is available on OCI with bare metal and virtual machine (VM) shapes. For example, you can train and run inference on object detection models like YOLOv4 on OCI shapes with NVIDIA A10 GPUs. The result is a scalable, reliable, and cost-effective solution that can handle high volumes of ML workloads. Typically, the cold start of a model running with UbiOps on OCI takes only 5–10 minutes. Combined with OCI cost of $2/hour (billed per second), this serverless solution can be very cost effective for your ML workloads.
If you’re already using OCI and interested in deploying machine learning models, contact UbiOps or Oracle for more information. The collaboration between UbiOps and Oracle Cloud Infrastructure has created a reliable, scalable, and cost-effective solution that can meet the needs of customers in various industries.
For more information, see the following resources: