Optimizing OCI AI Vision Performance with NVIDIA Triton Inference Server

March 21, 2023 | 10 minute read
Text Size 100%:

 

For AI workloads that demand extremely low latency and/or very high throughput, there are no real replacements for a GPU.  Getting low inference latencies with modern deep learning models on a CPU can be quite challenging.  Oracle Cloud Infrastructure (OCI) customers like Twist Bioscience who have latency- or throughput-critical workloads can now take advantage of our recently launched, cost-effective OCI VM.GPU.GU compute shapes, powered by the NVIDIA A10 Tensor Core GPU.

In OCI AI Services, we strive to deliver value to customers by finding a compelling balance between inference latency and cost, without sacrificing accuracy.  To achieve this, OCI AI Services often rely on CPU to power inference workloads where latency and throughput are important, but not the foremost concerns. 

This post describes how OCI AI Services have adopted NVIDIA Triton Inference Server — available in the NVIDIA AI Enterprise software suite — as a key tool in the quest to deliver good performance on CPU. 

Background

Oracle provides turnkey services for performing a variety of AI tasks across several domains, including computer vision (e.g. object detection, image classification), document understanding (e.g. OCR, key-value extraction), natural language processing (NLP) (e.g. machine translation, named entity recognition), decision (e.g. anomaly detection, time-series forecasting), and speech (e.g. speech-to-text).  This post will focus on the AI Vision Service.

OCI Vision service’s object detection and image classification capabilities are powered by state-of-the-art (SOTA) deep learning models.  This type of model typically relies on GPU to provide fast inference times.  Our challenge was to find a way to host these models on CPU for our value-conscious customers who could tolerate a bit of extra inference latency, while still providing an experience fast enough for real-time, synchronous usage.

Original serving implementation

Prior to adopting NVIDIA Triton Inference Server, our original solution for model serving used FastAPI, a popular Python web application framework, to run a custom Python inference application.  FastAPI's relative simplicity makes it easy to develop with and deploy.

We first converted our deep learning vision models into ONNX Runtime (ORT) format, an open-source standard that allows model execution to take advantage of hardware acceleration features on a variety of platforms.  We then wrote custom Python handlers which execute these models in FastAPI.

However, with the FastAPI solution, we encountered a few challenges:

  • Spiky latency and inconsistent CPU usage.
  • Difficulty scaling different parts of a FastAPI application independently.
  • FastAPI does not provide an out-of-the-box solution for emitting and collecting latency at various points in an inference pipeline, requiring us to build a custom solution for gathering metrics.
  • FastAPI has no native support for micro-batching.  Although this is less of a concern for CPU inference than it is with GPU, we wanted to structure our code to be able to run optimally on either type of processor. 

FastAPI also lacks some out of the box MLOps capabilities:

  • FastAPI does not provide server metrics out of the box.  Additional packages must be installed to export standard server metrics, such as latency, throughput, etc.
  • Additional logic is needed to support custom metrics such as model level metrics.

Lastly, FastAPI is not an "opinionated" framework, which comes with both benefits and drawbacks.  On one hand, it is extremely flexible and does not make many impositions on coding style or organization.  On the other hand, this flexibility does little to encourage or promote best practices for code layout and design.  We will discuss later in this post how Triton helps with these issues.

Enter NVIDIA Triton Inference Server

We are constantly looking for ways to provide more value for our users.  Our latest partner in this pursuit is NVIDIA.  We were drawn to NVIDIA Triton Inference Server for its flexible feature set, including support for numerous deep learning and traditional machine learning frameworks, as well as platforms with both GPU and CPU. NVIDIA Triton Inference Server is included with NVIDIA AI Enterprise, a secure, end-to-end, cloud-native AI software platform that provides performance-optimized AI software and global enterprise support to streamline development and deployment of production AI.

OCI Vision performance gains

After porting our models to run on Triton, we experienced significant improvements to both latency and throughput.  The following tables show the results of performance testing via Locust on a cluster that mimics the configuration used in our production serving environment.  Throughput increased by 30 - 76% and latency decreased by 30 - 51%.  We have seen similar results in our actual production environment.

Image Classification

 

Client Count

Requests Per Second (RPS)

Latency p50 (ms)

Latency p90 (ms)

Latency p95 (ms)

FastAPI

1

2.5

400

450

470

Triton

1

4.4

220

260

280

FastAPI 

6

2.9

2,000

2,700

2,800

Triton

6

4.6

1,300

1,300

1,400

With 1 concurrent request

  • Triton throughput has a 76% gain over FastAPI
  • Triton latency decreased 42.2% over FastAPI

With 6 concurrent requests

  • Triton throughput has a 58.6% gain over FastAPI
  • Triton latency decreased 51.9% over FastAPI

Object Detection

 

Client Count

Requests per Second (RPS)

Latency p50 (ms)

Latency p90 (ms)

Latency p95 (ms)

FastAPI

1

1.2

869

1000

1100

Triton

1

2.0

491

610

660

FastAPI 

3

1.5

2,000

2,400

2,600

Triton

3

2.1

1,300

1,600

1,800

With 1 concurrent request

  • Triton throughput has a 66.7% gain over FastAPI
  • Triton latency decreased 39% over FastAPI

With 6 concurrent requests

  • Triton throughput has a 40% gain over FastAPI
  • Triton latency decreased 30.8% over FastAPI

Peeking under the hood

What is driving NVIDIA Triton Inference Server’s impressive performance gains over FastAPI for our models?  Here are a few key factors.

Implementation language

FastAPI is implemented in Python.  In addition to being an interpreted language, Python's Global Interpreter Lock (GIL) is a well-known bottleneck that prevents more than one thread from executing in the Python interpreter at a time.

In contrast, Triton is implemented in C++, a compiled language that supports true multi-threading.

Modular design

FastAPI provides a single executable server that can support multiple HTTP verbs and paths.  However, this simplicity means that it is also difficult to independently scale different parts of an inferencing pipeline, such as featurization, model execution, and post-processing.

Triton uses a modular serving architecture that provides different backend implementations for the execution of various AI and ML frameworks, including PyTorch, TensorFlow, ONNX Runtime, and custom Python.  These execution backends can be chained together to serve inference requests.  This allows featurization, model execution, and post-processing to be broken out into separate and independent modules.  Importantly, these modules can also be configured and scaled independently.  Does your workload have a heavyweight model inference step?  You can give it more instances or more runtime threads.  Is your post-processing super lightweight?  You can scale it down as appropriate.  This flexibility is simple to achieve thanks to Triton’s modular architecture.

Async inferencing

In Triton, model inferencing is performed asynchronously. Triton uses different executor pools for serving incoming requests and for performing model inferencing.  In contrast, FastAPI supports async calls only at the HTTP handler level.  Since model inferencing is usually compute intensive and often relies on synchronous libraries, even if the endpoint is async, the inference request will block the main event loop and block other requests from being processed. Eventually, model inferencing and handling of incoming requests will start to compete for compute resources.

Other NVIDIA Triton benefits

NVIDIA Triton also supports a rich set of metrics for model serving, and the granularity is per model instead of per instance. This allows consistent and standardized monitoring across all models, out of the box.

Triton's modular architecture has a significant impact on the code structure and style.  A non-trivial benefit of using Triton is that it encourages separation of concerns.  By encouraging users to break apart steps like featurization, model execution, and post-processing into separate components, Triton can help to promote well-organized and modular code.

Triton also supports the KServe V2 Inference protocol, which helps to standardize integration across a variety of clients as well as frameworks. Additionally, Triton supports a wide variety of deep learning and traditional ML frameworks.

Suggestions for benchmarking

Here are some things we learned while doing our benchmarking and tuning.

Use a variety of test inputs

In our case, we used approximately 5,000+ images covering a variety of sizes, resolutions, and contents, to better reflect the diversity of inputs in the real world.

Use Locust and a generic HTTP client for load testing

Locust is a popular open-source load testing tool that is widely used to test the performance of web applications and APIs. We used Locust to test the performance of NVIDIA Triton model serving. We chose to use a generic HTTP client to send test traffic instead of the Inference Client provided with Triton.  This is because the Triton Inference Client performs optimizations that would not necessarily be available to all clients of our serving backend.  Therefore, using a generic HTTP client provides us with the most realistic performance baselines.

Execute client and server in two different nodes, but in the same network

We ran the load test client and Triton servers in two different node pools within the same private subnet, to better isolate the computation while limiting the network overhead.

Use Triton scheduling instead of ORT Multi-threading

While configuring our ONNX Runtime (ORT) backends, we learned that Triton performs better when running multiple instances of an ONNX model, rather than a single instance of the ONNX model with multiple threads.  In other words, it appears that Triton’s scheduler can more efficiently utilize compute resources, as compared to ORT's multi-threading.

Leverage OLive and Model Analyzer to optimize both ORT and NVIDIA Triton configuration

Both NVIDIA Triton and ORT generally give good performance out of the box. However, both tools also provide many knobs to tune performance.  ORT allows you to configure thread count, execution model, wait policy, etc. Triton exposes an even broader set of configurable parameters.

It’s very tedious to try all permutations to find the most optimal solution. We were able to use Microsoft OLive and NVIDIA Triton Model Analyzer to automate exploration of runtime parameters.  This saved a considerable amount of time and effort in finding an optimal or near-optimal runtime configuration.

Get started with OCI Vision today

 

Thomas Park

Thomas Park is an Architect for Oracle Cloud Infrastructure AI Services.

Simo Lin

Simo Lin is a Principal Software Engineer with Oracle Cloud Infrastructure


Previous Post

Deploying custom containers and NVIDIA Triton Inference Server in OCI Data Science

Tzvi Keisar | 5 min read

Next Post


Serve ML models at scale with NVIDIA Triton Inference Server on OKE

Joanne Lei | 3 min read