In a previous blog post, we discussed AMD Instinct MI300X Accelerator performance serving the Llama 2 70B generative AI (Gen AI) large language model (LLM), the most popular and largest Llama model at the time. The LLM serving architectures and use cases remain the same, but Meta’s third version of Llama brings significant enhancements to usability and accuracy. This new model, Llama 3.1, features a much larger model with 405 billion parameters, requiring more GPU memory and processing for inference needs. This blog post covers benchmarks for Llama 3.1 with Oracle Cloud Infrastructure (OCI) Compute with MI300X shape (BM.GPU.MI300X.8).

Each BM.GPU.MI300X.8 shape comes with eight MI300X GPU accelerators, each with 192 GB of HBM3 memory and 5.2 TB/s memory bandwidth, totaling 1.53 TB of memory, more details on specification can be found here. This capability is helpful for running larger LLM models and serving more concurrent users during inference requests. We tested the performance of a single instance of the Llama 3.1 405B FP8 model on a single OCI bare metal instance. All eight GPUs were assigned for serving the numbers reported below.

LLM serving performance

Latency measurements

Offline benchmarks were measured using the scripts/benchmark_latency.py in the vLLM. Each test had input size 128 and an output size 128 tokens.

Batch Size	Output throughput in token per second (TPS)	Average latency in seconds
1	36	3.530
2	71	3.583
4	137	3.734
8	258	3.974
16	468	4.375
32	798	5.130
64	1184	6.919
128	1585	10.340
256	1846	17.750

Alt text: Benchmark results for the Llama 3.1 405B FP8 model serving throughput to batch size.

Throughput measurements

The following table shows offline benchmarks measured using benchmark_throughput.py in the vLLM.

Scenario	Input size in tokens	Output size in tokens	Output throughput in TPS
Typical chatbot online scenario	128	128	1993
Generation heavy use case	128	2048	2794
Summarization use case with larger input token size	2048	128	316
Summarization and output generation heavy	2048	2048	1807

For both the tests we used the following configuration:

8 GPUs of AMD MI300X in US East (Ashburn) region on BM.GPU.MI300X.8 shape.
AMD ROCm Version on host 6.2
Llama 3.1 405B – FP8 quantized model from AMD
Latest container & parameter recommendations from – powderluv/vllm-docs: Documentation for vLLM Dev Channel releases (github.com)
vLLM version 0.6.1
Benchmarking Scripts from vLLM for latency & throughput

Conclusion

LLM model inference is the most common and well adopted use case within enterprises. The results underscore the efficiency of serving large-sized LLM models in a single OCI Compute bare metal instance with eight AMD MI300X GPUs. A single instance of a 405B model can support up to 256 concurrent uses delivering over 7 TPS per user, well above the typical human reading speed of 5 TPS. You can also run multiple smaller model instances in a single Compute node, such as smaller sized LLM models like 70B or 8B, to increase the total throughput of a single node and better manage your compute costs catering your Gen AI workloads.

If you’re excited to repeat this experiment or try new scenarios, review the launch post or reach out to us. In our next post, we plan to share machine learning (ML) finetuning and ML training performance numbers with MI300X for both single-node and multi-node setup. A popular use case where customers fine tune a model using custom datasets and popular open-source foundation model. Stay tuned!

For more information, see the the following references

Serving Llama 3.1 405B model with AMD Instinct MI300X Accelerators

LLM serving performance

Latency measurements

Throughput measurements

Conclusion

Amar Gowda

Director Product Management

Sowmya Srinivasa Raghavan

Data Scientist

Sid Padgaonkar

Sr. Director - Product Management (Gen AI) - Strategic Customers

MFA everywhere: Tailored strategies for securing diverse industries

Oracle recognized as a Leader for a second year in the 2024 Gartner Magic Quadrant for Distributed Hybrid Infrastructure

Serving Llama 3.1 405B model with AMD Instinct MI300X Accelerators

LLM serving performance

Latency measurements

Throughput measurements

Conclusion

Authors

Amar Gowda

Director Product Management

Sowmya Srinivasa Raghavan

Data Scientist

Sid Padgaonkar

Sr. Director - Product Management (Gen AI) - Strategic Customers

MFA everywhere: Tailored strategies for securing diverse industries

Oracle recognized as a Leader for a second year in the 2024 Gartner Magic Quadrant for Distributed Hybrid Infrastructure