Practical inferencing of open source models on mainstream GPU-accelerated OCI servers

With the huge demand of generative AI opportunities worldwide, planning for the required compute capacity is crucial. While NVIDIA A100 and H100 Tensor Core GPUs offer great performance for large scale LLM deployments, they can be complemented with mainstream GPUs like T4 , P100, and A10 for smaller scale deployments.

With the well-engineered Oracle Generative AI services, Oracle Cloud Infrastructure (OCI) also allows customers to bring their own models (open source or custom) for inferencing on the highly efficient OCI Servers. When running bring-your-own models purely on OCI, one might need to benchmark and optimize by running the LLMs on mainstream NVIDIA-accelerated OCI servers. This blog details how mainstream GP U s accelerated OCI servers (both bare metal and virtual machine) can be used for running a wide range of inferencing scenarios using Opensource LLMs.

Benchmarking parameters

Following are the different set of parameters which influence the inferencing test scenarios and results:

Generative AI model specifications: Model type and size
GPU specifications: Model and number of GPUs
CPU specifications: CPU type and number of CPUs
Maximum context window
Performance optimizations
- Quantized and unquantized models
- Different LLM models like transformer , transformer with KV cache optimization and paged attention, transformer with flash attention etc
Performance measured in terms of tokens per second

Testing environment

The following server configurations are used for the benchmarking purposes:

OCI server types and specifications
- GPU accelerated bare metal
  - Intel Xeon Platinum 8358 CPU @ 2.60GHz (128 cores)
  - Four NVIDIA A10 Tensor Core GPU s , each with 24GB GDDR6 memory
  - 1TB RAM
- GPU accelerated VM
  - Intel Xeon Platinum 8358 CPU @ 2.60GHz (60 cores)
  - Two NVIDIA A10 GPUs, each with 24GB GDDR6 memory
  - 480GB RAM
- GPU Accelerated Roving Edge Device ( RED)
  - Intel(R) Xeon(R) Gold 6230T CPU @ 2.10GHz ( 32 cores)
  - One NVIDIA T4 GPU with 16 GB GDDR6 memory
  - 512 GB RAM

The following LLM models ( quantized and unquantized versions ) are used for this benchmarking exercise:

Llama 2 models (7B, 13B, and 70B)
Llama 2 HF models (7B, 13B, and 70B)
Llama 3 models ( 8B , 70B )
Fin-llama-33B

Single-server, single-user inferencing tests

The following table shows the test results done on fin-llama models using llama.cpp on a single OCI bare metal server.

Table 1: Single-server single-user inferencing tests-finllama
Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs(tokens/second)
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q2_K.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	29. 2
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q3_K_L.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	28. 2
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q3_K_M.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	2 9
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q3_K_S.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	28. 4
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q4_0.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	30.9
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33B-GGUF.Q5_K_M.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	29. 2
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q4_K_M.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	28.5
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q4_K_S.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	28. 6
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33B-GGUF.Q5_K_M.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	29. 2
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q5_0.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	27. 7
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q5_K_M.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	27. 6
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q5_K_S.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	28
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q6_K.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	25.1
fin-llama-33B-GGUF	llama.cpp, GGUF, fin-llama-33b.Q8_0.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	23. 5

The following table shows the test results done on Llama 2 models using llama.cpp on a single Oracle Roving Edge (RED) server.

Table 2: Single-server single-user test inferencing LLama2 on RED
Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-7b	Llama-cpp , ggml-model-q4_0.gguf	llama-2-7b.Q4_0.gguf · TheBloke/Llama-2-7B-GGUF at main (huggingface.co)	RED	T4	1	51.9
Llama-2-13b	Llama-cpp , ggml-model-q4_0.gguf	https://huggingface.co/TheBloke/Llama-2-13B-GGUF/blob/main/llama-2-13b.Q4_0.gguf	RED	T4	1	28. 6
Llama-2-70b	Llama-cpp , ggml-model-q4_0.gguf	https://huggingface.co/TheBloke/Llama-2-70B-GGUF/blob/main/llama-2-70b.Q4_0.gguf	RED	T4	1	1. 6

The following table shows the test results done on quantized LLaMa2 70B models using llama.cpp on a single OCI bare metal server.

Table 3: Single-server single-user inferencing quantized Llama2 70B model on OCI bare metal servers
Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-70B-Chat-GPTQ	llama.cpp, gptq	TheBloke/Llama-2-70B-Chat-GPTQ at main (huggingface.co)	BM with 4 A10s	A10	4	11.2
Llama-2-70B-Chat-GPTQ	llama.cpp, gptq	TheBloke/Llama-2-70B-Chat-GPTQ at gptq-3bit–1g-actorder_True (huggingface.co)	BM with 4 A10s	A10	4	10.5
Llama-2-70B-Chat-AWQ	llama.cpp,AWQ	TheBloke/Llama-2-70B-Chat-AWQ · Hugging Face	BM with 4 A10s	A10	4	13. 6
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q3_K_L.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	17.5
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	19. 2
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q4_K_M.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	17. 9
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q5_K_M.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	16. 8

Single-server, multiuser concurrency inferencing tests

The following table shows the test results done on Llama 2 models using llama.cpp for concurrent users on a single OCI bare metal server.

Table 4: Single-server, multiuser inferencing Llama on OCI bare metal servers
Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator Type	Number of GPUs	Concurrent users	Throughput across all GPUs (tokens/second)
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	5	10.3
Llama-2-70B-Chat-GGUF	llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf	TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	10	8.7
fin-llama-33b.Q4_0.gguf	llama.cpp, GGUF, llama-2-33b-chat.Q4_0.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	5	22
fin-llama-33b.Q4_0.gguf	llama.cpp, GGUF, llama-2-33b-chat.Q4_0.gguf	TheBloke/fin-llama-33B-GGUF at main (huggingface.co)	BM with 4 A10s	A10	4	10	10.2

Distributed inferencing results on multiple servers

The following table shows the test results done on quantized Llama 2 models using llama.cpp with four OCI RED servers and using Message Passing Interface (MPI).

Table 5: Distributed inferencing of quantized Llama2 on multiple OCI RED servers
Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-7b MPI Run	Llama-cpp , ggml-model-q4_0.gguf	llama-2-7b.Q4_0.gguf · TheBloke/Llama-2-7B-GGUF at main (huggingface.co)	4 REDs	T4	4	52. 2
Llama-2-13b MPI Run	Llama-cpp , ggml-model-q4_0.gguf	https://huggingface.co/TheBloke/Llama-2-13B-GGUF/blob/main/llama-2-13b.Q4_0.gguf	4 REDs	T4	4	28.7
Llama-2-70b MPI Run	Llama-cpp , ggml-model-q4_0.gguf	https://huggingface.co/TheBloke/Llama-2-70B-GGUF/blob/main/llama-2-70b.Q4_0.gguf	4 REDs	T4	4	1. 6

Memory calculation for unquantized LLaMA 70B Models

For running a unquantized Llama transformer model on A10s, the following memory calculation is used:

Model type: Llama
Model size: 70B
Total memory requirement: 70B X 2 byte (16 bit) = 140 GB
Memory of 1 A10 GPU = 24 GB
Memory of 8 GPU = 160 GB (excluding GPU memory overheads on each A10 GPU).

Based on this calculation, the Llama 70B unquantized model can run on two OCI bare metal servers with eight A10 GPUs using any distributed inferencing framework, such as torchrun, Ray or MPI.

Table 6: Distributed inferencing of unquantized Llama2 70B model on 2 OCI bare metal servers
Model Type	Transformer Model, Quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-70b	Llama2 , 70B model, torchrun	GitHub – meta-llama/llama: Inference code for Llama models	2 BM Servers	A10s	8	8.8

Figure 1: Inference run of unquantized Llama 70B model on two bare metal servers with eight A10s

The following table shows the test results of the Llama 70B unquantized model using four VM servers with two A10s each:

Table 7: Distributed inferencing of unquantized Llama 70B model on 4 OCI VM Servers
Model type	Transformer model, quantization	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-70b	Llama2 , 70B model, torchrun	GitHub – meta-llama/llama: Inference code for Llama models	4 VM Servers	A10s	8	4

The following table shows the test results of Llama2 models using vLLM transformer model ( with Paged Attention) using two bare metal servers with four A10s each:

Table 8: Distributed inferencing of Llama using vLLM on 2 OCI bare metal servers
Model type	Transformer model, quantiation	Deployment config	OCI instance type	Accelerator type	Number of GPUs	Throughput across all GPUs (tokens/second)
Llama-2-7b	vLLM /PagedAttention/Ray	GitHub – vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs	2 BM Servers	A10s	8	30. 1
Llama-2-13b	vLLM /PagedAttention/Ray	GitHub – vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs	2 BM Servers	A10s	8	27.3
Llama-2-70b	vLLM /PagedAttention/Ray	GitHub – vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs	2 BM Servers	A10s	8	12.9

Following are the results of LLama3 Runs on 2 BM with 8 A10 GPUs using any distributed inferencing framework like torchrun, Ray , MPI etc.

Table 9: Distributed inferencing of Llama3 using Transformer model on multiple OCI bare metal servers
Model Type	Transformer Model, Quantisation	Deployment Config	OCI Instance Type	Accelerator Type	Num GPUs	Throughput across all GPUs (tokens/second)
Meta-Llama-3-70B	llama 3, 70B model, torchrun	https://github.com/meta-llama/llama3/tree/main	2 BM Servers	A10s	8	12.44
Meta-Llama-3-70B-Instruct	llama 3, 70B model, torchrun	https://github.com/meta-llama/llama3/tree/main	2 BM Servers	A10s	8	12.24
Meta-Llama-3-8B	llama 3, 8B model, torchrun	https://github.com/meta-llama/llama3/tree/main	1 BM Server	A10	1	27.10
Meta-Llama-3-8B-Instruct	llama 3, 8B model, torchrun	https://github.com/meta-llama/llama3/tree/main	1 BM Server	A10	1	27.04

The following table shows the test results of Llama3 models using vLLM transformer model ( with Paged Attention) using two bare metal servers with four A10s each:

Table 10: Distributed inferencing of Llama3 using vLLM on 2 OCI bare metal servers
Model Type	Transformer Model, Quantisation	Deployment Config	OCI Instance Type	Accelerator Type	Num GPUs	Throughput across all GPUs (tokens/second)
Meta-Llama-3-8B	vLLM /PagedAttention/Ray	GitHub – vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs	BM 2 Servers	A10s	8	24.61
Meta-Llama-3-70B	vLLM /PagedAttention/Ray	GitHub – vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs	BM 2 Servers	A10s	8	11.23

The following chart summarizes the inferencing performance of LLaMa2 and LLaMA3 unquantized models of Transformer and vLLM on A10 accelerated OCI VM and BM servers.

Distributed Inferencing Tests of Unquantized LLaMa 70B model — Figure 2 : Inferencing of unquantized Llama 70B model on OCI BM and VM servers.

Conclusion

The above benchmarking exercises show t hat mainstream GPU accelerated OCI servers (like A10s) can be used for inferencing activities of different sizes of Opensource large language models (LLMs) . When the best performance is needed for larger scale deployments, OCI offers advanced NVIDIA GPUs deployed with NVIDIA TensorRT-LLM, which delivers great results as shown in the recent MLPerf Inference v4.0 benchmarks. Depending on the requirements and the scale of the solution,one can start working with smaller LLMs, such as 7B and 13B models on mainstream GPU-accelerated servers, and then migrate to larger clusters with advanced GPUs ( A100s, H100s etc) as demand and model size increases. This scaling helps in quicker adoption of generative AI solutions for the customer.

Acknowledgments

The author wants to thank Mohan Srinivasan, Sreedhara Narayanaswamy, Ram Sivaram, and Hiten Goradia for their guidance, leadership, and support in this endeavour. The author also wants to thank James George for his expertise in setting up the MPI cluster on Oracle Roving Edge Devices(RED) .

Disclaimer

The benchmarking exercise published in this paper are for general guidance only. Individual test results can vary based on model size, testing parameters,performance techniques and the hardware/software stack used.