Practical inferencing of open source models on mainstream GPU-accelerated OCI servers

April 30, 2024 | 26 minute read
Krishna Shanmughom
Master Principal Cloud Architect, Oracle
Text Size 100%:

With the huge demand of generative AI opportunities worldwide, planning for the required compute capacity is crucial. While NVIDIA A100 and H100 Tensor Core GPUs offer great performance for large scale LLM deployments, they can be complemented with mainstream GPUs like T4 , P100, and A10 for smaller scale    deployments.

With the well-engineered Oracle Generative AI services, Oracle Cloud Infrastructure (OCI) also allows customers to bring their own models (open source or custom) for inferencing on the highly efficient OCI Servers. When running bring-your-own models purely on OCI, one might need to benchmark and optimize by running the LLMs on mainstream NVIDIA-accelerated OCI servers. This blog details how mainstream GPUs accelerated OCI servers (both bare metal and virtual machine) can be used for running a wide range of inferencing scenarios using Opensource LLMs.

Benchmarking parameters

Following are the different set of parameters which influence the inferencing test scenarios and results:

  • Generative AI model specifications: Model type and size
  • GPU specifications: Model and number of GPUs
  • CPU specifications: CPU type and number of CPUs
  • Maximum context window
  • Performance optimizations
    • Quantized and unquantized models
    • Different LLM models like transformer , transformer with KV cache optimization and paged attention, transformer with flash attention etc
  • Performance measured in terms of tokens per second

Testing environment

The following server configurations are used for the benchmarking purposes:

  • OCI server types and specifications
    • GPU accelerated bare metal
      • Intel Xeon Platinum 8358 CPU @ 2.60GHz (128 cores)
      • Four NVIDIA A10 Tensor Core GPUs, each with 24GB GDDR6 memory
      • 1TB RAM
    • GPU accelerated VM
      • Intel Xeon Platinum 8358 CPU @ 2.60GHz (60 cores)
      • Two NVIDIA A10 GPUs, each with 24GB GDDR6 memory    
      • 480GB RAM
    • GPU Accelerated Roving Edge Device ( RED)
      • Intel(R) Xeon(R) Gold 6230T CPU @ 2.10GHz  ( 32 cores)
      • One NVIDIA T4 GPU with 16     GB GDDR6 memory    
      • 512 GB RAM

The following LLM models ( quantized and unquantized versions ) are used for this benchmarking exercise:

  • Llama 2 models (7B, 13B, and 70B)
  • Llama 2 HF models (7B, 13B, and 70B)
  • Llama 3 models ( 8B , 70B )
  • Fin-llama-33B

Single-server, single-user inferencing tests

The following table shows the test results done on fin-llama models using llama.cpp on a single OCI bare metal server.

Table 1: Single-server single-user inferencing tests-finllama

Model type

Transformer model, quantization

Deployment config

OCI instance type

Accelerator type

Number of GPUs

Throughput across all GPUs(tokens/second)

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q2_K.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

29.2    

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q3_K_L.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

28.2    

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q3_K_M.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

29    

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q3_K_S.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

28.4    

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q4_0.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

30.9    

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33B-GGUF.Q5_K_M.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

29.2    

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q4_K_M.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

28.5    

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q4_K_S.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

28.6    

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33B-GGUF.Q5_K_M.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

29.2    

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q5_0.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

27.7    

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q5_K_M.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

27.6    

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q5_K_S.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

28    

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q6_K.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

25.1    

fin-llama-33B-GGUF

llama.cpp, GGUF, fin-llama-33b.Q8_0.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

23.5    

 

The following table shows the test results done on Llama 2 models using llama.cpp on a single Oracle Roving Edge (RED) server.

Table 2: Single-server single-user test inferencing LLama2 on RED

Model type

Transformer model, quantization

Deployment config

OCI instance type

Accelerator type

Number of GPUs

Throughput across all GPUs (tokens/second)

Llama-2-7b

Llama-cpp , ggml-model-q4_0.gguf

llama-2-7b.Q4_0.gguf · TheBloke/Llama-2-7B-GGUF at main (huggingface.co)

RED

T4

1

51.9       

 

Llama-2-13b

Llama-cpp , ggml-model-q4_0.gguf

https://huggingface.co/TheBloke/Llama-2-13B-GGUF/blob/main/llama-2-13b.Q4_0.gguf

RED

T4

1

28.6       

 

Llama-2-70b

Llama-cpp , ggml-model-q4_0.gguf

https://huggingface.co/TheBloke/Llama-2-70B-GGUF/blob/main/llama-2-70b.Q4_0.gguf

RED

T4

1

1.6       

 

 

The following table shows the test results done on quantized LLaMa2 70B models using llama.cpp on a single OCI bare metal server.

Table 3: Single-server single-user inferencing quantized Llama2 70B model on OCI bare metal servers

Model type

Transformer model, quantization

Deployment config

OCI instance type

Accelerator type

Number of GPUs

Throughput across all GPUs (tokens/second)

Llama-2-70B-Chat-GPTQ

llama.cpp, gptq

TheBloke/Llama-2-70B-Chat-GPTQ at main (huggingface.co)

BM with 4 A10s

A10

4

11.2    

Llama-2-70B-Chat-GPTQ

llama.cpp, gptq

TheBloke/Llama-2-70B-Chat-GPTQ at gptq-3bit--1g-actorder_True (huggingface.co)

BM with 4 A10s

A10

4

10.5    

Llama-2-70B-Chat-AWQ

llama.cpp,AWQ

TheBloke/Llama-2-70B-Chat-AWQ · Hugging Face

BM with 4 A10s

A10

4

13.6    

Llama-2-70B-Chat-GGUF

llama.cpp, GGUF, llama-2-70b-chat.Q3_K_L.gguf

TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

17.5    

Llama-2-70B-Chat-GGUF

llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf

TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

19.2    

Llama-2-70B-Chat-GGUF

llama.cpp, GGUF, llama-2-70b-chat.Q4_K_M.gguf

TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

17.9    

Llama-2-70B-Chat-GGUF

llama.cpp, GGUF, llama-2-70b-chat.Q5_K_M.gguf

TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

16.8    

Single-server, multiuser concurrency inferencing tests

 

The following table shows the test results done on Llama 2 models using llama.cpp for concurrent users on a single OCI bare metal server.

Table 4: Single-server, multiuser inferencing Llama on OCI bare metal servers

Model type

Transformer model, quantization

Deployment config

OCI instance type

Accelerator Type

Number of GPUs

Concurrent users

Throughput across all GPUs (tokens/second)

Llama-2-70B-Chat-GGUF

llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf

TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

5

10.3    

Llama-2-70B-Chat-GGUF

llama.cpp, GGUF, llama-2-70b-chat.Q4_0.gguf

TheBloke/Llama-2-70B-Chat-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

10

8.7    

fin-llama-33b.Q4_0.gguf

llama.cpp, GGUF, llama-2-33b-chat.Q4_0.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

5

22    

fin-llama-33b.Q4_0.gguf

llama.cpp, GGUF, llama-2-33b-chat.Q4_0.gguf

TheBloke/fin-llama-33B-GGUF at main (huggingface.co)

BM with 4 A10s

A10

4

10

10.2    

 

Distributed inferencing results on multiple servers

The following table shows the test results done on quantized Llama 2 models using llama.cpp with four OCI RED servers and using Message Passing Interface (MPI).

Table 5: Distributed inferencing of quantized Llama2 on multiple OCI RED servers

Model type

Transformer model, quantization

Deployment config

OCI instance type

Accelerator type

Number of GPUs

Throughput across all GPUs (tokens/second)

Llama-2-7b

MPI Run

Llama-cpp , ggml-model-q4_0.gguf

llama-2-7b.Q4_0.gguf · TheBloke/Llama-2-7B-GGUF at main (huggingface.co)

4 REDs

T4

52.2       

 

Llama-2-13b

MPI Run

Llama-cpp , ggml-model-q4_0.gguf

https://huggingface.co/TheBloke/Llama-2-13B-GGUF/blob/main/llama-2-13b.Q4_0.gguf

4 REDs

T4

28.7 

 

Llama-2-70b MPI Run

Llama-cpp , ggml-model-q4_0.gguf

https://huggingface.co/TheBloke/Llama-2-70B-GGUF/blob/main/llama-2-70b.Q4_0.gguf

4 REDs

T4

1.6       

 

Memory calculation for unquantized LLaMA 70B Models

For running a unquantized Llama transformer model on A10s, the following memory calculation is used:

  • Model type: Llama
  • Model size: 70B
  • Total memory requirement: 70B X 2 byte (16 bit) = 140 GB
  • Memory of 1 A10 GPU = 24 GB
  • Memory of 8 GPU = 160 GB (excluding GPU memory overheads on each A10 GPU).

Based on this calculation, the Llama 70B unquantized model can run on two OCI bare metal servers with eight A10 GPUs using any distributed inferencing framework, such as torchrun, Ray or MPI.

Table 6: Distributed inferencing of unquantized Llama2 70B model on 2 OCI bare metal servers

Model Type

Transformer Model, Quantization

Deployment config

OCI instance type

Accelerator type

Number of GPUs

Throughput across all GPUs (tokens/second)

Llama-2-70b

Llama2 , 70B model, torchrun

GitHub - meta-llama/llama: Inference code for Llama models

2 BM Servers

A10s

8.8      

Inference run of unquantized Llama 70B model on two bare metal servers with eight A10s
Figure 1: Inference run of unquantized Llama 70B model on two bare metal servers with eight A10s

 

The following table shows the test results of the Llama 70B unquantized model using four VM servers with two A10s each:

Table 7: Distributed inferencing of unquantized Llama 70B model on 4 OCI VM Servers

Model type

Transformer model, quantization

Deployment config

OCI instance type

Accelerator type

Number of GPUs

Throughput across all GPUs (tokens/second)

Llama-2-70b

Llama2 , 70B model, torchrun

GitHub - meta-llama/llama: Inference code for Llama models

4 VM  Servers

A10s

4      

 

The following table shows the test results of Llama2 models using vLLM transformer model ( with Paged Attention) using two bare metal servers with four A10s each:

Table 8: Distributed inferencing of Llama using vLLM on 2 OCI bare metal servers

Model type

Transformer model, quantiation

Deployment config

OCI instance type

Accelerator type

Number of GPUs

Throughput across all GPUs (tokens/second)

Llama-2-7b

vLLM /PagedAttention/Ray

 

GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

2 BM Servers

A10s

30.1      

Llama-2-13b

vLLM /PagedAttention/Ray

 

GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

2 BM Servers

A10s

8

27.3      

Llama-2-70b

vLLM /PagedAttention/Ray

 

GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

2 BM Servers

A10s

8

12.9      

 

Following are the results of LLama3 Runs on 2 BM with 8 A10 GPUs using any distributed inferencing framework like torchrun, Ray , MPI etc.

Table 9: Distributed inferencing of Llama3 using Transformer model on multiple OCI bare metal servers

Model Type

Transformer Model, Quantisation

Deployment Config

OCI Instance Type

Accelerator Type

Num GPUs

Throughput across all GPUs (tokens/second)

Meta-Llama-3-70B

llama 3, 70B model, torchrun

https://github.com/meta-llama/llama3/tree/main

2 BM Servers

A10s

12.44 

Meta-Llama-3-70B-Instruct

llama 3, 70B model, torchrun

https://github.com/meta-llama/llama3/tree/main

2 BM Servers

A10s

12.24 

Meta-Llama-3-8B

llama 3, 8B model, torchrun

https://github.com/meta-llama/llama3/tree/main

1 BM Server

A10

1

27.10 

 

Meta-Llama-3-8B-Instruct

llama 3, 8B model, torchrun

https://github.com/meta-llama/llama3/tree/main

1 BM Server

A10

1

27.04 

 

 

The following table shows the test results of Llama3 models using vLLM transformer model ( with Paged Attention) using two bare metal servers with four A10s each:

Table 10: Distributed inferencing of Llama3 using vLLM on 2 OCI bare metal servers

Model Type

Transformer Model, Quantisation

Deployment Config

OCI Instance Type

Accelerator Type

Num GPUs

Throughput across all GPUs (tokens/second)

Meta-Llama-3-8B

vLLM /PagedAttention/Ray

 

GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

BM 2 Servers

A10s

8

24.61 

Meta-Llama-3-70B

vLLM /PagedAttention/Ray

 

GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

BM 2 Servers

A10s

8

11.23 

 

The following chart summarizes the inferencing performance of LLaMa2 and LLaMA3 unquantized models of Transformer and vLLM on A10 accelerated OCI VM and BM servers.

Distributed Inferencing Tests of Unquantized LLaMa 70B model
Figure 2 : Inferencing of unquantized Llama 70B model on OCI BM and VM servers.

Conclusion

The above benchmarking exercises show that mainstream GPU accelerated OCI servers (like A10s) can be used for inferencing activities of different sizes of Opensource large language models (LLMs) . When the best performance is needed for larger scale deployments, OCI offers advanced NVIDIA GPUs deployed with NVIDIA TensorRT-LLM, which delivers great results as shown in the recent MLPerf Inference v4.0 benchmarks. Depending on the requirements and the scale of the solution,one can start working with smaller LLMs, such as 7B and 13B models    on mainstream GPU-accelerated servers, and then migrate to larger clusters with advanced  GPUs ( A100s, H100s etc) as demand and model size increases. This scaling helps in quicker adoption of generative AI solutions for the customer.  

Acknowledgments

The author wants to thank Mohan Srinivasan, Sreedhara Narayanaswamy, Ram Sivaram, and Hiten Goradia for their guidance, leadership, and support in this endeavour. The author also wants to thank James George for his expertise in setting up the MPI cluster on Oracle Roving Edge Devices(RED) .

Disclaimer

The benchmarking exercise published in this paper are for general guidance only. Individual test results can vary based on model size, testing parameters,performance techniques and the hardware/software stack used.

References

For further information, please visit the following links

Oracle Generative AI Solutions: https://www.oracle.com/artificial-intelligence/generative-ai/

Oracle GPU accelerated BareMetal Servers:https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#bm-gpu

Oracle GPU accelerated VM Servers:https://docs.oracle.com/en-us/iaas/Content/Compute/References/computeshapes.htm#vm-gpu

Oracle Roving Edge Servers: https://www.oracle.com/a/ocom/docs/data-sheet-roving-edge-device.pdf

NVIDIA A10 GPUs: https://www.nvidia.com/en-au/data-center/products/a10-gpu/

LLaMA CPP Source Code: https://github.com/ggerganov/llama.cpp

Meta LLaMA2 : meta-llama/llama: Inference code for Llama models (github.com)

Meta LLaMA3: meta-llama/llama3: The official Meta Llama 3 GitHub site

 

 

Krishna Shanmughom

Master Principal Cloud Architect, Oracle

Krishna Shanmughom is a Master Principal Cloud Architect in Innovation CoE, Cloud Engineering , JAPAC team with more than two decades of IT experience in Software development, Telecom, Cloud and Deep Learning technologies. His interests include Cloud Architecture,  Machine Learning, Deep Learning, Opensource LLMs, GPU Architecture,Telecom,Kubernetes, DevOps etc.