Since early 2024, we’ve embarked on a journey to help businesses best utilize their CPU infrastructure for generative AI (GenAI) deployment. We’ve collaborated with Ampere Computing to optimize Ampere arm64-based CPUs’ ability to process enhanced model weights, manage thread efficiency and build new compute kernels. Together we’ve delivered up to 152% performance gain over the current upstream llama.cpp open-source implementation, which translates to 158 tokens per second with 32 concurrent requests. We’ve also initiated a three-way partnership with Meta and Ampere to further optimize the CPU’s performance on small parameter Llama models, including Llama 3.2 1B, 3B and 11B, and within the whole PyTorch ecosystem. These collaborations culminated at Oracle CloudWorld 2024, where we were joined by industry leaders — Amit Sangani, director of AI Partner Engineering at Meta, and Victor Jakubiuk, head of AI at Ampere Computing who shared their insights on ways to maximize the value of CPUs for generative AI deployments.
In this blog post, we recap the main takeaways from the panel session and showcase an example of retrieval-augmented generation (RAG) to inference a use case using Oracle 23ai vector database with a Llama 3 8B model running on an Ampere A1 instance on Oracle Cloud Infrastructure (OCI).
OCW session takeaways
- Customers should evaluate their infrastructure choice on four fronts: Price-performance, availability, energy efficiency, and data privacy.
- Train on GPUs, serve on CPUs: Ampere CPUs are efficient when catering to common scenarios like large language model (LLM) serving and inferencing, RAG – an extension of inference and vectorization/embeddings (converting text data to vector numbers).
- When it comes to small, batch-size inference workloads (with less than 16 concurrent requests per second) using small parameter (sub-15B) LLM models, CPUs are a strong infrastructure choice given their global availability, price-performance, and energy efficiency. Customers can deploy in their own tenancy on OCI with tight control over data security.
- 80 – 90% of real-world enterprise use cases involve small parameter (sub-15B) LLM models. Meta is committed to open source AI and delivers advanced LLM models, like Llama 3.1 8B, with 128K context length and multilingual support to the community. Developers can now run state-of-the-art models on both CPU and GPU-based infrastructures. The most recent launch of Llama 3.2 1B, 3B, and 11B once again validates Meta’s commitment.
- We’ve seen 152% performance gain over the current, upstream llama.cpp open source implementation, which translates to 158 tokens per second with 32 concurrent requests. The performance of CPUs is significantly above the average human reading speed of around 5 tokens per second, which makes it a cost-efficient choice for both offline and real-time scenarios.
- Both PyTorch framework and llama.cpp work out of the box on Ampere CPUs with excellent inference performance.
- We plan to continue driving performance efficiency of Ampere CPUs at scale through just-in-time compilation, quantization, and thread scaling. Meta and Ampere will continue their partnership using the PyTorch framework and investing in distributed computing and inferencing.
- To facilitate the ease of deployment, Ampere Computing is looking to provide ready-to-use user interfaces, such as Ollama, and end-to-end enterprise software packages, including vector database and RAG.
See it in action
In the below video, we capture how to get started with a marketplace OCI impage and deploy an LLM inference web portal with ease.

Next, we launch a popular ollama instance running Llama 3 8B model on OCI’s Ampere A1. As it is shown by the sample prompts, you can expect low latency real-time response with CPU like Ampere A1.

The final video illustrates how to use popular farmeworks like streamlit to create a custom chat bot agent with augmented data using RAG.

Conclusion
As LLM models are getting smaller and more advanced, CPUs like Ampere may become a cost-effective choice to run AI inference workloads, while GPU continue to be the choice for training and fine-tuning. At Oracle, we help customers optimize their business operations through efficient use of compute infrastructure. The partnership between Meta, Ampere Computing, and OCI pushes the boundaries of AI compute and aims to democratize GenAI, utilizing open-source frameworks and helping to drive cost efficiency for our joint end customers.
To get started with Ampere on OCI, customers can launch the free custom OS image in the Oracle Cloud Marketplace with both Oracle Linux and Ubuntu support. The image is bundled with applications including a chat UI to help you deploy and validate open source LLMs, like Llama 3 8B, on an Ampere instance. For OCI Data Science customers, Ampere A1 compute shapes are now available in AI Quick Actions. We also offer a third-party solution through Wallaroo.ai if customers want a fully managed deployment and observability service. See Wallaroo’s Enterprise Plan and Team Plan on OCI Marketplace today.
To help with your validation, until December 31, 2024, we are offering free credits of up to three months of 64 cores of Ampere A1 and 360 GB of memory to customers who want to evaluate AI inference workloads on Ampere A1 flexible shapes. To be considered for this program, please reach out to your sales representative or sign up.
Additional links
- Central resource page: LLM Inference with Ampere-based OCI A1 Product Page (amperecomputing.com)
- OCI Blog series:
- Performance script: Benchmark scripts to reproduce the performance results and public access to optimized llama.cpp Ampere containers

