As Large Language Models (LLMs) continue to evolve and drive a wave of innovation, from copilots and chatbots to document summarization and autonomous agents, one question keeps surfacing in engineering conversations:

 

 

“How do we run these models reliably, securely, and at scale?”

 

One answer: NVIDIA NIM microservices. And when deployed on Oracle Kubernetes Engine (OKE), NIM microservices become a production-ready solution for Enterprise AI workloads, combining performance, portability, and flexibility in a fully containerized, GPU-accelerated architecture.


This post explores:
●    What are NVIDIA NIM microservices and why they matter
●    Why OKE is an ideal platform for running them
●    A step-by-step guide to deploying your first LLM inference endpoint with NIM on OKE

 

What are NVIDIA NIM Microservices?

NIM are prebuilt containers provided by NVIDIA that simplify the deployment of AI foundation models for real-time inference. Each NIM exposes a standardized API and comes fully optimized for GPU acceleration.
Each NIM microservice includes:
●    A pre-configured, containerized runtime environment
●    Optimized model weights and inference engine 
●    A REST API or gRPC interface for easy integration
●    Support for various foundation models (Meta Llama, Mistral, etc.)
●    Performance enhancements via NVIDIA TensorRT-LLM and other NVIDIA acceleration libraries

These are production-grade containers designed to minimize cold start times, reduce overhead, and streamline enterprise AI deployments.
 

Why Deploy NIM on Oracle Kubernetes Engine (OKE)?

OKE provides a powerful, managed Kubernetes environment designed for high-performance, scalable AI workloads. When paired with OCI’s NVIDIA accelerated compute and networking stack, it becomes a robust platform for deploying NIM at scale.

Key benefits:

  • GPU-Optimized Infrastructure
    Run NIM on OCI GPU instances (NVIDIA H100, A10, A100, V100) for increased performance and throughput.

     
  • Autoscaling and Load Balancing
    OKE integrates seamlessly with OCI Load Balancers and supports autoscaling to match inference traffic in real-time.

     
  • Native Monitoring and Logging
    Easily observe your inference metrics using OCI Logging and Monitoring or integrate Prometheus/Grafana stacks.

     
  • Security and Networking
    Run NIM in isolated VCNs, use OCI Vault for secrets, and expose endpoints securely with ingress controllers or API gateways.

     
  • Cost-Effective Multitenancy
    Deploy multiple NIM within a single cluster and scale horizontally, optimizing GPU usage while maintaining separation of concerns.

Cost Savings

Step-by-Step: Deploying a NIM on OKE

Let’s walk through a basic setup to deploy a Llama 3-based NIM on OKE.

Prerequisites

Before you begin, ensure you have:

  • Access to an OCI tenancy with proper permissions
  • A running OKE cluster (v1.26 or later) with GPU nodes
  • kubectl configured for your cluster
  • Helm installed
  • An NVIDIA API key – please follow these instructions
  • Docker or OCI container registry access (for custom container pulls if needed)

 

1. Create an OKE Cluster with GPU Nodes

You can create an OKE cluster via the OCI Console or CLI. Use GPU-enabled shapes, such as:

  • BM.GPU.H100.8
  • BM.GPU.A10.4
  • BM.GPU.A100.8

Ensure your node pool uses a GPU shape and that the “nvidia-container-runtime” is configured. Oracle’s GPU Marketplace images already have this pre-installed.

To begin, click OCI menu – Developer Services – Kubernetes Clusters (OKE).

OCI Cloud Console - NIM on OKE

 

Click Create Clusters.

OCI Cloud Console - NIM on OKE

 

Check Quick create is selected and click Submit.

OCI Cloud Console - NIM on OKE

 

For this lab, we will pick a Public endpoint, Managed node type and Private workers as the worker nodes.

 

OCI Cloud Console - NIM on OKE

 

Select Node shape, BM.GPU.A10.4 and node count as 2. You can adapt the shapes according to your needs. For further references on compute shapes visit OCI Computer Shapes Documentation.

 

OCI Cloud Console - NIM on OKE

 

The default volume size is 50GB, so we need to increase it to 500GB. Click Show advanced options and change boot volume size to 500GB and click next.

 

OCI Cloud Console - NIM on OKE - Boot Volume

 

Click Create cluster.

 

OCI Cloud Console - NIM on OKE - Cluster

 

Check node pool status is waiting for cluster. You should see the cluster is being created and will take around 10 to 15 minutes to complete.

Once the process is complete, click Access Cluster.

 

OCI Cloud Console - NIM on OKE - Cluster Access

 

Now you can use the command Kubernetes commands such as kubectl to manage your Kubernetes cluster.

You can run this locally using OCI CLI or leverage Cloud Shell

Cloud Shell provides:

  • An ephemeral machine to use as a host for a Linux shell, pre-configured with the latest version of the OCI Command Line Interface (CLI) and several useful tools
  • 5GB of encrypted persistent storage for your home directory
  • A persistent frame of the Console which stays active as you navigate to different pages of the console

OCI Cloud Console - NIM on OKE - Cluster Access

 

For this example, we will use Cloud Shell because of its simplicity and easy access to OCI resources.

Click Launch Cloud Shell and then copy-and-paste the command to the Cloud Shell and press enter.

 

NIM on OKE - Cloud Shell

 

Once in the Cloud Shell, run the following comands:

kubectl get nodes
kubectl get pods –all-namespaces

 

The output should list 2 nodes and the kubernetes systems pods.

NIM on OKE - Cloud Shell

By now you should have access to your OKE cluster.

 Check for taints that may prevent deploying to these nodes.

Kubernetes taints are a way to mark a node so that only specific pods can be scheduled on to it, preventing other pods from using that node unless they tolerate the taint.

 

kubectl describe nodes | grep -i taints

NIM on OKE - Cloud Shell

 

If you identified Taints, remove the nvidia.com/gpu:NoSchedule with the following command.

kubectl taint nodes –all nvidia.com/gpu:NoSchedule-

NIM on OKE - Cloud Shell

 

When the cluster is ready, create the namespace in Kubernetes.

kubectl create namespace nim

 

Replace <insert your NGC_API_KEY> with your NGC API key, then paste the updated command into the Cloud Shell and press Enter.

export NGC_API_KEY=<insert your NGC_API_KEY>

 

2. Install GPU Operator

Install the GPU Operator – NVIDIA GPU Operator manages NVIDIA GPU resources in a Kubernetes cluster and automates tasks related to bootstrapping GPU nodes.

Add the NVIDIA repository

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia \
    && helm repo update

 

helm install gpu-operator nvidia/gpu-operator -n nim

NIM on OKE - GPU Operator

 

Check the pods 

kubectl get pods -n nim

 

The GPU opreator installed successfully within 2 minutes.

NIM on OKE - GPU Operator

3. Install the NIM Operator

Install the NIM Operator – NIM Operator is a Kubernetes operator that manages the deployment and lifecycle of NVIDIA NIM

Install NIM-Operator

helm install nim-operator nvidia/k8s-nim-operator -n nim

 

Check the pods.

kubectl get pods -n nim

4. Create Secrets

Create a secret with specified type.
•    A docker-registry type secret is for accessing the NGC registry.
•    The generic type of secret is for accessing NGC catalog.

kubectl create secret docker-registry ngc-secret \

  –namespace nim \

  –docker-server=nvcr.io \

  –docker-username=’$oauthtoken’ \

  –docker-password=”$NGC_API_KEY”

 

 

kubectl create secret generic ngc-api-secret \

  –namespace nim \

  –from-literal=NGC_API_KEY=”$NGC_API_KEY” \

  –from-literal=NVIDIA_API_KEY=”$NGC_API_KEY”

 

Check the secret was successfully created within the Kubernetes cluster.

kubectl get secrets -n nim

 

5. Deploy the LLM NIM

NIM are intended to be run on NVIDIA GPUs, with the type and number of GPUs depending on the model. With the GPU Operator installed you can use Helm to install your LLM.
Create a deployment YAML that references the model-specific container. For example, for llama-3.1-nemotron-nano-8b-v1

 

cat <<ENDEND > oke-nim-llama.yaml 
image:
  repository: nvcr.io/nim/nvidia/llama-3.1-nemotron-nano-8b-v1
  tag: 1.8.4
  pullSecrets:
    – name: ngc-secret

model:
  name: nvidia/llama-3.1-nemotron-nano-8b-v1
  ngcAPISecret: ngc-api-secret

service:
  type: LoadBalancer
ENDEND

Download the nim-llm helm chart from the NVIDIA NGC repository.

helm fetch https://helm.ngc.nvidia.com/nim/charts/nim-llm-1.3.0.tgz –username=’$oauthtoken’ –password=$NGC_API_KEY

 

Run the install.

helm install my-nim nim-llm-1.3.0.tgz -f oke-nim-llama.yaml -n nim

NIM on OKE - Cloud Shell

 

Check the progress of the install.

 

kubectl get pods -n nim

NIM on OKE - Cloud Shell

 

When you see the pod status as Running, this mean the installation was successful.

Check what services are running.

 

kubectl get svc -n nim

 

Look for the pod name my-nim-nim-llm and take note of the EXTERNAL-IP.

NIM on OKE - Cloud Shell

 

6. Testing the inference API

Once the NIM is live, send a test prompt via curl or Postman. Replace the <EXTERNAL_IP>: with the EXTERNAL-IP from the my-nim-nim-llm service.

 

 

curl -X POST http://<EXTERNAL_IP>:8000/v1/chat/completions \
  -H “Content-Type: application/json” \
  -d ‘{
    “messages”: [
      {“role”: “system”, “content”: “You are a helpful assistant.”},
      {“role”: “user”, “content”: “”What should I do for a 4 day vacation in Greece?}
    ],
    “model”: “meta/llama3-8b-instruct”,
    “max_tokens”: 200
  }’

 

You should see a response like the one below.

 

{
  “id”: “cmpl-abc123”,
  “object”: “chat.completion”,
  “created”: 1752132000,
  “model”: “meta/llama3-8b-instruct”,
  “choices”: [
    {
      “index”: 0,
      “message”: {
        “role”: “assistant”,
        “content”: “Day 1: Athens—Acropolis, Acropolis Museum, Plaka. Day 2: Athens—Agora, food tour, Lycabettus at sunset. Day 3: Ferry to Hydra—car‑free island, coastal walk, swim, seafood taverna. Day 4: Hydra morning hike, ferry back and fly out. Tips: buy combo ticket for sites, prebook ferry, carry cash for small tavernas. Budget: €120–€180/day mid‑range; ferries €35–€60 each way.”
      }
    }
  ],
  “usage”: {
    “prompt_tokens”: 55,
    “completion_tokens”: 140,
    “total_tokens”: 195
  }
}

 

 

Let’s send another prompt asking “What is oracle Autonomous Database?

 

curl -X POST http://<EXTERNAL_IP>:8000/v1/chat/completions \
  -H “Content-Type: application/json” \
  -d ‘{
    “messages”: [
      {“role”: “system”, “content”: “You are a helpful assistant.”},
      {“role”: “user”, “content”: “What is oracle Autonomous Database?”}
    ],
    “model”: “meta/llama3-8b-instruct”,
    “max_tokens”: 150
  }’
​​​​​​​

 

LLM’s response.

 

{
  “id”: “cmpl-xyz987”,
  “object”: “chat.completion”,
  “created”: 1752132500,
  “model”: “meta/llama3-8b-instruct”,
  “choices”: [
    {
      “index”: 0,
      “message”: {
        “role”: “assistant”,
        “content”: “Oracle Autonomous Database is a cloud-based database service that uses machine learning to automate key management tasks such as provisioning, tuning, backups, and patching. It is designed to run without human intervention, reducing administrative overhead while improving performance, reliability, and security. Available in configurations for transaction processing, data warehousing, or mixed workloads, it scales automatically and includes built-in security and analytics capabilities.”
      }
    }
  ],
  “usage”: {
    “prompt_tokens”: 19,
    “completion_tokens”: 102,
    “total_tokens”: 150
  }
}
​​​​​​​

 

As part of the prompt settings, you can tweak the system prompt, model, and the max_tokens. Let’s try updating the system prompt that the size of the max_tokens.

 

curl -X POST http://<EXTERNAL_IP>:8000/v1/chat/completions \
  -H “Content-Type: application/json” \
  -d ‘{
    “messages”: [
      {“role”: “system”, “content”: “Act as a Software Engineer and respond to the questions towards a new solution development .”},
      {“role”: “user”, “content”: “What is oracle Autonomous Database?”}
    ],
    “model”: “meta/llama3-8b-instruct”,
    “max_tokens”: 2048
  }’
​​​​​​​

 

LLM’s response.

{
  “id”: “cmpl-abc2000”,
  “object”: “chat.completion”,
  “created”: 1752133000,
  “model”: “meta/llama3-8b-instruct”,
  “choices”: [
    {
      “index”: 0,
      “message”: {
        “role”: “assistant”,
        “content”: “Oracle Autonomous Database (ADB) is a fully managed, self-driving cloud database service offered on Oracle Cloud Infrastructure (OCI). It leverages advanced machine learning to automate the full lifecycle of a database — including provisioning, configuring, securing, tuning, scaling, patching, and repairing — without human intervention. This eliminates much of the manual database administration work while improving reliability, security, and performance.\n\n**1. Architecture**\nThe architecture is built on Oracle Database technology running in the OCI environment, enhanced with AI/ML-driven automation. Key layers include:\n- **Compute Layer**: Dedicated or shared infrastructure, optimized for database workloads.\n- **Storage Layer**: Exadata storage servers with high-bandwidth, low-latency access to persistent data.\n- **Automation Layer**: ML models monitor workloads, optimize indexes, allocate CPU/memory dynamically, and detect anomalies.\n- **Security Layer**: Always-on encryption (TDE), automatic security patching, access controls, and audit logging.\n\n**2. Deployment Options**\n- **Autonomous Transaction Processing (ATP)**: Optimized for OLTP workloads, mobile/web apps, ERP, and CRM systems.\n- **Autonomous Data Warehouse (ADW)**: Optimized for analytics, BI, data lakes, and reporting.\n- **Autonomous JSON Database**: For applications requiring a JSON-native store with SQL integration.\n- **Autonomous Database on Dedicated Exadata Infrastructure**: For isolation, compliance, and predictable performance.\n\n**3. Key Features**\n- **Self-Driving**: Automates provisioning, scaling, backups, tuning, and patching.\n- **Self-Securing**: Automatically applies security updates, encrypts all data, and protects from malicious attacks.\n- **Self-Repairing**: Detects and recovers from failures automatically with minimal downtime (99.995% SLA).\n- **Elastic Scaling**: CPU and storage scale independently without downtime.\n- **Data Integration**: Native connectors for Oracle Analytics Cloud, GoldenGate, Data Integration, and 3rd-party BI tools.\n- **Multi-Model Support**: Relational, JSON, spatial, graph, blockchain tables.\n\n**4. Benefits**\n- **Reduced Operational Cost**: Eliminates most DBA repetitive tasks.\n- **Higher Reliability**: Machine learning reduces misconfiguration risk.\n- **Performance Optimization**: Automatic indexing and query plan adjustments.\n- **Security Compliance**: Meets standards like ISO 27001, SOC, PCI-DSS, HIPAA.\n\n**5. Use Cases**\n- Real-time analytics dashboards.\n- High-transaction e-commerce platforms.\n- IoT sensor data ingestion and analysis.\n- Financial fraud detection.\n- Government and regulated industries requiring high compliance.\n\n**6. Technical Specs (OCI ADB)**\n- Storage: Up to petabytes with Exadata scale-out storage.\n- Memory: Up to hundreds of GB per instance.\n- CPUs: Elastic OCPU model (per second billing).\n- Network: RDMA over Converged Ethernet (RoCE) for ultra-low latency.\n\nIn summary, Oracle Autonomous Database combines Oracle’s database expertise with cloud-scale automation to deliver a service that is faster to deploy, easier to maintain, and more secure than traditional database models, freeing up teams to focus on innovation instead of administration.”
      }
    }
  ],
  “usage”: {
    “prompt_tokens”: 45,
    “completion_tokens”: 1020,
    “total_tokens”: 1065
  }
}
​​​​​​​

 

Final Thoughts

  • Enterprise-Grade AI at Scale: Combines NVIDIA GPU acceleration with OKE’s secure, managed infrastructure to reliably run LLMs and other AI models in production.
  • Pre-Built, Optimized Microservices: NIM include foundation models, optimized inference engines, and standard APIs packaged in ready-to-deploy containers, no complex manual setup required.
  • Kubernetes-Native Deployment: OKE manages the full container lifecycle (scaling, high availability, upgrades), reducing operational overhead.
  • Performance Boost & Cost Efficiency: Delivers faster inference than open-source inference engines on the same infrastructure, with even greater gains for high-concurrency workloads.
  • Flexible Workload Support: Ideal for chatbots, copilots, semantic search, document analysis, computer vision, and more, supporting multiple frameworks and models.
  • Integrated OCI Services: Natively connects to OCI Data Science, Object Storage, API Gateway, and observability tools like Grafana and Prometheus.
  • Security & Compliance by Design: Encryption, network isolation, access control, and compliance with ISO, SOC, PCI, HIPAA, and other standards.
  • Faster Time-to-Value: Go from concept to production inference endpoints in hours, not weeks, with automated deployments and preconfigured stacks.

Final Thoughts

Running NIM on OKE brings together the best of NVIDIA and Oracle Cloud: performance-optimized model microservices deployed on enterprise-grade, scalable infrastructure.
Whether you’re deploying a GenAI chatbot, a document Q&A system, or powering a RAG pipeline, this architecture gives you:
●    Full control over models and infrastructure
●    The flexibility of Kubernetes with the speed of NVIDIA inference
●    Seamless integration into your broader OCI AI and data ecosystem
Ready to start? Visit cloud.oracle.com, provision your OKE cluster, and try deploying your first NIM today

Acknowledgements

This blog post is the result of a strategic and hands-on collaboration between Oracle Cloud Infrastructure (OCI) and NVIDIA, two  industry leaders working together to accelerate enterprise AI innovation. By combining OCI’s enterprise-grade infrastructure, the NVIDIA AI Platform, and cutting-edge LLMs, we’re enabling customers to build, deploy, and scale generative AI workloads with confidence and performance at the core.


This post is part of a broader enablement initiative jointly developed with our partners at NVIDIA.
A special thanks to the engineering teams at NVIDIA and OCI, whose deep expertise and ongoing collaboration have been instrumental in supporting our customers throughout their AI journey. Your contributions continue to shape what’s possible.

  • Alejandro Casas, OCI Product Marketing
  • Dimitri Maltezakis Vathypetrou, NVIDIA Developer Relations 
  • Anurag Kuppala, NVIDIA AI Solution Architect

Resources