Autoscaling GPU Workloads with OCI Kubernetes Engine (OKE)

Applications face dynamic demands. This challenge is no less true for AI/ML workloads leveraging GPU accelerators. OCI Kubernetes Engine (OKE) allows you to address this challenge and right-size your applications by horizontally scaling pods in your cluster using data from the Nvidia GPU Device Plugin add-on.

Autoscaling GPU Accelerated Workloads

Imagine a customer support chatbot that faces a larger number of requests during the business day and is largely untouched overnight. Automatically scaling pods using GPU accelerators can minimize the number of idle GPUs, and the costs associated with them, when the traffic to the platform is low as well ensure fast response performance, and the customer satisfaction coupled with it, when the traffic to the platform is high. The long time solution to resource utilization-based scaling is the Horizontal Pod Autoscaler (HPA), a standard API resource in Kubernetes. You can use the Kubernetes Horizontal Pod Autoscaler to automatically scale the number of pods in a deployment, replication controller, replica set, or stateful set, based on that resource’s utilization, or on other metrics. By default, the Kubernetes Metrics Server only supports scaling based on CPU and memory utilization a metrics source, such as the Kubernetes Metrics Server. Other implementations of the Kubernetes Metrics API support autoscaling on multiple metrics and custom metrics, including the metrics relevant to AI/ML workloads required by pods utilizing accelerators.

Metric Sources

In order to scale based on metrics relevant to AI/ML workloads, you must have a source emitting those metrics. In this example, the source is the Nvidia GPU Device Plugin add-on. This add-on is a convenient way to manage the NVIDIA Device Plugin for Kubernetes, an implementation of the Kubernetes device plug-in framework for exposing the number of NVIDIA GPUs on each worker node, and tracking the health of those GPUs. The NVIDIA GPU Plugin add-on includes the NVIDIA Data Center GPU Manager, a suite of tools for managing and monitoring NVIDIA data center GPUs in cluster environments (see the NVIDIA DCGM page at developer.nvidia.com). You can use one or more of these metrics as the trigger for autoscaling.

To do so, we will need the Prometheus stack, an open-source cloud native standard systems monitoring and alerting toolkit, which we can deploy as part of the OCI GPU Scanner. The OCI GPU Scanner provides comprehensive, active performance and passive health checks for both NVIDIA and AMD GPUs on OCI. OCI GPU Scanner executes periodic OCI-authored GPU health checks and pushes the results to a dedicated OCI GPU Scanner Portal, Prometheus, and Grafana Dashboards. Once the Prometheus and the Prometheus Adapter have been installed, you can use it as a replacement for the Kubernetes Metrics Server and leverage DCGM metrics as a source for HPA.

Setting Up HPA with DCGM Metrics

1. Create an OKE v1.34.1 cluster and deploy a node pool with GPU shapes, for example VM.GPU.A10.2. The Nvidia GPU Device Plugin add-on will be deployed by default when you provision GPU nodes.

2. Install the oci-gpu-scanner which will install the Prometheus stack: https://github.com/oracle-quickstart/oci-gpu-scanner/blob/main/GETTING_STARTED_HELM_DEPLOY.md

3. Test your access to Prometheus and DCGM metrics:

$  kubectl port-forward svc/lens-prometheus-server 9090:9090 -n lens

4. Visit http://localhost:9090 and query for DCGM_FI_DEV_GPU_UTIL. This is the metric we will be using to trigger auto-scaling.

5. Install the Prometheus adapter:

$ helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
    
    $ helm repo update
    
    $ helm upgrade —install prometheus-adapter prometheus-community/prometheus-adapter \
    --namespace lens \
    --set prometheus.url="http://lens-prometheus-server.lens.svc"
    
    Release "prometheus-adapter" does not exist. Installing it now.
    NAME: prometheus-adapter
    LAST DEPLOYED: Wed Oct 22 13:21:08 2025
    NAMESPACE: lens
    STATUS: deployed
    REVISION: 1
    TEST SUITE: None
    NOTES:
    prometheus-adapter has been deployed.
    In a few minutes you should be able to list metrics using the following command(s):
    
    kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1

6. Verify the required metric is available:

$ kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq | grep DCGM_FI_DEV_GPU_UTIL
    
          "name": "services/DCGM_FI_DEV_GPU_UTIL",
          "name": "nodes/DCGM_FI_DEV_GPU_UTIL",
          "name": "jobs.batch/DCGM_FI_DEV_GPU_UTIL",
          "name": "namespaces/DCGM_FI_DEV_GPU_UTIL",
          "name": "pods/DCGM_FI_DEV_GPU_UTIL",

7. Run a workload to generate some load on a GPU node. Apply the following yaml to create a deployment which has 1 pod generating load against a single GPU. You can now check the GPU Utilization using Prometheus.

apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: cuda-vectoradd-load
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: cuda-vectoradd-load
      template:
        metadata:
          labels:
            app: cuda-vectoradd-load
        spec:
          restartPolicy: Always
          containers:
          - name: cuda-vectoradd
            image: "k8s.gcr.io/cuda-vector-add:v0.1"
            command: ["/bin/bash", "-c"]
            args: ["while true; do ./vectorAdd; done"]
            resources:
              limits:
                nvidia.com/gpu: 1

8. Verify the pod was created:

$ kubectl get pod | grep cuda-vectoradd
    
    Output: 
    
    cuda-vectoradd-load-69cdf9669d-t8bq2                      1/1     Running   0              43m

9. Set up Horizontal Pod Auotscaler. In this example, HPA is configured to scale up the number of replicas if the average value for DCGM_FI_DEV_GPU_UTIL exceeds 5%.

apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    metadata:
      name: cuda-vectoradd-load-hpa
    spec:
      scaleTargetRef:
        apiVersion: apps/v1
        kind: Deployment
        name: cuda-vectoradd-load
      minReplicas: 1
      maxReplicas: 5
      metrics:
      - type: Pods
        pods:
          metric:
            name: DCGM_FI_DEV_GPU_UTIL
          target:
            type: AverageValue
            averageValue: 5

Step 10: Verify pods have been automatically added based on the metric from DCGM:

$ kubectl get pod | grep cuda-vectoradd
    
    Output:
    
    cuda-vectoradd-load-69cdf9669d-82l9t 0/1 Pending 0 62s
    cuda-vectoradd-load-69cdf9669d-lqz4v 0/1 Pending 0 62s
    cuda-vectoradd-load-69cdf9669d-pffg9 1/1 Running 0 4m48s
    cuda-vectoradd-load-69cdf9669d-t8bq2 1/1 Running 0 16m

Conclusion

The ability to autoscale GPU workloads is critical to meet performance goals as well as optimize costs. This walkthrough demonstrates one way in which you can use the Nvidia GPU Device Plugin add-on from OCI Kubernetes Engine (OKE) and common open source telemetry tools to scale pods based on custom metrics relevant to AI/ML workloads.

Autoscaling GPU Workloads with OCI Kubernetes Engine (OKE)

Autoscaling GPU Accelerated Workloads

Metric Sources

Setting Up HPA with DCGM Metrics

Conclusion

Mickey Boxell

Product Management

Building with AI at the Core: Oracle's Latest Innovations for Developers

Verifying GPU Accelerator Access Isolation in OCI Kubernetes Engine (OKE)

Autoscaling GPU Workloads with OCI Kubernetes Engine (OKE)

Autoscaling GPU Accelerated Workloads

Metric Sources

Setting Up HPA with DCGM Metrics

Conclusion

Authors

Mickey Boxell

Product Management

Building with AI at the Core: Oracle's Latest Innovations for Developers

Verifying GPU Accelerator Access Isolation in OCI Kubernetes Engine (OKE)