Securing GPU-Accelerated AI Workloads in Oracle Kubernetes Engine with Sysdig

Manuel Boira Cuevas
Senior Solutions Architect / Strategic Alliances, Sysdig

Artificial Intelligence and Machine Learning workloads are grounded in established software engineering and infrastructure principles. While AI and ML lifecycles introduce new operational constraints, they still execute as workloads on compute, storage, and networking platforms, fitting naturally within familiar IaaS, PaaS, and SaaS delivery models (AI is still a workload somewhere).

Organizations either deploy AI systems on their own infrastructure or consume models as managed services. In this article, we focus on AI applications deployed on Kubernetes and cloud infrastructure, with particular attention to Oracle Cloud Infrastructure (OCI) and Oracle Kubernetes Engine (OKE).

OCI is increasingly adopted for security, compliance, and data sovereignty reasons. Thanks to its cost efficiency, and strong alignment with enterprise and regulatory requirements, OCI provides a solid foundation for AI and high-performance computing (HPC) applications. Features such as RDMA-enabled networking (high-bandwidth, ultra-low latency) are especially relevant for highly demanding parallel computing workloads (financial, automotive, aerospace, biomedical, GenAI, BigData).

Emerging attack surface

The AI threat surface spans a layered stack (physical CPU and GPU, virtualization layers up through models, data, inference, agents, applications, and APIs). MLOps platforms such as Kubeflow and MLflow manage model artifacts and training pipelines tightly coupled to shared data stores.

At runtime, inference engines such as vLLM and TensorRT-LLM execute with high privilege and sustained GPU access. In Kubernetes environments, stacks such as llm-d provide distributed serving primitives around model workers, while platforms such as NVIDIA Triton Inference Server provide a production grade inference server for multiple model backends.

Above this, agent layers built with frameworks like LlamaIndex or LangChain dynamically connect models, tools, and data before exposing functionality through application and API layers. These layers are tightly interconnected, weaknesses at any point can propagate upward, resulting in model theft, data exposure, or large-scale GPU abuse.

OCI Shared responsibility model

OCI and OKE provide a solid, well-managed platform, but attackers focus on what you deploy on top of it. Under OCI’s shared responsibility model, Oracle manages the control plane, while customers are responsible for application security and most data-plane operations. Below are examples of how this model works.

Core responsibilities

Oracle fully manages and operates the Kubernetes control plane (API server, etcd, core controllers) as a managed service, including availability and CNCF-conformant behavior.

Shared operations

Control plane upgrades are shared: Oracle releases supported versions and performs the upgrade, but you must initiate upgrades via Console/API/CLI.
Data plane responsibilities are shared: Oracle supplies images and core data-plane components (kubelet, kube-proxy, flannel), and you manage worker nodes and workloads using those images.

Security and patching

For control plane vulnerabilities, Oracle patches affected clusters; for data-plane vulnerabilities, Oracle provides patched images and you must roll them out to your nodes.

Support scope

Oracle Support covers OKE-provided components and integrations (API server connectivity, cluster operations, CCM integrations, network components like CoreDNS/kube-proxy).
Oracle Support excludes how-to Kubernetes usage, third-party software (e.g., Istio, Helm), unsupported Kubernetes versions, upstream bugs, and alpha features.

Application and networking ownership

You are solely responsible for cluster networking configuration, application networking (LBs, ingress, network policies), observability (logs/metrics), app health/performance, security, and all workloads running in the cluster.

AI threat proliferation

Attacks are increasing in volume, sophistication, and impact. Over the past several months, many notable incidents have highlighted how rapidly threats are evolving:

July 2025 – LangFlow Server RCE vulnerability → Unauthenticated AI pipeline takeover.

July 2025 – Nvidia Container Escape → Container-to-host GPU escape.

Nov 2025 – ShadowRay 2.0 → AI inference server exploit and cloud malware.

Nov 2025 – Keras Supply Chain Vulnerability → ML dependency supply-chain abuse.

Jan 2026 – IBM Bob duped to run malware → Trusted AI agent compromise.

For deeper analysis and additional examples, see the Sysdig Threat Research Team content.

Lessons Learned

Most of these attacks are executed inside running workloads, often entering through supply-chain weaknesses or zero-day exploits and escalating via over-privileged GPU runtimes, exposed inference services, or misconfigured data and vector stores.Consequently, we must lend special attention to:

Real time behavior

Even with a strong security posture, zero-day and supply-chain attacks can bypass preventive controls, making runtime protection essential for detecting and stopping abnormal behavior in AI and GPU workloads. In LLM-based systems, for example, prompt-based attacks can lead to resource hijacking and unintended compute abuse. A single metric is not enough. As we saw with ShadowRay 2.0, attackers kept a low GPU usage to avoid triggering alerts. An effective security approach has to correlate multi-domain information in real time.

Security posture and guardrails

CI/CD security and Kubernetes security posture management (KSPM) platforms can prevent attacks early by detecting poisoned dependencies, exposed AI services, and unsafe GPU or Kubernetes configurations, while enforcing least-privilege IAM, trusted images, and hardened GPU node pools.

A diagram titled “AI Infrastructure: Tactics targeting execution environments” showing three Generative AI application components: AI Interfaces, AI Supply Chain, and AI Infrastructure. Rows on the left list attack stages including Initial Access, Credential Access, and Discovery. Colored markers highlight that most tactics target the AI Supply Chain and AI Infrastructure layers rather than AI Interfaces — *This Datadog chart is aligned with the attack trends we observed.*

The Sysdig Approach to AI Workload Protection

Sysdig protects AI workloads by aligning its CNAPP platform around three foundational pillars.

Runtime insights provide deep, real-time visibility into AI and GPU workloads with multi-domain correlation.

Agentic AI that takes precise action for detection and response to stop threats as they execute, from inference server exploits to container escapes.Open innovation underpins the platform, leveraging open source, transparent policies, and customer-controlled rules to build trust and keep teams in control. Together, these pillars span the full AI lifecycle, ensuring production-grade applications remain secure without sacrificing performance or velocity.

Securing OKE Clusters with GPU Nodes

High Level Architecture

An Oracle Cloud Infrastructure (OCI) architecture diagram showing a Kubernetes (OKE) cluster deployed in a VCN with public and private subnets. The setup includes a bastion node, CPU and GPU worker node pools (Intel Xeon and NVIDIA), Kubernetes data and control planes, and addons such as CoreDNS, KubeProxy, CertMgr, and the NVIDIA GPU plugin. Oracle Cloud Shield, Host Shield (eBPF), and Cluster Shield provide security across worker nodes. The architecture integrates OCI Container Registry and Object Storage, with internet access through an Internet Gateway (IGW). — *Sysdig Secure with OCI and OKE accelerated by GPU. Architecture Reference*

Learn more by downloading the whitepaper Operational Security for OKE GPU-Accelerated AI Applications

AI workload protection, the right way

Defending your AI attack surface against threats requires leveraging key security best practices and capabilities.

Harden posture as early as possible

CI/CD vulnerability and risk management prevent AI attacks by blocking poisoned dependencies, exposed services, and unsafe GPU/Kubernetes configs before deployment. Sysdig runtime insights reduce the noise, helping with a clean prioritization.

IaC scan with drift detection
Supply Chain, Container Images & SBOMs
Continuous Posture & Compliance (Cloud, Containers, OS, Kubernetes cluster)
Risk management and inventory (advanced exposure, sensitive data access)
Runtime Insights prioritization

Protect the runtime perimeter. Always on.

Zero-days and supply-chain flaws still occur, so runtime detection is critical to stop abnormal behavior in AI and GPU workloads.

Almost Real-time Detection and Response
AI-Powered Threat Intelligence
Multi-Domain Correlation
Forensic Analysis
Network Topology

Be ready to respond at Cloud Speed

When the cost of a cloud breach is $4.45 million, security teams need to respond fast to attackers. Sysdig redefined the detection and response benchmark with the Sysdig 555. Here’s how:

Agentic AI security (Sysdig SAGE)
Advanced Response Actions
Built-in Automations Framework

Want to learn more? Explore the Sysdig Secure website.

Blueprints and Landing Zones

Security for GPU-accelerated Kubernetes clusters should not be an afterthought. Security must be addressed from the earliest design phase, which is why starting from a well-defined landing zone or blueprint is important to ensure clusters are secure by default.

Oracle addresses this need through a growing set of OCI Kubernetes blueprints for AI applications, including reference architectures for large language models. These blueprints provide validated infrastructure designs, recommended GPU and node profiles, required software components, and baseline monitoring configurations. They allow teams to move faster while avoiding ad-hoc, insecure deployments when adopting new architectures.

Sysdig and Oracle Kubernetes Engine jointly developed a Quick Start blueprint that focuses specifically on security. This blueprint enables one-click deployment of OKE clusters with Sysdig Secure integrated by default, using Terraform and aligned with OCI Quick Start standards. The goal is to make runtime security, visibility, and threat detection part of the initial cluster design, rather than something retrofitted after workloads are already running.

Security Operationalization

Modern security teams understand that tools only provide value when they are properly integrated and used in day-to-day operations. This usually means fitting new tools into an existing security stack. This is especially true for SOC teams, which tend to have well-established views on workflows, data ownership, and response automation.Because operational models vary widely, teams need to make deliberate choices around integrations, ownership, and response patterns. To determine how Sysdig should be deployed within your security stack, consider the following questions.

What service levels does your company need?

Sysdig can operate as a near real time detection and enrichment layer across cloud environments, producing high quality security signals and supporting timely response actions.

Do you need long-term retention and correlation?

Sysdig can selectively enrich and forward security events, reducing noise and limiting what needs to be retained in the SIEM. This helps lower operational effort and data ingestion costs.

Is your organization subject to regulatory requirements?

Sysdig integrates with risk, asset, and compliance management platforms to support regulated environments and ongoing compliance processes.

How much control does your security team have over the code-to-cloud pipeline?

Sysdig integrates with SCA, SAST, and ASPM tools to provide security context across the build, deploy, and runtime stages.

How far do you take automation?

Through APIs and integrations with SOAR platforms, Sysdig supports automated and customized security workflows.

Closing Thoughts

OKE on OCI delivers a resilient and compliant foundation for GPU accelerated AI workloads, but responsibility for securing the applications running on top ultimately rests with you. While much of the security industry focuses on analyzing outputs and adding guardrails at the prompt layer, infrastructure, supply chain, and runtime security remain essential first class concerns. The emerging AI threat landscape and new technology stacks demand a dedicated security approach.

Sysdig provides AI workload protection capabilities to address this challenge, including near real time detection, security signal enrichment to reduce noise and lower costs, and strong integration with compliance and security operations platforms.

Read more about Sysdig and Oracle Cloud:

OCI Blog Post: Sysdig Monitoring & Security for Oracle Cloud – OKE and Oracle Linux

Sysdig and Oracle: Secure innovation on Oracle Cloud

Download the whitepaper here, or contact Sysdig.

Guest Author

Manuel Boira Cuevas
Senior Solutions Architect / Strategic Alliances, Sysdig

Manuel is Senior Solutions Architect for Strategic Alliances at Sysdig, with broad experience across the IT sector spanning software development, application architecture, startup founding, and software consulting leadership. In recent years Manuel has focused on the intersection of development, operations, and cloud security, with particular interest in containerized environments, Kubernetes, and AI workloads. He is passionate about helping organizations build secure, scalable, and resilient systems.

Securing GPU-Accelerated AI Workloads in Oracle Kubernetes Engine with Sysdig

Emerging attack surface

OCI Shared responsibility model

AI threat proliferation

The Sysdig Approach to AI Workload Protection

Securing OKE Clusters with GPU Nodes

AI workload protection, the right way

Blueprints and Landing Zones

Security Operationalization

Closing Thoughts

Sonali Mishra

Principal Cloud Architect

vCPU and OCPU pricing information

Introducing Bring Your Own Certificate Authority

Securing GPU-Accelerated AI Workloads in Oracle Kubernetes Engine with Sysdig

Emerging attack surface

OCI Shared responsibility model

AI threat proliferation

The Sysdig Approach to AI Workload Protection

Securing OKE Clusters with GPU Nodes

AI workload protection, the right way

Blueprints and Landing Zones

Security Operationalization

Closing Thoughts

Authors

Sonali Mishra

Principal Cloud Architect

vCPU and OCPU pricing information

Introducing Bring Your Own Certificate Authority