
Manuel Boira Cuevas
Senior Solutions Architect / Strategic Alliances, Sysdig
Artificial Intelligence and Machine Learning workloads are grounded in established software engineering and infrastructure principles. While AI and ML lifecycles introduce new operational constraints, they still execute as workloads on compute, storage, and networking platforms, fitting naturally within familiar IaaS, PaaS, and SaaS delivery models (AI is still a workload somewhere).
Organizations either deploy AI systems on their own infrastructure or consume models as managed services. In this article, we focus on AI applications deployed on Kubernetes and cloud infrastructure, with particular attention to Oracle Cloud Infrastructure (OCI) and Oracle Kubernetes Engine (OKE).
OCI is increasingly adopted for security, compliance, and data sovereignty reasons. Thanks to its cost efficiency, and strong alignment with enterprise and regulatory requirements, OCI provides a solid foundation for AI and high-performance computing (HPC) applications. Features such as RDMA-enabled networking (high-bandwidth, ultra-low latency) are especially relevant for highly demanding parallel computing workloads (financial, automotive, aerospace, biomedical, GenAI, BigData).
Emerging attack surface
The AI threat surface spans a layered stack (physical CPU and GPU, virtualization layers up through models, data, inference, agents, applications, and APIs). MLOps platforms such as Kubeflow and MLflow manage model artifacts and training pipelines tightly coupled to shared data stores.
At runtime, inference engines such as vLLM and TensorRT-LLM execute with high privilege and sustained GPU access. In Kubernetes environments, stacks such as llm-d provide distributed serving primitives around model workers, while platforms such as NVIDIA Triton Inference Server provide a production grade inference server for multiple model backends.
Above this, agent layers built with frameworks like LlamaIndex or LangChain dynamically connect models, tools, and data before exposing functionality through application and API layers. These layers are tightly interconnected, weaknesses at any point can propagate upward, resulting in model theft, data exposure, or large-scale GPU abuse.
OCI Shared responsibility model
OCI and OKE provide a solid, well-managed platform, but attackers focus on what you deploy on top of it. Under OCI’s shared responsibility model, Oracle manages the control plane, while customers are responsible for application security and most data-plane operations. Below are examples of how this model works.
Core responsibilities
- Oracle fully manages and operates the Kubernetes control plane (API server, etcd, core controllers) as a managed service, including availability and CNCF-conformant behavior.
Shared operations
- Control plane upgrades are shared: Oracle releases supported versions and performs the upgrade, but you must initiate upgrades via Console/API/CLI.
- Data plane responsibilities are shared: Oracle supplies images and core data-plane components (kubelet, kube-proxy, flannel), and you manage worker nodes and workloads using those images.
Security and patching
- For control plane vulnerabilities, Oracle patches affected clusters; for data-plane vulnerabilities, Oracle provides patched images and you must roll them out to your nodes.
Support scope
- Oracle Support covers OKE-provided components and integrations (API server connectivity, cluster operations, CCM integrations, network components like CoreDNS/kube-proxy).
- Oracle Support excludes how-to Kubernetes usage, third-party software (e.g., Istio, Helm), unsupported Kubernetes versions, upstream bugs, and alpha features.
Application and networking ownership
- You are solely responsible for cluster networking configuration, application networking (LBs, ingress, network policies), observability (logs/metrics), app health/performance, security, and all workloads running in the cluster.
AI threat proliferation
Attacks are increasing in volume, sophistication, and impact. Over the past several months, many notable incidents have highlighted how rapidly threats are evolving:
July 2025 – LangFlow Server RCE vulnerability → Unauthenticated AI pipeline takeover.
July 2025 – Nvidia Container Escape → Container-to-host GPU escape.
Nov 2025 – ShadowRay 2.0 → AI inference server exploit and cloud malware.
Nov 2025 – Keras Supply Chain Vulnerability → ML dependency supply-chain abuse.
Jan 2026 – IBM Bob duped to run malware → Trusted AI agent compromise.
For deeper analysis and additional examples, see the Sysdig Threat Research Team content.
Lessons Learned
Most of these attacks are executed inside running workloads, often entering through supply-chain weaknesses or zero-day exploits and escalating via over-privileged GPU runtimes, exposed inference services, or misconfigured data and vector stores.Consequently, we must lend special attention to:
Real time behavior
Even with a strong security posture, zero-day and supply-chain attacks can bypass preventive controls, making runtime protection essential for detecting and stopping abnormal behavior in AI and GPU workloads. In LLM-based systems, for example, prompt-based attacks can lead to resource hijacking and unintended compute abuse. A single metric is not enough. As we saw with ShadowRay 2.0, attackers kept a low GPU usage to avoid triggering alerts. An effective security approach has to correlate multi-domain information in real time.
Security posture and guardrails
CI/CD security and Kubernetes security posture management (KSPM) platforms can prevent attacks early by detecting poisoned dependencies, exposed AI services, and unsafe GPU or Kubernetes configurations, while enforcing least-privilege IAM, trusted images, and hardened GPU node pools.

This Datadog chart is aligned with the attack trends we observed.
The Sysdig Approach to AI Workload Protection
Sysdig protects AI workloads by aligning its CNAPP platform around three foundational pillars.
Runtime insights provide deep, real-time visibility into AI and GPU workloads with multi-domain correlation.
Agentic AI that takes precise action for detection and response to stop threats as they execute, from inference server exploits to container escapes.Open innovation underpins the platform, leveraging open source, transparent policies, and customer-controlled rules to build trust and keep teams in control. Together, these pillars span the full AI lifecycle, ensuring production-grade applications remain secure without sacrificing performance or velocity.
Securing OKE Clusters with GPU Nodes
High Level Architecture

Learn more by downloading the whitepaper Operational Security for OKE GPU-Accelerated AI Applications
AI workload protection, the right way
Defending your AI attack surface against threats requires leveraging key security best practices and capabilities.
Harden posture as early as possible
CI/CD vulnerability and risk management prevent AI attacks by blocking poisoned dependencies, exposed services, and unsafe GPU/Kubernetes configs before deployment. Sysdig runtime insights reduce the noise, helping with a clean prioritization.
- IaC scan with drift detection
- Supply Chain, Container Images & SBOMs
- Continuous Posture & Compliance (Cloud, Containers, OS, Kubernetes cluster)
- Risk management and inventory (advanced exposure, sensitive data access)
- Runtime Insights prioritization
Protect the runtime perimeter. Always on.
Zero-days and supply-chain flaws still occur, so runtime detection is critical to stop abnormal behavior in AI and GPU workloads.
- Almost Real-time Detection and Response
- AI-Powered Threat Intelligence
- Multi-Domain Correlation
- Forensic Analysis
- Network Topology
Be ready to respond at Cloud Speed
When the cost of a cloud breach is $4.45 million, security teams need to respond fast to attackers. Sysdig redefined the detection and response benchmark with the Sysdig 555. Here’s how:
- Agentic AI security (Sysdig SAGE)
- Advanced Response Actions
- Built-in Automations Framework
Want to learn more? Explore the Sysdig Secure website.
Blueprints and Landing Zones
Security for GPU-accelerated Kubernetes clusters should not be an afterthought. Security must be addressed from the earliest design phase, which is why starting from a well-defined landing zone or blueprint is important to ensure clusters are secure by default.
Oracle addresses this need through a growing set of OCI Kubernetes blueprints for AI applications, including reference architectures for large language models. These blueprints provide validated infrastructure designs, recommended GPU and node profiles, required software components, and baseline monitoring configurations. They allow teams to move faster while avoiding ad-hoc, insecure deployments when adopting new architectures.
Sysdig and Oracle Kubernetes Engine jointly developed a Quick Start blueprint that focuses specifically on security. This blueprint enables one-click deployment of OKE clusters with Sysdig Secure integrated by default, using Terraform and aligned with OCI Quick Start standards. The goal is to make runtime security, visibility, and threat detection part of the initial cluster design, rather than something retrofitted after workloads are already running.
Security Operationalization
Modern security teams understand that tools only provide value when they are properly integrated and used in day-to-day operations. This usually means fitting new tools into an existing security stack. This is especially true for SOC teams, which tend to have well-established views on workflows, data ownership, and response automation.Because operational models vary widely, teams need to make deliberate choices around integrations, ownership, and response patterns. To determine how Sysdig should be deployed within your security stack, consider the following questions.
- What service levels does your company need?
Sysdig can operate as a near real time detection and enrichment layer across cloud environments, producing high quality security signals and supporting timely response actions.
- Do you need long-term retention and correlation?
Sysdig can selectively enrich and forward security events, reducing noise and limiting what needs to be retained in the SIEM. This helps lower operational effort and data ingestion costs.
- Is your organization subject to regulatory requirements?
Sysdig integrates with risk, asset, and compliance management platforms to support regulated environments and ongoing compliance processes.
- How much control does your security team have over the code-to-cloud pipeline?
Sysdig integrates with SCA, SAST, and ASPM tools to provide security context across the build, deploy, and runtime stages.
- How far do you take automation?
Through APIs and integrations with SOAR platforms, Sysdig supports automated and customized security workflows.
Closing Thoughts
OKE on OCI delivers a resilient and compliant foundation for GPU accelerated AI workloads, but responsibility for securing the applications running on top ultimately rests with you. While much of the security industry focuses on analyzing outputs and adding guardrails at the prompt layer, infrastructure, supply chain, and runtime security remain essential first class concerns. The emerging AI threat landscape and new technology stacks demand a dedicated security approach.
Sysdig provides AI workload protection capabilities to address this challenge, including near real time detection, security signal enrichment to reduce noise and lower costs, and strong integration with compliance and security operations platforms.
Read more about Sysdig and Oracle Cloud:
OCI Blog Post: Sysdig Monitoring & Security for Oracle Cloud – OKE and Oracle Linux
Sysdig and Oracle: Secure innovation on Oracle Cloud
Download the whitepaper here, or contact Sysdig.
Guest Author

Manuel Boira Cuevas
Senior Solutions Architect / Strategic Alliances, Sysdig
Manuel is Senior Solutions Architect for Strategic Alliances at Sysdig, with broad experience across the IT sector spanning software development, application architecture, startup founding, and software consulting leadership. In recent years Manuel has focused on the intersection of development, operations, and cloud security, with particular interest in containerized environments, Kubernetes, and AI workloads. He is passionate about helping organizations build secure, scalable, and resilient systems.
