Now Generally Available: The Largest, Fastest AI Supercomputer in the Cloud

November 18, 2024 | 3 minute read
Sagar Zanwar
Principal Product Manager, Compute
Akshai Parthasarathy
Product Marketing Director, Oracle
Text Size 100%:

We’re excited to announce the general availability of Oracle Cloud Infrastructure (OCI) Supercluster with NVIDIA H200 Tensor Core GPUs. The largest AI supercomputer available in the cloud*, our latest Supercluster scales up to an industry-leading 65,536 GPUs. At maximum scale, it can offer up to 260 ExaFLOPS of peak FP8 performance, more than four times the previous generation

Each OCI Compute instance within the Supercluster has 76% more high-bandwidth memory capacity and 40% more memory bandwidth than the H100 instance, improving large language model (LLM) inference performance by up to 1.9X. With double the front-end network throughput for data ingestion and retrieval (200 Gbps per instance), data transfer to and from the cluster is also dramatically improved to further accelerate AI model training and deployment.

OCI Supercluster with NVIDIA H200 GPUs: Even more scalability and performance at the same great price

AI models continue to evolve and become more capable by using over a trillion parameters to improve accuracy, fluency, efficiency, multimodal capability, and other dimensions. These new models require faster GPUs arranged into very large clusters. OCI superclusters offer the following features:

  • OCI’s bare metal GPU compute, unique among hyperscalers, removes the overhead of hypervisors and allows end users to get the most value from each instance’s CPUs and GPUs.
  • OCI’s custom-designed cluster network that uses RDMA over Converged Ethernet Version 2 (RoCE v2) on top of NVIDIA ConnectX-7 network interface cards (NICs) to support high throughput (400Gbps GPU to GPU interconnects across racks) and ultra-low latency of 2.5–9.1 microseconds. This configuration enables faster training of LLMs across tens of thousands of GPUs.
  • An upgraded 200-Gbps front-end network allows instances in the new Supercluster to more efficiently move large datasets between storage and GPUs, enabling faster iteration and more efficient scaling.
  • AI-specific hardware and software acceleration with built-in hardware acceleration and efficient network processing to empower the OCI File Storage with high-performance mount target (HPMT), fully managed Lustre file service (coming soon), and other AI-specific services.

Best of all, pricing remains $10 per GPU per hour, the same as the previous generation instance for NVIDIA H100 GPUs (BM.GPU.H100.8).

BM.GPU.H200.8 technical specifications

The NVIDIA H200 shape boasts the following specifications:

  • Instance name: BM.GPU.H200.8
  • Instance type: Bare metal (no hypervisor)
  • GPU: Eight NVIDIA H200 Tensor Core GPUs connected via NVIDIA NVLink
    • GPU memory capacity: 141GB HBM3e memory, each (76% more than NVIDIA H100)
    • GPU memory bandwidth: 4.8 TB/s (1.4 times that of NVIDIA H100)
  • CPU: Two 56-core Intel Sapphire Rapids 8480+
  • System memory: 3 TB DDR5
  • Local storage: Eight 3.84-TB NVMe SSDs
  • Cluster network: 3,200 Gbps (Eight 400-Gbps links)
  • Front-end network: 200 Gbps (Two times that of BM.GPU.H100.8)
  • OCI Supercluster scale: Up to 65,536 NVIDIA H200 GPUs (Four times the OCI Supercluster scale with NVIDIA H100 GPUs)
  • List price: $10 per GPU/hour (same as BM.GPU.H100.8)

Getting started

To get access to Oracle Cloud Infrastructure Supercluster with NVIDIA H200 GPUs, contact your Oracle sales team and learn more about AI Infrastructure.

 


* Scalability for CSP 1: 20,000 NVIDIA H200 GPUs; Scalability for CSP 2 and CSP 3: not available publicly

Sagar Zanwar

Principal Product Manager, Compute

Sagar Zanwar is a Lead Product Manager specializing in GPU products within the AI Infrastructure Group. With a keen focus on advancing AI infrastructure capabilities through cutting-edge GPU/OCI technology, Sagar plays a pivotal role in driving innovation and delivering high-performance solutions in the AI industry.

Show more

Akshai Parthasarathy

Product Marketing Director, Oracle

Akshai is a Director of Product Marketing for Oracle Cloud Infrastructure (OCI) focused on driving adoption of OCI’s services and solutions. He has over 15 years of experience and is a graduate of UC Berkeley and Georgia Tech.


Previous Post

Exploring how SaaS Cloud Security performs penetration testing

David B. Cross | 9 min read

Next Post


Announcing native attribution of pluggable database costs for Exadata Database Service on Dedicated Infrastructure and Cloud@Customer

Arun Ramakrishnan | 4 min read
Oracle Chatbot
Disconnected