Generative AI workloads drive a different set of engineering tradeoffs than traditional cloud workloads. So, we designed a purpose-built GenAI network tailored to the needs of the best-of-breed Gen AI workloads. 
 


At Oracle CloudWorld, Oracle announced Oracle Cloud Infrastructure (OCI) Supercluster was available, with up to 131,072 NVIDIA Blackwell GPUs, delivering an unprecedented 2.4 zettaflops of peak performance. The zettascale cluster network delivers 52 Pbps of nonblocking network bandwidth at 400 Gbps per port with as low as 2 µs (microseconds) latency. This scale is five times higher network bandwidth performance and delivers up to five times lower network latency compared to competitors. This blog post explores some of the key engineering innovations undergirding these improvements.

Cluster network at zettascale is a three-tier Clos topology that supports up to 131,072 GPUs with 400 Gbps nonblocking connectivity to each GPU.

Fig. 1: OCI Supercluster offerings for generative AI
Fig. 1: OCI Supercluster offerings for generative AI

Not only does this network support the largest GPU clusters, but it maintains low-latency, high-throughput, and high-workload resilience. Achieving that goal has required us to innovate in new ways. Here, we discuss the following innovations that enable this network:

  • RDMA at zettascale with ultra-high throughput
  • Ultra-low latency
  • Advanced link resilience for enhanced workload reliability
  • Advanced traffic load-balancing

RDMA at Zettascale with Ultra-High Throughput

Scaling RDMA

In a previous blog post, we discussed how OCI Superclusters achieve ultra-high performance at massive scale. We rely on tried-and-true principles of distributed systems and distributed networking to scale the network to any scale.  One of those principles is loose coupling between the endpoints and the core of the network. Previous attempts at RDMA over Converged Ethernet (RoCE) relied on using priority flow control (PFC) as the primary mechanism of handling congestion. PFC is known to occasionally lead to network blocking issues, something that’s unacceptable in a multitenant or multiworkload environment. We took a different approach by relying on congestion control, instead of PFC, as the primary mechanism of avoiding congestion. Congestion control is a proactive congestion management mechanism and works at all layers of the cluster network. It allowed the network to scale to any scale that our customers needed without the risk of network blocking.

As cluster size increases, flexibility to use multiple cluster workloads concurrently becomes critical. To support multiple concurrent workloads, we enable multiple classes of traffic in the network, where each class of workload, such as GenAI and high-performance computing message passage interface (HPC-MPI), gets its own class and quality of service tailored to its unique needs. 

Ultra-high Throughput and Ultra-low Latency

Generative AI workloads can use all the bandwidth that the network could provide. Our goal is to deliver line-rate throughput at the lowest possible latency with real-world workloads. In practical scenarios, even with nonblocking network, the observed network bandwidth can be lower because of localized link-level congestion from flow collision and resultant congestion.

We have used the following innovative techniques to deliver line-rate throughput for real-world AI workloads including:

  • Intelligent GPU placement: OCI control plane attempts to place GPUs in a customer cluster in the nearest possible proximity. This shorter distance helps simultaneously lower network latency and improve throughput automatically by providing network locality. More network traffic stays local to lower levels of the network, which reduces the chances of flow collision at the higher levels of the network.
  • Network locality service: OCI has a service that advertises network topology information to all GPUs that helps GPUs identify how closely they’re placed in reference to all other GPUs. This information allows GenAI schedulers to schedule jobs so that bandwidth heavy and latency sensitive jobs stay within lower levels of the network.
  • Advanced traffic load balancing techniques: We pioneered multiple new traffic load balancing techniques with industry partners, which we discuss in the section, “Advanced Traffic Load Balancing.” These load balancing techniques reduce latency and increase throughput by reducing the probability of congestion and by reducing queue depths in the network.

The network has an aggregate capacity of 52-Pbps line-rate throughput at 400G for all 131,072 NVIDIA Blackwell GPUs.

Achieving Ultra-Low Latency

The following schematic illustrates the network topology of an OCI cluster network. In this three-tier Clos network, the first tier of switches serves up to 256 NVIDIA GPUs with up to 2 µs (microseconds) of unidirectional latency. The second tier of switches serve up to 2048 NVIDIA GPUs with up to 5 µs of latency. The third tier of switches serve up to 131,072 NVIDIA GPUs with up to 8 µs of latency. 

Fig. 2: Architecture of a tiered OCI cluster network.
Fig. 2: OCI Cluster Network Fabric 

When transmitting a network packet, 2 microseconds isn’t much time, and every hardware and firmware component that handles the packet must be designed and configured to minimize latency. The following schematic illustrates a breakdown of the network latency, comprising network interface card (NIC) latency, switch latency, transceiver latency, and speed of light latency. The switch ASIC has a latency budget of under a microsecond, typically around nine hundred nanoseconds. The switch ASIC performs the following key functions:

  • Validating the packet
  • Performing a lookup on the packet destination address to determine which egress port to send it out
  • Rewriting the destination MAC (Layer-2) address to that of the next hop
  • If necessary, placing the packet in a temporary queue
  • Checking for and honoring the priority of the packet to provide the correct grade of service
  • For congestion marking the packet to signal congestion, forwarding the packet out the chosen egress port

Fig. 3: Types of latency in the packet flow across clusters
Fig. 3: Latency across components 

The speed of light latency is fixed. Optical signals take 5 µs to travel a kilometer in optical fiber, which is 5 nanoseconds per meter. With the goal of minimizing latency, we specify maximum allowed cable distances for network links and design data center layout to comply with cable distance specifications. As an example, we limit the cable distance between a GPU and the first-hop switch (Tier-0 switch) to a maximum of 40 meters.

The NIC, switch, and transceiver latencies are designed to minimize latency, and silicon logic exclusively processes packets to minimize network latency. Dynamic random-access memory (DRAM) and even high-bandwidth (memory) HBM lookups are expensive, and our goal is to eliminate such memory access from the packet processing path. We help ensure that the routing and switching configuration critically required for the NIC and switch silicon to process packets is always available in very low-latency memory components, such as SRAM or TCAM. 

Advanced Link Resilience for Enhanced Workload Reliability

Performance of AI and machine learning (ML) workloads is extremely sensitive to network disruptions. Small network disruptions can have an outsized impact on workload performance. The underlying RDMA transport is also sensitive to packet loss and small amounts of packet drop can lead to many packets being retransmitted. Finally, the sheer scale of these workloads with thousands of switches and tens of thousands of optical transceivers means that the probability of a component failure is higher than that for a typical computing workload.

We have enhanced the resilience of our customer workloads with the following features:

  • Customized switch and NIC configurations aimed at mitigating momentary link interruptions.
  • Advanced monitoring and automation systems to collect and analyze network link statistics and to predict impending failures.
  • Automation systems to detect repeat offender links and automatically remediate by moving traffic off those links and by initiating repair activities, without needing humans to pore over network dashboards.
  • A GPU-hosted OCI cloud agent to look for host-side link or environmental anomalies and to predict impending failures.

Let’s take a deeper look at one of the failure vectors, called link flaps.

Defining Link Flaps

Link flaps are characterized by transitions between states of link up and down, often triggered by a sequence of uncorrectable bit errors within a brief time window. For example, let’s assume we have a 400G ethernet link. 400G IEEE spec provides for forward error correction (FEC) protection on these links, where no more than three sequential FEC code blocks of 5,140 bits are allowed to have one uncorrectable bit error each. Uncorrectable bit errors occur when the FEC code, embedded in the ethernet frame, can’t correct for the bit errors. FEC can correct up to 15-bit errors in the FEC code block. In the worst case, we can have three uncorrectable bit errors in the span of a mere 20 ns (nanoseconds) and end up with a link-down event. 

While the optical layer of such a link recovers within milliseconds, the real problem with an event like that, however fleeting it might be, is that it results in a long-drawn-out event lasting for 10–15 seconds because multiple layers of logic are embedded on these links. These layers include the physical medium dependent (PMD), physical medium attachment (PMA), FEC, physical coding sublayer (PCS), media access control (MAC), Datalink, and IP layers. Each of these layers has its own independent link training, stabilization, and resilience algorithms, which add time to the bring-up of a link. 

The following graphic illustrates these layers in relation to the Open Systems Interconnection.

Fig. 4: Layers of the OSI protocol stack.
Fig. 4: OSI protocol stack 

Each link flap triggers a network-wide convergence event in the IP routing protocol, such as border gateway protocol (BGP) and open shortest path first (OSPF). In fact, two events occur: one for the link down and another for the link up. Network reconvergence events themselves can have their own second order effects, such as transient microloops. 

Link flaps slow down the workload and can even interrupt GPU training sequences, requiring the ejection of hosts from the workload. In short, link flaps can significantly impact GPU workloads and increase training time. Link flaps are costly events, and we want to prevent or minimize link flaps.

Causes of Link Flaps

A link flap can occur for many reasons, including the following examples:

  • Defects in transceivers and subcomponents, such as the lasers, fiber coupling, or digital signal processors (DSP), causing degraded signal or loss of signal.
  • Poor deployment hygiene in the fiber and connectors, such as dust on the fiber, can degrade the optical signal.
  • Thermal variations and abrupt changes in ambient operating conditions
  • Electrostatic discharge (ESD) damage of sensitive components
  • Assembly and manufacturing defects
  • Software or firmware issues on the device

Our observation is that the majority of link flap events are caused by transient events that don’t call for a specific repair action.

Mitigating Link Flaps

A momentary disruption of the optical signal on the PMD layer can lead to a long link flap event. We have automation systems to predict links that are expected to fail, and we also have automation systems to detect repeat offender links and to repair them. That still leaves us with a non-zero probability that some links flap on a nonrecurring basis. We want to minimize the impact of such one-time offenders, so we deploy link debounce. 

Link debounce is a technique where, if a momentary and one-time interruption of the optical signal (PMA layer on the NIC or switch) occurs, we don’t take the upper layers (MAC or IP layers) down and avoid stretching a short transient event into a long outage. As an added benefit, link debounce also avoids an IP routing protocol convergence event.

Advanced Traffic Load Balancing

Throughput issues in a network are a common result of inefficient traffic load balancing causing traffic congestion, overpacking of flows on paths, and equal cost multipath (ECMP) flow collisions. ECMP is a packet load balancing technique, where the switch distributes multiple flows over the available parallel paths while keeping all packets of a flow on a given path. Traffic congestion can happen even in networks that aren’t oversubscribed. Switches are unaware of the bandwidth requirements of individual flows and the mix of flows expected from a workload.  In other words, switches are unaware of flow composition for the workload they’re supporting.

Collectives-Aware Load Balancing

We have pioneered multiple advanced traffic load balancing techniques that can significantly reduce traffic congestion. For one of these techniques, we collaborated with one of our switch vendors. This technique relies on the switch being aware of AI and ML collectives and using that knowledge to map ML flows to paths while avoiding overpacking flows on paths.  

Collectives-aware load balancing is an innovative traffic load balancing technique, where the switch uses its knowledge of ML collectives to optimize the mapping of traffic flows on available paths. This advanced load balancing technique helps simultaneously lower latency and increase throughput by reducing the probability of congestion. 

The following graphic shows the collectives-aware load balancing:

Fig. 5: ECMP-based flow forwarding and collective-aware flow forwarding.
Fig. 5:  ECMP based flow & congestion Vs. Collective-aware flow

The schematic on the left shows standard ECMP load balancing, where the switch is unaware of ML collectives and mixes the red and blue flows, which leads to flow collisions or congestion. The schematic on the right shows collective-aware load balancing, where the switch is aware of ML collectives and avoids mixing the red and blue flows, avoiding flow collisions.

Conclusion

Oracle Cloud Infrastructure redefines network scalability and performance for generative AI and deep learning workloads with cluster networks at zettascale. With ultra-high throughput RDMA, ultra-low latency, enhanced network resilience, and intelligent traffic load-balancing, it delivers the infrastructure needed for next-generation AI models and large-scale training.

Oracle Cloud Infrastructure Engineering handles the most demanding workloads for enterprise customers, which has pushed us to think differently about designing our cloud platform. We have more of these engineering deep dives as part of this First Principles series, hosted by Pradeep Vincent and other experienced engineers at Oracle.

For more information, see the following resources:

1. OCI AI Infrastructure

2. Oracle Delivers Sovereign AI Anywhere Using NVIDIA Accelerated Computing

3. Announcing the General Availability of NVIDIA GPU Device Plugin Add-On for OKE