First Principles: Superclusters with RDMA—Ultra-high performance at massive scale

February 14, 2023 | 3 minute read
Jag Brar
Vice President and Distinguished Engineer, OCI Networking Ops
Pradeep Vincent
Senior Vice President and Chief Technical Architect, OCI
Text Size 100%:

Oracle Cloud Infrastructure (OCI) offers many unique services, including cluster network, an ultra-high performance network with support for remote direct memory access (RDMA). In our previous First Principles video blog, Building a High Performance Network in the Public Cloud, we explained how OCI's cluster network uses RDMA over Converged Ethernet (RoCE) on top of NVIDIA ConnectX RDMA NICs to support high-throughput and latency-sensitive workloads. In this blog we discuss how we have further enhanced our offering to support superclusters, which are designed to scale to tens of thousands of NVIDIA GPUs without compromising the performance that customers have come to expect from our networks. The following video highlights some of the technologies undergirding superclusters.
 

Superclusters

Figure 1 illustrates superclusters with RDMA network connectivity. Each GPU node has 8 NVIDIA A100 Tensor Core GPUs with a total of 1.6Tbps (1600Gbps) of full-duplex connectivity to the network fabric. The network fabric is designed to be nonblocking and offers full bisection bandwidth to all hosts. (Bisection bandwidth is the minimum amount of bandwidth available between any two parts of the network.)

 

RDMA Superclusters_Diagram Edited 2
Figure 1: Superclusters with RDMA network connectivity


Figure 2 shows the topology of the network fabric. The network fabric uses a classic 3-tier Clos topology and easily scales to tens of thousands of GPUs. We explain how it scales in the above video.


RDMA Superclusters_Diagram 2
Figure 2: Superclusters with RDMA network fabric
 

Ultra-high performance at scale

Superclusters are a massive network with multiple petabits-per-second of bisection bandwidth. It was necessary to develop new techniques to deliver the same ultra-high performance at a much larger scale. OCI uses a multi-tier Clos topology to build a nonblocking network fabric that scales to tens of thousands of GPUs.

While a traditional GPU cluster fits within a few rows in one room of a datacenter, the large-scale supercluster can span multiple rooms (aka data-halls) within a building or even multiple adjoining buildings in a datacenter complex. The cable distance between two GPUs can be longer, which results in some packets going across these data-halls and incurring slightly higher latency. Even though the latency increase may only be a few microseconds, we wanted to minimize that. At OCI, we use two different techniques to do so:

  1. Provide “placement hints” to customers such that workloads using locality hints automatically experience lower average latency.
  2. Use “intelligent workload placement” to reduce the fiber distance. 

The result of these two techniques is that workloads tend to operate across GPUs that are physically close to each other on the network, and customers experience the lowest possible latency.

The network supports Quality of Service (QoS) and sets aside lossless queue(s) for the GPU traffic. These queues have been further tuned to be aware of the longer cable distances, thus slightly higher network latency, while still providing lossless networking.

Conclusion

OCI has taken great care in prioritizing workload performance and in avoiding cutting corners at the expense of performance. Our goal has been and remains to deliver the highest performance at the largest scale possible. Customers can expect continued innovation from OCI in this space. To find similar exciting architectural deep-dives and insights, visit our First Principles homepage.

 

Related Blogs:

What’s next: training large language AI models at quintillions of operations on OCI

Oracle and NVIDIA solve the largest AI and NLP models

Running Applications on Oracle Cloud Using Cluster Networking

Large Clusters, Lowest Latency: Cluster Networking on Oracle Cloud Infrastructure

Democratizing AI infrastructure with OCI 

Jag Brar

Vice President and Distinguished Engineer, OCI Networking Ops

Jag Brar is Vice President and Distinguished Engineer at Oracle Cloud Infrastructure (OCI). He has over 25 years of experience in the field including key technical roles at Oracle, Amazon, Arista Networks, AT&T, Telus, Electric Lightwave and Terabeam Networks. Jag has been with OCI for almost eight years where he has played critical roles in Physical Networking, Virtual Networking, Compute, Database and Datacenter Engineering fields. He is an inventor or co-inventor in over 50 patent applications.

Pradeep Vincent

Senior Vice President and Chief Technical Architect, OCI

Pradeep Vincent is the Chief Technical Architect and Senior Vice President at Oracle Cloud Infrastructure (OCI). He is a technology and software architect with more than 20 years of experience in tech companies such as Oracle, AWS, and IBM. He has a deep understanding of Cloud Infrastructure, Compute, Storage and Networking. Pradeep has been with Oracle for more than eight years leading a team of architects and software engineers building Oracle’s Public Cloud. He also leads OCI’s Architecture and Engineering Community initiatives. 


Previous Post

Network latency using OCI-Azure Interconnect and best practices

Niranjan Mohapatra | 12 min read

Next Post


Announcing Oracle Visual Builder Studio on OCI

Shay Shmeltzer | 3 min read