Massively parallel GPU cluster computing for the largest AI and HPC jobs

November 14, 2022 | 5 minute read
Kevin Jorissen
Distinguished HPC Cloud Architect
Text Size 100%:

Let’s look at performance measurements of NVIDIA GPU clusters in Oracle Cloud Infrastructure (OCI). Prototypical benchmarks specific to each type of AI workload exist, from computer vision to recommendation engines to natural language processing (NLP). In this blog, we focus on a more universal benchmark of cluster network throughput and run the NVIDIA Magnum IO Collective Communications Library (NCCL) benchmark, a commonly used and well-understood test.

NVIDIA Magnum IO NCCL benchmark on OCI NVIDIA GPU clusters

The NCCL is a key library of multi-GPU collective communication primitives that are topology-aware and can be easily integrated into applications. NCCL is widely used in ML applications and critical to large- scale performance.  The library comes with a benchmark that shares the same name.  It executes an operation such as AllReduce or AllGather across the cluster and reports the observed bandwidth in absolute numbers and relative to the theoretical peak algorithm bandwidth.  We’ll show the latter (“bus bandwidth”) as a good measure of the efficiency of the HPC platform. 

We ran the benchmark on a cluster of 64 nodes of BM.GPU4.8, each of which contains eight NVIDIA Tensor Core A100 40-GB GPUs. While the benchmark tracks bandwidth for a range of message sizes, it’s typically visualized as the maximum bandwidth (at large data size, here 8 GB) versus cluster size:

A bar graph comparing the NCCL performance for NVIDIA A100 40GB.
Figure 1: NCCL for NVIDIA A100 40GB on OCI, maximum bandwidth vs number of cluster nodes.  E.g., 64 nodes = 512x A100 GPUs.

We find the same results on the newer BM.GPU.GM4.8, which has 8x NVDIA Tensor Core A100 80GB GPU memory per GPU, twice as much as the A100 40GB above. 

What it means for Oracle’s clusters and your machine learning HPC workloads

The theoretical bandwidth limit is 1600Gb/sec (based on the available RDMA network – 2 x 8 x 100Gb/sec).  A strong HPC platform should allow applications to utilize the full processing power of its nodes across even when running at-scale.  It’s excellent to be so close to the theoretical limit scaled to hundreds of GPUs.  To the best of our knowledge, no cloud provider has posted better results in absolute terms (speed) or as a % of the theoretical maximum (scaling efficiency).  In other words, OCI’s GPU clusters can scale linearly to hundreds of GPUs for the largest AI/ML and HPC problems.

OCI designed its HPC platform to “do the hard jobs well,” because we focus on mission-critical production HPC workloads of demanding enterprise customers. Our foundation is bare metal servers with OCI Cluster Network designed for RDMA. Bare metal means there's no hypervisor that can affect performance and introduce unwanted performance fluctuations (jitter), which are bad news when you want a hundred nodes to work closely in tandem.

With the recently announced Oracle-NVIDIA partnership, OCI intends to stay at the forefront of developments in large-scale GPU computing in future years, from ever increasing raw compute power to accelerated full-stack artificial intelligence and machine learning platform offerings.

How to configure NCCL for optimal benchmark performance

The NCCL algorithms are influenced by various configuration parameters. The optimal values depend on your infrastructure. For a cluster of A100-based OCI GPU servers, NCCL work properly using default flags, but for optimal performance, we recommend the following tuning:

 if[ $shape == \"BM.GPU.B4.8\" || $shape == \"BM.GPU.GM4.8\" ]



   var_NCCL_IB_HCA= "=mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_14,mlx5_15,mlx5_16,mlx5_17,mlx5_9,mlx5_10,mlx5_11,mlx5_12" 

 if  [ $shape == \"BM.GPU4.8\" ] 



   var_NCCL_IB_HCA= "=mlx5_0,mlx5_2,mlx5_6,mlx5_8,mlx5_10,mlx5_12,mlx5_14,mlx5_16,mlx5_1,mlx5_3,mlx5_7,mlx5_9,mlx5_11,mlx5_13,mlx5_15,mlx5_17" 


 mpirun --mca pml ucx \ 

   --bind-to numa \ 


   -x NCCL_IB_SL=0 \ 

   -x NCCL_IB_TC=41 \ 


   -x UCX_TLS=tcp \ 



   -mca coll_hcoll_enable=0 \ 

   -x NCCL_IB_GID_INDEX=3 \ 

   -x NCCL_IB_HCA= "${var_NCCL_IB_HCA}"  \ 

   --np $np --hostfile $hostfile  -N 8 /home/opc/nccl-tests/build/all_reduce_perf  -b1G -e10G -i$((1024*1024*1024*9)) -n $iterations 

The following settings aren’t default:


    The NCCL_IB_HCA variable specifies which RDMA interfaces to use for communication. Set on the shape basis to the RDMA interfaces.

    Consult the ibdev2netdev output to translate linux to mlx5_x interface names.

    • BM.GPU4.8: "=mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_14,mlx5_15,mlx5_16,mlx5_17,mlx5_9,mlx5_10,mlx5_11,mlx5_12"

    • BM.GPU.B4.8 and BM.GPU.GM4.8: "=mlx5_0,mlx5_2,mlx5_6,mlx5_8,mlx5_10,mlx5_12,mlx5_14,mlx5_16,mlx5_1,mlx5_3,mlx5_7,mlx5_9,mlx5_11,mlx5_13,mlx5_15,mlx5_17"


    Defines the InfiniBand traffic class field.

    The default value is 0.

    OCI suggested value: 41 or 105.


    Defines the InfiniBand Service Level.

    The default value is 0.

    OCI suggested value: 0 or 5.


    Number of IB queue pairs to use for each connection between two ranks. Useful on multilevel fabrics, which need multiple queue pairs to have good routing entropy. Each message, regardless of its size, splits in N parts and sent on each queue pair. Increasing this number can cause a latency increase and a bandwidth reduction.

    Default value is 1.

    OCI suggested value is 4


    The NCCL_IB_GID_INDEX variable defines the Global ID index used in RoCE mode. See the InfiniBand show_gids command to set this value.

    The default value is 0.

    OCI suggested value: 3


Bare metal servers and OCI Cluster Network are used by ML customers and are increasingly adopted for non-ML GPU workloads such as molecular dynamics, computer aided engineering (CAE), and weather prediction. 

Come see our Exhibitor’s Talk at Supercomputing ’22 this week for an overview of Oracle’s lineup of NVIDIA GPU based shapes (the NVIDIA A100 40 GB, NVIDIA A100 80 GB, NVIDIA A10 Tensor Core GPUs, older generation GPUs, and, announced for 2023, the NVIDIA H100 Tensor Core GPU). We also have an overview of Aleph Alpha, whose five-language GPT-3-like model has up to over 300 billion machine learning parameters and offers visual understanding in full multimodality. Stop by our booth in the exhibition hall for a cup of coffee and a chat about large-scale computing. See you in Dallas!

Kevin Jorissen

Distinguished HPC Cloud Architect

Previous Post

Oracle announces support of SC22 Student Cluster Competition

Luiz De Rose | 2 min read

Next Post

Internal combustion engine optimization using CONVERGE and machine learning on OCI HPC

Arnaud Froidmont | 4 min read