Run Lightning Fabric with NVIDIA GPUs on OCI

October 27, 2023 | 4 minute read
Dhvani Sheth
Solution Architect
Text Size 100%:

Oracle Cloud Infrastructure (OCI) offers bare-metal, virtual machine instances, and a variety of NVIDIA GPU's across many instance types. You can use GPUs on OCI for a variety of use cases, including graphics rendering, video editing, the most demanding artificial intelligence (AI) training and inference workloads, high-performance computing (HPC) workloads, analytics, and data science. In this blog, we will show how you can run a Lightning Fabric job on OCI’s NVIDIA GPU instances.

Lightning Fabric is an open-source library using which you can easily scale models while also maintaining maximum control over your training loop. It is fast to implement and provides maximum flexibility so that you can quickly get started with converting your PyTorch code to Lightning Fabric using minimum number of lines to scale the largest billion-parameter models.

Lightning Fabric on OCI

This guide will help you create an NVIDIA A100 Tensor Core GPU cluster from Oracle Cloud marketplace (version v2.10.2.1 or later) and Slurm. In Slurm version 23.02 onwards, we have a new TopologyParam=SwitchAsNodeRank option to reorder nodes based on the switch layout. This layout is useful if the naming convention for the nodes doesn’t naturally map to the network topology. So, we recommend using Slurm version 23.02 or greater, which the above stack installs and configures for you.

You can use Anaconda to create an environment and install the required packages. We tested the following example on GPU machines with NVIDIA Driver Version 515.105.01 and CUDA Version 11.7 with the following languages:

Download and install Anaconda with the following command:

  
wget https://repo.anaconda.com/archive/Anaconda3-2023.07-2-Linux-x86_64.sh
sudo chmod 755 Anaconda3-2023.07-2-Linux-x86_64.sh
./Anaconda3-2023.07-2-Linux-x86_64.sh

Activate conda with the following command:

  
eval "$(YOUR_ANACONDA3_LOCATION_FROM_ABOVE/anaconda3/bin/conda shell.YOUR_SHELL_NAME hook)"
conda init

 

For these changes to take effect, close and reopen your current shell. Now, create an environment in Conda and install packages.

  
conda create -n test-env
conda activate test-env
conda install python=3.8.10
conda install -c anaconda absl-py (make sure https://anaconda.org/anaconda/absl-py is still referencing v1.4.0. Otherwise refer absl-py-1.4.0)
python3 -m pip install lightning==2.0.8 (OL) or pip install lightning==2.0.8 (Ubuntu)

 

Don’t use Conda to install Lightning. Instead, use PIP.

In the following example, we have a sample sbatch file that demonstrates the NVIDIA NCCL parameters you need to use to run your GPU workload efficiently on OCI. The NCCL_IB_HCA variable specifies which RDMA interfaces to use for communication. It’s set to the RDMA interfaces based on GPU shape, as it is for the UCX_NET_DEVICES variable.

lightning.sbatch

  
#!/bin/bash
#SBATCH --job-name=lightning_example
#SBATCH --nodes=4                        # This needs to match Fabric(num_nodes=…)
#SBATCH --ntasks-per-node=8      # This needs to match Fabric(devices=...)
#SBATCH --gpus-per-node=8      # Request N GPUs per machine which should also match --ntasks-per-node
#SBATCH --exclusive
export PMI_DEBUG=1

MACHINEFILE="hostfile"
scontrol show hostnames $SLURM_JOB_NODELIST > $MACHINEFILE
echo MACHINEFILE
cat $MACHINEFILE

MPIVARS_PATH=`ls /usr/mpi/gcc/openmpi-*/bin/mpivars.sh`
if [[ "$MPIVARS_PATH" == "" ]]; then
    MPIVARS_PATH=`ls /opt/openmpi-*/bin/mpivars.sh`
fi
if [[ "$MPIVARS_PATH" == "" ]]; then
    echo "Could not find MPIPATH"; exit; fi
source $MPIVARS_PATH

# Based on which GPU shape you are using, choose the below variables. 
if [ $shape == \"BM.GPU.B4.8\" ] || [ $shape == \"BM.GPU.A100-v2.8\" ]
then
  var_UCX_NET_DEVICES=mlx5_0:1
  var_NCCL_IB_HCA="=mlx5_5,mlx5_6,mlx5_7,mlx5_8,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_14,mlx5_15,mlx5_16,mlx5_17,mlx5_9,mlx5_10,mlx5_11,mlx5_12"
elif [ $shape == \"BM.GPU4.8\" ]
then
  var_UCX_NET_DEVICES=mlx5_4:1
  var_NCCL_IB_HCA="=mlx5_0,mlx5_2,mlx5_6,mlx5_8,mlx5_10,mlx5_12,mlx5_14,mlx5_16,mlx5_1,mlx5_3,mlx5_7,mlx5_9,mlx5_11,mlx5_13,mlx5_15,mlx5_17"
fi

export OMPI_MCA_coll=^hcoll \
  NCCL_DEBUG=INFO \
  NCCL_IB_SL=0 \
  NCCL_IB_TC=41 \
  NCCL_IB_QPS_PER_CONNECTION=4 \
  UCX_TLS=ud,self,sm \
  UCX_NET_DEVICES=${var_UCX_NET_DEVICES} \
  HCOLL_ENABLE_MCAST_ALL=0 \
  OMPI_MCA_coll_hcoll_enable=0 \
  NCCL_IB_GID_INDEX=3 \
  NCCL_IB_HCA="${var_NCCL_IB_HCA}"

srun --mpi=pmi2 python3 fabric.py

The following sample python file launches Lightning Fabric and prints the Global Rank, Local Rank, and Node Rank, so you can verify that the GPUs are recognized and can be used in your program.

fabric.py

  
import socket
from typing import Sequence
from absl import app, logging
from lightning import fabric as lightning_fabric

def main(argv: Sequence[str]):
    del argv  # Unused.
    fabric = lightning_fabric.Fabric(precision="bf16-mixed",
                                     accelerator="gpu",
                                     devices=8,
                                     num_nodes=4)
    fabric.launch()
    logging.info("Hostname: %s, Global Rank: %s, World Size: %s, Local Rank: %s, Node Rank: %s", socket.gethostname(), fabric.global_rank, fabric.world_size, fabric.local_rank, fabric.node_rank)
if __name__ == '__main__':
    app.run(main)

Conclusion

You can quickly get started creating a GPU cluster yourself on Oracle Cloud Infrastructure and running Lightning Fabric, which includes support for FP8 on NVIDIA H100 Tensor Core GPUs, for your accelerated computing workloads. For more information, refer to the following resources:

Dhvani Sheth

Solution Architect

I architect large-scale GPU/HPC cluster solutions on Oracle Cloud. 


Previous Post

OCI File Storage for Windows Active Directory users

Aboo Valappil | 3 min read

Next Post


Best practices for migrating IL4 and IL5 workloads with the Oracle Cloud Native SCCA Solution

George Boateng | 4 min read