Microsoft Azure’s high-performance computing (HPC) & AI infrastructure is designed from the ground up to support the world’s most demanding workloads. High-performance AI workloads are bandwidth-hungry and latency-sensitive. As models scale in size and complexity, the efficiency of the interconnect fabric—how CPUs, GPUs, and storage communicate—becomes a critical factor in overall system performance. Even with the fastest GPUs, poor interconnect design can lead to bottlenecks, underutilized hardware, and extended time-to-results. In this blog post, we will highlight one of the key enabling features for running large-scale distributed workloads on Azure: a highly tuned HPC-class interconnect. Azure has invested years of system-level engineering of the InfiniBand interconnect, into ready-to-use configurations for customers available on Azure’s HB-series and N-series virtual machine (VMs).
by Hugo Affaticati (Cloud Infrastructure Engineer), Amirreza Rastegari (Senior Software Engineer), Jie Zhang (Principal Software Engineer), and Michael Ringenburg (Principal Software Engineer Manager).
Interconnects enable communication between VMs (also referred to as nodes) in a cluster and allow large workloads to execute across multiple VMs. Interconnect performance is typically characterized by two metrics, bandwidth and latency. Bandwidth measures the amount of data (or traffic) that can pass through the links of the interconnect – think the number of lanes on a freeway. Latency measures how long it takes an individual message to transit the interconnect – think the length and speed limit of a freeway. High-performance computing interconnects, like the InfiniBand interconnect available in Azure, provide both high bandwidth and low latency, and support remote direct memory access (RDMA) which enables additional scalability of workloads by eliminating the need to copy data into buffers.
In this blog, we describe the industry standard tests that you can run to understand the benefits of HPC interconnects. We first describe the NCCL benchmarks, which directly measure the performance of various communication patterns such as all-reduce collectives. We then describe some AI training workloads that illustrate the benefit of High-Performance interconnects.
NCCL Benchmarks
Nvidia Collective Communications Library (NCCL) is an independent library that provides common GPU communication operations such as all-reduce, all-gather, reduce, broadcast, reduce-scatter, and point-to-point send/receive. It is optimized for high bandwidth performance on systems with PCIe, NVLink, NVSwitch, or network interfaces like InfiniBand Verbs and TCP/IP sockets. NCCL can handle any number of GPUs within a single machine or across multiple machines and is compatible with both single-process and multi-process (e.g., MPI-based) applications.
NCCL-tests is a comprehensive benchmarking and testing suite for NCCL. The suite provides tools to validate both the performance and correctness of NCCL operations across multiple GPUs, nodes, and network configurations. To ensure optimal performance, it's critical to utilize a system-specific topology file, which helps NCCL adapt to the underlying hardware layout and communication pathways.
NCCL Topology File
Application developers can pass in the system topology by specifying the NCCL topology file while launching the job. The NCCL topology file passes in the following information to the NCCL library:
- GPU-to-NUMA mapping
- GPU-to-HCA, HCA-to-NUMA mapping
- NUMA-to-CPU-core mapping
- Speed and type of GPU-GPU interconnect.
This enables NCCL to choose the most efficient system topology for communication. Azure ND-series VMs, like the NDv4 and NDv5 series, feature multiple GPUs. These GPUs connect to the CPUs via PCIe links and to each other using NVLink. Drawing from extensive research and experience, we have refined the Azure AI and HPC Marketplace image with the appropriate topology for all VM series, readily available in the directory: /opt/microsoft. While we recommend customers to use the pre-configured images, the topology files can also be found in the azhpc-images Github repository under the topology directory.
When an Azure HPC/AI VM image is being used, NCCL will automatically pick up the correct topology file based on the underlying VM SKU. However, if a non-HPC/AI VM image or a container is being used – one should make sure the correct NCCL topology file is used and reflected in the running environment. A detailed blog post on how to optimize NCCL parameters for best performance for AI training workloads with topology files on Azure can be found at Optimizing AI Workloads on Azure: CPU Pinning via NCCL Topology file | Microsoft Community Hub.
Recommended NCCL Parameters
NCCL-tests are typically parallel jobs launched using MPI. MPI Libraries (OpenMPI, HPC-X, MVAPICH2, etc.) are already pre-packaged in the HPC/AI VM images. To load MPI (eg: HPC-X) into the current environment, run the following:
module load mpi/hpcx
Once the MPI environment is loaded, NCCL-tests can be launched using mpirun commands. The following subsections highlight the recommended NCCL command line options for best performance for single/multi-node NCCL runs on different SKU types.
Recommended command lines for NDv4/NDv5
Single Node with NVLink (NDv4/NDv5)
A basic command line that will perform optimally on a single node on our production SKUs is mentioned in the README of the azhpc images directory for that specific image, see here.
This is the recommendation at the current time for single node on the ubuntu 22.04 image with NDv4 or NDv5:
mpirun -np 8 \
--bind-to numa --report-bindings \
--map-by ppr:8:node \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
-mca coll_hcoll_enable 0 \
-x UCX_TLS=tcp \
-x UCX_NET_DEVICES=eth0 \
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x NCCL_SOCKET_IFNAME=eth0 \
-x NCCL_DEBUG=WARN \
-x NCCL_NVLS_ENABLE=1 \
/opt/nccl-tests/build/all_reduce_perf -b1K -f2 -g1 -e 4G
This can be used to sanity check NVLink health and performance on a given node.
Single Node IB only for NDv4/NDv5
mpirun -np 8 \
--bind-to numa --report-bindings \
--map-by ppr:8:node \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
-mca coll_hcoll_enable 0 \
-x UCX_TLS=tcp \
-x UCX_NET_DEVICES=eth0 \
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x NCCL_SOCKET_IFNAME=eth0 \
-x NCCL_DEBUG=WARN \
-x NCCL_SHM_DISABLE=1 \
-x NCCL_P2P_DISABLE=1 \
-x NCCL_NVLS_ENABLE=0 \
/opt/nccl-tests/build/all_reduce_perf -b1K -f2 -g1 -e 4G
This can be used to sanity check all the NIC health and performance on a given node.
Multi-Node for NDv4/NDv5 on the AZHPC images
mpirun \
-np $(( NODES * 8 )) \
-hostfile $HOSTFILE \
--bind-to numa \
--map-by ppr:8:node \
-x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
-mca coll_hcoll_enable 0 \
-x UCX_TLS=rc \
-x UCX_NET_DEVICES=mlx5_ib0:1 \
-x NCCL_IB_HCA=^mlx5_an0:1 \
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x NCCL_SOCKET_IFNAME=eth0 \
-x NCCL_DEBUG=warn \
-x NCCL_MIN_NCHANNELS=32 \
-x NCCL_NET_GDR_LEVEL=5 \
-x NCCL_TOPO_FILE=/opt/microsoft/${SKU}-topo.xml \
-x NCCL_NVLS_ENABLE=1 \
$sharp_args \
$ALL_REDUCE_PERF -f 2 -g1 -b1K -e 4G
where sharp_args can be set to the following to enable InfiniBand SHARP:
sharp_args="-x NCCL_COLLNET_ENABLE=1 \
-x NCCL_ALGO=CollnetDirect,CollnetChain,NVLS \
-x SHARP_COLL_ENABLE_SAT=1 \
-x SHARP_COLL_LOG_LEVEL=3 \
-x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1"
Key Environment Variables
The full list of NCCL environment variables can be found here. Some environment variables to pay attention to in the sample command lines below are called out in the example command lines above.
NCCL_P2P_DISABLE
Disables peer-to-peer communication directly between GPUs using NVLink or PCI. Documentation
NCCL_SHM_DISABLE
Disables the shared memory transports that will use host memory when peer-to-peer cannot occur. Documentation
NCCL_NVLS_ENABLE1
Enables NVLink. Documentation
Note that we typically use MPI to launch the NCCL jobs. MPI and UCX related environment variables control the MPI communication, not the communication NCCL itself performs. OpenMPI's documentation on environment variables used to interact with UCX can be found most clearly described in their FAQ here and here .
UCX_TLS
Picks the transports MPI/UCX is allowed to use or is excluded from using. Documentation
Expected performance
To verify the anticipated bandwidth using NCCL-tests, we refer to Bus Bandwidth (BusBw). Bus Bandwidth adjusts the algorithm's bandwidth to reflect hardware utilization, regardless of the number of ranks or the specific collective operation, enabling direct comparison with the theoretical peak bandwidth of the underlying hardware.
Generally, BusBw should be limited by the minimum of NVLink bandwidth, including NVLink SHARP (NVLS) if supported by NVSwitch, and network bandwidth. The following section outlines the expected BusBw using NCCL Allreduce as an example, with NCCL 2.26.5 and CUDA 12.6.
NDv5
GPU-to-GPU NVLink bandwidth is 900GB/s bi-directional and 450GB/s uni-directional.
CX7 HCA provides 50GB/s uni-directional bandwidth, so the maximum network bandwidth we can achieve from the node (with 8 CX7 HCAs) is 50*8=400GB/s.
Single-node:
- NVLS: 470+GB/s. Compared to native NVLink, we observe higher BusBw due to NVLink SHARP, a hardware feature that offloads arithmetic operations onto NVSwitches, reducing data transfers.
- Loopback: 45+GB/s, constrained by the bandwidth of a single HCA.
Table 1. NCCL Allreduce BusBw with Single NDv5 at 16GB message size with NVLS and IBLoopback, respectively
Message Size (Byte) |
BusBw-NVLS (GB/s) |
BusBw-IBLoopback (GB/s) |
16G |
481.42 |
48.69 |
Multi-node:
Since a single node with NVLS delivers 470GB/s BusBw, multi-node BusBw is now limited by the network bandwidth (400GB/s). We should expect 330GB/s for multi-node BusBw by default, considering the fact that there is no PCIe switch in NDv5 and NCCL consumes additional NVLink bandwidth to complete the ring. With improvements in the NVLSTree algorithm, BusBw is expected to reach around 360GB/s.
Table 2. NCCL Allreduce BusBw with multi-NDv5 VMs (different nunmber of H100 GPUs) at 16GB message size
Msg Size (Byte) |
BusBw with different number of H100 GPUs (GB/s) | |||||
32 |
64 |
128 |
256 |
512 |
1024 | |
16G |
361.06 |
366.12 |
367.39 |
365.51 |
366.63 |
354.2 |
Figure 1. Bus bandwidth (in GB/s) of a 16GB message on 4 to 256 Azure ND_H100_v5virtual machines.
AI Training Workloads
To run the open-source training model used in this study, begin by setting up a CycleCloud cluster with your preferred workload scheduler. In this guide, we use the SLURM scheduler. Detailed instructions for setting up a CycleCloud cluster with SLURM in Azure can be found here [Click for the Link].
This guide outlines the steps to benchmark training performance with the Grok-1 model published by xAI [Click for the Link]. Grok-1 is a 314B parameter Large Language Model (LLM) based on a Mixture of Experts (MoE) architecture. It features 8 experts, 64 layers, 48 attention heads (48 for queries, 8 for key/values), and 6144-dimensional internal embeddings, with support for 8-bit quantization and activation caching. The model, along with instructions for running on H100 and GH200 GPUs, is available here [Click for the Link].
To run the model under realistic configurations, at least 512 GPUs are required. However, by employing modified proxy configurations, the model can be scaled down to run on as few as 8 GPUs, each with 80GB of memory. Proxy configurations reduce the model's memory footprint by shrinking certain parameters. 3 details the hyperparameters and model configurations corresponding to different GPU setups.
Table 3. Hyperparameters and Model configuration for the Grok-1 314B parameter training workload. GPUs, SeqLen, Layers, TP, PP, CP, EP, DP, VP, MBS, GBS and GA denote the total number of GPUs, Sequence Length, Model Layers, Tensor Parallelism, Pipeline Parallelism, CheckPointing, Expert Parallelism, Data Parallelism, Virtual Pipeline Parallelism, Micro Batch Size, Global Batch Size and Gradient Accumulation, respectively.
GPUs |
SeqLen |
Layers |
TP |
PP |
CP |
EP |
DP |
VP |
MBS |
GBS |
GA |
8 |
4096 |
2 |
4 |
1 |
1 |
2 |
1 |
N/A |
1 |
1024 |
128 |
16 |
4096 |
4 |
4 |
1 |
1 |
4 |
1 |
N/A |
1 |
1024 |
128 |
32 |
4096 |
4 |
4 |
1 |
1 |
8 |
1 |
N/A |
1 |
1024 |
128 |
64 |
8192 |
8 |
4 |
1 |
2 |
8 |
1 |
N/A |
1 |
1024 |
128 |
128 |
8192 |
16 |
4 |
2 |
2 |
8 |
1 |
8 |
1 |
1024 |
128 |
256 |
8192 |
32 |
4 |
4 |
2 |
8 |
1 |
8 |
1 |
1024 |
128 |
512 |
8192 |
64 |
4 |
8 |
2 |
8 |
1 |
8 |
1 |
1024 |
128 |
1024 |
8192 |
64 |
4 |
8 |
2 |
8 |
2 |
8 |
1 |
2048 |
128 |
Here we share the instructions for running this distributed training workload on Azure’s H100/H200 VMs, with FP8/BF16 precisions.
- To begin, download the run scripts into your cluster from
Grok1 314B 25.01 (DGXC Benchmarking) | NVIDIA NGC
- Set up your stage path, this is a directory in which all the files are written and the results are saved. The stage path needs to be on a shared file system, accessible from all the VMs on your cluster.
export STAGE_PATH=”<path to your work directory>”
For your CycleCloud cluster, you can use an Azure Managed Lustre File System [Click here for the Link] or Azure NetApp Files [Click here for the Link] as your shared file system.
- Run the setup script. This script downloads the container and necessary files from NVIDIA’s NGC website.
- Before running the setup.sh script, fix the following typo in the script. In line 32, change “cp -vf configure.sh launch.sh $STAGE_PATH” to “cp -vf configure.sh launch.sh $STAGE_PATH/cfg”
- Run sbatch -A ${SBATCH_ACCOUNT} -p ${SBATCH_PARTITION} -N 1 ./setup.sh
- SBATCH_ACCOUNT is set to your slurm account on the cluster. In most basic cyclecloud+slurm settings, this field needs to be omitted.
- SBATCH_PARTITION refers to the partition on the cluster. In most cyclecloud clusters this is “hpc”.
- Consult with your system admin to find the proper settings for these variables.
- The NVIDIA NGC Grok-1 container uses a synthetic dataset for training. There is no need to download a training dataset for these tests.
- To optimize the performance of Azure’s network, add the following NCCL environment variables in your shell
Table 4. NCCL optimizations for Azure environment for large scale distributed training.
Variable |
Value |
Description |
NCCL_TOPO_FILE |
/opt/Microsoft/ndv5-topo.xml |
Ensures topology aware CPU/NIC/GPU mapping |
NCCL_P2P_NET_CHUNKSIZE |
2097152 |
Increases P2P transfer granularity |
NCCL_MIN_NCHANNELS |
32 |
Improves throughput for collectives |
NCCL_IB_QPS_PER_CONNECTION |
4 |
Enables use of multiple QP pairs to improve routing entropyrouting entropy |
NCCL_PXN_DISABLE |
1 |
Enables zero-copy design for P2P communicationsEnables zero-copy design for P2P communications |
NCCL_IGNORE_CPU_AFFINITY |
1 |
Ensures NCCL uses GPU affinity only |
- To run the workload, do: DTYPE=<fp8/bf16> sbatch -A ${SBATCH_ACCOUNT} -p ${SBATCH_PARTITION} -N ${NUM_NODES} ./launch.sh
i. DTYPE represents the datatype used for the training. This can be either “fp8” or “bf16”.
ii. NUM_NODES can be determined from NUM_NODES=NUMBER_OF_GPUS/8, since Azure’s H100 and H200 VMs provide 8 GPUs per node.
iii. We recommend editing the “launch.sh” script with the following:
export NCCL_TOPO_FILE=/opt/microsoft/ndv5-topo.xml
export NCCL_P2P_NET_CHUNKSIZE=2097152
export NCCL_MIN_NCHANNELS=32
export NCCL_IB_QPS_PER_CONNECTION=4
export NCCL_PXN_DISABLE=1
export NCCL_IGNORE_CPU_AFFINITY=1
srun \
--container-image "$IMAGE" \
--container-mounts ${NCCL_TOPO_FILE},$RESULT_DIR,$INDEX_MAPPING_DIR,$STAGE_PATH/cfg:/cfg,$STAGE_PATH/configure.sh:/gsw/configure.sh \
--container-env=NCCL_TOPO_FILE,NCCL_P2P_NET_CHUNKSIZE,NCCL_MIN_NCHANNELS,NCCL_IB_QPS_PER_CONNECTION,NCCL_PXN_DISABLE,NCCL_IGNORE_CPU_AFFINITY \
--cpu-bind=mask_cpu:"fff,fff000,fff000000,fff000000000,fff000000000000,fff000000000000000,fff000000000000000000,fff000000000000000000000" \
--container-writable \
--no-container-mount-home \
bash -c "source /gsw/configure.sh && launch"
Performance is reported in terms of seconds per iteration, which corresponds to seconds per training step. This can be obtained from the “*.out” log files generated in “$STAGE_PATH/results/$GSW_VERSION/$DTYPE/314b/$JOB_TOTAL_GPUS” directory. To extract the performance data from the log files, “grep” for “train_step_timing” over the “*.out” files in this directory: grep train_step_timing *.out
Since there is large variability in the performance at the beginning of the training job, it is recommended to use the training performance from the last training steps. Given the Sequence Length (SeqLen) and Global Batch Size (GBS), as listed in Table 3, this can be used to calculate the overall training throughput, in terms tokens processed per second
Throughput (Tokens/Second) = SeqLen x GBS / (Seconds per Training Step)
Table 5 and Table 6 demonstrate the performance results obtained with Grok-1 314B parameter training model on Azure’s ND_H100_v5 VMs, using FP8 and BF16 precision, respectively. These results are the ‘best performance’ selected from a set of several trials.
Table 5. Performance results obtained on Azure ND_H100_v5 VMs, in FP8 precision.
|
Total Number of H100 GPUs |
8 |
16 |
32 |
64 |
128 |
256 |
512 |
1024 |
Azure |
Training Step Time (Seconds/Step) |
19.4 |
17 |
8.39 |
16.6 |
17 |
16.8 |
17.9 |
18.4 |
Throughput (Tokens/Seconds) |
216201 |
246724 |
499917 |
505338 |
493448 |
499322 |
468637 |
911805 |
Table 6. Performance results obtained on Azure ND_H100_v5 VMs, in BF16 precision.
|
Total Number of H100 GPUs |
8 |
16 |
32 |
64 |
128 |
256 |
512 |
1024 |
Azure |
Training Step Time (Seconds/Step) |
24.1 |
21.9 |
10.7 |
21 |
21.2 |
20.8 |
22 |
22.4 |
Throughput (Tokens/Seconds) |
174038 |
191521 |
391991 |
399458 |
395689 |
403298 |
381300 |
748983 |
As previously discussed, training the Grok-1 model, which comprises 314 billion parameters, requires a minimum of 512 GPUs under realistic configurations. Using the performance data obtained for setups with 512 and 1,024 H100 GPUs, we calculated the scaling efficiency of distributed training using this model. 2 shows the scaling efficiency achieved with the Grok-1 model on Azure’s ND_H100_v5 VMs, demonstrating scaling efficiencies of higher than 97% for both the FP8 and BF16 precisions.
Figure 2. Scaling Efficiency of distributed training using Grok-1 314B parameter model, observed on Azure ND_H100_v5 VMs.
These results are in line with previously reported performance benchmarks for large-scale distributed training and inference on Azure, as measured by MLPerf v3.1 results [Click for the Link]. Figure 2 highlights the performance of Azure ND_H100_v5 VMs compared to the bare-metal nodes of the EOS supercomputer. At large scale, using 1,344 VMs, corresponding to 10,752 H100 GPUs, Azure VMs exhibit a performance that is only 2.3% lower than that of EOS.
This effectively quantifies the minimal overhead introduced by Azure’s virtualization layer, showcasing its ability to deliver performance and scalability on par with bare-metal supercomputers at scale. This is only made possible by Azure’s reliance on high-performance interconnects, with our backend InfiniBand network being a distinguishing factor.
Figure 3. Performance of Azure ND_H100_v5 for distributed training, as measured by the MLPerf v3.1 benchmarks, in comparison to the bare-metal nodes of NVIDA's EOS supercomputer.
Wrapping Up
Demanding High-Performance Computing and Artificial Intelligence (HPC & AI) workloads require careful design of the underlying infrastructure to ensure optimal performance and efficiency. In this blog, we’ve examined the impact of high-performance interconnects like the InfiniBand interconnect available with Azure’s HPC & AI virtual machines. We’ve shown the benefits of the improved latency and bandwidth provided by InfiniBand for real workloads and provided instructions to allow you to reproduce these results on Azure Virtual Machines.