Azure High Performance Computing (HPC) Blog

12 MIN READ

Performance at Scale: The Role of Interconnects in Azure HPC & AI Infrastructure

HugoAffaticati

Microsoft

Jun 25, 2025

Microsoft Azure’s high-performance computing (HPC) & AI infrastructure is designed from the ground up to support the world’s most demanding workloads. High-performance AI workloads are bandwidth-hungry and latency-sensitive. As models scale in size and complexity, the efficiency of the interconnect fabric—how CPUs, GPUs, and storage communicate—becomes a critical factor in overall system performance. Even with the fastest GPUs, poor interconnect design can lead to bottlenecks, underutilized hardware, and extended time-to-results. In this blog post, we will highlight one of the key enabling features for running large-scale distributed workloads on Azure: a highly tuned HPC-class interconnect. Azure has invested years of system-level engineering of the InfiniBand interconnect, into ready-to-use configurations for customers available on Azure’s HB-series and N-series virtual machine (VMs).

by Hugo Affaticati (Cloud Infrastructure Engineer), Amirreza Rastegari (Senior Software Engineer), Jie Zhang (Principal Software Engineer), and Michael Ringenburg (Principal Software Engineer Manager).

Interconnects enable communication between VMs (also referred to as nodes) in a cluster and allow large workloads to execute across multiple VMs. Interconnect performance is typically characterized by two metrics, bandwidth and latency. Bandwidth measures the amount of data (or traffic) that can pass through the links of the interconnect – think the number of lanes on a freeway. Latency measures how long it takes an individual message to transit the interconnect – think the length and speed limit of a freeway. High-performance computing interconnects, like the InfiniBand interconnect available in Azure, provide both high bandwidth and low latency, and support remote direct memory access (RDMA) which enables additional scalability of workloads by eliminating the need to copy data into buffers.

In this blog, we describe the industry standard tests that you can run to understand the benefits of HPC interconnects. We first describe the NCCL benchmarks, which directly measure the performance of various communication patterns such as all-reduce collectives. We then describe some AI training workloads that illustrate the benefit of High-Performance interconnects.

NCCL Benchmarks

Nvidia Collective Communications Library (NCCL) is an independent library that provides common GPU communication operations such as all-reduce, all-gather, reduce, broadcast, reduce-scatter, and point-to-point send/receive. It is optimized for high bandwidth performance on systems with PCIe, NVLink, NVSwitch, or network interfaces like InfiniBand Verbs and TCP/IP sockets. NCCL can handle any number of GPUs within a single machine or across multiple machines and is compatible with both single-process and multi-process (e.g., MPI-based) applications.

NCCL-tests is a comprehensive benchmarking and testing suite for NCCL. The suite provides tools to validate both the performance and correctness of NCCL operations across multiple GPUs, nodes, and network configurations. To ensure optimal performance, it's critical to utilize a system-specific topology file, which helps NCCL adapt to the underlying hardware layout and communication pathways.

NCCL Topology File

Application developers can pass in the system topology by specifying the NCCL topology file while launching the job. The NCCL topology file passes in the following information to the NCCL library:

GPU-to-NUMA mapping
GPU-to-HCA, HCA-to-NUMA mapping
NUMA-to-CPU-core mapping
Speed and type of GPU-GPU interconnect.

This enables NCCL to choose the most efficient system topology for communication. Azure ND-series VMs, like the NDv4 and NDv5 series, feature multiple GPUs. These GPUs connect to the CPUs via PCIe links and to each other using NVLink. Drawing from extensive research and experience, we have refined the Azure AI and HPC Marketplace image with the appropriate topology for all VM series, readily available in the directory: /opt/microsoft. While we recommend customers to use the pre-configured images, the topology files can also be found in the azhpc-images Github repository under the topology directory.

When an Azure HPC/AI VM image is being used, NCCL will automatically pick up the correct topology file based on the underlying VM SKU. However, if a non-HPC/AI VM image or a container is being used – one should make sure the correct NCCL topology file is used and reflected in the running environment. A detailed blog post on how to optimize NCCL parameters for best performance for AI training workloads with topology files on Azure can be found at Optimizing AI Workloads on Azure: CPU Pinning via NCCL Topology file | Microsoft Community Hub.

Recommended NCCL Parameters

NCCL-tests are typically parallel jobs launched using MPI. MPI Libraries (OpenMPI, HPC-X, MVAPICH2, etc.) are already pre-packaged in the HPC/AI VM images. To load MPI (eg: HPC-X) into the current environment, run the following:

module load mpi/hpcx

Once the MPI environment is loaded, NCCL-tests can be launched using mpirun commands. The following subsections highlight the recommended NCCL command line options for best performance for single/multi-node NCCL runs on different SKU types.

Recommended command lines for NDv4/NDv5

Single Node with NVLink (NDv4/NDv5)

A basic command line that will perform optimally on a single node on our production SKUs is mentioned in the README of the azhpc images directory for that specific image, see here.
This is the recommendation at the current time for single node on the ubuntu 22.04 image with NDv4 or NDv5:

mpirun -np 8 \
    --bind-to numa --report-bindings \
    --map-by ppr:8:node \
    -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
    -mca coll_hcoll_enable 0 \
    -x UCX_TLS=tcp \
    -x UCX_NET_DEVICES=eth0 \
    -x CUDA_DEVICE_ORDER=PCI_BUS_ID \
    -x NCCL_SOCKET_IFNAME=eth0 \
    -x NCCL_DEBUG=WARN \
    -x NCCL_NVLS_ENABLE=1 \
    /opt/nccl-tests/build/all_reduce_perf -b1K -f2 -g1 -e 4G

This can be used to sanity check NVLink health and performance on a given node.

Single Node IB only for NDv4/NDv5

mpirun -np 8 \
    --bind-to numa --report-bindings \
    --map-by ppr:8:node \
    -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
    -mca coll_hcoll_enable 0 \
    -x UCX_TLS=tcp \
    -x UCX_NET_DEVICES=eth0 \
    -x CUDA_DEVICE_ORDER=PCI_BUS_ID \
    -x NCCL_SOCKET_IFNAME=eth0 \
    -x NCCL_DEBUG=WARN \
    -x NCCL_SHM_DISABLE=1 \
    -x NCCL_P2P_DISABLE=1 \
    -x NCCL_NVLS_ENABLE=0 \
    /opt/nccl-tests/build/all_reduce_perf -b1K -f2 -g1 -e 4G

This can be used to sanity check all the NIC health and performance on a given node.

Multi-Node for NDv4/NDv5 on the AZHPC images

mpirun \
    -np $(( NODES * 8 )) \
    -hostfile $HOSTFILE \
    --bind-to numa \
    --map-by ppr:8:node \
    -x LD_LIBRARY_PATH=$LD_LIBRARY_PATH \
    -mca coll_hcoll_enable 0 \
    -x UCX_TLS=rc \
    -x UCX_NET_DEVICES=mlx5_ib0:1 \
    -x NCCL_IB_HCA=^mlx5_an0:1 \
    -x CUDA_DEVICE_ORDER=PCI_BUS_ID \
    -x NCCL_SOCKET_IFNAME=eth0 \
    -x NCCL_DEBUG=warn \
    -x NCCL_MIN_NCHANNELS=32 \
    -x NCCL_NET_GDR_LEVEL=5 \
    -x NCCL_TOPO_FILE=/opt/microsoft/${SKU}-topo.xml \
    -x NCCL_NVLS_ENABLE=1 \
    $sharp_args \
    $ALL_REDUCE_PERF -f 2 -g1 -b1K -e 4G

where sharp_args can be set to the following to enable InfiniBand SHARP:

sharp_args="-x NCCL_COLLNET_ENABLE=1 \
      -x NCCL_ALGO=CollnetDirect,CollnetChain,NVLS \
      -x SHARP_COLL_ENABLE_SAT=1 \
      -x SHARP_COLL_LOG_LEVEL=3 \
      -x SHARP_COLL_ENABLE_PCI_RELAXED_ORDERING=1"

Key Environment Variables

The full list of NCCL environment variables can be found here. Some environment variables to pay attention to in the sample command lines below are called out in the example command lines above.

NCCL_P2P_DISABLE

Disables peer-to-peer communication directly between GPUs using NVLink or PCI. Documentation

NCCL_SHM_DISABLE

Disables the shared memory transports that will use host memory when peer-to-peer cannot occur. Documentation

NCCL_NVLS_ENABLE1

Enables NVLink. Documentation

Note that we typically use MPI to launch the NCCL jobs. MPI and UCX related environment variables control the MPI communication, not the communication NCCL itself performs. OpenMPI's documentation on environment variables used to interact with UCX can be found most clearly described in their FAQ here and here .

UCX_TLS

Picks the transports MPI/UCX is allowed to use or is excluded from using. Documentation

Expected performance

To verify the anticipated bandwidth using NCCL-tests, we refer to Bus Bandwidth (BusBw). Bus Bandwidth adjusts the algorithm's bandwidth to reflect hardware utilization, regardless of the number of ranks or the specific collective operation, enabling direct comparison with the theoretical peak bandwidth of the underlying hardware.

Generally, BusBw should be limited by the minimum of NVLink bandwidth, including NVLink SHARP (NVLS) if supported by NVSwitch, and network bandwidth. The following section outlines the expected BusBw using NCCL Allreduce as an example, with NCCL 2.26.5 and CUDA 12.6.

NDv5

GPU-to-GPU NVLink bandwidth is 900GB/s bi-directional and 450GB/s uni-directional.
CX7 HCA provides 50GB/s uni-directional bandwidth, so the maximum network bandwidth we can achieve from the node (with 8 CX7 HCAs) is 50*8=400GB/s.

Single-node:

NVLS: 470+GB/s. Compared to native NVLink, we observe higher BusBw due to NVLink SHARP, a hardware feature that offloads arithmetic operations onto NVSwitches, reducing data transfers.
Loopback: 45+GB/s, constrained by the bandwidth of a single HCA.

Table 1. NCCL Allreduce BusBw with Single NDv5 at 16GB message size with NVLS and IBLoopback, respectively

Message Size (Byte)	BusBw-NVLS (GB/s)	BusBw-IBLoopback (GB/s)
16G	481.42	48.69

Multi-node:

Since a single node with NVLS delivers 470GB/s BusBw, multi-node BusBw is now limited by the network bandwidth (400GB/s). We should expect 330GB/s for multi-node BusBw by default, considering the fact that there is no PCIe switch in NDv5 and NCCL consumes additional NVLink bandwidth to complete the ring. With improvements in the NVLSTree algorithm, BusBw is expected to reach around 360GB/s.

Table 2. NCCL Allreduce BusBw with multi-NDv5 VMs (different nunmber of H100 GPUs) at 16GB message size

Msg Size (Byte)	BusBw with different number of H100 GPUs (GB/s)
Msg Size (Byte)	32	64	128	256	512	1024
16G	361.06	366.12	367.39	365.51	366.63	354.2

Figure 1. Bus bandwidth (in GB/s) of a 16GB message on 4 to 256 Azure ND_H100_v5virtual machines.

AI Training Workloads

To run the open-source training model used in this study, begin by setting up a CycleCloud cluster with your preferred workload scheduler. In this guide, we use the SLURM scheduler. Detailed instructions for setting up a CycleCloud cluster with SLURM in Azure can be found here [Click for the Link].

This guide outlines the steps to benchmark training performance with the Grok-1 model published by xAI [Click for the Link]. Grok-1 is a 314B parameter Large Language Model (LLM) based on a Mixture of Experts (MoE) architecture. It features 8 experts, 64 layers, 48 attention heads (48 for queries, 8 for key/values), and 6144-dimensional internal embeddings, with support for 8-bit quantization and activation caching. The model, along with instructions for running on H100 and GH200 GPUs, is available here [Click for the Link].

To run the model under realistic configurations, at least 512 GPUs are required. However, by employing modified proxy configurations, the model can be scaled down to run on as few as 8 GPUs, each with 80GB of memory. Proxy configurations reduce the model's memory footprint by shrinking certain parameters. 3 details the hyperparameters and model configurations corresponding to different GPU setups.

Table 3. Hyperparameters and Model configuration for the Grok-1 314B parameter training workload. GPUs, SeqLen, Layers, TP, PP, CP, EP, DP, VP, MBS, GBS and GA denote the total number of GPUs, Sequence Length, Model Layers, Tensor Parallelism, Pipeline Parallelism, CheckPointing, Expert Parallelism, Data Parallelism, Virtual Pipeline Parallelism, Micro Batch Size, Global Batch Size and Gradient Accumulation, respectively.

GPUs	SeqLen	Layers	TP	PP	CP	EP	DP	VP	MBS	GBS	GA
8	4096	2	4	1	1	2	1	N/A	1	1024	128
16	4096	4	4	1	1	4	1	N/A	1	1024	128
32	4096	4	4	1	1	8	1	N/A	1	1024	128
64	8192	8	4	1	2	8	1	N/A	1	1024	128
128	8192	16	4	2	2	8	1	8	1	1024	128
256	8192	32	4	4	2	8	1	8	1	1024	128
512	8192	64	4	8	2	8	1	8	1	1024	128
1024	8192	64	4	8	2	8	2	8	1	2048	128

Here we share the instructions for running this distributed training workload on Azure’s H100/H200 VMs, with FP8/BF16 precisions.

To begin, download the run scripts into your cluster from

Grok1 314B 25.01 (DGXC Benchmarking) | NVIDIA NGC

Set up your stage path, this is a directory in which all the files are written and the results are saved. The stage path needs to be on a shared file system, accessible from all the VMs on your cluster.

export STAGE_PATH=”<path to your work directory>”

For your CycleCloud cluster, you can use an Azure Managed Lustre File System [Click here for the Link] or Azure NetApp Files [Click here for the Link] as your shared file system.

Run the setup script. This script downloads the container and necessary files from NVIDIA’s NGC website.
1. Before running the setup.sh script, fix the following typo in the script. In line 32, change “cp -vf configure.sh launch.sh $STAGE_PATH” to “cp -vf configure.sh launch.sh $STAGE_PATH/cfg”
2. Run sbatch -A ${SBATCH_ACCOUNT} -p ${SBATCH_PARTITION} -N 1 ./setup.sh
  1. SBATCH_ACCOUNT is set to your slurm account on the cluster. In most basic cyclecloud+slurm settings, this field needs to be omitted.
  2. SBATCH_PARTITION refers to the partition on the cluster. In most cyclecloud clusters this is “hpc”.
  3. Consult with your system admin to find the proper settings for these variables.
The NVIDIA NGC Grok-1 container uses a synthetic dataset for training. There is no need to download a training dataset for these tests.
To optimize the performance of Azure’s network, add the following NCCL environment variables in your shell

Table 4. NCCL optimizations for Azure environment for large scale distributed training.

Variable	Value	Description
NCCL_TOPO_FILE	/opt/Microsoft/ndv5-topo.xml	Ensures topology aware CPU/NIC/GPU mapping
NCCL_P2P_NET_CHUNKSIZE	2097152	Increases P2P transfer granularity
NCCL_MIN_NCHANNELS	32	Improves throughput for collectives
NCCL_IB_QPS_PER_CONNECTION	4	Enables use of multiple QP pairs to improve routing entropyrouting entropy
NCCL_PXN_DISABLE	1	Enables zero-copy design for P2P communicationsEnables zero-copy design for P2P communications
NCCL_IGNORE_CPU_AFFINITY	1	Ensures NCCL uses GPU affinity only

To run the workload, do: DTYPE=<fp8/bf16> sbatch -A ${SBATCH_ACCOUNT} -p ${SBATCH_PARTITION} -N ${NUM_NODES} ./launch.sh

i. DTYPE represents the datatype used for the training. This can be either “fp8” or “bf16”.

ii. NUM_NODES can be determined from NUM_NODES=NUMBER_OF_GPUS/8, since Azure’s H100 and H200 VMs provide 8 GPUs per node.

iii. We recommend editing the “launch.sh” script with the following:

export NCCL_TOPO_FILE=/opt/microsoft/ndv5-topo.xml

export NCCL_P2P_NET_CHUNKSIZE=2097152

export NCCL_MIN_NCHANNELS=32

export NCCL_IB_QPS_PER_CONNECTION=4

export NCCL_PXN_DISABLE=1

export NCCL_IGNORE_CPU_AFFINITY=1

srun \

--container-image "$IMAGE" \

--container-mounts ${NCCL_TOPO_FILE},$RESULT_DIR,$INDEX_MAPPING_DIR,$STAGE_PATH/cfg:/cfg,$STAGE_PATH/configure.sh:/gsw/configure.sh \

--container-env=NCCL_TOPO_FILE,NCCL_P2P_NET_CHUNKSIZE,NCCL_MIN_NCHANNELS,NCCL_IB_QPS_PER_CONNECTION,NCCL_PXN_DISABLE,NCCL_IGNORE_CPU_AFFINITY \

--cpu-bind=mask_cpu:"fff,fff000,fff000000,fff000000000,fff000000000000,fff000000000000000,fff000000000000000000,fff000000000000000000000" \

--container-writable \

--no-container-mount-home \

bash -c "source /gsw/configure.sh && launch"

Performance is reported in terms of seconds per iteration, which corresponds to seconds per training step. This can be obtained from the “*.out” log files generated in “$STAGE_PATH/results/$GSW_VERSION/$DTYPE/314b/$JOB_TOTAL_GPUS” directory. To extract the performance data from the log files, “grep” for “train_step_timing” over the “*.out” files in this directory: grep train_step_timing *.out

Since there is large variability in the performance at the beginning of the training job, it is recommended to use the training performance from the last training steps. Given the Sequence Length (SeqLen) and Global Batch Size (GBS), as listed in Table 3, this can be used to calculate the overall training throughput, in terms tokens processed per second

Throughput (Tokens/Second) = SeqLen x GBS / (Seconds per Training Step)

Table 5 and Table 6 demonstrate the performance results obtained with Grok-1 314B parameter training model on Azure’s ND_H100_v5 VMs, using FP8 and BF16 precision, respectively. These results are the ‘best performance’ selected from a set of several trials.

Table 5. Performance results obtained on Azure ND_H100_v5 VMs, in FP8 precision.

	Total Number of H100 GPUs	8	16	32	64	128	256	512	1024
Azure	Training Step Time (Seconds/Step)	19.4	17	8.39	16.6	17	16.8	17.9	18.4
Azure	Throughput (Tokens/Seconds)	216201	246724	499917	505338	493448	499322	468637	911805

Table 6. Performance results obtained on Azure ND_H100_v5 VMs, in BF16 precision.

	Total Number of H100 GPUs	8	16	32	64	128	256	512	1024
Azure	Training Step Time (Seconds/Step)	24.1	21.9	10.7	21	21.2	20.8	22	22.4
Azure	Throughput (Tokens/Seconds)	174038	191521	391991	399458	395689	403298	381300	748983

As previously discussed, training the Grok-1 model, which comprises 314 billion parameters, requires a minimum of 512 GPUs under realistic configurations. Using the performance data obtained for setups with 512 and 1,024 H100 GPUs, we calculated the scaling efficiency of distributed training using this model. 2 shows the scaling efficiency achieved with the Grok-1 model on Azure’s ND_H100_v5 VMs, demonstrating scaling efficiencies of higher than 97% for both the FP8 and BF16 precisions.

Figure 2. Scaling Efficiency of distributed training using Grok-1 314B parameter model, observed on Azure ND_H100_v5 VMs.

These results are in line with previously reported performance benchmarks for large-scale distributed training and inference on Azure, as measured by MLPerf v3.1 results [Click for the Link]. Figure 2 highlights the performance of Azure ND_H100_v5 VMs compared to the bare-metal nodes of the EOS supercomputer. At large scale, using 1,344 VMs, corresponding to 10,752 H100 GPUs, Azure VMs exhibit a performance that is only 2.3% lower than that of EOS.

This effectively quantifies the minimal overhead introduced by Azure’s virtualization layer, showcasing its ability to deliver performance and scalability on par with bare-metal supercomputers at scale. This is only made possible by Azure’s reliance on high-performance interconnects, with our backend InfiniBand network being a distinguishing factor.

Figure 3. Performance of Azure ND_H100_v5 for distributed training, as measured by the MLPerf v3.1 benchmarks, in comparison to the bare-metal nodes of NVIDA's EOS supercomputer.

Wrapping Up

Demanding High-Performance Computing and Artificial Intelligence (HPC & AI) workloads require careful design of the underlying infrastructure to ensure optimal performance and efficiency. In this blog, we’ve examined the impact of high-performance interconnects like the InfiniBand interconnect available with Azure’s HPC & AI virtual machines. We’ve shown the benefits of the improved latency and bandwidth provided by InfiniBand for real workloads and provided instructions to allow you to reproduce these results on Azure Virtual Machines.

Updated Jun 25, 2025

Version 4.0

Microsoft

Joined July 26, 2022

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity