Today we’re excited to release the AI Infrastructure on Azure repository—a one-stop reference for teams building large-scale AI clusters on Azure.
Authors
Davide Vanzo - Senior Technical Program Manager - Azure Specialized
Jer-Ming Chia - Principal Technical Program Manager - Azure Specialized
Jesse Lopez - Senior Technical Program Manager - Azure Specialized
Jingchao Zhang - Senior Technical Program Manager - Azure Specialized
Paul Edwards - Principal Technical Program Manager - Azure Specialized
Wolfgang De Salvador - Senior Product Manager - Azure Storage
Introduction
When building a supercomputer on Azure for AI workloads, teams must stitch together orchestration, storage, and compute components. They often spend weeks fine-tuning those configurations for peak performance. This repo delivers well-tested (Infrastructure as Code) blueprints for fully integrated clusters that prioritize reliability and performance, and that can be used to reproduce our published benchmarks.
Design Considerations
Building an AI supercomputer on Azure spans many moving parts: VM family selection (e.g. ND GB200 v6 vs ND H200 v5), deployment model (fully containerized AKS clusters to traditional HPC), and storage strategy—you can even run training without POSIX file systems by tuning your data lifecycle, as detailed in several blog posts and sessions.
Other impactful design drivers include:
Storage & I/O
- Capacity needs (dataset, checkpoints, logs)
- Throughput & IOPS (sequential vs random access patterns)
- Filesystem interface (POSIX-compliant vs API-native cloud storage)
- Tiering strategy
Software & Orchestration
- AI framework & version (e.g MegatronLM, LLM-foundry, DeepSpeed)
- Container runtime (e.g enroot+pyxis, Singularity, Docker)
- Scheduler/Orchestration integration (e.g Slurm, Kueue, Volcano)
- OS image & driver stack (e.g Ubuntu/HPC image, NVIDIA drivers, IB drivers)
- Node Health Checks (checks for Infiniband fabric performance/health, GPU errors etc)
Workflow & Automation
- Checkpoint frequency & size (impacts storage performance)
- Data staging/ingest (pre-processing on CPU nodes vs GPU nodes)
- Monitoring & logging (telemetry pipelines, DCGM, Prometheus)
Systems optimizations
- CPU configs (NUMA topology files & affinity overrides)
- NCCL tuning (topology mapping, P2P chunk size, channel count)
- IB fabric tuning (queue-per-connection, zero-copy transfers)
- Storage tuning (mount options, I/O scheduler, parallel-FS striping)
Given the breadth of these design considerations, landing on an optimal configuration can be challenging. This repo’s purpose is to centralize our battle-tested configurations and optimization guidance—so you can push the health, reliability, and performance of your Azure AI supercomputers to the limit. We’ve also published end-to-end benchmarks here, giving you clear baselines to compare your own deployments against.
It also includes recommended node- and cluster-level health checks, baseline performance benchmarks, and a sample LLM training run using this configuration.
Configuration guidance
In this initial release, the repo provides a ready-to-run template for a "canonical" SLURM-managed HPC cluster, leveraging Azure NetApp Files for networked storage, and Azure Managed Lustre Service for parallel filesystem performance.
This section of the repository is aimed to contain well-tested infrastructure as code configurations for AI supercomputers on Azure that have been widely tested and adopted.
Storage Guidance
The repository also provides guidance for choosing storage backends. For instance, evaluating Azure Managed Lustre tiers to match the size and performance required for the specific training jobs.
One of the key elements to optimize in distributed training is the checkpoint time. This is critical for GPU utilization and it is strongly connected to the filesystem throughput. An example of this scenario for a GPT-3–style model (175 B parameters) has been presented in the repository for the case of Azure Managed Lustre. In a similar way, we present guidance on how to use BlobFuse2 with Azure Blob Storage for training jobs. Azure Blob Storage has demonstrated the ability to reach 25 Tbps of egress bandwidth on a single account in a recent Microsoft Build session.
Moreover, the repository is meant to host guidance on specific filesystem tunings to maximize the delivered performance.
Node and Cluster-level Healthchecks
Validating cluster readiness before large-scale training runs helps catch system issues early, so you don’t waste compute cycles and can hit the performance baselines. We recommend running a series of healthchecks at both the node- and cluster-level to catch hardware or software issues early.
It is recommended that AzureHPC Node Health Checks (AzNHC) is used to validate node-level functionality. Built on the LBNL NHC framework, AzNHC adds Azure-specific hardware tests for HPC and GPU VM SKUs. It includes SKU-specific tests, such as GPU availability, NVLink health, ECC memory error checks, device-to-host and host-to-device bandwidth tests, InfiniBand throughput (GDR and non-GDR), topology validation, and intra-node NCCL all-reduce benchmarks. It runs inside a Docker container that can be invoked easily.
In parallel, at the cluster level, testing inter-node GPU communication with NCCL all-reduce benchmarks is an effective way to measure collective bandwidth across your fleet. The Azure HPC Image includes the prebuilt nccl-tests suite in /opt/nccl-tests/build/, which can be used to run across all nodes via MPI. The recommended NCCL settings -- CollNet/NVLS, GDR, and relaxed PCI ordering -- provide optimal collective performance and serve as the baseline.
The repo includes best practices for running these validation tests.
Benchmarks
A recent published set of benchmarks demonstrates near-linear scaling from 8 up to 1,024 NDv5 H100 GPUs (Standard_ND96isr_H100_v5), delivering training performance on par with NVIDIA’s reference DGX systems and underscoring Azure’s infrastructure scalability and efficiency for large-scale AI workloads.
These benchmarks ran on the repo’s reference architecture. The recipes, deployment instructions, and full benchmark results are all available inside the example section of the repository.
Workload Examples
Equally important, the repo contains some real world examples of E2E AI training -- including best practices for training jobs data preparation and execution. Currently, in the examples section, we introduced the Megatron-LM GPT175B and the LLM Foundry MPT-30B and MPT-70B case.
The current examples are focused on the Azure CycleCloud Workspace for Slurm architecture, but there is the plan to extend them to additional orchestration solutions in the future.
These guides allow interested users to configure sample distributed training jobs, relying on important configuration guidance for their environment and infrastructure.
What’s next
The repository presented in this blog post will be expanded with additional scenarios, best practices and configuration recipes. We will share periodically updates on new contents and evolution of what available in the catalog.
We welcome contributions and we encourage to actively open requests for new content that you may find of interest.
Thank you to all our readers!