Azure CTO Mark Russinovich’s annual Azure infrastructure presentations at Build are always fascinating as he explores the past, present, and future of the hardware that underpins the cloud. This year’s talk was no different, focusing on the same AI platform touted in the rest of the event.
Over the years it’s been clear that Azure’s hardware has grown increasingly complex. At the start, it was a prime example of utility computing, using a single standard server design. Now it’s many different server types, able to support all classes of workloads. GPUs were added and now AI accelerators.
That last innovation, introduced in 2023, shows how much Azure’s infrastructure has evolved along with the workloads it hosts. Russinovich’s first slide showed how quickly modern AI models were growing, from 110 million parameters with GPT in 2018, to over a trillion in today’s GPT-4o. That growth has led to the development of massive distributed supercomputers to train these models, along with hardware and software to make them efficient and reliable.
Building the AI supercomputer
The scale of the systems needed to run these AI platforms is enormous. Microsoft’s first big AI-training supercomputer was detailed in May 2020. It had 10,000 Nvidia V100 GPUs and clocked in at number five in the global supercomputer rankings. Only three years later, in November 2023, the latest iteration had 14,400 H100 GPUs and ranked third.
In June 2024, Microsoft has more than 30 similar supercomputers in data centers around the world. Russinovich talked about the open source Llama-3-70B model, which takes 6.4 million GPU hours to train. On one GPU that would take 730 years, but with one of Microsoft’s AI supercomputers, a training run takes roughly 27 days.
Training is only part the problem. Once a model has been built, it needs to be used, and although inference doesn’t need supercomputer-levels of compute for training, it still needs a lot of power. As Russinovich notes, a single floating-point parameter needs two bytes of memory, a one-billion-parameter model needs 2GB of RAM, and a 175-billion-parameter model requires 350GB. That’s before you add in any necessary overhead, such as caches, which can add more than 40% to already-hefty memory requirements.
All this means that Azure needs a lot of GPUS with very specific characteristics to push through a lot of data as quickly as possible. Models like GPT-4 require significant amounts of high-bandwidth memory. Compute and memory all need substantial amounts of power. An Nvidia H100 GPU requires 700 watts, and with thousands in operation at any time, Azure data centers need to dump a lot of heat.
Beyond training, design for inference
Microsoft has developed its own inference accelerator in the shape of its Maia hardware, which is pioneering a new directed-liquid cooling system, sheathing the Maia accelerators in a closed-loop cooling system that has required a whole new rack design with a secondary cabinet that contains the cooling equipment’s heat exchangers.
Designing data centers for training has shown Microsoft how to provision for inference. Training rapidly ramps up to 100% and holds there for the duration of a run. Using the same power monitoring on an inferencing rack, it’s possible to see how power draw varies at different points across an inferencing operation.
Azure’s Project POLCA aims to use this information to increase efficiencies. It allows multiple inferencing operations to run at the same time by provisioning for peak power draw, giving around 20% overhead. That lets Microsoft put 30% more servers in a data center by throttling both server frequency and power. The result is a more efficient and more sustainable approach to the compute, power, and thermal demands of an AI data center.
Managing the data for training models brings its own set of problems; there’s a lot of data, and it needs to be distributed across the nodes of those Azure supercomputers. Microsoft has been working on what it calls Storage Accelerator to manage this data, distributing it across clusters with a cache that determines if required data is available locally or whether it needs to be fetched, using available bandwidth to avoid interfering with current operations. Using parallel reads to load data allows large amounts of training data to be loaded almost twice as fast as traditional file loads.
AI needs high-bandwidth networks
Compute and storage are important, but networking remains critical, especially with massive data-parallel workloads working across many hundreds of GPUs. Here, Microsoft has invested significantly in high-bandwidth InfiniBand connections, using 1.2TBps of internal connectivity in its servers, linking 8 GPUs, and at the same time 400Gbps between individual GPUs in separate servers.
Microsoft has invested a lot in InfiniBand, both for its Open AI training supercomputers and for its customer service. Interestingly Russinovich noted that “really, the only difference between the supercomputers we build for OpenAI and what we make available publicly, is the scale of the InfiniBand domain. In the case of OpenAI, the InfiniBand domain covers the entire supercomputer, which is tens of thousands of servers.” For other customers who don’t have the same training demands, the domains are smaller, but still at supercomputer scale, “1,000 to 2,000 servers in size, connecting 10,000 to 20,000 GPUs.”
All that networking infrastructure requires some surprisingly low-tech solutions, such as 3D-printed sleds to efficiently pull large amounts of cables. They’re placed in the cable shelves above the server racks and pulled along. It’s a simple way to cut cabling times significantly, a necessity when you’re building 30 supercomputers every six months.
Making AI reliable: Project Forge and One Pool
Hardware is only part of the Azure supercomputer story. The software stack provides the underlying platform orchestration and support tools. This is where Project Forge comes in. You can think of it as an equivalent to something like Kubernetes, a way of scheduling operations across a distributed infrastructure while providing essential resource management and spreading loads across different types of AI compute.
The Project Forge scheduler treats all the available AI accelerators in Azure as a single pool of virtual GPU capacity, something Microsoft calls One Pool. Loads have priority levels that control access to these virtual GPUs. A higher-priority load can evict a lower-priority one, moving it to a different class of accelerator or to another region altogether. The aim is to provide a consistent level of utilization across the entire Azure AI platform so Microsoft can better plan and manage its power and networking budget.
Like Kubernetes, Project Forge is designed to help run a more resilient service, detecting failures, restarting jobs, and repairing the host platform. By automating these processes, Azure can avoid having to restart expensive and complex jobs, treating them instead as a set of batches that can run individually and orchestrate inputs and outputs as needed.
Consistency and security: ready for AI applications
Once an AI model has been built it needs to be used. Again, Azure needs a way of balancing utilization across different types of models and different prompts within those models. If there’s no orchestration (or lazy orchestration), it’s easy to get into a position where one prompt ends up blocking other operations. By taking advantage of its virtual, fractional GPUs, Azure’s Project Flywheel can guarantee performance, interleaving operations from multiple prompts across virtual GPUs, allowing consistent operations on the host physical GPU while still providing a constant throughput.
Another low-level optimization is confidential computing capabilities when training custom models. You can run code and host data in trusted execution environments. Azure is now able to have complete confidential VMs, including GPUs, with encrypted messages between CPU and GPU trusted environments. You can use this for training or securing your private data used for retrieval-augmented generation.
From Russinovich’s presentation, it’s clear that Microsoft is investing heavily in making its AI infrastructure efficient and responsive for training and inference. The Azure infrastructure and platform teams have put a lot of work into building out hardware and software that can support training the largest models, while providing a secure and reliable place to use AI in your applications.
Running Open AI on Azure has given those teams a lot of experience, and it’s good to see that experience paying off in providing the same tools and techniques for the rest of us—even if we don’t need our own TOP500 supercomputers.