InfiniBand Architecture: Building High-Performance Networks for AI and HPC
As computing power surges and demand for high-performance computing (HPC) and AI workloads grows, the networks linking these systems must deliver extreme speed and low latency. InfiniBand has emerged as a leading interconnect architecture in this arena, offering exceptional throughput and ultra-low latency. It is now a go-to solution for supercomputers and AI clusters – as of mid-2023, InfiniBand connects 63 of the world’s top 100 supercomputers. This article explores the InfiniBand architecture, its core components, how it differs from traditional TCP/IP networking, and why it’s ideal for building fast, scalable networks for AI and HPC.
Fundamentals of InfiniBand Architecture
InfiniBand (IB) is a high-performance communications standard that defines a switched, point-to-point input/output fabric for interconnecting servers, storage, and other infrastructure. An InfiniBand fabric can support up to ~64,000 addressable devices within a network, allowing massive scalability for large clusters (What is InfiniBand and its difference with Ethernet | FiberMall) (InfiniBand: The High-Performance Network Protocol for Today’s Computing – ATGBICS). The InfiniBand Architecture (IBA) specification, maintained by the InfiniBand Trade Association (IBTA), standardizes this framework for vendor-neutral interoperability.
Key characteristics of InfiniBand networks include ultra-low latency, very high bandwidth, and low CPU overhead. InfiniBand is designed as a true fabric architecture: each device (node) connects to the fabric via dedicated high-speed links to switches, rather than sharing a single bus or Ethernet collision domain. This direct point-to-point connectivity avoids network congestion and achieves consistently high throughput. InfiniBand’s design also supports carrying multiple types of traffic (clustering, storage, management, etc.) over one unified network. It’s an ideal interconnect for parallel computing environments with thousands of interconnected nodes, where it can handle aggregated data flows from AI training, scientific simulations, storage access, and more (InfiniBand - A low-latency, high-bandwidth interconnect) (InfiniBand - A low-latency, high-bandwidth interconnect).
Data Transfer and RDMA Benefits
One of InfiniBand’s core advantages is how it handles data transfers using Remote Direct Memory Access (RDMA). In a traditional network stack, the operating system kernel mediates all data movement – applications must copy data to kernel buffers and invoke the OS for every send/receive. This incurs context switches and CPU overhead that add latency. InfiniBand takes a different approach: it allows applications to exchange data directly over the network with minimal OS intervention. Through RDMA, an InfiniBand interface can read or write directly to the memory of a remote node without involving the remote CPU or OS in the data path. This application-centric, zero-copy communication is a key differentiator between InfiniBand and traditional interconnects, and it drastically reduces latency and CPU load for data transfers (InfiniBand - A low-latency, high-bandwidth interconnect) (How to Choose Between InfiniBand and RoCE | FiberMall). In practice, messages can be placed straight into an application’s buffer on the receiving side by the InfiniBand adapter hardware, bypassing the usual kernel processing. This efficient data-movement model is crucial for scaling high-performance clusters – it frees up CPU cycles and enables data to flow with only nanoseconds to a few microseconds of overhead, even under heavy I/O loads.
InfiniBand vs. Traditional TCP/IP Networks
While InfiniBand and Ethernet networks both use layered architectures, InfiniBand is purpose-built to overcome the bottlenecks of TCP/IP networking in HPC environments. In distributed HPC or AI applications that require low latency and high I/O concurrency, the standard TCP/IP software stack often struggles to meet performance requirements (How to Choose Between InfiniBand and RoCE | FiberMall) (How to Choose Between InfiniBand and RoCE | FiberMall). Traditional TCP/IP communication involves multiple kernel-mediated steps (packet routing through the OS, copying data between user and kernel space, etc.), which introduce additional latency and CPU overhead. For example, each message on a TCP/IP network may be buffered and processed by the kernel on both sending and receiving hosts, adding tens of microseconds of delay and significant CPU interruption.
InfiniBand’s design avoids these inefficiencies. As mentioned, RDMA and kernel bypass allow InfiniBand to eliminate most of the CPU involvement in data transfer. This means InfiniBand can sustain much lower latencies than Ethernet even at similar link speeds – on the order of a few microseconds end-to-end, versus tens of microseconds for traditional TCP/IP. In large-scale parallel clusters, those savings per message translate to major performance gains. Studies have shown that by bypassing the kernel protocol stack, application-level latencies can drop from ~50 μs (TCP/IP) to under 5 μs with RDMA networks (on RoCE or InfiniBand) (How to Choose Between InfiniBand and RoCE | FiberMall). In effect, InfiniBand was created to offload networking tasks to intelligent adapters and switches, so that the network can keep up with the computing horsepower of modern supernodes. This makes InfiniBand especially suited for large HPC clusters and AI supercomputers, whereas traditional Ethernet networks are often a limiting factor for those workloads. In distributed storage and database contexts, InfiniBand is commonly deployed as the high-speed cluster interconnect, while TCP/IP might still be used for general enterprise traffic – each excels in its domain.
Another difference is that InfiniBand implements end-to-end reliability and flow control in hardware, yielding a nearly lossless network. Ethernet networks typically rely on upper-layer protocols (like TCP) to detect and retransmit lost packets, whereas InfiniBand’s link-level protocol prevents packet loss altogether (as described later in this article). The result is a network with deterministic throughput and latency, critical for tightly coupled computing jobs. In summary, InfiniBand’s architecture and RDMA capabilities directly address the performance limitations of TCP/IP, enabling high-frequency, low-latency communication at scale that traditional networks cannot easily achieve (How to Choose Between InfiniBand and RoCE | FiberMall) (How to Choose Between InfiniBand and RoCE | FiberMall).
Layered Architecture of InfiniBand
InfiniBand architecture is organized into a set of layers, conceptually similar to the OSI or TCP/IP model, but with its own specialized roles and protocols. The major layers of InfiniBand include the Physical Layer, Link Layer, Network Layer, Transport Layer, and the Upper Layer protocols. These layers work together to deliver reliable, ordered, and extremely fast data transmission across the InfiniBand fabric. Below is an overview of each layer and its functions:
Upper Layer Protocols
At the top of the InfiniBand stack are the upper layer protocols, which interface with applications and enable InfiniBand to carry various types of traffic. InfiniBand is flexible in supporting multiple protocols for different purposes, including:
SCSI RDMA Protocol (SRP) – Allows SCSI storage commands to be transported over InfiniBand, enabling fast access to remote disks as if they were local. This is used for high-speed storage area networks.
IP over InfiniBand (IPoIB) – Encapsulates IP packets over InfiniBand, letting standard TCP/IP traffic run on an InfiniBand network. This permits integration of InfiniBand into existing IP-based infrastructures and applications.
Sockets Direct Protocol (SDP) – A protocol that accelerates traditional socket (TCP) applications by using InfiniBand underneath. SDP bypasses the TCP/IP stack, mapping socket calls to InfiniBand operations for lower latency.
Message Passing Interface (MPI) – A communication library widely used in parallel computing (HPC). While MPI itself is software, it can utilize InfiniBand’s low-level transport (via RDMA) to implement its messaging routines. InfiniBand’s low latency and high throughput significantly benefit MPI-based HPC applications.
These are just a few examples – InfiniBand’s upper layer can carry other specialized protocols as well. The key point is that InfiniBand isn’t restricted to one type of traffic; it is designed to consolidate clustering, storage, and networking I/O onto a single fabric, which simplifies infrastructure and improves efficiency.
Transport Layer
The transport layer of InfiniBand provides end-to-end communication between nodes, taking on responsibilities similar to TCP/UDP in the Internet stack but with important differences. InfiniBand’s transport layer supports both reliable and unreliable transport modes and manages tasks like message sequencing, acknowledgment, and error checking to ensure data integrity. Crucially, the transport layer is where InfiniBand implements RDMA operations. Rather than establishing heavy software connections, InfiniBand sets up lightweight, hardware-managed communication contexts (queue pairs) between endpoints. This allows the network interface cards (called Host Channel Adapters in InfiniBand) to handle most of the data transfer work.
In InfiniBand, a send or RDMA write from one node can be executed such that the data is placed directly into the memory of the target node by the hardware, with the transport layer guaranteeing delivery and order. A concept called a “virtual channel” is used – effectively a logical communication path that links two processes on different nodes. Once established, these channels enable data to flow with minimal involvement from the host CPUs. The InfiniBand transport offloads communication tasks (like segmentation, reassembly, retries, etc.) to the adapters and switches. As a result, when a message arrives at the destination HCA, it is delivered straight into the application’s buffer, already removed from the wire protocol, without needing the CPU to copy data out of a kernel socket buffer. This is how InfiniBand achieves extremely low application-level latency and frees up the processor for computation.
By contrast, in a TCP/IP network, the transport (TCP) must interact heavily with the OS kernel (for buffering, retransmissions, etc.), which adds latency and overhead. InfiniBand’s transport layer avoids that overhead by design (How to Choose Between InfiniBand and RoCE | FiberMall). It provides features like credit-based flow control (to prevent overrunning a receiver, handled at the link layer as described below) and can also support multicast and other advanced transport capabilities. Overall, the InfiniBand transport layer is optimized to deliver data reliably with zero-copy and low CPU usage, aligning with the needs of HPC and real-time data processing.
Network Layer
InfiniBand’s network layer deals with addressing and routing of packets across the fabric. An InfiniBand fabric can be divided into multiple subnets – each subnet is an independent domain of communication, typically consisting of one or more InfiniBand switches and the end-nodes connected to them. Within a single subnet, all devices can reach each other through the switches using link-layer addresses. To scale beyond a single subnet (for very large deployments), InfiniBand uses routers to connect subnets together into a larger network. The network layer’s role is most apparent when routing between subnets.
Within a subnet, InfiniBand does not require IP addressing – it uses its own addressing scheme (Local Identifiers, discussed shortly) and a distributed routing method set up by a Subnet Manager. The Subnet Manager (SM) is a software service (often running on one of the switches or a dedicated node) that discovers the topology of the fabric, assigns addresses, and computes routing tables for the switches. InfiniBand switches forward packets based on these tables, similar to how an Ethernet switch uses MAC addresses, but here the addresses are InfiniBand-specific and the forwarding logic is managed centrally by the SM. If a network spans multiple subnets, routers are used to forward traffic between subnets; these routers operate at the InfiniBand network layer, using a global addressing scheme (Global Identifiers) to route packets to the correct subnet.
In summary, InfiniBand’s network layer ensures that any two end nodes in the fabric can communicate, either directly within one subnet or via routers across subnets. Unlike IP networks, where routing protocols dynamically route packets, InfiniBand relies on the centralized Subnet Manager to set up all routing information. This approach is well-suited to the relatively static, tightly controlled networks in HPC clusters. It guarantees that packet paths are optimized and loop-free, and it simplifies the forwarding logic in the switches (since they just follow the SM’s forwarding table). The result is very fast hardware-level routing with minimal per-packet decision-making, contributing to InfiniBand’s low latency.
Link Layer
The link layer in InfiniBand is responsible for node-to-node data framing, switching within subnets, and enforcement of flow control. It roughly corresponds to Ethernet’s data link layer but with additional capabilities to support InfiniBand’s reliability and performance. Important aspects of the InfiniBand link layer include:
Local Identifiers (LIDs): Each device (end node or switch port) in an InfiniBand subnet is assigned a 16-bit Local Identifier by the Subnet Manager. The LID serves as the link-layer address. Packets within a subnet are sent to the destination’s LID. InfiniBand switches maintain forwarding tables that map destination LIDs to output ports, allowing them to switch packets toward the correct receiver. These forwarding tables are set up by the SM, which calculates the network routes. Using LIDs (which can accommodate on the order of 48k–64k nodes per subnet) enables InfiniBand to scale to large subnet sizes (Mellanox External WORD Template) (Mellanox External WORD Template) without requiring IP routing internally.
Packet Switching: InfiniBand switches move packets through the fabric by looking at the destination LID and forwarding the packet out the appropriate port toward that node. This is hardware-based switching with very low port-to-port latency (on the order of 100 ns in modern InfiniBand switches). The use of a switched fabric means multiple paths can exist, and InfiniBand even supports adaptive routing – the ability to distribute traffic across multiple paths to avoid hot-spots. Within a single subnet, the switching is transparent to the end nodes, similar to an Ethernet switch but with InfiniBand-specific addressing. Because the network is managed as a fabric, it can be engineered into various topologies (fat-tree, mesh, dragonfly, etc.) to meet bandwidth and redundancy needs.
Credit-Based Flow Control: The InfiniBand link layer employs a credit token mechanism for flow control on each link. This ensures a sender cannot overwhelm a receiver’s buffer. Essentially, each receiving end of a link advertises credits to the sender, indicating how many packets or bytes it can accept. The sender only transmits data when it has credit indicating the receiver has buffer space available (How to Choose Between InfiniBand and RoCE | FiberMall) (How to Choose Between InfiniBand and RoCE | FiberMall). As the receiver processes and frees up buffer space, it returns credits to the sender. This backpressure system prevents buffer overflow and thus prevents packet loss due to congestion. It’s a stark contrast to Ethernet’s lack of ubiquitous link-level flow control (Ethernet typically resorts to dropping packets when congested and relies on higher layers to recover). InfiniBand’s flow control operates on virtual lanes (multiple priority channels over the same physical link) to manage traffic and maintain a lossless behavior.
Lossless Fabric: Thanks to the robust flow control, InfiniBand networks are essentially lossless under normal operation – packets are not dropped due to congestion. The sender will pause or slow down if the network is congested until the receiver signals readiness. This is crucial for applications like HPC and AI training, where dropped packets would not only hurt performance but could stall tightly synchronized computations. In InfiniBand, even when the network is busy, data is delivered reliably without needing retransmission at higher layers (Mellanox External WORD Template) (Mellanox External WORD Template). The guaranteed delivery and in-order packet arrival simplify the design of parallel algorithms and storage protocols on InfiniBand, since they see a very stable, predictable network behavior. (InfiniBand does have mechanisms for handling rare events like bit errors or hardware failures, but routine congestion is handled gracefully without loss.)
In summary, the link layer gives InfiniBand its efficiency and reliability through hardware-based switching and flow control. Each subnet behaves like a large, lossless switching domain where every node can reach every other with minimal interference and no packet loss. This layer is one of the reasons InfiniBand can attain such high effective bandwidth utilization – virtually all the transmitted packets make it to their destination on the first attempt, rather than being retransmitted later.
Physical Layer
The physical layer of InfiniBand defines the electrical, optical, and mechanical properties of the connections in the network. This includes the cables, transceivers, connectors, and signaling rates used for InfiniBand links. InfiniBand supports both copper and fiber optic media, depending on the distance and speed requirements. For example, short links within a rack or between adjacent racks often use Direct Attach Copper (DAC) cables, while longer links use Active Optical Cables (AOC) or pluggable optical transceivers with fiber. The physical layer specification ensures interoperability between different vendors’ hardware by standardizing connector types and signaling (including features like lane configurations and encoding).
InfiniBand physical links are typically denoted by lane counts and data rates. Common configurations include 1×, 4×, or 12× links – meaning a cable with 1, 4, or 12 lanes in parallel. Each lane operates at a certain gigabit rate. For instance, a single lane at HDR (High Data Rate) runs at 50 Gb/s, so a 4× HDR link is 4 * 50 = 200 Gb/s. The latest generation NDR (Next Data Rate) uses 100 Gb/s per lane, so a 4× NDR link is 400 Gb/s. Through channel bonding, InfiniBand can achieve extremely high bandwidth per port (e.g. 12× links used in some systems). As of 2024, 200 Gb/s HDR links are widely deployed in HPC systems, and 400 Gb/s NDR technology is emerging in cutting-edge AI supercomputers (How to Choose Between InfiniBand and RoCE | FiberMall). The roadmap for InfiniBand projects even higher speeds (e.g. a future XDR generation at 800 Gb/s). Importantly, the physical layer also incorporates features like hot-swapping (you can insert/remove cables or adapters without bringing down the whole network) and robust error detection at the physical level to trigger retries if needed.
By standardizing the physical interconnect, InfiniBand makes it practical to build large fabrics – system builders can mix and match compliant switches, adapters, and cables. The end result is a reliable, high-bandwidth pipe between any two points in the network, forming the foundation upon which the higher InfiniBand layers operate.
Conclusion
InfiniBand architecture has established itself as a preferred choice for building high-performance networks in supercomputing and large-scale AI clusters, thanks to its exceptional throughput and low-latency characteristics. Its unique design – leveraging hardware-based switching, RDMA, and lossless operation – makes it ideal for handling large-scale data transfers and the communication-intensive workloads of modern HPC and AI applications. In practical terms, InfiniBand enables supercomputers to run at their full potential by minimizing network bottlenecks. As the demand for computational power continues to grow and data centers scale out to ever more nodes, InfiniBand will remain a crucial interconnect technology in the realms of science, engineering, and enterprise computing. With new generations of InfiniBand delivering even greater speeds and efficiency, it is poised to continue underpinning advances in AI model training, scientific research simulations, and other HPC endeavors for years to come.
References
InfiniBand Trade Association, “About InfiniBand,” 2023. [Online]. Available: infinibandta.org/about-infiniband/.
B. (Brian) Zhang, “What is InfiniBand Network and the Difference with Ethernet?,” FiberMall Blog, Jun. 21, 2024. [Online]. Available: fibermall.com/blog/what-is-infiniband-network-and-difference-with-ethernet.htm.
C. (Catherine) Luo, “How to Choose Between InfiniBand and RoCE,” FiberMall Blog, Jun. 23, 2024. [Online]. Available: fibermall.com/blog/how-to-choose-between-infiniband-and-roce.htm.
Mellanox Technologies (NVIDIA), “InfiniBand FAQ,” White Paper, Rev. 1.3, Dec. 2014. [Online]. Available: network.nvidia.com/pdf/whitepapers/InfiniBandFAQ_FQ_100.pdf.
Executive Datacom CTO for FSI Industry at Huawei Morocco EBG || HCSE-Campus, HCIE DC(w)
4moWe already talked ahead advance to define next generation of converged storage over Ethernet with lossless Ethernet networking based on AI building next generation of Datacenter networks to Connect all HPC and AI Native computing platform with All flash storage. #Huawei #CloudFabric #NoF #RCoE
Always Excited To Preach Jesus Christ | CCIE 47140 | Networking Specialist - GCP|AWS
4moI’m just beginning to dive into RDMA type of networking. Lovely write-up. I'd say this covers the foundation every network engineers need to know, and where most interview questions would probably emanate. I had to save this article to come back to later.