AI Networking - Push to Standardization

Ajay Dubey

Director of Engineering, AI Fabric | Networking | IP | SoC | ASIC | FPGA

Published Apr 23, 2025

Scale-Up vs. Scale-Out

Artificial Intelligence (AI) workloads, especially driven by the explosive growth of Large Language Models (LLMs), are dramatically reshaping networking requirements, primarily through two crucial dimensions: Scale-Up and Scale-Out. Effective AI networking, grounded in optimized protocols and scalable architectures, is critical to enabling high throughput, low latency, and efficient handling of vast data flows.

Scale-Up refers to enhancing performance and capacity within a single node or system. This involves ultra-low latency and high-bandwidth communication between CPUs, accelerators (GPUs, FPGAs, ASICs), and memory within a tightly integrated environment. High-speed interconnects like NVIDIA’s NVLink or AMD’s Infinity Fabric exemplify such scale-up solutions.

Scale-Out involves increasing capacity and performance across distributed systems or clusters, emphasizing seamless handling of parallelism, network congestion, and synchronization across multiple interconnected nodes. This typically involves inter-GPU communications and requires robust network infrastructures capable of managing data parallelism and collective communications effectively.

Challenges Due to Lack of a Standard Interface

The current AI networking ecosystem faces fragmentation due to multiple proprietary solutions, resulting in:

· Interoperability Issues - Integration complexities across diverse systems.

· Scalability Constraints - Limited flexibility in rapidly scaling infrastructure.

· Vendor Lock-In: Dependence on specific vendors' technologies, stifling innovation.

· Complex System Optimization: Difficulty in optimizing communication pathways efficiently, leading to potential performance degradation.

Addressing Technical Challenges: Latency, Performance, and Congestion Management

AI workloads have distinct networking requirements during their lifecycle:

Training Phase - Demands sustained, high bandwidth with minimal packet loss tolerance, characterized by large, persistent "elephant flows" requiring uninterrupted communication channels.
Inference Phase - Prioritizes ultra-low latency to ensure quick response and enhanced user experience for interactive applications.
Congestion Management - Effective congestion control strategies are essential to maintain optimal network performance, minimize packet loss, and reduce latency variations, especially at scale.

Two emerging standards addressing these networking challenges are the Universal Accelerator Link (UAL) and the Ultra Ethernet Consortium (UEC).

Universal Accelerator Link (UAL)

UAL is an optimized interconnect protocol addressing scale-up challenges by enabling efficient, ultra-low latency data transfers between intra-node components. Key specifications include sub-1 microsecond (µs) round-trip latency for request-response transactions, less than 100 nanoseconds (ns) pin-to-pin latency, and bandwidth support up to 200 Giga transfers per second (GT/s) per lane.

Ultra Ethernet Consortium (UEC)

UEC targets scale-out solutions by enhancing Ethernet standards specifically for AI and HPC workloads. It emphasizes congestion control, advanced telemetry, precise synchronization, multi-path packet spraying, and tail latency reduction. Ultra Ethernet combines traditional Ethernet’s scalability with enhanced performance features, supporting large-scale deployments with up to 1,000,000 endpoints.

Core Technologies in AI Networking

Several networking technologies underpin these standards and enhance AI system performance:

Remote Direct Memory Access (RDMA): Enables direct memory-to-memory data transfers, bypassing CPU involvement, thereby significantly minimizing latency and maximizing bandwidth.
Collective Communication Libraries (CCLs): Facilitate data parallelism and synchronization across multiple GPUs or nodes using optimized methods like ring based All-Reduce algorithms to ensure efficient bandwidth utilization.
In-Network Computing (INC): Supported by Ultra Ethernet, INC offloads specific computational tasks to network devices, significantly reducing data movement and latency.

Congestion Management: RDMA vs. UEC

RDMA Congestion Management: Uses Priority Flow Control (PFC) and Explicit Congestion Notification (ECN) mechanisms to ensure lossless data transfers, ideal for environments requiring ultra-low latency.
UEC Congestion Management: Incorporates sophisticated endpoint-driven congestion control strategies and adaptive multi-path packet spraying, dynamically mitigating congestion at scale and proactively maintaining network performance.

Standardization through UAL and UEC offers promising solutions to address current fragmentation and technical limitations, enhancing interoperability, reducing latency, and significantly boosting performance to create scalable, efficient AI infrastructures.

AI Networking - Push to Standardization

Ajay Dubey

Director of Engineering, AI Fabric | Networking | IP | SoC | ASIC | FPGA

Scale-Up vs. Scale-Out

Challenges Due to Lack of a Standard Interface

Addressing Technical Challenges: Latency, Performance, and Congestion Management

Universal Accelerator Link (UAL)

Ultra Ethernet Consortium (UEC)

Core Technologies in AI Networking

Congestion Management: RDMA vs. UEC

More articles by this author

Others also viewed

The Evolution of Enterprise Connectivity, Security, and AI-Powered Innovation

Understanding Broadcom’s StrataXGS® Chipset Families: Trident vs. Tomahawk – The Smart vs. The Speedy

‘My Precious’: Meta Minipack3 – A 51.2T Marvel Leading the 800G Era

5G Backhaul IP and Industry Roadmap : A Chat GPT Deep Research Study (Research completed in 7minutes · 26 sources).

Ultra Ethernet Specification 1.0 - A Game Changer for AI Networking

What is Co-Package Optics?

Intel's Foundry Day Focuses on Advanced Packaging

From ARPANET to Autonomous Networks: AI’s Role in Building (or Breaking) the Internet’s Future

Cisco + AI: A Symbiotic Future

Powering the Next Wave: A Closer Look at Celestica's DS5000 800G Switch

Explore topics

Scale-Up vs. Scale-Out

Challenges Due to Lack of a Standard Interface

Addressing Technical Challenges: Latency, Performance, and Congestion Management

Universal Accelerator Link (UAL)

Ultra Ethernet Consortium (UEC)

Core Technologies in AI Networking

Congestion Management: RDMA vs. UEC

Networking (OvS) Data Path Offloads: Hardware Acceleration Architectures

Aug 11, 2025

Overview of Packet Flow through the NIC

Jul 28, 2025

Smart-NICs and IPUs: Engines Behind HPC and Cloud Virtualization Lift off

Jul 13, 2025

The Evolution of Network Interface Cards: From ISA Adapters to AI-Optimized Interconnects

Jul 6, 2025

Others also viewed

The Evolution of Enterprise Connectivity, Security, and AI-Powered Innovation

Understanding Broadcom’s StrataXGS® Chipset Families: Trident vs. Tomahawk – The Smart vs. The Speedy

‘My Precious’: Meta Minipack3 – A 51.2T Marvel Leading the 800G Era

5G Backhaul IP and Industry Roadmap : A Chat GPT Deep Research Study (Research completed in 7minutes · 26 sources).

Ultra Ethernet Specification 1.0 - A Game Changer for AI Networking

What is Co-Package Optics?

Intel's Foundry Day Focuses on Advanced Packaging

From ARPANET to Autonomous Networks: AI’s Role in Building (or Breaking) the Internet’s Future

Cisco + AI: A Symbiotic Future

Powering the Next Wave: A Closer Look at Celestica's DS5000 800G Switch

Explore topics