Personal Perspectives on Huawei's AI Network Architecture called UB-Mesh

Personal Perspectives on Huawei's AI Network Architecture called UB-Mesh

NOTE: This article is quoted from one of my blogs (https://mp.weixin.qq.com/s/SAXnLt7VwXneeV2VHriWug).

Recently, many industry peers have asked for my views on Huawei's innovative AI network architecture, UB-Mesh. To share ideas with colleagues, I've organized my insights as follows.

1、Design Philosophy: Prioritizing Network Cost Reduction

From the hardware cost structure of AI infrastructure, GPU expenditure dominates overall investment, while network equipment typically accounts for less than 15%. This cost composition dictates that the core task of designing an AI network is to build high-bandwidth communication links for GPUs, avoiding idle computing resources caused by data exchange delays. For example, DeepSeek achieves "overlapping communication with computation" through software-hardware co-design and engineering optimizations, fundamentally aiming to maximize GPU utilization and prevent computation resource waste.

Unlike mainstream industry solutions, UB-Mesh heavily focuses on reducing network costs by minimizing the number of switches and optical modules, and utilizing low-cost switches with lower switching capacity, specifically low-radix switches (LRS). Although not explicitly stated in the paper, this goal requires compromises in other design objectives.

2、Topology: Adoption of nD-FullMesh Over CLOS

The industry predominantly employs CLOS or Fat-Tree topologies to construct scalable Scale-out and even Scale-up networks. CLOS architecture is the de facto standard for Scale-out networks, requiring no further elaboration. Let’s instead delve into the evolution of NVIDIA’s Scale-up network technologies.

  • In GPU servers, GPU interconnect technology has evolved from the P2P full-mesh to the NVSwitch chip-based interconnect since the DGX-2 era.

Article content

  • SuperPoD solutions such as the GB200/GB300 NVL72 further unify intra-compute-tray and inter-compute-tray GPU interconnects through external NVSwitch implementations.

Article content

The primary advantages of switching-chip-based interconnection over P2P full-mesh interconnection include:

  • Maximize the available bandwidth between GPUs rather than allocating it equally to all interconnected GPUs.

Article content

  • Flexibly adapt to diverse communication patterns (e.g., All-Reduce, All-to-All, P2P).

For this reason, mainstream industry scale-up network solutions for super-PoDs have abandoned the point-to-point full-meshed connection architecture inside compute trays, and instead adopted switch-chip-based interconnection(Note that Huawei's CloudMatrix384 uses switch chips for Scale-up interconnect as well). 

As admitted in the paper,

“Traditional datacenters usually adopt the Clos architecture, which employs two or three tiers of switches to symmetrically connect NPUs/CPUs, offering high performance and adaptability to various traffic patterns.”

However, to reduce network costs, UB-Mesh retains traditional P2P full-mesh interconnects within single boards (compute trays) and extends this architecture from 1-D (intra-board) to 3-D (inter-rack) full connectivity.

Article content

This design consequently introduces the following challenges: poor adaptability to dynamic and complex traffic patterns, necessitating pre-defined traffic access relationships.

For the above, it admits in the paper that,

“Although UB-Mesh’s hardware architecture design matches the traffic patterns of LLM training, running workloads on such a hierarchical AI cluster may suffer from low utilization if training tasks are not effectively distributed across computing resources.”

3、Cost Reduction: Trading GPU Resources for Switch Savings

UB-Mesh reduces switch and optical module usage through nested full-mesh topologies (1-D intra-board, 2-D intra-rack, 3-D inter-rack), optimizing hardware costs by consuming GPU resources instead of switches.

As has been mentioned in the paper:

“Since each NPU is also a router, and the entire system contains a large number of NPUs, to conserve NPU hardware resources, the routing system must efficiently handle routing-table lookups and forwarding operations.”

Of course, this strategy also aims to address the inherent flaw of traditional point-to-point full-mesh topologies: the fixed and equal bandwidth allocation across all nodes, which inherently results in poor adaptability to dynamic and complex traffic patterns.

Article content

This approach is highly similar to data center network (DCN) architectures such as DCell and BCube, proposed by Microsoft Research China in earlier years. For example, BCube replaces part or all of the switches by leveraging server CPUs to relay data. Incidentally, BCube also employs the source routing concept used in UB-Mesh for data forwarding. In contrast, UB-Mesh uses more expensive NPU resources to replace relatively inexpensive switches for data relaying.

4、Routing Strategy: Reliance on Non-Shortest Path Algorithms

In nD Full-mesh topologies, multiple paths exist between nodes. UB-Mesh proposes using both shortest paths and detour paths to maximize bandwidth utilization.

Article content

As has been pointed out in the paper, “In the nD-FullMesh topology, between two endpoints, there are several possible paths with diverse distances. The system should enable the use of non-shortest paths to maximize bandwidth utilization across the network.”

This multi-path forwarding strategy causes the following costs:

  • The computational complexity of multi-path routing increases exponentially. As shown in the figure below, the source node must implement highly complex load balancing algorithms across Detour paths of varying lengths and partial link overlaps in multiple dimensions, as well as the shortest paths.

Article content

  • The transmission latency varies significantly across different paths (e.g., shortest paths vs. detour paths), exacerbating the "weakest link effect" that collective communication systems strive to avoid. Even within the same rack, communication latency between NPUs can differ dramatically depending on the paths taken.

5、Forwarding Mechanism: Reliance on Source Routing

To fully leverage the multi-path diversity with varying lengths, UB-Mesh has to adopt the source routing mechanism.

As mentioned in the paper, “To fully utilize the paths provided by APR, one practical way is leveraging the Source Routing (SR) mechanism.”

Article content

Compared with load balancing schemes like ECMP (Equal-Cost Multi-Path) or WECMP (Weighted ECMP), the source routing scheme is more complex:

  • Massive source route state maintenance: Source nodes must maintain end-to-end path state information for nearly all destination nodes, akin to the pressure on head nodes in Segment Routing domains.
  • Increased data encapsulation overhead: Extra path information must be carried in data packets to support source routing, which runs counter to the industry trend of designing more efficient data encapsulation formats—such as Broadcom’s streamlined AIFH design in the SUE scheme.

6、Cost-Effectiveness Assessment: Underestimated CapEx and OpEx

The CapEx and OpEx of the UB-Mesh scheme are seriously underestimated:

  • Capital Expenditure (CapEx): Technical closedness may lead to vendor lock-in, ultimately driving up procurement costs. The higher procurement costs of InfiniBand (IB) compared to Ethernet serve as clear evidence.
  • Operational Expenditure (OpEx): Complex topologies and routing mechanisms increase the difficulty of fault localization and performance tuning, significantly raising operational costs. The higher operational costs of IB relative to Ethernet exemplify this issue.

7、Network Reliability: Convergence Performance Degradation Caused by Source Routing

Distributed routing forwarding mechanisms enable fast local convergence for fault switching, whereas UB-Mesh’s source routing mechanism requires fault information to be transmitted to source nodes before path switching can be triggered. With slow routing convergence, single-point failures may cause network-wide traffic fluctuations, affecting the continuity of training tasks.

Article content

As mentioned in the paper:

“In UB-Mesh, since each node has a deterministic set of communication targets, we can accelerate the routing convergence by directly notifying those nodes upon link failures.”

Leaving aside the dynamic nature of All-to-All communication relationships during MoE (Mixture of Experts) expert parallel communication, even for pre-planned deterministic communication patterns like All-Reduce, the fact that NPUs simultaneously serve as routers for data relaying poses a challenging problem: when an NPU fails or its direct link fails, how to quickly notify all affected source routing head nodes of the fault remains an unsolved issue.

8、Convenience of Deployment and Operation: How to Deploy a 4×4 Rack Matrix

Article content

As described in the paper:

“We connect four adjacent racks along two dimensions for constructing an inter-rack full-mesh because this is the optimal point considering the reach of active electrical cables. Since each rack has 64 NPUs and each pod has 16 racks, a 4D-FullMesh UB-Mesh-Pod contains 1024 NPUs in total.”

Given the 5–7-meter transmission distance limit of 200G/400G active copper cables (AEC), can maintenance access paths be preserved between cabinets in different columns when implementing full interconnection in a 4×4 cabinet matrix?

9、Summary: Underlying Logic of Technical Choices

UB-Mesh's design, centered around reducing network costs, seemingly gains advantages in network hardware expenditure but introduces systemic challenges such as rigid topology, complex routing, and difficult maintenance.

For AI training networks, where computing power costs account for over 70% of total expenditure, trading GPU resources to save on network hardware costs is questionable in terms of cost-effectiveness. The core contradiction in technological evolution lies in balancing strategic direction and tactical optimization — a misstep in direction may require exponentially higher costs to rectify downstream.

Just as stated at the beginning of the paper,

 “Although UB-Mesh’s nD-FullMesh topology offers several theoretical advantages, its concrete architecture design, physical implementation, and networking system optimization present new challenges.”

Anyway, technological innovation requires exploring various paths, especially when mainstream technical routes offer limited room for innovation. Seeking novelty and differentiation through alternative technical approaches is indeed a common practice in academic research (including academic paper writing). Therefore, the various attempts made during the exploration process still hold value.

10、Discussion:

In reality, the diverse requirements and design objectives of the AI network described in the paper can be achieved within mainstream technical frameworks through simpler and more practical architectural solutions. Specifically, the Scale-out network and Scale-up network can be physically separated into two networks (although a unified technology stack should be adopted as much as possible):

  • Scale-out Network: Designed for building clusters of 100,000 or even 1 million GPU cards. In scenarios where the 128-card supernode serves as the basic construction unit, a two-layer CLOS topology with 128 tracks and 128×400G higher-radix switches can support million-card cluster scales. Under the premise of providing equivalent port access capabilities, the higher the switch radix, the fewer the layers required in the CLOS architecture, and the fewer the optical modules needed. Since optical modules account for approximately 2/3 of network hardware costs, saving on optical modules can significantly reduce network hardware costs.
  • Scale-up Network: Focused on achieving high-bandwidth communication domains within hundred-card-level supernodes, it can be designed as a single-stage multi-plane network architecture to meet the rigid requirements of ultra-high bandwidth and ultra-low latency. Additionally, for hundred-card Scale-up networks, active copper cables (AEC) can fully replace active optical cables (AOC). Compared to AOC, AEC can reduce costs and power consumption by approximately 50%, while improving reliability by several orders of magnitude.

Article content

References:

  1. NVSwitch:https://developer.nvidia.com/blog/nvidia-nvlink-and-nvidia-nvswitch-supercharge-large-language-model-inference/?ncid=no-ncid
  2. DCell:https://www.sigcomm.org/sites/default/files/ccr/papers/2008/October/1402946-1402968.pdf
  3. B-Cube:https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/comm136-guo.pdf

To view or add a comment, sign in

Others also viewed

Explore topics