Personal Perspectives on Huawei's AI Network Architecture called UB-Mesh
NOTE: This article is quoted from one of my blogs (https://mp.weixin.qq.com/s/SAXnLt7VwXneeV2VHriWug).
Recently, many industry peers have asked for my views on Huawei's innovative AI network architecture, UB-Mesh. To share ideas with colleagues, I've organized my insights as follows.
1、Design Philosophy: Prioritizing Network Cost Reduction
From the hardware cost structure of AI infrastructure, GPU expenditure dominates overall investment, while network equipment typically accounts for less than 15%. This cost composition dictates that the core task of designing an AI network is to build high-bandwidth communication links for GPUs, avoiding idle computing resources caused by data exchange delays. For example, DeepSeek achieves "overlapping communication with computation" through software-hardware co-design and engineering optimizations, fundamentally aiming to maximize GPU utilization and prevent computation resource waste.
Unlike mainstream industry solutions, UB-Mesh heavily focuses on reducing network costs by minimizing the number of switches and optical modules, and utilizing low-cost switches with lower switching capacity, specifically low-radix switches (LRS). Although not explicitly stated in the paper, this goal requires compromises in other design objectives.
2、Topology: Adoption of nD-FullMesh Over CLOS
The industry predominantly employs CLOS or Fat-Tree topologies to construct scalable Scale-out and even Scale-up networks. CLOS architecture is the de facto standard for Scale-out networks, requiring no further elaboration. Let’s instead delve into the evolution of NVIDIA’s Scale-up network technologies.
The primary advantages of switching-chip-based interconnection over P2P full-mesh interconnection include:
For this reason, mainstream industry scale-up network solutions for super-PoDs have abandoned the point-to-point full-meshed connection architecture inside compute trays, and instead adopted switch-chip-based interconnection(Note that Huawei's CloudMatrix384 uses switch chips for Scale-up interconnect as well).
As admitted in the paper,
“Traditional datacenters usually adopt the Clos architecture, which employs two or three tiers of switches to symmetrically connect NPUs/CPUs, offering high performance and adaptability to various traffic patterns.”
However, to reduce network costs, UB-Mesh retains traditional P2P full-mesh interconnects within single boards (compute trays) and extends this architecture from 1-D (intra-board) to 3-D (inter-rack) full connectivity.
This design consequently introduces the following challenges: poor adaptability to dynamic and complex traffic patterns, necessitating pre-defined traffic access relationships.
For the above, it admits in the paper that,
“Although UB-Mesh’s hardware architecture design matches the traffic patterns of LLM training, running workloads on such a hierarchical AI cluster may suffer from low utilization if training tasks are not effectively distributed across computing resources.”
3、Cost Reduction: Trading GPU Resources for Switch Savings
UB-Mesh reduces switch and optical module usage through nested full-mesh topologies (1-D intra-board, 2-D intra-rack, 3-D inter-rack), optimizing hardware costs by consuming GPU resources instead of switches.
As has been mentioned in the paper:
“Since each NPU is also a router, and the entire system contains a large number of NPUs, to conserve NPU hardware resources, the routing system must efficiently handle routing-table lookups and forwarding operations.”
Of course, this strategy also aims to address the inherent flaw of traditional point-to-point full-mesh topologies: the fixed and equal bandwidth allocation across all nodes, which inherently results in poor adaptability to dynamic and complex traffic patterns.
This approach is highly similar to data center network (DCN) architectures such as DCell and BCube, proposed by Microsoft Research China in earlier years. For example, BCube replaces part or all of the switches by leveraging server CPUs to relay data. Incidentally, BCube also employs the source routing concept used in UB-Mesh for data forwarding. In contrast, UB-Mesh uses more expensive NPU resources to replace relatively inexpensive switches for data relaying.
4、Routing Strategy: Reliance on Non-Shortest Path Algorithms
In nD Full-mesh topologies, multiple paths exist between nodes. UB-Mesh proposes using both shortest paths and detour paths to maximize bandwidth utilization.
As has been pointed out in the paper, “In the nD-FullMesh topology, between two endpoints, there are several possible paths with diverse distances. The system should enable the use of non-shortest paths to maximize bandwidth utilization across the network.”
This multi-path forwarding strategy causes the following costs:
5、Forwarding Mechanism: Reliance on Source Routing
To fully leverage the multi-path diversity with varying lengths, UB-Mesh has to adopt the source routing mechanism.
As mentioned in the paper, “To fully utilize the paths provided by APR, one practical way is leveraging the Source Routing (SR) mechanism.”
Compared with load balancing schemes like ECMP (Equal-Cost Multi-Path) or WECMP (Weighted ECMP), the source routing scheme is more complex:
6、Cost-Effectiveness Assessment: Underestimated CapEx and OpEx
The CapEx and OpEx of the UB-Mesh scheme are seriously underestimated:
7、Network Reliability: Convergence Performance Degradation Caused by Source Routing
Distributed routing forwarding mechanisms enable fast local convergence for fault switching, whereas UB-Mesh’s source routing mechanism requires fault information to be transmitted to source nodes before path switching can be triggered. With slow routing convergence, single-point failures may cause network-wide traffic fluctuations, affecting the continuity of training tasks.
As mentioned in the paper:
“In UB-Mesh, since each node has a deterministic set of communication targets, we can accelerate the routing convergence by directly notifying those nodes upon link failures.”
Leaving aside the dynamic nature of All-to-All communication relationships during MoE (Mixture of Experts) expert parallel communication, even for pre-planned deterministic communication patterns like All-Reduce, the fact that NPUs simultaneously serve as routers for data relaying poses a challenging problem: when an NPU fails or its direct link fails, how to quickly notify all affected source routing head nodes of the fault remains an unsolved issue.
8、Convenience of Deployment and Operation: How to Deploy a 4×4 Rack Matrix
As described in the paper:
“We connect four adjacent racks along two dimensions for constructing an inter-rack full-mesh because this is the optimal point considering the reach of active electrical cables. Since each rack has 64 NPUs and each pod has 16 racks, a 4D-FullMesh UB-Mesh-Pod contains 1024 NPUs in total.”
Given the 5–7-meter transmission distance limit of 200G/400G active copper cables (AEC), can maintenance access paths be preserved between cabinets in different columns when implementing full interconnection in a 4×4 cabinet matrix?
9、Summary: Underlying Logic of Technical Choices
UB-Mesh's design, centered around reducing network costs, seemingly gains advantages in network hardware expenditure but introduces systemic challenges such as rigid topology, complex routing, and difficult maintenance.
For AI training networks, where computing power costs account for over 70% of total expenditure, trading GPU resources to save on network hardware costs is questionable in terms of cost-effectiveness. The core contradiction in technological evolution lies in balancing strategic direction and tactical optimization — a misstep in direction may require exponentially higher costs to rectify downstream.
Just as stated at the beginning of the paper,
“Although UB-Mesh’s nD-FullMesh topology offers several theoretical advantages, its concrete architecture design, physical implementation, and networking system optimization present new challenges.”
Anyway, technological innovation requires exploring various paths, especially when mainstream technical routes offer limited room for innovation. Seeking novelty and differentiation through alternative technical approaches is indeed a common practice in academic research (including academic paper writing). Therefore, the various attempts made during the exploration process still hold value.
10、Discussion:
In reality, the diverse requirements and design objectives of the AI network described in the paper can be achieved within mainstream technical frameworks through simpler and more practical architectural solutions. Specifically, the Scale-out network and Scale-up network can be physically separated into two networks (although a unified technology stack should be adopted as much as possible):
References: