A Million-GPU AI Infra Reignites 'Network is Computer' Revolution
NOTE: This article is quoted from one of my blogs (https://mp.weixin.qq.com/s/me-H2q4yP7LOI8-5ASBj1w).
Recently, xAI, the AI company founded by Elon Musk, officially announced the launch of its next-generation AI supercomputer cluster, Colossus 2. This new system will greatly surpass the size and performance of its predecessor, Colossus 1, which is currently the world’s largest and most powerful supercomputer, equipped with 200,000 Nvidia H100/H200 GPUs. Colossus 2 is expected to feature an unprecedented 1 million GPUs and is estimated to cost between $25 billion and $30 billion. This major increase in computing power marks the official shift of the global AI infrastructure from the “100,000-GPU era” to the “million-GPU era.”
I. Exponential Growth in Computing Power Demand Driven by Algorithms and Data
The performance enhancement of large language models (LLMs) fundamentally relies on a synergistic trinity of algorithmic innovation, data expansion, and computing power upgrades, forming a core triangle of technological evolution. These three elements create a positive feedback loop: larger models necessitate increasingly complex multimodal data to support their generalization capabilities, while the fusion processing of multimodal data propels algorithmic evolution. The growth in model parameter scale, coupled with the exponential increase in multimodal data, drives a substantial surge in computing power demand, which in turn facilitates the evolution of larger models and datasets.
Architectural innovations, epitomized by the Transformer model, have driven the annual growth of LLM parameter scales at a remarkable 10-fold rate, with mainstream industry models now surpassing the threshold of 10 trillion parameters. This growth in parameter scale has been a primary driver of the exponential increase in training computing power demand. Concurrently, developing generalization capabilities for large models imposes heightened requirements for cross-modal data association modeling, necessitating the integration of EB-scale (10¹⁸ bytes) multimodal corpora, which include text, images, videos, code, and other heterogeneous data.
Multimodality has emerged as the predominant direction for large models, whose training processes require the simultaneous processing of text, images, and videos. In contrast to pure text-based large models, the dataset scale has transitioned from the terabyte (TB) level to the exabyte (EB) level (10¹⁸ bytes), with single-batch data interaction scales achieving petabyte (PB) levels. The fusion processing of multimodal data has not only driven the explosive growth in computing power demand but has also necessitated the evolution of the Transformer architecture to accommodate tasks such as cross-modal feature alignment and temporal dependency modeling, leading to innovations such as sparse attention and dynamic routing.
According to the Chinchilla Scaling Laws, model parameters and training data must be expanded proportionally to achieve optimal performance during large model training. This principle drives the synchronous exponential growth of model parameter scales and dataset sizes, which in turn triggers explosive growth in computing power demand.
(Note: The Chinchilla Scaling Laws originated from DeepMind’s 2022 paper, “Training Compute-Optimal Large Language Models.” By training over 400 models with parameter scales ranging from 70 million to 16 billion and data volumes between 5 billion and 400 billion tokens, this study proposed the crucial conclusion that model parameters and training data should increase in tandem: when the computing budget remains fixed, both parameter scales and data volumes must be augmented by the same multiple (e.g., doubling parameters while doubling data) to achieve optimal performance. The underlying rationale of this law is that parameter scales determine the upper limit of a model’s representational capabilities, while data volumes dictate its ability to fit real-world distributions. Only by balancing these two elements can computing resource utilization be maximized, establishing it as the “golden benchmark” of the large model era.)
II. The Scaling Law Meets the “Network is Computer” Concept: Reshaping the Paradigm of AI Infrastructure
Colossus 2 is poised to surpass mainstream industry AI training clusters, which typically operate with fewer than 500,000 GPUs, thereby strongly supporting the efficient training of ultra-large LLMs with parameter scales reaching 10 trillion or even 100 trillion (e.g., xAI’s self-developed Grok series).
(Note: The Scaling Law elucidates the relationship between model performance and scale. In 2020, OpenAI’s “Scaling Laws for Neural Language Models” illuminated the power-law relationship among model performance, parameter scale, and data volume through mathematical modeling and empirical research, thereby providing a theoretical foundation for large model development. Consequently, OpenAI’s substantial investments in AI infrastructure ultimately culminated in the emergence of ChatGPT, which has become a benchmark within the industry and has driven significant breakthroughs in generative AI technology.)
Leveraging NVIDIA’s Spectrum-X Ethernet networking platform, xAI achieved industry-leading performance in its 200,000-GPU Colossus 1 cluster, exhibiting sub-10-microsecond end-to-end communication latency and over 95% bandwidth utilization. This performance validates the critical role of AI networks in supporting computing clusters. As Colossus 2 progresses toward the million-GPU scale, AI networks will encounter unprecedented challenges in fulfilling higher bandwidth, lower latency, and extreme reliability requirements. The performance of these networks will directly influence the efficiency of ultra-large-scale AI cluster deployment. Today, AI networks have become a core element of AI infrastructure, equally significant as computing power, with their synergy forming a fundamental foundation for AI development.
(Note: The “Network is Computer” concept was proposed by John Gage of Sun Microsystems in 1984. At that time, the scientific workstations developed by Sun were constrained by single-machine computing power and required collaboration via networks. Gage proposed that networks could interconnect distributed devices into an organic whole, allowing them to collaborate on complex tasks akin to a giant computer. This concept became an iconic slogan for Sun, illustrating the company’s forward-looking vision of network computing and subsequently driving advancements in distributed computing and network technologies.)
III. AI Networks for Million-GPU Clusters: The Perfect Embodiment of “Network is Computer”
AI networks for million-GPU clusters are not merely enlargements of those used in 10,000-GPU clusters. Instead, they require systematic design tailored to the typical traffic characteristics and communication needs of large models with 10 trillion or even 100 trillion parameters, such as the high-frequency communication demands of cross-node Mixture of Experts (MOE) model parallel training. By deeply integrating the efficient interconnection capabilities of Scale-Up (vertical expansion) networks within supernodes with the large-scale networking capabilities of Scale-Out (horizontal expansion) networks between supernodes, a new generation of AI networks for million-GPU clusters can be established.
Reference Architecture: China Mobile Cloud’s Open and Disaggregated Supernode.
Given the stringent requirements of million-GPU clusters for computing power density and heat dissipation efficiency, liquid cooling technology has become a standard for AI infrastructure. The open and disaggregated supernode adopted herein is liquid-cooled (see: ChinaMobile Cloud’s AI Network Architecture Leading the New Paradigm of AI Infrastructure Development).
Each Segment accommodates 128 sets of liquid-cooled supernodes, totaling 64 Segments. A next-generation AI network for a million-GPU scale is constructed utilizing a 128-rail two-tier CLOS network.
Moreover, leveraging ChinaMobile Cloud’s self-innovated Fully Adaptive Routing Ethernet (FARE) protocol (https://datatracker.ietf.org/meeting/123/materials/slides-123-idr-sessa-1-7-fully-adaptive-routing-ethernet-using-bgp-02.pdf), real-time monitoring of network topology and bandwidth changes facilitates the dynamic implementation of packet spraying strategies based on weighted equal-cost multi-path (ECMP) for efficient global load balancing, ensuring that network bandwidth utilization remains stable above 95%.
IV. Conclusion
As the global AI infrastructure transitions from the “100,000-GPU era” to the “million-GPU era,” AI networks emerge as the core technological engine for ultra-large-scale computing power deployment, propelling the implementation of the “Network is Computer” concept. The profound synergy between computing power and networks is reshaping the underlying architecture of AI infrastructure, heralding a new development paradigm for AI infrastructure.
V. References