Evolution of SoCs, chiplets and interconnects
System on a chip
For many years chips had been in the limelight, more than the the interconnects between them which make up a traditional SoC. Semiconductor designs emphasized the road for capable and faster chips, at the same time the finished products would not have been possible without advancement in manufacturing processes, packaging, material sciences and myriad of tools and equipment.
An SoC which by definition is a single chip, consists of multiple units, generally a processing unit, co-processors, accelerators, memory controllers, communication and I/O units. It is common in today’s mobile phones and tablets, in contrast to a desktop PC or servers where some units can be physically replaced or upgraded. SoCs are inherently integration of functions onto a single substrate or package, and though individual functions are not generally replaceable they bring the benefit of making things smaller and consume less power. However SoCs are not limited to phones, AI driven chip designs are relying on the SoC concepts to create new generation of chips.
An SoC can also be loosely associated with multi chip designs, whereas the C in SoC refers to a chip, or one piece of silicon, a monolithic die, it is also possible to stitch multiple dies on a single package and achieve the same result. We can call it a SiP or a system in package. Instead of the SoC coming from a single manufacturer or fab, it is possible to source the dies from different vendors and package it in the end.
Making an SoC is also running up against Moore’s law, which in simple terms increase the number of transistors with time, is getting challenging. There are two elements that are in play here – newer architecture to improve processing power or compute functions, and newer processes to make the chips smaller and power efficient. With improved process nodes we are able to squeeze greater number of transistors in the same area, at the same time improved architecture is putting more functions (which in turn more transistors) in the chip. The two are inherently driving each other, improved architecture puts more functions and logic, which drives the requirement to improve process technology.
The approach to modularize seems the reasonable way forward for multiple reasons. Each die is growing in size even though improved process nodes squeeze ever more transistors in the same area than previous generations, new designs are running against the die reticle limit which is about 850 mm2. Yield issues are a concern above 550 mm2 , cost and IP reuse are other concerns why on package integration of multiple dies is preferable from cost and project point of interest.
Chiplets
This brings us to chiplets. Consider these as “IP blocks” which are integrated together in a particular packaging technology and connected together by interconnects or chiplet I/O. Chiplets also bring in a novel way to make the SIPs, it allows for heterogenous integration, meaning each die can be a different function, vs a homogenous split where the function of each die is the same. The heterogenous split can also be extended to improve results and efficiency, like for example a compute node developed in process n can be packaged with IO or memory chips in an older process (n-1) or (n-2). This reduces the overall cost from opex and capex perspective, since all the functions do not have to move to the latest process at the same time, and it allows to make alternate “SKUs” of a product quickly to target different market demands without incurring a higher project cost and time.
An example of such an package would be a CPU at 7 nm process, GPU at 10 nm, and a DRAM at 14 nm. It’s not fully clear where the chiplet originated, some reference point to a professor at Berkeley, however it was AMD which brought it mainstream with the Ryzen CPU in 2017, and conceptually Gordon Moore introduced modular functions in his 1965 seminal paper.
Monolithic dies do have certain advantages, and will continue to flourish as long as current process nodes allow the most efficient designs. A brief summary between the two are given below:
Electrical Interconnects
Connecting the chiplets – through interconnects fall in two categories – serial or parallel. Serial interconnects have lower number of lanes or wires from one chip to another. Parallel interconnects have high number of parallel lanes at lower frequency. Serial is relatively cost effective as lower no of pins from a chip can be routed across a standard substrate and does not require advanced packaging technology to connect two chiplets together. Serdes (which stands for serializer/deserializer) is a popular and common way of input-output, as the interconnect can be a thinner pipe. Think of a typical water tank that has pipes that are usually smaller in diameter than the tank itself.
The kind of serdes to use is reach specific: Long Reach (LR), Medium Reach (MR), Very Short Reach (VSR), Extra Short Reach (XSR), Ultra Short Reach (USR) are around; which one to use depends on a variety of factors, each has a different loss profile and power needs, and the need of placing the chiplets at certain distances. A rough ballpark is given in this table:
As data gets converted from parallel to serial and vice versa in serdes, there is some penalty (power, latency) associated with making data go through a narrow pipe. Some improvements like FEC (forward error correction) improve efficiency to some extent, and upper layers can do CRC (cycle redundancy checks) to know if reliable data was received. Ethernet PHY solutions use 56G or 112G serdes interfaces, with potential for 224G. Higher speed demands in serdes to migrate to higher Ethernet speeds (like 400G) cannot be met without newer signalling method like PAM4 (vs traditional NRZ), and even perhaps later to PAM8. To avoid clock/data skew the clock is embedded in the data stream, and recovered by a CDR block at receiver. This requires some sort of encoding/decoding, which started at 8b/10b, to 128b/130b for PCIe 3.0 and USB 3.1 . Serdes will remain a cost effective solution for moving large data point-to-point within systems.
Issues with skews at higher frequency, latency are reduced by using a parallel interface between chiplets. Consider 4 lanes of 56G serdes achieving 224G throughput, the same throughput on a parallel interface can be done at 1 Ghz with about 200 pins. Parallel interfaces are wider with slower rates, but they require a higher pin count to get data off (or into) the die. High Bandwidth interconnect (HBI), Advanced Interconnect Bus (AIB), Bunch of Wires (BoW) and recently Universal Chiplet Interconnect (UCIe) are some standards that define the parallel interfaces.
Packaging
One of the issues of parallel interface is packaging larger number of pins or wires off the chip. All these interconnects are designed to traverse the silicon. Packaging technologies are critical to enable the high pin density, and foundries have devised various methods to solve this problem. Intel’s EMIB (Embedded multi die Interconnect Bridge) uses a small bridge die embedded in the substrate. TSMC’s CoWoS (Chip on Wafer on Substrate) uses an silicon interposer and has been popular to integrate chiplets with HBMs. Another packaging technology is to use RDL (re-distributed layer) interposer instead of a silicon interposer which makes it cost effective. Each of these processes (also called advanced 2.5D packaging) has been used by foundries to package dies at very close range as well as a high pipe to move data between them. Stacking the dies vertically with hybrid bonding of die pads with die vias is a 3D type of advanced die packaging.
Which one to use is application dependent, AI use cases are looking at GPUs with HBM which forces them to use advanced packaging. Traditional IO devices will likely to continue to use serdes.
Optical Interconnects
Chiplet interconnect is however not limited to being electrical only. An option is to transport data through fiber, which requires the electrical signals to be converted to optical signals. Ethernet, Infiniband and Fiber Channel use optical in long reach of data center environment. There is a cost of converting electrical to optical and back to electrical, but with disaggregated architectures electrical based interconnects become challenging when distance increases. In some scenarios routing decisions can be made in the optical domain, saving the cost of electro-optical conversions.
In the optical domain the module consists of an optical chip (PIC) with an electrical interface adjacent to the ASIC. Based on where this engine is located, it can be Pluggable optics, Near Packaged Optics (NPO) , Co-Packaged Optics (CPO). Each of them has their benefits and drawbacks, with the pluggables being common in data centers and AI/ML clusters (e.g QSPF).
Whereas co-packaged optics (CPO) show benefit in areas of CMOS integration and lower power when compared to pluggable optics, CPO have challenges: one being manufacturing or fabrication has to adapt with process and packaging. Stability, maturity and reliability of the optical technology elements, thermals, choice of single and multimode fibers, ecosystem compatibility and interoperability are other considerations. Mainstream availability and lower cost of replaceable pluggable optics are other reasons why pluggables will be around for a while. Linear Drive pluggable optics (LPO) is another alternative which replaces the power hungry DSP/CDR components and maintains the benefits of pluggable optics.
Power is a factor for optical interconnects, in 2024 when Nvidia announced Blackwell, the DGX GB200 NVL72 rack system used copper instead of optics in the fabric, use of optical transceivers and retimers would have cost 20KW for the nvlink spine, for a 120 KW rack.
Quality of Interconnects
A modern data center composes of core, aggregator switches lined in a spine leaf (Clos) network, that interconnects possibly thousands of racks of servers. Each rack consists of shelves, with each shelf hosting several servers. All these servers are connected to a Top of Rack (TOR) switch. Servers are composed of compute nodes (CPUs), accelerators (GPU, DPU/IPU, ASICs), memory DRAMs or HBMs attached to xPUs. However this has exposed limitations and inefficiencies - resources lay underutilsed leading to higher opex costs. Some publications indicated that 30% of memory lie underutilized in a typical data center with memory costs in certain cases surpassing compute costs.
Disaggregated architecture is one proposed solution, where pooling of resources hold promise of lower opex. Take for example the memory situation mentioned earlier, instead of a server having close to full memory usage and a neighboring server underutilized in memory usage, with disaggregated we can pool the memory as needed, no matter where the memory reside. We can also think of memory and storage in a pool that can be given on demand to compute as required. However It is not sufficient to build a big capable cluster of compute or storage, and the interconnect capacity creating a bottleneck in efficient processing of data. Data needs to be in and out at the scale at which compute, accelerators, memory, storage, IO are capable enough to in-take and out-take of data.
Some key features to look while considering interconnects:
As always, cost plays a role and the $/Gbps for interconnects is an indicator when faced with choices.
IO, compute and switching
Finally, lets touch on Input-output. This is the airflow for compute, processing and moving data in and out. In a traditional server model, communication moves from a compute within a node through local bus (like nvlink, PCIe) and/or through a switch, to another node, memory, and/or to a networking card through the internal bus, and then through optical transceivers as Ethernet or Infiniband, between racks and pods. High performance computing, training and inferencing of AI models require linking of compute nodes, and availability of memory both on-die and off-die, this topology benefits from scaling up and out, both in radix, as well as how much data is moved between the compute nodes. Compute and IO co packaged in a topology has benefits that are not obvious, but being in the same package as compute, it can remove the bottlenecks encountered in a traditional networking model. The optical or electrical interconnects can transport Ethernet or Infiniband as IO, and has a key to what the compute needs, by open standard protocols like CXL, or proprietary protocols like nvlink and Infinity Fabric. In that sense the adaptions of software - protocol, transport and applications are necessary than a new interconnect at phy level for evolution of networking.
Chiplet based interconnects in a fabric topology packaged with compute opens up new avenues for modular systems, large scale topologies, resource disaggregation and dynamic use through pooling and parallelism.
Post by Soumen Karmakar
Disclaimer: Views and opinions do not reflect those of any organization.
Director, System Engineering @ Arista Networks | (US)NUA Community Outreach
1yHelpful article. Thank you.