IEDM 2024 Tutorial3_CPO for Improved Energy Efficiency and Performance of AI Applications.pdf
1. Tutorial 3
Tutorial 3
Co-Packaged Photonics for Improved Energy
Efficiency and Performance of AI Applications
Clint Schow
December 2024
Dept. of Electrical Engineering
C. Schow, IEDM 2024
1
2. Tutorial 3
Outline
• CPO and Why Do Systems Need it?
• Why Si Photonics for CPO?
• Interfaces and First Implementations of CPO
• Factors that May Delay CPO Deployment/Market Drivers
and Outlook
• Concluding Thoughts
C. Schow, IEDM 2024
2
3. Tutorial 3
Why CPO? High Cost of Data Movement from Chip Packages
C. Schow, IEDM 2024
3
Courtesy of J. Shalf, LBNL
Huge energy cost for
transporting data off-chip
4. Tutorial 3
D. Kam et al., “Is 25 Gb/s On-Board Signaling Viable?, IEEE Adv. Packag., 2009.
Chip Supported bandwidth:
C4s
PCB
LGA
or BGA
connector
Chip
Carrier
Courtesy of M. Ritter, IBM
BGA/LGA chip packages
• Poor scalability, signal integrity
• Reduced system performance and efficiency
Electronic Packaging Limits Chip Connectivity
4
C. Schow, IEDM 2024
C4
Chip carrier
Connector
PCB
BW normalized to C4 interface
5. Tutorial 3
Integration to Maximize BW and Energy Efficiency
C. Schow, IEDM 2024
5
Integrating photonics into the most expensive, constrained, and
challenging environment in the system
– Cost
– Reliability
– Thermal
– Power delivery
TX/RX
IC
Chip Carrier
Host PCB
short trace on carrier
Move from bulky optics located far away
To highly-integrated optics close to the logic chips
IC
Chip Carrier
TX/RX
Host Printed Circuit Board (PCB)
short trace on carrier (mm)
long trace on card (many cm)
connector
6. Tutorial 3
Paths to Higher Data Rates
C. Schow, IEDM 2024
6
Large jump in
complexity and power
7. Tutorial 3
Tradeoffs in Scaling Bandwidth
DP-QPSK: 4
DP-16QAM: 8
DP-64QAM: 12
28 56
Symbol Rate
(Gbd)
1
96
Wavelengths or Fibers (or both)
Bits/symbol
NRZ OOK: 1
QPSK or PAM4: 2
112
Large jump in
complexity and power
CAN’T partition into
smaller pipes
CAN partition into smaller pipes
BW Granularity determines effective switch radix
51T switch example:
800G granularity 64 ports
200G granularity 256 ports
• Limited by electronics and
photonics, packaging
• Dictated by SerDes
• More complexity generally
translates into higher power
consumption, higher latency
• More FEC, more DSP
• Linearity specs
• Challenging packaging for high fiber count
• Loss, tolerance, uniformity, yield
• WDM raises complexity and power
• Operation over temperature,
wavelength stability, added losses
C. Schow, IEDM 2024
7
8. Tutorial 3
High-Water Mark for Optics in HPC: IBM Sequoia BG/Q
(2012)
C. Schow, IEDM 2024
8
~ 30 m ~ 10 m
620,000 optical links
96 Blue Gene/Q Racks
• 20.013 Pflops
• 1.572M Cores
• ~8MW
• 6400 km of MMF
• 10Gbps/lane
• Maximum link
distance = 23 m
• HPC requires technologies optimized for short reach ~30m
• Each machine is a green field installation, backwards
compatibility never an issue
9. Tutorial 3
Copper fights back: IBM Summit: #1 Machine in 2018
C. Schow, IEDM 2024
9
Source: D. Kuchta, IBM
10. Tutorial 3
HPC Systems Heavily Unbalanced
C. Schow, IEDM 2024
10
https://www.hpcwire.com/2016/11/07/mccalpin-traces-hpc-system-balance-trends/
• Amdahl’s Law: 1 Byte/Flop
• Computation dramatically
outpacing networking
• Severely limits performance
for problems where data is
not localized
First implementation of CPO restored
some balance and illustrated potential of
the technology
11. Tutorial 3
The Promise of A Machine with 2,000,000 optical Links
• A. Benner, “Optical Interconnect Opportunities in Supercomputers and High End Computing,” OFC Tutorial 2012.
• First realization of fiber to the chip (aka CPO) – key to much higher BW
• Enabled by NRE from DARPA to develop custom optical modules
• Commercially not successful, too expensive
11
C. Schow, IEDM 2024
First CPO
12. Tutorial 3
IBM Power 775: Fiber Density Pushed to the Limit
C. Schow, IEDM 2024
12
IBM, public presentation 2010
• Need to maximize BW/fiber: wavelength division multiplexing (WDM), multi-core fibers
13. Tutorial 3
High Byte/flop: CPO Enabled by Electrical Packaging
C. Schow, IEDM 2024
13
Courtesy A. Benner, D. Kuchta, IBM
• Enabled by high-performance electrical packaging
• No longer available: technology is extinct
IBM Glass Ceramic MCM
14. Tutorial 3
Where are we now? Optics in Datacenters
• Everything in the
rack is Copper
(<2m)
• AOC (Active
Optical Cable) for
<20m links
• WDM (Wavelength
Division
Mulitplexing) only
for long links
C. Schow, IEDM 2024
14
Courtesy of M. Filer, Microsoft
15. Tutorial 3
Game Changer?
Optical Switching Deployed at Scale by Google
• Google has deployed MEMS-based optical circuit switches in their datacenters
• Network power consumption reduced by 40%, CAPEX reduced by 30%
• Advantages in network throughput and incremental installment
C. Schow, IEDM 2024
15
• L. Poutievski et al., “Jupiter Evolving: Transforming Google’s Datacenter Network via Optical Circuit Switches and Software-Defined
Networking,” in Proceedings of the ACM SIGCOMM 2022 Conference, New York, NY, USA, Aug. 2022, pp. 66–85.
• R. Urata et al., “Mission Apollo: Landing Optical Circuit Switching at Datacenter Scale,” arXiv:2208.10041
16. Tutorial 3
Google Applies OCS to ML Supercomputer (TPU v4)
• OCS network improves Google’s AI cluster
utilization, efficiency, and training times
• N. Jouppi, et al., “TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings,” ISCA '23: Proceedings of the 50th Annual International Symposium
on Computer Architecture, June 2023.
• Liu, Hong, et al. "Lightwave Fabrics: At-Scale Optical Circuit Switching for Datacenter and Machine Learning Systems." Proceedings of the ACM SIGCOMM 2023 Conference. 2023.
C. Schow, IEDM 2024
16
17. Tutorial 3
From a Machine Room to the Cloud and Back
Systems for AI are similar in many ways to conventional HPC:
• small scale compared to datacenters = short optical links
• latency and energy efficiency are critical
C. Schow, IEDM 2024
17
Utah Data Center
HPC AI cluster
18. Tutorial 3
State of The Art: NVIDIA Blackwell
• Blackwell are largest dies possible, 208B transistors, 20 Pflop each
• 2 Blackwell chips act as single GPU, connected with 10TB/s (NV-HBI)
• 900GB/s bi-directional GPU-CPU links
• 2700W
• 72 Blackwell GPUs, 36 Grace
CPUs
• Rack acts as a single GPU
• 130TB/s GPU BW within NVL72
domain
• NVLink scales to 576 GPUs = 8
NVL72 Racks
• Copper backplane
• Water cooled
• 120kW/rack
C. Schow, IEDM 2024
18
https://resources.nvidia.com/en-us-blackwell-architecture/blackwell-architecture-technical-brief
NVIDIA GB200 Superchip
Blackwell GPU x2
Grace CPU
NVL backplane
Networking
NVIDIA GB200 NVL72
36X
20. Tutorial 3
Outline
• CPO and Why Do Systems Need it?
• Why Si Photonics for CPO?
• Interfaces and First Implementations of CPO
• Factors that May Delay CPO Deployment/Market Drivers
and Outlook
• Concluding Thoughts
C. Schow, IEDM 2024
20
21. Tutorial 3
Why Si Photonics: Silicon Photonic Waveguides
C. Schow, IEDM 2024
21
0.5 µm
0.2 µm
Si (n=3.5)
SiO2 (n=1.45)
2 µm thick buried oxide (BOX)
Waveguide cross-section color-coded with intensity of electric field
• Undoped Si and SiO2 are transparent for λ =1.2 µm – 6.5 µm
• Si surrounded by SiO2 forms a dielectric waveguide (similar to single mode fiber, but
much smaller due to high contrast ratio)
• Large index contrast between core and cladding enables tight bends with low loss
• Losses (2 µm BOX) : ~2 dB/cm and ~0.01 dB/bend (R ~5 µm)
• Waveguides usually only support one polarization
23. Tutorial 3
Fiber to PIC Coupling: Mode Expansion to Match Fiber
C. Schow, IEDM 2024
23
Bhandari, Bishal et al. “Compact and Broadband Edge Coupler Based on Multi-Stage Silicon
Nitride Tapers.” IEEE Photonics Journal, 2020.
Li, C. et al. “CMOS-compatible silicon double-etched apodized waveguide grating couplers
for high efficient coupling.” OFC, 2013.
Edge Couplers Grating Couplers
• Broad spectral bandwidth
• Low loss
• Relatively polarization insensitive
• Sensitive to misalignment
• Larger footprint
• Limited spectral bandwidth
• Higher loss
• Polarization sensitive
• Less sensitive to misalignment
• smaller footprint
24. Tutorial 3
Polarization Must be Managed
• TX output is linearly polarized and launched into fiber
• At the RX, a random polarization state is received that has both TE
and TM components with an unknown distribution of power--All light
must be collected to avoid losses
• Polarization maintaining fiber can avoid this issue but is too costly
C. Schow, IEDM 2024
24
Polarization Splitter Rotator (Edge Couplers)
Rotates incoming TM component to propagating TE mode
Polarization Splitting Grating Couplers
W. Bogaerts et. al., "A polarization-diversity wavelength duplexer circuit in silicon-on-insulator photonic
wires," Opt. Express,2007.
Couples incoming TE and TM components to TE waveguides
W. D. Sacher et. al., "An O-band Polarization Splitter-Rotator in a CMOS-Integrated Silicon Photonics Platform," Frontiers in Optics, 2016.
25. Tutorial 3
Global Foundries: 45nm SOI CMOS Integration
C. Schow, IEDM 2024
25
Courtesy of T. Letavic, GlobalFoundries
28. Tutorial 3
OpenLight: Commercializing III-V on Silicon
C. Schow, IEDM 2024
28
https://www.onboardoptics.org/cobo-presentations
Manufactured by Tower who also offers a complete Si-only process
31. Tutorial 3
Intel Hybrid Technology: III-V on Si Integration
C. Schow, IEDM 2024
31
Courtesy of Y. Akulova, R. Blum, Intel
• Hybrid III-V on Si integration platform: SiP for passives and routing, III-V for lasers
32. Tutorial 3
8λ CWDM
C. Schow, IEDM 2024
32
Courtesy of Y. Akulova, R. Blum, Intel
• Ability to integrate multiple, different InP epitaxial structures
33. Tutorial 3
All-In-One Functionality Simplifies Packaging and Test
C. Schow, IEDM 2024
33
Courtesy of Y. Akulova, R. Blum, Intel
• Integration to lower packaging and test costs: all components on single die
35. Tutorial 3
Multi-Wavelength RRM-Based Links
• Widely pursued by industry and academia
• Wide and slow (~25Gbps) interfaces relying on many wavelengths
• RRMs provide wavelength multiplexing and modulation at the TX,
wavelength demultiplexing at RX
• Resonant nature of RRMs make them very efficient but very sensitive,
precise bias control and wavelength tracking are required
• Comb lasers often favored as a low-cost, wavelength-stable source
C. Schow, IEDM 2024
35
https://www.techpowerup.com/forums/threads/nvidia-is-
preparing-co-packaged-photonics-for-nvlink.276139/
D. Liang et al., "Advanced Integrated Photonics For DWDM
Optical Interconnects," OECC/PSC, 2022.
M. Wade et al., "TeraPHY: A Chiplet Technology
for Low-Power, High-Bandwidth In-Package
Optical I/O," IEEE Micro, 2020
M. Glick et al., "PINE: Photonic Integrated Networked
Energy efficient datacenters (ENLITENED Program)
[Invited]," in JOCN, 2020
NVIDIA
HPE Columbia University
Ayar Labs
36. Tutorial 3
Multi-Wavelength RRM-Based Links
• Widely pursued by industry and academia
• Wide and slow (~25Gbps) interfaces relying on many wavelengths
• RRMs provide wavelength multiplexing and modulation at the TX,
wavelength demultiplexing at RX
• Resonant nature of RRMs make them very efficient but very sensitive,
precise bias control and wavelength tracking are required
• Comb lasers often favored as a low-cost, wavelength-stable source
C. Schow, IEDM 2024
36
https://www.techpowerup.com/forums/threads/nvidia-is-
preparing-co-packaged-photonics-for-nvlink.276139/
D. Liang et al., "Advanced Integrated Photonics For DWDM
Optical Interconnects," OECC/PSC, 2022.
M. Wade et al., "TeraPHY: A Chiplet Technology
for Low-Power, High-Bandwidth In-Package
Optical I/O," IEEE Micro, 2020
M. Glick et al., "PINE: Photonic Integrated Networked
Energy efficient datacenters (ENLITENED Program)
[Invited]," in JOCN, 2020
NVIDIA
HPE Columbia University
Ayar Labs
37. Tutorial 3
Parallel Singlemode Optics: Nubis
C. Schow, IEDM 2024
37
S. T. Le et al., "1.6-Tbps Low-Power Linear-Drive High-Density Optical Interface (HDI/O) for ML/AI,“ OFC, 2024.
https://www.lightwaveonline.com/home/article/14305426/nubis-communications-inc-xt1600-high-density-linear-optical-engine https://www.lightwaveonline.com/home/article/14305426/nubis
-communications-inc-xt1600-high-density-linear-optical-engine
• Parallel optics with singlemode
fiber: 16 TX + 16 RX
• 100Gbps interfaces
• Si photonics PIC with Mach-
Zehnder Modulators
• Normal-incidence fiber coupling
for 2-D optical engine placement
38. Tutorial 3
Outline
• CPO and Why Do Systems Need it?
• Why Si Photonics for CPO?
• Interfaces and First Implementations of CPO
• Factors that May Delay CPO Deployment/Market Drivers
and Outlook
• Concluding Thoughts
C. Schow, IEDM 2024
38
39. Tutorial 3
Serializer
TX
FFE Electrical Channel
L < ~1m PCB
or few meters of cable
Electrical Link:
RX
DFE Deserializer
TX slice RX slice
Electrical I/O requires heavy equalization
Much of the required circuitry is common for both electrical and optical links
Serializer/deserializer (MUX/DEMUX) and clock generation/recovery blocks (PLL, CDR)
Equalization is very helpful on the electrical side of the optical module (TX in, RX out)
Short electrical
channel
Fiber
VCSEL
L : m to km
OE Module
Optical Link:
RX
Serializer TX
Pre-driver
VCSEL driver
PD
Deserializer
OE Module
Output driver
TIA and LA
Electrical and Optical Link Structure
C. Schow, IEDM 2024
39
41. Tutorial 3
SerDes Power and Latency Depend on Reach
C. Schow, IEDM 2024
41
Slide courtesy of Davide Tonietto, Huawei
42. Tutorial 3
Power vs Reach for SerDes
C. Schow, IEDM 2024
42
https://www.onboardoptics.org/cobo-presentations: R. Schaevitz, “Scaling into the Next Decade”
43. Tutorial 3
CPO to Maximize Density and Minimize Power
C. Schow, IEDM 2024
43
https://docs.broadcom.com/doc/siph-chiplets-in-package-scip
46. Tutorial 3
SerDes Must Handle Multiple Interconnects
C. Schow, IEDM 2024
46
Slide courtesy of Peter Del Vecchio, Broadcom
47. Tutorial 3
Outline
• CPO and Why Do Systems Need it?
• Why Si Photonics for CPO?
• Interfaces and First Implementations of CPO
• Factors that May Delay CPO Deployment/Market Drivers
and Outlook
• Concluding Thoughts
C. Schow, IEDM 2024
47
50. Tutorial 3
LPO to Push Back CPO Deployments?
• Linear Pluggable Optics (LPO), sometimes referred to as linear/direct drive
• Eliminates DSP in the module to save power
• Hot topic at OFC 2024, current consensus seems to be retimed TX, linear RX
C. Schow, IEDM 2024
50
https://community.fs.com/article/what-is-the-lpo-transceiver.html
53. Tutorial 3
A Heterogeneous View of the Future
C. Schow, IEDM 2024
53
Slide courtesy of Davide Tonietto, Huawei
54. Tutorial 3
Outlook in 2023: CPO Uptake
C. Schow, IEDM 2024
54
Courtesy of Vlad Kozlov, Lightcounting
55. Tutorial 3
Outlook in 2023: CPO Applications
C. Schow, IEDM 2024
55
Courtesy of Vlad Kozlov, Lightcounting
56. Tutorial 3
Outlook 2023: Transceivers Still Dominate
C. Schow, IEDM 2024
56
Courtesy of Vlad Kozlov, Lightcounting
57. Tutorial 3
Outline
• CPO and Why Do Systems Need it?
• Why Si Photonics for CPO?
• Interfaces and First Implementations of CPO
• Factors that May Delay CPO Deployment/Market Drivers
and Outlook
• Concluding Thoughts
C. Schow, IEDM 2024
57
58. Tutorial 3
Back to Blackwell
C. Schow, IEDM 2024
58
Memory
Backplane
Network
Slide courtesy of A. Seyedi, NVIDIA
59. Tutorial 3
Link Performance vs Distance and Purpose
C. Schow, IEDM 2024
59
Slide courtesy of A. Seyedi, NVIDIA
61. Tutorial 3
Closing Thoughts
• CPO may be the lowest power solution but faces challenges
• Pushback from incumbent technologies--pluggables
• Implementation challenges:
• Yield, reliability, serviceabliltiy
• Supply chain development and definition, who does what?
• Interoperability and standards
• Si Photonics has the integration scale to support CPO
• Optical packaging still not fully solved at scale
• Better modulators available in other materials, e.g. TFLN, inP
• The future is heterogeneous
• Multiple technologies for chip I/O, electrical and optical co-existence
• Heterogeneous integration of other optical materials on silicon
C. Schow, IEDM 2024
61
62. Tutorial 3
Questions?
C. Schow, IEDM 2024
62
Thank You!
SLIDES AND TECHNICAL CONTENT
Y. Akulova (Intel, now PsiQuantum); M. Filer (Microsoft, now stealth startup); T. Letavic, K. Giewont, T. Hirokawa
(GlobalFoundries); V. Koslov (Lightcounting); P. De Dobbelaere (Luxtera/Cisco); D. Kuchta (IBM); B. Lee (IBM, now NVIDIA); M.
Rakowski (IMEC, now GlobalFoundries), A. Seyedi (NVIDIA), D. Tonietto (Huawei); J. Shalf (LBNL); P. Del Vecchio (Broadcom);
Q. Wang (Meta)
Schow@ece.ucsb.edu