SlideShare a Scribd company logo
Evaluating UCIe based multi-die SoC to
meet timing and power
Logistics of the Webinar
2
All attendees will be placed on mute
To ask a question, click on Cloud Chat sign and type the
question. Folks are standing by to answer your questions.
There will also be a time at the end for Q&A
Agenda
Overview of UCIe™ — Universal Chiplet Interconnect Express™
Introduction to system modeling with UCIe and other Intellectual Properties
Assembling System models using UCIe protocol
Examples of SoC architectures using UCIe
Use Case
Mirabilis Design and VisualSim Architect
UCIe Background Information
Background on die-to-die Interconnect
•Packing large number of functions at different clock rate onto a monolithic die is not scalable
•Solution: Integrate multiple dies into a single package – Chiplets
•Chiplet Challenge:
• Die-to-die communication is very slow and consumes too much power
• No single standard available to handle the routing, signalling and multiple clock domains
• Cache coherency across dies
• Support for multiple protocols
•Exploration:
• Need a mechanism to predict the expected latency and power consumption
• Test feasibility of different configurations and assign compute resources on individual dies
• Study the impact of failures or extreme latency
• Explore different scheduling and Quality-of-Service algorithms
Universal Chiplet Interconnect Express or
UCIe is the Future
•Customizable, package-level integration of chiplets
• Combines best-in-class die-to-die interconnect and protocol connections from an interoperable, multi-vendor ecosystem
• open industry standard interconnect
•Offers high-bandwidth, low-latency, power-efficient, and cost-effective on-package connectivity
•Implement compute in an advanced process node to deliver power-efficient performance at higher cost
with memory and I/O controller reused from earlier design in an established (n-1 or n-2) process node
•Future design will incorporate interaction between AI engines on different dies connected and require
deterministic latency
•Optimal design requires accurate assignment of resource pooling, resource sharing and messaging passing
•UCIe theoretical bandwidth is 4x bandwidth of PCIe 6.0 (Tbps range)
•Actual bandwidth depends on burst data available, buffer size for the Tx and replay buffer
UCIe Specification at a Glance
How UCIe Works?
Multiple layers separate out the interconnect tasks
Physical layer is responsible for the electrical signaling,
clocking, link training and sideband
Die-to-Die adapter provides the link state management
and parameter negotiation for the chiplets. It optionally
guarantees reliable delivery of data through CRC and link
level retry mechanism.
◦ When multiple protocols are supported, it defines the
underlying arbitration mechanism.
The FLIT (flow control unit) defines the underlying transfer
mechanism when the adapter is responsible for reliable
transfer
UCIe Packaging
Two package types - standard and advanced.
◦ Standard has 16 lanes
◦ Advanced has 64 lanes
To increase bandwidth, support for multi-module
For 2 modules
◦ Standard has 32 lanes
◦ Advanced has 128 lanes
◦ Each will send different bytes of data
Increase in number of lanes as module count
increases
Multiple PHY logic provides for greater data transfer
with better scheduling
Transition to UCIe
CXL
2.0
PCIe Gen 6
interface
PCIe Gen 6
interface
Strea
ming
Typical SoC – monolithic approach Next Gen SoC – Use Chiplets in modular approach
Commonality with PCIe 6.0
•UCIe protocol emulates PCIe for chiplets
•UCIe transfers packets in FLITs
• PCIe 6.0 uses fixed value of 256 bytes
• UCIe FLIT Size is variable based on the sender and receiver protocol
•Credit based flow control mechanism
•Packets use ACK or NAK to confirm good reception
• Selective and Standard ACK options
•Advanced port status and error checking
• CRC checksum
•Bandwidth depends on the number of lanes
• Standard vs Advanced package
• Multi-Module option
System-level Architecture Analysis of UCIe
based multi-die SoC
Multi-Media Application –
UCIe Template provided by Intel
CPU – High
Performance
cores
CPU – Low
Power cores
Audio/Video
Encoder/Decoder
I/O Tile
M
E
M
M
E
M
M
E
M
PCIe 6.0 PCIe 6.0
P
C
I
e
6
.
0
C
X
L
3
.
0
UCIe
Retimer
Off-Package
Interconnect
NVMe SSD
chiplet
UCIe
Retimer
C
X
L
3
.
0
How much should the
retimer timeout be set to?
Do we need a multi module setup?
How much
should the
transfer rate
between UCIe
links be set to?
4 GTs or 8 GTs
… or 32 GTs?
Start with a System Block Diagram
VisualSim Model
Create a VisualSim model using existing building blocks
Stats
Advanced package, 4 module, 32 GT/s config Standard package, Single module, 4 GT/s config
~300x latency
difference can be
observed. However, for
non-time critical
applications, Standard
UCIe package option
looks attractive
Study the statistics to decide on the best configuration
Application Examples of UCIe based multi-
die SoC
Example 1 – Multi-Media applications
CPU – High
Performance
cores
CPU – Low
Power cores
Audio/Video
Encoder/Decoder
I/O Tile
M
E
M
M
E
M
M
E
M
PCIe 6.0 PCIe 6.0
P
C
I
e
6
.
0
C
X
L
3
.
0
Retimer Off-Package Interconnect
Example 2 :
Automotive Autonomous Driving
UCIe
AI Engine Tiles
Warp
Scheduler
PE
PE
PE
PE
Local Mem
GPU
Analog Chiplet
ADC DAC
PLL
ADC DAC
PLL
Processor subsystem
Core L1
B
u
s
SLC
Example 3 : Cache Coherency using UCIe
UCIe
SERDES
32nm
GPU
7nm
RISC-V Cores
5nm
ARM Cores
10nm
DSP
10nm
SLC chiplet
22nm
LPDDR5
28nm
C
a
c
h
e
C
a
c
h
e
C
a
c
h
e
C
a
c
h
e
Design Challenges in Implementing UCIe
•Huge memory transaction blocks a high priority control access
• For time critical application, these situations are not desirable
• Example : Automotive communication system
•Multiple chiplets can be connected easily and efficiently
• Resource sizing per chiplets needs to be correct to maximize bandwidth usage
• Example applications : Data Center and AI Accelerators
•Migrating from monolithic die to Chiplet in smartphones is efficient
• Limited memory needs to be partitioned for different dies to access with minimal contention
• Example: Apple M1 Ultra uses Chiplets to double the performance
Performance challenges
•User defines CXL stacks with two protocols sharing the physical link.
•Arbiter across the Die-to-Die adapter must send Flits alternatively between the 2 protocols.
• If one of the Protocol layers doesn’t have data to transmit, then instead of payload, “NOP” frames are
inserted. If one of the Protocol stacks is idle for most of the time, then bandwidth could essentially be wasted
on the “NOP” frames.
•Increasing the number of modules for either the standard or advanced package provides more
bandwidth.
• But is that extra bandwidth needed for the application?
•What happens if multiple chiplets in your design require the data stored at the same address location
which is in another chiplet?
• Consider the impact of cache coherency
•Can peak throughput be guaranteed for your application in a shared resource environment?
• AI Engine distributed across multiple dies
Analyzing UCIe based multi-die SoC using
VisualSim System Model
Autonomous driving
UCIe
AI Engine Tiles
Warp
Scheduler
PE
PE
PE
PE
Local Mem
GPU
Memory chiplet
ADC
DDR5
Processor subsystem
Core L1
B
u
s
SLC
• Optimal
mesh size
(mxn) ?
• Best sample
size (16
bytes vs 32
bytes etc) ?
Use a single protocol
stack or multi protocol
stack?
Do we need PCIe
gen6 or still use
gen5 for meeting
application
requirements?
VisualSim System Model using UCIe in
ADAS SoC
Statistics for Multi-Die SoC
• Note the AI Engine
latency spikes
• For multi protocol,
half bandwidth for
each protocol.
• Older gen protocols
are mixed with PCIe 6,
• Lower FLIT size
increases latency.
Comparing Different Configurations using
UCIe Interface
All Die Adapters use PCIe 6.0
Die Adapters use PCIe 6.0 and
Streaming Protocols (AXI)
Lower latency when using PCIe 6.0
Mirabilis Design
VisualSim Architect
About Mirabilis Design
Engineering Solutions focused on innovation in electronics
Based in Silicon Valley, USA
Development and support centers in US, India, Japan, China and Czech
60 large corporations, research centers and 73 universities as customers
Enabled 250 products in semiconductors, automotive, defense and space
VisualSim Architect is the system simulation and IP for hardware, software and networking
Mirabilis Design – Milestones
VisualSim Aerospace
Simulator of the Year
Hardware
Modeling
2003
Company
Incorporated
2005
Modeling Services
1st Customer
2008
Stochastic Modeling
Innovation Award
2010
Integration API
10th customer
2011
Network Modeling
University Program
2013
2015
2018
Best ESL at DAC
2nd at Arm TechCon
2019
VisualSim Automotive
Europe operations
2020
Failure Analysis
Created Asia Team
2021
Best Embedded Systems
Presentation Award – DAC
2021
SysML API
Requirements
2018
New
VisualSim
2022
Best in Show
Embedded World
2023
Communication System
Designer
2022
System Verilog and
UPF/CPF Link
VisualSim Architect
Cloud and
Desktop
Multi-simulation
engine- Digital,
Untimed &
Continuous
Library of Systems,
Networks, Semi,
FPGA & Software
Generate statistics,
documentation &
traces
Algorithms
Protocol
AI Insight
Performance
Power
Functional
Stochastic
Scripting
Sim API
Performance
Latency, Throughput, Buffer occupancy
Power
Instant, Average, Cumulative, Heat, Temperature
Battery and power generation sizing
Functionality
Correctness, efficiency and Quality-of-Service
Failure Analysis and Functional Safety
Generate errors and test for compliance
Software Evaluation
Test quality of C++ and impact on system performance
System-level Modeling and Simulation Software
that integrates requirements, exploration & verification
Over 500 Systems-Level IP Components
Comprehensive implementation-accurate Library
Traffic
• Distribution
• Sequence
• Trace file
• Instruction
profile
Power
• State power table
• Power management
• Energy harvesters
• Battery
• RegEx operators
SoC Buses
• AMBA and Corelink
• AHB, APB, AXI, ACE,
CHI, CMN600
• Network-on-Chip
• TileLink
System Bus
• PCI / PCI-
X / PCIe
• Rapid IO
• AFDX
• OpenVPX
• VME
• SPI 3.0
• 1553B
ARM
• M-, R-, 7TDMI
• A8, A53, A55, A72, A76,
A77, Neoverse
Custom
Creator
• Script language
• 600 RegEx fn
• Task graph
• Tracer
• C/C++/Java
• Python
Stochastic
• FIFO/LIFO Queue
• Time Queue
• Quantity Queue
• System Resource
• Schedulers
• Cyber Security
Memory
• Memory Controller
• DDR DRAM 2,3,4, 5
• LPDDR 2, 3, 4
• HBM, HMC
• SDR, QDR, RDRAM
Networking
• Ethernet & GiE
• Audio-Video Bridging
• 802.11 and Bluetooth
• 5G
• Spacewire
• CAN-FD
• TTEthernet
• FlexRay
• TSN & IEEE802.1Q
• ARINC 664/AFDX
Interfaces
• Virtual
Channel
• DMA
• Crossbar
• Serial
Switch
• Bridge
Algorithms
• Signal Processing
• Analog
• Antenna
RTOS
• Template
• ARINC 653
• AUTOSAR
Storage
• Flash & NVMe
• Storage Array
• Disk and SATA
• Fibre Channel
• FireWire
Software
• GEM5
• Software
code
integration
• Instruction
trace
• Statistical
software
model
• Task graph
RTL-Like
• Clock, Wire-Delay
• Registers, Latches
• Flip-flop
• ALU and FSM
• Mux, DeMux
• Lookup table
Processors
• GPU, DSP, mP and mC
• RISC-V
• SiFive u74
• Nvidia- Drive-PX
• PowerPC
• X86- Intel and AMD
• DSP- TI and ADI
• MIPS, Tensilica, SH
Reports
• Timing and Buffer
• Throughput/Util
• Ave/peak power
• Statistics
FPGA
• Xilinx- Zynq, Virtex,
Kintex
• Intel-Stratix, Arria
• Microsemi-
Smartfusion
• Programmable logic
template
• Interface traffic
generator
Evaluating UCIe based multi-die SoC to
meet timing and power

More Related Content

PPTX
AMBA 5 COHERENT HUB INTERFACE.pptx
PPTX
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
PPTX
CXL chapter1 and chapter 2 presentation.pptx
PPTX
Ultra Accelerator Link (UALink): Accelerator Scale-up Network
PPTX
High Bandwidth Memory(HBM)
PDF
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
PDF
Webinar: Practical DDR Testing for Compliance, Validation and Debug
ODP
PCIe DL_layer_3.0.1 (1)
AMBA 5 COHERENT HUB INTERFACE.pptx
03_03_Implementing_PCIe_ATS_in_ARM-based_SoCs_Final
CXL chapter1 and chapter 2 presentation.pptx
Ultra Accelerator Link (UALink): Accelerator Scale-up Network
High Bandwidth Memory(HBM)
ISSCC 2018: "Zeppelin": an SoC for Multi-chip Architectures
 
Webinar: Practical DDR Testing for Compliance, Validation and Debug
PCIe DL_layer_3.0.1 (1)

What's hot (20)

PPTX
Slideshare - PCIe
PPTX
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
PPTX
PDF
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
PDF
Pcie basic
PDF
Session 8,9 PCI Express
PDF
Verification Strategy for PCI-Express
ODP
axi protocol
PPTX
PPTX
AMD Chiplet Architecture for High-Performance Server and Desktop Products
 
PDF
Pci express technology 3.0
PPTX
PCI express
PPT
Pcie drivers basics
PDF
Pci express modi
PDF
PCI Express Verification using Reference Modeling
PDF
PCI_Express_Basics_Background.pdf
PPTX
Vlsi physical design automation on partitioning
PPTX
Vpc notes
PPTX
8b/10b Encoder Decoder design and Verification for PCI Express protocol usin...
PPTX
Embedded System Programming on ARM Cortex M3 and M4 Course
Slideshare - PCIe
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
Creating Your Own PCI Express System Using FPGAs: Embedded World 2010
Pcie basic
Session 8,9 PCI Express
Verification Strategy for PCI-Express
axi protocol
AMD Chiplet Architecture for High-Performance Server and Desktop Products
 
Pci express technology 3.0
PCI express
Pcie drivers basics
Pci express modi
PCI Express Verification using Reference Modeling
PCI_Express_Basics_Background.pdf
Vlsi physical design automation on partitioning
Vpc notes
8b/10b Encoder Decoder design and Verification for PCI Express protocol usin...
Embedded System Programming on ARM Cortex M3 and M4 Course
Ad

Similar to Evaluating UCIe based multi-die SoC to meet timing and power (20)

PDF
Universal Chip Interconnect Verification
PDF
Universal Chip interconnect Verification
PPTX
Mirabilis Design- NoC Webinar- 15th-Oct 2024
PPTX
Seminario utovrm
PPT
OHM CAD SYSTEM Capabilities
PPTX
ASIC Design Fundamentals.pptx
PDF
The Art of Applied Engineering - An Overview
PPTX
Mirabilis Design | Chiplet Summit | 2024
PPTX
Trends and challenges in IP based SOC design
PDF
MIPI DevCon 2016: Accelerating UFS and MIPI UniPro Interoperability Testing
PPT
FPGA_prototyping proccesing with conclusion
PPTX
FUNDAMENTALS OF COMPUTER DESIGN
PPTX
Mirabilis_Presentation_DAC_June_2024.pptx
PDF
system on chip book for reading apply the concept.pdf
PDF
1.1. SOC AND MULTICORE ARCHITECTURES FOR EMBEDDED SYSTEMS (2).pdf
PDF
System on Chip Design and Modelling Dr. David J Greaves
PPTX
PREP_ASIC.pptx KS KKA SPNNDPS FK KMAKDK D
PPT
Thaker q3 2008
PPT
Nilesh ranpura systemmodelling
PDF
Thaker q3 2008
Universal Chip Interconnect Verification
Universal Chip interconnect Verification
Mirabilis Design- NoC Webinar- 15th-Oct 2024
Seminario utovrm
OHM CAD SYSTEM Capabilities
ASIC Design Fundamentals.pptx
The Art of Applied Engineering - An Overview
Mirabilis Design | Chiplet Summit | 2024
Trends and challenges in IP based SOC design
MIPI DevCon 2016: Accelerating UFS and MIPI UniPro Interoperability Testing
FPGA_prototyping proccesing with conclusion
FUNDAMENTALS OF COMPUTER DESIGN
Mirabilis_Presentation_DAC_June_2024.pptx
system on chip book for reading apply the concept.pdf
1.1. SOC AND MULTICORE ARCHITECTURES FOR EMBEDDED SYSTEMS (2).pdf
System on Chip Design and Modelling Dr. David J Greaves
PREP_ASIC.pptx KS KKA SPNNDPS FK KMAKDK D
Thaker q3 2008
Nilesh ranpura systemmodelling
Thaker q3 2008
Ad

More from Deepak Shankar (20)

PPTX
Simulating Auto Systems & E/E Architectures for Power and Performance using V...
PPTX
Mirabilis_Presentation_SCC_July_2024.pptx
PPTX
How to achieve 95%+ Accurate power measurement during architecture exploration?
PPTX
Mirabilis_Design AMD Versal System-Level IP Library
PPTX
Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
PPTX
Modeling Abstraction
PPTX
Accelerated development in Automotive E/E Systems using VisualSim Architect
PDF
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
PPTX
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
PPTX
Energy efficient AI workload partitioning on multi-core systems
PPTX
Capacity Planning and Power Management of Data Centers.
PPTX
Automotive network and gateway simulation
PPTX
Introduction to architecture exploration
PPTX
Using ai for optimal time sensitive networking in avionics
PPTX
Designing memory controller for ddr5 and hbm2.0
PPTX
Task allocation on many core-multi processor distributed system
PPTX
Introduction to Architecture Exploration of Semiconductor, Embedded Systems, ...
PPTX
Develop High-bandwidth/low latency electronic systems for AI/ML application
PPTX
Webinar on Latency and throughput computation of automotive EE network
PPTX
Webinar on radar
Simulating Auto Systems & E/E Architectures for Power and Performance using V...
Mirabilis_Presentation_SCC_July_2024.pptx
How to achieve 95%+ Accurate power measurement during architecture exploration?
Mirabilis_Design AMD Versal System-Level IP Library
Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
Modeling Abstraction
Accelerated development in Automotive E/E Systems using VisualSim Architect
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Energy efficient AI workload partitioning on multi-core systems
Capacity Planning and Power Management of Data Centers.
Automotive network and gateway simulation
Introduction to architecture exploration
Using ai for optimal time sensitive networking in avionics
Designing memory controller for ddr5 and hbm2.0
Task allocation on many core-multi processor distributed system
Introduction to Architecture Exploration of Semiconductor, Embedded Systems, ...
Develop High-bandwidth/low latency electronic systems for AI/ML application
Webinar on Latency and throughput computation of automotive EE network
Webinar on radar

Recently uploaded (20)

PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPT
Project quality management in manufacturing
PPTX
Welding lecture in detail for understanding
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPT
Mechanical Engineering MATERIALS Selection
PDF
Well-logging-methods_new................
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
Geodesy 1.pptx...............................................
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Project quality management in manufacturing
Welding lecture in detail for understanding
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Mechanical Engineering MATERIALS Selection
Well-logging-methods_new................
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Structs to JSON How Go Powers REST APIs.pdf
Model Code of Practice - Construction Work - 21102022 .pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Lecture Notes Electrical Wiring System Components
Lesson 3_Tessellation.pptx finite Mathematics
Strings in CPP - Strings in C++ are sequences of characters used to store and...
Geodesy 1.pptx...............................................
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx

Evaluating UCIe based multi-die SoC to meet timing and power

  • 1. Evaluating UCIe based multi-die SoC to meet timing and power
  • 2. Logistics of the Webinar 2 All attendees will be placed on mute To ask a question, click on Cloud Chat sign and type the question. Folks are standing by to answer your questions. There will also be a time at the end for Q&A
  • 3. Agenda Overview of UCIe™ — Universal Chiplet Interconnect Express™ Introduction to system modeling with UCIe and other Intellectual Properties Assembling System models using UCIe protocol Examples of SoC architectures using UCIe Use Case Mirabilis Design and VisualSim Architect
  • 5. Background on die-to-die Interconnect •Packing large number of functions at different clock rate onto a monolithic die is not scalable •Solution: Integrate multiple dies into a single package – Chiplets •Chiplet Challenge: • Die-to-die communication is very slow and consumes too much power • No single standard available to handle the routing, signalling and multiple clock domains • Cache coherency across dies • Support for multiple protocols •Exploration: • Need a mechanism to predict the expected latency and power consumption • Test feasibility of different configurations and assign compute resources on individual dies • Study the impact of failures or extreme latency • Explore different scheduling and Quality-of-Service algorithms
  • 6. Universal Chiplet Interconnect Express or UCIe is the Future •Customizable, package-level integration of chiplets • Combines best-in-class die-to-die interconnect and protocol connections from an interoperable, multi-vendor ecosystem • open industry standard interconnect •Offers high-bandwidth, low-latency, power-efficient, and cost-effective on-package connectivity •Implement compute in an advanced process node to deliver power-efficient performance at higher cost with memory and I/O controller reused from earlier design in an established (n-1 or n-2) process node •Future design will incorporate interaction between AI engines on different dies connected and require deterministic latency •Optimal design requires accurate assignment of resource pooling, resource sharing and messaging passing •UCIe theoretical bandwidth is 4x bandwidth of PCIe 6.0 (Tbps range) •Actual bandwidth depends on burst data available, buffer size for the Tx and replay buffer
  • 8. How UCIe Works? Multiple layers separate out the interconnect tasks Physical layer is responsible for the electrical signaling, clocking, link training and sideband Die-to-Die adapter provides the link state management and parameter negotiation for the chiplets. It optionally guarantees reliable delivery of data through CRC and link level retry mechanism. ◦ When multiple protocols are supported, it defines the underlying arbitration mechanism. The FLIT (flow control unit) defines the underlying transfer mechanism when the adapter is responsible for reliable transfer
  • 9. UCIe Packaging Two package types - standard and advanced. ◦ Standard has 16 lanes ◦ Advanced has 64 lanes To increase bandwidth, support for multi-module For 2 modules ◦ Standard has 32 lanes ◦ Advanced has 128 lanes ◦ Each will send different bytes of data Increase in number of lanes as module count increases Multiple PHY logic provides for greater data transfer with better scheduling
  • 10. Transition to UCIe CXL 2.0 PCIe Gen 6 interface PCIe Gen 6 interface Strea ming Typical SoC – monolithic approach Next Gen SoC – Use Chiplets in modular approach
  • 11. Commonality with PCIe 6.0 •UCIe protocol emulates PCIe for chiplets •UCIe transfers packets in FLITs • PCIe 6.0 uses fixed value of 256 bytes • UCIe FLIT Size is variable based on the sender and receiver protocol •Credit based flow control mechanism •Packets use ACK or NAK to confirm good reception • Selective and Standard ACK options •Advanced port status and error checking • CRC checksum •Bandwidth depends on the number of lanes • Standard vs Advanced package • Multi-Module option
  • 12. System-level Architecture Analysis of UCIe based multi-die SoC
  • 13. Multi-Media Application – UCIe Template provided by Intel CPU – High Performance cores CPU – Low Power cores Audio/Video Encoder/Decoder I/O Tile M E M M E M M E M PCIe 6.0 PCIe 6.0 P C I e 6 . 0 C X L 3 . 0 UCIe Retimer Off-Package Interconnect NVMe SSD chiplet UCIe Retimer C X L 3 . 0 How much should the retimer timeout be set to? Do we need a multi module setup? How much should the transfer rate between UCIe links be set to? 4 GTs or 8 GTs … or 32 GTs? Start with a System Block Diagram
  • 14. VisualSim Model Create a VisualSim model using existing building blocks
  • 15. Stats Advanced package, 4 module, 32 GT/s config Standard package, Single module, 4 GT/s config ~300x latency difference can be observed. However, for non-time critical applications, Standard UCIe package option looks attractive Study the statistics to decide on the best configuration
  • 16. Application Examples of UCIe based multi- die SoC
  • 17. Example 1 – Multi-Media applications CPU – High Performance cores CPU – Low Power cores Audio/Video Encoder/Decoder I/O Tile M E M M E M M E M PCIe 6.0 PCIe 6.0 P C I e 6 . 0 C X L 3 . 0 Retimer Off-Package Interconnect
  • 18. Example 2 : Automotive Autonomous Driving UCIe AI Engine Tiles Warp Scheduler PE PE PE PE Local Mem GPU Analog Chiplet ADC DAC PLL ADC DAC PLL Processor subsystem Core L1 B u s SLC
  • 19. Example 3 : Cache Coherency using UCIe UCIe SERDES 32nm GPU 7nm RISC-V Cores 5nm ARM Cores 10nm DSP 10nm SLC chiplet 22nm LPDDR5 28nm C a c h e C a c h e C a c h e C a c h e
  • 20. Design Challenges in Implementing UCIe •Huge memory transaction blocks a high priority control access • For time critical application, these situations are not desirable • Example : Automotive communication system •Multiple chiplets can be connected easily and efficiently • Resource sizing per chiplets needs to be correct to maximize bandwidth usage • Example applications : Data Center and AI Accelerators •Migrating from monolithic die to Chiplet in smartphones is efficient • Limited memory needs to be partitioned for different dies to access with minimal contention • Example: Apple M1 Ultra uses Chiplets to double the performance
  • 21. Performance challenges •User defines CXL stacks with two protocols sharing the physical link. •Arbiter across the Die-to-Die adapter must send Flits alternatively between the 2 protocols. • If one of the Protocol layers doesn’t have data to transmit, then instead of payload, “NOP” frames are inserted. If one of the Protocol stacks is idle for most of the time, then bandwidth could essentially be wasted on the “NOP” frames. •Increasing the number of modules for either the standard or advanced package provides more bandwidth. • But is that extra bandwidth needed for the application? •What happens if multiple chiplets in your design require the data stored at the same address location which is in another chiplet? • Consider the impact of cache coherency •Can peak throughput be guaranteed for your application in a shared resource environment? • AI Engine distributed across multiple dies
  • 22. Analyzing UCIe based multi-die SoC using VisualSim System Model
  • 23. Autonomous driving UCIe AI Engine Tiles Warp Scheduler PE PE PE PE Local Mem GPU Memory chiplet ADC DDR5 Processor subsystem Core L1 B u s SLC • Optimal mesh size (mxn) ? • Best sample size (16 bytes vs 32 bytes etc) ? Use a single protocol stack or multi protocol stack? Do we need PCIe gen6 or still use gen5 for meeting application requirements?
  • 24. VisualSim System Model using UCIe in ADAS SoC
  • 25. Statistics for Multi-Die SoC • Note the AI Engine latency spikes • For multi protocol, half bandwidth for each protocol. • Older gen protocols are mixed with PCIe 6, • Lower FLIT size increases latency.
  • 26. Comparing Different Configurations using UCIe Interface All Die Adapters use PCIe 6.0 Die Adapters use PCIe 6.0 and Streaming Protocols (AXI) Lower latency when using PCIe 6.0
  • 28. About Mirabilis Design Engineering Solutions focused on innovation in electronics Based in Silicon Valley, USA Development and support centers in US, India, Japan, China and Czech 60 large corporations, research centers and 73 universities as customers Enabled 250 products in semiconductors, automotive, defense and space VisualSim Architect is the system simulation and IP for hardware, software and networking
  • 29. Mirabilis Design – Milestones VisualSim Aerospace Simulator of the Year Hardware Modeling 2003 Company Incorporated 2005 Modeling Services 1st Customer 2008 Stochastic Modeling Innovation Award 2010 Integration API 10th customer 2011 Network Modeling University Program 2013 2015 2018 Best ESL at DAC 2nd at Arm TechCon 2019 VisualSim Automotive Europe operations 2020 Failure Analysis Created Asia Team 2021 Best Embedded Systems Presentation Award – DAC 2021 SysML API Requirements 2018 New VisualSim 2022 Best in Show Embedded World 2023 Communication System Designer 2022 System Verilog and UPF/CPF Link
  • 30. VisualSim Architect Cloud and Desktop Multi-simulation engine- Digital, Untimed & Continuous Library of Systems, Networks, Semi, FPGA & Software Generate statistics, documentation & traces Algorithms Protocol AI Insight Performance Power Functional Stochastic Scripting Sim API Performance Latency, Throughput, Buffer occupancy Power Instant, Average, Cumulative, Heat, Temperature Battery and power generation sizing Functionality Correctness, efficiency and Quality-of-Service Failure Analysis and Functional Safety Generate errors and test for compliance Software Evaluation Test quality of C++ and impact on system performance System-level Modeling and Simulation Software that integrates requirements, exploration & verification
  • 31. Over 500 Systems-Level IP Components Comprehensive implementation-accurate Library Traffic • Distribution • Sequence • Trace file • Instruction profile Power • State power table • Power management • Energy harvesters • Battery • RegEx operators SoC Buses • AMBA and Corelink • AHB, APB, AXI, ACE, CHI, CMN600 • Network-on-Chip • TileLink System Bus • PCI / PCI- X / PCIe • Rapid IO • AFDX • OpenVPX • VME • SPI 3.0 • 1553B ARM • M-, R-, 7TDMI • A8, A53, A55, A72, A76, A77, Neoverse Custom Creator • Script language • 600 RegEx fn • Task graph • Tracer • C/C++/Java • Python Stochastic • FIFO/LIFO Queue • Time Queue • Quantity Queue • System Resource • Schedulers • Cyber Security Memory • Memory Controller • DDR DRAM 2,3,4, 5 • LPDDR 2, 3, 4 • HBM, HMC • SDR, QDR, RDRAM Networking • Ethernet & GiE • Audio-Video Bridging • 802.11 and Bluetooth • 5G • Spacewire • CAN-FD • TTEthernet • FlexRay • TSN & IEEE802.1Q • ARINC 664/AFDX Interfaces • Virtual Channel • DMA • Crossbar • Serial Switch • Bridge Algorithms • Signal Processing • Analog • Antenna RTOS • Template • ARINC 653 • AUTOSAR Storage • Flash & NVMe • Storage Array • Disk and SATA • Fibre Channel • FireWire Software • GEM5 • Software code integration • Instruction trace • Statistical software model • Task graph RTL-Like • Clock, Wire-Delay • Registers, Latches • Flip-flop • ALU and FSM • Mux, DeMux • Lookup table Processors • GPU, DSP, mP and mC • RISC-V • SiFive u74 • Nvidia- Drive-PX • PowerPC • X86- Intel and AMD • DSP- TI and ADI • MIPS, Tensilica, SH Reports • Timing and Buffer • Throughput/Util • Ave/peak power • Statistics FPGA • Xilinx- Zynq, Virtex, Kintex • Intel-Stratix, Arria • Microsemi- Smartfusion • Programmable logic template • Interface traffic generator
  • 32. Evaluating UCIe based multi-die SoC to meet timing and power

Editor's Notes

  • #2: Replace background with something relevant to VisualSim
  • #5: Replace background with something relevant to VisualSim
  • #13: Replace background with something relevant to VisualSim
  • #17: Replace background with something relevant to VisualSim
  • #23: Replace background with something relevant to VisualSim
  • #28: Replace background with something relevant to VisualSim
  • #33: Replace background with something relevant to VisualSim