SlideShare a Scribd company logo
Webinar:
ARM corelink, Arteris NoC, UCIe, Bunch-of-wires, CXL and PCIe-
Designing the interconnect is not for the weak-hearted!
Host:
Deepak Shankar, Vice President Technology
Mirabilis Design Inc.
Email: dshankar@mirabilisdesign.com
Agenda
Challenges
VisualSim
Solution
Extending
System
Modeling
Methodology
Experiments
The future
Explore and Measure using System-Level Exploration
NoC/
UCIe
AI Engine Tiles
Warp
Schedule
r
PE
PE
PE
PE
Local
Mem
GPU
Memory
Chiplet
ADC
DDR5
Processor subsystem
Core L1
B
u
s
SLC
Round-Trip Latency
Which one is it?
Neoverse/A720/RISC-V/Tensilica Lx8
Number and type of
GPU and TPU Cores
What is the AI
Clock Speed?
Optimal
Mesh size
Peak power
Thermal heat and temp
Management
Number
Port &
Modules
Interface
Buffer
Interconnect Speed
Scheduling and assignment
Throughput
Use benchmarks,
traffic, traces and
workloads
Buffer
Usage
Consider this SoC Architecture
Types of Experiments to be Conducted for an SoC Design
• Select interconnect- AXI vs NoC vs Crossbar vs mesh?
• Assign NoC, AHB or ACE to each level of Hierarchy?
• Commercial or custom NoC development?
• Optimize MEOSI coherence on a custom mesh to maximize cache hit-ratio?
• Deciding on monolithic vs multi-die chiplets?
• Impact of new power management on peak and total power?
• Flat vs hierarchy topology to ensure maximum memory bandwidth?
• Integrate with SoC generation configuration tools?
Enabling Customers with Full-Coverage Experiment
Block
Diagram
Model using
System-level
IP
Parameters
&
constraints
Regression
Sweep
Generate
statistics
and
specification
BLOCK METRICS
CONSTRAI
NT
CONSTRAINT
VALUE
STATISTIC
TYPE
Cache_I_1 A_Hit_Ratio >= 0.7 All
Cache_d_1 A_Miss_Ratio < 0.2 All
Cache_I_2
A_Number_Entere
d >= 175 All
Cache_SLC Buffer_Occupancy < 6 All
AXI_Top_Mast
er_1 Read_Data_Bytes >= 1.00E+07 All
CMN_XP Buffer_Overflow >= 10 All
Task_1 Latency < 1.23E-06 Mean
Task_2 Latency < 4.60E-03 Max
Task_3 Latency < 6.00E-05 Min
Cache_SLC Read_Hit_Ratio >= 0.9 All
Read_MBs_per_Se
Case Study: Data Center SoC
Design Challenges
1. What is the buffer size to prevent overflow on interconnect?
2. What is the memory throughput required to meet the goals?
3. How many Cores are required to meet 33 ms response time?
4. Should power management be Threshold, time-based, DVFS
or utilization-based?
Project Goals
1. Data Center SoC for Neural Network applications
2. Handle 30 million vertices/ second
3. Power consumption < 40W
4. Resnet 50 workload inference time < 3.2 seconds
VisualSim Solution
1. Library:
ARM A720, Cache, HBM3, DMA, CMN Cyprus,
Arteris NoC, UCIe
GPU, Sensor, Power State Machine
2. Custom model for AI braking and proximity test
3. Workload generator
For data center and automotive applications
4. Flow control and scheduling algorithms
5. Performance and power report generators
Evaluation of Constraints
Project Outcome
1. Generated component list, clock speed, bus width, buffer size and flow control
2. Expected statistics for performance, correctness and power
3. Executable specification for customer Architects to conduct trade-offs
Suggested Block Diagram
Statistics and Reports BLOCK METRICS MEASURED STATISTIC TYPE RESULT
AMBA-AXi GPU_Read_Data_Bytes
3,392,408,19
2 Max TRUE
AMBA-AXI DDR4_Bandwidth_Utilization 28% Std Deviation FALSE
NoC- CMN
System-Level
Cache_Read_Data_Bytes
3,392,128,25
6 Mean TRUE
NoC-
Arteris Read Buffer Channel Usage 32 Min TRUE
Data
Cache Hit_Ratio 89.148 Mean TRUE
Data
Cache Latency 4.79E-08 Mean FALSE
Data
Cache GB/Second 1.776 Min TRUE
Processor Context_Switch_Time 16.83 Max TRUE
Processor Application Processing Delay 3.86E-06 Min FALSE
Page_Tabl
e Memory_Used_By_TLB 128K Min FALSE
Cache Bus Request Buffer_Occupanc 440 Min FALSE
Processor Processor_Utilization 50% Max TRUE
Thermal Temperature 65C Mean FALSE
Power Peak Power for Chiplet Die 1 51W Max FALSE
Regression varying Parameters and Workloads
-Process_Node_nm 7 -Bus2_Clk_Speed 2000.0 -Core_Clk_Speed 2500.0
-Process_Node_nm 7 -Bus2_Clk_Speed 4000.0 -Core_Clk_Speed 4500.0
-Process_Node_nm 3 -Bus2_Clk_Speed 4000.0 -Core_Clk_Speed 4500.0
VisualSim Solution
VisualSim with libraries
Quickstart Training
Modeling services
Analysis and insight
Integration
The Product
The Offerings
VisualSim System-Level IP Library
VisualSim
System-Level IP
Library
Quantity and Time Queue
System Resources
Scheduler
RTOS Builder, ARINC 653, AUTOSAR
Task Graph, Workload Builder
Stochastic and Software
SoC Compute, Interconnect and Hardware
Systems and Networks
Traffic
Custom Builder
Distribution- and Trace-based
Sensors, VCD, Network, Sequence
Scripting language
RegEx
C/C++/Java/Python Wrapper
Statistics
Latency, Throughput,
Utilization, hit-ratio
Ave/peak power (instant, ave)
Heat, Temp
TSN, AVB, 10BaseT1S, Switched Ethernet
Resilient Packet Ring, RP3, WiFi 802.11
Bluetooth, PAN, Spacewire, SpaceFibre
IEEE802.1Q, Time-Triggered Ethernet
AFDX, 5G
VME, PCI/PCI-X/PCIe 6.0, CXL,
SPI 3.0, 1553B, FlexRay, CAN-FD/XL
AFDX, TTEthernet, OpenVPX
AMBA (AHB/ APB/ AXI/CHI),Tilelink
Corelink (600, 700), NoC (Generic,
Arteris), Virtual Channel, DMA,
Crossbar, Serial Switch, Bridge, UCie
CPU, DSP, GPU, TPU, MCU
ARM (M0-55), R5, Cortex (A8, A72,
A53, A76, A77, A65, A78, A720,
Neoverse V and X), Nvidia- Pascal to
Ampere, Leon, Power, X86, DSP &
ADI- TI, Tensilica- Lx8, Renesas, AI
RISC-V
SiFive
In-Order/Out-of-Order
Flash, NVMe, Disk, SSD,
NAS, Fibre Channel,
FireWire, HBM3.0, HMC
• Memory Controller, Disk, SDR
DRAM 2-5, LPDDR 2-5-X, SSD
QDR, RDRAM, MPMC, Cache,
Coherent cache
Storage and Memory
FPGA
Xilinx- Versal, Zynq, Ultrascale, Kintex
Altera-Stratix, Arria
Microsemi- Smartfusion
Programmable logic generator
Power States, Allocation
Transition, Loss, Battery
Consumption, Management
Generation, Distribution and
Thermal
Power
Communication
RF Tx/Rx, Baseband, Channels,
Analog, A/D transceivers, Antenna
Signal/audio/Image algorithms
Complete SoC System Model Solution in VisualSim
Reuse IP
Define
Hierarchy
Parameters
Power &
Thermal
Metrics
Capture
Custom IP
Builder
Debugging
& Profiling
Plotting
Three Levels of NoC Modeling
• Stochastic or queuing theory-based
• Focus on overlap latency and throughput without specific implementation
• Hybrid which is cycle-accurate but not fully pipelined
• Specific-vendor products but without the detailed underlying registers and
algorithms
• Most times, combines micro-arch for processors, cache and memory with a
slightly more abstract interconnect
• Micro-architecture
• Detailed implementation of a specific-vendor or custom product
• All modeling devices are functionality accurate
Stochastic NoC- Flow Control Modeling
Hybrid NoC Modeling- Vendor-specific Arteris NoC
Micro-Architecture Modeling of the Custom NoC
Statistics
NIU_INIU_00_Flits_Initiated = 100,
NIU_INIU_01_Request_Throughput_MBps = 800.0008,
NIU_INIU_02_Response_Throughput_MBps = 692.000,
NIU_INIU_03_Read_Request_Initiated = 100,
NIU_INIU_04_Read_Response_received = 43,
NIU_INIU_05_Write_Request_Initiated = 0,
NIU_INIU_06_Write_Response_received = 0,
NIU_INIU_07_Total_Read_Request_Bytes = 800,
NIU_INIU_08_Total_Write_Request_Bytes = 0,
NIU_INIU_09_Total_Request_Bytes = 800,
NIU_INIU_10_Total_Response_Bytes = 692,
NIU_INIU_13_Request_Buffer_overflow = 0,
NIU_INIU_14_Packets_Waiting_in_Request_Buffer = 0,
NIU_INIU_15_Packets_Waiting_in_ROB_Buffer = 0}
{NIU_TNIU_00_Flits_Responded = 178,
NIU_TNIU_01_Response_Throughput_MBps = 712.000,
NIU_TNIU_02_Request_Throughput_MBps = 1600.00,
NIU_TNIU_03_Read_Response_completed = 92.0,
NIU_TNIU_04_Read_Request_received = 200.0,
NIU_TNIU_05_Write_Response_completed = 0.0,
NIU_TNIU_06_Write_Request_received = 0.0,
NIU_TNIU_07_Total_Request_Bytes = 1600,
NIU_TNIU_08_Total_Response_Bytes = 712,
NIU_TNIU_09_Response_Buffer_overflow = 0,
NIU_TNIU_10_Packets_Waiting_in_Response_Buffer
= 6,
NIU_TNIU_11_Packets_Waiting_in_ROB_Buffer = 0}
Debugging
Master Flow control block (Master_1) ::: adding packet to the Queue
Packet details ::: ID = 99, A_Address = 5568517352L,
Master Flow control block (Master_1) ::: sending packet out
Packet details ::: ID = 99, A_Address = 5568517352L,
Master Flow control block (Master_2) ::: adding packet to the Queue
Packet details ::: ID = 99, A_Address = 5154753492L,
Master Flow control block (Master_2) ::: sending packet out
Packet details ::: ID = 99, A_Address = 5154753492L,
Master Flow control block (Master_1) ::: adding packet to the Queue
Packet details ::: ID = 100, A_Address = 5358742764L,
Master Flow control block (Master_1) ::: sending packet out
Packet details ::: ID = 100, A_Address = 5358742764L,
Master Flow control block (Master_2) ::: adding packet to the Queue
Packet details ::: ID = 100, A_Address = 2671326672L,
Master Flow control block (Master_2) ::: sending packet out
Packet details ::: ID = 100, A_Address = 2671326672L,
Tracing the Activity
Time_Array = {4.305E-7, 4.33E-7, 4.34E-7, 4.34E-7, 4.34E-7,
4.35E-7, 4.35E-7, 4.35E-7, 4.36E-7, 4.36E-7, 4.36E-7, 4.36E-7, 4.36E-7,
4.36E-7, 4.36E-7, 4.38E-7, 4.38E-7, 4.39E-7, 4.39E-7, 4.39E-7, 4.4E-7,
4.4E-7, 4.4E-7, 4.41E-7, 4.41E-7, 4.41E-7, 4.41E-7, 4.41E-7, 4.41E-7,
9.7286E-7, 9.7286E-7, 9.86923E-7, 9.86923E-7, 9.86923E-7, 9.89E-7,
9.9E-7, 9.91E-7, 9.92E-7, 9.92E-7, 9.92E-7, 9.92E-7, 9.94E-7, 9.95E-7,
9.96E-7},
Trace_Array = {"INIU2_Request_Queue_in",
"MUX_1_Port_2_in", "MUX_1_out", "Buffer_1_in",
"Buffer_Buffer_1_in", "Buffer_Buffer_1_out", "Buffer_1_out",
"DEMUX_1_in", "DEMUX_1_out", "TNIU_Req_in",
"TNIU_ROB_Queue_in", "TNIU_ROB_Queue_out", "TNIU_Req_out",
"INIU3_Req_in", "INIU3_Request_Queue_in", "INIU3_Req_out",
"MUX_3_Port_1_in", "MUX_3_out", "Buffer_3_in",
"Buffer_Buffer_3_in", "Buffer_Buffer_3_out", "Buffer_3_out",
"DEMUX_3_in", "DEMUX_3_out", "TNIU3_Req_in",
"TNIU3_ROB_Queue_in", "TNIU3_ROB_Queue_out",
"TNIU3_Req_out", "DRAM_in", "LPDDR_Scheduler_in",
"LPDDR_Scheduler_out", "DRAM_out", "TNIU3_Resp_in",
"TNIU3_Response_Queue_in", "TNIU3_Resp_out",
"Buffer_Buffer_4_in", "Buffer_Buffer_4_out", "INIU3_Resp_in",
"INIU3_Resp_out", "TNIU_Resp_in", "TNIU_Response_Queue_in",
"TNIU_Resp_out", "Buffer_Buffer_2_in", "Buffer_Buffer_2_out"}}
ARM Socrates Tool Output
VisualSim Model- Imported using Configuration files
Model Statistics
Cache_SLC_1_A_Hit_Ratio = 0.0,
Cache_SLC_1_A_Miss_Ratio = 100.0,
Cache_SLC_1_A_Number_Entered = 1386L,
Cache_SLC_1_A_Number_Returned = 1386L,
Cache_SLC_1_A_Prefetch_Completed = 1L,
Cache_SLC_1_A_Prefetch_Issued = 1L,
Cache_SLC_1_A_Prefetch_Useful = 0L,
Cache_SLC_1_Buffer_Occupancy = 0,
Cache_SLC_1_Buffer_Overflow = 0L,
Cache_SLC_1_Latency_Avg = 1.6392367965368E-7,
Cache_SLC_1_Latency_Max = 4.3999899999999E-7,
Cache_SLC_1_Latency_Min = 5.7111999999999E-8,
Cache_SLC_1_Read_MBs_per_Second = 837.1201674240335,
Cache_SLC_1_Total_Cache_Lines_Evicted = 0L,
Cache_SLC_1_Total_Cache_Lines_Write_Backed = 0L,
Cache_SLC_1_Total_MBs = 0.177472,
Cache_SLC_1_Total_MBs_per_Second = 3549.440709888142,
Cache_SLC_1_Utilization = 2.7733338880003
MC_DRAM_DRAM_1_00_Total_Requests = 136,
MC_DRAM_DRAM_1_01_Completed_Requests = 136,
MC_DRAM_DRAM_1_02_Total_MB_per_Second = 87.1179217539336,
MC_DRAM_DRAM_1_03_Total_Bytes = 8704,
MC_DRAM_DRAM_1_04_Read_Bytes = 8704,
MC_DRAM_DRAM_1_06_Read_MB_per_Second = 87.1179217539336,
MC_DRAM_DRAM_1_10_Max_Queue_Usage = 2,
MC_DRAM_DRAM_1_12_Queue_Removal_Position = {136, 0, 0, 0},
MC_DRAM_DRAM_1_21_Total_Activates = 148,
MC_DRAM_DRAM_1_22_Total_Precharges = 147,
MC_DRAM_DRAM_1_23_Total_RRD_L_S = {{0, 0}, {12, 0}},
MC_DRAM_DRAM_1_24_Total_CCD_L_S = {{135, 135}, {0, 0}},
MC_DRAM_DRAM_1_25_Total_WTR_L_S = {{0, 0}, {0, 0}},
MC_DRAM_DRAM_1_26_Total_RTP_WR_RAS_RTW = {{135, 132}, {0, 0}, {135, 122}, {0, 0}},
MC_DRAM_DRAM_1_27_Refresh_Percent = 1.920016,
MC_DRAM_DRAM_1_28_DRAM_Delay_Min = 1.4416999999991E-8,
MC_DRAM_DRAM_1_29_DRAM_Delay_Max = 1.4417000000004E-8,
MC_DRAM_DRAM_1_30_DRAM_Delay_Mean = 1.4417E-8,
MC_DRAM_DRAM_1_31_DRAM_Delay_StDev = 6.9311872118497E-16}
PCIe_Switch_PCIe_Switch_1_Port_1_Rx_MBps = 870.2400087024001,
PCIe_Switch_PCIe_Switch_1_Port_1_Total_MBps = 1740.4800174048003,
PCIe_Switch_PCIe_Switch_1_Port_1_Tx_MBps = 870.2400087024001,
PCIe_Switch_PCIe_Switch_1_Port_1_to_Port_7_Max_Latency = 8.6170000000051E-9,
PCIe_Switch_PCIe_Switch_1_Port_1_to_Port_7_Mean_Latency = 4.8814421768711E-9,
PCIe_Switch_PCIe_Switch_1_Port_1_to_Port_7_Min_Latency = 4.7219999999969E-9,
{CMN600_RND_1_Max_End_to_End_Latency = 6.55668E-7,
CMN600_RND_1_Max_Network_Latency = 8.4334000000002E-8,
CMN600_RND_1_Mean_End_to_End_Latency = 3.6446105555556E-7,
CMN600_RND_1_Mean_Network_Latency = 5.6137656565657E-8,
CMN600_RND_1_Min_End_to_End_Latency = 2.37999E-7,
CMN600_RND_1_Min_Network_Latency = 4.9999999999994E-8,
CMN600_RND_2_Max_End_to_End_Latency = 4.7366499999999E-7,
CMN600_R_0_0_EAST_In_Buffer_Max_Buffer_Occupancy = 18.0,
CMN600_R_0_0_SOUTH_In_Buffer_Max_Buffer_Occupancy = 8.0,
CMN600_R_0_1_EAST_In_Buffer_Max_Buffer_Occupancy = 11.0,
CMN600_R_0_1_SOUTH_In_Buffer_Max_Buffer_Occupancy = 21.0,
CMN600_R_0_1_WEST_In_Buffer_Max_Buffer_Occupancy = 5.0,
DS_Name = "CMN600_Stats"}
{ACE_Bus_1_Master_1_Read_Data_Bytes = 7936,
ACE_Bus_1_Master_1_Read_Data_MBps = 79.3600007936,
ACE_Bus_1_Slave_1_BW_Utilization_Prct = 0.6613333399467,
ACE_Bus_1_Slave_1_Read_Data_Bytes = 7936,
ACE_Bus_1_Slave_1_Read_Data_MBps = 79.3600007936}
{ACE_Bus_1_Slave_1_Rd_Threshold_Usage = 2.0,
ACE_Bus_1_Slave_1_Rd_Transactions = 124}
Reference Data:
High-Performance Multi-Die Exploration
Block Diagram and VisualSim Model
Experiments
Exp Description Mean
Latency
Mean
RNF
Latency
DRAM-
MBps
LLC
Cache-
MBps
L2 Cache
Buffer
Overflow
Maximum
UCIe
MBps
Maximim
Power
(Watts)
1 Traffic rate= 1.0e-7, 8 Cores per Cluster, Sequential
Addresses, AXI and L2 Cache= 4200Mhz , DRAM and
Memory Controller= 2400Mhz, CMN
frequency=1200Mhz, CMN Buffer capacity= 1000, AXI
read and write threshold= 250
4.105e-7 1.355e-8 728.903 3200.006 461L 12481.600 22
2 Traffic rate= 1.0e-7, 8 Cores per Cluster, Random
Addresses, AXI and L2 Cache= 3200Mhz , DRAM and
Memory Controller= 1200Mhz, CMN
frequency=1200Mhz, CMN Buffer capacity= 1000, AXI
read and write threshold= 250
9.8718e-7 1.72e-8 663.212 2432.004 367L 12816.001 17
3 Traffic rate= 1.0e-7, 4 Cores per Cluster, Sequential
Addresses, AXI and L2 Cache= 3200Mhz , DRAM and
Memory Controller= 1200Mhz, CMN
frequency=1200Mhz, CMN Buffer capacity= 1000, AXI
read and write threshold= 250
6.136e-7 1.250e-8 663.212 3200.006 293L 12475.200 17
4 Traffic rate= 1.0e-7, 8 Cores per Cluster, Sequential
Addresses, AXI and L2 Cache= 3200Mhz , DRAM and
Memory Controller= 1200Mhz, CMN
frequency=1200Mhz, CMN Buffer capacity= 1000, AXI
read and write threshold= 250
5.06e-7 1.249e-8 663.212 2688.005 464L 12558.400 17
5 Traffic rate= 1.0e-7, 8 Cores per Cluster, Sequential
Addresses, AXI and L2 Cache= 3200Mhz , DRAM and
Memory Controller= 1200Mhz, CMN
frequency=1200Mhz, CMN Buffer capacity= 500, AXI
read and write threshold= 100
7.395e-7 1.28e-8 663.212 2688.005 0L 12176.001 17
RNF Latencies
Exp Description RNF Latency (Maximum End to End Latency)
1 Traffic rate= 1.0e-7, 8 Cores per Cluster, Sequential Addresses, AXI and L2
Cache= 4200Mhz , DRAM and Memory Controller= 2400Mhz, CMN
frequency=1200Mhz, CMN Buffer capacity= 1000, AXI read and write
threshold= 250
[ RNF_1 = 4.9998000000001E-9, RNF_2 = 6.6664000000001E-9, RNF_3 = 4.9998000000001E-9, RNF_4 = 1.04988E-8,
RNF_5 = 1.16662E-8, RNF_6 = 9.9996E-9, RNF_7 = 1.49994E-8, RNF_8 = 1.33919E-8, RNF_9 = 1.6786E-8, RNF_10 =
1.33328E-8, RNF_11 = 9.9996E-9, RNF_12 = 1.40361E-8, RNF_13 = 7.36696E-8, RNF_14 = 6.95799E-8, RNF_15 =
6.94706E-8, RNF_16 = 6.76522E-8 ]
2 Traffic rate= 1.0e-7, 8 Cores per Cluster, Random Addresses, AXI and L2
Cache= 3200Mhz , DRAM and Memory Controller= 1200Mhz, CMN
frequency=1200Mhz, CMN Buffer capacity= 1000, AXI read and write
threshold= 250
[ RNF_1 = 4.9998000000001E-9, RNF_2 = 1.33328E-8, RNF_3 = 5.6658E-9, RNF_4 = 7.582E-9, RNF_5 = 1.16662E-8,
RNF_6 = 9.9996E-9, RNF_7 = 1.49994E-8, RNF_8 = 1.39988E-8, RNF_9 = 2.12506E-8, RNF_10 = 1.33328E-8, RNF_11 =
1.83321E-8, RNF_12 = 2.97902E-8, RNF_13 = 1.249314E-7, RNF_14 = 1.366519E-7, RNF_15 = 7.03419E-8, RNF_16 =
8.58516E-8 ]
3 Traffic rate= 1.0e-7, 4 Cores per Cluster, Sequential Addresses, AXI and L2
Cache= 3200Mhz , DRAM and Memory Controller= 1200Mhz, CMN
frequency=1200Mhz, CMN Buffer capacity= 1000, AXI read and write
threshold= 250
[ RNF_1 = 4.9998000000001E-9, RNF_2 = 6.6664000000001E-9, RNF_3 = 4.9998000000001E-9, RNF_4 =
6.6664000000002E-9, RNF_5 = 1.16662E-8, RNF_6 = 9.9996E-9, RNF_7 = 1.55829E-8, RNF_8 = 1.33328E-8, RNF_9 =
1.70836E-8, RNF_10 = 1.33328E-8, RNF_11 = 9.9996E-9, RNF_12 = 1.41656E-8, RNF_13 = 6.80913E-8, RNF_14 =
7.32668E-8, RNF_15 = 8.03479E-8, RNF_16 = 6.83493E-8 ]
4 Traffic rate= 1.0e-7, 8 Cores per Cluster, Sequential Addresses, AXI and L2
Cache= 3200Mhz , DRAM and Memory Controller= 1200Mhz, CMN
frequency=1200Mhz, CMN Buffer capacity= 1000, AXI read and write
threshold= 250
[ RNF_1 = 4.9998e-09, RNF_2 = 6.6664e-09, RNF_3 = 4.9998e-09, RNF_4 = 6.6664e-09, RNF_5 = 1.21021e-08, RNF_6 =
1.07689e-08, RNF_7 = 1.49994e-08, RNF_8 = 1.41021e-08, RNF_9 = 1.70836e-08, RNF_10 = 1.36022e-08, RNF_11 =
9.9996e-09, RNF_12 = 1.33515e-08, RNF_13 = 4.20844e-08, RNF_14 = 6.99336e-08, RNF_15 = 7.03259e-08, RNF_16 =
6.84963e-08 ]
5 Traffic rate= 1.0e-7, 8 Cores per Cluster, Sequential Addresses, AXI and L2
Cache= 3200Mhz , DRAM and Memory Controller= 1200Mhz, CMN
frequency=1200Mhz, CMN Buffer capacity= 500, AXI read and write
threshold= 100
[ RNF_1 = 4.9998e-09, RNF_2 = 6.6664e-09, RNF_3 = 4.9998e-09, RNF_4 = 6.6664e-09, RNF_5 = 1.21021e-08, RNF_6 =
1.07689e-08, RNF_7 = 1.49994e-08, RNF_8 = 1.41021e-08, RNF_9 = 1.70836e-08, RNF_10 = 1.36022e-08, RNF_11 =
9.9996e-09, RNF_12 = 1.33515e-08, RNF_13 = 4.20844e-08, RNF_14 = 6.99336e-08, RNF_15 = 7.03259e-08, RNF_16 =
6.84963e-08 ]
Latency of Processor Request Per Cluster
Experiment 3 Latency_Cluster_1 Experiment 4 Latency_Cluster1
Experiment 1 Latency_Cluster_1 Experiment 2 Latency_Cluster1
Experiment 5 Latency_Cluster_6
64 Tile SOC
• In the 64 tile SOC, We have 8 clusters
and each cluster contains 8 cores.
• There are 8 cores per clusters. The
addresses are generated randomly
and sequentially. These packets are
going to cache through AXI bus.
Interconnect Architecture
• 64 core SOC Die is connected
via 4 UCIe ports to two Dies
• Die 2 and 3 have an
SLC cache and the
DRAMs
Middle Die Power Setup
Power Exploration
Experiment 3 Power_DIE_1 Experiment 4 Power_DIE_1
Experiment 1 Power_DIE_1 Experiment 2 Power_DIE_1
Experiment 5 Power_DIE_1
System-Level Verification
of Automotive and
Defense SYstems
System Integration of SoC using Chiplet
System Overview
Gateway
Transfer messages between different CAN
and TSN networks
CAN Bus
CAN bus is the network that connects
sensors and ECU’s
TSN Switch
STN bus is the network that connects
High-Performance Cameras, Lidars and
Servers
Wheel
1
Wheel
4
Wheel
3
Wheel
2
Gateway
CAN
Bus
Engine
Proximity
Sensor
Brake
Pedal
Gyro
Sensor
Road
condition
sensor
TSN
Bus
CAN
Bus
ECU
Embedding the SoC in an Automotive Application for Testing
Evaluation of Chiplet Performance in Automotive Application
Power Modeling and Future Innovation
Power
Generation
Power
Storage
Power
Consumption
Thermal
Management
• Different charging schemes
• Impact of surge and shocks
• Battery Lifecycle
• Battery Consumption
• Statistics
• Heat and
temperature
• Impact of
cooling strategy
• Add impact of
power spikes
• State based power consumption
of electronics (controller, SOC)
and Mechanical (brakes, wheels)
• Average, instant and Cumulative
• Power per device and application
Verification and Debugging
• 4 Types of Power
Generators in VisualSim
• Constant, variable, motor,
solar charge
• Charge sent to battery
1 2 3 5
6
• Optimize and test the power management algorithms
• Sizing of power generators and battery
• Estimate power consumed by the software application
Downstream Integration
• Integrate with physical hardware
• Generate UPF file with power domains
• Generate SystemVerilog power testbench
7
Power
Management
• Change in power
state controlled by
time, utilization,
temperature and
expected activity
4
Generate Power and Thermal Characteristic
Behavior Task Graph
Power Table
Power management Unit
SystemVerilog Output for Power System Test
VCD Waveform for Verification
create_power_domain PD_Top -include_scope
create_power_domain -name PD_1_2.0 -elements {"CLKMUX"}
create_power_domain -name PD_1_1.0 -elements {"PLL","G2","G3"}
create_power_domain -name PD_1_3.0 -elements {"PROC"}
create_supply_port -port VDD_1.0 -direction in -domain PD_Top
create_supply_port -port VDD_2.0 -direction in -domain PD_Top
create_supply_port -port VDD_3.0 -direction in -domain PD_Top
create_supply_port -port VSS_0.0 -direction in -domain PD_Top
create_supply_net VDD_1.0 -domain PD_Top
create_supply_net VDD_2.0 -domain PD_Top
create_supply_net VDD_3.0 -domain PD_Top
create_supply_net VSS_0.0 -domain PD_Top
connect_supply_net VDD_1.0 -ports VDD_1.0
connect_supply_net VDD_2.0 -ports VDD_2.0
connect_supply_net VDD_3.0 -ports VDD_3.0
connect_supply_net VSS_0.0 -ports VSS_0.0
add_power_state PD_1_2.0 -state Active 
{-supply_expr (VDD_2.0 == {ON, 2.0}) && (VSS_0.0 =={ON,0.0})}
add_power_state PD_1_2.0 -state 
OFF {-supply_expr (VDD_2.0 == {OFF, 0.0}) && (VSS_0.0 =={ON,0.0})}
add_power_state PD_1_1.0 -state Active 
{-supply_expr (VDD_1.0 == {ON, 1.0}) && (VSS_0.0 =={ON,0.0})}
add_power_state PD_1_1.0 -state OFF 
{-supply_expr (VDD_1.0 == {OFF, 0.0}) && (VSS_0.0 =={ON,0.0})}
add_power_state PD_1_3.0 -state Active 
{-supply_expr (VDD_3.0 == {ON, 3.0}) && (VSS_0.0 =={ON,0.0})}
add_power_state PD_1_3.0 -state OFF 
{-supply_expr (VDD_3.0 == {OFF, 0.0}) && (VSS_0.0 =={ON,0.0})}
Power Modeling Integration
AI-based Simulation for Rapid System Exploration
• Run number 19 – clock
frequency at 1000 MHz satisfied
the performance requirements
we had set.
• Since the frequency was
increased from 600 MHz, the
total power consumption went
up while running the system at
1000 MHz
• Architect can evaluate
different processing
resources – DSP vs Xeon
cores vs Power cores if
they have stringent power
thresholds
Requirements being evaluated for each simulation
run in the parameter sweep
Overall Results – We can identify the simulation runs which
meet the requirements and select the right configuration
after considering cost vs performance trade-offs
System Verification
• Generate test cases and compare RTL
• Performance, Power and Functionality
• Validate product not just HW/SW
• Application relevant test vectors
• Link to board, emulators and
instruments
Golden
Reference
Comparator
Match Tag
Architecture
model of IP
Verilog/C/
emulation
Conclusion:
Enable Better Products
• One Product- One Model for power, performance and functionality
• Complete library of System-level IP library and IP Builders for
software, systems, networks and missions
• AI-based Regression tool linked to Requirements
• Proven to eliminate 90% of the bottlenecks prior to development
• Demonstrated over 40% schedule savings
Webinar:
ARM corelink, Arteris NoC, UCIe, Bunch-of-wires, CXL and PCIe-
Designing the interconnect is not for the weak-hearted!
Host:
Deepak Shankar, Vice President Technology
Mirabilis Design Inc.
Email: dshankar@mirabilisdesign.com

More Related Content

PPTX
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
PPTX
Mirabilis Design | Chiplet Summit | 2024
PPTX
Simulating Auto Systems & E/E Architectures for Power and Performance using V...
PDF
Data sheet of chip ATMEGA64 from Microchip
PPT
Microprocessor Systems and Interfacing Slides
PPTX
Mirabilis_Presentation_DAC_June_2024.pptx
PPTX
Crypto Performance on ARM Cortex-M Processors
PPTX
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...
Compare Performance-power of Arm Cortex vs RISC-V for AI applications_oct_2021
Mirabilis Design | Chiplet Summit | 2024
Simulating Auto Systems & E/E Architectures for Power and Performance using V...
Data sheet of chip ATMEGA64 from Microchip
Microprocessor Systems and Interfacing Slides
Mirabilis_Presentation_DAC_June_2024.pptx
Crypto Performance on ARM Cortex-M Processors
Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Arc...

Similar to Mirabilis Design- NoC Webinar- 15th-Oct 2024 (20)

PPTX
Arm cortex-m3 by-joe_bungo_arm
PDF
Architecture and Implementation of the ARM Cortex-A8 Microprocessor
PPTX
Processors selection
PPTX
Ec8791 arm 9 processor
PPTX
Real Time System Validation using Hardware in Loop (HIL) Digital Platform
PDF
HPC Infrastructure To Solve The CFD Grand Challenge
PDF
Large-Scale Optimization Strategies for Typical HPC Workloads
PDF
Zynq ultrascale
PPTX
Design of a low power processor for Embedded system applications
PPTX
intel business presentation 77777777777.pptx
PPTX
PPT
Piccolo F2806x Microcontrollers
PDF
How to Select Hardware for Internet of Things Systems?
PPT
MPC854XE: PowerQUICC III Processors
PDF
Performance analysis of 3D Finite Difference computational stencils on Seamic...
PPT
Overview of ST7 8-bit Microcontrollers
PDF
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
PDF
AAME ARM Techcon2013 006v02 Implementation Diversity
PDF
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
PDF
SoM with Zynq UltraScale device
Arm cortex-m3 by-joe_bungo_arm
Architecture and Implementation of the ARM Cortex-A8 Microprocessor
Processors selection
Ec8791 arm 9 processor
Real Time System Validation using Hardware in Loop (HIL) Digital Platform
HPC Infrastructure To Solve The CFD Grand Challenge
Large-Scale Optimization Strategies for Typical HPC Workloads
Zynq ultrascale
Design of a low power processor for Embedded system applications
intel business presentation 77777777777.pptx
Piccolo F2806x Microcontrollers
How to Select Hardware for Internet of Things Systems?
MPC854XE: PowerQUICC III Processors
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Overview of ST7 8-bit Microcontrollers
Introduction to National Supercomputer center in Tianjin TH-1A Supercomputer
AAME ARM Techcon2013 006v02 Implementation Diversity
Performance Analysis and Optimizations of CAE Applications (Case Study: STAR_...
SoM with Zynq UltraScale device
Ad

More from Deepak Shankar (20)

PPTX
Mirabilis_Presentation_SCC_July_2024.pptx
PPTX
How to achieve 95%+ Accurate power measurement during architecture exploration?
PPTX
Mirabilis_Design AMD Versal System-Level IP Library
PPTX
Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
PPTX
Modeling Abstraction
PPTX
Accelerated development in Automotive E/E Systems using VisualSim Architect
PPTX
Evaluating UCIe based multi-die SoC to meet timing and power
PDF
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
PPTX
Energy efficient AI workload partitioning on multi-core systems
PPTX
Capacity Planning and Power Management of Data Centers.
PPTX
Automotive network and gateway simulation
PPTX
Introduction to architecture exploration
PPTX
Using ai for optimal time sensitive networking in avionics
PPTX
Designing memory controller for ddr5 and hbm2.0
PPTX
Task allocation on many core-multi processor distributed system
PPTX
Introduction to Architecture Exploration of Semiconductor, Embedded Systems, ...
PPTX
Develop High-bandwidth/low latency electronic systems for AI/ML application
PPTX
Webinar on Latency and throughput computation of automotive EE network
PPTX
Webinar on radar
PPTX
Webinar on RISC-V
Mirabilis_Presentation_SCC_July_2024.pptx
How to achieve 95%+ Accurate power measurement during architecture exploration?
Mirabilis_Design AMD Versal System-Level IP Library
Mastering IoT Design: Sense, Process, Connect: Processing: Turning IoT Data i...
Modeling Abstraction
Accelerated development in Automotive E/E Systems using VisualSim Architect
Evaluating UCIe based multi-die SoC to meet timing and power
ROLE OF DIGITAL SIMULATION IN CONFIGURING NETWORK PARAMETERS
Energy efficient AI workload partitioning on multi-core systems
Capacity Planning and Power Management of Data Centers.
Automotive network and gateway simulation
Introduction to architecture exploration
Using ai for optimal time sensitive networking in avionics
Designing memory controller for ddr5 and hbm2.0
Task allocation on many core-multi processor distributed system
Introduction to Architecture Exploration of Semiconductor, Embedded Systems, ...
Develop High-bandwidth/low latency electronic systems for AI/ML application
Webinar on Latency and throughput computation of automotive EE network
Webinar on radar
Webinar on RISC-V
Ad

Recently uploaded (20)

PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Digital Strategies for Manufacturing Companies
PPT
Introduction Database Management System for Course Database
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
top salesforce developer skills in 2025.pdf
PPTX
Online Work Permit System for Fast Permit Processing
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
System and Network Administration Chapter 2
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
How to Choose the Right IT Partner for Your Business in Malaysia
Understanding Forklifts - TECH EHS Solution
Digital Strategies for Manufacturing Companies
Introduction Database Management System for Course Database
Design an Analysis of Algorithms I-SECS-1021-03
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Design an Analysis of Algorithms II-SECS-1021-03
How to Migrate SBCGlobal Email to Yahoo Easily
Navsoft: AI-Powered Business Solutions & Custom Software Development
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Operating system designcfffgfgggggggvggggggggg
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
top salesforce developer skills in 2025.pdf
Online Work Permit System for Fast Permit Processing
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PTS Company Brochure 2025 (1).pdf.......
Odoo Companies in India – Driving Business Transformation.pdf
System and Network Administration Chapter 2
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Which alternative to Crystal Reports is best for small or large businesses.pdf

Mirabilis Design- NoC Webinar- 15th-Oct 2024

  • 1. Webinar: ARM corelink, Arteris NoC, UCIe, Bunch-of-wires, CXL and PCIe- Designing the interconnect is not for the weak-hearted! Host: Deepak Shankar, Vice President Technology Mirabilis Design Inc. Email: dshankar@mirabilisdesign.com
  • 3. Explore and Measure using System-Level Exploration NoC/ UCIe AI Engine Tiles Warp Schedule r PE PE PE PE Local Mem GPU Memory Chiplet ADC DDR5 Processor subsystem Core L1 B u s SLC Round-Trip Latency Which one is it? Neoverse/A720/RISC-V/Tensilica Lx8 Number and type of GPU and TPU Cores What is the AI Clock Speed? Optimal Mesh size Peak power Thermal heat and temp Management Number Port & Modules Interface Buffer Interconnect Speed Scheduling and assignment Throughput Use benchmarks, traffic, traces and workloads Buffer Usage Consider this SoC Architecture
  • 4. Types of Experiments to be Conducted for an SoC Design • Select interconnect- AXI vs NoC vs Crossbar vs mesh? • Assign NoC, AHB or ACE to each level of Hierarchy? • Commercial or custom NoC development? • Optimize MEOSI coherence on a custom mesh to maximize cache hit-ratio? • Deciding on monolithic vs multi-die chiplets? • Impact of new power management on peak and total power? • Flat vs hierarchy topology to ensure maximum memory bandwidth? • Integrate with SoC generation configuration tools?
  • 5. Enabling Customers with Full-Coverage Experiment Block Diagram Model using System-level IP Parameters & constraints Regression Sweep Generate statistics and specification BLOCK METRICS CONSTRAI NT CONSTRAINT VALUE STATISTIC TYPE Cache_I_1 A_Hit_Ratio >= 0.7 All Cache_d_1 A_Miss_Ratio < 0.2 All Cache_I_2 A_Number_Entere d >= 175 All Cache_SLC Buffer_Occupancy < 6 All AXI_Top_Mast er_1 Read_Data_Bytes >= 1.00E+07 All CMN_XP Buffer_Overflow >= 10 All Task_1 Latency < 1.23E-06 Mean Task_2 Latency < 4.60E-03 Max Task_3 Latency < 6.00E-05 Min Cache_SLC Read_Hit_Ratio >= 0.9 All Read_MBs_per_Se
  • 6. Case Study: Data Center SoC Design Challenges 1. What is the buffer size to prevent overflow on interconnect? 2. What is the memory throughput required to meet the goals? 3. How many Cores are required to meet 33 ms response time? 4. Should power management be Threshold, time-based, DVFS or utilization-based? Project Goals 1. Data Center SoC for Neural Network applications 2. Handle 30 million vertices/ second 3. Power consumption < 40W 4. Resnet 50 workload inference time < 3.2 seconds VisualSim Solution 1. Library: ARM A720, Cache, HBM3, DMA, CMN Cyprus, Arteris NoC, UCIe GPU, Sensor, Power State Machine 2. Custom model for AI braking and proximity test 3. Workload generator For data center and automotive applications 4. Flow control and scheduling algorithms 5. Performance and power report generators Evaluation of Constraints Project Outcome 1. Generated component list, clock speed, bus width, buffer size and flow control 2. Expected statistics for performance, correctness and power 3. Executable specification for customer Architects to conduct trade-offs Suggested Block Diagram Statistics and Reports BLOCK METRICS MEASURED STATISTIC TYPE RESULT AMBA-AXi GPU_Read_Data_Bytes 3,392,408,19 2 Max TRUE AMBA-AXI DDR4_Bandwidth_Utilization 28% Std Deviation FALSE NoC- CMN System-Level Cache_Read_Data_Bytes 3,392,128,25 6 Mean TRUE NoC- Arteris Read Buffer Channel Usage 32 Min TRUE Data Cache Hit_Ratio 89.148 Mean TRUE Data Cache Latency 4.79E-08 Mean FALSE Data Cache GB/Second 1.776 Min TRUE Processor Context_Switch_Time 16.83 Max TRUE Processor Application Processing Delay 3.86E-06 Min FALSE Page_Tabl e Memory_Used_By_TLB 128K Min FALSE Cache Bus Request Buffer_Occupanc 440 Min FALSE Processor Processor_Utilization 50% Max TRUE Thermal Temperature 65C Mean FALSE Power Peak Power for Chiplet Die 1 51W Max FALSE Regression varying Parameters and Workloads -Process_Node_nm 7 -Bus2_Clk_Speed 2000.0 -Core_Clk_Speed 2500.0 -Process_Node_nm 7 -Bus2_Clk_Speed 4000.0 -Core_Clk_Speed 4500.0 -Process_Node_nm 3 -Bus2_Clk_Speed 4000.0 -Core_Clk_Speed 4500.0
  • 7. VisualSim Solution VisualSim with libraries Quickstart Training Modeling services Analysis and insight Integration The Product The Offerings
  • 8. VisualSim System-Level IP Library VisualSim System-Level IP Library Quantity and Time Queue System Resources Scheduler RTOS Builder, ARINC 653, AUTOSAR Task Graph, Workload Builder Stochastic and Software SoC Compute, Interconnect and Hardware Systems and Networks Traffic Custom Builder Distribution- and Trace-based Sensors, VCD, Network, Sequence Scripting language RegEx C/C++/Java/Python Wrapper Statistics Latency, Throughput, Utilization, hit-ratio Ave/peak power (instant, ave) Heat, Temp TSN, AVB, 10BaseT1S, Switched Ethernet Resilient Packet Ring, RP3, WiFi 802.11 Bluetooth, PAN, Spacewire, SpaceFibre IEEE802.1Q, Time-Triggered Ethernet AFDX, 5G VME, PCI/PCI-X/PCIe 6.0, CXL, SPI 3.0, 1553B, FlexRay, CAN-FD/XL AFDX, TTEthernet, OpenVPX AMBA (AHB/ APB/ AXI/CHI),Tilelink Corelink (600, 700), NoC (Generic, Arteris), Virtual Channel, DMA, Crossbar, Serial Switch, Bridge, UCie CPU, DSP, GPU, TPU, MCU ARM (M0-55), R5, Cortex (A8, A72, A53, A76, A77, A65, A78, A720, Neoverse V and X), Nvidia- Pascal to Ampere, Leon, Power, X86, DSP & ADI- TI, Tensilica- Lx8, Renesas, AI RISC-V SiFive In-Order/Out-of-Order Flash, NVMe, Disk, SSD, NAS, Fibre Channel, FireWire, HBM3.0, HMC • Memory Controller, Disk, SDR DRAM 2-5, LPDDR 2-5-X, SSD QDR, RDRAM, MPMC, Cache, Coherent cache Storage and Memory FPGA Xilinx- Versal, Zynq, Ultrascale, Kintex Altera-Stratix, Arria Microsemi- Smartfusion Programmable logic generator Power States, Allocation Transition, Loss, Battery Consumption, Management Generation, Distribution and Thermal Power Communication RF Tx/Rx, Baseband, Channels, Analog, A/D transceivers, Antenna Signal/audio/Image algorithms
  • 9. Complete SoC System Model Solution in VisualSim Reuse IP Define Hierarchy Parameters Power & Thermal Metrics Capture Custom IP Builder Debugging & Profiling Plotting
  • 10. Three Levels of NoC Modeling • Stochastic or queuing theory-based • Focus on overlap latency and throughput without specific implementation • Hybrid which is cycle-accurate but not fully pipelined • Specific-vendor products but without the detailed underlying registers and algorithms • Most times, combines micro-arch for processors, cache and memory with a slightly more abstract interconnect • Micro-architecture • Detailed implementation of a specific-vendor or custom product • All modeling devices are functionality accurate
  • 11. Stochastic NoC- Flow Control Modeling
  • 12. Hybrid NoC Modeling- Vendor-specific Arteris NoC
  • 13. Micro-Architecture Modeling of the Custom NoC Statistics NIU_INIU_00_Flits_Initiated = 100, NIU_INIU_01_Request_Throughput_MBps = 800.0008, NIU_INIU_02_Response_Throughput_MBps = 692.000, NIU_INIU_03_Read_Request_Initiated = 100, NIU_INIU_04_Read_Response_received = 43, NIU_INIU_05_Write_Request_Initiated = 0, NIU_INIU_06_Write_Response_received = 0, NIU_INIU_07_Total_Read_Request_Bytes = 800, NIU_INIU_08_Total_Write_Request_Bytes = 0, NIU_INIU_09_Total_Request_Bytes = 800, NIU_INIU_10_Total_Response_Bytes = 692, NIU_INIU_13_Request_Buffer_overflow = 0, NIU_INIU_14_Packets_Waiting_in_Request_Buffer = 0, NIU_INIU_15_Packets_Waiting_in_ROB_Buffer = 0} {NIU_TNIU_00_Flits_Responded = 178, NIU_TNIU_01_Response_Throughput_MBps = 712.000, NIU_TNIU_02_Request_Throughput_MBps = 1600.00, NIU_TNIU_03_Read_Response_completed = 92.0, NIU_TNIU_04_Read_Request_received = 200.0, NIU_TNIU_05_Write_Response_completed = 0.0, NIU_TNIU_06_Write_Request_received = 0.0, NIU_TNIU_07_Total_Request_Bytes = 1600, NIU_TNIU_08_Total_Response_Bytes = 712, NIU_TNIU_09_Response_Buffer_overflow = 0, NIU_TNIU_10_Packets_Waiting_in_Response_Buffer = 6, NIU_TNIU_11_Packets_Waiting_in_ROB_Buffer = 0} Debugging Master Flow control block (Master_1) ::: adding packet to the Queue Packet details ::: ID = 99, A_Address = 5568517352L, Master Flow control block (Master_1) ::: sending packet out Packet details ::: ID = 99, A_Address = 5568517352L, Master Flow control block (Master_2) ::: adding packet to the Queue Packet details ::: ID = 99, A_Address = 5154753492L, Master Flow control block (Master_2) ::: sending packet out Packet details ::: ID = 99, A_Address = 5154753492L, Master Flow control block (Master_1) ::: adding packet to the Queue Packet details ::: ID = 100, A_Address = 5358742764L, Master Flow control block (Master_1) ::: sending packet out Packet details ::: ID = 100, A_Address = 5358742764L, Master Flow control block (Master_2) ::: adding packet to the Queue Packet details ::: ID = 100, A_Address = 2671326672L, Master Flow control block (Master_2) ::: sending packet out Packet details ::: ID = 100, A_Address = 2671326672L, Tracing the Activity Time_Array = {4.305E-7, 4.33E-7, 4.34E-7, 4.34E-7, 4.34E-7, 4.35E-7, 4.35E-7, 4.35E-7, 4.36E-7, 4.36E-7, 4.36E-7, 4.36E-7, 4.36E-7, 4.36E-7, 4.36E-7, 4.38E-7, 4.38E-7, 4.39E-7, 4.39E-7, 4.39E-7, 4.4E-7, 4.4E-7, 4.4E-7, 4.41E-7, 4.41E-7, 4.41E-7, 4.41E-7, 4.41E-7, 4.41E-7, 9.7286E-7, 9.7286E-7, 9.86923E-7, 9.86923E-7, 9.86923E-7, 9.89E-7, 9.9E-7, 9.91E-7, 9.92E-7, 9.92E-7, 9.92E-7, 9.92E-7, 9.94E-7, 9.95E-7, 9.96E-7}, Trace_Array = {"INIU2_Request_Queue_in", "MUX_1_Port_2_in", "MUX_1_out", "Buffer_1_in", "Buffer_Buffer_1_in", "Buffer_Buffer_1_out", "Buffer_1_out", "DEMUX_1_in", "DEMUX_1_out", "TNIU_Req_in", "TNIU_ROB_Queue_in", "TNIU_ROB_Queue_out", "TNIU_Req_out", "INIU3_Req_in", "INIU3_Request_Queue_in", "INIU3_Req_out", "MUX_3_Port_1_in", "MUX_3_out", "Buffer_3_in", "Buffer_Buffer_3_in", "Buffer_Buffer_3_out", "Buffer_3_out", "DEMUX_3_in", "DEMUX_3_out", "TNIU3_Req_in", "TNIU3_ROB_Queue_in", "TNIU3_ROB_Queue_out", "TNIU3_Req_out", "DRAM_in", "LPDDR_Scheduler_in", "LPDDR_Scheduler_out", "DRAM_out", "TNIU3_Resp_in", "TNIU3_Response_Queue_in", "TNIU3_Resp_out", "Buffer_Buffer_4_in", "Buffer_Buffer_4_out", "INIU3_Resp_in", "INIU3_Resp_out", "TNIU_Resp_in", "TNIU_Response_Queue_in", "TNIU_Resp_out", "Buffer_Buffer_2_in", "Buffer_Buffer_2_out"}}
  • 15. VisualSim Model- Imported using Configuration files
  • 16. Model Statistics Cache_SLC_1_A_Hit_Ratio = 0.0, Cache_SLC_1_A_Miss_Ratio = 100.0, Cache_SLC_1_A_Number_Entered = 1386L, Cache_SLC_1_A_Number_Returned = 1386L, Cache_SLC_1_A_Prefetch_Completed = 1L, Cache_SLC_1_A_Prefetch_Issued = 1L, Cache_SLC_1_A_Prefetch_Useful = 0L, Cache_SLC_1_Buffer_Occupancy = 0, Cache_SLC_1_Buffer_Overflow = 0L, Cache_SLC_1_Latency_Avg = 1.6392367965368E-7, Cache_SLC_1_Latency_Max = 4.3999899999999E-7, Cache_SLC_1_Latency_Min = 5.7111999999999E-8, Cache_SLC_1_Read_MBs_per_Second = 837.1201674240335, Cache_SLC_1_Total_Cache_Lines_Evicted = 0L, Cache_SLC_1_Total_Cache_Lines_Write_Backed = 0L, Cache_SLC_1_Total_MBs = 0.177472, Cache_SLC_1_Total_MBs_per_Second = 3549.440709888142, Cache_SLC_1_Utilization = 2.7733338880003 MC_DRAM_DRAM_1_00_Total_Requests = 136, MC_DRAM_DRAM_1_01_Completed_Requests = 136, MC_DRAM_DRAM_1_02_Total_MB_per_Second = 87.1179217539336, MC_DRAM_DRAM_1_03_Total_Bytes = 8704, MC_DRAM_DRAM_1_04_Read_Bytes = 8704, MC_DRAM_DRAM_1_06_Read_MB_per_Second = 87.1179217539336, MC_DRAM_DRAM_1_10_Max_Queue_Usage = 2, MC_DRAM_DRAM_1_12_Queue_Removal_Position = {136, 0, 0, 0}, MC_DRAM_DRAM_1_21_Total_Activates = 148, MC_DRAM_DRAM_1_22_Total_Precharges = 147, MC_DRAM_DRAM_1_23_Total_RRD_L_S = {{0, 0}, {12, 0}}, MC_DRAM_DRAM_1_24_Total_CCD_L_S = {{135, 135}, {0, 0}}, MC_DRAM_DRAM_1_25_Total_WTR_L_S = {{0, 0}, {0, 0}}, MC_DRAM_DRAM_1_26_Total_RTP_WR_RAS_RTW = {{135, 132}, {0, 0}, {135, 122}, {0, 0}}, MC_DRAM_DRAM_1_27_Refresh_Percent = 1.920016, MC_DRAM_DRAM_1_28_DRAM_Delay_Min = 1.4416999999991E-8, MC_DRAM_DRAM_1_29_DRAM_Delay_Max = 1.4417000000004E-8, MC_DRAM_DRAM_1_30_DRAM_Delay_Mean = 1.4417E-8, MC_DRAM_DRAM_1_31_DRAM_Delay_StDev = 6.9311872118497E-16} PCIe_Switch_PCIe_Switch_1_Port_1_Rx_MBps = 870.2400087024001, PCIe_Switch_PCIe_Switch_1_Port_1_Total_MBps = 1740.4800174048003, PCIe_Switch_PCIe_Switch_1_Port_1_Tx_MBps = 870.2400087024001, PCIe_Switch_PCIe_Switch_1_Port_1_to_Port_7_Max_Latency = 8.6170000000051E-9, PCIe_Switch_PCIe_Switch_1_Port_1_to_Port_7_Mean_Latency = 4.8814421768711E-9, PCIe_Switch_PCIe_Switch_1_Port_1_to_Port_7_Min_Latency = 4.7219999999969E-9, {CMN600_RND_1_Max_End_to_End_Latency = 6.55668E-7, CMN600_RND_1_Max_Network_Latency = 8.4334000000002E-8, CMN600_RND_1_Mean_End_to_End_Latency = 3.6446105555556E-7, CMN600_RND_1_Mean_Network_Latency = 5.6137656565657E-8, CMN600_RND_1_Min_End_to_End_Latency = 2.37999E-7, CMN600_RND_1_Min_Network_Latency = 4.9999999999994E-8, CMN600_RND_2_Max_End_to_End_Latency = 4.7366499999999E-7, CMN600_R_0_0_EAST_In_Buffer_Max_Buffer_Occupancy = 18.0, CMN600_R_0_0_SOUTH_In_Buffer_Max_Buffer_Occupancy = 8.0, CMN600_R_0_1_EAST_In_Buffer_Max_Buffer_Occupancy = 11.0, CMN600_R_0_1_SOUTH_In_Buffer_Max_Buffer_Occupancy = 21.0, CMN600_R_0_1_WEST_In_Buffer_Max_Buffer_Occupancy = 5.0, DS_Name = "CMN600_Stats"} {ACE_Bus_1_Master_1_Read_Data_Bytes = 7936, ACE_Bus_1_Master_1_Read_Data_MBps = 79.3600007936, ACE_Bus_1_Slave_1_BW_Utilization_Prct = 0.6613333399467, ACE_Bus_1_Slave_1_Read_Data_Bytes = 7936, ACE_Bus_1_Slave_1_Read_Data_MBps = 79.3600007936} {ACE_Bus_1_Slave_1_Rd_Threshold_Usage = 2.0, ACE_Bus_1_Slave_1_Rd_Transactions = 124}
  • 18. Block Diagram and VisualSim Model
  • 19. Experiments Exp Description Mean Latency Mean RNF Latency DRAM- MBps LLC Cache- MBps L2 Cache Buffer Overflow Maximum UCIe MBps Maximim Power (Watts) 1 Traffic rate= 1.0e-7, 8 Cores per Cluster, Sequential Addresses, AXI and L2 Cache= 4200Mhz , DRAM and Memory Controller= 2400Mhz, CMN frequency=1200Mhz, CMN Buffer capacity= 1000, AXI read and write threshold= 250 4.105e-7 1.355e-8 728.903 3200.006 461L 12481.600 22 2 Traffic rate= 1.0e-7, 8 Cores per Cluster, Random Addresses, AXI and L2 Cache= 3200Mhz , DRAM and Memory Controller= 1200Mhz, CMN frequency=1200Mhz, CMN Buffer capacity= 1000, AXI read and write threshold= 250 9.8718e-7 1.72e-8 663.212 2432.004 367L 12816.001 17 3 Traffic rate= 1.0e-7, 4 Cores per Cluster, Sequential Addresses, AXI and L2 Cache= 3200Mhz , DRAM and Memory Controller= 1200Mhz, CMN frequency=1200Mhz, CMN Buffer capacity= 1000, AXI read and write threshold= 250 6.136e-7 1.250e-8 663.212 3200.006 293L 12475.200 17 4 Traffic rate= 1.0e-7, 8 Cores per Cluster, Sequential Addresses, AXI and L2 Cache= 3200Mhz , DRAM and Memory Controller= 1200Mhz, CMN frequency=1200Mhz, CMN Buffer capacity= 1000, AXI read and write threshold= 250 5.06e-7 1.249e-8 663.212 2688.005 464L 12558.400 17 5 Traffic rate= 1.0e-7, 8 Cores per Cluster, Sequential Addresses, AXI and L2 Cache= 3200Mhz , DRAM and Memory Controller= 1200Mhz, CMN frequency=1200Mhz, CMN Buffer capacity= 500, AXI read and write threshold= 100 7.395e-7 1.28e-8 663.212 2688.005 0L 12176.001 17
  • 20. RNF Latencies Exp Description RNF Latency (Maximum End to End Latency) 1 Traffic rate= 1.0e-7, 8 Cores per Cluster, Sequential Addresses, AXI and L2 Cache= 4200Mhz , DRAM and Memory Controller= 2400Mhz, CMN frequency=1200Mhz, CMN Buffer capacity= 1000, AXI read and write threshold= 250 [ RNF_1 = 4.9998000000001E-9, RNF_2 = 6.6664000000001E-9, RNF_3 = 4.9998000000001E-9, RNF_4 = 1.04988E-8, RNF_5 = 1.16662E-8, RNF_6 = 9.9996E-9, RNF_7 = 1.49994E-8, RNF_8 = 1.33919E-8, RNF_9 = 1.6786E-8, RNF_10 = 1.33328E-8, RNF_11 = 9.9996E-9, RNF_12 = 1.40361E-8, RNF_13 = 7.36696E-8, RNF_14 = 6.95799E-8, RNF_15 = 6.94706E-8, RNF_16 = 6.76522E-8 ] 2 Traffic rate= 1.0e-7, 8 Cores per Cluster, Random Addresses, AXI and L2 Cache= 3200Mhz , DRAM and Memory Controller= 1200Mhz, CMN frequency=1200Mhz, CMN Buffer capacity= 1000, AXI read and write threshold= 250 [ RNF_1 = 4.9998000000001E-9, RNF_2 = 1.33328E-8, RNF_3 = 5.6658E-9, RNF_4 = 7.582E-9, RNF_5 = 1.16662E-8, RNF_6 = 9.9996E-9, RNF_7 = 1.49994E-8, RNF_8 = 1.39988E-8, RNF_9 = 2.12506E-8, RNF_10 = 1.33328E-8, RNF_11 = 1.83321E-8, RNF_12 = 2.97902E-8, RNF_13 = 1.249314E-7, RNF_14 = 1.366519E-7, RNF_15 = 7.03419E-8, RNF_16 = 8.58516E-8 ] 3 Traffic rate= 1.0e-7, 4 Cores per Cluster, Sequential Addresses, AXI and L2 Cache= 3200Mhz , DRAM and Memory Controller= 1200Mhz, CMN frequency=1200Mhz, CMN Buffer capacity= 1000, AXI read and write threshold= 250 [ RNF_1 = 4.9998000000001E-9, RNF_2 = 6.6664000000001E-9, RNF_3 = 4.9998000000001E-9, RNF_4 = 6.6664000000002E-9, RNF_5 = 1.16662E-8, RNF_6 = 9.9996E-9, RNF_7 = 1.55829E-8, RNF_8 = 1.33328E-8, RNF_9 = 1.70836E-8, RNF_10 = 1.33328E-8, RNF_11 = 9.9996E-9, RNF_12 = 1.41656E-8, RNF_13 = 6.80913E-8, RNF_14 = 7.32668E-8, RNF_15 = 8.03479E-8, RNF_16 = 6.83493E-8 ] 4 Traffic rate= 1.0e-7, 8 Cores per Cluster, Sequential Addresses, AXI and L2 Cache= 3200Mhz , DRAM and Memory Controller= 1200Mhz, CMN frequency=1200Mhz, CMN Buffer capacity= 1000, AXI read and write threshold= 250 [ RNF_1 = 4.9998e-09, RNF_2 = 6.6664e-09, RNF_3 = 4.9998e-09, RNF_4 = 6.6664e-09, RNF_5 = 1.21021e-08, RNF_6 = 1.07689e-08, RNF_7 = 1.49994e-08, RNF_8 = 1.41021e-08, RNF_9 = 1.70836e-08, RNF_10 = 1.36022e-08, RNF_11 = 9.9996e-09, RNF_12 = 1.33515e-08, RNF_13 = 4.20844e-08, RNF_14 = 6.99336e-08, RNF_15 = 7.03259e-08, RNF_16 = 6.84963e-08 ] 5 Traffic rate= 1.0e-7, 8 Cores per Cluster, Sequential Addresses, AXI and L2 Cache= 3200Mhz , DRAM and Memory Controller= 1200Mhz, CMN frequency=1200Mhz, CMN Buffer capacity= 500, AXI read and write threshold= 100 [ RNF_1 = 4.9998e-09, RNF_2 = 6.6664e-09, RNF_3 = 4.9998e-09, RNF_4 = 6.6664e-09, RNF_5 = 1.21021e-08, RNF_6 = 1.07689e-08, RNF_7 = 1.49994e-08, RNF_8 = 1.41021e-08, RNF_9 = 1.70836e-08, RNF_10 = 1.36022e-08, RNF_11 = 9.9996e-09, RNF_12 = 1.33515e-08, RNF_13 = 4.20844e-08, RNF_14 = 6.99336e-08, RNF_15 = 7.03259e-08, RNF_16 = 6.84963e-08 ]
  • 21. Latency of Processor Request Per Cluster Experiment 3 Latency_Cluster_1 Experiment 4 Latency_Cluster1 Experiment 1 Latency_Cluster_1 Experiment 2 Latency_Cluster1 Experiment 5 Latency_Cluster_6
  • 22. 64 Tile SOC • In the 64 tile SOC, We have 8 clusters and each cluster contains 8 cores. • There are 8 cores per clusters. The addresses are generated randomly and sequentially. These packets are going to cache through AXI bus.
  • 23. Interconnect Architecture • 64 core SOC Die is connected via 4 UCIe ports to two Dies • Die 2 and 3 have an SLC cache and the DRAMs
  • 25. Power Exploration Experiment 3 Power_DIE_1 Experiment 4 Power_DIE_1 Experiment 1 Power_DIE_1 Experiment 2 Power_DIE_1 Experiment 5 Power_DIE_1
  • 26. System-Level Verification of Automotive and Defense SYstems System Integration of SoC using Chiplet
  • 27. System Overview Gateway Transfer messages between different CAN and TSN networks CAN Bus CAN bus is the network that connects sensors and ECU’s TSN Switch STN bus is the network that connects High-Performance Cameras, Lidars and Servers Wheel 1 Wheel 4 Wheel 3 Wheel 2 Gateway CAN Bus Engine Proximity Sensor Brake Pedal Gyro Sensor Road condition sensor TSN Bus CAN Bus ECU
  • 28. Embedding the SoC in an Automotive Application for Testing
  • 29. Evaluation of Chiplet Performance in Automotive Application
  • 30. Power Modeling and Future Innovation
  • 31. Power Generation Power Storage Power Consumption Thermal Management • Different charging schemes • Impact of surge and shocks • Battery Lifecycle • Battery Consumption • Statistics • Heat and temperature • Impact of cooling strategy • Add impact of power spikes • State based power consumption of electronics (controller, SOC) and Mechanical (brakes, wheels) • Average, instant and Cumulative • Power per device and application Verification and Debugging • 4 Types of Power Generators in VisualSim • Constant, variable, motor, solar charge • Charge sent to battery 1 2 3 5 6 • Optimize and test the power management algorithms • Sizing of power generators and battery • Estimate power consumed by the software application Downstream Integration • Integrate with physical hardware • Generate UPF file with power domains • Generate SystemVerilog power testbench 7 Power Management • Change in power state controlled by time, utilization, temperature and expected activity 4 Generate Power and Thermal Characteristic
  • 32. Behavior Task Graph Power Table Power management Unit SystemVerilog Output for Power System Test VCD Waveform for Verification create_power_domain PD_Top -include_scope create_power_domain -name PD_1_2.0 -elements {"CLKMUX"} create_power_domain -name PD_1_1.0 -elements {"PLL","G2","G3"} create_power_domain -name PD_1_3.0 -elements {"PROC"} create_supply_port -port VDD_1.0 -direction in -domain PD_Top create_supply_port -port VDD_2.0 -direction in -domain PD_Top create_supply_port -port VDD_3.0 -direction in -domain PD_Top create_supply_port -port VSS_0.0 -direction in -domain PD_Top create_supply_net VDD_1.0 -domain PD_Top create_supply_net VDD_2.0 -domain PD_Top create_supply_net VDD_3.0 -domain PD_Top create_supply_net VSS_0.0 -domain PD_Top connect_supply_net VDD_1.0 -ports VDD_1.0 connect_supply_net VDD_2.0 -ports VDD_2.0 connect_supply_net VDD_3.0 -ports VDD_3.0 connect_supply_net VSS_0.0 -ports VSS_0.0 add_power_state PD_1_2.0 -state Active {-supply_expr (VDD_2.0 == {ON, 2.0}) && (VSS_0.0 =={ON,0.0})} add_power_state PD_1_2.0 -state OFF {-supply_expr (VDD_2.0 == {OFF, 0.0}) && (VSS_0.0 =={ON,0.0})} add_power_state PD_1_1.0 -state Active {-supply_expr (VDD_1.0 == {ON, 1.0}) && (VSS_0.0 =={ON,0.0})} add_power_state PD_1_1.0 -state OFF {-supply_expr (VDD_1.0 == {OFF, 0.0}) && (VSS_0.0 =={ON,0.0})} add_power_state PD_1_3.0 -state Active {-supply_expr (VDD_3.0 == {ON, 3.0}) && (VSS_0.0 =={ON,0.0})} add_power_state PD_1_3.0 -state OFF {-supply_expr (VDD_3.0 == {OFF, 0.0}) && (VSS_0.0 =={ON,0.0})} Power Modeling Integration
  • 33. AI-based Simulation for Rapid System Exploration • Run number 19 – clock frequency at 1000 MHz satisfied the performance requirements we had set. • Since the frequency was increased from 600 MHz, the total power consumption went up while running the system at 1000 MHz • Architect can evaluate different processing resources – DSP vs Xeon cores vs Power cores if they have stringent power thresholds Requirements being evaluated for each simulation run in the parameter sweep Overall Results – We can identify the simulation runs which meet the requirements and select the right configuration after considering cost vs performance trade-offs
  • 34. System Verification • Generate test cases and compare RTL • Performance, Power and Functionality • Validate product not just HW/SW • Application relevant test vectors • Link to board, emulators and instruments Golden Reference Comparator Match Tag Architecture model of IP Verilog/C/ emulation
  • 36. Enable Better Products • One Product- One Model for power, performance and functionality • Complete library of System-level IP library and IP Builders for software, systems, networks and missions • AI-based Regression tool linked to Requirements • Proven to eliminate 90% of the bottlenecks prior to development • Demonstrated over 40% schedule savings
  • 37. Webinar: ARM corelink, Arteris NoC, UCIe, Bunch-of-wires, CXL and PCIe- Designing the interconnect is not for the weak-hearted! Host: Deepak Shankar, Vice President Technology Mirabilis Design Inc. Email: dshankar@mirabilisdesign.com