SlideShare a Scribd company logo
Synchoricity as the basis for going
Beyond Moore
Ahmed Hemani
Professor, Dept. Of Electronics, School of EECS, KTH,
Stockholm Sweden
Email: hemani@kth.se
1
Not a Typo
The 2nd R-CCS Symposium,
Future Co-design Session 13:30 to 15:30
Day 2, Feb 18, 2020, Kobe, Japan
© Ahmed Hemani
2
Going Beyond Moore !
1. Squeeze more out of CMOS
a. ASICs like custom functional hardware
b. Delivers 2-4 orders better energy-delay product compared
to GPUs, FPGAs and Multi-cores
2. Complement CMOS with emerging technologies
a. 2.5D and 3D Integration (DRAM)
b. Computation in memory using Memristors
c. Plasmonics
“Science makes progress, not when you find a solution,
but when you make it easy to use the solution”
-- Venki Ramakrishnan, Nobel Laureate
Solutions to go beyond Moore Make it easy to use the solution
3
Synchronicity
Time is discretized using clock ticks
D Q
clk1
D Q
clk2
Can be temporally composed If
clk1 = clk2
&
The two clocks are skew aligned
Synchoricity
Space is discretized using a virtual grid
Can be spatially composed If
If the number of grid cells
in each dimension are equal
&
Their interconnect edges are abuttable
4
SiLago (Silicon Lego) Blocks
RTL & Coarse Grain Reconfigurable
4-5 orders larger than Standard Cells
Characterized with postlayout data
Empowers Synthesis from Higher Abstractions
Inter SiLago Block Wires bought to periphery
at right place and right metal layer to enable
compositoin by abutment
(c) Ahmed Hemani
A SiLago Block
SiLago Blocks are the new standard cells
5
VLSI Designs are Composed by Abutting SiLago Blocks
All Wires – functional and infrastructural (reset, clocks and power grid) are
created as a result of abutment
Cost-Metrics of the composite design becomes known with post layout
accuracy
(c) Ahmed Hemani
6
Inspiration from Construction Industry
7
An Analogy
(c) Ahmed Hemani
8
We shifted to pre-fabricated wall segments
1. Productivity gain did not solely come from the
large size of the pre-fabricated wall segments
2. Productivity gain came from physical design
discipline that enables composition by abutment
3. IPs in VLSI Design lack this discipline and
composition by abutment
(c) Ahmed Hemani
© Ahmed Hemani
9
Lego Kits
Region Types – SiLago Block Types
Functional
Dense Linear Algebra
Sparse Linear Algebra
Inner Modem
Outer Modem
Graph Theory
Protocol Processing
Spectral Methods
Dynamic Programming
State Machines
Infrastructural
NOCs
Scratch Pad Memory
PLL + CGU
Memory Controller
FIFO, FIFO Controller
RISC Processors – RISC-
V
DMA
Memory Consistency
Power Management
The Berkeley Dwarfs
1 Dense Linear Algebra
2 Sparse Linera Algebra
3 Spectral Methods
4 N Body Methods
5 Structured Grids
6 Unstructured Grids
7 MapReduce
8 Combinational Logic
9 Graph Traversal
10 Dynamic Programming
11 Back-track and Branch n Bound
12 Graphical Models
13 Finite State Machine
The Berkeley
Dwarfs
SiLago Regions
Types
© Ahmed Hemani
10
Hardware Centric vs. Software Centric
Accelerators vs. Flexilators
SoftwareBy default
Functionalities are
mapped as
software Accelerators
Hardware
Only power and performance
critical functionalities are
mapped as hardware
accelerators
Flexilators
e.g. RISC-V
Only flexibility critical, dynamic and non-deterministic
functionalities are mapped to
SiLago Flexilators: RISC-V, FSMs, FIFOs, Arbiters, Schedulers,
NOCs etc.
Software Centric Platform Based Design Hardware Centric Synchoros VLSI Design
Custom Hardware
Hyper-CISC
Instructions
By default
Functionalities are
mapped as custom
functional hardware
Lego Flexilators
HPC LIB Impl
HLS
© Ahmed Hemani 11
Why does Synchoros VLSI Design Work ?
Log (# of Solutions)
Physical
GDSII
Boolean
Std. Cells
Algorithm
System
Application
RTL
Standard Cells
Physical Level
Standard Cells
SiLago Blocks
FunctionVerification(FV)
ConstraintsVerification(CV)
Manual
ManualAutomated
(CV)
(FV)
Automated
(CV)
(FV)
Automated
OneTimeEngineering
Full Custom Mead-Conway
O(10K Gates)
Standard Cells
O(10 million Gates)
Synchoros VLSI Design Style
O(100 million Gates)
~300 MUSD
OneTimeEngineeringEffort
SiLago Application Level Synthesis
1. Select Optimal Solution from ML solutions
2. Global Interconnect, buffers and control
3. Floorplanning
c
HPC Application
L Algorithms
Sampling Rate,
Total Latency
Number and types of SiLago blocks + Mapping
SiLago Blocks
Characterization Data
Hardened Blocks
Scripts
Mask Patterns Reports
Compose Ready-to-manufacture Chip
12
1  M
HPC Lib
Implementations
HPC
LIBs
HLS
HLS tools do not automate this.
Manually refined
(c) Ahmed Hemani
13
SiLago Design Instances =  Region Instances
System
Controller
Program
Memory
Ethernet
PLL/CGU
PMC
Data
Memory
TSVs
3D Memory
Control
Protocol Processing Region
NOCs
NOCs
Scratchpad
Interrupt Ctrl
Scratchpad
Scratchpad
DenseLinearAlgebra
Conceptual
Does Not Exist
Buffered/Pipelined
NOC SiLago Blocks
NOC Switch
SiLago Blocks
Region Specific
Network
Interface Units
DenseLinearAlgebra
Type, Number, Size and Position
of
SiLago Region Instances
Decided by Synthesis Tools
Based on
Functionality
And
Constraints
Inner
Modem
SiLago can also potentially reduce the manufacturing cost
Protocol Processing Streaming Storage
Data
Storage
System Ctrl
Program
Storage
DRAM CTRL
Flash CTRL
Ethernet
PLL/CGU
PMC
Inner
Modem
Outer
Modem
14
Outer
Modem
Flexilators
All SiLago designs are composed of a finite number of
SiLago block Types
All SiLago blocks can only have a finite types of
neighbors
Each SiLago blocks’s mask depending on the neighbor
types can be saved as a component mask
The entire design mask can be composed from such
component masks
(c) Ahmed Hemani
The DFT Cost can
also be factored out
The DFT can be
made much more
efficient reducing
time spent on ATE
© Ahmed Hemani
15
What becomes possible
101
103
10-1
Energyperframe(mJ)
106
600 $
Jetson TX1
(GPU)
667 $
Keystone II
(DSP)
126 $
Parellela
Multicore+NoC
1600 $
ZC706
FPGA
196 $
SiLago
22 nm
(volume of 10k)
10 M 100 M 1 G 10 G
Operations of CNN
Data on GPU, DSP, Parallela and FPGA adapted from
G. Hegde, S. Siddhartha, and N. Kapre, “CaffePresso: Accelerating convolutional networks on embedded SoCs,” ACM Transactions on Embedded Computing
System, vol. 17, 2017.
© Ahmed Hemani
16
Going Beyond Moore !
1. Squeeze more out of CMOS
a. ASICs like custom functional hardware
b. Delivers 2-4 orders better energy-delay product compared
to GPUs, FPGAs and Multi-cores
2. Complement CMOS with emerging technologies
a. 2.5D and 3D Integration (DRAM)
b. Computation in memory using Memristors
c. Plasmonics
“Science makes progress, not when you find a solution,
but when you make it easy to use the solution”
-- Venki Ramakrishnan
Solutions to go beyond Moore Make it easy to use the solution
17
BCPNN
Bayesian Confidence Propagation Neural Network
Professor Anders Lansner
Functional Requirements: Human Scale - Realtime
BCPNN Requirements
1. Realtime simulation
2. 2 Million HCUs – non-deterministically concurrent
3. 170 TFlops/s – BCPNN Computation
4. 50 TBs – Synaptic Weight Storage
5. 200 TBs / s – Bandwidth for synaptic storage
6. 250 GBs / s – Spiking Bandwidth
Infrastructural Requirements
18
100 MCUs
10000Connections
HCU
Synaptic
Memory
(25 MB)
MCU
State Vector
MCU Row
HCU =  MCUs
The BCPNN Computation Model
Input Spike
Computation
10 000 Spikes/s
100 × 100
Spikes/s
Support
Computation
100 / s
Output Spike
Computation
Delay
Buffer
19
Human Scale Cortex Dimensions
High Level of
Temporal Locality
Column updates
are more
expensive
© Ahmed Hemani
20
Infrastructural Operations are Significant
IncomingSpikesQueue&Controller
iSDIN
incomingSpikeDistributionInterconnect
InputComputationController
OutputComputationController
OutgoingSpikesQueue&Controller
oSDIN
outgoingSpikeDistributionInterconnect
DelayBuffers&Controllerforfanoutspikes
Scratchpad Memories
Input Computation
Input Computation
FSM
Input Computation
Unit R1 SP FPUs
HCU State Storage Memory Interface
ms Timer
Scratchpad Memories
Output Computation
Output Computation
FSM
Output Computation
Unit R2 SP FPUs
170 TFlops
Infrastructural Operations
The Silicon Lego Bricks for Method Applied to BCPNN
A Structured Physical Design Scheme to enable System-level synthesis
21
H- Tile
H- Tile
H- Tile H- Tile
H- Tile
H- Tile
H- Tile
H- TileH- Tile
TSVs + Controller
FPUs
SRAMs
H-Tile
Controller
NOC
Interface
Ques +
Controller
NOC Corridor
NOCCorridor
BCU: Brain Computation Unit
1.529 X 1.729 mm2
32 H-Cubes
32 Micro Channels
22
H-Cube
8 layers of DRAM
1 Bank per layer
2 Banks / HCU
TSV Micro Channel
4 HCUs + Control
1079 mm
200 mm
Bank 0: 128 Mb
TSV Area
RIB
Column
1489mm
250 mm
240mm
BCPNN Implementation
HCU0 HCU1
HCU2 HCU3
In Collaboration with
Prof. Nobert When and
Dr. Christian Wiess
TU Kaiserslautern
© Ahmed Hemani
23
BCPNN: ASIC vs GPUs
# of GK210 cores : 5000
Energy : 563.1 kJ
1s Realtime : 4.69 s simulated time
Energy Delay Product : 2642 kJ  s
Memory
60%
Computation
20%
Infrastructure
20%
(b) Energy Breakdown GPUs
1320
𝒎𝑱
𝑯𝑪𝑼∙𝒔
DRAM
73%
SRAM
9%
Infrastructure
3%
Computation
15%
1.52
𝒎𝑱
𝑯𝑪𝑼∙𝒔
(a) Energy Breakdown ASIC
Energy Delay Product: 3.06 kJ  s
ASIC: 3.0 kW
GPUs: 2.6 MW
SpiNNaker-
2
comparable
to GPUs
© Ahmed Hemani
24
The Impact of Column Access Elimination + Exploiting
Temporal Locality
DRAM
73%
SRAM
9%
Infrastructure
3%
Computation
15%
1.52
𝒎𝑱
𝑯𝑪𝑼∙𝒔
Baseline
3 kW 800 W
DRAM
14%
SRAM
10%
Infrastructure
13%
Computation
63%
0.40
𝒎𝑱
𝑯𝑪𝑼∙𝒔
Column Access Eliminated +
Temporal Locality
Column Access Eliminated
DRAM
59%
SRAM
8%
Infrastructure
6%
Computation
27%
0. 𝟗𝟒
𝒎𝑱
𝑯𝑪𝑼∙𝒔
1.88 kW
Such optimizations can be automatically
inferred from Simulations
50 GFLOPs / Watt
28 nm bulk CMOS
Sustained – not peak
© Ahmed Hemani
25
Interconnect and Storage are Expensive
6.3 pJ
3.2 pJ = 32 bit Data 1 mm ~= 32-bit FLOP > accessing 1 bit in 3D integrated DRAM
P. Kogge and J. Shalf, “Exascale computing trends: Adjusting to the ‘new normal’ for computer architecture,” Comput. Sci. Eng., vol. 15, no. 6, 2013.
© Ahmed Hemani
26
Computation in Memory using Memristors
DACs
ADCs
Source of Diagram Above: Chenchen Liu, Qing Yang, Bonan Yan, Xiaocong Du, Hai (Helen) Li, “A Memristor Crossbar Based Computing Engine Optimized for High Speed and Accuracy”, ISVLSI 2016
Benefits:
1. Single cycle dot product
2. Can be extended to do addition,
multiplication, element wise multiplication,
matrix inversion
3. No need to fetch, decode and execute
instructions  addresses wire problem
4. In some application instances, initialization of
matrix would be a one-time event
Challenges
1. Large matrices will need to be fragmented
resulting in movement of data. Need
complimentary control circuitry
2. ADC’s consume significant power and inject
latency
3. Accuracy
4. Experimental solutions reported. Not part of
mainstream design flow
Reminiscent of Analog Computation
© Ahmed Hemani
27
Memristor based CIM in the SiLago Framework
Region Types – SiLago Block Types
Functional
Dense Linear Algebra
Sparse Linear Algebra
Inner Modem
Outer Modem
Graph Theory
Protocol Processing
Spectral Methods
Dynamic Programming
State Machines
Infrastructural
NOCs
Scratch Pad Memory
PLL + CGU
Memory Controller
FIFO, FIFO Controller
RISC Processors – RISC-V
NVM
DRAM Vaults
Power Management
Memristor CIM
1. A Memristor CIM in a range of dimensions
2. Characterized with post-layout data and circuit
level simulations and validated with test chips
3. Exports, functional matrix operations and
infrastructural operations like initializing crossbar,
NIU operations, reg file operations etc.
4. Higher abstraction synthesis tools can refine in
terms of CIM SiLago blocks and know its
performance, energy and area.
© Ahmed Hemani
28
25 Watt Biologically Plausible Human Scale Brain
Move to 16/32 bit Integer Arithmetic
Single Precision Floating Point
1. Synaptic Storage/Access will reduce by ~50%
2. Computation Energy will reduce by ~75%
~800 Watts ~250 Watts
ReRAM
Computation in Memory
~25 Watts
Caveat:
Based on best effort estimates and
not on actual implementation
~ 2 TOPs/watt
28 nm bulk CMOS
© Ahmed Hemani
29
Wave Based Computing using Plasmons
1. Logic values encoded as phase of the waves
2. Interference of waves interpreted as majority
gate computation
© Ahmed Hemani
30
Plasmonics + CMOS Computing using SiLago blocks
PlasmonExa – Plasmonics based Exa-scale computing
design.
Synthesized in terms of Silicon Lego (SiLago) blocks.
SiLago
Majority Gate
SiLago Interconnect
waveguide
Cout
A
B
Cin SUM
SiLago micro-architecture block (full adder)
DigitalCMOS
Control&Memory
Plasmon
Sources
Plasmon
detector
Phase
Modulatros
N
3
2
1 1
2
3
N
Plasmonic
Wave-computing
1
2
N
THzCMOS
DriverandControl
Plasmon
waveguides
Electrically driven
Plasmon sources THz
Phase modulators
Inputs
Demodulator Output
Plasmon
Detector
CMOS Logic
Drivers
High Speed
Plasmon Logic Circuit
Impact
31
© Ahmed Hemani
1. 1000 X Power Density
2. More Affordable
Software Centric / GPU +
Based Computing
Hardware Centric
SiLago Based Computing
1. 1000 X Power Density
2. More Affordable

More Related Content

PPT
Aruna Ravi - M.S Thesis
PDF
Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...
PPTX
Dr.s.shiyamala fpga ppt
PPTX
GPU Design on FPGA
PPTX
Why a zynq should power your next project
PDF
A CGRA-based Approach for Accelerating Convolutional Neural Networks
PDF
Gv2512441247
PPTX
tau 2015 spyrou fpga timing
Aruna Ravi - M.S Thesis
Implementation of FPGA Based Image Processing Algorithm using Xilinx System G...
Dr.s.shiyamala fpga ppt
GPU Design on FPGA
Why a zynq should power your next project
A CGRA-based Approach for Accelerating Convolutional Neural Networks
Gv2512441247
tau 2015 spyrou fpga timing

What's hot (20)

PDF
Cuda project paper
PDF
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
PPT
Rems final
PDF
Design and analysis of optimized CORDIC based GMSK system on FPGA platform
PDF
An fpga based efficient fruit recognition system using minimum
PDF
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
PDF
ds894-zynq-ultrascale-plus-overview
PDF
AI is Impacting HPC Everywhere
PPTX
Fpga video capturing
PDF
Vlsi projects
PDF
"Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr...
PPT
RCW@DEI - Reconf Comp
PDF
Multi-GPU FFT Performance on Different Hardware
PPTX
DSP Processors versus ASICs
PDF
An35225228
PPTX
A Flexible Router Architecture for 3D Network-on-Chips
PDF
Vlsi 2014 15
PPT
Blanket project presentation
PPT
07 processor basics
PDF
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Cuda project paper
Keynote (Mike Muller) - Is There Anything New in Heterogeneous Computing - by...
Rems final
Design and analysis of optimized CORDIC based GMSK system on FPGA platform
An fpga based efficient fruit recognition system using minimum
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
ds894-zynq-ultrascale-plus-overview
AI is Impacting HPC Everywhere
Fpga video capturing
Vlsi projects
"Dynamically Reconfigurable Processor Technology for Vision Processing," a Pr...
RCW@DEI - Reconf Comp
Multi-GPU FFT Performance on Different Hardware
DSP Processors versus ASICs
An35225228
A Flexible Router Architecture for 3D Network-on-Chips
Vlsi 2014 15
Blanket project presentation
07 processor basics
Klessydra-T: Designing Configurable Vector Co-Processors for Multi-Threaded E...
Ad

Similar to 11 Synchoricity as the basis for going Beyond Moore (20)

PPTX
DATE 2020: Design, Automation and Test in Europe Conference
PDF
FPGA/Reconfigurable computing (HPRC)
PDF
Nikravesh big datafeb2013bt
PPTX
Introduction to FPGA acceleration
PPTX
Lecture 16 RC Architecture Types & FPGA Interns Lecturer.pptx
PDF
FPGA @ UPB-BGA
PDF
Priorities Shift In IC Design
PPTX
SoC FPGA Technology
PDF
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
PDF
International Journal of Computational Engineering Research(IJCER)
PDF
8d545d46b1785a31eaab12d116e10ba41d996928Lecture%202%20and%203%20pdf (1).pdf
PDF
FPGA Embedded Design
PDF
week15a.pdf
PDF
E3MV - Embedded Vision - Sundance
PDF
FPGAs for Supercomputing: The Why and How
PDF
⭐⭐⭐⭐⭐ CHARLA FIEC: Monitoring of system memory usage embedded in #FPGA
PPTX
Exascale Capabl
PDF
AI Assisted Digital System Design Lecture 1
PPT
lecture1-244.ppt
PPT
FPGA_prototyping proccesing with conclusion
DATE 2020: Design, Automation and Test in Europe Conference
FPGA/Reconfigurable computing (HPRC)
Nikravesh big datafeb2013bt
Introduction to FPGA acceleration
Lecture 16 RC Architecture Types & FPGA Interns Lecturer.pptx
FPGA @ UPB-BGA
Priorities Shift In IC Design
SoC FPGA Technology
digitaldesign-s20-lecture3b-fpga-afterlecture.pdf
International Journal of Computational Engineering Research(IJCER)
8d545d46b1785a31eaab12d116e10ba41d996928Lecture%202%20and%203%20pdf (1).pdf
FPGA Embedded Design
week15a.pdf
E3MV - Embedded Vision - Sundance
FPGAs for Supercomputing: The Why and How
⭐⭐⭐⭐⭐ CHARLA FIEC: Monitoring of system memory usage embedded in #FPGA
Exascale Capabl
AI Assisted Digital System Design Lecture 1
lecture1-244.ppt
FPGA_prototyping proccesing with conclusion
Ad

More from RCCSRENKEI (20)

PDF
第15回 配信講義 計算科学技術特論B(2022)
PDF
第14回 配信講義 計算科学技術特論B(2022)
PDF
第12回 配信講義 計算科学技術特論B(2022)
PDF
第13回 配信講義 計算科学技術特論B(2022)
PDF
第11回 配信講義 計算科学技術特論B(2022)
PDF
第10回 配信講義 計算科学技術特論B(2022)
PDF
第9回 配信講義 計算科学技術特論B(2022)
PDF
第8回 配信講義 計算科学技術特論B(2022)
PPT
第7回 配信講義 計算科学技術特論B(2022)
PPT
第6回 配信講義 計算科学技術特論B(2022)
PDF
第5回 配信講義 計算科学技術特論B(2022)
PPTX
Realization of Innovative Light Energy Conversion Materials utilizing the Sup...
PDF
Current status of the project "Toward a unified view of the universe: from la...
PPTX
Fugaku, the Successes and the Lessons Learned
PDF
第4回 配信講義 計算科学技術特論B(2022)
PDF
第3回 配信講義 計算科学技術特論B(2022)
PDF
第2回 配信講義 計算科学技術特論B(2022)
PDF
第1回 配信講義 計算科学技術特論B(2022)
PDF
210603 yamamoto
PDF
第15回 配信講義 計算科学技術特論A(2021)
第15回 配信講義 計算科学技術特論B(2022)
第14回 配信講義 計算科学技術特論B(2022)
第12回 配信講義 計算科学技術特論B(2022)
第13回 配信講義 計算科学技術特論B(2022)
第11回 配信講義 計算科学技術特論B(2022)
第10回 配信講義 計算科学技術特論B(2022)
第9回 配信講義 計算科学技術特論B(2022)
第8回 配信講義 計算科学技術特論B(2022)
第7回 配信講義 計算科学技術特論B(2022)
第6回 配信講義 計算科学技術特論B(2022)
第5回 配信講義 計算科学技術特論B(2022)
Realization of Innovative Light Energy Conversion Materials utilizing the Sup...
Current status of the project "Toward a unified view of the universe: from la...
Fugaku, the Successes and the Lessons Learned
第4回 配信講義 計算科学技術特論B(2022)
第3回 配信講義 計算科学技術特論B(2022)
第2回 配信講義 計算科学技術特論B(2022)
第1回 配信講義 計算科学技術特論B(2022)
210603 yamamoto
第15回 配信講義 計算科学技術特論A(2021)

Recently uploaded (20)

PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
diccionario toefl examen de ingles para principiante
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
famous lake in india and its disturibution and importance
PPTX
Microbiology with diagram medical studies .pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
. Radiology Case Scenariosssssssssssssss
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PDF
The scientific heritage No 166 (166) (2025)
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
7. General Toxicologyfor clinical phrmacy.pptx
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
diccionario toefl examen de ingles para principiante
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
Biophysics 2.pdffffffffffffffffffffffffff
neck nodes and dissection types and lymph nodes levels
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
Derivatives of integument scales, beaks, horns,.pptx
famous lake in india and its disturibution and importance
Microbiology with diagram medical studies .pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
. Radiology Case Scenariosssssssssssssss
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Classification Systems_TAXONOMY_SCIENCE8.pptx
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
The scientific heritage No 166 (166) (2025)
TOTAL hIP ARTHROPLASTY Presentation.pptx
bbec55_b34400a7914c42429908233dbd381773.pdf

11 Synchoricity as the basis for going Beyond Moore

  • 1. Synchoricity as the basis for going Beyond Moore Ahmed Hemani Professor, Dept. Of Electronics, School of EECS, KTH, Stockholm Sweden Email: hemani@kth.se 1 Not a Typo The 2nd R-CCS Symposium, Future Co-design Session 13:30 to 15:30 Day 2, Feb 18, 2020, Kobe, Japan
  • 2. © Ahmed Hemani 2 Going Beyond Moore ! 1. Squeeze more out of CMOS a. ASICs like custom functional hardware b. Delivers 2-4 orders better energy-delay product compared to GPUs, FPGAs and Multi-cores 2. Complement CMOS with emerging technologies a. 2.5D and 3D Integration (DRAM) b. Computation in memory using Memristors c. Plasmonics “Science makes progress, not when you find a solution, but when you make it easy to use the solution” -- Venki Ramakrishnan, Nobel Laureate Solutions to go beyond Moore Make it easy to use the solution
  • 3. 3 Synchronicity Time is discretized using clock ticks D Q clk1 D Q clk2 Can be temporally composed If clk1 = clk2 & The two clocks are skew aligned Synchoricity Space is discretized using a virtual grid Can be spatially composed If If the number of grid cells in each dimension are equal & Their interconnect edges are abuttable
  • 4. 4 SiLago (Silicon Lego) Blocks RTL & Coarse Grain Reconfigurable 4-5 orders larger than Standard Cells Characterized with postlayout data Empowers Synthesis from Higher Abstractions Inter SiLago Block Wires bought to periphery at right place and right metal layer to enable compositoin by abutment (c) Ahmed Hemani A SiLago Block SiLago Blocks are the new standard cells
  • 5. 5 VLSI Designs are Composed by Abutting SiLago Blocks All Wires – functional and infrastructural (reset, clocks and power grid) are created as a result of abutment Cost-Metrics of the composite design becomes known with post layout accuracy (c) Ahmed Hemani
  • 8. 8 We shifted to pre-fabricated wall segments 1. Productivity gain did not solely come from the large size of the pre-fabricated wall segments 2. Productivity gain came from physical design discipline that enables composition by abutment 3. IPs in VLSI Design lack this discipline and composition by abutment (c) Ahmed Hemani
  • 9. © Ahmed Hemani 9 Lego Kits Region Types – SiLago Block Types Functional Dense Linear Algebra Sparse Linear Algebra Inner Modem Outer Modem Graph Theory Protocol Processing Spectral Methods Dynamic Programming State Machines Infrastructural NOCs Scratch Pad Memory PLL + CGU Memory Controller FIFO, FIFO Controller RISC Processors – RISC- V DMA Memory Consistency Power Management The Berkeley Dwarfs 1 Dense Linear Algebra 2 Sparse Linera Algebra 3 Spectral Methods 4 N Body Methods 5 Structured Grids 6 Unstructured Grids 7 MapReduce 8 Combinational Logic 9 Graph Traversal 10 Dynamic Programming 11 Back-track and Branch n Bound 12 Graphical Models 13 Finite State Machine The Berkeley Dwarfs SiLago Regions Types
  • 10. © Ahmed Hemani 10 Hardware Centric vs. Software Centric Accelerators vs. Flexilators SoftwareBy default Functionalities are mapped as software Accelerators Hardware Only power and performance critical functionalities are mapped as hardware accelerators Flexilators e.g. RISC-V Only flexibility critical, dynamic and non-deterministic functionalities are mapped to SiLago Flexilators: RISC-V, FSMs, FIFOs, Arbiters, Schedulers, NOCs etc. Software Centric Platform Based Design Hardware Centric Synchoros VLSI Design Custom Hardware Hyper-CISC Instructions By default Functionalities are mapped as custom functional hardware Lego Flexilators
  • 11. HPC LIB Impl HLS © Ahmed Hemani 11 Why does Synchoros VLSI Design Work ? Log (# of Solutions) Physical GDSII Boolean Std. Cells Algorithm System Application RTL Standard Cells Physical Level Standard Cells SiLago Blocks FunctionVerification(FV) ConstraintsVerification(CV) Manual ManualAutomated (CV) (FV) Automated (CV) (FV) Automated OneTimeEngineering Full Custom Mead-Conway O(10K Gates) Standard Cells O(10 million Gates) Synchoros VLSI Design Style O(100 million Gates) ~300 MUSD
  • 12. OneTimeEngineeringEffort SiLago Application Level Synthesis 1. Select Optimal Solution from ML solutions 2. Global Interconnect, buffers and control 3. Floorplanning c HPC Application L Algorithms Sampling Rate, Total Latency Number and types of SiLago blocks + Mapping SiLago Blocks Characterization Data Hardened Blocks Scripts Mask Patterns Reports Compose Ready-to-manufacture Chip 12 1  M HPC Lib Implementations HPC LIBs HLS HLS tools do not automate this. Manually refined (c) Ahmed Hemani
  • 13. 13 SiLago Design Instances =  Region Instances System Controller Program Memory Ethernet PLL/CGU PMC Data Memory TSVs 3D Memory Control Protocol Processing Region NOCs NOCs Scratchpad Interrupt Ctrl Scratchpad Scratchpad DenseLinearAlgebra Conceptual Does Not Exist Buffered/Pipelined NOC SiLago Blocks NOC Switch SiLago Blocks Region Specific Network Interface Units DenseLinearAlgebra Type, Number, Size and Position of SiLago Region Instances Decided by Synthesis Tools Based on Functionality And Constraints
  • 14. Inner Modem SiLago can also potentially reduce the manufacturing cost Protocol Processing Streaming Storage Data Storage System Ctrl Program Storage DRAM CTRL Flash CTRL Ethernet PLL/CGU PMC Inner Modem Outer Modem 14 Outer Modem Flexilators All SiLago designs are composed of a finite number of SiLago block Types All SiLago blocks can only have a finite types of neighbors Each SiLago blocks’s mask depending on the neighbor types can be saved as a component mask The entire design mask can be composed from such component masks (c) Ahmed Hemani The DFT Cost can also be factored out The DFT can be made much more efficient reducing time spent on ATE
  • 15. © Ahmed Hemani 15 What becomes possible 101 103 10-1 Energyperframe(mJ) 106 600 $ Jetson TX1 (GPU) 667 $ Keystone II (DSP) 126 $ Parellela Multicore+NoC 1600 $ ZC706 FPGA 196 $ SiLago 22 nm (volume of 10k) 10 M 100 M 1 G 10 G Operations of CNN Data on GPU, DSP, Parallela and FPGA adapted from G. Hegde, S. Siddhartha, and N. Kapre, “CaffePresso: Accelerating convolutional networks on embedded SoCs,” ACM Transactions on Embedded Computing System, vol. 17, 2017.
  • 16. © Ahmed Hemani 16 Going Beyond Moore ! 1. Squeeze more out of CMOS a. ASICs like custom functional hardware b. Delivers 2-4 orders better energy-delay product compared to GPUs, FPGAs and Multi-cores 2. Complement CMOS with emerging technologies a. 2.5D and 3D Integration (DRAM) b. Computation in memory using Memristors c. Plasmonics “Science makes progress, not when you find a solution, but when you make it easy to use the solution” -- Venki Ramakrishnan Solutions to go beyond Moore Make it easy to use the solution
  • 17. 17 BCPNN Bayesian Confidence Propagation Neural Network Professor Anders Lansner
  • 18. Functional Requirements: Human Scale - Realtime BCPNN Requirements 1. Realtime simulation 2. 2 Million HCUs – non-deterministically concurrent 3. 170 TFlops/s – BCPNN Computation 4. 50 TBs – Synaptic Weight Storage 5. 200 TBs / s – Bandwidth for synaptic storage 6. 250 GBs / s – Spiking Bandwidth Infrastructural Requirements 18
  • 19. 100 MCUs 10000Connections HCU Synaptic Memory (25 MB) MCU State Vector MCU Row HCU =  MCUs The BCPNN Computation Model Input Spike Computation 10 000 Spikes/s 100 × 100 Spikes/s Support Computation 100 / s Output Spike Computation Delay Buffer 19 Human Scale Cortex Dimensions High Level of Temporal Locality Column updates are more expensive
  • 20. © Ahmed Hemani 20 Infrastructural Operations are Significant IncomingSpikesQueue&Controller iSDIN incomingSpikeDistributionInterconnect InputComputationController OutputComputationController OutgoingSpikesQueue&Controller oSDIN outgoingSpikeDistributionInterconnect DelayBuffers&Controllerforfanoutspikes Scratchpad Memories Input Computation Input Computation FSM Input Computation Unit R1 SP FPUs HCU State Storage Memory Interface ms Timer Scratchpad Memories Output Computation Output Computation FSM Output Computation Unit R2 SP FPUs 170 TFlops Infrastructural Operations
  • 21. The Silicon Lego Bricks for Method Applied to BCPNN A Structured Physical Design Scheme to enable System-level synthesis 21 H- Tile H- Tile H- Tile H- Tile H- Tile H- Tile H- Tile H- TileH- Tile TSVs + Controller FPUs SRAMs H-Tile Controller NOC Interface Ques + Controller NOC Corridor NOCCorridor
  • 22. BCU: Brain Computation Unit 1.529 X 1.729 mm2 32 H-Cubes 32 Micro Channels 22 H-Cube 8 layers of DRAM 1 Bank per layer 2 Banks / HCU TSV Micro Channel 4 HCUs + Control 1079 mm 200 mm Bank 0: 128 Mb TSV Area RIB Column 1489mm 250 mm 240mm BCPNN Implementation HCU0 HCU1 HCU2 HCU3 In Collaboration with Prof. Nobert When and Dr. Christian Wiess TU Kaiserslautern
  • 23. © Ahmed Hemani 23 BCPNN: ASIC vs GPUs # of GK210 cores : 5000 Energy : 563.1 kJ 1s Realtime : 4.69 s simulated time Energy Delay Product : 2642 kJ  s Memory 60% Computation 20% Infrastructure 20% (b) Energy Breakdown GPUs 1320 𝒎𝑱 𝑯𝑪𝑼∙𝒔 DRAM 73% SRAM 9% Infrastructure 3% Computation 15% 1.52 𝒎𝑱 𝑯𝑪𝑼∙𝒔 (a) Energy Breakdown ASIC Energy Delay Product: 3.06 kJ  s ASIC: 3.0 kW GPUs: 2.6 MW SpiNNaker- 2 comparable to GPUs
  • 24. © Ahmed Hemani 24 The Impact of Column Access Elimination + Exploiting Temporal Locality DRAM 73% SRAM 9% Infrastructure 3% Computation 15% 1.52 𝒎𝑱 𝑯𝑪𝑼∙𝒔 Baseline 3 kW 800 W DRAM 14% SRAM 10% Infrastructure 13% Computation 63% 0.40 𝒎𝑱 𝑯𝑪𝑼∙𝒔 Column Access Eliminated + Temporal Locality Column Access Eliminated DRAM 59% SRAM 8% Infrastructure 6% Computation 27% 0. 𝟗𝟒 𝒎𝑱 𝑯𝑪𝑼∙𝒔 1.88 kW Such optimizations can be automatically inferred from Simulations 50 GFLOPs / Watt 28 nm bulk CMOS Sustained – not peak
  • 25. © Ahmed Hemani 25 Interconnect and Storage are Expensive 6.3 pJ 3.2 pJ = 32 bit Data 1 mm ~= 32-bit FLOP > accessing 1 bit in 3D integrated DRAM P. Kogge and J. Shalf, “Exascale computing trends: Adjusting to the ‘new normal’ for computer architecture,” Comput. Sci. Eng., vol. 15, no. 6, 2013.
  • 26. © Ahmed Hemani 26 Computation in Memory using Memristors DACs ADCs Source of Diagram Above: Chenchen Liu, Qing Yang, Bonan Yan, Xiaocong Du, Hai (Helen) Li, “A Memristor Crossbar Based Computing Engine Optimized for High Speed and Accuracy”, ISVLSI 2016 Benefits: 1. Single cycle dot product 2. Can be extended to do addition, multiplication, element wise multiplication, matrix inversion 3. No need to fetch, decode and execute instructions  addresses wire problem 4. In some application instances, initialization of matrix would be a one-time event Challenges 1. Large matrices will need to be fragmented resulting in movement of data. Need complimentary control circuitry 2. ADC’s consume significant power and inject latency 3. Accuracy 4. Experimental solutions reported. Not part of mainstream design flow Reminiscent of Analog Computation
  • 27. © Ahmed Hemani 27 Memristor based CIM in the SiLago Framework Region Types – SiLago Block Types Functional Dense Linear Algebra Sparse Linear Algebra Inner Modem Outer Modem Graph Theory Protocol Processing Spectral Methods Dynamic Programming State Machines Infrastructural NOCs Scratch Pad Memory PLL + CGU Memory Controller FIFO, FIFO Controller RISC Processors – RISC-V NVM DRAM Vaults Power Management Memristor CIM 1. A Memristor CIM in a range of dimensions 2. Characterized with post-layout data and circuit level simulations and validated with test chips 3. Exports, functional matrix operations and infrastructural operations like initializing crossbar, NIU operations, reg file operations etc. 4. Higher abstraction synthesis tools can refine in terms of CIM SiLago blocks and know its performance, energy and area.
  • 28. © Ahmed Hemani 28 25 Watt Biologically Plausible Human Scale Brain Move to 16/32 bit Integer Arithmetic Single Precision Floating Point 1. Synaptic Storage/Access will reduce by ~50% 2. Computation Energy will reduce by ~75% ~800 Watts ~250 Watts ReRAM Computation in Memory ~25 Watts Caveat: Based on best effort estimates and not on actual implementation ~ 2 TOPs/watt 28 nm bulk CMOS
  • 29. © Ahmed Hemani 29 Wave Based Computing using Plasmons 1. Logic values encoded as phase of the waves 2. Interference of waves interpreted as majority gate computation
  • 30. © Ahmed Hemani 30 Plasmonics + CMOS Computing using SiLago blocks PlasmonExa – Plasmonics based Exa-scale computing design. Synthesized in terms of Silicon Lego (SiLago) blocks. SiLago Majority Gate SiLago Interconnect waveguide Cout A B Cin SUM SiLago micro-architecture block (full adder) DigitalCMOS Control&Memory Plasmon Sources Plasmon detector Phase Modulatros N 3 2 1 1 2 3 N Plasmonic Wave-computing 1 2 N THzCMOS DriverandControl Plasmon waveguides Electrically driven Plasmon sources THz Phase modulators Inputs Demodulator Output Plasmon Detector CMOS Logic Drivers High Speed Plasmon Logic Circuit
  • 31. Impact 31 © Ahmed Hemani 1. 1000 X Power Density 2. More Affordable Software Centric / GPU + Based Computing Hardware Centric SiLago Based Computing 1. 1000 X Power Density 2. More Affordable