11 Synchoricity as the basis for going Beyond Moore

Synchoricity as the basis for going
Beyond Moore
Ahmed Hemani
Professor, Dept. Of Electronics, School of EECS, KTH,
Stockholm Sweden
Email: hemani@kth.se
1
Not a Typo
The 2nd R-CCS Symposium,
Future Co-design Session 13:30 to 15:30
Day 2, Feb 18, 2020, Kobe, Japan

© Ahmed Hemani
2
Going Beyond Moore !
1. Squeeze more out of CMOS
a. ASICs like custom functional hardware
b. Delivers 2-4 orders better energy-delay product compared
to GPUs, FPGAs and Multi-cores
2. Complement CMOS with emerging technologies
a. 2.5D and 3D Integration (DRAM)
b. Computation in memory using Memristors
c. Plasmonics
“Science makes progress, not when you find a solution,
but when you make it easy to use the solution”
-- Venki Ramakrishnan, Nobel Laureate
Solutions to go beyond Moore Make it easy to use the solution

3
Synchronicity
Time is discretized using clock ticks
D Q
clk1
D Q
clk2
Can be temporally composed If
clk1 = clk2
&
The two clocks are skew aligned
Synchoricity
Space is discretized using a virtual grid
Can be spatially composed If
If the number of grid cells
in each dimension are equal
&
Their interconnect edges are abuttable

4
SiLago (Silicon Lego) Blocks
RTL & Coarse Grain Reconfigurable
4-5 orders larger than Standard Cells
Characterized with postlayout data
Empowers Synthesis from Higher Abstractions
Inter SiLago Block Wires bought to periphery
at right place and right metal layer to enable
compositoin by abutment
(c) Ahmed Hemani
A SiLago Block
SiLago Blocks are the new standard cells

5
VLSI Designs are Composed by Abutting SiLago Blocks
All Wires – functional and infrastructural (reset, clocks and power grid) are
created as a result of abutment
Cost-Metrics of the composite design becomes known with post layout
accuracy
(c) Ahmed Hemani

6
Inspiration from Construction Industry

8
We shifted to pre-fabricated wall segments
1. Productivity gain did not solely come from the
large size of the pre-fabricated wall segments
2. Productivity gain came from physical design
discipline that enables composition by abutment
3. IPs in VLSI Design lack this discipline and
composition by abutment
(c) Ahmed Hemani

© Ahmed Hemani
9
Lego Kits
Region Types – SiLago Block Types
Functional
Dense Linear Algebra
Sparse Linear Algebra
Inner Modem
Outer Modem
Graph Theory
Protocol Processing
Spectral Methods
Dynamic Programming
State Machines
Infrastructural
NOCs
Scratch Pad Memory
PLL + CGU
Memory Controller
FIFO, FIFO Controller
RISC Processors – RISC-
V
DMA
Memory Consistency
Power Management
The Berkeley Dwarfs
1 Dense Linear Algebra
2 Sparse Linera Algebra
3 Spectral Methods
4 N Body Methods
5 Structured Grids
6 Unstructured Grids
7 MapReduce
8 Combinational Logic
9 Graph Traversal
10 Dynamic Programming
11 Back-track and Branch n Bound
12 Graphical Models
13 Finite State Machine
The Berkeley
Dwarfs
SiLago Regions
Types

© Ahmed Hemani
10
Hardware Centric vs. Software Centric
Accelerators vs. Flexilators
SoftwareBy default
Functionalities are
mapped as
software Accelerators
Hardware
Only power and performance
critical functionalities are
mapped as hardware
accelerators
Flexilators
e.g. RISC-V
Only flexibility critical, dynamic and non-deterministic
functionalities are mapped to
SiLago Flexilators: RISC-V, FSMs, FIFOs, Arbiters, Schedulers,
NOCs etc.
Software Centric Platform Based Design Hardware Centric Synchoros VLSI Design
Custom Hardware
Hyper-CISC
Instructions
By default
Functionalities are
mapped as custom
functional hardware
Lego Flexilators

HPC LIB Impl
HLS
© Ahmed Hemani 11
Why does Synchoros VLSI Design Work ?
Log (# of Solutions)
Physical
GDSII
Boolean
Std. Cells
Algorithm
System
Application
RTL
Standard Cells
Physical Level
Standard Cells
SiLago Blocks
FunctionVerification(FV)
ConstraintsVerification(CV)
Manual
ManualAutomated
(CV)
(FV)
Automated
(CV)
(FV)
Automated
OneTimeEngineering
Full Custom Mead-Conway
O(10K Gates)
Standard Cells
O(10 million Gates)
Synchoros VLSI Design Style
O(100 million Gates)
~300 MUSD

OneTimeEngineeringEffort
SiLago Application Level Synthesis
1. Select Optimal Solution from ML solutions
2. Global Interconnect, buffers and control
3. Floorplanning
c
HPC Application
L Algorithms
Sampling Rate,
Total Latency
Number and types of SiLago blocks + Mapping
SiLago Blocks
Characterization Data
Hardened Blocks
Scripts
Mask Patterns Reports
Compose Ready-to-manufacture Chip
12
1  M
HPC Lib
Implementations
HPC
LIBs
HLS
HLS tools do not automate this.
Manually refined
(c) Ahmed Hemani

13
SiLago Design Instances =  Region Instances
System
Controller
Program
Memory
Ethernet
PLL/CGU
PMC
Data
Memory
TSVs
3D Memory
Control
Protocol Processing Region
NOCs
NOCs
Scratchpad
Interrupt Ctrl
Scratchpad
Scratchpad
DenseLinearAlgebra
Conceptual
Does Not Exist
Buffered/Pipelined
NOC SiLago Blocks
NOC Switch
SiLago Blocks
Region Specific
Network
Interface Units
DenseLinearAlgebra
Type, Number, Size and Position
of
SiLago Region Instances
Decided by Synthesis Tools
Based on
Functionality
And
Constraints

Inner
Modem
SiLago can also potentially reduce the manufacturing cost
Protocol Processing Streaming Storage
Data
Storage
System Ctrl
Program
Storage
DRAM CTRL
Flash CTRL
Ethernet
PLL/CGU
PMC
Inner
Modem
Outer
Modem
14
Outer
Modem
Flexilators
All SiLago designs are composed of a finite number of
SiLago block Types
All SiLago blocks can only have a finite types of
neighbors
Each SiLago blocks’s mask depending on the neighbor
types can be saved as a component mask
The entire design mask can be composed from such
component masks
(c) Ahmed Hemani
The DFT Cost can
also be factored out
The DFT can be
made much more
efficient reducing
time spent on ATE

© Ahmed Hemani
15
What becomes possible
101
103
10-1
Energyperframe(mJ)
106
600 $
Jetson TX1
(GPU)
667 $
Keystone II
(DSP)
126 $
Parellela
Multicore+NoC
1600 $
ZC706
FPGA
196 $
SiLago
22 nm
(volume of 10k)
10 M 100 M 1 G 10 G
Operations of CNN
Data on GPU, DSP, Parallela and FPGA adapted from
G. Hegde, S. Siddhartha, and N. Kapre, “CaffePresso: Accelerating convolutional networks on embedded SoCs,” ACM Transactions on Embedded Computing
System, vol. 17, 2017.

© Ahmed Hemani
16
Going Beyond Moore !
1. Squeeze more out of CMOS
a. ASICs like custom functional hardware
b. Delivers 2-4 orders better energy-delay product compared
to GPUs, FPGAs and Multi-cores
2. Complement CMOS with emerging technologies
a. 2.5D and 3D Integration (DRAM)
b. Computation in memory using Memristors
c. Plasmonics
“Science makes progress, not when you find a solution,
but when you make it easy to use the solution”
-- Venki Ramakrishnan
Solutions to go beyond Moore Make it easy to use the solution

17
BCPNN
Bayesian Confidence Propagation Neural Network
Professor Anders Lansner

Functional Requirements: Human Scale - Realtime
BCPNN Requirements
1. Realtime simulation
2. 2 Million HCUs – non-deterministically concurrent
3. 170 TFlops/s – BCPNN Computation
4. 50 TBs – Synaptic Weight Storage
5. 200 TBs / s – Bandwidth for synaptic storage
6. 250 GBs / s – Spiking Bandwidth
Infrastructural Requirements
18

100 MCUs
10000Connections
HCU
Synaptic
Memory
(25 MB)
MCU
State Vector
MCU Row
HCU =  MCUs
The BCPNN Computation Model
Input Spike
Computation
10 000 Spikes/s
100 × 100
Spikes/s
Support
Computation
100 / s
Output Spike
Computation
Delay
Buffer
19
Human Scale Cortex Dimensions
High Level of
Temporal Locality
Column updates
are more
expensive

© Ahmed Hemani
20
Infrastructural Operations are Significant
IncomingSpikesQueue&Controller
iSDIN
incomingSpikeDistributionInterconnect
InputComputationController
OutputComputationController
OutgoingSpikesQueue&Controller
oSDIN
outgoingSpikeDistributionInterconnect
DelayBuffers&Controllerforfanoutspikes
Scratchpad Memories
Input Computation
Input Computation
FSM
Input Computation
Unit R1 SP FPUs
HCU State Storage Memory Interface
ms Timer
Scratchpad Memories
Output Computation
Output Computation
FSM
Output Computation
Unit R2 SP FPUs
170 TFlops
Infrastructural Operations

The Silicon Lego Bricks for Method Applied to BCPNN
A Structured Physical Design Scheme to enable System-level synthesis
21
H- Tile
H- Tile
H- Tile H- Tile
H- Tile
H- Tile
H- Tile
H- TileH- Tile
TSVs + Controller
FPUs
SRAMs
H-Tile
Controller
NOC
Interface
Ques +
Controller
NOC Corridor
NOCCorridor

BCU: Brain Computation Unit
1.529 X 1.729 mm2
32 H-Cubes
32 Micro Channels
22
H-Cube
8 layers of DRAM
1 Bank per layer
2 Banks / HCU
TSV Micro Channel
4 HCUs + Control
1079 mm
200 mm
Bank 0: 128 Mb
TSV Area
RIB
Column
1489mm
250 mm
240mm
BCPNN Implementation
HCU0 HCU1
HCU2 HCU3
In Collaboration with
Prof. Nobert When and
Dr. Christian Wiess
TU Kaiserslautern

© Ahmed Hemani
23
BCPNN: ASIC vs GPUs
# of GK210 cores : 5000
Energy : 563.1 kJ
1s Realtime : 4.69 s simulated time
Energy Delay Product : 2642 kJ  s
Memory
60%
Computation
20%
Infrastructure
20%
(b) Energy Breakdown GPUs
1320
𝒎𝑱
𝑯𝑪𝑼∙𝒔
DRAM
73%
SRAM
9%
Infrastructure
3%
Computation
15%
1.52
𝒎𝑱
𝑯𝑪𝑼∙𝒔
(a) Energy Breakdown ASIC
Energy Delay Product: 3.06 kJ  s
ASIC: 3.0 kW
GPUs: 2.6 MW
SpiNNaker-
2
comparable
to GPUs

© Ahmed Hemani
24
The Impact of Column Access Elimination + Exploiting
Temporal Locality
DRAM
73%
SRAM
9%
Infrastructure
3%
Computation
15%
1.52
𝒎𝑱
𝑯𝑪𝑼∙𝒔
Baseline
3 kW 800 W
DRAM
14%
SRAM
10%
Infrastructure
13%
Computation
63%
0.40
𝒎𝑱
𝑯𝑪𝑼∙𝒔
Column Access Eliminated +
Temporal Locality
Column Access Eliminated
DRAM
59%
SRAM
8%
Infrastructure
6%
Computation
27%
0. 𝟗𝟒
𝒎𝑱
𝑯𝑪𝑼∙𝒔
1.88 kW
Such optimizations can be automatically
inferred from Simulations
50 GFLOPs / Watt
28 nm bulk CMOS
Sustained – not peak

© Ahmed Hemani
25
Interconnect and Storage are Expensive
6.3 pJ
3.2 pJ = 32 bit Data 1 mm ~= 32-bit FLOP > accessing 1 bit in 3D integrated DRAM
P. Kogge and J. Shalf, “Exascale computing trends: Adjusting to the ‘new normal’ for computer architecture,” Comput. Sci. Eng., vol. 15, no. 6, 2013.

© Ahmed Hemani
26
Computation in Memory using Memristors
DACs
ADCs
Source of Diagram Above: Chenchen Liu, Qing Yang, Bonan Yan, Xiaocong Du, Hai (Helen) Li, “A Memristor Crossbar Based Computing Engine Optimized for High Speed and Accuracy”, ISVLSI 2016
Benefits:
1. Single cycle dot product
2. Can be extended to do addition,
multiplication, element wise multiplication,
matrix inversion
3. No need to fetch, decode and execute
instructions  addresses wire problem
4. In some application instances, initialization of
matrix would be a one-time event
Challenges
1. Large matrices will need to be fragmented
resulting in movement of data. Need
complimentary control circuitry
2. ADC’s consume significant power and inject
latency
3. Accuracy
4. Experimental solutions reported. Not part of
mainstream design flow
Reminiscent of Analog Computation

© Ahmed Hemani
27
Memristor based CIM in the SiLago Framework
Region Types – SiLago Block Types
Functional
Dense Linear Algebra
Sparse Linear Algebra
Inner Modem
Outer Modem
Graph Theory
Protocol Processing
Spectral Methods
Dynamic Programming
State Machines
Infrastructural
NOCs
Scratch Pad Memory
PLL + CGU
Memory Controller
FIFO, FIFO Controller
RISC Processors – RISC-V
NVM
DRAM Vaults
Power Management
Memristor CIM
1. A Memristor CIM in a range of dimensions
2. Characterized with post-layout data and circuit
level simulations and validated with test chips
3. Exports, functional matrix operations and
infrastructural operations like initializing crossbar,
NIU operations, reg file operations etc.
4. Higher abstraction synthesis tools can refine in
terms of CIM SiLago blocks and know its
performance, energy and area.

© Ahmed Hemani
28
25 Watt Biologically Plausible Human Scale Brain
Move to 16/32 bit Integer Arithmetic
Single Precision Floating Point
1. Synaptic Storage/Access will reduce by ~50%
2. Computation Energy will reduce by ~75%
~800 Watts ~250 Watts
ReRAM
Computation in Memory
~25 Watts
Caveat:
Based on best effort estimates and
not on actual implementation
~ 2 TOPs/watt
28 nm bulk CMOS

© Ahmed Hemani
29
Wave Based Computing using Plasmons
1. Logic values encoded as phase of the waves
2. Interference of waves interpreted as majority
gate computation

© Ahmed Hemani
30
Plasmonics + CMOS Computing using SiLago blocks
PlasmonExa – Plasmonics based Exa-scale computing
design.
Synthesized in terms of Silicon Lego (SiLago) blocks.
SiLago
Majority Gate
SiLago Interconnect
waveguide
Cout
A
B
Cin SUM
SiLago micro-architecture block (full adder)
DigitalCMOS
Control&Memory
Plasmon
Sources
Plasmon
detector
Phase
Modulatros
N
3
2
1 1
2
3
N
Plasmonic
Wave-computing
1
2
N
THzCMOS
DriverandControl
Plasmon
waveguides
Electrically driven
Plasmon sources THz
Phase modulators
Inputs
Demodulator Output
Plasmon
Detector
CMOS Logic
Drivers
High Speed
Plasmon Logic Circuit

Impact
31
© Ahmed Hemani
1. 1000 X Power Density
2. More Affordable
Software Centric / GPU +
Based Computing
Hardware Centric
SiLago Based Computing
1. 1000 X Power Density
2. More Affordable

11 Synchoricity as the basis for going Beyond Moore

More Related Content

What's hot (20)

Similar to 11 Synchoricity as the basis for going Beyond Moore (20)

More from RCCSRENKEI (20)

Recently uploaded (20)

11 Synchoricity as the basis for going Beyond Moore