SlideShare a Scribd company logo
D15-3 Customization of a
Deep Learning Accelerator
Shien-Chun Luo
Industrial Technology Research Institute
25 April 2019
Agenda
 Object Detection Demonstration
 Design a High Efficient Accelerator
 Our Solutions and Some Results
2
Demonstration of Object Detection
3
• 256-MAC DLA @ 150 MHz
• ZCU102 FPGA (used 40% of 600k logics)
• Ubuntu on ARM A53, 1.2GHz
• USB camera input, display port output
• Tiny YOLO v1, 448 x 448 RGB input
• 8-CONV & 1-FC, 3.2 GOPs of NN
• Detection layer uses CPU
• VOC dataset, 20 categories
• Original FP-32, MAP= 40%
• Retrained INT-8 by TF, MAP=35%
• Avg 8 FPS
• Execution time, CONV ~79 ms , FC ~48 ms
FPGA Object Detection Setup
4
DRAM (1GB)
Input Image
Model Weights
OS Controlled Space
DRAM
CTRL
DP
USB ARM CPU
(FPGA)
DLA
(Processing System)
Temp Activations
Output Data
Reserved
for DLA
(64~256MB)
Program INIT
Set parameters
Load Weight
Image Capture (YUV)
Re-Format to RGB
Activate DLA
Post Processing
Display
DLA Finished
5
Design a High Efficient
Accelerator
3 Steps to Achieve Our Goal
6
1. Increase MAC PEs with high utilization
2. Increase data supplement to those PEs
3. Improve energy efficiency, adaptive to the models
Throughput
Computation Power
3
2
Concepts of step 1~ 3 Take Alexnet for example
Throughput Curves
given various DRAM BW
FPS/Throughput of Various Models
-- profiled using 256MAC, 128KB, INT8, DLA inference
7
Profiles of Classical Classification
Models (1)
8
AlexNet (224) Memory Bandwidth (GBps) Mem BW
Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s
100 MHz 51 GOPs 7.1 9.0 10.4 11.2 11.5 1.6
200 MHz 102 GOPs 10.0 14.2 18.0 20.7 21.8 2.2
400 MHz 205 GOPs 12.6 20.0 28.4 35.9 39.4 3.1
800 MHz 410 GOPs 14.3 25.2 40.1 56.8 66.0 4.6
1000 MHz 512 GOPs 14.6 26.6 43.6 64.3 76.4 5.2
Computational Power
Sensitivity --> 2.1 3.0 4.2 5.7 6.6
Inception v1 (224) Memory Bandwidth (GBps) Mem BW
Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s
100 MHz 51 GOPs 8.8 9.1 9.2 9.3 9.3 1.1
200 MHz 102 GOPs 16.6 17.6 18.1 18.4 18.5 1.1
400 MHz 205 GOPs 28.3 33.1 35.2 36.2 36.6 1.3
800 MHz 410 GOPs 41.2 56.6 66.2 70.4 71.8 1.7
1000 MHz 512 GOPs 44.7 65.2 79.6 86.8 88.9 2.0
Computational Power
Sensitivity --> 5.1 7.2 8.7 9.4 9.6
AlexNet prefers
More memory
bandwidth due to
heavy-weight FC
layers
Inception prefers
More computation
power because
CNN computation
dominates
↑ Edge devices may limit the budge of DRAM(256MAC)
Profiles of Classical Classification
Models (2)
9
ResNet50 (224) Memory Bandwidth (GBps) Mem BW
Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s
100 MHz 51 GOPs 5.0 5.1 5.1 5.1 5.1 1.0
200 MHz 102 GOPs 9.2 10.0 10.1 10.1 10.2 1.1
400 MHz 205 GOPs 13.0 18.4 20.1 20.2 20.3 1.6
800 MHz 410 GOPs 15.6 26.0 36.9 40.1 40.4 2.6
1000 MHz 512 GOPs 16.1 28.0 42.3 49.5 50.4 3.1
Computational Power
Sensitivity --> 3.2 5.5 8.3 9.7 9.9
MobileNet v1 (224) Memory Bandwidth (GBps) Mem BW
Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s
100 MHz 51 GOPs 31.1 33.0 33.5 33.6 33.7 1.1
200 MHz 102 GOPs 51.2 62.2 66.0 66.9 67.2 1.3
400 MHz 205 GOPs 62.9 102.5 124.3 131.9 133.4 2.1
800 MHz 410 GOPs 64.0 125.8 204.9 248.6 259.7 4.1
1000 MHz 512 GOPs 64.0 127.9 226.0 299.7 318.5 5.0
Computational Power
Sensitivity --> 2.1 3.9 6.8 8.9 9.5
Resnet prefers
Evenly memory
bandwidth and
computation power
Mobilenet prefers
More memory
bandwidth because
DW-CONV layers
reduce computation but
increase activations to
RW memory
↑ Edge devices may limit the budge of DRAM(256MAC)
10
Our Solutions and Some Results
Let’s Use a Customizable Architecture
11
1. Variable CONV processing resources
• 64-MAC PE cluster to 2048-MAC PE cluster
for a single convolution processer
• Variable volume of convolutional buffer
2. Configurable NN operator processors
• Options for batch normalization, PReLU,
scale, bias, quantization, element-wise
operators
• Options for down sample ( like pool)
operators
• Options for nonlinear LUTs
3. Custom memories and host CPUs
• Can be driven by MCU or CPU
• Shared or private DRAM/SRAMArchitecture revised based on NVDLA
DLA Features – Inherence and Our Changes
12
1. [Inherence] Channel-first CONV strategy
• Released data dependency, share input feature cube
• Any kernel size (n x m) ~100% utilization if deep channels
2. [Add tool to verify] Layer fusion to save memory access
• Fuse popular layer stack [ CONV – BN – PReLU – Pool ]
• Verified  reduce activations access
3. [Add tool to verify] Program time hiding
• Verified program the N+1th layer, when running the Nth layer
4. [Revised HW] Depth-wise CONV support
• Revised HW from DMA to ACC
5. [Future work] DMA of fast data dimension change
• Adding fast up-sample algorithms, data dimension reorder changes
width
height
IN
IN
Channel first
Plane first
Standard Inference collaborating ONNC
on Linux Machines
13
Model Graph
Model weights
Kernel Mode Driver
(KMD)
User Mode Driver
(UMD)
Flow Controller
(MCU or CPU)
DLA
HW
User’s
Framework
Compiler (ONNC)
HardwareAPI and Driver for Linux
 online | offline 
Framework Converter
ONNX Graph
Model weights
with quantize
information
Parser
CPU Tasks
Loadable
Files
Framework
Conversion
Quantize
info extract
(Tensorflow)
DLA Tasks
Compiler
Bottom-up (Baremetal) Verification Flow
14
Model
Weights
Model
Prototxt
Model
Parser
Layer
Fusion
Layer
Partition
DLA REG
CFGs
API
HW-aware
Quantize
Insertion
QAT or PTQ
Weight
Conversion
& Partition
Quantized
Weights
Simple API Example :
• Use “YOLO”, “RESNET-50” as a function call, if no breakdown to sub-tasks
• Use { RW REG, INTR, POLL} inside the API, fit for general C compiler
Two packages to inserted to main
1. Load quantized weights
2. Call API (NN functions )
(next slide)
Integer Model Quantization Flows
15
Native Training
Graph
Caffe Prototxt
Darknet CFG / ONNX /...
Network
Converter
Compiler
(Baremetal /
ONNC)
Native Training
Weights
(TF/Caffe/Darknet/...)
Weights with
Quantize Info
(TensorFlow )
DLA
Driver
Retrain NN Graph
Retrain Weights
(TensorFlow)
DLA
HW
Quantize-aware training (QAT)
 less accuracy loss
Weight
Converter
Post-training
quantization (PTQ),
 more accuracy loss
Without HW nor Compiler results, PTQ can be available
■ Require some testing data set ■ Tiny YOLO v1 MAP 40%  15%
Have basic HW inference fusion details, QAT can be available
■ Require training and testing data set ■ Tiny YOLO v1 MAP 40%  35%
Tiny YOLO v1 Inference Example
16
1 CONV
2 BN
3 Scale
4 ReLU
5 Pool
6 CONV
7 BN
8 Scale
9 ReLU
10 Pool
11 CONV
12 BN
13 Scale
14 ReLU
15 Pool
16 CONV
17 BN
18 Scale
19 ReLU
20 Pool
21 CONV
22 BN
23 Scale
24 ReLU
25 Pool
26 CONV
27 BN
28 Scale
29 ReLU
30 Pool
31 CONV
32 BN
33 Scale
34 ReLU
35 CONV
36 BN
37 Scale
38 ReLU
39 FC
Macro layer 1
Macro layer 2
Macro layer 3
Macro layer 4
Macro layer 5
Macro layer 6
Macro layer 7
Macro layer 8
FC9
Tiny YOLO v1 
(39 DNN layers)
HW Inference Queue
(9 macro layers) ↓
• Originally,8-bit data,minimal
feature maps DRAM access = 27.7MB
• Use fusion, total feature map DRAM
access = 6.2MB
• Total weights remain 27 MB
Fuse 5 layers into 1 macro layer
[CONV–BN–Scale–PReLU–Pool]
Reduce intermediate activation access
* Detection layer is done by CPU
A Quick Glance of RTL Results
17
Layer
Queue CFG
Weight
Generator
DLA
RTL
DRAM
Model
Caffe
format
Checker
Calculator
VPI
results
results
HEX
Layer Data DIM OPs
64M-DLA
Cycles
256M-DLA
Cycles
Hybrid1 448x448x3 193 M 5.8 M 4.39 M
Hybrid2 224x224x16 472 M 4.25 M 2.23 M
Hybrid3 112x112x32 467 M 3.94 M 1.12 M
Hybrid4 56x56x64 465 M 3.82 M 1.04 M
Hybrid5 28x28x128 464 M 3.71 M 0.97 M
Hybrid6 14x14x256 463 M 3.69 M 0.95 M
Hybrid7 7x7x512 463 M 3.66 M 2.41 M
Hybrid8 7x7x1024 231 M 3.52 M 1.6 M
FC9 12540 37 M 14.19 M 9.23 M
Summary 3250 M 46.57 M 23.9 M
Equation-based Profiler using
64 MAC / 128 KB configuration
18
Network Total Cycle Clock Rate
Run Time per
Frame
FPS
AlexNet 61 M 400MHz 152.20ms 6.57
GoogLeNet 27 M 400MHz 67.83ms 14.74
ResNet50 111 M 400MHz 278.65ms 3.59
VGG16 395 M 400MHz 987.55ms 1.01
Tiny YOLO v1 45 M 400MHz 112.67ms 8.88
Tiny YOLO v2 83 M 400MHz 208.47ms 4.80
Tiny YOLO v3 55 M 400MHz 136.21ms 7.34
• Keep improving accuracy as more RTL simulations of models are done
• The same DRAM BW model (~0.5 GBps) is used
Equation-based Profiler using
256 MAC / 128 KB configuration
19
Network Total Cycle Clock Rate
Run Time per
Frame
FPS
AlexNet 49M 400MHz 122.17ms 8.19
GoogLeNet 11M 400MHz 28.40ms 35.22
ResNet50 76M 400MHz 189.74ms 5.27
VGG16 214 M 400MHz 535.99ms 1.87
Tiny YOLO v1 26 M 400MHz 65.30ms 15.31
Tiny YOLO v2 48 M 400MHz 121.07ms 8.26
Tiny YOLO v3 24 M 400MHz 61.55ms 16.25
• Keep improving accuracy as more RTL simulations of models are done
• The same DRAM BW model (~0.5 GBps) is used
Use the Profiler to Find a Design Target
20
Such as…..
I want a 30fps real-time tiny-YOLO inference, what is the HW SPEC ?
ASIC Implementation
USB Accelerator for Legacy Machines
21
USB
Bridge
USB
GPIF
DRAM
IF
RISC-V
Cache
DLA
AXI
Parallel bus SDK + API (Linux)
DRAM
A
P
BPeripherals
SOC or FPGA
FPGA
Prototype
Fused layer inputFused layer output
• TSMC-65nm
• 3.2 x 3.2 mm
• 64-M, 128KB
• nv_small
Layout View
Chip and Board
22
• Technology: TSMC 65nm, Core : 1 V
• Performance: 64 MAC, 50 GOPs @ 400 MHz
• DLA Avg Power : 60 mW
EVA Board & Die Photo … More information about this
chip will be published later
Conclusions
23
 Adapt and customize HW resources, if you already have
some candidate models
 An end-to-end solution of edge AI application is here for
your reference
 Require especially-tight cooperation among HW-SW-
training in integer DLAs
~ Thank for Your Attention ~

More Related Content

PDF
Lightweight DNN Processor Design (based on NVDLA)
PDF
Day 1 LTE Technology Overview
PDF
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
PDF
3 gpp lte-rlc (1)
PDF
Custom fabric shader for unreal engine 4
PPTX
LTE1406 Extended VoLTE Talk Time.pptx
PDF
Prof. Andy Sutton: 5G RAN Architecture Evolution - Jan 2019
PPTX
Normalization 방법
Lightweight DNN Processor Design (based on NVDLA)
Day 1 LTE Technology Overview
Software AI Accelerators: The Next Frontier | Software for AI Optimization Su...
3 gpp lte-rlc (1)
Custom fabric shader for unreal engine 4
LTE1406 Extended VoLTE Talk Time.pptx
Prof. Andy Sutton: 5G RAN Architecture Evolution - Jan 2019
Normalization 방법

What's hot (20)

PDF
ZigBee/IEEE802.15.4について調べてみた
DOCX
UMTS/3G RAN Capacity Management Guideline Part-02 (Sectorization))
PDF
5G Network Architecture, Design and Optimisation
PDF
Opinion: The Politics of SA vs NSA 5G & 4G Speeds
PPTX
DDR SDRAMs
PDF
DDR, GDDR, HBM Memory : Presentation
PDF
AS45679 on FreeBSD
PDF
Lte advanced conformance & standards
PPTX
WebRTC SFU Mediasoup Sample update
PDF
Intermediate: 5G Network Architecture Options (Updated)
PPTX
Enterprise, Architecture and IoT
PDF
WebRTC/ORTCの最新動向まるわかり!
PDF
Mikrotik firewall filter
PDF
Arm cm3 architecture_and_programmer_model
DOCX
Rach procedure in lte
PPT
Hmm viterbi
PDF
14 wcdma
PDF
GoBGP活用によるSD-WANプラクティス
ODP
Akka Split Brain Resolver
PDF
Zynq ultrascale
ZigBee/IEEE802.15.4について調べてみた
UMTS/3G RAN Capacity Management Guideline Part-02 (Sectorization))
5G Network Architecture, Design and Optimisation
Opinion: The Politics of SA vs NSA 5G & 4G Speeds
DDR SDRAMs
DDR, GDDR, HBM Memory : Presentation
AS45679 on FreeBSD
Lte advanced conformance & standards
WebRTC SFU Mediasoup Sample update
Intermediate: 5G Network Architecture Options (Updated)
Enterprise, Architecture and IoT
WebRTC/ORTCの最新動向まるわかり!
Mikrotik firewall filter
Arm cm3 architecture_and_programmer_model
Rach procedure in lte
Hmm viterbi
14 wcdma
GoBGP活用によるSD-WANプラクティス
Akka Split Brain Resolver
Zynq ultrascale
Ad

Similar to customization of a deep learning accelerator, based on NVDLA (20)

PDF
Fast, Scalable Quantized Neural Network Inference on FPGAs with FINN and Logi...
 
PPTX
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
PDF
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
PDF
"Enabling Automated Design of Computationally Efficient Deep Neural Networks,...
PDF
Hardware for Deep Learning AI ML CNN.pdf
PDF
DLD meetup 2017, Efficient Deep Learning
PDF
Improving Hardware Efficiency for DNN Applications
PDF
Recent developments in Deep Learning
PDF
Xilinx Inference solution for DL using OpenPOWER systems
PDF
Deep Learning Initiative @ NECSTLab
PDF
2020 icldla-updated
PDF
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
PPTX
Computer Vision for Beginners
PDF
Presentation - webinar embedded machine learning
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Finding the best solution for Image Processing
PDF
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...
PDF
“A Practical Guide to Implementing ML on Embedded Devices,” a Presentation fr...
PDF
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
PDF
“State-of-the-art Model Quantization and Optimization for Efficient Edge AI,”...
Fast, Scalable Quantized Neural Network Inference on FPGAs with FINN and Logi...
 
Dov Nimratz, Roman Chobik "Embedded artificial intelligence"
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
"Enabling Automated Design of Computationally Efficient Deep Neural Networks,...
Hardware for Deep Learning AI ML CNN.pdf
DLD meetup 2017, Efficient Deep Learning
Improving Hardware Efficiency for DNN Applications
Recent developments in Deep Learning
Xilinx Inference solution for DL using OpenPOWER systems
Deep Learning Initiative @ NECSTLab
2020 icldla-updated
"Trade-offs in Implementing Deep Neural Networks on FPGAs," a Presentation fr...
Computer Vision for Beginners
Presentation - webinar embedded machine learning
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Finding the best solution for Image Processing
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...
“A Practical Guide to Implementing ML on Embedded Devices,” a Presentation fr...
“Trends in Neural Network Topologies for Vision at the Edge,” a Presentation ...
“State-of-the-art Model Quantization and Optimization for Efficient Edge AI,”...
Ad

Recently uploaded (20)

DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
DOCX
573137875-Attendance-Management-System-original
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Well-logging-methods_new................
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
PPT on Performance Review to get promotions
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPT
Mechanical Engineering MATERIALS Selection
PPT
Project quality management in manufacturing
PPTX
Construction Project Organization Group 2.pptx
PDF
composite construction of structures.pdf
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
573137875-Attendance-Management-System-original
UNIT-1 - COAL BASED THERMAL POWER PLANTS
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
R24 SURVEYING LAB MANUAL for civil enggi
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Well-logging-methods_new................
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
CYBER-CRIMES AND SECURITY A guide to understanding
PPT on Performance Review to get promotions
OOP with Java - Java Introduction (Basics)
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Mechanical Engineering MATERIALS Selection
Project quality management in manufacturing
Construction Project Organization Group 2.pptx
composite construction of structures.pdf

customization of a deep learning accelerator, based on NVDLA

  • 1. D15-3 Customization of a Deep Learning Accelerator Shien-Chun Luo Industrial Technology Research Institute 25 April 2019
  • 2. Agenda  Object Detection Demonstration  Design a High Efficient Accelerator  Our Solutions and Some Results 2
  • 3. Demonstration of Object Detection 3 • 256-MAC DLA @ 150 MHz • ZCU102 FPGA (used 40% of 600k logics) • Ubuntu on ARM A53, 1.2GHz • USB camera input, display port output • Tiny YOLO v1, 448 x 448 RGB input • 8-CONV & 1-FC, 3.2 GOPs of NN • Detection layer uses CPU • VOC dataset, 20 categories • Original FP-32, MAP= 40% • Retrained INT-8 by TF, MAP=35% • Avg 8 FPS • Execution time, CONV ~79 ms , FC ~48 ms
  • 4. FPGA Object Detection Setup 4 DRAM (1GB) Input Image Model Weights OS Controlled Space DRAM CTRL DP USB ARM CPU (FPGA) DLA (Processing System) Temp Activations Output Data Reserved for DLA (64~256MB) Program INIT Set parameters Load Weight Image Capture (YUV) Re-Format to RGB Activate DLA Post Processing Display DLA Finished
  • 5. 5 Design a High Efficient Accelerator
  • 6. 3 Steps to Achieve Our Goal 6 1. Increase MAC PEs with high utilization 2. Increase data supplement to those PEs 3. Improve energy efficiency, adaptive to the models Throughput Computation Power 3 2 Concepts of step 1~ 3 Take Alexnet for example Throughput Curves given various DRAM BW
  • 7. FPS/Throughput of Various Models -- profiled using 256MAC, 128KB, INT8, DLA inference 7
  • 8. Profiles of Classical Classification Models (1) 8 AlexNet (224) Memory Bandwidth (GBps) Mem BW Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s 100 MHz 51 GOPs 7.1 9.0 10.4 11.2 11.5 1.6 200 MHz 102 GOPs 10.0 14.2 18.0 20.7 21.8 2.2 400 MHz 205 GOPs 12.6 20.0 28.4 35.9 39.4 3.1 800 MHz 410 GOPs 14.3 25.2 40.1 56.8 66.0 4.6 1000 MHz 512 GOPs 14.6 26.6 43.6 64.3 76.4 5.2 Computational Power Sensitivity --> 2.1 3.0 4.2 5.7 6.6 Inception v1 (224) Memory Bandwidth (GBps) Mem BW Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s 100 MHz 51 GOPs 8.8 9.1 9.2 9.3 9.3 1.1 200 MHz 102 GOPs 16.6 17.6 18.1 18.4 18.5 1.1 400 MHz 205 GOPs 28.3 33.1 35.2 36.2 36.6 1.3 800 MHz 410 GOPs 41.2 56.6 66.2 70.4 71.8 1.7 1000 MHz 512 GOPs 44.7 65.2 79.6 86.8 88.9 2.0 Computational Power Sensitivity --> 5.1 7.2 8.7 9.4 9.6 AlexNet prefers More memory bandwidth due to heavy-weight FC layers Inception prefers More computation power because CNN computation dominates ↑ Edge devices may limit the budge of DRAM(256MAC)
  • 9. Profiles of Classical Classification Models (2) 9 ResNet50 (224) Memory Bandwidth (GBps) Mem BW Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s 100 MHz 51 GOPs 5.0 5.1 5.1 5.1 5.1 1.0 200 MHz 102 GOPs 9.2 10.0 10.1 10.1 10.2 1.1 400 MHz 205 GOPs 13.0 18.4 20.1 20.2 20.3 1.6 800 MHz 410 GOPs 15.6 26.0 36.9 40.1 40.4 2.6 1000 MHz 512 GOPs 16.1 28.0 42.3 49.5 50.4 3.1 Computational Power Sensitivity --> 3.2 5.5 8.3 9.7 9.9 MobileNet v1 (224) Memory Bandwidth (GBps) Mem BW Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s 100 MHz 51 GOPs 31.1 33.0 33.5 33.6 33.7 1.1 200 MHz 102 GOPs 51.2 62.2 66.0 66.9 67.2 1.3 400 MHz 205 GOPs 62.9 102.5 124.3 131.9 133.4 2.1 800 MHz 410 GOPs 64.0 125.8 204.9 248.6 259.7 4.1 1000 MHz 512 GOPs 64.0 127.9 226.0 299.7 318.5 5.0 Computational Power Sensitivity --> 2.1 3.9 6.8 8.9 9.5 Resnet prefers Evenly memory bandwidth and computation power Mobilenet prefers More memory bandwidth because DW-CONV layers reduce computation but increase activations to RW memory ↑ Edge devices may limit the budge of DRAM(256MAC)
  • 10. 10 Our Solutions and Some Results
  • 11. Let’s Use a Customizable Architecture 11 1. Variable CONV processing resources • 64-MAC PE cluster to 2048-MAC PE cluster for a single convolution processer • Variable volume of convolutional buffer 2. Configurable NN operator processors • Options for batch normalization, PReLU, scale, bias, quantization, element-wise operators • Options for down sample ( like pool) operators • Options for nonlinear LUTs 3. Custom memories and host CPUs • Can be driven by MCU or CPU • Shared or private DRAM/SRAMArchitecture revised based on NVDLA
  • 12. DLA Features – Inherence and Our Changes 12 1. [Inherence] Channel-first CONV strategy • Released data dependency, share input feature cube • Any kernel size (n x m) ~100% utilization if deep channels 2. [Add tool to verify] Layer fusion to save memory access • Fuse popular layer stack [ CONV – BN – PReLU – Pool ] • Verified  reduce activations access 3. [Add tool to verify] Program time hiding • Verified program the N+1th layer, when running the Nth layer 4. [Revised HW] Depth-wise CONV support • Revised HW from DMA to ACC 5. [Future work] DMA of fast data dimension change • Adding fast up-sample algorithms, data dimension reorder changes width height IN IN Channel first Plane first
  • 13. Standard Inference collaborating ONNC on Linux Machines 13 Model Graph Model weights Kernel Mode Driver (KMD) User Mode Driver (UMD) Flow Controller (MCU or CPU) DLA HW User’s Framework Compiler (ONNC) HardwareAPI and Driver for Linux  online | offline  Framework Converter ONNX Graph Model weights with quantize information Parser CPU Tasks Loadable Files Framework Conversion Quantize info extract (Tensorflow) DLA Tasks Compiler
  • 14. Bottom-up (Baremetal) Verification Flow 14 Model Weights Model Prototxt Model Parser Layer Fusion Layer Partition DLA REG CFGs API HW-aware Quantize Insertion QAT or PTQ Weight Conversion & Partition Quantized Weights Simple API Example : • Use “YOLO”, “RESNET-50” as a function call, if no breakdown to sub-tasks • Use { RW REG, INTR, POLL} inside the API, fit for general C compiler Two packages to inserted to main 1. Load quantized weights 2. Call API (NN functions ) (next slide)
  • 15. Integer Model Quantization Flows 15 Native Training Graph Caffe Prototxt Darknet CFG / ONNX /... Network Converter Compiler (Baremetal / ONNC) Native Training Weights (TF/Caffe/Darknet/...) Weights with Quantize Info (TensorFlow ) DLA Driver Retrain NN Graph Retrain Weights (TensorFlow) DLA HW Quantize-aware training (QAT)  less accuracy loss Weight Converter Post-training quantization (PTQ),  more accuracy loss Without HW nor Compiler results, PTQ can be available ■ Require some testing data set ■ Tiny YOLO v1 MAP 40%  15% Have basic HW inference fusion details, QAT can be available ■ Require training and testing data set ■ Tiny YOLO v1 MAP 40%  35%
  • 16. Tiny YOLO v1 Inference Example 16 1 CONV 2 BN 3 Scale 4 ReLU 5 Pool 6 CONV 7 BN 8 Scale 9 ReLU 10 Pool 11 CONV 12 BN 13 Scale 14 ReLU 15 Pool 16 CONV 17 BN 18 Scale 19 ReLU 20 Pool 21 CONV 22 BN 23 Scale 24 ReLU 25 Pool 26 CONV 27 BN 28 Scale 29 ReLU 30 Pool 31 CONV 32 BN 33 Scale 34 ReLU 35 CONV 36 BN 37 Scale 38 ReLU 39 FC Macro layer 1 Macro layer 2 Macro layer 3 Macro layer 4 Macro layer 5 Macro layer 6 Macro layer 7 Macro layer 8 FC9 Tiny YOLO v1  (39 DNN layers) HW Inference Queue (9 macro layers) ↓ • Originally,8-bit data,minimal feature maps DRAM access = 27.7MB • Use fusion, total feature map DRAM access = 6.2MB • Total weights remain 27 MB Fuse 5 layers into 1 macro layer [CONV–BN–Scale–PReLU–Pool] Reduce intermediate activation access * Detection layer is done by CPU
  • 17. A Quick Glance of RTL Results 17 Layer Queue CFG Weight Generator DLA RTL DRAM Model Caffe format Checker Calculator VPI results results HEX Layer Data DIM OPs 64M-DLA Cycles 256M-DLA Cycles Hybrid1 448x448x3 193 M 5.8 M 4.39 M Hybrid2 224x224x16 472 M 4.25 M 2.23 M Hybrid3 112x112x32 467 M 3.94 M 1.12 M Hybrid4 56x56x64 465 M 3.82 M 1.04 M Hybrid5 28x28x128 464 M 3.71 M 0.97 M Hybrid6 14x14x256 463 M 3.69 M 0.95 M Hybrid7 7x7x512 463 M 3.66 M 2.41 M Hybrid8 7x7x1024 231 M 3.52 M 1.6 M FC9 12540 37 M 14.19 M 9.23 M Summary 3250 M 46.57 M 23.9 M
  • 18. Equation-based Profiler using 64 MAC / 128 KB configuration 18 Network Total Cycle Clock Rate Run Time per Frame FPS AlexNet 61 M 400MHz 152.20ms 6.57 GoogLeNet 27 M 400MHz 67.83ms 14.74 ResNet50 111 M 400MHz 278.65ms 3.59 VGG16 395 M 400MHz 987.55ms 1.01 Tiny YOLO v1 45 M 400MHz 112.67ms 8.88 Tiny YOLO v2 83 M 400MHz 208.47ms 4.80 Tiny YOLO v3 55 M 400MHz 136.21ms 7.34 • Keep improving accuracy as more RTL simulations of models are done • The same DRAM BW model (~0.5 GBps) is used
  • 19. Equation-based Profiler using 256 MAC / 128 KB configuration 19 Network Total Cycle Clock Rate Run Time per Frame FPS AlexNet 49M 400MHz 122.17ms 8.19 GoogLeNet 11M 400MHz 28.40ms 35.22 ResNet50 76M 400MHz 189.74ms 5.27 VGG16 214 M 400MHz 535.99ms 1.87 Tiny YOLO v1 26 M 400MHz 65.30ms 15.31 Tiny YOLO v2 48 M 400MHz 121.07ms 8.26 Tiny YOLO v3 24 M 400MHz 61.55ms 16.25 • Keep improving accuracy as more RTL simulations of models are done • The same DRAM BW model (~0.5 GBps) is used
  • 20. Use the Profiler to Find a Design Target 20 Such as….. I want a 30fps real-time tiny-YOLO inference, what is the HW SPEC ?
  • 21. ASIC Implementation USB Accelerator for Legacy Machines 21 USB Bridge USB GPIF DRAM IF RISC-V Cache DLA AXI Parallel bus SDK + API (Linux) DRAM A P BPeripherals SOC or FPGA FPGA Prototype Fused layer inputFused layer output • TSMC-65nm • 3.2 x 3.2 mm • 64-M, 128KB • nv_small Layout View
  • 22. Chip and Board 22 • Technology: TSMC 65nm, Core : 1 V • Performance: 64 MAC, 50 GOPs @ 400 MHz • DLA Avg Power : 60 mW EVA Board & Die Photo … More information about this chip will be published later
  • 23. Conclusions 23  Adapt and customize HW resources, if you already have some candidate models  An end-to-end solution of edge AI application is here for your reference  Require especially-tight cooperation among HW-SW- training in integer DLAs ~ Thank for Your Attention ~