customization of a deep learning accelerator, based on NVDLA

D15-3 Customization of a
Deep Learning Accelerator
Shien-Chun Luo
Industrial Technology Research Institute
25 April 2019

Agenda
 Object Detection Demonstration
 Design a High Efficient Accelerator
 Our Solutions and Some Results
2

Demonstration of Object Detection
3
• 256-MAC DLA @ 150 MHz
• ZCU102 FPGA (used 40% of 600k logics)
• Ubuntu on ARM A53, 1.2GHz
• USB camera input, display port output
• Tiny YOLO v1, 448 x 448 RGB input
• 8-CONV & 1-FC, 3.2 GOPs of NN
• Detection layer uses CPU
• VOC dataset, 20 categories
• Original FP-32, MAP= 40%
• Retrained INT-8 by TF, MAP=35%
• Avg 8 FPS
• Execution time, CONV ~79 ms , FC ~48 ms

FPGA Object Detection Setup
4
DRAM (1GB)
Input Image
Model Weights
OS Controlled Space
DRAM
CTRL
DP
USB ARM CPU
(FPGA)
DLA
(Processing System)
Temp Activations
Output Data
Reserved
for DLA
(64~256MB)
Program INIT
Set parameters
Load Weight
Image Capture (YUV)
Re-Format to RGB
Activate DLA
Post Processing
Display
DLA Finished

5
Design a High Efficient
Accelerator

3 Steps to Achieve Our Goal
6
1. Increase MAC PEs with high utilization
2. Increase data supplement to those PEs
3. Improve energy efficiency, adaptive to the models
Throughput
Computation Power
3
2
Concepts of step 1~ 3 Take Alexnet for example
Throughput Curves
given various DRAM BW

FPS/Throughput of Various Models
-- profiled using 256MAC, 128KB, INT8, DLA inference
7

Profiles of Classical Classification
Models (1)
8
AlexNet (224) Memory Bandwidth (GBps) Mem BW
Sensitivity ↓FPS ↘ 1 GB/s 2 GB/s 4 GB/s 8 GB/s 12 GB/s
100 MHz 51 GOPs 7.1 9.0 10.4 11.2 11.5 1.6
200 MHz 102 GOPs 10.0 14.2 18.0 20.7 21.8 2.2
400 MHz 205 GOPs 12.6 20.0 28.4 35.9 39.4 3.1
800 MHz 410 GOPs 14.3 25.2 40.1 56.8 66.0 4.6
1000 MHz 512 GOPs 14.6 26.6 43.6 64.3 76.4 5.2
Computational Power
Sensitivity --> 2.1 3.0 4.2 5.7 6.6
Inception v1 (224) Memory Bandwidth (GBps) Mem BW
100 MHz 51 GOPs 8.8 9.1 9.2 9.3 9.3 1.1
200 MHz 102 GOPs 16.6 17.6 18.1 18.4 18.5 1.1
400 MHz 205 GOPs 28.3 33.1 35.2 36.2 36.6 1.3
800 MHz 410 GOPs 41.2 56.6 66.2 70.4 71.8 1.7
1000 MHz 512 GOPs 44.7 65.2 79.6 86.8 88.9 2.0
Computational Power
Sensitivity --> 5.1 7.2 8.7 9.4 9.6
AlexNet prefers
More memory
bandwidth due to
heavy-weight FC
layers
Inception prefers
More computation
power because
CNN computation
dominates
↑ Edge devices may limit the budge of DRAM(256MAC)

Profiles of Classical Classification
Models (2)
9
ResNet50 (224) Memory Bandwidth (GBps) Mem BW
100 MHz 51 GOPs 5.0 5.1 5.1 5.1 5.1 1.0
200 MHz 102 GOPs 9.2 10.0 10.1 10.1 10.2 1.1
400 MHz 205 GOPs 13.0 18.4 20.1 20.2 20.3 1.6
800 MHz 410 GOPs 15.6 26.0 36.9 40.1 40.4 2.6
1000 MHz 512 GOPs 16.1 28.0 42.3 49.5 50.4 3.1
Computational Power
Sensitivity --> 3.2 5.5 8.3 9.7 9.9
MobileNet v1 (224) Memory Bandwidth (GBps) Mem BW
100 MHz 51 GOPs 31.1 33.0 33.5 33.6 33.7 1.1
200 MHz 102 GOPs 51.2 62.2 66.0 66.9 67.2 1.3
400 MHz 205 GOPs 62.9 102.5 124.3 131.9 133.4 2.1
800 MHz 410 GOPs 64.0 125.8 204.9 248.6 259.7 4.1
1000 MHz 512 GOPs 64.0 127.9 226.0 299.7 318.5 5.0
Computational Power
Sensitivity --> 2.1 3.9 6.8 8.9 9.5
Resnet prefers
Evenly memory
bandwidth and
computation power
Mobilenet prefers
More memory
bandwidth because
DW-CONV layers
reduce computation but
increase activations to
RW memory
↑ Edge devices may limit the budge of DRAM(256MAC)

10
Our Solutions and Some Results

Let’s Use a Customizable Architecture
11
1. Variable CONV processing resources
• 64-MAC PE cluster to 2048-MAC PE cluster
for a single convolution processer
• Variable volume of convolutional buffer
2. Configurable NN operator processors
• Options for batch normalization, PReLU,
scale, bias, quantization, element-wise
operators
• Options for down sample ( like pool)
operators
• Options for nonlinear LUTs
3. Custom memories and host CPUs
• Can be driven by MCU or CPU
• Shared or private DRAM/SRAMArchitecture revised based on NVDLA

DLA Features – Inherence and Our Changes
12
1. [Inherence] Channel-first CONV strategy
• Released data dependency, share input feature cube
• Any kernel size (n x m) ~100% utilization if deep channels
2. [Add tool to verify] Layer fusion to save memory access
• Fuse popular layer stack [ CONV – BN – PReLU – Pool ]
• Verified  reduce activations access
3. [Add tool to verify] Program time hiding
• Verified program the N+1th layer, when running the Nth layer
4. [Revised HW] Depth-wise CONV support
• Revised HW from DMA to ACC
5. [Future work] DMA of fast data dimension change
• Adding fast up-sample algorithms, data dimension reorder changes
width
height
IN
IN
Channel first
Plane first

Standard Inference collaborating ONNC
on Linux Machines
13
Model Graph
Model weights
Kernel Mode Driver
(KMD)
User Mode Driver
(UMD)
Flow Controller
(MCU or CPU)
DLA
HW
User’s
Framework
Compiler (ONNC)
HardwareAPI and Driver for Linux
 online | offline 
Framework Converter
ONNX Graph
Model weights
with quantize
information
Parser
CPU Tasks
Loadable
Files
Framework
Conversion
Quantize
info extract
(Tensorflow)
DLA Tasks
Compiler

Bottom-up (Baremetal) Verification Flow
14
Model
Weights
Model
Prototxt
Model
Parser
Layer
Fusion
Layer
Partition
DLA REG
CFGs
API
HW-aware
Quantize
Insertion
QAT or PTQ
Weight
Conversion
& Partition
Quantized
Weights
Simple API Example :
• Use “YOLO”, “RESNET-50” as a function call, if no breakdown to sub-tasks
• Use { RW REG, INTR, POLL} inside the API, fit for general C compiler
Two packages to inserted to main
1. Load quantized weights
2. Call API (NN functions )
(next slide)

Integer Model Quantization Flows
15
Native Training
Graph
Caffe Prototxt
Darknet CFG / ONNX /...
Network
Converter
Compiler
(Baremetal /
ONNC)
Native Training
Weights
(TF/Caffe/Darknet/...)
Weights with
Quantize Info
(TensorFlow )
DLA
Driver
Retrain NN Graph
Retrain Weights
(TensorFlow)
DLA
HW
Quantize-aware training (QAT)
 less accuracy loss
Weight
Converter
Post-training
quantization (PTQ),
 more accuracy loss
Without HW nor Compiler results, PTQ can be available
■ Require some testing data set ■ Tiny YOLO v1 MAP 40%  15%
Have basic HW inference fusion details, QAT can be available
■ Require training and testing data set ■ Tiny YOLO v1 MAP 40%  35%

Tiny YOLO v1 Inference Example
16
1 CONV
2 BN
3 Scale
4 ReLU
5 Pool
6 CONV
7 BN
8 Scale
9 ReLU
10 Pool
11 CONV
12 BN
13 Scale
14 ReLU
15 Pool
16 CONV
17 BN
18 Scale
19 ReLU
20 Pool
21 CONV
22 BN
23 Scale
24 ReLU
25 Pool
26 CONV
27 BN
28 Scale
29 ReLU
30 Pool
31 CONV
32 BN
33 Scale
34 ReLU
35 CONV
36 BN
37 Scale
38 ReLU
39 FC
Macro layer 1
Macro layer 2
Macro layer 3
Macro layer 4
Macro layer 5
Macro layer 6
Macro layer 7
Macro layer 8
FC9
Tiny YOLO v1 
(39 DNN layers)
HW Inference Queue
(9 macro layers) ↓
• Originally，8-bit data，minimal
feature maps DRAM access = 27.7MB
• Use fusion, total feature map DRAM
access = 6.2MB
• Total weights remain 27 MB
Fuse 5 layers into 1 macro layer
[CONV–BN–Scale–PReLU–Pool]
Reduce intermediate activation access
* Detection layer is done by CPU

A Quick Glance of RTL Results
17
Layer
Queue CFG
Weight
Generator
DLA
RTL
DRAM
Model
Caffe
format
Checker
Calculator
VPI
results
results
HEX
Layer Data DIM OPs
64M-DLA
Cycles
256M-DLA
Cycles
Hybrid1 448x448x3 193 M 5.8 M 4.39 M
Hybrid2 224x224x16 472 M 4.25 M 2.23 M
Hybrid3 112x112x32 467 M 3.94 M 1.12 M
Hybrid4 56x56x64 465 M 3.82 M 1.04 M
Hybrid5 28x28x128 464 M 3.71 M 0.97 M
Hybrid6 14x14x256 463 M 3.69 M 0.95 M
Hybrid7 7x7x512 463 M 3.66 M 2.41 M
Hybrid8 7x7x1024 231 M 3.52 M 1.6 M
FC9 12540 37 M 14.19 M 9.23 M
Summary 3250 M 46.57 M 23.9 M

Equation-based Profiler using
64 MAC / 128 KB configuration
18
Network Total Cycle Clock Rate
Run Time per
Frame
FPS
AlexNet 61 M 400MHz 152.20ms 6.57
GoogLeNet 27 M 400MHz 67.83ms 14.74
ResNet50 111 M 400MHz 278.65ms 3.59
VGG16 395 M 400MHz 987.55ms 1.01
Tiny YOLO v1 45 M 400MHz 112.67ms 8.88
• Keep improving accuracy as more RTL simulations of models are done
• The same DRAM BW model (~0.5 GBps) is used

Equation-based Profiler using
256 MAC / 128 KB configuration
19
Network Total Cycle Clock Rate
Run Time per
Frame
FPS
AlexNet 49M 400MHz 122.17ms 8.19
GoogLeNet 11M 400MHz 28.40ms 35.22
ResNet50 76M 400MHz 189.74ms 5.27
VGG16 214 M 400MHz 535.99ms 1.87
• Keep improving accuracy as more RTL simulations of models are done
• The same DRAM BW model (~0.5 GBps) is used

Use the Profiler to Find a Design Target
20
Such as…..
I want a 30fps real-time tiny-YOLO inference, what is the HW SPEC ?

ASIC Implementation
USB Accelerator for Legacy Machines
21
USB
Bridge
USB
GPIF
DRAM
IF
RISC-V
Cache
DLA
AXI
Parallel bus SDK + API (Linux)
DRAM
A
P
BPeripherals
SOC or FPGA
FPGA
Prototype
Fused layer inputFused layer output
• TSMC-65nm
• 3.2 x 3.2 mm
• 64-M, 128KB
• nv_small
Layout View

Chip and Board
22
• Technology: TSMC 65nm, Core : 1 V
• Performance: 64 MAC, 50 GOPs @ 400 MHz
• DLA Avg Power : 60 mW
EVA Board & Die Photo … More information about this
chip will be published later

Conclusions
23
 Adapt and customize HW resources, if you already have
some candidate models
 An end-to-end solution of edge AI application is here for
your reference
 Require especially-tight cooperation among HW-SW-
training in integer DLAs
~ Thank for Your Attention ~

customization of a deep learning accelerator, based on NVDLA

More Related Content

What's hot (20)

Similar to customization of a deep learning accelerator, based on NVDLA (20)

Recently uploaded (20)

customization of a deep learning accelerator, based on NVDLA