SlideShare a Scribd company logo
A High-speed Low-power Deep Neural
Network on an FPGA based on the Nested
RNS: Applied to an Object Detector
Hiroki Nakahara, Tokyo Institute of Technology, Japan
Tsutomu Sasao, Meiji University, Japan
ISCAS2018
@Florence
Outline
• Background
• YOLOv2
• Convolutional Neural Network (CNN)
• Nested RNS (NRNS) for YOLOv2
• Experimental Results
• Conclusion
2
Image Classification by NN
Input Neural Network (NN) Output
3
Cat
(92%)
Improved by AlexNet (Deep Learning)
Why?
4
Bigdata
High-Performance
Computing
Algorithm
&Data Structure
Object Detection
5
Son
Baby
Daughter
• Detect multiple objects at a time
• High performance-power is necessary
Define of Problem
• Detecting and classifying multiple objects at the same time
• Evaluation criteria (from Pascal VOC):
6
Ground truth
annotation
Detection results:
>50% overlap of
bounding box(BBox)
with ground truth
One BBox for each
object
Confidence value
for each object
Person (50%)
, ∈{ ,. ,…, }
Average Precision (AP):
YOLOv2
(You Only Look Once version 2)
7
Input
Image
(Frame)
Feature maps
CONV+Pooling
CNN
CONV+Pooling
Class score
Bounding Box
Detection
J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger," arXiv preprint arXiv:1612.08242, 2016.
• Single CNN (One-shot) object detector
• Both a classification and a BBox estimation for each grid
2D Convolutional Operation
8
Input feature map
Output feature map
Kernel
(Binary)
X0,0 x W0,0
X0,1 x W0,1
X0,2 x W0,2
X1,0 x W1,0
X1,1 x W1,1
X1,2 x W1,2
X2,0 x W2,0
X2,1 x W2,1
+) X2,2 x W2,2
y
• Computational intensive part of the YOLOv2
FPGA
Realization of 2D Convolutional Layer
• Requires more than billion MACs
• Our realization
• Time multiplexing
• Nested Residue Number System(NRNS)
9
Off-chip Memory
* *
+
* *
+
+
* *
+
* *
+
+
BRAM BRAM
FPGA
Off-chip Memory
* *
BRAM
*
*
* *
*
* *
BRAM
*
*
* *
*
* *
BRAM
*
*
* *
*
* *
BRAM
*
*
* *
*
➔
Converter Converter Converter Converter
Converter Converter Converter Converter
Fully parallelization
with RNS
Residue Number System (RNS)
• Defined by a set of L mutually prime integer constants
〈m1,m2,...,mL〉
• No pair modulus have a common factor with any other
• Typically, prime number is used as moduli set
• An arbitrary integer X can be uniquely represented by
a tuple of L integers (X1,X2,…,XL), where
• Dynamic range:
10
)(mod ii mXX 
M = mi
i=1
L
Õ
Parallel Multiplication
Multiplication on RNS
• Moduli set〈3,4,5〉, X=8, Y=2
• Z=X×Y=16=(1,0,1)
• X=(2,0,3), Y=(2,2,2)
Z=(4 mod 3,0 mod 4,6 mod 5)
=(1,0,1)=16
11
Binary2RNS Conversion
RNS2Binary Conversion
Binary2RNS Converter
12
X mod 2 mod 3 mod4
0 0 0 0
1 1 1 1
2 0 2 2
3 1 0 3
4 0 1 0
5 1 2 1
13
00 01 10 11
00
01
10
11
0
1
1
1
1
1
0
0
0
1
1
1
1
1
0
0
X1=(x1, x2)
X2=(x3, x4)
h(X1) 0 01 1
x1 0 0 1 1
x2 0 1 0 1
h(X1) 0 1 0 1
0 1
00 0 1
01 1 1
10 1 0
11 1 0
x3,x4
h(X1)
Functional Decomposition
24x1=16 [bit] 22x1+23x1=12 [bit]
Column multiplicity=2
Bound variables
Free
variables
Decomposition Chart for X mod 3
14
000 001 010 011
00
01
10
11
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
X2=(x3, x4, x5)
X1=(x1,x2)
100 101 110 111
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
0 mod 3 = 0
1 mod 3 = 1
2 mod 3 = 2
3 mod 3 = 0
4 mod 3 = 1
5 mod 3 = 2
6 mod 3 = 0
7 mod 3 = 1
8 mod 3 = 2
9 mod 3 = 0
10 mod 3 = 1
…
Free
variables
Bound variables
Decomposition Chart for X mod 3
15
0 1 2
00
01
10
11
0
1
2
0
1
2
0
1
2
0
1
2
X2=(x3,x4,x5)X1=(x1,x2)
FreeBound
x3 0 0 0 0 1 1 1 1
x4 0 0 1 1 0 0 1 1
x5 0 1 0 1 0 1 0 1
h(X2) 0 1 2 0 1 2 0 1
0 mod 3 = 0
1 mod 3 = 1
2 mod 3 = 2
3 mod 3 = 0
4 mod 3 = 1
5 mod 3 = 2
6 mod 3 = 0
7 mod 3 = 1
8 mod 3 = 2
9 mod 3 = 0
10 mod 3 = 1
…
Binary2RNS Converter
16
LUT cascade for X mod m1
LUT cascade for X mod m2
BRAM
BRAM
BRAM
RNS2Binary Converter (m=30)
17
x1 y1
0 0
1 15
x2 y2
0 0
1 10
2 20
x3 y3
0 0
1 6
2 12
3 18
4 24
Mod m
Adder
Mod m
Adder
carry
carry
Problem
• Moduli set of RNS consists of mutually prime numbers
• sizes of circuits are all different
• Example: <7,11,13>
18
6-input
LUT
8-input
LUT
8-input
LUT
3
4
4
4
4
3
3
4
4
Binary2RNS
Converter
by
BRAMs
RNS2Binary
Converter
by
DSP blocks
and BRAMs
Nested RNS
• (Z1,Z2,…,Zi,…, ZL) (Z1,Z2,…,(Zi1,Zi2,…,Zij),…, ZL)
• Ex: <7,11,13>×<7,11,13>
<7,<5,6,7>11,<5,6,7>13>×<7,<5,6,7>11,<5,6,7>13>
19
1. Reuse the same moduli set
2. Decompose a large modulo into smaller ones
Original modulus
➔
Example of Nested RNS
• 19x22(=418) on <7,<5,6,7>11,<5,6,7>13>
19×22
=<5,8,6>×<1,0,9>
=<5,<3,2,1>11,<1,0,6>13>×<1,<0,0,0>11,<4,3,2>13>
=<5,<0,0,0>11,<4,0,5>13>
=<5,0,2>
=418
20
Modulo Multiplication
Bin2RNS on NRNS
RNS2Bin
Binary2NRNS Conversion
Realization of Nested RNS
21
<5,6,7>
2Bin
Bin2
<7,11,13>
3
<7,11,13>
2Bin
<5,6,7>
2Bin
Bin2
<5,6,7>
Bin2
<5,6,7>
6-input
LUT
6-input
LUT
6-input
LUT
6-input
LUT
6-input
LUT
6-input
LUT
6-input
LUT
Bin2
<7,11,13>
Bin2
<5,6,7>
Bin2
<5,6,7>
4
4
3
4
4
3
3
3
3
3
3
Binary
2NRNS
NRNS2
Binary
Realized by BRAMs LUTs BRAMs and DSP blocks
Moduli Set for NRNS
• Conventional RNS (uses 9 moduli)
<3,5,7,11,13,16,17,19,23>
• Applied the NRNS to moduli that are greater than 16
<3,4,5,7,11,13,16,
<3,4,5,7,11,13>17,
<3,4,5,7,11,13>19,
<3,4,5,7,11,13,<3,4,5,7,11,13>17>23>
22
All the 30 bit MAC operations are decomposed into 4 bit ones
DCNN Architecture using the NRNS
23
...
Parallel modulo mi
2D convolutional units
...
...
...
BRAM BRAM BRAM...
BRAM BRAM BRAM...
BRAM BRAM BRAM...
...
Parallel Bin2NRNS
Converters
Tree-based NRNS2Bin
Converters
Sequencer
On-chip
Memory
RNS
2
Bin
RNS
2
Bin
RNS
2
Bin
RNS
2
Bin
RNS
2
Bin
.........
...
NRNS based YOLOv2
• Framework: Chainer 1.24.0
• CNN: Tiny YOLOv2
• Benchmark: KITTI
vision benchmark
• mAP: 69.1 %
24
Implementation
• FPGA board: NetFPGA-SUME
• FPGA: Virtex7 VC690T
• LUT: 427,014 / 433,200
• 18Kb BRAM: 1,235 / 2,940
• DSP48E: 0 / 3,600
• Realized the pre-trained
NRNS-based YOLOv2
• 9 bit fixed precision
(dynamic range: 30 bit)
• Synthesis tool: Xilinx Vivado2017.2
• Timing constrain: 300MHz
• 3.84 FPS@3.5W → 1.097 FPS/W
25
Comparison
26
NVivia Pascal
GTX1080Ti
NetFPGA-SUME
Speed [FPS] 20.64 3.84
Power [W] 60.0 3.5
Efficiency [FPS/W] 0.344 1.097
Conclusion
• Realized the DCNN on the FPGA
• Time multiplexing
• Nested RNS
• MAC operation is realized by small LUTs
• Functional decomposition are used as follows:
• Bin2NRNS converter is realized by BRAMs
• NRNS2Bin converter is realized by DSP blocks
and BRAMs
• Performance per power (FPS/W)
• 3.19 times better than Pascal GPU
27

More Related Content

PDF
FPL15 talk: Deep Convolutional Neural Network on FPGA
PDF
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
PDF
Naist2015 dec ver1
PDF
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
PDF
A Random Forest using a Multi-valued Decision Diagram on an FPGa
PDF
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
PDF
FPT17: An object detector based on multiscale sliding window search using a f...
PDF
Batch normalization
FPL15 talk: Deep Convolutional Neural Network on FPGA
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
Naist2015 dec ver1
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
A Random Forest using a Multi-valued Decision Diagram on an FPGa
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
FPT17: An object detector based on multiscale sliding window search using a f...
Batch normalization

What's hot (20)

PDF
Lecture 7: Recurrent Neural Networks
PPTX
Batch normalization presentation
PDF
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
PDF
Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...
PDF
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
PPTX
Deep learning and its application
PPTX
LUT-Network Revision2 -English version-
PDF
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
PDF
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
PDF
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
PPTX
Electricity price forecasting with Recurrent Neural Networks
PDF
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
PDF
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
PPTX
Artificial neural networks introduction
PDF
Multidimensional RNN
PDF
Recurrent Neural Networks
PPTX
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
PDF
[PR12] PR-036 Learning to Remember Rare Events
PDF
Recurrent Neural Networks. Part 1: Theory
PDF
Recurrent Neural Networks (D2L8 Insight@DCU Machine Learning Workshop 2017)
Lecture 7: Recurrent Neural Networks
Batch normalization presentation
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Deep learning and its application
LUT-Network Revision2 -English version-
Deep Learning for Computer Vision: Recurrent Neural Networks (UPC 2016)
The Perceptron (D1L1 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Electricity price forecasting with Recurrent Neural Networks
Recurrent Neural Networks (D2L2 2017 UPC Deep Learning for Computer Vision)
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Artificial neural networks introduction
Multidimensional RNN
Recurrent Neural Networks
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
[PR12] PR-036 Learning to Remember Rare Events
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks (D2L8 Insight@DCU Machine Learning Workshop 2017)
Ad

Similar to ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied to YOLOv2 (20)

PDF
3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...
PPT
Chiffremtn asymetriqye AES Introduction.ppt
PDF
Families of Triangular Norm Based Kernel Function and Its Application to Kern...
PDF
Hardware Acceleration for Machine Learning
PPT
ResidueNumberSystems_ieee_bangalore (1)(1).ppt
PPT
Assume that the organization uses a weak random number generator or an algori...
PPT
Residue-Number-Systems the organization uses a weak .ppt
PPT
he organization uses a weak random number generator or an algorithm that gene...
PDF
Pushing Intelligence to Edge Nodes : Low Power circuits for Self Localization...
PDF
Course-Notes__Advanced-DSP.pdf
PDF
Advanced_DSP_J_G_Proakis.pdf
PPT
Solution of simplified neutron diffusion equation by FDM
PPTX
ImageNet classification with deep convolutional neural networks(2012)
PPT
Multilayer Neuronal network hardware implementation
PDF
Sparse coding Super-Resolution を用いた核医学画像処理
DOC
Dsp manual print
PPTX
Introduction to Neural Networks and Deep Learning
PPTX
Cycle’s topological optimizations and the iterative decoding problem on gener...
PDF
2016 03-03 marchand
PDF
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
3D Brain Image Segmentation Model using Deep Learning and Hidden Markov Rando...
Chiffremtn asymetriqye AES Introduction.ppt
Families of Triangular Norm Based Kernel Function and Its Application to Kern...
Hardware Acceleration for Machine Learning
ResidueNumberSystems_ieee_bangalore (1)(1).ppt
Assume that the organization uses a weak random number generator or an algori...
Residue-Number-Systems the organization uses a weak .ppt
he organization uses a weak random number generator or an algorithm that gene...
Pushing Intelligence to Edge Nodes : Low Power circuits for Self Localization...
Course-Notes__Advanced-DSP.pdf
Advanced_DSP_J_G_Proakis.pdf
Solution of simplified neutron diffusion equation by FDM
ImageNet classification with deep convolutional neural networks(2012)
Multilayer Neuronal network hardware implementation
Sparse coding Super-Resolution を用いた核医学画像処理
Dsp manual print
Introduction to Neural Networks and Deep Learning
Cycle’s topological optimizations and the iterative decoding problem on gener...
2016 03-03 marchand
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
Ad

More from Hiroki Nakahara (20)

PDF
ROS User Group Meeting #28 マルチ深層学習とROS
PDF
FPGAX2019
PDF
SBRA2018講演資料
PDF
DSF2018講演スライド
PDF
(公開版)Reconf研2017GUINNESS
PDF
(公開版)FPGAエクストリームコンピューティング2017
PDF
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
PDF
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
PDF
Tensor flow usergroup 2016 (公開版)
PDF
FPGAX2016 ドキュンなFPGA
PDF
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
PDF
Altera sdk for open cl アンケート集計結果(公開版)
PDF
Nested RNSを用いたディープニューラルネットワークのFPGA実装
PDF
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
PDF
Verilog-HDL Tutorial (15) software
PDF
Verilog-HDL Tutorial (15) hardware
PDF
Verilog-HDL Tutorial (14)
PDF
Verilog-HDL Tutorial (13)
PDF
Verilog-HDL Tutorial (12)
PDF
Verilog-HDL Tutorial (11)
ROS User Group Meeting #28 マルチ深層学習とROS
FPGAX2019
SBRA2018講演資料
DSF2018講演スライド
(公開版)Reconf研2017GUINNESS
(公開版)FPGAエクストリームコンピューティング2017
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
Tensor flow usergroup 2016 (公開版)
FPGAX2016 ドキュンなFPGA
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
Altera sdk for open cl アンケート集計結果(公開版)
Nested RNSを用いたディープニューラルネットワークのFPGA実装
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
Verilog-HDL Tutorial (15) software
Verilog-HDL Tutorial (15) hardware
Verilog-HDL Tutorial (14)
Verilog-HDL Tutorial (13)
Verilog-HDL Tutorial (12)
Verilog-HDL Tutorial (11)

Recently uploaded (20)

PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Well-logging-methods_new................
PPTX
Sustainable Sites - Green Building Construction
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
DOCX
573137875-Attendance-Management-System-original
PDF
PPT on Performance Review to get promotions
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
additive manufacturing of ss316l using mig welding
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPT
Project quality management in manufacturing
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
UNIT 4 Total Quality Management .pptx
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
CH1 Production IntroductoryConcepts.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Well-logging-methods_new................
Sustainable Sites - Green Building Construction
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
573137875-Attendance-Management-System-original
PPT on Performance Review to get promotions
Internet of Things (IOT) - A guide to understanding
Structs to JSON How Go Powers REST APIs.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
additive manufacturing of ss316l using mig welding
Model Code of Practice - Construction Work - 21102022 .pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Project quality management in manufacturing
CYBER-CRIMES AND SECURITY A guide to understanding
Lesson 3_Tessellation.pptx finite Mathematics
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
UNIT 4 Total Quality Management .pptx

ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied to YOLOv2

  • 1. A High-speed Low-power Deep Neural Network on an FPGA based on the Nested RNS: Applied to an Object Detector Hiroki Nakahara, Tokyo Institute of Technology, Japan Tsutomu Sasao, Meiji University, Japan ISCAS2018 @Florence
  • 2. Outline • Background • YOLOv2 • Convolutional Neural Network (CNN) • Nested RNS (NRNS) for YOLOv2 • Experimental Results • Conclusion 2
  • 3. Image Classification by NN Input Neural Network (NN) Output 3 Cat (92%) Improved by AlexNet (Deep Learning)
  • 5. Object Detection 5 Son Baby Daughter • Detect multiple objects at a time • High performance-power is necessary
  • 6. Define of Problem • Detecting and classifying multiple objects at the same time • Evaluation criteria (from Pascal VOC): 6 Ground truth annotation Detection results: >50% overlap of bounding box(BBox) with ground truth One BBox for each object Confidence value for each object Person (50%) , ∈{ ,. ,…, } Average Precision (AP):
  • 7. YOLOv2 (You Only Look Once version 2) 7 Input Image (Frame) Feature maps CONV+Pooling CNN CONV+Pooling Class score Bounding Box Detection J. Redmon and A. Farhadi, “YOLO9000: Better, Faster, Stronger," arXiv preprint arXiv:1612.08242, 2016. • Single CNN (One-shot) object detector • Both a classification and a BBox estimation for each grid
  • 8. 2D Convolutional Operation 8 Input feature map Output feature map Kernel (Binary) X0,0 x W0,0 X0,1 x W0,1 X0,2 x W0,2 X1,0 x W1,0 X1,1 x W1,1 X1,2 x W1,2 X2,0 x W2,0 X2,1 x W2,1 +) X2,2 x W2,2 y • Computational intensive part of the YOLOv2
  • 9. FPGA Realization of 2D Convolutional Layer • Requires more than billion MACs • Our realization • Time multiplexing • Nested Residue Number System(NRNS) 9 Off-chip Memory * * + * * + + * * + * * + + BRAM BRAM FPGA Off-chip Memory * * BRAM * * * * * * * BRAM * * * * * * * BRAM * * * * * * * BRAM * * * * * ➔ Converter Converter Converter Converter Converter Converter Converter Converter Fully parallelization with RNS
  • 10. Residue Number System (RNS) • Defined by a set of L mutually prime integer constants 〈m1,m2,...,mL〉 • No pair modulus have a common factor with any other • Typically, prime number is used as moduli set • An arbitrary integer X can be uniquely represented by a tuple of L integers (X1,X2,…,XL), where • Dynamic range: 10 )(mod ii mXX  M = mi i=1 L Õ
  • 11. Parallel Multiplication Multiplication on RNS • Moduli set〈3,4,5〉, X=8, Y=2 • Z=X×Y=16=(1,0,1) • X=(2,0,3), Y=(2,2,2) Z=(4 mod 3,0 mod 4,6 mod 5) =(1,0,1)=16 11 Binary2RNS Conversion RNS2Binary Conversion
  • 12. Binary2RNS Converter 12 X mod 2 mod 3 mod4 0 0 0 0 1 1 1 1 2 0 2 2 3 1 0 3 4 0 1 0 5 1 2 1
  • 13. 13 00 01 10 11 00 01 10 11 0 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 X1=(x1, x2) X2=(x3, x4) h(X1) 0 01 1 x1 0 0 1 1 x2 0 1 0 1 h(X1) 0 1 0 1 0 1 00 0 1 01 1 1 10 1 0 11 1 0 x3,x4 h(X1) Functional Decomposition 24x1=16 [bit] 22x1+23x1=12 [bit] Column multiplicity=2 Bound variables Free variables
  • 14. Decomposition Chart for X mod 3 14 000 001 010 011 00 01 10 11 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 X2=(x3, x4, x5) X1=(x1,x2) 100 101 110 111 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 0 mod 3 = 0 1 mod 3 = 1 2 mod 3 = 2 3 mod 3 = 0 4 mod 3 = 1 5 mod 3 = 2 6 mod 3 = 0 7 mod 3 = 1 8 mod 3 = 2 9 mod 3 = 0 10 mod 3 = 1 … Free variables Bound variables
  • 15. Decomposition Chart for X mod 3 15 0 1 2 00 01 10 11 0 1 2 0 1 2 0 1 2 0 1 2 X2=(x3,x4,x5)X1=(x1,x2) FreeBound x3 0 0 0 0 1 1 1 1 x4 0 0 1 1 0 0 1 1 x5 0 1 0 1 0 1 0 1 h(X2) 0 1 2 0 1 2 0 1 0 mod 3 = 0 1 mod 3 = 1 2 mod 3 = 2 3 mod 3 = 0 4 mod 3 = 1 5 mod 3 = 2 6 mod 3 = 0 7 mod 3 = 1 8 mod 3 = 2 9 mod 3 = 0 10 mod 3 = 1 …
  • 16. Binary2RNS Converter 16 LUT cascade for X mod m1 LUT cascade for X mod m2 BRAM BRAM BRAM
  • 17. RNS2Binary Converter (m=30) 17 x1 y1 0 0 1 15 x2 y2 0 0 1 10 2 20 x3 y3 0 0 1 6 2 12 3 18 4 24 Mod m Adder Mod m Adder carry carry
  • 18. Problem • Moduli set of RNS consists of mutually prime numbers • sizes of circuits are all different • Example: <7,11,13> 18 6-input LUT 8-input LUT 8-input LUT 3 4 4 4 4 3 3 4 4 Binary2RNS Converter by BRAMs RNS2Binary Converter by DSP blocks and BRAMs
  • 19. Nested RNS • (Z1,Z2,…,Zi,…, ZL) (Z1,Z2,…,(Zi1,Zi2,…,Zij),…, ZL) • Ex: <7,11,13>×<7,11,13> <7,<5,6,7>11,<5,6,7>13>×<7,<5,6,7>11,<5,6,7>13> 19 1. Reuse the same moduli set 2. Decompose a large modulo into smaller ones Original modulus ➔
  • 20. Example of Nested RNS • 19x22(=418) on <7,<5,6,7>11,<5,6,7>13> 19×22 =<5,8,6>×<1,0,9> =<5,<3,2,1>11,<1,0,6>13>×<1,<0,0,0>11,<4,3,2>13> =<5,<0,0,0>11,<4,0,5>13> =<5,0,2> =418 20 Modulo Multiplication Bin2RNS on NRNS RNS2Bin Binary2NRNS Conversion
  • 21. Realization of Nested RNS 21 <5,6,7> 2Bin Bin2 <7,11,13> 3 <7,11,13> 2Bin <5,6,7> 2Bin Bin2 <5,6,7> Bin2 <5,6,7> 6-input LUT 6-input LUT 6-input LUT 6-input LUT 6-input LUT 6-input LUT 6-input LUT Bin2 <7,11,13> Bin2 <5,6,7> Bin2 <5,6,7> 4 4 3 4 4 3 3 3 3 3 3 Binary 2NRNS NRNS2 Binary Realized by BRAMs LUTs BRAMs and DSP blocks
  • 22. Moduli Set for NRNS • Conventional RNS (uses 9 moduli) <3,5,7,11,13,16,17,19,23> • Applied the NRNS to moduli that are greater than 16 <3,4,5,7,11,13,16, <3,4,5,7,11,13>17, <3,4,5,7,11,13>19, <3,4,5,7,11,13,<3,4,5,7,11,13>17>23> 22 All the 30 bit MAC operations are decomposed into 4 bit ones
  • 23. DCNN Architecture using the NRNS 23 ... Parallel modulo mi 2D convolutional units ... ... ... BRAM BRAM BRAM... BRAM BRAM BRAM... BRAM BRAM BRAM... ... Parallel Bin2NRNS Converters Tree-based NRNS2Bin Converters Sequencer On-chip Memory RNS 2 Bin RNS 2 Bin RNS 2 Bin RNS 2 Bin RNS 2 Bin ......... ...
  • 24. NRNS based YOLOv2 • Framework: Chainer 1.24.0 • CNN: Tiny YOLOv2 • Benchmark: KITTI vision benchmark • mAP: 69.1 % 24
  • 25. Implementation • FPGA board: NetFPGA-SUME • FPGA: Virtex7 VC690T • LUT: 427,014 / 433,200 • 18Kb BRAM: 1,235 / 2,940 • DSP48E: 0 / 3,600 • Realized the pre-trained NRNS-based YOLOv2 • 9 bit fixed precision (dynamic range: 30 bit) • Synthesis tool: Xilinx Vivado2017.2 • Timing constrain: 300MHz • 3.84 FPS@3.5W → 1.097 FPS/W 25
  • 26. Comparison 26 NVivia Pascal GTX1080Ti NetFPGA-SUME Speed [FPS] 20.64 3.84 Power [W] 60.0 3.5 Efficiency [FPS/W] 0.344 1.097
  • 27. Conclusion • Realized the DCNN on the FPGA • Time multiplexing • Nested RNS • MAC operation is realized by small LUTs • Functional decomposition are used as follows: • Bin2NRNS converter is realized by BRAMs • NRNS2Bin converter is realized by DSP blocks and BRAMs • Performance per power (FPS/W) • 3.19 times better than Pascal GPU 27