SlideShare a Scribd company logo
A Deep Convolutional
Neural Network Based on
Nested Residue Number System
Hiroki Nakahara1 Tsutomu Sasao2
1Ehime University, Japan
2Meiji University, Japan
1
Outline
• Background
• Deep convolutional neural network (DCNN)
• Residue number system (RNS)
• DCNN using nested RNS (NRNS)
• Experimental results
• Conclusion
2
Background
• Deep Neural Network
– Multi-layer neuron model
– Used for embedded vision system
• FPGA realization is suitable for real-time systems
– faster than the CPU
– Lower power consumption than the GPU
– Fixed point representation is sufficient
• High-performance per area is desired
3
Deep Convolutional
Neural Network (DCNN)
4
Artificial Neuron
+
x0=1x0=1
x1
x2
xN
... w0 (Bias)
w1
w2
wN
f(u)
u y
xi: Input signal
wi: Weight
u: Internal state
f(u): Activation function
(Sigmoid, ReLU, etc.)
y: Output signal
5
y  f (u)
u  wi xi
i0
N

Deep Convolutional Neural Network(DCNN)
for ImageNet
• 2D convolutional layer, pooling layer, and fully connection layer
6
2D Convolutional Layer
• Consumes more than 90% of the computation time
– Multiply-accumulation (MAC) operation is performed
7
zij  yij  xim, jnwmn
n0
K1

m0
K1

xij: Input signal
yij : Bias
wmn: Weight
K: Kernel size
zij: Output signal
K
K
FPGA
Realization of 2D Convolutional Layer
• Requires more than billion MACs!
• Our realization
– Time multiplexing
– Nested Residue Number System(NRNS)
8
Off‐chip Memory
** **
++
** **
++
++
** **
++
** **
++
++
BRAMBRAM BRAMBRAM
FPGA
Off‐chip Memory
** **
BRAMBRAM
**
**
** **
**
** **
BRAMBRAM
**
**
** **
**
** **
BRAMBRAM
**
**
** **
**
** **
BRAMBRAM
**
**
** **
**
➔
ConverterConverter ConverterConverter ConverterConverter ConverterConverter
ConverterConverter ConverterConverter ConverterConverter ConverterConverter
Residue Number System(RNS)
9
Residue Number System (RNS)
 Defined by a set of L mutually prime integer
constants 〈m1,m2,...,mL〉
 No pair modulus have a common factor with any other
 Typically, prime number is used as moduli set
 An arbitrary integer X can be uniquely
represented by a tuple of L integers
(X1,X2,…,XL), where
 Dynamic range
10
)(mod ii mXX 
M  mi
i1
L

Parallel Multiplication
Multiplication on RNS
Moduli set〈3,4,5〉, X=8, Y=2
Z=X×Y=16=(1,0,1)
X=(2,0,3), Y=(2,2,2)
Z=(4 mod 3,0 mod 4,6 mod 5)
=(1,0,1)=16
11
Binary2RNS Conversion
RNS2Binary Conversion
➔ ➔
Binary2RNS Converter
12
X mod 2 mod 3 mod4
0 0 0 0
1 1 1 1
2 0 2 2
3 1 0 3
4 0 1 0
5 1 2 1
➔
13
00 01 10 11
00
01
10
11
0
1
1
1
1
1
0
0
0
1
1
1
1
1
0
0
X1=(x1, x2)
X2=(x3, x4)
h(X1) 0 01 1
x1 0 0 1 1
x2 0 1 0 1
h(X1) 0 1 0 1
0 1
00 0 1
01 1 1
10 1 0
11 1 0
x3,x4
h(X1)
Functional Decomposition
24x1=16 [bit] 22x1+23x1=12 [bit]
Column multiplicity=2
Bound variables
Free
variables
Decomposition Chart for X mod 3
14
000 001 010 011
00
01
10
11
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
X2=(x3, x4, x5)
X1=(x1,x2)
100 101 110 111
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
0 mod 3 = 0
1 mod 3 = 1
2 mod 3 = 2
3 mod 3 = 0
4 mod 3 = 1
5 mod 3 = 2
6 mod 3 = 0
7 mod 3 = 1
8 mod 3 = 2
9 mod 3 = 0
10 mod 3 = 1
…
Free
variables
Bound variables
Decomposition Chart for X mod 3
15
0 1 2
00
01
10
11
0
1
2
0
1
2
0
1
2
0
1
2
X2=(x3,x4,x5)X1=(x1,x2) 0 mod 3 = 0
1 mod 3 = 1
2 mod 3 = 2
3 mod 3 = 0
4 mod 3 = 1
5 mod 3 = 2
6 mod 3 = 0
7 mod 3 = 1
8 mod 3 = 2
9 mod 3 = 0
10 mod 3 = 1
…
FreeBound
x3 0 0 0 0 1 1 1 1
x4 0 0 1 1 0 0 1 1
x5 0 1 0 1 0 1 0 1
h(X2) 0 1 2 0 1 2 0 1
Binary2RNS Converter
16
LUT cascade for X mod m1
LUT cascade for X mod m2
BRAM
BRAM
BRAM
RNS2Binary Converter (m=30)
17
x1 y1
0 0
1 15
x2 y2
0 0
1 10
2 20
x3 y3
0 0
1 6
2 12
3 18
4 24
Mod m
Adder
Mod m
Adder
carry
carry
Problem
• Moduli set of RNS consists of mutually prime numbers
– sizes of circuits are all different
• Example: <7,11,13>
18
6‐input
LUT
8‐input
LUT
8‐input
LUT
3
4
4
4
4
3
3
4
4
Binary2RNS
Converter
by
BRAMs
RNS2Binary
Converter
by
DSP blocks
and BRAMs
➔ ➔
DCNN using Nested RNS
19
Nested RNS
• (Z1,Z2,…,Zi,…, ZL) (Z1,Z2,…,(Zi1,Zi2,…,Zij),…, ZL)
• Ex: <7,11,13>×<7,11,13>
<7,<5,6,7>11,<5,6,7>13>×<7,<5,6,7>11,<5,6,7>13>
20
1. Reuse the same moduli set
2. Decompose a large modulo into smaller ones
Original modulus
➔
Example of Nested RNS
• 19x22(=418) on <7,<5,6,7>11,<5,6,7>13>
19×22
=<5,8,6>×<1,0,9>
=<5,<3,2,1>11,<1,0,6>13>×<1,<0,0,0>11,<4,3,2>13>
=<5,<0,0,0>11,<4,0,5>13>
=<5,0,2>
=418
21
Modulo Multiplication
Bin2RNS on NRNS
RNS2Bin
Binary2NRNS Conversion
Realization of Nested RNS
22
<5,6,7>
2Bin
Bin2
<7,11,13>
3
<7,11,13>
2Bin
<5,6,7>
2Bin
Bin2
<5,6,7>
Bin2
<5,6,7>
6‐input
LUT
6‐input
LUT
6‐input
LUT
6‐input
LUT
6‐input
LUT
6‐input
LUT
6‐input
LUT
Bin2
<7,11,13>
Bin2
<5,6,7>
Bin2
<5,6,7>
4
4
3
4
4
3
3
3
3
3
3
Binary
2NRNS
NRNS2
Binary
Realized by BRAMs                      LUTs      BRAMs and DSP blocks   
Moduli Set for NRNS
• Conventional RNS (uses 23 moduli)
<3,4,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,
61,67,71,73,79,83>
• Applied the NRNS to moduli that are greater than 15
<3,4,5,7,11,13,
<3,4,5,7,11,13>17,
<3,4,5,7,11,13>19,
<3,4,5,7,11,13,<3,4,5,7,11,13>17>23,
<3,4,5,7,11,13,<3,4,5,7,11,13>17>29,
…, <3,4,5,7,11,13,<3,4,5,7,11,13>17>83>
23
All the 48-bit MAC operations are decomposed into 4-bit ones
DCNN Architecture using the NRNS
24
...
16 parallel modulo mi
2D convolutional units
...
...
. . .
BRAM BRAM BRAM...
BRAM BRAM BRAM...
BRAM BRAM BRAM...
. . .
Parallel Bin2NRNS
Converters
Tree‐based NRNS2Bin
Converters
Sequencer             
External DDR3SODIMM
DDR3 Ctrl.DDR3 Ctrl.
On‐chip
Memory
RNS
2
Bin
RNS
2
Bin
RNS
2
Bin
RNS
2
Bin
RNS
2
Bin
.........
...
Experimental Results
25
Implementation Setup
• FPGA board: Xilinx VC707
– FPGA: Virtex7 VC485T
– 1GB DDR3SODIMM
(Bus@800MHz, 64 bit width)
• Realized the pre-trained
ImageNet by Convnet2
– 48-bit fixed precision
• Synthesis tool: Xilinx Vivado2014.1
– Timing constrain: 400MHz
26
Comparison with Other
Implementations
27
Precision Max.
Freq.
[MHz]
FPGA Performance
[GOPS]
Performance
per area
[GOPS/
Slice x 10‐4]
ASAP2009 16bit fixed 115 Viretex5 LX330T 6.7 1.3
PACT2010 ‐‐‐ fixed 125 Viretex5 SX240T 7.0 1.9
FPL2009 48bit fixed 125 Spartax3A DSP3400 5.3 2.2
ISCA2010 48bit fixed 200 Virtex5 SX240T 16.0 4.3
ICCD2013 ‐‐‐ fixed 150 Virtex6 LVX240T 17.0 4.5
FPGA2015 32bit float 100 Virtex7 VX485T 61.6 8.1
Proposed 48bit fixed 400 Virtex7 VX485T 132.2 25.2
Conclusion
• Realized the DCNN on the FPGA
– Time multiplexing
– Nested RNS
• MAC operation is realized by small LUTs
• Functional decomposition are used as follows:
– Bin2NRNS converter is realized by BRAMs
– NRNS2Bin converter is realized by DSP blocks and
BRAMs
• Performance per area (GOPS/Slice)
– 5.86 times higher than ISCA10’s
28

More Related Content

PDF
Naist2015 dec ver1
PDF
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
PDF
A Random Forest using a Multi-valued Decision Diagram on an FPGa
PDF
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
PDF
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
PDF
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
PDF
FPT17: An object detector based on multiscale sliding window search using a f...
PDF
Batch normalization
Naist2015 dec ver1
ISCAS'18: A Deep Neural Network on the Nested RNS (NRNS) on an FPGA: Applied ...
A Random Forest using a Multi-valued Decision Diagram on an FPGa
FPGA2018: A Lightweight YOLOv2: A binarized CNN with a parallel support vecto...
ISMVL2018: A Ternary Weight Binary Input Convolutional Neural Network
FCCM2020: High-Throughput Convolutional Neural Network on an FPGA by Customiz...
FPT17: An object detector based on multiscale sliding window search using a f...
Batch normalization

What's hot (20)

PDF
[251] implementing deep learning using cu dnn
PDF
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
PDF
Lecture 7: Recurrent Neural Networks
PDF
Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...
PDF
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
PDF
第11回 配信講義 計算科学技術特論A(2021)
PPTX
LUT-Network Revision2 -English version-
PDF
CNN Attention Networks
PDF
Neurogrid : A Mixed-Analog-Digital Multichip System for Large-Scale Neural Si...
PPTX
Electricity price forecasting with Recurrent Neural Networks
PPTX
Lrz kurs: big data analysis
PPTX
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
PPTX
Deep learning and its application
PDF
Recurrent Neural Networks
PDF
Multidimensional RNN
PDF
An FPGA-based acceleration methodology and performance model for iterative st...
PDF
Slide tesi
PDF
Recurrent Neural Networks. Part 1: Theory
PDF
Deep Learning Initiative @ NECSTLab
PDF
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
[251] implementing deep learning using cu dnn
Ted Willke, Senior Principal Engineer, Intel Labs at MLconf NYC
Lecture 7: Recurrent Neural Networks
Fast and Light-weight Binarized Neural Network Implemented in an FPGA using L...
【DL輪読会】Incorporating group update for speech enhancement based on convolutio...
第11回 配信講義 計算科学技術特論A(2021)
LUT-Network Revision2 -English version-
CNN Attention Networks
Neurogrid : A Mixed-Analog-Digital Multichip System for Large-Scale Neural Si...
Electricity price forecasting with Recurrent Neural Networks
Lrz kurs: big data analysis
Jeff Johnson, Research Engineer, Facebook at MLconf NYC
Deep learning and its application
Recurrent Neural Networks
Multidimensional RNN
An FPGA-based acceleration methodology and performance model for iterative st...
Slide tesi
Recurrent Neural Networks. Part 1: Theory
Deep Learning Initiative @ NECSTLab
Massively Parallel K-Nearest Neighbor Computation on Distributed Architectures
Ad

Viewers also liked (17)

PDF
Verilog-HDL Tutorial (13)
PDF
FPGAX2016 ドキュンなFPGA
PDF
Nested RNSを用いたディープニューラルネットワークのFPGA実装
PDF
Verilog-HDL Tutorial (15) hardware
PDF
Verilog-HDL Tutorial (9)
PDF
Verilog-HDL Tutorial (14)
PDF
Verilog-HDL Tutorial (12)
PDF
Verilog-HDL Tutorial (15) software
PDF
Verilog-HDL Tutorial (11)
PDF
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
PDF
Altera sdk for open cl アンケート集計結果(公開版)
PDF
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
PDF
Tensor flow usergroup 2016 (公開版)
PDF
(公開版)FPGAエクストリームコンピューティング2017
PDF
(公開版)Reconf研2017GUINNESS
PDF
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
PDF
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
Verilog-HDL Tutorial (13)
FPGAX2016 ドキュンなFPGA
Nested RNSを用いたディープニューラルネットワークのFPGA実装
Verilog-HDL Tutorial (15) hardware
Verilog-HDL Tutorial (9)
Verilog-HDL Tutorial (14)
Verilog-HDL Tutorial (12)
Verilog-HDL Tutorial (15) software
Verilog-HDL Tutorial (11)
私のファミコンのfpsは530000です。もちろんフルパワーで(以下略
Altera sdk for open cl アンケート集計結果(公開版)
電波望遠鏡用の分光器をAltera SDK for OpenCL使ってサクッと作ってみた
Tensor flow usergroup 2016 (公開版)
(公開版)FPGAエクストリームコンピューティング2017
(公開版)Reconf研2017GUINNESS
2値ディープニューラルネットワークと組込み機器への応用: 開発中のツール紹介
2値化CNN on FPGAでGPUとガチンコバトル(公開版)
Ad

Similar to FPL15 talk: Deep Convolutional Neural Network on FPGA (20)

PDF
A BINARY TO RESIDUE CONVERSION USING NEW PROPOSED NON-COPRIME MODULI SET
PDF
A novel architecture of rns based
PDF
A BINARY TO RESIDUE CONVERSION USING NEW PROPOSED NON-COPRIME MODULI SET
DOCX
Low power cost rns comparison via partitioning the dynamic range
PDF
The use of reversible logic gates in the design of residue number systems
PPT
he organization uses a weak random number generator or an algorithm that gene...
PPT
Residue-Number-Systems the organization uses a weak .ppt
PPT
Assume that the organization uses a weak random number generator or an algori...
PPT
ResidueNumberSystems_ieee_bangalore (1)(1).ppt
PDF
A new efficient fpga design of residue to-binary converter
PDF
Seq2grid aaai 2021
PDF
Pushing Intelligence to Edge Nodes : Low Power circuits for Self Localization...
PDF
IRJET- Implementation of FIR Filter using Self Tested 2n-2k-1 Modulo Adder
PDF
Implementation of Low-Complexity Redundant Multiplier Architecture for Finite...
PDF
Implementation of cyclic convolution based on fnt
PDF
Implementation of cyclic convolution based on fnt
PDF
HIGH SPEED REVERSE CONVERTER FOR HIGH DYNAMIC RANGE MODULI SET
PDF
Fx3111501156
PPTX
Fast Algorithms for Quantized Convolutional Neural Networks
PDF
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal ...
A BINARY TO RESIDUE CONVERSION USING NEW PROPOSED NON-COPRIME MODULI SET
A novel architecture of rns based
A BINARY TO RESIDUE CONVERSION USING NEW PROPOSED NON-COPRIME MODULI SET
Low power cost rns comparison via partitioning the dynamic range
The use of reversible logic gates in the design of residue number systems
he organization uses a weak random number generator or an algorithm that gene...
Residue-Number-Systems the organization uses a weak .ppt
Assume that the organization uses a weak random number generator or an algori...
ResidueNumberSystems_ieee_bangalore (1)(1).ppt
A new efficient fpga design of residue to-binary converter
Seq2grid aaai 2021
Pushing Intelligence to Edge Nodes : Low Power circuits for Self Localization...
IRJET- Implementation of FIR Filter using Self Tested 2n-2k-1 Modulo Adder
Implementation of Low-Complexity Redundant Multiplier Architecture for Finite...
Implementation of cyclic convolution based on fnt
Implementation of cyclic convolution based on fnt
HIGH SPEED REVERSE CONVERTER FOR HIGH DYNAMIC RANGE MODULI SET
Fx3111501156
Fast Algorithms for Quantized Convolutional Neural Networks
© 2022, IRJET | Impact Factor value: 7.529 | ISO 9001:2008 Certified Journal ...

More from Hiroki Nakahara (6)

PDF
ROS User Group Meeting #28 マルチ深層学習とROS
PDF
FPGAX2019
PDF
SBRA2018講演資料
PDF
DSF2018講演スライド
PDF
Verilog-HDL Tutorial (8)
PDF
Verilog-HDL Tutorial (7)
ROS User Group Meeting #28 マルチ深層学習とROS
FPGAX2019
SBRA2018講演資料
DSF2018講演スライド
Verilog-HDL Tutorial (8)
Verilog-HDL Tutorial (7)

Recently uploaded (20)

PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPT
Project quality management in manufacturing
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
DOCX
573137875-Attendance-Management-System-original
PDF
Well-logging-methods_new................
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Lesson 3_Tessellation.pptx finite Mathematics
PPTX
Welding lecture in detail for understanding
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Sustainable Sites - Green Building Construction
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Project quality management in manufacturing
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
CH1 Production IntroductoryConcepts.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
UNIT-1 - COAL BASED THERMAL POWER PLANTS
573137875-Attendance-Management-System-original
Well-logging-methods_new................
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Lesson 3_Tessellation.pptx finite Mathematics
Welding lecture in detail for understanding
Arduino robotics embedded978-1-4302-3184-4.pdf
Sustainable Sites - Green Building Construction
bas. eng. economics group 4 presentation 1.pptx
Lecture Notes Electrical Wiring System Components
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf

FPL15 talk: Deep Convolutional Neural Network on FPGA

  • 1. A Deep Convolutional Neural Network Based on Nested Residue Number System Hiroki Nakahara1 Tsutomu Sasao2 1Ehime University, Japan 2Meiji University, Japan 1
  • 2. Outline • Background • Deep convolutional neural network (DCNN) • Residue number system (RNS) • DCNN using nested RNS (NRNS) • Experimental results • Conclusion 2
  • 3. Background • Deep Neural Network – Multi-layer neuron model – Used for embedded vision system • FPGA realization is suitable for real-time systems – faster than the CPU – Lower power consumption than the GPU – Fixed point representation is sufficient • High-performance per area is desired 3
  • 5. Artificial Neuron + x0=1x0=1 x1 x2 xN ... w0 (Bias) w1 w2 wN f(u) u y xi: Input signal wi: Weight u: Internal state f(u): Activation function (Sigmoid, ReLU, etc.) y: Output signal 5 y  f (u) u  wi xi i0 N 
  • 6. Deep Convolutional Neural Network(DCNN) for ImageNet • 2D convolutional layer, pooling layer, and fully connection layer 6
  • 7. 2D Convolutional Layer • Consumes more than 90% of the computation time – Multiply-accumulation (MAC) operation is performed 7 zij  yij  xim, jnwmn n0 K1  m0 K1  xij: Input signal yij : Bias wmn: Weight K: Kernel size zij: Output signal K K
  • 8. FPGA Realization of 2D Convolutional Layer • Requires more than billion MACs! • Our realization – Time multiplexing – Nested Residue Number System(NRNS) 8 Off‐chip Memory ** ** ++ ** ** ++ ++ ** ** ++ ** ** ++ ++ BRAMBRAM BRAMBRAM FPGA Off‐chip Memory ** ** BRAMBRAM ** ** ** ** ** ** ** BRAMBRAM ** ** ** ** ** ** ** BRAMBRAM ** ** ** ** ** ** ** BRAMBRAM ** ** ** ** ** ➔ ConverterConverter ConverterConverter ConverterConverter ConverterConverter ConverterConverter ConverterConverter ConverterConverter ConverterConverter
  • 10. Residue Number System (RNS)  Defined by a set of L mutually prime integer constants 〈m1,m2,...,mL〉  No pair modulus have a common factor with any other  Typically, prime number is used as moduli set  An arbitrary integer X can be uniquely represented by a tuple of L integers (X1,X2,…,XL), where  Dynamic range 10 )(mod ii mXX  M  mi i1 L 
  • 11. Parallel Multiplication Multiplication on RNS Moduli set〈3,4,5〉, X=8, Y=2 Z=X×Y=16=(1,0,1) X=(2,0,3), Y=(2,2,2) Z=(4 mod 3,0 mod 4,6 mod 5) =(1,0,1)=16 11 Binary2RNS Conversion RNS2Binary Conversion ➔ ➔
  • 12. Binary2RNS Converter 12 X mod 2 mod 3 mod4 0 0 0 0 1 1 1 1 2 0 2 2 3 1 0 3 4 0 1 0 5 1 2 1 ➔
  • 13. 13 00 01 10 11 00 01 10 11 0 1 1 1 1 1 0 0 0 1 1 1 1 1 0 0 X1=(x1, x2) X2=(x3, x4) h(X1) 0 01 1 x1 0 0 1 1 x2 0 1 0 1 h(X1) 0 1 0 1 0 1 00 0 1 01 1 1 10 1 0 11 1 0 x3,x4 h(X1) Functional Decomposition 24x1=16 [bit] 22x1+23x1=12 [bit] Column multiplicity=2 Bound variables Free variables
  • 14. Decomposition Chart for X mod 3 14 000 001 010 011 00 01 10 11 0 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 X2=(x3, x4, x5) X1=(x1,x2) 100 101 110 111 1 2 0 1 2 0 1 2 0 1 2 0 1 2 0 1 0 mod 3 = 0 1 mod 3 = 1 2 mod 3 = 2 3 mod 3 = 0 4 mod 3 = 1 5 mod 3 = 2 6 mod 3 = 0 7 mod 3 = 1 8 mod 3 = 2 9 mod 3 = 0 10 mod 3 = 1 … Free variables Bound variables
  • 15. Decomposition Chart for X mod 3 15 0 1 2 00 01 10 11 0 1 2 0 1 2 0 1 2 0 1 2 X2=(x3,x4,x5)X1=(x1,x2) 0 mod 3 = 0 1 mod 3 = 1 2 mod 3 = 2 3 mod 3 = 0 4 mod 3 = 1 5 mod 3 = 2 6 mod 3 = 0 7 mod 3 = 1 8 mod 3 = 2 9 mod 3 = 0 10 mod 3 = 1 … FreeBound x3 0 0 0 0 1 1 1 1 x4 0 0 1 1 0 0 1 1 x5 0 1 0 1 0 1 0 1 h(X2) 0 1 2 0 1 2 0 1
  • 17. RNS2Binary Converter (m=30) 17 x1 y1 0 0 1 15 x2 y2 0 0 1 10 2 20 x3 y3 0 0 1 6 2 12 3 18 4 24 Mod m Adder Mod m Adder carry carry
  • 18. Problem • Moduli set of RNS consists of mutually prime numbers – sizes of circuits are all different • Example: <7,11,13> 18 6‐input LUT 8‐input LUT 8‐input LUT 3 4 4 4 4 3 3 4 4 Binary2RNS Converter by BRAMs RNS2Binary Converter by DSP blocks and BRAMs ➔ ➔
  • 20. Nested RNS • (Z1,Z2,…,Zi,…, ZL) (Z1,Z2,…,(Zi1,Zi2,…,Zij),…, ZL) • Ex: <7,11,13>×<7,11,13> <7,<5,6,7>11,<5,6,7>13>×<7,<5,6,7>11,<5,6,7>13> 20 1. Reuse the same moduli set 2. Decompose a large modulo into smaller ones Original modulus ➔
  • 21. Example of Nested RNS • 19x22(=418) on <7,<5,6,7>11,<5,6,7>13> 19×22 =<5,8,6>×<1,0,9> =<5,<3,2,1>11,<1,0,6>13>×<1,<0,0,0>11,<4,3,2>13> =<5,<0,0,0>11,<4,0,5>13> =<5,0,2> =418 21 Modulo Multiplication Bin2RNS on NRNS RNS2Bin Binary2NRNS Conversion
  • 22. Realization of Nested RNS 22 <5,6,7> 2Bin Bin2 <7,11,13> 3 <7,11,13> 2Bin <5,6,7> 2Bin Bin2 <5,6,7> Bin2 <5,6,7> 6‐input LUT 6‐input LUT 6‐input LUT 6‐input LUT 6‐input LUT 6‐input LUT 6‐input LUT Bin2 <7,11,13> Bin2 <5,6,7> Bin2 <5,6,7> 4 4 3 4 4 3 3 3 3 3 3 Binary 2NRNS NRNS2 Binary Realized by BRAMs                      LUTs      BRAMs and DSP blocks   
  • 23. Moduli Set for NRNS • Conventional RNS (uses 23 moduli) <3,4,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59, 61,67,71,73,79,83> • Applied the NRNS to moduli that are greater than 15 <3,4,5,7,11,13, <3,4,5,7,11,13>17, <3,4,5,7,11,13>19, <3,4,5,7,11,13,<3,4,5,7,11,13>17>23, <3,4,5,7,11,13,<3,4,5,7,11,13>17>29, …, <3,4,5,7,11,13,<3,4,5,7,11,13>17>83> 23 All the 48-bit MAC operations are decomposed into 4-bit ones
  • 24. DCNN Architecture using the NRNS 24 ... 16 parallel modulo mi 2D convolutional units ... ... . . . BRAM BRAM BRAM... BRAM BRAM BRAM... BRAM BRAM BRAM... . . . Parallel Bin2NRNS Converters Tree‐based NRNS2Bin Converters Sequencer              External DDR3SODIMM DDR3 Ctrl.DDR3 Ctrl. On‐chip Memory RNS 2 Bin RNS 2 Bin RNS 2 Bin RNS 2 Bin RNS 2 Bin ......... ...
  • 26. Implementation Setup • FPGA board: Xilinx VC707 – FPGA: Virtex7 VC485T – 1GB DDR3SODIMM (Bus@800MHz, 64 bit width) • Realized the pre-trained ImageNet by Convnet2 – 48-bit fixed precision • Synthesis tool: Xilinx Vivado2014.1 – Timing constrain: 400MHz 26
  • 27. Comparison with Other Implementations 27 Precision Max. Freq. [MHz] FPGA Performance [GOPS] Performance per area [GOPS/ Slice x 10‐4] ASAP2009 16bit fixed 115 Viretex5 LX330T 6.7 1.3 PACT2010 ‐‐‐ fixed 125 Viretex5 SX240T 7.0 1.9 FPL2009 48bit fixed 125 Spartax3A DSP3400 5.3 2.2 ISCA2010 48bit fixed 200 Virtex5 SX240T 16.0 4.3 ICCD2013 ‐‐‐ fixed 150 Virtex6 LVX240T 17.0 4.5 FPGA2015 32bit float 100 Virtex7 VX485T 61.6 8.1 Proposed 48bit fixed 400 Virtex7 VX485T 132.2 25.2
  • 28. Conclusion • Realized the DCNN on the FPGA – Time multiplexing – Nested RNS • MAC operation is realized by small LUTs • Functional decomposition are used as follows: – Bin2NRNS converter is realized by BRAMs – NRNS2Bin converter is realized by DSP blocks and BRAMs • Performance per area (GOPS/Slice) – 5.86 times higher than ISCA10’s 28