FPL15 talk: Deep Convolutional Neural Network on FPGA

A Deep Convolutional
Neural Network Based on
Nested Residue Number System
Hiroki Nakahara1 Tsutomu Sasao2
1Ehime University, Japan
2Meiji University, Japan
1

Outline
• Background
• Deep convolutional neural network (DCNN)
• Residue number system (RNS)
• DCNN using nested RNS (NRNS)
• Experimental results
• Conclusion
2

Background
• Deep Neural Network
– Multi-layer neuron model
– Used for embedded vision system
• FPGA realization is suitable for real-time systems
– faster than the CPU
– Lower power consumption than the GPU
– Fixed point representation is sufficient
• High-performance per area is desired
3

Deep Convolutional
Neural Network (DCNN)
4

Artificial Neuron
+
x0=1x0=1
x1
x2
xN
... w0 (Bias)
w1
w2
wN
f(u)
u y
xi: Input signal
wi: Weight
u: Internal state
f(u): Activation function
(Sigmoid, ReLU, etc.)
y: Output signal
5
y  f (u)
u  wi xi
i0
N


Deep Convolutional Neural Network(DCNN)
for ImageNet
• 2D convolutional layer, pooling layer, and fully connection layer
6

2D Convolutional Layer
• Consumes more than 90% of the computation time
– Multiply-accumulation (MAC) operation is performed
7
zij  yij  xim, jnwmn
n0
K1

m0
K1

xij: Input signal
yij : Bias
wmn: Weight
K: Kernel size
zij: Output signal
K
K

FPGA
Realization of 2D Convolutional Layer
• Requires more than billion MACs!
• Our realization
– Time multiplexing
– Nested Residue Number System(NRNS)
8
Off‐chip Memory
** **
++
** **
++
++
** **
++
** **
++
++
BRAMBRAM BRAMBRAM
FPGA
Off‐chip Memory
** **
BRAMBRAM
**
**
** **
**
** **
BRAMBRAM
**
**
** **
**
** **
BRAMBRAM
**
**
** **
**
** **
BRAMBRAM
**
**
** **
**
➔
ConverterConverter ConverterConverter ConverterConverter ConverterConverter
ConverterConverter ConverterConverter ConverterConverter ConverterConverter

Residue Number System (RNS)
 Defined by a set of L mutually prime integer
constants 〈m1,m2,...,mL〉
 No pair modulus have a common factor with any other
 Typically, prime number is used as moduli set
 An arbitrary integer X can be uniquely
represented by a tuple of L integers
(X1,X2,…,XL), where
 Dynamic range
10
)(mod ii mXX 
M  mi
i1
L


Parallel Multiplication
Multiplication on RNS
Moduli set〈3,4,5〉, X=8, Y=2
Z=X×Y=16=(1,0,1)
X=(2,0,3), Y=(2,2,2)
Z=(4 mod 3,0 mod 4,6 mod 5)
=(1,0,1)=16
11
Binary2RNS Conversion
RNS2Binary Conversion
➔ ➔

Binary2RNS Converter
12
X mod 2 mod 3 mod4
0 0 0 0
1 1 1 1
2 0 2 2
3 1 0 3
4 0 1 0
5 1 2 1
➔

13
00 01 10 11
00
01
10
11
0
1
1
1
1
1
0
0
0
1
1
1
1
1
0
0
X1=(x1, x2)
X2=(x3, x4)
h(X1) ００１１
x1 0 0 1 1
x2 0 1 0 1
h(X1) 0 1 0 1
0 1
00 0 1
01 1 1
10 1 0
11 1 0
x3,x4
h(X1)
Functional Decomposition
24x1=16 [bit] 22x1+23x1=12 [bit]
Column multiplicity=2
Bound variables
Free
variables

Decomposition Chart for X mod 3
14
000 001 010 011
00
01
10
11
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
X2=(x3, x4, x5)
X1=(x1,x2)
100 101 110 111
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
0 mod 3 = 0
1 mod 3 = 1
2 mod 3 = 2
3 mod 3 = 0
4 mod 3 = 1
5 mod 3 = 2
6 mod 3 = 0
7 mod 3 = 1
8 mod 3 = 2
9 mod 3 = 0
10 mod 3 = 1
…
Free
variables
Bound variables

Decomposition Chart for X mod 3
15
0 1 2
00
01
10
11
0
1
2
0
1
2
0
1
2
0
1
2
X2=(x3,x4,x5)X1=(x1,x2) 0 mod 3 = 0
1 mod 3 = 1
2 mod 3 = 2
3 mod 3 = 0
4 mod 3 = 1
5 mod 3 = 2
6 mod 3 = 0
7 mod 3 = 1
8 mod 3 = 2
9 mod 3 = 0
10 mod 3 = 1
…
FreeBound
x3 0 0 0 0 1 1 1 1
x4 0 0 1 1 0 0 1 1
x5 0 1 0 1 0 1 0 1
h(X2) 0 1 2 0 1 2 0 1

Binary2RNS Converter
16
LUT cascade for X mod m1
LUT cascade for X mod m2
BRAM
BRAM
BRAM

RNS2Binary Converter (m=30)
17
x1 y1
0 0
1 15
x2 y2
0 0
1 10
2 20
x3 y3
0 0
1 6
2 12
3 18
4 24
Mod m
Adder
Mod m
Adder
carry
carry

Problem
• Moduli set of RNS consists of mutually prime numbers
– sizes of circuits are all different
• Example: <7,11,13>
18
6‐input
LUT
8‐input
LUT
8‐input
LUT
3
4
4
4
4
3
3
4
4
Binary2RNS
Converter
by
BRAMs
RNS2Binary
Converter
by
DSP blocks
and BRAMs
➔ ➔

Nested RNS
• (Z1,Z2,…,Zi,…, ZL) (Z1,Z2,…,(Zi1,Zi2,…,Zij),…, ZL)
• Ex: <7,11,13>×<7,11,13>
<7,<5,6,7>11,<5,6,7>13>×<7,<5,6,7>11,<5,6,7>13>
20
1. Reuse the same moduli set
2. Decompose a large modulo into smaller ones
Original modulus
➔

Example of Nested RNS
• 19x22(=418) on <7,<5,6,7>11,<5,6,7>13>
19×22
=<5,8,6>×<1,0,9>
=<5,<3,2,1>11,<1,0,6>13>×<1,<0,0,0>11,<4,3,2>13>
=<5,<0,0,0>11,<4,0,5>13>
=<5,0,2>
=418
21
Modulo Multiplication
Bin2RNS on NRNS
RNS2Bin
Binary2NRNS Conversion

Realization of Nested RNS
22
<5,6,7>
2Bin
Bin2
<7,11,13>
3
<7,11,13>
2Bin
<5,6,7>
2Bin
Bin2
<5,6,7>
Bin2
<5,6,7>
6‐input
LUT
6‐input
LUT
6‐input
LUT
6‐input
LUT
6‐input
LUT
6‐input
LUT
6‐input
LUT
Bin2
<7,11,13>
Bin2
<5,6,7>
Bin2
<5,6,7>
4
4
3
4
4
3
3
3
3
3
3
Binary
2NRNS
NRNS2
Binary
Realized by BRAMs LUTs BRAMs and DSP blocks

Moduli Set for NRNS
• Conventional RNS (uses 23 moduli)
<3,4,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,
61,67,71,73,79,83>
• Applied the NRNS to moduli that are greater than 15
<3,4,5,7,11,13,
<3,4,5,7,11,13>17,
<3,4,5,7,11,13>19,
<3,4,5,7,11,13,<3,4,5,7,11,13>17>23,
<3,4,5,7,11,13,<3,4,5,7,11,13>17>29,
…, <3,4,5,7,11,13,<3,4,5,7,11,13>17>83>
23
All the 48-bit MAC operations are decomposed into 4-bit ones

DCNN Architecture using the NRNS
24
...
16 parallel modulo mi
2D convolutional units
...
...
. . .
BRAM BRAM BRAM...
BRAM BRAM BRAM...
BRAM BRAM BRAM...
. . .
Parallel Bin2NRNS
Converters
Tree‐based NRNS2Bin
Converters
Sequencer
External DDR3SODIMM
DDR3 Ctrl.DDR3 Ctrl.
On‐chip
Memory
RNS
2
Bin
RNS
2
Bin
RNS
2
Bin
RNS
2
Bin
RNS
2
Bin
.........
...

Implementation Setup
• FPGA board: Xilinx VC707
– FPGA: Virtex7 VC485T
– 1GB DDR3SODIMM
(Bus@800MHz, 64 bit width)
• Realized the pre-trained
ImageNet by Convnet2
– 48-bit fixed precision
• Synthesis tool: Xilinx Vivado2014.1
– Timing constrain: 400MHz
26

Comparison with Other
Implementations
27
Precision Max.
Freq.
[MHz]
FPGA Performance
[GOPS]
Performance
per area
[GOPS/
Slice x 10‐4]
ASAP2009 16bit fixed 115 Viretex5 LX330T 6.7 1.3
PACT2010 ‐‐‐ fixed 125 Viretex5 SX240T 7.0 1.9
FPL2009 48bit fixed 125 Spartax3A DSP3400 5.3 2.2
ISCA2010 48bit fixed 200 Virtex5 SX240T 16.0 4.3
ICCD2013 ‐‐‐ fixed 150 Virtex6 LVX240T 17.0 4.5
FPGA2015 32bit float 100 Virtex7 VX485T 61.6 8.1
Proposed 48bit fixed 400 Virtex7 VX485T 132.2 25.2

Conclusion
• Realized the DCNN on the FPGA
– Time multiplexing
– Nested RNS
• MAC operation is realized by small LUTs
• Functional decomposition are used as follows:
– Bin2NRNS converter is realized by BRAMs
– NRNS2Bin converter is realized by DSP blocks and
BRAMs
• Performance per area (GOPS/Slice)
– 5.86 times higher than ISCA10’s
28

FPL15 talk: Deep Convolutional Neural Network on FPGA

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to FPL15 talk: Deep Convolutional Neural Network on FPGA (20)

More from Hiroki Nakahara (6)

Recently uploaded (20)

FPL15 talk: Deep Convolutional Neural Network on FPGA