FPGAX2019

なかはら
私のMNISTのFPSは530000です。
ですがもちろんフルパワーで（以下略
(+雑⾳CNNの紹介)
FPGAX2019@
Googleオフィス

つくったもの(1)
• 某国⺠ゲーム機アクセラレータ
2

研究テーマ:
Custom Computing Machine
3

Object Detection (物体認識)
4
Person
J. Redmon and A. Farhadi, "YOLOv3: An Incremental Improvement," arXiv, 2018
Person
Boat

Semantic Segmentation (領域分割)
5E. Shelhamer, J. Long and T. Darrell, "Fully Convolutional Networks for Semantic Segmentation," IEEE Trans. on
Pattern Analysis and Machine Intelligence, Vol.39, No.4, 2017, pp. 640 ‐ 651.

OpenPose (姿勢推定)
6
Z. Cao, T. Simon, S.‐E. Wei and Y. Sheikh, " Realtime Multi‐Person 2D Pose Estimation
using Part Affinity Fields," CVPR, 2017.

DepthMap (深さ推定)
7
D. Eigen, C. Puhrsch and R. Fergus, "Depth Map Prediction from a Single Image using a
Multi‐Scale Deep Network," arXiv:1406.2283 , 2014.

• Terasic社 DE5a‐Net ボードによる⾼速化
• YOLOv2 を 166 498 FPS (3並列化)
8
佐⽥悠⽣・下⽥将之・佐藤真平・中原啓貴,"Intel OpenCLを⽤いた3状態YOLOv2のFPGA実装に
ついて,"リコンフィギャラブルシステム研究会, 2018年12⽉(広島).

デモ: AvNET Ultra96
9
Xilinx社 Zynq UltraScale+ MPSoC (ZU3EG) 搭載,
30FPS (YOLOv2), 3万円, PYNQ (Python環境)で制御, 単独動作
Hiroki Nakahara, Masayuki Shimoda and Shimpei Sato, “A Tri‐State Weight Convolutional Neural
Network for an FPGA: Applied to YOLOv2 Object Detector,” FPT, 2018.

CNNの最適化
10
Source: http://www.isfpga.org/fpga2017/slides/D1_S1_InvitedTalk.pdf

2値化CNN
11
x1
w0 (Bias)
fsgn(Y)
Y
z
w1
x2
w2
xn
wn
...
x1 x2 Y
‐1 ‐1 1
‐1 +1 ‐1
+1 ‐1 ‐1
+1 +1 1
x1 x2 Y
0 0 1
0 1 0
1 0 0
1 1 1
M. Courbariaux, I. Hubara, D. Soudry, R.E.Yaniv, Y. Bengio, “Binarized neural networks: Training deep neural
networks with weights and activations constrained to +1 or ‐1," Computer Research Repository (CoRR), Mar.,
2016, http://arxiv.org/pdf/1602.02830v3.pdf

なぜメモリ量削減?→オンチップ実現したいから
E. Joel et al., “Tutorial on Hardware Architectures
for Deep Neural Networks,” MICRO‐49, 2016. 12
On-chip
Memory
J. Dean, “Numbers everyone should know”
Source: https://gist.github.com/2841832
• 広帯域 (左)
• 低消費電⼒ (右)

13Zhen Li et. al, “A survey of neural network accelerators,” ACM TRET, Vol.11, No.5, 2017, pp. 746‐761.

スパース化
• (学習済み)重みヒストグラムはある分布に従う(t‐分布かな︖)
• 活性化関数によっては, 出⼒の50%前後がゼロ
• 学習データやモデルによってはそれ以上ゼロになることも
• HWとしては重みスパース化のほうが扱いやすい
14
重みの値
0
50000
100000
150000
200000
‐1 ‐0.8 ‐0.6 ‐0.4 ‐0.2 0 0.2 0.4 0.6 0.8 1
⇒ 認識精度に
影響しない
→枝刈り
Tomoya Fujii, Shimpei Sato, Hiroki Nakahara, “A Threshold Neuron Pruning for a Binarized Deep Neural Network on an FPGA,” IEICE
Transactions 101‐D(2): 376‐386 (2018)

CNNによる画像分類の解析
15
Feature maps
CONV+Pooling CONV+Pooling
“5”
Input
image
...
Feature extraction layers
Classification
layers
3
2 0
1
4
5
6
7
8 9

問題点
• 低精度NNでは回帰問題を解けない
• 例: sin(x) regression using a NN (3‐layers)
16
(a) Float 32 bit for
activation and weight
(b) Float32 for
activation and binary
weight
(c) All binarized
Sin(x)
BinNNFloat32NN
Sin(x)
Miss
localization

混合精度CNN
• Object Detectorなど複雑なタスクで必須技術
• 前段: 2値精度CNN … ⾯積・スピード
• 後段: 多値精度CNN … 回帰問題（枠推定）
17
Input
Image
(Frame)
Feature maps
CONV+Pooling
CNN
CONV+Pooling
Class score
Bounding Box
Detection
2値 half
H. Nakahara et al., “A Lightweight YOLOv2: A Binarized CNN with A Parallel Support Vector Regression for an
FPGA,” Int’l Symp. on FPGA (ISFPGA), 2018.

蒸留 (Distillation)
• 学習済みモデルを別のモデルに転移
• 異なるモデル（層・チャネルなど）に転移する技術
• 蒸留による学習: 教師モデルのスコアを全て伝搬
→スコアの分布に汎⽤的な知識が含まれている
18G. Hinton, Oriol Vinyals, and J. Dean, “Distilling the Knowledge in a Neural Network,”NIPS’04
Teacher (Trained) CNN
Student CNN
Car  0.82
Cat  0.08
Dog 0.07
Pet  0.03
Car  0.62
Cat  0.12
Dog 0.24
Pet  0.02
Car  1.00
Cat  0.00
Dog 0.00
Pet  0.00
Soft target loss
Hard
target
loss Training
Dataset
Loss for soft and
hard targets

メタ機械学習によるパラメータ探索
• 職⼈芸には限界が…
• パラメータを効率よく探す必要あり
• グリッドサーチ: 遅い
• ランダムウォーク: 運任せ
• メタヒューリスティック(SA, GA, PSO): なかなかよい︖
• ベイズ推定: パラメータ推定しやすい単純な問題によさそう
• メタ機械学習: Hyperopt, Optuna
Chainer + Optunaの例:
https://github.com/pfnet/optuna/blob/master/examples/chainer_simple.py
19

• GUINNESS (GUI based Neural Network SyntheSizer)
20
H. Nakahara et. al, “GUINNESS: A GUI based Binarized Deep Neural Network Framework for Software
Programmers,” IEICE Trans on Info., (accepted).
https://github.com/HirokiNakahara/GUINNESS

Google Colaboratory
• 12時間までGPU (Tesla K80)を使える
• 必要なライブラリは予め導⼊済み
• TensorFlowも利⽤可能
• Chainerの導⼊も可能
Chainer on Google Colaboratory:
https://github.com/chainer/google‐colaboratory
• GUINNESS(Binary Neural Network設計ツール)を
Colaboratoryで使う⽅法が︕
Google ColaboratoryでBinary CNNを動かす(MNIST):
http://shimaharu.blogspot.com/2018/11/google‐colaboratorybinary‐cnnmnist.html
• Vivadoを⼊れてみた⼈も…
• スマフォで全部できるじゃん︕
21
Chainerを
デフォルトでサポート
(2019/Jan/31に確認)

On‐going work
• Coca‐cola DLを開発開始
• Co‐design and verification on Colaboratory for Deep Learning
22
https://github.com/HirokiNakahara/Coca‐Cola‐DL/
→

MNISTチャレンジ
• MNISTをどれだけ⾼速にできるか
• 精度は90%以上とする(…いいのか︖)
• Neural Networkを⽤いること
• t‐SNEとかRandom Forestとかでええやんってマジレスやめて
• FPGAに実装しよう︕
24

今回のターゲット
• 3層DNNに3値化(Binary+枝刈り)してみる
• ⼊⼒画像も2値化 (閾値で⽩⿊化)
25
x0
x1
x783
…
z0
z1
zn‐1
…
z0
z1
zn‐1
…
y0
y1
y9
…
784 10

パラメータの決定
• 3層DNNの中間層のニューロン数nを変化させて
認識率とスパース率を調査
• 各DNNのハイパーパラメータはOptunaで設定
26
x0
x1
x783
…
z0
z1
zn‐1
… z0
z1
zn‐1
…
y0
y1
y9
…
784 10n n
スパース率
=0枝の割合
(削減率)

学習結果
• ニューロン数を増やしても認識精度が上がるとは
限らない→適切なモデル選択
• スパース率とニューロン数に相関がある
• ⼀定数のコネクティビティ(接続性→モデル複雑度)が必要
27
91 91 91 91 88 86
71
55
89 88 86 83 79
72 75 73
120 100 80 60 40 20 10 5
認識精度スパース率
中間層のニューロン数

FPGA実装
• 設計したDNNを組合せ回路化しよう︕
• 中間層のニューロン数: 100, 認識率 91%, スパース率 88%
• 1ニューロンの最⼤⼊⼒数22 → 222 bit → BRAM 256 個
(または6⼊⼒LUT 67,108,864個!!)
28
x0
x1
x783
…
z0
z1
zn‐1
… z0
z1
zn‐1
…
y0
y1
y9
784 10100 100
最⼤で22⼊⼒1出⼒(平均10程度)

関数の分解
• ⼀般的な組み合わせ回路の複雑度2n/n
• n=n1+(n2+1)に分解できると2n1/n1+2n2+1/(n2+1)
• 2のべき乗で削減
• FPGAでは組合わせ回路をLUT(メモリ)で実現
• 分解できればメモリ量を2のべき乗の規模で削減可能
29
H
G
…
…
n1
n2

30
関数分解法
G
Hx1
x2
x3
x4
f
 　　2log
X1
X2
00 01 10 11
00
01
10
11
0
1
1
1
1
1
0
0
0
1
1
1
1
1
0
0
X1=(x1, x2)
X2=(x3, x4)
列複雑度 =2
束縛変数
⾃由変数
分解表
f =g(h(X1),X2)
h(X1) ００１１
接続線数 =
(異なる列パターンの個数)

31
00 01 10 11
00
01
10
11
0
1
1
1
1
1
0
0
0
1
1
1
1
1
0
0
X1=(x1, x2)
X2=(x3, x4)
=2
h(X1) ００１１
x1 0 0 1 1
x2 0 1 0 1
h(X1) 0 1 0 1
0 1
00 0 1
01 1 1
10 1 0
11 1 0
x3,x4
h(X1)
例
24x1=16 [bit] 22x1+23x1=12 [bit]
異なる列パターンに
符号を割り当てる
エンコーダ

関数分解が有効なクラス
• スパース(⼊⼒数とエントリ数にギャップがある場合)
な関数→パケット分類
32
0 0 0 0 0 3 0 0
0 1 0 0 0 0 0 0
0 0 0 0 2 0 0 0
0 0 0 0 0 0 0 0
Hiroki Nakahara, Tsutomu Sasao, Munehiro Matsuura, “A packet classifier using LUT cascades
based on EVMDDS (k),” FPL, 2013, pp.1‐6.

関数分解が有効なクラス(続)
• 単調増⼤する場合
• セグメントインデックスエンコーダ
33
0 4 3 2 1 0 4 3
1 0 4 3 2 1 0 4
2 1 0 4 3 2 1 0
3 2 1 0 4 3 2 1
エンコーダ
ROM
a b
f(x)=ax+b
Tsutomu Sasao, Shinobu Nagayama, Jon T. Butler,
“Numerical Function Generators Using LUT Cascades,”
IEEE Trans. Computers 56(6): 826‐838 (2007).

Binary DNNの場合
• Weighted‐Sum Function (WS関数)というクラス[1]
• 列複雑度はすでに解析済み
• バッチ正規化は︖
（2016年の夏に論⽂[2]を書いていた時はよい⽅法が思いつかなかった…）
34
+
x0=1
x1
x2
xN
...
w0 (Bias)
w1
w2
wN
Batch
Norm
+1 or -1sign
2016年夏の限界 2017年3⽉に達成
[1] T. Sasao, “Analysis and Synthesis of Weighted‐Sum Functions,” IEEE Trans. on CAD, Vol. 25, No. 5, 2006, pp.789‐796.
[2] H. Nakahara et. al, "A memory‐based realization of a binarized deep convolutional neural network," FPT, 2016,
pp.277‐280.

2値化重み和関数の例(n=5)
x0 x1 x2 x3 x4 積和演算結果
0 0 0 0 0 ‐w0‐w1‐w2‐w3‐w4
0 0 0 0 1 ‐w0‐w1‐w2‐w3+w4
0 0 0 1 0 ‐w0‐w1‐w2+w3‐w4
0 0 0 1 1 ‐w0‐w1‐w2‐w3+w4
0 0 1 0 0 ‐w0‐w1+w2‐w3‐w4
0 0 1 0 1 ‐w0‐w1+w2‐w3+w4
0 0 1 1 0 ‐w0‐w1+w2+w3‐w4
0 0 1 1 1 ‐w0‐w1+w2+w3+w4
0 1 0 0 0 ‐w0+w1‐w2‐w3‐w4
0 1 0 0 1 ‐w0+w1‐w2‐w3+w4
0 1 0 1 0 ‐w0+w1‐w2+w3‐w4
0 1 0 1 1 ‐w0+w1‐w2+w3+w4
1 1 1 1 1 +w0+w1+w2+w3+w4
...
...

2値化重み和関数の分解表
• 2値化重み和関数の出⼒ビット数がqビットのとき, その列複雑度は⾼々 2
• 各列は全て同じ値(⾃由変数)の加算 → 束縛変数を加算した値の組合わせ
が列複雑度
• 従って, その組み合わせ(ビット数)は符号も⼊れるとqビットで表現可能 → 2q
36
000 001 010 011 100 101 110 111
‐w0‐w1‐w2
‐w3‐w4
‐w0‐w1+w2
‐w3‐w4
‐w0+w1‐w2
‐w3‐w4
‐w0+w1+w2
‐w3‐w4
w0‐w1‐w2
‐w3‐w4
w0‐w1+w2
‐w3‐w4
w0+w1‐w2
‐w3‐w4
w0+w1+w2
‐w3‐w4
‐w0‐w1‐w2
‐w3+w4
‐w0‐w1+w2
‐w3+w4
‐w0+w1‐w2
‐w3+w4
‐w0+w1+w2
‐w3+w4
w0‐w1‐w2
‐w3+w4
w0‐w1+w2
‐w3+w4
w0+w1‐w2
‐w3+w4
w0+w1+w2
‐w3+w4
‐w0‐w1‐w2
+w3‐w4
‐w0‐w1+w2
+w3‐w4
‐w0+w1‐w2
+w3‐w4
‐w0+w1+w2
+w3‐w4
w0‐w1‐w2
+w3‐w4
w0‐w1+w2
+w3‐w4
w0+w1‐w2
+w3‐w4
w0+w1+w2
+w3‐w4
‐w0‐w1‐w2
+w3+w4
‐w0‐w1+w2
+w3+w4
‐w0+w1‐w2
+w3+w4
‐w0+w1+w2
+w3+w4
w0‐w1‐w2
+w3+w4
w0‐w1+w2
+w3+w4
w0+w1‐w2
+w3+w4
w0+w1+w2
+w3+w4
11 10 01 00
x0 x1 x2x3 x4

37
バッチ正規化に関して
𝑓Batch
Norm
𝑓
学習済み2値化NNにおいて, バッチ正規化演算は
整数精度バイアス加算と等価 mean
variance
Scaling Shift
H. Yonekawa and H. Nakahara, “On‐Chip Memory Based Binarized Convolutional Deep Neural Network Applying Batch
Normalization Free Technique on an FPGA,” IPDPS Workshops 2017, pp.98‐105.

• The output from batch
normalization( ) is the
input to sign function
Constant factor can be
ignored
• The input from batch
normalization( ) is the
integer value
To integer
38
Proof 𝑓Batch
Norm

全体構成
• ⼊出⼒にレジスタをつけて1clockで実⾏
• 各ニューロンは関数分解で複数の6‐LUTで実現
39
x0
x1
x783
…
z0
z1
zn‐1
…
z0
z1
zn‐1
… y0
y1
y9
784 10100 100 …
…
5= 𝒍𝒐𝒈 𝟐 𝟐𝟐6
1+(22‐6‐1)x5+1=77
222=67,108,864の
約400万分の1

実装した結果
• FPGA Board: AvNet社Ultra96
• タイミング制約: 100MHz
• LUT数: 6154個
• FF数: 772個
• タイミングがメットしちゃった…
→ 100 MFPS
→ 100 x 1000000 FPS = 100000000 FPS!!!!
40
1億FPS !
(⽕事場のクソ⼒超え)

1兆FPSへの道
• We are now…
• @100MHz (毎クロック処理) → 1億FPS
• LUT数: 6,154個
• FF数: 772個
• VCU1525が…︕
• 1,182,240 LUTs → 約190台並列化可能
• 2,364,480 FFs
• パイプライン実装 → 300MHz動作
• (理論値で)⼈類は0.57兆FPSしか到達していない…
c.f. 某宇宙恐⻯の吐く⽕の⽟は１兆度
41あ、うちVCU２台もってるやん…

カスタムプロセッサ
• 専⽤回路化したら無駄に早すぎたでござる…
• 現実的なI/O速度(カメラ30FPS)とかに合うように設計
• ROMからLUTを逐次読み出す⽅法で実現
→テーブル参照型プロセッサ
43
LUT2
LUT3
LUT1
LUT1
LUT2
LUT3
REG
PRG

mis‐FPGA
• LUTの⼊⼒数を増やす
→メモリ量増加 & LUT数削減（＝⾼速化）
メモリ量(トランジスタ数)で性能スケール
• LUT netlist を元に合成
44
R. Murgai, M. Fujita, F. Hirose, “Logic synthesis for a single large look‐up table,”
ICCD, 1995, pp.415‐424.
メモリサイズ∝性能

やっぱり…
• Binary精度だけでは複雑なタスクが解けない…
• 回帰は精度が必要→混合精度
45

雑⾳畳込み演算
(Noise Convolutional Operation)
46
+
⼊⼒に雑⾳を乗せる 1×1畳込み演算
⼊⼒画像出⼒
特徴マップ
雑⾳
畳み込み層

Point‐wise Convolution
• 1x1畳み込みを⾏う, 計算量・メモリ量を削減
47
…
k
k
M
M
C
C
N
1
1
M
M
C
C
N
Andrew G. Howard et. al "MobileNets: Efficient Convolutional Neural Networks for Mobile
Vision Applications, arXiv:1704.04861

統計的な等価性
• 期待値︓
• 分散 ︓
となる雑⾳を⽤いると
、
は統計的に等価
48
雑⾳畳込み既存畳込み

既存PNNの改善→NCNN(k)
(Noise CNN: NCNN)
• CVPR2018でPNN (摂動NN)[1]が発表されたが…
→全層が雑⾳畳み込み
→(特に)⼊⼒画像が仮定( 、 )
を満たさない(=等価性が成⽴しない,認識精度低下)
49
[1] F. Juefei-Xu, V. N. Boddeti, and M. Savvides, "Perturbative Neural Networks," CVPR, 2018, Vol. 1.
[2] A. Munakata, S. Sato, H. Nakahara, “A Noise Convolutional Neural Network,” ISMVL, 2019 (accepted).
雑
⾳
畳
込
み
層
全
結
合
層
既
存
3×
3
畳
込
み
層
k n-k
NCNN(k) [2]
既存畳込み層と
雑⾳畳込み層の
ハイブリッド
仮定を満たすまで
既存畳込みを⾏う

• 全層を雑⾳畳込み層とすると精度劣化
• 第1層を既存畳込み層としたNCNN(1)は精度劣化を0.4ポイント
で抑制しつつ, パラメータを88%削減
既存CNN, PNNとの⽐較
50
クラス分類タスクの⽐較 (モデル︓AlexNet、データセット︓CIFAR-100)
PNN
[CVPR2018]
NCNN(1)
(Ours)
既存CNN
認識精度(%) 29.1 49.4 49.8
全層の重み(MB) 1.1 1.2 10.0
重みの削減率 0.89 0.88 ‐

雑⾳畳込み回路
51
Off‐Chip
DDR
Memory
RND
BN
Unit
Act
Unit
+
Point‐wise
Conv Unit
Buffer
Point‐wise
Conv Unit
Buffer
Point‐wise
Conv Unit
Buffer
W.Mem
W.Mem
W.Mem
…
Point‐wise
畳込み演算器
Weight
Mem.
Bias
Mem.
BN
Unit
Act
UnitInput Reg+
ノイズ
⽣成器
DDR Controller

• NCNNはCNNと⽐べて学習時間を30‐40%削減
1epoch毎の学習時間と推論時間
52
モデル︓ResNet-18、データセット︓CIFAR-100
GPU: Nvidia GTX 1080Ti
NCNNにおける
CNNの層数
1 3 7 11 15 CNN
学習時間(s) 50.3 48.8 45.1 46.4 59.1 75.2

既存の実装結果との⽐較
Implementation
(Year)
Zhao et al.
(2017) [1]
FINN
(2017) [2]
Boucle et al.
(2017) [3]
Ours
(2019)
CNN Binary Binary Ternary Noise
Clock (MHz) 143 166 250 199
#LUTs
#18Kb BRAMs
#DSP 48Es
46900
94
3
42823
270
32
67300
667
0
40911
228
192
Accuracy (%) 87.73% 80.10% 86.71% 92.35%
Time [msec]
(FPS [s‐1])
5.94
(168)
2.24
(445)
2.36
(423)
1.80
(557)
Power 4.7 2.5 6.8 3.5
53
Binary, Ternaryよりも⾼速かつ⾼認識精度, ただしDSPブロック必要
VGG9をベースにしたCNNで評価, データセットはCIFAR10
[1] R. Zhao, W. Song, W. Zhang, T. Xing, J.‐H. Lin, M. Srivastava, R. Gupta and Z. Zhang, “Accelerating
Binarized Convolutional Neural Networks with Software‐Programmable FPGAs,” ISFPGA, 2017, pp.15‐24.
[2] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M. Jahre, and K. Vissers,
“FINN: A Framework for Fast, Scalable Binarized Neural Network Inference,” ISFPGA, 2017.
[3] A. P‐. Boucle, A. Bourge, F. Ptrot, H. Alemdar, N. Caldwell, and V. Leroy, “Scalable high‐performance
architecture for convolutional ternary neural networks on FPGA,” FPL, 2017, pp.1–7.

まとめ
• 雑⾳畳み込み (NCNN)
• ノイズの性質を利⽤して認識精度劣化を抑制
• FPGA実装を⾏い既存⼿法との⽐較
• DSPが必要であるものの, Binary, Ternary と⽐較して
認識精度を向上しつつ⾼速化達成
• クラス分類よりも⾼度なタスクに適⽤可能
• 今後の課題
• 実⽤的なアプリケーションへの適⽤→YOLOv2できたよ︕
54

FPGAX2019

More Related Content

What's hot (20)

Similar to FPGAX2019 (20)

More from Hiroki Nakahara (20)

FPGAX2019