Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications

2016 UEC Tokyo.
Caffe2C: A Framework for Easy Implementation
of CNN-based Mobile Applications
Ryosuke Tanno and Keiji Yanai
Department of Informatics,
The University of Electro-Communications, Tokyo

ⓒ 2016 UEC Tokyo.
1. INTRODUCTION

ⓒ 2016 UEC Tokyo.
• Deep Learning achieved remarkable progress
– E.g. Audio Recognition, Natural Language Processing,
• Especially, in Image Recognition, Deep Learning gave
the best performance
– Outperform even humans such as recognition of 1000
object(He+, Delving deep into rectifier, 2015)
Deep Learning(DNN,DCNN,CNN)
0
20
40
60
80
100
2010 2011 2012 2013 2014 2015 Human
Trained
72% 75%
85% 88.3% 93.3% 96.4% 94.9%
SIFT+BOF
Deep Learning
Deeeeeeeep
Outperform
Human !

ⓒ 2016 UEC Tokyo.
• Many Deep Learning Framework have emerged
– E.g. Caffe, TensorFlow, Chainer
Deep Learning Framework

ⓒ 2016 UEC Tokyo.
Convolution Architecture For Feature Extraction(CAFFE)
Open Framework, models and examples for Deep Learning
• Focus on Compuer Vision
• Pure C++/CUDA architecture for deep learning
• Command line, Python MATLAB interface
• Fastest processing speed
• Caffe is the most popular framework in the world
What is Caffe?

ⓒ 2016 UEC Tokyo.
• There are many attempts to archive CNN on the
mobile
– Require a high computational power and memory
Bring to CNN to Mobile
CNN into mobile !
High Computational Power and Memory are Bottleneck!!

ⓒ 2016 UEC Tokyo.
Files
• 3 files are required for Training -> Output: Model
– 3 files: Network definition, Mean, Label
How to train a model by caffe?
Training
Network
Mean
Label
3 files
Dataset
Output
Caffemodel
Use these 4 files
on mobile

ⓒ 2016 UEC Tokyo.
• We currently need to use OpenCV DNN module
– not optimized for the mobile devices
– their execution speed is relatively slow
Use the 4 Files
by Caffe on the Mobile
Network
Mean
Label
Model
4 files

ⓒ 2016 UEC Tokyo.
• We create a Caffe2C which converts the CNN model
definition files and the parameter files trained by
Caffe to a single C language code that can run on
mobile devices
• Caffe2C makes it easy to use deep learning on the C
language operating environment
• Caffe2C achieves faster runtime in comparison to
the existing OpenCV DNN module
Objective
Network
Mean
Label
Model
4 files
Caffe2C
Single C code

ⓒ 2016 UEC Tokyo.
• In order to demonstrate the utilization of the Caffe2C,
we have implemented 4 kinds of mobile CNN-based
image recognition apps on iOS.
Objective

ⓒ 2016 UEC Tokyo.
1. We create a Caffe2C which converts the model
definition files and the parameter files of Caffe into
a single C code that can run on mobile devices
2. We explain the flow of construction of recognition
app using Caffe2C
3. We have implemented 4 kinds of mobile CNN-based
Contributions

ⓒ 2016 UEC Tokyo.
2. CONSTRUCTION OF CNN-
BASED MOBILE RECOGNITION
SYSTEM

ⓒ 2016 UEC Tokyo.
• In order to use the learned parameters by Caffe on
mobile devices, it is necessary to currently use the
OpenCV DNN module not optimized, relatively slow
• We create a Caffe2C which converts the CNN model
definition files and the parameter files trained by Caffe
to a single C language code
– We can use parameter files trained by Caffe on mobile devices
Caffe2C

ⓒ 2016 UEC Tokyo.
• Caffe2C achieves faster execution speed in comparison
to the existing OpenCV DNN module
Caffe2C
Caffe2C OpenCV DNN
AlexNet
iPhone 7 Plus 106.9 1663.8
iPad Pro 141.5 1900.1
iPhone SE 141.5 2239.8
Runtime[ms] Caffe2C vs. OpenCV DNN(Input size: 227x227)
Speedup Rate:
About 15X

ⓒ 2016 UEC Tokyo.
1. Caffe2C directly converts the Deep Neural Network to
a C source code
Reasons for Fast Execution
Caffe2C
OpenCV DNN
Network
Mean
Label
Model
Caffe2C
Single C code
Execution
like Compiler
Execution
like Interpreter

ⓒ 2016 UEC Tokyo.
• Caffe2C directly converts the Deep Neural Network to a
C source code
Reasons for Fast Execution of Caffe2C？
Caffe2C
OpenCV DNN
Network
Mean
Label
Model
Caffe2C
Single C code
Execution
like Compiler
Execution
like Interpreter

ⓒ 2016 UEC Tokyo.
2. Caffe2C performs the pre-processing of the CNN as
much as possible to reduce the amount of online
computation
– Compute batch normalization in advance for conv weight.
3. Caffe2C effectively uses NEON/BLAS by multi-threading
Reasons for Fast Execution
Network
Mean
Label
Model
4 files
Caffe2C
Single C code

ⓒ 2016 UEC Tokyo.
Deployment Procedure
1. Train Deep CNN model by Caffe
2. Prepare model files
3. Generate a C source code by Caffe2C automatically
4. Implement C code on mobile with GUI code
Trained Deep
CNN Model
Deep CNN
Train Phase
1
Caffemodel
Network
Mean
Label
Model
Preparation
2
Convert
C code
3
Caffe2C
Implement
on Mobile
4
CNN into
mobile !

ⓒ 2016 UEC Tokyo.
3. IMAGE RECOGNITION
SYSTEM
FOR EVALUATION

ⓒ 2016 UEC Tokyo.
• In order to demonstrate the utilization of the Caffe2C，
we have implemented four kinds of mobile CNN-
based image recognition apps on iOS
• We explain image recognition engine used in the iOS
application
Image Recognition System
for evaluation

ⓒ 2016 UEC Tokyo.
CNN Architecture
• A representative architectures areAlexNet VGG-16 GoogleNet
AlexNet
VGG-16
Network-In-Network
or NIN

ⓒ 2016 UEC Tokyo.
CNN Architecture
• The number of weights in AlexNet and VGG-16 is
too much for mobile.
• GoogLeNet is too complicated
for efficient parallel implemen
-tation. (It has many branches.)
Many branches

ⓒ 2016 UEC Tokyo.
CNN Architecture
• We adopt Network-in-Network (NIN).
– No fully-connected layers (which bring less parameters)
– Straight flow and consisting of many conv layers
– relatively smaller than the other architectures
⇒ It’s easy for parallel implementation.
Efficient computation for conv layers is needed !
Network-In-Network(NIN)

ⓒ 2016 UEC Tokyo.
Fast computation of conv layers
- efficient GEMM with 4 cores and BLAS/NEON -
• Conv = im2col + GEMM (Generic Matrix Multiplication)
conv. kernels
input feature maps
2
3
patch
1
patch
2
patch
3
patch
4
patch
51
4
matrix multiplication (=conv. layer computation)
Parallel computation over multiple cores
Inside each core NEON or BLAS is used.
im2col
kernel 2
kernel 3
kernel 1
kernel 4
patch1
patch2
patch3
patch4
patch5
patch1
patch2
patch3
patch4
patch5
kernel 1
Core1
NEON
or BLAS
kernel 2
patch1
patch2
patch3
patch4
patch5
Core2
NEON
or BLAS
Core3
kernel 3 patch1
patch2
patch3
patch4
patch5NEON
or BLAS
Core4
patch1
patch2
patch3
patch4
patch5
kernel 4
NEON
or BLAS

ⓒ 2016 UEC Tokyo.
• Speeding up Conv layers →Speeding up GEMM
– computation of conv layer is decomposed into “im2col”
operation and generic matric multiplications(GEMM)
– Multi-threading: Use 2cores in iOS , 4 cores in Android in
parallel
– SIMD instruction(NEON in ARM-based processor)
• Total: iOS: 2Core*4 = 8calculation, Android: 4Core*4 = 16 calculation
– BLAS library(highly optimized for iOS ⇔ not optimized for
Android)
• BLAS(iOS: BLAS in iOS Accelerate Framework, Android: OpenBLAS)
Fast Implementation on Mobile

ⓒ 2016 UEC Tokyo.
Evaluation: Processing time
• iOS: BLAS >> NEON, Android: BLAS << NEON
– For iOS, using BLAS in the iOS Accelerate Framework is the
best choice.
– For Android, using NEON (SIMD instruction) is better than
OpenBLAS.
NEON BLAS Devices BLAS
iOS 181.0 55.7 iPhone 7 Plus Accelerate
iOS 222.4 66.0 iPad Pro Accelerate
iOS 251.8 79.9 iPhone SE Accelerate
Android 251.0 1652.0 GALAXY Note 3 OpenBLAS
Recognition Time[ms] BLAS vs. NEON
Highly
optimized

ⓒ 2016 UEC Tokyo.
Comparison to FV-based Previous Method
Deep Learning with UEC-FOOD100 dataset
• Much improved ( 65.3% ⇒ 81.5% (top-1) )
• Even for 160x160 improved ( 65.3% ⇒ 71.5% )
60.0%
65.0%
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
1 2 3 4 5 6 7 8 9 10
AlexNet
NIN 5layer [104ms]
NIN 4layer [67ms]
NIN 4layer (160x160) [33ms]
FV (Color+HOG) [65ms]
Top1:
81.5%
Top1:
65.3%
Top5:
96.2%
Top5:
86.7%
Top-N
Classification
Accuracy
Top-N
Kept almost
the same

ⓒ 2016 UEC Tokyo.
• We have implemented 4 kinds of mobile CNN-based
image recognition apps on iOS
– Food recognition app: “DeepFoodCam”
– Bird recognition app: “DeepBirdCam”
– Dog recognition app: “DeepDogCam”
– Flower recognition app: “DeepFlowerCam”
4 iOS Applications

ⓒ 2016 UEC Tokyo.
DeepFoodCam
• Recognize 101 classes including 100 food classes and
one nonfood class
Training Phase
• fine-tuned the CNN with 101 class images
– totally 20,000 images
– UECFOOD-100 and non-food collected from Twitter
Target Top-1 Top-5
Food 101 class 74.5% 93.5%
Accuracy

ⓒ 2016 UEC Tokyo.
• Recognize 200 bird class
Training Phase
• fine-tuning CNN with 6033 images of Caltech-UCSD
Birds 200 Dataset
DeepBirdCam
Target Top-1 Top-5
Bird 200 class 55.8% 80.2%
Accuracy

ⓒ 2016 UEC Tokyo.
• Recognize 100 dog class
Training Phase
• fine-tuning CNN with 150 and over images per class
of Stanford Dogs Dataset Dataset
DeepDogCam
Target Top-1 Top-5
Dog 100 class 69.0% 91.6%
Accuracy

ⓒ 2016 UEC Tokyo.
• Recognize 102 flower class
Training Phase
• fine-tuning CNN with 80 and over images per class of
102 Category Flower Dataset
DeepFlowerCam
Target Top-1 Top-5
Flower 102 class 64.1% 85.8%
Accuracy

ⓒ 2016 UEC Tokyo.
• We have implemented 4 kinds of mobile CNN-based
image recognition apps on iOS
– Food recognition app: “DeepFoodCam”
– Bird recognition app: “DeepBirdCam”
– Dog recognition app: “DeepDogCam”
– Flower recognition app: “DeepFlowerCam”
4 iOS Applications
If you prepare training
data, you can create
mobile recognition
apps in a day !!

ⓒ 2016 UEC Tokyo.
1. We create a Caffe2C which converts the model
definition files and the parameter files of Caffe into
a single C code that can run on mobile devices
2. We explain the flow of construction of recognition
app using Caffe2C
3. We have implemented 4 kinds of mobile CNN-based
Conclusions

ⓒ 2016 UEC Tokyo.
• We implemented apply our mobile framework into
real-time CNN-based mobile image processing
– such as Neural Style Transfer
Additional work

2016 UEC Tokyo.
Thank you for listening
iOS App is Available !
“DeepFoodCam“
iOS App is Available !
“RealTimeMultiStyleTransfer”

ⓒ 2016 UEC Tokyo.
Extension of NIN
adding BN, 5layers, multiple image size
• Modified models (BN, 5layer, multi-scale)
– adding BN layers just after all the conv/cccp layers
– replaced 5x5 conv with two 3x3 conv layers
– reduced the number of kernels in conv 4 from 1024 to 768
– replaced fixed average pooling with Global Average Pooling
• Multiple image size
4layers
5layers+BN
227x227 180x180 160x160 Trade-off: Accuracy vs speed
227x227
55.7ms 78.8%
180x180
35.5ms 76.0%
160x160
26.3ms 71.5%Global Average Pooling (GAP)

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications

More Related Content

Similar to Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications (20)

More from Ryosuke Tanno (14)

Recently uploaded (6)

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications