SlideShare a Scribd company logo
2016 UEC Tokyo.
Caffe2C: A Framework for Easy Implementation
of CNN-based Mobile Applications
Ryosuke Tanno and Keiji Yanai
Department of Informatics,
The University of Electro-Communications, Tokyo
ⓒ 2016 UEC Tokyo.
1. INTRODUCTION
ⓒ 2016 UEC Tokyo.
• Deep Learning achieved remarkable progress
– E.g. Audio Recognition, Natural Language Processing,
• Especially, in Image Recognition, Deep Learning gave
the best performance
– Outperform even humans such as recognition of 1000
object(He+, Delving deep into rectifier, 2015)
Deep Learning(DNN,DCNN,CNN)
0
20
40
60
80
100
2010 2011 2012 2013 2014 2015 Human
Trained
72% 75%
85% 88.3% 93.3% 96.4% 94.9%
SIFT+BOF
Deep Learning
Deeeeeeeep
Outperform
Human !
ⓒ 2016 UEC Tokyo.
• Many Deep Learning Framework have emerged
– E.g. Caffe, TensorFlow, Chainer
Deep Learning Framework
ⓒ 2016 UEC Tokyo.
Convolution Architecture For Feature Extraction(CAFFE)
Open Framework, models and examples for Deep Learning
• Focus on Compuer Vision
• Pure C++/CUDA architecture for deep learning
• Command line, Python MATLAB interface
• Fastest processing speed
• Caffe is the most popular framework in the world
What is Caffe?
ⓒ 2016 UEC Tokyo.
• There are many attempts to archive CNN on the
mobile
– Require a high computational power and memory
Bring to CNN to Mobile
CNN into mobile !
High Computational Power and Memory are Bottleneck!!
ⓒ 2016 UEC Tokyo.
Files
• 3 files are required for Training -> Output: Model
– 3 files: Network definition, Mean, Label
How to train a model by caffe?
Training
Network
Mean
Label
3 files
Dataset
Output
Caffemodel
Use these 4 files
on mobile
ⓒ 2016 UEC Tokyo.
• We currently need to use OpenCV DNN module
– not optimized for the mobile devices
– their execution speed is relatively slow
Use the 4 Files
by Caffe on the Mobile
Network
Mean
Label
Model
4 files
ⓒ 2016 UEC Tokyo.
• We create a Caffe2C which converts the CNN model
definition files and the parameter files trained by
Caffe to a single C language code that can run on
mobile devices
• Caffe2C makes it easy to use deep learning on the C
language operating environment
• Caffe2C achieves faster runtime in comparison to
the existing OpenCV DNN module
Objective
Network
Mean
Label
Model
4 files
Caffe2C
Single C code
ⓒ 2016 UEC Tokyo.
• In order to demonstrate the utilization of the Caffe2C,
we have implemented 4 kinds of mobile CNN-based
image recognition apps on iOS.
Objective
ⓒ 2016 UEC Tokyo.
1. We create a Caffe2C which converts the model
definition files and the parameter files of Caffe into
a single C code that can run on mobile devices
2. We explain the flow of construction of recognition
app using Caffe2C
3. We have implemented 4 kinds of mobile CNN-based
image recognition apps on iOS.
Contributions
ⓒ 2016 UEC Tokyo.
2. CONSTRUCTION OF CNN-
BASED MOBILE RECOGNITION
SYSTEM
ⓒ 2016 UEC Tokyo.
• In order to use the learned parameters by Caffe on
mobile devices, it is necessary to currently use the
OpenCV DNN module not optimized, relatively slow
• We create a Caffe2C which converts the CNN model
definition files and the parameter files trained by Caffe
to a single C language code
– We can use parameter files trained by Caffe on mobile devices
Caffe2C
ⓒ 2016 UEC Tokyo.
• Caffe2C achieves faster execution speed in comparison
to the existing OpenCV DNN module
Caffe2C
Caffe2C OpenCV DNN
AlexNet
iPhone 7 Plus 106.9 1663.8
iPad Pro 141.5 1900.1
iPhone SE 141.5 2239.8
Runtime[ms] Caffe2C vs. OpenCV DNN(Input size: 227x227)
Speedup Rate:
About 15X
ⓒ 2016 UEC Tokyo.
1. Caffe2C directly converts the Deep Neural Network to
a C source code
Reasons for Fast Execution
Caffe2C
OpenCV DNN
Network
Mean
Label
Model
Caffe2C
Single C code
Execution
like Compiler
Execution
like Interpreter
ⓒ 2016 UEC Tokyo.
• Caffe2C directly converts the Deep Neural Network to a
C source code
Reasons for Fast Execution of Caffe2C?
Caffe2C
OpenCV DNN
Network
Mean
Label
Model
Caffe2C
Single C code
Execution
like Compiler
Execution
like Interpreter
ⓒ 2016 UEC Tokyo.
2. Caffe2C performs the pre-processing of the CNN as
much as possible to reduce the amount of online
computation
– Compute batch normalization in advance for conv weight.
3. Caffe2C effectively uses NEON/BLAS by multi-threading
Reasons for Fast Execution
Network
Mean
Label
Model
4 files
Caffe2C
Single C code
ⓒ 2016 UEC Tokyo.
Deployment Procedure
1. Train Deep CNN model by Caffe
2. Prepare model files
3. Generate a C source code by Caffe2C automatically
4. Implement C code on mobile with GUI code
Trained Deep
CNN Model
Deep CNN
Train Phase
1
Caffemodel
Network
Mean
Label
Model
Preparation
2
Convert
C code
3
Caffe2C
Implement
on Mobile
4
CNN into
mobile !
ⓒ 2016 UEC Tokyo.
3. IMAGE RECOGNITION
SYSTEM
FOR EVALUATION
ⓒ 2016 UEC Tokyo.
• In order to demonstrate the utilization of the Caffe2C,
we have implemented four kinds of mobile CNN-
based image recognition apps on iOS
• We explain image recognition engine used in the iOS
application
Image Recognition System
for evaluation
ⓒ 2016 UEC Tokyo.
CNN Architecture
• A representative architectures areAlexNet VGG-16 GoogleNet
AlexNet
VGG-16
Network-In-Network
or NIN
ⓒ 2016 UEC Tokyo.
CNN Architecture
• The number of weights in AlexNet and VGG-16 is
too much for mobile.
• GoogLeNet is too complicated
for efficient parallel implemen
-tation. (It has many branches.)
Many branches
ⓒ 2016 UEC Tokyo.
CNN Architecture
• We adopt Network-in-Network (NIN).
– No fully-connected layers (which bring less parameters)
– Straight flow and consisting of many conv layers
– relatively smaller than the other architectures
⇒ It’s easy for parallel implementation.
Efficient computation for conv layers is needed !
Network-In-Network(NIN)
ⓒ 2016 UEC Tokyo.
Fast computation of conv layers
- efficient GEMM with 4 cores and BLAS/NEON -
• Conv = im2col + GEMM (Generic Matrix Multiplication)
conv. kernels
input feature maps
2
3
patch
1
patch
2
patch
3
patch
4
patch
51
4
matrix multiplication (=conv. layer computation)
Parallel computation over multiple cores
Inside each core NEON or BLAS is used.
im2col
kernel 2
kernel 3
kernel 1
kernel 4
patch1
patch2
patch3
patch4
patch5
patch1
patch2
patch3
patch4
patch5
kernel 1
Core1
NEON
or BLAS
kernel 2
patch1
patch2
patch3
patch4
patch5
Core2
NEON
or BLAS
Core3
kernel 3 patch1
patch2
patch3
patch4
patch5NEON
or BLAS
Core4
patch1
patch2
patch3
patch4
patch5
kernel 4
NEON
or BLAS
ⓒ 2016 UEC Tokyo.
• Speeding up Conv layers →Speeding up GEMM
– computation of conv layer is decomposed into “im2col”
operation and generic matric multiplications(GEMM)
– Multi-threading: Use 2cores in iOS , 4 cores in Android in
parallel
– SIMD instruction(NEON in ARM-based processor)
• Total: iOS: 2Core*4 = 8calculation, Android: 4Core*4 = 16 calculation
– BLAS library(highly optimized for iOS ⇔ not optimized for
Android)
• BLAS(iOS: BLAS in iOS Accelerate Framework, Android: OpenBLAS)
Fast Implementation on Mobile
ⓒ 2016 UEC Tokyo.
Evaluation: Processing time
• iOS: BLAS >> NEON, Android: BLAS << NEON
– For iOS, using BLAS in the iOS Accelerate Framework is the
best choice.
– For Android, using NEON (SIMD instruction) is better than
OpenBLAS.
NEON BLAS Devices BLAS
iOS 181.0 55.7 iPhone 7 Plus Accelerate
iOS 222.4 66.0 iPad Pro Accelerate
iOS 251.8 79.9 iPhone SE Accelerate
Android 251.0 1652.0 GALAXY Note 3 OpenBLAS
Recognition Time[ms] BLAS vs. NEON
Highly
optimized
ⓒ 2016 UEC Tokyo.
Comparison to FV-based Previous Method
Deep Learning with UEC-FOOD100 dataset
• Much improved ( 65.3% ⇒ 81.5% (top-1) )
• Even for 160x160 improved ( 65.3% ⇒ 71.5% )
60.0%
65.0%
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
1 2 3 4 5 6 7 8 9 10
AlexNet
NIN 5layer [104ms]
NIN 4layer [67ms]
NIN 4layer (160x160) [33ms]
FV (Color+HOG) [65ms]
Top1:
81.5%
Top1:
65.3%
Top5:
96.2%
Top5:
86.7%
Top-N
Classification
Accuracy
Top-N
Kept almost
the same
ⓒ 2016 UEC Tokyo.
4. MOBILE APPLICATIONS
ⓒ 2016 UEC Tokyo.
• We have implemented 4 kinds of mobile CNN-based
image recognition apps on iOS
– Food recognition app: “DeepFoodCam”
– Bird recognition app: “DeepBirdCam”
– Dog recognition app: “DeepDogCam”
– Flower recognition app: “DeepFlowerCam”
4 iOS Applications
ⓒ 2016 UEC Tokyo.
DeepFoodCam
• Recognize 101 classes including 100 food classes and
one nonfood class
Training Phase
• fine-tuned the CNN with 101 class images
– totally 20,000 images
– UECFOOD-100 and non-food collected from Twitter
Target Top-1 Top-5
Food 101 class 74.5% 93.5%
Accuracy
ⓒ 2016 UEC Tokyo.
• Recognize 200 bird class
Training Phase
• fine-tuning CNN with 6033 images of Caltech-UCSD
Birds 200 Dataset
DeepBirdCam
Target Top-1 Top-5
Bird 200 class 55.8% 80.2%
Accuracy
ⓒ 2016 UEC Tokyo.
• Recognize 100 dog class
Training Phase
• fine-tuning CNN with 150 and over images per class
of Stanford Dogs Dataset Dataset
DeepDogCam
Target Top-1 Top-5
Dog 100 class 69.0% 91.6%
Accuracy
ⓒ 2016 UEC Tokyo.
• Recognize 102 flower class
Training Phase
• fine-tuning CNN with 80 and over images per class of
102 Category Flower Dataset
DeepFlowerCam
Target Top-1 Top-5
Flower 102 class 64.1% 85.8%
Accuracy
ⓒ 2016 UEC Tokyo.
• We have implemented 4 kinds of mobile CNN-based
image recognition apps on iOS
– Food recognition app: “DeepFoodCam”
– Bird recognition app: “DeepBirdCam”
– Dog recognition app: “DeepDogCam”
– Flower recognition app: “DeepFlowerCam”
4 iOS Applications
If you prepare training
data, you can create
mobile recognition
apps in a day !!
ⓒ 2016 UEC Tokyo.
1. We create a Caffe2C which converts the model
definition files and the parameter files of Caffe into
a single C code that can run on mobile devices
2. We explain the flow of construction of recognition
app using Caffe2C
3. We have implemented 4 kinds of mobile CNN-based
image recognition apps on iOS.
Conclusions
ⓒ 2016 UEC Tokyo.
• We implemented apply our mobile framework into
real-time CNN-based mobile image processing
– such as Neural Style Transfer
Additional work
2016 UEC Tokyo.
Thank you for listening
iOS App is Available !
“DeepFoodCam“
iOS App is Available !
“RealTimeMultiStyleTransfer”
ⓒ 2016 UEC Tokyo.
Extension of NIN
adding BN, 5layers, multiple image size
• Modified models (BN, 5layer, multi-scale)
– adding BN layers just after all the conv/cccp layers
– replaced 5x5 conv with two 3x3 conv layers
– reduced the number of kernels in conv 4 from 1024 to 768
– replaced fixed average pooling with Global Average Pooling
• Multiple image size
4layers
5layers+BN
227x227 180x180 160x160 Trade-off: Accuracy vs speed
227x227
55.7ms 78.8%
180x180
35.5ms 76.0%
160x160
26.3ms 71.5%Global Average Pooling (GAP)

More Related Content

PDF
Deep Learning on the Mobile Devices
PDF
KARMA: Adaptive Android Kernel Live Patching
PDF
OpenCV DNN module vs. Ours method
PDF
Introduction of Mobile CNN
PPTX
Digit recognizer
PPTX
Continous delivery devoops session24.pptx
PPTX
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
PPT
CDA4411: Chapter 10 - Application Development
Deep Learning on the Mobile Devices
KARMA: Adaptive Android Kernel Live Patching
OpenCV DNN module vs. Ours method
Introduction of Mobile CNN
Digit recognizer
Continous delivery devoops session24.pptx
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
CDA4411: Chapter 10 - Application Development

Similar to Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications (20)

PPTX
Android Development recipes with java.pptx
PPTX
Ml goes fruitful
PDF
running stable diffusion on android
PDF
Starting with OpenCV on i.MX 6 Processors
PDF
Bootstrapping iPhone Development
PDF
Open Dayligth usando SDN-NFV
PDF
MattsonTutorialSC14.pdf
PDF
DevOps demystified
PPTX
AWS ECS Copilot DevOps Presentation
PPTX
Squeezing Deep Learning Into Mobile Phones
PDF
01-06 OCRE Test Suite - Fernandes.pdf
PPTX
CI/CD Pipeline with Docker
PDF
Solution Manual for Visual C# How to Program (6th Edition) (Deitel Series) 6t...
PDF
"OpenCV for High-performance, Low-power Vision Applications on Snapdragon," a...
PPTX
Rehosting apps between k8s clusters and automating deployment using crane
PPTX
Anomaly Detection with Azure and .NET
PDF
Getting Native with NDK
PPTX
MattsonTutorialSC14.pptx
PPT
Unit 2 Java
PPTX
Android Development recipes with java.pptx
Ml goes fruitful
running stable diffusion on android
Starting with OpenCV on i.MX 6 Processors
Bootstrapping iPhone Development
Open Dayligth usando SDN-NFV
MattsonTutorialSC14.pdf
DevOps demystified
AWS ECS Copilot DevOps Presentation
Squeezing Deep Learning Into Mobile Phones
01-06 OCRE Test Suite - Fernandes.pdf
CI/CD Pipeline with Docker
Solution Manual for Visual C# How to Program (6th Edition) (Deitel Series) 6t...
"OpenCV for High-performance, Low-power Vision Applications on Snapdragon," a...
Rehosting apps between k8s clusters and automating deployment using crane
Anomaly Detection with Azure and .NET
Getting Native with NDK
MattsonTutorialSC14.pptx
Unit 2 Java
Ad

More from Ryosuke Tanno (14)

PDF
【2017年度】勉強会資料_学習に関するテクニック
PDF
【2016年度】勉強会資料_word2vec
PDF
【2016年度】勉強会資料_Chainer
PDF
MMM2017参加報告
PDF
ECCV2016, ACMMM2016参加報告
PDF
敵対的生成ネットワークによる食事画像の変換に関する研究
PDF
深層学習ネットワークのモバイル実装
PDF
研究紹介
PDF
第三回勉強会 Keras担当回
PDF
複数スタイルの融合と 部分的適用を可能とする Multi-style Feed-forward Networkの提案
PDF
Conditional CycleGANによる食事画像変換
PDF
モバイルOS上での深層学習による 画像認識システムの実装と比較分析
PDF
AR DeepCalorieCam: AR表示型食事カロリー量推定システム
PDF
CoreMLによるiOS深層学習アプリの実装と性能分析
【2017年度】勉強会資料_学習に関するテクニック
【2016年度】勉強会資料_word2vec
【2016年度】勉強会資料_Chainer
MMM2017参加報告
ECCV2016, ACMMM2016参加報告
敵対的生成ネットワークによる食事画像の変換に関する研究
深層学習ネットワークのモバイル実装
研究紹介
第三回勉強会 Keras担当回
複数スタイルの融合と 部分的適用を可能とする Multi-style Feed-forward Networkの提案
Conditional CycleGANによる食事画像変換
モバイルOS上での深層学習による 画像認識システムの実装と比較分析
AR DeepCalorieCam: AR表示型食事カロリー量推定システム
CoreMLによるiOS深層学習アプリの実装と性能分析
Ad

Recently uploaded (6)

PDF
Lesson 13- HEREDITY _ pedSAWEREGFVCXZDSASEWFigree.pdf
PDF
heheheueueyeyeyegehehehhehshMedia-Literacy.pdf
PDF
6-UseCfgfhgfhgfhgfhgfhfhhaseActivity.pdf
DOC
Camb毕业证学历认证,格罗斯泰斯特主教大学毕业证仿冒文凭毕业证
DOC
证书学历UoA毕业证,澳大利亚中汇学院毕业证国外大学毕业证
PPTX
ASMS Telecommunication company Profile
Lesson 13- HEREDITY _ pedSAWEREGFVCXZDSASEWFigree.pdf
heheheueueyeyeyegehehehhehshMedia-Literacy.pdf
6-UseCfgfhgfhgfhgfhgfhfhhaseActivity.pdf
Camb毕业证学历认证,格罗斯泰斯特主教大学毕业证仿冒文凭毕业证
证书学历UoA毕业证,澳大利亚中汇学院毕业证国外大学毕业证
ASMS Telecommunication company Profile

Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications

  • 1. 2016 UEC Tokyo. Caffe2C: A Framework for Easy Implementation of CNN-based Mobile Applications Ryosuke Tanno and Keiji Yanai Department of Informatics, The University of Electro-Communications, Tokyo
  • 2. ⓒ 2016 UEC Tokyo. 1. INTRODUCTION
  • 3. ⓒ 2016 UEC Tokyo. • Deep Learning achieved remarkable progress – E.g. Audio Recognition, Natural Language Processing, • Especially, in Image Recognition, Deep Learning gave the best performance – Outperform even humans such as recognition of 1000 object(He+, Delving deep into rectifier, 2015) Deep Learning(DNN,DCNN,CNN) 0 20 40 60 80 100 2010 2011 2012 2013 2014 2015 Human Trained 72% 75% 85% 88.3% 93.3% 96.4% 94.9% SIFT+BOF Deep Learning Deeeeeeeep Outperform Human !
  • 4. ⓒ 2016 UEC Tokyo. • Many Deep Learning Framework have emerged – E.g. Caffe, TensorFlow, Chainer Deep Learning Framework
  • 5. ⓒ 2016 UEC Tokyo. Convolution Architecture For Feature Extraction(CAFFE) Open Framework, models and examples for Deep Learning • Focus on Compuer Vision • Pure C++/CUDA architecture for deep learning • Command line, Python MATLAB interface • Fastest processing speed • Caffe is the most popular framework in the world What is Caffe?
  • 6. ⓒ 2016 UEC Tokyo. • There are many attempts to archive CNN on the mobile – Require a high computational power and memory Bring to CNN to Mobile CNN into mobile ! High Computational Power and Memory are Bottleneck!!
  • 7. ⓒ 2016 UEC Tokyo. Files • 3 files are required for Training -> Output: Model – 3 files: Network definition, Mean, Label How to train a model by caffe? Training Network Mean Label 3 files Dataset Output Caffemodel Use these 4 files on mobile
  • 8. ⓒ 2016 UEC Tokyo. • We currently need to use OpenCV DNN module – not optimized for the mobile devices – their execution speed is relatively slow Use the 4 Files by Caffe on the Mobile Network Mean Label Model 4 files
  • 9. ⓒ 2016 UEC Tokyo. • We create a Caffe2C which converts the CNN model definition files and the parameter files trained by Caffe to a single C language code that can run on mobile devices • Caffe2C makes it easy to use deep learning on the C language operating environment • Caffe2C achieves faster runtime in comparison to the existing OpenCV DNN module Objective Network Mean Label Model 4 files Caffe2C Single C code
  • 10. ⓒ 2016 UEC Tokyo. • In order to demonstrate the utilization of the Caffe2C, we have implemented 4 kinds of mobile CNN-based image recognition apps on iOS. Objective
  • 11. ⓒ 2016 UEC Tokyo. 1. We create a Caffe2C which converts the model definition files and the parameter files of Caffe into a single C code that can run on mobile devices 2. We explain the flow of construction of recognition app using Caffe2C 3. We have implemented 4 kinds of mobile CNN-based image recognition apps on iOS. Contributions
  • 12. ⓒ 2016 UEC Tokyo. 2. CONSTRUCTION OF CNN- BASED MOBILE RECOGNITION SYSTEM
  • 13. ⓒ 2016 UEC Tokyo. • In order to use the learned parameters by Caffe on mobile devices, it is necessary to currently use the OpenCV DNN module not optimized, relatively slow • We create a Caffe2C which converts the CNN model definition files and the parameter files trained by Caffe to a single C language code – We can use parameter files trained by Caffe on mobile devices Caffe2C
  • 14. ⓒ 2016 UEC Tokyo. • Caffe2C achieves faster execution speed in comparison to the existing OpenCV DNN module Caffe2C Caffe2C OpenCV DNN AlexNet iPhone 7 Plus 106.9 1663.8 iPad Pro 141.5 1900.1 iPhone SE 141.5 2239.8 Runtime[ms] Caffe2C vs. OpenCV DNN(Input size: 227x227) Speedup Rate: About 15X
  • 15. ⓒ 2016 UEC Tokyo. 1. Caffe2C directly converts the Deep Neural Network to a C source code Reasons for Fast Execution Caffe2C OpenCV DNN Network Mean Label Model Caffe2C Single C code Execution like Compiler Execution like Interpreter
  • 16. ⓒ 2016 UEC Tokyo. • Caffe2C directly converts the Deep Neural Network to a C source code Reasons for Fast Execution of Caffe2C? Caffe2C OpenCV DNN Network Mean Label Model Caffe2C Single C code Execution like Compiler Execution like Interpreter
  • 17. ⓒ 2016 UEC Tokyo. 2. Caffe2C performs the pre-processing of the CNN as much as possible to reduce the amount of online computation – Compute batch normalization in advance for conv weight. 3. Caffe2C effectively uses NEON/BLAS by multi-threading Reasons for Fast Execution Network Mean Label Model 4 files Caffe2C Single C code
  • 18. ⓒ 2016 UEC Tokyo. Deployment Procedure 1. Train Deep CNN model by Caffe 2. Prepare model files 3. Generate a C source code by Caffe2C automatically 4. Implement C code on mobile with GUI code Trained Deep CNN Model Deep CNN Train Phase 1 Caffemodel Network Mean Label Model Preparation 2 Convert C code 3 Caffe2C Implement on Mobile 4 CNN into mobile !
  • 19. ⓒ 2016 UEC Tokyo. 3. IMAGE RECOGNITION SYSTEM FOR EVALUATION
  • 20. ⓒ 2016 UEC Tokyo. • In order to demonstrate the utilization of the Caffe2C, we have implemented four kinds of mobile CNN- based image recognition apps on iOS • We explain image recognition engine used in the iOS application Image Recognition System for evaluation
  • 21. ⓒ 2016 UEC Tokyo. CNN Architecture • A representative architectures areAlexNet VGG-16 GoogleNet AlexNet VGG-16 Network-In-Network or NIN
  • 22. ⓒ 2016 UEC Tokyo. CNN Architecture • The number of weights in AlexNet and VGG-16 is too much for mobile. • GoogLeNet is too complicated for efficient parallel implemen -tation. (It has many branches.) Many branches
  • 23. ⓒ 2016 UEC Tokyo. CNN Architecture • We adopt Network-in-Network (NIN). – No fully-connected layers (which bring less parameters) – Straight flow and consisting of many conv layers – relatively smaller than the other architectures ⇒ It’s easy for parallel implementation. Efficient computation for conv layers is needed ! Network-In-Network(NIN)
  • 24. ⓒ 2016 UEC Tokyo. Fast computation of conv layers - efficient GEMM with 4 cores and BLAS/NEON - • Conv = im2col + GEMM (Generic Matrix Multiplication) conv. kernels input feature maps 2 3 patch 1 patch 2 patch 3 patch 4 patch 51 4 matrix multiplication (=conv. layer computation) Parallel computation over multiple cores Inside each core NEON or BLAS is used. im2col kernel 2 kernel 3 kernel 1 kernel 4 patch1 patch2 patch3 patch4 patch5 patch1 patch2 patch3 patch4 patch5 kernel 1 Core1 NEON or BLAS kernel 2 patch1 patch2 patch3 patch4 patch5 Core2 NEON or BLAS Core3 kernel 3 patch1 patch2 patch3 patch4 patch5NEON or BLAS Core4 patch1 patch2 patch3 patch4 patch5 kernel 4 NEON or BLAS
  • 25. ⓒ 2016 UEC Tokyo. • Speeding up Conv layers →Speeding up GEMM – computation of conv layer is decomposed into “im2col” operation and generic matric multiplications(GEMM) – Multi-threading: Use 2cores in iOS , 4 cores in Android in parallel – SIMD instruction(NEON in ARM-based processor) • Total: iOS: 2Core*4 = 8calculation, Android: 4Core*4 = 16 calculation – BLAS library(highly optimized for iOS ⇔ not optimized for Android) • BLAS(iOS: BLAS in iOS Accelerate Framework, Android: OpenBLAS) Fast Implementation on Mobile
  • 26. ⓒ 2016 UEC Tokyo. Evaluation: Processing time • iOS: BLAS >> NEON, Android: BLAS << NEON – For iOS, using BLAS in the iOS Accelerate Framework is the best choice. – For Android, using NEON (SIMD instruction) is better than OpenBLAS. NEON BLAS Devices BLAS iOS 181.0 55.7 iPhone 7 Plus Accelerate iOS 222.4 66.0 iPad Pro Accelerate iOS 251.8 79.9 iPhone SE Accelerate Android 251.0 1652.0 GALAXY Note 3 OpenBLAS Recognition Time[ms] BLAS vs. NEON Highly optimized
  • 27. ⓒ 2016 UEC Tokyo. Comparison to FV-based Previous Method Deep Learning with UEC-FOOD100 dataset • Much improved ( 65.3% ⇒ 81.5% (top-1) ) • Even for 160x160 improved ( 65.3% ⇒ 71.5% ) 60.0% 65.0% 70.0% 75.0% 80.0% 85.0% 90.0% 95.0% 100.0% 1 2 3 4 5 6 7 8 9 10 AlexNet NIN 5layer [104ms] NIN 4layer [67ms] NIN 4layer (160x160) [33ms] FV (Color+HOG) [65ms] Top1: 81.5% Top1: 65.3% Top5: 96.2% Top5: 86.7% Top-N Classification Accuracy Top-N Kept almost the same
  • 28. ⓒ 2016 UEC Tokyo. 4. MOBILE APPLICATIONS
  • 29. ⓒ 2016 UEC Tokyo. • We have implemented 4 kinds of mobile CNN-based image recognition apps on iOS – Food recognition app: “DeepFoodCam” – Bird recognition app: “DeepBirdCam” – Dog recognition app: “DeepDogCam” – Flower recognition app: “DeepFlowerCam” 4 iOS Applications
  • 30. ⓒ 2016 UEC Tokyo. DeepFoodCam • Recognize 101 classes including 100 food classes and one nonfood class Training Phase • fine-tuned the CNN with 101 class images – totally 20,000 images – UECFOOD-100 and non-food collected from Twitter Target Top-1 Top-5 Food 101 class 74.5% 93.5% Accuracy
  • 31. ⓒ 2016 UEC Tokyo. • Recognize 200 bird class Training Phase • fine-tuning CNN with 6033 images of Caltech-UCSD Birds 200 Dataset DeepBirdCam Target Top-1 Top-5 Bird 200 class 55.8% 80.2% Accuracy
  • 32. ⓒ 2016 UEC Tokyo. • Recognize 100 dog class Training Phase • fine-tuning CNN with 150 and over images per class of Stanford Dogs Dataset Dataset DeepDogCam Target Top-1 Top-5 Dog 100 class 69.0% 91.6% Accuracy
  • 33. ⓒ 2016 UEC Tokyo. • Recognize 102 flower class Training Phase • fine-tuning CNN with 80 and over images per class of 102 Category Flower Dataset DeepFlowerCam Target Top-1 Top-5 Flower 102 class 64.1% 85.8% Accuracy
  • 34. ⓒ 2016 UEC Tokyo. • We have implemented 4 kinds of mobile CNN-based image recognition apps on iOS – Food recognition app: “DeepFoodCam” – Bird recognition app: “DeepBirdCam” – Dog recognition app: “DeepDogCam” – Flower recognition app: “DeepFlowerCam” 4 iOS Applications If you prepare training data, you can create mobile recognition apps in a day !!
  • 35. ⓒ 2016 UEC Tokyo. 1. We create a Caffe2C which converts the model definition files and the parameter files of Caffe into a single C code that can run on mobile devices 2. We explain the flow of construction of recognition app using Caffe2C 3. We have implemented 4 kinds of mobile CNN-based image recognition apps on iOS. Conclusions
  • 36. ⓒ 2016 UEC Tokyo. • We implemented apply our mobile framework into real-time CNN-based mobile image processing – such as Neural Style Transfer Additional work
  • 37. 2016 UEC Tokyo. Thank you for listening iOS App is Available ! “DeepFoodCam“ iOS App is Available ! “RealTimeMultiStyleTransfer”
  • 38. ⓒ 2016 UEC Tokyo. Extension of NIN adding BN, 5layers, multiple image size • Modified models (BN, 5layer, multi-scale) – adding BN layers just after all the conv/cccp layers – replaced 5x5 conv with two 3x3 conv layers – reduced the number of kernels in conv 4 from 1024 to 768 – replaced fixed average pooling with Global Average Pooling • Multiple image size 4layers 5layers+BN 227x227 180x180 160x160 Trade-off: Accuracy vs speed 227x227 55.7ms 78.8% 180x180 35.5ms 76.0% 160x160 26.3ms 71.5%Global Average Pooling (GAP)