CNNS
FROM THE BASICS TO RECENT ADVANCES
Dmytro Mishkin
Center for Machine Perception
Czech Technical University in Prague
ducha.aiki@gmail.com
MY BACKGROUND
 PhD student of Czech Technical university in
Prague. Now fully working in Deep Learning,
recent paper “All you need is a good init”
added to Stanford CS231n course.
 Kaggler. 9th out of 1049 teams at National
Data Science Bowl
 CTO of Clear Research (clear.sx) Using deep
learning at work since 2014.
2
Short review of the CNN design
Architecture progress
AlexNet → VGGNet → ResNet
→ GoogLeNet
Initialization
Design choices
3
OUTLINE
IMAGENET WINNERS
4
AlexNet (original):Krizhevsky et.al., ImageNet Classification with Deep Convolutional Neural Networks, 2012.
CaffeNet: Jia et.al., Caffe: Convolutional Architecture for Fast Feature Embedding, 2014.
Image credit: Roberto Matheus Pinheiro Pereira, “Deep Learning Talk”.
Srinivas et.al “A Taxonomy of Deep Convolutional Neural Nets for Computer Vision”, 2016.
5
CAFFENET ARCHITECTURE
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
6
CAFFENET ARCHITECTURE
VGGNET ARCHITECTURE
7
All convolutions are 3x3
Good performance,
but slow
INCEPTION (GOOGLENET):
BUILDING BLOCK
Szegedy et.al. Going Deeper with Convolutions. CVPR, 2015
Image credit: https://www.udacity.com/course/deep-learning--ud730
8
DEPTH LIMIT NOT REACHED YET
IN DEEP LEARNING
9
DEPTH LIMIT NOT REACHED YET
IN DEEP LEARNING
10
DEPTH LIMIT NOT REACHED YET
IN DEEP LEARNING
11
VGGNet – 19 layers, 19.6 billion FLOPS. Simonyan et.al., 2014
ResNet – 152 layers, 11.3 billion FLOPS. He et.al., 2015
Slide credit Šulc et.al. Very Deep Residual Networks with MaxOut for Plant Identification in the Wild, 2016.
Stochastic ResNet – 1200 layers.
Huang et.al., Deep Networks with Stochastic Depth, 2016
RESNET: RESIDUAL BLOCK
He et.al. Deep Residual Learning for Image Recognition, ICCV 2015
12
RESIDUAL NETWORKS: HOT TOPIC
 Identity mapping https://arxiv.org/abs/1603.05027
 Wide ResNets https://arxiv.org/abs/1605.07146
 Stochastic depth https://arxiv.org/abs/1603.09382
 Residual Inception https://arxiv.org/abs/1602.07261
 ResNets + ELU http://arxiv.org/pdf/1604.04112.pdf
 ResNet in ResNet
http://arxiv.org/pdf/1608.02908v1.pdf
 DC Nets http://arxiv.org/abs/1608.06993
 Weighted ResNet
http://arxiv.org/pdf/1605.08831v1.pdf
13
DATASETS USED IN PRESENTATION:
IMAGENET AND CIFAR-10
CIFAR-10:
50k training images
(32x32 px )
10k validation images
10 classes
ImageNet:
1.2M training images
(~ 256 x 256px )
50k validation images
1000 classes
Russakovskiy et.al, ImageNet Large Scale Visual Recognition Challenge, 2015
Krizhevsky, Learning Multiple Layers of Features from Tiny Images, 2009
14
Image credit: Hu et.al, 2015
Transferring Deep Convolutional Neural Networks for the Scene Classification
of High-Resolution Remote Sensing Imagery
15
CAFFENET ARCHITECTURE
LIST OF HYPER-PARAMETERS TESTED
16
REFERENCE METHODS: IMAGE SIZE SENSITIVE
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
17
CHOICE OF NON-LINEARITY
18
CHOICE OF NON-LINEARITY
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
19
NON-LINEARITIES ON CAFFENET
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
20
 Gaussian noise with variance.
 𝑣𝑎𝑟 𝜔𝑙 = 0.01 (AlexNet, Krizhevsky et.al, 2012)
 𝑣𝑎𝑟 𝜔𝑙 = 1/𝑛𝑖𝑛𝑝𝑢𝑡𝑠 (Glorot et.al. 2010)
 𝑣𝑎𝑟 𝜔𝑙 = 2/𝑛𝑖𝑛𝑝𝑢𝑡𝑠 (He et.al. 2015)
 Orthonormal: (Saxe et.al. 2013)
Glorot → SVD → 𝜔𝑙 = V
 Data-dependent: LSUV (Mishkin et.al, 2016)
Mishkin and Matas. All you need is a good init. ICLR, 2016
21
WEIGHT INITIALIZATION FOR A VERY DEEP NET
a) Layer gain 𝐺 𝐿 < 1→ vanishing activations variance
b) Layer gain 𝐺 𝐿 > 1 → exploding activation variance
Mishkin and Matas. All you need is a good init. ICLR, 2016
22
WEIGHT INITIALIZATION INFLUENCES ACTIVATIONS
ACTIVATIONS INFLUENCES MAGNITUDE OF
GRADIENT COMPONENTS
23
Mishkin and Matas. All you need is a good init. ICLR, 2016
Algorithm 1. Layer-sequential unit-variance orthogonal initialization.
𝑳 − convolution or fully-connected layer, 𝑊𝐿 − its weights, 𝑂 𝐿 − layer output,
𝜀 − variance tolerance,
𝑇𝑖 − iteration number, 𝑇 𝑚𝑎𝑥 − max number of iterations.
Pre-initialize network with orthonormal matrices as in Saxe et.al. (2013)
for each convolutional and fully-connected layer 𝑳 do
do forward pass with mini-batch
calculate v𝑎𝑟(𝑂 𝐿)
𝑊𝐿
𝑖+1
= ൗ𝑊𝐿
𝑖
𝑣𝑎𝑟(𝑂 𝐿)
until 𝑣𝑎𝑟 𝑂 𝐿 − 1.0 < 𝜀 or (𝑇𝑖 > 𝑇𝑚𝑎𝑥)
end for
*The LSUV algorithm does not deal with biases and initializes them with zeros
KEEPING PRODUCT OF PER-LAYER GAINS ~ 1:
LAYER-SEQUENTIAL UNIT-VARIANCE
ORTHOGONAL INITIALIZATION
Mishkin and Matas. All you need is a good init. ICLR, 2016
24
COMPARISON OF THE INITIALIZATIONS FOR
DIFFERENT ACTIVATIONS
 CIFAR-10 FitNet, accuracy [%]

 CIFAR-10 FitResNet, accuracy [%]
Mishkin and Matas. All you need is a good init. ICLR, 2016
25
LSUV INITIALIZATION IMPLEMENTATIONS
Mishkin and Matas. All you need is a good init. ICLR, 2016
 Caffe https://github.com/ducha-aiki/LSUVinit
 Keras https://github.com/ducha-aiki/LSUV-keras
 Torch https://github.com/yobibyte/torch-lsuv
26
BATCH NORMALIZATION
(AFTER EVERY CONVOLUTION LAYER)
Ioffe et.al, ICML 2015
27
BATCH NORMALIZATION: WHERE,
BEFORE OR AFTER NON-LINEARITY?
Mishkin and Matas. All you need is a good init. ICLR, 2016
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
ImageNet, top-1 accuracy
CIFAR-10, top-1 accuracy,
FitNet4 network
In short:
better to test
with
your architecture
and dataset :)
28
Network No BN Before ReLU After ReLU
CaffeNet128-FC2048 47.1 47.8 49.9
GoogLeNet128 61.9 60.3 59.6
Non-linearity BN Before BN After
TanH 88.1 89.2
ReLU 92.6 92.5
MaxOut 92.3 92.9
BATCH NORMALIZATION
SOMETIMES WORKS TOO GOOD AND HIDES PROBLEMS
29
Case:
CNN has less number outputs ( just typo),
than classes in dataset: 26 vs. 28
BatchNormed “learns well”
Plain CNN diverges
NON-LINEARITIES ON CAFFENET, WITH
BATCH NORMALIZATION
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
30
NON-LINEARITIES: TAKE AWAY MESSAGE
 Use ELU without batch normalization
 Or ReLU + BN
 Try maxout for the final layers
 Fallback solution (if something goes wrong) – ReLU
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
31
BUT IN SMALL DATA REGIME ( ~50K IMAGES)
TRY LEAKY OR RANDOMIZED RELU
 Accuracy [%], Network in Network architecture
 LogLoss, Plankton VGG architecture
Xu et.al. Empirical Evaluation of Rectified Activations in Convolutional Network ICLR 2015
ReLU VLReLU RReLU PReLU
CIFAR-10 87.55 88.80 88.81 88.20
CIFAR-100 57.10 59.60 59.80 58.4
ReLU VLReLU RReLU PReLU
KNDB 0.77 0.73 0.72 0.74
32
INPUT IMAGE SIZE
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
33
PADDING TYPES
Zero padding, stride = 2
Dumoulin and Visin. A guide to convolution arithmetic for deep learning. ArXiv 2016
No padding, stride = 2 Zero padding, stride = 1
34
PADDING
 Zero-padding:
 Preserving spatial size, not “washing out”
information
 Dropout-like augmentation by zeros
Caffenet128
 with conv padding: 47% top-1 acc
 w/o conv padding: 41% top-1 acc.
35
MAX POOLING:
PADDING AND KERNEL
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
36
POOLING METHODS
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
37
POOLING METHODS
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
38
LEARNING RATE POLICY
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
39
IMAGE PREPROCESSING
 Subtract mean pixel (training set), divide by std.
 RGB is the best (standard) colorspace for CNN
 Do nothing more…
 …unless you have specific dataset.
Subtract local mean pixel
B.Graham, 2015
Kaggle Diabetic Retinopathy Competition report
40
IMAGE PREPROCESSING:
WHAT DOESN`T WORK
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
41
IMAGE PREPROCESSING:
LET`S LEARN THE COLORSPACE
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
Image credit: https://www.udacity.com/course/deep-learning--ud730
42
DATASET QUALITY AND SIZE
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
43
NETWORK WIDTH: SATURATION AND SPEED
PROBLEM
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
44
BATCH SIZE AND LEARNING RATE
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
45
CLASSIFIER DESIGN
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
Ren et.al Object Detection Networks on Convolutional Feature Maps, arXiv 2016
Take home:
put
fully-connected
before final layer,
not earlier
46
APPLYING ALTOGETHER
Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
47
>5 pp. additional
top-1 accuracy
for free.
THANK YOU FOR THE ATTENTION
 Any questions?
 All logs, graphs and network definitions:
https://github.com/ducha-aiki/caffenet-benchmark
Feel free to add your tests :)
 The paper is here: https://arxiv.org/abs/1606.02228
ducha.aiki@gmail.com
mishkdmy@cmp.felk.cvut.cz 48
ARCHITECTURE
 Use as small filters as possible
 3x3 + ReLU + 3x3 + ReLU > 5x5 + ReLU.
 3x1 + 1x3 > 3x3.
 2x2 + 2x2 > 3x3
 Exception: 1st layer. Too computationally
ineffective to use 3x3 there.
Convolutional Neural Networks at Constrained Time Cost. He and Sun, CVPR 2015
49
CAFFENET TRAINING
Mishkin and Matas. All you need is a good init. ICLR, 2016
50
GOOGLENET TRAINING
Mishkin and Matas. All you need is a good init. ICLR, 2016
51

More Related Content

PPTX
Convolutional Neural Network (CNN) presentation from theory to code in Theano
PPTX
You only look once: Unified, real-time object detection (UPC Reading Group)
PPTX
Convolutional Neural Network (CNN)
PDF
Convolutional neural network
PPTX
Convolutional Neural Networks
PDF
Introduction to Deep learning
PDF
Convolutional Neural Networks (CNN)
PDF
Generative adversarial networks
Convolutional Neural Network (CNN) presentation from theory to code in Theano
You only look once: Unified, real-time object detection (UPC Reading Group)
Convolutional Neural Network (CNN)
Convolutional neural network
Convolutional Neural Networks
Introduction to Deep learning
Convolutional Neural Networks (CNN)
Generative adversarial networks

What's hot (20)

PPTX
Semantic Segmentation Methods using Deep Learning
PPTX
CNN Tutorial
PPTX
Deep neural networks
PDF
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
PPTX
Tutorial on Object Detection (Faster R-CNN)
PPTX
Convolutional Neural Networks
PDF
MobileNet V3
PDF
Image segmentation with deep learning
PPTX
U-Net (1).pptx
PDF
ResNet basics (Deep Residual Network for Image Recognition)
PPTX
알기쉬운 Variational autoencoder
PPTX
Introduction to CNN
PPTX
Artifical Neural Network and its applications
ODP
Convolutional Neural Networks
PDF
Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...
PPTX
Transfer Learning and Fine-tuning Deep Neural Networks
PPTX
What Is A Neural Network? | How Deep Neural Networks Work | Neural Network Tu...
PDF
Convolutional Neural Networks : Popular Architectures
PPTX
What is Deep Learning?
Semantic Segmentation Methods using Deep Learning
CNN Tutorial
Deep neural networks
[PR12] You Only Look Once (YOLO): Unified Real-Time Object Detection
Tutorial on Object Detection (Faster R-CNN)
Convolutional Neural Networks
MobileNet V3
Image segmentation with deep learning
U-Net (1).pptx
ResNet basics (Deep Residual Network for Image Recognition)
알기쉬운 Variational autoencoder
Introduction to CNN
Artifical Neural Network and its applications
Convolutional Neural Networks
Image classification on Imagenet (D1L4 2017 UPC Deep Learning for Computer Vi...
Transfer Learning and Fine-tuning Deep Neural Networks
What Is A Neural Network? | How Deep Neural Networks Work | Neural Network Tu...
Convolutional Neural Networks : Popular Architectures
What is Deep Learning?
Ad

Similar to CNNs: from the Basics to Recent Advances (20)

PDF
ImageNet Classification with Deep Convolutional Neural Networks
PPTX
UNetEliyaLaialy (2).pptx
PDF
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
PDF
Improved Image Based Super Resolution and Concrete Crack Prediction Using Pre...
PDF
PDF
Recent developments in Deep Learning
PDF
Can AI say from our eyes when we read relevant information?
PDF
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
PDF
rcnn.pdfmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
PDF
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
PDF
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
PPTX
Batch normalization presentation
PDF
BLOOD TISSUE IMAGE TO IDENTIFY MALARIA DISEASE CLASSIFICATION
PDF
CNN FEATURES ARE ALSO GREAT AT UNSUPERVISED CLASSIFICATION
PDF
Reservoir computing fast deep learning for sequences
PDF
ct_meeting_final_jcy (1).pdf
PDF
DeepFix: a fully convolutional neural network for predicting human fixations...
PPTX
major proposal_of_my_engineering_located.pptx
ImageNet Classification with Deep Convolutional Neural Networks
UNetEliyaLaialy (2).pptx
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Improved Image Based Super Resolution and Concrete Crack Prediction Using Pre...
Recent developments in Deep Learning
Can AI say from our eyes when we read relevant information?
Interpretability of Convolutional Neural Networks - Eva Mohedano - UPC Barcel...
rcnn.pdfmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Batch normalization presentation
BLOOD TISSUE IMAGE TO IDENTIFY MALARIA DISEASE CLASSIFICATION
CNN FEATURES ARE ALSO GREAT AT UNSUPERVISED CLASSIFICATION
Reservoir computing fast deep learning for sequences
ct_meeting_final_jcy (1).pdf
DeepFix: a fully convolutional neural network for predicting human fixations...
major proposal_of_my_engineering_located.pptx
Ad

Recently uploaded (20)

PPTX
bone as a tissue presentation micky.pptx
PPTX
AP CHEM 1.2 Mass spectroscopy of elements
PDF
From Molecular Interactions to Solubility in Deep Eutectic Solvents: Explorin...
PDF
Cosmology using numerical relativity - what hapenned before big bang?
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PPT
Enhancing Laboratory Quality Through ISO 15189 Compliance
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
Cells and Organs of the Immune System (Unit-2) - Majesh Sir.pptx
PPTX
ELISA(Enzyme linked immunosorbent assay)
PPTX
Presentation1 INTRODUCTION TO ENZYMES.pptx
PPTX
Preformulation.pptx Preformulation studies-Including all parameter
PPT
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
PDF
Science Form five needed shit SCIENEce so
PDF
Chapter 3 - Human Development Poweroint presentation
PDF
Sustainable Biology- Scopes, Principles of sustainiability, Sustainable Resou...
PDF
CuO Nps photocatalysts 15156456551564161
PPTX
endocrine - management of adrenal incidentaloma.pptx
PDF
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
PPT
Cell Structure Description and Functions
bone as a tissue presentation micky.pptx
AP CHEM 1.2 Mass spectroscopy of elements
From Molecular Interactions to Solubility in Deep Eutectic Solvents: Explorin...
Cosmology using numerical relativity - what hapenned before big bang?
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
Enhancing Laboratory Quality Through ISO 15189 Compliance
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Cells and Organs of the Immune System (Unit-2) - Majesh Sir.pptx
ELISA(Enzyme linked immunosorbent assay)
Presentation1 INTRODUCTION TO ENZYMES.pptx
Preformulation.pptx Preformulation studies-Including all parameter
Biochemestry- PPT ON Protein,Nitrogenous constituents of Urine, Blood, their ...
Science Form five needed shit SCIENEce so
Chapter 3 - Human Development Poweroint presentation
Sustainable Biology- Scopes, Principles of sustainiability, Sustainable Resou...
CuO Nps photocatalysts 15156456551564161
endocrine - management of adrenal incidentaloma.pptx
Integrative Oncology: Merging Conventional and Alternative Approaches (www.k...
Cell Structure Description and Functions

CNNs: from the Basics to Recent Advances

  • 1. CNNS FROM THE BASICS TO RECENT ADVANCES Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague ducha.aiki@gmail.com
  • 2. MY BACKGROUND  PhD student of Czech Technical university in Prague. Now fully working in Deep Learning, recent paper “All you need is a good init” added to Stanford CS231n course.  Kaggler. 9th out of 1049 teams at National Data Science Bowl  CTO of Clear Research (clear.sx) Using deep learning at work since 2014. 2
  • 3. Short review of the CNN design Architecture progress AlexNet → VGGNet → ResNet → GoogLeNet Initialization Design choices 3 OUTLINE
  • 5. AlexNet (original):Krizhevsky et.al., ImageNet Classification with Deep Convolutional Neural Networks, 2012. CaffeNet: Jia et.al., Caffe: Convolutional Architecture for Fast Feature Embedding, 2014. Image credit: Roberto Matheus Pinheiro Pereira, “Deep Learning Talk”. Srinivas et.al “A Taxonomy of Deep Convolutional Neural Nets for Computer Vision”, 2016. 5 CAFFENET ARCHITECTURE
  • 6. Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 6 CAFFENET ARCHITECTURE
  • 7. VGGNET ARCHITECTURE 7 All convolutions are 3x3 Good performance, but slow
  • 8. INCEPTION (GOOGLENET): BUILDING BLOCK Szegedy et.al. Going Deeper with Convolutions. CVPR, 2015 Image credit: https://www.udacity.com/course/deep-learning--ud730 8
  • 9. DEPTH LIMIT NOT REACHED YET IN DEEP LEARNING 9
  • 10. DEPTH LIMIT NOT REACHED YET IN DEEP LEARNING 10
  • 11. DEPTH LIMIT NOT REACHED YET IN DEEP LEARNING 11 VGGNet – 19 layers, 19.6 billion FLOPS. Simonyan et.al., 2014 ResNet – 152 layers, 11.3 billion FLOPS. He et.al., 2015 Slide credit Šulc et.al. Very Deep Residual Networks with MaxOut for Plant Identification in the Wild, 2016. Stochastic ResNet – 1200 layers. Huang et.al., Deep Networks with Stochastic Depth, 2016
  • 12. RESNET: RESIDUAL BLOCK He et.al. Deep Residual Learning for Image Recognition, ICCV 2015 12
  • 13. RESIDUAL NETWORKS: HOT TOPIC  Identity mapping https://arxiv.org/abs/1603.05027  Wide ResNets https://arxiv.org/abs/1605.07146  Stochastic depth https://arxiv.org/abs/1603.09382  Residual Inception https://arxiv.org/abs/1602.07261  ResNets + ELU http://arxiv.org/pdf/1604.04112.pdf  ResNet in ResNet http://arxiv.org/pdf/1608.02908v1.pdf  DC Nets http://arxiv.org/abs/1608.06993  Weighted ResNet http://arxiv.org/pdf/1605.08831v1.pdf 13
  • 14. DATASETS USED IN PRESENTATION: IMAGENET AND CIFAR-10 CIFAR-10: 50k training images (32x32 px ) 10k validation images 10 classes ImageNet: 1.2M training images (~ 256 x 256px ) 50k validation images 1000 classes Russakovskiy et.al, ImageNet Large Scale Visual Recognition Challenge, 2015 Krizhevsky, Learning Multiple Layers of Features from Tiny Images, 2009 14
  • 15. Image credit: Hu et.al, 2015 Transferring Deep Convolutional Neural Networks for the Scene Classification of High-Resolution Remote Sensing Imagery 15 CAFFENET ARCHITECTURE
  • 17. REFERENCE METHODS: IMAGE SIZE SENSITIVE Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 17
  • 19. CHOICE OF NON-LINEARITY Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 19
  • 20. NON-LINEARITIES ON CAFFENET Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 20
  • 21.  Gaussian noise with variance.  𝑣𝑎𝑟 𝜔𝑙 = 0.01 (AlexNet, Krizhevsky et.al, 2012)  𝑣𝑎𝑟 𝜔𝑙 = 1/𝑛𝑖𝑛𝑝𝑢𝑡𝑠 (Glorot et.al. 2010)  𝑣𝑎𝑟 𝜔𝑙 = 2/𝑛𝑖𝑛𝑝𝑢𝑡𝑠 (He et.al. 2015)  Orthonormal: (Saxe et.al. 2013) Glorot → SVD → 𝜔𝑙 = V  Data-dependent: LSUV (Mishkin et.al, 2016) Mishkin and Matas. All you need is a good init. ICLR, 2016 21 WEIGHT INITIALIZATION FOR A VERY DEEP NET
  • 22. a) Layer gain 𝐺 𝐿 < 1→ vanishing activations variance b) Layer gain 𝐺 𝐿 > 1 → exploding activation variance Mishkin and Matas. All you need is a good init. ICLR, 2016 22 WEIGHT INITIALIZATION INFLUENCES ACTIVATIONS
  • 23. ACTIVATIONS INFLUENCES MAGNITUDE OF GRADIENT COMPONENTS 23 Mishkin and Matas. All you need is a good init. ICLR, 2016
  • 24. Algorithm 1. Layer-sequential unit-variance orthogonal initialization. 𝑳 − convolution or fully-connected layer, 𝑊𝐿 − its weights, 𝑂 𝐿 − layer output, 𝜀 − variance tolerance, 𝑇𝑖 − iteration number, 𝑇 𝑚𝑎𝑥 − max number of iterations. Pre-initialize network with orthonormal matrices as in Saxe et.al. (2013) for each convolutional and fully-connected layer 𝑳 do do forward pass with mini-batch calculate v𝑎𝑟(𝑂 𝐿) 𝑊𝐿 𝑖+1 = ൗ𝑊𝐿 𝑖 𝑣𝑎𝑟(𝑂 𝐿) until 𝑣𝑎𝑟 𝑂 𝐿 − 1.0 < 𝜀 or (𝑇𝑖 > 𝑇𝑚𝑎𝑥) end for *The LSUV algorithm does not deal with biases and initializes them with zeros KEEPING PRODUCT OF PER-LAYER GAINS ~ 1: LAYER-SEQUENTIAL UNIT-VARIANCE ORTHOGONAL INITIALIZATION Mishkin and Matas. All you need is a good init. ICLR, 2016 24
  • 25. COMPARISON OF THE INITIALIZATIONS FOR DIFFERENT ACTIVATIONS  CIFAR-10 FitNet, accuracy [%]   CIFAR-10 FitResNet, accuracy [%] Mishkin and Matas. All you need is a good init. ICLR, 2016 25
  • 26. LSUV INITIALIZATION IMPLEMENTATIONS Mishkin and Matas. All you need is a good init. ICLR, 2016  Caffe https://github.com/ducha-aiki/LSUVinit  Keras https://github.com/ducha-aiki/LSUV-keras  Torch https://github.com/yobibyte/torch-lsuv 26
  • 27. BATCH NORMALIZATION (AFTER EVERY CONVOLUTION LAYER) Ioffe et.al, ICML 2015 27
  • 28. BATCH NORMALIZATION: WHERE, BEFORE OR AFTER NON-LINEARITY? Mishkin and Matas. All you need is a good init. ICLR, 2016 Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 ImageNet, top-1 accuracy CIFAR-10, top-1 accuracy, FitNet4 network In short: better to test with your architecture and dataset :) 28 Network No BN Before ReLU After ReLU CaffeNet128-FC2048 47.1 47.8 49.9 GoogLeNet128 61.9 60.3 59.6 Non-linearity BN Before BN After TanH 88.1 89.2 ReLU 92.6 92.5 MaxOut 92.3 92.9
  • 29. BATCH NORMALIZATION SOMETIMES WORKS TOO GOOD AND HIDES PROBLEMS 29 Case: CNN has less number outputs ( just typo), than classes in dataset: 26 vs. 28 BatchNormed “learns well” Plain CNN diverges
  • 30. NON-LINEARITIES ON CAFFENET, WITH BATCH NORMALIZATION Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 30
  • 31. NON-LINEARITIES: TAKE AWAY MESSAGE  Use ELU without batch normalization  Or ReLU + BN  Try maxout for the final layers  Fallback solution (if something goes wrong) – ReLU Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 31
  • 32. BUT IN SMALL DATA REGIME ( ~50K IMAGES) TRY LEAKY OR RANDOMIZED RELU  Accuracy [%], Network in Network architecture  LogLoss, Plankton VGG architecture Xu et.al. Empirical Evaluation of Rectified Activations in Convolutional Network ICLR 2015 ReLU VLReLU RReLU PReLU CIFAR-10 87.55 88.80 88.81 88.20 CIFAR-100 57.10 59.60 59.80 58.4 ReLU VLReLU RReLU PReLU KNDB 0.77 0.73 0.72 0.74 32
  • 33. INPUT IMAGE SIZE Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 33
  • 34. PADDING TYPES Zero padding, stride = 2 Dumoulin and Visin. A guide to convolution arithmetic for deep learning. ArXiv 2016 No padding, stride = 2 Zero padding, stride = 1 34
  • 35. PADDING  Zero-padding:  Preserving spatial size, not “washing out” information  Dropout-like augmentation by zeros Caffenet128  with conv padding: 47% top-1 acc  w/o conv padding: 41% top-1 acc. 35
  • 36. MAX POOLING: PADDING AND KERNEL Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 36
  • 37. POOLING METHODS Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 37
  • 38. POOLING METHODS Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 38
  • 39. LEARNING RATE POLICY Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 39
  • 40. IMAGE PREPROCESSING  Subtract mean pixel (training set), divide by std.  RGB is the best (standard) colorspace for CNN  Do nothing more…  …unless you have specific dataset. Subtract local mean pixel B.Graham, 2015 Kaggle Diabetic Retinopathy Competition report 40
  • 41. IMAGE PREPROCESSING: WHAT DOESN`T WORK Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 41
  • 42. IMAGE PREPROCESSING: LET`S LEARN THE COLORSPACE Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 Image credit: https://www.udacity.com/course/deep-learning--ud730 42
  • 43. DATASET QUALITY AND SIZE Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 43
  • 44. NETWORK WIDTH: SATURATION AND SPEED PROBLEM Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 44
  • 45. BATCH SIZE AND LEARNING RATE Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 45
  • 46. CLASSIFIER DESIGN Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 Ren et.al Object Detection Networks on Convolutional Feature Maps, arXiv 2016 Take home: put fully-connected before final layer, not earlier 46
  • 47. APPLYING ALTOGETHER Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016 47 >5 pp. additional top-1 accuracy for free.
  • 48. THANK YOU FOR THE ATTENTION  Any questions?  All logs, graphs and network definitions: https://github.com/ducha-aiki/caffenet-benchmark Feel free to add your tests :)  The paper is here: https://arxiv.org/abs/1606.02228 ducha.aiki@gmail.com mishkdmy@cmp.felk.cvut.cz 48
  • 49. ARCHITECTURE  Use as small filters as possible  3x3 + ReLU + 3x3 + ReLU > 5x5 + ReLU.  3x1 + 1x3 > 3x3.  2x2 + 2x2 > 3x3  Exception: 1st layer. Too computationally ineffective to use 3x3 there. Convolutional Neural Networks at Constrained Time Cost. He and Sun, CVPR 2015 49
  • 50. CAFFENET TRAINING Mishkin and Matas. All you need is a good init. ICLR, 2016 50
  • 51. GOOGLENET TRAINING Mishkin and Matas. All you need is a good init. ICLR, 2016 51