CNNs: from the Basics to Recent Advances

CNNS
FROM THE BASICS TO RECENT ADVANCES
Dmytro Mishkin
Center for Machine Perception
Czech Technical University in Prague
ducha.aiki@gmail.com

MY BACKGROUND
 PhD student of Czech Technical university in
Prague. Now fully working in Deep Learning,
recent paper “All you need is a good init”
added to Stanford CS231n course.
 Kaggler. 9th out of 1049 teams at National
Data Science Bowl
 CTO of Clear Research (clear.sx) Using deep
learning at work since 2014.
2

Short review of the CNN design
Architecture progress
AlexNet → VGGNet → ResNet
→ GoogLeNet
Initialization
Design choices
3
OUTLINE

AlexNet (original):Krizhevsky et.al., ImageNet Classification with Deep Convolutional Neural Networks, 2012.
CaffeNet: Jia et.al., Caffe: Convolutional Architecture for Fast Feature Embedding, 2014.
Image credit: Roberto Matheus Pinheiro Pereira, “Deep Learning Talk”.
Srinivas et.al “A Taxonomy of Deep Convolutional Neural Nets for Computer Vision”, 2016.
5
CAFFENET ARCHITECTURE

Mishkin et.al. Systematic evaluation of CNN advances on the ImageNet, arXiv 2016
6

VGGNET ARCHITECTURE
7
All convolutions are 3x3
Good performance,
but slow

INCEPTION (GOOGLENET):
BUILDING BLOCK
Szegedy et.al. Going Deeper with Convolutions. CVPR, 2015
Image credit: https://www.udacity.com/course/deep-learning--ud730
8

DEPTH LIMIT NOT REACHED YET
IN DEEP LEARNING
9

IN DEEP LEARNING
10

IN DEEP LEARNING
11
VGGNet – 19 layers, 19.6 billion FLOPS. Simonyan et.al., 2014
ResNet – 152 layers, 11.3 billion FLOPS. He et.al., 2015
Slide credit Šulc et.al. Very Deep Residual Networks with MaxOut for Plant Identification in the Wild, 2016.
Stochastic ResNet – 1200 layers.
Huang et.al., Deep Networks with Stochastic Depth, 2016

RESNET: RESIDUAL BLOCK
He et.al. Deep Residual Learning for Image Recognition, ICCV 2015
12

RESIDUAL NETWORKS: HOT TOPIC
 Identity mapping https://arxiv.org/abs/1603.05027
 Wide ResNets https://arxiv.org/abs/1605.07146
 Stochastic depth https://arxiv.org/abs/1603.09382
 Residual Inception https://arxiv.org/abs/1602.07261
 ResNets + ELU http://arxiv.org/pdf/1604.04112.pdf
 ResNet in ResNet
http://arxiv.org/pdf/1608.02908v1.pdf
 DC Nets http://arxiv.org/abs/1608.06993
 Weighted ResNet
http://arxiv.org/pdf/1605.08831v1.pdf
13

DATASETS USED IN PRESENTATION:
IMAGENET AND CIFAR-10
CIFAR-10:
50k training images
(32x32 px )
10k validation images
10 classes
ImageNet:
1.2M training images
(~ 256 x 256px )
50k validation images
1000 classes
Russakovskiy et.al, ImageNet Large Scale Visual Recognition Challenge, 2015
Krizhevsky, Learning Multiple Layers of Features from Tiny Images, 2009
14

Image credit: Hu et.al, 2015
Transferring Deep Convolutional Neural Networks for the Scene Classification
of High-Resolution Remote Sensing Imagery
15

LIST OF HYPER-PARAMETERS TESTED
16

REFERENCE METHODS: IMAGE SIZE SENSITIVE
17

CHOICE OF NON-LINEARITY
19

NON-LINEARITIES ON CAFFENET
20

 Gaussian noise with variance.
 𝑣𝑎𝑟 𝜔𝑙 = 0.01 (AlexNet, Krizhevsky et.al, 2012)
 𝑣𝑎𝑟 𝜔𝑙 = 1/𝑛𝑖𝑛𝑝𝑢𝑡𝑠 (Glorot et.al. 2010)
 𝑣𝑎𝑟 𝜔𝑙 = 2/𝑛𝑖𝑛𝑝𝑢𝑡𝑠 (He et.al. 2015)
 Orthonormal: (Saxe et.al. 2013)
Glorot → SVD → 𝜔𝑙 = V
 Data-dependent: LSUV (Mishkin et.al, 2016)
Mishkin and Matas. All you need is a good init. ICLR, 2016
21
WEIGHT INITIALIZATION FOR A VERY DEEP NET

a) Layer gain 𝐺 𝐿 < 1→ vanishing activations variance
b) Layer gain 𝐺 𝐿 > 1 → exploding activation variance
22
WEIGHT INITIALIZATION INFLUENCES ACTIVATIONS

ACTIVATIONS INFLUENCES MAGNITUDE OF
GRADIENT COMPONENTS
23

Algorithm 1. Layer-sequential unit-variance orthogonal initialization.
𝑳 − convolution or fully-connected layer, 𝑊𝐿 − its weights, 𝑂 𝐿 − layer output,
𝜀 − variance tolerance,
𝑇𝑖 − iteration number, 𝑇 𝑚𝑎𝑥 − max number of iterations.
Pre-initialize network with orthonormal matrices as in Saxe et.al. (2013)
for each convolutional and fully-connected layer 𝑳 do
do forward pass with mini-batch
calculate v𝑎𝑟(𝑂 𝐿)
𝑊𝐿
𝑖+1
= ൗ𝑊𝐿
𝑖
𝑣𝑎𝑟(𝑂 𝐿)
until 𝑣𝑎𝑟 𝑂 𝐿 − 1.0 < 𝜀 or (𝑇𝑖 > 𝑇𝑚𝑎𝑥)
end for
*The LSUV algorithm does not deal with biases and initializes them with zeros
KEEPING PRODUCT OF PER-LAYER GAINS ~ 1:
LAYER-SEQUENTIAL UNIT-VARIANCE
ORTHOGONAL INITIALIZATION
24

COMPARISON OF THE INITIALIZATIONS FOR
DIFFERENT ACTIVATIONS
 CIFAR-10 FitNet, accuracy [%]

 CIFAR-10 FitResNet, accuracy [%]
25

LSUV INITIALIZATION IMPLEMENTATIONS
 Caffe https://github.com/ducha-aiki/LSUVinit
 Keras https://github.com/ducha-aiki/LSUV-keras
 Torch https://github.com/yobibyte/torch-lsuv
26

BATCH NORMALIZATION
(AFTER EVERY CONVOLUTION LAYER)
Ioffe et.al, ICML 2015
27

BATCH NORMALIZATION: WHERE,
BEFORE OR AFTER NON-LINEARITY?
ImageNet, top-1 accuracy
CIFAR-10, top-1 accuracy,
FitNet4 network
In short:
better to test
with
your architecture
and dataset :)
28
Network No BN Before ReLU After ReLU
CaffeNet128-FC2048 47.1 47.8 49.9
GoogLeNet128 61.9 60.3 59.6
Non-linearity BN Before BN After
TanH 88.1 89.2
ReLU 92.6 92.5
MaxOut 92.3 92.9

BATCH NORMALIZATION
SOMETIMES WORKS TOO GOOD AND HIDES PROBLEMS
29
Case:
CNN has less number outputs ( just typo),
than classes in dataset: 26 vs. 28
BatchNormed “learns well”
Plain CNN diverges

NON-LINEARITIES ON CAFFENET, WITH
BATCH NORMALIZATION
30

NON-LINEARITIES: TAKE AWAY MESSAGE
 Use ELU without batch normalization
 Or ReLU + BN
 Try maxout for the final layers
 Fallback solution (if something goes wrong) – ReLU
31

BUT IN SMALL DATA REGIME ( ~50K IMAGES)
TRY LEAKY OR RANDOMIZED RELU
 Accuracy [%], Network in Network architecture
 LogLoss, Plankton VGG architecture
Xu et.al. Empirical Evaluation of Rectified Activations in Convolutional Network ICLR 2015
ReLU VLReLU RReLU PReLU
CIFAR-10 87.55 88.80 88.81 88.20
CIFAR-100 57.10 59.60 59.80 58.4
ReLU VLReLU RReLU PReLU
KNDB 0.77 0.73 0.72 0.74
32

INPUT IMAGE SIZE
33

PADDING TYPES
Zero padding, stride = 2
Dumoulin and Visin. A guide to convolution arithmetic for deep learning. ArXiv 2016
No padding, stride = 2 Zero padding, stride = 1
34

PADDING
 Zero-padding:
 Preserving spatial size, not “washing out”
information
 Dropout-like augmentation by zeros
Caffenet128
 with conv padding: 47% top-1 acc
 w/o conv padding: 41% top-1 acc.
35

MAX POOLING:
PADDING AND KERNEL
36

POOLING METHODS
37

POOLING METHODS
38

LEARNING RATE POLICY
39

IMAGE PREPROCESSING
 Subtract mean pixel (training set), divide by std.
 RGB is the best (standard) colorspace for CNN
 Do nothing more…
 …unless you have specific dataset.
Subtract local mean pixel
B.Graham, 2015
Kaggle Diabetic Retinopathy Competition report
40

IMAGE PREPROCESSING:
WHAT DOESN`T WORK
41

IMAGE PREPROCESSING:
LET`S LEARN THE COLORSPACE
Image credit: https://www.udacity.com/course/deep-learning--ud730
42

DATASET QUALITY AND SIZE
43

NETWORK WIDTH: SATURATION AND SPEED
PROBLEM
44

BATCH SIZE AND LEARNING RATE
45

CLASSIFIER DESIGN
Ren et.al Object Detection Networks on Convolutional Feature Maps, arXiv 2016
Take home:
put
fully-connected
before final layer,
not earlier
46

APPLYING ALTOGETHER
47
>5 pp. additional
top-1 accuracy
for free.

THANK YOU FOR THE ATTENTION
 Any questions?
 All logs, graphs and network definitions:
https://github.com/ducha-aiki/caffenet-benchmark
Feel free to add your tests :)
 The paper is here: https://arxiv.org/abs/1606.02228
ducha.aiki@gmail.com
mishkdmy@cmp.felk.cvut.cz 48

ARCHITECTURE
 Use as small filters as possible
 3x3 + ReLU + 3x3 + ReLU > 5x5 + ReLU.
 3x1 + 1x3 > 3x3.
 2x2 + 2x2 > 3x3
 Exception: 1st layer. Too computationally
ineffective to use 3x3 there.
Convolutional Neural Networks at Constrained Time Cost. He and Sun, CVPR 2015
49

CAFFENET TRAINING
50

GOOGLENET TRAINING
51

CNNs: from the Basics to Recent Advances

More Related Content

What's hot (20)

Similar to CNNs: from the Basics to Recent Advances (20)

Recently uploaded (20)

CNNs: from the Basics to Recent Advances