Skip Connections, Residual networks and challanges

Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur
Topic
Lecture 36: CNN Architecture

Concepts Covered:
 CNN
 CNN Architecture
 Convolution Layer
 Receptive Field
 Nonlinearity
 Pooling

Convolutio
n
∑
∞
=
−
=
0
)
(
)
(
)
(
p
p
n
h
p
x
n
y ∫
∞
−
=
0
)
(
)
(
)
( τ
τ
τ d
t
h
x
t
y
∑∑
∞
=
∞
=
−
−
=
0 0
)
,
(
)
,
(
)
,
(
p q
q
n
p
m
h
q
p
x
n
m
y
1 D Convolution
2 D Convolution

Finite Convolution
Kernel
Feature at a point is local in nature

Convolution
Kernel
∑
−
=
−
=
A
A
p
p
n
x
p
w
n
y )
(
)
(
)
(
∑ ∑
−
= −
=
−
−
=
A
A
p
A
A
q
q
n
p
m
x
q
p
w
n
m
y )
,
(
)
,
(
)
,
(
1 D 2A+1
2 D (2A+1)x(2A+1)

Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0)

Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1)

Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1) Y(2)

Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1) Y(2) Y(3)

Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1) Y(2) Y(3) Y(n-1)

Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1) Y(2) Y(3) Y(n-1) Y(n)

Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1) Y(2) Y(3) Y(n-1) Y(n) Y(n+1)

2 D
Convolution
6 x 6 Image
3 x 3 Kernel

2 D
Convolution
0 Padding
Flipping

Stride
No. of steps the kernel is moved during convolution
7 x 7 Input Image
3 x 3 Kernel
Stride =1
Stride =2

CNN
Architecture
Convolution
Nonlinearity
Pooling
Convolution
Nonlinearity
Pooling
Fully
Connected
Layer
Class
Image

• Color image has 3 dimensions: height, width and depth (depth is the
color channels i.e RGB)
• Filter or kernels that will be convolved with the RGB image could also
be 3D
• For multiple Kernels: All feature maps obtained from distinct kernels
are stacked to get the final output of that layer
Convolution Layer: 3 D
Convolution

• The kernel strides over the input
Image.
• At each location compute
collect them in the feature map.
• The animation shows the sliding
operation at 4 locations, but in reality
it is performed over the entire input.
Animation:- Arden Dertat
https://towardsdatascience.com/applied-deep-learning-
part-4-convolutional-neural-networks-584bc134c1e2
)
,
( n
m
I
∑∑ −
−
= )
,
(
)
,
(
)
,
( n
q
m
p
I
q
p
w
n
m
f
3 D Convolution-
Visualization

• Red and green boxes are two different
featured maps obtained by convolving
the same input with two different
kernels. The feature maps are stacked
along the depth dimension as shown.
Figure: Arden Dertat
3 D Convolution-
Visualization

• An RGB Image of size
32X32X3
• 10 Kernels of size 5x5x3
• Output featuremap of size
32x32x10
3 D Convolution-
Visualization

• ReLU is an element wise operation (applied per pixel) and
replaces all negative pixel values in the feature map by zero
Nonlinearity

• Replaces the output of a node at certain locations with a
summary statistic of nearby locations.
• Spatial Pooling can be of different types: Max, Average, Sum etc.
• Max Pooling report the maximum output within a rectangular
neighborhood.
• Pooling helps to make the output approximately invariant to
small translation.
• Pooling layers down sample each feature map independently,
reducing the height and width, keeping the depth intact.
• In pooling layer stride and window size needs to be specified
Poolin
g

• Figure below is the result of max pooling using a 2x2 window and
stride 2. Each color denotes a different window. Since both the
window size and stride are 2, the windows are not overlapping
3 2 5 6
8 9 5 3
4 4 6 8
1 1 2 1
9 6
4 8
Max pool with 2x2 window
with stride = 2
Poolin
g

• Pooling reduces the height and the width of the feature map, but the
depth remains unchanged as shown in figure
• Pooling operation is independently carried out across each depth
Poolin
g

Skip Connections, Residual networks and challanges

Topic
Lecture 37: Popular CNN Models

Concepts Covered:
 CNN
 LeNet
 AlexNet
 VGG Net
 GoogLeNet
 etc.

MLP vs
CNN
 Sparse Connectivity: Every node in the Convolution Layer receives input from a
small number of nodes in the previous layer (Receptive Field), needing smaller
number of parameters.
 Parameter Sharing: Each member of the Convolution Kernel is used at every
position of the input, dramatically reducing the number of parameters.
 This makes CNN much more efficient than MLP.

LeNet
5
• Proposed by Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner for
handwritten and machine-printed character recognition.
• Used by many Banks for recognition of hand written numbers on cheques.
• This architecture achieves an error rate as low as 0.95% on test data
Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick
Haffner, “Gradient –Based Learning Applied to
Document Recognition”, Proc. IEEE, Nov. 1998

LeNet
5
No. of Kernels- 6
Kernel Size- 5 x 5
Stride- 1

LeNet
5
Average Pooling
Window Size- 2 x 2
Stride- 2

LeNet
5
No. of Kernels- 16
Kernel Size- 5 x 5
Stride- 1

LeNet
5
No. of Kernels- 16
Kernel Size- 5 x 5
Stride- 1
 Break the symmetry in the
network
 Keep number of connections
within reasonable bounds.

IMAGENET Large Scale Visual Recognition
Challenge (ILSVRC)
https://engmrk.com/lenet-5-a-classic-cnn-architecture/

ILSVR
C
• IMAGENET Large Scale Visual Recognition Challenge.
• Evaluates algorithms for Object Detection and Image
Classification on large image database.
• Helps researchers to review state of the art Machine Learning
techniques for object detection across a wider variety of objects.
• Monitor the progress of computer vision for large scale image
indexing for retrieval and annotation.
• Database contains large number of Images from 1000
categories.
• More than 1000 images in every category.

ILSVR
C
• Every year of the challenge the forum also organizes a
workshop at one of the premier computer vision
conferences.
• The purpose of the workshop is to disseminate the new
findings of the challenge.
• Contestants with the most successful and innovative
techniques are invited to present their work.

Topic
Lecture 38: Popular CNN Models II

Concepts Covered:
 CNN
 LeNet
 ILSVRC
 AlexNet
 VGG Net
 GoogLeNet
 etc.

AlexNet
ILSVRC 2012 Winer
Krizhevsky Alex, Ilya Sutskever and Geoffrey E. Hilton, “Imagenet
Classification with deep convolutional neural networks”,
Advances in Neural Information Processing Systems, 2012

Sample Images from ImageNet
Dataset

https://www.learnopencv.com/understanding-alexnet/
AlexNe
t
ILSVRC 2012
Winner

AlexNe
t

AlexN
et
 60 Million parameters and 650000 neurons.
 The network is split into two pipelines and was trained on
two GPU.
 Input Image size 256 x 256 RGB.
 Grey scale images to be replicated to obtain 3-Channel RGB
 Random crops of size 227 x 227 are fed to the input layer of
AlexNet.
 Stochastic Gradient Descent with Momentum Optimizer.
 Top-5 error rate 15.3%.

Vanishing Gradient
Problem
 Uses ReLU activation instead of sigmoidal function.
 ReLU output is unbounded- uses Local Response Normalization
(LRN).
 LRN carries out a normalization amplifying the excited neuron
while dampening the surrounding neurons at the same time in a
local neighbourhood.
 Encourage Lateral Inhibition: concept in neuro biology that
indicates capacity of a neuron to reduce activity of its
neighbours.

Local Response Normalization (Inter-
Channel)
( )
β
α 







+
=
∑
+
−
−
=
)
2
/
,
1
min(
)
2
/
,
0
max(
2
,
,
,
n
i
N
n
i
j
j
y
x
i
y
x
i
y
x
a
k
a
b

https://towardsdatascience.com/difference-between-local-
response-normalization-and-batch-normalization-
272308c034ac
Local Response
Normalization

Local Response Normalization (Intra-
Channel)
( )
β
α 







+
=
∑ ∑
+
−
=
+
−
=
)
2
/
,
max(
)
2
/
,
0
max(
)
2
/
,
min(
)
2
/
,
0
max(
2
,
,
,
n
x
W
n
x
p
n
y
H
n
y
q
i
q
p
i
y
x
i
y
x
a
k
a
b

Local Response
Normalization
https://towardsdatascience.com/difference-between-local-
response-normalization-and-batch-normalization-
272308c034ac

Reducing
Overfitting
 Train the network with different variants of the
same image helps avoiding overfitting.
 Generate additional data from existing data
(Augmentation).
 Data augmentation by mirroring.
 Data Augmentation by random crops.
 Dropout Regularization.

Dropou
t
Srivastava Nitish et. al. “Dropout: A Simple Way to Prevent
Neural Networks from Overfitting” Journal of Machine
Learning Research 15 (2014), 1929-1958
 Regularization Technique proposed by Srivastava et.
al. in 2014.
 During training randomly selected neurons are
dropped from the network (with probability 0.5)
temporarily .
 Their activations are not passed to the downstream
neurons in the forward pass.
 In the backward pass weight updates are not
applied to theses neurons.

Dropou
t

How does it
help?
 While training weights of neurons are tuned for specific
features that provides some sort of specialization.
 Neighbouring neurons starts relying on these specializations
(co-adaptation).
 This leads to a neural network model too specialized to the
training data.
 As neurons are randomly dropped other neurons have to
step in to compensate.
 Thus the network learns multiple independent
representations

How does it
help?
 This makes the network less sensitive to specific weights.
 Enhances the generalization capability of the network
 Less vulnerable to overfitting.
 The whole network is used during testing – there is no dropout.
 Dropout increases number of iterations for the network to
converge.
 But helps avoid overfitting.

Topic
Lecture 39: Popular CNN Models III

Concepts Covered:
 CNN
 AlexNet
 VGG Net
 Transfer Learning
 GoogLeNet
 ResNet
 etc.

VGG 16
ILSVRC 2014 1st
Runner-Up
Visual Geometry Group
Oxford University
1/15

VGG
16
Very Deep Convolutional Networks for Large-Scale
Image Recognition by Karen Simonyan and
Andrew Zisserman
2/15

VGG
16
Andrew Zisserman
 Input to the architecture are color images of size 224x224.
 The image is passed through a stack of convolutional layers.
 Every convolution filter has very small receptive field: 3×3,
Stride 1.
 Uses row and column padding to maintain spatial resolution
after convolution.
 There are 13 Convolution Layers.
 There are 5 max-pool layers.
 Max pooling window size 2x2, stride 2.
3/15

VGG
16
Andrew Zisserman
 Not every convolution layer is followed by max-pool
layer.
 3 Fully connected layers.
 First two FC layers have 4096 channels each.
 Last FC layer has 1000 channels.
 Last layer is a softmax layer with 1000 channels, one
for each category of images in ImageNet database.
 Hidden layers have ReLU as activation function.
4/15

VGG
16
Andrew Zisserman
Striking difference from AlexNet
 All convolution kernels are of size 3x3 with stride 1.
 All maxpool kernels are of size 2x2 stride 2
 All variable size kernels as in AlexNet can be realised using
multiple 3x3 kernels.
 This realisation is in terms of size of the receptive field
covered by the kernels.
 Top-5 error rate ~ 7 %
5/15

Transfer
Learning
Kevin McGuinness
https://www.slideshare.net/xavigiro/transfer-learning-
d2l4-insightdcu-machine-learning-workshop-2017
7/15

Transfer
Learning
CNN as Fixed Feature Extractor:
 Take a pre-trained CNN architecture trained on a large
dataset (like ImageNet)
 Remove the last fully connected layer of this pre-trained
network
 Remaining CNN acts as a fixed feature extractor for the
new dataset
8/15

Image Source:-
https://becominghuman.ai/what-exactly-does-cnn-see-
4d436d8e6e52
Transfer
Learning
9/15

Image Source:-
4d436d8e6e52
Transfer
Learning
10/15

Image Source:-
4d436d8e6e52
Transfer
Learning
11/15

Transfer
Learning
Image Source:-
4d436d8e6e52
12/15

Transfer
Learning
Image Source:-
4d436d8e6e52
13/15

Transfer
Learning
 Lower layers generate more general features:-
knowledge transfers very well to other tasks.
 Higher layers are more task specific.
 Fine-tuning improves generalization when sufficient
examples are available.
 Transfer learning and fine tuning often lead to better
performance than training from scratch on the target
dataset.
 Even features transferred from distant tasks often
perform better than random initial weights.
14/15

Fine
tuning
 Weights of the pre-trained CNN is fine-tuned for the
new dataset by continuing the back propagation.
 Fine-tuning can be done for all layers.
 Due to overfitting concern, the earlier layers of the net
may be fixed and fine tuning is done only on the higher
layers.
 Earlier layers can be fixed as lower layers extract
features that are more generic.
 Higher layers on the other hand are task specific.
15/15

Topic
Lecture 40: Popular CNN Models IV

Concepts Covered:
 CNN
 AlexNet
 VGG Net
 Transfer Learning
 Challenges in Deep Learning
 GoogLeNet
 ResNet
 etc.

Challenges
 Deep learning is data hungry.
 Overfitting or lack of generalization.
 Vanishing/Exploding Gradient Problem.
 Appropriate Learning Rate.
 Covariate Shift.
 Effective training.
15/15

Vanishing Gradient
Problem
https://towardsdatascience.com/an-overview-of-
resnet-and-its-variants-5281e2f56035
13

Vanishing Gradient
Problem
1
f 2
f 3
f 4
f
1
W 2
W 3
W 4
W
O
X
))))
(
(
(
( 1
1
2
2
3
3
4
4 X
W
f
W
f
W
f
W
f
O =
12

Vanishing Gradient
Problem
))))
(
(
(
( 1
1
2
2
3
3
4
4 X
W
f
W
f
W
f
W
f
O =
4
θ
3
θ
2
θ
1
θ
11

Vanishing Gradient
Problem
)
( 4
4 θ
f
O = )
( 3
3
4
4 θ
θ f
W
= )
( 2
2
3
3 θ
θ f
W
= )
( 1
1
2
2 θ
θ f
W
= X
W1
1 =
θ
4
4
3
3
2
2
1
1
1
1
1
1
2
2
2
2
3
3
3
3
4
4
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
θ
θ
θ
θ
θ
θ
θ
θ
θ ∂
∂
′
′
′
=
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
=
∂
∂ O
W
f
W
f
W
f
X
W
f
f
f
f
f
f
O
W
O
4
4
3
3
2
1
2
2
2
2
2
3
3
3
3
4
4
2
.
.
.
.
.
.
.
.
.
.
θ
θ
θ
θ
θ
θ
θ ∂
∂
′
′
=
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
=
∂
∂ O
W
f
W
f
f
W
f
f
f
f
O
W
O
10

Vanishing Gradient
Problem
10
 Choice of activation function: ReLU instead
of Sigmoid.
 Appropriate initialization of weights.
 Intelligent Back Propagation Learning
Algorithm.

Skip Connections, Residual networks and challanges

More Related Content

Similar to Skip Connections, Residual networks and challanges (20)

Recently uploaded (20)

Skip Connections, Residual networks and challanges