SlideShare a Scribd company logo
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur
Topic
Lecture 36: CNN Architecture
Concepts Covered:
 CNN
 CNN Architecture
 Convolution Layer
 Receptive Field
 Nonlinearity
 Pooling
Convolutio
n
∑
∞
=
−
=
0
)
(
)
(
)
(
p
p
n
h
p
x
n
y ∫
∞
−
=
0
)
(
)
(
)
( τ
τ
τ d
t
h
x
t
y
∑∑
∞
=
∞
=
−
−
=
0 0
)
,
(
)
,
(
)
,
(
p q
q
n
p
m
h
q
p
x
n
m
y
1 D Convolution
2 D Convolution
Finite Convolution
Kernel
Feature at a point is local in nature
Convolution
Kernel
∑
−
=
−
=
A
A
p
p
n
x
p
w
n
y )
(
)
(
)
(
∑ ∑
−
= −
=
−
−
=
A
A
p
A
A
q
q
n
p
m
x
q
p
w
n
m
y )
,
(
)
,
(
)
,
(
1 D 2A+1
2 D (2A+1)x(2A+1)
Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0)
Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1)
Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1) Y(2)
Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1) Y(2) Y(3)
Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1) Y(2) Y(3) Y(n-1)
Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1) Y(2) Y(3) Y(n-1) Y(n)
Finite Convolution
Kernel
0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) .
W(2) W(1) W(0) W(-1) W(-2)
Y(0) Y(1) Y(2) Y(3) Y(n-1) Y(n) Y(n+1)
2 D
Convolution
6 x 6 Image
3 x 3 Kernel
2 D
Convolution
0 Padding
Flipping
2 D
Convolution
2 D
Convolution
2 D
Convolution
2 D
Convolution
2 D
Convolution
2 D
Convolution
2 D
Convolution
2 D
Convolution
Stride
No. of steps the kernel is moved during convolution
7 x 7 Input Image
3 x 3 Kernel
Stride =1
Stride =2
CNN
Architecture
Convolution
Nonlinearity
Pooling
Convolution
Nonlinearity
Pooling
Fully
Connected
Layer
Class
Image
• Color image has 3 dimensions: height, width and depth (depth is the
color channels i.e RGB)
• Filter or kernels that will be convolved with the RGB image could also
be 3D
• For multiple Kernels: All feature maps obtained from distinct kernels
are stacked to get the final output of that layer
Convolution Layer: 3 D
Convolution
• The kernel strides over the input
Image.
• At each location compute
collect them in the feature map.
• The animation shows the sliding
operation at 4 locations, but in reality
it is performed over the entire input.
Animation:- Arden Dertat
https://towardsdatascience.com/applied-deep-learning-
part-4-convolutional-neural-networks-584bc134c1e2
)
,
( n
m
I
∑∑ −
−
= )
,
(
)
,
(
)
,
( n
q
m
p
I
q
p
w
n
m
f
3 D Convolution-
Visualization
• Red and green boxes are two different
featured maps obtained by convolving
the same input with two different
kernels. The feature maps are stacked
along the depth dimension as shown.
Figure: Arden Dertat
https://towardsdatascience.com/applied-deep-learning-
part-4-convolutional-neural-networks-584bc134c1e2
3 D Convolution-
Visualization
• An RGB Image of size
32X32X3
• 10 Kernels of size 5x5x3
• Output featuremap of size
32x32x10
3 D Convolution-
Visualization
Figure: Arden Dertat
https://towardsdatascience.com/applied-deep-learning-
part-4-convolutional-neural-networks-584bc134c1e2
• ReLU is an element wise operation (applied per pixel) and
replaces all negative pixel values in the feature map by zero
Nonlinearity
Figure: Arden Dertat
https://towardsdatascience.com/applied-deep-learning-
part-4-convolutional-neural-networks-584bc134c1e2
• Replaces the output of a node at certain locations with a
summary statistic of nearby locations.
• Spatial Pooling can be of different types: Max, Average, Sum etc.
• Max Pooling report the maximum output within a rectangular
neighborhood.
• Pooling helps to make the output approximately invariant to
small translation.
• Pooling layers down sample each feature map independently,
reducing the height and width, keeping the depth intact.
• In pooling layer stride and window size needs to be specified
Poolin
g
• Figure below is the result of max pooling using a 2x2 window and
stride 2. Each color denotes a different window. Since both the
window size and stride are 2, the windows are not overlapping
3 2 5 6
8 9 5 3
4 4 6 8
1 1 2 1
9 6
4 8
Max pool with 2x2 window
with stride = 2
Poolin
g
Figure: Arden Dertat
https://towardsdatascience.com/applied-deep-learning-
part-4-convolutional-neural-networks-584bc134c1e2
• Pooling reduces the height and the width of the feature map, but the
depth remains unchanged as shown in figure
• Pooling operation is independently carried out across each depth
Poolin
g
Figure: Arden Dertat
https://towardsdatascience.com/applied-deep-learning-
part-4-convolutional-neural-networks-584bc134c1e2
CNN
Architecture
Skip Connections, Residual networks and challanges
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur
Topic
Lecture 37: Popular CNN Models
Concepts Covered:
 CNN
 LeNet
 AlexNet
 VGG Net
 GoogLeNet
 etc.
CNN
Architecture
Convolution
Nonlinearity
Pooling
Convolution
Nonlinearity
Pooling
Fully
Connected
Layer
Class
Image
CNN
Architecture
MLP vs
CNN
 Sparse Connectivity: Every node in the Convolution Layer receives input from a
small number of nodes in the previous layer (Receptive Field), needing smaller
number of parameters.
 Parameter Sharing: Each member of the Convolution Kernel is used at every
position of the input, dramatically reducing the number of parameters.
 This makes CNN much more efficient than MLP.
Some popular CNN
Models
LeNet
LeNet
5
• Proposed by Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner for
handwritten and machine-printed character recognition.
• Used by many Banks for recognition of hand written numbers on cheques.
• This architecture achieves an error rate as low as 0.95% on test data
Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick
Haffner, “Gradient –Based Learning Applied to
Document Recognition”, Proc. IEEE, Nov. 1998
LeNet
5
Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick
Haffner, “Gradient –Based Learning Applied to
Document Recognition”, Proc. IEEE, Nov. 1998
No. of Kernels- 6
Kernel Size- 5 x 5
Stride- 1
LeNet
5
Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick
Haffner, “Gradient –Based Learning Applied to
Document Recognition”, Proc. IEEE, Nov. 1998
Average Pooling
Window Size- 2 x 2
Stride- 2
LeNet
5
Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick
Haffner, “Gradient –Based Learning Applied to
Document Recognition”, Proc. IEEE, Nov. 1998
No. of Kernels- 16
Kernel Size- 5 x 5
Stride- 1
LeNet
5
Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick
Haffner, “Gradient –Based Learning Applied to
Document Recognition”, Proc. IEEE, Nov. 1998
No. of Kernels- 16
Kernel Size- 5 x 5
Stride- 1
LeNet
5
Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick
Haffner, “Gradient –Based Learning Applied to
Document Recognition”, Proc. IEEE, Nov. 1998
No. of Kernels- 16
Kernel Size- 5 x 5
Stride- 1
 Break the symmetry in the
network
 Keep number of connections
within reasonable bounds.
LeNet
5
Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick
Haffner, “Gradient –Based Learning Applied to
Document Recognition”, Proc. IEEE, Nov. 1998
Average Pooling
Window Size- 2 x 2
Stride- 2
LeNet 5: Summary
IMAGENET Large Scale Visual Recognition
Challenge (ILSVRC)
https://engmrk.com/lenet-5-a-classic-cnn-architecture/
ILSVR
C
• IMAGENET Large Scale Visual Recognition Challenge.
• Evaluates algorithms for Object Detection and Image
Classification on large image database.
• Helps researchers to review state of the art Machine Learning
techniques for object detection across a wider variety of objects.
• Monitor the progress of computer vision for large scale image
indexing for retrieval and annotation.
• Database contains large number of Images from 1000
categories.
• More than 1000 images in every category.
ILSVR
C
• Every year of the challenge the forum also organizes a
workshop at one of the premier computer vision
conferences.
• The purpose of the workshop is to disseminate the new
findings of the challenge.
• Contestants with the most successful and innovative
techniques are invited to present their work.
Skip Connections, Residual networks and challanges
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur
Topic
Lecture 38: Popular CNN Models II
Concepts Covered:
 CNN
 LeNet
 ILSVRC
 AlexNet
 VGG Net
 GoogLeNet
 etc.
AlexNet
ILSVRC 2012 Winer
Krizhevsky Alex, Ilya Sutskever and Geoffrey E. Hilton, “Imagenet
Classification with deep convolutional neural networks”,
Advances in Neural Information Processing Systems, 2012
Sample Images from ImageNet
Dataset
https://www.learnopencv.com/understanding-alexnet/
AlexNe
t
ILSVRC 2012
Winner
https://www.learnopencv.com/understanding-alexnet/
Krizhevsky Alex, Ilya Sutskever and Geoffrey E. Hilton, “Imagenet
Classification with deep convolutional neural networks”,
Advances in Neural Information Processing Systems, 2012
AlexNe
t
Krizhevsky Alex, Ilya Sutskever and Geoffrey E. Hilton, “Imagenet
Classification with deep convolutional neural networks”,
Advances in Neural Information Processing Systems, 2012
AlexN
et
 60 Million parameters and 650000 neurons.
 The network is split into two pipelines and was trained on
two GPU.
 Input Image size 256 x 256 RGB.
 Grey scale images to be replicated to obtain 3-Channel RGB
 Random crops of size 227 x 227 are fed to the input layer of
AlexNet.
 Stochastic Gradient Descent with Momentum Optimizer.
 Top-5 error rate 15.3%.
Krizhevsky Alex, Ilya Sutskever and Geoffrey E. Hilton, “Imagenet
Classification with deep convolutional neural networks”,
Advances in Neural Information Processing Systems, 2012
Vanishing Gradient
Problem
 Uses ReLU activation instead of sigmoidal function.
 ReLU output is unbounded- uses Local Response Normalization
(LRN).
 LRN carries out a normalization amplifying the excited neuron
while dampening the surrounding neurons at the same time in a
local neighbourhood.
 Encourage Lateral Inhibition: concept in neuro biology that
indicates capacity of a neuron to reduce activity of its
neighbours.
Local Response Normalization (Inter-
Channel)
( )
β
α 







+
=
∑
+
−
−
=
)
2
/
,
1
min(
)
2
/
,
0
max(
2
,
,
,
n
i
N
n
i
j
j
y
x
i
y
x
i
y
x
a
k
a
b
https://towardsdatascience.com/difference-between-local-
response-normalization-and-batch-normalization-
272308c034ac
Local Response
Normalization
Local Response Normalization (Intra-
Channel)
( )
β
α 







+
=
∑ ∑
+
−
=
+
−
=
)
2
/
,
max(
)
2
/
,
0
max(
)
2
/
,
min(
)
2
/
,
0
max(
2
,
,
,
n
x
W
n
x
p
n
y
H
n
y
q
i
q
p
i
y
x
i
y
x
a
k
a
b
Local Response
Normalization
https://towardsdatascience.com/difference-between-local-
response-normalization-and-batch-normalization-
272308c034ac
Krizhevsky Alex, Ilya Sutskever and Geoffrey E. Hilton, “Imagenet
Classification with deep convolutional neural networks”,
Advances in Neural Information Processing Systems, 2012
Reducing
Overfitting
 Train the network with different variants of the
same image helps avoiding overfitting.
 Generate additional data from existing data
(Augmentation).
 Data augmentation by mirroring.
 Data Augmentation by random crops.
 Dropout Regularization.
Dropou
t
Srivastava Nitish et. al. “Dropout: A Simple Way to Prevent
Neural Networks from Overfitting” Journal of Machine
Learning Research 15 (2014), 1929-1958
 Regularization Technique proposed by Srivastava et.
al. in 2014.
 During training randomly selected neurons are
dropped from the network (with probability 0.5)
temporarily .
 Their activations are not passed to the downstream
neurons in the forward pass.
 In the backward pass weight updates are not
applied to theses neurons.
https://www.learnopencv.com/understanding-alexnet/
Dropou
t
How does it
help?
Srivastava Nitish et. al. “Dropout: A Simple Way to Prevent
Neural Networks from Overfitting” Journal of Machine
Learning Research 15 (2014), 1929-1958
 While training weights of neurons are tuned for specific
features that provides some sort of specialization.
 Neighbouring neurons starts relying on these specializations
(co-adaptation).
 This leads to a neural network model too specialized to the
training data.
 As neurons are randomly dropped other neurons have to
step in to compensate.
 Thus the network learns multiple independent
representations
Learned
Features
How does it
help?
Srivastava Nitish et. al. “Dropout: A Simple Way to Prevent
Neural Networks from Overfitting” Journal of Machine
Learning Research 15 (2014), 1929-1958
 This makes the network less sensitive to specific weights.
 Enhances the generalization capability of the network
 Less vulnerable to overfitting.
 The whole network is used during testing – there is no dropout.
 Dropout increases number of iterations for the network to
converge.
 But helps avoid overfitting.
Skip Connections, Residual networks and challanges
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur
Topic
Lecture 39: Popular CNN Models III
Concepts Covered:
 CNN
 AlexNet
 VGG Net
 Transfer Learning
 GoogLeNet
 ResNet
 etc.
VGG 16
ILSVRC 2014 1st
Runner-Up
Visual Geometry Group
Oxford University
1/15
VGG
16
Very Deep Convolutional Networks for Large-Scale
Image Recognition by Karen Simonyan and
Andrew Zisserman
2/15
VGG
16
Very Deep Convolutional Networks for Large-Scale
Image Recognition by Karen Simonyan and
Andrew Zisserman
 Input to the architecture are color images of size 224x224.
 The image is passed through a stack of convolutional layers.
 Every convolution filter has very small receptive field: 3×3,
Stride 1.
 Uses row and column padding to maintain spatial resolution
after convolution.
 There are 13 Convolution Layers.
 There are 5 max-pool layers.
 Max pooling window size 2x2, stride 2.
3/15
VGG
16
Very Deep Convolutional Networks for Large-Scale
Image Recognition by Karen Simonyan and
Andrew Zisserman
 Not every convolution layer is followed by max-pool
layer.
 3 Fully connected layers.
 First two FC layers have 4096 channels each.
 Last FC layer has 1000 channels.
 Last layer is a softmax layer with 1000 channels, one
for each category of images in ImageNet database.
 Hidden layers have ReLU as activation function.
4/15
VGG
16
Very Deep Convolutional Networks for Large-Scale
Image Recognition by Karen Simonyan and
Andrew Zisserman
Striking difference from AlexNet
 All convolution kernels are of size 3x3 with stride 1.
 All maxpool kernels are of size 2x2 stride 2
 All variable size kernels as in AlexNet can be realised using
multiple 3x3 kernels.
 This realisation is in terms of size of the receptive field
covered by the kernels.
 Top-5 error rate ~ 7 %
5/15
Transfer Learning
6/15
Transfer
Learning
Kevin McGuinness
https://www.slideshare.net/xavigiro/transfer-learning-
d2l4-insightdcu-machine-learning-workshop-2017
7/15
Transfer
Learning
CNN as Fixed Feature Extractor:
 Take a pre-trained CNN architecture trained on a large
dataset (like ImageNet)
 Remove the last fully connected layer of this pre-trained
network
 Remaining CNN acts as a fixed feature extractor for the
new dataset
8/15
Image Source:-
https://becominghuman.ai/what-exactly-does-cnn-see-
4d436d8e6e52
Transfer
Learning
9/15
Image Source:-
https://becominghuman.ai/what-exactly-does-cnn-see-
4d436d8e6e52
Transfer
Learning
10/15
Image Source:-
https://becominghuman.ai/what-exactly-does-cnn-see-
4d436d8e6e52
Transfer
Learning
11/15
Transfer
Learning
Image Source:-
https://becominghuman.ai/what-exactly-does-cnn-see-
4d436d8e6e52
12/15
Transfer
Learning
Image Source:-
https://becominghuman.ai/what-exactly-does-cnn-see-
4d436d8e6e52
13/15
Transfer
Learning
 Lower layers generate more general features:-
knowledge transfers very well to other tasks.
 Higher layers are more task specific.
 Fine-tuning improves generalization when sufficient
examples are available.
 Transfer learning and fine tuning often lead to better
performance than training from scratch on the target
dataset.
 Even features transferred from distant tasks often
perform better than random initial weights.
14/15
Fine
tuning
 Weights of the pre-trained CNN is fine-tuned for the
new dataset by continuing the back propagation.
 Fine-tuning can be done for all layers.
 Due to overfitting concern, the earlier layers of the net
may be fixed and fine tuning is done only on the higher
layers.
 Earlier layers can be fixed as lower layers extract
features that are more generic.
 Higher layers on the other hand are task specific.
15/15
Skip Connections, Residual networks and challanges
Course Name: Deep Learning
Faculty Name: Prof. P. K. Biswas
Department : E & ECE, IIT Kharagpur
Topic
Lecture 40: Popular CNN Models IV
Concepts Covered:
 CNN
 AlexNet
 VGG Net
 Transfer Learning
 Challenges in Deep Learning
 GoogLeNet
 ResNet
 etc.
Deep Learning
Challenges
9
Challenges
 Deep learning is data hungry.
 Overfitting or lack of generalization.
 Vanishing/Exploding Gradient Problem.
 Appropriate Learning Rate.
 Covariate Shift.
 Effective training.
15/15
Vanishing Gradient
14
Vanishing Gradient
Problem
https://towardsdatascience.com/an-overview-of-
resnet-and-its-variants-5281e2f56035
13
Vanishing Gradient
Problem
1
f 2
f 3
f 4
f
1
W 2
W 3
W 4
W
O
X
))))
(
(
(
( 1
1
2
2
3
3
4
4 X
W
f
W
f
W
f
W
f
O =
12
Vanishing Gradient
Problem
))))
(
(
(
( 1
1
2
2
3
3
4
4 X
W
f
W
f
W
f
W
f
O =
4
θ
3
θ
2
θ
1
θ
11
Vanishing Gradient
Problem
)
( 4
4 θ
f
O = )
( 3
3
4
4 θ
θ f
W
= )
( 2
2
3
3 θ
θ f
W
= )
( 1
1
2
2 θ
θ f
W
= X
W1
1 =
θ
4
4
3
3
2
2
1
1
1
1
1
1
2
2
2
2
3
3
3
3
4
4
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
θ
θ
θ
θ
θ
θ
θ
θ
θ ∂
∂
′
′
′
=
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
=
∂
∂ O
W
f
W
f
W
f
X
W
f
f
f
f
f
f
O
W
O
4
4
3
3
2
1
2
2
2
2
2
3
3
3
3
4
4
2
.
.
.
.
.
.
.
.
.
.
θ
θ
θ
θ
θ
θ
θ ∂
∂
′
′
=
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
∂
=
∂
∂ O
W
f
W
f
f
W
f
f
f
f
O
W
O
10
Vanishing Gradient
Problem
10
 Choice of activation function: ReLU instead
of Sigmoid.
 Appropriate initialization of weights.
 Intelligent Back Propagation Learning
Algorithm.
Skip Connections, Residual networks and challanges

More Related Content

PDF
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
PPTX
Convolutional Neural Network (CNN) - image recognition
PPTX
Deep Learning
PDF
Recent developments in Deep Learning
PDF
convolutional_neural_networks in deep learning
PDF
A brief introduction to recent segmentation methods
PDF
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
PDF
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Network (CNN) - image recognition
Deep Learning
Recent developments in Deep Learning
convolutional_neural_networks in deep learning
A brief introduction to recent segmentation methods
MDEC Data Matters Series: machine learning and Deep Learning, A Primer
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...

Similar to Skip Connections, Residual networks and challanges (20)

PDF
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
PPTX
UNetEliyaLaialy (2).pptx
PPTX
Deep Learning for Computer Vision - PyconDE 2017
PDF
Deep Learning And Business Models (VNITC 2015-09-13)
PPTX
Convolutional Neural Networks
PPTX
PDF
IEEE CIS Webinar Sustainable futures.pdf
PPTX
SeRanet introduction
PPTX
build a Convolutional Neural Network (CNN) using TensorFlow in Python
PDF
Icon18revrec sudeshna
PPTX
Artificial Intelligence, Machine Learning and Deep Learning
PDF
Convolutional Neural Networks (CNN)
PDF
"Demystifying Deep Neural Networks," a Presentation from BDTI
PDF
Reservoir computing fast deep learning for sequences
PDF
A Survey on Image Processing using CNN in Deep Learning
PPTX
Developing Computational Skills in the Sciences with Matlab Webinar 2017
PDF
Software Defined Visualization (SDVis): Get the Most Out of ParaView* with OS...
PDF
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
PPTX
intro-to-cnn-April_2020.pptx
PDF
Fundamental of deep learning
Lecture 2.B: Computer Vision Applications - Full Stack Deep Learning - Spring...
UNetEliyaLaialy (2).pptx
Deep Learning for Computer Vision - PyconDE 2017
Deep Learning And Business Models (VNITC 2015-09-13)
Convolutional Neural Networks
IEEE CIS Webinar Sustainable futures.pdf
SeRanet introduction
build a Convolutional Neural Network (CNN) using TensorFlow in Python
Icon18revrec sudeshna
Artificial Intelligence, Machine Learning and Deep Learning
Convolutional Neural Networks (CNN)
"Demystifying Deep Neural Networks," a Presentation from BDTI
Reservoir computing fast deep learning for sequences
A Survey on Image Processing using CNN in Deep Learning
Developing Computational Skills in the Sciences with Matlab Webinar 2017
Software Defined Visualization (SDVis): Get the Most Out of ParaView* with OS...
PR-183: MixNet: Mixed Depthwise Convolutional Kernels
intro-to-cnn-April_2020.pptx
Fundamental of deep learning
Ad

Recently uploaded (20)

PPTX
Artificial Intelligence
PPTX
web development for engineering and engineering
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
OOP with Java - Java Introduction (Basics)
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
DOCX
573137875-Attendance-Management-System-original
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
additive manufacturing of ss316l using mig welding
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Well-logging-methods_new................
Artificial Intelligence
web development for engineering and engineering
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
CH1 Production IntroductoryConcepts.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
OOP with Java - Java Introduction (Basics)
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
573137875-Attendance-Management-System-original
Model Code of Practice - Construction Work - 21102022 .pdf
additive manufacturing of ss316l using mig welding
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Internet of Things (IOT) - A guide to understanding
Embodied AI: Ushering in the Next Era of Intelligent Systems
Safety Seminar civil to be ensured for safe working.
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Well-logging-methods_new................
Ad

Skip Connections, Residual networks and challanges

  • 1. Course Name: Deep Learning Faculty Name: Prof. P. K. Biswas Department : E & ECE, IIT Kharagpur Topic Lecture 36: CNN Architecture
  • 2. Concepts Covered:  CNN  CNN Architecture  Convolution Layer  Receptive Field  Nonlinearity  Pooling
  • 3. Convolutio n ∑ ∞ = − = 0 ) ( ) ( ) ( p p n h p x n y ∫ ∞ − = 0 ) ( ) ( ) ( τ τ τ d t h x t y ∑∑ ∞ = ∞ = − − = 0 0 ) , ( ) , ( ) , ( p q q n p m h q p x n m y 1 D Convolution 2 D Convolution
  • 4. Finite Convolution Kernel Feature at a point is local in nature
  • 5. Convolution Kernel ∑ − = − = A A p p n x p w n y ) ( ) ( ) ( ∑ ∑ − = − = − − = A A p A A q q n p m x q p w n m y ) , ( ) , ( ) , ( 1 D 2A+1 2 D (2A+1)x(2A+1)
  • 6. Finite Convolution Kernel 0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) . W(2) W(1) W(0) W(-1) W(-2) Y(0)
  • 7. Finite Convolution Kernel 0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) . W(2) W(1) W(0) W(-1) W(-2) Y(0) Y(1)
  • 8. Finite Convolution Kernel 0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) . W(2) W(1) W(0) W(-1) W(-2) Y(0) Y(1) Y(2)
  • 9. Finite Convolution Kernel 0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) . W(2) W(1) W(0) W(-1) W(-2) Y(0) Y(1) Y(2) Y(3)
  • 10. Finite Convolution Kernel 0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) . W(2) W(1) W(0) W(-1) W(-2) Y(0) Y(1) Y(2) Y(3) Y(n-1)
  • 11. Finite Convolution Kernel 0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) . W(2) W(1) W(0) W(-1) W(-2) Y(0) Y(1) Y(2) Y(3) Y(n-1) Y(n)
  • 12. Finite Convolution Kernel 0 0 X(0) X(1) X(2) X(3) . X(n-2) X(n-1) X(n) X(n+1) X(n+2) . W(2) W(1) W(0) W(-1) W(-2) Y(0) Y(1) Y(2) Y(3) Y(n-1) Y(n) Y(n+1)
  • 13. 2 D Convolution 6 x 6 Image 3 x 3 Kernel
  • 23. Stride No. of steps the kernel is moved during convolution 7 x 7 Input Image 3 x 3 Kernel Stride =1 Stride =2
  • 25. • Color image has 3 dimensions: height, width and depth (depth is the color channels i.e RGB) • Filter or kernels that will be convolved with the RGB image could also be 3D • For multiple Kernels: All feature maps obtained from distinct kernels are stacked to get the final output of that layer Convolution Layer: 3 D Convolution
  • 26. • The kernel strides over the input Image. • At each location compute collect them in the feature map. • The animation shows the sliding operation at 4 locations, but in reality it is performed over the entire input. Animation:- Arden Dertat https://towardsdatascience.com/applied-deep-learning- part-4-convolutional-neural-networks-584bc134c1e2 ) , ( n m I ∑∑ − − = ) , ( ) , ( ) , ( n q m p I q p w n m f 3 D Convolution- Visualization
  • 27. • Red and green boxes are two different featured maps obtained by convolving the same input with two different kernels. The feature maps are stacked along the depth dimension as shown. Figure: Arden Dertat https://towardsdatascience.com/applied-deep-learning- part-4-convolutional-neural-networks-584bc134c1e2 3 D Convolution- Visualization
  • 28. • An RGB Image of size 32X32X3 • 10 Kernels of size 5x5x3 • Output featuremap of size 32x32x10 3 D Convolution- Visualization Figure: Arden Dertat https://towardsdatascience.com/applied-deep-learning- part-4-convolutional-neural-networks-584bc134c1e2
  • 29. • ReLU is an element wise operation (applied per pixel) and replaces all negative pixel values in the feature map by zero Nonlinearity Figure: Arden Dertat https://towardsdatascience.com/applied-deep-learning- part-4-convolutional-neural-networks-584bc134c1e2
  • 30. • Replaces the output of a node at certain locations with a summary statistic of nearby locations. • Spatial Pooling can be of different types: Max, Average, Sum etc. • Max Pooling report the maximum output within a rectangular neighborhood. • Pooling helps to make the output approximately invariant to small translation. • Pooling layers down sample each feature map independently, reducing the height and width, keeping the depth intact. • In pooling layer stride and window size needs to be specified Poolin g
  • 31. • Figure below is the result of max pooling using a 2x2 window and stride 2. Each color denotes a different window. Since both the window size and stride are 2, the windows are not overlapping 3 2 5 6 8 9 5 3 4 4 6 8 1 1 2 1 9 6 4 8 Max pool with 2x2 window with stride = 2 Poolin g Figure: Arden Dertat https://towardsdatascience.com/applied-deep-learning- part-4-convolutional-neural-networks-584bc134c1e2
  • 32. • Pooling reduces the height and the width of the feature map, but the depth remains unchanged as shown in figure • Pooling operation is independently carried out across each depth Poolin g Figure: Arden Dertat https://towardsdatascience.com/applied-deep-learning- part-4-convolutional-neural-networks-584bc134c1e2
  • 35. Course Name: Deep Learning Faculty Name: Prof. P. K. Biswas Department : E & ECE, IIT Kharagpur Topic Lecture 37: Popular CNN Models
  • 36. Concepts Covered:  CNN  LeNet  AlexNet  VGG Net  GoogLeNet  etc.
  • 39. MLP vs CNN  Sparse Connectivity: Every node in the Convolution Layer receives input from a small number of nodes in the previous layer (Receptive Field), needing smaller number of parameters.  Parameter Sharing: Each member of the Convolution Kernel is used at every position of the input, dramatically reducing the number of parameters.  This makes CNN much more efficient than MLP.
  • 41. LeNet
  • 42. LeNet 5 • Proposed by Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner for handwritten and machine-printed character recognition. • Used by many Banks for recognition of hand written numbers on cheques. • This architecture achieves an error rate as low as 0.95% on test data Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner, “Gradient –Based Learning Applied to Document Recognition”, Proc. IEEE, Nov. 1998
  • 43. LeNet 5 Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner, “Gradient –Based Learning Applied to Document Recognition”, Proc. IEEE, Nov. 1998 No. of Kernels- 6 Kernel Size- 5 x 5 Stride- 1
  • 44. LeNet 5 Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner, “Gradient –Based Learning Applied to Document Recognition”, Proc. IEEE, Nov. 1998 Average Pooling Window Size- 2 x 2 Stride- 2
  • 45. LeNet 5 Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner, “Gradient –Based Learning Applied to Document Recognition”, Proc. IEEE, Nov. 1998 No. of Kernels- 16 Kernel Size- 5 x 5 Stride- 1
  • 46. LeNet 5 Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner, “Gradient –Based Learning Applied to Document Recognition”, Proc. IEEE, Nov. 1998 No. of Kernels- 16 Kernel Size- 5 x 5 Stride- 1
  • 47. LeNet 5 Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner, “Gradient –Based Learning Applied to Document Recognition”, Proc. IEEE, Nov. 1998 No. of Kernels- 16 Kernel Size- 5 x 5 Stride- 1  Break the symmetry in the network  Keep number of connections within reasonable bounds.
  • 48. LeNet 5 Yann LeCun, Leon Bottou, Yosuha Bengio and Patrick Haffner, “Gradient –Based Learning Applied to Document Recognition”, Proc. IEEE, Nov. 1998 Average Pooling Window Size- 2 x 2 Stride- 2
  • 50. IMAGENET Large Scale Visual Recognition Challenge (ILSVRC) https://engmrk.com/lenet-5-a-classic-cnn-architecture/
  • 51. ILSVR C • IMAGENET Large Scale Visual Recognition Challenge. • Evaluates algorithms for Object Detection and Image Classification on large image database. • Helps researchers to review state of the art Machine Learning techniques for object detection across a wider variety of objects. • Monitor the progress of computer vision for large scale image indexing for retrieval and annotation. • Database contains large number of Images from 1000 categories. • More than 1000 images in every category.
  • 52. ILSVR C • Every year of the challenge the forum also organizes a workshop at one of the premier computer vision conferences. • The purpose of the workshop is to disseminate the new findings of the challenge. • Contestants with the most successful and innovative techniques are invited to present their work.
  • 54. Course Name: Deep Learning Faculty Name: Prof. P. K. Biswas Department : E & ECE, IIT Kharagpur Topic Lecture 38: Popular CNN Models II
  • 55. Concepts Covered:  CNN  LeNet  ILSVRC  AlexNet  VGG Net  GoogLeNet  etc.
  • 56. AlexNet ILSVRC 2012 Winer Krizhevsky Alex, Ilya Sutskever and Geoffrey E. Hilton, “Imagenet Classification with deep convolutional neural networks”, Advances in Neural Information Processing Systems, 2012
  • 57. Sample Images from ImageNet Dataset
  • 59. Krizhevsky Alex, Ilya Sutskever and Geoffrey E. Hilton, “Imagenet Classification with deep convolutional neural networks”, Advances in Neural Information Processing Systems, 2012 AlexNe t
  • 60. Krizhevsky Alex, Ilya Sutskever and Geoffrey E. Hilton, “Imagenet Classification with deep convolutional neural networks”, Advances in Neural Information Processing Systems, 2012 AlexN et  60 Million parameters and 650000 neurons.  The network is split into two pipelines and was trained on two GPU.  Input Image size 256 x 256 RGB.  Grey scale images to be replicated to obtain 3-Channel RGB  Random crops of size 227 x 227 are fed to the input layer of AlexNet.  Stochastic Gradient Descent with Momentum Optimizer.  Top-5 error rate 15.3%.
  • 61. Krizhevsky Alex, Ilya Sutskever and Geoffrey E. Hilton, “Imagenet Classification with deep convolutional neural networks”, Advances in Neural Information Processing Systems, 2012 Vanishing Gradient Problem  Uses ReLU activation instead of sigmoidal function.  ReLU output is unbounded- uses Local Response Normalization (LRN).  LRN carries out a normalization amplifying the excited neuron while dampening the surrounding neurons at the same time in a local neighbourhood.  Encourage Lateral Inhibition: concept in neuro biology that indicates capacity of a neuron to reduce activity of its neighbours.
  • 62. Local Response Normalization (Inter- Channel) ( ) β α         + = ∑ + − − = ) 2 / , 1 min( ) 2 / , 0 max( 2 , , , n i N n i j j y x i y x i y x a k a b
  • 64. Local Response Normalization (Intra- Channel) ( ) β α         + = ∑ ∑ + − = + − = ) 2 / , max( ) 2 / , 0 max( ) 2 / , min( ) 2 / , 0 max( 2 , , , n x W n x p n y H n y q i q p i y x i y x a k a b
  • 66. Krizhevsky Alex, Ilya Sutskever and Geoffrey E. Hilton, “Imagenet Classification with deep convolutional neural networks”, Advances in Neural Information Processing Systems, 2012 Reducing Overfitting  Train the network with different variants of the same image helps avoiding overfitting.  Generate additional data from existing data (Augmentation).  Data augmentation by mirroring.  Data Augmentation by random crops.  Dropout Regularization.
  • 67. Dropou t Srivastava Nitish et. al. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” Journal of Machine Learning Research 15 (2014), 1929-1958  Regularization Technique proposed by Srivastava et. al. in 2014.  During training randomly selected neurons are dropped from the network (with probability 0.5) temporarily .  Their activations are not passed to the downstream neurons in the forward pass.  In the backward pass weight updates are not applied to theses neurons.
  • 69. How does it help? Srivastava Nitish et. al. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” Journal of Machine Learning Research 15 (2014), 1929-1958  While training weights of neurons are tuned for specific features that provides some sort of specialization.  Neighbouring neurons starts relying on these specializations (co-adaptation).  This leads to a neural network model too specialized to the training data.  As neurons are randomly dropped other neurons have to step in to compensate.  Thus the network learns multiple independent representations
  • 71. How does it help? Srivastava Nitish et. al. “Dropout: A Simple Way to Prevent Neural Networks from Overfitting” Journal of Machine Learning Research 15 (2014), 1929-1958  This makes the network less sensitive to specific weights.  Enhances the generalization capability of the network  Less vulnerable to overfitting.  The whole network is used during testing – there is no dropout.  Dropout increases number of iterations for the network to converge.  But helps avoid overfitting.
  • 73. Course Name: Deep Learning Faculty Name: Prof. P. K. Biswas Department : E & ECE, IIT Kharagpur Topic Lecture 39: Popular CNN Models III
  • 74. Concepts Covered:  CNN  AlexNet  VGG Net  Transfer Learning  GoogLeNet  ResNet  etc.
  • 75. VGG 16 ILSVRC 2014 1st Runner-Up Visual Geometry Group Oxford University 1/15
  • 76. VGG 16 Very Deep Convolutional Networks for Large-Scale Image Recognition by Karen Simonyan and Andrew Zisserman 2/15
  • 77. VGG 16 Very Deep Convolutional Networks for Large-Scale Image Recognition by Karen Simonyan and Andrew Zisserman  Input to the architecture are color images of size 224x224.  The image is passed through a stack of convolutional layers.  Every convolution filter has very small receptive field: 3×3, Stride 1.  Uses row and column padding to maintain spatial resolution after convolution.  There are 13 Convolution Layers.  There are 5 max-pool layers.  Max pooling window size 2x2, stride 2. 3/15
  • 78. VGG 16 Very Deep Convolutional Networks for Large-Scale Image Recognition by Karen Simonyan and Andrew Zisserman  Not every convolution layer is followed by max-pool layer.  3 Fully connected layers.  First two FC layers have 4096 channels each.  Last FC layer has 1000 channels.  Last layer is a softmax layer with 1000 channels, one for each category of images in ImageNet database.  Hidden layers have ReLU as activation function. 4/15
  • 79. VGG 16 Very Deep Convolutional Networks for Large-Scale Image Recognition by Karen Simonyan and Andrew Zisserman Striking difference from AlexNet  All convolution kernels are of size 3x3 with stride 1.  All maxpool kernels are of size 2x2 stride 2  All variable size kernels as in AlexNet can be realised using multiple 3x3 kernels.  This realisation is in terms of size of the receptive field covered by the kernels.  Top-5 error rate ~ 7 % 5/15
  • 82. Transfer Learning CNN as Fixed Feature Extractor:  Take a pre-trained CNN architecture trained on a large dataset (like ImageNet)  Remove the last fully connected layer of this pre-trained network  Remaining CNN acts as a fixed feature extractor for the new dataset 8/15
  • 88. Transfer Learning  Lower layers generate more general features:- knowledge transfers very well to other tasks.  Higher layers are more task specific.  Fine-tuning improves generalization when sufficient examples are available.  Transfer learning and fine tuning often lead to better performance than training from scratch on the target dataset.  Even features transferred from distant tasks often perform better than random initial weights. 14/15
  • 89. Fine tuning  Weights of the pre-trained CNN is fine-tuned for the new dataset by continuing the back propagation.  Fine-tuning can be done for all layers.  Due to overfitting concern, the earlier layers of the net may be fixed and fine tuning is done only on the higher layers.  Earlier layers can be fixed as lower layers extract features that are more generic.  Higher layers on the other hand are task specific. 15/15
  • 91. Course Name: Deep Learning Faculty Name: Prof. P. K. Biswas Department : E & ECE, IIT Kharagpur Topic Lecture 40: Popular CNN Models IV
  • 92. Concepts Covered:  CNN  AlexNet  VGG Net  Transfer Learning  Challenges in Deep Learning  GoogLeNet  ResNet  etc.
  • 94. Challenges  Deep learning is data hungry.  Overfitting or lack of generalization.  Vanishing/Exploding Gradient Problem.  Appropriate Learning Rate.  Covariate Shift.  Effective training. 15/15
  • 97. Vanishing Gradient Problem 1 f 2 f 3 f 4 f 1 W 2 W 3 W 4 W O X )))) ( ( ( ( 1 1 2 2 3 3 4 4 X W f W f W f W f O = 12
  • 98. Vanishing Gradient Problem )))) ( ( ( ( 1 1 2 2 3 3 4 4 X W f W f W f W f O = 4 θ 3 θ 2 θ 1 θ 11
  • 99. Vanishing Gradient Problem ) ( 4 4 θ f O = ) ( 3 3 4 4 θ θ f W = ) ( 2 2 3 3 θ θ f W = ) ( 1 1 2 2 θ θ f W = X W1 1 = θ 4 4 3 3 2 2 1 1 1 1 1 1 2 2 2 2 3 3 3 3 4 4 1 . . . . . . . . . . . . . . θ θ θ θ θ θ θ θ θ ∂ ∂ ′ ′ ′ = ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ = ∂ ∂ O W f W f W f X W f f f f f f O W O 4 4 3 3 2 1 2 2 2 2 2 3 3 3 3 4 4 2 . . . . . . . . . . θ θ θ θ θ θ θ ∂ ∂ ′ ′ = ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ ∂ = ∂ ∂ O W f W f f W f f f f O W O 10
  • 100. Vanishing Gradient Problem 10  Choice of activation function: ReLU instead of Sigmoid.  Appropriate initialization of weights.  Intelligent Back Propagation Learning Algorithm.