SlideShare a Scribd company logo
Batch Normalization
Accelerating Deep Network Training by
Reducing Internal Covariate Shift
By : seraj alhamidi
Instructor: Associate Prof. Mohammed Alhanjouri
June .2019
About the paper
Sergey Ioffe
Google Inc.,
sioffe@google.com
Christian Szegedy
Google Inc.,
szegedy@google.com
Authors
The 32nd International Conference on Machine
Learning (2015)
presented
publishers
https://ai.google/research/pubs/pub43442
Journal of Machine Learning Research
http://jmlr.org/proceedings/papers/v37/ioffe15.pdf
Cornell university
https://arxiv.org/abs/1502.03167
paper with over 6000 citations on ICML 2015citations
Outlines
Introduction
Issues with Training Deep Neural Networks
Batch Normalization
Ablation Study
Comparison with the State of the art Approaches
Some notes
My work
Introduction
ILSVRC Competition in 2015
The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) , sponsored
by google and Facebook
ImageNet, is a dataset of over 15 millions labelled high-resolution images with
around 22,000 categories, for classification and localization tasks
ILSVRC uses a subset of ImageNet of 1000 categories.
On 6 Feb 2015, Microsoft has proposed PReLU-Net which has 4.94% error
rate which surpasses the human error rate of 5.1%
Five days later, on 11 Feb 2015, Google proposed BN-Inception which has 4.8%
error rate.
Reach best accuracy in 7% of time need to reach same accuracy
Issues with Training Deep Neural Networks
Vanishing Gradient
Saturating nonlinearities (like 𝑡𝑎𝑛ℎ 𝑜𝑟 𝑠𝑖𝑔𝑚𝑜𝑖𝑑) cannot be used for deep
networks
An example, the sigmoid function and it’s derivative. When the inputs of
the sigmoid function becomes larger or smaller , the derivative becomes
close to zero.
the sigmoid function and its derivativebackpropagation algorithm update rule
 𝑤 𝜅 + 1 = 𝑤 𝜅 − 𝛼
𝜕𝐿
𝜕𝑤
 𝐿 = 0.5 𝑡 − 𝑎 2 𝑡 ∶ 𝑡𝑎𝑟𝑔𝑒𝑡 , 𝑎 ∶ 𝑜𝑢𝑡𝑝𝑢𝑡 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
 𝑎 𝑙
= 𝜎 𝑥 𝑙
𝜎 ∶ 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑧 ∶ 𝑖𝑛𝑝𝑢𝑡 𝑡𝑜 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛
 𝑥 𝑙 = 𝑤𝑖,𝑗 ∗ 𝑎 𝑙−1 + 𝑤𝑖+1 ,𝑗 ∗ 𝑎 𝑙−1 + ⋯

𝜕𝐿
𝜕𝑤
=
𝜕𝐿
𝜕𝑎
.
𝜕𝑎
𝜕𝑥
.
𝜕𝑥
𝜕𝑤
≡
𝜕𝐿
𝜕𝑎
. 𝜎 𝑥 𝑙 .
𝜕𝑥
𝜕𝑤
Issues with Training Deep Neural Networks
Vanishing Gradient
Sigmoid function with restricted inputsRectified linear units 𝑓 𝑥 = 𝑥+
= max(0, 𝑥)
Some ways around this are to use:
 batch normalization layers can also resolve the issue
 Nonlinearities like Rectified linear units (ReLU) which do not saturate.
 Smaller learning rates
 Careful weights initializations
Issues with Training Deep Neural Networks
Internal Covariate shift
 Covariate – The Features of the Input Data
 Covariate Shift - The change in the distribution of inputs layers in the middle of a
deep neural network, is referred to the technical name “internal covariate shift ”.
when the distribution that is fed to the layers of a network should be somewhat:
Zero-centered
Constant through time and data
the distribution of the data being fed to the layers should not vary too much across
the mini-batches fed to the network
Neural networks learn efficiently
Issues with Training Deep Neural Networks
Internal Covariate shift in deep NN
Iteration i
Iteration i+1
Iteration i+2
Every time there’s new
relation (distribution)
specially in deep layers and
at the beginning of training
Issues with Training Deep Neural Networks
Take one layer from internal layers
Assume has a distribution given
below. Also, let us suppose, the
function learned by the layer is
represented by the dashed line
suppose, after the gradient
updating, the distribution of x
gets changed to something like
the loss for this mini-batch is more as
compared to the previous loss
Issues with Training Deep Neural Networks
 Every time we force the layer (l) to design new Perceptron
 Because I give it new disruption every time
 In deep layers , we may have butterfly effect , however the change in 𝒘 at
first layers is small, thus make the network unstable
Batch Normalization
 The Batch Normalization attempts to normalize a batch of inputs before
they are fed to a non-linear activation unit (like ReLU, sigmoid, etc). during
training
 so that the input to the activation function across each training batch has a
mean of 0 and a variance of 1
 applying batch normalization to the activation σ(Wx + b) would result
in σ(BN(Wx + b)) where 𝐵𝑁 is the batch normalizing transform
Batch Normalization
To make each dimension unit gaussian, we apply:
𝑥 𝑘 =
𝑥 𝑘 − 𝐸 𝑥 𝑘
𝑉𝑎𝑟 𝑥 𝑘
where 𝐸 𝑥(𝑘) and 𝑉𝑎𝑟 𝑥(𝑘) are respectively the mean and variance of 𝑘-th
feature over a batch. Then we transform 𝑥(𝑘) as:
𝑦 𝑘 = 𝛾 𝑘 𝑥 𝑘 + 𝛽 𝑘
where 𝛾 and 𝛽 are the hyper (learnable) parameters of the so-called batch
normalization layer
𝐾 is the 𝑘-th sample in one feature mini batch
Batch Normalization
Transformation of inputs
Forward Propagation through Batch Normalization layer
We have shown the normalization of multiple sample in just one feature
Input: Values of 𝐱 over 𝐁 = 𝐱𝟏 … 𝐱 𝐦 a batch; Parameters to be learned 𝛄, 𝛃
Output: 𝓨𝐢 = 𝐁𝐍 𝛄,𝛃(𝐱 𝐢)
Flow of computation through Batch Normalization layer
𝝁 𝑩 =
𝟏
𝒎
𝒊=𝟏
𝒎
𝒙𝒊
𝝈 𝑩
𝟐
=
𝟏
𝒎
𝒊=𝟏
𝒎
𝒙𝒊 − 𝝁 𝑩
𝟐
𝒙𝒊=
𝒙𝒊 − 𝝁 𝑩
𝝈 𝑩
𝟐
+ 𝝐
𝓨𝒊= 𝜸 𝒙 + 𝜷 = 𝑩𝑵 𝜸,𝜷(𝒙𝒊)
𝜖
𝜖 is a small value 1 ∗ 10−8
for not devided by zero
Forward Propagation through Batch Normalization
layer
TWO features
THE MAGIC
Imagine that the network was thought that the optimal that will minimize
the cost is to Cancel the BN effect !
Forward Propagation through Batch Normalization layer
β = 𝐸 𝑥 = μB
𝛾 = 𝑉𝑎𝑟 𝑥 = 𝜎 𝐵
2
+ 𝜖
𝑥𝑖 =
𝑥𝑖 − 𝜇 𝐵
𝜎 𝐵
2
+ 𝜖
𝒴𝑖 = 𝛾 𝑥𝑖 + 𝛽 = 𝜎 𝐵
2
+ 𝜖 ∗
𝑥 𝑖 −𝜇 𝐵
𝜎 𝐵
2+𝜖
+ 𝜇 𝐵 = 𝑥𝑖
Identity transform
𝛾, 𝛽 Adapted by SGD
Backpropagation through Batch Normalization layer
𝒴𝑖 = 𝛾 𝑥 + 𝛽 = 𝐵𝑁𝛾,𝛽(𝑥𝑖)
𝝏𝑳
𝝏𝓨𝒊
𝝏𝑳
𝝏 𝒙𝒊
𝜕L
𝜕γi
𝝏𝑳
𝝏𝜷
𝝏𝑳
𝝏𝝈 𝑩
𝟐𝝏𝑳
𝝏𝝁 𝑩
𝝏𝑳
𝝏𝒙 𝒊
SGD
𝜷 𝒌 + 𝟏 = 𝜷 𝜿 − 𝜶
𝝏𝑳
𝝏𝜷
𝜸 𝒌 + 𝟏 = 𝜸 𝜿 − 𝜶
𝝏𝑳
𝝏𝜸
𝝁 𝑩 =
𝟏
𝒎
𝒊=𝟏
𝒎
𝒙𝒊
𝝈 𝑩
𝟐
=
𝟏
𝒎
𝒊=𝟏
𝒎
𝒙𝒊 − 𝝁 𝑩
𝟐
𝒙𝒊=
𝒙𝒊 − 𝝁 𝑩
𝝈 𝑩
𝟐
+ 𝝐
𝓨𝒊= 𝜸 𝒙 + 𝜷 = 𝑩𝑵 𝜸,𝜷(𝒙𝒊)
Backpropagation during test time
using the population, rather than mini-batch statistics. Effectively, we process mini-
batches of size 𝑚 and use their statistics to compute:
𝐸 𝑥(𝑘) = 𝐸 𝐵[ 𝜇 𝐵]
𝑉𝑎𝑟 𝑥(𝑘) =
𝑚
𝑚 − 1
𝐸 𝐵[𝜎 𝐵
2 ]
we can use 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝑚𝑜𝑣𝑖𝑛𝑔 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 to estimate the mean and variance to be
used during test time, we estimate the running average of mean and variance as:
𝜇 𝑟𝑢𝑛𝑛𝑖𝑛𝑔 = 𝛼. 𝜇 𝑟𝑢𝑛𝑛𝑖𝑛𝑔 + (1- 𝛼). 𝜇 𝐵
‫ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ‬‫ـــ‬
𝜎𝑟𝑢𝑛𝑛𝑖𝑛𝑔
2
= 𝛼. 𝜎𝑟𝑢𝑛𝑛𝑖𝑛𝑔
2
+ (1- 𝛼).𝜎 𝐵
2
where 𝛼 is a constant smoothing factor between 0 and 1 and represents the degree
of dependence on the previous observations .
Ablation Study
MNIST dataset
28×28 binary image as input, 3 fully connected (FC) hidden layer with 100 activations each
, the last hidden layer followed by 10 activations as there are 10 digits. And the loss is cross
entropy loss.
BN network is much more stable
Ablation Study
ImageNet of 1000 categories on GoogleNet/Inception(2014)
weighing 138GB for the training images, 6.3GB for the validation images, and 13GB for
the testing
CNN architectures tested
Comparison with the State of the art Approaches
Some Notes : cross-entropy loss function
We use cross-entropy loss function
neural network (1)
Computed | targets | correct?
------------------------------------------------
0.3 0.3 0.4 | 0 0 1 (democrat) | yes
0.3 0.4 0.3 | 0 1 0 (republican) | yes
0.1 0.2 0.7 | 1 0 0 (other) | no
neural network (2)
Computed | targets | correct?
------------------------------------------------
0.1 0.2 0.7 | 0 0 1 (democrat) | yes
0.1 0.7 0.2 | 0 1 0 (republican) | yes
0.3 0.4 0.3 | 1 0 0 (other) | no
cross-entropy error for the first training
−( (ln(0.3) ∗ 0) + (ln(0.3) ∗ 0) + (ln(0.4) ∗ 1) ) = −ln(0.4)
average cross-entropy error (ACE)
−(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38
average cross-entropy error (ACE)
−(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64
mean squared error for the first item
(0.3 − 0)^2 + (0.3 − 0)^2 + (0.4 − 1)^2 = 0.54
the MSE for the first neural network is
(0.54 + 0.54 + 1.34) / 3 = 0.81
The MSE for the second, better, network is
(0.14 + 0.14 + 0.74) / 3 = 0.34
(1.38 − 0.64 = 0.74) > (0.81 − 0.34 = 0.47)
The 𝑙𝑛() function in cross-entropy takes into account the closeness of a prediction
Some Notes : Convolutional Neural Network (CNN)
 It use for Image classification is the task
 It was developed between 1988 and 1993, at Bell Labs
 the first convolutional network that could recognize handwritten digits
Some Notes : Convolutional Neural Network (CNN)
Convolution Layer
(Conv Layer)
Pooling Layer ReLU Layer
Fully Connected
Layer (Flatten)
Some Notes : Convolutional Neural Network (CNN)
Convolution Layer (Conv Layer)
convolution works by sliding a window across the input
Some Notes : Convolutional Neural Network (CNN)
2D filters
Some Notes : Convolutional Neural Network (CNN)
3D filters
Pooling Layer (Sub-sampling or Down-sampling)
 reduce the size of feature maps by using some functions average or the maximum ,
(hence called down-sampling)
 make extracted features more robust by making it more invariant to scale and
orientation changes.
ReLU Layer
Remove all the black elements from it, keeping only those carrying a positive value
(the grey and white colours )
)𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑀𝑎𝑥(𝑧𝑒𝑟𝑜, 𝐼𝑛𝑝𝑢𝑡
to introduce non-linearity in our ConvNet
Fully Connected Layer (Flatten)
http://scs.ryerson.ca/~aharley/vis/conv/flat.html
MY WORK  MNIST on google colab
Inputs = 28*28 = 784
Layer 1&2 = 100 nodes | Layer 3 = 10 nodes
All Activations are sigmoid
Cross-entropy loss function
The train and test set
is already splited in
tensorflow
the distribution over
time of the inputs to
the sigmoid function
of the first five
neurons in the
second layer . Batch
normalization has a
visible and
significant effect of
removing
variance/noise in
these inputs.final acc: 99%
MY WORK  caltech dataset
𝑊𝑖𝑡ℎ 𝐵𝑁
#𝑒𝑝𝑜𝑐ℎ = 150
𝐿𝑅 = 1 ∗ 10−3
𝑊𝑖𝑡ℎ𝑜𝑢𝑡 𝐵𝑁
#𝑒𝑝𝑜𝑐ℎ = 150
𝐿𝑅 = 1 ∗ 10−3
𝑤𝑖𝑡ℎ𝑜𝑢𝑡 BN
#𝑒𝑝𝑜𝑐ℎ = 250
𝐿𝑅 = 1 ∗ 10−3
final acc: 90.54% final acc: 94.44%final acc: 96.04%
We use ten (10)
classes from Caltech
dataset instead of
ImageNet
Dataset because it’s
huge size
classify the
input image
1/3 ,1/3 , 1/3 train
, validation, test
Thanks
4
listening

More Related Content

PDF
Emerging Properties in Self-Supervised Vision Transformers
PDF
CNN Algorithm
PDF
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
PPTX
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
PPTX
AlexNet.pptx
PDF
MobileNet - PR044
PDF
Overview on Optimization algorithms in Deep Learning
Emerging Properties in Self-Supervised Vision Transformers
CNN Algorithm
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
AlexNet(ImageNet Classification with Deep Convolutional Neural Networks)
AlexNet.pptx
MobileNet - PR044
Overview on Optimization algorithms in Deep Learning

What's hot (20)

PDF
Batch normalization paper review
PPTX
PDF
Recurrent neural networks rnn
PDF
Introduction to XGBoost
PDF
Introduction to Recurrent Neural Network
PPTX
CNN Tutorial
PDF
Rnn and lstm
PDF
Recurrent Neural Networks. Part 1: Theory
PPT
Intro to Deep learning - Autoencoders
PDF
Deep Learning - Convolutional Neural Networks
PPT
backpropagation in neural networks
PDF
LSTM Basics
PPTX
Convolutional Neural Networks
PDF
Autoencoders
PDF
Single Image Super Resolution Overview
PPTX
U-Net (1).pptx
PPTX
Feedforward neural network
PPTX
Deep neural networks
PDF
Convolutional Neural Networks (CNN)
PPTX
Transfer Learning and Fine-tuning Deep Neural Networks
Batch normalization paper review
Recurrent neural networks rnn
Introduction to XGBoost
Introduction to Recurrent Neural Network
CNN Tutorial
Rnn and lstm
Recurrent Neural Networks. Part 1: Theory
Intro to Deep learning - Autoencoders
Deep Learning - Convolutional Neural Networks
backpropagation in neural networks
LSTM Basics
Convolutional Neural Networks
Autoencoders
Single Image Super Resolution Overview
U-Net (1).pptx
Feedforward neural network
Deep neural networks
Convolutional Neural Networks (CNN)
Transfer Learning and Fine-tuning Deep Neural Networks
Ad

Similar to Batch normalization presentation (20)

PPTX
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
PPTX
An Introduction to Deep Learning
PPTX
Introduction to Deep Learning
PDF
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
PDF
Separating Hype from Reality in Deep Learning with Sameer Farooqui
PDF
Batch normalization: Accelerating Deep Network Training by Reducing Internal ...
PPTX
Deep Learning
PPTX
Batch_Normalization.pptx
PDF
Deep Learning: concepts and use cases (October 2018)
PPTX
Visualization of Deep Learning
PDF
Why Batch Normalization Works so Well
PDF
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
PPTX
Deeplearning
PDF
Batch normalization
PPTX
machine_learning _presentation_on_paperpptx
PPTX
machine_learning _presentation_on_paperpptx
PPTX
Deep Learning
PDF
Deep Feed Forward Neural Networks and Regularization
PPTX
Deep learning with TensorFlow
PPTX
Introduction to Deep Learning and Tensorflow
Deep Learning Interview Questions And Answers | AI & Deep Learning Interview ...
An Introduction to Deep Learning
Introduction to Deep Learning
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Batch normalization: Accelerating Deep Network Training by Reducing Internal ...
Deep Learning
Batch_Normalization.pptx
Deep Learning: concepts and use cases (October 2018)
Visualization of Deep Learning
Why Batch Normalization Works so Well
Python for Image Understanding: Deep Learning with Convolutional Neural Nets
Deeplearning
Batch normalization
machine_learning _presentation_on_paperpptx
machine_learning _presentation_on_paperpptx
Deep Learning
Deep Feed Forward Neural Networks and Regularization
Deep learning with TensorFlow
Introduction to Deep Learning and Tensorflow
Ad

Recently uploaded (20)

PDF
Modernizing your data center with Dell and AMD
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
A Presentation on Artificial Intelligence
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Modernizing your data center with Dell and AMD
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
A Presentation on Artificial Intelligence
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectral efficient network and resource selection model in 5G networks
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Machine learning based COVID-19 study performance prediction
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Approach and Philosophy of On baking technology
Building Integrated photovoltaic BIPV_UPV.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Digital-Transformation-Roadmap-for-Companies.pptx
Electronic commerce courselecture one. Pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Monthly Chronicles - July 2025
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
The Rise and Fall of 3GPP – Time for a Sabbatical?

Batch normalization presentation

  • 1. Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift By : seraj alhamidi Instructor: Associate Prof. Mohammed Alhanjouri June .2019
  • 2. About the paper Sergey Ioffe Google Inc., sioffe@google.com Christian Szegedy Google Inc., szegedy@google.com Authors The 32nd International Conference on Machine Learning (2015) presented publishers https://ai.google/research/pubs/pub43442 Journal of Machine Learning Research http://jmlr.org/proceedings/papers/v37/ioffe15.pdf Cornell university https://arxiv.org/abs/1502.03167 paper with over 6000 citations on ICML 2015citations
  • 3. Outlines Introduction Issues with Training Deep Neural Networks Batch Normalization Ablation Study Comparison with the State of the art Approaches Some notes My work
  • 4. Introduction ILSVRC Competition in 2015 The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) , sponsored by google and Facebook ImageNet, is a dataset of over 15 millions labelled high-resolution images with around 22,000 categories, for classification and localization tasks ILSVRC uses a subset of ImageNet of 1000 categories. On 6 Feb 2015, Microsoft has proposed PReLU-Net which has 4.94% error rate which surpasses the human error rate of 5.1% Five days later, on 11 Feb 2015, Google proposed BN-Inception which has 4.8% error rate. Reach best accuracy in 7% of time need to reach same accuracy
  • 5. Issues with Training Deep Neural Networks Vanishing Gradient Saturating nonlinearities (like 𝑡𝑎𝑛ℎ 𝑜𝑟 𝑠𝑖𝑔𝑚𝑜𝑖𝑑) cannot be used for deep networks An example, the sigmoid function and it’s derivative. When the inputs of the sigmoid function becomes larger or smaller , the derivative becomes close to zero. the sigmoid function and its derivativebackpropagation algorithm update rule  𝑤 𝜅 + 1 = 𝑤 𝜅 − 𝛼 𝜕𝐿 𝜕𝑤  𝐿 = 0.5 𝑡 − 𝑎 2 𝑡 ∶ 𝑡𝑎𝑟𝑔𝑒𝑡 , 𝑎 ∶ 𝑜𝑢𝑡𝑝𝑢𝑡 𝑜𝑓 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛  𝑎 𝑙 = 𝜎 𝑥 𝑙 𝜎 ∶ 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛 𝑧 ∶ 𝑖𝑛𝑝𝑢𝑡 𝑡𝑜 𝑎𝑐𝑡𝑖𝑣𝑎𝑡𝑖𝑜𝑛  𝑥 𝑙 = 𝑤𝑖,𝑗 ∗ 𝑎 𝑙−1 + 𝑤𝑖+1 ,𝑗 ∗ 𝑎 𝑙−1 + ⋯  𝜕𝐿 𝜕𝑤 = 𝜕𝐿 𝜕𝑎 . 𝜕𝑎 𝜕𝑥 . 𝜕𝑥 𝜕𝑤 ≡ 𝜕𝐿 𝜕𝑎 . 𝜎 𝑥 𝑙 . 𝜕𝑥 𝜕𝑤
  • 6. Issues with Training Deep Neural Networks Vanishing Gradient Sigmoid function with restricted inputsRectified linear units 𝑓 𝑥 = 𝑥+ = max(0, 𝑥) Some ways around this are to use:  batch normalization layers can also resolve the issue  Nonlinearities like Rectified linear units (ReLU) which do not saturate.  Smaller learning rates  Careful weights initializations
  • 7. Issues with Training Deep Neural Networks Internal Covariate shift  Covariate – The Features of the Input Data  Covariate Shift - The change in the distribution of inputs layers in the middle of a deep neural network, is referred to the technical name “internal covariate shift ”. when the distribution that is fed to the layers of a network should be somewhat: Zero-centered Constant through time and data the distribution of the data being fed to the layers should not vary too much across the mini-batches fed to the network Neural networks learn efficiently
  • 8. Issues with Training Deep Neural Networks Internal Covariate shift in deep NN Iteration i Iteration i+1 Iteration i+2 Every time there’s new relation (distribution) specially in deep layers and at the beginning of training
  • 9. Issues with Training Deep Neural Networks Take one layer from internal layers Assume has a distribution given below. Also, let us suppose, the function learned by the layer is represented by the dashed line suppose, after the gradient updating, the distribution of x gets changed to something like the loss for this mini-batch is more as compared to the previous loss
  • 10. Issues with Training Deep Neural Networks  Every time we force the layer (l) to design new Perceptron  Because I give it new disruption every time  In deep layers , we may have butterfly effect , however the change in 𝒘 at first layers is small, thus make the network unstable
  • 11. Batch Normalization  The Batch Normalization attempts to normalize a batch of inputs before they are fed to a non-linear activation unit (like ReLU, sigmoid, etc). during training  so that the input to the activation function across each training batch has a mean of 0 and a variance of 1  applying batch normalization to the activation σ(Wx + b) would result in σ(BN(Wx + b)) where 𝐵𝑁 is the batch normalizing transform
  • 12. Batch Normalization To make each dimension unit gaussian, we apply: 𝑥 𝑘 = 𝑥 𝑘 − 𝐸 𝑥 𝑘 𝑉𝑎𝑟 𝑥 𝑘 where 𝐸 𝑥(𝑘) and 𝑉𝑎𝑟 𝑥(𝑘) are respectively the mean and variance of 𝑘-th feature over a batch. Then we transform 𝑥(𝑘) as: 𝑦 𝑘 = 𝛾 𝑘 𝑥 𝑘 + 𝛽 𝑘 where 𝛾 and 𝛽 are the hyper (learnable) parameters of the so-called batch normalization layer 𝐾 is the 𝑘-th sample in one feature mini batch
  • 14. Forward Propagation through Batch Normalization layer We have shown the normalization of multiple sample in just one feature Input: Values of 𝐱 over 𝐁 = 𝐱𝟏 … 𝐱 𝐦 a batch; Parameters to be learned 𝛄, 𝛃 Output: 𝓨𝐢 = 𝐁𝐍 𝛄,𝛃(𝐱 𝐢) Flow of computation through Batch Normalization layer 𝝁 𝑩 = 𝟏 𝒎 𝒊=𝟏 𝒎 𝒙𝒊 𝝈 𝑩 𝟐 = 𝟏 𝒎 𝒊=𝟏 𝒎 𝒙𝒊 − 𝝁 𝑩 𝟐 𝒙𝒊= 𝒙𝒊 − 𝝁 𝑩 𝝈 𝑩 𝟐 + 𝝐 𝓨𝒊= 𝜸 𝒙 + 𝜷 = 𝑩𝑵 𝜸,𝜷(𝒙𝒊) 𝜖 𝜖 is a small value 1 ∗ 10−8 for not devided by zero
  • 15. Forward Propagation through Batch Normalization layer TWO features
  • 16. THE MAGIC Imagine that the network was thought that the optimal that will minimize the cost is to Cancel the BN effect ! Forward Propagation through Batch Normalization layer β = 𝐸 𝑥 = μB 𝛾 = 𝑉𝑎𝑟 𝑥 = 𝜎 𝐵 2 + 𝜖 𝑥𝑖 = 𝑥𝑖 − 𝜇 𝐵 𝜎 𝐵 2 + 𝜖 𝒴𝑖 = 𝛾 𝑥𝑖 + 𝛽 = 𝜎 𝐵 2 + 𝜖 ∗ 𝑥 𝑖 −𝜇 𝐵 𝜎 𝐵 2+𝜖 + 𝜇 𝐵 = 𝑥𝑖 Identity transform 𝛾, 𝛽 Adapted by SGD
  • 17. Backpropagation through Batch Normalization layer 𝒴𝑖 = 𝛾 𝑥 + 𝛽 = 𝐵𝑁𝛾,𝛽(𝑥𝑖) 𝝏𝑳 𝝏𝓨𝒊 𝝏𝑳 𝝏 𝒙𝒊 𝜕L 𝜕γi 𝝏𝑳 𝝏𝜷 𝝏𝑳 𝝏𝝈 𝑩 𝟐𝝏𝑳 𝝏𝝁 𝑩 𝝏𝑳 𝝏𝒙 𝒊 SGD 𝜷 𝒌 + 𝟏 = 𝜷 𝜿 − 𝜶 𝝏𝑳 𝝏𝜷 𝜸 𝒌 + 𝟏 = 𝜸 𝜿 − 𝜶 𝝏𝑳 𝝏𝜸 𝝁 𝑩 = 𝟏 𝒎 𝒊=𝟏 𝒎 𝒙𝒊 𝝈 𝑩 𝟐 = 𝟏 𝒎 𝒊=𝟏 𝒎 𝒙𝒊 − 𝝁 𝑩 𝟐 𝒙𝒊= 𝒙𝒊 − 𝝁 𝑩 𝝈 𝑩 𝟐 + 𝝐 𝓨𝒊= 𝜸 𝒙 + 𝜷 = 𝑩𝑵 𝜸,𝜷(𝒙𝒊)
  • 18. Backpropagation during test time using the population, rather than mini-batch statistics. Effectively, we process mini- batches of size 𝑚 and use their statistics to compute: 𝐸 𝑥(𝑘) = 𝐸 𝐵[ 𝜇 𝐵] 𝑉𝑎𝑟 𝑥(𝑘) = 𝑚 𝑚 − 1 𝐸 𝐵[𝜎 𝐵 2 ] we can use 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝑚𝑜𝑣𝑖𝑛𝑔 𝑎𝑣𝑒𝑟𝑎𝑔𝑒 to estimate the mean and variance to be used during test time, we estimate the running average of mean and variance as: 𝜇 𝑟𝑢𝑛𝑛𝑖𝑛𝑔 = 𝛼. 𝜇 𝑟𝑢𝑛𝑛𝑖𝑛𝑔 + (1- 𝛼). 𝜇 𝐵 ‫ـــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــــ‬‫ـــ‬ 𝜎𝑟𝑢𝑛𝑛𝑖𝑛𝑔 2 = 𝛼. 𝜎𝑟𝑢𝑛𝑛𝑖𝑛𝑔 2 + (1- 𝛼).𝜎 𝐵 2 where 𝛼 is a constant smoothing factor between 0 and 1 and represents the degree of dependence on the previous observations .
  • 19. Ablation Study MNIST dataset 28×28 binary image as input, 3 fully connected (FC) hidden layer with 100 activations each , the last hidden layer followed by 10 activations as there are 10 digits. And the loss is cross entropy loss. BN network is much more stable
  • 20. Ablation Study ImageNet of 1000 categories on GoogleNet/Inception(2014) weighing 138GB for the training images, 6.3GB for the validation images, and 13GB for the testing CNN architectures tested
  • 21. Comparison with the State of the art Approaches
  • 22. Some Notes : cross-entropy loss function We use cross-entropy loss function neural network (1) Computed | targets | correct? ------------------------------------------------ 0.3 0.3 0.4 | 0 0 1 (democrat) | yes 0.3 0.4 0.3 | 0 1 0 (republican) | yes 0.1 0.2 0.7 | 1 0 0 (other) | no neural network (2) Computed | targets | correct? ------------------------------------------------ 0.1 0.2 0.7 | 0 0 1 (democrat) | yes 0.1 0.7 0.2 | 0 1 0 (republican) | yes 0.3 0.4 0.3 | 1 0 0 (other) | no cross-entropy error for the first training −( (ln(0.3) ∗ 0) + (ln(0.3) ∗ 0) + (ln(0.4) ∗ 1) ) = −ln(0.4) average cross-entropy error (ACE) −(ln(0.4) + ln(0.4) + ln(0.1)) / 3 = 1.38 average cross-entropy error (ACE) −(ln(0.7) + ln(0.7) + ln(0.3)) / 3 = 0.64 mean squared error for the first item (0.3 − 0)^2 + (0.3 − 0)^2 + (0.4 − 1)^2 = 0.54 the MSE for the first neural network is (0.54 + 0.54 + 1.34) / 3 = 0.81 The MSE for the second, better, network is (0.14 + 0.14 + 0.74) / 3 = 0.34 (1.38 − 0.64 = 0.74) > (0.81 − 0.34 = 0.47) The 𝑙𝑛() function in cross-entropy takes into account the closeness of a prediction
  • 23. Some Notes : Convolutional Neural Network (CNN)  It use for Image classification is the task  It was developed between 1988 and 1993, at Bell Labs  the first convolutional network that could recognize handwritten digits
  • 24. Some Notes : Convolutional Neural Network (CNN) Convolution Layer (Conv Layer) Pooling Layer ReLU Layer Fully Connected Layer (Flatten)
  • 25. Some Notes : Convolutional Neural Network (CNN)
  • 26. Convolution Layer (Conv Layer) convolution works by sliding a window across the input
  • 27. Some Notes : Convolutional Neural Network (CNN) 2D filters
  • 28. Some Notes : Convolutional Neural Network (CNN) 3D filters
  • 29. Pooling Layer (Sub-sampling or Down-sampling)  reduce the size of feature maps by using some functions average or the maximum , (hence called down-sampling)  make extracted features more robust by making it more invariant to scale and orientation changes.
  • 30. ReLU Layer Remove all the black elements from it, keeping only those carrying a positive value (the grey and white colours ) )𝑂𝑢𝑡𝑝𝑢𝑡 = 𝑀𝑎𝑥(𝑧𝑒𝑟𝑜, 𝐼𝑛𝑝𝑢𝑡 to introduce non-linearity in our ConvNet
  • 33. MY WORK  MNIST on google colab Inputs = 28*28 = 784 Layer 1&2 = 100 nodes | Layer 3 = 10 nodes All Activations are sigmoid Cross-entropy loss function The train and test set is already splited in tensorflow the distribution over time of the inputs to the sigmoid function of the first five neurons in the second layer . Batch normalization has a visible and significant effect of removing variance/noise in these inputs.final acc: 99%
  • 34. MY WORK  caltech dataset 𝑊𝑖𝑡ℎ 𝐵𝑁 #𝑒𝑝𝑜𝑐ℎ = 150 𝐿𝑅 = 1 ∗ 10−3 𝑊𝑖𝑡ℎ𝑜𝑢𝑡 𝐵𝑁 #𝑒𝑝𝑜𝑐ℎ = 150 𝐿𝑅 = 1 ∗ 10−3 𝑤𝑖𝑡ℎ𝑜𝑢𝑡 BN #𝑒𝑝𝑜𝑐ℎ = 250 𝐿𝑅 = 1 ∗ 10−3 final acc: 90.54% final acc: 94.44%final acc: 96.04% We use ten (10) classes from Caltech dataset instead of ImageNet Dataset because it’s huge size classify the input image 1/3 ,1/3 , 1/3 train , validation, test