SlideShare a Scribd company logo
Batch normalization
Accelerating Deep Network Training by Reducing
Internal Covariate Shift
H O K U N L I N
2 0 2 3 / 8 / 3
1
Outline
Introduction
Related Works
Methodology
Experimental Results
Conclusion
2
Introduction
Sergey Ioffe, Christian Szegedy from Google Reserch, 2015 ICML
It's hard to train deep neural networks
3
SGD is simple but require careful tunning
The inputs to each layer are affected by all preceding layers
Needs to continuously adapt to new distribution
Introduction
4
Introduction
5
Gradient vanishing slow down the convergence.
Changing distribution of input will likely move x into saturated regime
Introduction
Use Batch Normalization fix means and variances of layer inputs
Reduce the dependence of initial values of SGD
Reduce the need for Dropout
6
Match SOTA model on ImageNet using only 7% of training steps
Outline
Introduction
Related Works
Methodology
Experimental Results
Conclusion
7
Related Works
Normalizing the Inputs
Covariate shift
Input distribution to a system changes
8
Internal Covatiate Shift
Input distribution to a network changes due to training
Outline
Introduction
Related Works
Methodology
Experimental Results
Conclusion
9
Methodology
Fix the distribution of the layer inputs
1 0
Normalize each scalar feature independently
Since fill whitening is costly
Normalization via mini-batch
Since we often use mini-batch in SGD
Methodology - BN Transform
1 1
Consider a mini-batch with size m,
Methodology - BN Transform
1 2
Consider a mini-batch with size m,
Methodology - BN Transform
1 3
Consider a mini-batch with size m,
Add learnable parameters gamma and beta
Methodology - Train&Inference
1 4
Methodology - Train&Inference
1 5
Methodology - Train&Inference
1 6
Output depends on data in mini-batch!
Methodology - Train&Inference
1 7
Training Inferencing
Methodology - Train&Inference
1 8
Training Inferencing
Methodology - Train&Inference
1 9
Training Batch Normalization Network Find E[x], Var[x] and inference
Methodology - Train&Inference
2 0
Training Batch Normalization Network Find E[x], Var[x] and inference
Methodology - Train&Inference
2 1
Training Batch Normalization Network Find E[x], Var[x] and inference
Methodology - Train&Inference
2 2
Training Batch Normalization Network Find E[x], Var[x] and inference
Methodology - CNN
Put BN before nonlinearities
Also include learnable gamma and beta
2 3
reset mini-batch size to m times spatial size to obtain convolution properties
Methodology - Observation
With BN, backpropagation is unaffected by the scale of parameters
2 4
Methodology - Observation
With BN, backpropagation is unaffected by the scale of parameters
2 5
Methodology - Observation
With BN, backpropagation is unaffected by the scale of parameters
2 6
Methodology - Observation
With BN, backpropagation is unaffected by the scale of parameters
2 7
BN will stablize the parameter growth
Methodology - observation
With BN, backpropagation is unaffected by the scale of parameters
2 8
BN will stablize the parameter growth
BN enables higher learning rates
Outline
Introduction
Related Works
Methodology
Experimental Results
Conclusion
2 9
MNIST dataset
Handwritten digits dataset
28 x 28 pixel monochrome images
60K training images
3 0
10K testing images
10 labels
Verify internal covariate shift here
MNIST dataset - NN Setup
28 x 28 binary image input
3 FC hidden layers with 100 sigmoid nonlinearities each
1 FC hidden layer with 10 activations with cross-entropy loss
3 1
Train for 50K steps, 60 examples per mini-batch
BN add in each hidden layers
W initialized to small random Gaussian values
MNIST dataset - Result
x represent epoch
y represent test accuracy
3 2
NN with BN has higher test accuracy
MNIST dataset - Result
x represent epoch
y represent output value
3 3
Lines represent {15, 50, 85}th percentiles
Distribution in NN without BN is unstable
Distribution in NN with BN is stable
ImageNet
Train with ILSVRC 2012 dataset
1000 labels
150K test and validation images
3 4
1.2M train images
ImageNet - Inception Model
3 5
GoogLeNet is one of the instance of Inception
Won 2014 ImageNet
SOTA model as baseline
ImageNet - Inception Setup
3 6
ImageNet - Inception Setup
3 7
ImageNet - Inception Setup
3 8
5x5 conv. layers to two 3x3 conv. layers
ImageNet - Inception Setup
3 9
28x28 inception modules from 2 to 3
ImageNet - Inception Setup
4 0
Use average, max-pooling during training
ImageNet - Inception Setup
4 1
Remove board pooling layers between any two incepetion modules
ImageNet - Inception Setup
4 2
Add stride-2 conv./pooling layers before filter in 3c, 4e
ImageNet - BN Setup
4 3
Increase learning rate
Remove Dropout
Reduce the L2 Regularization by a factor of 5
Lower learning rate 6 times faster
Remove Local Response Normalization
Shuffle training examples more thoroughly
Reduce the photometric distortions
ImageNet - BN Setup
4 4
BN-Baseline
Inception + BN before each nonlinearity
ImageNet - BN Setup
4 5
BN-Baseline
Inception + BN before each nonlinearity
BN-x5 / BN-x30
BN-Baseline with learning rate increase by a factor of 5 / 30 (0.0075, 0.045)
ImageNet - BN Setup
4 6
BN-Baseline
Inception + BN before each nonlinearity
BN-x5 / BN-x30
BN-Baseline with learning rate increase by a factor of 5 / 30 (0.0075, 0.045)
BN-x30-Sigmoid
BN-x30 with Sigmoid instead of ReLU
ImageNet - Result
4 7
x represent epoch
y represent validation accuracy
Same acc. in fewer steps with BN
Inception+Sigmoid has acc. < 1/1000
ImageNet - Result
4 8
BN-x30 train slower initially
Higher learning rate, higher acc.
ImageNet - Result
4 9
BN reach 72.2% less than half steps
BN-x5 only needs 14 times fewer steps
Same acc. in fewer steps with BN
Inception+Sigmoid has acc. < 1/1000
Doesn't need Dropout and Local Response Normalization
ImageNet Ensemble - Setup
5 0
6 BN-x30 form BN-Inception ensemble
Increase initial weights in conv. layers
Using Dropout with probability 5% or 10%
Using non-convolutional, per-activation BN in last hidden layer
Predict base on on the arithmetic average
ImageNet Ensemble - Result
5 1
Outline
Introduction
Related Works
Methodology
Experimental Results
Conclusion
5 2
Conclusion
Reduce internal covariate shift speed up training
Add BN to SOTA model yields a substantial speedup in training
Preserve model expressivity
5 3
Allows higher learning rate
Reduce the need for dropout and careful parameter initialization
Beat SOTA model in ImageNet classfication
THANK YOU
5 4

More Related Content

PPTX
Batch normalization presentation
PPTX
Simulated annealing
PDF
LeNet-5
PPTX
Adversarial search with Game Playing
PDF
Policy gradient
PDF
Artificial Intelligence in games
PPTX
PPTX
Evolutionary Computing
Batch normalization presentation
Simulated annealing
LeNet-5
Adversarial search with Game Playing
Policy gradient
Artificial Intelligence in games
Evolutionary Computing

What's hot (20)

PDF
Adversarial machine learning for av software
PPT
Reinforcement Learning Q-Learning
PPT
Simulated annealing
PPTX
Deep Learning With Neural Networks
PPTX
adaboost
PDF
Why Batch Normalization Works so Well
PDF
CIFAR-10
PPSX
Genetic_Algorithm_AI(TU)
PPTX
Local search algorithm
PPTX
Deep Learning - CNN and RNN
PPTX
Grid search, pipeline, featureunion
PDF
Multi Layer Perceptron & Back Propagation
PDF
An introduction to reinforcement learning
PDF
Deep reinforcement learning
PPTX
Data Augmentation
PDF
An introduction to Deep Learning
PPTX
Reinforcement learning
PDF
Towards Human-Centered Machine Learning
PPTX
Simulated Annealing
PPTX
Machine Learning, Deep Learning and Data Analysis Introduction
Adversarial machine learning for av software
Reinforcement Learning Q-Learning
Simulated annealing
Deep Learning With Neural Networks
adaboost
Why Batch Normalization Works so Well
CIFAR-10
Genetic_Algorithm_AI(TU)
Local search algorithm
Deep Learning - CNN and RNN
Grid search, pipeline, featureunion
Multi Layer Perceptron & Back Propagation
An introduction to reinforcement learning
Deep reinforcement learning
Data Augmentation
An introduction to Deep Learning
Reinforcement learning
Towards Human-Centered Machine Learning
Simulated Annealing
Machine Learning, Deep Learning and Data Analysis Introduction
Ad

Similar to Batch normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (20)

PDF
Batch normalization paper review
PPTX
machine_learning _presentation_on_paperpptx
PDF
Bag of tricks for image classification with convolutional neural networks r...
PPTX
NITW_Improving Deep Neural Networks.pptx
PPTX
NITW_Improving Deep Neural Networks (1).pptx
PPT
backpropagation in neural networks
PDF
A Review on Food Classification using Convolutional Neural Networks
PPTX
Upscaling Image using Fast Super Resolution Convolution Natural Network
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
PDF
Mixed Precision Training Review
PPTX
Amnestic neural network for classification
PDF
Detection focal loss 딥러닝 논문읽기 모임 발표자료
PDF
Semi-Supervised Deep Learning
PPT
NEURAL Network Design Training
PPTX
DNN and RBM
PPTX
GNorm and Rethinking pre training-ruijie
PDF
4 high performance large-scale image recognition without normalization
PPTX
Convolutional Neural Networks CNN
PDF
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
PDF
High performance large-scale image recognition without normalization
Batch normalization paper review
machine_learning _presentation_on_paperpptx
Bag of tricks for image classification with convolutional neural networks r...
NITW_Improving Deep Neural Networks.pptx
NITW_Improving Deep Neural Networks (1).pptx
backpropagation in neural networks
A Review on Food Classification using Convolutional Neural Networks
Upscaling Image using Fast Super Resolution Convolution Natural Network
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Mixed Precision Training Review
Amnestic neural network for classification
Detection focal loss 딥러닝 논문읽기 모임 발표자료
Semi-Supervised Deep Learning
NEURAL Network Design Training
DNN and RBM
GNorm and Rethinking pre training-ruijie
4 high performance large-scale image recognition without normalization
Convolutional Neural Networks CNN
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...
High performance large-scale image recognition without normalization
Ad

Recently uploaded (20)

PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
An interstellar mission to test astrophysical black holes
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PDF
HPLC-PPT.docx high performance liquid chromatography
PPT
protein biochemistry.ppt for university classes
The KM-GBF monitoring framework – status & key messages.pptx
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Phytochemical Investigation of Miliusa longipes.pdf
An interstellar mission to test astrophysical black holes
ECG_Course_Presentation د.محمد صقران ppt
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
7. General Toxicologyfor clinical phrmacy.pptx
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Comparative Structure of Integument in Vertebrates.pptx
2. Earth - The Living Planet Module 2ELS
. Radiology Case Scenariosssssssssssssss
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
HPLC-PPT.docx high performance liquid chromatography
protein biochemistry.ppt for university classes

Batch normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift