SlideShare a Scribd company logo
2
Most read
5
Most read
13
Most read
Regularization for
Deep Learning
Lecture slides for Chapter 7 of Deep Learning
www.deeplearningbook.org
Ian Goodfellow
2016-09-27
(Goodfellow 2016)
Definition
• “Regularization is any modification we make to a
learning algorithm that is intended to reduce its
generalization error but not its training error.”
(Goodfellow 2016)
Weight Decay as Constrained
Optimization
ARIZATION FOR DEEP LEARNING
w1
w2
w⇤
˜w
Figure 7.1
(Goodfellow 2016)
Norm Penalties
• L1: Encourages sparsity, equivalent to MAP
Bayesian estimation with Laplace prior
• Squared L2: Encourages small weights, equivalent to
MAP Bayesian estimation with Gaussian prior
(Goodfellow 2016)
Dataset Augmentation
Affine
Distortion
Noise
Elastic
Deformation
Horizontal
flip
Random
Translation
Hue Shift
(Goodfellow 2016)
Multi-Task Learning
network in figure 7.2.
2. Generic parameters, shared across all the tasks (which benefit from th
pooled data of all the tasks). These are the lower layers of the neural networ
in figure 7.2.
h(1)
h(1)
h(2)
h(2)
h(3)
h(3)
y(1)
y(1)
y(2)
y(2)
h(shared)
h(shared)
xx
Figure 7.2: Multi-task learning can be cast in several ways in deep learning frameworFigure 7.2
(Goodfellow 2016)
Learning CurvesHAPTER 7. REGULARIZATION FOR DEEP LEARNING
0 50 100 150 200 250
Time (epochs)
0.00
0.05
0.10
0.15
0.20
Loss(negativelog-likelihood)
Training set loss
Validation set loss
gure 7.3: Learning curves showing how the negative log-likelihood loss changes o
Figure 7.3
Early stopping: terminate while validation set
performance is better
(Goodfellow 2016)
Early Stopping and Weight
Decay
R 7. REGULARIZATION FOR DEEP LEARNING
w1
w2
w⇤
˜w
w1
w2
w⇤
˜w
Figure 7.4
(Goodfellow 2016)
Sparse Representations
HAPTER 7. REGULARIZATION FOR DEEP LEARNING
2
6
6
6
6
4
14
1
19
2
23
3
7
7
7
7
5
=
2
6
6
6
6
4
3 1 2 5 4 1
4 2 3 1 1 3
1 5 4 2 3 2
3 1 2 3 0 3
5 4 2 2 5 1
3
7
7
7
7
5
2
6
6
6
6
6
6
4
0
2
0
0
3
0
3
7
7
7
7
7
7
5
y 2 Rm B 2 Rm⇥n h 2 Rn
(7.47)
In the first expression, we have an example of a sparsely parametrized linear
egression model. In the second, we have linear regression with a sparse representa-
on h of the data x. That is, h is a function of x that, in some sense, represents
he information present in x, but does so with a sparse vector.
Representational regularization is accomplished by the same sorts of mechanisms
hat we have used in parameter regularization.
Norm penalty regularization of representations is performed by adding to the
(Goodfellow 2016)
BaggingCHAPTER 7. REGULARIZATION FOR DEEP LEARNING
8
8
First ensemble member
Second ensemble member
Original dataset
First resampled dataset
Second resampled dataset
Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an 8 detector
the dataset depicted above, containing an 8, a 6 and a 9. Suppose we make two differ
resampled datasets. The bagging training procedure is to construct each of these data
by sampling with replacement. The first dataset omits the 9 and repeats the 8. On t
Figure 7.5
(Goodfellow 2016)
Dropout
CHAPTER 7. REGULARIZATION FOR DEEP LEARNING
yy
h1h1 h2h2
x1x1 x2x2
yy
h1h1 h2h2
x1x1 x2x2
yy
h1h1 h2h2
x2x2
yy
h1h1 h2h2
x1x1
yy
h2h2
x1x1 x2x2
yy
h1h1
x1x1 x2x2
yy
h1h1 h2h2
yy
x1x1 x2x2
yy
h2h2
x2x2
yy
h1h1
x1x1
yy
h1h1
x2x2
yy
h2h2
x1x1
yy
x1x1
yy
x2x2
yy
h2h2
yy
h1h1
yy
Base network
Ensemble of subnetworks
Figure 7.6
(Goodfellow 2016)
Adversarial ExamplesCHAPTER 7. REGULARIZATION FOR DEEP LEARNING
+ .007 ⇥ =
x sign(rxJ(✓, x, y))
x +
✏ sign(rxJ(✓, x, y))
y =“panda” “nematode” “gibbon”
w/ 57.7%
confidence
w/ 8.2%
confidence
w/ 99.3 %
confidence
Figure 7.8: A demonstration of adversarial example generation applied to GoogLeNet
(Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose
elements are equal to the sign of the elements of the gradient of the cost function with
respect to the input, we can change GoogLeNet’s classification of the image. Reproduced
with permission from Goodfellow et al. (2014b).
to optimize. Unfortunately, the value of a linear function can change very rapidly
if it has numerous inputs. If we change each input by ✏, then a linear function
with weights w can change by as much as ✏||w||1, which can be a very large
amount if w is high-dimensional. Adversarial training discourages this highly
sensitive locally linear behavior by encouraging the network to be locally constant
Figure 7.8
Training on adversarial examples is mostly
intended to improve security, but can sometimes
provide generic regularization.
(Goodfellow 2016)
Tangent Propagation
ER 7. REGULARIZATION FOR DEEP LEARNING
x1
x2
Normal Tangent
7.9: Illustration of the main idea of the tangent prop algorithm (Sima
nd manifold tangent classifier (Rifai et al., 2011c), which both regul
Figure 7.9

More Related Content

PPT
Perceptron
PPSX
Perceptron (neural network)
PDF
Data preprocessing using Machine Learning
PPTX
Feature Selection in Machine Learning
PPTX
Feedforward neural network
PDF
Gradient descent method
PDF
Principal Component Analysis
PDF
Autoencoders
Perceptron
Perceptron (neural network)
Data preprocessing using Machine Learning
Feature Selection in Machine Learning
Feedforward neural network
Gradient descent method
Principal Component Analysis
Autoencoders

What's hot (20)

PPT
Support Vector Machines
PPTX
Image feature extraction
PPTX
Perceptron & Neural Networks
PDF
Latent Dirichlet Allocation
PPTX
Regularization in deep learning
PPTX
Machine learning clustering
PDF
Bayesian inference
PDF
Dimensionality Reduction
PPT
Adaline madaline
ODP
NAIVE BAYES CLASSIFIER
PDF
Optics ordering points to identify the clustering structure
PDF
Lecture10 - Naïve Bayes
PPTX
Batch normalization presentation
PPT
2.5 backpropagation
PDF
Introduction to Machine Learning Classifiers
PDF
Bayes Belief Networks
PPSX
Lasso and ridge regression
PDF
Markov decision process
PDF
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
PDF
Lec 5 uncertainty
Support Vector Machines
Image feature extraction
Perceptron & Neural Networks
Latent Dirichlet Allocation
Regularization in deep learning
Machine learning clustering
Bayesian inference
Dimensionality Reduction
Adaline madaline
NAIVE BAYES CLASSIFIER
Optics ordering points to identify the clustering structure
Lecture10 - Naïve Bayes
Batch normalization presentation
2.5 backpropagation
Introduction to Machine Learning Classifiers
Bayes Belief Networks
Lasso and ridge regression
Markov decision process
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
Lec 5 uncertainty
Ad

Similar to 07 regularization (20)

PDF
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
PDF
06 mlp
PDF
Sinusoidal Function for Population Size in Quantum Evolutionary Algorithm and...
PDF
Making Robots Learn
PDF
Ijebea14 283
PDF
result analysis for deep leakage from gradients
PDF
Visualizing and Understanding Convolutional Networks
PDF
12_applications.pdf
PPTX
論文紹介 Fast imagetagging
PDF
論文紹介:Learning With Neighbor Consistency for Noisy Labels
PDF
Accurate Learning of Graph Representations with Graph Multiset Pooling
PDF
A Hough Transform Based On a Map-Reduce Algorithm
PDF
A simple framework for contrastive learning of visual representations
PDF
GAN(と強化学習との関係)
PDF
Deep learning ensembles loss landscape
PDF
Joint3DShapeMatching - a fast approach to 3D model matching using MatchALS 3...
PDF
Lec7 deeprlbootcamp-svg+scg
PDF
GPUFish_technical_report
PDF
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
PDF
Joint3DShapeMatching
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
06 mlp
Sinusoidal Function for Population Size in Quantum Evolutionary Algorithm and...
Making Robots Learn
Ijebea14 283
result analysis for deep leakage from gradients
Visualizing and Understanding Convolutional Networks
12_applications.pdf
論文紹介 Fast imagetagging
論文紹介:Learning With Neighbor Consistency for Noisy Labels
Accurate Learning of Graph Representations with Graph Multiset Pooling
A Hough Transform Based On a Map-Reduce Algorithm
A simple framework for contrastive learning of visual representations
GAN(と強化学習との関係)
Deep learning ensembles loss landscape
Joint3DShapeMatching - a fast approach to 3D model matching using MatchALS 3...
Lec7 deeprlbootcamp-svg+scg
GPUFish_technical_report
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Joint3DShapeMatching
Ad

More from Ronald Teo (14)

PDF
Mc td
PDF
PDF
PDF
04 numerical
PPTX
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
PDF
Intro rl
PDF
Lec5 advanced-policy-gradient-methods
PDF
Lec6 nuts-and-bolts-deep-rl-research
PDF
Lec4b pong from_pixels
PDF
Lec4a policy-gradients-actor-critic
PDF
Lec3 dqn
PDF
Lec2 sampling-based-approximations-and-function-fitting
PDF
Lec1 intro-mdps-exact-methods
PDF
02 linear algebra
Mc td
04 numerical
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Intro rl
Lec5 advanced-policy-gradient-methods
Lec6 nuts-and-bolts-deep-rl-research
Lec4b pong from_pixels
Lec4a policy-gradients-actor-critic
Lec3 dqn
Lec2 sampling-based-approximations-and-function-fitting
Lec1 intro-mdps-exact-methods
02 linear algebra

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Electronic commerce courselecture one. Pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Machine learning based COVID-19 study performance prediction
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MYSQL Presentation for SQL database connectivity
Mobile App Security Testing_ A Comprehensive Guide.pdf
Approach and Philosophy of On baking technology
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Understanding_Digital_Forensics_Presentation.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Chapter 3 Spatial Domain Image Processing.pdf
cuic standard and advanced reporting.pdf
Spectral efficient network and resource selection model in 5G networks
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Electronic commerce courselecture one. Pdf
Spectroscopy.pptx food analysis technology
Machine learning based COVID-19 study performance prediction
Unlocking AI with Model Context Protocol (MCP)
Dropbox Q2 2025 Financial Results & Investor Presentation
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

07 regularization

  • 1. Regularization for Deep Learning Lecture slides for Chapter 7 of Deep Learning www.deeplearningbook.org Ian Goodfellow 2016-09-27
  • 2. (Goodfellow 2016) Definition • “Regularization is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error.”
  • 3. (Goodfellow 2016) Weight Decay as Constrained Optimization ARIZATION FOR DEEP LEARNING w1 w2 w⇤ ˜w Figure 7.1
  • 4. (Goodfellow 2016) Norm Penalties • L1: Encourages sparsity, equivalent to MAP Bayesian estimation with Laplace prior • Squared L2: Encourages small weights, equivalent to MAP Bayesian estimation with Gaussian prior
  • 6. (Goodfellow 2016) Multi-Task Learning network in figure 7.2. 2. Generic parameters, shared across all the tasks (which benefit from th pooled data of all the tasks). These are the lower layers of the neural networ in figure 7.2. h(1) h(1) h(2) h(2) h(3) h(3) y(1) y(1) y(2) y(2) h(shared) h(shared) xx Figure 7.2: Multi-task learning can be cast in several ways in deep learning frameworFigure 7.2
  • 7. (Goodfellow 2016) Learning CurvesHAPTER 7. REGULARIZATION FOR DEEP LEARNING 0 50 100 150 200 250 Time (epochs) 0.00 0.05 0.10 0.15 0.20 Loss(negativelog-likelihood) Training set loss Validation set loss gure 7.3: Learning curves showing how the negative log-likelihood loss changes o Figure 7.3 Early stopping: terminate while validation set performance is better
  • 8. (Goodfellow 2016) Early Stopping and Weight Decay R 7. REGULARIZATION FOR DEEP LEARNING w1 w2 w⇤ ˜w w1 w2 w⇤ ˜w Figure 7.4
  • 9. (Goodfellow 2016) Sparse Representations HAPTER 7. REGULARIZATION FOR DEEP LEARNING 2 6 6 6 6 4 14 1 19 2 23 3 7 7 7 7 5 = 2 6 6 6 6 4 3 1 2 5 4 1 4 2 3 1 1 3 1 5 4 2 3 2 3 1 2 3 0 3 5 4 2 2 5 1 3 7 7 7 7 5 2 6 6 6 6 6 6 4 0 2 0 0 3 0 3 7 7 7 7 7 7 5 y 2 Rm B 2 Rm⇥n h 2 Rn (7.47) In the first expression, we have an example of a sparsely parametrized linear egression model. In the second, we have linear regression with a sparse representa- on h of the data x. That is, h is a function of x that, in some sense, represents he information present in x, but does so with a sparse vector. Representational regularization is accomplished by the same sorts of mechanisms hat we have used in parameter regularization. Norm penalty regularization of representations is performed by adding to the
  • 10. (Goodfellow 2016) BaggingCHAPTER 7. REGULARIZATION FOR DEEP LEARNING 8 8 First ensemble member Second ensemble member Original dataset First resampled dataset Second resampled dataset Figure 7.5: A cartoon depiction of how bagging works. Suppose we train an 8 detector the dataset depicted above, containing an 8, a 6 and a 9. Suppose we make two differ resampled datasets. The bagging training procedure is to construct each of these data by sampling with replacement. The first dataset omits the 9 and repeats the 8. On t Figure 7.5
  • 11. (Goodfellow 2016) Dropout CHAPTER 7. REGULARIZATION FOR DEEP LEARNING yy h1h1 h2h2 x1x1 x2x2 yy h1h1 h2h2 x1x1 x2x2 yy h1h1 h2h2 x2x2 yy h1h1 h2h2 x1x1 yy h2h2 x1x1 x2x2 yy h1h1 x1x1 x2x2 yy h1h1 h2h2 yy x1x1 x2x2 yy h2h2 x2x2 yy h1h1 x1x1 yy h1h1 x2x2 yy h2h2 x1x1 yy x1x1 yy x2x2 yy h2h2 yy h1h1 yy Base network Ensemble of subnetworks Figure 7.6
  • 12. (Goodfellow 2016) Adversarial ExamplesCHAPTER 7. REGULARIZATION FOR DEEP LEARNING + .007 ⇥ = x sign(rxJ(✓, x, y)) x + ✏ sign(rxJ(✓, x, y)) y =“panda” “nematode” “gibbon” w/ 57.7% confidence w/ 8.2% confidence w/ 99.3 % confidence Figure 7.8: A demonstration of adversarial example generation applied to GoogLeNet (Szegedy et al., 2014a) on ImageNet. By adding an imperceptibly small vector whose elements are equal to the sign of the elements of the gradient of the cost function with respect to the input, we can change GoogLeNet’s classification of the image. Reproduced with permission from Goodfellow et al. (2014b). to optimize. Unfortunately, the value of a linear function can change very rapidly if it has numerous inputs. If we change each input by ✏, then a linear function with weights w can change by as much as ✏||w||1, which can be a very large amount if w is high-dimensional. Adversarial training discourages this highly sensitive locally linear behavior by encouraging the network to be locally constant Figure 7.8 Training on adversarial examples is mostly intended to improve security, but can sometimes provide generic regularization.
  • 13. (Goodfellow 2016) Tangent Propagation ER 7. REGULARIZATION FOR DEEP LEARNING x1 x2 Normal Tangent 7.9: Illustration of the main idea of the tangent prop algorithm (Sima nd manifold tangent classifier (Rifai et al., 2011c), which both regul Figure 7.9