lecture3.pdf

UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 1
Lecture 3: Deeper into Deep Learning and Optimizations
Deep Learning @ UvA

o Machine learning paradigm for neural networks
o Backpropagation algorithm, backbone for training neural networks
o Neural network == modular architecture
o Visited different modules, saw how to implement and check them
Previous lecture

o How to define our model and optimize it in practice
o Data preprocessing and normalization
o Optimization methods
o Regularizations
o Architectures and architectural hyper-parameters
o Learning rate
o Weight initializations
o Good practices
Lecture overview

UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 4
Deeper into
Neural Networks &
Deep Neural Nets

A Neural/Deep Network in a nutshell
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods

SGD vs GD
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ

Backpropagation again
o Step 1. Compute forward propagations for all layers recursively
𝑎𝑙 = ℎ𝑙 𝑥𝑙 and 𝑥𝑙+1 = 𝑎𝑙
o Step 2. Once done with forward propagation, follow the reverse path.
◦ Start from the last layer and for each new layer compute the gradients
◦ Cache computations when possible to avoid redundant operations
o Step 3. Use the gradients
𝜕ℒ
𝜕𝜃𝑙
with Stochastic Gradient Descend to train
𝜕ℒ
𝜕𝑎𝑙
=
𝜕𝑎𝑙+1
𝜕𝑥𝑙+1
𝑇
⋅
𝜕ℒ
𝜕𝑎𝑙+1
𝜕ℒ
𝜕𝜃𝑙
=
𝜕𝑎𝑙
𝜕𝜃𝑙
⋅
𝜕ℒ
𝜕𝑎𝑙
𝑇

o Often loss surfaces are
◦ non-quadratic
◦ highly non-convex
◦ very high-dimensional
o Datasets are typically really large to compute complete gradients
o No real guarantee that
◦ the final solution will be good
◦ we converge fast to final solution
◦ or that there will be convergence
Still, backpropagation can be slow

o Stochastically sample “mini-batches” from dataset 𝐷
◦ The size of 𝐵𝑗 can contain even just 1 sample
o Much faster than Gradient Descend
o Results are often better
o Also suitable for datasets that change over time
o Variance of gradients increases when batch size decreases
Stochastic Gradient Descend (SGD)
𝜃(𝑡+1)
= 𝜃(𝑡)
−
𝜂𝑡
|𝐵𝑗|
෍
𝑖 ∈ 𝐵𝑗
𝛻𝜃ℒ𝑖
𝐵𝑗 = 𝑠𝑎𝑚𝑝𝑙𝑒(𝐷)

SGD is often better
Current solution
Full GD gradient
New GD solution
Noisy SGD gradient
Best GD solution
Best SGD solution
• No guarantee that this is what
is going to always happen.
• But the noisy SGC gradients
can help some times escaping
local optima
Loss surface

SGD is often better
o (A bit) Noisy gradients act as regularization
o Gradient Descend  Complete gradients
o Complete gradients fit optimally the (arbitrary) data we have, not the
distribution that generates them
◦ All training samples are the “absolute representative” of the input distribution
◦ Test data will be no different than training data
◦ Suitable for traditional optimization problems: “find optimal route”
◦ But for ML we cannot make this assumption  test data are always different
o Stochastic gradients  sampled training data sample roughly
representative gradients
◦ Model does not overfit to the particular training samples

SGD is faster
Gradient

SGD is faster
Gradient
10x
What is our
gradient now?

SGD is faster
10x
What is our
gradient now?
Gradient

o Of course in real situations data do not replicate
o However, after a sizeable amount of data there are clusters of data that
are similar
o Hence, the gradient is approximately alright
o Approximate alright is great, is even better in many cases actually
SGD is faster

o Often datasets are not “rigid”
o Imagine Instagram
◦ Let’s assume 1 million of new images uploaded per week and
we want to build a “cool picture” classifier
◦ Should “cool pictures” from the previous year have the same as
much influence?
◦ No, the learning machine should track these changes
o With GD these changes go undetected, as results are
averaged by the many more “past” samples
◦ Past “over-dominates”
o A properly implemented SGD can track changes much
better and give better models
◦ [LeCun2002]
SGD for dynamically changed datasets
Popular today
Popular in 2014
Popular in 2010

o Applicable only with SGD
o Choose samples with maximum information content
o Mini-batches should contain examples from different classes
◦ As different as possible
o Prefer samples likely to generate larger errors
◦ Otherwise gradients will be small  slower learning
◦ Check the errors from previous rounds and prefer “hard examples”
◦ Don’t overdo it though :P, beware of outliers
o In practice, split your dataset into mini-batches
◦ Each mini-batch is as class-divergent and rich as possible
◦ New epoch  to be safe new batches & new, randomly shuffled examples
Shuffling examples
Dataset
Shuffling
at epoch t
Shuffling
at epoch t+1

o Conditions of convergence well understood
o Acceleration techniques can be applied
◦ Second order (Hessian based) optimizations are possible
◦ Measuring not only gradients, but also curvatures of the loss surface
o Simpler theoretical analysis on weight dynamics and convergence rates
Advantages of Gradient Descend batch learning

o SGD is preferred to Gradient Descend
o Training is orders faster
◦ In real datasets Gradient Descend is not even realistic
o Solutions generalize better
◦ More efficient  larger datasets
◦ Larger datasets  better generalization
o How many samples per mini-batch?
◦ Hyper-parameter, trial & error
◦ Usually between 32-256 samples
In practice

Data preprocessing &
normalization
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ

o Center data to be roughly 0
◦ Activation functions usually “centered” around 0
◦ Convergence usually faster
◦ Otherwise bias on gradient direction  might slow down learning
Data pre-processing
ReLU  tanh(𝑥)  𝜎(𝑥) 



o Scale input variables to have similar diagonal covariances 𝑐𝑖 = σ𝑗(𝑥𝑖
(𝑗)
)2
◦ Similar covariances  more balanced rate of learning for different weights
◦ Rescaling to 1 is a good choice, unless some dimensions are less important
Data pre-processing
𝑥1
, 𝑥2
, 𝑥3
 much different covariances
𝜃1
𝜃2
𝑥 = 𝑥1
, 𝑥2
, 𝑥3 𝑇
, 𝜃 = 𝜃1
, 𝜃2
, 𝜃3 𝑇
, 𝑎 = tanh(𝜃Τ
𝑥)
𝜃3
Generated gradients ቚ
dℒ
𝑑𝜃 𝑥1,𝑥2,𝑥3
: much different
Gradient update harder: 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡
𝑑ℒ/𝑑θ1
𝑑ℒ/𝑑θ2
𝑑ℒ/𝑑θ3

o Input variables should be as decorrelated as possible
◦ Input variables are “more independent”
◦ Network is forced to find non-trivial correlations between inputs
◦ Decorrelated inputs  Better optimization
◦ Obviously not the case when inputs are by definition correlated (sequences)
o Extreme case
◦ extreme correlation (linear dependency) might cause problems [CAUTION]
Data pre-processing

o Input variables follow a Gaussian distribution (roughly)
o In practice:
◦ from training set compute mean and standard deviation
◦ Then subtract the mean from training samples
◦ Then divide the result by the standard deviation
Normalization: 𝑁 𝜇, 𝜎2
= 𝑁 0, 1
𝑥
𝑥 − 𝜇
𝑥 − 𝜇
𝜎

o Instead of “per-dimension”  all input dimensions simultaneously
o If dimensions have similar values (e.g. pixels in natural images)
◦ Compute one 𝜇, 𝜎2 instead of as many as the input variables
◦ Or the per color channel pixel average/variance
𝜇𝑟𝑒𝑑, 𝜎𝑟𝑒𝑑
2
, 𝜇𝑔𝑟𝑒𝑒𝑛, 𝜎𝑔𝑟𝑒𝑒𝑛
2 , 𝜇𝑏𝑙𝑢𝑒, 𝜎𝑏𝑙𝑢𝑒
2
𝑁 𝜇, 𝜎2
= 𝑁 0, 1 − Making things faster

o When input dimensions have similar ranges …
o … and with the right non-linearlity …
o … centering might be enough
◦ e.g. in images all dimensions are pixels
◦ All pixels have more or less the same ranges
o Juse make sure images have mean 0 (𝜇 = 0)
Even simpler: Centering the input

o If 𝐶 the covariance matrix of your dataset, compute
eigenvalues and eigenvectors with SVD
𝑈, Σ, 𝑉𝑇 = 𝑠𝑣𝑑(𝐶)
o Decorrelate (PCA-ed) dataset by
𝑋𝑟𝑜𝑡 = 𝑈𝑇𝑋
◦ Subset of eigenvectors 𝑈′
= [𝑢1, … , 𝑢𝑞] to reduce data dimensions
o Scaling by square root of eigenvalues to whiten data
𝑋𝑤ℎ𝑡 = 𝑋𝑟𝑜𝑡/ Σ
o Not used much with Convolutional Neural Nets
◦ The zero mean normalization is more important
PCA Whitening
𝑋𝑟𝑜𝑡 = 𝑈𝑇𝑋
𝑋𝑤ℎ𝑡 = 𝑋𝑟𝑜𝑡/ Σ

Example
Images taken from A. Karpathy course website: http://cs231n.github.io/neural-networks-2/

Data augmentation [Krizhevsky2012]
Original
Flip Random crop
Contrast Tint

o Weights change  the
distribution of the layer inputs
changes per round
◦ Covariance shift
o Normalize the layer inputs with
batch normalization
◦ Roughly speaking, normalize 𝑥𝑙 to
𝑁(0, 1) and rescale
Batch normalization [Ioffe2015]
𝑥𝑙
Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1)
Backpropagation
𝑥𝑙 𝑥𝑙
Batch Normalization
𝑥𝑙
ℒ
𝑥𝑙
ℒ
Batch normalization

Batch normalization - Intuitively
𝑥𝑙
Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1)
Backpropagation
𝑥𝑙 𝑥𝑙
Batch Normalization

Batch normalization – The algorithm
o 𝜇ℬ ←
1
𝑚
σ𝑖=1
𝑚
𝑥𝑖 [compute mini-batch mean]
o 𝜎ℬ ←
1
𝑚
σ𝑖=1
𝑚
𝑥𝑖 − 𝜇ℬ
2 [compute mini-batch variance]
o ෝ
𝑥𝑖 ←
𝑥𝑖−𝜇ℬ
𝜎ℬ
2+𝜀
[normalize input]
o ෝ
𝑦𝑖 ← 𝛾𝑥𝑖 + 𝛽 [scale and shift input]
Trainable parameters

o Gradients can be stronger  higher learning rates  faster training
◦ Otherwise maybe exploding or vanishing gradients or getting stuck to local minima
o Neurons get activated in a near optimal “regime”
o Better model regularization
◦ Neuron activations not deterministic,
depend on the batch
◦ Model cannot be overconfident
Batch normalization - Benefits

Regularization
(𝑥,𝑦)⊆(𝑋,𝑌)
ℓ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ

o Neural networks typically have thousands, if not millions of parameters
◦ Usually, the dataset size smaller than the number of parameters
o Overfitting is a grave danger
o Proper weight regularization is crucial to avoid overfitting
(𝑥,𝑦)⊆(𝑋,𝑌)
ℓ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) + 𝜆Ω(𝜃)
o Possible regularization methods
◦ ℓ2-regularization
◦ ℓ1-regularization
◦ Dropout
Regularization

o Most important (or most popular) regularization
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) +
𝜆
2
෍
𝑙
𝜃𝑙
2
o The ℓ2-regularization can pass inside the gradient descend update rule
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡 𝛻𝜃ℒ + 𝜆𝜃𝑙 ⟹
𝜃 𝑡+1 = 1 − 𝜆𝜂𝑡 𝜃 𝑡 − 𝜂𝑡𝛻𝜃ℒ
o 𝜆 is usually about 10−1, 10−2
ℓ2-regularization
“Weight decay”, because
weights get smaller

o ℓ1-regularization is one of the most important techniques
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) +
𝜆
2
෍
𝑙
𝜃𝑙
o Also ℓ1-regularization passes inside the gradient descend update rule
𝜃 𝑡+1
= 𝜃 𝑡
− 𝜆𝜂𝑡
𝜃 𝑡
|𝜃 𝑡 |
− 𝜂𝑡𝛻𝜃ℒ
o ℓ1-regularization  sparse weights
◦ 𝜆 ↗  more weights become 0
ℓ1-regularization
Sign function

o To tackle overfitting another popular technique is early stopping
o Monitor performance on a separate validation set
o Training the network will decrease training error, as well validation error
(although with a slower rate usually)
o Stop when validation error starts increasing
◦ This quite likely means the network starts to overfit
Early stopping

o During training setting activations randomly to 0
◦ Neurons sampled at random from a Bernoulli distribution with 𝑝 = 0.5
o At test time all neurons are used
◦ Neuron activations reweighted by 𝑝
o Benefits
◦ Reduces complex co-adaptations or co-dependencies between neurons
◦ No “free-rider” neurons that rely on others
◦ Every neuron becomes more robust
◦ Decreases significantly overfitting
◦ Improves significantly training speed
Dropout [Srivastava2014]

o Effectively, a different architecture at every training epoch
◦ Similar to model ensembles
Dropout
Original model

Dropout
Epoch 1

Dropout
Epoch 2

Architectural details
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ

o Straightforward sigmoids not a very good idea
o Symmetric sigmoids converge faster
◦ E.g. tanh, returns a(x=0)=0
◦ Recommended sigmoid: 𝑎 = ℎ 𝑥 = 1.7159 tanh(
2
3
𝑥)
o You can add a linear term to avoid flat areas
𝑎 = ℎ 𝑥 = tanh 𝑥 + 𝛽𝑥
Sigmoid-like activation functions
tanh 𝑥 + 0.5𝑥
tanh 𝑥

o RBF: 𝑎 = ℎ 𝑥 = σ𝑗 𝑢𝑗 exp −𝛽𝑗 𝑥 − 𝑤𝑗
2
o Sigmoid: 𝑎 = ℎ 𝑥 = 𝜎 𝑥 =
1
1+𝑒−𝑥
o Sigmoids can cover the full feature space
o RBF’s are much more local in the feature space
◦ Can be faster to train but with a more limited range
◦ Can give better set of basis functions
◦ Preferred in lower dimensional spaces
RBFs vs “Sigmoids”

o Activation function 𝑎 = ℎ(𝑥) = max 0, 𝑥
o Gradient wrt the input
𝜕𝑎
𝜕𝑥
= ቊ
0, 𝑖𝑓 𝑥 ≤ 0
1, 𝑖𝑓𝑥 > 0
o Very popular in computer vision and speech recognition
o Much faster computations, gradients
◦ No vanishing or exploding problems, only comparison, addition, multiplication
o People claim biological plausibility
o Sparse activations
o No saturation
o Non-symmetric
o Non-differentiable at 0
o A large gradient during training can cause a neuron to “die”. Higher learning rates mitigate the problem
Rectified Linear Unit (ReLU) module [Krizhevsky2012]

ReLU convergence rate
ReLU
Tanh

o Soft approximation (softplus): 𝑎 = ℎ(𝑥) = ln 1 + 𝑒𝑥
o Noisy ReLU: 𝑎 = ℎ 𝑥 = max 0, x + ε , ε~𝛮(0, σ(x))
o Leaky ReLU: 𝑎 = ℎ 𝑥 = ቊ
𝑥, 𝑖𝑓 𝑥 > 0
0.01𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
o Parametric ReLu: 𝑎 = ℎ 𝑥 = ቊ
𝑥, 𝑖𝑓 𝑥 > 0
𝛽𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
(parameter 𝛽 is trainable)
Other ReLUs

o Number of hidden layers
o Number of neuron in each hidden layer
o Type of activation functions
o Type and amount of regularization
Architectural hyper-parameters

o Dataset dependent hyperparameters
o Tip: Start small  increase complexity gradually
◦ e.g. start with a 2-3 hidden layers
◦ Add more layers  does performance improve?
◦ Add more neurons  does performance improve?
o Regularization is very important, use ℓ2
◦ Even if with very deep or wide network
◦ With strong ℓ2-regularization we avoid overfitting
Number of neurons, number of hidden layers
Generalization
Model complexity
(number of neurons)

Learning rate
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ

o The right learning rate 𝜂𝑡 very important for fast convergence
◦ Too strong  gradients overshoot and bounce
◦ Too weak,  too small gradients  slow training
o Learning rate per weight is often advantageous
◦ Some weights are near convergence, others not
o Rule of thumb
◦ Learning rate of (shared) weights prop. to square root of share weight connections
o Adaptive learning rates are also possible, based on the errors observed
◦ [Sompolinsky1995]
Learning rate

o Constant
◦ Learning rate remains the same for all epochs
o Step decay
◦ Decrease (e.g. 𝜂𝑡/𝑇 or 𝜂𝑡/𝑇) every T number of epochs
o Inverse decay 𝜂𝑡 =
𝜂0
1+𝜀𝑡
o Exponential decay 𝜂𝑡 = 𝜂0𝑒−𝜀𝑡
o Often step decay preferred
◦ simple, intuitive, works well and only a
single extra hyper-parameter 𝑇 (𝑇 =2, 10)
Learning rate schedules

o Try several log-spaced values 10−1, 10−2, 10−3, … on a smaller set
◦ Then, you can narrow it down from there around where you get the lowest error
o You can decrease the learning rate every 10 (or some other value) full
training set epochs
◦ Although this highly depends on your data
Learning rate in practice

Weight initialization
(𝑥,𝑦)⊆(𝑋,𝑌)
ℓ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ

o There are few contradictory requirements
o Weights need to be small enough
◦ around origin (𝟎) for symmetric functions (tanh, sigmoid)
◦ When training starts better stimulate activation functions near their linear regime
◦ larger gradients  faster training
o Weights need to be large enough
◦ Otherwise signal is too weak for any serious learning
Linear regime
Large gradients
Linear regime
Large gradients

o Weights must be initialized to preserve the variance of the activations during
the forward and backward computations
◦ Especially for deep learning
◦ All neurons operate in their full capacity
Question: Why similar input/output variance?
o Good practice: initialize weights to be asymmetric
◦ Don’t give save values to all weights (like all 𝟎)
◦ In that case all neurons generate same gradient  no learning
o Generally speaking initialization depends on
◦ non-linearities
◦ data normalization

Answer: Because the output of one module is the input to another
◦ non-linearities

o For tanh initialize weights from −
6
𝑑𝑙−1+𝑑𝑙
,
6
◦ 𝑑𝑙−1 is the number of input variables to the tanh layer and 𝑑𝑙 is the number of the
output variables
o For a sigmoid −4 ∙
6
, 4 ∙
6
One way of initializing sigmoid-like neurons
Linear regime
Large gradients

o For 𝑎 = 𝜃𝑥 the variance is
𝑉𝑎𝑟 𝑎 = 𝐸 𝑥 2𝑉𝑎𝑟 𝜃 + E 𝜃 2𝑉𝑎𝑟 𝑥 + 𝑉𝑎𝑟 𝑥 𝑉𝑎𝑟 𝜃
o Since 𝐸 𝑥 = 𝐸 𝜃 = 0
𝑉𝑎𝑟 𝑎 = 𝑉𝑎𝑟 𝑥 𝑉𝑎𝑟 𝜃 ≈ 𝑑 ⋅ 𝑉𝑎𝑟 𝑥𝑖 𝑉𝑎𝑟 𝜃𝑖
o For 𝑉𝑎𝑟 𝑎 = 𝑉𝑎𝑟 𝑥 ⇒ 𝑉𝑎𝑟 𝜃𝑖 =
1
𝑑
o Draw random weights from
𝜃~𝑁 0, 1/𝑑
where 𝑑 is the number of neurons in the input
Xavier initialization [Glorot2010]

o Unlike sigmoids, ReLUs ground to 0 the linear activations half the
time
o Double weight variance
◦ Compensate for the zero flat-area 
◦ Input and output maintain same variance
◦ Very similar to Xavier initialization
o Draw random weights from
w~𝑁 0, 2/𝑑
where 𝑑 is the number of neurons in the input
[He2015] initialization for ReLUs

Loss functions
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ

o Our samples contains only one class
◦ There is only one correct answer per sample
o Negative log-likelihood (cross entropy) + Softmax
ℒ 𝜃; 𝑥, 𝑦 = − σ𝑐=1
𝐶
𝑦𝑐 log 𝑎𝐿
𝑐
for all classes 𝑐 = 1, … , 𝐶
o Hierarchical softmax when C is very large
o Hinge loss (aka SVM loss)
ℒ 𝜃; 𝑥, 𝑦 = ෍
𝑐=1
𝑐≠𝑦
𝐶
max(0, 𝑎𝐿
𝑐
− 𝑎𝐿
𝑦
+ 1)
o Squared hinge loss
Multi-class classification
Is it a cat? Is it a horse? …

o Each sample can have many correct answers
o Hinge loss and the likes
◦ Also sigmoids would also work
o Each output neuron is independent
◦ “Does this contain a car, yes or no?“
◦ “Does this contain a person, yes or no?“
◦ “Does this contain a motorbike, yes or no?“
◦ “Does this contain a horse, yes or no?“
o Instead of “Is this a car, motorbike or person?”
◦ 𝑝 𝑐𝑎𝑟 𝑥) = 0.55, 𝑝 𝑚/𝑏𝑖𝑘𝑒 𝑥) = 0.25, 𝑝 𝑝𝑒𝑟𝑠𝑜𝑛 𝑥) = 0.15, 𝑝 ℎ𝑜𝑟𝑠𝑒 𝑥) = 0.05
◦ 𝑝 𝑐𝑎𝑟 𝑥) + 𝑝 𝑚/𝑏𝑖𝑘𝑒 𝑥) + 𝑝 𝑝𝑒𝑟𝑠𝑜𝑛 𝑥) + 𝑝 ℎ𝑜𝑟𝑠𝑒 𝑥) = 1.0
Multi-class, multi-label classification

o The good old Euclidean Loss
ℒ 𝜃; 𝑥, 𝑦 =
1
2
|𝑦 − 𝑎𝐿|2
2
o Or RBF on top of Euclidean loss
ℒ 𝜃; 𝑥, 𝑦 = ෍
𝑗
𝑢𝑗 exp(−𝛽𝑗(𝑦 − 𝑎𝐿)2)
o Or ℓ1 distance
ℒ 𝜃; 𝑥, 𝑦 = ෍
𝑗
|𝑦𝑗 − 𝑎𝐿
𝑗
|
Regression

Even better
optimizations
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ

o Don’t switch gradients all the time
o Maintain “momentum” from previous
parameters
o More robust gradients and learning 
faster convergence
o Nice “physics”-based interpretation
◦ Instead of updating the position of the “ball”, we
update the velocity, which updates the position
Momentum
𝜃(𝑡+1)
= 𝜃(𝑡)
+ 𝑢𝜃
𝑢𝜃 = 𝛾𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
Loss surface
Gradient
Gradient + momentum

o Use the future gradient instead of
the current gradient
o Better theoretical convergence
o Generally works better with
Convolutional Neural Networks
Nesterov Momentum [Sutskever2013]
𝜃(𝑡+1) = 𝜃(𝑡) + 𝑢𝜃
𝑢𝜃 = 𝛾𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
Gradient
Gradient + momentum
Momentum
Look-ahead gradient
from the next step
Momentum
Gradient + Nesterov
momentum

o Normally all weights updated with same “aggressiveness”
◦ Often some parameters could enjoy more “teaching”
◦ While others are already about there
o Adapt learning per parameter
𝜃(𝑡+1) = 𝜃(𝑡) − 𝐻ℒ
−1
𝜂𝑡𝛻𝜃ℒ
o 𝐻ℒ is the Hessian matrix of ℒ: second-order derivatives
𝐻ℒ
𝑖𝑗
=
𝜕ℒ
𝜕𝜃𝑖𝜕𝜃𝑗
Second order optimization

o Inverse of Hessian usually very expensive
◦ Too many parameters
o Approximating the Hessian, e.g. with the L-BFGS algorithm
◦ Keeps memory of gradients to approximate the inverse Hessian
o L-BFGS works alright with Gradient Descend. What about SGD?
o In practice SGD with some good momentum works just fine
Second order optimization methods in practice

o Adagrad [Duchi2011]
o RMSprop
o Adam [Kingma2014]
Other per-parameter adaptive optimizations

o Schedule
◦ 𝑚𝑗 = σ𝜏(𝛻𝜃ℒ𝑗)2
⟹ 𝜃(𝑡+1)
= 𝜃(𝑡)
− 𝜂𝑡
𝛻𝜃ℒ
𝑚+𝜀
◦ 𝜀 is a small number to avoid division with 0
◦ Gradients become gradually smaller and smaller
Adagrad [Duchi2011]

o Schedule
◦ 𝑚𝑗 = 𝛼 σ𝜏=1
𝑡−1
(𝛻𝜃
(𝑡)
ℒ𝑗)2 + 1 − 𝛼 𝛻𝜃
(𝑡)
ℒ𝑗 ⟹
◦ 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡
𝛻𝜃ℒ
𝑚+𝜀
o Moving average of the squared gradients
◦ Compared to Adagrad
o Large gradients, e.g. too “noisy” loss surface
◦ Updates are tamed
o Small gradients, e.g. stuck in flat loss surface ravine
◦ Updates become more aggressive
RMSprop
Square rooting boosts small values
while suppresses large values
Decay hyper-parameter

o One of the most popular learning algorithms
𝑚𝑗 = ෍
𝜏
(𝛻𝜃ℒ𝑗)2
𝜃(𝑡+0.5) = 𝛽1𝜃(𝑡) + 1 − 𝛽1 𝛻𝜃ℒ
𝑣(𝑡+0.5) = 𝛽2𝑣(𝑡) + 1 − 𝛽2 𝑚
𝜃(𝑡+1)
= 𝜃(𝑡)
− 𝜂𝑡
𝜃(𝑡+0.5)
𝑣(𝑡+0.5) + 𝜀
o Similar to RMSprop, but with momentum
o Recommended values: 𝛽1 = 0.9, 𝛽2 = 0.999, 𝜀 = 10−8
Adam [Kingma2014]

Visual overview
Picture credit: Alec Radford

o Learning to learn by gradient descent by gradient descent
◦ [Andrychowicz2016]
o 𝜃(𝑡+1) = 𝜃(𝑡) + 𝑔𝑡 𝛻𝜃ℒ, 𝜑
o 𝑔𝑡 is an “optimizer” with its own parameters 𝜑
◦ Implemented as a recurrent network
Learning –not computing– the gradients

Good practice
o Preprocess the data to at least have 0 mean
o Initialize weights based on activations functions
◦ For ReLU Xavier or HeICCV2015 initialization
o Always use ℓ2-regularization and dropout
o Use batch normalization

Babysitting
Deep Nets
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ

o Always check your gradients if not computed automatically
o Check that in the first round you get a random loss
o Check network with few samples
◦ Turn off regularization. You should predictably overfit and have a 0 loss
◦ Turn or regularization. The loss should increase
o Have a separate validation set
◦ Compare the curve between training and validation sets
◦ There should be a gap, but not too large
Babysitting Deep Nets

Summary
o How to define our model and optimize it in practice
o Data preprocessing and normalization
o Optimization methods
o Regularizations
o Architectures and architectural hyper-parameters
o Learning rate
o Weight initializations
o Good practices

o http://www.deeplearningbook.org/
◦ Part II: Chapter 7, 8
[Andrychowicz2016] Andrychowicz, Denil, Gomez, Hoffman, Pfau, Schaul, de Freitas, Learning to learn by gradient descent by gradient
descent, arXiv, 2016
[He2015] He, Zhang, Ren, Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, ICCV, 2015
[Ioffe2015] Ioffe, Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arXiv, 2015
[Kingma2014] Kingma, Ba. Adam: A Method for Stochastic Optimization, arXiv, 2014
[Srivastava2014] Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from
Overfitting, JMLR, 2014
[Sutskever2013] Sutskever, Martens, Dahl, Hinton. On the importance of initialization and momentum in deep learning, JMLR, 2013
[Bengio2012] Bengio. Practical recommendations for gradient-based training of deep architectures, arXiv, 2012
[Krizhevsky2012] Krizhevsky, Hinton. ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012
[Duchi2011] Duchi, Hazan, Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, JMLR, 2011
[Glorot2010] Glorot, Bengio. Understanding the difficulty of training deep feedforward neural networks, JMLR, 2010
[LeCun2002]
Reading material & references

Next lecture
o What are the Convolutional Neural Networks?
o Why are they important in Computer Vision?
o Differences from standard Neural Networks
o How to train a Convolutional Neural Network?

lecture3.pdf

More Related Content

Similar to lecture3.pdf (20)

More from Tigabu Yaya (20)

Recently uploaded (20)

lecture3.pdf