SlideShare a Scribd company logo
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 1
Lecture 3: Deeper into Deep Learning and Optimizations
Deep Learning @ UvA
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 2
o Machine learning paradigm for neural networks
o Backpropagation algorithm, backbone for training neural networks
o Neural network == modular architecture
o Visited different modules, saw how to implement and check them
Previous lecture
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 3
o How to define our model and optimize it in practice
o Data preprocessing and normalization
o Optimization methods
o Regularizations
o Architectures and architectural hyper-parameters
o Learning rate
o Weight initializations
o Good practices
Lecture overview
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 4
Deeper into
Neural Networks &
Deep Neural Nets
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 5
A Neural/Deep Network in a nutshell
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 6
SGD vs GD
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 7
Backpropagation again
o Step 1. Compute forward propagations for all layers recursively
𝑎𝑙 = ℎ𝑙 𝑥𝑙 and 𝑥𝑙+1 = 𝑎𝑙
o Step 2. Once done with forward propagation, follow the reverse path.
◦ Start from the last layer and for each new layer compute the gradients
◦ Cache computations when possible to avoid redundant operations
o Step 3. Use the gradients
𝜕ℒ
𝜕𝜃𝑙
with Stochastic Gradient Descend to train
𝜕ℒ
𝜕𝑎𝑙
=
𝜕𝑎𝑙+1
𝜕𝑥𝑙+1
𝑇
⋅
𝜕ℒ
𝜕𝑎𝑙+1
𝜕ℒ
𝜕𝜃𝑙
=
𝜕𝑎𝑙
𝜕𝜃𝑙
⋅
𝜕ℒ
𝜕𝑎𝑙
𝑇
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 8
o Often loss surfaces are
◦ non-quadratic
◦ highly non-convex
◦ very high-dimensional
o Datasets are typically really large to compute complete gradients
o No real guarantee that
◦ the final solution will be good
◦ we converge fast to final solution
◦ or that there will be convergence
Still, backpropagation can be slow
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 9
o Stochastically sample “mini-batches” from dataset 𝐷
◦ The size of 𝐵𝑗 can contain even just 1 sample
o Much faster than Gradient Descend
o Results are often better
o Also suitable for datasets that change over time
o Variance of gradients increases when batch size decreases
Stochastic Gradient Descend (SGD)
𝜃(𝑡+1)
= 𝜃(𝑡)
−
𝜂𝑡
|𝐵𝑗|
෍
𝑖 ∈ 𝐵𝑗
𝛻𝜃ℒ𝑖
𝐵𝑗 = 𝑠𝑎𝑚𝑝𝑙𝑒(𝐷)
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 10
SGD is often better
Current solution
Full GD gradient
New GD solution
Noisy SGD gradient
Best GD solution
Best SGD solution
• No guarantee that this is what
is going to always happen.
• But the noisy SGC gradients
can help some times escaping
local optima
Loss surface
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 11
SGD is often better
o (A bit) Noisy gradients act as regularization
o Gradient Descend  Complete gradients
o Complete gradients fit optimally the (arbitrary) data we have, not the
distribution that generates them
◦ All training samples are the “absolute representative” of the input distribution
◦ Test data will be no different than training data
◦ Suitable for traditional optimization problems: “find optimal route”
◦ But for ML we cannot make this assumption  test data are always different
o Stochastic gradients  sampled training data sample roughly
representative gradients
◦ Model does not overfit to the particular training samples
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 12
SGD is faster
Gradient
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 13
SGD is faster
Gradient
10x
What is our
gradient now?
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 14
SGD is faster
10x
What is our
gradient now?
Gradient
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 15
o Of course in real situations data do not replicate
o However, after a sizeable amount of data there are clusters of data that
are similar
o Hence, the gradient is approximately alright
o Approximate alright is great, is even better in many cases actually
SGD is faster
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 16
o Often datasets are not “rigid”
o Imagine Instagram
◦ Let’s assume 1 million of new images uploaded per week and
we want to build a “cool picture” classifier
◦ Should “cool pictures” from the previous year have the same as
much influence?
◦ No, the learning machine should track these changes
o With GD these changes go undetected, as results are
averaged by the many more “past” samples
◦ Past “over-dominates”
o A properly implemented SGD can track changes much
better and give better models
◦ [LeCun2002]
SGD for dynamically changed datasets
Popular today
Popular in 2014
Popular in 2010
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 17
o Applicable only with SGD
o Choose samples with maximum information content
o Mini-batches should contain examples from different classes
◦ As different as possible
o Prefer samples likely to generate larger errors
◦ Otherwise gradients will be small  slower learning
◦ Check the errors from previous rounds and prefer “hard examples”
◦ Don’t overdo it though :P, beware of outliers
o In practice, split your dataset into mini-batches
◦ Each mini-batch is as class-divergent and rich as possible
◦ New epoch  to be safe new batches & new, randomly shuffled examples
Shuffling examples
Dataset
Shuffling
at epoch t
Shuffling
at epoch t+1
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 18
o Conditions of convergence well understood
o Acceleration techniques can be applied
◦ Second order (Hessian based) optimizations are possible
◦ Measuring not only gradients, but also curvatures of the loss surface
o Simpler theoretical analysis on weight dynamics and convergence rates
Advantages of Gradient Descend batch learning
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 19
o SGD is preferred to Gradient Descend
o Training is orders faster
◦ In real datasets Gradient Descend is not even realistic
o Solutions generalize better
◦ More efficient  larger datasets
◦ Larger datasets  better generalization
o How many samples per mini-batch?
◦ Hyper-parameter, trial & error
◦ Usually between 32-256 samples
In practice
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 20
Data preprocessing &
normalization
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 21
o Center data to be roughly 0
◦ Activation functions usually “centered” around 0
◦ Convergence usually faster
◦ Otherwise bias on gradient direction  might slow down learning
Data pre-processing
ReLU  tanh(𝑥)  𝜎(𝑥) 


UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 22
o Scale input variables to have similar diagonal covariances 𝑐𝑖 = σ𝑗(𝑥𝑖
(𝑗)
)2
◦ Similar covariances  more balanced rate of learning for different weights
◦ Rescaling to 1 is a good choice, unless some dimensions are less important
Data pre-processing
𝑥1
, 𝑥2
, 𝑥3
 much different covariances
𝜃1
𝜃2
𝑥 = 𝑥1
, 𝑥2
, 𝑥3 𝑇
, 𝜃 = 𝜃1
, 𝜃2
, 𝜃3 𝑇
, 𝑎 = tanh(𝜃Τ
𝑥)
𝜃3
Generated gradients ቚ
dℒ
𝑑𝜃 𝑥1,𝑥2,𝑥3
: much different
Gradient update harder: 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡
𝑑ℒ/𝑑θ1
𝑑ℒ/𝑑θ2
𝑑ℒ/𝑑θ3
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 23
o Input variables should be as decorrelated as possible
◦ Input variables are “more independent”
◦ Network is forced to find non-trivial correlations between inputs
◦ Decorrelated inputs  Better optimization
◦ Obviously not the case when inputs are by definition correlated (sequences)
o Extreme case
◦ extreme correlation (linear dependency) might cause problems [CAUTION]
Data pre-processing
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 24
o Input variables follow a Gaussian distribution (roughly)
o In practice:
◦ from training set compute mean and standard deviation
◦ Then subtract the mean from training samples
◦ Then divide the result by the standard deviation
Normalization: 𝑁 𝜇, 𝜎2
= 𝑁 0, 1
𝑥
𝑥 − 𝜇
𝑥 − 𝜇
𝜎
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 25
o Instead of “per-dimension”  all input dimensions simultaneously
o If dimensions have similar values (e.g. pixels in natural images)
◦ Compute one 𝜇, 𝜎2 instead of as many as the input variables
◦ Or the per color channel pixel average/variance
𝜇𝑟𝑒𝑑, 𝜎𝑟𝑒𝑑
2
, 𝜇𝑔𝑟𝑒𝑒𝑛, 𝜎𝑔𝑟𝑒𝑒𝑛
2 , 𝜇𝑏𝑙𝑢𝑒, 𝜎𝑏𝑙𝑢𝑒
2
𝑁 𝜇, 𝜎2
= 𝑁 0, 1 − Making things faster
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 26
o When input dimensions have similar ranges …
o … and with the right non-linearlity …
o … centering might be enough
◦ e.g. in images all dimensions are pixels
◦ All pixels have more or less the same ranges
o Juse make sure images have mean 0 (𝜇 = 0)
Even simpler: Centering the input
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 27
o If 𝐶 the covariance matrix of your dataset, compute
eigenvalues and eigenvectors with SVD
𝑈, Σ, 𝑉𝑇 = 𝑠𝑣𝑑(𝐶)
o Decorrelate (PCA-ed) dataset by
𝑋𝑟𝑜𝑡 = 𝑈𝑇𝑋
◦ Subset of eigenvectors 𝑈′
= [𝑢1, … , 𝑢𝑞] to reduce data dimensions
o Scaling by square root of eigenvalues to whiten data
𝑋𝑤ℎ𝑡 = 𝑋𝑟𝑜𝑡/ Σ
o Not used much with Convolutional Neural Nets
◦ The zero mean normalization is more important
PCA Whitening
𝑋𝑟𝑜𝑡 = 𝑈𝑇𝑋
𝑋𝑤ℎ𝑡 = 𝑋𝑟𝑜𝑡/ Σ
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 28
Example
Images taken from A. Karpathy course website: http://cs231n.github.io/neural-networks-2/
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 29
Data augmentation [Krizhevsky2012]
Original
Flip Random crop
Contrast Tint
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 30
o Weights change  the
distribution of the layer inputs
changes per round
◦ Covariance shift
o Normalize the layer inputs with
batch normalization
◦ Roughly speaking, normalize 𝑥𝑙 to
𝑁(0, 1) and rescale
Batch normalization [Ioffe2015]
𝑥𝑙
Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1)
Backpropagation
𝑥𝑙 𝑥𝑙
Batch Normalization
𝑥𝑙
ℒ
𝑥𝑙
ℒ
Batch normalization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 31
Batch normalization - Intuitively
𝑥𝑙
Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1)
Backpropagation
𝑥𝑙 𝑥𝑙
Batch Normalization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 32
Batch normalization – The algorithm
o 𝜇ℬ ←
1
𝑚
σ𝑖=1
𝑚
𝑥𝑖 [compute mini-batch mean]
o 𝜎ℬ ←
1
𝑚
σ𝑖=1
𝑚
𝑥𝑖 − 𝜇ℬ
2 [compute mini-batch variance]
o ෝ
𝑥𝑖 ←
𝑥𝑖−𝜇ℬ
𝜎ℬ
2+𝜀
[normalize input]
o ෝ
𝑦𝑖 ← 𝛾𝑥𝑖 + 𝛽 [scale and shift input]
Trainable parameters
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 33
o Gradients can be stronger  higher learning rates  faster training
◦ Otherwise maybe exploding or vanishing gradients or getting stuck to local minima
o Neurons get activated in a near optimal “regime”
o Better model regularization
◦ Neuron activations not deterministic,
depend on the batch
◦ Model cannot be overconfident
Batch normalization - Benefits
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 34
Regularization
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℓ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 35
o Neural networks typically have thousands, if not millions of parameters
◦ Usually, the dataset size smaller than the number of parameters
o Overfitting is a grave danger
o Proper weight regularization is crucial to avoid overfitting
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℓ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) + 𝜆Ω(𝜃)
o Possible regularization methods
◦ ℓ2-regularization
◦ ℓ1-regularization
◦ Dropout
Regularization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 36
o Most important (or most popular) regularization
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) +
𝜆
2
෍
𝑙
𝜃𝑙
2
o The ℓ2-regularization can pass inside the gradient descend update rule
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡 𝛻𝜃ℒ + 𝜆𝜃𝑙 ⟹
𝜃 𝑡+1 = 1 − 𝜆𝜂𝑡 𝜃 𝑡 − 𝜂𝑡𝛻𝜃ℒ
o 𝜆 is usually about 10−1, 10−2
ℓ2-regularization
“Weight decay”, because
weights get smaller
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 37
o ℓ1-regularization is one of the most important techniques
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) +
𝜆
2
෍
𝑙
𝜃𝑙
o Also ℓ1-regularization passes inside the gradient descend update rule
𝜃 𝑡+1
= 𝜃 𝑡
− 𝜆𝜂𝑡
𝜃 𝑡
|𝜃 𝑡 |
− 𝜂𝑡𝛻𝜃ℒ
o ℓ1-regularization  sparse weights
◦ 𝜆 ↗  more weights become 0
ℓ1-regularization
Sign function
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 38
o To tackle overfitting another popular technique is early stopping
o Monitor performance on a separate validation set
o Training the network will decrease training error, as well validation error
(although with a slower rate usually)
o Stop when validation error starts increasing
◦ This quite likely means the network starts to overfit
Early stopping
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 39
o During training setting activations randomly to 0
◦ Neurons sampled at random from a Bernoulli distribution with 𝑝 = 0.5
o At test time all neurons are used
◦ Neuron activations reweighted by 𝑝
o Benefits
◦ Reduces complex co-adaptations or co-dependencies between neurons
◦ No “free-rider” neurons that rely on others
◦ Every neuron becomes more robust
◦ Decreases significantly overfitting
◦ Improves significantly training speed
Dropout [Srivastava2014]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 40
o Effectively, a different architecture at every training epoch
◦ Similar to model ensembles
Dropout
Original model
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 41
o Effectively, a different architecture at every training epoch
◦ Similar to model ensembles
Dropout
Epoch 1
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 42
o Effectively, a different architecture at every training epoch
◦ Similar to model ensembles
Dropout
Epoch 1
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 43
o Effectively, a different architecture at every training epoch
◦ Similar to model ensembles
Dropout
Epoch 2
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 44
o Effectively, a different architecture at every training epoch
◦ Similar to model ensembles
Dropout
Epoch 2
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 45
Architectural details
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 46
o Straightforward sigmoids not a very good idea
o Symmetric sigmoids converge faster
◦ E.g. tanh, returns a(x=0)=0
◦ Recommended sigmoid: 𝑎 = ℎ 𝑥 = 1.7159 tanh(
2
3
𝑥)
o You can add a linear term to avoid flat areas
𝑎 = ℎ 𝑥 = tanh 𝑥 + 𝛽𝑥
Sigmoid-like activation functions
tanh 𝑥 + 0.5𝑥
tanh 𝑥
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 47
o RBF: 𝑎 = ℎ 𝑥 = σ𝑗 𝑢𝑗 exp −𝛽𝑗 𝑥 − 𝑤𝑗
2
o Sigmoid: 𝑎 = ℎ 𝑥 = 𝜎 𝑥 =
1
1+𝑒−𝑥
o Sigmoids can cover the full feature space
o RBF’s are much more local in the feature space
◦ Can be faster to train but with a more limited range
◦ Can give better set of basis functions
◦ Preferred in lower dimensional spaces
RBFs vs “Sigmoids”
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 48
o Activation function 𝑎 = ℎ(𝑥) = max 0, 𝑥
o Gradient wrt the input
𝜕𝑎
𝜕𝑥
= ቊ
0, 𝑖𝑓 𝑥 ≤ 0
1, 𝑖𝑓𝑥 > 0
o Very popular in computer vision and speech recognition
o Much faster computations, gradients
◦ No vanishing or exploding problems, only comparison, addition, multiplication
o People claim biological plausibility
o Sparse activations
o No saturation
o Non-symmetric
o Non-differentiable at 0
o A large gradient during training can cause a neuron to “die”. Higher learning rates mitigate the problem
Rectified Linear Unit (ReLU) module [Krizhevsky2012]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 49
ReLU convergence rate
ReLU
Tanh
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 50
o Soft approximation (softplus): 𝑎 = ℎ(𝑥) = ln 1 + 𝑒𝑥
o Noisy ReLU: 𝑎 = ℎ 𝑥 = max 0, x + ε , ε~𝛮(0, σ(x))
o Leaky ReLU: 𝑎 = ℎ 𝑥 = ቊ
𝑥, 𝑖𝑓 𝑥 > 0
0.01𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
o Parametric ReLu: 𝑎 = ℎ 𝑥 = ቊ
𝑥, 𝑖𝑓 𝑥 > 0
𝛽𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
(parameter 𝛽 is trainable)
Other ReLUs
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 51
o Number of hidden layers
o Number of neuron in each hidden layer
o Type of activation functions
o Type and amount of regularization
Architectural hyper-parameters
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 52
o Dataset dependent hyperparameters
o Tip: Start small  increase complexity gradually
◦ e.g. start with a 2-3 hidden layers
◦ Add more layers  does performance improve?
◦ Add more neurons  does performance improve?
o Regularization is very important, use ℓ2
◦ Even if with very deep or wide network
◦ With strong ℓ2-regularization we avoid overfitting
Number of neurons, number of hidden layers
Generalization
Model complexity
(number of neurons)
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 53
Learning rate
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 54
o The right learning rate 𝜂𝑡 very important for fast convergence
◦ Too strong  gradients overshoot and bounce
◦ Too weak,  too small gradients  slow training
o Learning rate per weight is often advantageous
◦ Some weights are near convergence, others not
o Rule of thumb
◦ Learning rate of (shared) weights prop. to square root of share weight connections
o Adaptive learning rates are also possible, based on the errors observed
◦ [Sompolinsky1995]
Learning rate
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 55
o Constant
◦ Learning rate remains the same for all epochs
o Step decay
◦ Decrease (e.g. 𝜂𝑡/𝑇 or 𝜂𝑡/𝑇) every T number of epochs
o Inverse decay 𝜂𝑡 =
𝜂0
1+𝜀𝑡
o Exponential decay 𝜂𝑡 = 𝜂0𝑒−𝜀𝑡
o Often step decay preferred
◦ simple, intuitive, works well and only a
single extra hyper-parameter 𝑇 (𝑇 =2, 10)
Learning rate schedules
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 56
o Try several log-spaced values 10−1, 10−2, 10−3, … on a smaller set
◦ Then, you can narrow it down from there around where you get the lowest error
o You can decrease the learning rate every 10 (or some other value) full
training set epochs
◦ Although this highly depends on your data
Learning rate in practice
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 57
Weight initialization
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℓ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 58
o There are few contradictory requirements
o Weights need to be small enough
◦ around origin (𝟎) for symmetric functions (tanh, sigmoid)
◦ When training starts better stimulate activation functions near their linear regime
◦ larger gradients  faster training
o Weights need to be large enough
◦ Otherwise signal is too weak for any serious learning
Weight initialization
Linear regime
Large gradients
Linear regime
Large gradients
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 59
o Weights must be initialized to preserve the variance of the activations during
the forward and backward computations
◦ Especially for deep learning
◦ All neurons operate in their full capacity
Question: Why similar input/output variance?
o Good practice: initialize weights to be asymmetric
◦ Don’t give save values to all weights (like all 𝟎)
◦ In that case all neurons generate same gradient  no learning
o Generally speaking initialization depends on
◦ non-linearities
◦ data normalization
Weight initialization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 60
o Weights must be initialized to preserve the variance of the activations during
the forward and backward computations
◦ Especially for deep learning
◦ All neurons operate in their full capacity
Question: Why similar input/output variance?
Answer: Because the output of one module is the input to another
o Good practice: initialize weights to be asymmetric
◦ Don’t give save values to all weights (like all 𝟎)
◦ In that case all neurons generate same gradient  no learning
o Generally speaking initialization depends on
◦ non-linearities
◦ data normalization
Weight initialization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 61
o Weights must be initialized to preserve the variance of the activations during
the forward and backward computations
◦ Especially for deep learning
◦ All neurons operate in their full capacity
Question: Why similar input/output variance?
Answer: Because the output of one module is the input to another
o Good practice: initialize weights to be asymmetric
◦ Don’t give save values to all weights (like all 𝟎)
◦ In that case all neurons generate same gradient  no learning
o Generally speaking initialization depends on
◦ non-linearities
◦ data normalization
Weight initialization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 62
o For tanh initialize weights from −
6
𝑑𝑙−1+𝑑𝑙
,
6
𝑑𝑙−1+𝑑𝑙
◦ 𝑑𝑙−1 is the number of input variables to the tanh layer and 𝑑𝑙 is the number of the
output variables
o For a sigmoid −4 ∙
6
𝑑𝑙−1+𝑑𝑙
, 4 ∙
6
𝑑𝑙−1+𝑑𝑙
One way of initializing sigmoid-like neurons
Linear regime
Large gradients
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 63
o For 𝑎 = 𝜃𝑥 the variance is
𝑉𝑎𝑟 𝑎 = 𝐸 𝑥 2𝑉𝑎𝑟 𝜃 + E 𝜃 2𝑉𝑎𝑟 𝑥 + 𝑉𝑎𝑟 𝑥 𝑉𝑎𝑟 𝜃
o Since 𝐸 𝑥 = 𝐸 𝜃 = 0
𝑉𝑎𝑟 𝑎 = 𝑉𝑎𝑟 𝑥 𝑉𝑎𝑟 𝜃 ≈ 𝑑 ⋅ 𝑉𝑎𝑟 𝑥𝑖 𝑉𝑎𝑟 𝜃𝑖
o For 𝑉𝑎𝑟 𝑎 = 𝑉𝑎𝑟 𝑥 ⇒ 𝑉𝑎𝑟 𝜃𝑖 =
1
𝑑
o Draw random weights from
𝜃~𝑁 0, 1/𝑑
where 𝑑 is the number of neurons in the input
Xavier initialization [Glorot2010]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 64
o Unlike sigmoids, ReLUs ground to 0 the linear activations half the
time
o Double weight variance
◦ Compensate for the zero flat-area 
◦ Input and output maintain same variance
◦ Very similar to Xavier initialization
o Draw random weights from
w~𝑁 0, 2/𝑑
where 𝑑 is the number of neurons in the input
[He2015] initialization for ReLUs
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 65
Loss functions
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 66
o Our samples contains only one class
◦ There is only one correct answer per sample
o Negative log-likelihood (cross entropy) + Softmax
ℒ 𝜃; 𝑥, 𝑦 = − σ𝑐=1
𝐶
𝑦𝑐 log 𝑎𝐿
𝑐
for all classes 𝑐 = 1, … , 𝐶
o Hierarchical softmax when C is very large
o Hinge loss (aka SVM loss)
ℒ 𝜃; 𝑥, 𝑦 = ෍
𝑐=1
𝑐≠𝑦
𝐶
max(0, 𝑎𝐿
𝑐
− 𝑎𝐿
𝑦
+ 1)
o Squared hinge loss
Multi-class classification
Is it a cat? Is it a horse? …
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 67
o Each sample can have many correct answers
o Hinge loss and the likes
◦ Also sigmoids would also work
o Each output neuron is independent
◦ “Does this contain a car, yes or no?“
◦ “Does this contain a person, yes or no?“
◦ “Does this contain a motorbike, yes or no?“
◦ “Does this contain a horse, yes or no?“
o Instead of “Is this a car, motorbike or person?”
◦ 𝑝 𝑐𝑎𝑟 𝑥) = 0.55, 𝑝 𝑚/𝑏𝑖𝑘𝑒 𝑥) = 0.25, 𝑝 𝑝𝑒𝑟𝑠𝑜𝑛 𝑥) = 0.15, 𝑝 ℎ𝑜𝑟𝑠𝑒 𝑥) = 0.05
◦ 𝑝 𝑐𝑎𝑟 𝑥) + 𝑝 𝑚/𝑏𝑖𝑘𝑒 𝑥) + 𝑝 𝑝𝑒𝑟𝑠𝑜𝑛 𝑥) + 𝑝 ℎ𝑜𝑟𝑠𝑒 𝑥) = 1.0
Multi-class, multi-label classification
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 68
o The good old Euclidean Loss
ℒ 𝜃; 𝑥, 𝑦 =
1
2
|𝑦 − 𝑎𝐿|2
2
o Or RBF on top of Euclidean loss
ℒ 𝜃; 𝑥, 𝑦 = ෍
𝑗
𝑢𝑗 exp(−𝛽𝑗(𝑦 − 𝑎𝐿)2)
o Or ℓ1 distance
ℒ 𝜃; 𝑥, 𝑦 = ෍
𝑗
|𝑦𝑗 − 𝑎𝐿
𝑗
|
Regression
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 69
Even better
optimizations
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 70
o Don’t switch gradients all the time
o Maintain “momentum” from previous
parameters
o More robust gradients and learning 
faster convergence
o Nice “physics”-based interpretation
◦ Instead of updating the position of the “ball”, we
update the velocity, which updates the position
Momentum
𝜃(𝑡+1)
= 𝜃(𝑡)
+ 𝑢𝜃
𝑢𝜃 = 𝛾𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
Loss surface
Gradient
Gradient + momentum
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 71
o Use the future gradient instead of
the current gradient
o Better theoretical convergence
o Generally works better with
Convolutional Neural Networks
Nesterov Momentum [Sutskever2013]
𝜃(𝑡+1) = 𝜃(𝑡) + 𝑢𝜃
𝑢𝜃 = 𝛾𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
Gradient
Gradient + momentum
Momentum
Look-ahead gradient
from the next step
Momentum
Gradient + Nesterov
momentum
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 72
o Normally all weights updated with same “aggressiveness”
◦ Often some parameters could enjoy more “teaching”
◦ While others are already about there
o Adapt learning per parameter
𝜃(𝑡+1) = 𝜃(𝑡) − 𝐻ℒ
−1
𝜂𝑡𝛻𝜃ℒ
o 𝐻ℒ is the Hessian matrix of ℒ: second-order derivatives
𝐻ℒ
𝑖𝑗
=
𝜕ℒ
𝜕𝜃𝑖𝜕𝜃𝑗
Second order optimization
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 73
o Inverse of Hessian usually very expensive
◦ Too many parameters
o Approximating the Hessian, e.g. with the L-BFGS algorithm
◦ Keeps memory of gradients to approximate the inverse Hessian
o L-BFGS works alright with Gradient Descend. What about SGD?
o In practice SGD with some good momentum works just fine
Second order optimization methods in practice
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 74
o Adagrad [Duchi2011]
o RMSprop
o Adam [Kingma2014]
Other per-parameter adaptive optimizations
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 75
o Schedule
◦ 𝑚𝑗 = σ𝜏(𝛻𝜃ℒ𝑗)2
⟹ 𝜃(𝑡+1)
= 𝜃(𝑡)
− 𝜂𝑡
𝛻𝜃ℒ
𝑚+𝜀
◦ 𝜀 is a small number to avoid division with 0
◦ Gradients become gradually smaller and smaller
Adagrad [Duchi2011]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 76
o Schedule
◦ 𝑚𝑗 = 𝛼 σ𝜏=1
𝑡−1
(𝛻𝜃
(𝑡)
ℒ𝑗)2 + 1 − 𝛼 𝛻𝜃
(𝑡)
ℒ𝑗 ⟹
◦ 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡
𝛻𝜃ℒ
𝑚+𝜀
o Moving average of the squared gradients
◦ Compared to Adagrad
o Large gradients, e.g. too “noisy” loss surface
◦ Updates are tamed
o Small gradients, e.g. stuck in flat loss surface ravine
◦ Updates become more aggressive
RMSprop
Square rooting boosts small values
while suppresses large values
Decay hyper-parameter
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 77
o One of the most popular learning algorithms
𝑚𝑗 = ෍
𝜏
(𝛻𝜃ℒ𝑗)2
𝜃(𝑡+0.5) = 𝛽1𝜃(𝑡) + 1 − 𝛽1 𝛻𝜃ℒ
𝑣(𝑡+0.5) = 𝛽2𝑣(𝑡) + 1 − 𝛽2 𝑚
𝜃(𝑡+1)
= 𝜃(𝑡)
− 𝜂𝑡
𝜃(𝑡+0.5)
𝑣(𝑡+0.5) + 𝜀
o Similar to RMSprop, but with momentum
o Recommended values: 𝛽1 = 0.9, 𝛽2 = 0.999, 𝜀 = 10−8
Adam [Kingma2014]
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 78
Visual overview
Picture credit: Alec Radford
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 79
o Learning to learn by gradient descent by gradient descent
◦ [Andrychowicz2016]
o 𝜃(𝑡+1) = 𝜃(𝑡) + 𝑔𝑡 𝛻𝜃ℒ, 𝜑
o 𝑔𝑡 is an “optimizer” with its own parameters 𝜑
◦ Implemented as a recurrent network
Learning –not computing– the gradients
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 80
Good practice
o Preprocess the data to at least have 0 mean
o Initialize weights based on activations functions
◦ For ReLU Xavier or HeICCV2015 initialization
o Always use ℓ2-regularization and dropout
o Use batch normalization
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 81
Babysitting
Deep Nets
𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿)
θ∗ ← arg min𝜃 ෍
(𝑥,𝑦)⊆(𝑋,𝑌)
ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L )
𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ
1. The Neural Network
2. Learning by minimizing empirical error
3. Optimizing with Gradient Descend based methods
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 82
o Always check your gradients if not computed automatically
o Check that in the first round you get a random loss
o Check network with few samples
◦ Turn off regularization. You should predictably overfit and have a 0 loss
◦ Turn or regularization. The loss should increase
o Have a separate validation set
◦ Compare the curve between training and validation sets
◦ There should be a gap, but not too large
Babysitting Deep Nets
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 83
Summary
o How to define our model and optimize it in practice
o Data preprocessing and normalization
o Optimization methods
o Regularizations
o Architectures and architectural hyper-parameters
o Learning rate
o Weight initializations
o Good practices
UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 84
o http://www.deeplearningbook.org/
◦ Part II: Chapter 7, 8
[Andrychowicz2016] Andrychowicz, Denil, Gomez, Hoffman, Pfau, Schaul, de Freitas, Learning to learn by gradient descent by gradient
descent, arXiv, 2016
[He2015] He, Zhang, Ren, Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, ICCV, 2015
[Ioffe2015] Ioffe, Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arXiv, 2015
[Kingma2014] Kingma, Ba. Adam: A Method for Stochastic Optimization, arXiv, 2014
[Srivastava2014] Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from
Overfitting, JMLR, 2014
[Sutskever2013] Sutskever, Martens, Dahl, Hinton. On the importance of initialization and momentum in deep learning, JMLR, 2013
[Bengio2012] Bengio. Practical recommendations for gradient-based training of deep architectures, arXiv, 2012
[Krizhevsky2012] Krizhevsky, Hinton. ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012
[Duchi2011] Duchi, Hazan, Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, JMLR, 2011
[Glorot2010] Glorot, Bengio. Understanding the difficulty of training deep feedforward neural networks, JMLR, 2010
[LeCun2002]
Reading material & references
UVA DEEP LEARNING COURSE
EFSTRATIOS GAVVES & MAX WELLING
DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 85
Next lecture
o What are the Convolutional Neural Networks?
o Why are they important in Computer Vision?
o Differences from standard Neural Networks
o How to train a Convolutional Neural Network?

More Related Content

PDF
lecture2.pdf
PDF
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
PPTX
An Introduction to Deep Learning
PPTX
Introduction to Deep Learning
PPTX
Introduction to Deep learning and H2O for beginner's
PPTX
DeepLearningLecture.pptx
PPTX
1. Introduction to deep learning.pptx
PDF
Cheatsheet deep-learning-tips-tricks
lecture2.pdf
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
An Introduction to Deep Learning
Introduction to Deep Learning
Introduction to Deep learning and H2O for beginner's
DeepLearningLecture.pptx
1. Introduction to deep learning.pptx
Cheatsheet deep-learning-tips-tricks

Similar to lecture3.pdf (20)

PDF
The Machinery behind Deep Learning
PDF
Chap 8. Optimization for training deep models
PPTX
Chapter10.pptx
PPTX
Deep learning crash course
PDF
Deep Learning: concepts and use cases (October 2018)
PPTX
1. Introduction to deep learning.pptx
PDF
Deep Learning Class #1 - Go Deep or Go Home
PDF
DL Classe 1 - Go Deep or Go Home
PDF
Separating Hype from Reality in Deep Learning with Sameer Farooqui
PDF
Dep Neural Networks introduction new.pdf
PPTX
Introduction to deep Learning Fundamentals
PPTX
AD3501 - DL Unit-1 PPT.pptx python syllabus
PPTX
Techniques in Deep Learning
PPTX
Nimrita deep learning
PPTX
Deep Neural Network Module 3A Optimization.pptx
PDF
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
PPTX
Deep Learning
PPTX
Training DNN Models - II.pptx
PPTX
Deep Learning in Recommender Systems - RecSys Summer School 2017
PPT
deep learning UNIT-1 Introduction Part-1.ppt
The Machinery behind Deep Learning
Chap 8. Optimization for training deep models
Chapter10.pptx
Deep learning crash course
Deep Learning: concepts and use cases (October 2018)
1. Introduction to deep learning.pptx
Deep Learning Class #1 - Go Deep or Go Home
DL Classe 1 - Go Deep or Go Home
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Dep Neural Networks introduction new.pdf
Introduction to deep Learning Fundamentals
AD3501 - DL Unit-1 PPT.pptx python syllabus
Techniques in Deep Learning
Nimrita deep learning
Deep Neural Network Module 3A Optimization.pptx
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Deep Learning
Training DNN Models - II.pptx
Deep Learning in Recommender Systems - RecSys Summer School 2017
deep learning UNIT-1 Introduction Part-1.ppt
Ad

More from Tigabu Yaya (20)

PDF
Deep Learning and types Convolutional Neural Network
PDF
ML_basics_lecture1_linear_regression.pdf
PDF
03. Data Exploration in Data Science.pdf
PDF
MOD_Architectural_Design_Chap6_Summary.pdf
PDF
MOD_Design_Implementation_Ch7_summary.pdf
PDF
GER_Project_Management_Ch22_summary.pdf
PDF
lecture_GPUArchCUDA02-CUDAMem.pdf
PDF
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
PDF
6_RealTimeScheduling.pdf
PPTX
Regression.pptx
PDF
lecture6.pdf
PDF
lecture5.pdf
PDF
lecture4.pdf
PPT
Chap 4.ppt
PPT
200402_RoseRealTime.ppt
PPT
matrixfactorization.ppt
PPTX
nnfl.0620.pptx
PPT
L20.ppt
PDF
The Jacobi and Gauss-Seidel Iterative Methods.pdf
PDF
C_and_C++_notes.pdf
Deep Learning and types Convolutional Neural Network
ML_basics_lecture1_linear_regression.pdf
03. Data Exploration in Data Science.pdf
MOD_Architectural_Design_Chap6_Summary.pdf
MOD_Design_Implementation_Ch7_summary.pdf
GER_Project_Management_Ch22_summary.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA04-OpenMPHOMP.pdf
6_RealTimeScheduling.pdf
Regression.pptx
lecture6.pdf
lecture5.pdf
lecture4.pdf
Chap 4.ppt
200402_RoseRealTime.ppt
matrixfactorization.ppt
nnfl.0620.pptx
L20.ppt
The Jacobi and Gauss-Seidel Iterative Methods.pdf
C_and_C++_notes.pdf
Ad

Recently uploaded (20)

PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Institutional Correction lecture only . . .
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
RMMM.pdf make it easy to upload and study
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Institutional Correction lecture only . . .
Sports Quiz easy sports quiz sports quiz
Pharmacology of Heart Failure /Pharmacotherapy of CHF
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
STATICS OF THE RIGID BODIES Hibbelers.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
human mycosis Human fungal infections are called human mycosis..pptx
01-Introduction-to-Information-Management.pdf
Final Presentation General Medicine 03-08-2024.pptx
PPH.pptx obstetrics and gynecology in nursing
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Pre independence Education in Inndia.pdf
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Basic Mud Logging Guide for educational purpose
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
RMMM.pdf make it easy to upload and study
FourierSeries-QuestionsWithAnswers(Part-A).pdf

lecture3.pdf

  • 1. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 1 Lecture 3: Deeper into Deep Learning and Optimizations Deep Learning @ UvA
  • 2. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 2 o Machine learning paradigm for neural networks o Backpropagation algorithm, backbone for training neural networks o Neural network == modular architecture o Visited different modules, saw how to implement and check them Previous lecture
  • 3. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 3 o How to define our model and optimize it in practice o Data preprocessing and normalization o Optimization methods o Regularizations o Architectures and architectural hyper-parameters o Learning rate o Weight initializations o Good practices Lecture overview
  • 4. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 4 Deeper into Neural Networks & Deep Neural Nets
  • 5. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 5 A Neural/Deep Network in a nutshell 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 6. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 6 SGD vs GD 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 7. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 7 Backpropagation again o Step 1. Compute forward propagations for all layers recursively 𝑎𝑙 = ℎ𝑙 𝑥𝑙 and 𝑥𝑙+1 = 𝑎𝑙 o Step 2. Once done with forward propagation, follow the reverse path. ◦ Start from the last layer and for each new layer compute the gradients ◦ Cache computations when possible to avoid redundant operations o Step 3. Use the gradients 𝜕ℒ 𝜕𝜃𝑙 with Stochastic Gradient Descend to train 𝜕ℒ 𝜕𝑎𝑙 = 𝜕𝑎𝑙+1 𝜕𝑥𝑙+1 𝑇 ⋅ 𝜕ℒ 𝜕𝑎𝑙+1 𝜕ℒ 𝜕𝜃𝑙 = 𝜕𝑎𝑙 𝜕𝜃𝑙 ⋅ 𝜕ℒ 𝜕𝑎𝑙 𝑇
  • 8. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 8 o Often loss surfaces are ◦ non-quadratic ◦ highly non-convex ◦ very high-dimensional o Datasets are typically really large to compute complete gradients o No real guarantee that ◦ the final solution will be good ◦ we converge fast to final solution ◦ or that there will be convergence Still, backpropagation can be slow
  • 9. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 9 o Stochastically sample “mini-batches” from dataset 𝐷 ◦ The size of 𝐵𝑗 can contain even just 1 sample o Much faster than Gradient Descend o Results are often better o Also suitable for datasets that change over time o Variance of gradients increases when batch size decreases Stochastic Gradient Descend (SGD) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡 |𝐵𝑗| ෍ 𝑖 ∈ 𝐵𝑗 𝛻𝜃ℒ𝑖 𝐵𝑗 = 𝑠𝑎𝑚𝑝𝑙𝑒(𝐷)
  • 10. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 10 SGD is often better Current solution Full GD gradient New GD solution Noisy SGD gradient Best GD solution Best SGD solution • No guarantee that this is what is going to always happen. • But the noisy SGC gradients can help some times escaping local optima Loss surface
  • 11. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 11 SGD is often better o (A bit) Noisy gradients act as regularization o Gradient Descend  Complete gradients o Complete gradients fit optimally the (arbitrary) data we have, not the distribution that generates them ◦ All training samples are the “absolute representative” of the input distribution ◦ Test data will be no different than training data ◦ Suitable for traditional optimization problems: “find optimal route” ◦ But for ML we cannot make this assumption  test data are always different o Stochastic gradients  sampled training data sample roughly representative gradients ◦ Model does not overfit to the particular training samples
  • 12. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 12 SGD is faster Gradient
  • 13. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 13 SGD is faster Gradient 10x What is our gradient now?
  • 14. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 14 SGD is faster 10x What is our gradient now? Gradient
  • 15. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 15 o Of course in real situations data do not replicate o However, after a sizeable amount of data there are clusters of data that are similar o Hence, the gradient is approximately alright o Approximate alright is great, is even better in many cases actually SGD is faster
  • 16. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 16 o Often datasets are not “rigid” o Imagine Instagram ◦ Let’s assume 1 million of new images uploaded per week and we want to build a “cool picture” classifier ◦ Should “cool pictures” from the previous year have the same as much influence? ◦ No, the learning machine should track these changes o With GD these changes go undetected, as results are averaged by the many more “past” samples ◦ Past “over-dominates” o A properly implemented SGD can track changes much better and give better models ◦ [LeCun2002] SGD for dynamically changed datasets Popular today Popular in 2014 Popular in 2010
  • 17. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 17 o Applicable only with SGD o Choose samples with maximum information content o Mini-batches should contain examples from different classes ◦ As different as possible o Prefer samples likely to generate larger errors ◦ Otherwise gradients will be small  slower learning ◦ Check the errors from previous rounds and prefer “hard examples” ◦ Don’t overdo it though :P, beware of outliers o In practice, split your dataset into mini-batches ◦ Each mini-batch is as class-divergent and rich as possible ◦ New epoch  to be safe new batches & new, randomly shuffled examples Shuffling examples Dataset Shuffling at epoch t Shuffling at epoch t+1
  • 18. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 18 o Conditions of convergence well understood o Acceleration techniques can be applied ◦ Second order (Hessian based) optimizations are possible ◦ Measuring not only gradients, but also curvatures of the loss surface o Simpler theoretical analysis on weight dynamics and convergence rates Advantages of Gradient Descend batch learning
  • 19. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 19 o SGD is preferred to Gradient Descend o Training is orders faster ◦ In real datasets Gradient Descend is not even realistic o Solutions generalize better ◦ More efficient  larger datasets ◦ Larger datasets  better generalization o How many samples per mini-batch? ◦ Hyper-parameter, trial & error ◦ Usually between 32-256 samples In practice
  • 20. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 20 Data preprocessing & normalization 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 21. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 21 o Center data to be roughly 0 ◦ Activation functions usually “centered” around 0 ◦ Convergence usually faster ◦ Otherwise bias on gradient direction  might slow down learning Data pre-processing ReLU  tanh(𝑥)  𝜎(𝑥)   
  • 22. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 22 o Scale input variables to have similar diagonal covariances 𝑐𝑖 = σ𝑗(𝑥𝑖 (𝑗) )2 ◦ Similar covariances  more balanced rate of learning for different weights ◦ Rescaling to 1 is a good choice, unless some dimensions are less important Data pre-processing 𝑥1 , 𝑥2 , 𝑥3  much different covariances 𝜃1 𝜃2 𝑥 = 𝑥1 , 𝑥2 , 𝑥3 𝑇 , 𝜃 = 𝜃1 , 𝜃2 , 𝜃3 𝑇 , 𝑎 = tanh(𝜃Τ 𝑥) 𝜃3 Generated gradients ቚ dℒ 𝑑𝜃 𝑥1,𝑥2,𝑥3 : much different Gradient update harder: 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡 𝑑ℒ/𝑑θ1 𝑑ℒ/𝑑θ2 𝑑ℒ/𝑑θ3
  • 23. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 23 o Input variables should be as decorrelated as possible ◦ Input variables are “more independent” ◦ Network is forced to find non-trivial correlations between inputs ◦ Decorrelated inputs  Better optimization ◦ Obviously not the case when inputs are by definition correlated (sequences) o Extreme case ◦ extreme correlation (linear dependency) might cause problems [CAUTION] Data pre-processing
  • 24. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 24 o Input variables follow a Gaussian distribution (roughly) o In practice: ◦ from training set compute mean and standard deviation ◦ Then subtract the mean from training samples ◦ Then divide the result by the standard deviation Normalization: 𝑁 𝜇, 𝜎2 = 𝑁 0, 1 𝑥 𝑥 − 𝜇 𝑥 − 𝜇 𝜎
  • 25. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 25 o Instead of “per-dimension”  all input dimensions simultaneously o If dimensions have similar values (e.g. pixels in natural images) ◦ Compute one 𝜇, 𝜎2 instead of as many as the input variables ◦ Or the per color channel pixel average/variance 𝜇𝑟𝑒𝑑, 𝜎𝑟𝑒𝑑 2 , 𝜇𝑔𝑟𝑒𝑒𝑛, 𝜎𝑔𝑟𝑒𝑒𝑛 2 , 𝜇𝑏𝑙𝑢𝑒, 𝜎𝑏𝑙𝑢𝑒 2 𝑁 𝜇, 𝜎2 = 𝑁 0, 1 − Making things faster
  • 26. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 26 o When input dimensions have similar ranges … o … and with the right non-linearlity … o … centering might be enough ◦ e.g. in images all dimensions are pixels ◦ All pixels have more or less the same ranges o Juse make sure images have mean 0 (𝜇 = 0) Even simpler: Centering the input
  • 27. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 27 o If 𝐶 the covariance matrix of your dataset, compute eigenvalues and eigenvectors with SVD 𝑈, Σ, 𝑉𝑇 = 𝑠𝑣𝑑(𝐶) o Decorrelate (PCA-ed) dataset by 𝑋𝑟𝑜𝑡 = 𝑈𝑇𝑋 ◦ Subset of eigenvectors 𝑈′ = [𝑢1, … , 𝑢𝑞] to reduce data dimensions o Scaling by square root of eigenvalues to whiten data 𝑋𝑤ℎ𝑡 = 𝑋𝑟𝑜𝑡/ Σ o Not used much with Convolutional Neural Nets ◦ The zero mean normalization is more important PCA Whitening 𝑋𝑟𝑜𝑡 = 𝑈𝑇𝑋 𝑋𝑤ℎ𝑡 = 𝑋𝑟𝑜𝑡/ Σ
  • 28. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 28 Example Images taken from A. Karpathy course website: http://cs231n.github.io/neural-networks-2/
  • 29. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 29 Data augmentation [Krizhevsky2012] Original Flip Random crop Contrast Tint
  • 30. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 30 o Weights change  the distribution of the layer inputs changes per round ◦ Covariance shift o Normalize the layer inputs with batch normalization ◦ Roughly speaking, normalize 𝑥𝑙 to 𝑁(0, 1) and rescale Batch normalization [Ioffe2015] 𝑥𝑙 Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1) Backpropagation 𝑥𝑙 𝑥𝑙 Batch Normalization 𝑥𝑙 ℒ 𝑥𝑙 ℒ Batch normalization
  • 31. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 31 Batch normalization - Intuitively 𝑥𝑙 Layer l input distribution at (t) Layer l input distribution at (t+0.5) Layer l input distribution at (t+1) Backpropagation 𝑥𝑙 𝑥𝑙 Batch Normalization
  • 32. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 32 Batch normalization – The algorithm o 𝜇ℬ ← 1 𝑚 σ𝑖=1 𝑚 𝑥𝑖 [compute mini-batch mean] o 𝜎ℬ ← 1 𝑚 σ𝑖=1 𝑚 𝑥𝑖 − 𝜇ℬ 2 [compute mini-batch variance] o ෝ 𝑥𝑖 ← 𝑥𝑖−𝜇ℬ 𝜎ℬ 2+𝜀 [normalize input] o ෝ 𝑦𝑖 ← 𝛾𝑥𝑖 + 𝛽 [scale and shift input] Trainable parameters
  • 33. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 33 o Gradients can be stronger  higher learning rates  faster training ◦ Otherwise maybe exploding or vanishing gradients or getting stuck to local minima o Neurons get activated in a near optimal “regime” o Better model regularization ◦ Neuron activations not deterministic, depend on the batch ◦ Model cannot be overconfident Batch normalization - Benefits
  • 34. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 34 Regularization 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℓ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 35. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 35 o Neural networks typically have thousands, if not millions of parameters ◦ Usually, the dataset size smaller than the number of parameters o Overfitting is a grave danger o Proper weight regularization is crucial to avoid overfitting θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℓ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) + 𝜆Ω(𝜃) o Possible regularization methods ◦ ℓ2-regularization ◦ ℓ1-regularization ◦ Dropout Regularization
  • 36. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 36 o Most important (or most popular) regularization θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) + 𝜆 2 ෍ 𝑙 𝜃𝑙 2 o The ℓ2-regularization can pass inside the gradient descend update rule 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡 𝛻𝜃ℒ + 𝜆𝜃𝑙 ⟹ 𝜃 𝑡+1 = 1 − 𝜆𝜂𝑡 𝜃 𝑡 − 𝜂𝑡𝛻𝜃ℒ o 𝜆 is usually about 10−1, 10−2 ℓ2-regularization “Weight decay”, because weights get smaller
  • 37. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 37 o ℓ1-regularization is one of the most important techniques θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) + 𝜆 2 ෍ 𝑙 𝜃𝑙 o Also ℓ1-regularization passes inside the gradient descend update rule 𝜃 𝑡+1 = 𝜃 𝑡 − 𝜆𝜂𝑡 𝜃 𝑡 |𝜃 𝑡 | − 𝜂𝑡𝛻𝜃ℒ o ℓ1-regularization  sparse weights ◦ 𝜆 ↗  more weights become 0 ℓ1-regularization Sign function
  • 38. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 38 o To tackle overfitting another popular technique is early stopping o Monitor performance on a separate validation set o Training the network will decrease training error, as well validation error (although with a slower rate usually) o Stop when validation error starts increasing ◦ This quite likely means the network starts to overfit Early stopping
  • 39. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 39 o During training setting activations randomly to 0 ◦ Neurons sampled at random from a Bernoulli distribution with 𝑝 = 0.5 o At test time all neurons are used ◦ Neuron activations reweighted by 𝑝 o Benefits ◦ Reduces complex co-adaptations or co-dependencies between neurons ◦ No “free-rider” neurons that rely on others ◦ Every neuron becomes more robust ◦ Decreases significantly overfitting ◦ Improves significantly training speed Dropout [Srivastava2014]
  • 40. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 40 o Effectively, a different architecture at every training epoch ◦ Similar to model ensembles Dropout Original model
  • 41. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 41 o Effectively, a different architecture at every training epoch ◦ Similar to model ensembles Dropout Epoch 1
  • 42. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 42 o Effectively, a different architecture at every training epoch ◦ Similar to model ensembles Dropout Epoch 1
  • 43. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 43 o Effectively, a different architecture at every training epoch ◦ Similar to model ensembles Dropout Epoch 2
  • 44. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 44 o Effectively, a different architecture at every training epoch ◦ Similar to model ensembles Dropout Epoch 2
  • 45. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 45 Architectural details 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 46. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 46 o Straightforward sigmoids not a very good idea o Symmetric sigmoids converge faster ◦ E.g. tanh, returns a(x=0)=0 ◦ Recommended sigmoid: 𝑎 = ℎ 𝑥 = 1.7159 tanh( 2 3 𝑥) o You can add a linear term to avoid flat areas 𝑎 = ℎ 𝑥 = tanh 𝑥 + 𝛽𝑥 Sigmoid-like activation functions tanh 𝑥 + 0.5𝑥 tanh 𝑥
  • 47. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 47 o RBF: 𝑎 = ℎ 𝑥 = σ𝑗 𝑢𝑗 exp −𝛽𝑗 𝑥 − 𝑤𝑗 2 o Sigmoid: 𝑎 = ℎ 𝑥 = 𝜎 𝑥 = 1 1+𝑒−𝑥 o Sigmoids can cover the full feature space o RBF’s are much more local in the feature space ◦ Can be faster to train but with a more limited range ◦ Can give better set of basis functions ◦ Preferred in lower dimensional spaces RBFs vs “Sigmoids”
  • 48. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 48 o Activation function 𝑎 = ℎ(𝑥) = max 0, 𝑥 o Gradient wrt the input 𝜕𝑎 𝜕𝑥 = ቊ 0, 𝑖𝑓 𝑥 ≤ 0 1, 𝑖𝑓𝑥 > 0 o Very popular in computer vision and speech recognition o Much faster computations, gradients ◦ No vanishing or exploding problems, only comparison, addition, multiplication o People claim biological plausibility o Sparse activations o No saturation o Non-symmetric o Non-differentiable at 0 o A large gradient during training can cause a neuron to “die”. Higher learning rates mitigate the problem Rectified Linear Unit (ReLU) module [Krizhevsky2012]
  • 49. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 49 ReLU convergence rate ReLU Tanh
  • 50. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 50 o Soft approximation (softplus): 𝑎 = ℎ(𝑥) = ln 1 + 𝑒𝑥 o Noisy ReLU: 𝑎 = ℎ 𝑥 = max 0, x + ε , ε~𝛮(0, σ(x)) o Leaky ReLU: 𝑎 = ℎ 𝑥 = ቊ 𝑥, 𝑖𝑓 𝑥 > 0 0.01𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 o Parametric ReLu: 𝑎 = ℎ 𝑥 = ቊ 𝑥, 𝑖𝑓 𝑥 > 0 𝛽𝑥 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (parameter 𝛽 is trainable) Other ReLUs
  • 51. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 51 o Number of hidden layers o Number of neuron in each hidden layer o Type of activation functions o Type and amount of regularization Architectural hyper-parameters
  • 52. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 52 o Dataset dependent hyperparameters o Tip: Start small  increase complexity gradually ◦ e.g. start with a 2-3 hidden layers ◦ Add more layers  does performance improve? ◦ Add more neurons  does performance improve? o Regularization is very important, use ℓ2 ◦ Even if with very deep or wide network ◦ With strong ℓ2-regularization we avoid overfitting Number of neurons, number of hidden layers Generalization Model complexity (number of neurons)
  • 53. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 53 Learning rate 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 54. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 54 o The right learning rate 𝜂𝑡 very important for fast convergence ◦ Too strong  gradients overshoot and bounce ◦ Too weak,  too small gradients  slow training o Learning rate per weight is often advantageous ◦ Some weights are near convergence, others not o Rule of thumb ◦ Learning rate of (shared) weights prop. to square root of share weight connections o Adaptive learning rates are also possible, based on the errors observed ◦ [Sompolinsky1995] Learning rate
  • 55. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 55 o Constant ◦ Learning rate remains the same for all epochs o Step decay ◦ Decrease (e.g. 𝜂𝑡/𝑇 or 𝜂𝑡/𝑇) every T number of epochs o Inverse decay 𝜂𝑡 = 𝜂0 1+𝜀𝑡 o Exponential decay 𝜂𝑡 = 𝜂0𝑒−𝜀𝑡 o Often step decay preferred ◦ simple, intuitive, works well and only a single extra hyper-parameter 𝑇 (𝑇 =2, 10) Learning rate schedules
  • 56. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 56 o Try several log-spaced values 10−1, 10−2, 10−3, … on a smaller set ◦ Then, you can narrow it down from there around where you get the lowest error o You can decrease the learning rate every 10 (or some other value) full training set epochs ◦ Although this highly depends on your data Learning rate in practice
  • 57. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 57 Weight initialization 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℓ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 58. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 58 o There are few contradictory requirements o Weights need to be small enough ◦ around origin (𝟎) for symmetric functions (tanh, sigmoid) ◦ When training starts better stimulate activation functions near their linear regime ◦ larger gradients  faster training o Weights need to be large enough ◦ Otherwise signal is too weak for any serious learning Weight initialization Linear regime Large gradients Linear regime Large gradients
  • 59. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 59 o Weights must be initialized to preserve the variance of the activations during the forward and backward computations ◦ Especially for deep learning ◦ All neurons operate in their full capacity Question: Why similar input/output variance? o Good practice: initialize weights to be asymmetric ◦ Don’t give save values to all weights (like all 𝟎) ◦ In that case all neurons generate same gradient  no learning o Generally speaking initialization depends on ◦ non-linearities ◦ data normalization Weight initialization
  • 60. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 60 o Weights must be initialized to preserve the variance of the activations during the forward and backward computations ◦ Especially for deep learning ◦ All neurons operate in their full capacity Question: Why similar input/output variance? Answer: Because the output of one module is the input to another o Good practice: initialize weights to be asymmetric ◦ Don’t give save values to all weights (like all 𝟎) ◦ In that case all neurons generate same gradient  no learning o Generally speaking initialization depends on ◦ non-linearities ◦ data normalization Weight initialization
  • 61. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 61 o Weights must be initialized to preserve the variance of the activations during the forward and backward computations ◦ Especially for deep learning ◦ All neurons operate in their full capacity Question: Why similar input/output variance? Answer: Because the output of one module is the input to another o Good practice: initialize weights to be asymmetric ◦ Don’t give save values to all weights (like all 𝟎) ◦ In that case all neurons generate same gradient  no learning o Generally speaking initialization depends on ◦ non-linearities ◦ data normalization Weight initialization
  • 62. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 62 o For tanh initialize weights from − 6 𝑑𝑙−1+𝑑𝑙 , 6 𝑑𝑙−1+𝑑𝑙 ◦ 𝑑𝑙−1 is the number of input variables to the tanh layer and 𝑑𝑙 is the number of the output variables o For a sigmoid −4 ∙ 6 𝑑𝑙−1+𝑑𝑙 , 4 ∙ 6 𝑑𝑙−1+𝑑𝑙 One way of initializing sigmoid-like neurons Linear regime Large gradients
  • 63. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 63 o For 𝑎 = 𝜃𝑥 the variance is 𝑉𝑎𝑟 𝑎 = 𝐸 𝑥 2𝑉𝑎𝑟 𝜃 + E 𝜃 2𝑉𝑎𝑟 𝑥 + 𝑉𝑎𝑟 𝑥 𝑉𝑎𝑟 𝜃 o Since 𝐸 𝑥 = 𝐸 𝜃 = 0 𝑉𝑎𝑟 𝑎 = 𝑉𝑎𝑟 𝑥 𝑉𝑎𝑟 𝜃 ≈ 𝑑 ⋅ 𝑉𝑎𝑟 𝑥𝑖 𝑉𝑎𝑟 𝜃𝑖 o For 𝑉𝑎𝑟 𝑎 = 𝑉𝑎𝑟 𝑥 ⇒ 𝑉𝑎𝑟 𝜃𝑖 = 1 𝑑 o Draw random weights from 𝜃~𝑁 0, 1/𝑑 where 𝑑 is the number of neurons in the input Xavier initialization [Glorot2010]
  • 64. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 64 o Unlike sigmoids, ReLUs ground to 0 the linear activations half the time o Double weight variance ◦ Compensate for the zero flat-area  ◦ Input and output maintain same variance ◦ Very similar to Xavier initialization o Draw random weights from w~𝑁 0, 2/𝑑 where 𝑑 is the number of neurons in the input [He2015] initialization for ReLUs
  • 65. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 65 Loss functions 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 66. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 66 o Our samples contains only one class ◦ There is only one correct answer per sample o Negative log-likelihood (cross entropy) + Softmax ℒ 𝜃; 𝑥, 𝑦 = − σ𝑐=1 𝐶 𝑦𝑐 log 𝑎𝐿 𝑐 for all classes 𝑐 = 1, … , 𝐶 o Hierarchical softmax when C is very large o Hinge loss (aka SVM loss) ℒ 𝜃; 𝑥, 𝑦 = ෍ 𝑐=1 𝑐≠𝑦 𝐶 max(0, 𝑎𝐿 𝑐 − 𝑎𝐿 𝑦 + 1) o Squared hinge loss Multi-class classification Is it a cat? Is it a horse? …
  • 67. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 67 o Each sample can have many correct answers o Hinge loss and the likes ◦ Also sigmoids would also work o Each output neuron is independent ◦ “Does this contain a car, yes or no?“ ◦ “Does this contain a person, yes or no?“ ◦ “Does this contain a motorbike, yes or no?“ ◦ “Does this contain a horse, yes or no?“ o Instead of “Is this a car, motorbike or person?” ◦ 𝑝 𝑐𝑎𝑟 𝑥) = 0.55, 𝑝 𝑚/𝑏𝑖𝑘𝑒 𝑥) = 0.25, 𝑝 𝑝𝑒𝑟𝑠𝑜𝑛 𝑥) = 0.15, 𝑝 ℎ𝑜𝑟𝑠𝑒 𝑥) = 0.05 ◦ 𝑝 𝑐𝑎𝑟 𝑥) + 𝑝 𝑚/𝑏𝑖𝑘𝑒 𝑥) + 𝑝 𝑝𝑒𝑟𝑠𝑜𝑛 𝑥) + 𝑝 ℎ𝑜𝑟𝑠𝑒 𝑥) = 1.0 Multi-class, multi-label classification
  • 68. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 68 o The good old Euclidean Loss ℒ 𝜃; 𝑥, 𝑦 = 1 2 |𝑦 − 𝑎𝐿|2 2 o Or RBF on top of Euclidean loss ℒ 𝜃; 𝑥, 𝑦 = ෍ 𝑗 𝑢𝑗 exp(−𝛽𝑗(𝑦 − 𝑎𝐿)2) o Or ℓ1 distance ℒ 𝜃; 𝑥, 𝑦 = ෍ 𝑗 |𝑦𝑗 − 𝑎𝐿 𝑗 | Regression
  • 69. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 69 Even better optimizations 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 70. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 70 o Don’t switch gradients all the time o Maintain “momentum” from previous parameters o More robust gradients and learning  faster convergence o Nice “physics”-based interpretation ◦ Instead of updating the position of the “ball”, we update the velocity, which updates the position Momentum 𝜃(𝑡+1) = 𝜃(𝑡) + 𝑢𝜃 𝑢𝜃 = 𝛾𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ Loss surface Gradient Gradient + momentum
  • 71. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 71 o Use the future gradient instead of the current gradient o Better theoretical convergence o Generally works better with Convolutional Neural Networks Nesterov Momentum [Sutskever2013] 𝜃(𝑡+1) = 𝜃(𝑡) + 𝑢𝜃 𝑢𝜃 = 𝛾𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ Gradient Gradient + momentum Momentum Look-ahead gradient from the next step Momentum Gradient + Nesterov momentum
  • 72. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 72 o Normally all weights updated with same “aggressiveness” ◦ Often some parameters could enjoy more “teaching” ◦ While others are already about there o Adapt learning per parameter 𝜃(𝑡+1) = 𝜃(𝑡) − 𝐻ℒ −1 𝜂𝑡𝛻𝜃ℒ o 𝐻ℒ is the Hessian matrix of ℒ: second-order derivatives 𝐻ℒ 𝑖𝑗 = 𝜕ℒ 𝜕𝜃𝑖𝜕𝜃𝑗 Second order optimization
  • 73. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 73 o Inverse of Hessian usually very expensive ◦ Too many parameters o Approximating the Hessian, e.g. with the L-BFGS algorithm ◦ Keeps memory of gradients to approximate the inverse Hessian o L-BFGS works alright with Gradient Descend. What about SGD? o In practice SGD with some good momentum works just fine Second order optimization methods in practice
  • 74. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 74 o Adagrad [Duchi2011] o RMSprop o Adam [Kingma2014] Other per-parameter adaptive optimizations
  • 75. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 75 o Schedule ◦ 𝑚𝑗 = σ𝜏(𝛻𝜃ℒ𝑗)2 ⟹ 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡 𝛻𝜃ℒ 𝑚+𝜀 ◦ 𝜀 is a small number to avoid division with 0 ◦ Gradients become gradually smaller and smaller Adagrad [Duchi2011]
  • 76. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 76 o Schedule ◦ 𝑚𝑗 = 𝛼 σ𝜏=1 𝑡−1 (𝛻𝜃 (𝑡) ℒ𝑗)2 + 1 − 𝛼 𝛻𝜃 (𝑡) ℒ𝑗 ⟹ ◦ 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡 𝛻𝜃ℒ 𝑚+𝜀 o Moving average of the squared gradients ◦ Compared to Adagrad o Large gradients, e.g. too “noisy” loss surface ◦ Updates are tamed o Small gradients, e.g. stuck in flat loss surface ravine ◦ Updates become more aggressive RMSprop Square rooting boosts small values while suppresses large values Decay hyper-parameter
  • 77. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 77 o One of the most popular learning algorithms 𝑚𝑗 = ෍ 𝜏 (𝛻𝜃ℒ𝑗)2 𝜃(𝑡+0.5) = 𝛽1𝜃(𝑡) + 1 − 𝛽1 𝛻𝜃ℒ 𝑣(𝑡+0.5) = 𝛽2𝑣(𝑡) + 1 − 𝛽2 𝑚 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡 𝜃(𝑡+0.5) 𝑣(𝑡+0.5) + 𝜀 o Similar to RMSprop, but with momentum o Recommended values: 𝛽1 = 0.9, 𝛽2 = 0.999, 𝜀 = 10−8 Adam [Kingma2014]
  • 78. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 78 Visual overview Picture credit: Alec Radford
  • 79. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 79 o Learning to learn by gradient descent by gradient descent ◦ [Andrychowicz2016] o 𝜃(𝑡+1) = 𝜃(𝑡) + 𝑔𝑡 𝛻𝜃ℒ, 𝜑 o 𝑔𝑡 is an “optimizer” with its own parameters 𝜑 ◦ Implemented as a recurrent network Learning –not computing– the gradients
  • 80. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 80 Good practice o Preprocess the data to at least have 0 mean o Initialize weights based on activations functions ◦ For ReLU Xavier or HeICCV2015 initialization o Always use ℓ2-regularization and dropout o Use batch normalization
  • 81. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 81 Babysitting Deep Nets 𝑎𝐿 𝑥; 𝜃1,…,L = ℎ𝐿 (ℎ𝐿−1 … ℎ1 𝑥, θ1 , θ𝐿−1 , θ𝐿) θ∗ ← arg min𝜃 ෍ (𝑥,𝑦)⊆(𝑋,𝑌) ℒ(𝑦, 𝑎𝐿 𝑥; 𝜃1,…,L ) 𝜃(𝑡+1) = 𝜃(𝑡) − 𝜂𝑡𝛻𝜃ℒ 1. The Neural Network 2. Learning by minimizing empirical error 3. Optimizing with Gradient Descend based methods
  • 82. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 82 o Always check your gradients if not computed automatically o Check that in the first round you get a random loss o Check network with few samples ◦ Turn off regularization. You should predictably overfit and have a 0 loss ◦ Turn or regularization. The loss should increase o Have a separate validation set ◦ Compare the curve between training and validation sets ◦ There should be a gap, but not too large Babysitting Deep Nets
  • 83. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 83 Summary o How to define our model and optimize it in practice o Data preprocessing and normalization o Optimization methods o Regularizations o Architectures and architectural hyper-parameters o Learning rate o Weight initializations o Good practices
  • 84. UVA DEEP LEARNING COURSE – EFSTRATIOS GAVVES & MAX WELLING - DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 84 o http://www.deeplearningbook.org/ ◦ Part II: Chapter 7, 8 [Andrychowicz2016] Andrychowicz, Denil, Gomez, Hoffman, Pfau, Schaul, de Freitas, Learning to learn by gradient descent by gradient descent, arXiv, 2016 [He2015] He, Zhang, Ren, Sun. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, ICCV, 2015 [Ioffe2015] Ioffe, Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, arXiv, 2015 [Kingma2014] Kingma, Ba. Adam: A Method for Stochastic Optimization, arXiv, 2014 [Srivastava2014] Srivastava, Hinton, Krizhevsky, Sutskever, Salakhutdinov. Dropout: A Simple Way to Prevent Neural Networks from Overfitting, JMLR, 2014 [Sutskever2013] Sutskever, Martens, Dahl, Hinton. On the importance of initialization and momentum in deep learning, JMLR, 2013 [Bengio2012] Bengio. Practical recommendations for gradient-based training of deep architectures, arXiv, 2012 [Krizhevsky2012] Krizhevsky, Hinton. ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012 [Duchi2011] Duchi, Hazan, Singer. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, JMLR, 2011 [Glorot2010] Glorot, Bengio. Understanding the difficulty of training deep feedforward neural networks, JMLR, 2010 [LeCun2002] Reading material & references
  • 85. UVA DEEP LEARNING COURSE EFSTRATIOS GAVVES & MAX WELLING DEEPER INTO DEEP LEARNING AND OPTIMIZATIONS - 85 Next lecture o What are the Convolutional Neural Networks? o Why are they important in Computer Vision? o Differences from standard Neural Networks o How to train a Convolutional Neural Network?