The Machinery behind Deep Learning

The Machinery behind Deep Learning
Stefan Kühn
Join me on XING
Minds Mastering Machines - Cologne - April 26th, 2018
Stefan Kühn (XING) Deep Optimization 26.04.2018 1 / 35

Contents
1 Training Deep Networks
2 Training and Learning
3 The Toolbox of Optimization Methods
4 Takeaways

4 Takeaways

Deep Learning
Neural Networks - Universal Approximation Theorem
1-hidden-layer feed-forward neural net with ﬁnite number of parameters can
approximate any continuous function on compact subsets of Rn
Questions:
Why do we need deep learning at all?
theoretic result, requires wide nets
approximation by piecewise constant functions (not what you might
want for classiﬁcation/regression)
deep nets can replicate capacity of wide shallow nets with performance
and stability improvements
Why are deep nets harder to train than shallow nets?
More parameters to be learned by training?
More hyperparameters to be set before training?
Numerical issues?
disclaimer — ideas stolen from Martens, Sutskever, Bengio et al. and many more —

Example: RNNs
Recurrent Neural Nets
Extremely powerful for modeling sequential data, e.g. time series but
extremely hard to train (somewhat less hard for LSTMs/GRUs)
Main Advantages:
Qualitatively: Flexible and rich model class
Practically: Gradients easily computed by Backpropagation (BPTT)
Main Problems:
Qualitatively: Learning long-term dependencies
Practically: Gradient-based methods struggle when separation between
input and target output is large

Example: RNNs
Recurrent Neural Nets
Highly volatile relationship between parameters and hidden states
Indicators
Vanishing/exploding gradients
Internal Covariate Shift
Remedies
ReLU
’Careful’ initialization
Small stepsizes
(Recurrent) Batch Normalization

Example: RNNs
Recurrent Neural Nets and LSTM
Schmidhuber/Hochreiter proposed change of RNN architecture by adding
Long Short-Term Memory Units
Vanishing/exploding gradients?
fixed linear dynamics, no longer problematic
Any questions open?
Gradient-based trainings works better with LSTMs
LSTMs can compensate one deficiency of Gradient-based learning but
is this the only one?
Most problems are related to specific numerical issues.

4 Takeaways

Notions of Optimality
Mathematical Optimization
Minimize a given loss function by a certain optimization method or strategy
until convergence.
Vidal et al, Mathematics of Deep Learning

Notions of Optimality
Mathematical Optimization
Minimize a given loss function by a certain optimization method or strategy
until convergence.
Local Optimum: Minimum in local neighborhood (global minimum
might not even exist)
Global Optimum: Point with lowest function value (if existing)
Critical points: Candidates for local/global optima, or saddle points
Iterative Minimization: Step-by-step approach to ﬁnd minima
Descent direction: Direction in which the function value decreases at
least for a small steps
Gradient: For diﬀerentiable functions the negative gradient always is a
descent direction and it vanishes at critical points

Optimality and Deep Neural Nets
Some surprisingly strong theoretical results for this nonlinear+nonconvex
optimization problem - and practical evidence as well!
Saddle points: In high-dimensional convex problem critical points are
saddle points -> could not be observed for Deep Nets
Local and global optimal: Deep Nets seem to have the property that
local optima are located near the global optimum
Optimal representation: Deep Nets can represent data optimally under
certain conditions (minimal suﬃcient statistic)
Information Theory: Deep Nets and entropy are becoming best friend,
strong relations to optimal control theory (Optimization in inﬁnite
dimensions)
Global optimality for positively homogeneous networks:
self-explanatory

Notions of Error
Decomposition of the Error
Even the best possible prediction - the optimal prediction via the so-called
Bayes predictor - comes with an error.
Error Components
Bayes Error: Theoretically optimal error
Approximation Error: Error introduced by the model class
Estimation Error: Error introduced by parameter estimation / model
training / optimization method

Notions of Error
Example
Bayes Error: Even the optimal predictor for house prices using only zip
codes makes an error -> Property of the data / features
Approximation Error: Linear Models cannot resolve non-linear
relationships between the features irrespective of the training method
(but possibly could with diﬀerent features, like e.g. polynomial
regression)
Estimation Error: Did we select the right model from the model class
based on the available data? -> depends on model class, data and
training / optimization method
But what about the Generalization Error?

Notions of Learning
Learning
A core objective of a learner is to generalize from its experience.
But why do we use Mathematical Optimization for Learning?
What would be an alternative? Biology?

Trade-offs between Optimization and Learning
Computational complexity becomes the limiting factor when one envisions
large amounts of training data. [Bouttou, Bousquet]
Underlying Idea
Approximate optimization algorithms might be sufficient for learning
purposes. [Bouttou, Bousquet]
Implications:
Small-scale: Trade-off between approximation error and estimation
error
Large-scale: Computational complexity dominates
Long story short:
The best optimization methods might not be the best learning
methods!

Empirical results
Empirical evidence for SGD being a better learner than optimizer.
RCV1, text classiﬁcation, see e.g. Bouttou, Stochastic Gradient Descent Tricks

4 Takeaways

Advanced Concepts in Mathematical Optimization
Stepsize rules: Dynamically adjust step lengths to speed up
convergence
Preconditioning: Helps with ill-conditioned problems -> pathological
curvature
Damping: A strategy for making ill-posed problem regular -> helps
making local methods (Newton) work globally
Trust region: Determine step length - or radius of trust - ﬁrst and then
look for good/best descent directions
Relaxation: Relax constraints for better tractability
Combine simple and complex methods: Levenberg-Marquardt
algorithm, combines Gradient Descent and Newton’s method (ensures
global convergence plus fast local convergence)

Gradient Descent
Minimize a given objective function f :
min f (x), x ∈ Rn
Direction of Steepest Descent, the negative gradient:
d = − f (x)
Update in step k
xk+1 = xk − α f (xk)
Properties:
always a descent direction, no test needed
locally optimal, globally convergent
works with inexact line search, e.g. Armijo’s rule

Stochastic Gradient Descent
Setting
x model parameters
f (x) :=
i
fi (x), loss function is sum of individual losses
f (x) :=
i
fi (x), i = 1, . . . , m number of training examples
Choose i and update in step k
xk+1 = xk − α fi (xk)

Shortcomings of Gradient Descent
local: only local information used
especially: no curvature information used
greedy: prefers high curvature directions
scale invariant: no
James Martens, Deep learning via Hessian-free optimization

Momentum
Update in step k
zk+1 = βzk + f (xk)
xk+1 = xk − αzk+1
Properties for a quadratic convex objective:
condition number κ of improves by square root
stepsizes can be twice as long
order of convergence
√
κ − 1
√
κ + 1
instead of
κ − 1
κ + 1
can diverge, if β is not properly chosen/adapted
Gabriel Goh, Why momentum really works

Momentum
D E M O
https://distill.pub/2017/momentum/

Adam
Properties:
combines several clever tricks (from Momentum, RMSprop, AdaGrad)
has some similarities to Trust Region methods
empirically proven - best in class (personal opinion)
Kingma, Ba Adam: A method for stochastic optimization

SGD, Momentum and more
D E M O
Visualization of algorithms - by Sebastian Ruder

Beyond Adam
Adam has problems (and it’s not Eve)
Parameters are coupled
Some results indicate that Adam has not the best generalization
properties
It’s a heuristic -> convergence guarantuee?
And Adam has friends!
New variants that decouple parameters
Combine Adam - better at early training stages - and SGD - better
generalization properties
This also helps with convergence!
Wilson et al The Marginal Value of Adaptive Gradient Methods in Machine Learning
Keskar, Socher Improving Generalization Performance by Switching from Adam to SGD

Higher-Order Methods
Second-Order Methods
Require existence of Hessian and use this for scaling gradients accordingly,
very successful but computationally expensive
Classical Newton Method: fast local convergence, no global
convergence
Relaxed Newton Methods: help with global convergence
Damped Newton Methods: help with global convergence
Modiﬁed Newton Methods: help with computational complexity
Quasi-Newton Methods: help with computational complexity
Nonlinear Conjugate Gradient Methods: iteratively build approximation
to Hessian
But there is a lot more to explore, e.g. the basin-hopping algorithm - a strategy for ﬁnding global optima - or
derivative-free methods like Nelder-Mead (downhill simplex), Particle Swarm Optimization (PSO and its variants)

L-BFGS and Nonlinear CG
Observations so far:
The better the method, the more parameters to tune.
All better methods try to incorporate curvature information.
Why not doing so directly?
L-BFGS
Quasi-Newton method, builds an approximation of the (inverse) Hessian
and scales gradient accordingly.
Nonlinear CG
Informally speaking, Nonlinear CG tries to solve a quadratic approximation
of the function.
No surprise: They also work with minibatches.

Empirical results
Empirical evidence for better optimizers being better learners.
MNIST, handwritten digit recognition, from Ng et al., On Optimization Methods for Deep Learning

Truncated Newton: Hessian-Free Optimization
Main ideas:
Approximate not Hessian H, but matrix-vector product Hd.
Use ﬁnite diﬀerences instead of exact Hessian.
Use damping.
Use Linear CG method for solving quadratic approximation.
Use clever mini-batch stragegy for large data-sets.

Empirical test on pathological problems
Main results:
The addition problem is known to be eﬀectively impossible for
gradient descent, HF did it.
Basic RNN cells are used, no specialized architectures (LSTMs etc.).
(Martens/Sutskever (2011), Hochreiter/Schmidhuber, (1997)

4 Takeaways

Summary
In the long run, the biggest bottleneck will be the sequential parts of an
algorithm. That’s why the number of iterations needs to be small. SGD and
its successors tend to have much more iterations, and they cannot beneﬁt
as much from higher parallelism (GPUs).
But whatever you do/prefer/choose:
At least try out successors of SGD: Momentum, Adam etc.
Look for generic approaches instead of more and more specialized and
manually ﬁnetuned solutions.
Key aspects:
Initialization
Adaptive choice of stepsizes/momentum/. . .
Scaling of the gradient

Resources
Overview of Gradient Descent methods
Why momentum really works
Adam - A Method for Stochastic Optimization
Mathematics of Deep Learning
The Marginal Value of Adaptive Gradient Methods in Machine
Learning
Andrew Ng et al. about L-BFGS and CG outperforming SGD
Lecture Slides Neural Networks for Machine Learning - Hinton et al.
On the importance of initialization and momentum in deep learning
Data-Science-Blog: Summary article in preparation (Stefan Kühn)
The Neural Network Zoo

Thank you!

The Machinery behind Deep Learning

More Related Content

What's hot (20)

Similar to The Machinery behind Deep Learning (20)

More from Stefan Kühn (15)

Recently uploaded (20)

The Machinery behind Deep Learning