SKuehn_MachineLearningAndOptimization_2015

The Foundations of (Machine) Learning
Stefan Kühn
codecentric AG
CSMLS Meetup Hamburg - February 26th, 2015
Stefan Kühn (codecentric AG) Unconstrained Optimization 25.02.2015 1 / 18

Contents
1 Supervised Machine Learning
2 Optimization Theory
3 Concrete Optimization Methods

Setting
Supervised Learning Approach
Use labeled training data in order to ﬁt a given model to the data, i.e. to
learn from the given data.
Typical Problems:
Classiﬁcation - discrete output
Logistic Regression
Neural Networks
Support Vector Machines
Regression - continuous output
Linear Regression
Support Vector Regression
Generalized Linear/Additive Models

Training and Learning
Ingredients:
Training Data Set
Model, e.g. Logistic Regression
Error Measure, e.g. Mean Squared Error
Learning Procedure:
Derive objective function from Model and Error Measure
Initialize Model parameters
Find a good ﬁt!
Iterate with other initial parameters
What is Learning in this context?
Learning is nothing but the application of an algorithm for unconstrained
optimization to the given objective function.

Unconstrained Optimization
Higher-order methods
Newton’s method (fast local convergence)
Gradient-based methods
Gradient Descent / Steepest Descent (globally convergent)
Conjugate Gradient (globally convergent)
Gauß-Newton, Levenberg-Marquardt, Quasi-Newton
Krylow Subspace methods
Derivative-free methods, direct search
Secant method (locally convergent)
Regula Falsi and successors (global convergence, typically slow)
Nelder-Mead / Downhill-Simplex
unconventional method, creates a moving simplex
driven by reﬂection/contraction/expansion of the corner points
globally convergent for diﬀerentiable functions f ∈ C1

General Iterative Algorithmic Scheme
Goal: Minimize a given function f :
min f (x), x ∈ Rn
Iterative Algorithms
Starting from a given point an iterative algorithm tries to minimize the
objective function step by step.
Preparation: k = 0
Initialization: Choose initial points and parameters
Iterate until convergence: k = 1, 2, 3, . . .
Termination criterion: Check optimality of the current iterate
Descent Direction: Find reasonable search direction
Stepsize: Determine length of the step in the given direction

Termination criteria
Critical points x∗
:
f (x∗
) = 0
Gradient: Should converge to zero
f (x∗
) < tol
Iterates: Distance between xk and xk+1 should converge to zero
xk
− xk+1
< tol
Function Values: Diﬀerence between f (xk) and f (xk+1) should
converge to zero
f (xk
) − f (xk+1
) < tol
Number of iterations: Terminate after maxiter iterations

Descent direction

Descent direction
Geometric interpretation
d is a descent direction if and only if the angle α between the gradient
f (x) and d is in a certain range:
π
2
= 90◦
< α < 270◦
=
3π
2
Algebraic equivalent
The sign of the scalar product between two vectors a and b is determined
by the cosine of the angle α between a and b:
a, b = aT
b = a b cos α(a, b)
d is a descent direction if and only if:
dT
f (x) < 0

Stepsize

Stepsize
Armijo’s rule
Takes two parameters 0 < σ < 1, and 0 < ρ < 0.5
For = 0, 1, 2, ... test Armijo condition:
f (p + σ d) < f (p) + ρσ dT
f (p)
Accepted stepsize
First that passes this test determines the accepted stepsize
t = σ
Standard Armijo implies, that for the accepted stepsize t always holds
t <= 1, only semi-eﬃcient.
Technical detail: Widening
Test whether some t > 0 satisfy Armijo condition, i.e. check
= −1, −2, . . . as well, ensures eﬃciency.

Gradient Descent
Descent direction
Direction of Steepest Descent, the negative gradient:
d = − f (x)
Motivation:
corresponds to α = 180◦ = π
obvious choice, always a descent direction, no test needed
guarantees the quickest win locally
works with inexact line search, e.g. Armijo’ s rule
works for functions f ∈ C1
always solves auxiliary optimization problem
min sT
f (x), s ∈ Rn
, s = 1

Conjugate Gradient
Motivation: Quadratic Model Problem, minimize
f (x) = Ax − b 2
Optimality condition:
f (x∗
) = 2AT
(Ax∗
− b) = 0
Obvious approach: Solve system of linear equations
AT
Ax = AT
b
Descent direction
Consecutive directions di , . . . , di+k satisfy certain orthogonality or
conjugacy conditions, M = AT A symmetric positive deﬁnite:
dT
i Mdj = 0, i = j

Nonlinear Conjugate Gradient
Initial Steps:
start at point x0 with d0 = − f (x0)
perform exact line search, ﬁnd
t0 = arg min f (x0 + td0), t > 0
set x1 = x0 + t0d0.
Iteration:
set ∆k = − f (xk)
compute βk via one of the available formulas (next slide)
update conjugate search direction dk = ∆k + βkdk−1
perform exact line search, ﬁnd
tk = arg min f (xk + tdk), t > 0
set xk+1 = xk + tkdk

Nonlinear Conjugate Gradient
Formulas for βk:
Fletcher-Reeves
βFR
k =
∆T
k ∆k
∆T
k−1∆k−1
Polak-Ribière βPR
k =
∆T
k (∆k − ∆k−1)
∆T
k−1∆k−1
Hestenes-Stiefel βHS
k = −
∆T
k (∆k − ∆k−1)
sT
k−1 (∆k − ∆k−1)
Dai-Yuan βDY
k = −
∆T
k ∆k
sT
k−1 (∆k − ∆k−1)
Reasonable choice with automatic direction reset:
β = max 0, βPR

SKuehn_MachineLearningAndOptimization_2015

More Related Content

What's hot (19)

Viewers also liked (17)

Similar to SKuehn_MachineLearningAndOptimization_2015 (20)

More from Stefan Kühn (16)

Recently uploaded (20)

SKuehn_MachineLearningAndOptimization_2015