optimization algorithm_deeplearning.pptx

CS771: Intro to ML
Functions and their optima 2
 Many ML problems require us to optimize a function of some
variable(s)
 For simplicity, assume is a scalar-valued function of a scalar (
 Any function has one/more optima (maxima, minima), and maybe
saddle points
𝑓 (𝑥)
Global
maxima
A local
maxima
A local
maxima
A local
minima
A local
minima A local
minima
Global
minima
Will see
what these
are later
Usually interested in
global optima but
often want to find local
optima, too
𝑥
The objective function of
the ML problem we are
solving (e.g., squared loss
for regression)
Assume
unconstrained for
now, i.e., just a real-
valued
number/vector
For deep learning models, often
the local optima are what we can
find (and they usually suffice) –
more later

CS771: Intro to ML
Derivatives
3
 Magnitude of derivative at a point is the rate of change of the func
at that point
 Derivative becomes zero at stationary points (optima or saddle
points)
=
𝑓 (𝑥)
𝑥
∆ 𝑥
∆ 𝑓 (𝑥)
∆ 𝑥
∆ 𝑓 (𝑥)
Sign is also important: Positive
derivative means is increasing at if
we increase the value of by a very
small amount; negative derivative
means it is decreasing
Understanding how changes its value
as we change is helpful to understand
optimization
(minimization/maximization)
algorithms
Will sometimes use to
denote the derivative

CS771: Intro to ML
Rules of Derivatives
4
Some basic rules of taking derivatives
 Sum Rule:
 Scaling Rule: if is not a function of
 Product Rule:
 Quotient Rule:
 Chain Rule:
We already used some of these (sum,
scaling and chain) when calculating the
derivative for the linear regression
model

CS771: Intro to ML
Derivatives
5
 How the derivative itself changes tells us about the function’s
optima
 The second derivative can provide this information
’
𝑓 ( )= 0 at ,
𝑥 𝑥
’( )>0 just before
𝑓 𝑥
’( )<0 just
𝑥 𝑓 𝑥
after 𝑥
𝑥 is a maxima
’
𝑓 ( )= 0 at
𝑥 𝑥
’
𝑓 ( )< 0 just before
𝑥
’( )>0 just after
𝑥 𝑓 𝑥
𝑥
𝑥 is a minima
’
𝑓 ( )= 0 at
𝑥 𝑥
’
𝑓 ( )= 0 just before
𝑥
’( )= 0 just
𝑥 𝑓 𝑥
after 𝑥
𝑥 may be a saddle
’
𝑓 ( )= 0 and
𝑥
is a maxima
’
𝑓 ( )= 0 and
𝑥
is a minima
’
𝑓 ( )= 0 and
𝑥
may be a saddle. May
need higher derivatives

CS771: Intro to ML
Saddle Points
6
 Points where derivative is zero but are neither minima nor maxima
 Saddle points are very common for loss functions of deep learning
models
 Need to be handled carefully during optimization
 Second or higher derivative may help identify if a stationary point is
Saddle is a point of
inflection where the
derivative is also
zero
A saddle
point

CS771: Intro to ML
Multivariate Functions
7
 Most functions that we see in ML are multivariate function
 Example: Loss fn in lin-reg was a multivar function of -dim vector
 Here is an illustration of a function of 2 variables (4 maxima and 5
minima)
Two-dim contour
plot of the
function (i.e., what
it looks like from
the above)
courtesy: http://benchmarkfcns.xyz/benchmarkfcns/griewankfcn.html

CS771: Intro to ML
Derivatives of Multivariate Functions
8
 Can define derivative for a multivariate functions as well via the
gradient
 Gradient of a function is a vector of partial derivatives
 Optima and saddle points defined similar to one-dim case
 Required properties that we saw for one-dim case must be satisfied along
all the directions
 The second derivative in this case is known as the Hessian
∇ 𝑓 (𝒙)=
(𝜕 𝑓
𝜕 𝑥1
,
𝜕 𝑓
𝜕 𝑥2
,…,
𝜕 𝑓
𝜕 𝑥𝐷
)
Each element in this gradient vector tells
us how much will change if we move a
little along the corresponding (akin to
one-dim case)

CS771: Intro to ML
The Hessian
9
 For a multivar scalar valued function , Hessian is a matrix
 The Hessian matrix can be used to assess the optima/saddle points
 = 0 and is a positive semi-definite (PSD) matrix then is a minima
 = 0, and is a negative semi-definite (NSD) matrix then is a maxima
𝛻
2
𝑓 ( 𝒙 )=
[
𝜕2
𝑓
𝜕𝑥1
2
𝜕2
𝑓
𝜕 𝑥2 𝑥1
𝜕2
𝑓
𝜕𝑥1 𝑥2
𝜕2
𝑓
𝜕 𝑥2
2
…
…
⋮ ⋮ ⋱
𝜕
2
𝑓
𝜕 𝑥𝐷 𝑥1
𝜕
2
𝑓
𝜕 𝑥𝐷 𝑥2
…
𝜕
2
𝑓
𝜕 𝑥1 𝑥𝐷
𝜕2
𝑓
𝜕 𝑥2 𝑥𝐷
⋮
𝜕
2
𝑓
𝜕 𝑥𝐷
2
]
Note: If the function itself is
vector valued, e.g., then we
will have such Hessian
matrices, one for each output
dimension of
Gives
information
about the
curvature of the
function at point
A square, symmetric matrix M
is PSD if
Will be NSD if
PSD if all
eigenvalues
are non-
negative

CS771: Intro to ML
 A function being optimized can be either convex or non-convex
 Here are a couple of examples of convex functions
 Here are a couple of examples of non-convex functions
Convex and Non-Convex Functions
10
Convex functions are bowl-
shaped. They have a unique
optima (minima)
Negative of a convex function is
called a concave function, which
also has a unique optima
(maxima)
Non-convex functions
have multiple minima.
Usually harder to optimize
as compared to convex
functions
Loss functions of most
deep learning models
are non-convex

CS771: Intro to ML
Convex Sets
11
 A set S of points is a convex set, if for any two points , and 0 1
≤ ≤
 Above means that all points on the line-segment between and lie
within
 The domain of a convex function needs to be a convex set
𝑧=𝛼 𝑥+(1− 𝛼) 𝑦 ∈𝑆
is also called a “convex
combination” of two
points
Can also define convex
combination of points as

CS771: Intro to ML
Convex Functions
12
 Informally, is convex if all of its chords lie above the function
everywhere
 Formally, (assuming differentiable function), some tests for
convexity:
 First-order convexity (graph of must be above all the tangents)
Exercise: Show
that ridge
regression
objective is
convex

CS771: Intro to ML
Optimization Using First-Order Optimality
13
 Very simple. Already used this approach for linear and ridge
regression
 First order optimality: The gradient must be equal to zero at the
optima
 Sometimes, setting and solving for gives a closed form solution
 If closed form solution is not available, the gradient vector can still
= 0
The approach works only for
very simple problems where
the objective is convex and
there are no constraints on the
values can take
Called “first order” since only
gradient is used and gradient
provides the first order info about
the function being optimized

CS771: Intro to ML
Optimization via Gradient Descent
14
 Initialize as
 For iteration (or until convergence)
 Calculate the gradient using the current iterates
 Set the learning rate
 Move in the opposite direction of gradient
Gradient Descent
𝒘
(𝑡 +1)
=𝒘
(𝑡 )
−𝜂 𝑡 𝒈
(𝑡 )
Can I used this
approach to solve
maximization
problems?
Iterative since it requires
several steps/iterations to
find the optimal solution
For convex
functions, GD will
converge to the
global minima
Good
initialization
needed for
non-convex
functions
For max. problems we
can use gradient
ascent
The learning rate
very imp. Should
be set carefully
(fixed or chosen
adaptively). Will
discuss some
strategies later
Will move in the
direction of the
gradient
Will see the
justification
shortly
Sometimes may
be tricky to to
assess
convergence? Will
see some
methods later
Fact: Gradient gives
the direction of
steepest change in
function’s value

CS771: Intro to ML
Gradient Descent: An Illustration
15
𝒘∗
𝒘
(0)
𝒘
(1) 𝒘
(2)
𝒘
(0)
𝒘
(1)
𝒘
(2)
𝒘∗
𝒘
(3)
𝒘
(3)
Stuck at a
local minima
Negative gradient here . Let’s
move in the positive direction
Positive gradient
here. Let’s move in
the negative
direction
Learning rate is very important
Good
initialization is
very important
𝐿(𝒘)
𝒘

CS771: Intro to ML
Reference
16
• CS771: Introduction to Machine Learning
• Nisheeth
• CS771: Introduction to Machine Learning by Prof. Nisheeth
• Nice reference for today’s material.
• For those of you interested in a deeper dive in the math, see Ch 3 in this book

optimization algorithm_deeplearning.pptx

More Related Content

Similar to optimization algorithm_deeplearning.pptx (20)

Recently uploaded (20)

optimization algorithm_deeplearning.pptx