SlideShare a Scribd company logo
Optimization for ML
CS771: Intro to ML
Functions and their optima 2
 Many ML problems require us to optimize a function of some
variable(s)
 For simplicity, assume is a scalar-valued function of a scalar (
 Any function has one/more optima (maxima, minima), and maybe
saddle points
𝑓 (𝑥)
Global
maxima
A local
maxima
A local
maxima
A local
minima
A local
minima A local
minima
Global
minima
Will see
what these
are later
Usually interested in
global optima but
often want to find local
optima, too
𝑥
The objective function of
the ML problem we are
solving (e.g., squared loss
for regression)
Assume
unconstrained for
now, i.e., just a real-
valued
number/vector
For deep learning models, often
the local optima are what we can
find (and they usually suffice) –
more later
CS771: Intro to ML
Derivatives
3
 Magnitude of derivative at a point is the rate of change of the func
at that point
 Derivative becomes zero at stationary points (optima or saddle
points)
=
𝑓 (𝑥)
𝑥
∆ 𝑥
∆ 𝑓 (𝑥)
∆ 𝑥
∆ 𝑓 (𝑥)
Sign is also important: Positive
derivative means is increasing at if
we increase the value of by a very
small amount; negative derivative
means it is decreasing
Understanding how changes its value
as we change is helpful to understand
optimization
(minimization/maximization)
algorithms
Will sometimes use to
denote the derivative
CS771: Intro to ML
Rules of Derivatives
4
Some basic rules of taking derivatives
 Sum Rule:
 Scaling Rule: if is not a function of
 Product Rule:
 Quotient Rule:
 Chain Rule:
We already used some of these (sum,
scaling and chain) when calculating the
derivative for the linear regression
model
CS771: Intro to ML
Derivatives
5
 How the derivative itself changes tells us about the function’s
optima
 The second derivative can provide this information
’
𝑓 ( )= 0 at ,
𝑥 𝑥
’( )>0 just before
𝑓 𝑥
’( )<0 just
𝑥 𝑓 𝑥
after 𝑥
𝑥 is a maxima
’
𝑓 ( )= 0 at
𝑥 𝑥
’
𝑓 ( )< 0 just before
𝑥
’( )>0 just after
𝑥 𝑓 𝑥
𝑥
𝑥 is a minima
’
𝑓 ( )= 0 at
𝑥 𝑥
’
𝑓 ( )= 0 just before
𝑥
’( )= 0 just
𝑥 𝑓 𝑥
after 𝑥
𝑥 may be a saddle
’
𝑓 ( )= 0 and
𝑥
is a maxima
’
𝑓 ( )= 0 and
𝑥
is a minima
’
𝑓 ( )= 0 and
𝑥
may be a saddle. May
need higher derivatives
CS771: Intro to ML
Saddle Points
6
 Points where derivative is zero but are neither minima nor maxima
 Saddle points are very common for loss functions of deep learning
models
 Need to be handled carefully during optimization
 Second or higher derivative may help identify if a stationary point is
Saddle is a point of
inflection where the
derivative is also
zero
A saddle
point
CS771: Intro to ML
Multivariate Functions
7
 Most functions that we see in ML are multivariate function
 Example: Loss fn in lin-reg was a multivar function of -dim vector
 Here is an illustration of a function of 2 variables (4 maxima and 5
minima)
Two-dim contour
plot of the
function (i.e., what
it looks like from
the above)
courtesy: http://benchmarkfcns.xyz/benchmarkfcns/griewankfcn.html
CS771: Intro to ML
Derivatives of Multivariate Functions
8
 Can define derivative for a multivariate functions as well via the
gradient
 Gradient of a function is a vector of partial derivatives
 Optima and saddle points defined similar to one-dim case
 Required properties that we saw for one-dim case must be satisfied along
all the directions
 The second derivative in this case is known as the Hessian
∇ 𝑓 (𝒙)=
(𝜕 𝑓
𝜕 𝑥1
,
𝜕 𝑓
𝜕 𝑥2
,…,
𝜕 𝑓
𝜕 𝑥𝐷
)
Each element in this gradient vector tells
us how much will change if we move a
little along the corresponding (akin to
one-dim case)
CS771: Intro to ML
The Hessian
9
 For a multivar scalar valued function , Hessian is a matrix
 The Hessian matrix can be used to assess the optima/saddle points
 = 0 and is a positive semi-definite (PSD) matrix then is a minima
 = 0, and is a negative semi-definite (NSD) matrix then is a maxima
𝛻
2
𝑓 ( 𝒙 )=
[
𝜕2
𝑓
𝜕𝑥1
2
𝜕2
𝑓
𝜕 𝑥2 𝑥1
𝜕2
𝑓
𝜕𝑥1 𝑥2
𝜕2
𝑓
𝜕 𝑥2
2
…
…
⋮ ⋮ ⋱
𝜕
2
𝑓
𝜕 𝑥𝐷 𝑥1
𝜕
2
𝑓
𝜕 𝑥𝐷 𝑥2
…
𝜕
2
𝑓
𝜕 𝑥1 𝑥𝐷
𝜕2
𝑓
𝜕 𝑥2 𝑥𝐷
⋮
𝜕
2
𝑓
𝜕 𝑥𝐷
2
]
Note: If the function itself is
vector valued, e.g., then we
will have such Hessian
matrices, one for each output
dimension of
Gives
information
about the
curvature of the
function at point
A square, symmetric matrix M
is PSD if
Will be NSD if
PSD if all
eigenvalues
are non-
negative
CS771: Intro to ML
 A function being optimized can be either convex or non-convex
 Here are a couple of examples of convex functions
 Here are a couple of examples of non-convex functions
Convex and Non-Convex Functions
10
Convex functions are bowl-
shaped. They have a unique
optima (minima)
Negative of a convex function is
called a concave function, which
also has a unique optima
(maxima)
Non-convex functions
have multiple minima.
Usually harder to optimize
as compared to convex
functions
Loss functions of most
deep learning models
are non-convex
CS771: Intro to ML
Convex Sets
11
 A set S of points is a convex set, if for any two points , and 0 1
≤ ≤
 Above means that all points on the line-segment between and lie
within
 The domain of a convex function needs to be a convex set
𝑧=𝛼 𝑥+(1− 𝛼) 𝑦 ∈𝑆
is also called a “convex
combination” of two
points
Can also define convex
combination of points as
CS771: Intro to ML
Convex Functions
12
 Informally, is convex if all of its chords lie above the function
everywhere
 Formally, (assuming differentiable function), some tests for
convexity:
 First-order convexity (graph of must be above all the tangents)
Exercise: Show
that ridge
regression
objective is
convex
CS771: Intro to ML
Optimization Using First-Order Optimality
13
 Very simple. Already used this approach for linear and ridge
regression
 First order optimality: The gradient must be equal to zero at the
optima
 Sometimes, setting and solving for gives a closed form solution
 If closed form solution is not available, the gradient vector can still
= 0
The approach works only for
very simple problems where
the objective is convex and
there are no constraints on the
values can take
Called “first order” since only
gradient is used and gradient
provides the first order info about
the function being optimized
CS771: Intro to ML
Optimization via Gradient Descent
14
 Initialize as
 For iteration (or until convergence)
 Calculate the gradient using the current iterates
 Set the learning rate
 Move in the opposite direction of gradient
Gradient Descent
𝒘
(𝑡 +1)
=𝒘
(𝑡 )
−𝜂 𝑡 𝒈
(𝑡 )
Can I used this
approach to solve
maximization
problems?
Iterative since it requires
several steps/iterations to
find the optimal solution
For convex
functions, GD will
converge to the
global minima
Good
initialization
needed for
non-convex
functions
For max. problems we
can use gradient
ascent
The learning rate
very imp. Should
be set carefully
(fixed or chosen
adaptively). Will
discuss some
strategies later
Will move in the
direction of the
gradient
Will see the
justification
shortly
Sometimes may
be tricky to to
assess
convergence? Will
see some
methods later
Fact: Gradient gives
the direction of
steepest change in
function’s value
CS771: Intro to ML
Gradient Descent: An Illustration
15
𝒘∗
𝒘
(0)
𝒘
(1) 𝒘
(2)
𝒘
(0)
𝒘
(1)
𝒘
(2)
𝒘∗
𝒘
(3)
𝒘
(3)
Stuck at a
local minima
Negative gradient here . Let’s
move in the positive direction
Positive gradient
here. Let’s move in
the negative
direction
Learning rate is very important
Good
initialization is
very important
𝐿(𝒘)
𝒘
CS771: Intro to ML
Reference
16
• CS771: Introduction to Machine Learning
• Nisheeth
• CS771: Introduction to Machine Learning by Prof. Nisheeth
• Nice reference for today’s material.
• For those of you interested in a deeper dive in the math, see Ch 3 in this book

More Related Content

PPTX
Introduction & Optimization for Machine Learning
PPTX
linear regression1.pptx machine learning
PPTX
i just wanted to Your score increases as you pick a categ
PPTX
lec0734523532453425324523452345245432.pptx
PPTX
Linear Regression in machine learning.pptx
PPTX
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
PDF
gradientDescentTNP (2).pdf
PPTX
Regularization concept in machine learning
Introduction & Optimization for Machine Learning
linear regression1.pptx machine learning
i just wanted to Your score increases as you pick a categ
lec0734523532453425324523452345245432.pptx
Linear Regression in machine learning.pptx
Arjrandomjjejejj3ejjeejjdjddjjdjdjdjdjdjdjdjdjd
gradientDescentTNP (2).pdf
Regularization concept in machine learning

Similar to optimization algorithm_deeplearning.pptx (20)

PPTX
Linear Programming
PPTX
lec22 pca- DIMENSILANITY REDUCTION.pptx
PPTX
PRML Chapter 7
PPTX
MACHINE LEARNING NEURAL NETWORK PPT UNIT 4
PPTX
EPE821_Lecture3.pptx
PPT
5163147.ppt
PDF
lec2_unannotated.pdf ml csci 567 vatsal sharan
PPTX
Daa unit 1
PPTX
2. Linear regression with one variable.pptx
PPT
Greedy Algorihm
DOCX
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
PPTX
lec26.pptx
PPTX
Lecture of Regression for supervised Machine Learning
PPTX
Introduction to regression power point presentation
PPTX
Regression Techniques in Machine learning.pptx
PPTX
Simplex algorithm
PDF
Dsp lab _eec-652__vi_sem_18012013
PDF
Dsp lab _eec-652__vi_sem_18012013
PPTX
Linear Programing.pptx
PDF
Dep Neural Networks introduction new.pdf
Linear Programming
lec22 pca- DIMENSILANITY REDUCTION.pptx
PRML Chapter 7
MACHINE LEARNING NEURAL NETWORK PPT UNIT 4
EPE821_Lecture3.pptx
5163147.ppt
lec2_unannotated.pdf ml csci 567 vatsal sharan
Daa unit 1
2. Linear regression with one variable.pptx
Greedy Algorihm
CSCI 2033 Elementary Computational Linear Algebra(Spring 20.docx
lec26.pptx
Lecture of Regression for supervised Machine Learning
Introduction to regression power point presentation
Regression Techniques in Machine learning.pptx
Simplex algorithm
Dsp lab _eec-652__vi_sem_18012013
Dsp lab _eec-652__vi_sem_18012013
Linear Programing.pptx
Dep Neural Networks introduction new.pdf
Ad

Recently uploaded (20)

PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Artificial Intelligence
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPT
introduction to datamining and warehousing
PPTX
Current and future trends in Computer Vision.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Construction Project Organization Group 2.pptx
PDF
Digital Logic Computer Design lecture notes
PPTX
Sustainable Sites - Green Building Construction
DOCX
573137875-Attendance-Management-System-original
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PPTX
additive manufacturing of ss316l using mig welding
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Well-logging-methods_new................
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Artificial Intelligence
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
introduction to datamining and warehousing
Current and future trends in Computer Vision.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
R24 SURVEYING LAB MANUAL for civil enggi
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Construction Project Organization Group 2.pptx
Digital Logic Computer Design lecture notes
Sustainable Sites - Green Building Construction
573137875-Attendance-Management-System-original
Model Code of Practice - Construction Work - 21102022 .pdf
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
additive manufacturing of ss316l using mig welding
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Well-logging-methods_new................
Ad

optimization algorithm_deeplearning.pptx

  • 2. CS771: Intro to ML Functions and their optima 2  Many ML problems require us to optimize a function of some variable(s)  For simplicity, assume is a scalar-valued function of a scalar (  Any function has one/more optima (maxima, minima), and maybe saddle points 𝑓 (𝑥) Global maxima A local maxima A local maxima A local minima A local minima A local minima Global minima Will see what these are later Usually interested in global optima but often want to find local optima, too 𝑥 The objective function of the ML problem we are solving (e.g., squared loss for regression) Assume unconstrained for now, i.e., just a real- valued number/vector For deep learning models, often the local optima are what we can find (and they usually suffice) – more later
  • 3. CS771: Intro to ML Derivatives 3  Magnitude of derivative at a point is the rate of change of the func at that point  Derivative becomes zero at stationary points (optima or saddle points) = 𝑓 (𝑥) 𝑥 ∆ 𝑥 ∆ 𝑓 (𝑥) ∆ 𝑥 ∆ 𝑓 (𝑥) Sign is also important: Positive derivative means is increasing at if we increase the value of by a very small amount; negative derivative means it is decreasing Understanding how changes its value as we change is helpful to understand optimization (minimization/maximization) algorithms Will sometimes use to denote the derivative
  • 4. CS771: Intro to ML Rules of Derivatives 4 Some basic rules of taking derivatives  Sum Rule:  Scaling Rule: if is not a function of  Product Rule:  Quotient Rule:  Chain Rule: We already used some of these (sum, scaling and chain) when calculating the derivative for the linear regression model
  • 5. CS771: Intro to ML Derivatives 5  How the derivative itself changes tells us about the function’s optima  The second derivative can provide this information ’ 𝑓 ( )= 0 at , 𝑥 𝑥 ’( )>0 just before 𝑓 𝑥 ’( )<0 just 𝑥 𝑓 𝑥 after 𝑥 𝑥 is a maxima ’ 𝑓 ( )= 0 at 𝑥 𝑥 ’ 𝑓 ( )< 0 just before 𝑥 ’( )>0 just after 𝑥 𝑓 𝑥 𝑥 𝑥 is a minima ’ 𝑓 ( )= 0 at 𝑥 𝑥 ’ 𝑓 ( )= 0 just before 𝑥 ’( )= 0 just 𝑥 𝑓 𝑥 after 𝑥 𝑥 may be a saddle ’ 𝑓 ( )= 0 and 𝑥 is a maxima ’ 𝑓 ( )= 0 and 𝑥 is a minima ’ 𝑓 ( )= 0 and 𝑥 may be a saddle. May need higher derivatives
  • 6. CS771: Intro to ML Saddle Points 6  Points where derivative is zero but are neither minima nor maxima  Saddle points are very common for loss functions of deep learning models  Need to be handled carefully during optimization  Second or higher derivative may help identify if a stationary point is Saddle is a point of inflection where the derivative is also zero A saddle point
  • 7. CS771: Intro to ML Multivariate Functions 7  Most functions that we see in ML are multivariate function  Example: Loss fn in lin-reg was a multivar function of -dim vector  Here is an illustration of a function of 2 variables (4 maxima and 5 minima) Two-dim contour plot of the function (i.e., what it looks like from the above) courtesy: http://benchmarkfcns.xyz/benchmarkfcns/griewankfcn.html
  • 8. CS771: Intro to ML Derivatives of Multivariate Functions 8  Can define derivative for a multivariate functions as well via the gradient  Gradient of a function is a vector of partial derivatives  Optima and saddle points defined similar to one-dim case  Required properties that we saw for one-dim case must be satisfied along all the directions  The second derivative in this case is known as the Hessian ∇ 𝑓 (𝒙)= (𝜕 𝑓 𝜕 𝑥1 , 𝜕 𝑓 𝜕 𝑥2 ,…, 𝜕 𝑓 𝜕 𝑥𝐷 ) Each element in this gradient vector tells us how much will change if we move a little along the corresponding (akin to one-dim case)
  • 9. CS771: Intro to ML The Hessian 9  For a multivar scalar valued function , Hessian is a matrix  The Hessian matrix can be used to assess the optima/saddle points  = 0 and is a positive semi-definite (PSD) matrix then is a minima  = 0, and is a negative semi-definite (NSD) matrix then is a maxima 𝛻 2 𝑓 ( 𝒙 )= [ 𝜕2 𝑓 𝜕𝑥1 2 𝜕2 𝑓 𝜕 𝑥2 𝑥1 𝜕2 𝑓 𝜕𝑥1 𝑥2 𝜕2 𝑓 𝜕 𝑥2 2 … … ⋮ ⋮ ⋱ 𝜕 2 𝑓 𝜕 𝑥𝐷 𝑥1 𝜕 2 𝑓 𝜕 𝑥𝐷 𝑥2 … 𝜕 2 𝑓 𝜕 𝑥1 𝑥𝐷 𝜕2 𝑓 𝜕 𝑥2 𝑥𝐷 ⋮ 𝜕 2 𝑓 𝜕 𝑥𝐷 2 ] Note: If the function itself is vector valued, e.g., then we will have such Hessian matrices, one for each output dimension of Gives information about the curvature of the function at point A square, symmetric matrix M is PSD if Will be NSD if PSD if all eigenvalues are non- negative
  • 10. CS771: Intro to ML  A function being optimized can be either convex or non-convex  Here are a couple of examples of convex functions  Here are a couple of examples of non-convex functions Convex and Non-Convex Functions 10 Convex functions are bowl- shaped. They have a unique optima (minima) Negative of a convex function is called a concave function, which also has a unique optima (maxima) Non-convex functions have multiple minima. Usually harder to optimize as compared to convex functions Loss functions of most deep learning models are non-convex
  • 11. CS771: Intro to ML Convex Sets 11  A set S of points is a convex set, if for any two points , and 0 1 ≤ ≤  Above means that all points on the line-segment between and lie within  The domain of a convex function needs to be a convex set 𝑧=𝛼 𝑥+(1− 𝛼) 𝑦 ∈𝑆 is also called a “convex combination” of two points Can also define convex combination of points as
  • 12. CS771: Intro to ML Convex Functions 12  Informally, is convex if all of its chords lie above the function everywhere  Formally, (assuming differentiable function), some tests for convexity:  First-order convexity (graph of must be above all the tangents) Exercise: Show that ridge regression objective is convex
  • 13. CS771: Intro to ML Optimization Using First-Order Optimality 13  Very simple. Already used this approach for linear and ridge regression  First order optimality: The gradient must be equal to zero at the optima  Sometimes, setting and solving for gives a closed form solution  If closed form solution is not available, the gradient vector can still = 0 The approach works only for very simple problems where the objective is convex and there are no constraints on the values can take Called “first order” since only gradient is used and gradient provides the first order info about the function being optimized
  • 14. CS771: Intro to ML Optimization via Gradient Descent 14  Initialize as  For iteration (or until convergence)  Calculate the gradient using the current iterates  Set the learning rate  Move in the opposite direction of gradient Gradient Descent 𝒘 (𝑡 +1) =𝒘 (𝑡 ) −𝜂 𝑡 𝒈 (𝑡 ) Can I used this approach to solve maximization problems? Iterative since it requires several steps/iterations to find the optimal solution For convex functions, GD will converge to the global minima Good initialization needed for non-convex functions For max. problems we can use gradient ascent The learning rate very imp. Should be set carefully (fixed or chosen adaptively). Will discuss some strategies later Will move in the direction of the gradient Will see the justification shortly Sometimes may be tricky to to assess convergence? Will see some methods later Fact: Gradient gives the direction of steepest change in function’s value
  • 15. CS771: Intro to ML Gradient Descent: An Illustration 15 𝒘∗ 𝒘 (0) 𝒘 (1) 𝒘 (2) 𝒘 (0) 𝒘 (1) 𝒘 (2) 𝒘∗ 𝒘 (3) 𝒘 (3) Stuck at a local minima Negative gradient here . Let’s move in the positive direction Positive gradient here. Let’s move in the negative direction Learning rate is very important Good initialization is very important 𝐿(𝒘) 𝒘
  • 16. CS771: Intro to ML Reference 16 • CS771: Introduction to Machine Learning • Nisheeth • CS771: Introduction to Machine Learning by Prof. Nisheeth • Nice reference for today’s material. • For those of you interested in a deeper dive in the math, see Ch 3 in this book