SlideShare a Scribd company logo
Behavior of Limited Memory BFGS on
Nonsmooth Convex Functions
Azam Asl Michael L. Overton
March 21, 2018
Courant Institute of Mathematical Sciences, New York University
1
BFGS
M.J.D. Powell established convergence of BFGS with an inexact
Armijo-Wolfe line search for a general class of smooth convex
functions for BFGS.
Pathological counterexamples to convergence in the smooth,
nonconvex case are known to exist (Y.-H. Dai, 2002, 2013;
W. Mascarenhas 2004), but it is widely accepted that the method
works well in practice in the smooth, nonconvex case.
2
The BFGS Method (“Full” Version)
Initialize iterate x and positive-definite symmetric matrix H (which is supposed
to approximate the inverse Hessian of f )
Repeat
• Set d = −H f (x).
• Obtain t from Armijo-Wolfe line search
• Set s = td, y = f (x + td) − f (x)
• Replace x by x + td
• Replace H by VHV T
+ 1
sT y
ssT
, where V = I − 1
sT y
syT
Note that H can be computed in O(n2
) operations since V is a rank one
perturbation of the identity.
Uses Armijo-Wolfe line search.
3
Armijo-Wolfe line search
Armijo condition: f (x + td) ≤ f (x) + c1t f (x)T d
Wolfe condition: f is differentiable at x + td and
f (x + td)T d ≥ c2 f (x)T d
where 0 < c1 < c2 < 1.
4
BFGS for Nonsmooth Optimization
A.S. Lewis and M.L. Overton (Math. Prog., 2013): observed that BFGS is very
effective for nonsmooth optimization. Their convergence results are limited to
special cases.
In the nonsmooth case, BFGS builds a very ill-conditioned inverse “Hessian”
approximation, with some tiny eigenvalues converging to zero, corresponding to
“infinitely large” curvature in the directions defined by the associated
eigenvectors.
We have never seen convergence to non-stationary points that cannot be
explained by numerical difficulties.
Convergence rate of BFGS is typically linear (not superlinear) in the nonsmooth
case.
5
LM-BFGS-m
“Full” BFGS requires storing an n × n matrix and doing matrix-vector
multiplies, which is not possible when n is large.
In the 1980s, J. Nocedal and others developed a “limited memory” version of
BFGS, with O(n) space and time requirements. which is very widely used for
minimizing smooth functions in many variables. It works by saving only the
most recent m rank two updates to an initial inverse Hessian approximation.
There are two variants: with and without “scaling”. With scaling means at
every iteration we set H0
k = γk I where γk =
sT
k−1yk−1
yT
k−1yk−1
, is an approximation to
the one of the eigenvalues of ( 2
fk−1)−1
. Use of the scale is usually
recommended.
Both variants (with and without scaling) have convergence results for smooth
convex problems, but the rate of convergence is linear, not superlinear like full
BFGS.
Question: how effective is it on nonsmooth problems?
6
A Simple Nonsmooth Convex Function, Unbounded Below
f (x) = a|x(1)
| +
n
i=2
x(i)
, where x ∈ Rn
.
10
5
0
u
-5
-10-10
-5
v
0
5
10
60
50
40
30
20
10
0
-10
5|u|+v
7
Motivation
Red: path of LM-BFGS-1 with scaling, converges to non-stationary point.
Blue: path of the gradient method with same Armijo-Wolfe line search,
generates f (x) ↓ −∞.
u
-8 -6 -4 -2 0 2 4 6 8 10
v
-12
-10
-8
-6
-4
-2
0
2
4
6
f(u, v) = 3|u| + v. x0 = (8.284; 2.177), c1=0.05, τ =-0.056
LBFGS-1
Gradient method
8
Convergence of the LM-BFGS-1 search direction
Theorem
Let dk be the search direction generated by LM-BFGS-1 with
scaling applied to f (x) = a|x(1)| + n
i=2 x(i), using an
Armijo-Wolfe line search. Suppose 4(n − 1) ≤ a, then,
|dk|
||dk||
converges to some constant direction d.
9
Convergence of the LM-BFGS-1 search direction
Corollary
Suppose 4(n − 1) ≤ a. Let c1 be the Armijo parameter. If
a(a + a2 − 3(n − 1)) > (
1
c1
− 1)(n − 1),
then, xk converges to a non-minimal point.
In practice we observe that 3(n − 1) ≤ a suffices for convergence
to a non-minimal point.
10
Experiments
In practice we observe that 3(n − 1) ≤ a suffices for the method to fail.
Below with n = 2 and a =
√
3 the method still fails:
u
-8 -6 -4 -2 0 2 4 6 8 10
v
-26
-24
-22
-20
-18
-16
-14
-12
-10
-8
f(u, v) = 1.7321|u| + v. x0 = (8.284; 2.177), c1=0.05, τ =-0.27
LBFGS-1
Gradient method
11
Experiments
But if we set a =
√
3 − 0.001, it succeeds.
u
-8 -6 -4 -2 0 2 4 6 8 10
v
-30
-28
-26
-24
-22
-20
-18
-16
-14
-12
f(u, v) = 1.7311|u| + v. x0 = (8.284; 2.177), c1=0.05, τ =-0.27
LBFGS-1
Gradient method
12
Experiments: Top, scaling on. Bottom, scaling off
N=30, f(X) = a|x(1)
|+ Σ
i=2
N
x(i)
, c
1
=0.05, (3(N-1))0.5
= 9.33, nrand = 5000
a
9.315 9.32 9.325 9.33 9.335 9.34
Failurerate
0
0.2
0.4
0.6
0.8
1
a
9.315 9.32 9.325 9.33 9.335 9.34
Failurerate
-1
-0.5
0
0.5
1
13
Some questions
• Why does scaling of LM-BFGS, which is recommended for
smooth problems, seem to be a bad idea for nonsmooth
problems?
• Can we prove convergence on this particular example when
scaling is off?
• Can this ”failure” result be extended to a broader problem
class?
• Our analysis was just for LM-BFGS-1: can it be extended to
LM-BFGS-m?
Thank you!
14

More Related Content

PPTX
Double Integrals
PDF
Cusps of the Kähler moduli space and stability conditions on K3 surfaces
PPS
M1 unit v-jntuworld
PDF
Mit2 092 f09_lec06
PDF
Topological Strings Invariants
PPTX
Double integration in polar form with change in variable (harsh gupta)
PPT
Integration Ppt
Double Integrals
Cusps of the Kähler moduli space and stability conditions on K3 surfaces
M1 unit v-jntuworld
Mit2 092 f09_lec06
Topological Strings Invariants
Double integration in polar form with change in variable (harsh gupta)
Integration Ppt

What's hot (20)

PDF
Transformations
PPTX
NUMERICAL INTEGRATION AND ITS APPLICATIONS
PDF
QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...
PPT
L 4 4
PPTX
Line integral,Strokes and Green Theorem
PDF
Some fixed point theorems of expansion mapping in g-metric spaces
PDF
Representing Graphs by Touching Domains
PDF
A proof of an equation containing improper gamma distribution
PPTX
Divergence,curl,gradient
PDF
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
PDF
Three rotational invariant variants of the 3SOME algorithms
PPTX
ppt on application of integrals
PDF
2021 preTEST4A Vector Calculus
PDF
N. Bilic - "Hamiltonian Method in the Braneworld" 2/3
PPTX
Use of integral calculus in engineering
PDF
Chapter 16 1
PDF
A NEW STUDY OF TRAPEZOIDAL, SIMPSON’S1/3 AND SIMPSON’S 3/8 RULES OF NUMERICAL...
PDF
Ayadi weglein-2013
PDF
Mathandphysicspart6subpart1 draftbacc
Transformations
NUMERICAL INTEGRATION AND ITS APPLICATIONS
QMC: Operator Splitting Workshop, Forward-Backward Splitting Algorithm withou...
L 4 4
Line integral,Strokes and Green Theorem
Some fixed point theorems of expansion mapping in g-metric spaces
Representing Graphs by Touching Domains
A proof of an equation containing improper gamma distribution
Divergence,curl,gradient
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
Three rotational invariant variants of the 3SOME algorithms
ppt on application of integrals
2021 preTEST4A Vector Calculus
N. Bilic - "Hamiltonian Method in the Braneworld" 2/3
Use of integral calculus in engineering
Chapter 16 1
A NEW STUDY OF TRAPEZOIDAL, SIMPSON’S1/3 AND SIMPSON’S 3/8 RULES OF NUMERICAL...
Ayadi weglein-2013
Mathandphysicspart6subpart1 draftbacc
Ad

Similar to QMC: Operator Splitting Workshop, Behavior of BFGS on Nonsmooth Convex Functions - Azam Asl, Mar 21, 2018 (16)

PDF
Recursive Compressed Sensing
PPTX
Exploring Optimization in Vowpal Wabbit
PDF
Sequential estimation of_discrete_choice_models__copy_-4
PDF
Fast parallelizable scenario-based stochastic optimization
PPT
lecture6.ppt
PDF
Gradient_Descent_Unconstrained.pdf
PDF
Gradient descent method
PDF
lecture01_lecture01_lecture0001_ceva.pdf
PDF
Maximum likelihood estimation of regularisation parameters in inverse problem...
PDF
Talk_HU_Berlin_Chiheb_benhammouda.pdf
PDF
Moving Toward Deep Learning Algorithms on HPCC Systems
PDF
Sequential estimation of_discrete_choice_models
PDF
Derivative Free Optimization and Robust Optimization
PDF
lecture on support vector machine in the field of ML
PDF
Hidden Markov Random Field model and BFGS algorithm for Brain Image Segmentation
PPTX
Lecture 8 about data mining and how to use it.pptx
Recursive Compressed Sensing
Exploring Optimization in Vowpal Wabbit
Sequential estimation of_discrete_choice_models__copy_-4
Fast parallelizable scenario-based stochastic optimization
lecture6.ppt
Gradient_Descent_Unconstrained.pdf
Gradient descent method
lecture01_lecture01_lecture0001_ceva.pdf
Maximum likelihood estimation of regularisation parameters in inverse problem...
Talk_HU_Berlin_Chiheb_benhammouda.pdf
Moving Toward Deep Learning Algorithms on HPCC Systems
Sequential estimation of_discrete_choice_models
Derivative Free Optimization and Robust Optimization
lecture on support vector machine in the field of ML
Hidden Markov Random Field model and BFGS algorithm for Brain Image Segmentation
Lecture 8 about data mining and how to use it.pptx
Ad

More from The Statistical and Applied Mathematical Sciences Institute (20)

PDF
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
PDF
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
PDF
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
PDF
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
PDF
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
PDF
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
PPTX
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
PDF
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
PDF
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
PPTX
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
PDF
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
PDF
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
PDF
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
PDF
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
PDF
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
PDF
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
PPTX
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
PPTX
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
PDF
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
PDF
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...
Causal Inference Opening Workshop - Latent Variable Models, Causal Inference,...
2019 Fall Series: Special Guest Lecture - 0-1 Phase Transitions in High Dimen...
Causal Inference Opening Workshop - Causal Discovery in Neuroimaging Data - F...
Causal Inference Opening Workshop - Smooth Extensions to BART for Heterogeneo...
Causal Inference Opening Workshop - A Bracketing Relationship between Differe...
Causal Inference Opening Workshop - Testing Weak Nulls in Matched Observation...
Causal Inference Opening Workshop - Difference-in-differences: more than meet...
Causal Inference Opening Workshop - New Statistical Learning Methods for Esti...
Causal Inference Opening Workshop - Bipartite Causal Inference with Interfere...
Causal Inference Opening Workshop - Bridging the Gap Between Causal Literatur...
Causal Inference Opening Workshop - Some Applications of Reinforcement Learni...
Causal Inference Opening Workshop - Bracketing Bounds for Differences-in-Diff...
Causal Inference Opening Workshop - Assisting the Impact of State Polcies: Br...
Causal Inference Opening Workshop - Experimenting in Equilibrium - Stefan Wag...
Causal Inference Opening Workshop - Targeted Learning for Causal Inference Ba...
Causal Inference Opening Workshop - Bayesian Nonparametric Models for Treatme...
2019 Fall Series: Special Guest Lecture - Adversarial Risk Analysis of the Ge...
2019 Fall Series: Professional Development, Writing Academic Papers…What Work...
2019 GDRR: Blockchain Data Analytics - Machine Learning in/for Blockchain: Fu...
2019 GDRR: Blockchain Data Analytics - QuTrack: Model Life Cycle Management f...

Recently uploaded (20)

PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
GDM (1) (1).pptx small presentation for students
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PPTX
master seminar digital applications in india
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Cell Types and Its function , kingdom of life
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Computing-Curriculum for Schools in Ghana
PDF
RMMM.pdf make it easy to upload and study
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
GDM (1) (1).pptx small presentation for students
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
2.FourierTransform-ShortQuestionswithAnswers.pdf
Renaissance Architecture: A Journey from Faith to Humanism
master seminar digital applications in india
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Cell Types and Its function , kingdom of life
VCE English Exam - Section C Student Revision Booklet
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Computing-Curriculum for Schools in Ghana
RMMM.pdf make it easy to upload and study
Abdominal Access Techniques with Prof. Dr. R K Mishra
102 student loan defaulters named and shamed – Is someone you know on the list?
Supply Chain Operations Speaking Notes -ICLT Program
PPH.pptx obstetrics and gynecology in nursing
Microbial disease of the cardiovascular and lymphatic systems
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student

QMC: Operator Splitting Workshop, Behavior of BFGS on Nonsmooth Convex Functions - Azam Asl, Mar 21, 2018

  • 1. Behavior of Limited Memory BFGS on Nonsmooth Convex Functions Azam Asl Michael L. Overton March 21, 2018 Courant Institute of Mathematical Sciences, New York University 1
  • 2. BFGS M.J.D. Powell established convergence of BFGS with an inexact Armijo-Wolfe line search for a general class of smooth convex functions for BFGS. Pathological counterexamples to convergence in the smooth, nonconvex case are known to exist (Y.-H. Dai, 2002, 2013; W. Mascarenhas 2004), but it is widely accepted that the method works well in practice in the smooth, nonconvex case. 2
  • 3. The BFGS Method (“Full” Version) Initialize iterate x and positive-definite symmetric matrix H (which is supposed to approximate the inverse Hessian of f ) Repeat • Set d = −H f (x). • Obtain t from Armijo-Wolfe line search • Set s = td, y = f (x + td) − f (x) • Replace x by x + td • Replace H by VHV T + 1 sT y ssT , where V = I − 1 sT y syT Note that H can be computed in O(n2 ) operations since V is a rank one perturbation of the identity. Uses Armijo-Wolfe line search. 3
  • 4. Armijo-Wolfe line search Armijo condition: f (x + td) ≤ f (x) + c1t f (x)T d Wolfe condition: f is differentiable at x + td and f (x + td)T d ≥ c2 f (x)T d where 0 < c1 < c2 < 1. 4
  • 5. BFGS for Nonsmooth Optimization A.S. Lewis and M.L. Overton (Math. Prog., 2013): observed that BFGS is very effective for nonsmooth optimization. Their convergence results are limited to special cases. In the nonsmooth case, BFGS builds a very ill-conditioned inverse “Hessian” approximation, with some tiny eigenvalues converging to zero, corresponding to “infinitely large” curvature in the directions defined by the associated eigenvectors. We have never seen convergence to non-stationary points that cannot be explained by numerical difficulties. Convergence rate of BFGS is typically linear (not superlinear) in the nonsmooth case. 5
  • 6. LM-BFGS-m “Full” BFGS requires storing an n × n matrix and doing matrix-vector multiplies, which is not possible when n is large. In the 1980s, J. Nocedal and others developed a “limited memory” version of BFGS, with O(n) space and time requirements. which is very widely used for minimizing smooth functions in many variables. It works by saving only the most recent m rank two updates to an initial inverse Hessian approximation. There are two variants: with and without “scaling”. With scaling means at every iteration we set H0 k = γk I where γk = sT k−1yk−1 yT k−1yk−1 , is an approximation to the one of the eigenvalues of ( 2 fk−1)−1 . Use of the scale is usually recommended. Both variants (with and without scaling) have convergence results for smooth convex problems, but the rate of convergence is linear, not superlinear like full BFGS. Question: how effective is it on nonsmooth problems? 6
  • 7. A Simple Nonsmooth Convex Function, Unbounded Below f (x) = a|x(1) | + n i=2 x(i) , where x ∈ Rn . 10 5 0 u -5 -10-10 -5 v 0 5 10 60 50 40 30 20 10 0 -10 5|u|+v 7
  • 8. Motivation Red: path of LM-BFGS-1 with scaling, converges to non-stationary point. Blue: path of the gradient method with same Armijo-Wolfe line search, generates f (x) ↓ −∞. u -8 -6 -4 -2 0 2 4 6 8 10 v -12 -10 -8 -6 -4 -2 0 2 4 6 f(u, v) = 3|u| + v. x0 = (8.284; 2.177), c1=0.05, τ =-0.056 LBFGS-1 Gradient method 8
  • 9. Convergence of the LM-BFGS-1 search direction Theorem Let dk be the search direction generated by LM-BFGS-1 with scaling applied to f (x) = a|x(1)| + n i=2 x(i), using an Armijo-Wolfe line search. Suppose 4(n − 1) ≤ a, then, |dk| ||dk|| converges to some constant direction d. 9
  • 10. Convergence of the LM-BFGS-1 search direction Corollary Suppose 4(n − 1) ≤ a. Let c1 be the Armijo parameter. If a(a + a2 − 3(n − 1)) > ( 1 c1 − 1)(n − 1), then, xk converges to a non-minimal point. In practice we observe that 3(n − 1) ≤ a suffices for convergence to a non-minimal point. 10
  • 11. Experiments In practice we observe that 3(n − 1) ≤ a suffices for the method to fail. Below with n = 2 and a = √ 3 the method still fails: u -8 -6 -4 -2 0 2 4 6 8 10 v -26 -24 -22 -20 -18 -16 -14 -12 -10 -8 f(u, v) = 1.7321|u| + v. x0 = (8.284; 2.177), c1=0.05, τ =-0.27 LBFGS-1 Gradient method 11
  • 12. Experiments But if we set a = √ 3 − 0.001, it succeeds. u -8 -6 -4 -2 0 2 4 6 8 10 v -30 -28 -26 -24 -22 -20 -18 -16 -14 -12 f(u, v) = 1.7311|u| + v. x0 = (8.284; 2.177), c1=0.05, τ =-0.27 LBFGS-1 Gradient method 12
  • 13. Experiments: Top, scaling on. Bottom, scaling off N=30, f(X) = a|x(1) |+ Σ i=2 N x(i) , c 1 =0.05, (3(N-1))0.5 = 9.33, nrand = 5000 a 9.315 9.32 9.325 9.33 9.335 9.34 Failurerate 0 0.2 0.4 0.6 0.8 1 a 9.315 9.32 9.325 9.33 9.335 9.34 Failurerate -1 -0.5 0 0.5 1 13
  • 14. Some questions • Why does scaling of LM-BFGS, which is recommended for smooth problems, seem to be a bad idea for nonsmooth problems? • Can we prove convergence on this particular example when scaling is off? • Can this ”failure” result be extended to a broader problem class? • Our analysis was just for LM-BFGS-1: can it be extended to LM-BFGS-m? Thank you! 14