QMC: Operator Splitting Workshop, Behavior of BFGS on Nonsmooth Convex Functions - Azam Asl, Mar 21, 2018

Behavior of Limited Memory BFGS on
Nonsmooth Convex Functions
Azam Asl Michael L. Overton
March 21, 2018
Courant Institute of Mathematical Sciences, New York University
1

BFGS
M.J.D. Powell established convergence of BFGS with an inexact
Armijo-Wolfe line search for a general class of smooth convex
functions for BFGS.
Pathological counterexamples to convergence in the smooth,
nonconvex case are known to exist (Y.-H. Dai, 2002, 2013;
W. Mascarenhas 2004), but it is widely accepted that the method
works well in practice in the smooth, nonconvex case.
2

The BFGS Method (“Full” Version)
Initialize iterate x and positive-deﬁnite symmetric matrix H (which is supposed
to approximate the inverse Hessian of f )
Repeat
• Set d = −H f (x).
• Obtain t from Armijo-Wolfe line search
• Set s = td, y = f (x + td) − f (x)
• Replace x by x + td
• Replace H by VHV T
+ 1
sT y
ssT
, where V = I − 1
sT y
syT
Note that H can be computed in O(n2
) operations since V is a rank one
perturbation of the identity.
Uses Armijo-Wolfe line search.
3

Armijo-Wolfe line search
Armijo condition: f (x + td) ≤ f (x) + c1t f (x)T d
Wolfe condition: f is diﬀerentiable at x + td and
f (x + td)T d ≥ c2 f (x)T d
where 0 < c1 < c2 < 1.
4

BFGS for Nonsmooth Optimization
A.S. Lewis and M.L. Overton (Math. Prog., 2013): observed that BFGS is very
effective for nonsmooth optimization. Their convergence results are limited to
special cases.
In the nonsmooth case, BFGS builds a very ill-conditioned inverse “Hessian”
approximation, with some tiny eigenvalues converging to zero, corresponding to
“infinitely large” curvature in the directions defined by the associated
eigenvectors.
We have never seen convergence to non-stationary points that cannot be
explained by numerical difficulties.
Convergence rate of BFGS is typically linear (not superlinear) in the nonsmooth
case.
5

LM-BFGS-m
“Full” BFGS requires storing an n × n matrix and doing matrix-vector
multiplies, which is not possible when n is large.
In the 1980s, J. Nocedal and others developed a “limited memory” version of
BFGS, with O(n) space and time requirements. which is very widely used for
minimizing smooth functions in many variables. It works by saving only the
most recent m rank two updates to an initial inverse Hessian approximation.
There are two variants: with and without “scaling”. With scaling means at
every iteration we set H0
k = γk I where γk =
sT
k−1yk−1
yT
k−1yk−1
, is an approximation to
the one of the eigenvalues of ( 2
fk−1)−1
. Use of the scale is usually
recommended.
Both variants (with and without scaling) have convergence results for smooth
convex problems, but the rate of convergence is linear, not superlinear like full
BFGS.
Question: how eﬀective is it on nonsmooth problems?
6

A Simple Nonsmooth Convex Function, Unbounded Below
f (x) = a|x(1)
| +
n
i=2
x(i)
, where x ∈ Rn
.
10
5
0
u
-5
-10-10
-5
v
0
5
10
60
50
40
30
20
10
0
-10
5|u|+v
7

Motivation
Red: path of LM-BFGS-1 with scaling, converges to non-stationary point.
Blue: path of the gradient method with same Armijo-Wolfe line search,
generates f (x) ↓ −∞.
u
-8 -6 -4 -2 0 2 4 6 8 10
v
-12
-10
-8
-6
-4
-2
0
2
4
6
f(u, v) = 3|u| + v. x0 = (8.284; 2.177), c1=0.05, τ =-0.056
LBFGS-1
Gradient method
8

Convergence of the LM-BFGS-1 search direction
Theorem
Let dk be the search direction generated by LM-BFGS-1 with
scaling applied to f (x) = a|x(1)| + n
i=2 x(i), using an
Armijo-Wolfe line search. Suppose 4(n − 1) ≤ a, then,
|dk|
||dk||
converges to some constant direction d.
9

Convergence of the LM-BFGS-1 search direction
Corollary
Suppose 4(n − 1) ≤ a. Let c1 be the Armijo parameter. If
a(a + a2 − 3(n − 1)) > (
1
c1
− 1)(n − 1),
then, xk converges to a non-minimal point.
In practice we observe that 3(n − 1) ≤ a suﬃces for convergence
to a non-minimal point.
10

Experiments
In practice we observe that 3(n − 1) ≤ a suﬃces for the method to fail.
Below with n = 2 and a =
√
3 the method still fails:
u
-8 -6 -4 -2 0 2 4 6 8 10
v
-26
-24
-22
-20
-18
-16
-14
-12
-10
-8
f(u, v) = 1.7321|u| + v. x0 = (8.284; 2.177), c1=0.05, τ =-0.27
LBFGS-1
Gradient method
11

Experiments
But if we set a =
√
3 − 0.001, it succeeds.
u
-8 -6 -4 -2 0 2 4 6 8 10
v
-30
-28
-26
-24
-22
-20
-18
-16
-14
-12
f(u, v) = 1.7311|u| + v. x0 = (8.284; 2.177), c1=0.05, τ =-0.27
LBFGS-1
Gradient method
12

Experiments: Top, scaling on. Bottom, scaling oﬀ
N=30, f(X) = a|x(1)
|+ Σ
i=2
N
x(i)
, c
1
=0.05, (3(N-1))0.5
= 9.33, nrand = 5000
a
9.315 9.32 9.325 9.33 9.335 9.34
Failurerate
0
0.2
0.4
0.6
0.8
1
a
9.315 9.32 9.325 9.33 9.335 9.34
Failurerate
-1
-0.5
0
0.5
1
13

Some questions
• Why does scaling of LM-BFGS, which is recommended for
smooth problems, seem to be a bad idea for nonsmooth
problems?
• Can we prove convergence on this particular example when
scaling is oﬀ?
• Can this ”failure” result be extended to a broader problem
class?
• Our analysis was just for LM-BFGS-1: can it be extended to
LM-BFGS-m?
Thank you!
14

QMC: Operator Splitting Workshop, Behavior of BFGS on Nonsmooth Convex Functions - Azam Asl, Mar 21, 2018

More Related Content

What's hot (20)

Similar to QMC: Operator Splitting Workshop, Behavior of BFGS on Nonsmooth Convex Functions - Azam Asl, Mar 21, 2018 (16)

More from The Statistical and Applied Mathematical Sciences Institute (20)

Recently uploaded (20)

QMC: Operator Splitting Workshop, Behavior of BFGS on Nonsmooth Convex Functions - Azam Asl, Mar 21, 2018