Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory (NeurIPS 2023)

Trajectory Alignment: Understanding the Edge of
Stability Phenomenon via Bifurcation Theory
Minhak Song Chulhee Yun
KAIST
NeurIPS 2023
Poster: Session 3, Wed 13 Dec

Edge of Stability (EoS)
Full-batch GD: Θt+1 = Θt − η∇L(Θt)

Descent Lemma: If η < 2
λmax(∇2L)
, then loss drops each iteration

Descent Lemma: If η < 2
λmax(∇2L)
, then loss drops each iteration
EoS: sharpness (= λmax(∇2L)) increases along GD trajectory then
saturates at 2/η [Cohen et al., 2021]
0 5000 10000 15000 20000
100
200
300
400
500
sharpness
fully-connected ELU
0 1000 2000 3000 4000 5000
25
50
75
100
125
150
fully-connected ReLU
0 2500 5000 7500 10000 12500 15000 17500
100
200
300
400
fully-connected tanh
0 20000 40000 60000 80000 100000
iteration
200
400
600
800
sharpness
CNN ELU
0 20000 40000 60000 80000 100000
iteration
200
400
600
CNN ReLU
0 2500 5000 7500 10000 12500 15000 17500
iteration
40
60
80
100
120
CNN tanh

Toy model
Objective function: L(x, y) = log(cosh(xy)), step size: η = 2/16

Toy model
0.5 0.0 0.5 1.0
x
4.0
4.5
5.0
5.5
6.0
6.5
7.0
y
0 50 100 150 200
iterations
0
5
10
15
20
25
sharpness
2

Toy model
0.5 0.0 0.5 1.0
x
4.0
4.5
5.0
5.5
6.0
6.5
7.0
y
0 50 100 150 200
iterations
0
5
10
15
20
25
sharpness
2
▶ Different GD trajectories align on the same curve:
Trajectory Alignment occurs!

Toy model
0.5 0.0 0.5 1.0
x
4.0
4.5
5.0
5.5
6.0
6.5
7.0
y
0 50 100 150 200
iterations
0
5
10
15
20
25
sharpness
2
▶ Different GD trajectories align on the same curve:
Trajectory Alignment occurs!
Q. Trajectory alignment of GD in general setting?

Canonical reparameterization

(p, q) ≜

Residual,
EoS threshold
NTK sharpness

(p, q) ≜

Residual,
EoS threshold
NTK sharpness

Objective function: L(Θ) = 1
n
Pn
i=1 ℓ(f (xi ; Θ) − yi )
Assumption: ℓ is convex, Lipschitz loss with ℓ′(0) = 0, ℓ′′(0) = 1.

(p, q) ≜

Residual,
EoS threshold
NTK sharpness

n
Pn
i=1 ℓ(f (xi ; Θ) − yi )
(p, q) =
1
n
n
X
i=1
(f (xi ; Θ) − yi ),
2/η
λmax(NTK)
!
=
1
n
n
X
i=1
(f (xi ; Θ) − yi ),
2n
η∥
Pn
i=1(∇Θf (xi ; Θ))⊗2∥2
2
!

(p, q) ≜

Residual,
EoS threshold
NTK sharpness

n
Pn
i=1 ℓ(f (xi ; Θ) − yi )
(p, q) =
1
n
n
X
i=1
(f (xi ; Θ) − yi ),
2/η
λmax(NTK)
!
=
1
n
n
X
i=1
(f (xi ; Θ) − yi ),
2n
η∥
Pn
i=1(∇Θf (xi ; Θ))⊗2∥2
2
!
▶ Minimum with sharpness 2/η corresponds to (p, q) = (0, 1)

Experiment: single training point
Setting: log-cosh loss ℓ(p) = log(cosh(p)), 3-layer FC networks

Experiment: single training point
Setting: log-cosh loss ℓ(p) = log(cosh(p)), 3-layer FC networks
Observation: GD trajectories align on the curve q =
ℓ′(p)
p
(a) tanh FC network (b) ELU FC network (c) linear FC network

Experiment: multiple training points (CIFAR-10)
Setting: log-cosh loss ℓ(p) = log(cosh(p)), CIFAR-10 2-class subset

Experiment: multiple training points (CIFAR-10)
Setting: log-cosh loss ℓ(p) = log(cosh(p)), CIFAR-10 2-class subset
Observation: GD trajectories align on the curve independent of
initialization
(a) 3-layer MLP (b) 3-layer CNN

Theory: Trajectory Alignment phenomenon provably occurs
Setting: training a two-layer linear network on a single data point

Theorems 4.2 and 4.3 (informal, EoS regime)
If q0 1, then there exists ta = O(log(η−1
)) such that for any t ≥ ta,
qt
r(pt)
= 1 + h(pt)η2
+ O(η4
) ,
where h(p) ≜ −1
2

pr(p)3
r′(p) + p2
r(p)2

for p ̸= 0 and h(0) ≜ − 1
2r′′(0) .

Theorems 4.2 and 4.3 (informal, EoS regime)
If q0 1, then there exists ta = O(log(η−1
)) such that for any t ≥ ta,
qt
r(pt)
= 1 + h(pt)η2
+ O(η4
) ,
where h(p) ≜ −1
2

pr(p)3
r′(p) + p2
r(p)2

for p ̸= 0 and h(0) ≜ − 1
2r′′(0) .
Moreover, (pt, qt) converge to the point (0, q∗
) such that
q∗
= 1 −
η2
2r′′(0)
+ O(η4
),
and the limiting sharpness is
2
η
−
η
|r′′(0)|
+ O(η3
) .

Summary
▶ We empirically demonstrate and provably establish the
trajectory alignment phenomenon of GD in EoS regime.

Summary
▶ Sheds light on the training dynamics of high-dimensional
non-convex NN optimization using GD with large step size.

Summary
▶ Sheds light on the training dynamics of high-dimensional
non-convex NN optimization using GD with large step size.
▶ For more details, join our poster session (Session 3, Wed
13 Dec) or check our paper!
(openreview link)

References I
Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet
Talwalkar. Gradient descent on neural networks typically occurs at the
edge of stability. In International Conference on Learning
Representations, 2021. URL
https://openreview.net/forum?id=jh-rTtvkGeM.

Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory (NeurIPS 2023)

More Related Content

Similar to Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory (NeurIPS 2023) (20)

Recently uploaded (20)

Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory (NeurIPS 2023)