SlideShare a Scribd company logo
Trajectory Alignment: Understanding the Edge of
Stability Phenomenon via Bifurcation Theory
Minhak Song Chulhee Yun
KAIST
NeurIPS 2023
Poster: Session 3, Wed 13 Dec
Edge of Stability (EoS)
Full-batch GD: Θt+1 = Θt − η∇L(Θt)
Edge of Stability (EoS)
Full-batch GD: Θt+1 = Θt − η∇L(Θt)
Descent Lemma: If η < 2
λmax(∇2L)
, then loss drops each iteration
Edge of Stability (EoS)
Full-batch GD: Θt+1 = Θt − η∇L(Θt)
Descent Lemma: If η < 2
λmax(∇2L)
, then loss drops each iteration
Edge of Stability (EoS)
Full-batch GD: Θt+1 = Θt − η∇L(Θt)
Descent Lemma: If η < 2
λmax(∇2L)
, then loss drops each iteration
EoS: sharpness (= λmax(∇2L)) increases along GD trajectory then
saturates at 2/η [Cohen et al., 2021]
0 5000 10000 15000 20000
100
200
300
400
500
sharpness
fully-connected ELU
0 1000 2000 3000 4000 5000
25
50
75
100
125
150
fully-connected ReLU
0 2500 5000 7500 10000 12500 15000 17500
100
200
300
400
fully-connected tanh
0 20000 40000 60000 80000 100000
iteration
200
400
600
800
sharpness
CNN ELU
0 20000 40000 60000 80000 100000
iteration
200
400
600
CNN ReLU
0 2500 5000 7500 10000 12500 15000 17500
iteration
40
60
80
100
120
CNN tanh
Toy model
Objective function: L(x, y) = log(cosh(xy)), step size: η = 2/16
Toy model
Objective function: L(x, y) = log(cosh(xy)), step size: η = 2/16
0.5 0.0 0.5 1.0
x
4.0
4.5
5.0
5.5
6.0
6.5
7.0
y
0 50 100 150 200
iterations
0
5
10
15
20
25
sharpness
2
Toy model
Objective function: L(x, y) = log(cosh(xy)), step size: η = 2/16
0.5 0.0 0.5 1.0
x
4.0
4.5
5.0
5.5
6.0
6.5
7.0
y
0 50 100 150 200
iterations
0
5
10
15
20
25
sharpness
2
Toy model
Objective function: L(x, y) = log(cosh(xy)), step size: η = 2/16
0.5 0.0 0.5 1.0
x
4.0
4.5
5.0
5.5
6.0
6.5
7.0
y
0 50 100 150 200
iterations
0
5
10
15
20
25
sharpness
2
▶ Different GD trajectories align on the same curve:
Trajectory Alignment occurs!
Toy model
Objective function: L(x, y) = log(cosh(xy)), step size: η = 2/16
0.5 0.0 0.5 1.0
x
4.0
4.5
5.0
5.5
6.0
6.5
7.0
y
0 50 100 150 200
iterations
0
5
10
15
20
25
sharpness
2
▶ Different GD trajectories align on the same curve:
Trajectory Alignment occurs!
Q. Trajectory alignment of GD in general setting?
Canonical reparameterization
Q. Trajectory alignment of GD in general setting?
Canonical reparameterization
Q. Trajectory alignment of GD in general setting?
(p, q) ≜

Residual,
EoS threshold
NTK sharpness
Canonical reparameterization
Q. Trajectory alignment of GD in general setting?
(p, q) ≜

Residual,
EoS threshold
NTK sharpness

Objective function: L(Θ) = 1
n
Pn
i=1 ℓ(f (xi ; Θ) − yi )
Assumption: ℓ is convex, Lipschitz loss with ℓ′(0) = 0, ℓ′′(0) = 1.
Canonical reparameterization
Q. Trajectory alignment of GD in general setting?
(p, q) ≜

Residual,
EoS threshold
NTK sharpness

Objective function: L(Θ) = 1
n
Pn
i=1 ℓ(f (xi ; Θ) − yi )
Assumption: ℓ is convex, Lipschitz loss with ℓ′(0) = 0, ℓ′′(0) = 1.
(p, q) =
1
n
n
X
i=1
(f (xi ; Θ) − yi ),
2/η
λmax(NTK)
!
=
1
n
n
X
i=1
(f (xi ; Θ) − yi ),
2n
η∥
Pn
i=1(∇Θf (xi ; Θ))⊗2∥2
2
!
Canonical reparameterization
Q. Trajectory alignment of GD in general setting?
(p, q) ≜

Residual,
EoS threshold
NTK sharpness

Objective function: L(Θ) = 1
n
Pn
i=1 ℓ(f (xi ; Θ) − yi )
Assumption: ℓ is convex, Lipschitz loss with ℓ′(0) = 0, ℓ′′(0) = 1.
(p, q) =
1
n
n
X
i=1
(f (xi ; Θ) − yi ),
2/η
λmax(NTK)
!
=
1
n
n
X
i=1
(f (xi ; Θ) − yi ),
2n
η∥
Pn
i=1(∇Θf (xi ; Θ))⊗2∥2
2
!
▶ Minimum with sharpness 2/η corresponds to (p, q) = (0, 1)
Experiment: single training point
Setting: log-cosh loss ℓ(p) = log(cosh(p)), 3-layer FC networks
Experiment: single training point
Setting: log-cosh loss ℓ(p) = log(cosh(p)), 3-layer FC networks
Observation: GD trajectories align on the curve q =
ℓ′(p)
p
(a) tanh FC network (b) ELU FC network (c) linear FC network
Experiment: multiple training points (CIFAR-10)
Setting: log-cosh loss ℓ(p) = log(cosh(p)), CIFAR-10 2-class subset
Experiment: multiple training points (CIFAR-10)
Setting: log-cosh loss ℓ(p) = log(cosh(p)), CIFAR-10 2-class subset
Observation: GD trajectories align on the curve independent of
initialization
(a) 3-layer MLP (b) 3-layer CNN
Theory: Trajectory Alignment phenomenon provably occurs
Setting: training a two-layer linear network on a single data point
Theory: Trajectory Alignment phenomenon provably occurs
Setting: training a two-layer linear network on a single data point
Theorems 4.2 and 4.3 (informal, EoS regime)
If q0  1, then there exists ta = O(log(η−1
)) such that for any t ≥ ta,
qt
r(pt)
= 1 + h(pt)η2
+ O(η4
) ,
where h(p) ≜ −1
2

pr(p)3
r′(p) + p2
r(p)2

for p ̸= 0 and h(0) ≜ − 1
2r′′(0) .
Theory: Trajectory Alignment phenomenon provably occurs
Setting: training a two-layer linear network on a single data point
Theorems 4.2 and 4.3 (informal, EoS regime)
If q0  1, then there exists ta = O(log(η−1
)) such that for any t ≥ ta,
qt
r(pt)
= 1 + h(pt)η2
+ O(η4
) ,
where h(p) ≜ −1
2

pr(p)3
r′(p) + p2
r(p)2

for p ̸= 0 and h(0) ≜ − 1
2r′′(0) .
Moreover, (pt, qt) converge to the point (0, q∗
) such that
q∗
= 1 −
η2
2r′′(0)
+ O(η4
),
and the limiting sharpness is
2
η
−
η
|r′′(0)|
+ O(η3
) .
Summary
Summary
▶ We empirically demonstrate and provably establish the
trajectory alignment phenomenon of GD in EoS regime.
Summary
▶ We empirically demonstrate and provably establish the
trajectory alignment phenomenon of GD in EoS regime.
▶ Sheds light on the training dynamics of high-dimensional
non-convex NN optimization using GD with large step size.
Summary
▶ We empirically demonstrate and provably establish the
trajectory alignment phenomenon of GD in EoS regime.
▶ Sheds light on the training dynamics of high-dimensional
non-convex NN optimization using GD with large step size.
▶ For more details, join our poster session (Session 3, Wed
13 Dec) or check our paper!
(openreview link)
References I
Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet
Talwalkar. Gradient descent on neural networks typically occurs at the
edge of stability. In International Conference on Learning
Representations, 2021. URL
https://openreview.net/forum?id=jh-rTtvkGeM.

More Related Content

PDF
Approximate Bayesian Computation with Quasi-Likelihoods
PDF
QMC: Operator Splitting Workshop, Thresholdings, Robustness, and Generalized ...
PDF
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
PDF
Testing for mixtures by seeking components
PDF
MLP輪読スパース8章 トレースノルム正則化
PDF
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
PDF
Chris Sherlock's slides
PDF
Basic concepts and how to measure price volatility
Approximate Bayesian Computation with Quasi-Likelihoods
QMC: Operator Splitting Workshop, Thresholdings, Robustness, and Generalized ...
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
Testing for mixtures by seeking components
MLP輪読スパース8章 トレースノルム正則化
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
Chris Sherlock's slides
Basic concepts and how to measure price volatility

Similar to Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory (NeurIPS 2023) (20)

PDF
Unbiased Hamiltonian Monte Carlo
PDF
Hierarchical matrices for approximating large covariance matries and computin...
PDF
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
PDF
Low rank tensor approximation of probability density and characteristic funct...
PDF
Lossy Kernelization
PDF
Complexity of exact solutions of many body systems: nonequilibrium steady sta...
PDF
Tales on two commuting transformations or flows
PDF
A Note on TopicRNN
PDF
Introducing Zap Q-Learning
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
Scattering theory analogues of several classical estimates in Fourier analysis
PDF
SMB_2012_HR_VAN_ST-last version
PDF
Talk at CIRM on Poisson equation and debiasing techniques
PDF
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
PDF
Trilinear embedding for divergence-form operators
PDF
Some Thoughts on Sampling
PDF
A Fibonacci-like universe expansion on time-scale
PDF
Bird’s-eye view of Gaussian harmonic analysis
PDF
Recent developments on unbiased MCMC
PDF
第5回CCMSハンズオン(ソフトウェア講習会): AkaiKKRチュートリアル 1. KKR法
Unbiased Hamiltonian Monte Carlo
Hierarchical matrices for approximating large covariance matries and computin...
Seminar Talk: Multilevel Hybrid Split Step Implicit Tau-Leap for Stochastic R...
Low rank tensor approximation of probability density and characteristic funct...
Lossy Kernelization
Complexity of exact solutions of many body systems: nonequilibrium steady sta...
Tales on two commuting transformations or flows
A Note on TopicRNN
Introducing Zap Q-Learning
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Scattering theory analogues of several classical estimates in Fourier analysis
SMB_2012_HR_VAN_ST-last version
Talk at CIRM on Poisson equation and debiasing techniques
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
Trilinear embedding for divergence-form operators
Some Thoughts on Sampling
A Fibonacci-like universe expansion on time-scale
Bird’s-eye view of Gaussian harmonic analysis
Recent developments on unbiased MCMC
第5回CCMSハンズオン(ソフトウェア講習会): AkaiKKRチュートリアル 1. KKR法
Ad

Recently uploaded (20)

PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Cloud computing and distributed systems.
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
A Presentation on Artificial Intelligence
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPT
Teaching material agriculture food technology
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
sap open course for s4hana steps from ECC to s4
Chapter 3 Spatial Domain Image Processing.pdf
Cloud computing and distributed systems.
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
A Presentation on Artificial Intelligence
Agricultural_Statistics_at_a_Glance_2022_0.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Teaching material agriculture food technology
Building Integrated photovoltaic BIPV_UPV.pdf
Machine Learning_overview_presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
A comparative analysis of optical character recognition models for extracting...
MIND Revenue Release Quarter 2 2025 Press Release
Encapsulation_ Review paper, used for researhc scholars
Dropbox Q2 2025 Financial Results & Investor Presentation
MYSQL Presentation for SQL database connectivity
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
sap open course for s4hana steps from ECC to s4
Ad

Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory (NeurIPS 2023)

  • 1. Trajectory Alignment: Understanding the Edge of Stability Phenomenon via Bifurcation Theory Minhak Song Chulhee Yun KAIST NeurIPS 2023 Poster: Session 3, Wed 13 Dec
  • 2. Edge of Stability (EoS) Full-batch GD: Θt+1 = Θt − η∇L(Θt)
  • 3. Edge of Stability (EoS) Full-batch GD: Θt+1 = Θt − η∇L(Θt) Descent Lemma: If η < 2 λmax(∇2L) , then loss drops each iteration
  • 4. Edge of Stability (EoS) Full-batch GD: Θt+1 = Θt − η∇L(Θt) Descent Lemma: If η < 2 λmax(∇2L) , then loss drops each iteration
  • 5. Edge of Stability (EoS) Full-batch GD: Θt+1 = Θt − η∇L(Θt) Descent Lemma: If η < 2 λmax(∇2L) , then loss drops each iteration EoS: sharpness (= λmax(∇2L)) increases along GD trajectory then saturates at 2/η [Cohen et al., 2021] 0 5000 10000 15000 20000 100 200 300 400 500 sharpness fully-connected ELU 0 1000 2000 3000 4000 5000 25 50 75 100 125 150 fully-connected ReLU 0 2500 5000 7500 10000 12500 15000 17500 100 200 300 400 fully-connected tanh 0 20000 40000 60000 80000 100000 iteration 200 400 600 800 sharpness CNN ELU 0 20000 40000 60000 80000 100000 iteration 200 400 600 CNN ReLU 0 2500 5000 7500 10000 12500 15000 17500 iteration 40 60 80 100 120 CNN tanh
  • 6. Toy model Objective function: L(x, y) = log(cosh(xy)), step size: η = 2/16
  • 7. Toy model Objective function: L(x, y) = log(cosh(xy)), step size: η = 2/16 0.5 0.0 0.5 1.0 x 4.0 4.5 5.0 5.5 6.0 6.5 7.0 y 0 50 100 150 200 iterations 0 5 10 15 20 25 sharpness 2
  • 8. Toy model Objective function: L(x, y) = log(cosh(xy)), step size: η = 2/16 0.5 0.0 0.5 1.0 x 4.0 4.5 5.0 5.5 6.0 6.5 7.0 y 0 50 100 150 200 iterations 0 5 10 15 20 25 sharpness 2
  • 9. Toy model Objective function: L(x, y) = log(cosh(xy)), step size: η = 2/16 0.5 0.0 0.5 1.0 x 4.0 4.5 5.0 5.5 6.0 6.5 7.0 y 0 50 100 150 200 iterations 0 5 10 15 20 25 sharpness 2 ▶ Different GD trajectories align on the same curve: Trajectory Alignment occurs!
  • 10. Toy model Objective function: L(x, y) = log(cosh(xy)), step size: η = 2/16 0.5 0.0 0.5 1.0 x 4.0 4.5 5.0 5.5 6.0 6.5 7.0 y 0 50 100 150 200 iterations 0 5 10 15 20 25 sharpness 2 ▶ Different GD trajectories align on the same curve: Trajectory Alignment occurs! Q. Trajectory alignment of GD in general setting?
  • 11. Canonical reparameterization Q. Trajectory alignment of GD in general setting?
  • 12. Canonical reparameterization Q. Trajectory alignment of GD in general setting? (p, q) ≜ Residual, EoS threshold NTK sharpness
  • 13. Canonical reparameterization Q. Trajectory alignment of GD in general setting? (p, q) ≜ Residual, EoS threshold NTK sharpness Objective function: L(Θ) = 1 n Pn i=1 ℓ(f (xi ; Θ) − yi ) Assumption: ℓ is convex, Lipschitz loss with ℓ′(0) = 0, ℓ′′(0) = 1.
  • 14. Canonical reparameterization Q. Trajectory alignment of GD in general setting? (p, q) ≜ Residual, EoS threshold NTK sharpness Objective function: L(Θ) = 1 n Pn i=1 ℓ(f (xi ; Θ) − yi ) Assumption: ℓ is convex, Lipschitz loss with ℓ′(0) = 0, ℓ′′(0) = 1. (p, q) = 1 n n X i=1 (f (xi ; Θ) − yi ), 2/η λmax(NTK) ! = 1 n n X i=1 (f (xi ; Θ) − yi ), 2n η∥ Pn i=1(∇Θf (xi ; Θ))⊗2∥2 2 !
  • 15. Canonical reparameterization Q. Trajectory alignment of GD in general setting? (p, q) ≜ Residual, EoS threshold NTK sharpness Objective function: L(Θ) = 1 n Pn i=1 ℓ(f (xi ; Θ) − yi ) Assumption: ℓ is convex, Lipschitz loss with ℓ′(0) = 0, ℓ′′(0) = 1. (p, q) = 1 n n X i=1 (f (xi ; Θ) − yi ), 2/η λmax(NTK) ! = 1 n n X i=1 (f (xi ; Θ) − yi ), 2n η∥ Pn i=1(∇Θf (xi ; Θ))⊗2∥2 2 ! ▶ Minimum with sharpness 2/η corresponds to (p, q) = (0, 1)
  • 16. Experiment: single training point Setting: log-cosh loss ℓ(p) = log(cosh(p)), 3-layer FC networks
  • 17. Experiment: single training point Setting: log-cosh loss ℓ(p) = log(cosh(p)), 3-layer FC networks Observation: GD trajectories align on the curve q = ℓ′(p) p (a) tanh FC network (b) ELU FC network (c) linear FC network
  • 18. Experiment: multiple training points (CIFAR-10) Setting: log-cosh loss ℓ(p) = log(cosh(p)), CIFAR-10 2-class subset
  • 19. Experiment: multiple training points (CIFAR-10) Setting: log-cosh loss ℓ(p) = log(cosh(p)), CIFAR-10 2-class subset Observation: GD trajectories align on the curve independent of initialization (a) 3-layer MLP (b) 3-layer CNN
  • 20. Theory: Trajectory Alignment phenomenon provably occurs Setting: training a two-layer linear network on a single data point
  • 21. Theory: Trajectory Alignment phenomenon provably occurs Setting: training a two-layer linear network on a single data point Theorems 4.2 and 4.3 (informal, EoS regime) If q0 1, then there exists ta = O(log(η−1 )) such that for any t ≥ ta, qt r(pt) = 1 + h(pt)η2 + O(η4 ) , where h(p) ≜ −1 2 pr(p)3 r′(p) + p2 r(p)2 for p ̸= 0 and h(0) ≜ − 1 2r′′(0) .
  • 22. Theory: Trajectory Alignment phenomenon provably occurs Setting: training a two-layer linear network on a single data point Theorems 4.2 and 4.3 (informal, EoS regime) If q0 1, then there exists ta = O(log(η−1 )) such that for any t ≥ ta, qt r(pt) = 1 + h(pt)η2 + O(η4 ) , where h(p) ≜ −1 2 pr(p)3 r′(p) + p2 r(p)2 for p ̸= 0 and h(0) ≜ − 1 2r′′(0) . Moreover, (pt, qt) converge to the point (0, q∗ ) such that q∗ = 1 − η2 2r′′(0) + O(η4 ), and the limiting sharpness is 2 η − η |r′′(0)| + O(η3 ) .
  • 24. Summary ▶ We empirically demonstrate and provably establish the trajectory alignment phenomenon of GD in EoS regime.
  • 25. Summary ▶ We empirically demonstrate and provably establish the trajectory alignment phenomenon of GD in EoS regime. ▶ Sheds light on the training dynamics of high-dimensional non-convex NN optimization using GD with large step size.
  • 26. Summary ▶ We empirically demonstrate and provably establish the trajectory alignment phenomenon of GD in EoS regime. ▶ Sheds light on the training dynamics of high-dimensional non-convex NN optimization using GD with large step size. ▶ For more details, join our poster session (Session 3, Wed 13 Dec) or check our paper! (openreview link)
  • 27. References I Jeremy Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=jh-rTtvkGeM.