SlideShare a Scribd company logo
Stochastic Gradient Descent with
Exponential Convergence Rates of
Expected Classification Errors
Atsushi Nitanda and Taiji Suzuki
AISTATS
April 18th, 2019
Naha, Okinawa
RIKEN AIP
1, 2 1, 2
1 2
Overview
• Topic
Convergence analysis of (averaged) SGD for binary classification
problems.
• Key assumption
Strongest version of low noise condition (margin condition) on the
conditional label probability.
• Result
Exponential convergence rates of expected classification errors
2
Background
• Stochastic Gradient Descent (SGD)
Simple and effective method for training machine learning models.
Significantly faster than vanilla gradient descent.
• Convergence Rates
Expected risk: sublinear convergence 𝑂(1/𝑛&
), (𝛼 ∈ [1/2,1]).
Expected classification error: How fast dose it converge?
GD SGD
SGD: 𝑔/01 ← 𝑔/ − 𝜂𝐺6(𝑔/, 𝑍/) (𝑍/ ∼ 𝜌),
GD : 𝑔/01 ← 𝑔/ − 𝜂𝔼;<∼= 𝐺6 𝑔/, 𝑍/
Cost per iteration:
1 (SGD) vs #data examples (GD)
3
Background
Common way to bound classification error.
• Classification error bound via consistency of loss functions:
[T. Zhang(2004), P. Bartlett+(2006)]
ℙ sgn 𝑔 𝑋 ≠ 𝑌 − ℙ sgn 2𝜌 1 𝑋 − 1 ≠ 𝑌 ≲ ℒ 𝑔 − ℒ∗
H
,
𝑔: predictor, ℒ∗: Bayes optimal for ℒ,
𝜌 1 𝑋 : conditional probability of label 𝑌 = 1.
𝑝 = 1/2 for logistic, exponential, and squared losses.
• Sublinear convergence 𝑂
1
KLM of excess classification error.
4
Excess classification error Excess risk
Background
Faster convergence rates of excess classification error.
• Low noise condition on 𝜌 𝑌 = 1 𝑋)
[A.B. Tsybakov(2004), P. Bartlett+(2006)]
improves the consistency property,
resulting in faster rates: 𝑂
1
K
. (still sublinear convergence)
• Low noise condition (strongest version)
[V. Koltchinskii & O. Benzosova(2005), J-Y. Audibert & A.B. Tsybakov(2007)]
accelerates the rates for ERM to linear rates 𝑂 exp(−𝑛) .
5
Background
Faster convergence rates of excess classification error for SGD.
• Linear convergence rate
[L. Pillaud-Vivien, A. Rudi, & Francis Bach(2018)]
has been shown for the squared loss function under the strong low
noise condition.
• This work
shows the linear convergence for more suitable loss functions (e.g.,
logistic loss) under the strong low noise condition.
6
Outline
• Problem Settings and Assumptions
• (Averaged) Stochastic Gradient Descent
• Main Results: Linear Convergence Rates of SGD and ASGD
• Proof Idea
• Toy Experiment
7
Problem Setting
• Regularized expected risk minimization problems
min
S∈ℋU
ℒ6 𝑔 = 𝔼(V,W) 𝑙(𝑔 𝑋 , 𝑌) +
𝜆
2
𝑔 [

,
(ℋ[, , [): Reproducing kernel Hilbert space,
𝑙: Differentiable loss,
(𝑋, 𝑌): random variables on feature space and label set −1,1 ,
𝜆: Regularization parameter.
8
Loss Function
Example ∃𝜙: ℝ → ℝbc:convex s.t. 𝑙 𝜁, 𝑦 = 𝜙 𝑦𝜁 ,
𝜙 𝑣 = g
log(1 + exp −𝑣 ) 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑙𝑜𝑠𝑠 ,
exp −𝑣 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝑙𝑜𝑠𝑠 ,
𝑣
𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑙𝑜𝑠𝑠 .
9
Assumption
- sup
𝒳
𝑘(𝑥, 𝑥) ≤ 𝑅
,
- ∃𝑀 > 0, 𝜕• 𝑙 𝜁, 𝑦 ≤ 𝑀,
- ∃𝐿 > 0 ∀𝑔, ℎ ∈ ℋ[, ℒ 𝑔 + ℎ − ℒ 𝑔 − ∇ℒ(𝑔), ℎ [ ≤
†

ℎ [

,
- 𝜌 𝑌 = 1 𝑋 ∈ 0,1 , 𝑎. 𝑒.,
- ℎ∗ : increasing function on 0,1 ,
- sgn 𝜇 − 0.5 = sgn ℎ∗ 𝜇 ,
- 𝑔∗ ≔ arg min
S:Œ•Ž••‘Ž’“•
ℒ 𝑔 ∈ ℋ[.
Remark Logistic loss satisfies these assumptions.
The other loss functions also satisfy them by restricting Hypothesis space.
10
Link function:
ℎ∗ 𝜇 = arg min
”∈ℝ
𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) .
Strongest Low Noise Condition
Assumption ∃𝛿 ∈ 0,1/2 , for 𝑋 a.e. w.r.t. 𝜌 𝒳,
𝜌 𝑌 = 1 𝑋) − 0.5 > 𝛿.
11
𝜌 𝑌 = 1 𝑥)
𝒳
0.5
1.0
𝑌 = −1 𝑌 = +1
𝛿
𝛿
Strongest Low Noise Condition
Assumption ∃𝛿 ∈ 0,1/2 , for 𝑋 a.e. w.r.t. 𝜌 𝒳,
𝜌 𝑌 = 1 𝑋) − 0.5 > 𝛿.
MNIST
12
Toy data used in experiment
Example
(Averaged) Stochastic Gradient Descent
13
• Stochastic Gradient in RKHS
𝐺6 𝑔, 𝑋, 𝑌 = 𝜕• 𝑙 𝑔 𝑋 , 𝑌 𝑘 𝑋,⋅ + 𝜆𝑔.
𝜂/ =
2
𝜆(𝛾 + 𝑡)
, 𝛼/ =
2 𝛾 + 𝑡 − 1
(2𝛾 + 𝑇)(𝑇 + 1)
Note: averaging can be updated iteratively.
Convergence Analyses
• For simplicity, we focus on the following case:
𝑔1 = 0,
𝑘: Gaussian kernel,
𝜙 𝑣 = log(1 + exp(−𝑣)): Logistic loss.
• We analyze convergence rates of excess classification error:
ℛ 𝑔 − 𝑅∗: = ℙ sgn 𝑔 𝑋 ≠ 𝑌 − ℙ sgn 𝑔∗ ≠ 𝑌 .
14
Main Result 1: Linear Convergence of SGD
Theorem There exists ∃𝜆 > 0 s.t. the following holds.
Assume 𝜂1 ≤ min
1
†06
,
1
6
, V 𝜕• 𝑙 𝑔 𝑋 , 𝑌 𝑘 𝑋,⋅ ≤ 𝜎
.
Set 𝜈 ≔

6• 𝐿 + 𝜆 𝜎
, 1 + 𝛾 ℒ6 𝑔1 − ℒ6 𝑔6 ) .
Then, for 𝑇 ≥
Ÿ
6
log¡1 10¢
1¡¢
− 𝛾, we have
𝔼 ℛ 𝑔£01 − ℛ∗ ≤ 2 exp −
𝜆
𝛾 + 𝑇
2¤ ⋅ 9
log
1 + 2𝛿
1 − 2𝛿
.
#iterations for 𝜖-solution: 𝑂
1
6• log
1
§
log¡ 10¢
1¡¢
.
15
Main Result 2: Linear Convergence of ASGD
Theorem There exists ∃𝜆 > 0 s.t. the following holds.
Assume 𝜂1 ≤ min
1
†
,
1
6
. Then, if
max
٬
6•(©0£)
,
© ©¡1 Sª U
(©0£)(£01)
≤ 32 log 10¢
1¡¢
, we have
𝔼 ℛ 𝑔£01
− ℛ∗ ≤ 2 exp −
𝜆
2𝛾 + 𝑇
21c ⋅ 9
log
1 + 2𝛿
1 − 2𝛿
.
#iterations for 𝜖-solution: 𝑂
1
6• log
1
§
log¡ 10¢
1¡¢
.
Remark Condition on 𝑇 is much improved.
A dominant term can be satisfied for somewhat small 𝜖.
16
Toy Experiment
• 2-dim toy dataset.
• 𝛿 ∈ 0.1, 0.25, 0.4 .
• Linear separable.
• Logistic loss.
• 𝜆 was determined by validation.
Right Figure
Generated samples for 𝛿 = 0.4.
𝑥1 = 1 is the Bayes optimal.
17
18
From top to bottom:
1. Risk value
2. Class. error
3. Excess class. error
/Excess risk value
Purple line: SGD
Blue line : ASGD
ASGD is much faster
especially when 𝛿 = 0.4.
Summary
• We explained convergence rates of expected classification
errors for (A)SGD are sublinear 𝑂(1/𝑛&
) in general.
• We showed that these rates can be accelerated to linear rates
𝑂(exp(−𝑛)) under strong low noise condition.
Future Work
• Faster convergence by making more additional assumptions.
• Variants of SGD(Acceleration, Variance reduction).
• Non-convex models such as deep neural networks.
• Random Fourier features (ongoing work with collaborators).
19
References
- T. Zhang. Statistical behavior and consistency of classification methods based on convex risk
minimization. The Annals of Statistics, 2004.
- P. Bartlett, M. Jordan, & J. McAuliffe. Convexity, classification, and risk bounds. Journal of the
American Statistical Association, 2006.
- A.B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 2004.
- V. Koltchinskii & O. Benzosova. Exponential convergence rates in classification. In International
Conference on Computational Learning Theory, 2005.
- J-Y. Audibert & A.B. Tsybakov. Fast learning rates for plug-in classifiers. The Annals of statistics, 2007.
- L. Bottou & O. Bousquet. The Tradeoffs of Large Scale Learning, Advances in Neural Information
Processing Systems, 2008.
- L. Pillaud-Vivien, A. Rudi, & Francis Bach. Exponential convergence of testing error for stochastic
gradient methods. In International Conference on Computational Learning Theory, 2018.
20
Appendix
21
Link Function
Definition (Link function) ℎ∗: 0,1 → ℝ,
ℎ∗ 𝜇 = arg min
”∈ℝ
𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) .
ℎ∗ connects conditional probability of label to model outputs.
Example (Logistic loss)
ℎ∗ 𝜇 = log
𝜇
1 − 𝜇
, ℎ∗
¡1
𝑎 =
1
1 + exp −𝑎
.
22
0
ℎ∗
Expected risk defined by
conditional probability 𝜇.
ℎ∗(𝜇)
Proof Idea
Set 𝑚 𝛿 ≔ max ℎ∗ 0.5 + 𝛿 , ℎ∗ 0.5 − 𝛿 .
Example (logistic loss) 𝑚 𝛿 = log
10¢
1¡¢
.
Through ℎ∗, noise condition is converted to: 𝑔∗ 𝑋 ≥ 𝑚 𝛿 .
Set 𝑔6 ≔ arg min
S∈ℋU
ℒ6(𝑔).
When 𝜆 is sufficiently small, 𝑔6 is close to 𝑔∗. Moreover,
Proposition
There exists 𝜆 s.t. 𝑔 − 𝑔6 [ ≤
Œ ¢
®
→ ℛ 𝑔 = ℛ∗.
23
24
Analyze the convergence speed and probability to get in in RKHS.
𝜌(1|𝑋)
Space of conditional probabilities
Small ball which provides Bayes rule.
𝑔∗
𝑔6
Small ball mapped into .
RKHS (predictor)
SGD
ℎ∗
Recall ℎ∗ 𝜇 = arg min
”∈ℝ
𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) .
Proof Idea
Proof Sketch
1. Let	𝑍1, … , 𝑍£~𝜌 be i.i.d., random variables,
𝐷/: = 𝔼 ¶𝑔£01|𝑍1, … , 𝑍/ − 𝔼 ¶𝑔£01|𝑍1, … , 𝑍/¡1 ,
¶𝑔£01 = 𝔼 ¶𝑔£01 + ·
/¸1
£
𝐷/ .
2. Convergence of 𝔼 ¶𝑔£01 can be analyzed by
𝔼 ¶𝑔£01 − 𝑔6 ≤ 𝔼 ℒ6 𝔼 ¶𝑔£01 − ℒ6 𝑔6 .
3. Bound ∑/¸1
£
𝐷/ by Martingale inequality: for 𝑐£ s.t. ∑/¸1
£
𝐷/ º

≤ 𝑐£

,
ℙ ·
/¸1
£
𝐷/
[
≥ 𝜖 ≤ 2 exp −
𝜖
𝑐£
 .
4. Bound 𝑐£ by stability of A(SGD).
5. Combining 1 and 2, probability to get Bayes rule can be obtained.
6. Finally, 𝔼 ¶𝑔£01 = ℙ ¶𝑔£01 𝑖𝑠 𝑛𝑜𝑡 𝐵𝑎𝑦𝑒𝑠 .
25

More Related Content

PPTX
[Vldb 2013] skyline operator on anti correlated distributions
PDF
Numerical analysis stationary variables
PDF
Iterative methods with special structures
PDF
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
PDF
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
PDF
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
PDF
Design and Implementation of Parallel and Randomized Approximation Algorithms
PDF
Clustering Theory
[Vldb 2013] skyline operator on anti correlated distributions
Numerical analysis stationary variables
Iterative methods with special structures
Paper Study: OptNet: Differentiable Optimization as a Layer in Neural Networks
Anti-differentiating approximation algorithms: A case study with min-cuts, sp...
CLIM Program: Remote Sensing Workshop, Optimization for Distributed Data Syst...
Design and Implementation of Parallel and Randomized Approximation Algorithms
Clustering Theory

What's hot (20)

PDF
Fast relaxation methods for the matrix exponential
PPTX
Spectral clustering Tutorial
PDF
Tensor Train decomposition in machine learning
PPTX
Learning a nonlinear embedding by preserving class neibourhood structure 최종
PDF
PageRank Centrality of dynamic graph structures
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
MLHEP 2015: Introductory Lecture #2
PDF
Tensorizing Neural Network
PPT
Syde770a presentation
PPTX
Techniques in Deep Learning
PDF
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
PDF
PR 103: t-SNE
PDF
K-means and GMM
PDF
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
PDF
Survadapt-Webinar_2014_SLIDES
PDF
icml2004 tutorial on spectral clustering part II
PDF
icml2004 tutorial on spectral clustering part I
PDF
Foundation of KL Divergence
PDF
Spacey random walks and higher order Markov chains
PDF
Spectral Clustering Report
Fast relaxation methods for the matrix exponential
Spectral clustering Tutorial
Tensor Train decomposition in machine learning
Learning a nonlinear embedding by preserving class neibourhood structure 최종
PageRank Centrality of dynamic graph structures
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
MLHEP 2015: Introductory Lecture #2
Tensorizing Neural Network
Syde770a presentation
Techniques in Deep Learning
Sharp Characterization of Optimal Minibatch Size for Stochastic Finite Sum Co...
PR 103: t-SNE
K-means and GMM
Animashree Anandkumar, Electrical Engineering and CS Dept, UC Irvine at MLcon...
Survadapt-Webinar_2014_SLIDES
icml2004 tutorial on spectral clustering part II
icml2004 tutorial on spectral clustering part I
Foundation of KL Divergence
Spacey random walks and higher order Markov chains
Spectral Clustering Report
Ad

Similar to Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors (20)

PDF
MLHEP 2015: Introductory Lecture #4
PDF
Paper Study: Melding the data decision pipeline
PDF
MLHEP 2015: Introductory Lecture #3
PDF
Average Sensitivity of Graph Algorithms
PDF
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
PPTX
The world of loss function
PPTX
STLtalk about statistical analysis and its application
PDF
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
PDF
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
PPTX
Efficient anomaly detection via matrix sketching
PPTX
Linear regression, costs & gradient descent
PPTX
Presentation
PPTX
Stochastic Optimization
PPTX
20180831 riemannian representation learning
PDF
Paper study: Learning to solve circuit sat
PPTX
Optimization of positive linear systems via geometric programming
PDF
2 random variables notes 2p3
PDF
Distributed solution of stochastic optimal control problem on GPUs
PDF
deeplearninhg........ applicationsWEEK 05.pdf
PPTX
04 Multi-layer Feedforward Networks
MLHEP 2015: Introductory Lecture #4
Paper Study: Melding the data decision pipeline
MLHEP 2015: Introductory Lecture #3
Average Sensitivity of Graph Algorithms
Backpropagation (DLAI D3L1 2017 UPC Deep Learning for Artificial Intelligence)
The world of loss function
STLtalk about statistical analysis and its application
MUMS: Bayesian, Fiducial, and Frequentist Conference - Model Selection in the...
Hierarchical Deterministic Quadrature Methods for Option Pricing under the Ro...
Efficient anomaly detection via matrix sketching
Linear regression, costs & gradient descent
Presentation
Stochastic Optimization
20180831 riemannian representation learning
Paper study: Learning to solve circuit sat
Optimization of positive linear systems via geometric programming
2 random variables notes 2p3
Distributed solution of stochastic optimal control problem on GPUs
deeplearninhg........ applicationsWEEK 05.pdf
04 Multi-layer Feedforward Networks
Ad

Recently uploaded (20)

PPT
Teaching material agriculture food technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Big Data Technologies - Introduction.pptx
PDF
Approach and Philosophy of On baking technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
KodekX | Application Modernization Development
PPTX
Spectroscopy.pptx food analysis technology
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Teaching material agriculture food technology
Electronic commerce courselecture one. Pdf
Spectral efficient network and resource selection model in 5G networks
Reach Out and Touch Someone: Haptics and Empathic Computing
Building Integrated photovoltaic BIPV_UPV.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Machine learning based COVID-19 study performance prediction
sap open course for s4hana steps from ECC to s4
Big Data Technologies - Introduction.pptx
Approach and Philosophy of On baking technology
Programs and apps: productivity, graphics, security and other tools
KodekX | Application Modernization Development
Spectroscopy.pptx food analysis technology
Network Security Unit 5.pdf for BCA BBA.
Review of recent advances in non-invasive hemoglobin estimation
Encapsulation_ Review paper, used for researhc scholars
NewMind AI Weekly Chronicles - August'25 Week I
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
The Rise and Fall of 3GPP – Time for a Sabbatical?

Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors

  • 1. Stochastic Gradient Descent with Exponential Convergence Rates of Expected Classification Errors Atsushi Nitanda and Taiji Suzuki AISTATS April 18th, 2019 Naha, Okinawa RIKEN AIP 1, 2 1, 2 1 2
  • 2. Overview • Topic Convergence analysis of (averaged) SGD for binary classification problems. • Key assumption Strongest version of low noise condition (margin condition) on the conditional label probability. • Result Exponential convergence rates of expected classification errors 2
  • 3. Background • Stochastic Gradient Descent (SGD) Simple and effective method for training machine learning models. Significantly faster than vanilla gradient descent. • Convergence Rates Expected risk: sublinear convergence 𝑂(1/𝑛& ), (𝛼 ∈ [1/2,1]). Expected classification error: How fast dose it converge? GD SGD SGD: 𝑔/01 ← 𝑔/ − 𝜂𝐺6(𝑔/, 𝑍/) (𝑍/ ∼ 𝜌), GD : 𝑔/01 ← 𝑔/ − 𝜂𝔼;<∼= 𝐺6 𝑔/, 𝑍/ Cost per iteration: 1 (SGD) vs #data examples (GD) 3
  • 4. Background Common way to bound classification error. • Classification error bound via consistency of loss functions: [T. Zhang(2004), P. Bartlett+(2006)] ℙ sgn 𝑔 𝑋 ≠ 𝑌 − ℙ sgn 2𝜌 1 𝑋 − 1 ≠ 𝑌 ≲ ℒ 𝑔 − ℒ∗ H , 𝑔: predictor, ℒ∗: Bayes optimal for ℒ, 𝜌 1 𝑋 : conditional probability of label 𝑌 = 1. 𝑝 = 1/2 for logistic, exponential, and squared losses. • Sublinear convergence 𝑂 1 KLM of excess classification error. 4 Excess classification error Excess risk
  • 5. Background Faster convergence rates of excess classification error. • Low noise condition on 𝜌 𝑌 = 1 𝑋) [A.B. Tsybakov(2004), P. Bartlett+(2006)] improves the consistency property, resulting in faster rates: 𝑂 1 K . (still sublinear convergence) • Low noise condition (strongest version) [V. Koltchinskii & O. Benzosova(2005), J-Y. Audibert & A.B. Tsybakov(2007)] accelerates the rates for ERM to linear rates 𝑂 exp(−𝑛) . 5
  • 6. Background Faster convergence rates of excess classification error for SGD. • Linear convergence rate [L. Pillaud-Vivien, A. Rudi, & Francis Bach(2018)] has been shown for the squared loss function under the strong low noise condition. • This work shows the linear convergence for more suitable loss functions (e.g., logistic loss) under the strong low noise condition. 6
  • 7. Outline • Problem Settings and Assumptions • (Averaged) Stochastic Gradient Descent • Main Results: Linear Convergence Rates of SGD and ASGD • Proof Idea • Toy Experiment 7
  • 8. Problem Setting • Regularized expected risk minimization problems min S∈ℋU ℒ6 𝑔 = 𝔼(V,W) 𝑙(𝑔 𝑋 , 𝑌) + 𝜆 2 𝑔 [ , (ℋ[, , [): Reproducing kernel Hilbert space, 𝑙: Differentiable loss, (𝑋, 𝑌): random variables on feature space and label set −1,1 , 𝜆: Regularization parameter. 8
  • 9. Loss Function Example ∃𝜙: ℝ → ℝbc:convex s.t. 𝑙 𝜁, 𝑦 = 𝜙 𝑦𝜁 , 𝜙 𝑣 = g log(1 + exp −𝑣 ) 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑙𝑜𝑠𝑠 , exp −𝑣 𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡𝑖𝑎𝑙 𝑙𝑜𝑠𝑠 , 𝑣 𝑠𝑞𝑢𝑎𝑟𝑒𝑑 𝑙𝑜𝑠𝑠 . 9
  • 10. Assumption - sup 𝒳 𝑘(𝑥, 𝑥) ≤ 𝑅 , - ∃𝑀 > 0, 𝜕• 𝑙 𝜁, 𝑦 ≤ 𝑀, - ∃𝐿 > 0 ∀𝑔, ℎ ∈ ℋ[, ℒ 𝑔 + ℎ − ℒ 𝑔 − ∇ℒ(𝑔), ℎ [ ≤ † ℎ [ , - 𝜌 𝑌 = 1 𝑋 ∈ 0,1 , 𝑎. 𝑒., - ℎ∗ : increasing function on 0,1 , - sgn 𝜇 − 0.5 = sgn ℎ∗ 𝜇 , - 𝑔∗ ≔ arg min S:Œ•Ž••‘Ž’“• ℒ 𝑔 ∈ ℋ[. Remark Logistic loss satisfies these assumptions. The other loss functions also satisfy them by restricting Hypothesis space. 10 Link function: ℎ∗ 𝜇 = arg min ”∈ℝ 𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) .
  • 11. Strongest Low Noise Condition Assumption ∃𝛿 ∈ 0,1/2 , for 𝑋 a.e. w.r.t. 𝜌 𝒳, 𝜌 𝑌 = 1 𝑋) − 0.5 > 𝛿. 11 𝜌 𝑌 = 1 𝑥) 𝒳 0.5 1.0 𝑌 = −1 𝑌 = +1 𝛿 𝛿
  • 12. Strongest Low Noise Condition Assumption ∃𝛿 ∈ 0,1/2 , for 𝑋 a.e. w.r.t. 𝜌 𝒳, 𝜌 𝑌 = 1 𝑋) − 0.5 > 𝛿. MNIST 12 Toy data used in experiment Example
  • 13. (Averaged) Stochastic Gradient Descent 13 • Stochastic Gradient in RKHS 𝐺6 𝑔, 𝑋, 𝑌 = 𝜕• 𝑙 𝑔 𝑋 , 𝑌 𝑘 𝑋,⋅ + 𝜆𝑔. 𝜂/ = 2 𝜆(𝛾 + 𝑡) , 𝛼/ = 2 𝛾 + 𝑡 − 1 (2𝛾 + 𝑇)(𝑇 + 1) Note: averaging can be updated iteratively.
  • 14. Convergence Analyses • For simplicity, we focus on the following case: 𝑔1 = 0, 𝑘: Gaussian kernel, 𝜙 𝑣 = log(1 + exp(−𝑣)): Logistic loss. • We analyze convergence rates of excess classification error: ℛ 𝑔 − 𝑅∗: = ℙ sgn 𝑔 𝑋 ≠ 𝑌 − ℙ sgn 𝑔∗ ≠ 𝑌 . 14
  • 15. Main Result 1: Linear Convergence of SGD Theorem There exists ∃𝜆 > 0 s.t. the following holds. Assume 𝜂1 ≤ min 1 †06 , 1 6 , V 𝜕• 𝑙 𝑔 𝑋 , 𝑌 𝑘 𝑋,⋅ ≤ 𝜎 . Set 𝜈 ≔ 6• 𝐿 + 𝜆 𝜎 , 1 + 𝛾 ℒ6 𝑔1 − ℒ6 𝑔6 ) . Then, for 𝑇 ≥ Ÿ 6 log¡1 10¢ 1¡¢ − 𝛾, we have 𝔼 ℛ 𝑔£01 − ℛ∗ ≤ 2 exp − 𝜆 𝛾 + 𝑇 2¤ ⋅ 9 log 1 + 2𝛿 1 − 2𝛿 . #iterations for 𝜖-solution: 𝑂 1 6• log 1 § log¡ 10¢ 1¡¢ . 15
  • 16. Main Result 2: Linear Convergence of ASGD Theorem There exists ∃𝜆 > 0 s.t. the following holds. Assume 𝜂1 ≤ min 1 † , 1 6 . Then, if max Ÿ¨ 6•(©0£) , © ©¡1 Sª U (©0£)(£01) ≤ 32 log 10¢ 1¡¢ , we have 𝔼 ℛ 𝑔£01 − ℛ∗ ≤ 2 exp − 𝜆 2𝛾 + 𝑇 21c ⋅ 9 log 1 + 2𝛿 1 − 2𝛿 . #iterations for 𝜖-solution: 𝑂 1 6• log 1 § log¡ 10¢ 1¡¢ . Remark Condition on 𝑇 is much improved. A dominant term can be satisfied for somewhat small 𝜖. 16
  • 17. Toy Experiment • 2-dim toy dataset. • 𝛿 ∈ 0.1, 0.25, 0.4 . • Linear separable. • Logistic loss. • 𝜆 was determined by validation. Right Figure Generated samples for 𝛿 = 0.4. 𝑥1 = 1 is the Bayes optimal. 17
  • 18. 18 From top to bottom: 1. Risk value 2. Class. error 3. Excess class. error /Excess risk value Purple line: SGD Blue line : ASGD ASGD is much faster especially when 𝛿 = 0.4.
  • 19. Summary • We explained convergence rates of expected classification errors for (A)SGD are sublinear 𝑂(1/𝑛& ) in general. • We showed that these rates can be accelerated to linear rates 𝑂(exp(−𝑛)) under strong low noise condition. Future Work • Faster convergence by making more additional assumptions. • Variants of SGD(Acceleration, Variance reduction). • Non-convex models such as deep neural networks. • Random Fourier features (ongoing work with collaborators). 19
  • 20. References - T. Zhang. Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 2004. - P. Bartlett, M. Jordan, & J. McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 2006. - A.B. Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 2004. - V. Koltchinskii & O. Benzosova. Exponential convergence rates in classification. In International Conference on Computational Learning Theory, 2005. - J-Y. Audibert & A.B. Tsybakov. Fast learning rates for plug-in classifiers. The Annals of statistics, 2007. - L. Bottou & O. Bousquet. The Tradeoffs of Large Scale Learning, Advances in Neural Information Processing Systems, 2008. - L. Pillaud-Vivien, A. Rudi, & Francis Bach. Exponential convergence of testing error for stochastic gradient methods. In International Conference on Computational Learning Theory, 2018. 20
  • 22. Link Function Definition (Link function) ℎ∗: 0,1 → ℝ, ℎ∗ 𝜇 = arg min ”∈ℝ 𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) . ℎ∗ connects conditional probability of label to model outputs. Example (Logistic loss) ℎ∗ 𝜇 = log 𝜇 1 − 𝜇 , ℎ∗ ¡1 𝑎 = 1 1 + exp −𝑎 . 22 0 ℎ∗ Expected risk defined by conditional probability 𝜇. ℎ∗(𝜇)
  • 23. Proof Idea Set 𝑚 𝛿 ≔ max ℎ∗ 0.5 + 𝛿 , ℎ∗ 0.5 − 𝛿 . Example (logistic loss) 𝑚 𝛿 = log 10¢ 1¡¢ . Through ℎ∗, noise condition is converted to: 𝑔∗ 𝑋 ≥ 𝑚 𝛿 . Set 𝑔6 ≔ arg min S∈ℋU ℒ6(𝑔). When 𝜆 is sufficiently small, 𝑔6 is close to 𝑔∗. Moreover, Proposition There exists 𝜆 s.t. 𝑔 − 𝑔6 [ ≤ Œ ¢ ® → ℛ 𝑔 = ℛ∗. 23
  • 24. 24 Analyze the convergence speed and probability to get in in RKHS. 𝜌(1|𝑋) Space of conditional probabilities Small ball which provides Bayes rule. 𝑔∗ 𝑔6 Small ball mapped into . RKHS (predictor) SGD ℎ∗ Recall ℎ∗ 𝜇 = arg min ”∈ℝ 𝜇𝜙 ℎ + 1 − 𝜇 𝜙(−ℎ) . Proof Idea
  • 25. Proof Sketch 1. Let 𝑍1, … , 𝑍£~𝜌 be i.i.d., random variables, 𝐷/: = 𝔼 ¶𝑔£01|𝑍1, … , 𝑍/ − 𝔼 ¶𝑔£01|𝑍1, … , 𝑍/¡1 , ¶𝑔£01 = 𝔼 ¶𝑔£01 + · /¸1 £ 𝐷/ . 2. Convergence of 𝔼 ¶𝑔£01 can be analyzed by 𝔼 ¶𝑔£01 − 𝑔6 ≤ 𝔼 ℒ6 𝔼 ¶𝑔£01 − ℒ6 𝑔6 . 3. Bound ∑/¸1 £ 𝐷/ by Martingale inequality: for 𝑐£ s.t. ∑/¸1 £ 𝐷/ º ≤ 𝑐£ , ℙ · /¸1 £ 𝐷/ [ ≥ 𝜖 ≤ 2 exp − 𝜖 𝑐£ . 4. Bound 𝑐£ by stability of A(SGD). 5. Combining 1 and 2, probability to get Bayes rule can be obtained. 6. Finally, 𝔼 ¶𝑔£01 = ℙ ¶𝑔£01 𝑖𝑠 𝑛𝑜𝑡 𝐵𝑎𝑦𝑒𝑠 . 25