SlideShare a Scribd company logo
INTRODUCTION TO
STATISTICAL LEARNING
THEORY
J. Saketha Nath (IIT Bombay)
“The goal of statistical learning
theory is to study, in a statistical
framework, the properties of
learning algorithms”
– [Bousquet et.al., 04]
What is STL?
Supervised Learning Setting
■ Given:
– Training data: 𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑚, 𝑦𝑚
– Model: set of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴
– Loss function: 𝑙: 𝒴 × 𝒴 ↦ ℝ+
■ Goal: ?? Do well on new data
■ Assumptions:
– There exists 𝐹𝑋𝑌 that generates 𝐷 (Stochastic framework)
– iid samples
Supervised Learning Setting
■ Given:
– Training data: 𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑚, 𝑦𝑚
– Model: set of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴
– Loss function: 𝑙: 𝒴 × 𝒴 ↦ ℝ+
■ Goal: ?? Pick a candidate that does well on new data ??
■ Assumptions:
– There exists 𝐹𝑋𝑌 that generates 𝐷 (Stochastic framework)
– iid samples
Supervised Learning Setting
■ Given:
– Training data: 𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑚, 𝑦𝑚
– Model: set of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴
– Loss function: 𝑙: 𝒴 × 𝒴 ↦ ℝ+
■ Goal: ?? Pick a candidate that does well on new data ??
■ Assumptions:
– There exists 𝑭𝑿𝒀 that generates 𝑫 as well as “new data” (Stochastic
framework)
– iid samples and bounded, Lipschitz loss
Supervised Learning Setting
■ Given:
– Training data: 𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑚, 𝑦𝑚
– Model: set 𝒢 of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴
– Loss function: 𝑙: 𝒴 × 𝒴 ↦ ℝ+
■ Goal: g∗
= argmin
𝑔∈𝒢
𝑬 𝒍 𝒀, 𝒈(𝑿)
■ Assumptions:
– There exists 𝑭𝑿𝒀 that generates 𝐷 as well as “new data”
– iid samples and bounded, Lipschitz loss
Supervised Learning Setting
■ Given:
– Training data: 𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑚, 𝑦𝑚
– Model: set 𝒢 of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴
– Loss function: 𝑙: 𝒴 × 𝒴 ↦ ℝ+
■ Goal: g∗
= argmin
𝑔∈𝒢
𝐸 𝑙 𝑌, 𝑔(𝑋)
■ Assumptions:
– There exists 𝐹𝑋𝑌 that generates 𝐷 as well as “new data”
– iid samples and bounded, Lipschitz loss
Minimize expected loss
(a.k.a. risk 𝑅𝑙[𝑔] minimization)
Supervised Learning Setting
■ Given:
– Training data: 𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑚, 𝑦𝑚
– Model: set 𝒢 of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴
– Loss function: 𝑙: 𝒴 × 𝒴 ↦ ℝ+
■ Goal: g∗
= argmin
𝑔∈𝒢
𝐸 𝑙 𝑌, 𝑔(𝑋)
■ Assumptions:
– There exists 𝐹𝑋𝑌 that generates 𝐷 as well as “new data”
– iid samples and bounded, Lipschitz loss
Well-defined, but un-realizable.
Supervised Learning Setting
■ Given:
– Training data: 𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑚, 𝑦𝑚
– Model: set 𝒢 of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴
– Loss function: 𝑙: 𝒴 × 𝒴 ↦ ℝ+
■ Goal: g∗
= argmin
𝑔∈𝒢
𝐸 𝑙 𝑌, 𝑔(𝑋)
■ Assumptions:
– There exists 𝐹𝑋𝑌 that generates 𝐷 as well as “new data”
– iid samples and bounded, Lipschitz loss
How well can we approximate?
Skyline ?
■ Case of 𝒢 = 1 (estimate error rate)
– Law of large numbers:
1
𝑚 𝑖=1
𝑚
𝑙 𝑌𝑖, 𝑔 𝑋𝑖
𝑚=1
∞ p
𝐸 𝑙 𝑌, 𝑔(𝑋)
With high probability,
average loss (a.k.a. empirical
risk)
on (a large) training set is a good
approximation for risk
Skyline ?
■ Case of 𝒢 = 1 (estimate error rate)
– Law of large numbers:
1
𝑚 𝑖=1
𝑚
𝑙 𝑌𝑖, 𝑔 𝑋𝑖
𝑚=1
∞ p
𝐸 𝑙 𝑌, 𝑔(𝑋)
For given (but any) 𝐹𝑋𝑌, 𝛿 > 0, 𝜖 > 0, we have that:
There exists 𝑚0 𝛿, 𝜖 ∈ ℕ, such that
𝑃
1
𝑚
𝑖=1
𝑚
𝑙 𝑌𝑖, 𝑔 𝑋𝑖 − 𝐸 𝑙 𝑌, 𝑔(𝑋) > 𝜖 ≤ 𝛿
for all 𝑚 ≥ 𝑚0 𝛿, 𝜖 .
Some Definitions
■ A problem 𝒢, 𝑙 is learnable iff there exists an algorithm that selects 𝑔𝑚 ∈ 𝒢
such that for any F𝑋𝑌, 𝛿 > 0, 𝜖 > 0, we have that there exists 𝑚0 𝛿, 𝜖 ∈ ℕ, such
that
𝑷 𝑹𝒍 𝒈𝒎 − 𝑹𝒍 𝒈∗ > 𝝐 ≤ 𝜹 for all 𝒎 ≥ 𝒎𝟎 𝜹, 𝝐 .
– 𝑔∗
is the (true) risk minimizer
■ Such an algorithm is called universally consistent 𝑚0 𝛿, 𝜖 may depend on 𝐹𝑋𝑌
■ (Smallest) 𝑚0 is called sample complexity of the problem
– Analogously sample complexity of algorithm
Some Definitions
■ A problem 𝒢, 𝑙 is learnable iff there exists an algorithm that selects 𝑔𝑚 ∈ 𝒢
such that for any F𝑋𝑌, 𝛿 > 0, 𝜖 > 0, we have that there exists 𝑚0 𝛿, 𝜖 ∈ ℕ, such
that
𝑷 𝑹𝒍 𝒈𝒎 − 𝑹𝒍 𝒈∗ > 𝝐 ≤ 𝜹 for all 𝒎 ≥ 𝒎𝟎 𝜹, 𝝐 .
– 𝑔∗
is the (true) risk minimizer
■ Such an algorithm is called universally consistent 𝑚0 𝛿, 𝜖 may depend on 𝐹𝑋𝑌
■ (Smallest) 𝑚0 is called sample complexity of the problem
– Analogously sample complexity of algorithm
[Vapnik, 92]
Some Algorithms
SAMPLE AVERAGE APPROXIMATION
(a.k.a Empirical Risk Minimization)
1. min
𝑔∈𝒢
𝐸 𝑙 𝑌, 𝑔(𝑋) ≈ min
𝑔∈𝒢
1
𝑚 𝑖=1
𝑚
𝑙 𝑦𝑖, 𝑔 𝑥𝑖
(consistent estimator approximation)
2. Bounds based on concentration of mean
3. Indirect bounds (choice optimization
alg.)
[Vapnik, 92]
Some Algorithms
SAMPLE AVERAGE APPROXIMATION
(a.k.a Empirical Risk Minimization)
1. min
𝑔∈𝒢
𝐸 𝑙 𝑌, 𝑔(𝑋) ≈ min
𝑔∈𝒢
1
𝑚 𝑖=1
𝑚
𝑙 𝑦𝑖, 𝑔 𝑥𝑖
(consistent estimator approximation)
2. Bounds based on concentration of mean
3. Indirect bounds (choice optimization
alg.)
Minimize error
on
training set
𝑹𝒎[𝒈]
[Vapnik, 92]
Some Algorithms
SAMPLE AVERAGE APPROXIMATION
(a.k.a Empirical Risk Minimization)
1. min
𝑔∈𝒢
𝐸 𝑙 𝑌, 𝑔(𝑋) ≈ min
𝑔∈𝒢
1
𝑚 𝑖=1
𝑚
𝑙 𝑦𝑖, 𝑔 𝑥𝑖
(consistent estimator approximation)
2. Bounds based on concentration of mean
3. Indirect bounds (choice optimization
alg.)
https://www.coursera.org/course/
ml
[Vapnik, 92]
SAMPLE APPROXIMATION
(a.k.a Stochastic Gradient Descent)
1. Update 𝑔(𝑘) using 𝑙(𝑦𝑘, 𝑥𝑘) and
𝑔 ≡
1
𝑚 𝑘=1
𝑚
𝑔 𝑘
(weak estimator approximation)
2. Online learning literature
3. Direct bounds on risk
Some Algorithms
SAMPLE AVERAGE APPROXIMATION
(a.k.a Empirical Risk Minimization)
1. min
𝑔∈𝒢
𝐸 𝑙 𝑌, 𝑔(𝑋) ≈ min
𝑔∈𝒢
1
𝑚 𝑖=1
𝑚
𝑙 𝑦𝑖, 𝑔 𝑥𝑖
(consistent estimator approximation)
2. Bounds based on concentration of mean
3. Indirect bounds (choice optimization
alg.)
[Robbins & Monro, 51]
https://www.coursera.org/course/
ml
SAMPLE APPROXIMATION
(a.k.a Stochastic Gradient Descent)
1. Update 𝑔(𝑘) using 𝑙(𝑦𝑘, 𝑥𝑘) and
𝑔 ≡
1
𝑚 𝑘=1
𝑚
𝑔 𝑘
(weak estimator approximation)
2. Online learning literature
3. Direct bounds on risk
Some Algorithms
[Robbins & Monro, 51]
[Vapnik, 92]
SAMPLE APPROXIMATION
(a.k.a Stochastic Gradient Descent)
1. Update 𝑔(𝑘) using 𝑙(𝑦𝑘, 𝑥𝑘) and
𝑔 ≡
1
𝑚 𝑘=1
𝑚
𝑔 𝑘
(weak estimator approximation)
2. Online learning literature
3. Direct bounds on risk
Some Algorithms
SAMPLE AVERAGE APPROXIMATION
(a.k.a Empirical Risk Minimization)
1. min
𝑔∈𝒢
𝐸 𝑙 𝑌, 𝑔(𝑋) ≈ min
𝑔∈𝒢
1
𝑚 𝑖=1
𝑚
𝑙 𝑦𝑖, 𝑔 𝑥𝑖
(consistent estimator approximation)
2. Bounds based on concentration of mean
3. Indirect bounds (choice optimization
alg.)
[Robbins & Monro, 51]
ERM consistency: Sufficient
conditions
■ 0 ≤ 𝑅 𝑔𝑚 − 𝑅 𝑔∗ = 𝑅 𝑔𝑚 − 𝑅𝑚 𝑔𝑚 + 𝑅𝑚 𝑔𝑚 − 𝑅𝑚 𝑔∗ + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗]
■ ≤ max
𝑔∈𝒢
𝑅 𝑔 − 𝑅𝑚 𝑔 + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗]
𝑝
0 ∵LLN
■ Hence one-sided uniform convergence is a sufficient condition for ERM
consistency
– i.e., max
𝑔∈𝒢
𝑅 𝑔 − 𝑅𝑚 𝑔
𝑚=1
∞ 𝑝
0 as 𝑚 → ∞
– Vapnik proved this is necessary for “non-trivial” consistency (of ERM)
ERM consistency: Sufficient
conditions
■ 0 ≤ 𝑅 𝑔𝑚 − 𝑅 𝑔∗ = 𝑅 𝑔𝑚 − 𝑅𝑚 𝑔𝑚 + 𝑅𝑚 𝑔𝑚 − 𝑅𝑚 𝑔∗ + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗]
■ ≤ max
𝑔∈𝒢
𝑅 𝑔 − 𝑅𝑚 𝑔 + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗]
𝑝
0 ∵LLN
■ Hence one-sided uniform convergence is a sufficient condition for ERM
consistency
– i.e., max
𝑔∈𝒢
𝑅 𝑔 − 𝑅𝑚 𝑔
𝑚=1
∞ 𝑝
0 as 𝑚 → ∞
– Vapnik proved this is necessary for “non-trivial” consistency (of ERM)
ERM consistency: Sufficient
conditions
■ 0 ≤ 𝑅 𝑔𝑚 − 𝑅 𝑔∗ = 𝑅 𝑔𝑚 − 𝑅𝑚 𝑔𝑚 + 𝑅𝑚 𝑔𝑚 − 𝑅𝑚 𝑔∗ + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗]
■ ≤ max
𝑔∈𝒢
𝑅 𝑔 − 𝑅𝑚 𝑔 + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗]
𝑝
0 ∵LLN
■ Hence one-sided uniform convergence is a sufficient condition for ERM
consistency
– i.e., max
𝑔∈𝒢
𝑅 𝑔 − 𝑅𝑚 𝑔
𝑚=1
∞ 𝑝
0 as 𝑚 → ∞
– Vapnik proved this is necessary for “non-trivial” consistency (of ERM)
ERM consistency: Sufficient
conditions
■ 0 ≤ 𝑅 𝑔𝑚 − 𝑅 𝑔∗ = 𝑅 𝑔𝑚 − 𝑅𝑚 𝑔𝑚 + 𝑅𝑚 𝑔𝑚 − 𝑅𝑚 𝑔∗ + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗]
■ ≤ max
𝑔∈𝒢
𝑅 𝑔 − 𝑅𝑚 𝑔 + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗]
𝑝
0 ∵LLN
■ Hence one-sided uniform convergence is a sufficient condition for ERM
consistency
– i.e., max
𝑔∈𝒢
𝑅 𝑔 − 𝑅𝑚 𝑔
𝑚=1
∞ 𝑝
0 as 𝑚 → ∞
– Vapnik proved this is necessary for “non-trivial” consistency (of ERM)
Story so far …
■ Two algorithms: Sample Average Approx., Sample Approx.
■ One-sided uniform convergence of mean is sufficient for SAA consistency.
■ Defined Rademacher Complexity.
■ Pending:
– Concentration around mean for the max. term.
– 𝓡𝐦 𝓖 𝒎=𝟏
∞
→ 𝟎 ⇒ a Learnable problem.
≤ 𝐸 max
𝑔∈𝒢
𝐸 𝑅𝑚
′
𝑔 − 𝑅𝑚 𝑔
Candidate for Problem Complexity
≤ 𝐸 max
𝑔∈𝒢
𝐸 𝑅𝑚
′
𝑔 − 𝑅𝑚 𝑔
Candidate for Problem Complexity
≤ 𝐸 max
𝑔∈𝒢
𝐸 𝑅𝑚
′
𝑔 − 𝑅𝑚 𝑔
Candidate for Problem Complexity
1. Ensure (asymptotically) goes to zero.
2. Show concentration around mean for max.
div.
≤ 𝐸 max
𝑔∈𝒢
𝐸 𝑅𝑚
′
𝑔 − 𝑅𝑚 𝑔
Candidate for Problem Complexity
≤ 𝐸 max
𝑔∈𝒢
𝐸 𝑅𝑚
′
𝑔 − 𝑅𝑚 𝑔
Candidate for Problem Complexity
≤ 𝐸 max
𝑔∈𝒢
𝐸 𝑅𝑚
′
𝑔 − 𝑅𝑚 𝑔
Candidate for Problem Complexity
MAXIMUM
DISCREPANCY
≤ 𝐸 max
𝑔∈𝒢
𝐸 𝑅𝑚
′
𝑔 − 𝑅𝑚 𝑔
Towards Rademacher Complexity
𝐸𝜎𝐸 max
𝑔∈𝒢
1
𝑚
𝑖=1
𝑚
𝜎𝑖 𝑙 𝑌𝑖
′
, 𝑔 𝑋𝑖
′
− 𝑙 𝑌𝑖, 𝑔 𝑋𝑖
Towards Rademacher Complexity
𝐸𝜎𝐸 max
𝑔∈𝒢
1
𝑚
𝑖=1
𝑚
𝜎𝑖 𝑙 𝑌𝑖
′
, 𝑔 𝑋𝑖
′
− 𝑙 𝑌𝑖, 𝑔 𝑋𝑖
Towards Rademacher Complexity
iid Rademacher
random variables
𝑃 𝜎𝑖 = 1 = 0.5,
𝑃 𝜎𝑖 = −1 = 0.5.
≤ 2 𝐸 𝐸𝜎 max
𝑔∈𝒢
1
𝑚
𝑖=1
𝑚
𝜎𝑖𝑙 𝑌𝑖, 𝑔 𝑋𝑖
Empirical term
Distribution−dependent term
Rademacher Complexity
= 2 𝐸 𝐸𝜎 max
𝑔∈𝒢
1
𝑚
𝑖=1
𝑚
𝜎𝑖𝑙 𝑌𝑖, 𝑔 𝑋𝑖
Empirical term
Distribution−dependent term
Rademacher Complexity
𝑓(𝑍𝑖)
𝑓 ∈ ℱ
= 2 𝐸 𝐸𝜎 max
𝑓∈ℱ
1
𝑚
𝑖=1
𝑚
𝜎𝑖𝑓(𝑍𝑖)
ℛ𝑚 ℱ
ℛ𝑚 ℱ
Rademacher Complexity
ℛ𝑚 ℱ is Rademacher Complexity; ℛ𝑚 ℱ is empirical Rademacher Complexity
Story so far …
■ Two algorithms: Sample Average Approx., Sample Approx.
■ One-sided uniform convergence of mean is sufficient for SAA consistency.
■ Defined Rademacher Complexity.
■ Pending:
– Concentration around mean for the max. term.
– 𝓡𝐦 𝓖 𝒎=𝟏
∞
→ 𝟎 ⇒ a Learnable problem.
Closer look at ℛ𝑚 ℱ = 𝐸 max
𝑓∈ℱ
1
𝑚 𝑖=1
𝑚
𝜎𝑖𝑓(𝑍𝑖)
■ High if ℱ correlates with random noise
– Classification problems: ℱ can assign arbitrary labels
■ Higher ℛ𝑚 ℱ , lower confidence on prediction
■ ℱ1 ⊆ ℱ2 ⇒ ℛ𝑚 ℱ1 ≤ ℛ𝑚 ℱ2
■ Lower ℛ𝑚 ℱ , higher chance we miss Bayes optimal
Closer look at ℛ𝑚 ℱ = 𝐸 max
𝑓∈ℱ
1
𝑚 𝑖=1
𝑚
𝜎𝑖𝑓(𝑍𝑖)
■ High if ℱ correlates with random noise
– Classification problems: ℱ can assign arbitrary labels
■ Higher ℛ𝑚 ℱ , lower confidence on prediction
■ ℱ1 ⊆ ℱ2 ⇒ ℛ𝑚 ℱ1 ≤ ℛ𝑚 ℱ2
■ Lower ℛ𝑚 ℱ , higher chance we miss Bayes optimal
Closer look at ℛ𝑚 ℱ = 𝐸 max
𝑓∈ℱ
1
𝑚 𝑖=1
𝑚
𝜎𝑖𝑓(𝑍𝑖)
■ High if ℱ correlates with random noise
– Classification problems: ℱ can assign arbitrary labels
■ Higher ℛ𝑚 ℱ , lower confidence on prediction
■ ℱ1 ⊆ ℱ2 ⇒ ℛ𝑚 ℱ1 ≤ ℛ𝑚 ℱ2
■ Lower ℛ𝑚 ℱ , higher chance we miss Bayes optimal
Choose model with
right trade-off
using Domain
knowledge.
Relation with classical measures
■ Growth Function: Π𝑚 ℱ ≡ max
𝑥1,…,𝑥𝑚 ⊂𝒳
𝑓 𝑥1 , … , 𝑓(𝑥𝑚) | 𝑓 ∈ ℱ
– Classification case: Π𝑚 ℱ is max. no. of distinct classifiers induced
– Massart’s Lemma: 𝓡𝒎 𝓕 ≤
𝟐𝚷𝐦 𝓕
𝒎
■ VC-Dimension: 𝑉𝐶𝑑𝑖𝑚 ℱ ≡ max
𝑚:Π𝑚 ℱ =2𝑚
𝑚
– Sauer’s Lemma: 𝓡𝒎 𝓕 ≤
𝟐𝒅 𝐥𝐨𝐠
𝒆𝒎
𝒅
𝒎
Mean concentration: Observation
■ Define ℎ 𝑋1, 𝑌1 , … , 𝑋𝑚, 𝑌𝑚 ≡ max
𝑔∈𝒢
𝑅 𝑔 − 𝑅𝑚 𝑔
■ ℎ is function:
– of iid random variables
– Satisfies bounded difference property
■ Δℎ when one (𝑋𝑖, 𝑌𝑖) changes ≤
Δ𝑙
𝑚
(∵
bounded loss)
– Concentration around mean – McDiarmid’s inequality
McDiarmid’s Inequality
Let 𝑋1, … , 𝑋𝑚 ∈ 𝒳𝑚 be iid rvs and ℎ: 𝒳𝑚 ↦ ℝ satisfying:
ℎ 𝑥1, … , 𝑥𝑖, … , 𝑥𝑚 − ℎ 𝑥1, … , 𝑥𝑖
′
, … , 𝑥𝑚 ≤ 𝑐𝑖
Then the following hold for any 𝜖 > 0:
𝑃 ℎ − 𝐸 ℎ ≥ 𝜖 ≤ 𝑒
−2𝜖2
𝑖=1
𝑚
𝑐𝑖
2
,
𝑃 ℎ − 𝐸 ℎ ≤ −𝜖 ≤ 𝑒
−2𝜖2
𝑖=1
𝑚
𝑐𝑖
2
McDiarmid’s Inequality
Let 𝑋1, … , 𝑋𝑚 ∈ 𝒳𝑚 be iid rvs and ℎ: 𝒳𝑚 ↦ ℝ satisfying:
ℎ 𝑥1, … , 𝑥𝑖, … , 𝑥𝑚 − ℎ 𝑥1, … , 𝑥𝑖
′
, … , 𝑥𝑚 ≤ 𝑐𝑖
Then the following hold for any 𝜖 > 0:
𝑃 ℎ − 𝐸 ℎ ≥ 𝜖 ≤ 𝑒
−2𝜖2
𝑖=1
𝑚
𝑐𝑖
2
,
𝑃 ℎ − 𝐸 ℎ ≤ −𝜖 ≤ 𝑒
−2𝜖2
𝑖=1
𝑚
𝑐𝑖
2
𝑒
−2𝑚𝜖2
Δ𝑙2
→ 0
𝟎 ⇒learnable
Learning Bounds
■ Let 𝛿 ≡ 𝑒
−2𝑚𝜖2
Δ𝑙2
, i.e., 𝝐 = 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
■ 𝑃 ℎ − 𝐸 ℎ ≥ 𝜖 ≤ 𝛿 is same as:
– with probability atleast 1 − 𝛿, we have:
𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐𝓡𝒎 𝓕 + 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖
■ With probability atleast 1 − 𝛿, we have:
𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐𝓡𝒎 𝓕 + 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖
Learning Bounds
■ Let 𝛿 ≡ 𝑒
−2𝑚𝜖2
Δ𝑙2
, i.e., 𝝐 = 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
■ 𝑃 ℎ − 𝐸 ℎ ≥ 𝜖 ≤ 𝛿 is same as:
– with probability atleast 1 − 𝛿, we have:
𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐𝓡𝒎 𝓕 + 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖
■ With probability atleast 1 − 𝛿, we have:
𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐𝓡𝒎 𝓕 + 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖
Computable
except this
term!
Learning Bounds
■ Let 𝛿 ≡ 𝑒
−2𝑚𝜖2
Δ𝑙2
, i.e., 𝝐 = 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
■ 𝑃 ℎ − 𝐸 ℎ ≥ 𝜖 ≤ 𝛿 is same as:
– with probability atleast 1 − 𝛿, we have:
𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐𝓡𝒎 𝓕 + 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖
■ With probability atleast 1 − 𝛿, we have:
𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐𝓡𝒎 𝓕 + 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖
Use McDiarmid
on ℛ𝑚(ℱ)
Learning Bounds
■ Let 𝛿 ≡ 𝑒
−2𝑚𝜖2
Δ𝑙2
, i.e., 𝝐 = 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
■ 𝑃 ℎ − 𝐸 ℎ ≥ 𝜖 ≤ 𝛿 is same as:
– with probability atleast 1 − 𝛿, we have:
𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐𝓡𝒎 𝓕 + 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖
■ With probability atleast 1 − 𝛿, we have:
𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐ℛ𝒎 𝓕 + 𝟑𝚫𝒍
𝐥𝐨𝐠
𝟐
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖
Story so far …
■ Two algorithms: Sample Average Approx., Sample Approx.
■ One-sided uniform convergence of mean is sufficient for SAA consistency.
■ Defined Rademacher Complexity.
■ Concentration around mean for the max. term.
■ 𝓡𝐦 𝓖 𝒎=𝟏
∞
→ 𝟎 ⇒ a Learnable problem.
■ Examples of usable Learnable problems
– Shows sufficiency condition not loose
Linear model with Lipschitz loss
■ Consider 𝒢 ≡ 𝑔 | ∃ 𝑤 ∋ 𝑔 𝑥 = 𝑤, 𝜙 𝑥 , 𝑤 ≤ 𝑊 , 𝜙: 𝒳 ↦ ℋ (linear model)
■ Contraction Lemma: ℛ𝑚 ℱ ≤ ℛ𝑚 𝒢
■ ℛ𝑚 𝒢 = 𝐸𝜎 max
𝑤 ≤𝑊
1
𝑚 𝑖=1
𝑚
𝜎𝑖 𝑤, 𝜙 𝑥𝑖
– = 𝐸𝜎 max
𝑤 ≤𝑊
𝑤,
1
𝑚 𝑖=1
𝑚
𝜎𝑖 𝜙 𝑥𝑖
– =
𝑊
𝑚
𝐸𝜎
1
𝑚 𝑖=1
𝑚
𝜎𝑖 𝜙 𝑥𝑖
– ≤
𝑊
𝑚
𝐸𝜎
1
𝑚 𝑖=1
𝑚
𝜎𝑖 𝜙 𝑥𝑖
2
(∵ Jensen’s
Inequality)
– =
𝑊
𝑚 𝑖=1
𝑚
𝜙(𝑥𝑖)
2
≤
𝑊𝑅
𝑚
→ 0 (if
𝜙(𝑥) ≤ 𝑅)
Linear model with Lipschitz loss
■ Consider 𝒢 ≡ 𝑔 | ∃ 𝑤 ∋ 𝑔 𝑥 = 𝑤, 𝜙 𝑥 , 𝑤 ≤ 𝑊 , 𝜙: 𝒳 ↦ ℋ (linear model)
■ Contraction Lemma: ℛ𝑚 ℱ ≤ ℛ𝑚 𝒢
■ ℛ𝑚 𝒢 = 𝐸𝜎 max
𝑤 ≤𝑊
1
𝑚 𝑖=1
𝑚
𝜎𝑖 𝑤, 𝜙 𝑥𝑖
– = 𝐸𝜎 max
𝑤 ≤𝑊
𝑤,
1
𝑚 𝑖=1
𝑚
𝜎𝑖 𝜙 𝑥𝑖
– =
𝑊
𝑚
𝐸𝜎
1
𝑚 𝑖=1
𝑚
𝜎𝑖 𝜙 𝑥𝑖
– ≤
𝑊
𝑚
𝐸𝜎
1
𝑚 𝑖=1
𝑚
𝜎𝑖 𝜙 𝑥𝑖
2
(∵ Jensen’s
Inequality)
– =
𝑊
𝑚 𝑖=1
𝑚
𝜙(𝑥𝑖)
2
≤
𝑊𝑅
𝑚
→ 0 (if
𝜙(𝑥) ≤ 𝑅)
Learnable
Problems
Shai Shalev-Shwartz et.al.,
2009
THANK YOU

More Related Content

PDF
A basic introduction to learning
PPT
Telling the Story of Support Vector Mmachines.ppt
PPT
Introduction to Machine Learning STUDENTS.ppt
PPTX
When Models Meet Data: From ancient science to todays Artificial Intelligence...
PDF
Cs229 notes4
PDF
Lecture notes
PPTX
Supervised learning for IOT IN Vellore Institute of Technology
PPTX
Statistical Learning and Model Selection module 2.pptx
A basic introduction to learning
Telling the Story of Support Vector Mmachines.ppt
Introduction to Machine Learning STUDENTS.ppt
When Models Meet Data: From ancient science to todays Artificial Intelligence...
Cs229 notes4
Lecture notes
Supervised learning for IOT IN Vellore Institute of Technology
Statistical Learning and Model Selection module 2.pptx

Similar to STLtalk about statistical analysis and its application (20)

PDF
ML_Lec4 introduction to linear regression.pdf
PPT
Intro to Model Selection
PPT
Statistical Machine________ Learning.ppt
PDF
Machine learning (4)
PDF
Lecture6 xing
PPTX
Machine learning introduction lecture notes
PPTX
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
PDF
CS229 Machine Learning Lecture Notes
PDF
Lecture 5 - Linear Regression Linear Regression
PDF
Some Equations for MAchine LEarning
PPTX
Linear regression, costs & gradient descent
PPTX
Introduction to TreeNet (2004)
PPT
tutorial.ppt
PDF
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
PPTX
Elements of Statistical Learning 読み会 第2章
PDF
Random Matrix Theory and Machine Learning - Part 4
PDF
Cheatsheet supervised-learning
PDF
A Solution Manual and Notes for The Elements of Statistical Learning.pdf
PDF
nber_slides.pdf
PDF
Chapter3 hundred page machine learning
ML_Lec4 introduction to linear regression.pdf
Intro to Model Selection
Statistical Machine________ Learning.ppt
Machine learning (4)
Lecture6 xing
Machine learning introduction lecture notes
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
CS229 Machine Learning Lecture Notes
Lecture 5 - Linear Regression Linear Regression
Some Equations for MAchine LEarning
Linear regression, costs & gradient descent
Introduction to TreeNet (2004)
tutorial.ppt
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
Elements of Statistical Learning 読み会 第2章
Random Matrix Theory and Machine Learning - Part 4
Cheatsheet supervised-learning
A Solution Manual and Notes for The Elements of Statistical Learning.pdf
nber_slides.pdf
Chapter3 hundred page machine learning
Ad

More from JulieDash5 (20)

PPTX
PNCB_Test_Taking_Strategies_Resource.pptx
PPT
15_Test_Taking_Strategies_-_Revised_tips.ppt
PPT
Maths exam tips updated_new tips really workppt
PPT
Exam Preparation _is best_for you_tips.ppt
PPT
LastMinute_Exam_Preparation_exam_abcd.ppt
PPTX
ALiVE-PPT-CIES-66th-Conference-2022updated.pptx
PPTX
dyw39-ardrossan-academy-life--skills.pptx
PPTX
Tech-Session-5c-Applying-Equator-Principles.pptx
PPT
Students Life is how to know it by students
PPT
Science - butterfly-life-cycle-powerpoint.ppt
PPT
Queen-Rearing-Presentation is very important t
PPT
alphabet_sikho_A_B_C_D_E_F_G_H_I_J_K_L....ppt
PPTX
count the numbers_1_2_3_4_5_6_7_8_9.pptx
PPTX
ABS_and_Human_Health_Human Health and Biodiversityptx
PPT
Davis_EurogardIV_2006_botanical gardens.ppt
PPT
LS_presentation_24.11.05_LDC_w_shopEnhancing co-operation and promoting syner...
PPT
Davis_EurogardIV_2006_how to affect gardens.pt
PPT
LS_presentation_24.11.05_LDC_w_shop_enhancing co-operation.ppt
PPT
160_16ANMEEN2_20200524063718_Informal Letter Sample45.ppt
PPT
BIODIVERSITYDrKDSawant_about biodiversity
PNCB_Test_Taking_Strategies_Resource.pptx
15_Test_Taking_Strategies_-_Revised_tips.ppt
Maths exam tips updated_new tips really workppt
Exam Preparation _is best_for you_tips.ppt
LastMinute_Exam_Preparation_exam_abcd.ppt
ALiVE-PPT-CIES-66th-Conference-2022updated.pptx
dyw39-ardrossan-academy-life--skills.pptx
Tech-Session-5c-Applying-Equator-Principles.pptx
Students Life is how to know it by students
Science - butterfly-life-cycle-powerpoint.ppt
Queen-Rearing-Presentation is very important t
alphabet_sikho_A_B_C_D_E_F_G_H_I_J_K_L....ppt
count the numbers_1_2_3_4_5_6_7_8_9.pptx
ABS_and_Human_Health_Human Health and Biodiversityptx
Davis_EurogardIV_2006_botanical gardens.ppt
LS_presentation_24.11.05_LDC_w_shopEnhancing co-operation and promoting syner...
Davis_EurogardIV_2006_how to affect gardens.pt
LS_presentation_24.11.05_LDC_w_shop_enhancing co-operation.ppt
160_16ANMEEN2_20200524063718_Informal Letter Sample45.ppt
BIODIVERSITYDrKDSawant_about biodiversity
Ad

Recently uploaded (20)

PPTX
additive manufacturing of ss316l using mig welding
PPTX
web development for engineering and engineering
PPTX
Construction Project Organization Group 2.pptx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
composite construction of structures.pdf
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
DOCX
573137875-Attendance-Management-System-original
PPTX
Sustainable Sites - Green Building Construction
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Lecture Notes Electrical Wiring System Components
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Welding lecture in detail for understanding
additive manufacturing of ss316l using mig welding
web development for engineering and engineering
Construction Project Organization Group 2.pptx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Automation-in-Manufacturing-Chapter-Introduction.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
composite construction of structures.pdf
OOP with Java - Java Introduction (Basics)
CYBER-CRIMES AND SECURITY A guide to understanding
573137875-Attendance-Management-System-original
Sustainable Sites - Green Building Construction
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Lecture Notes Electrical Wiring System Components
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Welding lecture in detail for understanding

STLtalk about statistical analysis and its application

  • 2. “The goal of statistical learning theory is to study, in a statistical framework, the properties of learning algorithms” – [Bousquet et.al., 04] What is STL?
  • 3. Supervised Learning Setting ■ Given: – Training data: 𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑚, 𝑦𝑚 – Model: set of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴 – Loss function: 𝑙: 𝒴 × 𝒴 ↦ ℝ+ ■ Goal: ?? Do well on new data ■ Assumptions: – There exists 𝐹𝑋𝑌 that generates 𝐷 (Stochastic framework) – iid samples
  • 4. Supervised Learning Setting ■ Given: – Training data: 𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑚, 𝑦𝑚 – Model: set of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴 – Loss function: 𝑙: 𝒴 × 𝒴 ↦ ℝ+ ■ Goal: ?? Pick a candidate that does well on new data ?? ■ Assumptions: – There exists 𝐹𝑋𝑌 that generates 𝐷 (Stochastic framework) – iid samples
  • 5. Supervised Learning Setting ■ Given: – Training data: 𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑚, 𝑦𝑚 – Model: set of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴 – Loss function: 𝑙: 𝒴 × 𝒴 ↦ ℝ+ ■ Goal: ?? Pick a candidate that does well on new data ?? ■ Assumptions: – There exists 𝑭𝑿𝒀 that generates 𝑫 as well as “new data” (Stochastic framework) – iid samples and bounded, Lipschitz loss
  • 6. Supervised Learning Setting ■ Given: – Training data: 𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑚, 𝑦𝑚 – Model: set 𝒢 of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴 – Loss function: 𝑙: 𝒴 × 𝒴 ↦ ℝ+ ■ Goal: g∗ = argmin 𝑔∈𝒢 𝑬 𝒍 𝒀, 𝒈(𝑿) ■ Assumptions: – There exists 𝑭𝑿𝒀 that generates 𝐷 as well as “new data” – iid samples and bounded, Lipschitz loss
  • 7. Supervised Learning Setting ■ Given: – Training data: 𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑚, 𝑦𝑚 – Model: set 𝒢 of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴 – Loss function: 𝑙: 𝒴 × 𝒴 ↦ ℝ+ ■ Goal: g∗ = argmin 𝑔∈𝒢 𝐸 𝑙 𝑌, 𝑔(𝑋) ■ Assumptions: – There exists 𝐹𝑋𝑌 that generates 𝐷 as well as “new data” – iid samples and bounded, Lipschitz loss Minimize expected loss (a.k.a. risk 𝑅𝑙[𝑔] minimization)
  • 8. Supervised Learning Setting ■ Given: – Training data: 𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑚, 𝑦𝑚 – Model: set 𝒢 of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴 – Loss function: 𝑙: 𝒴 × 𝒴 ↦ ℝ+ ■ Goal: g∗ = argmin 𝑔∈𝒢 𝐸 𝑙 𝑌, 𝑔(𝑋) ■ Assumptions: – There exists 𝐹𝑋𝑌 that generates 𝐷 as well as “new data” – iid samples and bounded, Lipschitz loss Well-defined, but un-realizable.
  • 9. Supervised Learning Setting ■ Given: – Training data: 𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑚, 𝑦𝑚 – Model: set 𝒢 of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴 – Loss function: 𝑙: 𝒴 × 𝒴 ↦ ℝ+ ■ Goal: g∗ = argmin 𝑔∈𝒢 𝐸 𝑙 𝑌, 𝑔(𝑋) ■ Assumptions: – There exists 𝐹𝑋𝑌 that generates 𝐷 as well as “new data” – iid samples and bounded, Lipschitz loss How well can we approximate?
  • 10. Skyline ? ■ Case of 𝒢 = 1 (estimate error rate) – Law of large numbers: 1 𝑚 𝑖=1 𝑚 𝑙 𝑌𝑖, 𝑔 𝑋𝑖 𝑚=1 ∞ p 𝐸 𝑙 𝑌, 𝑔(𝑋) With high probability, average loss (a.k.a. empirical risk) on (a large) training set is a good approximation for risk
  • 11. Skyline ? ■ Case of 𝒢 = 1 (estimate error rate) – Law of large numbers: 1 𝑚 𝑖=1 𝑚 𝑙 𝑌𝑖, 𝑔 𝑋𝑖 𝑚=1 ∞ p 𝐸 𝑙 𝑌, 𝑔(𝑋) For given (but any) 𝐹𝑋𝑌, 𝛿 > 0, 𝜖 > 0, we have that: There exists 𝑚0 𝛿, 𝜖 ∈ ℕ, such that 𝑃 1 𝑚 𝑖=1 𝑚 𝑙 𝑌𝑖, 𝑔 𝑋𝑖 − 𝐸 𝑙 𝑌, 𝑔(𝑋) > 𝜖 ≤ 𝛿 for all 𝑚 ≥ 𝑚0 𝛿, 𝜖 .
  • 12. Some Definitions ■ A problem 𝒢, 𝑙 is learnable iff there exists an algorithm that selects 𝑔𝑚 ∈ 𝒢 such that for any F𝑋𝑌, 𝛿 > 0, 𝜖 > 0, we have that there exists 𝑚0 𝛿, 𝜖 ∈ ℕ, such that 𝑷 𝑹𝒍 𝒈𝒎 − 𝑹𝒍 𝒈∗ > 𝝐 ≤ 𝜹 for all 𝒎 ≥ 𝒎𝟎 𝜹, 𝝐 . – 𝑔∗ is the (true) risk minimizer ■ Such an algorithm is called universally consistent 𝑚0 𝛿, 𝜖 may depend on 𝐹𝑋𝑌 ■ (Smallest) 𝑚0 is called sample complexity of the problem – Analogously sample complexity of algorithm
  • 13. Some Definitions ■ A problem 𝒢, 𝑙 is learnable iff there exists an algorithm that selects 𝑔𝑚 ∈ 𝒢 such that for any F𝑋𝑌, 𝛿 > 0, 𝜖 > 0, we have that there exists 𝑚0 𝛿, 𝜖 ∈ ℕ, such that 𝑷 𝑹𝒍 𝒈𝒎 − 𝑹𝒍 𝒈∗ > 𝝐 ≤ 𝜹 for all 𝒎 ≥ 𝒎𝟎 𝜹, 𝝐 . – 𝑔∗ is the (true) risk minimizer ■ Such an algorithm is called universally consistent 𝑚0 𝛿, 𝜖 may depend on 𝐹𝑋𝑌 ■ (Smallest) 𝑚0 is called sample complexity of the problem – Analogously sample complexity of algorithm
  • 14. [Vapnik, 92] Some Algorithms SAMPLE AVERAGE APPROXIMATION (a.k.a Empirical Risk Minimization) 1. min 𝑔∈𝒢 𝐸 𝑙 𝑌, 𝑔(𝑋) ≈ min 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝑙 𝑦𝑖, 𝑔 𝑥𝑖 (consistent estimator approximation) 2. Bounds based on concentration of mean 3. Indirect bounds (choice optimization alg.)
  • 15. [Vapnik, 92] Some Algorithms SAMPLE AVERAGE APPROXIMATION (a.k.a Empirical Risk Minimization) 1. min 𝑔∈𝒢 𝐸 𝑙 𝑌, 𝑔(𝑋) ≈ min 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝑙 𝑦𝑖, 𝑔 𝑥𝑖 (consistent estimator approximation) 2. Bounds based on concentration of mean 3. Indirect bounds (choice optimization alg.) Minimize error on training set 𝑹𝒎[𝒈]
  • 16. [Vapnik, 92] Some Algorithms SAMPLE AVERAGE APPROXIMATION (a.k.a Empirical Risk Minimization) 1. min 𝑔∈𝒢 𝐸 𝑙 𝑌, 𝑔(𝑋) ≈ min 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝑙 𝑦𝑖, 𝑔 𝑥𝑖 (consistent estimator approximation) 2. Bounds based on concentration of mean 3. Indirect bounds (choice optimization alg.) https://www.coursera.org/course/ ml
  • 17. [Vapnik, 92] SAMPLE APPROXIMATION (a.k.a Stochastic Gradient Descent) 1. Update 𝑔(𝑘) using 𝑙(𝑦𝑘, 𝑥𝑘) and 𝑔 ≡ 1 𝑚 𝑘=1 𝑚 𝑔 𝑘 (weak estimator approximation) 2. Online learning literature 3. Direct bounds on risk Some Algorithms SAMPLE AVERAGE APPROXIMATION (a.k.a Empirical Risk Minimization) 1. min 𝑔∈𝒢 𝐸 𝑙 𝑌, 𝑔(𝑋) ≈ min 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝑙 𝑦𝑖, 𝑔 𝑥𝑖 (consistent estimator approximation) 2. Bounds based on concentration of mean 3. Indirect bounds (choice optimization alg.) [Robbins & Monro, 51]
  • 18. https://www.coursera.org/course/ ml SAMPLE APPROXIMATION (a.k.a Stochastic Gradient Descent) 1. Update 𝑔(𝑘) using 𝑙(𝑦𝑘, 𝑥𝑘) and 𝑔 ≡ 1 𝑚 𝑘=1 𝑚 𝑔 𝑘 (weak estimator approximation) 2. Online learning literature 3. Direct bounds on risk Some Algorithms [Robbins & Monro, 51]
  • 19. [Vapnik, 92] SAMPLE APPROXIMATION (a.k.a Stochastic Gradient Descent) 1. Update 𝑔(𝑘) using 𝑙(𝑦𝑘, 𝑥𝑘) and 𝑔 ≡ 1 𝑚 𝑘=1 𝑚 𝑔 𝑘 (weak estimator approximation) 2. Online learning literature 3. Direct bounds on risk Some Algorithms SAMPLE AVERAGE APPROXIMATION (a.k.a Empirical Risk Minimization) 1. min 𝑔∈𝒢 𝐸 𝑙 𝑌, 𝑔(𝑋) ≈ min 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝑙 𝑦𝑖, 𝑔 𝑥𝑖 (consistent estimator approximation) 2. Bounds based on concentration of mean 3. Indirect bounds (choice optimization alg.) [Robbins & Monro, 51]
  • 20. ERM consistency: Sufficient conditions ■ 0 ≤ 𝑅 𝑔𝑚 − 𝑅 𝑔∗ = 𝑅 𝑔𝑚 − 𝑅𝑚 𝑔𝑚 + 𝑅𝑚 𝑔𝑚 − 𝑅𝑚 𝑔∗ + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗] ■ ≤ max 𝑔∈𝒢 𝑅 𝑔 − 𝑅𝑚 𝑔 + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗] 𝑝 0 ∵LLN ■ Hence one-sided uniform convergence is a sufficient condition for ERM consistency – i.e., max 𝑔∈𝒢 𝑅 𝑔 − 𝑅𝑚 𝑔 𝑚=1 ∞ 𝑝 0 as 𝑚 → ∞ – Vapnik proved this is necessary for “non-trivial” consistency (of ERM)
  • 21. ERM consistency: Sufficient conditions ■ 0 ≤ 𝑅 𝑔𝑚 − 𝑅 𝑔∗ = 𝑅 𝑔𝑚 − 𝑅𝑚 𝑔𝑚 + 𝑅𝑚 𝑔𝑚 − 𝑅𝑚 𝑔∗ + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗] ■ ≤ max 𝑔∈𝒢 𝑅 𝑔 − 𝑅𝑚 𝑔 + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗] 𝑝 0 ∵LLN ■ Hence one-sided uniform convergence is a sufficient condition for ERM consistency – i.e., max 𝑔∈𝒢 𝑅 𝑔 − 𝑅𝑚 𝑔 𝑚=1 ∞ 𝑝 0 as 𝑚 → ∞ – Vapnik proved this is necessary for “non-trivial” consistency (of ERM)
  • 22. ERM consistency: Sufficient conditions ■ 0 ≤ 𝑅 𝑔𝑚 − 𝑅 𝑔∗ = 𝑅 𝑔𝑚 − 𝑅𝑚 𝑔𝑚 + 𝑅𝑚 𝑔𝑚 − 𝑅𝑚 𝑔∗ + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗] ■ ≤ max 𝑔∈𝒢 𝑅 𝑔 − 𝑅𝑚 𝑔 + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗] 𝑝 0 ∵LLN ■ Hence one-sided uniform convergence is a sufficient condition for ERM consistency – i.e., max 𝑔∈𝒢 𝑅 𝑔 − 𝑅𝑚 𝑔 𝑚=1 ∞ 𝑝 0 as 𝑚 → ∞ – Vapnik proved this is necessary for “non-trivial” consistency (of ERM)
  • 23. ERM consistency: Sufficient conditions ■ 0 ≤ 𝑅 𝑔𝑚 − 𝑅 𝑔∗ = 𝑅 𝑔𝑚 − 𝑅𝑚 𝑔𝑚 + 𝑅𝑚 𝑔𝑚 − 𝑅𝑚 𝑔∗ + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗] ■ ≤ max 𝑔∈𝒢 𝑅 𝑔 − 𝑅𝑚 𝑔 + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗] 𝑝 0 ∵LLN ■ Hence one-sided uniform convergence is a sufficient condition for ERM consistency – i.e., max 𝑔∈𝒢 𝑅 𝑔 − 𝑅𝑚 𝑔 𝑚=1 ∞ 𝑝 0 as 𝑚 → ∞ – Vapnik proved this is necessary for “non-trivial” consistency (of ERM)
  • 24. Story so far … ■ Two algorithms: Sample Average Approx., Sample Approx. ■ One-sided uniform convergence of mean is sufficient for SAA consistency. ■ Defined Rademacher Complexity. ■ Pending: – Concentration around mean for the max. term. – 𝓡𝐦 𝓖 𝒎=𝟏 ∞ → 𝟎 ⇒ a Learnable problem.
  • 25. ≤ 𝐸 max 𝑔∈𝒢 𝐸 𝑅𝑚 ′ 𝑔 − 𝑅𝑚 𝑔 Candidate for Problem Complexity
  • 26. ≤ 𝐸 max 𝑔∈𝒢 𝐸 𝑅𝑚 ′ 𝑔 − 𝑅𝑚 𝑔 Candidate for Problem Complexity
  • 27. ≤ 𝐸 max 𝑔∈𝒢 𝐸 𝑅𝑚 ′ 𝑔 − 𝑅𝑚 𝑔 Candidate for Problem Complexity 1. Ensure (asymptotically) goes to zero. 2. Show concentration around mean for max. div.
  • 28. ≤ 𝐸 max 𝑔∈𝒢 𝐸 𝑅𝑚 ′ 𝑔 − 𝑅𝑚 𝑔 Candidate for Problem Complexity
  • 29. ≤ 𝐸 max 𝑔∈𝒢 𝐸 𝑅𝑚 ′ 𝑔 − 𝑅𝑚 𝑔 Candidate for Problem Complexity
  • 30. ≤ 𝐸 max 𝑔∈𝒢 𝐸 𝑅𝑚 ′ 𝑔 − 𝑅𝑚 𝑔 Candidate for Problem Complexity MAXIMUM DISCREPANCY
  • 31. ≤ 𝐸 max 𝑔∈𝒢 𝐸 𝑅𝑚 ′ 𝑔 − 𝑅𝑚 𝑔 Towards Rademacher Complexity
  • 32. 𝐸𝜎𝐸 max 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝜎𝑖 𝑙 𝑌𝑖 ′ , 𝑔 𝑋𝑖 ′ − 𝑙 𝑌𝑖, 𝑔 𝑋𝑖 Towards Rademacher Complexity
  • 33. 𝐸𝜎𝐸 max 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝜎𝑖 𝑙 𝑌𝑖 ′ , 𝑔 𝑋𝑖 ′ − 𝑙 𝑌𝑖, 𝑔 𝑋𝑖 Towards Rademacher Complexity iid Rademacher random variables 𝑃 𝜎𝑖 = 1 = 0.5, 𝑃 𝜎𝑖 = −1 = 0.5.
  • 34. ≤ 2 𝐸 𝐸𝜎 max 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝜎𝑖𝑙 𝑌𝑖, 𝑔 𝑋𝑖 Empirical term Distribution−dependent term Rademacher Complexity
  • 35. = 2 𝐸 𝐸𝜎 max 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝜎𝑖𝑙 𝑌𝑖, 𝑔 𝑋𝑖 Empirical term Distribution−dependent term Rademacher Complexity 𝑓(𝑍𝑖) 𝑓 ∈ ℱ
  • 36. = 2 𝐸 𝐸𝜎 max 𝑓∈ℱ 1 𝑚 𝑖=1 𝑚 𝜎𝑖𝑓(𝑍𝑖) ℛ𝑚 ℱ ℛ𝑚 ℱ Rademacher Complexity ℛ𝑚 ℱ is Rademacher Complexity; ℛ𝑚 ℱ is empirical Rademacher Complexity
  • 37. Story so far … ■ Two algorithms: Sample Average Approx., Sample Approx. ■ One-sided uniform convergence of mean is sufficient for SAA consistency. ■ Defined Rademacher Complexity. ■ Pending: – Concentration around mean for the max. term. – 𝓡𝐦 𝓖 𝒎=𝟏 ∞ → 𝟎 ⇒ a Learnable problem.
  • 38. Closer look at ℛ𝑚 ℱ = 𝐸 max 𝑓∈ℱ 1 𝑚 𝑖=1 𝑚 𝜎𝑖𝑓(𝑍𝑖) ■ High if ℱ correlates with random noise – Classification problems: ℱ can assign arbitrary labels ■ Higher ℛ𝑚 ℱ , lower confidence on prediction ■ ℱ1 ⊆ ℱ2 ⇒ ℛ𝑚 ℱ1 ≤ ℛ𝑚 ℱ2 ■ Lower ℛ𝑚 ℱ , higher chance we miss Bayes optimal
  • 39. Closer look at ℛ𝑚 ℱ = 𝐸 max 𝑓∈ℱ 1 𝑚 𝑖=1 𝑚 𝜎𝑖𝑓(𝑍𝑖) ■ High if ℱ correlates with random noise – Classification problems: ℱ can assign arbitrary labels ■ Higher ℛ𝑚 ℱ , lower confidence on prediction ■ ℱ1 ⊆ ℱ2 ⇒ ℛ𝑚 ℱ1 ≤ ℛ𝑚 ℱ2 ■ Lower ℛ𝑚 ℱ , higher chance we miss Bayes optimal
  • 40. Closer look at ℛ𝑚 ℱ = 𝐸 max 𝑓∈ℱ 1 𝑚 𝑖=1 𝑚 𝜎𝑖𝑓(𝑍𝑖) ■ High if ℱ correlates with random noise – Classification problems: ℱ can assign arbitrary labels ■ Higher ℛ𝑚 ℱ , lower confidence on prediction ■ ℱ1 ⊆ ℱ2 ⇒ ℛ𝑚 ℱ1 ≤ ℛ𝑚 ℱ2 ■ Lower ℛ𝑚 ℱ , higher chance we miss Bayes optimal Choose model with right trade-off using Domain knowledge.
  • 41. Relation with classical measures ■ Growth Function: Π𝑚 ℱ ≡ max 𝑥1,…,𝑥𝑚 ⊂𝒳 𝑓 𝑥1 , … , 𝑓(𝑥𝑚) | 𝑓 ∈ ℱ – Classification case: Π𝑚 ℱ is max. no. of distinct classifiers induced – Massart’s Lemma: 𝓡𝒎 𝓕 ≤ 𝟐𝚷𝐦 𝓕 𝒎 ■ VC-Dimension: 𝑉𝐶𝑑𝑖𝑚 ℱ ≡ max 𝑚:Π𝑚 ℱ =2𝑚 𝑚 – Sauer’s Lemma: 𝓡𝒎 𝓕 ≤ 𝟐𝒅 𝐥𝐨𝐠 𝒆𝒎 𝒅 𝒎
  • 42. Mean concentration: Observation ■ Define ℎ 𝑋1, 𝑌1 , … , 𝑋𝑚, 𝑌𝑚 ≡ max 𝑔∈𝒢 𝑅 𝑔 − 𝑅𝑚 𝑔 ■ ℎ is function: – of iid random variables – Satisfies bounded difference property ■ Δℎ when one (𝑋𝑖, 𝑌𝑖) changes ≤ Δ𝑙 𝑚 (∵ bounded loss) – Concentration around mean – McDiarmid’s inequality
  • 43. McDiarmid’s Inequality Let 𝑋1, … , 𝑋𝑚 ∈ 𝒳𝑚 be iid rvs and ℎ: 𝒳𝑚 ↦ ℝ satisfying: ℎ 𝑥1, … , 𝑥𝑖, … , 𝑥𝑚 − ℎ 𝑥1, … , 𝑥𝑖 ′ , … , 𝑥𝑚 ≤ 𝑐𝑖 Then the following hold for any 𝜖 > 0: 𝑃 ℎ − 𝐸 ℎ ≥ 𝜖 ≤ 𝑒 −2𝜖2 𝑖=1 𝑚 𝑐𝑖 2 , 𝑃 ℎ − 𝐸 ℎ ≤ −𝜖 ≤ 𝑒 −2𝜖2 𝑖=1 𝑚 𝑐𝑖 2
  • 44. McDiarmid’s Inequality Let 𝑋1, … , 𝑋𝑚 ∈ 𝒳𝑚 be iid rvs and ℎ: 𝒳𝑚 ↦ ℝ satisfying: ℎ 𝑥1, … , 𝑥𝑖, … , 𝑥𝑚 − ℎ 𝑥1, … , 𝑥𝑖 ′ , … , 𝑥𝑚 ≤ 𝑐𝑖 Then the following hold for any 𝜖 > 0: 𝑃 ℎ − 𝐸 ℎ ≥ 𝜖 ≤ 𝑒 −2𝜖2 𝑖=1 𝑚 𝑐𝑖 2 , 𝑃 ℎ − 𝐸 ℎ ≤ −𝜖 ≤ 𝑒 −2𝜖2 𝑖=1 𝑚 𝑐𝑖 2 𝑒 −2𝑚𝜖2 Δ𝑙2 → 0 𝟎 ⇒learnable
  • 45. Learning Bounds ■ Let 𝛿 ≡ 𝑒 −2𝑚𝜖2 Δ𝑙2 , i.e., 𝝐 = 𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ■ 𝑃 ℎ − 𝐸 ℎ ≥ 𝜖 ≤ 𝛿 is same as: – with probability atleast 1 − 𝛿, we have: 𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐𝓡𝒎 𝓕 + 𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ∀ 𝒈 ∈ 𝓖 ■ With probability atleast 1 − 𝛿, we have: 𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐𝓡𝒎 𝓕 + 𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ∀ 𝒈 ∈ 𝓖
  • 46. Learning Bounds ■ Let 𝛿 ≡ 𝑒 −2𝑚𝜖2 Δ𝑙2 , i.e., 𝝐 = 𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ■ 𝑃 ℎ − 𝐸 ℎ ≥ 𝜖 ≤ 𝛿 is same as: – with probability atleast 1 − 𝛿, we have: 𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐𝓡𝒎 𝓕 + 𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ∀ 𝒈 ∈ 𝓖 ■ With probability atleast 1 − 𝛿, we have: 𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐𝓡𝒎 𝓕 + 𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ∀ 𝒈 ∈ 𝓖 Computable except this term!
  • 47. Learning Bounds ■ Let 𝛿 ≡ 𝑒 −2𝑚𝜖2 Δ𝑙2 , i.e., 𝝐 = 𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ■ 𝑃 ℎ − 𝐸 ℎ ≥ 𝜖 ≤ 𝛿 is same as: – with probability atleast 1 − 𝛿, we have: 𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐𝓡𝒎 𝓕 + 𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ∀ 𝒈 ∈ 𝓖 ■ With probability atleast 1 − 𝛿, we have: 𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐𝓡𝒎 𝓕 + 𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ∀ 𝒈 ∈ 𝓖 Use McDiarmid on ℛ𝑚(ℱ)
  • 48. Learning Bounds ■ Let 𝛿 ≡ 𝑒 −2𝑚𝜖2 Δ𝑙2 , i.e., 𝝐 = 𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ■ 𝑃 ℎ − 𝐸 ℎ ≥ 𝜖 ≤ 𝛿 is same as: – with probability atleast 1 − 𝛿, we have: 𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐𝓡𝒎 𝓕 + 𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ∀ 𝒈 ∈ 𝓖 ■ With probability atleast 1 − 𝛿, we have: 𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐ℛ𝒎 𝓕 + 𝟑𝚫𝒍 𝐥𝐨𝐠 𝟐 𝜹 𝟐𝒎 ∀ 𝒈 ∈ 𝓖
  • 49. Story so far … ■ Two algorithms: Sample Average Approx., Sample Approx. ■ One-sided uniform convergence of mean is sufficient for SAA consistency. ■ Defined Rademacher Complexity. ■ Concentration around mean for the max. term. ■ 𝓡𝐦 𝓖 𝒎=𝟏 ∞ → 𝟎 ⇒ a Learnable problem. ■ Examples of usable Learnable problems – Shows sufficiency condition not loose
  • 50. Linear model with Lipschitz loss ■ Consider 𝒢 ≡ 𝑔 | ∃ 𝑤 ∋ 𝑔 𝑥 = 𝑤, 𝜙 𝑥 , 𝑤 ≤ 𝑊 , 𝜙: 𝒳 ↦ ℋ (linear model) ■ Contraction Lemma: ℛ𝑚 ℱ ≤ ℛ𝑚 𝒢 ■ ℛ𝑚 𝒢 = 𝐸𝜎 max 𝑤 ≤𝑊 1 𝑚 𝑖=1 𝑚 𝜎𝑖 𝑤, 𝜙 𝑥𝑖 – = 𝐸𝜎 max 𝑤 ≤𝑊 𝑤, 1 𝑚 𝑖=1 𝑚 𝜎𝑖 𝜙 𝑥𝑖 – = 𝑊 𝑚 𝐸𝜎 1 𝑚 𝑖=1 𝑚 𝜎𝑖 𝜙 𝑥𝑖 – ≤ 𝑊 𝑚 𝐸𝜎 1 𝑚 𝑖=1 𝑚 𝜎𝑖 𝜙 𝑥𝑖 2 (∵ Jensen’s Inequality) – = 𝑊 𝑚 𝑖=1 𝑚 𝜙(𝑥𝑖) 2 ≤ 𝑊𝑅 𝑚 → 0 (if 𝜙(𝑥) ≤ 𝑅)
  • 51. Linear model with Lipschitz loss ■ Consider 𝒢 ≡ 𝑔 | ∃ 𝑤 ∋ 𝑔 𝑥 = 𝑤, 𝜙 𝑥 , 𝑤 ≤ 𝑊 , 𝜙: 𝒳 ↦ ℋ (linear model) ■ Contraction Lemma: ℛ𝑚 ℱ ≤ ℛ𝑚 𝒢 ■ ℛ𝑚 𝒢 = 𝐸𝜎 max 𝑤 ≤𝑊 1 𝑚 𝑖=1 𝑚 𝜎𝑖 𝑤, 𝜙 𝑥𝑖 – = 𝐸𝜎 max 𝑤 ≤𝑊 𝑤, 1 𝑚 𝑖=1 𝑚 𝜎𝑖 𝜙 𝑥𝑖 – = 𝑊 𝑚 𝐸𝜎 1 𝑚 𝑖=1 𝑚 𝜎𝑖 𝜙 𝑥𝑖 – ≤ 𝑊 𝑚 𝐸𝜎 1 𝑚 𝑖=1 𝑚 𝜎𝑖 𝜙 𝑥𝑖 2 (∵ Jensen’s Inequality) – = 𝑊 𝑚 𝑖=1 𝑚 𝜙(𝑥𝑖) 2 ≤ 𝑊𝑅 𝑚 → 0 (if 𝜙(𝑥) ≤ 𝑅)