STLtalk about statistical analysis and its application

INTRODUCTION TO
STATISTICAL LEARNING
THEORY
J. Saketha Nath (IIT Bombay)

“The goal of statistical learning
theory is to study, in a statistical
framework, the properties of
learning algorithms”
– [Bousquet et.al., 04]
What is STL?

Supervised Learning Setting
■ Given:
– Training data: 𝐷 = 𝑥1, 𝑦1 , 𝑥2, 𝑦2 , … , 𝑥𝑚, 𝑦𝑚
– Model: set of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴
– Loss function: 𝑙: 𝒴 × 𝒴 ↦ ℝ+
■ Goal: ?? Do well on new data
■ Assumptions:
– There exists 𝐹𝑋𝑌 that generates 𝐷 (Stochastic framework)
– iid samples

■ Given:
■ Goal: ?? Pick a candidate that does well on new data ??
■ Assumptions:
– There exists 𝐹𝑋𝑌 that generates 𝐷 (Stochastic framework)
– iid samples

■ Given:
■ Goal: ?? Pick a candidate that does well on new data ??
■ Assumptions:
– There exists 𝑭𝑿𝒀 that generates 𝑫 as well as “new data” (Stochastic
framework)
– iid samples and bounded, Lipschitz loss

■ Given:
– Model: set 𝒢 of candidate predictors of the form 𝑔: 𝒳 ↦ 𝒴
■ Goal: g∗
= argmin
𝑔∈𝒢
𝑬 𝒍 𝒀, 𝒈(𝑿)
■ Assumptions:
– There exists 𝑭𝑿𝒀 that generates 𝐷 as well as “new data”

■ Given:
■ Goal: g∗
= argmin
𝑔∈𝒢
𝐸 𝑙 𝑌, 𝑔(𝑋)
■ Assumptions:
– There exists 𝐹𝑋𝑌 that generates 𝐷 as well as “new data”
Minimize expected loss
(a.k.a. risk 𝑅𝑙[𝑔] minimization)

■ Given:
■ Goal: g∗
= argmin
𝑔∈𝒢
■ Assumptions:
Well-defined, but un-realizable.

■ Given:
■ Goal: g∗
= argmin
𝑔∈𝒢
■ Assumptions:
How well can we approximate?

Skyline ?
■ Case of 𝒢 = 1 (estimate error rate)
– Law of large numbers:
1
𝑚 𝑖=1
𝑚
𝑙 𝑌𝑖, 𝑔 𝑋𝑖
𝑚=1
∞ p
With high probability,
average loss (a.k.a. empirical
risk)
on (a large) training set is a good
approximation for risk

Skyline ?
■ Case of 𝒢 = 1 (estimate error rate)
– Law of large numbers:
1
𝑚 𝑖=1
𝑚
𝑙 𝑌𝑖, 𝑔 𝑋𝑖
𝑚=1
∞ p
For given (but any) 𝐹𝑋𝑌, 𝛿 > 0, 𝜖 > 0, we have that:
There exists 𝑚0 𝛿, 𝜖 ∈ ℕ, such that
𝑃
1
𝑚
𝑖=1
𝑚
𝑙 𝑌𝑖, 𝑔 𝑋𝑖 − 𝐸 𝑙 𝑌, 𝑔(𝑋) > 𝜖 ≤ 𝛿
for all 𝑚 ≥ 𝑚0 𝛿, 𝜖 .

Some Definitions
■ A problem 𝒢, 𝑙 is learnable iff there exists an algorithm that selects 𝑔𝑚 ∈ 𝒢
such that for any F𝑋𝑌, 𝛿 > 0, 𝜖 > 0, we have that there exists 𝑚0 𝛿, 𝜖 ∈ ℕ, such
that
𝑷 𝑹𝒍 𝒈𝒎 − 𝑹𝒍 𝒈∗ > 𝝐 ≤ 𝜹 for all 𝒎 ≥ 𝒎𝟎 𝜹, 𝝐 .
– 𝑔∗
is the (true) risk minimizer
■ Such an algorithm is called universally consistent 𝑚0 𝛿, 𝜖 may depend on 𝐹𝑋𝑌
■ (Smallest) 𝑚0 is called sample complexity of the problem
– Analogously sample complexity of algorithm

[Vapnik, 92]
Some Algorithms
SAMPLE AVERAGE APPROXIMATION
(a.k.a Empirical Risk Minimization)
1. min
𝑔∈𝒢
𝐸 𝑙 𝑌, 𝑔(𝑋) ≈ min
𝑔∈𝒢
1
𝑚 𝑖=1
𝑚
𝑙 𝑦𝑖, 𝑔 𝑥𝑖
(consistent estimator approximation)
2. Bounds based on concentration of mean
3. Indirect bounds (choice optimization
alg.)

[Vapnik, 92]
Some Algorithms
1. min
𝑔∈𝒢
𝑔∈𝒢
1
𝑚 𝑖=1
𝑚
alg.)
Minimize error
on
training set
𝑹𝒎[𝒈]

[Vapnik, 92]
Some Algorithms
1. min
𝑔∈𝒢
𝑔∈𝒢
1
𝑚 𝑖=1
𝑚
alg.)
https://www.coursera.org/course/
ml

[Vapnik, 92]
SAMPLE APPROXIMATION
(a.k.a Stochastic Gradient Descent)
1. Update 𝑔(𝑘) using 𝑙(𝑦𝑘, 𝑥𝑘) and
𝑔 ≡
1
𝑚 𝑘=1
𝑚
𝑔 𝑘
(weak estimator approximation)
2. Online learning literature
3. Direct bounds on risk
Some Algorithms
1. min
𝑔∈𝒢
𝑔∈𝒢
1
𝑚 𝑖=1
𝑚
alg.)
[Robbins & Monro, 51]

https://www.coursera.org/course/
ml
SAMPLE APPROXIMATION
(a.k.a Stochastic Gradient Descent)
1. Update 𝑔(𝑘) using 𝑙(𝑦𝑘, 𝑥𝑘) and
𝑔 ≡
1
𝑚 𝑘=1
𝑚
𝑔 𝑘
(weak estimator approximation)
2. Online learning literature
3. Direct bounds on risk
Some Algorithms
[Robbins & Monro, 51]

ERM consistency: Sufficient
conditions
■ 0 ≤ 𝑅 𝑔𝑚 − 𝑅 𝑔∗ = 𝑅 𝑔𝑚 − 𝑅𝑚 𝑔𝑚 + 𝑅𝑚 𝑔𝑚 − 𝑅𝑚 𝑔∗ + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗]
■ ≤ max
𝑔∈𝒢
𝑅 𝑔 − 𝑅𝑚 𝑔 + 𝑅𝑚 𝑔∗ − 𝑅[𝑔∗]
𝑝
0 ∵LLN
■ Hence one-sided uniform convergence is a sufficient condition for ERM
consistency
– i.e., max
𝑔∈𝒢
𝑅 𝑔 − 𝑅𝑚 𝑔
𝑚=1
∞ 𝑝
0 as 𝑚 → ∞
– Vapnik proved this is necessary for “non-trivial” consistency (of ERM)

Story so far …
■ Two algorithms: Sample Average Approx., Sample Approx.
■ One-sided uniform convergence of mean is sufficient for SAA consistency.
■ Defined Rademacher Complexity.
■ Pending:
– Concentration around mean for the max. term.
– 𝓡𝐦 𝓖 𝒎=𝟏
∞
→ 𝟎 ⇒ a Learnable problem.

≤ 𝐸 max
𝑔∈𝒢
𝐸 𝑅𝑚
′
𝑔 − 𝑅𝑚 𝑔
Candidate for Problem Complexity

≤ 𝐸 max
𝑔∈𝒢
𝐸 𝑅𝑚
′
1. Ensure (asymptotically) goes to zero.
2. Show concentration around mean for max.
div.

≤ 𝐸 max
𝑔∈𝒢
𝐸 𝑅𝑚
′
MAXIMUM
DISCREPANCY

≤ 𝐸 max
𝑔∈𝒢
𝐸 𝑅𝑚
′
Towards Rademacher Complexity

𝐸𝜎𝐸 max
𝑔∈𝒢
1
𝑚
𝑖=1
𝑚
𝜎𝑖 𝑙 𝑌𝑖
′
, 𝑔 𝑋𝑖
′
− 𝑙 𝑌𝑖, 𝑔 𝑋𝑖

𝐸𝜎𝐸 max
𝑔∈𝒢
1
𝑚
𝑖=1
𝑚
𝜎𝑖 𝑙 𝑌𝑖
′
, 𝑔 𝑋𝑖
′
− 𝑙 𝑌𝑖, 𝑔 𝑋𝑖
iid Rademacher
random variables
𝑃 𝜎𝑖 = 1 = 0.5,
𝑃 𝜎𝑖 = −1 = 0.5.

≤ 2 𝐸 𝐸𝜎 max
𝑔∈𝒢
1
𝑚
𝑖=1
𝑚
𝜎𝑖𝑙 𝑌𝑖, 𝑔 𝑋𝑖
Empirical term
Distribution−dependent term
Rademacher Complexity

= 2 𝐸 𝐸𝜎 max
𝑔∈𝒢
1
𝑚
𝑖=1
𝑚
𝜎𝑖𝑙 𝑌𝑖, 𝑔 𝑋𝑖
Empirical term
Distribution−dependent term
𝑓(𝑍𝑖)
𝑓 ∈ ℱ

= 2 𝐸 𝐸𝜎 max
𝑓∈ℱ
1
𝑚
𝑖=1
𝑚
𝜎𝑖𝑓(𝑍𝑖)
ℛ𝑚 ℱ
ℛ𝑚 ℱ
ℛ𝑚 ℱ is Rademacher Complexity; ℛ𝑚 ℱ is empirical Rademacher Complexity

Closer look at ℛ𝑚 ℱ = 𝐸 max
𝑓∈ℱ
1
𝑚 𝑖=1
𝑚
■ High if ℱ correlates with random noise
– Classification problems: ℱ can assign arbitrary labels
■ Higher ℛ𝑚 ℱ , lower confidence on prediction
■ ℱ1 ⊆ ℱ2 ⇒ ℛ𝑚 ℱ1 ≤ ℛ𝑚 ℱ2
■ Lower ℛ𝑚 ℱ , higher chance we miss Bayes optimal

Closer look at ℛ𝑚 ℱ = 𝐸 max
𝑓∈ℱ
1
𝑚 𝑖=1
𝑚
■ High if ℱ correlates with random noise
– Classification problems: ℱ can assign arbitrary labels
■ Higher ℛ𝑚 ℱ , lower confidence on prediction
■ ℱ1 ⊆ ℱ2 ⇒ ℛ𝑚 ℱ1 ≤ ℛ𝑚 ℱ2
■ Lower ℛ𝑚 ℱ , higher chance we miss Bayes optimal
Choose model with
right trade-off
using Domain
knowledge.

Relation with classical measures
■ Growth Function: Π𝑚 ℱ ≡ max
𝑥1,…,𝑥𝑚 ⊂𝒳
𝑓 𝑥1 , … , 𝑓(𝑥𝑚) | 𝑓 ∈ ℱ
– Classification case: Π𝑚 ℱ is max. no. of distinct classifiers induced
– Massart’s Lemma: 𝓡𝒎 𝓕 ≤
𝟐𝚷𝐦 𝓕
𝒎
■ VC-Dimension: 𝑉𝐶𝑑𝑖𝑚 ℱ ≡ max
𝑚:Π𝑚 ℱ =2𝑚
𝑚
– Sauer’s Lemma: 𝓡𝒎 𝓕 ≤
𝟐𝒅 𝐥𝐨𝐠
𝒆𝒎
𝒅
𝒎

Mean concentration: Observation
■ Define ℎ 𝑋1, 𝑌1 , … , 𝑋𝑚, 𝑌𝑚 ≡ max
𝑔∈𝒢
𝑅 𝑔 − 𝑅𝑚 𝑔
■ ℎ is function:
– of iid random variables
– Satisfies bounded difference property
■ Δℎ when one (𝑋𝑖, 𝑌𝑖) changes ≤
Δ𝑙
𝑚
(∵
bounded loss)
– Concentration around mean – McDiarmid’s inequality

McDiarmid’s Inequality
Let 𝑋1, … , 𝑋𝑚 ∈ 𝒳𝑚 be iid rvs and ℎ: 𝒳𝑚 ↦ ℝ satisfying:
ℎ 𝑥1, … , 𝑥𝑖, … , 𝑥𝑚 − ℎ 𝑥1, … , 𝑥𝑖
′
, … , 𝑥𝑚 ≤ 𝑐𝑖
Then the following hold for any 𝜖 > 0:
𝑃 ℎ − 𝐸 ℎ ≥ 𝜖 ≤ 𝑒
−2𝜖2
𝑖=1
𝑚
𝑐𝑖
2
,
𝑃 ℎ − 𝐸 ℎ ≤ −𝜖 ≤ 𝑒
−2𝜖2
𝑖=1
𝑚
𝑐𝑖
2

McDiarmid’s Inequality
Let 𝑋1, … , 𝑋𝑚 ∈ 𝒳𝑚 be iid rvs and ℎ: 𝒳𝑚 ↦ ℝ satisfying:
ℎ 𝑥1, … , 𝑥𝑖, … , 𝑥𝑚 − ℎ 𝑥1, … , 𝑥𝑖
′
, … , 𝑥𝑚 ≤ 𝑐𝑖
Then the following hold for any 𝜖 > 0:
𝑃 ℎ − 𝐸 ℎ ≥ 𝜖 ≤ 𝑒
−2𝜖2
𝑖=1
𝑚
𝑐𝑖
2
,
𝑃 ℎ − 𝐸 ℎ ≤ −𝜖 ≤ 𝑒
−2𝜖2
𝑖=1
𝑚
𝑐𝑖
2
𝑒
−2𝑚𝜖2
Δ𝑙2
→ 0
𝟎 ⇒learnable

Learning Bounds
■ Let 𝛿 ≡ 𝑒
−2𝑚𝜖2
Δ𝑙2
, i.e., 𝝐 = 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
■ 𝑃 ℎ − 𝐸 ℎ ≥ 𝜖 ≤ 𝛿 is same as:
– with probability atleast 1 − 𝛿, we have:
𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐𝓡𝒎 𝓕 + 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖
■ With probability atleast 1 − 𝛿, we have:
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖

Learning Bounds
−2𝑚𝜖2
Δ𝑙2
, i.e., 𝝐 = 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖
Computable
except this
term!

Learning Bounds
−2𝑚𝜖2
Δ𝑙2
, i.e., 𝝐 = 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖
Use McDiarmid
on ℛ𝑚(ℱ)

Learning Bounds
−2𝑚𝜖2
Δ𝑙2
, i.e., 𝝐 = 𝚫𝒍
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
𝐥𝐨𝐠
𝟏
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖
𝑹 𝒈 ≤ 𝑹𝒎 𝒈 + 𝟐ℛ𝒎 𝓕 + 𝟑𝚫𝒍
𝐥𝐨𝐠
𝟐
𝜹
𝟐𝒎
∀ 𝒈 ∈ 𝓖

Story so far …
■ Two algorithms: Sample Average Approx., Sample Approx.
■ One-sided uniform convergence of mean is sufficient for SAA consistency.
■ Defined Rademacher Complexity.
■ Concentration around mean for the max. term.
■ 𝓡𝐦 𝓖 𝒎=𝟏
∞
→ 𝟎 ⇒ a Learnable problem.
■ Examples of usable Learnable problems
– Shows sufficiency condition not loose

Linear model with Lipschitz loss
■ Consider 𝒢 ≡ 𝑔 | ∃ 𝑤 ∋ 𝑔 𝑥 = 𝑤, 𝜙 𝑥 , 𝑤 ≤ 𝑊 , 𝜙: 𝒳 ↦ ℋ (linear model)
■ Contraction Lemma: ℛ𝑚 ℱ ≤ ℛ𝑚 𝒢
■ ℛ𝑚 𝒢 = 𝐸𝜎 max
𝑤 ≤𝑊
1
𝑚 𝑖=1
𝑚
𝜎𝑖 𝑤, 𝜙 𝑥𝑖
– = 𝐸𝜎 max
𝑤 ≤𝑊
𝑤,
1
𝑚 𝑖=1
𝑚
𝜎𝑖 𝜙 𝑥𝑖
– =
𝑊
𝑚
𝐸𝜎
1
𝑚 𝑖=1
𝑚
– ≤
𝑊
𝑚
𝐸𝜎
1
𝑚 𝑖=1
𝑚
2
(∵ Jensen’s
Inequality)
– =
𝑊
𝑚 𝑖=1
𝑚
𝜙(𝑥𝑖)
2
≤
𝑊𝑅
𝑚
→ 0 (if
𝜙(𝑥) ≤ 𝑅)

Learnable
Problems
Shai Shalev-Shwartz et.al.,
2009

STLtalk about statistical analysis and its application

More Related Content

Similar to STLtalk about statistical analysis and its application (20)

More from JulieDash5 (20)

Recently uploaded (20)

STLtalk about statistical analysis and its application