SlideShare a Scribd company logo
Lecture notes
Selected theoretical aspects of machine learning and deep learning
François Bachoc
University Paul Sabatier
March 3, 2021
Contents
1 Generalities on regression, classification and neural networks 2
1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Approximation with neural networks with one hidden layer 9
2.1 Statement of the theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Sketch of the proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Complete proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3 Complements on the generalization error in classification and VC-dimension 16
3.1 Shattering coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Bounding the generalization error from the shattering coefficients . . . . . . . . . . . . 19
3.3 VC-dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Bounding the shattering coefficients from the VC-dimension . . . . . . . . . . . . . . . 27
4 VC-dimension of neural networks with several hidden layers 30
4.1 Neural networks as directed acyclic graphs . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 Bounding the VC-dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3 Proof of the theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Acknowledgements
Sections 2 and 4 are taken from lecture notes from Sébastien Gerchinovitz that are available at https:
//www.math.univ-toulouse.fr/~fmalgouy/enseignement/mva.html. Parts of Section 3 benefited
from parts of the book [Gir14].
Introduction
With deep learning, we shall understand neural networks with many hidden layers. Deep learning
methods are currently very popular for some tasks, for instance the following ones.
• regression: predicting y ∈ R.
• classification: predicting y ∈ {0, 1} or y ∈ {0, . . . , K}.
• generative modeling: generating vectors x ∈ Rd following an unknown target distribution.
Typical applications are:
1
• For regression: any type of input x ∈ Rd and of corresponding output y ∈ R to predict. For
instance, y can be the delay of a flight and x can gather characteristics of this flight, such as the
day, position of the airport and duration.
• For classification: x ∈ Rd can be an image (vector of color levels for each pixels) and y can
give the type of image, for instance cat/dog, or value of a digit.
• For generative modeling: generating images (e.g. faces) or musical pieces.
Goals of the lecture notes. The goal is to study some theoretical aspects of deep learning, and
in some cases of machine learning more broadly. There are many recent contributions and only a few
of them will be covered.
1 Generalities on regression, classification and neural networks
1.1 Regression
We consider a law L on [0, 1]d × R. We aim at finding a function f : [0, 1]d → R such that, for
(X, Y ) ∼ L,
E

(f(X) − Y )2

is small.
The optimal function f is then the conditional expectation, as shown in the following proposition.
Proposition 1 Let f? : [0, 1]d → R be defined by
f?
(x) = E (Y | X = x) ,
for x ∈ [0, 1]d. The, for any f : [0, 1]d → R,
E

(f(X) − Y )2

= E

(f?
(X) − Y )2

+ E

(f?
(X) − f(X))2

.
From the previous proposition, f? minimizes the mean square error among all possible functions,
and the closer a function f is to f?, the more it leads to a small mean square error.
Proof of Proposition 1 Let us use the law of total expectation.
E

(f(X) − Y )2

= E

E

(f(X) − Y )2
Lecture notes
Lecture notes
X

.
Conditionally to X, we can use the equation
E

(Z − a)2

= Var(Z) + (E(Z) − a)2
for a random variable Z and a deterministic number a (bias-variance decomposition). This gives
E

(f(X) − Y )2

=E

(E(Y |X) − f(X))2
+ Var(Y |X)

=E

(f?
(X) − f(X))2

+ E (Var(Y |X))
=E

(f?
(X) − f(X))2

+ E

E

(Y − E(Y |X))2
Lecture notes
Lecture notes
X

(law of total expectation :) =E

(f?
(X) − f(X))2

+ E

(Y − E(Y |X))2

= E

(f?
(X) − f(X))2

+ E

(Y − f?
(X))2

.

We now consider a data set of the form (X1, Y1), . . . , (Xn, Yn), independent and of law L. We
consider a function learned by empirical risk minimization. We let F be a set of functions from [0, 1]d
to R. We consider
ˆ
fn ∈ argmin
f∈F
1
n
n
X
i=1
(f(Xi) − Yi)2
.
The next proposition enables to bound the mean square error of ˆ
fn.
2
Proposition 2 Let (X, Y ) ∼ L, independently from (X1, Y1), . . . , (Xn, Yn). Then we have
E

ˆ
fn(X) − Y
2

− E

(f?
(X) − Y )2

≤ 2E sup
f∈F
Lecture notes
Lecture notes
Lecture notes
Lecture notes
E

(f(X) − Y )2

−
1
n
n
X
i=1
(f(Xi) − Yi)2
Lecture notes
Lecture notes
Lecture notes
Lecture notes
!
+ inf
f∈F
E

(f(X) − f?
(X))2

.
Remarks
• In the term
E

ˆ
fn(X) − Y
2

the expectation is taken with respect to both (X1, Y1), . . . , (Xn, Yn) and (X, Y ).
• We bound
E

ˆ
fn(X) − Y
2

− E

(f?
(X) − Y )2

which is always non-negative and is called the excess of risk.
• The first component of the bound is
2E sup
f∈F
Lecture notes
Lecture notes
Lecture notes
Lecture notes
E

(f(X) − Y )2

−
1
n
n
X
i=1
(f(Xi) − Yi)2
Lecture notes
Lecture notes
Lecture notes
Lecture notes
!
which is called the generalization error. The larger the set F is, the larger this error is, because
the supremum is taken over a larger set.
• The second component of the bound is
inf
f∈F
E

(f(X) − f?
(X))2

which is called the approximation error. The smaller F is the larger this error is, because the
infimum is taken over less functions.
• Hence, we see that F should be not too small and not too large, which can be interpreted as a
bias-variance trade off.
Proof of Proposition 2
We let, for f ∈ F,
R(f) = E

(Y − f(X))2

and
Rn(f) =
1
n
n
X
i=1
(f(Xi) − Yi)2
.
We let, for   0, f be such that
R(f) ≤ inf
f∈F
R(f) + .
Then we have
E

ˆ
fn(X) − Y
2

(law of total expectation:) =E

E

ˆ
fn(X) − Y
2
Lecture notes
Lecture notes
Lecture notes
X1, Y1, . . . , Xn, Yn

=E

R( ˆ
fn)

,
3
since (X, Y ) is independent from X1, Y1, . . . , Xn, Yn and in R( ˆ
fn), the function ˆ
fn is fixed, as the
expectation is taken only with respect to X and Y . Then we have
E

R( ˆ
fn)

− R(f?
) =E

R( ˆ
fn) − Rn( ˆ
fn)

+ E

Rn( ˆ
fn) − Rn(f)

+ E (Rn(f) − R(f))
+

R(f) − inf
f∈F
R(f)

+

inf
f∈F
R(f) − R(f?
)

≤E sup
f∈F
|R(f) − Rn(f)|
!
+ 0 + E sup
f∈F
|R(f) − Rn(f)|
!
+  + inf
f∈F
(R(f) − R(f?
))
(Proposition 1:) =2E sup
f∈F
|R(f) − Rn(f)|
!
+  + inf
f∈F
E

(f(X) − f?
(X))2

.
Since this inequality holds for any   0, we also obtain the inequality with  = 0 which concludes the
proof. 
1.2 Classification
The general principle is quite similar to regression. We consider a law on [0, 1]d × {0, 1}. We are
looking for a function f : [0, 1]d → {0, 1} (a classifier) such that with (X, Y ) ∼ L,
P (f(X) 6= Y )
is small. The next proposition provides the optimal function f for this.
Proposition 3 Let p? : [0, 1]d → [0, 1] defined by
p?
(x) = P (Y = 1| X = x)
for x ∈ [0, 1]d. We let
T?
(x) = 1p?(x)≥1
2
for x ∈ [0, 1]d. Then, for any f : [0, 1]d → {0, 1},
P (f(X) 6= Y ) = P (T?
(X) 6= Y )+E 1T?(X)6=f(X) (max(p?
(X), 1 − p?
(X)) − min(p?
(X), 1 − p?
(X)))

.
Hence, we see that a prediction error (that is, predicting f(X) with f(X) 6= T?(X)) is more
harmful when
max(p?
(X), 1 − p?
(X)) − min(p?
(X), 1 − p?
(X))
is large. This is well interpreted, because when
max(p?
(X), 1 − p?
(X)) − min(p?
(X), 1 − p?
(X)) = 0,
we have p?(X) = 1/2, thus P(Y = 1|X) = 1/2. In this case, P(f(X) 6= Y |X) = 1/2, regardless of the
value of f(X).
Proof of Proposition 4 Using the law of total expectation, we have
P (f(X) 6= Y ) − P (T?
(X) 6= Y ) = E E 1f(X)6=Y − 1T?(X)6=Y
Lecture notes
X

:= E (E (e(X, Y )| X)) .
Conditionally to X we have the following.
• If T?(X) = 1, then
– if f(X) = 1, then e(X, Y ) = 0,
4
– if f(X) = 0, then
∗ e(X, Y ) = 1 with probability P(Y = 1|X) = p?(X) = max(p?(X), 1 − p?(X)),
∗ e(X, Y ) = −1 with probability P(Y = 0|X) = 1 − p?(X) = min(p?(X), 1 − p?(X))
and thus
E (e(X, Y )| X) = (max(p?
(X), 1 − p?
(X)) − min(p?
(X), 1 − p?
(X))) 1f(X)6=T?(X).
• If T?(X) = 0, then
– if f(X) = 0, then e(X, Y ) = 0,
– if f(X) = 1, then
∗ e(X, Y ) = 1 with probability P(Y = 0|X) = 1 − p?(X) = max(p?(X), 1 − p?(X)),
∗ e(X, Y ) = −1 with probability P(Y = 1|X) = p?(X) = min(p?(X), 1 − p?(X))
and thus
E (e(X, Y )| X) = (max(p?
(X), 1 − p?
(X)) − min(p?
(X), 1 − p?
(X))) 1f(X)6=T?(X).
Hence, eventually
P (f(X) 6= Y ) − P (T?
(X) 6= Y ) = E 1T?(X)6=f(X) (max(p?
(X), 1 − p?
(X)) − min(p?
(X), 1 − p?
(X)))

.

We now consider a data set of the form (X1, Y1), . . . , (Xn, Yn) independent and of law L. We
consider a function that is learned by empirical risk minimization. We consider a set F of functions
from [0, 1]d to {0, 1} and
ˆ
fn ∈ argmin
f∈F
1
n
n
X
i=1
1Yi6=f(Xi).
The next proposition enables to bound the probability of error of ˆ
fn.
Proposition 4 Let (X, Y ) ∼ L, independently from (X1, Y1), . . . , (Xn, Yn). Then we have
P

ˆ
fn(X) 6= Y

− P (T?
(X) 6= Y )
≤2E sup
f∈F
Lecture notes
Lecture notes
Lecture notes

More Related Content

PDF
On Frechet Derivatives with Application to the Inverse Function Theorem of Or...
PDF
Composed short m sequences
PDF
The mathematics of perfect shuffles,perci diaconis, r,l,graham,william m.kant...
PPT
Fuzzy calculation
PDF
9 pd es
PDF
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
PDF
maths ppt.pdf
PDF
Lectures4 8
On Frechet Derivatives with Application to the Inverse Function Theorem of Or...
Composed short m sequences
The mathematics of perfect shuffles,perci diaconis, r,l,graham,william m.kant...
Fuzzy calculation
9 pd es
On Application of the Fixed-Point Theorem to the Solution of Ordinary Differe...
maths ppt.pdf
Lectures4 8

What's hot (20)

PDF
Mac331 complex analysis_mfa_week6_16-10-20 (1)
PDF
Fourier 3
PDF
Interpolation with Finite differences
PDF
Intuitionistic Fuzzification of T-Ideals in Bci-Algebras
PDF
How to Solve a Partial Differential Equation on a surface
DOC
Chapter 1 (maths 3)
PDF
-contraction and some fixed point results via modified !-distance mappings in t...
PPT
maths
PDF
On Application of Power Series Solution of Bessel Problems to the Problems of...
PPT
1532 fourier series
DOC
Lap lace
PDF
Midterm II Review
PDF
Partial Differential Equation - Notes
PPTX
Higherorder non homogeneous partial differrential equations (Maths 3) Power P...
PDF
Interpolation techniques - Background and implementation
PPTX
27 power series x
PDF
A common fixed point theorems in menger space using occationally weakly compa...
PPTX
Finite difference method
Mac331 complex analysis_mfa_week6_16-10-20 (1)
Fourier 3
Interpolation with Finite differences
Intuitionistic Fuzzification of T-Ideals in Bci-Algebras
How to Solve a Partial Differential Equation on a surface
Chapter 1 (maths 3)
-contraction and some fixed point results via modified !-distance mappings in t...
maths
On Application of Power Series Solution of Bessel Problems to the Problems of...
1532 fourier series
Lap lace
Midterm II Review
Partial Differential Equation - Notes
Higherorder non homogeneous partial differrential equations (Maths 3) Power P...
Interpolation techniques - Background and implementation
27 power series x
A common fixed point theorems in menger space using occationally weakly compa...
Finite difference method
Ad

Similar to Lecture notes (20)

PDF
A basic introduction to learning
PDF
Deep Learning Theory Seminar (Chap 1-2, part 1)
PDF
Machine learning and its parameter is discussed here
PDF
Cs229 notes4
PPTX
When Models Meet Data: From ancient science to todays Artificial Intelligence...
PDF
MLBOOK.pdf
PDF
NIPS2007: learning using many examples
PPTX
STLtalk about statistical analysis and its application
PDF
Machine learning (4)
PDF
ML MODULE 4.pdf
PDF
super-cheatsheet-artificial-intelligence.pdf
PPTX
PRML Chapter 1
PDF
Two algorithms to accelerate training of back-propagation neural networks
PDF
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
PPTX
Linear algebra and probability (Deep Learning chapter 2&3)
PDF
CS229_MachineLearning_notes.pdfkkkkkkkkkk
PDF
machine learning notes by Andrew Ng and Tengyu Ma
PDF
Representation learning in limited-data settings
PPTX
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
PPTX
ML unit-1.pptx
A basic introduction to learning
Deep Learning Theory Seminar (Chap 1-2, part 1)
Machine learning and its parameter is discussed here
Cs229 notes4
When Models Meet Data: From ancient science to todays Artificial Intelligence...
MLBOOK.pdf
NIPS2007: learning using many examples
STLtalk about statistical analysis and its application
Machine learning (4)
ML MODULE 4.pdf
super-cheatsheet-artificial-intelligence.pdf
PRML Chapter 1
Two algorithms to accelerate training of back-propagation neural networks
RECENT ADVANCES in PREDICTIVE (MACHINE) LEARNING
Linear algebra and probability (Deep Learning chapter 2&3)
CS229_MachineLearning_notes.pdfkkkkkkkkkk
machine learning notes by Andrew Ng and Tengyu Ma
Representation learning in limited-data settings
Deep Learning: Introduction & Chapter 5 Machine Learning Basics
ML unit-1.pptx
Ad

Recently uploaded (20)

PPTX
Microbiology with diagram medical studies .pptx
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Microbiology with diagram medical studies .pptx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
AlphaEarth Foundations and the Satellite Embedding dataset
TOTAL hIP ARTHROPLASTY Presentation.pptx
ECG_Course_Presentation د.محمد صقران ppt
bbec55_b34400a7914c42429908233dbd381773.pdf
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
neck nodes and dissection types and lymph nodes levels
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Biophysics 2.pdffffffffffffffffffffffffff
7. General Toxicologyfor clinical phrmacy.pptx
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
Cell Membrane: Structure, Composition & Functions
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
microscope-Lecturecjchchchchcuvuvhc.pptx
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
INTRODUCTION TO EVS | Concept of sustainability
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice

Lecture notes

  • 1. Lecture notes Selected theoretical aspects of machine learning and deep learning François Bachoc University Paul Sabatier March 3, 2021 Contents 1 Generalities on regression, classification and neural networks 2 1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Approximation with neural networks with one hidden layer 9 2.1 Statement of the theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Sketch of the proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3 Complete proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Complements on the generalization error in classification and VC-dimension 16 3.1 Shattering coefficients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Bounding the generalization error from the shattering coefficients . . . . . . . . . . . . 19 3.3 VC-dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.4 Bounding the shattering coefficients from the VC-dimension . . . . . . . . . . . . . . . 27 4 VC-dimension of neural networks with several hidden layers 30 4.1 Neural networks as directed acyclic graphs . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.2 Bounding the VC-dimension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.3 Proof of the theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Acknowledgements Sections 2 and 4 are taken from lecture notes from Sébastien Gerchinovitz that are available at https: //www.math.univ-toulouse.fr/~fmalgouy/enseignement/mva.html. Parts of Section 3 benefited from parts of the book [Gir14]. Introduction With deep learning, we shall understand neural networks with many hidden layers. Deep learning methods are currently very popular for some tasks, for instance the following ones. • regression: predicting y ∈ R. • classification: predicting y ∈ {0, 1} or y ∈ {0, . . . , K}. • generative modeling: generating vectors x ∈ Rd following an unknown target distribution. Typical applications are: 1
  • 2. • For regression: any type of input x ∈ Rd and of corresponding output y ∈ R to predict. For instance, y can be the delay of a flight and x can gather characteristics of this flight, such as the day, position of the airport and duration. • For classification: x ∈ Rd can be an image (vector of color levels for each pixels) and y can give the type of image, for instance cat/dog, or value of a digit. • For generative modeling: generating images (e.g. faces) or musical pieces. Goals of the lecture notes. The goal is to study some theoretical aspects of deep learning, and in some cases of machine learning more broadly. There are many recent contributions and only a few of them will be covered. 1 Generalities on regression, classification and neural networks 1.1 Regression We consider a law L on [0, 1]d × R. We aim at finding a function f : [0, 1]d → R such that, for (X, Y ) ∼ L, E (f(X) − Y )2 is small. The optimal function f is then the conditional expectation, as shown in the following proposition. Proposition 1 Let f? : [0, 1]d → R be defined by f? (x) = E (Y | X = x) , for x ∈ [0, 1]d. The, for any f : [0, 1]d → R, E (f(X) − Y )2 = E (f? (X) − Y )2 + E (f? (X) − f(X))2 . From the previous proposition, f? minimizes the mean square error among all possible functions, and the closer a function f is to f?, the more it leads to a small mean square error. Proof of Proposition 1 Let us use the law of total expectation. E (f(X) − Y )2 = E E (f(X) − Y )2
  • 5. X . Conditionally to X, we can use the equation E (Z − a)2 = Var(Z) + (E(Z) − a)2 for a random variable Z and a deterministic number a (bias-variance decomposition). This gives E (f(X) − Y )2 =E (E(Y |X) − f(X))2 + Var(Y |X) =E (f? (X) − f(X))2 + E (Var(Y |X)) =E (f? (X) − f(X))2 + E E (Y − E(Y |X))2
  • 8. X (law of total expectation :) =E (f? (X) − f(X))2 + E (Y − E(Y |X))2 = E (f? (X) − f(X))2 + E (Y − f? (X))2 . We now consider a data set of the form (X1, Y1), . . . , (Xn, Yn), independent and of law L. We consider a function learned by empirical risk minimization. We let F be a set of functions from [0, 1]d to R. We consider ˆ fn ∈ argmin f∈F 1 n n X i=1 (f(Xi) − Yi)2 . The next proposition enables to bound the mean square error of ˆ fn. 2
  • 9. Proposition 2 Let (X, Y ) ∼ L, independently from (X1, Y1), . . . , (Xn, Yn). Then we have E ˆ fn(X) − Y 2 − E (f? (X) − Y )2 ≤ 2E sup f∈F
  • 14. E (f(X) − Y )2 − 1 n n X i=1 (f(Xi) − Yi)2
  • 19. ! + inf f∈F E (f(X) − f? (X))2 . Remarks • In the term E ˆ fn(X) − Y 2 the expectation is taken with respect to both (X1, Y1), . . . , (Xn, Yn) and (X, Y ). • We bound E ˆ fn(X) − Y 2 − E (f? (X) − Y )2 which is always non-negative and is called the excess of risk. • The first component of the bound is 2E sup f∈F
  • 24. E (f(X) − Y )2 − 1 n n X i=1 (f(Xi) − Yi)2
  • 29. ! which is called the generalization error. The larger the set F is, the larger this error is, because the supremum is taken over a larger set. • The second component of the bound is inf f∈F E (f(X) − f? (X))2 which is called the approximation error. The smaller F is the larger this error is, because the infimum is taken over less functions. • Hence, we see that F should be not too small and not too large, which can be interpreted as a bias-variance trade off. Proof of Proposition 2 We let, for f ∈ F, R(f) = E (Y − f(X))2 and Rn(f) = 1 n n X i=1 (f(Xi) − Yi)2 . We let, for 0, f be such that R(f) ≤ inf f∈F R(f) + . Then we have E ˆ fn(X) − Y 2 (law of total expectation:) =E E ˆ fn(X) − Y 2
  • 33. X1, Y1, . . . , Xn, Yn =E R( ˆ fn) , 3
  • 34. since (X, Y ) is independent from X1, Y1, . . . , Xn, Yn and in R( ˆ fn), the function ˆ fn is fixed, as the expectation is taken only with respect to X and Y . Then we have E R( ˆ fn) − R(f? ) =E R( ˆ fn) − Rn( ˆ fn) + E Rn( ˆ fn) − Rn(f) + E (Rn(f) − R(f)) + R(f) − inf f∈F R(f) + inf f∈F R(f) − R(f? ) ≤E sup f∈F |R(f) − Rn(f)| ! + 0 + E sup f∈F |R(f) − Rn(f)| ! + + inf f∈F (R(f) − R(f? )) (Proposition 1:) =2E sup f∈F |R(f) − Rn(f)| ! + + inf f∈F E (f(X) − f? (X))2 . Since this inequality holds for any 0, we also obtain the inequality with = 0 which concludes the proof. 1.2 Classification The general principle is quite similar to regression. We consider a law on [0, 1]d × {0, 1}. We are looking for a function f : [0, 1]d → {0, 1} (a classifier) such that with (X, Y ) ∼ L, P (f(X) 6= Y ) is small. The next proposition provides the optimal function f for this. Proposition 3 Let p? : [0, 1]d → [0, 1] defined by p? (x) = P (Y = 1| X = x) for x ∈ [0, 1]d. We let T? (x) = 1p?(x)≥1 2 for x ∈ [0, 1]d. Then, for any f : [0, 1]d → {0, 1}, P (f(X) 6= Y ) = P (T? (X) 6= Y )+E 1T?(X)6=f(X) (max(p? (X), 1 − p? (X)) − min(p? (X), 1 − p? (X))) . Hence, we see that a prediction error (that is, predicting f(X) with f(X) 6= T?(X)) is more harmful when max(p? (X), 1 − p? (X)) − min(p? (X), 1 − p? (X)) is large. This is well interpreted, because when max(p? (X), 1 − p? (X)) − min(p? (X), 1 − p? (X)) = 0, we have p?(X) = 1/2, thus P(Y = 1|X) = 1/2. In this case, P(f(X) 6= Y |X) = 1/2, regardless of the value of f(X). Proof of Proposition 4 Using the law of total expectation, we have P (f(X) 6= Y ) − P (T? (X) 6= Y ) = E E 1f(X)6=Y − 1T?(X)6=Y
  • 36. X := E (E (e(X, Y )| X)) . Conditionally to X we have the following. • If T?(X) = 1, then – if f(X) = 1, then e(X, Y ) = 0, 4
  • 37. – if f(X) = 0, then ∗ e(X, Y ) = 1 with probability P(Y = 1|X) = p?(X) = max(p?(X), 1 − p?(X)), ∗ e(X, Y ) = −1 with probability P(Y = 0|X) = 1 − p?(X) = min(p?(X), 1 − p?(X)) and thus E (e(X, Y )| X) = (max(p? (X), 1 − p? (X)) − min(p? (X), 1 − p? (X))) 1f(X)6=T?(X). • If T?(X) = 0, then – if f(X) = 0, then e(X, Y ) = 0, – if f(X) = 1, then ∗ e(X, Y ) = 1 with probability P(Y = 0|X) = 1 − p?(X) = max(p?(X), 1 − p?(X)), ∗ e(X, Y ) = −1 with probability P(Y = 1|X) = p?(X) = min(p?(X), 1 − p?(X)) and thus E (e(X, Y )| X) = (max(p? (X), 1 − p? (X)) − min(p? (X), 1 − p? (X))) 1f(X)6=T?(X). Hence, eventually P (f(X) 6= Y ) − P (T? (X) 6= Y ) = E 1T?(X)6=f(X) (max(p? (X), 1 − p? (X)) − min(p? (X), 1 − p? (X))) . We now consider a data set of the form (X1, Y1), . . . , (Xn, Yn) independent and of law L. We consider a function that is learned by empirical risk minimization. We consider a set F of functions from [0, 1]d to {0, 1} and ˆ fn ∈ argmin f∈F 1 n n X i=1 1Yi6=f(Xi). The next proposition enables to bound the probability of error of ˆ fn. Proposition 4 Let (X, Y ) ∼ L, independently from (X1, Y1), . . . , (Xn, Yn). Then we have P ˆ fn(X) 6= Y − P (T? (X) 6= Y ) ≤2E sup f∈F
  • 42. P (f(X) 6= Y ) − 1 n n X i=1 1f(Xi)6=Yi
  • 47. ! + inf f∈F E 1T?(X)6=f(X) (max(p? (X), 1 − p? (X)) − min(p? (X), 1 − p? (X))) . The proof and the interpretation are the same as for regression. 1.3 Neural networks Neural networks define a set of functions from [0, 1]d to R. Feed-forward neural networks with one hidden layer This is the simplest example. These networks are represented as in Figure 1. 5
  • 48. Inputs Neurons of the hidden layer Output neuron Figure 1: Representation of a feed-forward neural network with one hidden layer. In Figure 1, the interpretation is the following. • The arrows mean that there is a multiplication by a scalar or that a function from R to R is applied and (possibly) a scalar is added. • The function σ : R → R is called the activation function. • A circle (a neuron) sums all the values that are pointed to it by the arrows. • The column with w1, . . . , wN is called the hidden layer. The function corresponding to Figure 1 is x ∈ [0, 1]d 7→ N X i=1 viσ (hwi, xi + bi) , with h·, ·i the standard inner product on Rd. The neural network function is parametrized by 6
  • 49. • σ : R → R, the activation function, • v1, . . . , vN ∈ R, the output weights, • w1, . . . , wN ∈ Rd, the weights (of the neurons of the hidden layer), • b1, . . . , bN ∈ R, the biases. Examples of activation functions are • linear σ(x) = x, • threshold σ(x) = 1x≥0, • sigmoid σ(x) = ex/(1 + ex), • ReLU σ(x) = max(0, x). For instance, the network of Figure 2 encodes the absolute value function. Figure 2: Representation of the absolute value function as a neural network. Feed-forward neural networks with several hidden layers. This is the same type of repre- sentation but with several layers of activation functions. These networks are represented as in Figure 3. 7
  • 50. Inputs Hidden layer 1 Hidden layer c Output neuron Figure 3: Representation of a feed-forward neural network with several hidden layer. The neural network function corresponding to Figure 3 is defined by x ∈ [0, 1]d 7→ fv ◦ gc ◦ gc−1 ◦ · · · ◦ g1(x), where fv : RNc → R u → Nc X i=1 uivi and for i = 1, . . . , c, with N0 = d, gi : RNi−1 → RNi is defined by, for u ∈ RNi−1 and j = 1, . . . , Ni, (gi(u))j = σ hw (i) j , ui + b (i) j . The neural network function is parametrized by 8
  • 51. • σ : R → R, the activation function, • v ∈ RNc , the output weights, • b (c) 1 , . . . , b (c) Nc ∈ R, the biases of the hidden layer c, • w (c) 1 , . . . , w (c) Nc ∈ RNc−1, the weights of the hidden layer c, • . . . • b (2) 1 , . . . , b (2) N2 ∈ R, the biases of the hidden layer 2, • w (2) 1 , . . . , w (2) N2 ∈ RN1 , the weights of the hidden layer 2, • b (1) 1 , . . . , b (1) N1 ∈ R, the biases of the hidden layer 1, • w (1) 1 , . . . , w (1) N1 ∈ Rd, the weights of the hidden layer 1. Classes of functions To come back to regression, the class of functions F corresponding to neural networks is given by • c, number of hidden layers, • σ, activation function, • N1, . . . , Nc, numbers of neurons in the hidden layers. These parameters are called architecture parameters. Then F is a parametric set of functions F = n neural networks parametrized by v, b (c) 1 , . . . , b (c) Nc , w (c) 1 , . . . , w (c) Nc , . . . , b (1) 1 , . . . , b (1) N1 , w (1) 1 , . . . , w (1) N1 o . For classification, for g ∈ F, we take f(x) = ( 1 if g(x) ≥ 0 0 if g(x) 0 to have a parametric set of classifiers. 2 Approximation with neural networks with one hidden layer 2.1 Statement of the theorem Several theorems tackle the universality of feed-forward neural networks with one hidden layer of the form x ∈ [0, 1]d 7→ N X i=1 viσ(hwi, xi + bi) with v1, . . . , vN ∈ R, b1, . . . , bN ∈ R, w1, . . . , wN ∈ Rd and σ : R → R. We will study the first theorem of the literature, from [Cyb89]. Theorem 5 ([Cyb89]) Let σ : R → R be a continuous function such that    σ(t) → t→−∞ 0 σ(t) → t→+∞ 1. 9
  • 52. Then the set N1 of functions of the form x ∈ [0, 1]d 7→ N X i=1 viσ(hwi, xi + bi); N ∈ N, v1, . . . , vN ∈ R, b1, . . . , bN ∈ R, w1, . . . , wN ∈ Rd is dense in the set C([0, 1]d, R) of the real- valued continuous functions on [0, 1]d, endowed with the supremum norm, ||f||∞ = supx∈[0,1]d |f(x)| for f ∈ C([0, 1]d, R). This theorem means the following. • We have N1 ⊂ C([0, 1]d, R), which means that neural network functions are continuous. • For all f ∈ C([0, 1]d, R), for all 0, there exist N ∈ N, v1, . . . , vN ∈ R, b1, . . . , bN ∈ R and w1, . . . , wN ∈ Rd such that sup x∈[0,1]d
  • 62. ≤ . • Equivalently, for all f ∈ C([0, 1]d, R), for all 0, there exists g ∈ N1 such that ||f − g||∞ ≤ . • Equivalently, for all f ∈ C([0, 1]d, R), there exists a sequence (gn)n∈N such that gn ∈ N1 for n ∈ N and ||f − gn||∞ → 0 as n → ∞. • This theorem is comforting for the approximation error inf f∈F E (f(X) − f? (X))2 in regression with F = N1. Indeed, this term is equal to zero if x 7→ f?(x) (conditional expectation in regression) is a continuous function on [0, 1]d. • Furthermore, if we let, for N ∈ N, N1,N = ( x ∈ [0, 1]d 7→ N X i=1 viσ(hwi, xi + bi); v1, . . . , vN ∈ R, b1, . . . , bN ∈ R, w1, . . . , wN ∈ Rd ) (set of neural networks with N neurons), then we remark that N1,k ⊂ N1,k+1 for k ∈ N. The proof of this inclusion is left as an exercize, one can for instance construct a neural network with k + 1 neurons and vk+1 = 0 to obtain the function of a neural network with k neurons. Hence, we have that inff∈N1,N ||f−f?||∞ is decreasing with N. Hence, from Theorem 5, inff∈N1,N ||f−f?||∞ → 0 as N → ∞ (left as an exercize). Hence, since E(g(X)2) ≤ ||g||2 ∞ for g : [0, 1]d → R, we obtain inf f∈N1,N E (f(X) − f? (X))2 → N→∞ 0. Hence if we minimize the empirical risk with neural networks with N neurons (N large), the approximation error will be small. 2.2 Sketch of the proof The proof is by contradiction (it is non constructive). For f ∈ C([0, 1]d, R) we will not exhibit a neural network that is close to f. Step 1 We assume that there exists f0 ∈ C([0, 1]d, R)N1. Here we write N1 for the closure of N1, which means that f ∈ N1 ⇐⇒ there exists (gN )N∈N with gN ∈ N1 for N ∈ N such that ||gN − f0||∞ → N→∞ 0. 10
  • 63. Step 2 We apply the Hahn-Banach theorem to construct a continuous linear map L : C([0, 1]d , R) → C such that L(f0) = 1 and L(f) = 0 for all f ∈ N1. • Linear means that for g1, g2 ∈ C([0, 1]d, R) and for α1, α2 ∈ R, we have L(α1g1 + α2g2) = α1L(g1) + α2L(g2). • Continuous means that for g ∈ C([0, 1]d, R) and for a sequence (gn)n∈N with gn ∈ C([0, 1]d, R) for n ∈ N and such that ||gn − g||∞ → 0 as n → ∞, we have L(gn) → L(g) as n → ∞. Step 3 We then use the Riesz representation theorem. There exists a complex-valued Borel measure µ on [0, 1]d such that L(f) = Z [0,1]d fdµ for f ∈ C([0, 1]d, R), where the above integral is a Lebesgue integral. That µ is a complex-valued Borel measure on [0, 1]d means that, with B the Borel sigma algebra (the measurable subsets of [0, 1]d), we have µ : B → C. Furthermore, for E ∈ B such that E = ∪∞ i=1Ei with E1, E2, . . . ∈ B, with Ei ∩ Ej = ∅ for i 6= j, we have µ(E) = ∞ X i=1 µ(Ei), where µ(Ei) ∈ C and P∞ i=1 |µ(Ei)| ∞. Step 4 We show that R [0,1]d fdµ for all f ∈ N1 implies that µ = 0, which is a contradiction to L(f0) = R [0,1]d f0dµ = 1 and concludes the proof. Remark 6 The steps 1, 2 and 3 could be carried out with N1 replaced by other function spaces F. These steps are actually classical in approximation theory. The step 4 is on the contrary specific to neural networks with one hidden layer. 2.3 Complete proof Let f ∈ N1. Then there exist N ∈ N, v1, . . . , vN ∈ R, b1, . . . , bN ∈ R, w1, . . . , wN ∈ Rd such that f : x ∈ [0, 1]d 7→ N X i=1 viσ(hwi, xi + bi). Since σ is continuous, f is continuous as a sum and composition of continuous functions. Hence N1 ⊂ C([0, 1]d, R). Let us now assume that N1 6= C([0, 1]d, R). Thus, let f0 ∈ C([0, 1]d, R)N1. We then apply a version of the Hahn-Banach theorem. Theorem 7 There exists a continuous linear map L : C([0, 1]d , R) → C such that L(f0) = 1 and L(f) = 0 for all f ∈ N1. 11
  • 64. The above theorem holds because f0 6∈ N1, see for instance [Rud98][Chapters 3 and 6]. We then apply a version of the Riesz representation theorem. Theorem 8 There exists a complex-valued Borel measure µ on [0, 1]d such that L(f) = Z [0,1]d fdµ for f ∈ C([0, 1]d, R). We have seen that µ : B → C where, for B ∈ B, we have B ⊂ [0, 1]d. Furthermore, we can defined the total variation measure |µ| defined by |µ|(E) = sup ∞ X i=1 |µ(Ei)|, E ∈ B where the supremum is over the set of all the (Ei)i∈N, with Ei ∈ B for i ∈ N and Ei ∩ Ej = ∅ for i 6= j and E = ∪∞ i=1Ei. Then |µ| : B → [0, ∞) and |µ| has finite mass, |µ|([0, 1]d) ∞. Finally, there exists h : [0, 1]d → C, measurable, such that h(x) = 1 for x ∈ [0, 1]d and dµ = hd|µ| which means that for B ∈ B, µ(B) = Z B hd|µ| = Z B h(x)d|µ|(x) and we have a more classical Lebesgue integral with a function h that corresponds to a density. The above theorem is also given in [Rud98]. We now want to show that µ = 0, which means that µ(B) = 0 for B ∈ B. Since L(f) = 0 for all f ∈ C([0, 1]d, R), we have, for all N ∈ N, v1, . . . , vN ∈ R, b1, . . . , bN ∈ R, w1, . . . , wN ∈ Rd, with f = N X i=1 vifi, where for i = 1, . . . , N, fi : [0, 1]d → R is defined by, for x ∈ [0, 1]d, fi(x) = σ(hwi, xi + bi), we have L(f) = 0. Hence, since L is linear N X i=1 viL(fi) = 0. Specifically, we can choose v1 = 1 and v2 = · · · = vN = 0 to obtain, for all w ∈ Rd, for all b ∈ R, L(f) = 0, with f : [0, 1]d → R defined by, for x ∈ [0, 1]d, f(x) = σ(hw, xi + b). This gives, for all w ∈ Rd, for all b ∈ R, L(f) = Z [0,1]d f(x)dµ(x) = 0 and thus Z [0,1]d σ(hw, xi + b)h(x)d|µ|(x) = 0. 12
  • 65. Let w ∈ Rd and b, φ ∈ R, λ 0. Let x ∈ Rd. We let σλ,φ(x) = σ (λ (hw, xi + b) + φ) . Then if hw, xi + b 0, since σ(t) → 1 as t → +∞, we have σ (λ (hw, xi + b) + φ) → λ→+∞ 1. If hw, xi + b 0, since σ(t) → 0 as t → −∞, we have σ (λ (hw, xi + b) + φ) → λ→+∞ 0. If hw, xi + b = 0, then we have σ (λ (hw, xi + b) + φ) = σ(φ). Hence, we have shown that σλ,φ → λ→+∞ γ(x) :=      1 if hw, xi + b 0 0 if hw, xi + b 0 σ(φ) if hw, xi + b = 0 . Furthermore, for x ∈ [0, 1]d, σλ,φ(x) = σ (λ (hw, xi + b) + φ) = σ (hλw, xi + λb + φ) and thus σλ,φ ∈ N1 (it is a neural network function). Hence Z [0,1]d σλ,φ(x)h(x)d|µ|(x) = 0. Furthermore, sup λ0 sup x∈[0,1]d |σλ,φ(x)| ≤ sup t∈R σ(t) = ||σ||∞ ∞, as σ is continuous and has finite limits at ±∞. We recall that |h(x)| = 1 for all x ∈ [0, 1]d and thus Z [0,1]d sup λ0 |σλ,φ(x)||h(x)|d|µ|(x) ≤ sup t∈R σ(t) Z [0,1]d d|µ|(x) = sup t∈R σ(t) |µ|([0, 1]d ) ∞. Hence we can apply the dominated convergence theorem, 0 = Z [0,1]d σλ,φ(x)h(x)d|µ|(x) → λ→+∞ Z [0,1]d γ(x)h(x)d|µ|(x) = Z [0,1]d 1hw,xi+b0 + σ(φ)1hw,xi+b=0 h(x)d|µ|(x). We let Πw,b = n x ∈ [0, 1]d : hw, xi + b = 0 o and Hw,b = n x ∈ [0, 1]d : hw, xi + b 0 o for w ∈ Rd and b ∈ R. We then obtain Z Hw,b h(x)d|µ|(x) + σ(φ) Z Πw,b h(x)d|µ|(x) = 0 and thus µ(Hw,b) + σ(φ)µ(Πw,b) = 0. 13
  • 66. Since σ is not constant, we can take φ1, φ2 ∈ R with σ(φ1) 6= σ(φ2) and thus 1 σ(φ1) 1 σ(φ2) µ(Hw,b) µ(Πw,b) = 0 0 and the determinant of the above matrix is σ(φ1) − σ(φ2) 6= 0. Hence, for all w ∈ Rd and b ∈ R, µ( Πw,b |{z} hyperplane ) = µ( Hw,b |{z} half space ) = 0. Let w ∈ Rd. We write ||w||1 = Pd i=1 |wi|. For a bounded g : [−||w||1, ||w||1] → C (not necessarily continuous), we let ψ(g) = Z [0,1]d g(hw, xi)dµ(x). We remark that |hw, xi| =
  • 76. ≤ d X i=1 |wi| = ||w||1. We observe that ψ is linear, for any bounded g1, g2 : [−||w||1, ||w||1] → C, for any α1, α2 ∈ R, we have ψ(α1g1 + α2g2) = Z [0,1]d (α1g1(hw, xi) + α2g2(hw, xi)) dµ(x) = α1 Z [0,1]d g1(hw, xi)dµ(x) + α2 Z [0,1]d g2(hw, xi)dµ(x) = α1ψ(g1) + α2ψ(g2). Furthermore, we have a continuity property of ψ of the form: |ψ(g1) − ψ(g2)| =
  • 86. =
  • 91. Z [0,1]d (g1 − g2)(hw, xi)h(x)d|µ|(x)
  • 96. ≤ Z [0,1]d |(g1 − g2)(hw, xi)| |h(x)|d|µ|(x) ≤ ||g1 − g2||∞ Z [0,1]d |h(x)|d|µ|(x), with ||g1 − g2||∞ = supt∈[−||w||1,||w||1] |g1(t) − g2(t)|. Hence we have |ψ(g1) − ψ(g2)| ≤ ||g1 − g2||∞ Z [0,1]d d|µ|(x) = ||g1 − g2||∞ |µ|([0, 1]d ) | {z } ∞ , which is a Lipschitz property (stronger than continuity). The, for θ ∈ R and g : [−||w||1, ||w||1] → R defined by g(t) = 1t∈[θ,+∞) 14
  • 97. for t ∈ [−||w||1, ||w||1], we have ψ(g) = Z [0,1]d 1hw,xi∈[θ,+∞)dµ(x) = Z [0,1]d 1hw,xi−θ≥0dµ(x) = Z [0,1]d 1hw,xi−θ0dµ(x) + Z [0,1]d 1hw,xi−θ=0dµ(x) = Z Hw,−θ dµ(x) + Z Πw,−θ dµ(x) = µ(Hw,−θ) + µ(Πw,−θ) = 0, from what we have seen before. For g defined on [−||w||1, ||w||1] valued in C, defined by g(t) = 1t∈(θ,+∞) for t ∈ [−||w||1, ||w||1], we also have ψ(g) = Z [0,1]d 1hw,xi−θ0dµ(x) = µ(Hw,−θ) = 0. Hence, • with 1[θ1,θ2] : [−||w||1, ||w||1] → R defined by, for t ∈ [−||w||1, ||w||1], 1[θ1,θ2](t) = 1t∈[θ1,θ2] = 1θ1≤t≤θ2 , • with 1[θ1,θ2) : [−||w||1, ||w||1] → R defined by, for t ∈ [−||w||1, ||w||1], 1[θ1,θ2)(t) = 1t∈[θ1,θ2) = 1θ1≤tθ2 , • with 1(θ1,θ2] : [−||w||1, ||w||1] → R defined by, for t ∈ [−||w||1, ||w||1], 1(θ1,θ2](t) = 1t∈(θ1,θ2] = 1θ1t≤θ2 , we have ψ(1[θ1,θ2]) = ψ(1[θ1,+∞) − 1(θ2,+∞)), with 1[θ1,+∞)(t) = 1t≥θ1 and 1(θ2,+∞)(t) = 1tθ2 (for t ∈ [−||w||1, ||w||1]). Hence ψ(1[θ1,θ2]) = ψ(1[θ1,+∞)) − ψ(1(θ2,+∞)) = 0 − 0 = 0, from what we have seen before. Also ψ(1[θ1,θ2)) = ψ(1[θ1,+∞)) − ψ(1[θ2,+∞)) = 0 − 0 = 0 15
  • 98. and ψ(1(θ1,θ2]) = ψ(1(θ1,+∞)) − ψ(1(θ2,+∞)) = 0 − 0 = 0. Now let us write r : [−||w||1, ||w||1] → C defined by r(t) = eit = cos(t) + i sin(t), with i2 = −1 and for t ∈ [−||w||1, ||w||1]. Let us also write, for k ∈ N and t ∈ [−||w||1, ||w||1], rk(t) = 1t=−||w||1 r(−||w||1) + k−1 X j=−k 1( j||w||1 k , (j+1)||w||1 k ] (t)r j||w||1 k . Then sup t∈[−||w||1,||w||1] |rk(t) − r(t)| ≤ sup x,y∈[−||w||1,||w||1] |x−y|≤ ||w||1 k |r(x) − r(y)| → k→∞ 0, since r is uniformly continuous (or even Lipschitz) on [−||w||1, ||w||1]. Hence, with the continuity property that we have seen, ψ(r) = lim k→∞ ψ(rk) = 0, since ψ(rk) = 0 for k ∈ N from what we have seen before. Hence, we have shown that for any w ∈ Rd, Z [0,1]d eihw,xi dµ(x) = 0. We see the Fourier transform of the measure µ. This implies that µ is the zero measure (which can be shown by technical arguments which are not specific to neural networks). Hence L(f0) = Z [0,1]d f0(x)dµ(x) = 0 which is a contradiction with L(f0) = 1 and concludes the proof of Theorem 5. There are two main take home messages. • The density result N1 = C([0, 1]d, R). • The non-constructive proof technique, by contradiction. The use of the Hahn-Banach theorem to prove a density result is standard. 3 Complements on the generalization error in classification and VC- dimension 3.1 Shattering coefficients We consider a pair of random variables (X, Y ) on [0, 1]d × {0, 1}. We consider (X1, Y1), . . . , (Xn, Yn) independent, with the same distribution as (X, Y ) and independent of (X, Y ). We consider a set F of functions from [0, 1]d to {0, 1}. Then, we have seen in Section 1.2 that the generalization error is E sup f∈F
  • 103. P (f(X) 6= Y ) − 1 n n X i=1 1f(Xi)6=Yi
  • 108. ! . We have seen that, intuitively, the larger F is, the larger this generalization error is. A measure of the “size” or “complexity” of F is given by the following definition. 16
  • 109. Definition 9 We call shattering coefficient of F (at n for n ∈ N) the quantity ΠF (n) = max x1,...,xn∈[0,1]d card {(f(x1), . . . , f(xn)) ; f ∈ F} . We observe that ΠF (n) is increasing with respect to F and n, • if F1 ⊂ F2, then ΠF1 (n) ≤ ΠF2 (n), • ΠF (n) ≤ ΠF (n + 1). Example Let d = 1 and F = {x ∈ [0, 1] 7→ 1x≥a; a ∈ R} . Then for any 0 ≤ x1 ≤ · · · ≤ xn ≤ 1 and for f ∈ F, we have (f(x1), . . . , f(xn)) = (0, . . . , 0, 1, . . . , 1) , where • if a xn then there are only 0s, • if a ≤ x1 then here are only 1s, • if x1 a ≤ xn then the first 1 is at position i ∈ {2, . . . , n} with xi−1 0 and xi ≥ a. Hence the vectors that we can obtain are (0, . . . , 0), (0, . . . , 1), (0, . . . , 1, 1), . . . , (1, . . . , 1). Hence there are n + 1 possibilities. Hence card {(f(x1), . . . , f(xn)) ; f ∈ F} ≤ n + 1 and thus ΠF (n) ≤ n + 1. Furthermore, with x1 = 0, x2 = 1/n, . . . , xn = (n − 1)/n, • with f given by x 7→ 1x≥2 we have (f(x1), . . . , f(xn)) = (0, . . . , 0), • with f given by x 7→ 1x≥−1 we have (f(x1), . . . , f(xn)) = (1, . . . , 1), • for i ∈ {1, . . . , n − 1}, with f given by x 7→ 1x≥(xi+xi+1)/2 we have (f(x1), . . . , f(xn)) = (0, . . . , 0, 1, . . . , 1) with i 0s and n − i 1s. Hence with x1 = 0, x2 = 1/n, . . . , xn = (n − 1)/n we have card {(f(x1), . . . , f(xn)) ; f ∈ F} ≥ n + 1. Hence finally ΠF (n) ≥ n + 1 and thus ΠF (n) = n + 1. Example Let d = 2 and F = x ∈ [0, 1]2 7→ 1hw,xi≥a; a ∈ R; w ∈ R2 . These are affine classifiers as in Figure 4. 17
  • 110. Figure 4: An example of an affine classifier. Then for n = 3 and for x1, x2, x3 ∈ [0, 1]2 that are not contained in a line, we can obtain the 8 possible classification vectors, as shown in Figure 5. 18
  • 111. Figure 5: Obtaining the 8 possible classification vectors with 3 points and affine classifiers. Hence card {(f(x1), f(x2), f(x3)) ; f ∈ F} ≥ 8. Also card {(f(x1), f(x2), f(x3)) ; f ∈ F} ≤ card (i1, i2, i3) ∈ {0, 1}3 = 23 = 8. Hence we have ΠF (3) = 8. Remark 10 We always have ΠF (n) ≤ card {(i1, . . . , in) ∈ {0, 1}n } = 2n . Remark 11 If F is a finite set, ΠF (n) ≤ card(F). 3.2 Bounding the generalization error from the shattering coefficients The next proposition enables to bound the generalization error from ΠF (n). 19
  • 112. Proposition 12 For any set F of functions from [0, 1]d to {0, 1}, we have E sup f∈F
  • 117. P (f(X) 6= Y ) − 1 n n X i=1 1f(Xi)6=Yi
  • 122. ! ≤ 2 r 2 log(2ΠF(n)) n . Remarks • The notation log stands for the Neper base e logarithm. • We see a dependence in 1/ √ n, which is classical when empirical means are compared with expectations. • If card(F) = 1 with F = {f} then E sup f∈F
  • 127. P (f(X) 6= Y ) − 1 n n X i=1 1f(Xi)6=Yi
  • 132. ! = E
  • 142. ! ≤ v u u u tE   E 1f(X)6=Y − 1 n n X i=1 1f(Xi)6=Yi !2   = v u u tVar 1 n n X i=1 1f(Xi)6=Yi ! = v u u t 1 n2 Var n X i=1 1f(Xi)6=Yi ! = r 1 n2 nVar 1f(X)6=Y = 1 √ n Var 1f(X)6=Y ≤ 1 √ n . In the second inequality above, we have used Jensen’s inequality which implies that E(|W|) ≤ p Var(W) for a random variable W. In the second equality above we have used that (X1, Y1), . . . , (Xn, Yn) are independent and distributed as (X, Y ). The first inequality above holds because Var 1f(X)6=Y ≤ E 1f(X)6=Y 2 ≤ E(1) = 1. On the other hand the upper bound of Proposition 12 is 2 r 2 log(2) n = 2 p 2 log(2) | {z } ≈2.35 1 √ n . We obtain the same order of magnitude 1/ √ n. • In all cases, we have ΠF (n) ≤ 2n and thus the bound of Proposition 12 is smaller than 2 r 2 log(2 × 2n) n = 2 r 2 log(2n+1) n = 2 r 2(n + 1) log(2) n = 2 s 2 log(2) 1 + 1 n → n→∞ 2 p 2 log(2). 20
  • 143. This bound based on ΠF (n) ≤ 2n is not informative because we already know that E       sup f∈F
  • 153. P (f(X) 6= Y ) | {z } ∈[0,1] − 1 n n X i=1 1f(Xi)6=Yi | {z } ∈[0,1]
  • 163.       ≤ E sup f∈F 1 ! = 1. To summarize • The bound of Proposition 12 agrees in terms of order of magnitudes with the two extreme cases card(F) = 1 (then ΠF (n) = 1) and ΠF (n) ≤ 2n. • This bound will be particularly useful when ΠF (n) is in between these two cases. Proof of Proposition 12 The proof is based on a classical argument that is called symmetrization. Without loss of generality, we can assume that Y ∈ {−1, 1} and F is composed of functions from [0, 1]d to {−1, 1} (the choice of 0 and 1 to define the two classes is arbitrary in classification, here −1 and 1 will be more convenient). We let (X̃1, Ỹ1), . . . , (X̃n, Ỹn) be pairs of random variables such that (X1, Y1), . . . , (Xn, Yn), (X̃1, Ỹ1), . . . , (X̃n, Ỹn) are independent and with the same distribution as (X, Y ). Then P (f(X) 6= Y ) = Ẽ 1 n n X i=1 1f(X̃i)6=Ỹi ! , writing Ẽ to indicate that the expectation is taken with respect to (X̃1, Ỹ1), . . . , (X̃n, Ỹn). We let ∆n = E sup f∈F
  • 168. P (f(X) 6= Y ) − 1 n n X i=1 1f(Xi)6=Yi
  • 173. ! . We have ∆n = E sup f∈F
  • 213. ! . Let now σ1, . . . , σn be independent random variables, independent from (Xi, Yi, X̃i, Ỹi)i=1,...,n and such that Pσ(σi = 1) = Pσ(σi = −1) = 1 2 for i = 1, . . . , n and by writing Eσ and Pσ the expectations and probabilities with respect to σ1, . . . , σn. We let for i = 1, . . . , n (X̄i, Ȳi) = ( (Xi, Yi) if σi = 1 (X̃i, Ỹi) if σi = −1 and ( ¯ X̄i, ¯ Ȳi) = ( (X̃i, Ỹi) if σi = 1 (Xi, Yi) if σi = −1 . 21
  • 214. Then (X̄i, Ȳi)i=1,...,n, ( ¯ X̄i, ¯ Ȳi)i=1,...,n are independent and have the same distribution as (X, Y ). Let us show this. For any bounded measurable functions g1, . . . , g2n from [0, 1]d × {0, 1} to R, we have, using the law of total expectation, E n Y i=1 gi(X̄i, Ȳi) ! n Y i=1 gi( ¯ X̄i, ¯ Ȳi) !! = E E n Y i=1 gi(X̄i, Ȳi) ! n Y i=1 gi( ¯ X̄i, ¯ Ȳi) !
  • 219. σ1, . . . , σn !! = E    E        n Y i=1 σi=1 gi(Xi, Yi)       n Y i=1 σi=−1 gi(X̃i, Ỹi)        n Y j=1 σj=1 gn+j(X̃j, Ỹj)         n Y j=1 σj=−1 gn+j(Xj, Yj)    
  • 227. σ1, . . . , σn         . In the above conditional expectation, the 2n variables are independent since each of the (Xi, Yi)i=1,...,n, (Xi, Yi)i=1,..., appears exactly once. Their common distribution is that of (X, Y ). Furthermore, the 2n functions g1, . . . , g2n appear once each. Hence we have E n Y i=1 gi(X̄i, Ȳi) ! n Y i=1 gi( ¯ X̄i, ¯ Ȳi) !! = E 2n Y i=1 E (gi(X, Y )) ! = 2n Y i=1 E (gi(X, Y )) . Hence, indeed, (X̄i, Ȳi)i=1,...,n, ( ¯ X̄i, ¯ Ȳi)i=1,...,n are independent and have the same distribution as (X, Y ). Hence, we have ∆n ≤ EẼEσ sup f∈F
  • 237. ! , where Eσ means that only σ1, . . . σn are random. We observe that 1f(X̄i)6=Ȳi − 1( ¯ X̄i)6= ¯ Ȳi = σi 1f(Xi)6=Yi − 1f(X̃i)6=Ỹi because • if σi = 1, (X̄i, Ȳi, ¯ X̄i, ¯ Ȳi) = (Xi, Yi, X̃i, Ỹi), • if σi = −1, (X̄i, Ȳi, ¯ X̄i, ¯ Ȳi) = (X̃i, Ỹi, Xi, Yi). Hence ∆n ≤ EẼEσ sup f∈F
  • 287. ! . For any y1, . . . , yn ∈ {−1, 1} and x1, . . . , xn ∈ [0, 1]d we define the set VF (x, y) = 1yi6=f(xi), . . . , 1yn6=f(xn) ; f ∈ F . Then we have ∆n ≤ 2 n max y1,...,yn∈{−1,1} max x1,...,xn∈[0,1]d Eσ sup v∈VF (x,y) |hσ, vi| ! , with σ = (σ1, . . . , σn). 22
  • 288. We observe that there is a bijection between VF (x, y) and {(f(x1), . . . , f(xn)); f ∈ F}. Hence max y1,...,yn∈{−1,1} max x1,...,xn∈[0,1]d card(VF (x, y)) ≤ ΠF (n). Assume that we show For any set V ⊂ {−1, 0, 1}n , Eσ sup v∈V |hσ, vi| ≤ p 2n log(2card(V )). (1) Then we would have ∆n ≤ 2 n p 2n log(2ΠF (n)) = 2 r 2 n log(2ΠF (n)) which would conclude the proof. Let us now show (1). Let us write −V = {−v; v ∈ V } and V # = V ∪ −V . We have, for any s 0, Eσ sup v∈V |hσ, vi| ≤ Eσ sup v∈V # |hσ, vi| ! = Eσ 1 s log es supv∈V # |hσ,vi| . We now apply Jensen inequality to the concave function (1/s) log. This gives Eσ sup v∈V |hσ, vi| ≤ 1 s log Eσ es supv∈V # |hσ,vi| = 1 s log Eσ sup v∈V # es|hσ,vi| !! ≤ 1 s log  Eσ   X v∈V # es|hσ,vi|     = 1 s log   X v∈V # Eσ es|hσ,vi|   = 1 s log   X v∈V # n Y i=1 Eσ (esσivi )   = 1 s log   X v∈V # n Y i=1 1 2 esvi + e−svi   . We can show simply that for x ≥ 0, ex + e−x ≤ 2ex2/2. This gives, using also that v2 i ≤ 1 for i = 1, . . . , n and v ∈ V , Eσ sup v∈V |hσ, vi| ≤ 1 s log   X v∈V # n Y i=1 e s2v2 i 2   ≤ 1 s log   X v∈V # e ns2 2   ≤ 1 s log card(V # )e ns2 2 = log(card(V #)) s + ns 2 . We let s = r 2 log(card(V #)) n 23
  • 289. which gives Eσ sup v∈V |hσ, vi| ≤ 1 √ 2 q n log(card(V #)) + 1 √ 2 q n log(card(V #)) = 2 r n 2 log(card(V #)) = q 2n log(card(V #)) ≤ p 2n log(2card(V )). Hence (1) is proved and thus the proof of Proposition 12 is concluded. 3.3 VC-dimension From the previous proposition, the shattering coefficient ΠF (n) is important and we would like to quantify its growth as n grows. A tool for this is the Vapnik-Cherbonenkis dimension, that we will call VC-dimension. Definition 13 For a set F of functions from [0, 1]d to {0, 1}, we write VCdim(F) and call VC- dimension the quantity VCdim(F) = sup {m ∈ N; ΠF (m) = 2m } with the convention ΠF (0) = 1 so that VCdim(F) ≥ 0. It is possible that VCdim(F) = +∞. Interpretation The quantity VCdim(F) is the largest number of input points that can be “shattered”, meaning that they can be classified in all possible ways by varying the classifier in F. Examples • When F = {all the functions from [0, 1]d to {0, 1}} then VCdim(F) = +∞. Indeed, for any n ∈ N, by considering x1, . . . , xn two-by-two distinct, we have ΠF (n) = 2n. • When F is finite with card(F) ≤ 2m0 then VCdim(F) ≤ m0. Indeed, for m m0, we have seen that ΠF (m) ≤ card(F) ≤ 2m0 ≤ 2m. Hence m 6∈ {m ∈ N; ΠF (m) = 2m } and thus sup {m ∈ N; ΠF (m) = 2m } ≤ m0. Remark 14 If VCdim(F) = V ∞ then for i = 1, . . . , V , ΠF (i) = 2i. Proof of Remark 14 Since ΠF (V ) = 2V , there exist x1, . . . , xV ∈ [0, 1]d such that card {(f(x1), . . . , f(xn)) ; f ∈ F} = 2V . This means that we obtain all the possible vectors with components in {0, 1} and thus we obtain all the possible subvectors for the i first coefficients for i = 1, . . . , V . Hence card {(f(x1), . . . , f(xi)) ; f ∈ F} = 2i . and thus ΠF (i) = 2i. Similarly if for i0 ∈ N, ΠF (i0) = 2i0 then for all i = 1, . . . , i0, ΠF (i) = 2i. We can compute the VC-dimension in the case of linear and affine classifiers. 24
  • 290. Proposition 15 Let d ∈ N. Let Fd,l = n x ∈ [0, 1]d 7→ 1hw,xi≥0; w ∈ Rd o and Fd,a = n x ∈ [0, 1]d 7→ 1hw,xi+a≥0; w ∈ Rd , a ∈ R o . Then VCdim(Fd,l) = d and VCdim(Fd,a) = d + 1. Remark 16 The VC-dimension coincides here with the number of free parameters and thus with the usual notion of dimension. Proof of Proposition 15 Write x1 =      1 0 . . . 0      , x2 =        0 1 0 . . . 0        , . . . , xd =      0 . . . 0 1      in Rd. Then for any y1, . . . , yd ∈ {0, 1} write zi = ( 1 if yi = 1 −1 if yi = 0 . Consider x 7→ 1hx, Pd j=1 zjxji≥0. Then for k = 1, . . . , d, 1hxk, Pd j=1 zjxji≥0 = 1hxk,zkxki≥0 = 1zk≥0 = yk. Hence we reach all the elements of {0, 1}d. Hence ΠFd,l (d) = 2d and thus VCdim(Fd,l) ≥ d. Assume that VCdim(Fd,l) ≥ d + 1. Then, from Remark 14, ΠFd,l (d+1) = 2d+1. Hence, there exists x1, . . . , xd+1 ∈ [0, 1]d and w1, . . . , w2d+1 ∈ Rd such that    w i x1 . . . w i xd+1    , for i = 1, . . . , 2d+1 take all possible sign vectors ( 0 or ≥ 0). We write X =    x 1 . . . x d+1    of dimension (d + 1) × d and W = w1 . . . w2d+1 25
  • 291. of dimension d × 2d+1. Then XW =    x 1 w1 . . . x 1 w2d+1 . . . . . . . . . x d+1w1 . . . x d+1w2d+1    is of dimension (d + 1) × 2d+1. Let us show that the d + 1 rows of XW are linearly independent. Let a of size (d + 1) × 1, non-zero such that a XW = 0, where the above display is a linear combination of the rows of XW. Then, for k ∈ {1, . . . , 2d+1}, (a XW)k = a Xwk = a    x 1 wk . . . x d+1wk    . Let k such that for i = 1, . . . , d + 1, ai ≥ 0 if and only if x i wk ≥ 0 (k exists since we reach all the possible sign vectors). Then (a XW)k = d+1 X i=1 ai(x i wk) | {z } same signs = d+1 X i=1 |ai||x i wk|. Since a is non-zero we can assume that there is a j such that aj 0 (up to replacing a by −a at the beginning). Then (a XW)k ≥ |aj|(x j wk) 0, since x j wk 0 and aj 0. This is a contradiction. Hence there does no exist a of size (d + 1) × 1, non-zero such that aXW = 0. Hence the d + 1 lines of XW are linearly independent. Hence the rank of XW is larger or equal to d + 1. But the rank of XW is smaller or equal to d because X is of dimension (d + 1) × d. Hence we have reached a contradiction and thus VCdim(Fd,l) d + 1. Hence VCdim(Fd,l) = d. Let us now consider Fd,a. Let x1 =      1 0 . . . 0      , x2 =        0 1 0 . . . 0        , . . . , xd =      0 . . . 0 1      and xd+1 =    0 . . . 0    , in Rd. Then, for any y1, . . . , yn ∈ {0, 1}, write for i = 1, . . . , d + 1, zi = ( 1 if yi = 1 −1 if yi = 0. . Consider the function x ∈ [0, 1]d 7→ 1hx, Pd j=1(zj−zd+1)xji≥−zd+1 . 26
  • 292. Then for k = 1, . . . , d, 1hxk, Pd j=1(zj−zd+1)xji≥−zd+1 = 1hxk,(zk−zd+1)xki≥−zd+1 = 1zk−zd+1≥−zd+1 = 1zk≥0 = yk. and 1hxd+1, Pd j=1(zj−zd+1)xji≥−zd+1 = 10≥−zd+1 = 1zd+1≥0 = yd+1. Hence we reach the 2d+1 possible vectors and thus VCdim(Fd,a) ≥ d + 1. Assume now that VCdim(Fd,a) ≥ d + 2. Then, as seen previously, ΠFd,a (d + 2) = 2d+2 . Hence there exists x1, . . . , xd+2 ∈ Rd such that for all y1, . . . , yd+2 ∈ {0, 1}, there exists w ∈ Rd and b ∈ R such that, for k = 1, . . . , d + 2, 1hw,xki+b≥0 = yk. We write x̄i = xi 1 of size (d + 1) × 1 for i = 1, . . . , d + 2 and w̄ = w b of size (d + 1) × 1. Then, for k = 1, . . . , d + 2, 1hw̄,x̄ki≥0 = 1hw,xki+b≥0 = yk. Hence in Rd+2 we have shattered d + 2 vectors x̄1, . . . , x̄d+2 (we have obtained all the possible sign vectors) with linear classifiers. This implies VCdim(Fd+1,l) ≥ d + 2 which is false since we have shown above that VCdim(Fd+1,l) = d + 1. Hence we have VCdim(Fd,a) d + 2. Hence VCdim(Fd,a) = d + 1. 3.4 Bounding the shattering coefficients from the VC-dimension From the next lemma, we can bound the shattering coefficients from bounds on the VC-dimension. Lemma 17 (Sauer lemma) Let F be a non-empty set of functions from [0, 1]d to {0, 1}. Assume that VCdim(F) ∞. Then we have, for n ∈ N, ΠF (n) ≤ VCdim(F) X i=0 n i ≤ (n + 1)VCdim(F) , with n i = ( n! i!(n−i)! if i ∈ {0, . . . , n} 0 if i n . 27
  • 293. Proof of Lemma 17 For any set A, with H a non-empty set of functions from A to 0, 1, we can define ΠH(n) and VCdim(H) in the same way as when A = [0, 1]d. Let us show For any set A, for any set H of functions from A to R: (2) ΠH(k) ≤ vH X i=0 k i , for k = 1, . . . , n with VH = VCdim(H). We will show (2) by induction on k. Let us show it for k = 1. If VH = 0 then ΠH(1) 21 = 2. Hence ΠH(1) ≤ 1 = 1 0 = 0 X i=0 k i . Hence (2) is proved for k = 1 and VH = 0. If VH ≥ 1 we have ΠH(1) = 21 = 2 = 1 0 + 1 1 ≤ VH X i=0 k i . Hence eventually (2) is true for k = 1. Assume now that (2) is true for any k from 1 to n − 1. If VH = 0 then there does not exist any x ∈ A and h1, h2 ∈ H such that h1(x) = 0 and h2(x) = 1 because for all x ∈ A, card{h(x); h ∈ H} 21. Hence for all x1, . . . , xn ∈ A, card {(h(x1), . . . , h(xn)); h ∈ H} = 1. Hence ΠH(n) = 1 = 0 X i=0 n i . It thus remains to address the case VH ≥ 1. For x1, . . . , xn ∈ A, define H(x1, . . . , xn) = {(h(x1), . . . , h(xn)); h ∈ H}. There exist x1, . . . , xn ∈ A such that card(H(x1, . . . , xn)) = ΠH(n). The set H(x1, . . . , xn) only depend on the values of the functions in H on {x1, . . . , xn}. Hence, replacing • A by A = {x1, . . . , xn}, • H by H0 = h0 : {x1, . . . , xn} → {0, 1}; there exists h ∈ H such that h0 (xi) = h(xi) for i = 1, . . . , n , we have ΠH(n) = ΠH0 (n). Hence in the sequel we assume that A = {x1, . . . , xn} and H is a set of functions from {x1, . . . , xn} to {0, 1}, without loss of generality. Let us consider the set H0 = h ∈ H; h(xn) = 1 and h0 = h − 1{xn} ∈ H , composed of the functions that are equal to 1 at xn and that stay in H if their value at xn is replaced by 0. Notice that we have written 1{xn} : {x1, . . . , xn} → {0, 1} defined by 1{xn}(xi) = 1xn=xi for i = 1, . . . , n. 28
  • 294. We use the notation, for a set G of functions from {x1, . . . , xn} to {0, 1}, and {xi1 , . . . , xiq } ⊂ {x1, . . . , xn}, G(xi1 , . . . , xiq ) = {(g(xi1 ), . . . , g(xiq )); g ∈ G}. We have H(x1, . . . , xn) = H0 (x1, . . . , xn) ∪ (HH0 )(x1, . . . , xn) and thus cardH(x1, . . . , xn) ≤ cardH0 (x1, . . . , xn) + card(HH0 )(x1, . . . , xn). Step 1: bounding cardH0(x1, . . . , xn) We observe that cardH0 (x1, . . . , xn) = cardH0 (x1, . . . , xn−1) because h(xn) = 1 for h ∈ H0. If q ∈ N is such that there exists {xi1 , . . . , xiq } ⊂ {x1, . . . , xn} with card{xi1 , . . . , xiq } = 2q then xn 6∈ {xi1 , . . . , xiq } (because h(xn) = 1 for h ∈ H0). Also, we have card{xi1 , . . . , xiq , xn} = 2q+1 because 2q+1 = card ({0, 1}q × {0, 1}) = card (h(xi1 , . . . , xiq , h(xn))); h ∈ H by definition of H0. Hence VH ≥ q + 1 and thus VH ≥ VH0 + 1 (since q can be taken as VH0 ). Hence VH0 ≤ VH − 1. Hence, we have, applying (2) with k = n − 1, cardH0 (x1, . . . , xn) = cardH0 (x1, . . . , xn−1) ≤ ΠH0 (n − 1) ≤ VH0 X i=0 n − 1 i ≤ VH −1 X i=0 n − 1 i . Step 2: bounding card(HH0)(x1, . . . , xn) If h, h0 ∈ HH0 satisfy h(xi) = h0(xi) for i = 1, . . . , n − 1, then we can not have h(xn) 6= h0(xn) (otherwise h or h0 takes value 1 at xn and thus belongs to H0). Hence we have card(HH0 )(x1, . . . , xn) = card(HH0 )(x1, . . . , xn−1). Also VHH0 ≤ VH because HH0 ⊂ H. Hence, using (2) with k = n − 1 we have card(HH0 )(x1, . . . , xn−1) ≤ ΠHH0 (n − 1) ≤ VHH0 X i=0 n − 1 i ≤ VH X i=0 n − 1 i . Combining the two steps, we obtain cardH(x1, . . . , xn) ≤ VH −1 X i=0 n − 1 i + VH X i=0 n − 1 i ≤ VH X i=1 n − 1 i − 1 + VH X i=0 n − 1 i =1 + VH X i=1 n − 1 i − 1 + n − 1 i =1 + VH X i=1 n i = VH X i=0 n i . We recall that cardH(x1, . . . , xn) = ΠH(n) and that we had started with any A ⊂ [0, 1]d and any set of functions from A to {0, 1}. Hence (2) is shown by induction. Hence we have ΠF (n) ≤ VCdim(F) X i=0 n i 29
  • 295. which gives the first inequality of the lemma. For the second inequality, we have VCdim(F) X i=0 n i = min(VCdim(F),n) X i=0 n i ≤ min(VCdim(F),n) X i=0 ni i! ≤ min(VCdim(F),n) X i=0 ni VCdim(F) i ≤ VCdim(F) X i=0 ni VCdim(F) i = (n + 1)VCdim(F) , which shows the second inequality of the lemma. Finally using Proposition 12 and Lemma 17, we obtain, for a set of functions F from [0, 1]d to {0, 1}, E sup f∈F
  • 300. P (f(X) 6= Y ) − 1 n n X i=1 1f(Xi)6=Yi
  • 305. ! ≤ 2 r 2 log(2ΠF (n)) n ≤ 2 r 2 log(2(n + 1)VCdim(F)) n = 2 r 2 log(2) + 2VCdim(F) log(n + 1) n When VCdim(F) ∞ the bound goes to zero at rate almost 1/ √ n. If we use a set of functions Fn that depends on n (more complex if there are more observations), then the rate of convergence is almost p VCdim(F)/ √ n. 4 VC-dimension of neural networks with several hidden layers The section is based on [BHLM19]. 4.1 Neural networks as directed acyclic graphs We will use graphs. A directed graph is of the form (V, E), where V stands for vertices and E stands for edges. The set V is a finite set, for instance V = {v1, . . . , vn} or V = {1, . . . , n}. The set E is a subset of V × V that does not contain any element of the form (v, v), v ∈ V . If (v1, v2) ∈ E, we say that there is a path from v1 to v2. We say that v1 is a predecessor of v2 and that v2 is a successor of v1. A simple example is given in Figure 6. 30
  • 306. Figure 6: A directed graph defined by V = {1, 2, 3, 4} and E = {(1, 2), (1, 3), (3, 4)} with 4 vertices and 3 edges. We say that the directed graph (V, E) is acyclic if there does not exist any n ∈ N and v1, . . . , vn ∈ V such that • vn = v1, • (v1, v2), . . . , (vn−1, vn) ∈ E. A simple example is given in Figure 7. Figure 7: The graph on the left is acyclic and the graph on the right is cyclic (not acyclic). A directed graph which is acyclic is called a DAG (directed acyclic graph). We call path a vector (v1, . . . , vn) with v1, . . . , vn ∈ V and (v1, v2), . . . , (vn−1, vn) ∈ E. For a DAG (V, E) and v ∈ V we call indegree of v the quantity card((v0, v), v0 ∈ V, (v0, v) ∈ E). We call outdegree of v the quantity card((v, v0), v0 ∈ V, (v, v0) ∈ E). A simple example is given in Figure 8. 31
  • 307. Figure 8: The vertex 1 has indegree 0 and outdegree 2. The vertex 4 has indegree 2 and outdegree 0. Definition 18 (general feed-forward neural network) A feed-forward neural network is defined by the following. • An activation function σ : R → R. • A DAG G = (V, E) such that G has d ≥ 1 vertices with indegree 0 and 1 vertex of outdegree 0. We write the d vertices with indegree 0 as s (0) 1 , . . . , s (0) d . • A vector of weights wa; a ∈ V 0 ∪ E where V 0 is the set of vertices with non-zero indegrees (there is a weight per vertex [except the d vertices with indegree 0] and a weight per edge). We write L for the maximal length (number of edges) of a path of G. We have L ≤ card(V ) − 1. We define the layers 0, 1, . . . , L by induction as follows. • The layer 0 is the set {vertices with indegree 0}. • For ` = 1, . . . , L, layer ` = n vertices who have a predecessor in the layer ` − 1, possibly other predecessors in the layers 0, 1, . . . , ` − 2, and no other predecessors o . Proposition 19 In the context of Definition 18, we have the following. • The layers 0, 1, . . . , L are non-empty. • The layers 0, 1, . . . , L are disjoint sets. • The layer L is a singleton composed of the unique vertex of outdegree 0. • The edges are only of the form (v, v0), with v ∈ layer i and v0 ∈ layer j with i j. Proof of Proposition 19 We call the elements of the layer 0 the roots. For a vertex v, we call inverse path from v to the roots a vector (v, v1, . . . , vk) with vk a root and (v1, v), (v2, v2), . . . , (vk, vk−1) ∈ E (hence (sk, sk−1, . . . , s2, s) 32
  • 308. is a path). The length of such an inverse path is k (there are k edges in the path). By convention, if v is a root, we say that v has an inverse path of length 0 to the roots. Then let us show by induction that, for ` = 0, . . . , L, layer ` = { vertices which longest inverse paths to the roots have length `} . (3) The property is true for the layer 0 (with our convention). If the property is true for the layer `, then any vertex of the layer ` + 1 has an edge that comes from the layer `, so it has an inverse path to the roots of length ` + 1. There are no longer inverse path because the vertex has no predecessors outside of the layers 0, 1 . . . , `. Consider a vertex v which longest inverse path to the roots has length ` + 1. The first predecessor of v in this path belongs to the layer `, from (3) at step `. The only predecessors of v are in the layers 0 to ` because if there are other predecessors, the longest inverse path from v to the roots has length W ` (because this other predecessor would not belong to the layers 0 to ` and using (3)). Hence, finally, v belongs to the layer ` + 1. Hence we have shown (3) by induction. We remark that any vertex v has an inverse path to the roots. Indeed, we let V̄ be the subset of S composed of the vertices which have an inverse path to the roots. Then, is V̄ 6= V , then V V̄ provides a DAG with one vertex of indegree 0 which is false (in a DAG, there exists a vertex of indegree zero, otherwise we can construct an arbitrary long inverse path and thus a cycle). Hence, from (3), an element of the layer L has outdegree 0 (otherwise there is a path of length L + 1). Hence, the layer L is empty or is a singleton. By construction of the layers, the edges only go from a layer i to a layer j with i j. Hence, the only possible path of length L go through each of the layers 0, 1, . . . , L. Since such a path exists, the layers 0, 1, . . . , L are non-empty. Hence we have proved everything: the layers are non-empty, disjoint, any vertex belongs to one of the layers and the edges go from a layer to a layer of strictly larger index. An example is given in Figure 9. Figure 9: An example of the DAG of a neural network. The layer 0 has 3 vertices, representing a neural network classifier from [0, 1]3 to {0, 1}. These vertices have indegree 0. The layer 1 has 4 vertices (neurons). The layer 2 has 3 vertices (neurons). The layer L = 3 has one (final) vertex of outdegree 0 (output of the neural network function). The layers 1 and 2 correspond to the hidden layers. 33
  • 309. Remark Compared to Section 1.3, • we do not have all the possible edges between the layers i and i + 1, • we allow for edges between the layers i and i + k with k ≥ 2. Formal definition of a general feed-forward neural network function based on a DAG Following definition 18 and Proposition 19, it is a function characterized by wa; a ∈ V 0 ∪ E where V 0 contains the layers 1 to L. The input space is [0, 1]d. Consider an input x = (x1, . . . , xn) ∈ [0, 1]d. We define by induction on the layers 0 to L the outputs associated to each neurons of the layer `. For the layer 0 the output of the vertex s (0) i is xi. For the layer ` + 1, ` = 0, . . . , L − 2, the output of a vertex v is σ m X i=1 wiSi + b ! where • m is the indegree of v, • v0 1, . . . , v0 m are the predecessors of v: (v0 1, v), . . . , (v0 m, v) ∈ E, • (w1, . . . wm) = (w(v0 1,v), . . . , w(v0 m,v)), the weights associated to the edges pointing to v, • S1, . . . , Sm are the outputs of v1, . . . , v0 m, which are vertices of the layers 0, . . . , ` so these outputs are indeed already defined, • b = wv is the weight associated to the vertex v. For the layer L, with the same notations, the output is 1Pm i=1 wiSi+bi≥0. 4.2 Bounding the VC-dimension We will bound the VC-dimension of these neural network functions based on the following quantities. • L: number of layers minus 1 (longest path). • U: number of neurons. U = card(V 0) where V 0 is the set of vertices of the layers 1 to L. • W: number of weights. W = U + card(A). We assume that σ is piecewise polynomial: there exist I1, . . . , Ip+1 pieces (p ≥ 1) where I1, . . . , Ip+1 are intervals of R, that is of the form (−∞, a), (−∞, a], (a, b), [a, b), (a, b], [a, b], (a, +∞), [a, +∞) such that Ii ∩Ij = ∅ for i 6= j, with R = ∪p+1 i=1 Ii and such that σ is polynomial on Ii for i = 1, . . . , p+1 with a polynomial function of degree smaller or equal to D ∈ N. Examples • Threshold function σ(x) = 1x≥0 with p = 1, I1 = (−∞, 0), I2 = [0, +∞) and D = 0. The polynomials are x 7→ 0 on I1 and x 7→ 1 on I2. • ReLU function σ(x) = max(0, x) with p = 1, I1 = (−∞, 0), I2 = [0, +∞) and D = 1. The polynomials are x 7→ 0 on I1 and x 7→ x on I2. 34
  • 310. Theorem 20 ([BHLM19]) Let L ≥ 1, U ≥ 3, d ≥ 1, p ≥ 1 and W ≥ U ≥ L. We consider a DAG G = (V, E) which longest path has length L, with d vertices with indegree 0 and one vertex with outdegree 0. We assume that card(V 0) = U where V 0 is the set of vertices in the layers 1 to L. We assume that U + card(E) = W. We consider a function σ : R → R, piecewise polynomial on p + 1 disjoint intervals, with degrees smaller or equal to D. We define the following, for i ∈ {1, . . . , L}. • If D = 0, Wi is the number of parameters (weights and biases) useful to the computation of all the neurons of the layer i. We have Wi = number of edges pointing to the layer i + number of vertices in the layer i. • If D ≥ 1, Wi is the number of parameters (weights and biases) useful to the computation of all the neurons of the layers 1 to i. We have Wi = number of edges pointing to a layer j, j ≤ i + number of vertices in the layers 1 to i . We write L̄ = 1 W L X i=1 Wi ∈ [1, L], • this is equal to 1 if D = 0, • this can be close to L if D ≥ 1 and if the neurons are concentrated on the first layers. We define, for i = 1, . . . , L, ki as the number of vertices of the layer i (kL = 1). We write R = L X i=1 ki(1 + (i − 1)Di−1 ) | {z } ≤ULDL−1 if D ≥ 1 and R = U if D ≥ 0. We define F as the set of all the feed-forward neural networks defined by G = (V, E), with one weight per vertex of the layers 1 to L and one weight per edge (the structure of the network is fixed and the weights are varying). Then, for m ≥ W, with e = exp(1), ΠF (m) ≤ L Y i=1 2 2emkip(1 + (i − 1)Di−1) Wi Wi (4) ≤ 4emp(1 + (L − 1)DL−1 ) PL i=1 Wi . (5) Furthermore VCdim(F) ≤ L + L̄W log2 (4epR log2(2epR)) . (6) In particular we have the following. • If D = 0, VCdim(F) ≤ L + W log2 (4epU log2(2epU)) has W as a dominating term (neglecting logarithms). This is the number of parameters of the neural network functions. • If D ≥ 1, VCdim(F) has L̄W as a dominating term (neglecting logarithms). This is more than the number of parameters of the neural network functions. We can interpret this by the fact that depth can increase L̄ (recall that L̄ ∈ [1, L]) and thus make the family of neural network functions more complex. 35
  • 311. 4.3 Proof of the theorem Let us prove Theorem 20. The proof relies of the following result from algebraic geometry. Lemma 21 Let P1, . . . , Pm be polynomials functions of n ≤ m variables of degree smaller or equal to D ≥ 1. We write K = card {(sign(P1(x)), . . . , sign(Pm(x))) ; x ∈ Rn } , with sign(t) = 1t≥0. Note that K is the number of possible sign vectors. Then K ≤ 2emD n n . The proof of Lemma 21 can be found in [AB09]. Let us write f(x, a) for the output of the network for the input x ∈ [0, 1]d and the vector of parameters a ∈ RW . Let x1, . . . , xn ∈ [0, 1]d. In order to bound ΠF (m), let us bound card (sign(f(x1, a)), . . . , sign(f(xn, a))) ; a ∈ RW ≤ N X i=1 card {(sign(f(x1, a)), . . . , sign(f(xn, a))) ; a ∈ Pi} , where P1, . . . , PN are a partition of RW which will be chosen such that the m functions a 7→ f(xj, a), j = 1, . . . , m, are polynomial on each cell Pi. We can then apply Lemma 21. The main difficulty is to construct a good partition. We will construct by induction partitions C0, . . . , CL−1, where CL−1 will be the final partition P1, . . . , PN . The partitions C0, . . . , CL−1 will be partitions of RW such that for i ∈ {0, . . . , L − 1}, Ci = {A1, . . . , Aq} with A1 ∪ · · · ∪ Aq = RW and Ar ∪ Ar0 = ∅ for r 6= r0. We will have the following. (a) The partitions are nested, any C ∈ Ci is a union of one or several C0 ∈ Ci+1 (0 ≤ i ≤ L − 2). (b) We have card(C0) = 1 (C0 = {RW }) and for i ∈ {1, . . . , L − 1}, card(Ci) card(Ci−1) ≤ 2 2emkip(1 + (i − 1)Di−1) Wi Wi . (c) For i ∈ {0, . . . , L − 1}, for E ∈ Ci, for j ∈ {1, . . . , m}, the output of a neuron of the layer i (for the input xj) is a polynomial function of Wi variables of a ∈ E, with degree smaller or equal to iDi. Induction When i = 0, we have C0 = {RW }. The output of a neuron of the layer 0 is constant with respect to a ∈ RW and thus the property (c) holds. Let 1 ≤ i ≤ L − 1. Assume that we have constructed nested partitions C0, . . . , Ci−1 satisfying (b) and(c). Let us construct Ci. We write Ph,xj,E(a) the input (just before σ) of the neuron h (h = 1, . . . , ki) of the layer i, for the input xj, as a function of a ∈ E with E ∈ Ci−1. From the induction hypothesis (c), since Ph,xj,E(a) is of the form X k wk(output of neuron k) + b and since the partitions are nested, we have that Ph,xj,E(a) is polynomial on E of degree smaller of equal to 1 + (i − 1)Di−1 and depends at most on Wi variables (we can check that this holds also when D = 0). Because of σ, the output of the neuron h is piecewise polynomial on E. We will divide E into subcells such that the output is polynomial on each of the subcells, for any neurons h and any input xj. Figure 10 illustrates the current state of the proof. 36
  • 312. Figure 10: Illustration of the construction of the partitions. We write t1 t2 · · · tp the cuts of the pieces I1, . . . , Ip+1, as illustrated in Figure 11. Figure 11: Illustration of the cuts of the intervals for σ. We consider the polynomials ± Ph,xj,E(a) − tr h∈{1,...,ki} j∈{1,...,m} r∈{1,...,p} , 37
  • 313. where in the above display there is a + if Ir+1 is closed at tr and a − if Ir+1 is open at tr. With this, 1± Ph,xj,E(a)−tr ≥0 = sign ± Ph,xj,E(a) − tr is constant for Ph,xj,E(a) ∈ Ir+1. From Lemma 21, this set of polynomials on RW reaches at most Π = 2 2e(kimp)(1 + (i − 1)Di−1) Wi Wi distinct vectors of signs, sign (± (Ph,xk,E(a) − tr))h,j,r when a ∈ RW and thus when a ∈ E. Indeed, • kimp is the number of polynomials, • 1 + (i − 1)Di−1 is the degree bound, • Wi is the number of variables. We can thus partition E into less than Π subcells such that, on each of these subcells, the Ph,xj,E(a) stay in the same interval where σ is polynomial as a varies in the subcell. We remark that these Π subcells of E are the same for all the neurons h and all the inputs xj (this is important for the sequel). Hence we obtain a new partition Ci of cardinality less than Πcard(Ci−1). This enables to satisfy the property (b). Let us now address the property (c). For all E0 ∈ Ci, the output of the neuron h ∈ {1, . . . , ki}, a ∈ E0 7→ σ Ph,xj,E(a) is a polynomial function of Wi variables with degree smaller or equal to D(1 + (i − 1)Di−1 ) ≤ iDi , where the factor D comes from the application of the polynomial corresponding to σ. Hence the property (c) holds. This completes the induction and we have the nested partitions C0, . . . , CL−1 satisfying (b) and (c). Use of the partition to conclude the proof In particular, CL−1 is a partition of RW such that the output of each neuron of the layers 0, . . . , L−1 is polynomial of degree smaller or equal to (L − 1)DL−1 on each E ∈ CL−1 (since the partitions are nested) and for all input x1, . . . , xm. Hence for each cell E ⊂ CL−1 and each input xj, the function a ∈ E 7→ f(xj, a) at the end of the network is polynomial with degree less of equal to 1 + (L − 1)DL−1 where the 1 comes from the final linear combination. Hence, from Lemma 21, card {(sign(f(x1, a)), . . . , sign(f(xn, a))) ; a ∈ E} ≤ 2 2em(1 + (L − 1)DL−1) WL WL and thus card {(sign(f(x1, a)), . . . , sign(f(xn, a))) ; a ∈ Rw } ≤ X E∈CL−1 card {(sign(f(x1, a)), . . . , sign(f(xn, a))) ; a ∈ E} ≤ card(CL−1)2 2em(1 + (L − 1)DL−1) WL WL . (7) 38
  • 314. Then, from the property (b), card(CL−1) ≤ L−1 Y i=1 2 2emkip(1 + (i − 1)Di−1) Wi Wi and thus, since (7) holds for any x1, . . . , xm ∈ [0, 1]d, ΠF (m) ≤ L Y i=1 2 2emkip(1 + (i − 1)Di−1) Wi Wi and thus (4) is proved. For the sequel, we use the inequality between arithmetic and geometric means: for y1, . . . , yk 0, for a1, . . . , ak ≥ 0 such that Pk i=1 ai 0, k Y i=1 yai i ≤ Pk i=1 aiyi Pk i=1 ai !Pk i=1 ai . Then we have ΠF (m) ≤ 2L 2emp PL i=1 ki(1 + (i − 1)Di−1) PL i=1 Wi !PL i=1 Wi (by definition of R:) = 2L 2empR PL i=1 Wi !PL i=1 Wi (8) (since L ≤ L X i=1 Wi:) ≤ 4emp(1 + (L − 1)DL−1) PL i=1 ki PL i=1 Wi !PL i=1 Wi (since L X i=1 ki ≤ L X i=1 Wi:) ≤ 4emp(1 + (L − 1)DL−1 ) PL i=1 Wi . Hence (5) is proved. To prove the bound (6) on VCdim(F) we will combine (8) and the next lemma (that we do not prove). Lemma 22 Let r ≥ 16 and w ≥ t 0. Then, for any m t + w log2(2r log2(r)) := x0, we have 2m 2t mr w w . Hence from (8) and by definition of the VC-dimension, Lemma 22 with t = L, w = PL i=1 Wi and r = 2epR ≥ 2eU ≥ 16 yields VCdim(F) ≤ L + L X i=1 Wi ! log2 (4epR log2(2epR)) which proves (6). References [AB09] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. cambridge university press, 2009. 39
  • 315. [BHLM19] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research, 20(63):1–17, 2019. [Cyb89] George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4):303–314, 1989. [Gir14] Christophe Giraud. Introduction to high-dimensional statistics, volume 138. CRC Press, 2014. [Rud98] Walter Rudin. Real and Complex Analysis. Mc Graw Hill, 1998. 40