SlideShare a Scribd company logo
IDS Lab
Universal Approximation Theorem
why does deep neural network work?

basic maths for deep learning

Jamie Seol
IDS Lab
Jamie Seol
Motivation
• Basic logistic regression, or a perceptron can’t classify data like:

• because the decision boundary will be just a line!
IDS Lab
Jamie Seol
Motivation
• Unless we put additional features like x3 = x1
2, x4 = x2
2

• but this is a feature engineering!

• we need to do it by our hands!!

• we don’t want to do this!

• this is the one of main reasons why we’re studying deep
learning!

• we can’t do this always!

• what if the input data was like 1000 dimensional vector?

• we know that deep learning automatically finds these additional
features

• but why and how?
IDS Lab
Jamie Seol
Toy experiment
• We know that SVM will probably work well on the previous
problem

• How about MLP(multilayer perceptron)?

• It works, with 100% accuracy

• https://github.com/theeluwin/kata/blob/master/tasks/
Universal%20Approximation%20Theorem/uat.py
IDS Lab
Jamie Seol
Toy experiment
• MLP actually learned a non-linear decision boundary without any
feature engineering!
data decision boundary
IDS Lab
Jamie Seol
Universal Approximation Theorem
• The theorem states that, long story short,

MLP can represent ANY given (nice) function

• Formally proved by G. Cybenko in 1989 with sigmoidal activation
function

• K. Hornik proved any activation function works too

• therefore, theoretically, MLP can learn ANYTHING
• “it’s not working in practice” ← it will, if you have zillion neurons
with zillion layers, and infinite data

• now we can say a deep learning is an universal learning
algorithm

• then why do we need things like CNN and RNN? because we
don’t have infinite data
IDS Lab
Jamie Seol
Universal Approximation Theorem
• We’ll prove this theorem

• We can’t skip maths forever!

• let’s use Times New Roman font
IDS Lab
Jamie Seol
Analysis 101
• In Euclidean space ℝn, we call the following set open ball
• B(x, ε) = {y ∈ ℝn | d(x, y) < ε}
• for given x ∈ ℝn and ε ∈ ℝ+
• function d can be any metric, but usually we use
• d(x, y) = l2(x - y) = ||x - y||2
• also known as Euclidean distance
• We call a set A ⊂ ℝn is open if for ∀x ∈ A, ∃ε ∈ ℝ+ such that B(x, ε) ⊂ A
• We generalize this concept of open set by open set itself, which we call a
topology
IDS Lab
Jamie Seol
Analysis 101
open ball open set
open ball
when d(x, y) = l1(x - y)
IDS Lab
Jamie Seol
Topology 101
• For some set X, we say 𝒯 is a topology on X if:
• 𝒯 is a family of subsets of X
• that is, 𝒯 ⊂ 𝒫(X)
• ∅, X ∈ 𝒯
• any union of elements of 𝒯 is an element of 𝒯
• any intersection of finite elements of 𝒯 is an element of 𝒯
• Members of 𝒯 are often called open sets
• for example, with X = ℝn, we can give topology 𝒯 by collecting all
open sets defined in previous slide
• note that this is a typical non-contructive definition!
• we can’t specify all members of 𝒯, though we just know that they
do exist well
IDS Lab
Jamie Seol
Topology 101
• Note that topology is motivated by generalization of open set!
• for example, in analysis, we say a sequence {an} converges to a if for
∀ε > 0, ∃N s.t. for ∀n > N, |an - a| < ε holds
• in topology, we say a sequence {an} converges to a if for any open set
O ∈ 𝒯 containing a, ∃N s.t. for n > N, an ∈ O holds
• x ∈ X is called a limit point of A (for some given A ⊂ X) if for any open
set O ∈ 𝒯 containing x, O∩A ≠ ∅ holds
• we denote A’to be a set of limit points of A
• Ā is called a closure of A, defined by Ā = A∪A’
• we say A is dense in X if Ā = X
• THIS is the true meaning of dense!
• dense dance ~
IDS Lab
Jamie Seol
FourierAnalysis 101
• Easy example: any irrational number is a limit point of ℚ (why?), so
closure of ℚ is equal to ℝ, or, ℚ is dense in ℝ
• One of the most useful application of the concept dense is Fourier series/
transform:
• trigonometric polynomials (a.k.a. Fourier series) are dense in L2
• this is non-trivial when support is not compact
• how about Lp?
• Fourier transform can be defined in Lp
• this is also non-trivial
• can be proved by following step:
• show that Swarts class is dense in Lp
• extend domain by completion to Lp
IDS Lab
Jamie Seol
Topology 101
• We say a topological space X is compact if every open covers has a finite
subcover
• open covers are just
• Actually, talking about compactness requires a lot of time
• For Euclidean space, we can think compact as bounded and closed
• this is not trivial: see Heine-Borel theorem
• bounded means that the set can be contained in some big open ball
• closed is just opposite of open; A is closed set when XA is open
IDS Lab
Jamie Seol
LinearAlgebra 101
• For a vector space V over field F, we call || • || → ℝ a norm of a vector if
• for all a ∈ F and all u, v ∈ V,
• ||av|| = |a| ||v||
• ||u + v|| ≤ ||u|| + ||v||
• ||u|| = 0 then u = 0
• Examples
• lp-norm: lp(x) =
• Lp-norm: Lp(f) =
• Note that set of real valued-functions are vector space! functions are
vector!
• (3f + g) is a function(vector) defined by (3f + g)(x) = 3f(x) + g(x)
IDS Lab
Jamie Seol
LinearAlgebra 101
• For a vector space V over field F, we call <•, •> → F a inner product of
two vectors if
• for all a ∈ F and all x, y, z ∈ V,
• Note that we can induce canonical(trivial) norm from inner product by
• Example: inner product of two function
IDS Lab
Jamie Seol
LinearAlgebra 101
• Function space - just a vector space
• which means, its elements, or vectors, are functions
• mostly we’ll deal with continuous and real-valued function
• Typical example of function spaces are:
• linear functional (dual space)
• Swartz class
• Banach space, Hilbert space
• Reproducing Kernel Hilbert Space (RKHS)
• this is really really really important concept! mandatory for
understanding regularization and transfer learning
• we’ll cover it someday…
• Domain of functions(vectors) are often called support
IDS Lab
Jamie Seol
Analysis 101
• Aset M is called a metric space if metric d is given, holding
• d: M × M → ℝ, and for all x, y, z in M,
• d(x, y) ≥ 0
• d(x, y) = 0 then x = y
• d(x, y) = d(y, x)
• d(x, z) ≤ d(x, y) + d(y, z)
• Obviously, if some vector space is norm space, then it has
canonical(trivial) metric d induced by d(x, y) = ||x - y||
• Another obvious: metric space has canonical(trivial) topology induced
from open balls
• For example, Kullback-Leibler divergence is not a metric
IDS Lab
Jamie Seol
Analysis 101
• Asequence {an} in metric space M is said to be a cauchy if for ∀ε > 0,
∃N s.t. for ∀n, m > N, d(an, am) < ε holds
• Ametric space M is said to be complete if every cauchy sequence
converges
• very, very typical complete space: ℝ
• but this is not trivial
• for example, rational sequence {an} satisfying an ∈ [π - 1/n, π + 1/n] is
cauchy, but it never converges!
IDS Lab
Jamie Seol
Analysis 101
• In function space with functions having a metric codomain, we say a
sequence {fn} converges uniformly to f if for ∀ε > 0, ∃N s.t. for ∀n > N,
d(fn(x), f(x)) < ε holds all x in support
• why uniform? because, N was determined independently by x
• which means, N is a function of ε
• if N was determined by both ε and x, then we call it point-wise
convergence
IDS Lab
Jamie Seol
RealAnalysis 101
• We say a function is measurable if f−1(A) is measurable for any open set A
• maybe we should stop here…
• if someone meets non-measurable function by chance, it’ll be one of
the most unlucky day ever
• one old mathematician said, “If someone built an airplane using
non-measurable function, then I’ll not ride the plane”
• Anyway, a measure is something that measures area of a set
• for example, µ([a, b]) = b - a will hold as intuitively in ℝ with
Lebesgue measure µ
• we say some property p holds almost everywhere in A if the measure
of {x ∈ A | ¬p(x)} is 0
• note that the probability, is measure of an event!
IDS Lab
Jamie Seol
All together
• Want to show: “set of MLPis uniformly dense in measurable normed
function space on every compact support”
• at last! now we can read this statement
• Let’s prove this theorem
IDS Lab
Jamie Seol
Notations
• For some natural number r, let Ar be set of all affine functions from ℝr to
ℝ, where affine function means in form of A(x) = wTx + b
• For any measurable function G from ℝ to ℝ, let ∑r(G) be the class of
functions {f: ℝr → ℝ, f(x) = ∑𝛽jG(Aj(x)), x ∈ ℝ, 𝛽j ∈ ℝ, Aj ∈Ar, finite
sum}
• note that ∑r(G) is family of single hidden layer feedforward neural
network with one output, having G as activation function
• actually, we’re talking about Borel measurable, not just any measure
• For any measurable function G from ℝ to ℝ, let ∑∏r(G) be the class of
functions {f: ℝr → ℝ, f(x) = ∑𝛽j∏G(Ajk(x)), x ∈ ℝ, 𝛽j ∈ ℝ, Ajk ∈Ar, finite
sum and finite product}
• somewhat complex but more general form of MLP
IDS Lab
Jamie Seol
Notations
• Afunction 𝛹: ℝ → [0, 1] is a squashing function if
• non-decreasing
• limit to +∞ gives 1, -∞ gives 0
• examples: positive indicator function, standard sigmoid function
• Cr = set of continuous function from ℝr to ℝ
• if G is continuous, both ∑r(G) and ∑∏r(G) belongs to Cr
• Mr = set of measurable function from ℝr to ℝ
• if G is measurable, both ∑r(G) and ∑∏r(G) belongs to Mr
• note that Cr is subset of Mr
IDS Lab
Jamie Seol
Notations
• Subset S in Cr is said to be uniformly dense on compacta in Cr if for
every compact subset K ⊂ ℝr, S is dK-dense in Cr
• dK(f, g) = supx∈K |f(x) - g(x)|
• uniform convergence should hold
• Given a measure µ on (ℝr, Br), we define metric
• dµ(f, g) = inf{ε > 0: µ{x: |f(x) - g(x)| > ε} < ε}
• so dµ(f, g) = 0 if and only if f and g agrees almost everywhere
• Br is Borel 𝜎-field of ℝr (we omit detail explanation)
• note that this measure µ is some kind of input space environment (think
as a probability of an event)
IDS Lab
Jamie Seol
Stone-Weierstrass Theorem
• Afamily A of real functions defined on E is an algebra if A is closed
under addition, multiplication, and scalar multiplication
• Afamily A separates points on E if for every x, y in E that x ≠ y, ∃f ∈ A s.
t. f(x) ≠ f(y)
• Afamily A vanishes at no point of E if for each x in E, ∃f ∈ A s. t. f(x) ≠ 0
• statement if A is an continuous algebra on compact metric space K and
separates points on K and vanished at no point of K, then A is dK-dense on
C(K)
• proof too long to prove in here
• see the reference for full proof
IDS Lab
Jamie Seol
Sketch of the proof
1. Apply Stone-Weierstrass theorem to ∑∏r(G)
2. Extend uniformly dense on compacta to dense in sense of measure
3. For squashing functions, approximate it to continuous squashing function
4. For continuous squashing function, approximate it to Fourier series
5. Remove ∏ by cosine rule
IDS Lab
Jamie Seol
Theorem 1
• statement let G be any continuous non-constant function from ℝ to ℝ,
then ∑∏r(G) is uniformly dense on compacta in Cr
• proof let K ⊂ ℝr be any compact set, then for any G, ∑∏r(G) is obviously
an algebra on K which separates points and vanished at no point, since we
collected affine functions in form of A(x) = wTx + b, so we can use Stone-
Weierstrass theorem
IDS Lab
Jamie Seol
Lemma 1
• statement if {fn} is a sequence of functions in Mr that converges
uniformly on compacta to the function f, then dµ(fn, f) → 0
• by this lemma, extending Mr from Cr becomes really easy!
• note that our activation functions are not always continous
• but surely measurable, because we don’t want to build any non-
measurable MLP - it would be a terror!
• proof pick a big compact set (so that measure of the set becomes almost
large as the whole), find large N from uniform convergence, then apply
Chebyshev’s inequality
• fn will be similar to f in the compact set, and the rest part of support is
small, so the integral on Chebyshev’s inequality becomes tiny
• we’re working on ℝr, which is a locally compact metric space with
regular measure, so many problems are relatively simple
IDS Lab
Jamie Seol
Theorem 2
• statement let G be any continuous non-constant function from ℝ to ℝ,
then ∑∏r(G) is dµ-dense in Mr for any finite measure µ
• proof immediately follows from Theorem 1 and Lemma 1 if we know
that for any finite measure µ, Cr is dµ-dense in Mr
• this can be proved by similar way that we did in Lemma 1
• So the activation function don’t really have to be continuous! we can use
positive indicator function as activation function
IDS Lab
Jamie Seol
Theorem 3
• statement let 𝛹 be any squashing function, then ∑∏r(𝛹) is uniformly
dense on compacta in Cr and dµ-dense in Mr for any probability measure µ
• proof show that ∑∏r(𝛹) is uniformly dense on compacta in ∑∏r(F),
where F is continuous squashing function
• any continuous squashing function F can be uniformly approximated
by some element H of ∑l(𝛹), in sense of supx∈ℝ |F(x) - H(x)| < ε for ∀ε
> 0
• if ∏l(F) can be uniformly approximated by members of ∑∏r(𝛹), then
we’re done
• these proof can be done by analytic (rather dirty) works
• the rest part will be completed by Lemma 1 and Theorem 2
IDS Lab
Jamie Seol
Theorem 4
• statement let 𝛹 be any squashing function, then ∑r(𝛹) is uniformly
dense on compacta in Cr and dµ-dense in Mr for any probability measure µ
• proof using the fact that continous squashing function can be uniformly
approximated by Fourier series in each compact support and note that
cosAcosB = cos(A+B) - cos(A-B)
• wow! that’s really all!
• we removed the ∏ part by uniformly approximating 𝛹 to some
continuous squashing function (using Theorem 3), which also can
uniformly approximated to its Fourier series in each compact set (for
more details, see the reference book), and some cosine rules
IDS Lab
Jamie Seol
Corollaries
• The rest part is just some natural corollaries like
• it can be extended to general Lp space
• it can be extended to multi-output MLP
• it can be extended to multi-layer MLP
• hmm? actually, Theorem 4 claims that 1-layer NN is enough
universal approximator
IDS Lab
Jamie Seol
Conclusion
• We now know that MLP actually works theoretically!
• MLPs are universal approximators
• Taylor series and Fourier series are also good universal approximators
but those two requires detailed information of the original function (or
computable oracle), while MLP can be learned by only using the train
data
• me?
• Jamie Seol
• email - jamie@europa.snu.ac.kr
• twitter - @theeluwin
• website - theeluwin.kr
• blog - blog.theeluwin.kr
IDS Lab
Jamie Seol
References
• Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer
feedforward networks are universal approximators." Neural networks 2.5
(1989): 359-366.
• Cybenko, George. "Approximation by superpositions of a sigmoidal
function." Mathematics of control, signals and systems 2.4 (1989): 303-314.
• Rudin, Walter. Principles of mathematical analysis. Vol. 3. New York:
McGraw-Hill, 1964.
• Rudin, Walter. Real and complex analysis. Tata McGraw-Hill Education,
1987.
• Munkres, James R. Topology. Prentice Hall, 2000.
• Gockenbach, Mark S. Finite-dimensional linear algebra. CRC Press, 2011.
• Young, Matt. "The Stone-Weierstrass Theorem." (2006).
• Stein, Elias M., and Rami Shakarchi. "FourierAnalysis, Princeton Lectures
inAnalysis I." (2003).

More Related Content

PDF
Optimization for Deep Learning
PPTX
Optimization in Deep Learning
PPTX
Machine learning ppt.
PPT
2.4 rule based classification
DOCX
History of neural networks
PDF
Ai notes
PPTX
Deep neural networks
Optimization for Deep Learning
Optimization in Deep Learning
Machine learning ppt.
2.4 rule based classification
History of neural networks
Ai notes
Deep neural networks

What's hot (20)

PPTX
Unification and Lifting
PDF
Sequence Modelling with Deep Learning
PPTX
Machine learning clustering
PDF
Markov decision process
PDF
Tensorflow presentation
PPTX
Support Vector Machine ppt presentation
PPTX
Deep Reinforcement Learning
PPT
Fuzzy logic
PDF
Reinforcement learning, Q-Learning
PPT
Learning sets of rules, Sequential Learning Algorithm,FOIL
PPTX
Artificial Intelligence: What Is Reinforcement Learning?
PDF
Computational intelligence an introduction
PDF
Machine Learning: Introduction to Neural Networks
PDF
An introduction to Deep Learning
PPTX
An introduction to reinforcement learning
PPTX
AI_Session 7 Greedy Best first search algorithm.pptx
PPT
Soft Computing-173101
PPTX
Basic operators in matlab
PDF
Introduction to Deep Learning
PPTX
Unification and Lifting
Sequence Modelling with Deep Learning
Machine learning clustering
Markov decision process
Tensorflow presentation
Support Vector Machine ppt presentation
Deep Reinforcement Learning
Fuzzy logic
Reinforcement learning, Q-Learning
Learning sets of rules, Sequential Learning Algorithm,FOIL
Artificial Intelligence: What Is Reinforcement Learning?
Computational intelligence an introduction
Machine Learning: Introduction to Neural Networks
An introduction to Deep Learning
An introduction to reinforcement learning
AI_Session 7 Greedy Best first search algorithm.pptx
Soft Computing-173101
Basic operators in matlab
Introduction to Deep Learning
Ad

Similar to Universal Approximation Theorem (20)

PDF
Finite mathematics
PPT
lecture07 dicrete mathematics relation .ppt
PPTX
Design and Analysis of Algorithms Lecture Notes
PDF
Dynamic Programming From CS 6515(Fibonacci, LIS, LCS))
PPT
PPTX
Discrete_Mathematics 6 algorithm complexity.pptx
PDF
Rethinking of Generalization
PDF
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
PPTX
ppt.pptx fixed point iteration method no
PDF
Variational inference
PPT
Lecture1
PPTX
Linear Regression.pptx
PDF
Dataflow Analysis
PPTX
Introduction to the AKS Primality Test
PPTX
Support vector machine
PPTX
Fuzzy Logic.pptx
PPTX
Lower bound theory Np hard & Np completeness
PPT
CRYPTOGRAPHY AND NUMBER THEORY, he ha huli
PPTX
NICE Implementations of Variational Inference
PPTX
NICE Research -Variational inference project
Finite mathematics
lecture07 dicrete mathematics relation .ppt
Design and Analysis of Algorithms Lecture Notes
Dynamic Programming From CS 6515(Fibonacci, LIS, LCS))
Discrete_Mathematics 6 algorithm complexity.pptx
Rethinking of Generalization
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Prof...
ppt.pptx fixed point iteration method no
Variational inference
Lecture1
Linear Regression.pptx
Dataflow Analysis
Introduction to the AKS Primality Test
Support vector machine
Fuzzy Logic.pptx
Lower bound theory Np hard & Np completeness
CRYPTOGRAPHY AND NUMBER THEORY, he ha huli
NICE Implementations of Variational Inference
NICE Research -Variational inference project
Ad

Recently uploaded (20)

PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
An interstellar mission to test astrophysical black holes
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
BIOMOLECULES PPT........................
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPT
protein biochemistry.ppt for university classes
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
The scientific heritage No 166 (166) (2025)
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
7. General Toxicologyfor clinical phrmacy.pptx
An interstellar mission to test astrophysical black holes
Placing the Near-Earth Object Impact Probability in Context
INTRODUCTION TO EVS | Concept of sustainability
Introduction to Cardiovascular system_structure and functions-1
BIOMOLECULES PPT........................
Derivatives of integument scales, beaks, horns,.pptx
protein biochemistry.ppt for university classes
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Phytochemical Investigation of Miliusa longipes.pdf
AlphaEarth Foundations and the Satellite Embedding dataset
TOTAL hIP ARTHROPLASTY Presentation.pptx
Comparative Structure of Integument in Vertebrates.pptx
microscope-Lecturecjchchchchcuvuvhc.pptx
Biophysics 2.pdffffffffffffffffffffffffff
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
The KM-GBF monitoring framework – status & key messages.pptx
HPLC-PPT.docx high performance liquid chromatography
The scientific heritage No 166 (166) (2025)

Universal Approximation Theorem

  • 1. IDS Lab Universal Approximation Theorem why does deep neural network work? basic maths for deep learning Jamie Seol
  • 2. IDS Lab Jamie Seol Motivation • Basic logistic regression, or a perceptron can’t classify data like: • because the decision boundary will be just a line!
  • 3. IDS Lab Jamie Seol Motivation • Unless we put additional features like x3 = x1 2, x4 = x2 2 • but this is a feature engineering! • we need to do it by our hands!! • we don’t want to do this! • this is the one of main reasons why we’re studying deep learning! • we can’t do this always! • what if the input data was like 1000 dimensional vector? • we know that deep learning automatically finds these additional features • but why and how?
  • 4. IDS Lab Jamie Seol Toy experiment • We know that SVM will probably work well on the previous problem • How about MLP(multilayer perceptron)? • It works, with 100% accuracy • https://github.com/theeluwin/kata/blob/master/tasks/ Universal%20Approximation%20Theorem/uat.py
  • 5. IDS Lab Jamie Seol Toy experiment • MLP actually learned a non-linear decision boundary without any feature engineering! data decision boundary
  • 6. IDS Lab Jamie Seol Universal Approximation Theorem • The theorem states that, long story short, MLP can represent ANY given (nice) function • Formally proved by G. Cybenko in 1989 with sigmoidal activation function • K. Hornik proved any activation function works too • therefore, theoretically, MLP can learn ANYTHING • “it’s not working in practice” ← it will, if you have zillion neurons with zillion layers, and infinite data • now we can say a deep learning is an universal learning algorithm • then why do we need things like CNN and RNN? because we don’t have infinite data
  • 7. IDS Lab Jamie Seol Universal Approximation Theorem • We’ll prove this theorem • We can’t skip maths forever! • let’s use Times New Roman font
  • 8. IDS Lab Jamie Seol Analysis 101 • In Euclidean space ℝn, we call the following set open ball • B(x, ε) = {y ∈ ℝn | d(x, y) < ε} • for given x ∈ ℝn and ε ∈ ℝ+ • function d can be any metric, but usually we use • d(x, y) = l2(x - y) = ||x - y||2 • also known as Euclidean distance • We call a set A ⊂ ℝn is open if for ∀x ∈ A, ∃ε ∈ ℝ+ such that B(x, ε) ⊂ A • We generalize this concept of open set by open set itself, which we call a topology
  • 9. IDS Lab Jamie Seol Analysis 101 open ball open set open ball when d(x, y) = l1(x - y)
  • 10. IDS Lab Jamie Seol Topology 101 • For some set X, we say 𝒯 is a topology on X if: • 𝒯 is a family of subsets of X • that is, 𝒯 ⊂ 𝒫(X) • ∅, X ∈ 𝒯 • any union of elements of 𝒯 is an element of 𝒯 • any intersection of finite elements of 𝒯 is an element of 𝒯 • Members of 𝒯 are often called open sets • for example, with X = ℝn, we can give topology 𝒯 by collecting all open sets defined in previous slide • note that this is a typical non-contructive definition! • we can’t specify all members of 𝒯, though we just know that they do exist well
  • 11. IDS Lab Jamie Seol Topology 101 • Note that topology is motivated by generalization of open set! • for example, in analysis, we say a sequence {an} converges to a if for ∀ε > 0, ∃N s.t. for ∀n > N, |an - a| < ε holds • in topology, we say a sequence {an} converges to a if for any open set O ∈ 𝒯 containing a, ∃N s.t. for n > N, an ∈ O holds • x ∈ X is called a limit point of A (for some given A ⊂ X) if for any open set O ∈ 𝒯 containing x, O∩A ≠ ∅ holds • we denote A’to be a set of limit points of A • Ā is called a closure of A, defined by Ā = A∪A’ • we say A is dense in X if Ā = X • THIS is the true meaning of dense! • dense dance ~
  • 12. IDS Lab Jamie Seol FourierAnalysis 101 • Easy example: any irrational number is a limit point of ℚ (why?), so closure of ℚ is equal to ℝ, or, ℚ is dense in ℝ • One of the most useful application of the concept dense is Fourier series/ transform: • trigonometric polynomials (a.k.a. Fourier series) are dense in L2 • this is non-trivial when support is not compact • how about Lp? • Fourier transform can be defined in Lp • this is also non-trivial • can be proved by following step: • show that Swarts class is dense in Lp • extend domain by completion to Lp
  • 13. IDS Lab Jamie Seol Topology 101 • We say a topological space X is compact if every open covers has a finite subcover • open covers are just • Actually, talking about compactness requires a lot of time • For Euclidean space, we can think compact as bounded and closed • this is not trivial: see Heine-Borel theorem • bounded means that the set can be contained in some big open ball • closed is just opposite of open; A is closed set when XA is open
  • 14. IDS Lab Jamie Seol LinearAlgebra 101 • For a vector space V over field F, we call || • || → ℝ a norm of a vector if • for all a ∈ F and all u, v ∈ V, • ||av|| = |a| ||v|| • ||u + v|| ≤ ||u|| + ||v|| • ||u|| = 0 then u = 0 • Examples • lp-norm: lp(x) = • Lp-norm: Lp(f) = • Note that set of real valued-functions are vector space! functions are vector! • (3f + g) is a function(vector) defined by (3f + g)(x) = 3f(x) + g(x)
  • 15. IDS Lab Jamie Seol LinearAlgebra 101 • For a vector space V over field F, we call <•, •> → F a inner product of two vectors if • for all a ∈ F and all x, y, z ∈ V, • Note that we can induce canonical(trivial) norm from inner product by • Example: inner product of two function
  • 16. IDS Lab Jamie Seol LinearAlgebra 101 • Function space - just a vector space • which means, its elements, or vectors, are functions • mostly we’ll deal with continuous and real-valued function • Typical example of function spaces are: • linear functional (dual space) • Swartz class • Banach space, Hilbert space • Reproducing Kernel Hilbert Space (RKHS) • this is really really really important concept! mandatory for understanding regularization and transfer learning • we’ll cover it someday… • Domain of functions(vectors) are often called support
  • 17. IDS Lab Jamie Seol Analysis 101 • Aset M is called a metric space if metric d is given, holding • d: M × M → ℝ, and for all x, y, z in M, • d(x, y) ≥ 0 • d(x, y) = 0 then x = y • d(x, y) = d(y, x) • d(x, z) ≤ d(x, y) + d(y, z) • Obviously, if some vector space is norm space, then it has canonical(trivial) metric d induced by d(x, y) = ||x - y|| • Another obvious: metric space has canonical(trivial) topology induced from open balls • For example, Kullback-Leibler divergence is not a metric
  • 18. IDS Lab Jamie Seol Analysis 101 • Asequence {an} in metric space M is said to be a cauchy if for ∀ε > 0, ∃N s.t. for ∀n, m > N, d(an, am) < ε holds • Ametric space M is said to be complete if every cauchy sequence converges • very, very typical complete space: ℝ • but this is not trivial • for example, rational sequence {an} satisfying an ∈ [π - 1/n, π + 1/n] is cauchy, but it never converges!
  • 19. IDS Lab Jamie Seol Analysis 101 • In function space with functions having a metric codomain, we say a sequence {fn} converges uniformly to f if for ∀ε > 0, ∃N s.t. for ∀n > N, d(fn(x), f(x)) < ε holds all x in support • why uniform? because, N was determined independently by x • which means, N is a function of ε • if N was determined by both ε and x, then we call it point-wise convergence
  • 20. IDS Lab Jamie Seol RealAnalysis 101 • We say a function is measurable if f−1(A) is measurable for any open set A • maybe we should stop here… • if someone meets non-measurable function by chance, it’ll be one of the most unlucky day ever • one old mathematician said, “If someone built an airplane using non-measurable function, then I’ll not ride the plane” • Anyway, a measure is something that measures area of a set • for example, µ([a, b]) = b - a will hold as intuitively in ℝ with Lebesgue measure µ • we say some property p holds almost everywhere in A if the measure of {x ∈ A | ¬p(x)} is 0 • note that the probability, is measure of an event!
  • 21. IDS Lab Jamie Seol All together • Want to show: “set of MLPis uniformly dense in measurable normed function space on every compact support” • at last! now we can read this statement • Let’s prove this theorem
  • 22. IDS Lab Jamie Seol Notations • For some natural number r, let Ar be set of all affine functions from ℝr to ℝ, where affine function means in form of A(x) = wTx + b • For any measurable function G from ℝ to ℝ, let ∑r(G) be the class of functions {f: ℝr → ℝ, f(x) = ∑𝛽jG(Aj(x)), x ∈ ℝ, 𝛽j ∈ ℝ, Aj ∈Ar, finite sum} • note that ∑r(G) is family of single hidden layer feedforward neural network with one output, having G as activation function • actually, we’re talking about Borel measurable, not just any measure • For any measurable function G from ℝ to ℝ, let ∑∏r(G) be the class of functions {f: ℝr → ℝ, f(x) = ∑𝛽j∏G(Ajk(x)), x ∈ ℝ, 𝛽j ∈ ℝ, Ajk ∈Ar, finite sum and finite product} • somewhat complex but more general form of MLP
  • 23. IDS Lab Jamie Seol Notations • Afunction 𝛹: ℝ → [0, 1] is a squashing function if • non-decreasing • limit to +∞ gives 1, -∞ gives 0 • examples: positive indicator function, standard sigmoid function • Cr = set of continuous function from ℝr to ℝ • if G is continuous, both ∑r(G) and ∑∏r(G) belongs to Cr • Mr = set of measurable function from ℝr to ℝ • if G is measurable, both ∑r(G) and ∑∏r(G) belongs to Mr • note that Cr is subset of Mr
  • 24. IDS Lab Jamie Seol Notations • Subset S in Cr is said to be uniformly dense on compacta in Cr if for every compact subset K ⊂ ℝr, S is dK-dense in Cr • dK(f, g) = supx∈K |f(x) - g(x)| • uniform convergence should hold • Given a measure µ on (ℝr, Br), we define metric • dµ(f, g) = inf{ε > 0: µ{x: |f(x) - g(x)| > ε} < ε} • so dµ(f, g) = 0 if and only if f and g agrees almost everywhere • Br is Borel 𝜎-field of ℝr (we omit detail explanation) • note that this measure µ is some kind of input space environment (think as a probability of an event)
  • 25. IDS Lab Jamie Seol Stone-Weierstrass Theorem • Afamily A of real functions defined on E is an algebra if A is closed under addition, multiplication, and scalar multiplication • Afamily A separates points on E if for every x, y in E that x ≠ y, ∃f ∈ A s. t. f(x) ≠ f(y) • Afamily A vanishes at no point of E if for each x in E, ∃f ∈ A s. t. f(x) ≠ 0 • statement if A is an continuous algebra on compact metric space K and separates points on K and vanished at no point of K, then A is dK-dense on C(K) • proof too long to prove in here • see the reference for full proof
  • 26. IDS Lab Jamie Seol Sketch of the proof 1. Apply Stone-Weierstrass theorem to ∑∏r(G) 2. Extend uniformly dense on compacta to dense in sense of measure 3. For squashing functions, approximate it to continuous squashing function 4. For continuous squashing function, approximate it to Fourier series 5. Remove ∏ by cosine rule
  • 27. IDS Lab Jamie Seol Theorem 1 • statement let G be any continuous non-constant function from ℝ to ℝ, then ∑∏r(G) is uniformly dense on compacta in Cr • proof let K ⊂ ℝr be any compact set, then for any G, ∑∏r(G) is obviously an algebra on K which separates points and vanished at no point, since we collected affine functions in form of A(x) = wTx + b, so we can use Stone- Weierstrass theorem
  • 28. IDS Lab Jamie Seol Lemma 1 • statement if {fn} is a sequence of functions in Mr that converges uniformly on compacta to the function f, then dµ(fn, f) → 0 • by this lemma, extending Mr from Cr becomes really easy! • note that our activation functions are not always continous • but surely measurable, because we don’t want to build any non- measurable MLP - it would be a terror! • proof pick a big compact set (so that measure of the set becomes almost large as the whole), find large N from uniform convergence, then apply Chebyshev’s inequality • fn will be similar to f in the compact set, and the rest part of support is small, so the integral on Chebyshev’s inequality becomes tiny • we’re working on ℝr, which is a locally compact metric space with regular measure, so many problems are relatively simple
  • 29. IDS Lab Jamie Seol Theorem 2 • statement let G be any continuous non-constant function from ℝ to ℝ, then ∑∏r(G) is dµ-dense in Mr for any finite measure µ • proof immediately follows from Theorem 1 and Lemma 1 if we know that for any finite measure µ, Cr is dµ-dense in Mr • this can be proved by similar way that we did in Lemma 1 • So the activation function don’t really have to be continuous! we can use positive indicator function as activation function
  • 30. IDS Lab Jamie Seol Theorem 3 • statement let 𝛹 be any squashing function, then ∑∏r(𝛹) is uniformly dense on compacta in Cr and dµ-dense in Mr for any probability measure µ • proof show that ∑∏r(𝛹) is uniformly dense on compacta in ∑∏r(F), where F is continuous squashing function • any continuous squashing function F can be uniformly approximated by some element H of ∑l(𝛹), in sense of supx∈ℝ |F(x) - H(x)| < ε for ∀ε > 0 • if ∏l(F) can be uniformly approximated by members of ∑∏r(𝛹), then we’re done • these proof can be done by analytic (rather dirty) works • the rest part will be completed by Lemma 1 and Theorem 2
  • 31. IDS Lab Jamie Seol Theorem 4 • statement let 𝛹 be any squashing function, then ∑r(𝛹) is uniformly dense on compacta in Cr and dµ-dense in Mr for any probability measure µ • proof using the fact that continous squashing function can be uniformly approximated by Fourier series in each compact support and note that cosAcosB = cos(A+B) - cos(A-B) • wow! that’s really all! • we removed the ∏ part by uniformly approximating 𝛹 to some continuous squashing function (using Theorem 3), which also can uniformly approximated to its Fourier series in each compact set (for more details, see the reference book), and some cosine rules
  • 32. IDS Lab Jamie Seol Corollaries • The rest part is just some natural corollaries like • it can be extended to general Lp space • it can be extended to multi-output MLP • it can be extended to multi-layer MLP • hmm? actually, Theorem 4 claims that 1-layer NN is enough universal approximator
  • 33. IDS Lab Jamie Seol Conclusion • We now know that MLP actually works theoretically! • MLPs are universal approximators • Taylor series and Fourier series are also good universal approximators but those two requires detailed information of the original function (or computable oracle), while MLP can be learned by only using the train data • me? • Jamie Seol • email - jamie@europa.snu.ac.kr • twitter - @theeluwin • website - theeluwin.kr • blog - blog.theeluwin.kr
  • 34. IDS Lab Jamie Seol References • Hornik, Kurt, Maxwell Stinchcombe, and Halbert White. "Multilayer feedforward networks are universal approximators." Neural networks 2.5 (1989): 359-366. • Cybenko, George. "Approximation by superpositions of a sigmoidal function." Mathematics of control, signals and systems 2.4 (1989): 303-314. • Rudin, Walter. Principles of mathematical analysis. Vol. 3. New York: McGraw-Hill, 1964. • Rudin, Walter. Real and complex analysis. Tata McGraw-Hill Education, 1987. • Munkres, James R. Topology. Prentice Hall, 2000. • Gockenbach, Mark S. Finite-dimensional linear algebra. CRC Press, 2011. • Young, Matt. "The Stone-Weierstrass Theorem." (2006). • Stein, Elias M., and Rami Shakarchi. "FourierAnalysis, Princeton Lectures inAnalysis I." (2003).