SlideShare a Scribd company logo
Pattern Recognition and Machine Learning
Eunho Lee
2017-01-03(Tue)
January 8, 2017 1 / 33
Outline
1 Introduction
Polynomial Curve Fitting
Probability Theory
The Curse of Dimensionality
Decision Theory
Information Theory
2 Appendix C - Properties of Matrices
Determinants
Matrix Derivatives
2.3. Eigenvector Equation
January 8, 2017 2 / 33
Introduction
1. Introduction
Input data set
x ≡ (x1, · · · , xN)T
Target data set
t ≡ (t1, · · · , tN)T
Traning set
Input data set + Target data set
January 8, 2017 3 / 33
Introduction
1. Introduction
Data Set
⇓
Probability Theory
⇓
Decision Theory
⇓
Pattern Recognition
* Probability theory provides a
framework for expressing uncentainty
* Decision theory allows us to
exploit the probablistic
representation in order to make
predictions that are optimal
January 8, 2017 4 / 33
Introduction Polynomial Curve Fitting
1.1. Polynomial Curve Fitting
y(x, w) = w0 + w1x + w2x2
+ · · · + wMxM
=
M
j=0
wj xj
January 8, 2017 5 / 33
Introduction Polynomial Curve Fitting
1.1. Polynomial Curve Fitting
1.1.1. Error Functions
Error Function
E(w) =
1
2
N
n=1
{y(xn, w) − tn}2
(1)
Root-mean-square(RMS) error 1
ERMS = 2E(w∗)/N (2)
* allows us to compare different sizes of data sets
* makes same scale as the target variable
1
w* is a unique solution of the minimization of the error function
January 8, 2017 6 / 33
Introduction Polynomial Curve Fitting
1.1. Polynomial Curve Fitting
1.1.2. Over-Fitting
Over-fitting
Very poor representation
January 8, 2017 7 / 33
Introduction Polynomial Curve Fitting
1.1. Polynomial Curve Fitting
1.1.3. Modified Error Function
Modified Error Function
E(w) =
1
2
N
n=1
{y(xn, w) − tn}2
+
λ
2
w
2
(3)
prevent the large coefficients w
January 8, 2017 8 / 33
Introduction Probability Theory
1.2. Probability Theory
Sum Rule
p(X) =
Y
p(X, Y )
Product Rule
p(X, Y ) = p(Y | X)p(X)
Bayes’ Theorem
p(Y | X) =
p(X | Y )p(Y )
p(X)
January 8, 2017 9 / 33
Introduction Probability Theory
1.2. Probability Theory
Expectation
E[f ] =
x
p(x)f (x)
1
N
N
n=1
f (xn)
Variance
var[f ] = E[(f (x) − E[f (x)])2
]
January 8, 2017 10 / 33
Introduction Probability Theory
1.2. Probability Theory
1.2.1. Two Different Probabilities
1. Frequentiest probability (only for large data set)
p(x) =
occurence(x)
N
2. Bayesian probability 2
p(w | D) =
p(D | w)p(w)
p(D)
2
p(wㅣD): posterior, p(w): prior, p(Dㅣw): likelihood function, p(D): evidence
January 8, 2017 11 / 33
Introduction Probability Theory
1.2. Probability Theory
1.2.2. The Gaussian Distribution
The Gaussian distribution (single real-valued variable x)
N(x | µ, σ2
) =
1
(2πσ2)1/2
exp{−
1
2σ2
(x − µ)2
}
The Gaussian distribution (D-dimensional vector x)
N(x | µ, Σ) =
1
(2π)D/2 |Σ|1/2
exp{−
1
2
(x − µ)T
Σ−1
(x − µ)}
where µ is mean and σ2 is variance
January 8, 2017 12 / 33
Introduction Probability Theory
1.2. Probability Theory
1.2.2. The Gaussian Distribution
In the pattern recognition problem, suppose that the observations are
drawn independently from a Gaussian distribution.
Because our data set is i.i.d, we obtain the likelihood function
p(x | µ, σ2
) =
N
n=1
N(xn | µ, σ2
) (4)
January 8, 2017 13 / 33
Introduction Probability Theory
1.2. Probability Theory
1.2.2. The Gaussian Distribution
Our goal is to find µML, σ2
ML which maximize the likelihood function (4).
Take log for convenience,
ln p(x | µ, σ2
) = −
1
2σ2
N
n=1
(xn − µ)2
−
N
2
ln σ2
−
N
2
ln (2π) (5)
From (5) with respect to µ,
µML =
1
N
N
n=1
xn (sample mean)
Similarly, with respect to σ2,
σ2
ML =
1
N
N
n=1
(xn − µML)2
(sample variance)
January 8, 2017 14 / 33
Introduction Probability Theory
1.2. Probability Theory
1.2.2. The Gaussian Distribution
Therefore,
E[µML] = µ
E[σ2
ML] = (N−1
N )σ2
The figure shows that how variance is biased
January 8, 2017 15 / 33
Introduction Probability Theory
1.2. Probability Theory
1.2.3. Curve Fitting (re-visited)
Assume that target value t has a Gaussian distribution.
p(t | x, w, β) = N(t | y(x, w), β−1
) (6)
January 8, 2017 16 / 33
Introduction Probability Theory
1.2. Probability Theory
1.2.3. Curve Fitting (re-visited)
Because the data are drawn independently, the likelihood is given by
p(t | x, w, β) =
N
n=1
N(tn | y(xn, w), β−1
) (7)
1. Maximum Likelihood Function Method
Our goal is to maximize the likelihood function (7).
Take log for convenience,
ln p(t | x, w, β) = −
β
2
N
n=1
{y(xn, w) − tn}2
+
N
2
ln β −
N
2
ln (2π) (8)
January 8, 2017 17 / 33
Introduction Probability Theory
1.2. Probability Theory
1.2.3. Curve Fitting (re-visited)
Maximizing (8) with respect to w is equivalent to minimizing the
sum-of-squares error function (1)
Maximizing (5) with respect to β gives
1
βML
=
1
N
N
n=1
{y(xn, wML) − tn}2
January 8, 2017 18 / 33
Introduction Probability Theory
1.2. Probability Theory
1.2.3. Curve Fitting (re-visited)
2. Maximum Posterior Method (MAP)
Using Bayes’ theorem, the posterior is
p(w | x, t, α, β) ∝ p(t | x, w, β)p(w | α) (9)
Let’s introduce a prior
p(w | α) = N(w | 0, α−1
I) = (
α
2π
)(M+1)/2
exp{−
α
2
wT
w} (10)
Our goal is to maximize the posterior (9)
January 8, 2017 19 / 33
Introduction Probability Theory
1.2. Probability Theory
1.2.3. Curve Fitting (re-visited)
Maximizing {posterior} with respect to w
⇔ Maximizing {p(t | x, w, β)p(w | α)} by (9)
⇔ Minimizing {−ln (p(t | x, w, β)p(w | α))}
⇔ Minimizing
β
2
N
n=1{y(xn, w) − tn}2 + α
2 wTw − N
2 ln β + N
2 ln (2π) − ln ( α
2π )(M+1)/2
⇔ Minimizing β
2
N
n=1{y(xn, w) − tn}2 + α
2 wTw
is equivalent to minimizing the modified error function (3) where λ = α
β
January 8, 2017 20 / 33
Introduction Probability Theory
1.2. Probability Theory
1.2.4. Bayesian Curve Fitting
Our goal is to predict t, therefore evaluate the predictive distribution.
p(t | x, x, t) = p(t | x, w)p(w | x, t)dw (11)
The predictive distribution (11)’s RHS can be performed analytically of the
form
p(t | x, x, t) = N(t | m(x), s2
(x))
where the mean and variance are given by
m(x) = βø(x)T
S
N
n=1
ø(xn)tn
s2
(x) = β−1
+ ø(x)T
Sø(x)
January 8, 2017 21 / 33
Introduction The Curse of Dimensionality
1.3. The Curse of Dimensionality
The severe difficulty that can arise in spaces of many dimensions
How much data do i need to classify?
January 8, 2017 22 / 33
Introduction Decision Theory
1.4. Decision Theory
1.4.1. Minimizing the misclassification rate
Rk is decision regions, Ck is k-th class
The boundaries between decision regions are called decision
boundaries
Each decision region need not be contiguous
Classify x to Cm which maximize p(x, Ck)
January 8, 2017 23 / 33
Introduction Decision Theory
1.4. Decision Theory
1.4.1. Minimizing the misclassification rate
p(mistake) =
R1
p(x, C2)dx +
R2
p(x, C1)dx
p(x, Ck) = p(Ck | x)p(x)
January 8, 2017 24 / 33
Introduction Decision Theory
1.4. Decision Theory
1.4.2. Minimizing the expected loss
Dicision’s value is different each other
Minimize the expected loss by using the loss matrix L
E[L] =
k j Rj
Lkj p(x, Ck)dx
January 8, 2017 25 / 33
Introduction Decision Theory
1.4. Decision Theory
1.4.3. The Reject Option
If θ < 1/k, then there is no reject region (k is the number of classes)
If θ = 1, then all region is rejected
January 8, 2017 26 / 33
Introduction Information Theory
1.5. Infomation Theory
Entropy of the event x (the quantity of information)
h(x) = −log2p(x)
Entropy of the random variable x
H[x] = −
x
p(x)log2p(x)
The distribution that maximizes the differential entropy is the
Gaussian
January 8, 2017 27 / 33
Appendix C - Properties of Matrices
2. Appendix C - Properties of Matrices
Determinants
Matrix Derivatives
Eigenvector Equation
January 8, 2017 28 / 33
Appendix C - Properties of Matrices Determinants
2.1. Determinants
If A and B are matrices of size N × M, then
IN + ABT
= IM + AT
B
A useful special case is
IN + abT
= 1 + aT
B
where a and b are N-dimensional column vectors
January 8, 2017 29 / 33
Appendix C - Properties of Matrices Matrix Derivatives
2.2. Matrix Derivatives
1. ∂
∂x (xTa) = ∂
∂x (aTx) = a
2. ∂
∂x (AB) = ∂A
∂x B + A∂B
∂x
3. ∂
∂x (A−1) = −A−1 ∂A
∂x A−1
4. ∂
∂A Tr(AB) = BT
5. ∂
∂A Tr(A) = I
January 8, 2017 30 / 33
Appendix C - Properties of Matrices Eigenvector Equation
2.3. Eigenvector Equation
For a square matrix A, the eigenvector equation is defined by
Aui = λi ui
where ui is an eigenvector and λi is the corresponding eigenvalue
January 8, 2017 31 / 33
Appendix C - Properties of Matrices Eigenvector Equation
2.3. Eigenvector Equation
If A is diagonalized, we can say that
A = UΛUT
By using that equation, we obtain
1. |A| = M
i=1 λi
2. Tr(A) = M
i=1 λi
January 8, 2017 32 / 33
Appendix C - Properties of Matrices Eigenvector Equation
2.3. Eigenvector Equation
1. The eigenvalues of symmetirc matrices are real
2. The eigenvectors ui statisfies
uT
i uj = Iij
January 8, 2017 33 / 33

More Related Content

PDF
accurate ABC Oliver Ratmann
PDF
SPDE presentation 2012
PDF
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
ABC in Venezia
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
accurate ABC Oliver Ratmann
SPDE presentation 2012
CLIM Fall 2017 Course: Statistics for Climate Research, Estimating Curves and...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
ABC in Venezia
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...

What's hot (20)

PDF
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
PDF
Can we estimate a constant?
PDF
CLIM Fall 2017 Course: Statistics for Climate Research, Geostats for Large Da...
PDF
better together? statistical learning in models made of modules
PDF
Some sampling techniques for big data analysis
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
The Universal Measure for General Sources and its Application to MDL/Bayesian...
PDF
MDL/Bayesian Criteria based on Universal Coding/Measure
PDF
Predictive mean-matching2
PDF
sada_pres
PDF
2013 IEEE International Symposium on Information Theory
PDF
Likelihood-free Design: a discussion
PDF
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
PDF
Nested sampling
PDF
Generative models : VAE and GAN
PDF
Big model, big data
PDF
S. Duplij, A q-deformed generalization of the Hosszu-Gluskin theorem
PDF
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
talk MCMC & SMC 2004
CLIM Fall 2017 Course: Statistics for Climate Research, Guest lecture: Data F...
Can we estimate a constant?
CLIM Fall 2017 Course: Statistics for Climate Research, Geostats for Large Da...
better together? statistical learning in models made of modules
Some sampling techniques for big data analysis
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
The Universal Measure for General Sources and its Application to MDL/Bayesian...
MDL/Bayesian Criteria based on Universal Coding/Measure
Predictive mean-matching2
sada_pres
2013 IEEE International Symposium on Information Theory
Likelihood-free Design: a discussion
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
Nested sampling
Generative models : VAE and GAN
Big model, big data
S. Duplij, A q-deformed generalization of the Hosszu-Gluskin theorem
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
talk MCMC & SMC 2004
Ad

Viewers also liked (8)

PDF
linear regression part 2
PDF
Bayesian regression intro with r
PDF
Bayesian Methods for Machine Learning
PDF
Pattern Recognition and Machine Learning: Section 3.3
PDF
PRML上巻勉強会 at 東京大学 資料 第1章前半
PDF
PRML読書会1スライド(公開用)
PPTX
[PRML 3.1~3.2] Linear Regression / Bias-Variance Decomposition
PPTX
Pattern recognition and Machine Learning.
linear regression part 2
Bayesian regression intro with r
Bayesian Methods for Machine Learning
Pattern Recognition and Machine Learning: Section 3.3
PRML上巻勉強会 at 東京大学 資料 第1章前半
PRML読書会1スライド(公開用)
[PRML 3.1~3.2] Linear Regression / Bias-Variance Decomposition
Pattern recognition and Machine Learning.
Ad

Similar to Pattern Recognition (20)

PPTX
Frequency14.pptx
PDF
Statistical Hydrology for Engineering.pdf
PDF
bayesian_statistics_introduction_uppsala_university
PPTX
Statistical Physics Assignment Help
PDF
Insufficient Gibbs sampling (A. Luciano, C.P. Robert and R. Ryder)
PDF
Empowering Fourier-based Pricing Methods for Efficient Valuation of High-Dime...
PDF
Introduction to Evidential Neural Networks
PDF
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
PDF
Nonlinear_system,Nonlinear_system, Nonlinear_system.pdf
PDF
A common fixed point theorem in cone metric spaces
PDF
Multivriada ppt ms
PDF
Fixed point theorem in fuzzy metric space with e.a property
PDF
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
PDF
KAUST_talk_short.pdf
PPT
Machine Learning and Statistical Analysis
PPT
Machine Learning and Statistical Analysis
PPT
Machine Learning and Statistical Analysis
PPT
Machine Learning and Statistical Analysis
PPT
Machine Learning and Statistical Analysis
PPT
Machine Learning and Statistical Analysis
Frequency14.pptx
Statistical Hydrology for Engineering.pdf
bayesian_statistics_introduction_uppsala_university
Statistical Physics Assignment Help
Insufficient Gibbs sampling (A. Luciano, C.P. Robert and R. Ryder)
Empowering Fourier-based Pricing Methods for Efficient Valuation of High-Dime...
Introduction to Evidential Neural Networks
Program on Quasi-Monte Carlo and High-Dimensional Sampling Methods for Applie...
Nonlinear_system,Nonlinear_system, Nonlinear_system.pdf
A common fixed point theorem in cone metric spaces
Multivriada ppt ms
Fixed point theorem in fuzzy metric space with e.a property
Computational Information Geometry on Matrix Manifolds (ICTP 2013)
KAUST_talk_short.pdf
Machine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
Machine Learning and Statistical Analysis
Machine Learning and Statistical Analysis

Recently uploaded (20)

PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
modul_python (1).pptx for professional and student
PDF
Navigating the Thai Supplements Landscape.pdf
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
New ISO 27001_2022 standard and the changes
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
retention in jsjsksksksnbsndjddjdnFPD.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Introduction to the R Programming Language
PDF
annual-report-2024-2025 original latest.
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
Business_Capability_Map_Collection__pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPTX
A Complete Guide to Streamlining Business Processes
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
modul_python (1).pptx for professional and student
Navigating the Thai Supplements Landscape.pdf
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
[EN] Industrial Machine Downtime Prediction
New ISO 27001_2022 standard and the changes
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
retention in jsjsksksksnbsndjddjdnFPD.pptx
Business Analytics and business intelligence.pdf
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction to the R Programming Language
annual-report-2024-2025 original latest.
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Business_Capability_Map_Collection__pptx
SAP 2 completion done . PRESENTATION.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
A Complete Guide to Streamlining Business Processes
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx

Pattern Recognition

  • 1. Pattern Recognition and Machine Learning Eunho Lee 2017-01-03(Tue) January 8, 2017 1 / 33
  • 2. Outline 1 Introduction Polynomial Curve Fitting Probability Theory The Curse of Dimensionality Decision Theory Information Theory 2 Appendix C - Properties of Matrices Determinants Matrix Derivatives 2.3. Eigenvector Equation January 8, 2017 2 / 33
  • 3. Introduction 1. Introduction Input data set x ≡ (x1, · · · , xN)T Target data set t ≡ (t1, · · · , tN)T Traning set Input data set + Target data set January 8, 2017 3 / 33
  • 4. Introduction 1. Introduction Data Set ⇓ Probability Theory ⇓ Decision Theory ⇓ Pattern Recognition * Probability theory provides a framework for expressing uncentainty * Decision theory allows us to exploit the probablistic representation in order to make predictions that are optimal January 8, 2017 4 / 33
  • 5. Introduction Polynomial Curve Fitting 1.1. Polynomial Curve Fitting y(x, w) = w0 + w1x + w2x2 + · · · + wMxM = M j=0 wj xj January 8, 2017 5 / 33
  • 6. Introduction Polynomial Curve Fitting 1.1. Polynomial Curve Fitting 1.1.1. Error Functions Error Function E(w) = 1 2 N n=1 {y(xn, w) − tn}2 (1) Root-mean-square(RMS) error 1 ERMS = 2E(w∗)/N (2) * allows us to compare different sizes of data sets * makes same scale as the target variable 1 w* is a unique solution of the minimization of the error function January 8, 2017 6 / 33
  • 7. Introduction Polynomial Curve Fitting 1.1. Polynomial Curve Fitting 1.1.2. Over-Fitting Over-fitting Very poor representation January 8, 2017 7 / 33
  • 8. Introduction Polynomial Curve Fitting 1.1. Polynomial Curve Fitting 1.1.3. Modified Error Function Modified Error Function E(w) = 1 2 N n=1 {y(xn, w) − tn}2 + λ 2 w 2 (3) prevent the large coefficients w January 8, 2017 8 / 33
  • 9. Introduction Probability Theory 1.2. Probability Theory Sum Rule p(X) = Y p(X, Y ) Product Rule p(X, Y ) = p(Y | X)p(X) Bayes’ Theorem p(Y | X) = p(X | Y )p(Y ) p(X) January 8, 2017 9 / 33
  • 10. Introduction Probability Theory 1.2. Probability Theory Expectation E[f ] = x p(x)f (x) 1 N N n=1 f (xn) Variance var[f ] = E[(f (x) − E[f (x)])2 ] January 8, 2017 10 / 33
  • 11. Introduction Probability Theory 1.2. Probability Theory 1.2.1. Two Different Probabilities 1. Frequentiest probability (only for large data set) p(x) = occurence(x) N 2. Bayesian probability 2 p(w | D) = p(D | w)p(w) p(D) 2 p(wㅣD): posterior, p(w): prior, p(Dㅣw): likelihood function, p(D): evidence January 8, 2017 11 / 33
  • 12. Introduction Probability Theory 1.2. Probability Theory 1.2.2. The Gaussian Distribution The Gaussian distribution (single real-valued variable x) N(x | µ, σ2 ) = 1 (2πσ2)1/2 exp{− 1 2σ2 (x − µ)2 } The Gaussian distribution (D-dimensional vector x) N(x | µ, Σ) = 1 (2π)D/2 |Σ|1/2 exp{− 1 2 (x − µ)T Σ−1 (x − µ)} where µ is mean and σ2 is variance January 8, 2017 12 / 33
  • 13. Introduction Probability Theory 1.2. Probability Theory 1.2.2. The Gaussian Distribution In the pattern recognition problem, suppose that the observations are drawn independently from a Gaussian distribution. Because our data set is i.i.d, we obtain the likelihood function p(x | µ, σ2 ) = N n=1 N(xn | µ, σ2 ) (4) January 8, 2017 13 / 33
  • 14. Introduction Probability Theory 1.2. Probability Theory 1.2.2. The Gaussian Distribution Our goal is to find µML, σ2 ML which maximize the likelihood function (4). Take log for convenience, ln p(x | µ, σ2 ) = − 1 2σ2 N n=1 (xn − µ)2 − N 2 ln σ2 − N 2 ln (2π) (5) From (5) with respect to µ, µML = 1 N N n=1 xn (sample mean) Similarly, with respect to σ2, σ2 ML = 1 N N n=1 (xn − µML)2 (sample variance) January 8, 2017 14 / 33
  • 15. Introduction Probability Theory 1.2. Probability Theory 1.2.2. The Gaussian Distribution Therefore, E[µML] = µ E[σ2 ML] = (N−1 N )σ2 The figure shows that how variance is biased January 8, 2017 15 / 33
  • 16. Introduction Probability Theory 1.2. Probability Theory 1.2.3. Curve Fitting (re-visited) Assume that target value t has a Gaussian distribution. p(t | x, w, β) = N(t | y(x, w), β−1 ) (6) January 8, 2017 16 / 33
  • 17. Introduction Probability Theory 1.2. Probability Theory 1.2.3. Curve Fitting (re-visited) Because the data are drawn independently, the likelihood is given by p(t | x, w, β) = N n=1 N(tn | y(xn, w), β−1 ) (7) 1. Maximum Likelihood Function Method Our goal is to maximize the likelihood function (7). Take log for convenience, ln p(t | x, w, β) = − β 2 N n=1 {y(xn, w) − tn}2 + N 2 ln β − N 2 ln (2π) (8) January 8, 2017 17 / 33
  • 18. Introduction Probability Theory 1.2. Probability Theory 1.2.3. Curve Fitting (re-visited) Maximizing (8) with respect to w is equivalent to minimizing the sum-of-squares error function (1) Maximizing (5) with respect to β gives 1 βML = 1 N N n=1 {y(xn, wML) − tn}2 January 8, 2017 18 / 33
  • 19. Introduction Probability Theory 1.2. Probability Theory 1.2.3. Curve Fitting (re-visited) 2. Maximum Posterior Method (MAP) Using Bayes’ theorem, the posterior is p(w | x, t, α, β) ∝ p(t | x, w, β)p(w | α) (9) Let’s introduce a prior p(w | α) = N(w | 0, α−1 I) = ( α 2π )(M+1)/2 exp{− α 2 wT w} (10) Our goal is to maximize the posterior (9) January 8, 2017 19 / 33
  • 20. Introduction Probability Theory 1.2. Probability Theory 1.2.3. Curve Fitting (re-visited) Maximizing {posterior} with respect to w ⇔ Maximizing {p(t | x, w, β)p(w | α)} by (9) ⇔ Minimizing {−ln (p(t | x, w, β)p(w | α))} ⇔ Minimizing β 2 N n=1{y(xn, w) − tn}2 + α 2 wTw − N 2 ln β + N 2 ln (2π) − ln ( α 2π )(M+1)/2 ⇔ Minimizing β 2 N n=1{y(xn, w) − tn}2 + α 2 wTw is equivalent to minimizing the modified error function (3) where λ = α β January 8, 2017 20 / 33
  • 21. Introduction Probability Theory 1.2. Probability Theory 1.2.4. Bayesian Curve Fitting Our goal is to predict t, therefore evaluate the predictive distribution. p(t | x, x, t) = p(t | x, w)p(w | x, t)dw (11) The predictive distribution (11)’s RHS can be performed analytically of the form p(t | x, x, t) = N(t | m(x), s2 (x)) where the mean and variance are given by m(x) = βø(x)T S N n=1 ø(xn)tn s2 (x) = β−1 + ø(x)T Sø(x) January 8, 2017 21 / 33
  • 22. Introduction The Curse of Dimensionality 1.3. The Curse of Dimensionality The severe difficulty that can arise in spaces of many dimensions How much data do i need to classify? January 8, 2017 22 / 33
  • 23. Introduction Decision Theory 1.4. Decision Theory 1.4.1. Minimizing the misclassification rate Rk is decision regions, Ck is k-th class The boundaries between decision regions are called decision boundaries Each decision region need not be contiguous Classify x to Cm which maximize p(x, Ck) January 8, 2017 23 / 33
  • 24. Introduction Decision Theory 1.4. Decision Theory 1.4.1. Minimizing the misclassification rate p(mistake) = R1 p(x, C2)dx + R2 p(x, C1)dx p(x, Ck) = p(Ck | x)p(x) January 8, 2017 24 / 33
  • 25. Introduction Decision Theory 1.4. Decision Theory 1.4.2. Minimizing the expected loss Dicision’s value is different each other Minimize the expected loss by using the loss matrix L E[L] = k j Rj Lkj p(x, Ck)dx January 8, 2017 25 / 33
  • 26. Introduction Decision Theory 1.4. Decision Theory 1.4.3. The Reject Option If θ < 1/k, then there is no reject region (k is the number of classes) If θ = 1, then all region is rejected January 8, 2017 26 / 33
  • 27. Introduction Information Theory 1.5. Infomation Theory Entropy of the event x (the quantity of information) h(x) = −log2p(x) Entropy of the random variable x H[x] = − x p(x)log2p(x) The distribution that maximizes the differential entropy is the Gaussian January 8, 2017 27 / 33
  • 28. Appendix C - Properties of Matrices 2. Appendix C - Properties of Matrices Determinants Matrix Derivatives Eigenvector Equation January 8, 2017 28 / 33
  • 29. Appendix C - Properties of Matrices Determinants 2.1. Determinants If A and B are matrices of size N × M, then IN + ABT = IM + AT B A useful special case is IN + abT = 1 + aT B where a and b are N-dimensional column vectors January 8, 2017 29 / 33
  • 30. Appendix C - Properties of Matrices Matrix Derivatives 2.2. Matrix Derivatives 1. ∂ ∂x (xTa) = ∂ ∂x (aTx) = a 2. ∂ ∂x (AB) = ∂A ∂x B + A∂B ∂x 3. ∂ ∂x (A−1) = −A−1 ∂A ∂x A−1 4. ∂ ∂A Tr(AB) = BT 5. ∂ ∂A Tr(A) = I January 8, 2017 30 / 33
  • 31. Appendix C - Properties of Matrices Eigenvector Equation 2.3. Eigenvector Equation For a square matrix A, the eigenvector equation is defined by Aui = λi ui where ui is an eigenvector and λi is the corresponding eigenvalue January 8, 2017 31 / 33
  • 32. Appendix C - Properties of Matrices Eigenvector Equation 2.3. Eigenvector Equation If A is diagonalized, we can say that A = UΛUT By using that equation, we obtain 1. |A| = M i=1 λi 2. Tr(A) = M i=1 λi January 8, 2017 32 / 33
  • 33. Appendix C - Properties of Matrices Eigenvector Equation 2.3. Eigenvector Equation 1. The eigenvalues of symmetirc matrices are real 2. The eigenvectors ui statisfies uT i uj = Iij January 8, 2017 33 / 33