SlideShare a Scribd company logo
3
Most read
4
Most read
13
Most read
Chapter 12
Reviewer : Sunwoo Kim
Christopher M. Bishop
Pattern Recognition and Machine Learning
Yonsei University
Department of Applied Statistics
Chapter 12. Continuous Latent Variables
2
What are we doing?
This chapter is pretty familiar to us.
Unlike discrete latent variables which we covered in chapter 9 and 10, we are going to figure out some ‘continuous’ latent variables.
We have already covered such values in our multivariate statistical analysis class, a PCA.
First, let’s think of what is ‘continuous latent variable.’
Consider above figure of 3.
They are all 3, and they are just being changed by ‘rotation’, ‘vertical translation’, and ‘horizontal translation’.
Here, we can say these changes are three degrees of freedom and can be refer to the ℝ𝟑
latent space!
Note that changes are not perfectly observed. That is, they exist with the ‘noise’. Thus, its important to figure the noises and errors out!
Now, let’s take a deeper look at some methods.
Chapter 12.1. Principal Component Analysis
3
What is PCA?
Originally, PCA was proposed to overcome the problem of ‘multicollinearity’.
That is, we are re-organizing features with un-correlated vectors!
Here, let’s only think at the perspective of ‘dimensionality reduction’.
Goal of PCA.
PCA1. We are finding projections which maximize the variance.
PCA2. We are finding projections which minimize the error.
Aims of two approaches are different.
However, we can see the results are the same!!
Let’s take a look at how these approaches differ.
PCA is a method which projects each data point to the eigen vectors of covariance matrix.
The reason why ‘eigen vectors of covariance matrix’ are used will be covered soon.
If we use all eigen vectors, we are simply changing entire feature vectors to the same
number of un-correlated vectors.
If we select some of them, we are performing dimensionality reduction!
Chapter 12.1. Principal Component Analysis
4
Variance maximization
Let’s begin with projection to one-dimensional space (𝑀 = 1).
We are trying to project each data point to a vector of 𝑢1, which has a unit length(𝑢1
𝑇
𝑢1 = 1).
Here, we can say variance of projected vector can be
As we mentioned, our primary goal is to find 𝒖𝟏, which maximized the resulting vector’s variance. This can be obtained by constraint optimization!
By using Lagrange, we can get (+ setting zero derivative)
That is, resulting 𝒖𝟏 is an eigen vector
of a covariance matrix 𝑺
Here, variance can be calculated as,
What does it mean? This indicates resulting variance is defined as the eigen value of
the corresponding eigen vector. Thus, we have to choose the eigen vectors from
the biggest eigen vectors, and by descending order!
Chapter 12.1. Principal Component Analysis
5
Error minimization
Here, we are thinking of same eigen vector formulation.
However, we are trying to minimize error which occurs according to the reduction of dimensionality!
** Then, why such error occurs?
Consider there are 7 features. If we reduce it to 5 new-features, obviously there will be difference!
If we assume basis is complete, every data point can be represented by a linear combination of them.
That is, if every basis vectors are
Here, we can set 𝜶 as pretty obvious method, that is, 𝛼𝑛𝑗 = 𝑋𝑛
𝑇
𝑢𝑗. Then we can write
Think we are performing
dimensionality reduction.
Approximation can be
Note that 𝑧 is being defined respectively to each
data, but 𝑏𝑖 is universal constant! Which is being
applied equally to every data!
Chapter 12.1. Principal Component Analysis
6
Error minimization
Thus, overall error can be defined as
By setting derivative to zero, we can re-write 𝑧 and 𝑏 by the followings…
By applying new notation, we can achieve… Thus, error which we are trying to minimize is…
Again, we can see this has same form with the previous variance maximization case. We can again use Lagrange.
Chapter 12.1. Principal Component Analysis
7
Interpretation & Application
Here, we can achieve some interesting insights.
1. First, we were trying to maximize the variance of transformed data.
2. At the different perspective, we tried to minimize errors.
3. Here results were same! Furthermore, we can interpret the abandoned eigen vector’s eigen value as the loss of information.
4. That is why we can interpret chosen eigen values as the explained proportion!
We can use PCA in…
- Dimensionality reduction - Data transformation(To uncorrelated) - Data visualization
This can be applied to multicollinearity issues!
We covered it in regression analysis!
Chapter 12.2. Probabilistic PCA
8
Idea of probabilistic PCA
In fact, aforementioned PCA is not a ‘statistics’.
Rather, it was a pure linear algebra, which was done by mere mathematics.
Now, let’s view PCA at the perspective of statistics and probability!
Still, we consider gaussian latent variable which follows 𝒑 𝒛 ~ 𝑵 𝒛 𝟎, 𝑰 .
Similarly, we can define distribution of data as 𝒑 𝒙 𝒛 ~ 𝑵(𝒙|𝑾𝒛 + 𝝁, 𝝈𝟐
𝑰).
Note that transformed data is a linear combination of existing latent variables!
Thus, we can define our data by…
𝑿 = 𝑾𝒛 + 𝝁 + 𝝐
Chapter 12.2. Probabilistic PCA
9
Estimating parameters
Here, we have to estimate 𝑾, 𝝁, 𝝈𝟐
.
Pure likelihood function can be written as
** Detail calculations were skipped!
This can be simply derived by…
Here, orthogonal transformation of 𝑊 does not affect the
resulting distribution.
Thus, this estimation is independent of 𝑅.
Chapter 12.2. Probabilistic PCA
10
Estimating parameters
For the evaluation of predictive distribution and finding posterior, we need inverse of 𝐶. By using the fact that
Resulting posterior would be…
Here, check out that only observed variable is 𝑥𝑛.
Rest were not observed. They only exist as a random variable or unknown constant!
Note that this is pretty similar to Bayesian regression, but here 𝑊𝑍 + 𝜇 are all un-observed!
So the estimation process is a bit different!
Chapter 12.2. Probabilistic PCA
11
Maximum likelihood solution
From the marginal distribution of 𝑥, we can estimate 𝑊, 𝜇, 𝜎2
.
Here, we can get a closed form of 𝜇 to be 𝑥.
This is really intuitive!
By substituting 𝜇 by 𝑥, we can get following
equations of density function. Note that 𝑆
corresponds to the covariance matrix of 𝑋.
Direct estimation of 𝑊 is pretty hard. However, it is known that approximated solution can be
- Here, 𝑈𝑀 is a column vectors of eigen vectors.
- 𝐿𝑀 is a diagonal matrix filled with eigen values.
- 𝑅 is an arbitrary orthogonal matrix. (In fact, I cannot understand where does
this arbitrary 𝑅 matrix comes from…)
Chapter 12.2. Probabilistic PCA
12
Maximum likelihood solution
For the variance term, it can be computed by…
There is one notable fact.
Even we compute the estimated value of matrix 𝑊, its product is invariant of 𝑹! (We have already covered 𝑊𝑇𝑊 = 𝑊𝑇
𝑊)
Note that if we set 𝑅 as an identity matrix, we can see 𝑊 converges to the basic PCA matrix!
That is, direction is preserved as the eigen vector, only its direction in scaled by 𝜆𝑖 − 𝜎2
.
Let’s revise form of covariance matrix once again.
As we have seen, 𝐶 = 𝑊𝑊𝑇
+ 𝜎2
𝐼. Consider a unit vector 𝑣𝑇
𝑣 = 1 is aligned with covariance matrix 𝐶.
If a 𝑣 is a vector orthogonal to the existing eigen vectors, it would yield 𝑣𝑇
𝐶𝑣 = 𝜎2
, since first term goes zero. This is just noise vectors.
If a 𝑣 is a vector of PC, 𝑢𝑖 = 𝑣, then variance becomes 𝜆𝑖 − 𝜎2
+ 𝜎2
= 𝜆𝑖, which means model captured variance well!
Then what does it mean??
In my opinion, this is an issue about estimating latent variable 𝒛.
If a 𝑧 is estimated correctly, which means aligned to the eigen vector of covariance matrix, then resulting covariance of saturated value(𝑋) should be
similar to that of original data.
On the other hand, if 𝑧 is wrongly estimated(which means orthogonal to the existing PCs) then the resulting variance would be only 𝜎2
, which means
noise term.
𝑿 = 𝑾𝒛 + 𝝁 + 𝝐
Chapter 12.2. Probabilistic PCA
13
Intuitive understanding
Let’s see whether the resulting parameters fit our intuition.
First, what if we use full-latent vector? Which means, dimension of latent vector and original vector is same. Then…
This exactly recovers the original data’s covariance!
Secondly, unlike original PCA which maps data to the saturated dimension, probabilistic PCA maps ‘latent-vectors’ to the data space.
Which means,
On the other hand, if we set 𝜎2
→ 0, posterior mean becomes…
Which means, we are projecting data onto the estimated 𝑾 space!
Chapter 12.2. Probabilistic PCA
14
EM algorithm for PCA
You may wonder why we are applying EM to the PCA despite the fact that we know the closed form of each parameter.
It’s because its sometime computationally more efficient!
Here, we define 𝑧 as a latent value, and 𝜇 and 𝜎2
as our parameter!
Parameters can be computed as
Complete log-likelihood (To be maximized)
E-Step
M-Step
Chapter 12.2. Probabilistic PCA
15
Bayesian PCA
Naturally, we can think PCA with the Bayesian perspective.
This may give help in deciding appropriate dimension of latent vectors with the help of evidence values!
However, to obtain Bayesian model, we need to marginalize distribution with respect to 𝜎2
To make such computations tractable, we use ARD(automatic relevance determination). / Detail calculation has been skipped!
That is, each prior is defined separately for each column of 𝑾 matrix!
Then as we all know, we are finding appropriate 𝛼 after integrating 𝑊 out of total likelihood functions! (This can also be computed by using Gibbs-sampling!)
Then, we can get
𝒑 𝒙 𝒛 ~ 𝑵(𝒙|𝑾𝒛 + 𝝁, 𝝈𝟐𝑰) .
Note that this term also contains 𝑤, so we need estimation of it!
Where 𝐴 is a diagonal matrix filled with 𝛼𝑖.
So, we need to obtain it in a sequential way like EM!
Chapter 12.2. Probabilistic PCA
16
Factor Analysis
We have already studied this chapter in detail at multivariate-statistics!
It can be expressed by…
Likewise, we are assuming data is being formulated by some
un-observed factors, and we are back-tracking factors by the
linear combination and error terms of data!
This is really similar to the probabilistic PCA
Intelligence
Language skill
IQ-test
Soo-Neung
GPA
Factors
(Unobserved)
Data
(Observed)
Here, Ψ is called a DxD diagonal matrix called ‘uniqueness’, an independent noise of data.
Furthermore, 𝑊 is called a ‘factor loading’.
Similarly, we can find that 𝑝 𝑥 = 𝑁 𝑥 𝜇, 𝑊𝑊𝑇
+ Ψ).
Then, we can find
Chapter 12.3. Kernel PCA
17
Kernel-based approach of PCA
We studied kernel, especially valid kernel(a kernel which we do not need to explicitly compute 𝜙(𝑥)) in chapter 6 and 7.
Kernel can also be applied to PCA, which means we are projecting data-points to the ‘non-linear’ space!
As you all can see, it is natural to express 𝑥𝑛𝑥𝑛
𝑇
by the form of kernel!
Chapter 12.3. Kernel PCA
18
Calculation
We are mapping data to non-linear kernel 𝜙(𝑥𝑛).
As we have studied, explicitly computing kernel is not a good idea, since some kernel (like gaussian) sends feature vector to the infinite dimension
and compute inner product of it.
First, let’s consider zero mean(𝜙(𝑥𝑛) = 0) mapping function. (It’s non-sense, but we will relax this condition.)
As we have seen, resulting covariance matrix and eigen value can be
Last equation can be re-written as Note that this is a linear
combination of 𝜙(𝑥𝑛)
This term is a scalar
However, there still exist single 𝜙(𝑥𝑛) term.
We have to make it to the form of kernel!
By multiplying 𝜙(𝑥𝑛) to both side…
Chapter 12.3. Kernel PCA
19
Calculation
By using matrix notation, we can write as…
There is 𝐾 on both side of the equation.
We can erase them by multiplying 𝐾−1
on both side.
Since valid kernel is positive definite (𝝀𝒊 > 𝟎), we can know there always exist inverse of kernel function!
** Note that 𝒅𝒆𝒕 𝑲 ≠ 𝟎 when p.d.
Thus, overall equation can be expressed by…
Normalizing condition of 𝑎𝑖
Method to generate saturated data vector.
Chapter 12.3. Kernel PCA
20
Non-centered data vector
We have assumed 𝜙(𝑥𝑛) are all centered.
However, real-world data is not centered in most cases. Thus, we have to make it centered to easily compute covariance matrix 𝐶.
Note that here again we have to avoid computing 𝝓(𝒙𝒏) explicitly!
Original form of centered data. Let’s re-write gram matrix 𝐾 with respect to this tilda value!
We can write left-side equations by this matrix notation!
Here, 1𝑁 is a vector filled with
1
𝑁
values!
Note that computing such gram matrix is not always possible,
since it is 𝑁 𝑋 𝑁 matrix.
Under big-data condition, it might be sometimes intractable!
So we may use some approximation method.
Chapter 12.3. Kernel PCA
21
Example of kernel PCA
In this example, we used gaussian
kernel which looks like
Lines(‘a bit blurred, please look at the
figure of textbook’) are the PCs,
Check that PC lines well-captured the
original data distribution!
Let’s see more practical examples
Chapter 12.3. Kernel PCA
22
Example of kernel PCA
This example is from Wikipedia kernel PCA, link is
https://en.wikipedia.org/wiki/Kernel_principal_component_analysis
That is, we can make such data separable with just a single dimension (x-axis in a figure!)
Chapter 12.4. Nonlinear Latent Variable Models
23
Non-Gaussian?
Limiting model and distribution as linear and gaussian may restrict the practical application and other related issues.
Thus, let’s cover some practical latent variable models as we shall see shortly.
Independent component analysis
Overall discussion begins from
That is, latent variables are clearly independent!
Note that we restricted the independence with respect to linear orthogonality,
We are assuming probabilistic independence here!
Power of this method is that we do not assume the gaussian structure of the model.
Here, we can assume data distribution to be
There are not much examples in the
book.. So.. One who is interested in it
may do some extra study.. ;(
Chapter 12.4. Nonlinear Latent Variable Models
24
Auto-associative neural networks
It is a dimensionality reduction model of neural network.
Overall form of this model is similar to that of auto-encoder which we are familiar of!
In order to train this network, we need a specific error function.
Which can be
If this network does not contain any non-linear function, it can be something really close to
the basic PCA. However, there is no orthogonality condition for the hidden units.
For the non-linearity, we can use such model(left-hand side model!)
This model can be interpreted by…
Please note that this transformation of 𝐹2
can be a non-linear embedding since it
contains non-linear unit between the
network units!

More Related Content

PPTX
PRML Chapter 3
PPTX
PRML Chapter 11
PPTX
PRML Chapter 6
PPTX
PRML Chapter 1
PPTX
PRML Chapter 2
PPTX
PRML Chapter 9
PPTX
Spectral clustering
PPTX
Deep belief network.pptx
PRML Chapter 3
PRML Chapter 11
PRML Chapter 6
PRML Chapter 1
PRML Chapter 2
PRML Chapter 9
Spectral clustering
Deep belief network.pptx

What's hot (20)

PDF
Machine Learning: Introduction to Neural Networks
PPTX
Bayesian Neural Networks
PDF
Intro to Classification: Logistic Regression & SVM
PDF
Genetic algorithm fitness function
PDF
Scientific Computing with Python - NumPy | WeiYuan
PDF
Introduction to MCMC methods
PPTX
SLAM-베이즈필터와 칼만필터
PPT
Perceptron algorithm
PPTX
Interpretable machine learning
PDF
Introduction to Model-Based Machine Learning
PPTX
Wigner function and entangled states slides
PDF
Introduction to capsule networks
PPTX
Support vector machine
PDF
Transfer defect learning
PPTX
Chapter 10 sequence modeling recurrent and recursive nets
PPT
Perceptron
PPTX
Quantum Computing by Rajeev Chauhan
PPT
PDF
Supervised and Unsupervised Machine Learning
Machine Learning: Introduction to Neural Networks
Bayesian Neural Networks
Intro to Classification: Logistic Regression & SVM
Genetic algorithm fitness function
Scientific Computing with Python - NumPy | WeiYuan
Introduction to MCMC methods
SLAM-베이즈필터와 칼만필터
Perceptron algorithm
Interpretable machine learning
Introduction to Model-Based Machine Learning
Wigner function and entangled states slides
Introduction to capsule networks
Support vector machine
Transfer defect learning
Chapter 10 sequence modeling recurrent and recursive nets
Perceptron
Quantum Computing by Rajeev Chauhan
Supervised and Unsupervised Machine Learning
Ad

Similar to PRML Chapter 12 (20)

PPTX
PDF
pca.pdf polymer nanoparticles and sensors
PDF
Probabilistic PCA, EM, and more
PPTX
ML-Lec-18-NEW Dimensionality Reduction-PCA (1).pptx
PPTX
pcappt-140121072949-phpapp01.pptx
PDF
5 DimensionalityReduction.pdf
PPTX
Dimensionality Reduction and feature extraction.pptx
PPTX
DimensionalityReduction.pptx
PPTX
machine learning.pptx
PPTX
Principal component analysis
PPTX
principalcomponentanalysis-150314161616-conversion-gate01 (1).pptx
PDF
Machine learning (11)
PPTX
Feature selection using PCA.pptx
PPTX
PCA Final.pptx
PDF
Cs229 notes10
PDF
Principal Component Analysis
PDF
Neural Networks: Principal Component Analysis (PCA)
PPTX
ML unit2.pptx
PPT
Lecture1_jps.ppt
pca.pdf polymer nanoparticles and sensors
Probabilistic PCA, EM, and more
ML-Lec-18-NEW Dimensionality Reduction-PCA (1).pptx
pcappt-140121072949-phpapp01.pptx
5 DimensionalityReduction.pdf
Dimensionality Reduction and feature extraction.pptx
DimensionalityReduction.pptx
machine learning.pptx
Principal component analysis
principalcomponentanalysis-150314161616-conversion-gate01 (1).pptx
Machine learning (11)
Feature selection using PCA.pptx
PCA Final.pptx
Cs229 notes10
Principal Component Analysis
Neural Networks: Principal Component Analysis (PCA)
ML unit2.pptx
Lecture1_jps.ppt
Ad

More from Sunwoo Kim (6)

PPTX
PRML Chapter 10
PPTX
PRML Chapter 7
PDF
PRML Chapter 7 SVM supplementary files
PPTX
PRML Chapter 8
PPTX
PRML Chapter 4
PPTX
PRML Chapter 5
PRML Chapter 10
PRML Chapter 7
PRML Chapter 7 SVM supplementary files
PRML Chapter 8
PRML Chapter 4
PRML Chapter 5

Recently uploaded (20)

PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Introduction to machine learning and Linear Models
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Database Infoormation System (DBIS).pptx
PDF
Foundation of Data Science unit number two notes
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
Fluorescence-microscope_Botany_detailed content
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Business Acumen Training GuidePresentation.pptx
Introduction to machine learning and Linear Models
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Database Infoormation System (DBIS).pptx
Foundation of Data Science unit number two notes
Business Ppt On Nestle.pptx huunnnhhgfvu
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Fluorescence-microscope_Botany_detailed content
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Introduction to Knowledge Engineering Part 1
ISS -ESG Data flows What is ESG and HowHow
Introduction-to-Cloud-ComputingFinal.pptx
.pdf is not working space design for the following data for the following dat...
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
oil_refinery_comprehensive_20250804084928 (1).pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx

PRML Chapter 12

  • 1. Chapter 12 Reviewer : Sunwoo Kim Christopher M. Bishop Pattern Recognition and Machine Learning Yonsei University Department of Applied Statistics
  • 2. Chapter 12. Continuous Latent Variables 2 What are we doing? This chapter is pretty familiar to us. Unlike discrete latent variables which we covered in chapter 9 and 10, we are going to figure out some ‘continuous’ latent variables. We have already covered such values in our multivariate statistical analysis class, a PCA. First, let’s think of what is ‘continuous latent variable.’ Consider above figure of 3. They are all 3, and they are just being changed by ‘rotation’, ‘vertical translation’, and ‘horizontal translation’. Here, we can say these changes are three degrees of freedom and can be refer to the ℝ𝟑 latent space! Note that changes are not perfectly observed. That is, they exist with the ‘noise’. Thus, its important to figure the noises and errors out! Now, let’s take a deeper look at some methods.
  • 3. Chapter 12.1. Principal Component Analysis 3 What is PCA? Originally, PCA was proposed to overcome the problem of ‘multicollinearity’. That is, we are re-organizing features with un-correlated vectors! Here, let’s only think at the perspective of ‘dimensionality reduction’. Goal of PCA. PCA1. We are finding projections which maximize the variance. PCA2. We are finding projections which minimize the error. Aims of two approaches are different. However, we can see the results are the same!! Let’s take a look at how these approaches differ. PCA is a method which projects each data point to the eigen vectors of covariance matrix. The reason why ‘eigen vectors of covariance matrix’ are used will be covered soon. If we use all eigen vectors, we are simply changing entire feature vectors to the same number of un-correlated vectors. If we select some of them, we are performing dimensionality reduction!
  • 4. Chapter 12.1. Principal Component Analysis 4 Variance maximization Let’s begin with projection to one-dimensional space (𝑀 = 1). We are trying to project each data point to a vector of 𝑢1, which has a unit length(𝑢1 𝑇 𝑢1 = 1). Here, we can say variance of projected vector can be As we mentioned, our primary goal is to find 𝒖𝟏, which maximized the resulting vector’s variance. This can be obtained by constraint optimization! By using Lagrange, we can get (+ setting zero derivative) That is, resulting 𝒖𝟏 is an eigen vector of a covariance matrix 𝑺 Here, variance can be calculated as, What does it mean? This indicates resulting variance is defined as the eigen value of the corresponding eigen vector. Thus, we have to choose the eigen vectors from the biggest eigen vectors, and by descending order!
  • 5. Chapter 12.1. Principal Component Analysis 5 Error minimization Here, we are thinking of same eigen vector formulation. However, we are trying to minimize error which occurs according to the reduction of dimensionality! ** Then, why such error occurs? Consider there are 7 features. If we reduce it to 5 new-features, obviously there will be difference! If we assume basis is complete, every data point can be represented by a linear combination of them. That is, if every basis vectors are Here, we can set 𝜶 as pretty obvious method, that is, 𝛼𝑛𝑗 = 𝑋𝑛 𝑇 𝑢𝑗. Then we can write Think we are performing dimensionality reduction. Approximation can be Note that 𝑧 is being defined respectively to each data, but 𝑏𝑖 is universal constant! Which is being applied equally to every data!
  • 6. Chapter 12.1. Principal Component Analysis 6 Error minimization Thus, overall error can be defined as By setting derivative to zero, we can re-write 𝑧 and 𝑏 by the followings… By applying new notation, we can achieve… Thus, error which we are trying to minimize is… Again, we can see this has same form with the previous variance maximization case. We can again use Lagrange.
  • 7. Chapter 12.1. Principal Component Analysis 7 Interpretation & Application Here, we can achieve some interesting insights. 1. First, we were trying to maximize the variance of transformed data. 2. At the different perspective, we tried to minimize errors. 3. Here results were same! Furthermore, we can interpret the abandoned eigen vector’s eigen value as the loss of information. 4. That is why we can interpret chosen eigen values as the explained proportion! We can use PCA in… - Dimensionality reduction - Data transformation(To uncorrelated) - Data visualization This can be applied to multicollinearity issues! We covered it in regression analysis!
  • 8. Chapter 12.2. Probabilistic PCA 8 Idea of probabilistic PCA In fact, aforementioned PCA is not a ‘statistics’. Rather, it was a pure linear algebra, which was done by mere mathematics. Now, let’s view PCA at the perspective of statistics and probability! Still, we consider gaussian latent variable which follows 𝒑 𝒛 ~ 𝑵 𝒛 𝟎, 𝑰 . Similarly, we can define distribution of data as 𝒑 𝒙 𝒛 ~ 𝑵(𝒙|𝑾𝒛 + 𝝁, 𝝈𝟐 𝑰). Note that transformed data is a linear combination of existing latent variables! Thus, we can define our data by… 𝑿 = 𝑾𝒛 + 𝝁 + 𝝐
  • 9. Chapter 12.2. Probabilistic PCA 9 Estimating parameters Here, we have to estimate 𝑾, 𝝁, 𝝈𝟐 . Pure likelihood function can be written as ** Detail calculations were skipped! This can be simply derived by… Here, orthogonal transformation of 𝑊 does not affect the resulting distribution. Thus, this estimation is independent of 𝑅.
  • 10. Chapter 12.2. Probabilistic PCA 10 Estimating parameters For the evaluation of predictive distribution and finding posterior, we need inverse of 𝐶. By using the fact that Resulting posterior would be… Here, check out that only observed variable is 𝑥𝑛. Rest were not observed. They only exist as a random variable or unknown constant! Note that this is pretty similar to Bayesian regression, but here 𝑊𝑍 + 𝜇 are all un-observed! So the estimation process is a bit different!
  • 11. Chapter 12.2. Probabilistic PCA 11 Maximum likelihood solution From the marginal distribution of 𝑥, we can estimate 𝑊, 𝜇, 𝜎2 . Here, we can get a closed form of 𝜇 to be 𝑥. This is really intuitive! By substituting 𝜇 by 𝑥, we can get following equations of density function. Note that 𝑆 corresponds to the covariance matrix of 𝑋. Direct estimation of 𝑊 is pretty hard. However, it is known that approximated solution can be - Here, 𝑈𝑀 is a column vectors of eigen vectors. - 𝐿𝑀 is a diagonal matrix filled with eigen values. - 𝑅 is an arbitrary orthogonal matrix. (In fact, I cannot understand where does this arbitrary 𝑅 matrix comes from…)
  • 12. Chapter 12.2. Probabilistic PCA 12 Maximum likelihood solution For the variance term, it can be computed by… There is one notable fact. Even we compute the estimated value of matrix 𝑊, its product is invariant of 𝑹! (We have already covered 𝑊𝑇𝑊 = 𝑊𝑇 𝑊) Note that if we set 𝑅 as an identity matrix, we can see 𝑊 converges to the basic PCA matrix! That is, direction is preserved as the eigen vector, only its direction in scaled by 𝜆𝑖 − 𝜎2 . Let’s revise form of covariance matrix once again. As we have seen, 𝐶 = 𝑊𝑊𝑇 + 𝜎2 𝐼. Consider a unit vector 𝑣𝑇 𝑣 = 1 is aligned with covariance matrix 𝐶. If a 𝑣 is a vector orthogonal to the existing eigen vectors, it would yield 𝑣𝑇 𝐶𝑣 = 𝜎2 , since first term goes zero. This is just noise vectors. If a 𝑣 is a vector of PC, 𝑢𝑖 = 𝑣, then variance becomes 𝜆𝑖 − 𝜎2 + 𝜎2 = 𝜆𝑖, which means model captured variance well! Then what does it mean?? In my opinion, this is an issue about estimating latent variable 𝒛. If a 𝑧 is estimated correctly, which means aligned to the eigen vector of covariance matrix, then resulting covariance of saturated value(𝑋) should be similar to that of original data. On the other hand, if 𝑧 is wrongly estimated(which means orthogonal to the existing PCs) then the resulting variance would be only 𝜎2 , which means noise term. 𝑿 = 𝑾𝒛 + 𝝁 + 𝝐
  • 13. Chapter 12.2. Probabilistic PCA 13 Intuitive understanding Let’s see whether the resulting parameters fit our intuition. First, what if we use full-latent vector? Which means, dimension of latent vector and original vector is same. Then… This exactly recovers the original data’s covariance! Secondly, unlike original PCA which maps data to the saturated dimension, probabilistic PCA maps ‘latent-vectors’ to the data space. Which means, On the other hand, if we set 𝜎2 → 0, posterior mean becomes… Which means, we are projecting data onto the estimated 𝑾 space!
  • 14. Chapter 12.2. Probabilistic PCA 14 EM algorithm for PCA You may wonder why we are applying EM to the PCA despite the fact that we know the closed form of each parameter. It’s because its sometime computationally more efficient! Here, we define 𝑧 as a latent value, and 𝜇 and 𝜎2 as our parameter! Parameters can be computed as Complete log-likelihood (To be maximized) E-Step M-Step
  • 15. Chapter 12.2. Probabilistic PCA 15 Bayesian PCA Naturally, we can think PCA with the Bayesian perspective. This may give help in deciding appropriate dimension of latent vectors with the help of evidence values! However, to obtain Bayesian model, we need to marginalize distribution with respect to 𝜎2 To make such computations tractable, we use ARD(automatic relevance determination). / Detail calculation has been skipped! That is, each prior is defined separately for each column of 𝑾 matrix! Then as we all know, we are finding appropriate 𝛼 after integrating 𝑊 out of total likelihood functions! (This can also be computed by using Gibbs-sampling!) Then, we can get 𝒑 𝒙 𝒛 ~ 𝑵(𝒙|𝑾𝒛 + 𝝁, 𝝈𝟐𝑰) . Note that this term also contains 𝑤, so we need estimation of it! Where 𝐴 is a diagonal matrix filled with 𝛼𝑖. So, we need to obtain it in a sequential way like EM!
  • 16. Chapter 12.2. Probabilistic PCA 16 Factor Analysis We have already studied this chapter in detail at multivariate-statistics! It can be expressed by… Likewise, we are assuming data is being formulated by some un-observed factors, and we are back-tracking factors by the linear combination and error terms of data! This is really similar to the probabilistic PCA Intelligence Language skill IQ-test Soo-Neung GPA Factors (Unobserved) Data (Observed) Here, Ψ is called a DxD diagonal matrix called ‘uniqueness’, an independent noise of data. Furthermore, 𝑊 is called a ‘factor loading’. Similarly, we can find that 𝑝 𝑥 = 𝑁 𝑥 𝜇, 𝑊𝑊𝑇 + Ψ). Then, we can find
  • 17. Chapter 12.3. Kernel PCA 17 Kernel-based approach of PCA We studied kernel, especially valid kernel(a kernel which we do not need to explicitly compute 𝜙(𝑥)) in chapter 6 and 7. Kernel can also be applied to PCA, which means we are projecting data-points to the ‘non-linear’ space! As you all can see, it is natural to express 𝑥𝑛𝑥𝑛 𝑇 by the form of kernel!
  • 18. Chapter 12.3. Kernel PCA 18 Calculation We are mapping data to non-linear kernel 𝜙(𝑥𝑛). As we have studied, explicitly computing kernel is not a good idea, since some kernel (like gaussian) sends feature vector to the infinite dimension and compute inner product of it. First, let’s consider zero mean(𝜙(𝑥𝑛) = 0) mapping function. (It’s non-sense, but we will relax this condition.) As we have seen, resulting covariance matrix and eigen value can be Last equation can be re-written as Note that this is a linear combination of 𝜙(𝑥𝑛) This term is a scalar However, there still exist single 𝜙(𝑥𝑛) term. We have to make it to the form of kernel! By multiplying 𝜙(𝑥𝑛) to both side…
  • 19. Chapter 12.3. Kernel PCA 19 Calculation By using matrix notation, we can write as… There is 𝐾 on both side of the equation. We can erase them by multiplying 𝐾−1 on both side. Since valid kernel is positive definite (𝝀𝒊 > 𝟎), we can know there always exist inverse of kernel function! ** Note that 𝒅𝒆𝒕 𝑲 ≠ 𝟎 when p.d. Thus, overall equation can be expressed by… Normalizing condition of 𝑎𝑖 Method to generate saturated data vector.
  • 20. Chapter 12.3. Kernel PCA 20 Non-centered data vector We have assumed 𝜙(𝑥𝑛) are all centered. However, real-world data is not centered in most cases. Thus, we have to make it centered to easily compute covariance matrix 𝐶. Note that here again we have to avoid computing 𝝓(𝒙𝒏) explicitly! Original form of centered data. Let’s re-write gram matrix 𝐾 with respect to this tilda value! We can write left-side equations by this matrix notation! Here, 1𝑁 is a vector filled with 1 𝑁 values! Note that computing such gram matrix is not always possible, since it is 𝑁 𝑋 𝑁 matrix. Under big-data condition, it might be sometimes intractable! So we may use some approximation method.
  • 21. Chapter 12.3. Kernel PCA 21 Example of kernel PCA In this example, we used gaussian kernel which looks like Lines(‘a bit blurred, please look at the figure of textbook’) are the PCs, Check that PC lines well-captured the original data distribution! Let’s see more practical examples
  • 22. Chapter 12.3. Kernel PCA 22 Example of kernel PCA This example is from Wikipedia kernel PCA, link is https://en.wikipedia.org/wiki/Kernel_principal_component_analysis That is, we can make such data separable with just a single dimension (x-axis in a figure!)
  • 23. Chapter 12.4. Nonlinear Latent Variable Models 23 Non-Gaussian? Limiting model and distribution as linear and gaussian may restrict the practical application and other related issues. Thus, let’s cover some practical latent variable models as we shall see shortly. Independent component analysis Overall discussion begins from That is, latent variables are clearly independent! Note that we restricted the independence with respect to linear orthogonality, We are assuming probabilistic independence here! Power of this method is that we do not assume the gaussian structure of the model. Here, we can assume data distribution to be There are not much examples in the book.. So.. One who is interested in it may do some extra study.. ;(
  • 24. Chapter 12.4. Nonlinear Latent Variable Models 24 Auto-associative neural networks It is a dimensionality reduction model of neural network. Overall form of this model is similar to that of auto-encoder which we are familiar of! In order to train this network, we need a specific error function. Which can be If this network does not contain any non-linear function, it can be something really close to the basic PCA. However, there is no orthogonality condition for the hidden units. For the non-linearity, we can use such model(left-hand side model!) This model can be interpreted by… Please note that this transformation of 𝐹2 can be a non-linear embedding since it contains non-linear unit between the network units!