PRML Chapter 12

Chapter 12
Reviewer : Sunwoo Kim
Christopher M. Bishop
Pattern Recognition and Machine Learning
Yonsei University
Department of Applied Statistics

Chapter 12. Continuous Latent Variables
2
What are we doing?
This chapter is pretty familiar to us.
Unlike discrete latent variables which we covered in chapter 9 and 10, we are going to figure out some ‘continuous’ latent variables.
We have already covered such values in our multivariate statistical analysis class, a PCA.
First, let’s think of what is ‘continuous latent variable.’
Consider above figure of 3.
They are all 3, and they are just being changed by ‘rotation’, ‘vertical translation’, and ‘horizontal translation’.
Here, we can say these changes are three degrees of freedom and can be refer to the ℝ𝟑
latent space!
Note that changes are not perfectly observed. That is, they exist with the ‘noise’. Thus, its important to figure the noises and errors out!
Now, let’s take a deeper look at some methods.

Chapter 12.1. Principal Component Analysis
3
What is PCA?
Originally, PCA was proposed to overcome the problem of ‘multicollinearity’.
That is, we are re-organizing features with un-correlated vectors!
Here, let’s only think at the perspective of ‘dimensionality reduction’.
Goal of PCA.
PCA1. We are finding projections which maximize the variance.
PCA2. We are finding projections which minimize the error.
Aims of two approaches are different.
However, we can see the results are the same!!
Let’s take a look at how these approaches differ.
PCA is a method which projects each data point to the eigen vectors of covariance matrix.
The reason why ‘eigen vectors of covariance matrix’ are used will be covered soon.
If we use all eigen vectors, we are simply changing entire feature vectors to the same
number of un-correlated vectors.
If we select some of them, we are performing dimensionality reduction!

4
Variance maximization
Let’s begin with projection to one-dimensional space (𝑀 = 1).
We are trying to project each data point to a vector of 𝑢1, which has a unit length(𝑢1
𝑇
𝑢1 = 1).
Here, we can say variance of projected vector can be
As we mentioned, our primary goal is to find 𝒖𝟏, which maximized the resulting vector’s variance. This can be obtained by constraint optimization!
By using Lagrange, we can get (+ setting zero derivative)
That is, resulting 𝒖𝟏 is an eigen vector
of a covariance matrix 𝑺
Here, variance can be calculated as,
What does it mean? This indicates resulting variance is defined as the eigen value of
the corresponding eigen vector. Thus, we have to choose the eigen vectors from
the biggest eigen vectors, and by descending order!

5
Error minimization
Here, we are thinking of same eigen vector formulation.
However, we are trying to minimize error which occurs according to the reduction of dimensionality!
** Then, why such error occurs?
Consider there are 7 features. If we reduce it to 5 new-features, obviously there will be difference!
If we assume basis is complete, every data point can be represented by a linear combination of them.
That is, if every basis vectors are
Here, we can set 𝜶 as pretty obvious method, that is, 𝛼𝑛𝑗 = 𝑋𝑛
𝑇
𝑢𝑗. Then we can write
Think we are performing
dimensionality reduction.
Approximation can be
Note that 𝑧 is being defined respectively to each
data, but 𝑏𝑖 is universal constant! Which is being
applied equally to every data!

6
Error minimization
Thus, overall error can be defined as
By setting derivative to zero, we can re-write 𝑧 and 𝑏 by the followings…
By applying new notation, we can achieve… Thus, error which we are trying to minimize is…
Again, we can see this has same form with the previous variance maximization case. We can again use Lagrange.

7
Interpretation & Application
Here, we can achieve some interesting insights.
1. First, we were trying to maximize the variance of transformed data.
2. At the different perspective, we tried to minimize errors.
3. Here results were same! Furthermore, we can interpret the abandoned eigen vector’s eigen value as the loss of information.
4. That is why we can interpret chosen eigen values as the explained proportion!
We can use PCA in…
- Dimensionality reduction - Data transformation(To uncorrelated) - Data visualization
This can be applied to multicollinearity issues!
We covered it in regression analysis!

Chapter 12.2. Probabilistic PCA
8
Idea of probabilistic PCA
In fact, aforementioned PCA is not a ‘statistics’.
Rather, it was a pure linear algebra, which was done by mere mathematics.
Now, let’s view PCA at the perspective of statistics and probability!
Still, we consider gaussian latent variable which follows 𝒑 𝒛 ~ 𝑵 𝒛 𝟎, 𝑰 .
Similarly, we can define distribution of data as 𝒑 𝒙 𝒛 ~ 𝑵(𝒙|𝑾𝒛 + 𝝁, 𝝈𝟐
𝑰).
Note that transformed data is a linear combination of existing latent variables!
Thus, we can define our data by…
𝑿 = 𝑾𝒛 + 𝝁 + 𝝐

9
Estimating parameters
Here, we have to estimate 𝑾, 𝝁, 𝝈𝟐
.
Pure likelihood function can be written as
** Detail calculations were skipped!
This can be simply derived by…
Here, orthogonal transformation of 𝑊 does not affect the
resulting distribution.
Thus, this estimation is independent of 𝑅.

10
Estimating parameters
For the evaluation of predictive distribution and finding posterior, we need inverse of 𝐶. By using the fact that
Resulting posterior would be…
Here, check out that only observed variable is 𝑥𝑛.
Rest were not observed. They only exist as a random variable or unknown constant!
Note that this is pretty similar to Bayesian regression, but here 𝑊𝑍 + 𝜇 are all un-observed!
So the estimation process is a bit different!

11
Maximum likelihood solution
From the marginal distribution of 𝑥, we can estimate 𝑊, 𝜇, 𝜎2
.
Here, we can get a closed form of 𝜇 to be 𝑥.
This is really intuitive!
By substituting 𝜇 by 𝑥, we can get following
equations of density function. Note that 𝑆
corresponds to the covariance matrix of 𝑋.
Direct estimation of 𝑊 is pretty hard. However, it is known that approximated solution can be
- Here, 𝑈𝑀 is a column vectors of eigen vectors.
- 𝐿𝑀 is a diagonal matrix filled with eigen values.
- 𝑅 is an arbitrary orthogonal matrix. (In fact, I cannot understand where does
this arbitrary 𝑅 matrix comes from…)

12
Maximum likelihood solution
For the variance term, it can be computed by…
There is one notable fact.
Even we compute the estimated value of matrix 𝑊, its product is invariant of 𝑹! (We have already covered 𝑊𝑇𝑊 = 𝑊𝑇
𝑊)
Note that if we set 𝑅 as an identity matrix, we can see 𝑊 converges to the basic PCA matrix!
That is, direction is preserved as the eigen vector, only its direction in scaled by 𝜆𝑖 − 𝜎2
.
Let’s revise form of covariance matrix once again.
As we have seen, 𝐶 = 𝑊𝑊𝑇
+ 𝜎2
𝐼. Consider a unit vector 𝑣𝑇
𝑣 = 1 is aligned with covariance matrix 𝐶.
If a 𝑣 is a vector orthogonal to the existing eigen vectors, it would yield 𝑣𝑇
𝐶𝑣 = 𝜎2
, since first term goes zero. This is just noise vectors.
If a 𝑣 is a vector of PC, 𝑢𝑖 = 𝑣, then variance becomes 𝜆𝑖 − 𝜎2
+ 𝜎2
= 𝜆𝑖, which means model captured variance well!
Then what does it mean??
In my opinion, this is an issue about estimating latent variable 𝒛.
If a 𝑧 is estimated correctly, which means aligned to the eigen vector of covariance matrix, then resulting covariance of saturated value(𝑋) should be
similar to that of original data.
On the other hand, if 𝑧 is wrongly estimated(which means orthogonal to the existing PCs) then the resulting variance would be only 𝜎2
, which means
noise term.
𝑿 = 𝑾𝒛 + 𝝁 + 𝝐

13
Intuitive understanding
Let’s see whether the resulting parameters fit our intuition.
First, what if we use full-latent vector? Which means, dimension of latent vector and original vector is same. Then…
This exactly recovers the original data’s covariance!
Secondly, unlike original PCA which maps data to the saturated dimension, probabilistic PCA maps ‘latent-vectors’ to the data space.
Which means,
On the other hand, if we set 𝜎2
→ 0, posterior mean becomes…
Which means, we are projecting data onto the estimated 𝑾 space!

14
EM algorithm for PCA
You may wonder why we are applying EM to the PCA despite the fact that we know the closed form of each parameter.
It’s because its sometime computationally more efficient!
Here, we define 𝑧 as a latent value, and 𝜇 and 𝜎2
as our parameter!
Parameters can be computed as
Complete log-likelihood (To be maximized)
E-Step
M-Step

15
Bayesian PCA
Naturally, we can think PCA with the Bayesian perspective.
This may give help in deciding appropriate dimension of latent vectors with the help of evidence values!
However, to obtain Bayesian model, we need to marginalize distribution with respect to 𝜎2
To make such computations tractable, we use ARD(automatic relevance determination). / Detail calculation has been skipped!
That is, each prior is defined separately for each column of 𝑾 matrix!
Then as we all know, we are finding appropriate 𝛼 after integrating 𝑊 out of total likelihood functions! (This can also be computed by using Gibbs-sampling!)
Then, we can get
𝒑 𝒙 𝒛 ~ 𝑵(𝒙|𝑾𝒛 + 𝝁, 𝝈𝟐𝑰) .
Note that this term also contains 𝑤, so we need estimation of it!
Where 𝐴 is a diagonal matrix filled with 𝛼𝑖.
So, we need to obtain it in a sequential way like EM!

16
Factor Analysis
We have already studied this chapter in detail at multivariate-statistics!
It can be expressed by…
Likewise, we are assuming data is being formulated by some
un-observed factors, and we are back-tracking factors by the
linear combination and error terms of data!
This is really similar to the probabilistic PCA
Intelligence
Language skill
IQ-test
Soo-Neung
GPA
Factors
(Unobserved)
Data
(Observed)
Here, Ψ is called a DxD diagonal matrix called ‘uniqueness’, an independent noise of data.
Furthermore, 𝑊 is called a ‘factor loading’.
Similarly, we can find that 𝑝 𝑥 = 𝑁 𝑥 𝜇, 𝑊𝑊𝑇
+ Ψ).
Then, we can find

Chapter 12.3. Kernel PCA
17
Kernel-based approach of PCA
We studied kernel, especially valid kernel(a kernel which we do not need to explicitly compute 𝜙(𝑥)) in chapter 6 and 7.
Kernel can also be applied to PCA, which means we are projecting data-points to the ‘non-linear’ space!
As you all can see, it is natural to express 𝑥𝑛𝑥𝑛
𝑇
by the form of kernel!

18
Calculation
We are mapping data to non-linear kernel 𝜙(𝑥𝑛).
As we have studied, explicitly computing kernel is not a good idea, since some kernel (like gaussian) sends feature vector to the infinite dimension
and compute inner product of it.
First, let’s consider zero mean(𝜙(𝑥𝑛) = 0) mapping function. (It’s non-sense, but we will relax this condition.)
As we have seen, resulting covariance matrix and eigen value can be
Last equation can be re-written as Note that this is a linear
combination of 𝜙(𝑥𝑛)
This term is a scalar
However, there still exist single 𝜙(𝑥𝑛) term.
We have to make it to the form of kernel!
By multiplying 𝜙(𝑥𝑛) to both side…

19
Calculation
By using matrix notation, we can write as…
There is 𝐾 on both side of the equation.
We can erase them by multiplying 𝐾−1
on both side.
Since valid kernel is positive definite (𝝀𝒊 > 𝟎), we can know there always exist inverse of kernel function!
** Note that 𝒅𝒆𝒕 𝑲 ≠ 𝟎 when p.d.
Thus, overall equation can be expressed by…
Normalizing condition of 𝑎𝑖
Method to generate saturated data vector.

20
Non-centered data vector
We have assumed 𝜙(𝑥𝑛) are all centered.
However, real-world data is not centered in most cases. Thus, we have to make it centered to easily compute covariance matrix 𝐶.
Note that here again we have to avoid computing 𝝓(𝒙𝒏) explicitly!
Original form of centered data. Let’s re-write gram matrix 𝐾 with respect to this tilda value!
We can write left-side equations by this matrix notation!
Here, 1𝑁 is a vector filled with
1
𝑁
values!
Note that computing such gram matrix is not always possible,
since it is 𝑁 𝑋 𝑁 matrix.
Under big-data condition, it might be sometimes intractable!
So we may use some approximation method.

21
Example of kernel PCA
In this example, we used gaussian
kernel which looks like
Lines(‘a bit blurred, please look at the
figure of textbook’) are the PCs,
Check that PC lines well-captured the
original data distribution!
Let’s see more practical examples

22
Example of kernel PCA
This example is from Wikipedia kernel PCA, link is
https://en.wikipedia.org/wiki/Kernel_principal_component_analysis
That is, we can make such data separable with just a single dimension (x-axis in a figure!)

Chapter 12.4. Nonlinear Latent Variable Models
23
Non-Gaussian?
Limiting model and distribution as linear and gaussian may restrict the practical application and other related issues.
Thus, let’s cover some practical latent variable models as we shall see shortly.
Independent component analysis
Overall discussion begins from
That is, latent variables are clearly independent!
Note that we restricted the independence with respect to linear orthogonality,
We are assuming probabilistic independence here!
Power of this method is that we do not assume the gaussian structure of the model.
Here, we can assume data distribution to be
There are not much examples in the
book.. So.. One who is interested in it
may do some extra study.. ;(

Chapter 12.4. Nonlinear Latent Variable Models
24
Auto-associative neural networks
It is a dimensionality reduction model of neural network.
Overall form of this model is similar to that of auto-encoder which we are familiar of!
In order to train this network, we need a specific error function.
Which can be
If this network does not contain any non-linear function, it can be something really close to
the basic PCA. However, there is no orthogonality condition for the hidden units.
For the non-linearity, we can use such model(left-hand side model!)
This model can be interpreted by…
Please note that this transformation of 𝐹2
can be a non-linear embedding since it
contains non-linear unit between the
network units!

PRML Chapter 12

More Related Content

What's hot (20)

Similar to PRML Chapter 12 (20)

More from Sunwoo Kim (6)

Recently uploaded (20)

PRML Chapter 12