lecture6.pptx

Probabilistic Models
with Latent Variables
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 1

Density Estimation Problem
• Learning from unlabeled data 𝑥1, 𝑥2, … , 𝑥𝑁
• Unsupervised learning, density estimation
• Empirical distribution typically has multiple modes

From http://courses.ee.sun.ac.za/Pattern_Recognition_813
From http://yulearning.blogspot.co.uk

• Conv. composition of unimodal pdf’s: multimodal pdf
𝑓 𝑥 = 𝑘 𝑤𝑘𝑓𝑘(𝑥) where 𝑘 𝑤𝑘 = 1
• Physical interpretation
• Sub populations

Latent Variables
• Introduce new variable 𝑍𝑖 for each 𝑋𝑖
• Latent / hidden: not observed in the data
• Probabilistic interpretation
• Mixing weights: 𝑤𝑘 ≡ 𝑝(𝑧𝑖 = 𝑘)
• Mixture densities: 𝑓𝑘(𝑥) ≡ 𝑝(𝑥|𝑧𝑖 = 𝑘)

Generative Mixture Model
For 𝑖 = 1, … , 𝑁
𝑍𝑖~𝑖𝑖𝑑 𝑀𝑢𝑙𝑡
𝑋𝑖~𝑖𝑖𝑑 𝑝(𝑥|𝑧𝑖)
• 𝑃(𝑥𝑖, 𝑧𝑖) = 𝑝 𝑧𝑖 𝑝(𝑥𝑖|𝑧𝑖)
• 𝑃 𝑥𝑖 = 𝑘 𝑝(𝑥𝑖, 𝑧𝑖) recovers mixture
distribution
𝑍𝑖
𝑋𝑖
𝑁
Plate Notation

Tasks in a Mixture Model
• Inference
𝑃 𝑧 𝑥 =
𝑖
𝑃(𝑧𝑖|𝑥𝑖)
• Parameter Estimation
• Find parameters that e.g. maximize likelihood
• Does not decouple according to classes
• Non convex, many local minima

Example: Gaussian Mixture Model
• Model
For 𝑖 = 1, … , 𝑁
𝑍𝑖~𝑖𝑖𝑑 𝑀𝑢𝑙𝑡(𝜋)
𝑋𝑖 | 𝑍𝑖 = 𝑘~𝑖𝑖𝑑 𝑁(𝑥|𝜇𝑘, Σ)
• Inference
𝑃(𝑧𝑖 = 𝑘|𝑥𝑖; 𝜇, Σ)
• Soft-max function

Example: Gaussian Mixture Model
• Loglikelihood
• Which training instance comes from which component?
𝑙 𝜃 =
𝑖
log 𝑝 𝑥𝑖 =
𝑖
log
𝑘
𝑝 𝑧𝑖 = 𝑘 𝑝(𝑥𝑖|𝑧𝑖 = 𝑘)
• No closed form solution for maximizing 𝑙 𝜃
• Possibility 1: Gradient descent etc
• Possibility 2: Expectation Maximization

Expectation Maximization Algorithm
• Observation: Know values of 𝑍𝑖 ⇒ easy to maximize
• Key idea: iterative updates
• Given parameter estimates, “infer” all 𝑍𝑖 variables
• Given inferred 𝑍𝑖 variables, maximize wrt parameters
• Questions
• Does this converge?
• What does this maximize?

• Complete loglikelihood
𝑙𝑐 𝜃 =
𝑖
log 𝑝 𝑥𝑖, 𝑧𝑖 =
𝑖
log 𝑝 𝑧𝑖 𝑝(𝑥𝑖|𝑧𝑖)
• Problem: 𝑧𝑖 not known
• Possible solution: Replace w/ conditional expectation
• Expected complete loglikelihood
𝑄 𝜃, 𝜃𝑜𝑙𝑑 = 𝐸[
𝑖
log 𝑝 𝑥𝑖, 𝑧𝑖 ]
Wrt 𝑝(𝑧|𝑥, 𝜃𝑜𝑙𝑑) where 𝜃𝑜𝑙𝑑 are the current parameters

• Expectation Step
• Update 𝛾𝑖𝑘 based on current parameters
𝛾𝑖𝑘 =
𝜋𝑘𝑝 𝑥𝑖 𝜃𝑜𝑙𝑑,𝑘
𝑘 𝜋𝑘𝑝 𝑥𝑖 𝜃𝑜𝑙𝑑,𝑘
• Maximization Step
• Maximize 𝑄 𝜃, 𝜃𝑜𝑙𝑑 wrt parameters
• Overall algorithm
• Initialize all latent variables
• Iterate until convergence
• M Step
• E Step

Example: EM for GMM
• E Step remains the step for all mixture models
• M Step
• 𝜋𝑘 = 𝑖 𝛾𝑖𝑘
𝑁
=
𝛾𝑘
𝑁
• 𝜇𝑘 = 𝑖 𝛾𝑖𝑘𝑥𝑖
𝛾𝑘
• Σ =?
• Compare with generative classifier

Analysis of EM Algorithm
• Expected complete LL is a lower bound on LL
• EM iteratively maximizes this lower bound
• Converges to a local maximum of the loglikelihood

Bayesian / MAP Estimation
• EM overfits
• Possible to perform MAP instead of MLE in M-step
• EM is partially Bayesian
• Posterior distribution over latent variables
• Point estimate over parameters
• Fully Bayesian approach is called Variational Bayes

(Lloyd’s) K Means Algorithm
• Hard EM for Gaussian Mixture Model
• Point estimate of parameters (as usual)
• Point estimate of latent variables
• Spherical Gaussian mixture components
𝑧𝑖
∗
= arg max
k
𝑝(𝑧𝑖 = 𝑘|𝑥𝑖, 𝜃) = arg min
𝑘
𝑥𝑖 − 𝜇𝑘 2
2
Where 𝜇𝑘 =
𝑖:𝑧𝑖=𝑘 𝑥𝑖
𝑁
• Most popular “hard” clustering algorithm

K Means Problem
• Given {𝑥𝑖}, find k “means” (𝜇1
∗
, … , 𝜇𝑘
∗
) and data
assignments (𝑧1
∗
, … , 𝑧𝑁
∗
) such that
𝜇∗, 𝑧∗ = arg min
𝜇,𝑧
𝑖
𝑥𝑖 − 𝜇𝑧𝑖 2
2
• Note: 𝑧𝑖 is k-dimensional binary vector

Model selection: Choosing K for GMM
• Cross validation
• Plot likelihood on training set and validation set for
increasing values of k
• Likelihood on training set keeps improving
• Likelihood on validation set drops after “optimal” k
• Does not work for k-means! Why?

Principal Component Analysis: Motivation
• Dimensionality reduction
• Reduces #parameters to estimate
• Data often resides in much lower dimension, e.g., on a line
in a 3D space
• Provides “understanding”
• Mixture models very restricted
• Latent variables restricted to small discrete set
• Can we “relax” the latent variable?

Classical PCA: Motivation
• Revisit K-means
min
𝑊,𝑍
𝐽 𝑊, 𝑍 = 𝑋 − 𝑊𝑍𝑇 2
𝐹
• W: matrix containing means
• Z: matrix containing cluster membership vectors
• How can we relax Z and W?

Classical PCA: Problem
min
𝑊,𝑍
𝐽 𝑊, 𝑍 = |𝑋 − 𝑊𝑍𝑇 |2
𝐹
• X : 𝐷 × 𝑁
• Arbitrary Z of size 𝑁 × 𝐿,
• Orthonormal W of size 𝐷 × 𝐿

Classical PCA: Optimal Solution
• Empirical covariance matrix Σ =
1
𝑁 𝑖 𝑥𝑖𝑥𝑖
𝑇
• Scaled and centered data
• 𝑊 = 𝑉𝐿 where 𝑉𝐿 contains L Eigen vectors for the L
largest Eigen values of Σ
• 𝑧𝑖 = 𝑊𝑇𝑥𝑖
• Alternative solution via Singular Value
Decomposition (SVD)
• W contains the “principal components” that capture
the largest variance in the data

Probabilistic PCA
• Generative model
𝑃(𝑧𝑖) = 𝑁(𝑧𝑖|𝜇0, Σ0)
𝑃(𝑥𝑖|𝑧𝑖) = 𝑁(𝑥𝑖|𝑊𝑧𝑖 + 𝜇, Ψ)
Ψ forced to be diagonal
• Latent linear models
• Factor Analysis
• Special Case: PCA with Ψ = 𝜎2 𝐼

Visualization of Generative Process
From Bishop, PRML

Relationship with Gaussian Density
• 𝐶𝑜𝑣[𝑥] = 𝑊𝑊𝑇 + Ψ
• Why does Ψ need to be restricted?
• Intermediate low rank parameterization of Gaussian
covariance matrix between full rank and diagonal
• Compare #parameters

EM for PCA: Rod and Springs
From Bishop, PRML

Advantages of EM
• Simpler than gradient methods w/ constraints
• Handles missing data
• Easy path for handling more complex models
• Not always the fastest method

Summary of Latent Variable Models
• Learning from unlabeled data
• Latent variables
• Discrete: Clustering / Mixture models ; GMM
• Continuous: Dimensionality reduction ; PCA
• Summary / “Understanding” of data
• Expectation Maximization Algorithm

lecture6.pptx

More Related Content

Similar to lecture6.pptx (20)

Recently uploaded (20)

lecture6.pptx