SlideShare a Scribd company logo
Probabilistic Models
with Latent Variables
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 1
Density Estimation Problem
• Learning from unlabeled data 𝑥1, 𝑥2, … , 𝑥𝑁
• Unsupervised learning, density estimation
• Empirical distribution typically has multiple modes
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 2
Density Estimation Problem
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 3
From http://courses.ee.sun.ac.za/Pattern_Recognition_813
From http://yulearning.blogspot.co.uk
Density Estimation Problem
• Conv. composition of unimodal pdf’s: multimodal pdf
𝑓 𝑥 = 𝑘 𝑤𝑘𝑓𝑘(𝑥) where 𝑘 𝑤𝑘 = 1
• Physical interpretation
• Sub populations
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 4
Latent Variables
• Introduce new variable 𝑍𝑖 for each 𝑋𝑖
• Latent / hidden: not observed in the data
• Probabilistic interpretation
• Mixing weights: 𝑤𝑘 ≡ 𝑝(𝑧𝑖 = 𝑘)
• Mixture densities: 𝑓𝑘(𝑥) ≡ 𝑝(𝑥|𝑧𝑖 = 𝑘)
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 5
Generative Mixture Model
For 𝑖 = 1, … , 𝑁
𝑍𝑖~𝑖𝑖𝑑 𝑀𝑢𝑙𝑡
𝑋𝑖~𝑖𝑖𝑑 𝑝(𝑥|𝑧𝑖)
• 𝑃(𝑥𝑖, 𝑧𝑖) = 𝑝 𝑧𝑖 𝑝(𝑥𝑖|𝑧𝑖)
• 𝑃 𝑥𝑖 = 𝑘 𝑝(𝑥𝑖, 𝑧𝑖) recovers mixture
distribution
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 6
𝑍𝑖
𝑋𝑖
𝑁
Plate Notation
Tasks in a Mixture Model
• Inference
𝑃 𝑧 𝑥 =
𝑖
𝑃(𝑧𝑖|𝑥𝑖)
• Parameter Estimation
• Find parameters that e.g. maximize likelihood
• Does not decouple according to classes
• Non convex, many local minima
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 7
Example: Gaussian Mixture Model
• Model
For 𝑖 = 1, … , 𝑁
𝑍𝑖~𝑖𝑖𝑑 𝑀𝑢𝑙𝑡(𝜋)
𝑋𝑖 | 𝑍𝑖 = 𝑘~𝑖𝑖𝑑 𝑁(𝑥|𝜇𝑘, Σ)
• Inference
𝑃(𝑧𝑖 = 𝑘|𝑥𝑖; 𝜇, Σ)
• Soft-max function
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 8
Example: Gaussian Mixture Model
• Loglikelihood
• Which training instance comes from which component?
𝑙 𝜃 =
𝑖
log 𝑝 𝑥𝑖 =
𝑖
log
𝑘
𝑝 𝑧𝑖 = 𝑘 𝑝(𝑥𝑖|𝑧𝑖 = 𝑘)
• No closed form solution for maximizing 𝑙 𝜃
• Possibility 1: Gradient descent etc
• Possibility 2: Expectation Maximization
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 9
Expectation Maximization Algorithm
• Observation: Know values of 𝑍𝑖 ⇒ easy to maximize
• Key idea: iterative updates
• Given parameter estimates, “infer” all 𝑍𝑖 variables
• Given inferred 𝑍𝑖 variables, maximize wrt parameters
• Questions
• Does this converge?
• What does this maximize?
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 10
Expectation Maximization Algorithm
• Complete loglikelihood
𝑙𝑐 𝜃 =
𝑖
log 𝑝 𝑥𝑖, 𝑧𝑖 =
𝑖
log 𝑝 𝑧𝑖 𝑝(𝑥𝑖|𝑧𝑖)
• Problem: 𝑧𝑖 not known
• Possible solution: Replace w/ conditional expectation
• Expected complete loglikelihood
𝑄 𝜃, 𝜃𝑜𝑙𝑑 = 𝐸[
𝑖
log 𝑝 𝑥𝑖, 𝑧𝑖 ]
Wrt 𝑝(𝑧|𝑥, 𝜃𝑜𝑙𝑑) where 𝜃𝑜𝑙𝑑 are the current parameters
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 11
Expectation Maximization Algorithm
𝑄 𝜃, 𝜃𝑜𝑙𝑑 = 𝐸[
𝑖
log 𝑝 𝑥𝑖, 𝑧𝑖 ]
=
𝑖 𝑘
𝐸 𝐼 𝑧𝑖 = 𝑘 log[𝜋𝑘𝑝(𝑥𝑖|𝜃𝑘)]
=
𝑖 𝑘
𝑝(𝑧𝑖 = 𝑘|𝑥𝑖, 𝜃𝑜𝑙𝑑) log[𝜋𝑘𝑝(𝑥𝑖|𝜃𝑘)]
=
𝑖 𝑘
𝛾𝑖𝑘 log 𝜋𝑘 +
𝑖 𝑘
𝛾𝑖𝑘 log 𝑝(𝑥𝑖|𝜃𝑘)
Where 𝛾𝑖𝑘 = 𝑝(𝑧𝑖 = 𝑘|𝑥𝑖, 𝜃𝑜𝑙𝑑)
• Compare with likelihood for generative classifier
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 12
Expectation Maximization Algorithm
• Expectation Step
• Update 𝛾𝑖𝑘 based on current parameters
𝛾𝑖𝑘 =
𝜋𝑘𝑝 𝑥𝑖 𝜃𝑜𝑙𝑑,𝑘
𝑘 𝜋𝑘𝑝 𝑥𝑖 𝜃𝑜𝑙𝑑,𝑘
• Maximization Step
• Maximize 𝑄 𝜃, 𝜃𝑜𝑙𝑑 wrt parameters
• Overall algorithm
• Initialize all latent variables
• Iterate until convergence
• M Step
• E Step
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 13
Example: EM for GMM
• E Step remains the step for all mixture models
• M Step
• 𝜋𝑘 = 𝑖 𝛾𝑖𝑘
𝑁
=
𝛾𝑘
𝑁
• 𝜇𝑘 = 𝑖 𝛾𝑖𝑘𝑥𝑖
𝛾𝑘
• Σ =?
• Compare with generative classifier
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 14
Analysis of EM Algorithm
• Expected complete LL is a lower bound on LL
• EM iteratively maximizes this lower bound
• Converges to a local maximum of the loglikelihood
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 15
Bayesian / MAP Estimation
• EM overfits
• Possible to perform MAP instead of MLE in M-step
• EM is partially Bayesian
• Posterior distribution over latent variables
• Point estimate over parameters
• Fully Bayesian approach is called Variational Bayes
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 16
(Lloyd’s) K Means Algorithm
• Hard EM for Gaussian Mixture Model
• Point estimate of parameters (as usual)
• Point estimate of latent variables
• Spherical Gaussian mixture components
𝑧𝑖
∗
= arg max
k
𝑝(𝑧𝑖 = 𝑘|𝑥𝑖, 𝜃) = arg min
𝑘
𝑥𝑖 − 𝜇𝑘 2
2
Where 𝜇𝑘 =
𝑖:𝑧𝑖=𝑘 𝑥𝑖
𝑁
• Most popular “hard” clustering algorithm
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 17
K Means Problem
• Given {𝑥𝑖}, find k “means” (𝜇1
∗
, … , 𝜇𝑘
∗
) and data
assignments (𝑧1
∗
, … , 𝑧𝑁
∗
) such that
𝜇∗, 𝑧∗ = arg min
𝜇,𝑧
𝑖
𝑥𝑖 − 𝜇𝑧𝑖 2
2
• Note: 𝑧𝑖 is k-dimensional binary vector
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 18
Model selection: Choosing K for GMM
• Cross validation
• Plot likelihood on training set and validation set for
increasing values of k
• Likelihood on training set keeps improving
• Likelihood on validation set drops after “optimal” k
• Does not work for k-means! Why?
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 19
Principal Component Analysis: Motivation
• Dimensionality reduction
• Reduces #parameters to estimate
• Data often resides in much lower dimension, e.g., on a line
in a 3D space
• Provides “understanding”
• Mixture models very restricted
• Latent variables restricted to small discrete set
• Can we “relax” the latent variable?
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 20
Classical PCA: Motivation
• Revisit K-means
min
𝑊,𝑍
𝐽 𝑊, 𝑍 = 𝑋 − 𝑊𝑍𝑇 2
𝐹
• W: matrix containing means
• Z: matrix containing cluster membership vectors
• How can we relax Z and W?
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 21
Classical PCA: Problem
min
𝑊,𝑍
𝐽 𝑊, 𝑍 = |𝑋 − 𝑊𝑍𝑇 |2
𝐹
• X : 𝐷 × 𝑁
• Arbitrary Z of size 𝑁 × 𝐿,
• Orthonormal W of size 𝐷 × 𝐿
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 22
Classical PCA: Optimal Solution
• Empirical covariance matrix Σ =
1
𝑁 𝑖 𝑥𝑖𝑥𝑖
𝑇
• Scaled and centered data
• 𝑊 = 𝑉𝐿 where 𝑉𝐿 contains L Eigen vectors for the L
largest Eigen values of Σ
• 𝑧𝑖 = 𝑊𝑇𝑥𝑖
• Alternative solution via Singular Value
Decomposition (SVD)
• W contains the “principal components” that capture
the largest variance in the data
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 23
Probabilistic PCA
• Generative model
𝑃(𝑧𝑖) = 𝑁(𝑧𝑖|𝜇0, Σ0)
𝑃(𝑥𝑖|𝑧𝑖) = 𝑁(𝑥𝑖|𝑊𝑧𝑖 + 𝜇, Ψ)
Ψ forced to be diagonal
• Latent linear models
• Factor Analysis
• Special Case: PCA with Ψ = 𝜎2 𝐼
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 24
Visualization of Generative Process
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 25
From Bishop, PRML
Relationship with Gaussian Density
• 𝐶𝑜𝑣[𝑥] = 𝑊𝑊𝑇 + Ψ
• Why does Ψ need to be restricted?
• Intermediate low rank parameterization of Gaussian
covariance matrix between full rank and diagonal
• Compare #parameters
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 26
EM for PCA: Rod and Springs
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 27
From Bishop, PRML
Advantages of EM
• Simpler than gradient methods w/ constraints
• Handles missing data
• Easy path for handling more complex models
• Not always the fastest method
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 28
Summary of Latent Variable Models
• Learning from unlabeled data
• Latent variables
• Discrete: Clustering / Mixture models ; GMM
• Continuous: Dimensionality reduction ; PCA
• Summary / “Understanding” of data
• Expectation Maximization Algorithm
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 29

More Related Content

PPTX
lecture7.pptx
PPTX
lecture4.pptx
PPTX
lecture4.pptx
PPTX
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indr...
PPTX
Machine Learning Theory and Algorithms Notes
PPTX
lecture1.pptx
PPTX
New lecture on Probability for machine learning.pptx
PDF
Top Machine Learning Algorithms Used By AI Professionals ARTiBA.pdf
lecture7.pptx
lecture4.pptx
lecture4.pptx
Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indr...
Machine Learning Theory and Algorithms Notes
lecture1.pptx
New lecture on Probability for machine learning.pptx
Top Machine Learning Algorithms Used By AI Professionals ARTiBA.pdf

Similar to lecture6.pptx (20)

PDF
Bayesian_Decision_Theory-3.pdf
PPTX
presentationIDC - 14MAY2015
PDF
Machine Learning: an Introduction and cases
PPT
Machine learning and deep learning algorithms
PPTX
Machine learning interviews day4
ODP
Introduction to Machine learning
PPT
. An introduction to machine learning and probabilistic ...
PPT
Machine Learning: Foundations Course Number 0368403401
PDF
Machine_Learning_Co__
PDF
Machine learning cheat sheet
PPTX
Introduction to Machine Learning
PPTX
Machine learning ppt.
PDF
A presentation about machine learning and NN
PDF
Fundamentals Of Machine Learning For Predictive Data Analytics Algorithms Wor...
PPTX
Understanding Basics of Machine Learning
PDF
Machine learning with in the python lecture for computer science
PDF
Machine learning and its parameter is discussed here
PPTX
Simple overview of machine learning
PPT
Lec1-Into
PDF
Machine learning (8)
Bayesian_Decision_Theory-3.pdf
presentationIDC - 14MAY2015
Machine Learning: an Introduction and cases
Machine learning and deep learning algorithms
Machine learning interviews day4
Introduction to Machine learning
. An introduction to machine learning and probabilistic ...
Machine Learning: Foundations Course Number 0368403401
Machine_Learning_Co__
Machine learning cheat sheet
Introduction to Machine Learning
Machine learning ppt.
A presentation about machine learning and NN
Fundamentals Of Machine Learning For Predictive Data Analytics Algorithms Wor...
Understanding Basics of Machine Learning
Machine learning with in the python lecture for computer science
Machine learning and its parameter is discussed here
Simple overview of machine learning
Lec1-Into
Machine learning (8)
Ad

Recently uploaded (20)

PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
Welding lecture in detail for understanding
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPT
Mechanical Engineering MATERIALS Selection
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Lecture Notes Electrical Wiring System Components
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Foundation to blockchain - A guide to Blockchain Tech
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Internet of Things (IOT) - A guide to understanding
R24 SURVEYING LAB MANUAL for civil enggi
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
OOP with Java - Java Introduction (Basics)
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Welding lecture in detail for understanding
Operating System & Kernel Study Guide-1 - converted.pdf
Mechanical Engineering MATERIALS Selection
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Embodied AI: Ushering in the Next Era of Intelligent Systems
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Lecture Notes Electrical Wiring System Components
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Foundation to blockchain - A guide to Blockchain Tech
Ad

lecture6.pptx

  • 1. Probabilistic Models with Latent Variables Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 1
  • 2. Density Estimation Problem • Learning from unlabeled data 𝑥1, 𝑥2, … , 𝑥𝑁 • Unsupervised learning, density estimation • Empirical distribution typically has multiple modes Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 2
  • 3. Density Estimation Problem Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 3 From http://courses.ee.sun.ac.za/Pattern_Recognition_813 From http://yulearning.blogspot.co.uk
  • 4. Density Estimation Problem • Conv. composition of unimodal pdf’s: multimodal pdf 𝑓 𝑥 = 𝑘 𝑤𝑘𝑓𝑘(𝑥) where 𝑘 𝑤𝑘 = 1 • Physical interpretation • Sub populations Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 4
  • 5. Latent Variables • Introduce new variable 𝑍𝑖 for each 𝑋𝑖 • Latent / hidden: not observed in the data • Probabilistic interpretation • Mixing weights: 𝑤𝑘 ≡ 𝑝(𝑧𝑖 = 𝑘) • Mixture densities: 𝑓𝑘(𝑥) ≡ 𝑝(𝑥|𝑧𝑖 = 𝑘) Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 5
  • 6. Generative Mixture Model For 𝑖 = 1, … , 𝑁 𝑍𝑖~𝑖𝑖𝑑 𝑀𝑢𝑙𝑡 𝑋𝑖~𝑖𝑖𝑑 𝑝(𝑥|𝑧𝑖) • 𝑃(𝑥𝑖, 𝑧𝑖) = 𝑝 𝑧𝑖 𝑝(𝑥𝑖|𝑧𝑖) • 𝑃 𝑥𝑖 = 𝑘 𝑝(𝑥𝑖, 𝑧𝑖) recovers mixture distribution Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 6 𝑍𝑖 𝑋𝑖 𝑁 Plate Notation
  • 7. Tasks in a Mixture Model • Inference 𝑃 𝑧 𝑥 = 𝑖 𝑃(𝑧𝑖|𝑥𝑖) • Parameter Estimation • Find parameters that e.g. maximize likelihood • Does not decouple according to classes • Non convex, many local minima Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 7
  • 8. Example: Gaussian Mixture Model • Model For 𝑖 = 1, … , 𝑁 𝑍𝑖~𝑖𝑖𝑑 𝑀𝑢𝑙𝑡(𝜋) 𝑋𝑖 | 𝑍𝑖 = 𝑘~𝑖𝑖𝑑 𝑁(𝑥|𝜇𝑘, Σ) • Inference 𝑃(𝑧𝑖 = 𝑘|𝑥𝑖; 𝜇, Σ) • Soft-max function Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 8
  • 9. Example: Gaussian Mixture Model • Loglikelihood • Which training instance comes from which component? 𝑙 𝜃 = 𝑖 log 𝑝 𝑥𝑖 = 𝑖 log 𝑘 𝑝 𝑧𝑖 = 𝑘 𝑝(𝑥𝑖|𝑧𝑖 = 𝑘) • No closed form solution for maximizing 𝑙 𝜃 • Possibility 1: Gradient descent etc • Possibility 2: Expectation Maximization Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 9
  • 10. Expectation Maximization Algorithm • Observation: Know values of 𝑍𝑖 ⇒ easy to maximize • Key idea: iterative updates • Given parameter estimates, “infer” all 𝑍𝑖 variables • Given inferred 𝑍𝑖 variables, maximize wrt parameters • Questions • Does this converge? • What does this maximize? Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 10
  • 11. Expectation Maximization Algorithm • Complete loglikelihood 𝑙𝑐 𝜃 = 𝑖 log 𝑝 𝑥𝑖, 𝑧𝑖 = 𝑖 log 𝑝 𝑧𝑖 𝑝(𝑥𝑖|𝑧𝑖) • Problem: 𝑧𝑖 not known • Possible solution: Replace w/ conditional expectation • Expected complete loglikelihood 𝑄 𝜃, 𝜃𝑜𝑙𝑑 = 𝐸[ 𝑖 log 𝑝 𝑥𝑖, 𝑧𝑖 ] Wrt 𝑝(𝑧|𝑥, 𝜃𝑜𝑙𝑑) where 𝜃𝑜𝑙𝑑 are the current parameters Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 11
  • 12. Expectation Maximization Algorithm 𝑄 𝜃, 𝜃𝑜𝑙𝑑 = 𝐸[ 𝑖 log 𝑝 𝑥𝑖, 𝑧𝑖 ] = 𝑖 𝑘 𝐸 𝐼 𝑧𝑖 = 𝑘 log[𝜋𝑘𝑝(𝑥𝑖|𝜃𝑘)] = 𝑖 𝑘 𝑝(𝑧𝑖 = 𝑘|𝑥𝑖, 𝜃𝑜𝑙𝑑) log[𝜋𝑘𝑝(𝑥𝑖|𝜃𝑘)] = 𝑖 𝑘 𝛾𝑖𝑘 log 𝜋𝑘 + 𝑖 𝑘 𝛾𝑖𝑘 log 𝑝(𝑥𝑖|𝜃𝑘) Where 𝛾𝑖𝑘 = 𝑝(𝑧𝑖 = 𝑘|𝑥𝑖, 𝜃𝑜𝑙𝑑) • Compare with likelihood for generative classifier Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 12
  • 13. Expectation Maximization Algorithm • Expectation Step • Update 𝛾𝑖𝑘 based on current parameters 𝛾𝑖𝑘 = 𝜋𝑘𝑝 𝑥𝑖 𝜃𝑜𝑙𝑑,𝑘 𝑘 𝜋𝑘𝑝 𝑥𝑖 𝜃𝑜𝑙𝑑,𝑘 • Maximization Step • Maximize 𝑄 𝜃, 𝜃𝑜𝑙𝑑 wrt parameters • Overall algorithm • Initialize all latent variables • Iterate until convergence • M Step • E Step Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 13
  • 14. Example: EM for GMM • E Step remains the step for all mixture models • M Step • 𝜋𝑘 = 𝑖 𝛾𝑖𝑘 𝑁 = 𝛾𝑘 𝑁 • 𝜇𝑘 = 𝑖 𝛾𝑖𝑘𝑥𝑖 𝛾𝑘 • Σ =? • Compare with generative classifier Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 14
  • 15. Analysis of EM Algorithm • Expected complete LL is a lower bound on LL • EM iteratively maximizes this lower bound • Converges to a local maximum of the loglikelihood Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 15
  • 16. Bayesian / MAP Estimation • EM overfits • Possible to perform MAP instead of MLE in M-step • EM is partially Bayesian • Posterior distribution over latent variables • Point estimate over parameters • Fully Bayesian approach is called Variational Bayes Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 16
  • 17. (Lloyd’s) K Means Algorithm • Hard EM for Gaussian Mixture Model • Point estimate of parameters (as usual) • Point estimate of latent variables • Spherical Gaussian mixture components 𝑧𝑖 ∗ = arg max k 𝑝(𝑧𝑖 = 𝑘|𝑥𝑖, 𝜃) = arg min 𝑘 𝑥𝑖 − 𝜇𝑘 2 2 Where 𝜇𝑘 = 𝑖:𝑧𝑖=𝑘 𝑥𝑖 𝑁 • Most popular “hard” clustering algorithm Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 17
  • 18. K Means Problem • Given {𝑥𝑖}, find k “means” (𝜇1 ∗ , … , 𝜇𝑘 ∗ ) and data assignments (𝑧1 ∗ , … , 𝑧𝑁 ∗ ) such that 𝜇∗, 𝑧∗ = arg min 𝜇,𝑧 𝑖 𝑥𝑖 − 𝜇𝑧𝑖 2 2 • Note: 𝑧𝑖 is k-dimensional binary vector Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 18
  • 19. Model selection: Choosing K for GMM • Cross validation • Plot likelihood on training set and validation set for increasing values of k • Likelihood on training set keeps improving • Likelihood on validation set drops after “optimal” k • Does not work for k-means! Why? Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 19
  • 20. Principal Component Analysis: Motivation • Dimensionality reduction • Reduces #parameters to estimate • Data often resides in much lower dimension, e.g., on a line in a 3D space • Provides “understanding” • Mixture models very restricted • Latent variables restricted to small discrete set • Can we “relax” the latent variable? Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 20
  • 21. Classical PCA: Motivation • Revisit K-means min 𝑊,𝑍 𝐽 𝑊, 𝑍 = 𝑋 − 𝑊𝑍𝑇 2 𝐹 • W: matrix containing means • Z: matrix containing cluster membership vectors • How can we relax Z and W? Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 21
  • 22. Classical PCA: Problem min 𝑊,𝑍 𝐽 𝑊, 𝑍 = |𝑋 − 𝑊𝑍𝑇 |2 𝐹 • X : 𝐷 × 𝑁 • Arbitrary Z of size 𝑁 × 𝐿, • Orthonormal W of size 𝐷 × 𝐿 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 22
  • 23. Classical PCA: Optimal Solution • Empirical covariance matrix Σ = 1 𝑁 𝑖 𝑥𝑖𝑥𝑖 𝑇 • Scaled and centered data • 𝑊 = 𝑉𝐿 where 𝑉𝐿 contains L Eigen vectors for the L largest Eigen values of Σ • 𝑧𝑖 = 𝑊𝑇𝑥𝑖 • Alternative solution via Singular Value Decomposition (SVD) • W contains the “principal components” that capture the largest variance in the data Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 23
  • 24. Probabilistic PCA • Generative model 𝑃(𝑧𝑖) = 𝑁(𝑧𝑖|𝜇0, Σ0) 𝑃(𝑥𝑖|𝑧𝑖) = 𝑁(𝑥𝑖|𝑊𝑧𝑖 + 𝜇, Ψ) Ψ forced to be diagonal • Latent linear models • Factor Analysis • Special Case: PCA with Ψ = 𝜎2 𝐼 Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 24
  • 25. Visualization of Generative Process Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 25 From Bishop, PRML
  • 26. Relationship with Gaussian Density • 𝐶𝑜𝑣[𝑥] = 𝑊𝑊𝑇 + Ψ • Why does Ψ need to be restricted? • Intermediate low rank parameterization of Gaussian covariance matrix between full rank and diagonal • Compare #parameters Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 26
  • 27. EM for PCA: Rod and Springs Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 27 From Bishop, PRML
  • 28. Advantages of EM • Simpler than gradient methods w/ constraints • Handles missing data • Easy path for handling more complex models • Not always the fastest method Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 28
  • 29. Summary of Latent Variable Models • Learning from unlabeled data • Latent variables • Discrete: Clustering / Mixture models ; GMM • Continuous: Dimensionality reduction ; PCA • Summary / “Understanding” of data • Expectation Maximization Algorithm Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya 29