SlideShare a Scribd company logo
Part 2: Unsupervised Learning Machine Learning Techniques  for Computer Vision Microsoft Research Cambridge ECCV 2004, Prague Christopher M. Bishop
Overview of Part 2 Mixture models EM Variational Inference Bayesian model complexity Continuous latent variables
The Gaussian Distribution Multivariate Gaussian Maximum likelihood mean covariance
Gaussian Mixtures Linear super-position of Gaussians Normalization and positivity require
Example: Mixture of 3 Gaussians
Maximum Likelihood for the GMM Log likelihood function  Sum over components appears  inside  the log no closed form ML solution
EM Algorithm  –  Informal Derivation
EM Algorithm  –  Informal Derivation M step equations
EM Algorithm  –  Informal Derivation E step equation
EM Algorithm  –  Informal Derivation Can interpret the mixing coefficients as prior probabilities   Corresponding posterior probabilities ( responsibilities )
Old Faithful Data Set Duration of eruption (minutes) Time between eruptions (minutes)
 
 
 
 
 
 
Latent Variable View of EM To sample from a Gaussian mixture: first pick one of the components with probability  then draw a sample  from that component repeat these two steps for each new data point
Latent Variable View of EM Goal: given a data set, find  Suppose we knew the colours maximum likelihood would involve fitting each component to the corresponding cluster Problem: the colours are  latent  (hidden) variables
Incomplete and Complete Data complete incomplete
Latent Variable Viewpoint
Latent Variable Viewpoint  Binary latent variables  describing which component generated each data point  Conditional distribution of observed variable Prior distribution of latent variables Marginalizing over the latent variables we obtain
Graphical Representation of GMM
Latent Variable View of EM Suppose we knew the values for the latent variables maximize the  complete-data  log likelihood trivial closed-form solution: fit each component to the corresponding set of data points We don’t know the values of the latent variables however, for given parameter values we can compute the expected values of the latent variables
Posterior Probabilities (colour coded)
Over-fitting in Gaussian Mixture Models Infinities in likelihood function when a component ‘collapses’ onto a data point:   with Also, maximum likelihood cannot determine the number  K  of components
Cross Validation Can select model complexity using an independent validation data set If data is scarce use  cross-validation : partition data into  S  subsets train on  S  1  subsets  test on remainder repeat and average Disadvantages computationally expensive  can only determine one or  two complexity parameters
Bayesian Mixture of Gaussians Parameters and latent variables appear on equal footing Conjugate priors
Data Set Size Problem 1:  learn the function for  from 100 (slightly) noisy examples data set is computationally small but statistically large Problem 2:  learn to recognize 1,000 everyday objects from 5,000,000 natural images data set is computationally large but statistically small Bayesian inference  computationally more demanding than ML or MAP (but see discussion of Gaussian mixtures later) significant benefit for statistically small data sets
Variational Inference Exact Bayesian inference intractable Markov chain Monte Carlo computationally expensive issues of convergence Variational Inference   broadly applicable deterministic approximation let  denote all latent variables and parameters approximate true posterior  using a simpler distribution  minimize Kullback-Leibler divergence
General View of Variational Inference For arbitrary where Maximizing over  would give the true posterior this is intractable by definition
Variational Lower Bound
Factorized Approximation Goal: choose a family of  q  distributions which are: sufficiently flexible to give good approximation sufficiently simple to remain tractable Here we consider factorized distributions No further assumptions are required! Optimal solution for one factor, keeping the remainder fixed coupled solutions so initialize then cyclically update message passing view (Winn and Bishop, 2004)
 
Lower Bound Can also be evaluated Useful for maths/code verification Also useful for model comparison:
Illustration: Univariate Gaussian Likelihood function Conjugate prior  Factorized variational distribution
Initial Configuration
After Updating
After Updating
Converged Solution
Variational Mixture of Gaussians Assume factorized posterior distribution No other approximations needed!
Variational Equations for GMM
Lower Bound for GMM
VIBES Bishop, Spiegelhalter and Winn (2002)
ML Limit If instead we choose we recover the maximum likelihood EM algorithm
Bound vs. K for Old Faithful Data
Bayesian Model Complexity
Sparse Bayes for Gaussian Mixture Corduneanu and Bishop (2001) Start with large value of  K treat mixing coefficients as parameters maximize marginal likelihood prunes out excess components
 
 
Summary: Variational Gaussian Mixtures Simple modification of maximum likelihood EM code Small computational overhead compared to EM No singularities Automatic model order selection
Continuous Latent Variables Conventional PCA data covariance matrix eigenvector decomposition Minimizes sum-of-squares projection not a probabilistic model how should we choose  L  ?
Probabilistic PCA Tipping and Bishop (1998) L  dimensional continuous latent space D  dimensional data space PCA factor analysis
Probabilistic PCA Marginal distribution Advantages exact ML solution computationally efficient EM algorithm captures dominant correlations with few parameters mixtures of PPCA Bayesian PCA building block for more complex models
EM for PCA
EM for PCA
EM for PCA
EM for PCA
EM for PCA
EM for PCA
EM for PCA
Bayesian PCA Bishop (1998) Gaussian prior over columns of Automatic relevance determination  (ARD)  ML PCA Bayesian PCA
Non-linear Manifolds Example: images of a rigid object
Bayesian Mixture of BPCA Models
 
Flexible Sprites Jojic and Frey (2001) Automatic decomposition of video sequence into  background model ordered set of masks (one per object per frame) foreground model (one per object per frame)
 
Transformed Component Analysis Generative model Now include transformations (translations) Extend to  L  layers Inference intractable so  use variational framework
 
Bayesian Constellation Model Li, Fergus and Perona (2003) Object recognition from small training sets Variational treatment of fully Bayesian model
Bayesian Constellation Model
Summary of Part 2 Discrete and continuous latent variables  EM algorithm Build complex models from simple components represented graphically incorporates prior knowledge Variational inference Bayesian model comparison

More Related Content

PDF
Identifying Critical Neurons in ANN Architectures using Mixed Integer Program...
PDF
deep CNN vs conventional ML
PDF
Datapath
PDF
Using Multi-layered Feed-forward Neural Network (MLFNN) Architecture as Bidir...
PDF
220206 transformer interpretability beyond attention visualization
PDF
Distributed Computing - Slide Set 1 (broadcast algorithms + stronger)
PPTX
Deep Belief Networks for Spam Filtering
PPT
NIPS2007: deep belief nets
Identifying Critical Neurons in ANN Architectures using Mixed Integer Program...
deep CNN vs conventional ML
Datapath
Using Multi-layered Feed-forward Neural Network (MLFNN) Architecture as Bidir...
220206 transformer interpretability beyond attention visualization
Distributed Computing - Slide Set 1 (broadcast algorithms + stronger)
Deep Belief Networks for Spam Filtering
NIPS2007: deep belief nets

What's hot (20)

PDF
Hidden Layer Leraning Vector Quantizatio
PDF
A new development in the hierarchical clustering of repertory grid data
PPT
Svm ms
PPTX
UNET: Massive Scale DNN on Spark
PDF
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
PPTX
Lecture 17: Supervised Learning Recap
PPTX
Radial basis function network ppt bySheetal,Samreen and Dhanashri
PPTX
Quasi newton artificial neural network training algorithms
PDF
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
PDF
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
PDF
dot plot analysis
PPTX
Deep Belief nets
PPTX
Elements of dynamic programming
PDF
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
PDF
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
PPTX
Mnist report ppt
PPTX
Deep belief networks for spam filtering
PPTX
Dahlquist et-al bosc-ismb_2016_poster
PPT
Needleman wunsch computional ppt
PDF
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
Hidden Layer Leraning Vector Quantizatio
A new development in the hierarchical clustering of repertory grid data
Svm ms
UNET: Massive Scale DNN on Spark
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Lecture 17: Supervised Learning Recap
Radial basis function network ppt bySheetal,Samreen and Dhanashri
Quasi newton artificial neural network training algorithms
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Eff...
[딥논읽] Meta-Transfer Learning for Zero-Shot Super-Resolution paper review
dot plot analysis
Deep Belief nets
Elements of dynamic programming
PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation
Restricted Boltzmann Machine - A comprehensive study with a focus on Deep Bel...
Mnist report ppt
Deep belief networks for spam filtering
Dahlquist et-al bosc-ismb_2016_poster
Needleman wunsch computional ppt
PR-108: MobileNetV2: Inverted Residuals and Linear Bottlenecks
Ad

Viewers also liked (20)

PDF
Probabilistic PCA, EM, and more
PDF
K-means, EM and Mixture models
PDF
Principal Component Analysis
PDF
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
PDF
CVPR2008 tutorial generalized pca
PPT
notes as .ppt
PPT
. An introduction to machine learning and probabilistic ...
PDF
Tree-based Translation Models (『機械翻訳』§6.2-6.3)
PPT
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
PDF
Machine Learning CSCI 5622
PPTX
An Enhanced Independent Component-Based Human Facial Expression Recognition ...
PPT
15857 cse422 unsupervised-learning
PDF
25 Machine Learning Unsupervised Learaning K-means K-centers
PDF
Pattern Recognition and Machine Learning: Section 3.3
PPTX
Deformable Facial Models and 3D Face Reconstruction Methods: A survey
PDF
Machine learning fro computer vision - a whirlwind of key concepts for the un...
PDF
Probability and Statistics "Cheatsheet"
PDF
3D Dynamic Facial Sequences Analsysis for face recognition and emotion detection
PDF
Machine Learning: Introduction to Neural Networks
PPTX
Machine Learning techniques
Probabilistic PCA, EM, and more
K-means, EM and Mixture models
Principal Component Analysis
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
CVPR2008 tutorial generalized pca
notes as .ppt
. An introduction to machine learning and probabilistic ...
Tree-based Translation Models (『機械翻訳』§6.2-6.3)
CC282 Unsupervised Learning (Clustering) Lecture 7 slides for ...
Machine Learning CSCI 5622
An Enhanced Independent Component-Based Human Facial Expression Recognition ...
15857 cse422 unsupervised-learning
25 Machine Learning Unsupervised Learaning K-means K-centers
Pattern Recognition and Machine Learning: Section 3.3
Deformable Facial Models and 3D Face Reconstruction Methods: A survey
Machine learning fro computer vision - a whirlwind of key concepts for the un...
Probability and Statistics "Cheatsheet"
3D Dynamic Facial Sequences Analsysis for face recognition and emotion detection
Machine Learning: Introduction to Neural Networks
Machine Learning techniques
Ad

Similar to Part 2: Unsupervised Learning Machine Learning Techniques (20)

PPT
Cristopher M. Bishop's tutorial on graphical models
PPT
Cristopher M. Bishop's tutorial on graphical models
PPT
Cristopher M. Bishop's tutorial on graphical models
PPT
Cristopher M. Bishop's tutorial on graphical models
PPT
Cristopher M. Bishop's tutorial on graphical models
PPT
ProbabilisticModeling20080411
PPT
November, 2006 CCKM'06 1
PPTX
Lecture 18: Gaussian Mixture Models and Expectation Maximization
PPT
Intro to Model Selection
PPTX
Image recogonization
PPTX
ML_in_QM_JC_02-10-18
PDF
Machine Learning.pdf
PPTX
GMM Clustering Presentation Slides for Machine Learning Course
PDF
Approaches to online quantile estimation
PPTX
Statistical Clustering Redux - kmeans, GMM and Variational Inference
PDF
MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
PPTX
PRML Chapter 9
PDF
Machine learning Mind Map
PDF
MetiTarski's menagerie of cooperating systems
PPTX
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...
Cristopher M. Bishop's tutorial on graphical models
Cristopher M. Bishop's tutorial on graphical models
Cristopher M. Bishop's tutorial on graphical models
Cristopher M. Bishop's tutorial on graphical models
Cristopher M. Bishop's tutorial on graphical models
ProbabilisticModeling20080411
November, 2006 CCKM'06 1
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Intro to Model Selection
Image recogonization
ML_in_QM_JC_02-10-18
Machine Learning.pdf
GMM Clustering Presentation Slides for Machine Learning Course
Approaches to online quantile estimation
Statistical Clustering Redux - kmeans, GMM and Variational Inference
MM-KBAC – Using Mixed Models to Adjust for Population Structure in a Rare-var...
PRML Chapter 9
Machine learning Mind Map
MetiTarski's menagerie of cooperating systems
MM - KBAC: Using mixed models to adjust for population structure in a rare-va...

More from butest (20)

PDF
EL MODELO DE NEGOCIO DE YOUTUBE
DOC
1. MPEG I.B.P frame之不同
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPT
Timeline: The Life of Michael Jackson
DOCX
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
PDF
LESSONS FROM THE MICHAEL JACKSON TRIAL
PPTX
Com 380, Summer II
PPT
PPT
DOCX
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
DOC
MICHAEL JACKSON.doc
PPTX
Social Networks: Twitter Facebook SL - Slide 1
PPT
Facebook
DOCX
Executive Summary Hare Chevrolet is a General Motors dealership ...
DOC
Welcome to the Dougherty County Public Library's Facebook and ...
DOC
NEWS ANNOUNCEMENT
DOC
C-2100 Ultra Zoom.doc
DOC
MAC Printing on ITS Printers.doc.doc
DOC
Mac OS X Guide.doc
DOC
hier
DOC
WEB DESIGN!
EL MODELO DE NEGOCIO DE YOUTUBE
1. MPEG I.B.P frame之不同
LESSONS FROM THE MICHAEL JACKSON TRIAL
Timeline: The Life of Michael Jackson
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
LESSONS FROM THE MICHAEL JACKSON TRIAL
Com 380, Summer II
PPT
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
MICHAEL JACKSON.doc
Social Networks: Twitter Facebook SL - Slide 1
Facebook
Executive Summary Hare Chevrolet is a General Motors dealership ...
Welcome to the Dougherty County Public Library's Facebook and ...
NEWS ANNOUNCEMENT
C-2100 Ultra Zoom.doc
MAC Printing on ITS Printers.doc.doc
Mac OS X Guide.doc
hier
WEB DESIGN!

Part 2: Unsupervised Learning Machine Learning Techniques

  • 1. Part 2: Unsupervised Learning Machine Learning Techniques for Computer Vision Microsoft Research Cambridge ECCV 2004, Prague Christopher M. Bishop
  • 2. Overview of Part 2 Mixture models EM Variational Inference Bayesian model complexity Continuous latent variables
  • 3. The Gaussian Distribution Multivariate Gaussian Maximum likelihood mean covariance
  • 4. Gaussian Mixtures Linear super-position of Gaussians Normalization and positivity require
  • 5. Example: Mixture of 3 Gaussians
  • 6. Maximum Likelihood for the GMM Log likelihood function Sum over components appears inside the log no closed form ML solution
  • 7. EM Algorithm – Informal Derivation
  • 8. EM Algorithm – Informal Derivation M step equations
  • 9. EM Algorithm – Informal Derivation E step equation
  • 10. EM Algorithm – Informal Derivation Can interpret the mixing coefficients as prior probabilities Corresponding posterior probabilities ( responsibilities )
  • 11. Old Faithful Data Set Duration of eruption (minutes) Time between eruptions (minutes)
  • 12.  
  • 13.  
  • 14.  
  • 15.  
  • 16.  
  • 17.  
  • 18. Latent Variable View of EM To sample from a Gaussian mixture: first pick one of the components with probability then draw a sample from that component repeat these two steps for each new data point
  • 19. Latent Variable View of EM Goal: given a data set, find Suppose we knew the colours maximum likelihood would involve fitting each component to the corresponding cluster Problem: the colours are latent (hidden) variables
  • 20. Incomplete and Complete Data complete incomplete
  • 22. Latent Variable Viewpoint Binary latent variables describing which component generated each data point Conditional distribution of observed variable Prior distribution of latent variables Marginalizing over the latent variables we obtain
  • 24. Latent Variable View of EM Suppose we knew the values for the latent variables maximize the complete-data log likelihood trivial closed-form solution: fit each component to the corresponding set of data points We don’t know the values of the latent variables however, for given parameter values we can compute the expected values of the latent variables
  • 26. Over-fitting in Gaussian Mixture Models Infinities in likelihood function when a component ‘collapses’ onto a data point: with Also, maximum likelihood cannot determine the number K of components
  • 27. Cross Validation Can select model complexity using an independent validation data set If data is scarce use cross-validation : partition data into S subsets train on S  1 subsets test on remainder repeat and average Disadvantages computationally expensive can only determine one or two complexity parameters
  • 28. Bayesian Mixture of Gaussians Parameters and latent variables appear on equal footing Conjugate priors
  • 29. Data Set Size Problem 1: learn the function for from 100 (slightly) noisy examples data set is computationally small but statistically large Problem 2: learn to recognize 1,000 everyday objects from 5,000,000 natural images data set is computationally large but statistically small Bayesian inference computationally more demanding than ML or MAP (but see discussion of Gaussian mixtures later) significant benefit for statistically small data sets
  • 30. Variational Inference Exact Bayesian inference intractable Markov chain Monte Carlo computationally expensive issues of convergence Variational Inference broadly applicable deterministic approximation let denote all latent variables and parameters approximate true posterior using a simpler distribution minimize Kullback-Leibler divergence
  • 31. General View of Variational Inference For arbitrary where Maximizing over would give the true posterior this is intractable by definition
  • 33. Factorized Approximation Goal: choose a family of q distributions which are: sufficiently flexible to give good approximation sufficiently simple to remain tractable Here we consider factorized distributions No further assumptions are required! Optimal solution for one factor, keeping the remainder fixed coupled solutions so initialize then cyclically update message passing view (Winn and Bishop, 2004)
  • 34.  
  • 35. Lower Bound Can also be evaluated Useful for maths/code verification Also useful for model comparison:
  • 36. Illustration: Univariate Gaussian Likelihood function Conjugate prior Factorized variational distribution
  • 41. Variational Mixture of Gaussians Assume factorized posterior distribution No other approximations needed!
  • 44. VIBES Bishop, Spiegelhalter and Winn (2002)
  • 45. ML Limit If instead we choose we recover the maximum likelihood EM algorithm
  • 46. Bound vs. K for Old Faithful Data
  • 48. Sparse Bayes for Gaussian Mixture Corduneanu and Bishop (2001) Start with large value of K treat mixing coefficients as parameters maximize marginal likelihood prunes out excess components
  • 49.  
  • 50.  
  • 51. Summary: Variational Gaussian Mixtures Simple modification of maximum likelihood EM code Small computational overhead compared to EM No singularities Automatic model order selection
  • 52. Continuous Latent Variables Conventional PCA data covariance matrix eigenvector decomposition Minimizes sum-of-squares projection not a probabilistic model how should we choose L ?
  • 53. Probabilistic PCA Tipping and Bishop (1998) L dimensional continuous latent space D dimensional data space PCA factor analysis
  • 54. Probabilistic PCA Marginal distribution Advantages exact ML solution computationally efficient EM algorithm captures dominant correlations with few parameters mixtures of PPCA Bayesian PCA building block for more complex models
  • 62. Bayesian PCA Bishop (1998) Gaussian prior over columns of Automatic relevance determination (ARD) ML PCA Bayesian PCA
  • 63. Non-linear Manifolds Example: images of a rigid object
  • 64. Bayesian Mixture of BPCA Models
  • 65.  
  • 66. Flexible Sprites Jojic and Frey (2001) Automatic decomposition of video sequence into background model ordered set of masks (one per object per frame) foreground model (one per object per frame)
  • 67.  
  • 68. Transformed Component Analysis Generative model Now include transformations (translations) Extend to L layers Inference intractable so use variational framework
  • 69.  
  • 70. Bayesian Constellation Model Li, Fergus and Perona (2003) Object recognition from small training sets Variational treatment of fully Bayesian model
  • 72. Summary of Part 2 Discrete and continuous latent variables EM algorithm Build complex models from simple components represented graphically incorporates prior knowledge Variational inference Bayesian model comparison