K-means and GMM

Network Intelligence and Analysis Lab
Clustering methods via EM algorithm
2014.07.10
SanghyukChun

•
Machine Learning
•
Training data
•
Learning model
•
Unsupervised Learning
•
Training data without label
•
Input data: 퐷퐷={푥푥1,푥푥2,…,푥푥푁푁}
•
Most of unsupervised learning problems are trying to find hidden structure in unlabeled data
•
Examples: Clustering, Dimensionality Reduction (PCA, LDA), …
Machine Learning and Unsupervised Learning
2

•
Clustering
•
Grouping objects in a such way that objects in the same group are more similar to each other than other groups
•
Input: a set of objects (or data) without group information
•
Output: cluster index for each object
•
Usage: Customer Segmentation, Image Segmentation…
Unsupervised Learning and Clustering
Input
Output
Clustering
Algorithm
3

K-means Clustering
Introduction
Optimization
4

•
Intuition: data in same cluster has shorter distance than data which are in other clusters
•
Goal: minimize distance between data in same cluster
•
Objective function:
•
퐽퐽=෍ 푛푛=1 푁푁 ෍ 푘푘=1 퐾퐾 푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2
•
Where N is number of data points, K is number of clusters
•
푟푟푛푛푛∈{0,1}is indicator variables where k describing which of the K clusters the data point 퐱퐱퐧퐧is assigned to
•
훍훍퐤퐤is a prototype associated with the k-thcluster
•
Eventually 훍훍퐤퐤is same as the center (mean) of cluster
K-means Clustering
5

•
Objective function:
•
푎푎푎푎푎푎푎푎푎푛푛{푟푟푛푛푛푛,훍훍퐤퐤}෍ 푛푛=1 푁푁 ෍ 푘푘=1 퐾퐾 푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2
•
This function can be solved through an iterative procedure
•
Step 1: minimize J with respect to the 푟푟푛푛푛, keeping 훍훍퐤퐤is fixed
•
Step 2: minimize J with respect to the 훍훍퐤퐤, keeping 푟푟푛푛푛is fixed
•
Repeat Step 1,2 until converge
•
Does it always converge?
K-means Clustering –Optimization
6

•
Biconvex optimization is a generalization of convex optimization where the objective function and the constraint set can be biconvex
•
푓푓푥푥,푦푦is biconvex if fixing x, 푓푓푥푥y=푓푓푥푥,푦푦is convex over Y and fixing y, 푓푓푦푦푥푥=푓푓푥푥,푦푦is convex over X
•
One way to solve biconvex optimization problem is that iteratively solve the corresponding convex problems
•
It does not guarantee the global optimal point
•
But it always converge to some local optimum
Optional –Biconvex optimization
7

•
•
Step 1: minimize J with respect to the 푟푟푛푛푛, keeping 훍훍퐤퐤is fixed
•
푟푟푛푛푛=ቊ1푖푘푘=푎푎푎푎푎푎푎푎푎푛푛푗푗퐱퐱퐧퐧−훍훍퐤퐤 ퟐퟐ 0표표표표표표표표표표표표표표표
•
Step 2: minimize J with respect to the 훍훍퐤퐤, keeping 푟푟푛푛푛is fixed
•
Derivative with respect to 훍훍퐤퐤to zero giving
•
2Σ푛푛푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤=0
•
훍훍퐤퐤=Σ푛푛푟푟푛푛푛푛퐱퐱퐧퐧 Σ푛푛푟푟푛푛푛푛
•
훍훍퐤퐤is equal to the mean of all the data assigned to cluster k
K-means Clustering –Optimization
8

•
Advantage of K-means clustering
•
Easy to implement (kmeansin Matlab, kclusterin Python)
•
In practice, it works well
•
Disadvantage of K-means clustering
•
It can converge to local optimum
•
Computing Euclidian distance of every point is expensive
•
Solution: Batch K-means
•
Euclidian distance is non-robust to outlier
•
Solution: K-medoidsalgorithms (use different metric)
K-means Clustering –Conclusion
9

Mixture of Gaussians
Mixture Model
EM Algorithm
EM for Gaussian Mixtures
10

•
Assumption: There are k components: 푐푐푖푖푖푖=1 푘푘
•
Component 푐푐푖푖has an associated mean vector 휇휇푖푖
•
Each component generates data from a Gaussian with mean 휇휇푖푖 and covariance matrix Σ푖푖
Mixture of Gaussians
휇휇1
휇휇2
휇휇3
휇휇4
휇휇5
11

•
Represent model as linear combination of Gaussians
•
Probability density function of GMM
•
푝푝푥푥=෍ 푘푘=1 퐾퐾 휋휋푘푘푁푁푥푥휇휇푘푘,Σ푘푘
•
푁푁푥푥휇휇푘푘,Σ푘푘=12휋휋푑푑/2Σ1/2exp{−12푥푥−휇휇⊤Σ−1푥푥−휇휇}
•
Which is called a mixture of Gaussian or Gaussian Mixture Model
•
Each Gaussian density is called component of the mixtures and has its own mean 휇휇푘푘and covariance Σ푘푘
•
The parameters are called mixing coefficients (Σ푘푘휋휋푘푘=1)
Gaussian Mixture Model
12

•
푝푝푥푥=Σ푘푘=1 퐾퐾휋휋푘푘푁푁푥푥휇휇푘푘,Σ푘푘, where Σ푘푘휋휋푘푘=1
•
Input:
•
The training set: 푥푥푖푖푖푖=1 푁푁
•
Number of clusters: k
•
Goal: model this data using mixture of Gaussians
•
Mixing coefficients 휋휋1,휋휋2,…,휋휋푘푘
•
Means and covariance: 휇휇1,휇휇2,…,휇휇푘푘;Σ1,Σ2,…,Σ푘푘
Clustering using Mixture Model
13

•
푝푝푥푥퐺퐺=푝푝푥푥휋휋1,휇휇1,…=Σ푖푖푝푝푥푥푐푐푖푖푝푝(푐푐푖푖)=Σ푖푖휋휋푖푖푁푁(푥푥|휇휇푖푖,Σ푖푖)
•
푝푝푥푥1,푥푥2,…,푥푥푁푁퐺퐺=Π푖푖푝푝(푥푥푖푖|퐺퐺)
•
The log likelihood function is given by
•
ln푝푝퐗퐗훑훑,훍훍,횺횺=෍ 푛푛=1 푁푁 ln෍ 푘푘=1 퐾퐾 휋휋푘푘푁푁퐱퐱퐧퐧훍훍퐤퐤,횺횺퐤퐤
•
Goal: Find parameter which maximize log-likelihood
•
Problem: Hard to compute maximum likelihood
•
Solution: use EM algorithm
Maximum Likelihood of GMM
14

•
EM algorithm is an iterative procedure for finding the MLE
•
An expectation (E) step creates a function for the expectation of the log-likelihood evaluated using the current estimate for the parameters
•
A maximization (M) step computes parameters maximizing the expected log-likelihood found on the E step
•
These parameter-estimates are then used to determine the distribution of the latent variables in the next E step.
•
EM always converges to one of local optimums
EM (Expectation Maximization) Algorithm
15

•
•
E-Step: minimize J with respect to the 푟푟푛푛푛, keeping 훍훍퐤퐤is fixed
•
푟푟푛푛푛=ቊ1푖푘푘=푎푎푎푎푎푎푎푎푎푛푛푗푗퐱퐱퐧퐧−훍훍퐤퐤 ퟐퟐ 0표표표표표표표표표표표표표표표
•
M-Step: minimize J with respect to the 훍훍퐤퐤, keeping 푟푟푛푛푛is fixed
•
훍훍퐤퐤=Σ푛푛푟푟푛푛푛푛퐱퐱퐧퐧 Σ푛푛푟푟푛푛푛푛
K-means revisit: EM and K-means
16

•
Let 푧푧푘푘is Bernoulli random variable with probability 휋휋푘푘
•
푝푝푧푧푘푘=1=휋휋푘푘where Σ푧푧푘푘=1and Σ휋휋푘푘=1
•
Because z use a 1-of-K representation, this distribution in the form
•
푝푝푧푧=Π푘푘=1 퐾퐾휋휋푘푘 푧푧푘
•
Similarly, the conditional distribution of x given a particular value for z is a Gaussian
•
푝푝푥푥푧푧=Π푘푘=1 퐾퐾푁푁푥푥휇휇푘푘,Σ푘푘 푧푧푘
Latent variable for GMM
17

•
The joint distribution is given by 푝푝푥푥,푧푧=푝푝푧푧푝푝(푥푥|푧푧)
•
푝푝푥푥=Σ푧푧푝푝푧푧푝푝(푥푥|푧푧)=Σ푘푘휋휋푘푘푁푁(푥푥|휇휇푘푘,Σ푘푘)
•
Thus the marginal distribution of x is a Gaussian mixture of the above form
•
Now, we are able to work with joint distribution instead of marginal distribution
•
Graphical representation of a GMMfor a set of N i.i.d. data points {푥푥푛푛} with corresponding latent variable{푧푧푛푛},where n=1,…,N
Latent variable for GMM
퐳퐳퐧퐧
푿푿풏풏
훑훑
흁흁
횺횺
N
18

•
Conditional probability of z given x
•
From Bayes’ theorem,
•
훾훾푧푧푘푘≡푝푝푧푧푘푘=1퐱퐱=푝푝푧푧푘=1푝푝퐱퐱푧푧푘푘=1Σ푗푗=1 퐾퐾푝푝푧푧푗푗=1푝푝퐱퐱푧푧푗푗=1= 휋휋푘푘푁푁퐱퐱훍훍퐤퐤,횺횺퐤퐤 Σ푗푗=1 퐾퐾휋휋푗푗푁푁(퐱퐱|훍훍퐣퐣,횺횺퐣퐣)
•
훾훾푧푧푘푘can also be viewed as the responsibility that component k takes for ‘explaining’ the observation x
EM for Gaussian Mixtures (E-step)
19

•
Likelihood function for GMM
•
ln푝푝퐗퐗훑훑,훍훍,횺횺=෍ 푛푛=1 푁푁 ln෍ 푘푘=1 퐾퐾 휋휋푘푘푁푁퐱퐱퐧퐧훍훍퐤퐤,횺횺퐤퐤
•
Setting the derivatives of log likelihood with respect to the means 휇휇푘푘of the Gaussian components to zero, we obtain
•
휇휇푘푘= 1N푘푘 ෍ 푛푛=1 푁푁 훾훾푧푧푛푛푛퐱퐱퐧퐧 where, 푁푁푘푘=Σ푛푛=1 푁푁훾훾(푧푧푛푛푛)
EM for Gaussian Mixtures (M-step)
20

•
Setting the derivatives of likelihood with respect to the Σ푘푘to zero, we obtain
•
횺횺풌풌= 1 푁푁푘푘 ෍ 푛푛=1 푁푁 훾훾푧푧푛푛푛퐱퐱퐧퐧−휇휇푘푘퐱퐱퐧퐧−휇휇푘푘 ⊤
•
Maximize likelihood with respect to the mixing coefficient 휋휋by using a Lagrange multiplier, we obtain
•
ln푝푝퐗퐗훑훑,훍훍,횺횺+휆휆(Σ푘푘=1 퐾퐾휋휋푘푘−1)
•
휋휋푘푘=푁푁푘푁푁
EM for Gaussian Mixtures (M-step)
21

•
휇휇푘푘,Σ푘푘,휋휋푘푘do not constitute a closed-form solution for the parameters of the mixture model because the responsibility 훾훾푧푧푛푛푛depend on those parameters in a complex way
•
훾훾(푧푧푛푛푛)=휋휋푘푁푁퐱퐱훍훍퐤퐤,횺횺퐤퐤 Σ푗푗=1 퐾퐾휋휋푗푗푁푁(퐱퐱|훍훍퐣퐣,횺횺퐣퐣)
•
In EM algorithm for GMM, 훾훾(푧푧푛푛푛)and parameters are iteratively optimized
•
In E step, responsibilities or the posterior probabilities are evaluated by current values for the parameters
•
In M step, re-estimate the means, covariances, and mixing coefficients using previous results
22

•
Initialize the means 휇휇푘푘, covariancesΣ푘푘and mixing coefficient 휋휋푘푘, and evaluate the initial value of the log likelihood
•
E step: Evaluate the responsibilities using the current parameter
•
훾훾(푧푧푛푛푛)= 휋휋푘푘푁푁퐱퐱훍훍퐤퐤,횺횺퐤퐤 Σ푗푗=1 퐾퐾휋휋푗푗푁푁(퐱퐱|훍훍퐣퐣,횺횺퐣퐣)
•
M step: Re-estimate parameters using the current responsibilities
•
휇휇푘푘 푛푛푛푛푛푛=1N푘Σ푛푛=1 푁푁훾훾푧푧푛푛푛퐱퐱퐧퐧
•
횺횺풌풌 풏풏풏풏풏풏=1 푁푁푘Σ푛푛=1 푁푁훾훾푧푧푛푛푛퐱퐱퐧퐧−휇휇푘푘퐱퐱퐧퐧−휇휇푘푘 ⊤
•
휋휋푘푘 푛푛푛푛푛푛=푁푁푘푁푁
•
푁푁푘푘=Σ푛푛=1 푁푁훾훾(푧푧푛푛푛)
•
Repeat E step and M step until converge
23

•
We can derive the K-means algorithm as a particular limit of EM for Gaussian Mixture Model
•
Consider a Gaussian mixture model with covariance matrices are given by 휀휀퐼퐼, where 휀휀is a variance parameter and I is identity
•
If we consider the limit휀휀→0, log likelihood of GMM becomes
•
퐸퐸푧푧ln푝푝푋푋,푍푍휇휇,Σ,휋휋→−12=Σ푛푛Σ푘푘푟푟푛푛푛퐱퐱퐧퐧−훍훍퐤퐤 2+퐶퐶
•
Thus, we see that in this limit, maximizing the expected complete- data log likelihood is equivalent to K-means algorithm
Relationship between K-means algorithm and GMM
24

K-means and GMM

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to K-means and GMM (20)

Recently uploaded (20)

K-means and GMM