SlideShare a Scribd company logo
XJinliaXXngXXXXXXXX
jlxu@bupt.edu.cn
Department of Computer Science
Institute of Network Technology . BUPT
May 20, 2016
K-means, E.M. and Miture ModelsK-means, E.M. and Mixture Models
Remind:Two``-MainProblemsinML
• Two-mainproblemsinML:
– Regression: Linear Regression, Neural net...
– Classification: Decision Tree, kNN, Bayessian Classifier...
• Today, we will learn:
– K-means: a trivial unsupervised classification algorithm.
– Expectation Maximization: a general algorithm for density estimation.
∗ We will see how to use EM in general cases and in specific case of GMM.
– GMM: a tool for modelling Data-in-the-Wild (density estimator)
∗ We also learn how to use GMM in a Bayessian Classifier
Contents
• Unsupervised Learning
• K-means clustering
• Expectation Maximization (E.M.)
• Gaussian mixtures as a Density Estimator
– Gaussian mixtures
– EM for mixtures
• Gaussian mixtures for classification
Unsupervised Learning
•
– Label of each sample is included in the training set
Sample Label
x1 y1
... ...
xn yk
• Unsupervised learning:
– Traning set contains the samples only
Sample Label
x1
...
xn
Supervised learning techniques:
Unsupervised Learning
−10 0 10 20 30 40 50
0
10
20
30
40
50
60
(a) Supervised learning.
−10 0 10 20 30 40 50
0
10
20
30
40
50
60
(b) Unsupervised learning.
Figure 1: Unsupervised vs. Supervised Learning
What is unsupervised learning useful for?
• Collecting and labeling a large training set can be very expensive.
• Be able to find features which are helpful for categorization.
• Gain insight into the natural structure of the data.
Contents
• Unsupervised Learning
• K-means clustering
• Expectation Maximization (E.M.)
• Gaussian mixtures as a Density Estimator
– Gaussian mixtures
– EM for mixtures
• Gaussian mixtures for classification
K-means clustering
• Clustering algorithms aim to find
groups of “similar” data points among
the input data.
• K-means is an effective algorithm to ex-
tract a given number of clusters from a
training set.
• Once done, the cluster locations can
be used to classify data into distinct
classes. −10 0 10 20 30 40 50
0
10
20
30
40
50
60
K-means clustering
• Given:
– The dataset: {xn}N
n=1 = {x1, x2, ..., xN}
– Number of clusters: K (K < N)
• Goal: find a partition S = {Sk}K
k=1 so that it minimizes the objective function
J =
N∑
n=1
K∑
k=1
rnk ∥ xn − µk ∥2
(1)
where rnk = 1 if xn is assigned to cluster Sk, and rnj = 0 for j ̸= k.
i.e. Find values for the {rnk} and the {µk} to minimize (1).
K-means clustering
J =
N∑
n=1
K∑
k=1
rnk ∥ xn − µk ∥2
• Select some initial values for the µk.
• Expectation: keep the µk fixed, minimize J respect to rnk.
• Maximization: keep the rnk fixed, minimize J respect to the µk.
• Loop until no change in the partitions (or maximum number of interations is
exceeded).
K-means clustering
J =
N∑
n=1
K∑
k=1
rnk ∥ xn − µk ∥2
• Expectation: J is linear function of rnk
rnk =



1 if k = arg minj ∥ xn − µj ∥2
0 otherwise
• Maximization: setting the derivative of J with respect to µk to zero, gives:
µk =
∑
n rnkxn
∑
n rnk
Convergence of K-means: assured [why?], but may lead to local minimum of J
[8]
K-means clustering: How to understand?
J =
N∑
n=1
K∑
k=1
rnk ∥ xn − µk ∥2
• Expectation: minimize J respect to rnk
– For each xn, find the “closest” cluster mean µk and put xn into cluster Sk.
• Maximization: minimize J respect to µk
– For each cluster Sk, re-estimate the cluster mean µk to be the average value
of all samples in Sk.
• Loop until no change in the partitions (or maximum number of interations is
exceeded).
Initialize with random clusters
Assign each point to nearest center
Recompute optimum centers (means)
Repeat: Assign points to nearest center
Repeat: Recompute centers
Repeat...
Repeat...Until clustering does not change
Repeat...Until clustering does not change
Total error reduced at every step - guaranteed to converge.
K-means clustering: some variations
• Initial cluster centroids:
– Randomly selected
– Iterative procedure: k-mean++ [2]
• Number of clusters K:
– Empirically/experimentally: 2 ∼
√
n
– Learning [6]
• Objective function:
– General dissimilarity measure: k-medoids algorithm.
• Speeding up:
– kd-trees for pre-processing [7]
– Triangle inequality for distance calculation [4]
Contents
• Unsupervised Learning
• K-means clustering
• Expectation Maximization (E.M.)
• Gaussian mixtures as a Density Estimator
– Gaussian mixtures
– EM for mixtures
• Gaussian mixtures for classification
Expectation Maximization
E.M.
Expectation Maximization
• A general-purpose algorithm for MLE in a wide range of situations.
• First formally stated by Dempster, Laird and Rubin in 1977 [1]
• An excellent way of doing our unsupervised learning problem, as we will see
– EM is also used widely in other domains.
EM: a solution for MLE
• Given a statistical model with:
– a set X of observed data,
– a set Z of unobserved latent data,
– a vector of unknown parameters θ,
– a likelihood function L (θ; X, Z) = p (X, Z | θ)
• Roughly speaking, the aim of MLE is to determine θ = arg maxθ L (θ; X, Z)
– We known the old trick: partial derivatives of the log likelihood...
– But it is not always tractable [e.g.]
– Other solutions are available.
EM: General Case
L (θ; X, Z) = p (X, Z | θ)
• EM is just an iterative procedure for finding the MLE
• Expectation step: keep the current estimate θ(t)
fixed, calculate the expected
value of the log likelihood function
Q
(
θ | θ(t)
)
= E [log L (θ; X, Z)] = E [log p (X, Z | θ)]
• Maximization step: Find the parameter that maximizes this quantity
θ(t+1)
= arg max
θ
Q
(
θ | θ(t)
)
EM: Motivation
• If we know the value of the parameters θ, we can find the value of latent variables
Z by maximizing the log likelihood over all possible values of Z
– Searching on the value space of Z.
• If we know Z, we can find an estimate of θ
– Typically by grouping the observed data points according to the value of asso-
ciated latent variable,
– then averaging the values (or some functions of the values) of the points in
each group.
To understand this motivation, let’s take K-means as a trivial example...
EM: informal description
Both θ and Z are unknown, EM is an iterative algorithm:
1. Initialize the parameters θ to some random values.
2. Compute the best values of Z given these parameter values.
3. Use the just-computed values of Z to find better estimates for θ.
4. Iterate until convergence.
EM Convergence
• E.M. Convergence: Yes
– After each iteration, p (X, Z | θ) must increase or remain [NOT OBVIOUS]
– But it can not exceed 1 [OBVIOUS]
– Hence it must converge [OBVIOUS]
• Bad news: E.M. converges to local optimum.
– Whether the algorithm converges to the global optimum depends on the ini-
tialization.
• Let’s take K-means as an example, again...
• Details can be found in [9].
Contents
• Unsupervised Learning
• K-means clustering
• Expectation Maximization (E.M.)
• Gaussian mixtures as a Density Estimator
– Gaussian mixtures
– EM for mixtures
• Gaussian mixtures for classification
–
Remind: Bayes Classifier
0 10 20 30 40 50 60 70 80
−10
0
10
20
30
40
50
60
70
p (y = i | x) =
p (x | y = i) p (y = i)
p (x)
Remind: Bayes Classifier
0 10 20 30 40 50 60 70 80
−10
0
10
20
30
40
50
60
70
In case of Gaussian Bayes Classifier:
p (y = i | x) =
1
(2π)
d/2
∥Σi∥1/2
exp
[
−1
2 (x − µi)T
Σi (x − µi)
]
pi
p (x)
How can we deal with the denominator p (x)?
Remind: The Single Gaussian Distribution
• Multivariate Gaussian
N (x; µ, Σ) =
1
(2π)
d/2
∥ Σ ∥1/2
exp


−
1
2
(x − µ)T
Σ−1
(x − µ)



• For maximum likelihood
0 =
∂ ln N (x1, x2, ..., xN; µ, Σ)
∂µ
• and the solution is
µML =
1
N
N∑
i=1
xi
ΣML =
1
N
N∑
i=1
(xi − µML)T
(xi − µML)
The GMM assumption
• There are k components: {ci}k
i=1
• Component ci has an associated mean
vector µi
•
•
µ1
µ2
µ3
The GMM assumption
• There are k components: {ci}k
i=1
• Component ci has an associated mean
vector µi
• Each component generates data from a
Gaussian with mean µi and covariance
matrix Σi
• Each sample is generated according to
the following guidelines:
µ1
µ2
µ3
The GMM assumption
• There are k components: {ci}k
i=1
• Component ci has an associated mean
vector µi
• Each component generates data from a
Gaussian with mean µi and covariance
matrix Σi
• Each sample is generated according to
the following guidelines:
– Randomly select component ci
with probability P (ci) = wi, s.t.
∑k
i=1 wi = 1
µ2
The GMM assumption
• There are k components: {ci}k
i=1
• Component ci has an associated mean
vector µi
• Each component generates data from a
Gaussian with mean µi and covariance
matrix Σi
• Each sample is generated according to
the following guidelines:
– Randomly select component ci with
probability P (ci) = wi, s.t.
∑k
i=1 wi = 1
– Sample ~ N (µi, Σi)
µ2
x
Probability density function of GMM
“Linear combination” of Gaussians:
f (x) =
k∑
i=1
wiN (x; µi, Σi) , where
k∑
i=1
wi = 1
0 50 100 150 200 250
0
0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.018
w1N µ1, σ2
1 w2N µ2, σ2
2
w3N µ3, σ2
3
f (x)
(a) The pdf of an 1D GMM with 3 components. (b) The pdf of an 2D GMM with 3 components.
Figure 2: Probability density function of some GMMs.
GMM: Problem definition
f (x) =
k∑
i=1
wiN (x; µi, Σi) , where
k∑
i=1
wi = 1
Given a training set, how to model these data point using GMM?
• Given:
– The trainning set: {xi}N
i=1
– Number of clusters: k
• Goal: model this data using a mixture of Gaussians
– Weights: w1, w2, ..., wk
– Means and covariances: µ1, µ2, ..., µk; Σ1, Σ2, ..., Σk
Computing likelihoods in unsupervised case
f (x) =
k∑
i=1
wiN (x; µi, Σi) , where
k∑
i=1
wi = 1
• Given a mixture of Gaussians, denoted by G. For any x, we can define the
likelihood:
P (x | G) = P (x | w1, µ1, Σ1, ..., wk, µk, Σk)
=
k∑
i=1
P (x | ci) P (ci)
=
k∑
i=1
wiN (x; µi, Σi)
• So we can define likelihood for the whole training set [Why?]
P (x1, x2, ..., xN | G) =
N∏
i=1
P (xi | G)
=
N∏
i=1
k∑
j=1
wjN (xi; µj, Σj)
Estimating GMM parameters
• We known this: Maximum Likelihood Estimation
ln P (X | G) =
N∑
i=1
ln



k∑
j=1
wjN (xi; µj, Σj)



– For the max likelihood:
0 =
∂ ln P (X | G)
∂µj
– This leads to non-linear non-analytically-solvable equations!
• Use gradient descent
– Slow but doable
• A much cuter and recently popular method...
E.M. for GMM
• Remember:
– We have the training set {xi}N
i=1, the number of components k.
– Assume we know p (c1) = w1, p (c2) = w2, ..., p (ck) = wk
– We don’t know µ1, µ2, ..., µk
The likelihood:
p (data | µ1, µ2, ..., µk) = p (x1, x2, ..., xN | µ1, µ2, ..., µk)
=
N∏
i=1
p (xi | µ1, µ2, ..., µk)
=
N∏
i=1
k∑
j=1
p (xi | wj, µ1, µ2, ..., µk) p (cj)
=
N∏
i=1
k∑
j=1
K exp


−
1
2σ2
(
xi − µj
)
2


wi
E.M. for GMM
• For Max. Likelihood, we know ∂
∂µi
log p (data | µ1, µ2, ..., µk) = 0
• Some wild algebra turns this into: For Maximum Likelihood, for each j:
µj =
N∑
i=1
p (cj | xi, µ1, µ2, ..., µk) xi
N∑
i=1
p (cj | xi, µ1, µ2, ..., µk)
This is N non-linear equations of µj’s.
• So:
– If, for each xi, we know p (cj | xi, µ1, µ2, ..., µk), then we could easily compute
µj,
– If we know each µj, we could compute p (cj | xi, µ1, µ2, ..., µk) for each xi
and cj.
E.M. for GMM
• E.M. is coming: on the t’th iteration, let our estimates be
λt = {µ1 (t) , µ2 (t) , ..., µk (t)}
• E-step: compute the expected classes of all data points for each class
p (cj | xi, λt) =
p (xi | cj, λt) p (cj | λt)
p (xi | λt)
=
p
(
xi | cj, µj (t) , σjI
)
p (cj)
k∑
m=1
p (xi | cm, µm (t) , σmI) p (cm)
• M-step: compute µ given our data’s class membership distributions
µj (t + 1) =
N∑
i=1
p (cj | xi, λt) xi
N∑
i=1
p (cj | xi, λt)
E.M. for General GMM: E-step
• On the t’th iteration, let our estimates be
λt = {µ1 (t) , µ2 (t) , ..., µk (t) , Σ1 (t) , Σ2 (t) , ..., Σk (t) , w1 (t) , w2 (t) , ..., wk (t)}
• E-step: compute the expected classes of all data points for each class
τij (t) ≡ p (cj | xi, λt) =
p (xi | cj, λt) p (cj | λt)
p (xi | λt)
=
p
(
xi | cj, µj (t) , Σj (t)
)
wj (t)
k∑
m=1
p (xi | cm, µm (t) , Σj (t)) wm (t)
E.M. for General GMM: M-step
• M-step: compute µ given our data’s class membership distributions
wj (t + 1) =
N∑
i=1
p (cj | xi, λt)
N
µj (t + 1) =
N∑
i=1
p (cj | xi, λt) xi
N∑
i=1
p (cj | xi, λt)
=
1
N
N∑
i=1
τij (t) =
1
Nwj (t + 1)
N∑
i=1
τij (t) xi
Σj (t + 1) =
N∑
i=1
p (cj | xi, λt)
[
xi − µj (t + 1)
] [
xi − µj (t + 1)
]
T
N∑
i=1
p (cj | xi, λt)
=
1
Nwj (t + 1)
N∑
i=1
τij (t)
[
xi − µj (t + 1)
] [
xi − µj (t + 1)
]
T
E.M. for General GMM: Initialization
• wj = 1/k, j = 1, 2, ..., k
• Each µj is set to a randomly selected point
– Or use K-means for this initialization.
• Each Σj is computed using the equation in previous slide...
Gaussian Mixture Example: Start
After first iteration
After 2nd iteration
After 3rd iteration
After 4th iteration
After 5th iteration
After 6th iteration
After 20th iteration
Local optimum solution
• E.M. is guaranteed to find the local optimal solution by monotonically increasing
the log-likelihood
• Whether it converges to the global optimal solution depends on the initialization
−10 −5 0 5 10 15
0
2
4
6
8
10
12
14
16
18
−10 −5 0 5 10 15
0
5
10
15
GMM: Selecting the number of components
• We can run the E.M. algorithm with different numbers of components.
– Need a criteria for selecting the “best” number of components
−10 −5 0 5 10 15
0
5
10
15
−10 −5 0 5 10 15
0
2
4
6
8
10
12
14
16
−10 −5 0 5 10 15
0
2
4
6
8
10
12
14
16
Contents
• Unsupervised Learning
• K-means clustering
• Expectation Maximization (E.M.)
– Regularized EM
– Model Selection
• Gaussian mixtures as a Density Estimator
– Gaussian mixtures
– EM for mixtures
• Gaussian mixtures for classification
Gaussian mixtures for classification
p (y = i | x) =
p (x | y = i) p (y = i)
p (x)
• To build a Bayesian classifier based on GMM, we can use GMM to model data in
each class
– So each class is modeled by one k-component GMM.
• For example:
Class 0: p (y = 0) , p (x | θ0), (a 3-component mixture)
Class 1: p (y = 1) , p (x | θ1), (a 3-component mixture)
Class 2: p (y = 2) , p (x | θ2), (a 3-component mixture)
...
GMM for Classification
• As previous, each class is modeled by a k-component GMM.
• A new test sample x is classified according to
c = arg max
i
p (y = i) p (x | θi)
where
p (x | θi) =
k∑
i=1
wiN (x; µi, Σi)
• Simple, quick (and is actually used!)
Clustering:k-means, expect-maximization and gaussian mixture model
Q & A
References
[1] N. Laird A. Dempster and D. Rubin. Maximum likelihood from incomplete data
via the em algorithm. Journal of the Royal Statistical Society. Series B (Method-
ological), 39(1):pp. 1–38., 1977.
[2] David Arthur and Sergei Vassilvitskii. k-means ++ : The Advantages of Careful
Seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on
Discrete algorithms, volume 8, pages 1027–1035, 2007.
[3] N. Gumerov C. Yang, R. Duraiswami and L. Davis. Improved fast gauss transform
and efficient kernel density estimation. In IEEE International Conference on
Computer Vision, pages pages 464–471, 2003.
[4] Charles Elkan. Using the Triangle Inequality to Accelerate k-Means. In Proceed-
ings of the Twentieth International Conference on Machine Learning (ICML),
2003.
[5] Keshu Zhang Haifeng Li and Tao Jiang. The regularized em algorithm. In
Proceedings of the 20th National Conference on Artificial Intelligence, pages
pages 807 – 812, Pittsburgh, PA, 2005.
[6] Greg Hamerly and Charles Elkan. Learning the k in k-means. In In Neural
Information Processing Systems. MIT Press, 2003.
[7] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth
Silverman, and Angela Y Wu. An efficient k-means clustering algorithm: anal-
ysis and implementation. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 24(7):881–892, July 2002.
[8] J MacQueen. Some methods for classification and analysis of multivariate obser-
vations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics
and Probability, volume 233, pages 281–297. University of California Press, 1967.
[9] C.F. Wu. On the convergence properties of the em algorithm. The Annals of
Statistics, 11:95–103, 1983.

More Related Content

PPT
Back propagation
PDF
Support Vector Machines for Classification
PPTX
Machine learning with ADA Boost
PPTX
K nearest neighbor
PPT
Support Vector Machines
PPTX
Noise filtering
PPT
Support Vector machine
PDF
Linear models for classification
Back propagation
Support Vector Machines for Classification
Machine learning with ADA Boost
K nearest neighbor
Support Vector Machines
Noise filtering
Support Vector machine
Linear models for classification

What's hot (20)

PPTX
Random forest
PPTX
Introduction to Genetic Algorithms
PDF
Principal component analysis and lda
PDF
Statistical Pattern recognition(1)
PPTX
Independent Component Analysis
PPTX
Data mining: Classification and prediction
PPTX
Unsupervised learning
PDF
Feature Extraction
PDF
Independent Component Analysis
PPTX
Evolutionary Computing
PDF
Classification Based Machine Learning Algorithms
PPTX
Run length encoding
PPTX
Boltzmann Machines in Deep learning and machine learning also used for traini...
PPTX
Lecture 6: Ensemble Methods
PDF
Densenet CNN
ODP
Machine Learning With Logistic Regression
PDF
L2. Evaluating Machine Learning Algorithms I
PPT
PPTX
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Random forest
Introduction to Genetic Algorithms
Principal component analysis and lda
Statistical Pattern recognition(1)
Independent Component Analysis
Data mining: Classification and prediction
Unsupervised learning
Feature Extraction
Independent Component Analysis
Evolutionary Computing
Classification Based Machine Learning Algorithms
Run length encoding
Boltzmann Machines in Deep learning and machine learning also used for traini...
Lecture 6: Ensemble Methods
Densenet CNN
Machine Learning With Logistic Regression
L2. Evaluating Machine Learning Algorithms I
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Ad

Similar to Clustering:k-means, expect-maximization and gaussian mixture model (20)

PDF
13_Unsupervised Learning.pdf
PPTX
GMM Clustering Presentation Slides for Machine Learning Course
PPTX
A popular clustering algorithm is known as K-means, which will follow an iter...
PDF
Machine learning ,supervised learning ,j
PDF
Clustering-beamer.pdf
PDF
Lecture9 xing
PPTX
Statistical Clustering Redux - kmeans, GMM and Variational Inference
PDF
Cs229 notes7b
PDF
Machine learning (8)
PPTX
Statistical Clustering
PDF
On learning statistical mixtures maximizing the complete likelihood
PPT
gmatrix distro_gmatrix distro_gmatrix distro
PPTX
PRML Chapter 9
PDF
k-MLE: A fast algorithm for learning statistical mixture models
PPTX
Lecture 17: Supervised Learning Recap
PDF
Machine Learning, K-means Algorithm Implementation with R
PDF
K-means and GMM
PDF
2012 mdsp pr12 k means mixture of gaussian
PDF
K-means, EM and Mixture models
PDF
Machine Learning, Statistics And Data Mining
13_Unsupervised Learning.pdf
GMM Clustering Presentation Slides for Machine Learning Course
A popular clustering algorithm is known as K-means, which will follow an iter...
Machine learning ,supervised learning ,j
Clustering-beamer.pdf
Lecture9 xing
Statistical Clustering Redux - kmeans, GMM and Variational Inference
Cs229 notes7b
Machine learning (8)
Statistical Clustering
On learning statistical mixtures maximizing the complete likelihood
gmatrix distro_gmatrix distro_gmatrix distro
PRML Chapter 9
k-MLE: A fast algorithm for learning statistical mixture models
Lecture 17: Supervised Learning Recap
Machine Learning, K-means Algorithm Implementation with R
K-means and GMM
2012 mdsp pr12 k means mixture of gaussian
K-means, EM and Mixture models
Machine Learning, Statistics And Data Mining
Ad

More from jins0618 (20)

PDF
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
PDF
Latent Interest and Topic Mining on User-item Bipartite Networks
PDF
Web Service QoS Prediction Approach in Mobile Internet Environments
PDF
吕潇 星环科技大数据技术探索与应用实践
PPT
李战怀 大数据环境下数据存储与管理的研究
PPTX
2015 07-tuto0-courseoutline
PDF
Christian jensen advanced routing in spatial networks using big data
PPTX
Jeffrey xu yu large graph processing
PDF
Calton pu experimental methods on performance in cloud and accuracy in big da...
PDF
Ling liu part 02:big graph processing
PDF
Ling liu part 01:big graph processing
PPTX
Wang ke mining revenue-maximizing bundling configuration
PDF
Wang ke classification by cut clearance under threshold
PPTX
2015 07-tuto2-clus type
PPTX
2015 07-tuto1-phrase mining
PPTX
2015 07-tuto3-mining hin
PPTX
2015 07-tuto0-courseoutline
PPTX
Weiyi meng web data truthfulness analysis
PPTX
Ke yi small summaries for big data
PDF
Gao cong geospatial social media data management and context-aware recommenda...
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Latent Interest and Topic Mining on User-item Bipartite Networks
Web Service QoS Prediction Approach in Mobile Internet Environments
吕潇 星环科技大数据技术探索与应用实践
李战怀 大数据环境下数据存储与管理的研究
2015 07-tuto0-courseoutline
Christian jensen advanced routing in spatial networks using big data
Jeffrey xu yu large graph processing
Calton pu experimental methods on performance in cloud and accuracy in big da...
Ling liu part 02:big graph processing
Ling liu part 01:big graph processing
Wang ke mining revenue-maximizing bundling configuration
Wang ke classification by cut clearance under threshold
2015 07-tuto2-clus type
2015 07-tuto1-phrase mining
2015 07-tuto3-mining hin
2015 07-tuto0-courseoutline
Weiyi meng web data truthfulness analysis
Ke yi small summaries for big data
Gao cong geospatial social media data management and context-aware recommenda...

Recently uploaded (20)

PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPT
Quality review (1)_presentation of this 21
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to machine learning and Linear Models
PPTX
Computer network topology notes for revision
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Mega Projects Data Mega Projects Data
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Database Infoormation System (DBIS).pptx
Quality review (1)_presentation of this 21
Reliability_Chapter_ presentation 1221.5784
Introduction to machine learning and Linear Models
Computer network topology notes for revision
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Business Ppt On Nestle.pptx huunnnhhgfvu
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
Miokarditis (Inflamasi pada Otot Jantung)
Supervised vs unsupervised machine learning algorithms
Mega Projects Data Mega Projects Data
Galatica Smart Energy Infrastructure Startup Pitch Deck
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
.pdf is not working space design for the following data for the following dat...
Business Acumen Training GuidePresentation.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Clinical guidelines as a resource for EBP(1).pdf
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb

Clustering:k-means, expect-maximization and gaussian mixture model

  • 1. XJinliaXXngXXXXXXXX jlxu@bupt.edu.cn Department of Computer Science Institute of Network Technology . BUPT May 20, 2016 K-means, E.M. and Miture ModelsK-means, E.M. and Mixture Models
  • 2. Remind:Two``-MainProblemsinML • Two-mainproblemsinML: – Regression: Linear Regression, Neural net... – Classification: Decision Tree, kNN, Bayessian Classifier... • Today, we will learn: – K-means: a trivial unsupervised classification algorithm. – Expectation Maximization: a general algorithm for density estimation. ∗ We will see how to use EM in general cases and in specific case of GMM. – GMM: a tool for modelling Data-in-the-Wild (density estimator) ∗ We also learn how to use GMM in a Bayessian Classifier
  • 3. Contents • Unsupervised Learning • K-means clustering • Expectation Maximization (E.M.) • Gaussian mixtures as a Density Estimator – Gaussian mixtures – EM for mixtures • Gaussian mixtures for classification
  • 4. Unsupervised Learning • – Label of each sample is included in the training set Sample Label x1 y1 ... ... xn yk • Unsupervised learning: – Traning set contains the samples only Sample Label x1 ... xn Supervised learning techniques:
  • 5. Unsupervised Learning −10 0 10 20 30 40 50 0 10 20 30 40 50 60 (a) Supervised learning. −10 0 10 20 30 40 50 0 10 20 30 40 50 60 (b) Unsupervised learning. Figure 1: Unsupervised vs. Supervised Learning
  • 6. What is unsupervised learning useful for? • Collecting and labeling a large training set can be very expensive. • Be able to find features which are helpful for categorization. • Gain insight into the natural structure of the data.
  • 7. Contents • Unsupervised Learning • K-means clustering • Expectation Maximization (E.M.) • Gaussian mixtures as a Density Estimator – Gaussian mixtures – EM for mixtures • Gaussian mixtures for classification
  • 8. K-means clustering • Clustering algorithms aim to find groups of “similar” data points among the input data. • K-means is an effective algorithm to ex- tract a given number of clusters from a training set. • Once done, the cluster locations can be used to classify data into distinct classes. −10 0 10 20 30 40 50 0 10 20 30 40 50 60
  • 9. K-means clustering • Given: – The dataset: {xn}N n=1 = {x1, x2, ..., xN} – Number of clusters: K (K < N) • Goal: find a partition S = {Sk}K k=1 so that it minimizes the objective function J = N∑ n=1 K∑ k=1 rnk ∥ xn − µk ∥2 (1) where rnk = 1 if xn is assigned to cluster Sk, and rnj = 0 for j ̸= k. i.e. Find values for the {rnk} and the {µk} to minimize (1).
  • 10. K-means clustering J = N∑ n=1 K∑ k=1 rnk ∥ xn − µk ∥2 • Select some initial values for the µk. • Expectation: keep the µk fixed, minimize J respect to rnk. • Maximization: keep the rnk fixed, minimize J respect to the µk. • Loop until no change in the partitions (or maximum number of interations is exceeded).
  • 11. K-means clustering J = N∑ n=1 K∑ k=1 rnk ∥ xn − µk ∥2 • Expectation: J is linear function of rnk rnk =    1 if k = arg minj ∥ xn − µj ∥2 0 otherwise • Maximization: setting the derivative of J with respect to µk to zero, gives: µk = ∑ n rnkxn ∑ n rnk Convergence of K-means: assured [why?], but may lead to local minimum of J [8]
  • 12. K-means clustering: How to understand? J = N∑ n=1 K∑ k=1 rnk ∥ xn − µk ∥2 • Expectation: minimize J respect to rnk – For each xn, find the “closest” cluster mean µk and put xn into cluster Sk. • Maximization: minimize J respect to µk – For each cluster Sk, re-estimate the cluster mean µk to be the average value of all samples in Sk. • Loop until no change in the partitions (or maximum number of interations is exceeded).
  • 14. Assign each point to nearest center
  • 16. Repeat: Assign points to nearest center
  • 20. Repeat...Until clustering does not change Total error reduced at every step - guaranteed to converge.
  • 21. K-means clustering: some variations • Initial cluster centroids: – Randomly selected – Iterative procedure: k-mean++ [2] • Number of clusters K: – Empirically/experimentally: 2 ∼ √ n – Learning [6] • Objective function: – General dissimilarity measure: k-medoids algorithm. • Speeding up: – kd-trees for pre-processing [7] – Triangle inequality for distance calculation [4]
  • 22. Contents • Unsupervised Learning • K-means clustering • Expectation Maximization (E.M.) • Gaussian mixtures as a Density Estimator – Gaussian mixtures – EM for mixtures • Gaussian mixtures for classification
  • 24. Expectation Maximization • A general-purpose algorithm for MLE in a wide range of situations. • First formally stated by Dempster, Laird and Rubin in 1977 [1] • An excellent way of doing our unsupervised learning problem, as we will see – EM is also used widely in other domains.
  • 25. EM: a solution for MLE • Given a statistical model with: – a set X of observed data, – a set Z of unobserved latent data, – a vector of unknown parameters θ, – a likelihood function L (θ; X, Z) = p (X, Z | θ) • Roughly speaking, the aim of MLE is to determine θ = arg maxθ L (θ; X, Z) – We known the old trick: partial derivatives of the log likelihood... – But it is not always tractable [e.g.] – Other solutions are available.
  • 26. EM: General Case L (θ; X, Z) = p (X, Z | θ) • EM is just an iterative procedure for finding the MLE • Expectation step: keep the current estimate θ(t) fixed, calculate the expected value of the log likelihood function Q ( θ | θ(t) ) = E [log L (θ; X, Z)] = E [log p (X, Z | θ)] • Maximization step: Find the parameter that maximizes this quantity θ(t+1) = arg max θ Q ( θ | θ(t) )
  • 27. EM: Motivation • If we know the value of the parameters θ, we can find the value of latent variables Z by maximizing the log likelihood over all possible values of Z – Searching on the value space of Z. • If we know Z, we can find an estimate of θ – Typically by grouping the observed data points according to the value of asso- ciated latent variable, – then averaging the values (or some functions of the values) of the points in each group. To understand this motivation, let’s take K-means as a trivial example...
  • 28. EM: informal description Both θ and Z are unknown, EM is an iterative algorithm: 1. Initialize the parameters θ to some random values. 2. Compute the best values of Z given these parameter values. 3. Use the just-computed values of Z to find better estimates for θ. 4. Iterate until convergence.
  • 29. EM Convergence • E.M. Convergence: Yes – After each iteration, p (X, Z | θ) must increase or remain [NOT OBVIOUS] – But it can not exceed 1 [OBVIOUS] – Hence it must converge [OBVIOUS] • Bad news: E.M. converges to local optimum. – Whether the algorithm converges to the global optimum depends on the ini- tialization. • Let’s take K-means as an example, again... • Details can be found in [9].
  • 30. Contents • Unsupervised Learning • K-means clustering • Expectation Maximization (E.M.) • Gaussian mixtures as a Density Estimator – Gaussian mixtures – EM for mixtures • Gaussian mixtures for classification –
  • 31. Remind: Bayes Classifier 0 10 20 30 40 50 60 70 80 −10 0 10 20 30 40 50 60 70 p (y = i | x) = p (x | y = i) p (y = i) p (x)
  • 32. Remind: Bayes Classifier 0 10 20 30 40 50 60 70 80 −10 0 10 20 30 40 50 60 70 In case of Gaussian Bayes Classifier: p (y = i | x) = 1 (2π) d/2 ∥Σi∥1/2 exp [ −1 2 (x − µi)T Σi (x − µi) ] pi p (x) How can we deal with the denominator p (x)?
  • 33. Remind: The Single Gaussian Distribution • Multivariate Gaussian N (x; µ, Σ) = 1 (2π) d/2 ∥ Σ ∥1/2 exp   − 1 2 (x − µ)T Σ−1 (x − µ)    • For maximum likelihood 0 = ∂ ln N (x1, x2, ..., xN; µ, Σ) ∂µ • and the solution is µML = 1 N N∑ i=1 xi ΣML = 1 N N∑ i=1 (xi − µML)T (xi − µML)
  • 34. The GMM assumption • There are k components: {ci}k i=1 • Component ci has an associated mean vector µi • • µ1 µ2 µ3
  • 35. The GMM assumption • There are k components: {ci}k i=1 • Component ci has an associated mean vector µi • Each component generates data from a Gaussian with mean µi and covariance matrix Σi • Each sample is generated according to the following guidelines: µ1 µ2 µ3
  • 36. The GMM assumption • There are k components: {ci}k i=1 • Component ci has an associated mean vector µi • Each component generates data from a Gaussian with mean µi and covariance matrix Σi • Each sample is generated according to the following guidelines: – Randomly select component ci with probability P (ci) = wi, s.t. ∑k i=1 wi = 1 µ2
  • 37. The GMM assumption • There are k components: {ci}k i=1 • Component ci has an associated mean vector µi • Each component generates data from a Gaussian with mean µi and covariance matrix Σi • Each sample is generated according to the following guidelines: – Randomly select component ci with probability P (ci) = wi, s.t. ∑k i=1 wi = 1 – Sample ~ N (µi, Σi) µ2 x
  • 38. Probability density function of GMM “Linear combination” of Gaussians: f (x) = k∑ i=1 wiN (x; µi, Σi) , where k∑ i=1 wi = 1 0 50 100 150 200 250 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 w1N µ1, σ2 1 w2N µ2, σ2 2 w3N µ3, σ2 3 f (x) (a) The pdf of an 1D GMM with 3 components. (b) The pdf of an 2D GMM with 3 components. Figure 2: Probability density function of some GMMs.
  • 39. GMM: Problem definition f (x) = k∑ i=1 wiN (x; µi, Σi) , where k∑ i=1 wi = 1 Given a training set, how to model these data point using GMM? • Given: – The trainning set: {xi}N i=1 – Number of clusters: k • Goal: model this data using a mixture of Gaussians – Weights: w1, w2, ..., wk – Means and covariances: µ1, µ2, ..., µk; Σ1, Σ2, ..., Σk
  • 40. Computing likelihoods in unsupervised case f (x) = k∑ i=1 wiN (x; µi, Σi) , where k∑ i=1 wi = 1 • Given a mixture of Gaussians, denoted by G. For any x, we can define the likelihood: P (x | G) = P (x | w1, µ1, Σ1, ..., wk, µk, Σk) = k∑ i=1 P (x | ci) P (ci) = k∑ i=1 wiN (x; µi, Σi) • So we can define likelihood for the whole training set [Why?] P (x1, x2, ..., xN | G) = N∏ i=1 P (xi | G) = N∏ i=1 k∑ j=1 wjN (xi; µj, Σj)
  • 41. Estimating GMM parameters • We known this: Maximum Likelihood Estimation ln P (X | G) = N∑ i=1 ln    k∑ j=1 wjN (xi; µj, Σj)    – For the max likelihood: 0 = ∂ ln P (X | G) ∂µj – This leads to non-linear non-analytically-solvable equations! • Use gradient descent – Slow but doable • A much cuter and recently popular method...
  • 42. E.M. for GMM • Remember: – We have the training set {xi}N i=1, the number of components k. – Assume we know p (c1) = w1, p (c2) = w2, ..., p (ck) = wk – We don’t know µ1, µ2, ..., µk The likelihood: p (data | µ1, µ2, ..., µk) = p (x1, x2, ..., xN | µ1, µ2, ..., µk) = N∏ i=1 p (xi | µ1, µ2, ..., µk) = N∏ i=1 k∑ j=1 p (xi | wj, µ1, µ2, ..., µk) p (cj) = N∏ i=1 k∑ j=1 K exp   − 1 2σ2 ( xi − µj ) 2   wi
  • 43. E.M. for GMM • For Max. Likelihood, we know ∂ ∂µi log p (data | µ1, µ2, ..., µk) = 0 • Some wild algebra turns this into: For Maximum Likelihood, for each j: µj = N∑ i=1 p (cj | xi, µ1, µ2, ..., µk) xi N∑ i=1 p (cj | xi, µ1, µ2, ..., µk) This is N non-linear equations of µj’s. • So: – If, for each xi, we know p (cj | xi, µ1, µ2, ..., µk), then we could easily compute µj, – If we know each µj, we could compute p (cj | xi, µ1, µ2, ..., µk) for each xi and cj.
  • 44. E.M. for GMM • E.M. is coming: on the t’th iteration, let our estimates be λt = {µ1 (t) , µ2 (t) , ..., µk (t)} • E-step: compute the expected classes of all data points for each class p (cj | xi, λt) = p (xi | cj, λt) p (cj | λt) p (xi | λt) = p ( xi | cj, µj (t) , σjI ) p (cj) k∑ m=1 p (xi | cm, µm (t) , σmI) p (cm) • M-step: compute µ given our data’s class membership distributions µj (t + 1) = N∑ i=1 p (cj | xi, λt) xi N∑ i=1 p (cj | xi, λt)
  • 45. E.M. for General GMM: E-step • On the t’th iteration, let our estimates be λt = {µ1 (t) , µ2 (t) , ..., µk (t) , Σ1 (t) , Σ2 (t) , ..., Σk (t) , w1 (t) , w2 (t) , ..., wk (t)} • E-step: compute the expected classes of all data points for each class τij (t) ≡ p (cj | xi, λt) = p (xi | cj, λt) p (cj | λt) p (xi | λt) = p ( xi | cj, µj (t) , Σj (t) ) wj (t) k∑ m=1 p (xi | cm, µm (t) , Σj (t)) wm (t)
  • 46. E.M. for General GMM: M-step • M-step: compute µ given our data’s class membership distributions wj (t + 1) = N∑ i=1 p (cj | xi, λt) N µj (t + 1) = N∑ i=1 p (cj | xi, λt) xi N∑ i=1 p (cj | xi, λt) = 1 N N∑ i=1 τij (t) = 1 Nwj (t + 1) N∑ i=1 τij (t) xi Σj (t + 1) = N∑ i=1 p (cj | xi, λt) [ xi − µj (t + 1) ] [ xi − µj (t + 1) ] T N∑ i=1 p (cj | xi, λt) = 1 Nwj (t + 1) N∑ i=1 τij (t) [ xi − µj (t + 1) ] [ xi − µj (t + 1) ] T
  • 47. E.M. for General GMM: Initialization • wj = 1/k, j = 1, 2, ..., k • Each µj is set to a randomly selected point – Or use K-means for this initialization. • Each Σj is computed using the equation in previous slide...
  • 56. Local optimum solution • E.M. is guaranteed to find the local optimal solution by monotonically increasing the log-likelihood • Whether it converges to the global optimal solution depends on the initialization −10 −5 0 5 10 15 0 2 4 6 8 10 12 14 16 18 −10 −5 0 5 10 15 0 5 10 15
  • 57. GMM: Selecting the number of components • We can run the E.M. algorithm with different numbers of components. – Need a criteria for selecting the “best” number of components −10 −5 0 5 10 15 0 5 10 15 −10 −5 0 5 10 15 0 2 4 6 8 10 12 14 16 −10 −5 0 5 10 15 0 2 4 6 8 10 12 14 16
  • 58. Contents • Unsupervised Learning • K-means clustering • Expectation Maximization (E.M.) – Regularized EM – Model Selection • Gaussian mixtures as a Density Estimator – Gaussian mixtures – EM for mixtures • Gaussian mixtures for classification
  • 59. Gaussian mixtures for classification p (y = i | x) = p (x | y = i) p (y = i) p (x) • To build a Bayesian classifier based on GMM, we can use GMM to model data in each class – So each class is modeled by one k-component GMM. • For example: Class 0: p (y = 0) , p (x | θ0), (a 3-component mixture) Class 1: p (y = 1) , p (x | θ1), (a 3-component mixture) Class 2: p (y = 2) , p (x | θ2), (a 3-component mixture) ...
  • 60. GMM for Classification • As previous, each class is modeled by a k-component GMM. • A new test sample x is classified according to c = arg max i p (y = i) p (x | θi) where p (x | θi) = k∑ i=1 wiN (x; µi, Σi) • Simple, quick (and is actually used!)
  • 62. Q & A
  • 63. References [1] N. Laird A. Dempster and D. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society. Series B (Method- ological), 39(1):pp. 1–38., 1977. [2] David Arthur and Sergei Vassilvitskii. k-means ++ : The Advantages of Careful Seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, volume 8, pages 1027–1035, 2007. [3] N. Gumerov C. Yang, R. Duraiswami and L. Davis. Improved fast gauss transform and efficient kernel density estimation. In IEEE International Conference on Computer Vision, pages pages 464–471, 2003. [4] Charles Elkan. Using the Triangle Inequality to Accelerate k-Means. In Proceed-
  • 64. ings of the Twentieth International Conference on Machine Learning (ICML), 2003. [5] Keshu Zhang Haifeng Li and Tao Jiang. The regularized em algorithm. In Proceedings of the 20th National Conference on Artificial Intelligence, pages pages 807 – 812, Pittsburgh, PA, 2005. [6] Greg Hamerly and Charles Elkan. Learning the k in k-means. In In Neural Information Processing Systems. MIT Press, 2003. [7] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. An efficient k-means clustering algorithm: anal- ysis and implementation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(7):881–892, July 2002. [8] J MacQueen. Some methods for classification and analysis of multivariate obser- vations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, volume 233, pages 281–297. University of California Press, 1967.
  • 65. [9] C.F. Wu. On the convergence properties of the em algorithm. The Annals of Statistics, 11:95–103, 1983.