Clustering lect

Clustering
Shadi Albarqouni, M.Sc.
Graduate Research Assistant | PhD Candidate
shadi.albarqouni@tum.de
Computer Aided Medical Procedures (CAMP) | TU München (TUM)
Machine Learning in Medical Imaging (MLMI)
Winter Semester 15/16
BioMedical Computing (BMC) Master Program

Outline
1 Introduction
2 Parametric, cost-based clustering
1. K-Means
2. K-Medoids
3. Kernel K-Means
4. Spectral Clustering
5. Extensions
6. Comparison
3 Parametric, model-based clustering
1. Mixture Models
4 Non-parametric, model-based clustering
1. Mean-shift
2 / 29

What is clustering?
Deﬁnition (Clustering)
Given n unlabelled data points, separate them into K clusters.
Dilemma! [6]
• What is a Cluster?
(Compact vs. Connected)
• How many K clusters?
(Parametric vs. Non-parametric)
• Soft vs. Hard clustering.
(Model vs. Cost based)
• Data representation.
(Vector vs. Similarities)
• Classiﬁcation vs. Clustering.
• Stability [7].
3 / 29

Applications
• Image Retrieval
• Image Compression
• Image Segmentation
• Pattern Recognition
4 / 29

Notation
• D = {x1, x2, ..., xn} ∈ Rm×n is the data set.
• m is the feature dimension of xi .
• n is the number of instances.
• K is the number of clusters.
• = {C1, C2, ...., CK }, where Ck is a partition of D.
• c(xi ) is the label/cluster of instance xi .
• rnk where n is the index of instance and k is the index of cluster.
Objective
Find the clusters minimizing the cost function L( ).
5 / 29

Parametric, cost-based clustering
Parametric: K is deﬁned.
Cost-based: It is hard-clustering based on the cost function.
Selected Algorithms:
• K-Means [8].
• K-Medoids [11].
• Kernel K-Means [12].
• Spectral Clustering [10].
6 / 29

K-Means
• K-Means algorithm:
1. Initialize: Pick K random samples from the dataset D as the cluster
centroids µk = {µ1, µ2, ..., µK }.
2. Assign Points to the clusters: Partition data points D into K
clusters = {C1, C2, ..., CK } based on the Euclidean distance
between the points and centroids (searching for the closest centroid).
3. Centroid update: Based on the points assigned to each cluster, a
new centroid is computed µk .
4. Repeat: Do step 2 and 3 until convergence.
5. Convergence: if the cluster centroids barley change, or we have
compact and/or isolated clusters. Mathematically, when the cost
(distortion) function L( ) =
K
k=1 i∈Ck
xi − µk
2
is minimum.
• Practical issues:
a) The initialization. b) Pre-processing.
7 / 29

K-Means – Algorithm
input : Data points D = {x1, x2, ..., xn}, number of clusters K
output: Clusters, = {C1, C2, ...., CK }
Pick K random samples as the cluster centroids µk.
repeat
for i = 1 to n do
c(xi ) = mink∈K xi − µk
2
2 %Assign points to clusters
end
for k = 1 to K do
µk = 1
|Ck | i∈Ck
xi %Update the cluster centroid
end
until convergence;
8 / 29

K-Medoids (1)
• K-Medoids algorithm:
1. Initialize: Pick K random samples from the dataset D as the medoids
µk = {µ1, µ2, ..., µK }.
2. Assign Points to the clusters: Partition data points D into K
clusters = {C1, C2, ..., CK } based on the dissimilarity (Manhattan)
distance between the points and medoids (searching for the min.
dissimilarity).
3. Medoids update: Based on the points assigned to each cluster, swap
the medoid with a new data point and compute the cost. (undo the
swap if the cost is getting increased).
4. Repeat: Do step 2 and 3 until convergence.
5. Convergence: if the cluster medoids barley change. Mathematically,
when the cost (distortion) function L( ) =
K
k=1 i∈Ck
xi − µk is
minimum.
9 / 29

K-Medoids (2)
Figure : K-Means vs. K-Medoids
10 / 29

Kernel K-Means (1)
Deﬁnition
It is a generalization of the
standard K-Means algorithm.
• What happens if the clusters
are not linearly separable?
• Euclidean distance vs. Geodesic
distance.
Figure : Spiral and Jain datasets
11 / 29

Kernel K-Means (2)
• K-Means can not be applied right away.
• Map the data points xi ∈ D to a high dimensional feature space
M using a nonlinear function φ(xi ).
• Assume the clusters in the high dimensional feature space (RKHS)
is linearly separable, hence K-Means can be applied.
• The cost function would be
LK( ) =
K
k=1 i∈Ck
φ(xi ) − φ(µk) 2
,
where
φ(xi ) − φ(µk) 2
= φ(xi )T ·φ(xi )−2φ(xi )T ·φ(µk)+φ(µk)T ·φ(µk).
12 / 29

Kernel K-Means (3)
• Using the kernel trick, Kij = φ(xi )T · φ(xj), the Euclidean distance
in LK( ) can be computed easily using any kernel function Kij
without explicitly knowing the nonlinear transformation φ(xi ).
• Examples of kernel functions (positive semideﬁnite)
1. Hom. Polynomial kernel: Kij = (xT
i xj )δ
2. Inho. Polynomial kernel: Kij = (xT
i xj + γ)δ
3. Gaussian kernel: Kij = e
− xi −xj
2
2σ2
4. Laplacian kernel Kij = e
− xi −xj
σ
5. Sigmoid kernel: Kij = tanh(γ(xT
i xj ) + θ)
13 / 29

Kernel K-Means – Algorithm
input : Data points D = {x1, x2, ..., xn}, Kernel matrix Kij, number
of clusters K
repeat
for i = 1 to n do
for k = 1 to K do
Compute φ(xi ) − φ(µk) 2
using Kij
end
c(xi ) = mink∈K φ(xi ) − φ(µk) 2
end
for k = 1 to K do
µk = 1
|Ck | i∈Ck
xi
end
until convergence;
14 / 29

Spectral Clustering
Graph - Overview
• Fully connected, undirected, and wighted graph with N vertices
• The graph is represented by G = {ν, ε, ω}, where ν is a set of
vertices N, ε is a set of edges, and ω is a set of weights are
assigned using a heat kernel as follows to build the Adjacency
matrix W
Wij =



e−
xi −xj
2
2
σ2 eij ∈ ε
0 else
• The degree matrix D, where its diagonal elements Dij = j Wij
• Compute the Normalized graph Laplacian Matrix
˜L = I − D−1/2
WD−1/2
15 / 29

Spectral Clustering – Algorithm
input : Normalized Laplacian Matrix ˜L, number of clusters K
Compute the ﬁrsts K eigenvectors U = {u1, u2, ..., uK } ∈ Rn×K of ˜L.
Compute ˜U by normalising the rows to norm 1.
Do K-Means on ˜U ∈ Rn×K such that your data points are the rows
vectors which have K-dimensions or simply: D ← ˜UT .
repeat
for i = 1 to n do
c(xi ) = mink∈K xi − µk
2
2 %Assign points to clusters
end
for k = 1 to K do
µk = 1
|Ck | i∈Ck
xi %Update the cluster centroid
end
until convergence;
16 / 29

Extensions (1)
• Alternative cost (distortion) function:
n
i=1
n
j=1
xi − xj
2
=
K
k=1 i,j∈Ck
xi − xj
2
Intracluster distance
+
K
k=1 i∈Ck j /∈Ck
xi − xj
2
Intercluster distance
1. Intracluster distance:
L( ) =
K
k=1 i,j∈Ck
xi − xj
2
+ constant
2. Interclsuter distance:
L( ) = −
K
k=1 i∈Ck l /∈Ck
xi − xj
2
+ constant
17 / 29

Extensions (2)
3. K-Median:
L( ) =
K
k=1 i∈Ck
xi − µk
• Alternative Initialization:
1. K-Means++ [1]
2. Global Kernel K-Means [13]
• On selecting K 1:
1. Rule of thumb: K = n/2
2. Elbow Method
3. Silhouette
• Soft clustering: Fuzzy C-Means [2]
• Variant: Spectral Clustering [14]
• Hierarchical Clustering
1
https://en.wikipedia.org/wiki/Determining_the_number_of_
clusters_in_a_data_set
18 / 29

Comparison
Algorithm Data Rep. Comp. Out. Cent.
K-Means Vectors Low No /∈ D
K-Medians Vectors High No /∈ D
K-Medoids Similarity High Yes ∈ D
Kernel K-Means Kernel High N/A /∈ D
Spectral Clustering Similarity High N/A /∈ D
2
2
Data Rep: Data Representation, Comp.: Computational cost, Out.:
Handling outliers, Cent.: Centroids.
19 / 29

Parametric, model-based clustering
Parametric: K and the density function are deﬁned (i.e. Gaussian)
Model-based: It is soft-clustering based on the mixture density f (x).
f (x) =
K
k=1
πkfk(x), s.t. πk ≥ 0,
K
πk = 1,
where fk(x) is the component of mixture. f (x) is a Gaussian Mixture
Model (GMM) when fk(x) ∼ N(x; µk, σ2
k).
Degree of Membership:
γki = P[xi ∈ Ck] =
πkfk(xi )
f (xi )
GMM Parameter: θ = {π1:K , µ1:K , σ1:K }.
Selected Algorithm to estimate the parameter:
• EM-Algorithm [5].
20 / 29

Expectation-Maximization (EM) Algorithm
• Given data points D sampled
i.i.d from an unknown
distribution f
• We need to model the
distribution using Maximum
Likelihood (ML) principle
(log-likelihood):
l(θ) = ln fθ(D) =
n
i=1
ln fθ(xi )
l(θ) =
n
i=1
ln
K
k=1
πkfk(xi )
The objective: θML = argmaxθl(θ)
Figure : GMM Clustering
21 / 29

EM – Algorithm
input : data points D, number of clusters K
output: Parameters, θML = {π1:K , µ1:K , σ1:K }
Initialize the parameters θ at random.
repeat
for i = 1 to n do
for k = 1 to K do
γik = πk fk (xi )
f (xi ) %E-Step
end
end
for k = 1 to K do
πk = 1
n
n
i=1 γik %M-Step
µk = 1
nπk
n
i=1 γik xi
σk = 1
nπk
n
i=1 γik (xi − µk )(xi − µk )T
end
until convergence;
22 / 29

Non-parametric, model-based clustering
Idea: group the points by the peak of data density
Parameter: shape and number of clusters K are deﬁned by the
algorithm, however, you should deﬁne:
1. smoothness of density estimate (h)3
2. what is a peak
Selected Algorithm:
• Mean-shift [4].
3
Rule of thumb: h = (4ˆσ
3n
)1/5
, where ˆσ is the standard deviation of the samples.
23 / 29

Mean-shift Algorithm
• Given data points D sampled
i.i.d from an unknown density f
• We need to define the shape of
the density using Kernel Density
Estimation (KDE) principle:
fh(x) =
1
nhm
n
i=1
K(
x − xi
h
),
where K(·) is a kernel function,
must be positive, symmetric
and differentiable, i.e. Gaussian
kernel K(z) = 1
(2π)m/2 e−
z 2
2
• The objective: find the peaks of
fh(x) by equating fh(x) = 0
• That results in
x =
n
i=1 xi K(x−xi
h )
n
i=1 K(x−xi
h )
mean−shift
Figure : Mean-shift Clustering
24 / 29

Summary
25 / 29

Acknowledgment
This tutorial is done with the help of
• Bishop’s book [3],
• Meila’s slides in MLSS 2011 [9], and
• Lichao’s slides form pervious semester (SS15)
26 / 29

References (1)
David Arthur and Sergei Vassilvitskii.
k-means++: The advantages of careful seeding.
In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pages 1027–1035.
Society for Industrial and Applied Mathematics, 2007.
James C Bezdek.
Pattern recognition with fuzzy objective function algorithms.
Springer Science & Business Media, 2013.
Christopher M Bishop.
Pattern recognition and machine learning.
springer, 2006.
Yizong Cheng.
Mean shift, mode seeking, and clustering.
Pattern Analysis and Machine Intelligence, IEEE Transactions on, 17(8):790–799, 1995.
Arthur P Dempster, Nan M Laird, and Donald B Rubin.
Maximum likelihood from incomplete data via the em algorithm.
Journal of the royal statistical society. Series B (methodological), pages 1–38, 1977.
Anil K Jain and Martin HC Law.
Data clustering: A user’s dilemma.
In Pattern Recognition and Machine Intelligence, pages 1–10. Springer, 2005.
27 / 29

References (2)
Tilman Lange, Volker Roth, Mikio L Braun, and Joachim M Buhmann.
Stability-based validation of clustering solutions.
Neural computation, 16(6):1299–1323, 2004.
S. Lloyd.
Least squares quantization in pcm.
Information Theory, IEEE Transactions on, 28(2):129–137, Mar 1982.
Marina Meila.
Classic and modern data clustering.
Machine Learning Summer School (MLSS), 2011.
Andrew Y Ng, Michael I Jordan, Yair Weiss, et al.
On spectral clustering: Analysis and an algorithm.
Advances in neural information processing systems, 2:849–856, 2002.
Hae-Sang Park and Chi-Hyuck Jun.
A simple and fast algorithm for k-medoids clustering.
Expert Systems with Applications, 36(2):3336–3341, 2009.
Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller.
Nonlinear component analysis as a kernel eigenvalue problem.
Neural computation, 10(5):1299–1319, 1998.
28 / 29

References (3)
Grigorios F Tzortzis and Aristidis C Likas.
The global kernel-means algorithm for clustering in feature space.
Neural Networks, IEEE Transactions on, 20(7):1181–1194, 2009.
Ulrike Von Luxburg.
A tutorial on spectral clustering.
Statistics and computing, 17(4):395–416, 2007.
29 / 29

Clustering lect

More Related Content

What's hot (20)

Similar to Clustering lect (20)

More from Shadi Nabil Albarqouni (10)

Recently uploaded (20)

Clustering lect