SlideShare a Scribd company logo
Machine Learning for Data Mining
Hierarchical Clustering
Andres Mendez-Vazquez
July 27, 2015
1 / 46
Images/cinvestav-
Outline
1 Hierarchical Clustering
Definition
Basic Ideas
2 Agglomerative Algorithms
Introduction
Problems with Agglomerative Algorithms
Two Categories of Agglomerative Algorithms
Matrix Based Algorithms
Graph Based Algorithms
3 Divisive Algorithms
Introduction
4 Algorithms for Large Data Sets
Introduction
Clustering Using REpresentatives (CURE)
2 / 46
Images/cinvestav-
Outline
1 Hierarchical Clustering
Definition
Basic Ideas
2 Agglomerative Algorithms
Introduction
Problems with Agglomerative Algorithms
Two Categories of Agglomerative Algorithms
Matrix Based Algorithms
Graph Based Algorithms
3 Divisive Algorithms
Introduction
4 Algorithms for Large Data Sets
Introduction
Clustering Using REpresentatives (CURE)
3 / 46
Images/cinvestav-
Concepts
Hierarchical Clustering Algorithms
They are quite different from the previous clustering algorithms.
Actually
They produce a hierarchy of clusterings.
4 / 46
Images/cinvestav-
Concepts
Hierarchical Clustering Algorithms
They are quite different from the previous clustering algorithms.
Actually
They produce a hierarchy of clusterings.
4 / 46
Images/cinvestav-
Dendrogram: Hierarchical Clustering
Hierarchical Clustering
The clustering is obtained by cutting the dendrogram at a desired level:
Each connected component forms a cluster.
5 / 46
Images/cinvestav-
Example
Dendrogram
6 / 46
Images/cinvestav-
Outline
1 Hierarchical Clustering
Definition
Basic Ideas
2 Agglomerative Algorithms
Introduction
Problems with Agglomerative Algorithms
Two Categories of Agglomerative Algorithms
Matrix Based Algorithms
Graph Based Algorithms
3 Divisive Algorithms
Introduction
4 Algorithms for Large Data Sets
Introduction
Clustering Using REpresentatives (CURE)
7 / 46
Images/cinvestav-
Basic Ideas
At each step t
A new clustering is obtained based on the clustering produced at the
previous step t − 1
Two Main Types
1 Agglomerative Algorithms.
1 Start with each item being a single cluster.
2 Eventually all items belong to the same cluster.
2 Divisive Algorithms
1 Start with all items belong to the same cluster.
2 Eventually each item forms a cluster on its own.
8 / 46
Images/cinvestav-
Basic Ideas
At each step t
A new clustering is obtained based on the clustering produced at the
previous step t − 1
Two Main Types
1 Agglomerative Algorithms.
1 Start with each item being a single cluster.
2 Eventually all items belong to the same cluster.
2 Divisive Algorithms
1 Start with all items belong to the same cluster.
2 Eventually each item forms a cluster on its own.
8 / 46
Images/cinvestav-
Basic Ideas
At each step t
A new clustering is obtained based on the clustering produced at the
previous step t − 1
Two Main Types
1 Agglomerative Algorithms.
1 Start with each item being a single cluster.
2 Eventually all items belong to the same cluster.
2 Divisive Algorithms
1 Start with all items belong to the same cluster.
2 Eventually each item forms a cluster on its own.
8 / 46
Images/cinvestav-
Basic Ideas
At each step t
A new clustering is obtained based on the clustering produced at the
previous step t − 1
Two Main Types
1 Agglomerative Algorithms.
1 Start with each item being a single cluster.
2 Eventually all items belong to the same cluster.
2 Divisive Algorithms
1 Start with all items belong to the same cluster.
2 Eventually each item forms a cluster on its own.
8 / 46
Images/cinvestav-
Basic Ideas
At each step t
A new clustering is obtained based on the clustering produced at the
previous step t − 1
Two Main Types
1 Agglomerative Algorithms.
1 Start with each item being a single cluster.
2 Eventually all items belong to the same cluster.
2 Divisive Algorithms
1 Start with all items belong to the same cluster.
2 Eventually each item forms a cluster on its own.
8 / 46
Images/cinvestav-
Basic Ideas
At each step t
A new clustering is obtained based on the clustering produced at the
previous step t − 1
Two Main Types
1 Agglomerative Algorithms.
1 Start with each item being a single cluster.
2 Eventually all items belong to the same cluster.
2 Divisive Algorithms
1 Start with all items belong to the same cluster.
2 Eventually each item forms a cluster on its own.
8 / 46
Images/cinvestav-
Basic Ideas
At each step t
A new clustering is obtained based on the clustering produced at the
previous step t − 1
Two Main Types
1 Agglomerative Algorithms.
1 Start with each item being a single cluster.
2 Eventually all items belong to the same cluster.
2 Divisive Algorithms
1 Start with all items belong to the same cluster.
2 Eventually each item forms a cluster on its own.
8 / 46
Images/cinvestav-
Therefore
Given the previous ideas
It is necessary to define the concept of nesting!!!
After all given a divisive and agglomerative procedure
9 / 46
Images/cinvestav-
Therefore
Given the previous ideas
It is necessary to define the concept of nesting!!!
After all given a divisive and agglomerative procedure
9 / 46
Images/cinvestav-
Nested Clustering
Definition
1 A clustering i containing k clusters is said to be nested in the
clustering i+1, which contains r < k clusters, if each cluster in i, it
is a subset of a set in i+1.
2 At least one cluster at i is a proper subset of a set in i+1.
This is written as
i i+1 (1)
10 / 46
Images/cinvestav-
Nested Clustering
Definition
1 A clustering i containing k clusters is said to be nested in the
clustering i+1, which contains r < k clusters, if each cluster in i, it
is a subset of a set in i+1.
2 At least one cluster at i is a proper subset of a set in i+1.
This is written as
i i+1 (1)
10 / 46
Images/cinvestav-
Nested Clustering
Definition
1 A clustering i containing k clusters is said to be nested in the
clustering i+1, which contains r < k clusters, if each cluster in i, it
is a subset of a set in i+1.
2 At least one cluster at i is a proper subset of a set in i+1.
This is written as
i i+1 (1)
10 / 46
Images/cinvestav-
Example
We have
The following set{x1, x2, x3, x4, x5}.
With the following structures
1 = {{x1, x3} , {x4} , {x2, x5}}
2 = {{x1, x3, x4} , {x2, x5}}
Again
Hierarchical Clustering produces a hierarchy of clusterings!!!
11 / 46
Images/cinvestav-
Example
We have
The following set{x1, x2, x3, x4, x5}.
With the following structures
1 = {{x1, x3} , {x4} , {x2, x5}}
2 = {{x1, x3, x4} , {x2, x5}}
Again
Hierarchical Clustering produces a hierarchy of clusterings!!!
11 / 46
Images/cinvestav-
Example
We have
The following set{x1, x2, x3, x4, x5}.
With the following structures
1 = {{x1, x3} , {x4} , {x2, x5}}
2 = {{x1, x3, x4} , {x2, x5}}
Again
Hierarchical Clustering produces a hierarchy of clusterings!!!
11 / 46
Images/cinvestav-
Example
We have
The following set{x1, x2, x3, x4, x5}.
With the following structures
1 = {{x1, x3} , {x4} , {x2, x5}}
2 = {{x1, x3, x4} , {x2, x5}}
Again
Hierarchical Clustering produces a hierarchy of clusterings!!!
11 / 46
Images/cinvestav-
Outline
1 Hierarchical Clustering
Definition
Basic Ideas
2 Agglomerative Algorithms
Introduction
Problems with Agglomerative Algorithms
Two Categories of Agglomerative Algorithms
Matrix Based Algorithms
Graph Based Algorithms
3 Divisive Algorithms
Introduction
4 Algorithms for Large Data Sets
Introduction
Clustering Using REpresentatives (CURE)
12 / 46
Images/cinvestav-
Agglomerative Algorithms.
Initial State
You have N clusters each containing an element of the data X.
At each step i, you have an i structure with N − i.
Then, a new clustering structure i+1 is generated.
Thus
13 / 46
Images/cinvestav-
Agglomerative Algorithms.
Initial State
You have N clusters each containing an element of the data X.
At each step i, you have an i structure with N − i.
Then, a new clustering structure i+1 is generated.
Thus
13 / 46
Images/cinvestav-
Agglomerative Algorithms.
Initial State
You have N clusters each containing an element of the data X.
At each step i, you have an i structure with N − i.
Then, a new clustering structure i+1 is generated.
Thus
13 / 46
Images/cinvestav-
Agglomerative Algorithms.
Initial State
You have N clusters each containing an element of the data X.
At each step i, you have an i structure with N − i.
Then, a new clustering structure i+1 is generated.
Thus
13 / 46
Images/cinvestav-
In that way...
We have
At each step, we have that each cluster i is a proper subset of a cluste in
i or
i i+1 (2)
14 / 46
Images/cinvestav-
The Basic Algorithm for Agglomerative
For this
We have a function g (Ci, Cj) defined in all pair of cluster to measure
similarity or dissimilarity.
t denotes the current level of the hierarchy.
Algorithm
I n i t i a l i z a t i o n
Choose 0 = {Ci = {xi } , i = 1, ..., N}
t = 0
Repeat
t = t + 1
Find one p a i r of c l u s t e r s
(Cr , Cs) i n
t−1 such that
g(Ci , Cj ) = max, min of a s i m i l l a r i t y
or d i s s i m i l a r i t y f u n c t i o n
over a l l p a i r s
Define Cq = Ci ∪ Cj , t = t−1 − Ci , Cj ∪ Cq
U n t i l a l v e c t o r s i n a s i n g l e c l u s t e r
15 / 46
Images/cinvestav-
The Basic Algorithm for Agglomerative
For this
We have a function g (Ci, Cj) defined in all pair of cluster to measure
similarity or dissimilarity.
t denotes the current level of the hierarchy.
Algorithm
I n i t i a l i z a t i o n
Choose 0 = {Ci = {xi } , i = 1, ..., N}
t = 0
Repeat
t = t + 1
Find one p a i r of c l u s t e r s
(Cr , Cs) i n
t−1 such that
g(Ci , Cj ) = max, min of a s i m i l l a r i t y
or d i s s i m i l a r i t y f u n c t i o n
over a l l p a i r s
Define Cq = Ci ∪ Cj , t = t−1 − Ci , Cj ∪ Cq
U n t i l a l v e c t o r s i n a s i n g l e c l u s t e r
15 / 46
Images/cinvestav-
The Basic Algorithm for Agglomerative
For this
We have a function g (Ci, Cj) defined in all pair of cluster to measure
similarity or dissimilarity.
t denotes the current level of the hierarchy.
Algorithm
I n i t i a l i z a t i o n
Choose 0 = {Ci = {xi } , i = 1, ..., N}
t = 0
Repeat
t = t + 1
Find one p a i r of c l u s t e r s
(Cr , Cs) i n
t−1 such that
g(Ci , Cj ) = max, min of a s i m i l l a r i t y
or d i s s i m i l a r i t y f u n c t i o n
over a l l p a i r s
Define Cq = Ci ∪ Cj , t = t−1 − Ci , Cj ∪ Cq
U n t i l a l v e c t o r s i n a s i n g l e c l u s t e r
15 / 46
Images/cinvestav-
Enforcing Nesting
Note the following
“We can say that if two vectors come together into a single cluster at level
t of the hierarchy, they will remain in the same cluster for all subsequent
clusterings.”
Thus
0 1 2 ... N−1 N (3)
Hurra!!!
Enforcing the nesting property!!!
16 / 46
Images/cinvestav-
Enforcing Nesting
Note the following
“We can say that if two vectors come together into a single cluster at level
t of the hierarchy, they will remain in the same cluster for all subsequent
clusterings.”
Thus
0 1 2 ... N−1 N (3)
Hurra!!!
Enforcing the nesting property!!!
16 / 46
Images/cinvestav-
Enforcing Nesting
Note the following
“We can say that if two vectors come together into a single cluster at level
t of the hierarchy, they will remain in the same cluster for all subsequent
clusterings.”
Thus
0 1 2 ... N−1 N (3)
Hurra!!!
Enforcing the nesting property!!!
16 / 46
Images/cinvestav-
Outline
1 Hierarchical Clustering
Definition
Basic Ideas
2 Agglomerative Algorithms
Introduction
Problems with Agglomerative Algorithms
Two Categories of Agglomerative Algorithms
Matrix Based Algorithms
Graph Based Algorithms
3 Divisive Algorithms
Introduction
4 Algorithms for Large Data Sets
Introduction
Clustering Using REpresentatives (CURE)
17 / 46
Images/cinvestav-
Problems with Agglomerative Algorithms
First - Related to Nesting Property
No way to recover from a “poor” clustering that may have occurred in an
earlier level of the hierarchy.
Second
At each level t, there are N − t clusters.
Thus at level t+1 the total number of pairs compared.
N − t
2
=
(N − t) (N − t − 1)
2
(4)
Total Number of pairs compared are
N−1
t=0
N − t
2
(5)
18 / 46
Images/cinvestav-
Problems with Agglomerative Algorithms
First - Related to Nesting Property
No way to recover from a “poor” clustering that may have occurred in an
earlier level of the hierarchy.
Second
At each level t, there are N − t clusters.
Thus at level t+1 the total number of pairs compared.
N − t
2
=
(N − t) (N − t − 1)
2
(4)
Total Number of pairs compared are
N−1
t=0
N − t
2
(5)
18 / 46
Images/cinvestav-
Problems with Agglomerative Algorithms
First - Related to Nesting Property
No way to recover from a “poor” clustering that may have occurred in an
earlier level of the hierarchy.
Second
At each level t, there are N − t clusters.
Thus at level t+1 the total number of pairs compared.
N − t
2
=
(N − t) (N − t − 1)
2
(4)
Total Number of pairs compared are
N−1
t=0
N − t
2
(5)
18 / 46
Images/cinvestav-
Problems with Agglomerative Algorithms
First - Related to Nesting Property
No way to recover from a “poor” clustering that may have occurred in an
earlier level of the hierarchy.
Second
At each level t, there are N − t clusters.
Thus at level t+1 the total number of pairs compared.
N − t
2
=
(N − t) (N − t − 1)
2
(4)
Total Number of pairs compared are
N−1
t=0
N − t
2
(5)
18 / 46
Images/cinvestav-
Problems with Agglomerative Algorithms
First - Related to Nesting Property
No way to recover from a “poor” clustering that may have occurred in an
earlier level of the hierarchy.
Second
At each level t, there are N − t clusters.
Thus at level t+1 the total number of pairs compared.
N − t
2
=
(N − t) (N − t − 1)
2
(4)
Total Number of pairs compared are
N−1
t=0
N − t
2
(5)
18 / 46
Images/cinvestav-
Thus
We have that
N−1
t=0
N − t
2
=
N
k=1
k
2
=
(N − 1) N (N + 1)
6
(6)
Thus
The complexity of this schema is O N3
However
You still depend on the nature of g.
19 / 46
Images/cinvestav-
Thus
We have that
N−1
t=0
N − t
2
=
N
k=1
k
2
=
(N − 1) N (N + 1)
6
(6)
Thus
The complexity of this schema is O N3
However
You still depend on the nature of g.
19 / 46
Images/cinvestav-
Thus
We have that
N−1
t=0
N − t
2
=
N
k=1
k
2
=
(N − 1) N (N + 1)
6
(6)
Thus
The complexity of this schema is O N3
However
You still depend on the nature of g.
19 / 46
Images/cinvestav-
Outline
1 Hierarchical Clustering
Definition
Basic Ideas
2 Agglomerative Algorithms
Introduction
Problems with Agglomerative Algorithms
Two Categories of Agglomerative Algorithms
Matrix Based Algorithms
Graph Based Algorithms
3 Divisive Algorithms
Introduction
4 Algorithms for Large Data Sets
Introduction
Clustering Using REpresentatives (CURE)
20 / 46
Images/cinvestav-
Two Categories of Agglomerative Algorithms
There are two
1 Matrix Theory Based.
2 Graph Theory Concepts.
Matrix Theory Based
As the name says, they are based in dissimilarity matrix P0 = P (X)
of N × N.
At each merging the matrix is reduced by one level ⇒ Pt becomes a
N − t × N − t matrix.
21 / 46
Images/cinvestav-
Two Categories of Agglomerative Algorithms
There are two
1 Matrix Theory Based.
2 Graph Theory Concepts.
Matrix Theory Based
As the name says, they are based in dissimilarity matrix P0 = P (X)
of N × N.
At each merging the matrix is reduced by one level ⇒ Pt becomes a
N − t × N − t matrix.
21 / 46
Images/cinvestav-
Two Categories of Agglomerative Algorithms
There are two
1 Matrix Theory Based.
2 Graph Theory Concepts.
Matrix Theory Based
As the name says, they are based in dissimilarity matrix P0 = P (X)
of N × N.
At each merging the matrix is reduced by one level ⇒ Pt becomes a
N − t × N − t matrix.
21 / 46
Images/cinvestav-
Two Categories of Agglomerative Algorithms
There are two
1 Matrix Theory Based.
2 Graph Theory Concepts.
Matrix Theory Based
As the name says, they are based in dissimilarity matrix P0 = P (X)
of N × N.
At each merging the matrix is reduced by one level ⇒ Pt becomes a
N − t × N − t matrix.
21 / 46
Images/cinvestav-
Matrix Based Algorithm
Matrix Updating Algorithmic Scheme (MUAS)
I n i t i a l i z a t i o n
Choose 0 = {Ci = {xi} , i = 1, ..., N}
P0 = P(X)
t = 0
Repeat
t = t + 1
Find one p a i r of c l u s t e r s
(Cr , Cs) i n
t−1 such that
d(Ci, Cj) = minr,s=1,..,N,r=s d(Cr , Cs)
Define Cq = Ci ∪ Cj , t = t−1 − {Ci, Cj} ∪ Cq
Create Pt by s t r a t e g y
U n t i l a l v e c t o r s i n a s i n g l e c l u s t e r
22 / 46
Images/cinvestav-
Matrix Based Algorithm
Strategy
1 Delete the two rows and columns that correspond to the merged
clusters.
2 Add new row and a new column that contain the distances between
the newly formed cluster and the old (unaffected at this level) clusters.
23 / 46
Images/cinvestav-
Matrix Based Algorithm
Strategy
1 Delete the two rows and columns that correspond to the merged
clusters.
2 Add new row and a new column that contain the distances between
the newly formed cluster and the old (unaffected at this level) clusters.
23 / 46
Images/cinvestav-
Distance Used in These Schemes
It has been pointed out that there is only one general distance for
these algorithms
d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ...
bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)|
Where different values of ai, aj, b and c correspond to different choices of
the dissimilarity measures.
Using this distance is possible to generate several algorithms
1 The single link algorithm.
2 The complete link algorithm.
3 The weighted pair group method average.
4 The unweighted pair group method centroid.
5 Etc...
24 / 46
Images/cinvestav-
Distance Used in These Schemes
It has been pointed out that there is only one general distance for
these algorithms
d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ...
bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)|
Where different values of ai, aj, b and c correspond to different choices of
the dissimilarity measures.
Using this distance is possible to generate several algorithms
1 The single link algorithm.
2 The complete link algorithm.
3 The weighted pair group method average.
4 The unweighted pair group method centroid.
5 Etc...
24 / 46
Images/cinvestav-
Distance Used in These Schemes
It has been pointed out that there is only one general distance for
these algorithms
d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ...
bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)|
Where different values of ai, aj, b and c correspond to different choices of
the dissimilarity measures.
Using this distance is possible to generate several algorithms
1 The single link algorithm.
2 The complete link algorithm.
3 The weighted pair group method average.
4 The unweighted pair group method centroid.
5 Etc...
24 / 46
Images/cinvestav-
Distance Used in These Schemes
It has been pointed out that there is only one general distance for
these algorithms
d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ...
bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)|
Where different values of ai, aj, b and c correspond to different choices of
the dissimilarity measures.
Using this distance is possible to generate several algorithms
1 The single link algorithm.
2 The complete link algorithm.
3 The weighted pair group method average.
4 The unweighted pair group method centroid.
5 Etc...
24 / 46
Images/cinvestav-
Distance Used in These Schemes
It has been pointed out that there is only one general distance for
these algorithms
d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ...
bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)|
Where different values of ai, aj, b and c correspond to different choices of
the dissimilarity measures.
Using this distance is possible to generate several algorithms
1 The single link algorithm.
2 The complete link algorithm.
3 The weighted pair group method average.
4 The unweighted pair group method centroid.
5 Etc...
24 / 46
Images/cinvestav-
Distance Used in These Schemes
It has been pointed out that there is only one general distance for
these algorithms
d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ...
bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)|
Where different values of ai, aj, b and c correspond to different choices of
the dissimilarity measures.
Using this distance is possible to generate several algorithms
1 The single link algorithm.
2 The complete link algorithm.
3 The weighted pair group method average.
4 The unweighted pair group method centroid.
5 Etc...
24 / 46
Images/cinvestav-
For example
The single link algorithm
This is obtained if we set ai = 1/2, aj = 1/2, b = 0, c = −1/2
Thus, we have
d (Cq, Cs) = min {d (Ci, Cs) , d (Cj, Cs)} (7)
Please look at the example in the Dropbox
It is an interesting example.
25 / 46
Images/cinvestav-
For example
The single link algorithm
This is obtained if we set ai = 1/2, aj = 1/2, b = 0, c = −1/2
Thus, we have
d (Cq, Cs) = min {d (Ci, Cs) , d (Cj, Cs)} (7)
Please look at the example in the Dropbox
It is an interesting example.
25 / 46
Images/cinvestav-
For example
The single link algorithm
This is obtained if we set ai = 1/2, aj = 1/2, b = 0, c = −1/2
Thus, we have
d (Cq, Cs) = min {d (Ci, Cs) , d (Cj, Cs)} (7)
Please look at the example in the Dropbox
It is an interesting example.
25 / 46
Images/cinvestav-
Agglomerative Algorithms Based on Graph Theory
Consider the following
1 Each node in the graph G correspond to a vector.
2 Cluster are formed by connecting nodes.
3 Certain property, h (k), needs to be respected.
Common Properties: Node Connectivity
The node connectivity of a connected subgraph is the largest integer k
such that all pairs of nodes are joined by at least k paths having no nodes
in common.
26 / 46
Images/cinvestav-
Agglomerative Algorithms Based on Graph Theory
Consider the following
1 Each node in the graph G correspond to a vector.
2 Cluster are formed by connecting nodes.
3 Certain property, h (k), needs to be respected.
Common Properties: Node Connectivity
The node connectivity of a connected subgraph is the largest integer k
such that all pairs of nodes are joined by at least k paths having no nodes
in common.
26 / 46
Images/cinvestav-
Agglomerative Algorithms Based on Graph Theory
Consider the following
1 Each node in the graph G correspond to a vector.
2 Cluster are formed by connecting nodes.
3 Certain property, h (k), needs to be respected.
Common Properties: Node Connectivity
The node connectivity of a connected subgraph is the largest integer k
such that all pairs of nodes are joined by at least k paths having no nodes
in common.
26 / 46
Images/cinvestav-
Agglomerative Algorithms Based on Graph Theory
Consider the following
1 Each node in the graph G correspond to a vector.
2 Cluster are formed by connecting nodes.
3 Certain property, h (k), needs to be respected.
Common Properties: Node Connectivity
The node connectivity of a connected subgraph is the largest integer k
such that all pairs of nodes are joined by at least k paths having no nodes
in common.
26 / 46
Images/cinvestav-
Agglomerative Algorithms Based on Graph Theory
Common Properties: Edge Connectivity
The edge connectivity of a connected subgraph is the largest integer k
such that all pairs of nodes are joined by at least k paths having no edges
in common.
Common Properties: Node Degree
The degree of a connected subgraph is the largest integer k such that
each node has at least k incident edges.
27 / 46
Images/cinvestav-
Agglomerative Algorithms Based on Graph Theory
Common Properties: Edge Connectivity
The edge connectivity of a connected subgraph is the largest integer k
such that all pairs of nodes are joined by at least k paths having no edges
in common.
Common Properties: Node Degree
The degree of a connected subgraph is the largest integer k such that
each node has at least k incident edges.
27 / 46
Images/cinvestav-
Basically, We use the Same Scheme, But...
The function
gh(k) (Cr , Cs) = min
x∈Cr ,y∈Cs
{d (x, y) |Property} (8)
Property
The G subgraph defined by Cr ∪ Cs is
1 It is connected and either
1 It has the property h(k) or
2 It is complete
28 / 46
Images/cinvestav-
Basically, We use the Same Scheme, But...
The function
gh(k) (Cr , Cs) = min
x∈Cr ,y∈Cs
{d (x, y) |Property} (8)
Property
The G subgraph defined by Cr ∪ Cs is
1 It is connected and either
1 It has the property h(k) or
2 It is complete
28 / 46
Images/cinvestav-
Basically, We use the Same Scheme, But...
The function
gh(k) (Cr , Cs) = min
x∈Cr ,y∈Cs
{d (x, y) |Property} (8)
Property
The G subgraph defined by Cr ∪ Cs is
1 It is connected and either
1 It has the property h(k) or
2 It is complete
28 / 46
Images/cinvestav-
Basically, We use the Same Scheme, But...
The function
gh(k) (Cr , Cs) = min
x∈Cr ,y∈Cs
{d (x, y) |Property} (8)
Property
The G subgraph defined by Cr ∪ Cs is
1 It is connected and either
1 It has the property h(k) or
2 It is complete
28 / 46
Images/cinvestav-
Basically, We use the Same Scheme, But...
The function
gh(k) (Cr , Cs) = min
x∈Cr ,y∈Cs
{d (x, y) |Property} (8)
Property
The G subgraph defined by Cr ∪ Cs is
1 It is connected and either
1 It has the property h(k) or
2 It is complete
28 / 46
Images/cinvestav-
Examples
Examples
1 Single Link Algorithm
2 Complete Link Algorithm
There is other style of clustering
Clustering Algorithms Based on the Minimum Spanning Tree
29 / 46
Images/cinvestav-
Examples
Examples
1 Single Link Algorithm
2 Complete Link Algorithm
There is other style of clustering
Clustering Algorithms Based on the Minimum Spanning Tree
29 / 46
Images/cinvestav-
Examples
Examples
1 Single Link Algorithm
2 Complete Link Algorithm
There is other style of clustering
Clustering Algorithms Based on the Minimum Spanning Tree
29 / 46
Images/cinvestav-
Outline
1 Hierarchical Clustering
Definition
Basic Ideas
2 Agglomerative Algorithms
Introduction
Problems with Agglomerative Algorithms
Two Categories of Agglomerative Algorithms
Matrix Based Algorithms
Graph Based Algorithms
3 Divisive Algorithms
Introduction
4 Algorithms for Large Data Sets
Introduction
Clustering Using REpresentatives (CURE)
30 / 46
Images/cinvestav-
Divisive Algorithms
Reverse Strategy
Start with a single cluster split it iteratively.
31 / 46
Images/cinvestav-
Generalized Divisive Scheme
Algorithm PROBLEM what is wrong!!!
I n i t i a l i z a t i o n
Choose 0 = {X}
P0 = P(X)
t = 0
Repeat
t = t + 1
For i = 1 to t
Given a p a r t i t i o n Ct−1, i
Generate a l l p o s s i b l e c l u s t e r s
next i
Find the p a i r C1
t−1,j, C2
t−1,j that
maximize g
Create
t = t−1 − {Ct−1,j} ∪ C1
t−1,j, C2
t−1,j
U n t i l a l l v e c t o r s l i e i n a s i n g l e c l u s t e r
32 / 46
Images/cinvestav-
Outline
1 Hierarchical Clustering
Definition
Basic Ideas
2 Agglomerative Algorithms
Introduction
Problems with Agglomerative Algorithms
Two Categories of Agglomerative Algorithms
Matrix Based Algorithms
Graph Based Algorithms
3 Divisive Algorithms
Introduction
4 Algorithms for Large Data Sets
Introduction
Clustering Using REpresentatives (CURE)
33 / 46
Images/cinvestav-
Algorithms for Large Data Sets
There are several
1 The CURE Algorithm
2 The ROCK Algorithm
3 The Chameleon Algorithm
4 The BIRCH Algorithm
34 / 46
Images/cinvestav-
Algorithms for Large Data Sets
There are several
1 The CURE Algorithm
2 The ROCK Algorithm
3 The Chameleon Algorithm
4 The BIRCH Algorithm
34 / 46
Images/cinvestav-
Algorithms for Large Data Sets
There are several
1 The CURE Algorithm
2 The ROCK Algorithm
3 The Chameleon Algorithm
4 The BIRCH Algorithm
34 / 46
Images/cinvestav-
Algorithms for Large Data Sets
There are several
1 The CURE Algorithm
2 The ROCK Algorithm
3 The Chameleon Algorithm
4 The BIRCH Algorithm
34 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Idea
Each cluster Ci has a set of representatives RCi = x
(i)
1 , x
(i)
2 , ..., x
(i)
K
with K > 1.
What is happening
By using multiple representatives for each cluster, the CURE algorithm
tries to “capture” the shape of each one.
However
In order to avoid taking into account irregularities (For example, outliers)
in the border of the cluster.
The initially chosen representatives are “pushed” toward the mean of
the cluster.
35 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Idea
Each cluster Ci has a set of representatives RCi = x
(i)
1 , x
(i)
2 , ..., x
(i)
K
with K > 1.
What is happening
By using multiple representatives for each cluster, the CURE algorithm
tries to “capture” the shape of each one.
However
In order to avoid taking into account irregularities (For example, outliers)
in the border of the cluster.
The initially chosen representatives are “pushed” toward the mean of
the cluster.
35 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Idea
Each cluster Ci has a set of representatives RCi = x
(i)
1 , x
(i)
2 , ..., x
(i)
K
with K > 1.
What is happening
By using multiple representatives for each cluster, the CURE algorithm
tries to “capture” the shape of each one.
However
In order to avoid taking into account irregularities (For example, outliers)
in the border of the cluster.
The initially chosen representatives are “pushed” toward the mean of
the cluster.
35 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Idea
Each cluster Ci has a set of representatives RCi = x
(i)
1 , x
(i)
2 , ..., x
(i)
K
with K > 1.
What is happening
By using multiple representatives for each cluster, the CURE algorithm
tries to “capture” the shape of each one.
However
In order to avoid taking into account irregularities (For example, outliers)
in the border of the cluster.
The initially chosen representatives are “pushed” toward the mean of
the cluster.
35 / 46
Images/cinvestav-
Therfore
This action is known
As “Shrinking” in the sense that the volume of space “defined” by the
representatives is shrunk toward the mean of the cluster.
36 / 46
Images/cinvestav-
Shrinking Process
Given a cluster C
Select the point x ∈ C with the maximum distance from the mean of C
and set RC = {x} (the set of representatives).
Then
1 For i = 2 to min {K, nC }
2 Determine y ∈ C − RC that lies farthest from the points in RC
3 RC = RC ∪ {y}
37 / 46
Images/cinvestav-
Shrinking Process
Given a cluster C
Select the point x ∈ C with the maximum distance from the mean of C
and set RC = {x} (the set of representatives).
Then
1 For i = 2 to min {K, nC }
2 Determine y ∈ C − RC that lies farthest from the points in RC
3 RC = RC ∪ {y}
37 / 46
Images/cinvestav-
Shrinking Process
Given a cluster C
Select the point x ∈ C with the maximum distance from the mean of C
and set RC = {x} (the set of representatives).
Then
1 For i = 2 to min {K, nC }
2 Determine y ∈ C − RC that lies farthest from the points in RC
3 RC = RC ∪ {y}
37 / 46
Images/cinvestav-
Shrinking Process
Given a cluster C
Select the point x ∈ C with the maximum distance from the mean of C
and set RC = {x} (the set of representatives).
Then
1 For i = 2 to min {K, nC }
2 Determine y ∈ C − RC that lies farthest from the points in RC
3 RC = RC ∪ {y}
37 / 46
Images/cinvestav-
Shrinking Process
Given a cluster C
Select the point x ∈ C with the maximum distance from the mean of C
and set RC = {x} (the set of representatives).
Then
1 For i = 2 to min {K, nC }
2 Determine y ∈ C − RC that lies farthest from the points in RC
3 RC = RC ∪ {y}
37 / 46
Images/cinvestav-
Shrinking Process
Do the Shrinking
Shrink the points x ∈ RC toward the mean mC in C by a factor α.
Actually
x = (1 − α) x + αmC ∀x ∈ RC (9)
38 / 46
Images/cinvestav-
Shrinking Process
Do the Shrinking
Shrink the points x ∈ RC toward the mean mC in C by a factor α.
Actually
x = (1 − α) x + αmC ∀x ∈ RC (9)
38 / 46
Images/cinvestav-
Resulting set RC
Thus
The resulting set RC is the set of representatives of C.
Thus the distance between two cluster is defined as
d (Ci, Cj) = min
x∈RCi
,y∈RCj
d (x, y) (10)
39 / 46
Images/cinvestav-
Resulting set RC
Thus
The resulting set RC is the set of representatives of C.
Thus the distance between two cluster is defined as
d (Ci, Cj) = min
x∈RCi
,y∈RCj
d (x, y) (10)
39 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
Images/cinvestav-
Clustering Using REpresentatives (CURE)
Basic Algorithm
Input : A set of points X = {x1, x2, ..., xN }
Output : C clusters
1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi}
2 Ci.closest stores the cluster closest to Ci.
3 All the input points are inserted into a K − d tree T.
4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances
between Ci and Ci.closest).
5 While size(Q) > C
6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest.
7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj.
8 Also remove Ci and Cj from T and Q.
9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch.
10 insert Ck into Q
40 / 46
Images/cinvestav-
Complexity of Cure
Too Prohibitive
O N2
log2 N (11)
41 / 46
Images/cinvestav-
Possible Solution
CURE does the following
The technique adopted by the CURE algorithm, in order to reduce the
computational complexity, is that of random sampling.
Actually
That is, a sample set X is created from X, by choosing randomly N out
of the N points of X.
However, one has to ensure that the probability of missing a cluster of
X, due to this sampling
This can be guaranteed if the number of points N is sufficiently large.
42 / 46
Images/cinvestav-
Possible Solution
CURE does the following
The technique adopted by the CURE algorithm, in order to reduce the
computational complexity, is that of random sampling.
Actually
That is, a sample set X is created from X, by choosing randomly N out
of the N points of X.
However, one has to ensure that the probability of missing a cluster of
X, due to this sampling
This can be guaranteed if the number of points N is sufficiently large.
42 / 46
Images/cinvestav-
Possible Solution
CURE does the following
The technique adopted by the CURE algorithm, in order to reduce the
computational complexity, is that of random sampling.
Actually
That is, a sample set X is created from X, by choosing randomly N out
of the N points of X.
However, one has to ensure that the probability of missing a cluster of
X, due to this sampling
This can be guaranteed if the number of points N is sufficiently large.
42 / 46
Images/cinvestav-
Then
Having estimated N
CURE forms a number of p = N
N sample data sets by successive random
samples.
In other words
In other words, X is partitioned randomly in p subsets.
For this a parameter q is selected
Then, the points in each partition are clustered until N
q clusters are
formed or the distance between the closest pair of clusters to be merged in
the next iteration step exceeds a user-defined threshold.
43 / 46
Images/cinvestav-
Then
Having estimated N
CURE forms a number of p = N
N sample data sets by successive random
samples.
In other words
In other words, X is partitioned randomly in p subsets.
For this a parameter q is selected
Then, the points in each partition are clustered until N
q clusters are
formed or the distance between the closest pair of clusters to be merged in
the next iteration step exceeds a user-defined threshold.
43 / 46
Images/cinvestav-
Then
Having estimated N
CURE forms a number of p = N
N sample data sets by successive random
samples.
In other words
In other words, X is partitioned randomly in p subsets.
For this a parameter q is selected
Then, the points in each partition are clustered until N
q clusters are
formed or the distance between the closest pair of clusters to be merged in
the next iteration step exceeds a user-defined threshold.
43 / 46
Images/cinvestav-
Once this has been finished
A second clustering pass is done
One the at most pN
q = N
q clusters from all the subsets.
The Goal
To apply the merging procedure described previously to all (at most) N
q so
that we end up with the required final number, m, of clusters.
Finally
Each point x in the data set, X, that is not used as a representative in
any one of the m clusters is assigned to one of them according to the
following strategy.
44 / 46
Images/cinvestav-
Once this has been finished
A second clustering pass is done
One the at most pN
q = N
q clusters from all the subsets.
The Goal
To apply the merging procedure described previously to all (at most) N
q so
that we end up with the required final number, m, of clusters.
Finally
Each point x in the data set, X, that is not used as a representative in
any one of the m clusters is assigned to one of them according to the
following strategy.
44 / 46
Images/cinvestav-
Once this has been finished
A second clustering pass is done
One the at most pN
q = N
q clusters from all the subsets.
The Goal
To apply the merging procedure described previously to all (at most) N
q so
that we end up with the required final number, m, of clusters.
Finally
Each point x in the data set, X, that is not used as a representative in
any one of the m clusters is assigned to one of them according to the
following strategy.
44 / 46
Images/cinvestav-
Finally
First
A random sample of representative points from each of the m clusters is
chosen.
Then
Then, based on the previous representatives the point x is assigned to the
cluster that contains the representative closest to it.
Experiments reported by Guha et al. show that CURE
It is sensitive to parameter selection.
Specifically K must be large enough to capture the geometry of each
cluster.
In addition, N must be higher than a certain percentage ≈ 2.5% of N.
45 / 46
Images/cinvestav-
Finally
First
A random sample of representative points from each of the m clusters is
chosen.
Then
Then, based on the previous representatives the point x is assigned to the
cluster that contains the representative closest to it.
Experiments reported by Guha et al. show that CURE
It is sensitive to parameter selection.
Specifically K must be large enough to capture the geometry of each
cluster.
In addition, N must be higher than a certain percentage ≈ 2.5% of N.
45 / 46
Images/cinvestav-
Finally
First
A random sample of representative points from each of the m clusters is
chosen.
Then
Then, based on the previous representatives the point x is assigned to the
cluster that contains the representative closest to it.
Experiments reported by Guha et al. show that CURE
It is sensitive to parameter selection.
Specifically K must be large enough to capture the geometry of each
cluster.
In addition, N must be higher than a certain percentage ≈ 2.5% of N.
45 / 46
Images/cinvestav-
Finally
First
A random sample of representative points from each of the m clusters is
chosen.
Then
Then, based on the previous representatives the point x is assigned to the
cluster that contains the representative closest to it.
Experiments reported by Guha et al. show that CURE
It is sensitive to parameter selection.
Specifically K must be large enough to capture the geometry of each
cluster.
In addition, N must be higher than a certain percentage ≈ 2.5% of N.
45 / 46
Images/cinvestav-
Finally
First
A random sample of representative points from each of the m clusters is
chosen.
Then
Then, based on the previous representatives the point x is assigned to the
cluster that contains the representative closest to it.
Experiments reported by Guha et al. show that CURE
It is sensitive to parameter selection.
Specifically K must be large enough to capture the geometry of each
cluster.
In addition, N must be higher than a certain percentage ≈ 2.5% of N.
45 / 46
Images/cinvestav-
Not only that
The value of a affects also CURE
Small values, CURE looks similar than a MST clustering.
Large values, CURE resembles an algorithm with a single
representative.
Worst Case Complexity
O N 2
log2 N (12)
46 / 46
Images/cinvestav-
Not only that
The value of a affects also CURE
Small values, CURE looks similar than a MST clustering.
Large values, CURE resembles an algorithm with a single
representative.
Worst Case Complexity
O N 2
log2 N (12)
46 / 46

More Related Content

PDF
24 Machine Learning Combining Models - Ada Boost
PDF
31 Machine Learning Unsupervised Cluster Validity
PDF
28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms
PDF
27 Machine Learning Unsupervised Measure Properties
PDF
23 Machine Learning Feature Generation
PDF
11 Machine Learning Important Issues in Machine Learning
PDF
Introduction to logistic regression
PDF
17 Machine Learning Radial Basis Functions
24 Machine Learning Combining Models - Ada Boost
31 Machine Learning Unsupervised Cluster Validity
28 Dealing with the NP Poblems: Exponential Search and Approximation Algorithms
27 Machine Learning Unsupervised Measure Properties
23 Machine Learning Feature Generation
11 Machine Learning Important Issues in Machine Learning
Introduction to logistic regression
17 Machine Learning Radial Basis Functions

What's hot (20)

PDF
18.1 combining models
PDF
18 Machine Learning Radial Basis Function Networks Forward Heuristics
PDF
Machine learning in science and industry — day 1
PDF
Machine learning in science and industry — day 4
PDF
Machine learning in science and industry — day 2
PDF
Machine learning in science and industry — day 3
PDF
06 Machine Learning - Naive Bayes
PDF
Iclr2016 vaeまとめ
PDF
Vc dimension in Machine Learning
PDF
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
PDF
The Kernel Trick
PDF
Tree models with Scikit-Learn: Great models with little assumptions
PDF
Neural Networks: Support Vector machines
PDF
"Deep Learning" Chap.6 Convolutional Neural Net
PDF
(DL hacks輪読) Variational Inference with Rényi Divergence
PDF
Understanding Random Forests: From Theory to Practice
PDF
Probabilistic PCA, EM, and more
PDF
Ridge regression, lasso and elastic net
PDF
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
PDF
Scaling Multinomial Logistic Regression via Hybrid Parallelism
18.1 combining models
18 Machine Learning Radial Basis Function Networks Forward Heuristics
Machine learning in science and industry — day 1
Machine learning in science and industry — day 4
Machine learning in science and industry — day 2
Machine learning in science and industry — day 3
06 Machine Learning - Naive Bayes
Iclr2016 vaeまとめ
Vc dimension in Machine Learning
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
The Kernel Trick
Tree models with Scikit-Learn: Great models with little assumptions
Neural Networks: Support Vector machines
"Deep Learning" Chap.6 Convolutional Neural Net
(DL hacks輪読) Variational Inference with Rényi Divergence
Understanding Random Forests: From Theory to Practice
Probabilistic PCA, EM, and more
Ridge regression, lasso and elastic net
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
Scaling Multinomial Logistic Regression via Hybrid Parallelism
Ad

Viewers also liked (15)

PPT
3.3 hierarchical methods
PPTX
Cluster analysis
PDF
Cluster Analysis for Dummies
PPTX
Introduction to Machine Learning
PPTX
Db scan multiview
PDF
Machine Learning and Data Mining: 05 Advanced Association Rule Mining
PDF
Machine Learning and Data Mining: 08 Clustering: Hierarchical
PPTX
Hierarchical clustering in Python and beyond
PPTX
An Introduction to Agglomeration
PDF
Clustering: A Survey
PDF
K means Clustering
PPT
Basics of Machine Learning
PPTX
Introduction to Machine Learning
3.3 hierarchical methods
Cluster analysis
Cluster Analysis for Dummies
Introduction to Machine Learning
Db scan multiview
Machine Learning and Data Mining: 05 Advanced Association Rule Mining
Machine Learning and Data Mining: 08 Clustering: Hierarchical
Hierarchical clustering in Python and beyond
An Introduction to Agglomeration
Clustering: A Survey
K means Clustering
Basics of Machine Learning
Introduction to Machine Learning
Ad

Similar to 28 Machine Learning Unsupervised Hierarchical Clustering (20)

PDF
Clustering Algorithms.pdf
PDF
PPT s10-machine vision-s2
PPT
Hierarchical (2)l ppt for data and analytics
PPTX
Data mining and warehousing
PDF
An Analysis On Clustering Algorithms In Data Mining
PDF
Clustering Approach Recommendation System using Agglomerative Algorithm
PPTX
05 Clustering in Data Mining
PPTX
Algorithms used in AIML and the need for aiml basic use cases
PDF
3MLChapter3ClusteringSlides23EN UC Coimbra PT
PDF
Enhanced Clustering Algorithm for Processing Online Data
PDF
A0310112
PDF
Multilevel techniques for the clustering problem
PPTX
Unsupervised Learning-Clustering Algorithms.pptx
PDF
Similarity distance measures
PPTX
log6kntt4i4dgwfwbpxw-signature-75c4ed0a4b22d2fef90396cdcdae85b38911f9dce0924a...
PDF
Paper id 26201478
PDF
4.Unit 4 ML Q&A.pdf machine learning qb
PPT
multiarmed bandit.ppt
DOCX
Agglomerative Clustering Onvertically Partitioned Data–Distributed Database M...
Clustering Algorithms.pdf
PPT s10-machine vision-s2
Hierarchical (2)l ppt for data and analytics
Data mining and warehousing
An Analysis On Clustering Algorithms In Data Mining
Clustering Approach Recommendation System using Agglomerative Algorithm
05 Clustering in Data Mining
Algorithms used in AIML and the need for aiml basic use cases
3MLChapter3ClusteringSlides23EN UC Coimbra PT
Enhanced Clustering Algorithm for Processing Online Data
A0310112
Multilevel techniques for the clustering problem
Unsupervised Learning-Clustering Algorithms.pptx
Similarity distance measures
log6kntt4i4dgwfwbpxw-signature-75c4ed0a4b22d2fef90396cdcdae85b38911f9dce0924a...
Paper id 26201478
4.Unit 4 ML Q&A.pdf machine learning qb
multiarmed bandit.ppt
Agglomerative Clustering Onvertically Partitioned Data–Distributed Database M...

More from Andres Mendez-Vazquez (20)

PDF
2.03 bayesian estimation
PDF
05 linear transformations
PDF
01.04 orthonormal basis_eigen_vectors
PDF
01.03 squared matrices_and_other_issues
PDF
01.02 linear equations
PDF
01.01 vector spaces
PDF
06 recurrent neural_networks
PDF
05 backpropagation automatic_differentiation
PDF
Zetta global
PDF
01 Introduction to Neural Networks and Deep Learning
PDF
25 introduction reinforcement_learning
PDF
Neural Networks and Deep Learning Syllabus
PDF
Introduction to artificial_intelligence_syllabus
PDF
Ideas 09 22_2018
PDF
Ideas about a Bachelor in Machine Learning/Data Sciences
PDF
Analysis of Algorithms Syllabus
PDF
20 k-means, k-center, k-meoids and variations
PDF
17 vapnik chervonenkis dimension
PDF
A basic introduction to learning
PDF
Introduction Mathematics Intelligent Systems Syllabus
2.03 bayesian estimation
05 linear transformations
01.04 orthonormal basis_eigen_vectors
01.03 squared matrices_and_other_issues
01.02 linear equations
01.01 vector spaces
06 recurrent neural_networks
05 backpropagation automatic_differentiation
Zetta global
01 Introduction to Neural Networks and Deep Learning
25 introduction reinforcement_learning
Neural Networks and Deep Learning Syllabus
Introduction to artificial_intelligence_syllabus
Ideas 09 22_2018
Ideas about a Bachelor in Machine Learning/Data Sciences
Analysis of Algorithms Syllabus
20 k-means, k-center, k-meoids and variations
17 vapnik chervonenkis dimension
A basic introduction to learning
Introduction Mathematics Intelligent Systems Syllabus

Recently uploaded (20)

PDF
Well-logging-methods_new................
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
web development for engineering and engineering
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
Welding lecture in detail for understanding
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Digital Logic Computer Design lecture notes
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
UNIT 4 Total Quality Management .pptx
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPT
Mechanical Engineering MATERIALS Selection
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Well-logging-methods_new................
Internet of Things (IOT) - A guide to understanding
bas. eng. economics group 4 presentation 1.pptx
web development for engineering and engineering
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
additive manufacturing of ss316l using mig welding
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
OOP with Java - Java Introduction (Basics)
Welding lecture in detail for understanding
Model Code of Practice - Construction Work - 21102022 .pdf
Automation-in-Manufacturing-Chapter-Introduction.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Digital Logic Computer Design lecture notes
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Lecture Notes Electrical Wiring System Components
UNIT 4 Total Quality Management .pptx
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Mechanical Engineering MATERIALS Selection
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...

28 Machine Learning Unsupervised Hierarchical Clustering

  • 1. Machine Learning for Data Mining Hierarchical Clustering Andres Mendez-Vazquez July 27, 2015 1 / 46
  • 2. Images/cinvestav- Outline 1 Hierarchical Clustering Definition Basic Ideas 2 Agglomerative Algorithms Introduction Problems with Agglomerative Algorithms Two Categories of Agglomerative Algorithms Matrix Based Algorithms Graph Based Algorithms 3 Divisive Algorithms Introduction 4 Algorithms for Large Data Sets Introduction Clustering Using REpresentatives (CURE) 2 / 46
  • 3. Images/cinvestav- Outline 1 Hierarchical Clustering Definition Basic Ideas 2 Agglomerative Algorithms Introduction Problems with Agglomerative Algorithms Two Categories of Agglomerative Algorithms Matrix Based Algorithms Graph Based Algorithms 3 Divisive Algorithms Introduction 4 Algorithms for Large Data Sets Introduction Clustering Using REpresentatives (CURE) 3 / 46
  • 4. Images/cinvestav- Concepts Hierarchical Clustering Algorithms They are quite different from the previous clustering algorithms. Actually They produce a hierarchy of clusterings. 4 / 46
  • 5. Images/cinvestav- Concepts Hierarchical Clustering Algorithms They are quite different from the previous clustering algorithms. Actually They produce a hierarchy of clusterings. 4 / 46
  • 6. Images/cinvestav- Dendrogram: Hierarchical Clustering Hierarchical Clustering The clustering is obtained by cutting the dendrogram at a desired level: Each connected component forms a cluster. 5 / 46
  • 8. Images/cinvestav- Outline 1 Hierarchical Clustering Definition Basic Ideas 2 Agglomerative Algorithms Introduction Problems with Agglomerative Algorithms Two Categories of Agglomerative Algorithms Matrix Based Algorithms Graph Based Algorithms 3 Divisive Algorithms Introduction 4 Algorithms for Large Data Sets Introduction Clustering Using REpresentatives (CURE) 7 / 46
  • 9. Images/cinvestav- Basic Ideas At each step t A new clustering is obtained based on the clustering produced at the previous step t − 1 Two Main Types 1 Agglomerative Algorithms. 1 Start with each item being a single cluster. 2 Eventually all items belong to the same cluster. 2 Divisive Algorithms 1 Start with all items belong to the same cluster. 2 Eventually each item forms a cluster on its own. 8 / 46
  • 10. Images/cinvestav- Basic Ideas At each step t A new clustering is obtained based on the clustering produced at the previous step t − 1 Two Main Types 1 Agglomerative Algorithms. 1 Start with each item being a single cluster. 2 Eventually all items belong to the same cluster. 2 Divisive Algorithms 1 Start with all items belong to the same cluster. 2 Eventually each item forms a cluster on its own. 8 / 46
  • 11. Images/cinvestav- Basic Ideas At each step t A new clustering is obtained based on the clustering produced at the previous step t − 1 Two Main Types 1 Agglomerative Algorithms. 1 Start with each item being a single cluster. 2 Eventually all items belong to the same cluster. 2 Divisive Algorithms 1 Start with all items belong to the same cluster. 2 Eventually each item forms a cluster on its own. 8 / 46
  • 12. Images/cinvestav- Basic Ideas At each step t A new clustering is obtained based on the clustering produced at the previous step t − 1 Two Main Types 1 Agglomerative Algorithms. 1 Start with each item being a single cluster. 2 Eventually all items belong to the same cluster. 2 Divisive Algorithms 1 Start with all items belong to the same cluster. 2 Eventually each item forms a cluster on its own. 8 / 46
  • 13. Images/cinvestav- Basic Ideas At each step t A new clustering is obtained based on the clustering produced at the previous step t − 1 Two Main Types 1 Agglomerative Algorithms. 1 Start with each item being a single cluster. 2 Eventually all items belong to the same cluster. 2 Divisive Algorithms 1 Start with all items belong to the same cluster. 2 Eventually each item forms a cluster on its own. 8 / 46
  • 14. Images/cinvestav- Basic Ideas At each step t A new clustering is obtained based on the clustering produced at the previous step t − 1 Two Main Types 1 Agglomerative Algorithms. 1 Start with each item being a single cluster. 2 Eventually all items belong to the same cluster. 2 Divisive Algorithms 1 Start with all items belong to the same cluster. 2 Eventually each item forms a cluster on its own. 8 / 46
  • 15. Images/cinvestav- Basic Ideas At each step t A new clustering is obtained based on the clustering produced at the previous step t − 1 Two Main Types 1 Agglomerative Algorithms. 1 Start with each item being a single cluster. 2 Eventually all items belong to the same cluster. 2 Divisive Algorithms 1 Start with all items belong to the same cluster. 2 Eventually each item forms a cluster on its own. 8 / 46
  • 16. Images/cinvestav- Therefore Given the previous ideas It is necessary to define the concept of nesting!!! After all given a divisive and agglomerative procedure 9 / 46
  • 17. Images/cinvestav- Therefore Given the previous ideas It is necessary to define the concept of nesting!!! After all given a divisive and agglomerative procedure 9 / 46
  • 18. Images/cinvestav- Nested Clustering Definition 1 A clustering i containing k clusters is said to be nested in the clustering i+1, which contains r < k clusters, if each cluster in i, it is a subset of a set in i+1. 2 At least one cluster at i is a proper subset of a set in i+1. This is written as i i+1 (1) 10 / 46
  • 19. Images/cinvestav- Nested Clustering Definition 1 A clustering i containing k clusters is said to be nested in the clustering i+1, which contains r < k clusters, if each cluster in i, it is a subset of a set in i+1. 2 At least one cluster at i is a proper subset of a set in i+1. This is written as i i+1 (1) 10 / 46
  • 20. Images/cinvestav- Nested Clustering Definition 1 A clustering i containing k clusters is said to be nested in the clustering i+1, which contains r < k clusters, if each cluster in i, it is a subset of a set in i+1. 2 At least one cluster at i is a proper subset of a set in i+1. This is written as i i+1 (1) 10 / 46
  • 21. Images/cinvestav- Example We have The following set{x1, x2, x3, x4, x5}. With the following structures 1 = {{x1, x3} , {x4} , {x2, x5}} 2 = {{x1, x3, x4} , {x2, x5}} Again Hierarchical Clustering produces a hierarchy of clusterings!!! 11 / 46
  • 22. Images/cinvestav- Example We have The following set{x1, x2, x3, x4, x5}. With the following structures 1 = {{x1, x3} , {x4} , {x2, x5}} 2 = {{x1, x3, x4} , {x2, x5}} Again Hierarchical Clustering produces a hierarchy of clusterings!!! 11 / 46
  • 23. Images/cinvestav- Example We have The following set{x1, x2, x3, x4, x5}. With the following structures 1 = {{x1, x3} , {x4} , {x2, x5}} 2 = {{x1, x3, x4} , {x2, x5}} Again Hierarchical Clustering produces a hierarchy of clusterings!!! 11 / 46
  • 24. Images/cinvestav- Example We have The following set{x1, x2, x3, x4, x5}. With the following structures 1 = {{x1, x3} , {x4} , {x2, x5}} 2 = {{x1, x3, x4} , {x2, x5}} Again Hierarchical Clustering produces a hierarchy of clusterings!!! 11 / 46
  • 25. Images/cinvestav- Outline 1 Hierarchical Clustering Definition Basic Ideas 2 Agglomerative Algorithms Introduction Problems with Agglomerative Algorithms Two Categories of Agglomerative Algorithms Matrix Based Algorithms Graph Based Algorithms 3 Divisive Algorithms Introduction 4 Algorithms for Large Data Sets Introduction Clustering Using REpresentatives (CURE) 12 / 46
  • 26. Images/cinvestav- Agglomerative Algorithms. Initial State You have N clusters each containing an element of the data X. At each step i, you have an i structure with N − i. Then, a new clustering structure i+1 is generated. Thus 13 / 46
  • 27. Images/cinvestav- Agglomerative Algorithms. Initial State You have N clusters each containing an element of the data X. At each step i, you have an i structure with N − i. Then, a new clustering structure i+1 is generated. Thus 13 / 46
  • 28. Images/cinvestav- Agglomerative Algorithms. Initial State You have N clusters each containing an element of the data X. At each step i, you have an i structure with N − i. Then, a new clustering structure i+1 is generated. Thus 13 / 46
  • 29. Images/cinvestav- Agglomerative Algorithms. Initial State You have N clusters each containing an element of the data X. At each step i, you have an i structure with N − i. Then, a new clustering structure i+1 is generated. Thus 13 / 46
  • 30. Images/cinvestav- In that way... We have At each step, we have that each cluster i is a proper subset of a cluste in i or i i+1 (2) 14 / 46
  • 31. Images/cinvestav- The Basic Algorithm for Agglomerative For this We have a function g (Ci, Cj) defined in all pair of cluster to measure similarity or dissimilarity. t denotes the current level of the hierarchy. Algorithm I n i t i a l i z a t i o n Choose 0 = {Ci = {xi } , i = 1, ..., N} t = 0 Repeat t = t + 1 Find one p a i r of c l u s t e r s (Cr , Cs) i n t−1 such that g(Ci , Cj ) = max, min of a s i m i l l a r i t y or d i s s i m i l a r i t y f u n c t i o n over a l l p a i r s Define Cq = Ci ∪ Cj , t = t−1 − Ci , Cj ∪ Cq U n t i l a l v e c t o r s i n a s i n g l e c l u s t e r 15 / 46
  • 32. Images/cinvestav- The Basic Algorithm for Agglomerative For this We have a function g (Ci, Cj) defined in all pair of cluster to measure similarity or dissimilarity. t denotes the current level of the hierarchy. Algorithm I n i t i a l i z a t i o n Choose 0 = {Ci = {xi } , i = 1, ..., N} t = 0 Repeat t = t + 1 Find one p a i r of c l u s t e r s (Cr , Cs) i n t−1 such that g(Ci , Cj ) = max, min of a s i m i l l a r i t y or d i s s i m i l a r i t y f u n c t i o n over a l l p a i r s Define Cq = Ci ∪ Cj , t = t−1 − Ci , Cj ∪ Cq U n t i l a l v e c t o r s i n a s i n g l e c l u s t e r 15 / 46
  • 33. Images/cinvestav- The Basic Algorithm for Agglomerative For this We have a function g (Ci, Cj) defined in all pair of cluster to measure similarity or dissimilarity. t denotes the current level of the hierarchy. Algorithm I n i t i a l i z a t i o n Choose 0 = {Ci = {xi } , i = 1, ..., N} t = 0 Repeat t = t + 1 Find one p a i r of c l u s t e r s (Cr , Cs) i n t−1 such that g(Ci , Cj ) = max, min of a s i m i l l a r i t y or d i s s i m i l a r i t y f u n c t i o n over a l l p a i r s Define Cq = Ci ∪ Cj , t = t−1 − Ci , Cj ∪ Cq U n t i l a l v e c t o r s i n a s i n g l e c l u s t e r 15 / 46
  • 34. Images/cinvestav- Enforcing Nesting Note the following “We can say that if two vectors come together into a single cluster at level t of the hierarchy, they will remain in the same cluster for all subsequent clusterings.” Thus 0 1 2 ... N−1 N (3) Hurra!!! Enforcing the nesting property!!! 16 / 46
  • 35. Images/cinvestav- Enforcing Nesting Note the following “We can say that if two vectors come together into a single cluster at level t of the hierarchy, they will remain in the same cluster for all subsequent clusterings.” Thus 0 1 2 ... N−1 N (3) Hurra!!! Enforcing the nesting property!!! 16 / 46
  • 36. Images/cinvestav- Enforcing Nesting Note the following “We can say that if two vectors come together into a single cluster at level t of the hierarchy, they will remain in the same cluster for all subsequent clusterings.” Thus 0 1 2 ... N−1 N (3) Hurra!!! Enforcing the nesting property!!! 16 / 46
  • 37. Images/cinvestav- Outline 1 Hierarchical Clustering Definition Basic Ideas 2 Agglomerative Algorithms Introduction Problems with Agglomerative Algorithms Two Categories of Agglomerative Algorithms Matrix Based Algorithms Graph Based Algorithms 3 Divisive Algorithms Introduction 4 Algorithms for Large Data Sets Introduction Clustering Using REpresentatives (CURE) 17 / 46
  • 38. Images/cinvestav- Problems with Agglomerative Algorithms First - Related to Nesting Property No way to recover from a “poor” clustering that may have occurred in an earlier level of the hierarchy. Second At each level t, there are N − t clusters. Thus at level t+1 the total number of pairs compared. N − t 2 = (N − t) (N − t − 1) 2 (4) Total Number of pairs compared are N−1 t=0 N − t 2 (5) 18 / 46
  • 39. Images/cinvestav- Problems with Agglomerative Algorithms First - Related to Nesting Property No way to recover from a “poor” clustering that may have occurred in an earlier level of the hierarchy. Second At each level t, there are N − t clusters. Thus at level t+1 the total number of pairs compared. N − t 2 = (N − t) (N − t − 1) 2 (4) Total Number of pairs compared are N−1 t=0 N − t 2 (5) 18 / 46
  • 40. Images/cinvestav- Problems with Agglomerative Algorithms First - Related to Nesting Property No way to recover from a “poor” clustering that may have occurred in an earlier level of the hierarchy. Second At each level t, there are N − t clusters. Thus at level t+1 the total number of pairs compared. N − t 2 = (N − t) (N − t − 1) 2 (4) Total Number of pairs compared are N−1 t=0 N − t 2 (5) 18 / 46
  • 41. Images/cinvestav- Problems with Agglomerative Algorithms First - Related to Nesting Property No way to recover from a “poor” clustering that may have occurred in an earlier level of the hierarchy. Second At each level t, there are N − t clusters. Thus at level t+1 the total number of pairs compared. N − t 2 = (N − t) (N − t − 1) 2 (4) Total Number of pairs compared are N−1 t=0 N − t 2 (5) 18 / 46
  • 42. Images/cinvestav- Problems with Agglomerative Algorithms First - Related to Nesting Property No way to recover from a “poor” clustering that may have occurred in an earlier level of the hierarchy. Second At each level t, there are N − t clusters. Thus at level t+1 the total number of pairs compared. N − t 2 = (N − t) (N − t − 1) 2 (4) Total Number of pairs compared are N−1 t=0 N − t 2 (5) 18 / 46
  • 43. Images/cinvestav- Thus We have that N−1 t=0 N − t 2 = N k=1 k 2 = (N − 1) N (N + 1) 6 (6) Thus The complexity of this schema is O N3 However You still depend on the nature of g. 19 / 46
  • 44. Images/cinvestav- Thus We have that N−1 t=0 N − t 2 = N k=1 k 2 = (N − 1) N (N + 1) 6 (6) Thus The complexity of this schema is O N3 However You still depend on the nature of g. 19 / 46
  • 45. Images/cinvestav- Thus We have that N−1 t=0 N − t 2 = N k=1 k 2 = (N − 1) N (N + 1) 6 (6) Thus The complexity of this schema is O N3 However You still depend on the nature of g. 19 / 46
  • 46. Images/cinvestav- Outline 1 Hierarchical Clustering Definition Basic Ideas 2 Agglomerative Algorithms Introduction Problems with Agglomerative Algorithms Two Categories of Agglomerative Algorithms Matrix Based Algorithms Graph Based Algorithms 3 Divisive Algorithms Introduction 4 Algorithms for Large Data Sets Introduction Clustering Using REpresentatives (CURE) 20 / 46
  • 47. Images/cinvestav- Two Categories of Agglomerative Algorithms There are two 1 Matrix Theory Based. 2 Graph Theory Concepts. Matrix Theory Based As the name says, they are based in dissimilarity matrix P0 = P (X) of N × N. At each merging the matrix is reduced by one level ⇒ Pt becomes a N − t × N − t matrix. 21 / 46
  • 48. Images/cinvestav- Two Categories of Agglomerative Algorithms There are two 1 Matrix Theory Based. 2 Graph Theory Concepts. Matrix Theory Based As the name says, they are based in dissimilarity matrix P0 = P (X) of N × N. At each merging the matrix is reduced by one level ⇒ Pt becomes a N − t × N − t matrix. 21 / 46
  • 49. Images/cinvestav- Two Categories of Agglomerative Algorithms There are two 1 Matrix Theory Based. 2 Graph Theory Concepts. Matrix Theory Based As the name says, they are based in dissimilarity matrix P0 = P (X) of N × N. At each merging the matrix is reduced by one level ⇒ Pt becomes a N − t × N − t matrix. 21 / 46
  • 50. Images/cinvestav- Two Categories of Agglomerative Algorithms There are two 1 Matrix Theory Based. 2 Graph Theory Concepts. Matrix Theory Based As the name says, they are based in dissimilarity matrix P0 = P (X) of N × N. At each merging the matrix is reduced by one level ⇒ Pt becomes a N − t × N − t matrix. 21 / 46
  • 51. Images/cinvestav- Matrix Based Algorithm Matrix Updating Algorithmic Scheme (MUAS) I n i t i a l i z a t i o n Choose 0 = {Ci = {xi} , i = 1, ..., N} P0 = P(X) t = 0 Repeat t = t + 1 Find one p a i r of c l u s t e r s (Cr , Cs) i n t−1 such that d(Ci, Cj) = minr,s=1,..,N,r=s d(Cr , Cs) Define Cq = Ci ∪ Cj , t = t−1 − {Ci, Cj} ∪ Cq Create Pt by s t r a t e g y U n t i l a l v e c t o r s i n a s i n g l e c l u s t e r 22 / 46
  • 52. Images/cinvestav- Matrix Based Algorithm Strategy 1 Delete the two rows and columns that correspond to the merged clusters. 2 Add new row and a new column that contain the distances between the newly formed cluster and the old (unaffected at this level) clusters. 23 / 46
  • 53. Images/cinvestav- Matrix Based Algorithm Strategy 1 Delete the two rows and columns that correspond to the merged clusters. 2 Add new row and a new column that contain the distances between the newly formed cluster and the old (unaffected at this level) clusters. 23 / 46
  • 54. Images/cinvestav- Distance Used in These Schemes It has been pointed out that there is only one general distance for these algorithms d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ... bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)| Where different values of ai, aj, b and c correspond to different choices of the dissimilarity measures. Using this distance is possible to generate several algorithms 1 The single link algorithm. 2 The complete link algorithm. 3 The weighted pair group method average. 4 The unweighted pair group method centroid. 5 Etc... 24 / 46
  • 55. Images/cinvestav- Distance Used in These Schemes It has been pointed out that there is only one general distance for these algorithms d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ... bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)| Where different values of ai, aj, b and c correspond to different choices of the dissimilarity measures. Using this distance is possible to generate several algorithms 1 The single link algorithm. 2 The complete link algorithm. 3 The weighted pair group method average. 4 The unweighted pair group method centroid. 5 Etc... 24 / 46
  • 56. Images/cinvestav- Distance Used in These Schemes It has been pointed out that there is only one general distance for these algorithms d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ... bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)| Where different values of ai, aj, b and c correspond to different choices of the dissimilarity measures. Using this distance is possible to generate several algorithms 1 The single link algorithm. 2 The complete link algorithm. 3 The weighted pair group method average. 4 The unweighted pair group method centroid. 5 Etc... 24 / 46
  • 57. Images/cinvestav- Distance Used in These Schemes It has been pointed out that there is only one general distance for these algorithms d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ... bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)| Where different values of ai, aj, b and c correspond to different choices of the dissimilarity measures. Using this distance is possible to generate several algorithms 1 The single link algorithm. 2 The complete link algorithm. 3 The weighted pair group method average. 4 The unweighted pair group method centroid. 5 Etc... 24 / 46
  • 58. Images/cinvestav- Distance Used in These Schemes It has been pointed out that there is only one general distance for these algorithms d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ... bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)| Where different values of ai, aj, b and c correspond to different choices of the dissimilarity measures. Using this distance is possible to generate several algorithms 1 The single link algorithm. 2 The complete link algorithm. 3 The weighted pair group method average. 4 The unweighted pair group method centroid. 5 Etc... 24 / 46
  • 59. Images/cinvestav- Distance Used in These Schemes It has been pointed out that there is only one general distance for these algorithms d (Cq, Cs) =aid (Ci, Cs) + ajd (Cj, Cs) + ... bd (Ci, Cj) + c |d (Ci, Cs) − d (Cj, Cs)| Where different values of ai, aj, b and c correspond to different choices of the dissimilarity measures. Using this distance is possible to generate several algorithms 1 The single link algorithm. 2 The complete link algorithm. 3 The weighted pair group method average. 4 The unweighted pair group method centroid. 5 Etc... 24 / 46
  • 60. Images/cinvestav- For example The single link algorithm This is obtained if we set ai = 1/2, aj = 1/2, b = 0, c = −1/2 Thus, we have d (Cq, Cs) = min {d (Ci, Cs) , d (Cj, Cs)} (7) Please look at the example in the Dropbox It is an interesting example. 25 / 46
  • 61. Images/cinvestav- For example The single link algorithm This is obtained if we set ai = 1/2, aj = 1/2, b = 0, c = −1/2 Thus, we have d (Cq, Cs) = min {d (Ci, Cs) , d (Cj, Cs)} (7) Please look at the example in the Dropbox It is an interesting example. 25 / 46
  • 62. Images/cinvestav- For example The single link algorithm This is obtained if we set ai = 1/2, aj = 1/2, b = 0, c = −1/2 Thus, we have d (Cq, Cs) = min {d (Ci, Cs) , d (Cj, Cs)} (7) Please look at the example in the Dropbox It is an interesting example. 25 / 46
  • 63. Images/cinvestav- Agglomerative Algorithms Based on Graph Theory Consider the following 1 Each node in the graph G correspond to a vector. 2 Cluster are formed by connecting nodes. 3 Certain property, h (k), needs to be respected. Common Properties: Node Connectivity The node connectivity of a connected subgraph is the largest integer k such that all pairs of nodes are joined by at least k paths having no nodes in common. 26 / 46
  • 64. Images/cinvestav- Agglomerative Algorithms Based on Graph Theory Consider the following 1 Each node in the graph G correspond to a vector. 2 Cluster are formed by connecting nodes. 3 Certain property, h (k), needs to be respected. Common Properties: Node Connectivity The node connectivity of a connected subgraph is the largest integer k such that all pairs of nodes are joined by at least k paths having no nodes in common. 26 / 46
  • 65. Images/cinvestav- Agglomerative Algorithms Based on Graph Theory Consider the following 1 Each node in the graph G correspond to a vector. 2 Cluster are formed by connecting nodes. 3 Certain property, h (k), needs to be respected. Common Properties: Node Connectivity The node connectivity of a connected subgraph is the largest integer k such that all pairs of nodes are joined by at least k paths having no nodes in common. 26 / 46
  • 66. Images/cinvestav- Agglomerative Algorithms Based on Graph Theory Consider the following 1 Each node in the graph G correspond to a vector. 2 Cluster are formed by connecting nodes. 3 Certain property, h (k), needs to be respected. Common Properties: Node Connectivity The node connectivity of a connected subgraph is the largest integer k such that all pairs of nodes are joined by at least k paths having no nodes in common. 26 / 46
  • 67. Images/cinvestav- Agglomerative Algorithms Based on Graph Theory Common Properties: Edge Connectivity The edge connectivity of a connected subgraph is the largest integer k such that all pairs of nodes are joined by at least k paths having no edges in common. Common Properties: Node Degree The degree of a connected subgraph is the largest integer k such that each node has at least k incident edges. 27 / 46
  • 68. Images/cinvestav- Agglomerative Algorithms Based on Graph Theory Common Properties: Edge Connectivity The edge connectivity of a connected subgraph is the largest integer k such that all pairs of nodes are joined by at least k paths having no edges in common. Common Properties: Node Degree The degree of a connected subgraph is the largest integer k such that each node has at least k incident edges. 27 / 46
  • 69. Images/cinvestav- Basically, We use the Same Scheme, But... The function gh(k) (Cr , Cs) = min x∈Cr ,y∈Cs {d (x, y) |Property} (8) Property The G subgraph defined by Cr ∪ Cs is 1 It is connected and either 1 It has the property h(k) or 2 It is complete 28 / 46
  • 70. Images/cinvestav- Basically, We use the Same Scheme, But... The function gh(k) (Cr , Cs) = min x∈Cr ,y∈Cs {d (x, y) |Property} (8) Property The G subgraph defined by Cr ∪ Cs is 1 It is connected and either 1 It has the property h(k) or 2 It is complete 28 / 46
  • 71. Images/cinvestav- Basically, We use the Same Scheme, But... The function gh(k) (Cr , Cs) = min x∈Cr ,y∈Cs {d (x, y) |Property} (8) Property The G subgraph defined by Cr ∪ Cs is 1 It is connected and either 1 It has the property h(k) or 2 It is complete 28 / 46
  • 72. Images/cinvestav- Basically, We use the Same Scheme, But... The function gh(k) (Cr , Cs) = min x∈Cr ,y∈Cs {d (x, y) |Property} (8) Property The G subgraph defined by Cr ∪ Cs is 1 It is connected and either 1 It has the property h(k) or 2 It is complete 28 / 46
  • 73. Images/cinvestav- Basically, We use the Same Scheme, But... The function gh(k) (Cr , Cs) = min x∈Cr ,y∈Cs {d (x, y) |Property} (8) Property The G subgraph defined by Cr ∪ Cs is 1 It is connected and either 1 It has the property h(k) or 2 It is complete 28 / 46
  • 74. Images/cinvestav- Examples Examples 1 Single Link Algorithm 2 Complete Link Algorithm There is other style of clustering Clustering Algorithms Based on the Minimum Spanning Tree 29 / 46
  • 75. Images/cinvestav- Examples Examples 1 Single Link Algorithm 2 Complete Link Algorithm There is other style of clustering Clustering Algorithms Based on the Minimum Spanning Tree 29 / 46
  • 76. Images/cinvestav- Examples Examples 1 Single Link Algorithm 2 Complete Link Algorithm There is other style of clustering Clustering Algorithms Based on the Minimum Spanning Tree 29 / 46
  • 77. Images/cinvestav- Outline 1 Hierarchical Clustering Definition Basic Ideas 2 Agglomerative Algorithms Introduction Problems with Agglomerative Algorithms Two Categories of Agglomerative Algorithms Matrix Based Algorithms Graph Based Algorithms 3 Divisive Algorithms Introduction 4 Algorithms for Large Data Sets Introduction Clustering Using REpresentatives (CURE) 30 / 46
  • 78. Images/cinvestav- Divisive Algorithms Reverse Strategy Start with a single cluster split it iteratively. 31 / 46
  • 79. Images/cinvestav- Generalized Divisive Scheme Algorithm PROBLEM what is wrong!!! I n i t i a l i z a t i o n Choose 0 = {X} P0 = P(X) t = 0 Repeat t = t + 1 For i = 1 to t Given a p a r t i t i o n Ct−1, i Generate a l l p o s s i b l e c l u s t e r s next i Find the p a i r C1 t−1,j, C2 t−1,j that maximize g Create t = t−1 − {Ct−1,j} ∪ C1 t−1,j, C2 t−1,j U n t i l a l l v e c t o r s l i e i n a s i n g l e c l u s t e r 32 / 46
  • 80. Images/cinvestav- Outline 1 Hierarchical Clustering Definition Basic Ideas 2 Agglomerative Algorithms Introduction Problems with Agglomerative Algorithms Two Categories of Agglomerative Algorithms Matrix Based Algorithms Graph Based Algorithms 3 Divisive Algorithms Introduction 4 Algorithms for Large Data Sets Introduction Clustering Using REpresentatives (CURE) 33 / 46
  • 81. Images/cinvestav- Algorithms for Large Data Sets There are several 1 The CURE Algorithm 2 The ROCK Algorithm 3 The Chameleon Algorithm 4 The BIRCH Algorithm 34 / 46
  • 82. Images/cinvestav- Algorithms for Large Data Sets There are several 1 The CURE Algorithm 2 The ROCK Algorithm 3 The Chameleon Algorithm 4 The BIRCH Algorithm 34 / 46
  • 83. Images/cinvestav- Algorithms for Large Data Sets There are several 1 The CURE Algorithm 2 The ROCK Algorithm 3 The Chameleon Algorithm 4 The BIRCH Algorithm 34 / 46
  • 84. Images/cinvestav- Algorithms for Large Data Sets There are several 1 The CURE Algorithm 2 The ROCK Algorithm 3 The Chameleon Algorithm 4 The BIRCH Algorithm 34 / 46
  • 85. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Idea Each cluster Ci has a set of representatives RCi = x (i) 1 , x (i) 2 , ..., x (i) K with K > 1. What is happening By using multiple representatives for each cluster, the CURE algorithm tries to “capture” the shape of each one. However In order to avoid taking into account irregularities (For example, outliers) in the border of the cluster. The initially chosen representatives are “pushed” toward the mean of the cluster. 35 / 46
  • 86. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Idea Each cluster Ci has a set of representatives RCi = x (i) 1 , x (i) 2 , ..., x (i) K with K > 1. What is happening By using multiple representatives for each cluster, the CURE algorithm tries to “capture” the shape of each one. However In order to avoid taking into account irregularities (For example, outliers) in the border of the cluster. The initially chosen representatives are “pushed” toward the mean of the cluster. 35 / 46
  • 87. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Idea Each cluster Ci has a set of representatives RCi = x (i) 1 , x (i) 2 , ..., x (i) K with K > 1. What is happening By using multiple representatives for each cluster, the CURE algorithm tries to “capture” the shape of each one. However In order to avoid taking into account irregularities (For example, outliers) in the border of the cluster. The initially chosen representatives are “pushed” toward the mean of the cluster. 35 / 46
  • 88. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Idea Each cluster Ci has a set of representatives RCi = x (i) 1 , x (i) 2 , ..., x (i) K with K > 1. What is happening By using multiple representatives for each cluster, the CURE algorithm tries to “capture” the shape of each one. However In order to avoid taking into account irregularities (For example, outliers) in the border of the cluster. The initially chosen representatives are “pushed” toward the mean of the cluster. 35 / 46
  • 89. Images/cinvestav- Therfore This action is known As “Shrinking” in the sense that the volume of space “defined” by the representatives is shrunk toward the mean of the cluster. 36 / 46
  • 90. Images/cinvestav- Shrinking Process Given a cluster C Select the point x ∈ C with the maximum distance from the mean of C and set RC = {x} (the set of representatives). Then 1 For i = 2 to min {K, nC } 2 Determine y ∈ C − RC that lies farthest from the points in RC 3 RC = RC ∪ {y} 37 / 46
  • 91. Images/cinvestav- Shrinking Process Given a cluster C Select the point x ∈ C with the maximum distance from the mean of C and set RC = {x} (the set of representatives). Then 1 For i = 2 to min {K, nC } 2 Determine y ∈ C − RC that lies farthest from the points in RC 3 RC = RC ∪ {y} 37 / 46
  • 92. Images/cinvestav- Shrinking Process Given a cluster C Select the point x ∈ C with the maximum distance from the mean of C and set RC = {x} (the set of representatives). Then 1 For i = 2 to min {K, nC } 2 Determine y ∈ C − RC that lies farthest from the points in RC 3 RC = RC ∪ {y} 37 / 46
  • 93. Images/cinvestav- Shrinking Process Given a cluster C Select the point x ∈ C with the maximum distance from the mean of C and set RC = {x} (the set of representatives). Then 1 For i = 2 to min {K, nC } 2 Determine y ∈ C − RC that lies farthest from the points in RC 3 RC = RC ∪ {y} 37 / 46
  • 94. Images/cinvestav- Shrinking Process Given a cluster C Select the point x ∈ C with the maximum distance from the mean of C and set RC = {x} (the set of representatives). Then 1 For i = 2 to min {K, nC } 2 Determine y ∈ C − RC that lies farthest from the points in RC 3 RC = RC ∪ {y} 37 / 46
  • 95. Images/cinvestav- Shrinking Process Do the Shrinking Shrink the points x ∈ RC toward the mean mC in C by a factor α. Actually x = (1 − α) x + αmC ∀x ∈ RC (9) 38 / 46
  • 96. Images/cinvestav- Shrinking Process Do the Shrinking Shrink the points x ∈ RC toward the mean mC in C by a factor α. Actually x = (1 − α) x + αmC ∀x ∈ RC (9) 38 / 46
  • 97. Images/cinvestav- Resulting set RC Thus The resulting set RC is the set of representatives of C. Thus the distance between two cluster is defined as d (Ci, Cj) = min x∈RCi ,y∈RCj d (x, y) (10) 39 / 46
  • 98. Images/cinvestav- Resulting set RC Thus The resulting set RC is the set of representatives of C. Thus the distance between two cluster is defined as d (Ci, Cj) = min x∈RCi ,y∈RCj d (x, y) (10) 39 / 46
  • 99. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Algorithm Input : A set of points X = {x1, x2, ..., xN } Output : C clusters 1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi} 2 Ci.closest stores the cluster closest to Ci. 3 All the input points are inserted into a K − d tree T. 4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances between Ci and Ci.closest). 5 While size(Q) > C 6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest. 7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj. 8 Also remove Ci and Cj from T and Q. 9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch. 10 insert Ck into Q 40 / 46
  • 100. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Algorithm Input : A set of points X = {x1, x2, ..., xN } Output : C clusters 1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi} 2 Ci.closest stores the cluster closest to Ci. 3 All the input points are inserted into a K − d tree T. 4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances between Ci and Ci.closest). 5 While size(Q) > C 6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest. 7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj. 8 Also remove Ci and Cj from T and Q. 9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch. 10 insert Ck into Q 40 / 46
  • 101. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Algorithm Input : A set of points X = {x1, x2, ..., xN } Output : C clusters 1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi} 2 Ci.closest stores the cluster closest to Ci. 3 All the input points are inserted into a K − d tree T. 4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances between Ci and Ci.closest). 5 While size(Q) > C 6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest. 7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj. 8 Also remove Ci and Cj from T and Q. 9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch. 10 insert Ck into Q 40 / 46
  • 102. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Algorithm Input : A set of points X = {x1, x2, ..., xN } Output : C clusters 1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi} 2 Ci.closest stores the cluster closest to Ci. 3 All the input points are inserted into a K − d tree T. 4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances between Ci and Ci.closest). 5 While size(Q) > C 6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest. 7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj. 8 Also remove Ci and Cj from T and Q. 9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch. 10 insert Ck into Q 40 / 46
  • 103. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Algorithm Input : A set of points X = {x1, x2, ..., xN } Output : C clusters 1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi} 2 Ci.closest stores the cluster closest to Ci. 3 All the input points are inserted into a K − d tree T. 4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances between Ci and Ci.closest). 5 While size(Q) > C 6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest. 7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj. 8 Also remove Ci and Cj from T and Q. 9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch. 10 insert Ck into Q 40 / 46
  • 104. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Algorithm Input : A set of points X = {x1, x2, ..., xN } Output : C clusters 1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi} 2 Ci.closest stores the cluster closest to Ci. 3 All the input points are inserted into a K − d tree T. 4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances between Ci and Ci.closest). 5 While size(Q) > C 6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest. 7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj. 8 Also remove Ci and Cj from T and Q. 9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch. 10 insert Ck into Q 40 / 46
  • 105. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Algorithm Input : A set of points X = {x1, x2, ..., xN } Output : C clusters 1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi} 2 Ci.closest stores the cluster closest to Ci. 3 All the input points are inserted into a K − d tree T. 4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances between Ci and Ci.closest). 5 While size(Q) > C 6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest. 7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj. 8 Also remove Ci and Cj from T and Q. 9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch. 10 insert Ck into Q 40 / 46
  • 106. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Algorithm Input : A set of points X = {x1, x2, ..., xN } Output : C clusters 1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi} 2 Ci.closest stores the cluster closest to Ci. 3 All the input points are inserted into a K − d tree T. 4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances between Ci and Ci.closest). 5 While size(Q) > C 6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest. 7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj. 8 Also remove Ci and Cj from T and Q. 9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch. 10 insert Ck into Q 40 / 46
  • 107. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Algorithm Input : A set of points X = {x1, x2, ..., xN } Output : C clusters 1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi} 2 Ci.closest stores the cluster closest to Ci. 3 All the input points are inserted into a K − d tree T. 4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances between Ci and Ci.closest). 5 While size(Q) > C 6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest. 7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj. 8 Also remove Ci and Cj from T and Q. 9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch. 10 insert Ck into Q 40 / 46
  • 108. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Algorithm Input : A set of points X = {x1, x2, ..., xN } Output : C clusters 1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi} 2 Ci.closest stores the cluster closest to Ci. 3 All the input points are inserted into a K − d tree T. 4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances between Ci and Ci.closest). 5 While size(Q) > C 6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest. 7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj. 8 Also remove Ci and Cj from T and Q. 9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch. 10 insert Ck into Q 40 / 46
  • 109. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Algorithm Input : A set of points X = {x1, x2, ..., xN } Output : C clusters 1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi} 2 Ci.closest stores the cluster closest to Ci. 3 All the input points are inserted into a K − d tree T. 4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances between Ci and Ci.closest). 5 While size(Q) > C 6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest. 7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj. 8 Also remove Ci and Cj from T and Q. 9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch. 10 insert Ck into Q 40 / 46
  • 110. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Algorithm Input : A set of points X = {x1, x2, ..., xN } Output : C clusters 1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi} 2 Ci.closest stores the cluster closest to Ci. 3 All the input points are inserted into a K − d tree T. 4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances between Ci and Ci.closest). 5 While size(Q) > C 6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest. 7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj. 8 Also remove Ci and Cj from T and Q. 9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch. 10 insert Ck into Q 40 / 46
  • 111. Images/cinvestav- Clustering Using REpresentatives (CURE) Basic Algorithm Input : A set of points X = {x1, x2, ..., xN } Output : C clusters 1 For every cluster Ci = {xi} store Ci.mC = {xi} and Ci.RC = {xi} 2 Ci.closest stores the cluster closest to Ci. 3 All the input points are inserted into a K − d tree T. 4 Insert each cluster into the heap Q. (Clusters are arranged in increasing order of distances between Ci and Ci.closest). 5 While size(Q) > C 6 Remove the top element of Q, Ci and merge it with Cj == Ci.closest. 7 Then compute the new representative points for the merged cluster Ck = Ci ∪ Cj. 8 Also remove Ci and Cj from T and Q. 9 Also for all the clusters Ch ∈ Q, update Ch.closest and relocate Ch. 10 insert Ck into Q 40 / 46
  • 112. Images/cinvestav- Complexity of Cure Too Prohibitive O N2 log2 N (11) 41 / 46
  • 113. Images/cinvestav- Possible Solution CURE does the following The technique adopted by the CURE algorithm, in order to reduce the computational complexity, is that of random sampling. Actually That is, a sample set X is created from X, by choosing randomly N out of the N points of X. However, one has to ensure that the probability of missing a cluster of X, due to this sampling This can be guaranteed if the number of points N is sufficiently large. 42 / 46
  • 114. Images/cinvestav- Possible Solution CURE does the following The technique adopted by the CURE algorithm, in order to reduce the computational complexity, is that of random sampling. Actually That is, a sample set X is created from X, by choosing randomly N out of the N points of X. However, one has to ensure that the probability of missing a cluster of X, due to this sampling This can be guaranteed if the number of points N is sufficiently large. 42 / 46
  • 115. Images/cinvestav- Possible Solution CURE does the following The technique adopted by the CURE algorithm, in order to reduce the computational complexity, is that of random sampling. Actually That is, a sample set X is created from X, by choosing randomly N out of the N points of X. However, one has to ensure that the probability of missing a cluster of X, due to this sampling This can be guaranteed if the number of points N is sufficiently large. 42 / 46
  • 116. Images/cinvestav- Then Having estimated N CURE forms a number of p = N N sample data sets by successive random samples. In other words In other words, X is partitioned randomly in p subsets. For this a parameter q is selected Then, the points in each partition are clustered until N q clusters are formed or the distance between the closest pair of clusters to be merged in the next iteration step exceeds a user-defined threshold. 43 / 46
  • 117. Images/cinvestav- Then Having estimated N CURE forms a number of p = N N sample data sets by successive random samples. In other words In other words, X is partitioned randomly in p subsets. For this a parameter q is selected Then, the points in each partition are clustered until N q clusters are formed or the distance between the closest pair of clusters to be merged in the next iteration step exceeds a user-defined threshold. 43 / 46
  • 118. Images/cinvestav- Then Having estimated N CURE forms a number of p = N N sample data sets by successive random samples. In other words In other words, X is partitioned randomly in p subsets. For this a parameter q is selected Then, the points in each partition are clustered until N q clusters are formed or the distance between the closest pair of clusters to be merged in the next iteration step exceeds a user-defined threshold. 43 / 46
  • 119. Images/cinvestav- Once this has been finished A second clustering pass is done One the at most pN q = N q clusters from all the subsets. The Goal To apply the merging procedure described previously to all (at most) N q so that we end up with the required final number, m, of clusters. Finally Each point x in the data set, X, that is not used as a representative in any one of the m clusters is assigned to one of them according to the following strategy. 44 / 46
  • 120. Images/cinvestav- Once this has been finished A second clustering pass is done One the at most pN q = N q clusters from all the subsets. The Goal To apply the merging procedure described previously to all (at most) N q so that we end up with the required final number, m, of clusters. Finally Each point x in the data set, X, that is not used as a representative in any one of the m clusters is assigned to one of them according to the following strategy. 44 / 46
  • 121. Images/cinvestav- Once this has been finished A second clustering pass is done One the at most pN q = N q clusters from all the subsets. The Goal To apply the merging procedure described previously to all (at most) N q so that we end up with the required final number, m, of clusters. Finally Each point x in the data set, X, that is not used as a representative in any one of the m clusters is assigned to one of them according to the following strategy. 44 / 46
  • 122. Images/cinvestav- Finally First A random sample of representative points from each of the m clusters is chosen. Then Then, based on the previous representatives the point x is assigned to the cluster that contains the representative closest to it. Experiments reported by Guha et al. show that CURE It is sensitive to parameter selection. Specifically K must be large enough to capture the geometry of each cluster. In addition, N must be higher than a certain percentage ≈ 2.5% of N. 45 / 46
  • 123. Images/cinvestav- Finally First A random sample of representative points from each of the m clusters is chosen. Then Then, based on the previous representatives the point x is assigned to the cluster that contains the representative closest to it. Experiments reported by Guha et al. show that CURE It is sensitive to parameter selection. Specifically K must be large enough to capture the geometry of each cluster. In addition, N must be higher than a certain percentage ≈ 2.5% of N. 45 / 46
  • 124. Images/cinvestav- Finally First A random sample of representative points from each of the m clusters is chosen. Then Then, based on the previous representatives the point x is assigned to the cluster that contains the representative closest to it. Experiments reported by Guha et al. show that CURE It is sensitive to parameter selection. Specifically K must be large enough to capture the geometry of each cluster. In addition, N must be higher than a certain percentage ≈ 2.5% of N. 45 / 46
  • 125. Images/cinvestav- Finally First A random sample of representative points from each of the m clusters is chosen. Then Then, based on the previous representatives the point x is assigned to the cluster that contains the representative closest to it. Experiments reported by Guha et al. show that CURE It is sensitive to parameter selection. Specifically K must be large enough to capture the geometry of each cluster. In addition, N must be higher than a certain percentage ≈ 2.5% of N. 45 / 46
  • 126. Images/cinvestav- Finally First A random sample of representative points from each of the m clusters is chosen. Then Then, based on the previous representatives the point x is assigned to the cluster that contains the representative closest to it. Experiments reported by Guha et al. show that CURE It is sensitive to parameter selection. Specifically K must be large enough to capture the geometry of each cluster. In addition, N must be higher than a certain percentage ≈ 2.5% of N. 45 / 46
  • 127. Images/cinvestav- Not only that The value of a affects also CURE Small values, CURE looks similar than a MST clustering. Large values, CURE resembles an algorithm with a single representative. Worst Case Complexity O N 2 log2 N (12) 46 / 46
  • 128. Images/cinvestav- Not only that The value of a affects also CURE Small values, CURE looks similar than a MST clustering. Large values, CURE resembles an algorithm with a single representative. Worst Case Complexity O N 2 log2 N (12) 46 / 46