Clustering

March 16, 2020 Data Mining: Concepts and Techniques 1
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
Disclaimer: Content taken from Han & Kamber slides, Data mining textbooks and Internet

What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms

© Prentice Hall 3
Clustering Houses
Size Based
Geographic Distance Based

Clustering: Rich Applications and
Multidisciplinary Efforts
 Pattern Recognition
 Clustering methods simply try to group similar patterns into clusters whose
members are more similar to each other
 Spatial Data Analysis
 Create thematic maps in GIS by clustering feature spaces
 Detect spatial clusters or for other spatial mining tasks
 Image Processing
 clustering is used in image segmentation for separating image objects which are
analyzed further
 Economic Science (especially market research)
 cluster analysis is to classify objects into relatively homogeneous groups based on a
set of variables considered like demographics, psychographics, buying behaviours,
attitudes, preferences, etc.
 WWW
 Document classification
 Cluster Weblog data to discover groups of similar access patterns

Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 Land use: Identification of areas of similar land use in an earth
observation database
 Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults

Quality: What Is Good Clustering?
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
 The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns

Measure the Quality of Clustering
 Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, typically metric: d(i, j)
 There is a separate “quality” function that measures the
“goodness” of a cluster.
 The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
 Weights should be associated with different variables
based on applications and data semantics.
 It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.

Requirements of Clustering in Data Mining
 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability

Data Structures
 Data matrix
 (two modes)
 Dissimilarity matrix
 (one mode)


















npx...nfx...n1x
...............
ipx...ifx...i1x
...............
1px...1fx...11x
















0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0

Type of data in clustering analysis
 Interval-scaled variables
 Binary variables
 Nominal, ordinal, and ratio variables
 Variables of mixed types

Major Clustering Approaches (I)
 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using
some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue

Major Clustering Approaches (II)
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
 Model-based:
 A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: pCluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific constraints
 Typical methods: COD (obstacles), constrained clustering

Typical Alternatives to Calculate the Distance
between Clusters
 Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e., dis(Ki,
Kj) = dis(Mi, Mj)
 Medoid: one chosen, centrally located object in the cluster

Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster
 Radius: square root of average distance from any point of the
cluster to its centroid
 Diameter: square root of average mean squared distance between
all pairs of points in the cluster
N
tN
i ip
mC
)(
1


N
mc
ip
tN
i
mR
2)(
1




)1(
2)(
11







NN
iq
t
ip
tN
i
N
i
mD

Partitioning Algorithms: Basic Concept
 Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters, s.t., min sum of squared distance
 Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
2
1 )( mimKmt
k
m tCmi
 

The K-Means Clustering Method
 Given k, the k-means algorithm is implemented in
four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when no more new
assignment

The K-Means Clustering Method
 Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K
object as initial
cluster center
Assign
each
objects
to most
similar
center
Update
the
cluster
means
Update
the
cluster
means
reassignreassign

K-Means example
 Cluster the following items in 2 clusters: {2, 4, 10, 12, 3, 20, 30, 11,
25}
 𝑑 𝐶𝑖, 𝑡𝑖 = (𝐶𝑖 − 𝑡𝑖)2
 Assignment to K = min (𝑑 𝐶𝑖, 𝑡𝑖 )
M1 M2 K1 K2
2 4 { {
2.5 16
3 18 {2, 3, 4, 10} {12, 20, 30, 11, 25}
4.75 19.6 {2, 3, 4, 10, 12, 11} {20, 30, 25}
7 25 {2, 3, 4, 10, 12, 11} {20, 30, 25}
2.5 16 {2, 3, 4} {10, 12, 20, 30, 11, 25}
2 4 {22 4 {2 {42 4 {2 {4,102 4 {2,3} , 12, 20, 30, 11, 25}
Stopping Criteria:
• No new assignment
• No change in cluster means

K-means 2D example
 Apply k-means for the following dataset to make 2 clusters:
X Y
185 72
170 56
168 60
179 68
182 72
188 77
Step 1: Assume Initial Centroids: C1 = (185, 72), C2 = (170, 56)
Step 2: Calculate Euclidean Distance to each centroid:
𝑑[ 𝑥, 𝑦 , 𝑎, 𝑏 ] = (𝑥 − 𝑎)2+(𝑦 − 𝑏)2
For t1 = (168, 60)
𝑑[ 185,72 , 168, 60 ] = (185 − 168)2+(72 − 60)2
= 20.808
𝑑[ 170,56 , 168, 60 ] = (170 − 168)2+(56 − 60)2
= 4.472
Since d(C2, t1) > d(C1,t1). So assign t1 to C2
Step 3: For t2 = (179, 68)
𝑑[ 185,72 , 179, 68 ] = (185 − 179)2+(72 − 68)2
= 7.211
𝑑[ 170,56 , 179, 68 ] = (170 − 179)2+(56 − 68)2
= 15
Since d(C1, t2) < d(C2,t2) So assign t2 to C1
Step 4: For t3 = (182, 72)
𝑑[ 185,72 , 182, 72 ] = (185 − 182)2+(72 − 72)2
= 3
𝑑[ 170,56 , 182, 72 ] = (170 − 182)2+(56 − 72)2
= 20
Since d(C1, t3) < d(C2,t3), So assign t3 to C1

K-means 2D example
 Apply k-means for the following dataset to make 2 clusters:
X Y
185 72
170 56
168 60
179 68
182 72
188 77
Step 5: For t4 = (188, 77)
𝑑[ 185,72 , 182, 72 ] = (185 − 188)2+(72 − 77)2
= 5.83
𝑑[ 170,56 , 182, 72 ] = (170 − 188)2+(56 − 77)2
= 27.65
Since d(C1, t4) < d(C2,t4), So assign t4 to C1
Step 6: Clusters after 1 iteration
D1 = {(185, 72), (179, 68), (182, 72), (188, 77)}
D2 = {(170, 56), (168, 60)}
Step 7: New clusters centroids C1 = {183.5, 72.25} C2 = {169, 116}
Repeat above steps for all samples till convergence
Final Clusters
D1 = {(185, 72), (179, 68), (182, 72), (188, 77)}
D2 = {(170, 56), (168, 60)}

Comments on the K-Means Method
 Strength: Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
 Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic annealing and
genetic algorithms
 Weakness
 Applicable only when mean is defined, then what about categorical
data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes

Variations of the K-Means Method
 A few variants of the k-means which differ in
 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes (Huang’98)
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method

What Is the Problem of the K-Means Method?
 The k-means algorithm is sensitive to outliers !
 Since an object with an extremely large value may substantially
distort the distribution of the data.
 K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10

The K-Medoids Clustering Method
 Find representative objects, called medoids, in clusters
 PAM (Partitioning Around Medoids, 1987)
 starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
 PAM works effectively for small data sets, but does not scale
well for large data sets
 CLARA (Kaufmann & Rousseeuw, 1990)
 CLARANS (Ng & Han, 1994): Randomized sampling
 Focusing + spatial data structure (Ester et al., 1995)

A Typical K-Medoids Algorithm (PAM)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary
choose k
object as
initial
medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign
each
remainin
g object
to
nearest
medoids
Randomly select a
nonmedoid object,Oramdom
Compute
total cost of
swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O
and Orandom
If quality is
improved.
Do loop
Until no
change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10

PAM (Partitioning Around Medoids) (1987)
 PAM (Kaufman and Rousseeuw, 1987), built in Splus
 Use real object to represent the cluster
1. Select k representative objects arbitrarily
2. For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
3. For each pair of i and h,
 If TCih < 0, i is replaced by h
 Then assign each non-selected object to the most
similar representative object
4. repeat steps 2-3 until there is no change

PAM example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
March 16, 2020 30
Step 1: Let A & B be mediod. Obtain clusters by assigning elements close to
the respective mediods
Step 2: Now examine the three non-mediod {C, D, E} to determine if they can
replace existing mediods.
i.e. A replaced by C, D, or E and B replaced by C, D or E
We have 6 costs to determine
TCAC, TCAD, TCAE, TCBC, TCBD, TCBE
Step 1: {A, C, D} & {B, E}
Let us replace A with C. Cost of replacing A with C
TCAC = CAAC + CBAC + CCAC + CDAC + CEAC
Cjih = cost change for item tj associated by swapping mediod ti with non-
mediod tj
{A, B, E} & {C, D}
TCAC = 1 + 0 + (-2) + (-1) + 0 = -2
The overall cluster is reduced by 2
Step 3: Let us replace A with D. Cost of replacing A with D
{A, B, E} & {C, D}
TCAD = CAAD + CBAD + CCAD + CDAD + CEAD
TCAD = 1 + 0 + (-1) + (-2) + 0 = -2

A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
31
Step 4: Let us replace A with E. Cost of replacing A with E
{A, B, C} & {E, D}
TCAE = CAAE + CBAE + CCAE + CDAE + CEAE
TCAD = 1 + 0 + 0 + 1 + (-3) = -1
Step 5: Let us replace B with C. Cost of replacing B with C
{A, B, E} & {C, D}
TCBC = CABC + CBBC + CCBC + CDBC + CEBC
TCBC = 0 + 1 + (-2) + (-1) + 0 = -2
Step 6: Let us replace B with D. Cost of replacing B with D
{A, B, E} & {C, D}
TCBD = CABD + CBBD + CCBD + CDBD + CEBD
TCBD = 0 + 1 + (-1) + (-2) + 0 = -2
Step 7: Let us replace B with E. Cost of replacing B with E
{A, B, C, D} & {E}
TCBE = CABE + CBBE + CCBE + CDBE + CEBE
TCBE = 0 + 1 + 0 + 0 + (-3) = -2

PAM Clustering: Total swapping cost TCih=jCjih
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
j
i
h
t
Cjih = 0
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
t
i h
j
Cjih = d(j, h) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
h
i
t
j
Cjih = d(j, t) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
t
i
h j
Cjih = d(j, h) - d(j, t)

What Is the Problem with PAM?
 Pam is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced by
outliers or other extreme values than a mean
 Pam works efficiently for small data sets but does not
scale well for large data sets.
 O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
Sampling based method,
CLARA(Clustering LARge Applications)

Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)

AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus
 Use the Single-Link method and the dissimilarity matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10

Dendrogram: Shows How the Clusters are Merged
Decompose data objects into a
several levels of nested partitioning
(tree of clusters), called a
dendrogram.
A clustering of the data objects is
obtained by cutting the
dendrogram at the desired level,
then each connected component
forms a cluster.

A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Step 1: At level 0, 5 clusters
Step 2: At Level 1: Min_dist = 1,
Find distance between each pair.
If min_dist{ti, tj} <= 1, then merge clusters
{A, B}, {C, D}, {E}
Step 1: {A}, {B}, {C} , {D} {E}
Find distance between clusters formed in step 2
A->C = 2 B->C = 2
A->D = 2 B->D = 4
Hence min_dist({A,B}, {C, D}) = 2
A->E = 3 B->E = 3 min_dist({A,B}, {E}) = 3
C->E = 5 D->E = 3 min_dist({C, D}, {E}) = 3
Since threshold is 2, we merge {A, B, C, D} {E}
Hierarchical Clustering Single Link
example

A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
A->E = 3 B->E = 3 C->E = 5 D->E = 3
min_dist({A, B, C, D}, {E}) = 3
Since threshold is 3, we merge both the clusters to get {A, B, C, D, E}
Hierarchical Clustering Single Link
example

A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
Step 1: At level 0, 5 clusters
Step 2: At Level 1: Max_dist = 1,
Find distance between each pair.
If max_dist{ti, tj} <= 1, then merge clusters
{A, B}, {C, D}, {E}
Step 1: {A}, {B}, {C} , {D} {E}
A->C = 2 B->C = 2
A->D = 2 B->D = 4
Hence max_dist({A,B}, {C, D}) = 4
A->E = 3 B->E = 3 max_dist({A,B}, {E}) = 3
C->E = 5 D->E = 3 max_dist({C, D}, {E}) = 5
Since threshold is 2, no merge at this level
Hierarchical Clustering Complete Link
example

A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
A->C = 2 B->C = 2 A->D = 2 B->D = 4
C->E = 5 D->E = 3
max_dist({C, D}, {A, B, E}) = 5
Since threshold is 4, So no merge
Step 6: At Level 5: Merge both clusters.
Hierarchical Clustering Complete Link
example
A->C = 2 B->C = 2
A->D = 2 B->D = 4
Hence max_dist({A,B}, {C, D}) = 4
A->E = 3 B->E = 3 max_dist({A,B}, {E}) = 3
C->E = 5 D->E = 3 max_dist({C, D}, {E}) = 5
Since threshold is 3, max_dist({A,B}, {E}) is 3 so we merge them

DIANA (Divisive Analysis)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10

Recent Hierarchical Clustering Methods
 Major weakness of agglomerative clustering methods
 do not scale well: time complexity of at least O(n2),
where n is the number of total objects
 can never undo what was done previously
 Integration of hierarchical with distance-based clustering
 BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
 ROCK (1999): clustering categorical data by neighbor
and link analysis
 CHAMELEON (1999): hierarchical clustering using
dynamic modeling

Clustering

More Related Content

What's hot (20)

Similar to Clustering (20)

Recently uploaded (20)

Clustering