SlideShare a Scribd company logo
March 16, 2020 Data Mining: Concepts and Techniques 1
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
Disclaimer: Content taken from Han & Kamber slides, Data mining textbooks and Internet
March 16, 2020 Data Mining: Concepts and Techniques 2
What is Cluster Analysis?
 Cluster: a collection of data objects
 Similar to one another within the same cluster
 Dissimilar to the objects in other clusters
 Cluster analysis
 Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
 Unsupervised learning: no predefined classes
 Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
© Prentice Hall 3
Clustering Houses
Size Based
Geographic Distance Based
March 16, 2020 Data Mining: Concepts and Techniques 4
Clustering: Rich Applications and
Multidisciplinary Efforts
 Pattern Recognition
 Clustering methods simply try to group similar patterns into clusters whose
members are more similar to each other
 Spatial Data Analysis
 Create thematic maps in GIS by clustering feature spaces
 Detect spatial clusters or for other spatial mining tasks
 Image Processing
 clustering is used in image segmentation for separating image objects which are
analyzed further
 Economic Science (especially market research)
 cluster analysis is to classify objects into relatively homogeneous groups based on a
set of variables considered like demographics, psychographics, buying behaviours,
attitudes, preferences, etc.
 WWW
 Document classification
 Cluster Weblog data to discover groups of similar access patterns
March 16, 2020 Data Mining: Concepts and Techniques 5
Examples of Clustering Applications
 Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
 Land use: Identification of areas of similar land use in an earth
observation database
 Insurance: Identifying groups of motor insurance policy holders with
a high average claim cost
 City-planning: Identifying groups of houses according to their house
type, value, and geographical location
 Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
March 16, 2020 Data Mining: Concepts and Techniques 6
Quality: What Is Good Clustering?
 A good clustering method will produce high quality
clusters with
 high intra-class similarity
 low inter-class similarity
 The quality of a clustering result depends on both the
similarity measure used by the method and its
implementation
 The quality of a clustering method is also measured by its
ability to discover some or all of the hidden patterns
March 16, 2020 Data Mining: Concepts and Techniques 7
Measure the Quality of Clustering
 Dissimilarity/Similarity metric: Similarity is expressed in
terms of a distance function, typically metric: d(i, j)
 There is a separate “quality” function that measures the
“goodness” of a cluster.
 The definitions of distance functions are usually very
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables.
 Weights should be associated with different variables
based on applications and data semantics.
 It is hard to define “similar enough” or “good enough”
 the answer is typically highly subjective.
March 16, 2020 Data Mining: Concepts and Techniques 8
Requirements of Clustering in Data Mining
 Scalability
 Ability to deal with different types of attributes
 Ability to handle dynamic data
 Discovery of clusters with arbitrary shape
 Minimal requirements for domain knowledge to
determine input parameters
 Able to deal with noise and outliers
 Insensitive to order of input records
 High dimensionality
 Incorporation of user-specified constraints
 Interpretability and usability
March 16, 2020 Data Mining: Concepts and Techniques 9
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
March 16, 2020 Data Mining: Concepts and Techniques 10
Data Structures
 Data matrix
 (two modes)
 Dissimilarity matrix
 (one mode)


















npx...nfx...n1x
...............
ipx...ifx...i1x
...............
1px...1fx...11x
















0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
March 16, 2020 Data Mining: Concepts and Techniques 11
Type of data in clustering analysis
 Interval-scaled variables
 Binary variables
 Nominal, ordinal, and ratio variables
 Variables of mixed types
March 16, 2020 Data Mining: Concepts and Techniques 12
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
March 16, 2020 Data Mining: Concepts and Techniques 13
Major Clustering Approaches (I)
 Partitioning approach:
 Construct various partitions and then evaluate them by some criterion,
e.g., minimizing the sum of square errors
 Typical methods: k-means, k-medoids, CLARANS
 Hierarchical approach:
 Create a hierarchical decomposition of the set of data (or objects) using
some criterion
 Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON
 Density-based approach:
 Based on connectivity and density functions
 Typical methods: DBSACN, OPTICS, DenClue
March 16, 2020 Data Mining: Concepts and Techniques 14
Major Clustering Approaches (II)
 Grid-based approach:
 based on a multiple-level granularity structure
 Typical methods: STING, WaveCluster, CLIQUE
 Model-based:
 A model is hypothesized for each of the clusters and tries to find the best
fit of that model to each other
 Typical methods: EM, SOM, COBWEB
 Frequent pattern-based:
 Based on the analysis of frequent patterns
 Typical methods: pCluster
 User-guided or constraint-based:
 Clustering by considering user-specified or application-specific constraints
 Typical methods: COD (obstacles), constrained clustering
March 16, 2020 Data Mining: Concepts and Techniques 15
Typical Alternatives to Calculate the Distance
between Clusters
 Single link: smallest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)
 Complete link: largest distance between an element in one cluster
and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)
 Average: avg distance between an element in one cluster and an
element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)
 Centroid: distance between the centroids of two clusters, i.e.,
dis(Ki, Kj) = dis(Ci, Cj)
 Medoid: distance between the medoids of two clusters, i.e., dis(Ki,
Kj) = dis(Mi, Mj)
 Medoid: one chosen, centrally located object in the cluster
March 16, 2020 Data Mining: Concepts and Techniques 16
Centroid, Radius and Diameter of a
Cluster (for numerical data sets)
 Centroid: the “middle” of a cluster
 Radius: square root of average distance from any point of the
cluster to its centroid
 Diameter: square root of average mean squared distance between
all pairs of points in the cluster
N
tN
i ip
mC
)(
1


N
mc
ip
tN
i
mR
2)(
1




)1(
2)(
11







NN
iq
t
ip
tN
i
N
i
mD
March 16, 2020 Data Mining: Concepts and Techniques 17
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
March 16, 2020 Data Mining: Concepts and Techniques 18
Partitioning Algorithms: Basic Concept
 Partitioning method: Construct a partition of a database D of n objects
into a set of k clusters, s.t., min sum of squared distance
 Given a k, find a partition of k clusters that optimizes the chosen
partitioning criterion
 Global optimal: exhaustively enumerate all partitions
 Heuristic methods: k-means and k-medoids algorithms
 k-means (MacQueen’67): Each cluster is represented by the center
of the cluster
 k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the objects
in the cluster
2
1 )( mimKmt
k
m tCmi
 
March 16, 2020 Data Mining: Concepts and Techniques 19
The K-Means Clustering Method
 Given k, the k-means algorithm is implemented in
four steps:
 Partition objects into k nonempty subsets
 Compute seed points as the centroids of the
clusters of the current partition (the centroid is the
center, i.e., mean point, of the cluster)
 Assign each object to the cluster with the nearest
seed point
 Go back to Step 2, stop when no more new
assignment
March 16, 2020 Data Mining: Concepts and Techniques 20
The K-Means Clustering Method
 Example
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrarily choose K
object as initial
cluster center
Assign
each
objects
to most
similar
center
Update
the
cluster
means
Update
the
cluster
means
reassignreassign
K-Means example
 Cluster the following items in 2 clusters: {2, 4, 10, 12, 3, 20, 30, 11,
25}
 𝑑 𝐶𝑖, 𝑡𝑖 = (𝐶𝑖 − 𝑡𝑖)2
 Assignment to K = min (𝑑 𝐶𝑖, 𝑡𝑖 )
March 16, 2020 Data Mining: Concepts and Techniques 21
M1 M2 K1 K2
2 4 { {
2.5 16
3 18 {2, 3, 4, 10} {12, 20, 30, 11, 25}
4.75 19.6 {2, 3, 4, 10, 12, 11} {20, 30, 25}
7 25 {2, 3, 4, 10, 12, 11} {20, 30, 25}
2.5 16 {2, 3, 4} {10, 12, 20, 30, 11, 25}
2 4 {22 4 {2 {42 4 {2 {4,102 4 {2,3} , 12, 20, 30, 11, 25}
Stopping Criteria:
• No new assignment
• No change in cluster means
K-means 2D example
 Apply k-means for the following dataset to make 2 clusters:
March 16, 2020 Data Mining: Concepts and Techniques 22
X Y
185 72
170 56
168 60
179 68
182 72
188 77
Step 1: Assume Initial Centroids: C1 = (185, 72), C2 = (170, 56)
Step 2: Calculate Euclidean Distance to each centroid:
𝑑[ 𝑥, 𝑦 , 𝑎, 𝑏 ] = (𝑥 − 𝑎)2+(𝑦 − 𝑏)2
For t1 = (168, 60)
𝑑[ 185,72 , 168, 60 ] = (185 − 168)2+(72 − 60)2
= 20.808
𝑑[ 170,56 , 168, 60 ] = (170 − 168)2+(56 − 60)2
= 4.472
Since d(C2, t1) > d(C1,t1). So assign t1 to C2
Step 3: For t2 = (179, 68)
𝑑[ 185,72 , 179, 68 ] = (185 − 179)2+(72 − 68)2
= 7.211
𝑑[ 170,56 , 179, 68 ] = (170 − 179)2+(56 − 68)2
= 15
Since d(C1, t2) < d(C2,t2) So assign t2 to C1
Step 4: For t3 = (182, 72)
𝑑[ 185,72 , 182, 72 ] = (185 − 182)2+(72 − 72)2
= 3
𝑑[ 170,56 , 182, 72 ] = (170 − 182)2+(56 − 72)2
= 20
Since d(C1, t3) < d(C2,t3), So assign t3 to C1
K-means 2D example
 Apply k-means for the following dataset to make 2 clusters:
March 16, 2020 Data Mining: Concepts and Techniques 23
X Y
185 72
170 56
168 60
179 68
182 72
188 77
Step 5: For t4 = (188, 77)
𝑑[ 185,72 , 182, 72 ] = (185 − 188)2+(72 − 77)2
= 5.83
𝑑[ 170,56 , 182, 72 ] = (170 − 188)2+(56 − 77)2
= 27.65
Since d(C1, t4) < d(C2,t4), So assign t4 to C1
Step 6: Clusters after 1 iteration
D1 = {(185, 72), (179, 68), (182, 72), (188, 77)}
D2 = {(170, 56), (168, 60)}
Step 7: New clusters centroids C1 = {183.5, 72.25} C2 = {169, 116}
Repeat above steps for all samples till convergence
Final Clusters
D1 = {(185, 72), (179, 68), (182, 72), (188, 77)}
D2 = {(170, 56), (168, 60)}
March 16, 2020 Data Mining: Concepts and Techniques 24
Comments on the K-Means Method
 Strength: Relatively efficient: O(tkn), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n.
 Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))
 Comment: Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic annealing and
genetic algorithms
 Weakness
 Applicable only when mean is defined, then what about categorical
data?
 Need to specify k, the number of clusters, in advance
 Unable to handle noisy data and outliers
 Not suitable to discover clusters with non-convex shapes
March 16, 2020 Data Mining: Concepts and Techniques 25
Variations of the K-Means Method
 A few variants of the k-means which differ in
 Selection of the initial k means
 Dissimilarity calculations
 Strategies to calculate cluster means
 Handling categorical data: k-modes (Huang’98)
 Replacing means of clusters with modes
 Using new dissimilarity measures to deal with categorical objects
 Using a frequency-based method to update modes of clusters
 A mixture of categorical and numerical data: k-prototype method
March 16, 2020 Data Mining: Concepts and Techniques 26
What Is the Problem of the K-Means Method?
 The k-means algorithm is sensitive to outliers !
 Since an object with an extremely large value may substantially
distort the distribution of the data.
 K-Medoids: Instead of taking the mean value of the object in a
cluster as a reference point, medoids can be used, which is the most
centrally located object in a cluster.
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
March 16, 2020 Data Mining: Concepts and Techniques 27
The K-Medoids Clustering Method
 Find representative objects, called medoids, in clusters
 PAM (Partitioning Around Medoids, 1987)
 starts from an initial set of medoids and iteratively replaces one
of the medoids by one of the non-medoids if it improves the
total distance of the resulting clustering
 PAM works effectively for small data sets, but does not scale
well for large data sets
 CLARA (Kaufmann & Rousseeuw, 1990)
 CLARANS (Ng & Han, 1994): Randomized sampling
 Focusing + spatial data structure (Ester et al., 1995)
March 16, 2020 Data Mining: Concepts and Techniques 28
A Typical K-Medoids Algorithm (PAM)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 20
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
K=2
Arbitrary
choose k
object as
initial
medoids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Assign
each
remainin
g object
to
nearest
medoids
Randomly select a
nonmedoid object,Oramdom
Compute
total cost of
swapping
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Total Cost = 26
Swapping O
and Orandom
If quality is
improved.
Do loop
Until no
change
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
March 16, 2020 Data Mining: Concepts and Techniques 29
PAM (Partitioning Around Medoids) (1987)
 PAM (Kaufman and Rousseeuw, 1987), built in Splus
 Use real object to represent the cluster
1. Select k representative objects arbitrarily
2. For each pair of non-selected object h and selected
object i, calculate the total swapping cost TCih
3. For each pair of i and h,
 If TCih < 0, i is replaced by h
 Then assign each non-selected object to the most
similar representative object
4. repeat steps 2-3 until there is no change
PAM example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
March 16, 2020 30
Step 1: Let A & B be mediod. Obtain clusters by assigning elements close to
the respective mediods
Step 2: Now examine the three non-mediod {C, D, E} to determine if they can
replace existing mediods.
i.e. A replaced by C, D, or E and B replaced by C, D or E
We have 6 costs to determine
TCAC, TCAD, TCAE, TCBC, TCBD, TCBE
Step 1: {A, C, D} & {B, E}
Let us replace A with C. Cost of replacing A with C
TCAC = CAAC + CBAC + CCAC + CDAC + CEAC
Cjih = cost change for item tj associated by swapping mediod ti with non-
mediod tj
{A, B, E} & {C, D}
TCAC = 1 + 0 + (-2) + (-1) + 0 = -2
The overall cluster is reduced by 2
Step 3: Let us replace A with D. Cost of replacing A with D
{A, B, E} & {C, D}
TCAD = CAAD + CBAD + CCAD + CDAD + CEAD
TCAD = 1 + 0 + (-1) + (-2) + 0 = -2
The overall cluster is reduced by 2
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
31
Step 4: Let us replace A with E. Cost of replacing A with E
{A, B, C} & {E, D}
TCAE = CAAE + CBAE + CCAE + CDAE + CEAE
TCAD = 1 + 0 + 0 + 1 + (-3) = -1
The overall cluster is reduced by 1
Step 5: Let us replace B with C. Cost of replacing B with C
{A, B, E} & {C, D}
TCBC = CABC + CBBC + CCBC + CDBC + CEBC
TCBC = 0 + 1 + (-2) + (-1) + 0 = -2
The overall cluster is reduced by 2
Step 6: Let us replace B with D. Cost of replacing B with D
{A, B, E} & {C, D}
TCBD = CABD + CBBD + CCBD + CDBD + CEBD
TCBD = 0 + 1 + (-1) + (-2) + 0 = -2
The overall cluster is reduced by 2
Step 7: Let us replace B with E. Cost of replacing B with E
{A, B, C, D} & {E}
TCBE = CABE + CBBE + CCBE + CDBE + CEBE
TCBE = 0 + 1 + 0 + 0 + (-3) = -2
The overall cluster is reduced by 2
March 16, 2020 Data Mining: Concepts and Techniques 32
PAM Clustering: Total swapping cost TCih=jCjih
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
j
i
h
t
Cjih = 0
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
t
i h
j
Cjih = d(j, h) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
h
i
t
j
Cjih = d(j, t) - d(j, i)
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
t
i
h j
Cjih = d(j, h) - d(j, t)
March 16, 2020 Data Mining: Concepts and Techniques 33
What Is the Problem with PAM?
 Pam is more robust than k-means in the presence of
noise and outliers because a medoid is less influenced by
outliers or other extreme values than a mean
 Pam works efficiently for small data sets but does not
scale well for large data sets.
 O(k(n-k)2 ) for each iteration
where n is # of data,k is # of clusters
Sampling based method,
CLARA(Clustering LARge Applications)
March 16, 2020 Data Mining: Concepts and Techniques 34
Chapter 7. Cluster Analysis
1. What is Cluster Analysis?
2. Types of Data in Cluster Analysis
3. A Categorization of Major Clustering Methods
4. Partitioning Methods
5. Hierarchical Methods
March 16, 2020 Data Mining: Concepts and Techniques 35
Hierarchical Clustering
 Use distance matrix as clustering criteria. This method
does not require the number of clusters k as an input,
but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a
a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative
(AGNES)
divisive
(DIANA)
March 16, 2020 Data Mining: Concepts and Techniques 36
AGNES (Agglomerative Nesting)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus
 Use the Single-Link method and the dissimilarity matrix.
 Merge nodes that have the least dissimilarity
 Go on in a non-descending fashion
 Eventually all nodes belong to the same cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
March 16, 2020 Data Mining: Concepts and Techniques 37
Dendrogram: Shows How the Clusters are Merged
Decompose data objects into a
several levels of nested partitioning
(tree of clusters), called a
dendrogram.
A clustering of the data objects is
obtained by cutting the
dendrogram at the desired level,
then each connected component
forms a cluster.
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
March 16, 2020 Data Mining: Concepts and Techniques 38
Step 1: At level 0, 5 clusters
Step 2: At Level 1: Min_dist = 1,
Find distance between each pair.
If min_dist{ti, tj} <= 1, then merge clusters
{A, B}, {C, D}, {E}
Step 1: {A}, {B}, {C} , {D} {E}
Step 3: At Level 2: Min_dist = 2,
Find distance between clusters formed in step 2
A->C = 2 B->C = 2
A->D = 2 B->D = 4
Hence min_dist({A,B}, {C, D}) = 2
A->E = 3 B->E = 3 min_dist({A,B}, {E}) = 3
C->E = 5 D->E = 3 min_dist({C, D}, {E}) = 3
Since threshold is 2, we merge {A, B, C, D} {E}
Hierarchical Clustering Single Link
example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
March 16, 2020 Data Mining: Concepts and Techniques 39
Step 4: At Level 3: Min_dist = 3,
Find distance between clusters formed in step 3
A->E = 3 B->E = 3 C->E = 5 D->E = 3
min_dist({A, B, C, D}, {E}) = 3
Since threshold is 3, we merge both the clusters to get {A, B, C, D, E}
Hierarchical Clustering Single Link
example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
March 16, 2020 Data Mining: Concepts and Techniques 40
Step 1: At level 0, 5 clusters
Step 2: At Level 1: Max_dist = 1,
Find distance between each pair.
If max_dist{ti, tj} <= 1, then merge clusters
{A, B}, {C, D}, {E}
Step 1: {A}, {B}, {C} , {D} {E}
Step 3: At Level 2: Max_dist = 2,
Find distance between clusters formed in step 2
A->C = 2 B->C = 2
A->D = 2 B->D = 4
Hence max_dist({A,B}, {C, D}) = 4
A->E = 3 B->E = 3 max_dist({A,B}, {E}) = 3
C->E = 5 D->E = 3 max_dist({C, D}, {E}) = 5
Since threshold is 2, no merge at this level
Hierarchical Clustering Complete Link
example
A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
March 16, 2020 Data Mining: Concepts and Techniques 41
Step 5: At Level 4: Max_dist = 4,
Find distance between clusters formed in step 3
A->C = 2 B->C = 2 A->D = 2 B->D = 4
C->E = 5 D->E = 3
max_dist({C, D}, {A, B, E}) = 5
Since threshold is 4, So no merge
Step 6: At Level 5: Merge both clusters.
Hierarchical Clustering Complete Link
example
Step 4: At Level 3: Max_dist = 3,
Find distance between clusters formed in step 2
A->C = 2 B->C = 2
A->D = 2 B->D = 4
Hence max_dist({A,B}, {C, D}) = 4
A->E = 3 B->E = 3 max_dist({A,B}, {E}) = 3
C->E = 5 D->E = 3 max_dist({C, D}, {E}) = 5
Since threshold is 3, max_dist({A,B}, {E}) is 3 so we merge them
March 16, 2020 Data Mining: Concepts and Techniques 42
DIANA (Divisive Analysis)
 Introduced in Kaufmann and Rousseeuw (1990)
 Implemented in statistical analysis packages, e.g., Splus
 Inverse order of AGNES
 Eventually each node forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
March 16, 2020 Data Mining: Concepts and Techniques 43
Recent Hierarchical Clustering Methods
 Major weakness of agglomerative clustering methods
 do not scale well: time complexity of at least O(n2),
where n is the number of total objects
 can never undo what was done previously
 Integration of hierarchical with distance-based clustering
 BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters
 ROCK (1999): clustering categorical data by neighbor
and link analysis
 CHAMELEON (1999): hierarchical clustering using
dynamic modeling

More Related Content

PPT
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
PPT
Chapter 5. Data Cube Technology.ppt
PPT
5.1 mining data streams
PPTX
Introduction to Clustering algorithm
PPT
Chapter 8. Classification Basic Concepts.ppt
PPT
Introduction to Data Mining
PPTX
Eclat algorithm in association rule mining
PPT
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...
Chapter - 5 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Chapter 5. Data Cube Technology.ppt
5.1 mining data streams
Introduction to Clustering algorithm
Chapter 8. Classification Basic Concepts.ppt
Introduction to Data Mining
Eclat algorithm in association rule mining
Data Mining: Concepts and Techniques_ Chapter 6: Mining Frequent Patterns, ...

What's hot (20)

PPTX
Kmeans
PDF
08. Mining Type Of Complex Data
PPTX
Data mining , Knowledge Discovery Process, Classification
PPTX
Chapter 4 Classification
PDF
Clustering
PPT
2.4 rule based classification
PPT
Machine Learning presentation.
PPT
Mining Frequent Patterns, Association and Correlations
PPTX
Anomaly Detection Using Isolation Forests
PDF
Outlier detection method introduction
PPTX
Data mining presentation.ppt
PPT
Data preprocessing
PPT
2. visualization in data mining
PPTX
Anomaly Detection Technique
PPT
K means Clustering Algorithm
PPTX
Data warehouse and olap technology
PPT
Data preprocessing
PPT
Data Mining: Concepts and techniques: Chapter 13 trend
PPT
Seminar Presentation Hadoop
PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Kmeans
08. Mining Type Of Complex Data
Data mining , Knowledge Discovery Process, Classification
Chapter 4 Classification
Clustering
2.4 rule based classification
Machine Learning presentation.
Mining Frequent Patterns, Association and Correlations
Anomaly Detection Using Isolation Forests
Outlier detection method introduction
Data mining presentation.ppt
Data preprocessing
2. visualization in data mining
Anomaly Detection Technique
K means Clustering Algorithm
Data warehouse and olap technology
Data preprocessing
Data Mining: Concepts and techniques: Chapter 13 trend
Seminar Presentation Hadoop
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
Ad

Similar to Clustering (20)

PPTX
8clustering.pptx
PPT
Chapter 10 ClusBasic ppt file for clear understaning
PPT
Chapter -10-Clus_Basic.ppt -DataMinning
PPT
clustering.ppt
PPT
PPT
Data mining concepts and techniques Chapter 10
PPTX
Chapter 10.1,2,3.pptx
PPT
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
PPT
UniT_A_Clustering machine learning .ppt
PPT
Chapter 07
PPT
8clst.ppt
PPT
Jewei Hans & Kamber Capter 7
PDF
Chapter 10.1,2,3 pdf.pdf
PDF
Clustering techniques data mining book ....
PPT
data mining cocepts and techniques chapter
PPT
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
PPTX
Data clustring
PPT
10 clusbasic
PPT
CLUSTERING
8clustering.pptx
Chapter 10 ClusBasic ppt file for clear understaning
Chapter -10-Clus_Basic.ppt -DataMinning
clustering.ppt
Data mining concepts and techniques Chapter 10
Chapter 10.1,2,3.pptx
Data Mining Concepts and Techniques, Chapter 10. Cluster Analysis: Basic Conc...
UniT_A_Clustering machine learning .ppt
Chapter 07
8clst.ppt
Jewei Hans & Kamber Capter 7
Chapter 10.1,2,3 pdf.pdf
Clustering techniques data mining book ....
data mining cocepts and techniques chapter
Chapter 10. Cluster Analysis Basic Concepts and Methods.ppt
Data clustring
10 clusbasic
CLUSTERING
Ad

Recently uploaded (20)

PDF
Design Guidelines and solutions for Plastics parts
PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PPTX
Current and future trends in Computer Vision.pptx
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PDF
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
PPTX
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
PPTX
Management Information system : MIS-e-Business Systems.pptx
PPTX
Software Engineering and software moduleing
PPT
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PPTX
Feature types and data preprocessing steps
PDF
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
PPTX
Fundamentals of Mechanical Engineering.pptx
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PPTX
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
PDF
Visual Aids for Exploratory Data Analysis.pdf
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
Design Guidelines and solutions for Plastics parts
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
Current and future trends in Computer Vision.pptx
Exploratory_Data_Analysis_Fundamentals.pdf
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
22EC502-MICROCONTROLLER AND INTERFACING-8051 MICROCONTROLLER.pdf
AUTOMOTIVE ENGINE MANAGEMENT (MECHATRONICS).pptx
Management Information system : MIS-e-Business Systems.pptx
Software Engineering and software moduleing
INTRODUCTION -Data Warehousing and Mining-M.Tech- VTU.ppt
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
Feature types and data preprocessing steps
EXPLORING LEARNING ENGAGEMENT FACTORS INFLUENCING BEHAVIORAL, COGNITIVE, AND ...
Fundamentals of Mechanical Engineering.pptx
Nature of X-rays, X- Ray Equipment, Fluoroscopy
CURRICULAM DESIGN engineering FOR CSE 2025.pptx
Visual Aids for Exploratory Data Analysis.pdf
Information Storage and Retrieval Techniques Unit III
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf

Clustering

  • 1. March 16, 2020 Data Mining: Concepts and Techniques 1 Chapter 7. Cluster Analysis 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis 3. A Categorization of Major Clustering Methods 4. Partitioning Methods 5. Hierarchical Methods Disclaimer: Content taken from Han & Kamber slides, Data mining textbooks and Internet
  • 2. March 16, 2020 Data Mining: Concepts and Techniques 2 What is Cluster Analysis?  Cluster: a collection of data objects  Similar to one another within the same cluster  Dissimilar to the objects in other clusters  Cluster analysis  Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters  Unsupervised learning: no predefined classes  Typical applications  As a stand-alone tool to get insight into data distribution  As a preprocessing step for other algorithms
  • 3. © Prentice Hall 3 Clustering Houses Size Based Geographic Distance Based
  • 4. March 16, 2020 Data Mining: Concepts and Techniques 4 Clustering: Rich Applications and Multidisciplinary Efforts  Pattern Recognition  Clustering methods simply try to group similar patterns into clusters whose members are more similar to each other  Spatial Data Analysis  Create thematic maps in GIS by clustering feature spaces  Detect spatial clusters or for other spatial mining tasks  Image Processing  clustering is used in image segmentation for separating image objects which are analyzed further  Economic Science (especially market research)  cluster analysis is to classify objects into relatively homogeneous groups based on a set of variables considered like demographics, psychographics, buying behaviours, attitudes, preferences, etc.  WWW  Document classification  Cluster Weblog data to discover groups of similar access patterns
  • 5. March 16, 2020 Data Mining: Concepts and Techniques 5 Examples of Clustering Applications  Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs  Land use: Identification of areas of similar land use in an earth observation database  Insurance: Identifying groups of motor insurance policy holders with a high average claim cost  City-planning: Identifying groups of houses according to their house type, value, and geographical location  Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults
  • 6. March 16, 2020 Data Mining: Concepts and Techniques 6 Quality: What Is Good Clustering?  A good clustering method will produce high quality clusters with  high intra-class similarity  low inter-class similarity  The quality of a clustering result depends on both the similarity measure used by the method and its implementation  The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
  • 7. March 16, 2020 Data Mining: Concepts and Techniques 7 Measure the Quality of Clustering  Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function, typically metric: d(i, j)  There is a separate “quality” function that measures the “goodness” of a cluster.  The definitions of distance functions are usually very different for interval-scaled, boolean, categorical, ordinal ratio, and vector variables.  Weights should be associated with different variables based on applications and data semantics.  It is hard to define “similar enough” or “good enough”  the answer is typically highly subjective.
  • 8. March 16, 2020 Data Mining: Concepts and Techniques 8 Requirements of Clustering in Data Mining  Scalability  Ability to deal with different types of attributes  Ability to handle dynamic data  Discovery of clusters with arbitrary shape  Minimal requirements for domain knowledge to determine input parameters  Able to deal with noise and outliers  Insensitive to order of input records  High dimensionality  Incorporation of user-specified constraints  Interpretability and usability
  • 9. March 16, 2020 Data Mining: Concepts and Techniques 9 Chapter 7. Cluster Analysis 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis 3. A Categorization of Major Clustering Methods 4. Partitioning Methods 5. Hierarchical Methods
  • 10. March 16, 2020 Data Mining: Concepts and Techniques 10 Data Structures  Data matrix  (two modes)  Dissimilarity matrix  (one mode)                   npx...nfx...n1x ............... ipx...ifx...i1x ............... 1px...1fx...11x                 0...)2,()1,( ::: )2,3() ...ndnd 0dd(3,1 0d(2,1) 0
  • 11. March 16, 2020 Data Mining: Concepts and Techniques 11 Type of data in clustering analysis  Interval-scaled variables  Binary variables  Nominal, ordinal, and ratio variables  Variables of mixed types
  • 12. March 16, 2020 Data Mining: Concepts and Techniques 12 Chapter 7. Cluster Analysis 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis 3. A Categorization of Major Clustering Methods 4. Partitioning Methods 5. Hierarchical Methods
  • 13. March 16, 2020 Data Mining: Concepts and Techniques 13 Major Clustering Approaches (I)  Partitioning approach:  Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors  Typical methods: k-means, k-medoids, CLARANS  Hierarchical approach:  Create a hierarchical decomposition of the set of data (or objects) using some criterion  Typical methods: Diana, Agnes, BIRCH, ROCK, CAMELEON  Density-based approach:  Based on connectivity and density functions  Typical methods: DBSACN, OPTICS, DenClue
  • 14. March 16, 2020 Data Mining: Concepts and Techniques 14 Major Clustering Approaches (II)  Grid-based approach:  based on a multiple-level granularity structure  Typical methods: STING, WaveCluster, CLIQUE  Model-based:  A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other  Typical methods: EM, SOM, COBWEB  Frequent pattern-based:  Based on the analysis of frequent patterns  Typical methods: pCluster  User-guided or constraint-based:  Clustering by considering user-specified or application-specific constraints  Typical methods: COD (obstacles), constrained clustering
  • 15. March 16, 2020 Data Mining: Concepts and Techniques 15 Typical Alternatives to Calculate the Distance between Clusters  Single link: smallest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)  Complete link: largest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)  Average: avg distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)  Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj)  Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Mi, Mj)  Medoid: one chosen, centrally located object in the cluster
  • 16. March 16, 2020 Data Mining: Concepts and Techniques 16 Centroid, Radius and Diameter of a Cluster (for numerical data sets)  Centroid: the “middle” of a cluster  Radius: square root of average distance from any point of the cluster to its centroid  Diameter: square root of average mean squared distance between all pairs of points in the cluster N tN i ip mC )( 1   N mc ip tN i mR 2)( 1     )1( 2)( 11        NN iq t ip tN i N i mD
  • 17. March 16, 2020 Data Mining: Concepts and Techniques 17 Chapter 7. Cluster Analysis 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis 3. A Categorization of Major Clustering Methods 4. Partitioning Methods 5. Hierarchical Methods
  • 18. March 16, 2020 Data Mining: Concepts and Techniques 18 Partitioning Algorithms: Basic Concept  Partitioning method: Construct a partition of a database D of n objects into a set of k clusters, s.t., min sum of squared distance  Given a k, find a partition of k clusters that optimizes the chosen partitioning criterion  Global optimal: exhaustively enumerate all partitions  Heuristic methods: k-means and k-medoids algorithms  k-means (MacQueen’67): Each cluster is represented by the center of the cluster  k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 2 1 )( mimKmt k m tCmi  
  • 19. March 16, 2020 Data Mining: Concepts and Techniques 19 The K-Means Clustering Method  Given k, the k-means algorithm is implemented in four steps:  Partition objects into k nonempty subsets  Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster)  Assign each object to the cluster with the nearest seed point  Go back to Step 2, stop when no more new assignment
  • 20. March 16, 2020 Data Mining: Concepts and Techniques 20 The K-Means Clustering Method  Example 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means Update the cluster means reassignreassign
  • 21. K-Means example  Cluster the following items in 2 clusters: {2, 4, 10, 12, 3, 20, 30, 11, 25}  𝑑 𝐶𝑖, 𝑡𝑖 = (𝐶𝑖 − 𝑡𝑖)2  Assignment to K = min (𝑑 𝐶𝑖, 𝑡𝑖 ) March 16, 2020 Data Mining: Concepts and Techniques 21 M1 M2 K1 K2 2 4 { { 2.5 16 3 18 {2, 3, 4, 10} {12, 20, 30, 11, 25} 4.75 19.6 {2, 3, 4, 10, 12, 11} {20, 30, 25} 7 25 {2, 3, 4, 10, 12, 11} {20, 30, 25} 2.5 16 {2, 3, 4} {10, 12, 20, 30, 11, 25} 2 4 {22 4 {2 {42 4 {2 {4,102 4 {2,3} , 12, 20, 30, 11, 25} Stopping Criteria: • No new assignment • No change in cluster means
  • 22. K-means 2D example  Apply k-means for the following dataset to make 2 clusters: March 16, 2020 Data Mining: Concepts and Techniques 22 X Y 185 72 170 56 168 60 179 68 182 72 188 77 Step 1: Assume Initial Centroids: C1 = (185, 72), C2 = (170, 56) Step 2: Calculate Euclidean Distance to each centroid: 𝑑[ 𝑥, 𝑦 , 𝑎, 𝑏 ] = (𝑥 − 𝑎)2+(𝑦 − 𝑏)2 For t1 = (168, 60) 𝑑[ 185,72 , 168, 60 ] = (185 − 168)2+(72 − 60)2 = 20.808 𝑑[ 170,56 , 168, 60 ] = (170 − 168)2+(56 − 60)2 = 4.472 Since d(C2, t1) > d(C1,t1). So assign t1 to C2 Step 3: For t2 = (179, 68) 𝑑[ 185,72 , 179, 68 ] = (185 − 179)2+(72 − 68)2 = 7.211 𝑑[ 170,56 , 179, 68 ] = (170 − 179)2+(56 − 68)2 = 15 Since d(C1, t2) < d(C2,t2) So assign t2 to C1 Step 4: For t3 = (182, 72) 𝑑[ 185,72 , 182, 72 ] = (185 − 182)2+(72 − 72)2 = 3 𝑑[ 170,56 , 182, 72 ] = (170 − 182)2+(56 − 72)2 = 20 Since d(C1, t3) < d(C2,t3), So assign t3 to C1
  • 23. K-means 2D example  Apply k-means for the following dataset to make 2 clusters: March 16, 2020 Data Mining: Concepts and Techniques 23 X Y 185 72 170 56 168 60 179 68 182 72 188 77 Step 5: For t4 = (188, 77) 𝑑[ 185,72 , 182, 72 ] = (185 − 188)2+(72 − 77)2 = 5.83 𝑑[ 170,56 , 182, 72 ] = (170 − 188)2+(56 − 77)2 = 27.65 Since d(C1, t4) < d(C2,t4), So assign t4 to C1 Step 6: Clusters after 1 iteration D1 = {(185, 72), (179, 68), (182, 72), (188, 77)} D2 = {(170, 56), (168, 60)} Step 7: New clusters centroids C1 = {183.5, 72.25} C2 = {169, 116} Repeat above steps for all samples till convergence Final Clusters D1 = {(185, 72), (179, 68), (182, 72), (188, 77)} D2 = {(170, 56), (168, 60)}
  • 24. March 16, 2020 Data Mining: Concepts and Techniques 24 Comments on the K-Means Method  Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.  Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 + k(n-k))  Comment: Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms  Weakness  Applicable only when mean is defined, then what about categorical data?  Need to specify k, the number of clusters, in advance  Unable to handle noisy data and outliers  Not suitable to discover clusters with non-convex shapes
  • 25. March 16, 2020 Data Mining: Concepts and Techniques 25 Variations of the K-Means Method  A few variants of the k-means which differ in  Selection of the initial k means  Dissimilarity calculations  Strategies to calculate cluster means  Handling categorical data: k-modes (Huang’98)  Replacing means of clusters with modes  Using new dissimilarity measures to deal with categorical objects  Using a frequency-based method to update modes of clusters  A mixture of categorical and numerical data: k-prototype method
  • 26. March 16, 2020 Data Mining: Concepts and Techniques 26 What Is the Problem of the K-Means Method?  The k-means algorithm is sensitive to outliers !  Since an object with an extremely large value may substantially distort the distribution of the data.  K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster. 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 27. March 16, 2020 Data Mining: Concepts and Techniques 27 The K-Medoids Clustering Method  Find representative objects, called medoids, in clusters  PAM (Partitioning Around Medoids, 1987)  starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting clustering  PAM works effectively for small data sets, but does not scale well for large data sets  CLARA (Kaufmann & Rousseeuw, 1990)  CLARANS (Ng & Han, 1994): Randomized sampling  Focusing + spatial data structure (Ester et al., 1995)
  • 28. March 16, 2020 Data Mining: Concepts and Techniques 28 A Typical K-Medoids Algorithm (PAM) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 20 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrary choose k object as initial medoids 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Assign each remainin g object to nearest medoids Randomly select a nonmedoid object,Oramdom Compute total cost of swapping 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Total Cost = 26 Swapping O and Orandom If quality is improved. Do loop Until no change 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 29. March 16, 2020 Data Mining: Concepts and Techniques 29 PAM (Partitioning Around Medoids) (1987)  PAM (Kaufman and Rousseeuw, 1987), built in Splus  Use real object to represent the cluster 1. Select k representative objects arbitrarily 2. For each pair of non-selected object h and selected object i, calculate the total swapping cost TCih 3. For each pair of i and h,  If TCih < 0, i is replaced by h  Then assign each non-selected object to the most similar representative object 4. repeat steps 2-3 until there is no change
  • 30. PAM example A B C D E A 0 1 2 2 3 B 1 0 2 4 3 C 2 2 0 1 5 D 2 4 1 0 3 E 3 3 5 3 0 March 16, 2020 30 Step 1: Let A & B be mediod. Obtain clusters by assigning elements close to the respective mediods Step 2: Now examine the three non-mediod {C, D, E} to determine if they can replace existing mediods. i.e. A replaced by C, D, or E and B replaced by C, D or E We have 6 costs to determine TCAC, TCAD, TCAE, TCBC, TCBD, TCBE Step 1: {A, C, D} & {B, E} Let us replace A with C. Cost of replacing A with C TCAC = CAAC + CBAC + CCAC + CDAC + CEAC Cjih = cost change for item tj associated by swapping mediod ti with non- mediod tj {A, B, E} & {C, D} TCAC = 1 + 0 + (-2) + (-1) + 0 = -2 The overall cluster is reduced by 2 Step 3: Let us replace A with D. Cost of replacing A with D {A, B, E} & {C, D} TCAD = CAAD + CBAD + CCAD + CDAD + CEAD TCAD = 1 + 0 + (-1) + (-2) + 0 = -2 The overall cluster is reduced by 2
  • 31. A B C D E A 0 1 2 2 3 B 1 0 2 4 3 C 2 2 0 1 5 D 2 4 1 0 3 E 3 3 5 3 0 31 Step 4: Let us replace A with E. Cost of replacing A with E {A, B, C} & {E, D} TCAE = CAAE + CBAE + CCAE + CDAE + CEAE TCAD = 1 + 0 + 0 + 1 + (-3) = -1 The overall cluster is reduced by 1 Step 5: Let us replace B with C. Cost of replacing B with C {A, B, E} & {C, D} TCBC = CABC + CBBC + CCBC + CDBC + CEBC TCBC = 0 + 1 + (-2) + (-1) + 0 = -2 The overall cluster is reduced by 2 Step 6: Let us replace B with D. Cost of replacing B with D {A, B, E} & {C, D} TCBD = CABD + CBBD + CCBD + CDBD + CEBD TCBD = 0 + 1 + (-1) + (-2) + 0 = -2 The overall cluster is reduced by 2 Step 7: Let us replace B with E. Cost of replacing B with E {A, B, C, D} & {E} TCBE = CABE + CBBE + CCBE + CDBE + CEBE TCBE = 0 + 1 + 0 + 0 + (-3) = -2 The overall cluster is reduced by 2
  • 32. March 16, 2020 Data Mining: Concepts and Techniques 32 PAM Clustering: Total swapping cost TCih=jCjih 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 j i h t Cjih = 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 t i h j Cjih = d(j, h) - d(j, i) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 h i t j Cjih = d(j, t) - d(j, i) 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 t i h j Cjih = d(j, h) - d(j, t)
  • 33. March 16, 2020 Data Mining: Concepts and Techniques 33 What Is the Problem with PAM?  Pam is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean  Pam works efficiently for small data sets but does not scale well for large data sets.  O(k(n-k)2 ) for each iteration where n is # of data,k is # of clusters Sampling based method, CLARA(Clustering LARge Applications)
  • 34. March 16, 2020 Data Mining: Concepts and Techniques 34 Chapter 7. Cluster Analysis 1. What is Cluster Analysis? 2. Types of Data in Cluster Analysis 3. A Categorization of Major Clustering Methods 4. Partitioning Methods 5. Hierarchical Methods
  • 35. March 16, 2020 Data Mining: Concepts and Techniques 35 Hierarchical Clustering  Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but needs a termination condition Step 0 Step 1 Step 2 Step 3 Step 4 b d c e a a b d e c d e a b c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative (AGNES) divisive (DIANA)
  • 36. March 16, 2020 Data Mining: Concepts and Techniques 36 AGNES (Agglomerative Nesting)  Introduced in Kaufmann and Rousseeuw (1990)  Implemented in statistical analysis packages, e.g., Splus  Use the Single-Link method and the dissimilarity matrix.  Merge nodes that have the least dissimilarity  Go on in a non-descending fashion  Eventually all nodes belong to the same cluster 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 37. March 16, 2020 Data Mining: Concepts and Techniques 37 Dendrogram: Shows How the Clusters are Merged Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram. A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
  • 38. A B C D E A 0 1 2 2 3 B 1 0 2 4 3 C 2 2 0 1 5 D 2 4 1 0 3 E 3 3 5 3 0 March 16, 2020 Data Mining: Concepts and Techniques 38 Step 1: At level 0, 5 clusters Step 2: At Level 1: Min_dist = 1, Find distance between each pair. If min_dist{ti, tj} <= 1, then merge clusters {A, B}, {C, D}, {E} Step 1: {A}, {B}, {C} , {D} {E} Step 3: At Level 2: Min_dist = 2, Find distance between clusters formed in step 2 A->C = 2 B->C = 2 A->D = 2 B->D = 4 Hence min_dist({A,B}, {C, D}) = 2 A->E = 3 B->E = 3 min_dist({A,B}, {E}) = 3 C->E = 5 D->E = 3 min_dist({C, D}, {E}) = 3 Since threshold is 2, we merge {A, B, C, D} {E} Hierarchical Clustering Single Link example
  • 39. A B C D E A 0 1 2 2 3 B 1 0 2 4 3 C 2 2 0 1 5 D 2 4 1 0 3 E 3 3 5 3 0 March 16, 2020 Data Mining: Concepts and Techniques 39 Step 4: At Level 3: Min_dist = 3, Find distance between clusters formed in step 3 A->E = 3 B->E = 3 C->E = 5 D->E = 3 min_dist({A, B, C, D}, {E}) = 3 Since threshold is 3, we merge both the clusters to get {A, B, C, D, E} Hierarchical Clustering Single Link example
  • 40. A B C D E A 0 1 2 2 3 B 1 0 2 4 3 C 2 2 0 1 5 D 2 4 1 0 3 E 3 3 5 3 0 March 16, 2020 Data Mining: Concepts and Techniques 40 Step 1: At level 0, 5 clusters Step 2: At Level 1: Max_dist = 1, Find distance between each pair. If max_dist{ti, tj} <= 1, then merge clusters {A, B}, {C, D}, {E} Step 1: {A}, {B}, {C} , {D} {E} Step 3: At Level 2: Max_dist = 2, Find distance between clusters formed in step 2 A->C = 2 B->C = 2 A->D = 2 B->D = 4 Hence max_dist({A,B}, {C, D}) = 4 A->E = 3 B->E = 3 max_dist({A,B}, {E}) = 3 C->E = 5 D->E = 3 max_dist({C, D}, {E}) = 5 Since threshold is 2, no merge at this level Hierarchical Clustering Complete Link example
  • 41. A B C D E A 0 1 2 2 3 B 1 0 2 4 3 C 2 2 0 1 5 D 2 4 1 0 3 E 3 3 5 3 0 March 16, 2020 Data Mining: Concepts and Techniques 41 Step 5: At Level 4: Max_dist = 4, Find distance between clusters formed in step 3 A->C = 2 B->C = 2 A->D = 2 B->D = 4 C->E = 5 D->E = 3 max_dist({C, D}, {A, B, E}) = 5 Since threshold is 4, So no merge Step 6: At Level 5: Merge both clusters. Hierarchical Clustering Complete Link example Step 4: At Level 3: Max_dist = 3, Find distance between clusters formed in step 2 A->C = 2 B->C = 2 A->D = 2 B->D = 4 Hence max_dist({A,B}, {C, D}) = 4 A->E = 3 B->E = 3 max_dist({A,B}, {E}) = 3 C->E = 5 D->E = 3 max_dist({C, D}, {E}) = 5 Since threshold is 3, max_dist({A,B}, {E}) is 3 so we merge them
  • 42. March 16, 2020 Data Mining: Concepts and Techniques 42 DIANA (Divisive Analysis)  Introduced in Kaufmann and Rousseeuw (1990)  Implemented in statistical analysis packages, e.g., Splus  Inverse order of AGNES  Eventually each node forms a cluster on its own 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
  • 43. March 16, 2020 Data Mining: Concepts and Techniques 43 Recent Hierarchical Clustering Methods  Major weakness of agglomerative clustering methods  do not scale well: time complexity of at least O(n2), where n is the number of total objects  can never undo what was done previously  Integration of hierarchical with distance-based clustering  BIRCH (1996): uses CF-tree and incrementally adjusts the quality of sub-clusters  ROCK (1999): clustering categorical data by neighbor and link analysis  CHAMELEON (1999): hierarchical clustering using dynamic modeling