SlideShare a Scribd company logo
Clustering
Ms. Rashmi Bhat
What is Clustering??
▪ Grouping of objects
How will you group these together??
What is Clustering??
Option 1: By Type Option 2: By Color
What is Clustering??
Option 3: By Shape
What is Cluster Analysis??
▪ A cluster is a collection of data objects that are similar to one another
within the same cluster and are dissimilar to the objects in other clusters.
▪ Cluster analysis has been extensively focused mainly on distance-based
cluster analysis.
The process of grouping a set of physical or abstract objects into classes of
similar objects is called as Clustering.
What is Cluster Analysis??
▪ How clustering differs from classification???
What is Cluster Analysis??
▪ Clustering is also called data segmentation
▪ Clustering is finding borders between groups,
▪ Segmenting is using borders to form groups
▪ Clustering is the method of creating segments.
▪ Clustering can also be used for outlier detection
What is Cluster Analysis??
▪ Classification: Supervised Learning
▪ Classes are predetermined
▪ Based on training data set
▪ Used to classify future observations
▪ Clustering : Unsupervised Learning
▪ Classes are not known in advance
▪ No prior knowledge
▪ Used to explore (understand) the data
▪ Clustering is a form of learning by observation, rather than learning by
examples.
Applications of Clustering
▪ Marketing:
▪ Segmentation of the customer based on behavior
▪ Banking:
▪ ATM Fraud detection (outlier detection)
▪ Gene analysis:
▪ Identifying gene responsible for a disease
▪ Image processing:
▪ Identifying objects on an image (face detection)
▪ Houses:
▪Identifying groups of houses according to their house type, value, and geographical location
Requirements of Clustering Analysis
▪ The following are typical requirements of clustering in data mining:
▪ Scalability
▪ Dealing with different types of attributes
▪ Discovering clusters with arbitrary shapes
▪ Ability to deal with noisy data
▪ Minimal requirements for domain knowledge to determine input parameters
▪ Incremental clustering
▪ High dimensionality
▪ Constraint-based clustering
▪ Interpretability and usability
What is Cluster Analysis??
▪ A cluster is a collection of data objects that are similar to one another
within the same cluster and are dissimilar to the objects in other clusters.
▪ Cluster analysis has been extensively focused mainly on distance-based
cluster analysis.
The process of grouping a set of physical or abstract objects into classes of
similar objects is called as Clustering.
Distance Measures
▪ Cluster analysis has been extensively focused mainly on distance-based
cluster analysis
▪ Distance is defined as the quantitative measure of how far apart two objects are.
▪ The similarity measure is the measure of how much alike two data objects
are.
▪ If the distance is small, the features are having a high degree of similarity.
▪ Whereas a large distance will be a low degree of similarity.
▪ Generally, similarity are measured in the range 0 to 1 [0,1].
▪ Similarity = 1 if X = Y (Where X, Y are two objects)
▪ Similarity = 0 if X ≠ Y
Distance
Measures Euclidean Distance
Manhattan Distance
Minkowski Distance
Cosine Similarity
Jaccard Similarity
Distance
Measures
𝑫 𝑿, 𝒀 = 𝒙𝟐 − 𝒙𝟏
𝟐 + 𝒚𝟐 − 𝒚𝟏
𝟐
• The Euclidean distance between two points is the length of the
path connecting them.
• The Pythagorean theorem gives this distance between two points.
Distance
Measures
𝑫 𝑨, 𝑩 = 𝒙𝟐 − 𝒙𝟏 + 𝒚𝟐 − 𝒚𝟏
• Manhattan distance is a metric in which the distance between
two points is calculated as the sum of the absolute differences
of their Cartesian coordinates.
• It is the total sum of the difference between the x-coordinates
and y-coordinates.
Distance
Measures
𝑫 𝑿, 𝒀 = ෍
𝒊=𝟏
𝒏
|𝒙𝒊 − 𝒚𝒊|𝒑
ൗ
𝟏
𝒑
=
𝒑
෍
𝒊=𝟏
𝒏
|𝒙𝒊 − 𝒚𝒊|𝒑
• It is the generalized form of the Euclidean and Manhattan Distance
Measure.
Distance
Measures
• The cosine similarity metric finds the normalized dot
product of the two attributes.
• By determining the cosine similarity, we would
effectively try to find the cosine of the angle between
the two objects.
• The cosine of 0° is 1, and it is less than 1 for any other
angle.
Distance
Measures
• When we consider Jaccard similarity these objects will
be sets.
| 𝑨 ∪ 𝑩 | = 7
| 𝑨 ∩ 𝑩 | = 2
𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑱 𝑨, 𝑩 =
𝐴 ∩ 𝐵
𝐴 ∪ 𝐵
=
2
7
= 0.286
Clustering Techniques
▪ Clustering techniques are categorized in following categories
Partitioning Methods
Hierarchical Methods
Density-based Methods
Grid-based Methods
Model-based Methods
Partitioning Method
▪ Construct a partition of a database 𝑫 of 𝒏 objects into 𝒌 clusters
▪ each cluster contains at least one object
▪ each object belongs to exactly one cluster
▪ Given a 𝒌, find a partition of 𝒌 clusters that optimizes the chosen
partitioning criterion (min distance from cluster centers)
▪ Global optimal: exhaustively enumerate all partitions Stirling(n,k)
(S(10,3) = 9.330, S(20,3) = 580.606.446,…)
▪ Heuristic methods: k-means and k-medoids algorithms
▪ k-means: Each cluster is represented by the center of the cluster.
▪ k-medoids or PAM (Partition around medoids): Each cluster is represented by one of
the objects in the cluster.
𝑘-means Clustering
Input:
𝒌 clusters, 𝒏 objects of database 𝑫.
Output:
Set of 𝒌 clusters minimizing squared error function
Algorithm:
1. Arbitrarily choose 𝒌 objects from 𝑫 as the initial cluster centers;
2. Repeat
1. (Re)assign each object to the cluster to which the object is the most similar, based on
the mean value of the objects in the cluster;
2. Update the cluster means, i.e., calculate the mean value of the objects for each cluster;
3. Until no change;
𝑘-means Clustering
Example: Cluster the following data example into 3 clusters using k-means clustering and Euclidean
distance
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
𝑘-means Clustering
1. Choose arbitrary 3 points as cluster centers
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
C1 = (2,1)
C2 = (4,4)
C3 = (2,3)
𝑘-means Clustering
2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster
centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
𝐷1 𝑃1, 𝐶1 = 2 − 2 2 + 1 − 5 2 = 16 = 4
C1 = (2,1)
C2 = (4,4)
C3 = (2,3)
𝑫 = 𝒙𝟐 − 𝒙𝟏
𝟐 + 𝒚𝟐 − 𝒚𝟏
𝟐 … . 𝑬𝒖𝒄𝒍𝒊𝒅𝒆𝒂𝒏 𝑫𝒊𝒔𝒕𝒂𝒏𝒄𝒆
𝐷1 𝑃1, 𝐶2 = 4 − 2 2 + 4 − 5 2 = 5 = 2.236
𝐷1 𝑃1, 𝐶3 = 2 − 2 2 + 3 − 5 2 = 4 = 2
Cluster1 = {(2,1)}
Cluster2 = { }
Cluster3 = {(2,5)}
𝐷2 𝑃2, 𝐶1 = 2 − 2 2 + 1 − 1 2 = 0
𝐷2 𝑃2, 𝐶2 = 4 − 2 2 + 4 − 1 2 = 13 = 3.605
𝐷2 𝑃2, 𝐶3 = 2 − 2 2 + 3 − 1 2 = 4 = 2
Similarly, assign other points to appropriate cluster.
Cluster1 = { }
Cluster2 = { }
Cluster3 = {(2,5)}
Cluster1 = { }
Cluster2 = { }
Cluster3 = { }
𝑘-means Clustering
2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster
centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
C1 = (2,1)
C2 = (4,4)
C3 = (2,3)
Cluster1 = {(2,1)}
Cluster2 = { }
Cluster3 = {(2,5)}
Cluster1 = { }
Cluster2 = { }
Cluster3 = {(2,5)}
Cluster1 = {(2,1)}
Cluster2 = {(7,1)}
Cluster3 = {(2,5)}
Cluster1 = {(2,1)}
Cluster2 = {(4,4),(7,1)}
Cluster3 = {(2,5)}
Cluster1 = {(2,1)}
Cluster2 = {(4,4),(7,1), (3,5)}
Cluster3 = {(2,5)}
Cluster1 = {(2,1), (1,2)}
Cluster2 = {(4,4),(7,1), (3,5), (6,2), (6,1), (3,4)}
Cluster3 = {(2,3),(2,5)}
𝑘-means Clustering
2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster
centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
𝑘-means Clustering
3. Update the cluster means
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
Old Cluster Centers:
C1 = (2,1)
C2 = (4,4)
C3 = (2,3)
Clusters:
Cluster1 = {(2,1), (1,2), }
Cluster2 = {(4,4),(7,1), (3,5), (6,2), (6,1), (3,4)}
Cluster3 = {(2,3),(2,5)}
Calculate the mean of the points in each cluster
𝑚𝑒𝑎𝑛1 =
2+1
2
,
1+2
2
𝑚𝑒𝑎𝑛2 =
4+7+3+6+6+3
6
,
4+1+5+2+1+4
6
𝑚𝑒𝑎𝑛3 =
2+2
2
,
3+5
2
New Cluster Centers:
C1 = (1.5, 1.5)
C2 = (4.83, 2.83)
C3 = (2, 4)
𝑘-means Clustering
2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster
centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
𝑘-means Clustering
2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster
centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
𝑘-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
𝐷1 𝑃1, 𝐶1 = 1.5 − 2 2 + 1.5 − 5 2 = 3.535
Updated Cluster Centers:
C1 = (1.5, 1.5)
C2 = (4.83, 2.83)
C3 = (2, 4)
𝐷1 𝑃1, 𝐶2 = 4.83 − 2 2 + 2.83 − 5 2 = 3.566
𝐷1 𝑃1, 𝐶3 = 2 − 2 2 + 4 − 5 2 = 1
Cluster1 = { }
Cluster2 = { }
Cluster3 = {(2,5)}
𝐷2 𝑃2, 𝐶1 = 1.5 − 2 2 + 1.5 − 1 2 = 0.707
𝐷2 𝑃2, 𝐶2 = 4.83 − 2 2 + 2.83 − 1 2 = 13 = 3.3701
𝐷2 𝑃2, 𝐶3 = 2 − 2 2 + 4 − 1 2 = 3
Cluster1 = {(2,1)}
Cluster2 = { }
Cluster3 = {(2,5)}
Similarly, assign other points to appropriate cluster.
𝑘-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
Updated Cluster Centers:
C1 = (1.5, 1.5)
C2 = (4.83, 2.83)
C3 = (2, 4)
Updated Clusters
Cluster1 = {(2,1), (1,2) }
Cluster2 = {(7,1), (4,4), (6,2), (6,1)}
Cluster3 = {(3,5), (2,5), (3,4), (2,3)}
Cluster1 = {(2,1)}
Cluster2 = { }
Cluster3 = {(2,5)}
𝑘-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
𝑘-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
Old Cluster Centers:
C1 = (1.5, 1.5)
C2 = (4.83, 2.83)
C3 = (2, 4)
Updated Clusters
Cluster1 = {(2,1), (1,2) }
Cluster2 = {(7,1), (4,4), (6,2), (6,1)}
Cluster3 = {(3,5), (2,5), (3,4), (2,3)}
3. Update Cluster centers by repeating the process until there is no
change in clusters
New Cluster Centers:
C1 = (1.5, 1.5)
C2 = (5.75, 2)
C3 = (2.5, 4.25)
𝑘-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
𝑘-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
Old Cluster Centers:
C1 = (1.5, 1.5)
C2 = (5.75, 2)
C3 = (2.5, 4.25)
Updated Clusters
Cluster1 = {(2,1), (1,2) }
Cluster2 = {(7,1), (6,2), (6,1)}
Cluster3 = {(3,5), (2,5), (4,4), (3,4), (2,3)}
3. Update Cluster centers by repeating the process until there is no
change in clusters
New Cluster Centers:
C1 = (1.5, 1.5)
C2 = (6.33, 1.33)
C3 = (2.8, 4.2)
𝑘-means Clustering
2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each
cluster centers. And choose closest one.
Point X Y
P1 2 5
P2 2 1
P3 7 1
P4 3 5
P5 4 4
P6 6 2
P7 1 2
P8 6 1
P9 3 4
P10 2 3
𝑘-means Clustering
Apply k-means algorithm for the following data set with two
clusters.
D={15, 16, 19, 20, 20, 21, 22, 28, 35, 40, 41, 42, 43, 44, 60, 61, 65}
𝑘-means Clustering
▪ Advantages:
▪ Relatively scalable and efficient in processing large data sets
▪ The computational complexity of the algorithm is 𝑂 𝑛𝑘𝑡
▪ where 𝑛 is the total number of objects, 𝑘 is the number of clusters, and 𝑡 is the number of iterations
▪ This method terminates at a local optimum.
▪Disadvantages:
▪ Can be applied only when the mean of a cluster is defined
▪ The necessity for users to specify 𝑘, the number of clusters, in advance.
▪ Sensitive to noise and outlier data points
𝑘-means Clustering
▪ How to cluster categorical data?
▪ Variant of 𝑘-means is used for clustering categorical data: 𝑘-modes Method
▪ Replace mean of cluster with mode of data
▪ A new dissimilarity measures to deal with categorical objects
▪ A frequency-based method to update modes of clusters.
𝑘-Medoids Clustering
▪ Picks actual objects to represent the clusters, using one representative object per
cluster
▪ Each remaining object is clustered with the representative object to which it is the
most similar.
▪ Partitioning method is then performed based on the principle of minimizing the
sum of the dissimilarities between each object and its corresponding reference
point
▪ Absolute Error criterion is used
𝐸 = ෍
𝑗=1
𝑘
෍
𝑝∈𝑐𝑗
𝑑𝑖𝑠𝑡(𝑝, 𝑜𝑗)
Where
• 𝑝 is the point in space representing
a given object in cluster 𝑐𝑗
• 𝑂𝑗 is the representative object of
cluster 𝑐𝑗
Sum of absolute error
𝑘-Medoids Clustering
▪ The iterative process of replacing representative objects by nonrepresentative objects
continues as long as the quality of the resulting clustering is improved.
▪ Quality is measured by a cost function that measures the average dissimilarity between an
object and the representative object of its cluster.
▪ Four cases are examined for each of the nonrepresentative objects, 𝑝.
▪ Suppose, object 𝒑 is currently assigned to a cluster represented by medoid 𝑶𝒋
𝒑
𝑶𝒊
𝑶𝒋
𝑶𝒓𝒂𝒏𝒅𝒐𝒎
𝒑
𝑶𝒊
𝑶𝒋
𝑶𝒓𝒂𝒏𝒅𝒐𝒎
𝒑
𝑶𝒊
𝑶𝒋
𝑶𝒓𝒂𝒏𝒅𝒐𝒎
𝒑
𝑶𝒊
𝑶𝒋
𝑶𝒓𝒂𝒏𝒅𝒐𝒎
Case 1 Case 2 Case 3 Case 4
Before Swapping After Swapping
𝑘-Medoids Clustering
▪ Each time a reassignment occurs, a difference in absolute error, 𝐸, is
contributed to the cost function.
▪ Therefore, the cost function calculates the difference in absolute-error value if
a current representative object is replaced by a nonrepresentative object.
▪ The total cost of swapping is the sum of costs incurred by all nonrepresentative
objects.
▪ If the total cost is negative, then 𝑂𝑗 is replaced or swapped with 𝑂𝑟𝑎𝑛𝑑𝑜𝑚
▪ If the total cost is positive, the current representative object, 𝑂𝑗, is considered acceptable, and
nothing is changed.
▪ PAM(Partitioning Around Medoids) was one of the first k-medoids algorithms
𝑘-Medoids Clustering
Input: 𝑘 number of clusters, 𝑛 data objects from data set 𝐷
Output: a set of 𝑘 clusters
Algorithm:
1. Arbitrarily select 𝑘 objects as the representative objects or seeds
2. Repeat
1. Assign each remaining objects to the cluster with the nearest representative object
2. Randomly select the non- representative object 𝑂𝑟𝑎𝑛𝑑𝑜𝑚
3. Compute the total cost 𝑆 of swapping 𝑂𝑗 with 𝑂𝑟𝑎𝑛𝑑𝑜𝑚
4. If 𝑆 < 0, then swap 𝑂𝑗 with 𝑂𝑟𝑎𝑛𝑑𝑜𝑚 to form the new set of 𝑘 representative objects
3. Until no change
𝑘-Medoids Clustering
X Y
O1 2 6
O2 3 4
O3 3 8
O4 4 7
O5 6 2
O6 6 4
O7 7 3
O8 7 4
O9 8 5
O10 7 6
Data Objects
Aim: Create two Clusters
Step 1:
Choose randomly two medoids
(representative objects)
𝑂3 = 3,8
𝑂8 = (7,4)
𝑘-Medoids Clustering
X Y Cluster
O1 2 6
O2 3 4
O3 3 8
O4 4 7
O5 6 2
O6 6 4
O7 7 3
O8 7 4
O9 8 5
O10 7 6
Data Objects
Aim: Create two Clusters
Step 2:
Assign each object to the closest
representative object
Using Euclidean distance, we
form following clusters
𝑘-Medoids Clustering
Data Objects
Aim: Create two Clusters
X Y Cluster
O1 2 6 C1
O2 3 4 C1
O3 3 8 C1
O4 4 7 C1
O5 6 2 C2
O6 6 4 C2
O7 7 3 C2
O8 7 4 C2
O9 8 5 C2
O10 7 6 C2
Step 2:
Assign each object to the closest
representative object
Using Euclidean distance, we
form following clusters
C1={O1, O2, O3, O4}
C2={O5, O6, O7, O8, O9, O10}
𝑘-Medoids Clustering
Data Objects
Aim: Create two Clusters
X Y Cluster
O1 2 6 C1
O2 3 4 C1
O3 3 8 C1
O4 4 7 C1
O5 6 2 C2
O6 6 4 C2
O7 7 3 C2
O8 7 4 C2
O9 8 5 C2
O10 7 6 C2
Step 3:
Compute the absolute error (for the set of representative objects 𝑂3 and 𝑂8)
𝐸 = ෍
𝑗=1
𝑘
෍
𝑝∈𝑐𝑗
𝑝 − 𝑂𝑗
𝑬 = 𝑶𝟏 − 𝑶𝟑 + 𝑶𝟐 − 𝑶𝟑 + 𝑶𝟑 − 𝑶𝟑 + 𝑶𝟒 − 𝑶𝟑
+ 𝑶𝟓 − 𝑶𝟖 + 𝑶𝟔 − 𝑶𝟖 + 𝑶𝟕 − 𝑶𝟖 + 𝑶𝟖 − 𝑶𝟖 + 𝑶𝟗 − 𝑶𝟖 + 𝑶𝟏𝟎 − 𝑶𝟖
𝑶𝟏 − 𝑶𝟑 = 𝒙𝟏 − 𝒙𝟑 + 𝒚𝟏 − 𝒚𝟑 . . . . Manhattan Distance
𝑘-Medoids Clustering
Data Objects
Aim: Create two Clusters
X Y Cluster
O1 2 6 C1
O2 3 4 C1
O3 3 8 C1
O4 4 7 C1
O5 6 2 C2
O6 6 4 C2
O7 7 3 C2
O8 7 4 C2
O9 8 5 C2
O10 7 6 C2
Step 3:
Compute the absolute error (for the set of representative objects 𝑂3 and 𝑂8)
𝐸 = ෍
𝑗=1
𝑘
෍
𝑝∈𝑐𝑗
𝑝 − 𝑂𝑗
𝑬 = 𝑶𝟏 − 𝑶𝟑 + 𝑶𝟐 − 𝑶𝟑 + 𝑶𝟑 − 𝑶𝟑 + 𝑶𝟒 − 𝑶𝟑
+ 𝑶𝟓 − 𝑶𝟖 + 𝑶𝟔 − 𝑶𝟖 + 𝑶𝟕 − 𝑶𝟖 + 𝑶𝟖 − 𝑶𝟖 + 𝑶𝟗 − 𝑶𝟖 + 𝑶𝟏𝟎 − 𝑶𝟖
𝑬 = 𝟑 + 𝟒 + 𝟎 + 𝟐 + 𝟑 + 𝟏 + 𝟏 + 𝟎 + 𝟐 + 𝟐
𝑬 = 𝟏𝟖
𝑘-Medoids Clustering
Data Objects
Aim: Create two Clusters
X Y Cluster
O1 2 6 C1
O2 3 4 C1
O3 3 8 C1
O4 4 7 C1
O5 6 2 C2
O6 6 4 C2
O7 7 3 C2
O8 7 4 C2
O9 8 5 C2
O10 7 6 C2
Step 4:
Choose a random object 𝑂9
Swap 𝑂8 and 𝑂9
Compute the absolute error (for
the set of representative objects
𝑂3 and 𝑂9)
𝑘-Medoids Clustering
Data Objects
Aim: Create two Clusters
X Y Cluster
O1 2 6 C1
O2 3 4 C1
O3 3 8 C1
O4 4 7 C1
O5 6 2 C2
O6 6 4 C2
O7 7 3 C2
O8 7 4 C2
O9 8 5 C2
O10 7 6 C2
𝑬 = 𝑶𝟏 − 𝑶𝟑 + 𝑶𝟐 − 𝑶𝟑 + 𝑶𝟑 − 𝑶𝟑 + 𝑶𝟒 − 𝑶𝟑
+ 𝑶𝟓 − 𝑶𝟗 + 𝑶𝟔 − 𝑶𝟗 + 𝑶𝟕 − 𝑶𝟗 + 𝑶𝟖 − 𝑶𝟗 + 𝑶𝟗 − 𝑶𝟗 + 𝑶𝟏𝟎 − 𝑶𝟗
𝑬 = 𝟑 + 𝟒 + 𝟎 + 𝟐 + (𝟓 + 𝟑 + 𝟑 + 𝟐 + 𝟎 + 𝟐)
𝑬 = 𝟐𝟒
Step 5:
Compute the cost function
𝑆 = 𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝐸𝑟𝑟𝑜𝑟 𝑓𝑜𝑟𝑶𝟑, 𝑶𝟖 − 𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝐸𝑟𝑟𝑜𝑟 𝑓𝑜𝑟𝑶𝟑, 𝑶𝟗
𝑆 = 18 − 24 = −6
As 𝑆 < 0, we swap 𝑶𝟖 with 𝑶𝟗
𝑘-Medoids Clustering
Data Objects
Aim: Create two Clusters
X Y Cluster
O1 2 6
O2 3 4
O3 3 8
O4 4 7
O5 6 2
O6 6 4
O7 7 3
O8 7 4
O9 8 5
O10 7 6
Step 6:
New medoids are 𝑶𝟑 with 𝑶𝟗
Repeat Step 2
Assign each object to the
closest representative object.
X Y Cluster
O1 2 6 C1
O2 3 4 C1
O3 3 8 C1
O4 4 7 C1
O5 6 2 C2
O6 6 4 C2
O7 7 3 C2
O8 7 4 C2
O9 8 5 C2
O10 7 6 C2
𝑘-Medoids Clustering
▪ Which method is more robust 𝑘-Means or 𝑘-Medoids?
▪ The k-medoids method is more robust than k-means in the presence of noise and outliers,
because a medoid is less influenced by outliers or other extreme values than a mean.
▪ The processing of 𝑘-Medoids is more costly than the k-means method.
Hierarchical Clustering
▪ Groups data objects into a tree of clusters.
Hierarchical
Clustering
Methods
Agglomerative Divisive
Hierarchical Clustering
▪ Agglomerative Hierarchical Clustering
▪ Starts by placing each object in its own cluster
▪ Merges these atomic clusters into larger and larger clusters
▪ It will halt when all of the objects are in a single cluster or until certain termination
conditions are satisfied.
▪ Bottom-Up Strategy.
▪ The user can specify the desired number of clusters as a termination condition.
Hierarchical Clustering
A B F C D E G
AB CD
ABF CDE
ABFCDEG
CDEG
Step 0
Step 1
Step 2
Step 3
Step 4
Application of Agglomerative NESting
(AGNES) Hierarchical Clustering
Hierarchical Clustering
▪ Divisive Hierarchical Clustering Method
▪ Starting with all objects in one cluster.
▪ Subdivides the cluster into smaller and smaller pieces.
▪ It will halt when each object forms a cluster on its own or until it satisfies certain termination
conditions
▪ Top-Down Strategy
▪ The user can specify the desired number of clusters as a termination condition.
Hierarchical Clustering
A B F C D E G
AB CD
ABF CDE
ABFCDEG
CDEG
Step 4
Step 3
Step 2
Step 1
Step 0
Application of DIvisive ANAlysis
(DIANA) Hierarchical Clustering
Hierarchical Clustering
▪ A tree structure called a dendrogram is used to represent the process of
hierarchical clustering.
Fig. Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}
Hierarchical Clustering
▪ Four widely used measures for distance between clusters
▪ 𝒑 − 𝒑′ is distance between two objects 𝑝 and 𝑝′.
▪ 𝒎𝒊 is mean for cluster 𝑪𝒊
▪ 𝒏𝒊 is number of objects in cluster 𝑪𝒊.
𝑀𝑖𝑛𝑖𝑚𝑢𝑚 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒: 𝑑𝑚𝑖𝑛 𝐶𝑖, 𝐶𝑗 = 𝑚𝑖𝑛𝑝∈𝐶𝑖,𝑝′∈𝐶𝑗
𝑝 − 𝑝′
𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒: 𝑑𝑚𝑎𝑥 𝐶𝑖, 𝐶𝑗 = 𝑚𝑎𝑥𝑝∈𝐶𝑖,𝑝′∈𝐶𝑗
𝑝 − 𝑝′
𝑀𝑒𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒: 𝑑𝑚𝑒𝑎𝑛 𝐶𝑖, 𝐶𝑗 = 𝑚𝑖 − 𝑚𝑗
𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒: 𝑑𝑎𝑣𝑔 𝐶𝑖, 𝐶𝑗 =
1
𝑛𝑖𝑛𝑗
෍
𝑝∈𝐶𝑖
෍
𝑝′∈𝐶𝑗
𝑝 − 𝑝′
Hierarchical Clustering
▪ If an algorithm uses minimum distance measure, an algorithm is called a
nearest-neighbor clustering algorithm.
▪If the clustering process is terminated when the minimum distance between
nearest clusters exceeds an arbitrary threshold, it is called a single-linkage
algorithm.
▪ If an algorithm uses maximum distance measure, an algorithm is called a
farthest-neighbor clustering algorithm.
▪ If the clustering process is terminated when the maximum distance between
nearest clusters exceeds an arbitrary threshold, it is called a complete-
linkage algorithm.
▪ An agglomerative hierarchical clustering algorithm that uses the minimum
distance measure is also called a minimal spanning tree algorithm.

More Related Content

PPTX
Schemas for multidimensional databases
PPTX
PDF
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | Edureka
PPTX
K-Nearest Neighbor(KNN)
PDF
K - Nearest neighbor ( KNN )
PPT
Clustering
PPTX
Introduction to Clustering algorithm
Schemas for multidimensional databases
Data Warehouse Concepts | Data Warehouse Tutorial | Data Warehousing | Edureka
K-Nearest Neighbor(KNN)
K - Nearest neighbor ( KNN )
Clustering
Introduction to Clustering algorithm

What's hot (20)

PDF
Hierarchical Clustering
PPTX
Hierarchical clustering.pptx
PDF
Dimensionality Reduction
PPTX
Lect5 principal component analysis
PDF
Logistic regression in Machine Learning
PPTX
Clustering in data Mining (Data Mining)
PPTX
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
PPTX
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
PPT
3.5 model based clustering
PPTX
04 Multi-layer Feedforward Networks
PPTX
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
PPTX
Feature Selection in Machine Learning
PPTX
Data mining: Classification and prediction
PPTX
Presentation on K-Means Clustering
PDF
Data preprocessing using Machine Learning
PPTX
Classification and Regression
PPTX
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
PPT
3. mining frequent patterns
PDF
Introduction to Machine Learning Classifiers
PDF
Linear regression
Hierarchical Clustering
Hierarchical clustering.pptx
Dimensionality Reduction
Lect5 principal component analysis
Logistic regression in Machine Learning
Clustering in data Mining (Data Mining)
Support Vector Machine - How Support Vector Machine works | SVM in Machine Le...
Hierarchical Clustering | Hierarchical Clustering in R |Hierarchical Clusteri...
3.5 model based clustering
04 Multi-layer Feedforward Networks
Logistic Regression | Logistic Regression In Python | Machine Learning Algori...
Feature Selection in Machine Learning
Data mining: Classification and prediction
Presentation on K-Means Clustering
Data preprocessing using Machine Learning
Classification and Regression
K Means Clustering Algorithm | K Means Clustering Example | Machine Learning ...
3. mining frequent patterns
Introduction to Machine Learning Classifiers
Linear regression
Ad

Similar to Clustering (20)

PDF
Lecture_54.pdF k-MEANS cLUTERING BY NPTEL
PPT
4 DM Clustering ifor computerscience.ppt
PPT
Lecture_3_k-mean-clustering.ppt
PPTX
K-means machine learning clustering .pptx
PPTX
Lec13 Clustering.pptx
PPTX
CLUSTER ANALYSIS ALGORITHMS.pptx
PDF
Unsupervised Learning in Machine Learning
PPTX
k-mean medoid and-knn-algorithm problems.pptx
PPTX
Clustering
PPTX
machine learning - Clustering in R
PPTX
Advanced database and data mining & clustering concepts
PDF
ClusteringClusteringClusteringClustering.pdf
PDF
ch_5_dm clustering in data mining.......
PPTX
K means ALGORITHM IN MACHINE LEARNING.pptx
PPTX
K-means Clustering || Data Mining
PPT
K mean-clustering algorithm
PPT
K mean-clustering
Lecture_54.pdF k-MEANS cLUTERING BY NPTEL
4 DM Clustering ifor computerscience.ppt
Lecture_3_k-mean-clustering.ppt
K-means machine learning clustering .pptx
Lec13 Clustering.pptx
CLUSTER ANALYSIS ALGORITHMS.pptx
Unsupervised Learning in Machine Learning
k-mean medoid and-knn-algorithm problems.pptx
Clustering
machine learning - Clustering in R
Advanced database and data mining & clustering concepts
ClusteringClusteringClusteringClustering.pdf
ch_5_dm clustering in data mining.......
K means ALGORITHM IN MACHINE LEARNING.pptx
K-means Clustering || Data Mining
K mean-clustering algorithm
K mean-clustering
Ad

More from Rashmi Bhat (20)

PDF
Knowledge-Based Agents in AI: Principles, Components, and Functionality
PDF
Understanding Intelligent Agents: Concepts, Structure, and Applications
PDF
Introduction to Artificial Intelligence: Concepts and Applications
PPTX
Input Output Management in Operating System
PPTX
Virtual memory management in Operating System
PPTX
Main Memory Management in Operating System
PDF
Process Scheduling in OS
PDF
Introduction to Operating System
PDF
The Geometry of Virtual Worlds.pdf
PDF
Module 1 VR.pdf
PPTX
PPTX
Spatial Data Mining
PPTX
Web mining
PDF
Mining Frequent Patterns And Association Rules
PDF
Classification in Data Mining
PPTX
ETL Process
PPTX
Data Warehouse Fundamentals
PDF
Virtual Reality
PDF
Introduction To Virtual Reality
PPTX
Graph Theory
Knowledge-Based Agents in AI: Principles, Components, and Functionality
Understanding Intelligent Agents: Concepts, Structure, and Applications
Introduction to Artificial Intelligence: Concepts and Applications
Input Output Management in Operating System
Virtual memory management in Operating System
Main Memory Management in Operating System
Process Scheduling in OS
Introduction to Operating System
The Geometry of Virtual Worlds.pdf
Module 1 VR.pdf
Spatial Data Mining
Web mining
Mining Frequent Patterns And Association Rules
Classification in Data Mining
ETL Process
Data Warehouse Fundamentals
Virtual Reality
Introduction To Virtual Reality
Graph Theory

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Electronic commerce courselecture one. Pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Modernizing your data center with Dell and AMD
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
KodekX | Application Modernization Development
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Spectral efficient network and resource selection model in 5G networks
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Weekly Chronicles - August'25 Week I
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Electronic commerce courselecture one. Pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Modernizing your data center with Dell and AMD
The AUB Centre for AI in Media Proposal.docx
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
KodekX | Application Modernization Development
CIFDAQ's Market Insight: SEC Turns Pro Crypto
20250228 LYD VKU AI Blended-Learning.pptx
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Big Data Technologies - Introduction.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Spectral efficient network and resource selection model in 5G networks

Clustering

  • 2. What is Clustering?? ▪ Grouping of objects How will you group these together??
  • 3. What is Clustering?? Option 1: By Type Option 2: By Color
  • 5. What is Cluster Analysis?? ▪ A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. ▪ Cluster analysis has been extensively focused mainly on distance-based cluster analysis. The process of grouping a set of physical or abstract objects into classes of similar objects is called as Clustering.
  • 6. What is Cluster Analysis?? ▪ How clustering differs from classification???
  • 7. What is Cluster Analysis?? ▪ Clustering is also called data segmentation ▪ Clustering is finding borders between groups, ▪ Segmenting is using borders to form groups ▪ Clustering is the method of creating segments. ▪ Clustering can also be used for outlier detection
  • 8. What is Cluster Analysis?? ▪ Classification: Supervised Learning ▪ Classes are predetermined ▪ Based on training data set ▪ Used to classify future observations ▪ Clustering : Unsupervised Learning ▪ Classes are not known in advance ▪ No prior knowledge ▪ Used to explore (understand) the data ▪ Clustering is a form of learning by observation, rather than learning by examples.
  • 9. Applications of Clustering ▪ Marketing: ▪ Segmentation of the customer based on behavior ▪ Banking: ▪ ATM Fraud detection (outlier detection) ▪ Gene analysis: ▪ Identifying gene responsible for a disease ▪ Image processing: ▪ Identifying objects on an image (face detection) ▪ Houses: ▪Identifying groups of houses according to their house type, value, and geographical location
  • 10. Requirements of Clustering Analysis ▪ The following are typical requirements of clustering in data mining: ▪ Scalability ▪ Dealing with different types of attributes ▪ Discovering clusters with arbitrary shapes ▪ Ability to deal with noisy data ▪ Minimal requirements for domain knowledge to determine input parameters ▪ Incremental clustering ▪ High dimensionality ▪ Constraint-based clustering ▪ Interpretability and usability
  • 11. What is Cluster Analysis?? ▪ A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. ▪ Cluster analysis has been extensively focused mainly on distance-based cluster analysis. The process of grouping a set of physical or abstract objects into classes of similar objects is called as Clustering.
  • 12. Distance Measures ▪ Cluster analysis has been extensively focused mainly on distance-based cluster analysis ▪ Distance is defined as the quantitative measure of how far apart two objects are. ▪ The similarity measure is the measure of how much alike two data objects are. ▪ If the distance is small, the features are having a high degree of similarity. ▪ Whereas a large distance will be a low degree of similarity. ▪ Generally, similarity are measured in the range 0 to 1 [0,1]. ▪ Similarity = 1 if X = Y (Where X, Y are two objects) ▪ Similarity = 0 if X ≠ Y
  • 13. Distance Measures Euclidean Distance Manhattan Distance Minkowski Distance Cosine Similarity Jaccard Similarity
  • 14. Distance Measures 𝑫 𝑿, 𝒀 = 𝒙𝟐 − 𝒙𝟏 𝟐 + 𝒚𝟐 − 𝒚𝟏 𝟐 • The Euclidean distance between two points is the length of the path connecting them. • The Pythagorean theorem gives this distance between two points.
  • 15. Distance Measures 𝑫 𝑨, 𝑩 = 𝒙𝟐 − 𝒙𝟏 + 𝒚𝟐 − 𝒚𝟏 • Manhattan distance is a metric in which the distance between two points is calculated as the sum of the absolute differences of their Cartesian coordinates. • It is the total sum of the difference between the x-coordinates and y-coordinates.
  • 16. Distance Measures 𝑫 𝑿, 𝒀 = ෍ 𝒊=𝟏 𝒏 |𝒙𝒊 − 𝒚𝒊|𝒑 ൗ 𝟏 𝒑 = 𝒑 ෍ 𝒊=𝟏 𝒏 |𝒙𝒊 − 𝒚𝒊|𝒑 • It is the generalized form of the Euclidean and Manhattan Distance Measure.
  • 17. Distance Measures • The cosine similarity metric finds the normalized dot product of the two attributes. • By determining the cosine similarity, we would effectively try to find the cosine of the angle between the two objects. • The cosine of 0° is 1, and it is less than 1 for any other angle.
  • 18. Distance Measures • When we consider Jaccard similarity these objects will be sets. | 𝑨 ∪ 𝑩 | = 7 | 𝑨 ∩ 𝑩 | = 2 𝐽𝑎𝑐𝑐𝑎𝑟𝑑 𝑆𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑱 𝑨, 𝑩 = 𝐴 ∩ 𝐵 𝐴 ∪ 𝐵 = 2 7 = 0.286
  • 19. Clustering Techniques ▪ Clustering techniques are categorized in following categories Partitioning Methods Hierarchical Methods Density-based Methods Grid-based Methods Model-based Methods
  • 20. Partitioning Method ▪ Construct a partition of a database 𝑫 of 𝒏 objects into 𝒌 clusters ▪ each cluster contains at least one object ▪ each object belongs to exactly one cluster ▪ Given a 𝒌, find a partition of 𝒌 clusters that optimizes the chosen partitioning criterion (min distance from cluster centers) ▪ Global optimal: exhaustively enumerate all partitions Stirling(n,k) (S(10,3) = 9.330, S(20,3) = 580.606.446,…) ▪ Heuristic methods: k-means and k-medoids algorithms ▪ k-means: Each cluster is represented by the center of the cluster. ▪ k-medoids or PAM (Partition around medoids): Each cluster is represented by one of the objects in the cluster.
  • 21. 𝑘-means Clustering Input: 𝒌 clusters, 𝒏 objects of database 𝑫. Output: Set of 𝒌 clusters minimizing squared error function Algorithm: 1. Arbitrarily choose 𝒌 objects from 𝑫 as the initial cluster centers; 2. Repeat 1. (Re)assign each object to the cluster to which the object is the most similar, based on the mean value of the objects in the cluster; 2. Update the cluster means, i.e., calculate the mean value of the objects for each cluster; 3. Until no change;
  • 22. 𝑘-means Clustering Example: Cluster the following data example into 3 clusters using k-means clustering and Euclidean distance Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3
  • 23. 𝑘-means Clustering 1. Choose arbitrary 3 points as cluster centers Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 C1 = (2,1) C2 = (4,4) C3 = (2,3)
  • 24. 𝑘-means Clustering 2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 𝐷1 𝑃1, 𝐶1 = 2 − 2 2 + 1 − 5 2 = 16 = 4 C1 = (2,1) C2 = (4,4) C3 = (2,3) 𝑫 = 𝒙𝟐 − 𝒙𝟏 𝟐 + 𝒚𝟐 − 𝒚𝟏 𝟐 … . 𝑬𝒖𝒄𝒍𝒊𝒅𝒆𝒂𝒏 𝑫𝒊𝒔𝒕𝒂𝒏𝒄𝒆 𝐷1 𝑃1, 𝐶2 = 4 − 2 2 + 4 − 5 2 = 5 = 2.236 𝐷1 𝑃1, 𝐶3 = 2 − 2 2 + 3 − 5 2 = 4 = 2 Cluster1 = {(2,1)} Cluster2 = { } Cluster3 = {(2,5)} 𝐷2 𝑃2, 𝐶1 = 2 − 2 2 + 1 − 1 2 = 0 𝐷2 𝑃2, 𝐶2 = 4 − 2 2 + 4 − 1 2 = 13 = 3.605 𝐷2 𝑃2, 𝐶3 = 2 − 2 2 + 3 − 1 2 = 4 = 2 Similarly, assign other points to appropriate cluster. Cluster1 = { } Cluster2 = { } Cluster3 = {(2,5)} Cluster1 = { } Cluster2 = { } Cluster3 = { }
  • 25. 𝑘-means Clustering 2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 C1 = (2,1) C2 = (4,4) C3 = (2,3) Cluster1 = {(2,1)} Cluster2 = { } Cluster3 = {(2,5)} Cluster1 = { } Cluster2 = { } Cluster3 = {(2,5)} Cluster1 = {(2,1)} Cluster2 = {(7,1)} Cluster3 = {(2,5)} Cluster1 = {(2,1)} Cluster2 = {(4,4),(7,1)} Cluster3 = {(2,5)} Cluster1 = {(2,1)} Cluster2 = {(4,4),(7,1), (3,5)} Cluster3 = {(2,5)} Cluster1 = {(2,1), (1,2)} Cluster2 = {(4,4),(7,1), (3,5), (6,2), (6,1), (3,4)} Cluster3 = {(2,3),(2,5)}
  • 26. 𝑘-means Clustering 2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3
  • 27. 𝑘-means Clustering 3. Update the cluster means Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 Old Cluster Centers: C1 = (2,1) C2 = (4,4) C3 = (2,3) Clusters: Cluster1 = {(2,1), (1,2), } Cluster2 = {(4,4),(7,1), (3,5), (6,2), (6,1), (3,4)} Cluster3 = {(2,3),(2,5)} Calculate the mean of the points in each cluster 𝑚𝑒𝑎𝑛1 = 2+1 2 , 1+2 2 𝑚𝑒𝑎𝑛2 = 4+7+3+6+6+3 6 , 4+1+5+2+1+4 6 𝑚𝑒𝑎𝑛3 = 2+2 2 , 3+5 2 New Cluster Centers: C1 = (1.5, 1.5) C2 = (4.83, 2.83) C3 = (2, 4)
  • 28. 𝑘-means Clustering 2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3
  • 29. 𝑘-means Clustering 2. Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3
  • 30. 𝑘-means Clustering 2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 𝐷1 𝑃1, 𝐶1 = 1.5 − 2 2 + 1.5 − 5 2 = 3.535 Updated Cluster Centers: C1 = (1.5, 1.5) C2 = (4.83, 2.83) C3 = (2, 4) 𝐷1 𝑃1, 𝐶2 = 4.83 − 2 2 + 2.83 − 5 2 = 3.566 𝐷1 𝑃1, 𝐶3 = 2 − 2 2 + 4 − 5 2 = 1 Cluster1 = { } Cluster2 = { } Cluster3 = {(2,5)} 𝐷2 𝑃2, 𝐶1 = 1.5 − 2 2 + 1.5 − 1 2 = 0.707 𝐷2 𝑃2, 𝐶2 = 4.83 − 2 2 + 2.83 − 1 2 = 13 = 3.3701 𝐷2 𝑃2, 𝐶3 = 2 − 2 2 + 4 − 1 2 = 3 Cluster1 = {(2,1)} Cluster2 = { } Cluster3 = {(2,5)} Similarly, assign other points to appropriate cluster.
  • 31. 𝑘-means Clustering 2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 Updated Cluster Centers: C1 = (1.5, 1.5) C2 = (4.83, 2.83) C3 = (2, 4) Updated Clusters Cluster1 = {(2,1), (1,2) } Cluster2 = {(7,1), (4,4), (6,2), (6,1)} Cluster3 = {(3,5), (2,5), (3,4), (2,3)} Cluster1 = {(2,1)} Cluster2 = { } Cluster3 = {(2,5)}
  • 32. 𝑘-means Clustering 2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3
  • 33. 𝑘-means Clustering 2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 Old Cluster Centers: C1 = (1.5, 1.5) C2 = (4.83, 2.83) C3 = (2, 4) Updated Clusters Cluster1 = {(2,1), (1,2) } Cluster2 = {(7,1), (4,4), (6,2), (6,1)} Cluster3 = {(3,5), (2,5), (3,4), (2,3)} 3. Update Cluster centers by repeating the process until there is no change in clusters New Cluster Centers: C1 = (1.5, 1.5) C2 = (5.75, 2) C3 = (2.5, 4.25)
  • 34. 𝑘-means Clustering 2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3
  • 35. 𝑘-means Clustering 2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3 Old Cluster Centers: C1 = (1.5, 1.5) C2 = (5.75, 2) C3 = (2.5, 4.25) Updated Clusters Cluster1 = {(2,1), (1,2) } Cluster2 = {(7,1), (6,2), (6,1)} Cluster3 = {(3,5), (2,5), (4,4), (3,4), (2,3)} 3. Update Cluster centers by repeating the process until there is no change in clusters New Cluster Centers: C1 = (1.5, 1.5) C2 = (6.33, 1.33) C3 = (2.8, 4.2)
  • 36. 𝑘-means Clustering 2. Repeat: Assign each point to its closest cluster center. Calculate distance of each point from each cluster centers. And choose closest one. Point X Y P1 2 5 P2 2 1 P3 7 1 P4 3 5 P5 4 4 P6 6 2 P7 1 2 P8 6 1 P9 3 4 P10 2 3
  • 37. 𝑘-means Clustering Apply k-means algorithm for the following data set with two clusters. D={15, 16, 19, 20, 20, 21, 22, 28, 35, 40, 41, 42, 43, 44, 60, 61, 65}
  • 38. 𝑘-means Clustering ▪ Advantages: ▪ Relatively scalable and efficient in processing large data sets ▪ The computational complexity of the algorithm is 𝑂 𝑛𝑘𝑡 ▪ where 𝑛 is the total number of objects, 𝑘 is the number of clusters, and 𝑡 is the number of iterations ▪ This method terminates at a local optimum. ▪Disadvantages: ▪ Can be applied only when the mean of a cluster is defined ▪ The necessity for users to specify 𝑘, the number of clusters, in advance. ▪ Sensitive to noise and outlier data points
  • 39. 𝑘-means Clustering ▪ How to cluster categorical data? ▪ Variant of 𝑘-means is used for clustering categorical data: 𝑘-modes Method ▪ Replace mean of cluster with mode of data ▪ A new dissimilarity measures to deal with categorical objects ▪ A frequency-based method to update modes of clusters.
  • 40. 𝑘-Medoids Clustering ▪ Picks actual objects to represent the clusters, using one representative object per cluster ▪ Each remaining object is clustered with the representative object to which it is the most similar. ▪ Partitioning method is then performed based on the principle of minimizing the sum of the dissimilarities between each object and its corresponding reference point ▪ Absolute Error criterion is used 𝐸 = ෍ 𝑗=1 𝑘 ෍ 𝑝∈𝑐𝑗 𝑑𝑖𝑠𝑡(𝑝, 𝑜𝑗) Where • 𝑝 is the point in space representing a given object in cluster 𝑐𝑗 • 𝑂𝑗 is the representative object of cluster 𝑐𝑗 Sum of absolute error
  • 41. 𝑘-Medoids Clustering ▪ The iterative process of replacing representative objects by nonrepresentative objects continues as long as the quality of the resulting clustering is improved. ▪ Quality is measured by a cost function that measures the average dissimilarity between an object and the representative object of its cluster. ▪ Four cases are examined for each of the nonrepresentative objects, 𝑝. ▪ Suppose, object 𝒑 is currently assigned to a cluster represented by medoid 𝑶𝒋 𝒑 𝑶𝒊 𝑶𝒋 𝑶𝒓𝒂𝒏𝒅𝒐𝒎 𝒑 𝑶𝒊 𝑶𝒋 𝑶𝒓𝒂𝒏𝒅𝒐𝒎 𝒑 𝑶𝒊 𝑶𝒋 𝑶𝒓𝒂𝒏𝒅𝒐𝒎 𝒑 𝑶𝒊 𝑶𝒋 𝑶𝒓𝒂𝒏𝒅𝒐𝒎 Case 1 Case 2 Case 3 Case 4 Before Swapping After Swapping
  • 42. 𝑘-Medoids Clustering ▪ Each time a reassignment occurs, a difference in absolute error, 𝐸, is contributed to the cost function. ▪ Therefore, the cost function calculates the difference in absolute-error value if a current representative object is replaced by a nonrepresentative object. ▪ The total cost of swapping is the sum of costs incurred by all nonrepresentative objects. ▪ If the total cost is negative, then 𝑂𝑗 is replaced or swapped with 𝑂𝑟𝑎𝑛𝑑𝑜𝑚 ▪ If the total cost is positive, the current representative object, 𝑂𝑗, is considered acceptable, and nothing is changed. ▪ PAM(Partitioning Around Medoids) was one of the first k-medoids algorithms
  • 43. 𝑘-Medoids Clustering Input: 𝑘 number of clusters, 𝑛 data objects from data set 𝐷 Output: a set of 𝑘 clusters Algorithm: 1. Arbitrarily select 𝑘 objects as the representative objects or seeds 2. Repeat 1. Assign each remaining objects to the cluster with the nearest representative object 2. Randomly select the non- representative object 𝑂𝑟𝑎𝑛𝑑𝑜𝑚 3. Compute the total cost 𝑆 of swapping 𝑂𝑗 with 𝑂𝑟𝑎𝑛𝑑𝑜𝑚 4. If 𝑆 < 0, then swap 𝑂𝑗 with 𝑂𝑟𝑎𝑛𝑑𝑜𝑚 to form the new set of 𝑘 representative objects 3. Until no change
  • 44. 𝑘-Medoids Clustering X Y O1 2 6 O2 3 4 O3 3 8 O4 4 7 O5 6 2 O6 6 4 O7 7 3 O8 7 4 O9 8 5 O10 7 6 Data Objects Aim: Create two Clusters Step 1: Choose randomly two medoids (representative objects) 𝑂3 = 3,8 𝑂8 = (7,4)
  • 45. 𝑘-Medoids Clustering X Y Cluster O1 2 6 O2 3 4 O3 3 8 O4 4 7 O5 6 2 O6 6 4 O7 7 3 O8 7 4 O9 8 5 O10 7 6 Data Objects Aim: Create two Clusters Step 2: Assign each object to the closest representative object Using Euclidean distance, we form following clusters
  • 46. 𝑘-Medoids Clustering Data Objects Aim: Create two Clusters X Y Cluster O1 2 6 C1 O2 3 4 C1 O3 3 8 C1 O4 4 7 C1 O5 6 2 C2 O6 6 4 C2 O7 7 3 C2 O8 7 4 C2 O9 8 5 C2 O10 7 6 C2 Step 2: Assign each object to the closest representative object Using Euclidean distance, we form following clusters C1={O1, O2, O3, O4} C2={O5, O6, O7, O8, O9, O10}
  • 47. 𝑘-Medoids Clustering Data Objects Aim: Create two Clusters X Y Cluster O1 2 6 C1 O2 3 4 C1 O3 3 8 C1 O4 4 7 C1 O5 6 2 C2 O6 6 4 C2 O7 7 3 C2 O8 7 4 C2 O9 8 5 C2 O10 7 6 C2 Step 3: Compute the absolute error (for the set of representative objects 𝑂3 and 𝑂8) 𝐸 = ෍ 𝑗=1 𝑘 ෍ 𝑝∈𝑐𝑗 𝑝 − 𝑂𝑗 𝑬 = 𝑶𝟏 − 𝑶𝟑 + 𝑶𝟐 − 𝑶𝟑 + 𝑶𝟑 − 𝑶𝟑 + 𝑶𝟒 − 𝑶𝟑 + 𝑶𝟓 − 𝑶𝟖 + 𝑶𝟔 − 𝑶𝟖 + 𝑶𝟕 − 𝑶𝟖 + 𝑶𝟖 − 𝑶𝟖 + 𝑶𝟗 − 𝑶𝟖 + 𝑶𝟏𝟎 − 𝑶𝟖 𝑶𝟏 − 𝑶𝟑 = 𝒙𝟏 − 𝒙𝟑 + 𝒚𝟏 − 𝒚𝟑 . . . . Manhattan Distance
  • 48. 𝑘-Medoids Clustering Data Objects Aim: Create two Clusters X Y Cluster O1 2 6 C1 O2 3 4 C1 O3 3 8 C1 O4 4 7 C1 O5 6 2 C2 O6 6 4 C2 O7 7 3 C2 O8 7 4 C2 O9 8 5 C2 O10 7 6 C2 Step 3: Compute the absolute error (for the set of representative objects 𝑂3 and 𝑂8) 𝐸 = ෍ 𝑗=1 𝑘 ෍ 𝑝∈𝑐𝑗 𝑝 − 𝑂𝑗 𝑬 = 𝑶𝟏 − 𝑶𝟑 + 𝑶𝟐 − 𝑶𝟑 + 𝑶𝟑 − 𝑶𝟑 + 𝑶𝟒 − 𝑶𝟑 + 𝑶𝟓 − 𝑶𝟖 + 𝑶𝟔 − 𝑶𝟖 + 𝑶𝟕 − 𝑶𝟖 + 𝑶𝟖 − 𝑶𝟖 + 𝑶𝟗 − 𝑶𝟖 + 𝑶𝟏𝟎 − 𝑶𝟖 𝑬 = 𝟑 + 𝟒 + 𝟎 + 𝟐 + 𝟑 + 𝟏 + 𝟏 + 𝟎 + 𝟐 + 𝟐 𝑬 = 𝟏𝟖
  • 49. 𝑘-Medoids Clustering Data Objects Aim: Create two Clusters X Y Cluster O1 2 6 C1 O2 3 4 C1 O3 3 8 C1 O4 4 7 C1 O5 6 2 C2 O6 6 4 C2 O7 7 3 C2 O8 7 4 C2 O9 8 5 C2 O10 7 6 C2 Step 4: Choose a random object 𝑂9 Swap 𝑂8 and 𝑂9 Compute the absolute error (for the set of representative objects 𝑂3 and 𝑂9)
  • 50. 𝑘-Medoids Clustering Data Objects Aim: Create two Clusters X Y Cluster O1 2 6 C1 O2 3 4 C1 O3 3 8 C1 O4 4 7 C1 O5 6 2 C2 O6 6 4 C2 O7 7 3 C2 O8 7 4 C2 O9 8 5 C2 O10 7 6 C2 𝑬 = 𝑶𝟏 − 𝑶𝟑 + 𝑶𝟐 − 𝑶𝟑 + 𝑶𝟑 − 𝑶𝟑 + 𝑶𝟒 − 𝑶𝟑 + 𝑶𝟓 − 𝑶𝟗 + 𝑶𝟔 − 𝑶𝟗 + 𝑶𝟕 − 𝑶𝟗 + 𝑶𝟖 − 𝑶𝟗 + 𝑶𝟗 − 𝑶𝟗 + 𝑶𝟏𝟎 − 𝑶𝟗 𝑬 = 𝟑 + 𝟒 + 𝟎 + 𝟐 + (𝟓 + 𝟑 + 𝟑 + 𝟐 + 𝟎 + 𝟐) 𝑬 = 𝟐𝟒 Step 5: Compute the cost function 𝑆 = 𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝐸𝑟𝑟𝑜𝑟 𝑓𝑜𝑟𝑶𝟑, 𝑶𝟖 − 𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒 𝐸𝑟𝑟𝑜𝑟 𝑓𝑜𝑟𝑶𝟑, 𝑶𝟗 𝑆 = 18 − 24 = −6 As 𝑆 < 0, we swap 𝑶𝟖 with 𝑶𝟗
  • 51. 𝑘-Medoids Clustering Data Objects Aim: Create two Clusters X Y Cluster O1 2 6 O2 3 4 O3 3 8 O4 4 7 O5 6 2 O6 6 4 O7 7 3 O8 7 4 O9 8 5 O10 7 6 Step 6: New medoids are 𝑶𝟑 with 𝑶𝟗 Repeat Step 2 Assign each object to the closest representative object. X Y Cluster O1 2 6 C1 O2 3 4 C1 O3 3 8 C1 O4 4 7 C1 O5 6 2 C2 O6 6 4 C2 O7 7 3 C2 O8 7 4 C2 O9 8 5 C2 O10 7 6 C2
  • 52. 𝑘-Medoids Clustering ▪ Which method is more robust 𝑘-Means or 𝑘-Medoids? ▪ The k-medoids method is more robust than k-means in the presence of noise and outliers, because a medoid is less influenced by outliers or other extreme values than a mean. ▪ The processing of 𝑘-Medoids is more costly than the k-means method.
  • 53. Hierarchical Clustering ▪ Groups data objects into a tree of clusters. Hierarchical Clustering Methods Agglomerative Divisive
  • 54. Hierarchical Clustering ▪ Agglomerative Hierarchical Clustering ▪ Starts by placing each object in its own cluster ▪ Merges these atomic clusters into larger and larger clusters ▪ It will halt when all of the objects are in a single cluster or until certain termination conditions are satisfied. ▪ Bottom-Up Strategy. ▪ The user can specify the desired number of clusters as a termination condition.
  • 55. Hierarchical Clustering A B F C D E G AB CD ABF CDE ABFCDEG CDEG Step 0 Step 1 Step 2 Step 3 Step 4 Application of Agglomerative NESting (AGNES) Hierarchical Clustering
  • 56. Hierarchical Clustering ▪ Divisive Hierarchical Clustering Method ▪ Starting with all objects in one cluster. ▪ Subdivides the cluster into smaller and smaller pieces. ▪ It will halt when each object forms a cluster on its own or until it satisfies certain termination conditions ▪ Top-Down Strategy ▪ The user can specify the desired number of clusters as a termination condition.
  • 57. Hierarchical Clustering A B F C D E G AB CD ABF CDE ABFCDEG CDEG Step 4 Step 3 Step 2 Step 1 Step 0 Application of DIvisive ANAlysis (DIANA) Hierarchical Clustering
  • 58. Hierarchical Clustering ▪ A tree structure called a dendrogram is used to represent the process of hierarchical clustering. Fig. Dendrogram representation for hierarchical clustering of data objects {a, b, c, d, e}
  • 59. Hierarchical Clustering ▪ Four widely used measures for distance between clusters ▪ 𝒑 − 𝒑′ is distance between two objects 𝑝 and 𝑝′. ▪ 𝒎𝒊 is mean for cluster 𝑪𝒊 ▪ 𝒏𝒊 is number of objects in cluster 𝑪𝒊. 𝑀𝑖𝑛𝑖𝑚𝑢𝑚 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒: 𝑑𝑚𝑖𝑛 𝐶𝑖, 𝐶𝑗 = 𝑚𝑖𝑛𝑝∈𝐶𝑖,𝑝′∈𝐶𝑗 𝑝 − 𝑝′ 𝑀𝑎𝑥𝑖𝑚𝑢𝑚 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒: 𝑑𝑚𝑎𝑥 𝐶𝑖, 𝐶𝑗 = 𝑚𝑎𝑥𝑝∈𝐶𝑖,𝑝′∈𝐶𝑗 𝑝 − 𝑝′ 𝑀𝑒𝑎𝑛 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒: 𝑑𝑚𝑒𝑎𝑛 𝐶𝑖, 𝐶𝑗 = 𝑚𝑖 − 𝑚𝑗 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒: 𝑑𝑎𝑣𝑔 𝐶𝑖, 𝐶𝑗 = 1 𝑛𝑖𝑛𝑗 ෍ 𝑝∈𝐶𝑖 ෍ 𝑝′∈𝐶𝑗 𝑝 − 𝑝′
  • 60. Hierarchical Clustering ▪ If an algorithm uses minimum distance measure, an algorithm is called a nearest-neighbor clustering algorithm. ▪If the clustering process is terminated when the minimum distance between nearest clusters exceeds an arbitrary threshold, it is called a single-linkage algorithm. ▪ If an algorithm uses maximum distance measure, an algorithm is called a farthest-neighbor clustering algorithm. ▪ If the clustering process is terminated when the maximum distance between nearest clusters exceeds an arbitrary threshold, it is called a complete- linkage algorithm. ▪ An agglomerative hierarchical clustering algorithm that uses the minimum distance measure is also called a minimal spanning tree algorithm.