2. Clustering
⢠Clustering or cluster analysis is a machine learning technique,
which groups the unlabelled dataset.
⢠It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points. The
objects with the possible similarities remain in a group that
has less or no similarities with another group.ā
⢠Finding some similar patterns in the unlabelled dataset such
as shape, size, color, behavior, etc., and divides them as per
the presence and absence of those similar patterns.
April 30, 2025 SIT1305 Machine Learning 2
3. Clustering
⢠It is an unsupervised learning method, hence no supervision is
provided to the algorithm, and it deals with the unlabeled
dataset.
⢠After applying this clustering technique, each cluster or group is
provided with a cluster-ID. ML system can use this id to simplify
the processing of large and complex datasets.
⢠The clustering technique is commonly used for statistical data
analysis.
⢠Note: Clustering is somewhere similar to the classification algorithm, but
the difference is the type of dataset that we are using. In classification, we
work with the labeled data set, whereas in clustering, we work with the
unlabelled dataset.
April 30, 2025 SIT1305 Machine Learning 3
4. ⢠The below diagram explains the working of the clustering
algorithm. We can see the different fruits are divided into
several groups with similar properties.
April 30, 2025 SIT1305 Machine Learning 4
5. Applications of Clustering
⢠In Identification of Cancer Cells: The clustering algorithms are
widely used for the identification of cancerous cells. It divides
the cancerous and non-cancerous data sets into different
groups.
⢠In Search Engines: Search engines also work on the clustering
technique. The search result appears based on the closest
object to the search query. It does it by grouping similar data
objects in one group that is far from the other dissimilar
objects. The accurate result of a query depends on the quality
of the clustering algorithm used.
April 30, 2025 SIT1305 Machine Learning 5
6. Applications of Clustering
⢠Customer Segmentation: It is used in market research to
segment the customers based on their choice and preferences.
⢠In Biology: It is used in the biology stream to classify different
species of plants and animals using the image recognition
technique.
⢠In Land Use: The clustering technique is used in identifying the
area of similar lands use in the GIS database. This can be very
useful to find that for what purpose the particular land should
be used, that means for which purpose it is more suitable.
April 30, 2025 SIT1305 Machine Learning 6
7. ⢠The clustering technique can be widely used in various tasks.
Some most common uses of this technique are:
ā Market Segmentation
ā Statistical data analysis
ā Social network analysis
ā Image segmentation
ā Anomaly detection, etc.
⢠Apart from these general usages, it is used by the Amazon in
its recommendation system to provide the recommendations
as per the past search of products.
⢠Netflix also uses this technique to recommend the movies
and web-series to its users as per the watch history.
April 30, 2025 SIT1305 Machine Learning 7
8. Unsupervised learning: no predefined classes
⢠A good clustering method will produce high quality clusters
ā high intra-class similarity: cohesive within clusters
ā low inter-class similarity: distinctive between clusters
⢠The quality of a clustering method depends on
ā the similarity measure used by the method
ā its implementation, and
ā its ability to discover some or all of the hidden patterns.
⢠Clustering is a form of learning by observation rather than
learning by examples.
April 30, 2025 SIT1305 Machine Learning 8
9. Main objectives of clustering are:
⢠Intra-cluster distance is minimized.
⢠Inter-cluster distance is maximized.
April 30, 2025 SIT1305 Machine Learning 9
10. Data Matrix and Dissimilarity Matrix
April 30, 2025 SIT1305 Machine Learning 10
11. Similarity and Dissimilarity
⢠Distances are normally used to measure the similarity or
dissimilarity between to data objects.
⢠Some popular distances are based on Minkowski distance(Lp
or Lh norm)
April 30, 2025 SIT1305 Machine Learning 11
12. Special cases of Minkowski Distance
April 30, 2025 SIT1305 Machine Learning 12
15. Problem 1
⢠Given two objects represented by the tuples (22, 1, 42, 10)
and (20, 0, 36, 8):
1. Compute the Euclidean distance between the two
objects.
2. Compute the Manhattan distance between the two
objects.
3. Compute the Minkowski distance between the two
objects using q=3.
April 30, 2025 SIT1305 Machine Learning 15
17. 2. Compute the Manhattan distance between the two objects.
= 2+1+6+2
= 11
3. Compute the Minkowski distance between the two objects
using q=3.
April 30, 2025 SIT1305 Machine Learning 17
8
10
36
42
0
1
20
22
)
,
( ļ
ļ«
ļ
ļ«
ļ
ļ«
ļ
ļ½
j
i
d
3
/
1
3
3
3
3
)
8
10
36
42
0
1
20
22
(
)
,
( ļ
ļ«
ļ
ļ«
ļ
ļ«
ļ
ļ½
j
i
d
15
.
6
233
8
216
1
8
2
6
1
2
3
3
3 3
3
3
3
ļ½
ļ«
ļ«
ļ«
ļ½
ļ«
ļ«
ļ«
18. Problem 2
⢠Given 5-dimensional numeric samples A=(1,0,2,5,3) and
B=(2,1,0,3,-1).
1. Compute the Euclidean distance between the two
objects.
2. Compute the Manhattan distance between the two
objects.
3. Compute the Supremum distance.
April 30, 2025 SIT1305 Machine Learning 18
19. Types of Clustering Methods
⢠The clustering methods are broadly divided into Hard
clustering (datapoint belongs to only one group) and Soft
Clustering (data points can belong to another group also).
⢠But there are also other various approaches of Clustering
exist. Below are the main clustering methods used in Machine
learning:
ā Partitioning Clustering
ā Density-Based Clustering
ā Distribution Model-Based Clustering
ā Hierarchical Clustering
ā Fuzzy Clustering
April 30, 2025 SIT1305 Machine Learning 19
20. Partitioning Clustering
⢠It is a type of clustering that divides the data into non-
hierarchical groups. It is also known as the centroid-based
method. The most common example of partitioning clustering
is the K-Means Clustering algorithm.
⢠In this type, the dataset is divided into a set of k groups,
where K is used to define the number of pre-defined groups.
The cluster center is created in such a way that the distance
between the data points of one cluster is minimum as
compared to another cluster centroid.
April 30, 2025 SIT1305 Machine Learning 20
21. Density-Based Clustering
⢠The density-based clustering method connects the highly-
dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be
connected.
⢠This algorithm does it by identifying different clusters in the
dataset and connects the areas of high densities into clusters.
The dense areas in data space are divided from each other by
sparser areas.
⢠These algorithms can face difficulty in clustering the data
points if the dataset has varying densities and high
dimensions.
April 30, 2025 SIT1305 Machine Learning 21
22. Hierarchical Clustering
⢠Hierarchical clustering can be used as an alternative for the
partitioned clustering as there is no requirement of pre-
specifying the number of clusters to be created.
⢠In this technique, the dataset is divided into clusters to create
a tree-like structure, which is also called a dendrogram.
⢠The observations or any number of clusters can be selected by
cutting the tree at the correct level. The most common
example of this method is the Agglomerative Hierarchical
algorithm.
April 30, 2025 SIT1305 Machine Learning 22
23. Distribution Model-Based Clustering
⢠In the distribution model-based clustering method, the data is
divided based on the probability of how a dataset belongs to a
particular distribution. The grouping is done by assuming
some distributions commonly Gaussian Distribution.
⢠The example of this type is the Expectation-Maximization
Clustering algorithm that uses Gaussian Mixture Models
(GMM).
April 30, 2025 SIT1305 Machine Learning 23
24. Fuzzy Clustering
⢠Fuzzy clustering is a type of soft method in which a data
object may belong to more than one group or cluster.
⢠Each dataset has a set of membership coefficients, which
depend on the degree of membership to be in a cluster.
⢠Fuzzy C-means algorithm is the example of this type of
clustering; it is sometimes also known as the Fuzzy k-means
algorithm.
April 30, 2025 SIT1305 Machine Learning 24
25. Major Clustering Methods
1. Partitioning Clustering Method
⢠Given a database of n objects or data tuples , a partitioning
method constructs k partitions of the data , where each
partition represents a cluster and k<=n.
⢠k is the number of groups after the classification of objects.
There are some requirements which need to be satisfied with
this Partitioning Clustering Method :
ā Each group must contain at least one object
ā Each object must belong to exactly one group.
⢠There is one technique called iterative relocation, which means
the object will be moved from one group to another to improve
the partitioning.
April 30, 2025 SIT1305 Machine Learning 25
26. ⢠The general criterion of a good partitioning is that object in
the same clusters are ācloseā or related to each other ,
whereas objects of different clusters are āfar apartā or very
different.
⢠Example:
ā K-means, K--Mediods ,CLARANS
April 30, 2025 SIT1305 Machine Learning 26
27. 2. Hierarchical Clustering Methods
⢠In this hierarchical clustering method, the given set of an
object of data is created into a kind of hierarchical
decomposition.
⢠The formation of hierarchical decomposition will decide the
purposes of classification.
⢠Hierarchical clustering algorithm is of two types:
ā i) Agglomerative Hierarchical clustering algorithm or
AGNES (agglomerative nesting) and.
ā ii) Divisive Hierarchical clustering algorithm or DIANA
(divisive analysis).
ā Both this algorithm are exactly reverse of each other.
⢠Example: BIRCH, CAMELEON
April 30, 2025 SIT1305 Machine Learning 27
28. ⢠Hierarchical clustering is an alternative approach to k-means
clustering for identifying groups in a data set.
⢠In contrast to k-means, hierarchical clustering will create a
hierarchy of clusters and therefore does not require us to pre-
specify the number of clusters.
⢠Hierarchical clustering has an added advantage over k-means
clustering - results can be easily visualized using an attractive
tree-based representation called a dendrogram.
April 30, 2025 SIT1305 Machine Learning 28
29. ⢠Divisive approach is a top-down approach.
⢠Start with one,all-inclusive cluster.
⢠Smaller clusters are created by splitting the group by using the
continuous iteration.
⢠Split until each cluster contains a point.
ā Cannot undo after the group is split or merged, and that is
why this method is not so flexible.
April 30, 2025 SIT1305 Machine Learning 29
Divisive Approach
30. Agglomerative Approach
⢠This approach is also known as bottom-up approach.
⢠Start with each object forming a separate group.
⢠It keeps on merging the objects or groups that are close to
one another.
⢠It keep on doing so until all of the groups are merged into one
or until the termination condition holds.
April 30, 2025 SIT1305 Machine Learning 30
32. K-means Clustering Method
⢠K-Means clustering is an unsupervised iterative clustering
technique.
⢠It partitions the given data set into k predefined distinct clusters.
⢠It partitions the data set such that-
ā Each data point belongs to a cluster with the nearest mean.
ā Data points belonging to one cluster have high degree of
similarity.
ā Data points belonging to different clusters have high degree of
dissimilarity.
April 30, 2025 SIT1305 Machine Learning 32
33. K-means Clustering Method
⢠If k is given, the K-means algorithm can be executed in the
following steps:
ā Partition of objects into k non-empty subsets
ā Identifying the cluster centroids (mean point) of the
current partition.
ā Assigning each point to a specific cluster
ā Compute the distances from each point and allot points to
the cluster where the distance from the centroid is
minimum.
ā After re-allotting the points, find the centroid of the new
cluster formed.
April 30, 2025 SIT1305 Machine Learning 33
34. The step by step process:
April 30, 2025 SIT1305 Machine Learning 34
36. ⢠The most commonly used partitioning-clustering strategy is
based on the square error criterion.
⢠The general objective is to obtain the partition that ,for a fixed
number of clusters, minimizes the total square error.
⢠Suppose that the given dataset of N samples in an n-
dimensional space has been partitioned into k-clusters {c1 , c2 ,...
ck }.
⢠Each ck has nk samples and each sample has exactly one cluster,
so that
⢠The mean vector MK of cluster Ck is defined as the centroid of
the cluster
where Xik is the ith
sample belonging to cluster Ck
April 30, 2025 SIT1305 Machine Learning 36
k
k
where
N
nk ...,
2
,
1
ļ½
ļ½
ļ„
ļ„
ļ½
ļ·
ļ·
ļø
ļ¶
ļ§
ļ§
ļØ
ļ¦
ļ½
k
n
i
ik
k
K X
n
M
1
1
37. ⢠The square error for cluster Ck is the sum of the squared
Euclidean distance between each sample in Ck and its
centroid. This error is also called the within-cluster variation.
⢠The square-error for the entire clustering space containing k
clusters is the sum of the within-cluster variations:
April 30, 2025 SIT1305 Machine Learning 37
2
1
2
)
(
ļ„
ļ½
ļ
ļ½
k
n
i
k
ik
K M
X
e
ļ„
ļ½
ļ½
k
K
k
K e
E
1
2
2
38. Example
Consider the data points X1={1,0} X2={0,1} X3={2,1} X4={3,3}
Clusters: C1={X1 ,X3} C2={X2 ,X4}
a. Apply one iteration of K-means partitioning clustering
algorithm.
b. What is the change in total square error?
c. Apply second iteration of K-means partitioning clustering
algorithm.
April 30, 2025 SIT1305 Machine Learning 38
39. ⢠Step 1: The centroid for the clusters C1 and C2 are:
April 30, 2025 SIT1305 Machine Learning 39
ļ„
ļ½
ļ·
ļ·
ļø
ļ¶
ļ§
ļ§
ļØ
ļ¦
ļ½
k
n
i
ik
k
K X
n
M
1
1
ļ» ļ½
ļ» ļ½
2
,
5
.
1
2
3
1
,
2
3
0
5
.
0
,
5
.
1
2
1
0
,
2
2
1
2
1
ļ½
ļ¾
ļ½
ļ¼
ļ®
ļ
ļ¬ ļ«
ļ«
ļ½
ļ½
ļ¾
ļ½
ļ¼
ļ®
ļ
ļ¬ ļ«
ļ«
ļ½
M
M
X1={1,0} X2={0,1} X3={2,1} X4={3,3}
Clusters: C1={X1 ,X3} C2={X2 ,X4}
40. ⢠Step 2: Within cluster variation after initial random
distribution of samples:
April 30, 2025 SIT1305 Machine Learning 40
2
1
2
)
(
ļ„
ļ½
ļ
ļ½
k
n
i
k
ik
K M
X
e
1
]
25
.
0
25
.
0
25
.
0
25
.
0
[
]
)
5
.
0
1
(
)
5
.
1
2
(
)
5
.
0
0
(
)
5
.
1
1
[( 2
2
2
2
2
1
ļ½
ļ«
ļ«
ļ«
ļ½
ļ
ļ«
ļ
ļ«
ļ
ļ«
ļ
ļ½
e
5
.
6
]
1
25
.
2
1
25
.
2
[
]
)
2
3
(
)
5
.
1
3
(
)
2
1
(
)
5
.
1
0
[( 2
2
2
2
2
2
ļ½
ļ«
ļ«
ļ«
ļ½
ļ
ļ«
ļ
ļ«
ļ
ļ«
ļ
ļ½
e
41. ⢠Step 3: Total square error
⢠Reassign all samples depending on minimum distance from
centroid M1 and M2 ,the new redistribution of samples inside
clusters will be:
1. X1={1,0}
April 30, 2025 SIT1305 Machine Learning 41
ļ„
ļ½
ļ½
k
K
k
K e
E
1
2
2
5
.
7
5
.
6
1
2
2
2
1
2
ļ½
ļ«
ļ½
ļ«
ļ½ e
e
E
ļ» ļ½
ļ» ļ½
2
,
5
.
1
5
.
0
,
5
.
1
2
1
ļ½
ļ½
M
M
062
.
2
)
2
0
(
)
5
.
1
1
(
)
,
(
707
.
0
)
5
.
0
0
(
)
5
.
1
1
(
)
,
(
2
2
1
2
2
2
1
1
ļ½
ļ
ļ«
ļ
ļ½
ļ½
ļ
ļ«
ļ
ļ½
X
M
d
X
M
d
42. 2. X2={0,1}
3. X3={2,1}
4. X4={3,3}
April 30, 2025 SIT1305 Machine Learning 42
803
.
1
)
2
1
(
)
5
.
1
0
(
)
,
(
581
.
1
)
5
.
0
1
(
)
5
.
1
0
(
)
,
(
2
2
2
2
2
2
2
1
ļ½
ļ
ļ«
ļ
ļ½
ļ½
ļ
ļ«
ļ
ļ½
X
M
d
X
M
d
118
.
1
)
2
1
(
)
5
.
1
2
(
)
,
(
707
.
0
)
5
.
0
1
(
)
5
.
1
2
(
)
,
(
2
2
3
2
2
2
3
1
ļ½
ļ
ļ«
ļ
ļ½
ļ½
ļ
ļ«
ļ
ļ½
X
M
d
X
M
d
803
.
1
)
2
3
(
)
5
.
1
3
(
)
,
(
915
.
2
)
5
.
0
3
(
)
5
.
1
3
(
)
,
(
2
2
4
2
2
2
4
1
ļ½
ļ
ļ«
ļ
ļ½
ļ½
ļ
ļ«
ļ
ļ½
X
M
d
X
M
d
44. ⢠Total square error
⢠After first iteration, the total square error is significantly
reduced from the value 7.5 to 2.668.
April 30, 2025 SIT1305 Machine Learning 44
668
.
2
0
668
.
2
2
2
2
1
2
ļ½
ļ«
ļ½
ļ«
ļ½ e
e
E
45. ⢠New centroids:
1. X1={1,0}
2. X2={0,1}
April 30, 2025 SIT1305 Machine Learning 45
ļ» ļ½
ļ» ļ½
3
,
3
66
.
0
,
1
2
1
ļ½
ļ½
M
M
46
.
3
9
4
)
3
0
(
)
3
1
(
)
,
(
66
.
0
)
66
.
0
0
(
)
1
1
(
)
,
(
2
2
1
2
2
2
1
1
ļ½
ļ«
ļ½
ļ
ļ«
ļ
ļ½
ļ½
ļ
ļ«
ļ
ļ½
X
M
d
X
M
d
46
.
3
4
9
)
3
1
(
)
3
0
(
)
,
(
056
.
1
1156
.
0
1
)
66
.
0
1
(
)
1
0
(
)
,
(
2
2
2
2
2
2
2
1
ļ½
ļ«
ļ½
ļ
ļ«
ļ
ļ½
ļ½
ļ«
ļ½
ļ
ļ«
ļ
ļ½
X
M
d
X
M
d
46. 3. X3={2,1}
4. X4={3,3}
Clusters: C1={X1 , X2 ,X3} C2={X4}
There is no reassignment and therefore the algorithm halts.
April 30, 2025 SIT1305 Machine Learning 46
24
.
2
4
1
)
3
1
(
)
3
2
(
)
,
(
056
.
1
1156
.
0
1
)
66
.
0
1
(
)
1
2
(
)
,
(
2
2
3
2
2
2
3
1
ļ½
ļ«
ļ½
ļ
ļ«
ļ
ļ½
ļ½
ļ«
ļ½
ļ
ļ«
ļ
ļ½
X
M
d
X
M
d
0
)
3
3
(
)
3
3
(
)
,
(
078
.
3
)
34
.
2
(
2
)
66
.
0
3
(
)
1
3
(
)
,
(
2
2
4
2
2
2
2
2
4
1
ļ½
ļ
ļ«
ļ
ļ½
ļ½
ļ«
ļ½
ļ
ļ«
ļ
ļ½
X
M
d
X
M
d
47. Advantages:
ā With large number of variables, k-means may be
computationally faster that hierarchical clustering(if k is
small).
ā K-means may produce tighter clusters that hierarchical
clustering especially is the cluster are globular.
Disadvantages:
ā Difficult in comparing the quality of the clusters produced.
ā Applicable only when mean is defined.
ā Need to specify k, the number of clusters in advance.
ā Unable to handle noisy data and outliers.
April 30, 2025 SIT1305 Machine Learning 47