April 30, 2025 SIT1305 Machine Learning 1
Unit-III
Clustering
Course In-Charges:
Dr.A.Mary Posonia
Dr.B.Ankayarkanni
Clustering
• Clustering or cluster analysis is a machine learning technique,
which groups the unlabelled dataset.
• It can be defined as "A way of grouping the data points into
different clusters, consisting of similar data points. The
objects with the possible similarities remain in a group that
has less or no similarities with another group.ā€œ
• Finding some similar patterns in the unlabelled dataset such
as shape, size, color, behavior, etc., and divides them as per
the presence and absence of those similar patterns.
April 30, 2025 SIT1305 Machine Learning 2
Clustering
• It is an unsupervised learning method, hence no supervision is
provided to the algorithm, and it deals with the unlabeled
dataset.
• After applying this clustering technique, each cluster or group is
provided with a cluster-ID. ML system can use this id to simplify
the processing of large and complex datasets.
• The clustering technique is commonly used for statistical data
analysis.
• Note: Clustering is somewhere similar to the classification algorithm, but
the difference is the type of dataset that we are using. In classification, we
work with the labeled data set, whereas in clustering, we work with the
unlabelled dataset.
April 30, 2025 SIT1305 Machine Learning 3
• The below diagram explains the working of the clustering
algorithm. We can see the different fruits are divided into
several groups with similar properties.
April 30, 2025 SIT1305 Machine Learning 4
Applications of Clustering
• In Identification of Cancer Cells: The clustering algorithms are
widely used for the identification of cancerous cells. It divides
the cancerous and non-cancerous data sets into different
groups.
• In Search Engines: Search engines also work on the clustering
technique. The search result appears based on the closest
object to the search query. It does it by grouping similar data
objects in one group that is far from the other dissimilar
objects. The accurate result of a query depends on the quality
of the clustering algorithm used.
April 30, 2025 SIT1305 Machine Learning 5
Applications of Clustering
• Customer Segmentation: It is used in market research to
segment the customers based on their choice and preferences.
• In Biology: It is used in the biology stream to classify different
species of plants and animals using the image recognition
technique.
• In Land Use: The clustering technique is used in identifying the
area of similar lands use in the GIS database. This can be very
useful to find that for what purpose the particular land should
be used, that means for which purpose it is more suitable.
April 30, 2025 SIT1305 Machine Learning 6
• The clustering technique can be widely used in various tasks.
Some most common uses of this technique are:
– Market Segmentation
– Statistical data analysis
– Social network analysis
– Image segmentation
– Anomaly detection, etc.
• Apart from these general usages, it is used by the Amazon in
its recommendation system to provide the recommendations
as per the past search of products.
• Netflix also uses this technique to recommend the movies
and web-series to its users as per the watch history.
April 30, 2025 SIT1305 Machine Learning 7
Unsupervised learning: no predefined classes
• A good clustering method will produce high quality clusters
– high intra-class similarity: cohesive within clusters
– low inter-class similarity: distinctive between clusters
• The quality of a clustering method depends on
– the similarity measure used by the method
– its implementation, and
– its ability to discover some or all of the hidden patterns.
• Clustering is a form of learning by observation rather than
learning by examples.
April 30, 2025 SIT1305 Machine Learning 8
Main objectives of clustering are:
• Intra-cluster distance is minimized.
• Inter-cluster distance is maximized.
April 30, 2025 SIT1305 Machine Learning 9
Data Matrix and Dissimilarity Matrix
April 30, 2025 SIT1305 Machine Learning 10
Similarity and Dissimilarity
• Distances are normally used to measure the similarity or
dissimilarity between to data objects.
• Some popular distances are based on Minkowski distance(Lp
or Lh norm)
April 30, 2025 SIT1305 Machine Learning 11
Special cases of Minkowski Distance
April 30, 2025 SIT1305 Machine Learning 12
April 30, 2025 SIT1305 Machine Learning 13
Example
April 30, 2025 SIT1305 Machine Learning 14
Problem 1
• Given two objects represented by the tuples (22, 1, 42, 10)
and (20, 0, 36, 8):
1. Compute the Euclidean distance between the two
objects.
2. Compute the Manhattan distance between the two
objects.
3. Compute the Minkowski distance between the two
objects using q=3.
April 30, 2025 SIT1305 Machine Learning 15
1. Compute the Euclidean distance between the two objects.
April 30, 2025 SIT1305 Machine Learning 16
    2
2
2
2
)
8
10
(
)
36
42
(
0
1
20
22
)
,
( 







j
i
d
2
2
2
2
2
6
1
2
)
,
( 



j
i
d
708
.
6
45
4
36
1
4
)
,
( 





j
i
d
(22, 1, 42, 10) and (20, 0, 36, 8)
2. Compute the Manhattan distance between the two objects.
= 2+1+6+2
= 11
3. Compute the Minkowski distance between the two objects
using q=3.
April 30, 2025 SIT1305 Machine Learning 17
8
10
36
42
0
1
20
22
)
,
( 







j
i
d
3
/
1
3
3
3
3
)
8
10
36
42
0
1
20
22
(
)
,
( 







j
i
d
15
.
6
233
8
216
1
8
2
6
1
2
3
3
3 3
3
3
3








Problem 2
• Given 5-dimensional numeric samples A=(1,0,2,5,3) and
B=(2,1,0,3,-1).
1. Compute the Euclidean distance between the two
objects.
2. Compute the Manhattan distance between the two
objects.
3. Compute the Supremum distance.
April 30, 2025 SIT1305 Machine Learning 18
Types of Clustering Methods
• The clustering methods are broadly divided into Hard
clustering (datapoint belongs to only one group) and Soft
Clustering (data points can belong to another group also).
• But there are also other various approaches of Clustering
exist. Below are the main clustering methods used in Machine
learning:
– Partitioning Clustering
– Density-Based Clustering
– Distribution Model-Based Clustering
– Hierarchical Clustering
– Fuzzy Clustering
April 30, 2025 SIT1305 Machine Learning 19
Partitioning Clustering
• It is a type of clustering that divides the data into non-
hierarchical groups. It is also known as the centroid-based
method. The most common example of partitioning clustering
is the K-Means Clustering algorithm.
• In this type, the dataset is divided into a set of k groups,
where K is used to define the number of pre-defined groups.
The cluster center is created in such a way that the distance
between the data points of one cluster is minimum as
compared to another cluster centroid.
April 30, 2025 SIT1305 Machine Learning 20
Density-Based Clustering
• The density-based clustering method connects the highly-
dense areas into clusters, and the arbitrarily shaped
distributions are formed as long as the dense region can be
connected.
• This algorithm does it by identifying different clusters in the
dataset and connects the areas of high densities into clusters.
The dense areas in data space are divided from each other by
sparser areas.
• These algorithms can face difficulty in clustering the data
points if the dataset has varying densities and high
dimensions.
April 30, 2025 SIT1305 Machine Learning 21
Hierarchical Clustering
• Hierarchical clustering can be used as an alternative for the
partitioned clustering as there is no requirement of pre-
specifying the number of clusters to be created.
• In this technique, the dataset is divided into clusters to create
a tree-like structure, which is also called a dendrogram.
• The observations or any number of clusters can be selected by
cutting the tree at the correct level. The most common
example of this method is the Agglomerative Hierarchical
algorithm.
April 30, 2025 SIT1305 Machine Learning 22
Distribution Model-Based Clustering
• In the distribution model-based clustering method, the data is
divided based on the probability of how a dataset belongs to a
particular distribution. The grouping is done by assuming
some distributions commonly Gaussian Distribution.
• The example of this type is the Expectation-Maximization
Clustering algorithm that uses Gaussian Mixture Models
(GMM).
April 30, 2025 SIT1305 Machine Learning 23
Fuzzy Clustering
• Fuzzy clustering is a type of soft method in which a data
object may belong to more than one group or cluster.
• Each dataset has a set of membership coefficients, which
depend on the degree of membership to be in a cluster.
• Fuzzy C-means algorithm is the example of this type of
clustering; it is sometimes also known as the Fuzzy k-means
algorithm.
April 30, 2025 SIT1305 Machine Learning 24
Major Clustering Methods
1. Partitioning Clustering Method
• Given a database of n objects or data tuples , a partitioning
method constructs k partitions of the data , where each
partition represents a cluster and k<=n.
• k is the number of groups after the classification of objects.
There are some requirements which need to be satisfied with
this Partitioning Clustering Method :
– Each group must contain at least one object
– Each object must belong to exactly one group.
• There is one technique called iterative relocation, which means
the object will be moved from one group to another to improve
the partitioning.
April 30, 2025 SIT1305 Machine Learning 25
• The general criterion of a good partitioning is that object in
the same clusters are ā€œcloseā€ or related to each other ,
whereas objects of different clusters are ā€œfar apartā€ or very
different.
• Example:
– K-means, K--Mediods ,CLARANS
April 30, 2025 SIT1305 Machine Learning 26
2. Hierarchical Clustering Methods
• In this hierarchical clustering method, the given set of an
object of data is created into a kind of hierarchical
decomposition.
• The formation of hierarchical decomposition will decide the
purposes of classification.
• Hierarchical clustering algorithm is of two types:
– i) Agglomerative Hierarchical clustering algorithm or
AGNES (agglomerative nesting) and.
– ii) Divisive Hierarchical clustering algorithm or DIANA
(divisive analysis).
– Both this algorithm are exactly reverse of each other.
• Example: BIRCH, CAMELEON
April 30, 2025 SIT1305 Machine Learning 27
• Hierarchical clustering is an alternative approach to k-means
clustering for identifying groups in a data set.
• In contrast to k-means, hierarchical clustering will create a
hierarchy of clusters and therefore does not require us to pre-
specify the number of clusters.
• Hierarchical clustering has an added advantage over k-means
clustering - results can be easily visualized using an attractive
tree-based representation called a dendrogram.
April 30, 2025 SIT1305 Machine Learning 28
• Divisive approach is a top-down approach.
• Start with one,all-inclusive cluster.
• Smaller clusters are created by splitting the group by using the
continuous iteration.
• Split until each cluster contains a point.
– Cannot undo after the group is split or merged, and that is
why this method is not so flexible.
April 30, 2025 SIT1305 Machine Learning 29
Divisive Approach
Agglomerative Approach
• This approach is also known as bottom-up approach.
• Start with each object forming a separate group.
• It keeps on merging the objects or groups that are close to
one another.
• It keep on doing so until all of the groups are merged into one
or until the termination condition holds.
April 30, 2025 SIT1305 Machine Learning 30
K-means Clustering
April 30, 2025 SIT1305 Machine Learning 31
K-means Clustering Method
• K-Means clustering is an unsupervised iterative clustering
technique.
• It partitions the given data set into k predefined distinct clusters.
• It partitions the data set such that-
– Each data point belongs to a cluster with the nearest mean.
– Data points belonging to one cluster have high degree of
similarity.
– Data points belonging to different clusters have high degree of
dissimilarity.
April 30, 2025 SIT1305 Machine Learning 32
K-means Clustering Method
• If k is given, the K-means algorithm can be executed in the
following steps:
– Partition of objects into k non-empty subsets
– Identifying the cluster centroids (mean point) of the
current partition.
– Assigning each point to a specific cluster
– Compute the distances from each point and allot points to
the cluster where the distance from the centroid is
minimum.
– After re-allotting the points, find the centroid of the new
cluster formed.
April 30, 2025 SIT1305 Machine Learning 33
The step by step process:
April 30, 2025 SIT1305 Machine Learning 34
April 30, 2025 SIT1305 Machine Learning 35
• The most commonly used partitioning-clustering strategy is
based on the square error criterion.
• The general objective is to obtain the partition that ,for a fixed
number of clusters, minimizes the total square error.
• Suppose that the given dataset of N samples in an n-
dimensional space has been partitioned into k-clusters {c1 , c2 ,...
ck }.
• Each ck has nk samples and each sample has exactly one cluster,
so that
• The mean vector MK of cluster Ck is defined as the centroid of
the cluster
where Xik is the ith
sample belonging to cluster Ck
April 30, 2025 SIT1305 Machine Learning 36
k
k
where
N
nk ...,
2
,
1














k
n
i
ik
k
K X
n
M
1
1
• The square error for cluster Ck is the sum of the squared
Euclidean distance between each sample in Ck and its
centroid. This error is also called the within-cluster variation.
• The square-error for the entire clustering space containing k
clusters is the sum of the within-cluster variations:
April 30, 2025 SIT1305 Machine Learning 37
2
1
2
)
(




k
n
i
k
ik
K M
X
e



k
K
k
K e
E
1
2
2
Example
Consider the data points X1={1,0} X2={0,1} X3={2,1} X4={3,3}
Clusters: C1={X1 ,X3} C2={X2 ,X4}
a. Apply one iteration of K-means partitioning clustering
algorithm.
b. What is the change in total square error?
c. Apply second iteration of K-means partitioning clustering
algorithm.
April 30, 2025 SIT1305 Machine Learning 38
• Step 1: The centroid for the clusters C1 and C2 are:
April 30, 2025 SIT1305 Machine Learning 39











k
n
i
ik
k
K X
n
M
1
1
 
 
2
,
5
.
1
2
3
1
,
2
3
0
5
.
0
,
5
.
1
2
1
0
,
2
2
1
2
1






 








 


M
M
X1={1,0} X2={0,1} X3={2,1} X4={3,3}
Clusters: C1={X1 ,X3} C2={X2 ,X4}
• Step 2: Within cluster variation after initial random
distribution of samples:
April 30, 2025 SIT1305 Machine Learning 40
2
1
2
)
(




k
n
i
k
ik
K M
X
e
1
]
25
.
0
25
.
0
25
.
0
25
.
0
[
]
)
5
.
0
1
(
)
5
.
1
2
(
)
5
.
0
0
(
)
5
.
1
1
[( 2
2
2
2
2
1













e
5
.
6
]
1
25
.
2
1
25
.
2
[
]
)
2
3
(
)
5
.
1
3
(
)
2
1
(
)
5
.
1
0
[( 2
2
2
2
2
2













e
• Step 3: Total square error
• Reassign all samples depending on minimum distance from
centroid M1 and M2 ,the new redistribution of samples inside
clusters will be:
1. X1={1,0}
April 30, 2025 SIT1305 Machine Learning 41



k
K
k
K e
E
1
2
2
5
.
7
5
.
6
1
2
2
2
1
2




 e
e
E
 
 
2
,
5
.
1
5
.
0
,
5
.
1
2
1


M
M
062
.
2
)
2
0
(
)
5
.
1
1
(
)
,
(
707
.
0
)
5
.
0
0
(
)
5
.
1
1
(
)
,
(
2
2
1
2
2
2
1
1










X
M
d
X
M
d
2. X2={0,1}
3. X3={2,1}
4. X4={3,3}
April 30, 2025 SIT1305 Machine Learning 42
803
.
1
)
2
1
(
)
5
.
1
0
(
)
,
(
581
.
1
)
5
.
0
1
(
)
5
.
1
0
(
)
,
(
2
2
2
2
2
2
2
1










X
M
d
X
M
d
118
.
1
)
2
1
(
)
5
.
1
2
(
)
,
(
707
.
0
)
5
.
0
1
(
)
5
.
1
2
(
)
,
(
2
2
3
2
2
2
3
1










X
M
d
X
M
d
803
.
1
)
2
3
(
)
5
.
1
3
(
)
,
(
915
.
2
)
5
.
0
3
(
)
5
.
1
3
(
)
,
(
2
2
4
2
2
2
4
1










X
M
d
X
M
d
• New Clusters: C1={X1 , X2 ,X3} C2={X4}
April 30, 2025 SIT1305 Machine Learning 43
 
 
3
,
3
66
.
0
,
1
3
1
1
0
,
3
2
0
1
2
1







 




M
M
668
.
2
]
1156
.
0
1
1156
.
0
1
4356
.
0
0
[
]
)
66
.
0
1
(
)
1
2
(
)
66
.
0
1
(
)
1
0
(
)
66
.
0
0
(
)
1
1
[( 2
2
2
2
2
2
2
1



















e
0
]
)
3
3
(
)
3
3
[( 2
2
2
2





e
• Total square error
• After first iteration, the total square error is significantly
reduced from the value 7.5 to 2.668.
April 30, 2025 SIT1305 Machine Learning 44
668
.
2
0
668
.
2
2
2
2
1
2




 e
e
E
• New centroids:
1. X1={1,0}
2. X2={0,1}
April 30, 2025 SIT1305 Machine Learning 45
 
 
3
,
3
66
.
0
,
1
2
1


M
M
46
.
3
9
4
)
3
0
(
)
3
1
(
)
,
(
66
.
0
)
66
.
0
0
(
)
1
1
(
)
,
(
2
2
1
2
2
2
1
1












X
M
d
X
M
d
46
.
3
4
9
)
3
1
(
)
3
0
(
)
,
(
056
.
1
1156
.
0
1
)
66
.
0
1
(
)
1
0
(
)
,
(
2
2
2
2
2
2
2
1














X
M
d
X
M
d
3. X3={2,1}
4. X4={3,3}
Clusters: C1={X1 , X2 ,X3} C2={X4}
There is no reassignment and therefore the algorithm halts.
April 30, 2025 SIT1305 Machine Learning 46
24
.
2
4
1
)
3
1
(
)
3
2
(
)
,
(
056
.
1
1156
.
0
1
)
66
.
0
1
(
)
1
2
(
)
,
(
2
2
3
2
2
2
3
1














X
M
d
X
M
d
0
)
3
3
(
)
3
3
(
)
,
(
078
.
3
)
34
.
2
(
2
)
66
.
0
3
(
)
1
3
(
)
,
(
2
2
4
2
2
2
2
2
4
1












X
M
d
X
M
d
Advantages:
– With large number of variables, k-means may be
computationally faster that hierarchical clustering(if k is
small).
– K-means may produce tighter clusters that hierarchical
clustering especially is the cluster are globular.
Disadvantages:
– Difficult in comparing the quality of the clusters produced.
– Applicable only when mean is defined.
– Need to specify k, the number of clusters in advance.
– Unable to handle noisy data and outliers.
April 30, 2025 SIT1305 Machine Learning 47

More Related Content

PPTX
UNIT - 4: Data Warehousing and Data Mining
PPTX
Machine Learning : Clustering - Cluster analysis.pptx
PPTX
For iiii year students of cse ML-UNIT-V.pptx
PPTX
pratik meshram-Unit 5 (contemporary mkt r sch)
PPT
Data Mining Lecture Node: Hierarchical Cluster in Data Mining
PDF
An Analysis On Clustering Algorithms In Data Mining
PDF
clustering using different methods in .pdf
PPTX
Clustering in Machine Learning, a process of grouping.
UNIT - 4: Data Warehousing and Data Mining
Machine Learning : Clustering - Cluster analysis.pptx
For iiii year students of cse ML-UNIT-V.pptx
pratik meshram-Unit 5 (contemporary mkt r sch)
Data Mining Lecture Node: Hierarchical Cluster in Data Mining
An Analysis On Clustering Algorithms In Data Mining
clustering using different methods in .pdf
Clustering in Machine Learning, a process of grouping.

Similar to Clustering.ppt.......................... (20)

PPTX
Unit 2 unsupervised learning.pptx
PDF
Descriptive m0deling
PDF
A Comparative Study Of Various Clustering Algorithms In Data Mining
PDF
4.Unit 4 ML Q&A.pdf machine learning qb
PPTX
Clustering: Grouping all Data for Insights
PPTX
Clustering
PDF
Clustering[306] [Read-Only].pdf
PDF
Cancer data partitioning with data structure and difficulty independent clust...
PDF
Data mining
PDF
International Journal of Engineering and Science Invention (IJESI)
PDF
Dp33701704
PDF
Dp33701704
PPTX
Rohit 10103543
DOC
Predicting Students Performance using K-Median Clustering
PPTX
Clustering in data Mining (Data Mining)
PPT
DM_clustering.ppt
PDF
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
PDF
BIM Data Mining Unit5 by Tekendra Nath Yogi
DOCX
Cluster analysis (2).docx
PDF
ClusteringClusteringClusteringClustering.pdf
Unit 2 unsupervised learning.pptx
Descriptive m0deling
A Comparative Study Of Various Clustering Algorithms In Data Mining
4.Unit 4 ML Q&A.pdf machine learning qb
Clustering: Grouping all Data for Insights
Clustering
Clustering[306] [Read-Only].pdf
Cancer data partitioning with data structure and difficulty independent clust...
Data mining
International Journal of Engineering and Science Invention (IJESI)
Dp33701704
Dp33701704
Rohit 10103543
Predicting Students Performance using K-Median Clustering
Clustering in data Mining (Data Mining)
DM_clustering.ppt
Improved K-mean Clustering Algorithm for Prediction Analysis using Classifica...
BIM Data Mining Unit5 by Tekendra Nath Yogi
Cluster analysis (2).docx
ClusteringClusteringClusteringClustering.pdf
Ad

Recently uploaded (20)

PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PDF
Environmental Education MCQ BD2EE - Share Source.pdf
PPTX
20th Century Theater, Methods, History.pptx
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Ā 
PDF
Uderstanding digital marketing and marketing stratergie for engaging the digi...
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PDF
LDMMIA Reiki Yoga Finals Review Spring Summer
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
PDF
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PPTX
History, Philosophy and sociology of education (1).pptx
Ā 
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
advance database management system book.pdf
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
PPTX
Introduction to pro and eukaryotes and differences.pptx
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
A powerpoint presentation on the Revised K-10 Science Shaping Paper
Environmental Education MCQ BD2EE - Share Source.pdf
20th Century Theater, Methods, History.pptx
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Ā 
Uderstanding digital marketing and marketing stratergie for engaging the digi...
Paper A Mock Exam 9_ Attempt review.pdf.
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
Practical Manual AGRO-233 Principles and Practices of Natural Farming
LDMMIA Reiki Yoga Finals Review Spring Summer
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
MBA _Common_ 2nd year Syllabus _2021-22_.pdf
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
History, Philosophy and sociology of education (1).pptx
Ā 
B.Sc. DS Unit 2 Software Engineering.pptx
advance database management system book.pdf
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Introduction to pro and eukaryotes and differences.pptx
AI-driven educational solutions for real-life interventions in the Philippine...
Ad

Clustering.ppt..........................

  • 1. April 30, 2025 SIT1305 Machine Learning 1 Unit-III Clustering Course In-Charges: Dr.A.Mary Posonia Dr.B.Ankayarkanni
  • 2. Clustering • Clustering or cluster analysis is a machine learning technique, which groups the unlabelled dataset. • It can be defined as "A way of grouping the data points into different clusters, consisting of similar data points. The objects with the possible similarities remain in a group that has less or no similarities with another group.ā€œ • Finding some similar patterns in the unlabelled dataset such as shape, size, color, behavior, etc., and divides them as per the presence and absence of those similar patterns. April 30, 2025 SIT1305 Machine Learning 2
  • 3. Clustering • It is an unsupervised learning method, hence no supervision is provided to the algorithm, and it deals with the unlabeled dataset. • After applying this clustering technique, each cluster or group is provided with a cluster-ID. ML system can use this id to simplify the processing of large and complex datasets. • The clustering technique is commonly used for statistical data analysis. • Note: Clustering is somewhere similar to the classification algorithm, but the difference is the type of dataset that we are using. In classification, we work with the labeled data set, whereas in clustering, we work with the unlabelled dataset. April 30, 2025 SIT1305 Machine Learning 3
  • 4. • The below diagram explains the working of the clustering algorithm. We can see the different fruits are divided into several groups with similar properties. April 30, 2025 SIT1305 Machine Learning 4
  • 5. Applications of Clustering • In Identification of Cancer Cells: The clustering algorithms are widely used for the identification of cancerous cells. It divides the cancerous and non-cancerous data sets into different groups. • In Search Engines: Search engines also work on the clustering technique. The search result appears based on the closest object to the search query. It does it by grouping similar data objects in one group that is far from the other dissimilar objects. The accurate result of a query depends on the quality of the clustering algorithm used. April 30, 2025 SIT1305 Machine Learning 5
  • 6. Applications of Clustering • Customer Segmentation: It is used in market research to segment the customers based on their choice and preferences. • In Biology: It is used in the biology stream to classify different species of plants and animals using the image recognition technique. • In Land Use: The clustering technique is used in identifying the area of similar lands use in the GIS database. This can be very useful to find that for what purpose the particular land should be used, that means for which purpose it is more suitable. April 30, 2025 SIT1305 Machine Learning 6
  • 7. • The clustering technique can be widely used in various tasks. Some most common uses of this technique are: – Market Segmentation – Statistical data analysis – Social network analysis – Image segmentation – Anomaly detection, etc. • Apart from these general usages, it is used by the Amazon in its recommendation system to provide the recommendations as per the past search of products. • Netflix also uses this technique to recommend the movies and web-series to its users as per the watch history. April 30, 2025 SIT1305 Machine Learning 7
  • 8. Unsupervised learning: no predefined classes • A good clustering method will produce high quality clusters – high intra-class similarity: cohesive within clusters – low inter-class similarity: distinctive between clusters • The quality of a clustering method depends on – the similarity measure used by the method – its implementation, and – its ability to discover some or all of the hidden patterns. • Clustering is a form of learning by observation rather than learning by examples. April 30, 2025 SIT1305 Machine Learning 8
  • 9. Main objectives of clustering are: • Intra-cluster distance is minimized. • Inter-cluster distance is maximized. April 30, 2025 SIT1305 Machine Learning 9
  • 10. Data Matrix and Dissimilarity Matrix April 30, 2025 SIT1305 Machine Learning 10
  • 11. Similarity and Dissimilarity • Distances are normally used to measure the similarity or dissimilarity between to data objects. • Some popular distances are based on Minkowski distance(Lp or Lh norm) April 30, 2025 SIT1305 Machine Learning 11
  • 12. Special cases of Minkowski Distance April 30, 2025 SIT1305 Machine Learning 12
  • 13. April 30, 2025 SIT1305 Machine Learning 13
  • 14. Example April 30, 2025 SIT1305 Machine Learning 14
  • 15. Problem 1 • Given two objects represented by the tuples (22, 1, 42, 10) and (20, 0, 36, 8): 1. Compute the Euclidean distance between the two objects. 2. Compute the Manhattan distance between the two objects. 3. Compute the Minkowski distance between the two objects using q=3. April 30, 2025 SIT1305 Machine Learning 15
  • 16. 1. Compute the Euclidean distance between the two objects. April 30, 2025 SIT1305 Machine Learning 16     2 2 2 2 ) 8 10 ( ) 36 42 ( 0 1 20 22 ) , (         j i d 2 2 2 2 2 6 1 2 ) , (     j i d 708 . 6 45 4 36 1 4 ) , (       j i d (22, 1, 42, 10) and (20, 0, 36, 8)
  • 17. 2. Compute the Manhattan distance between the two objects. = 2+1+6+2 = 11 3. Compute the Minkowski distance between the two objects using q=3. April 30, 2025 SIT1305 Machine Learning 17 8 10 36 42 0 1 20 22 ) , (         j i d 3 / 1 3 3 3 3 ) 8 10 36 42 0 1 20 22 ( ) , (         j i d 15 . 6 233 8 216 1 8 2 6 1 2 3 3 3 3 3 3 3        
  • 18. Problem 2 • Given 5-dimensional numeric samples A=(1,0,2,5,3) and B=(2,1,0,3,-1). 1. Compute the Euclidean distance between the two objects. 2. Compute the Manhattan distance between the two objects. 3. Compute the Supremum distance. April 30, 2025 SIT1305 Machine Learning 18
  • 19. Types of Clustering Methods • The clustering methods are broadly divided into Hard clustering (datapoint belongs to only one group) and Soft Clustering (data points can belong to another group also). • But there are also other various approaches of Clustering exist. Below are the main clustering methods used in Machine learning: – Partitioning Clustering – Density-Based Clustering – Distribution Model-Based Clustering – Hierarchical Clustering – Fuzzy Clustering April 30, 2025 SIT1305 Machine Learning 19
  • 20. Partitioning Clustering • It is a type of clustering that divides the data into non- hierarchical groups. It is also known as the centroid-based method. The most common example of partitioning clustering is the K-Means Clustering algorithm. • In this type, the dataset is divided into a set of k groups, where K is used to define the number of pre-defined groups. The cluster center is created in such a way that the distance between the data points of one cluster is minimum as compared to another cluster centroid. April 30, 2025 SIT1305 Machine Learning 20
  • 21. Density-Based Clustering • The density-based clustering method connects the highly- dense areas into clusters, and the arbitrarily shaped distributions are formed as long as the dense region can be connected. • This algorithm does it by identifying different clusters in the dataset and connects the areas of high densities into clusters. The dense areas in data space are divided from each other by sparser areas. • These algorithms can face difficulty in clustering the data points if the dataset has varying densities and high dimensions. April 30, 2025 SIT1305 Machine Learning 21
  • 22. Hierarchical Clustering • Hierarchical clustering can be used as an alternative for the partitioned clustering as there is no requirement of pre- specifying the number of clusters to be created. • In this technique, the dataset is divided into clusters to create a tree-like structure, which is also called a dendrogram. • The observations or any number of clusters can be selected by cutting the tree at the correct level. The most common example of this method is the Agglomerative Hierarchical algorithm. April 30, 2025 SIT1305 Machine Learning 22
  • 23. Distribution Model-Based Clustering • In the distribution model-based clustering method, the data is divided based on the probability of how a dataset belongs to a particular distribution. The grouping is done by assuming some distributions commonly Gaussian Distribution. • The example of this type is the Expectation-Maximization Clustering algorithm that uses Gaussian Mixture Models (GMM). April 30, 2025 SIT1305 Machine Learning 23
  • 24. Fuzzy Clustering • Fuzzy clustering is a type of soft method in which a data object may belong to more than one group or cluster. • Each dataset has a set of membership coefficients, which depend on the degree of membership to be in a cluster. • Fuzzy C-means algorithm is the example of this type of clustering; it is sometimes also known as the Fuzzy k-means algorithm. April 30, 2025 SIT1305 Machine Learning 24
  • 25. Major Clustering Methods 1. Partitioning Clustering Method • Given a database of n objects or data tuples , a partitioning method constructs k partitions of the data , where each partition represents a cluster and k<=n. • k is the number of groups after the classification of objects. There are some requirements which need to be satisfied with this Partitioning Clustering Method : – Each group must contain at least one object – Each object must belong to exactly one group. • There is one technique called iterative relocation, which means the object will be moved from one group to another to improve the partitioning. April 30, 2025 SIT1305 Machine Learning 25
  • 26. • The general criterion of a good partitioning is that object in the same clusters are ā€œcloseā€ or related to each other , whereas objects of different clusters are ā€œfar apartā€ or very different. • Example: – K-means, K--Mediods ,CLARANS April 30, 2025 SIT1305 Machine Learning 26
  • 27. 2. Hierarchical Clustering Methods • In this hierarchical clustering method, the given set of an object of data is created into a kind of hierarchical decomposition. • The formation of hierarchical decomposition will decide the purposes of classification. • Hierarchical clustering algorithm is of two types: – i) Agglomerative Hierarchical clustering algorithm or AGNES (agglomerative nesting) and. – ii) Divisive Hierarchical clustering algorithm or DIANA (divisive analysis). – Both this algorithm are exactly reverse of each other. • Example: BIRCH, CAMELEON April 30, 2025 SIT1305 Machine Learning 27
  • 28. • Hierarchical clustering is an alternative approach to k-means clustering for identifying groups in a data set. • In contrast to k-means, hierarchical clustering will create a hierarchy of clusters and therefore does not require us to pre- specify the number of clusters. • Hierarchical clustering has an added advantage over k-means clustering - results can be easily visualized using an attractive tree-based representation called a dendrogram. April 30, 2025 SIT1305 Machine Learning 28
  • 29. • Divisive approach is a top-down approach. • Start with one,all-inclusive cluster. • Smaller clusters are created by splitting the group by using the continuous iteration. • Split until each cluster contains a point. – Cannot undo after the group is split or merged, and that is why this method is not so flexible. April 30, 2025 SIT1305 Machine Learning 29 Divisive Approach
  • 30. Agglomerative Approach • This approach is also known as bottom-up approach. • Start with each object forming a separate group. • It keeps on merging the objects or groups that are close to one another. • It keep on doing so until all of the groups are merged into one or until the termination condition holds. April 30, 2025 SIT1305 Machine Learning 30
  • 31. K-means Clustering April 30, 2025 SIT1305 Machine Learning 31
  • 32. K-means Clustering Method • K-Means clustering is an unsupervised iterative clustering technique. • It partitions the given data set into k predefined distinct clusters. • It partitions the data set such that- – Each data point belongs to a cluster with the nearest mean. – Data points belonging to one cluster have high degree of similarity. – Data points belonging to different clusters have high degree of dissimilarity. April 30, 2025 SIT1305 Machine Learning 32
  • 33. K-means Clustering Method • If k is given, the K-means algorithm can be executed in the following steps: – Partition of objects into k non-empty subsets – Identifying the cluster centroids (mean point) of the current partition. – Assigning each point to a specific cluster – Compute the distances from each point and allot points to the cluster where the distance from the centroid is minimum. – After re-allotting the points, find the centroid of the new cluster formed. April 30, 2025 SIT1305 Machine Learning 33
  • 34. The step by step process: April 30, 2025 SIT1305 Machine Learning 34
  • 35. April 30, 2025 SIT1305 Machine Learning 35
  • 36. • The most commonly used partitioning-clustering strategy is based on the square error criterion. • The general objective is to obtain the partition that ,for a fixed number of clusters, minimizes the total square error. • Suppose that the given dataset of N samples in an n- dimensional space has been partitioned into k-clusters {c1 , c2 ,... ck }. • Each ck has nk samples and each sample has exactly one cluster, so that • The mean vector MK of cluster Ck is defined as the centroid of the cluster where Xik is the ith sample belonging to cluster Ck April 30, 2025 SIT1305 Machine Learning 36 k k where N nk ..., 2 , 1               k n i ik k K X n M 1 1
  • 37. • The square error for cluster Ck is the sum of the squared Euclidean distance between each sample in Ck and its centroid. This error is also called the within-cluster variation. • The square-error for the entire clustering space containing k clusters is the sum of the within-cluster variations: April 30, 2025 SIT1305 Machine Learning 37 2 1 2 ) (     k n i k ik K M X e    k K k K e E 1 2 2
  • 38. Example Consider the data points X1={1,0} X2={0,1} X3={2,1} X4={3,3} Clusters: C1={X1 ,X3} C2={X2 ,X4} a. Apply one iteration of K-means partitioning clustering algorithm. b. What is the change in total square error? c. Apply second iteration of K-means partitioning clustering algorithm. April 30, 2025 SIT1305 Machine Learning 38
  • 39. • Step 1: The centroid for the clusters C1 and C2 are: April 30, 2025 SIT1305 Machine Learning 39            k n i ik k K X n M 1 1     2 , 5 . 1 2 3 1 , 2 3 0 5 . 0 , 5 . 1 2 1 0 , 2 2 1 2 1                     M M X1={1,0} X2={0,1} X3={2,1} X4={3,3} Clusters: C1={X1 ,X3} C2={X2 ,X4}
  • 40. • Step 2: Within cluster variation after initial random distribution of samples: April 30, 2025 SIT1305 Machine Learning 40 2 1 2 ) (     k n i k ik K M X e 1 ] 25 . 0 25 . 0 25 . 0 25 . 0 [ ] ) 5 . 0 1 ( ) 5 . 1 2 ( ) 5 . 0 0 ( ) 5 . 1 1 [( 2 2 2 2 2 1              e 5 . 6 ] 1 25 . 2 1 25 . 2 [ ] ) 2 3 ( ) 5 . 1 3 ( ) 2 1 ( ) 5 . 1 0 [( 2 2 2 2 2 2              e
  • 41. • Step 3: Total square error • Reassign all samples depending on minimum distance from centroid M1 and M2 ,the new redistribution of samples inside clusters will be: 1. X1={1,0} April 30, 2025 SIT1305 Machine Learning 41    k K k K e E 1 2 2 5 . 7 5 . 6 1 2 2 2 1 2      e e E     2 , 5 . 1 5 . 0 , 5 . 1 2 1   M M 062 . 2 ) 2 0 ( ) 5 . 1 1 ( ) , ( 707 . 0 ) 5 . 0 0 ( ) 5 . 1 1 ( ) , ( 2 2 1 2 2 2 1 1           X M d X M d
  • 42. 2. X2={0,1} 3. X3={2,1} 4. X4={3,3} April 30, 2025 SIT1305 Machine Learning 42 803 . 1 ) 2 1 ( ) 5 . 1 0 ( ) , ( 581 . 1 ) 5 . 0 1 ( ) 5 . 1 0 ( ) , ( 2 2 2 2 2 2 2 1           X M d X M d 118 . 1 ) 2 1 ( ) 5 . 1 2 ( ) , ( 707 . 0 ) 5 . 0 1 ( ) 5 . 1 2 ( ) , ( 2 2 3 2 2 2 3 1           X M d X M d 803 . 1 ) 2 3 ( ) 5 . 1 3 ( ) , ( 915 . 2 ) 5 . 0 3 ( ) 5 . 1 3 ( ) , ( 2 2 4 2 2 2 4 1           X M d X M d
  • 43. • New Clusters: C1={X1 , X2 ,X3} C2={X4} April 30, 2025 SIT1305 Machine Learning 43     3 , 3 66 . 0 , 1 3 1 1 0 , 3 2 0 1 2 1              M M 668 . 2 ] 1156 . 0 1 1156 . 0 1 4356 . 0 0 [ ] ) 66 . 0 1 ( ) 1 2 ( ) 66 . 0 1 ( ) 1 0 ( ) 66 . 0 0 ( ) 1 1 [( 2 2 2 2 2 2 2 1                    e 0 ] ) 3 3 ( ) 3 3 [( 2 2 2 2      e
  • 44. • Total square error • After first iteration, the total square error is significantly reduced from the value 7.5 to 2.668. April 30, 2025 SIT1305 Machine Learning 44 668 . 2 0 668 . 2 2 2 2 1 2      e e E
  • 45. • New centroids: 1. X1={1,0} 2. X2={0,1} April 30, 2025 SIT1305 Machine Learning 45     3 , 3 66 . 0 , 1 2 1   M M 46 . 3 9 4 ) 3 0 ( ) 3 1 ( ) , ( 66 . 0 ) 66 . 0 0 ( ) 1 1 ( ) , ( 2 2 1 2 2 2 1 1             X M d X M d 46 . 3 4 9 ) 3 1 ( ) 3 0 ( ) , ( 056 . 1 1156 . 0 1 ) 66 . 0 1 ( ) 1 0 ( ) , ( 2 2 2 2 2 2 2 1               X M d X M d
  • 46. 3. X3={2,1} 4. X4={3,3} Clusters: C1={X1 , X2 ,X3} C2={X4} There is no reassignment and therefore the algorithm halts. April 30, 2025 SIT1305 Machine Learning 46 24 . 2 4 1 ) 3 1 ( ) 3 2 ( ) , ( 056 . 1 1156 . 0 1 ) 66 . 0 1 ( ) 1 2 ( ) , ( 2 2 3 2 2 2 3 1               X M d X M d 0 ) 3 3 ( ) 3 3 ( ) , ( 078 . 3 ) 34 . 2 ( 2 ) 66 . 0 3 ( ) 1 3 ( ) , ( 2 2 4 2 2 2 2 2 4 1             X M d X M d
  • 47. Advantages: – With large number of variables, k-means may be computationally faster that hierarchical clustering(if k is small). – K-means may produce tighter clusters that hierarchical clustering especially is the cluster are globular. Disadvantages: – Difficult in comparing the quality of the clusters produced. – Applicable only when mean is defined. – Need to specify k, the number of clusters in advance. – Unable to handle noisy data and outliers. April 30, 2025 SIT1305 Machine Learning 47