SlideShare a Scribd company logo
Clustering & Classification
Institute of Engineering & Management (IEM), Kolkata (INDIA)
BCA (A) – 3rd
year
 Koyel Agarwal
 Madhurima Dey
 Mainak Sen Choudhary
 Md. Jamshed Khan
 Md. Masud Parvez
Contents:
• What is Clustering & Classification?
• Data Mining & Its applications.
• Types of Data Mining Functions
• How does Classification Works?
• How does Clustering Works?
• Types of Clustering
• K-Mean Algorithm
• Conclusion
What is Clustering & Classification?
• Clustering & Classification are the terms or
concept related to Data Mining and Machine
Learning.
• It is the technique used to extract data and
knowledge in Data Mining and in Machine
Learning.
• Classification & Clustering are also known as
Supervised Learning and Unsupervised
Learning respectively.
Data Mining:
Data Mining is defined as extracting information from
huge sets of data. In other words, we can say that data
mining is the procedure of mining knowledge from data.
Data Mining Applications:
Data mining is highly useful in the following domains:
 Market Analysis and Management
 Corporate Analysis & Risk Management
 Fraud Detection
There are two categories of functions involved in
Data Mining :-
 Descriptive
 Classification and Prediction
Descriptive Functions:
The descriptive function deals with the general properties of data in
the database. List of descriptive functions −
 Class/Concept Description
 Mining of frequent Patterns
 Mining of Associations
 Mining of Correlations
 Mining of Clusters
Classification:
Classification is supervised learning technique used to assign pre-
defined tag to instance on the basis of features. So classification
algorithm requires training data. Classification model is created
from training data, then classification model is used to classify new
instances.
Clustering:
Clustering is unsupervised technique used to group similar
instances on the basis of features. Clustering does not require
training data. Clustering does not assign pre-defined label to each
and every group.
Clustering & classification
Examples of Classification
Following are the examples of cases where the data analysis task
is Classification −
 A bank loan officer wants to analyze the data in order to know
which customer (loan applicant) are risky or which are safe.
 A marketing manager at a company needs to analyze a customer
with a given profile, who will buy a new computer.
In both of the above examples, a model or classifier is
constructed to predict the categorical labels. These labels are
risky or safe for loan application data and yes or no for
marketing data.
How does Classification works?
The Data Classification process includes two steps −
 Building the Classifier or Model
 Using Classifier for Classification
Building the Classifier or Model
 This step is the learning step or the learning phase.
 In this step the classification algorithms build the
classifier.
 The classifier is built from the training set made up of
database tuples and their associated class labels.
 Each tuple that constitutes the training set is referred to
as a category or class. These tuples can also be referred
to as sample, object or data points.
Building the Classifier or Model
Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to
estimate the accuracy of classification rules. The classification rules can be
applied to the new data tuples if the accuracy is considered acceptable.
How does Clustering Works?
 Clustering analysis finds clusters of data objects that are similar in
some sense to one another.
 The members of a cluster are more like each other than they are like
members of other clusters.
 The goal of clustering analysis is to find high-quality clusters such
that the inter-cluster similarity is low and the intra-cluster similarity
is high.
 A cluster of data objects can be treated as one group.
 While doing cluster analysis, we first partition the set of data into
groups based on data similarity and then assign the labels to the
groups.
Clustering & classification
Major Existing clustering methods
• Distance-based
• Hierarchical
• Partitioning
• Probabilistic
Distance based method
• In this case we easily identify the 4 clusters into which the data can be divided; the
similarity criterion is distance: two or more objects belong to the same cluster if they
are “close” according to a given distance. This is called distance-based clustering.
Hierarchical clusteringHierarchical clustering
Agglomerative (bottom up)Agglomerative (bottom up)
1.1. start with 1 pointstart with 1 point
(singleton)(singleton)
2.2. recursively add two orrecursively add two or
more appropriatemore appropriate
clustersclusters
3.3. Stop when k number ofStop when k number of
clusters is achieved.clusters is achieved.
Divisive (top dow)Divisive (top dow)
1.1. Start with a big clusterStart with a big cluster
2.2. Recursively divide intoRecursively divide into
smaller clusterssmaller clusters
3.3. Stop when k number ofStop when k number of
clusters is achieved.clusters is achieved.
Hierarchical algorithms
• Agglomerative algorithms begin with each
element as a separate cluster and merge them
into successively larger clusters.
• Divisive algorithms begin with the whole set and
proceed to divide it into successively smaller
clusters.
Hierarchical agglomerative general
algorithm
1.1. Find the 2 closest objectsFind the 2 closest objects
and merge them into aand merge them into a
clustercluster
2.2. Find and merge the next twoFind and merge the next two
closest points, where a pointclosest points, where a point
is either an individual objectis either an individual object
or a cluster of objects.or a cluster of objects.
3.3. If more than one clusterIf more than one cluster
remains, return to step 2remains, return to step 2
Partitioning clustering
1. Divide data into proper subset
2. recursively go through each subset and
relocate points between clusters (opposite to
visit-once approach in Hierarchical approach)
Probabilistic clustering
1. Data are picked from mixture of probability
distribution.
2. Use the mean, variance of each distribution as
parameters for cluster
3. Single cluster membership
K-mean algorithm
1. It accepts the number of clusters to group data
into, and the dataset to cluster as input values.
2. It then creates the first K initial clusters (K=
number of clusters needed) from the dataset by
choosing K rows of data randomly from the dataset.
For Example, if there are 10,000 rows of data in
the dataset and 3 clusters need to be formed, then
the first K=3 initial clusters will be created by
selecting 3 records randomly from the dataset as
the initial clusters. Each of the 3 initial clusters
formed will have just one row of data.
3. The K-Means algorithm calculates the Arithmetic Mean of each
cluster formed in the dataset. The Arithmetic Mean of a cluster is the
mean of all the individual records in the cluster. In each of the first K
initial clusters, their is only one record. The Arithmetic Mean of a
cluster with one record is the set of values that make up that record.
For Example if the dataset we are discussing is a set of Height,
Weight and Age measurements for students in a University, where
a record P in the dataset S is represented by a Height, Weight and
Age measurement, then P = {Age, Height, Weight). Then a record
containing the measurements of a student John, would be
represented as John = {20, 170, 80} where John's Age = 20 years,
Height = 1.70 metres and Weight = 80 Pounds. Since there is only
one record in each initial cluster then the Arithmetic Mean of a
cluster with only the record for John as a member = {20, 170, 80}.
4. Next, K-Means assigns each record in the dataset to only one of the initial
clusters. Each record is assigned to the nearest cluster (the cluster which it is
most similar to) using a measure of distance or similarity like the Euclidean
Distance Measure or Manhattan/City-Block Distance Measure.
5. K-Means re-assigns each record in the dataset to the most similar cluster and re-
calculates the arithmetic mean of all the clusters in the dataset. The arithmetic
mean of a cluster is the arithmetic mean of all the records in that cluster. For
Example, if a cluster contains two records where the record of the set of
measurements for John = {20, 170, 80} and Henry = {30, 160, 120}, then the
arithmetic mean Pmean
is represented as Pmean
= {Agemean
, Heightmean
,
Weightmean
).  Agemean
= (20 + 30)/2, Heightmean
= (170 + 160)/2 and
Weightmean
= (80 + 120)/2. The arithmetic mean of this cluster = {25,
165, 100}. This new arithmetic mean becomes the center of this new cluster.
Following the same procedure, new cluster centers are formed for all the
existing clusters.
6. K-Means re-assigns each record in the dataset to only one of the
new clusters formed. A record or data point is assigned to the
nearest cluster (the cluster which it is most similar to) using a
measure of distance or similarity 
7. The preceding steps are repeated until stable clusters are formed
and the K-Means clustering procedure is completed. Stable clusters
are formed when new iterations or repetitions of the K-Means
clustering algorithm does not create new clusters as the cluster center
or Arithmetic Mean of each cluster formed is the same as the old
cluster center. There are different techniques for determining when a
stable cluster is formed or when the k-means clustering algorithm
procedure is completed.
K-Means Algorithm: ExampleK-Means Algorithm: Example
OutputOutput
Thank you !!!

More Related Content

PPT
K mean-clustering algorithm
PPT
2.3 bayesian classification
PPTX
Virtualization in cloud computing
PPTX
PPTX
K-Nearest Neighbor Classifier
PPTX
Association rule mining.pptx
PDF
Production System in AI
PPTX
Machine learning clustering
K mean-clustering algorithm
2.3 bayesian classification
Virtualization in cloud computing
K-Nearest Neighbor Classifier
Association rule mining.pptx
Production System in AI
Machine learning clustering

What's hot (20)

PDF
Run time storage
PPTX
Clusters techniques
PPTX
05 Clustering in Data Mining
PPT
lecture12-clustering.ppt
PPT
automatic classification in information retrieval
PPTX
Sms spam-detection
PPTX
Data cube computation
PPTX
Hierarchical clustering.pptx
PDF
Ddb 1.6-design issues
PPTX
Unsupervised learning (clustering)
PDF
Classification Based Machine Learning Algorithms
PPTX
Cluster computing
PPTX
The impact of web on ir
PPTX
Information retrieval 7 boolean model
PPSX
Frequent itemset mining methods
PPTX
Unsupervised learning clustering
PPTX
Exhaustive Search
PPTX
Decision tree induction \ Decision Tree Algorithm with Example| Data science
PPTX
INTERNET ARCHITECTURE.pptx
PPTX
Clustering - K-Means, DBSCAN
Run time storage
Clusters techniques
05 Clustering in Data Mining
lecture12-clustering.ppt
automatic classification in information retrieval
Sms spam-detection
Data cube computation
Hierarchical clustering.pptx
Ddb 1.6-design issues
Unsupervised learning (clustering)
Classification Based Machine Learning Algorithms
Cluster computing
The impact of web on ir
Information retrieval 7 boolean model
Frequent itemset mining methods
Unsupervised learning clustering
Exhaustive Search
Decision tree induction \ Decision Tree Algorithm with Example| Data science
INTERNET ARCHITECTURE.pptx
Clustering - K-Means, DBSCAN
Ad

Similar to Clustering & classification (20)

PPT
Slide-TIF311-DM-10-11.ppt
PPT
Slide-TIF311-DM-10-11.ppt
PPT
clustering and their types explanation of data mining
PPTX
machine learning - Clustering in R
PDF
ClusteringClusteringClusteringClustering.pdf
PDF
Clustering[306] [Read-Only].pdf
DOCX
8.clustering algorithm.k means.em algorithm
PDF
Unsupervised Learning in Machine Learning
PPTX
K MEANS CLUSTERING - UNSUPERVISED LEARNING
PDF
A Study of Efficiency Improvements Technique for K-Means Algorithm
PPTX
Unsupervised Learning.pptx
PPTX
K means Clustering - algorithm to cluster n objects
DOCX
Neural nw k means
PPTX
Unsupervised%20Learninffffg (2).pptx. application
PDF
Mat189: Cluster Analysis with NBA Sports Data
PPTX
Clustering
PPTX
Presentation on K-Means Clustering
PPT
DM_clustering.ppt
PDF
[ML]-Unsupervised-learning_Unit2.ppt.pdf
PDF
Cancer data partitioning with data structure and difficulty independent clust...
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
clustering and their types explanation of data mining
machine learning - Clustering in R
ClusteringClusteringClusteringClustering.pdf
Clustering[306] [Read-Only].pdf
8.clustering algorithm.k means.em algorithm
Unsupervised Learning in Machine Learning
K MEANS CLUSTERING - UNSUPERVISED LEARNING
A Study of Efficiency Improvements Technique for K-Means Algorithm
Unsupervised Learning.pptx
K means Clustering - algorithm to cluster n objects
Neural nw k means
Unsupervised%20Learninffffg (2).pptx. application
Mat189: Cluster Analysis with NBA Sports Data
Clustering
Presentation on K-Means Clustering
DM_clustering.ppt
[ML]-Unsupervised-learning_Unit2.ppt.pdf
Cancer data partitioning with data structure and difficulty independent clust...
Ad

Recently uploaded (20)

PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Well-logging-methods_new................
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
Construction Project Organization Group 2.pptx
PPTX
Geodesy 1.pptx...............................................
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
UNIT 4 Total Quality Management .pptx
Foundation to blockchain - A guide to Blockchain Tech
CH1 Production IntroductoryConcepts.pptx
Well-logging-methods_new................
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
bas. eng. economics group 4 presentation 1.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Construction Project Organization Group 2.pptx
Geodesy 1.pptx...............................................
Operating System & Kernel Study Guide-1 - converted.pdf
CYBER-CRIMES AND SECURITY A guide to understanding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Lecture Notes Electrical Wiring System Components
Automation-in-Manufacturing-Chapter-Introduction.pdf
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
UNIT 4 Total Quality Management .pptx

Clustering & classification

  • 1. Clustering & Classification Institute of Engineering & Management (IEM), Kolkata (INDIA) BCA (A) – 3rd year  Koyel Agarwal  Madhurima Dey  Mainak Sen Choudhary  Md. Jamshed Khan  Md. Masud Parvez
  • 2. Contents: • What is Clustering & Classification? • Data Mining & Its applications. • Types of Data Mining Functions • How does Classification Works? • How does Clustering Works? • Types of Clustering • K-Mean Algorithm • Conclusion
  • 3. What is Clustering & Classification? • Clustering & Classification are the terms or concept related to Data Mining and Machine Learning. • It is the technique used to extract data and knowledge in Data Mining and in Machine Learning. • Classification & Clustering are also known as Supervised Learning and Unsupervised Learning respectively.
  • 4. Data Mining: Data Mining is defined as extracting information from huge sets of data. In other words, we can say that data mining is the procedure of mining knowledge from data. Data Mining Applications: Data mining is highly useful in the following domains:  Market Analysis and Management  Corporate Analysis & Risk Management  Fraud Detection
  • 5. There are two categories of functions involved in Data Mining :-  Descriptive  Classification and Prediction Descriptive Functions: The descriptive function deals with the general properties of data in the database. List of descriptive functions −  Class/Concept Description  Mining of frequent Patterns  Mining of Associations  Mining of Correlations  Mining of Clusters
  • 6. Classification: Classification is supervised learning technique used to assign pre- defined tag to instance on the basis of features. So classification algorithm requires training data. Classification model is created from training data, then classification model is used to classify new instances. Clustering: Clustering is unsupervised technique used to group similar instances on the basis of features. Clustering does not require training data. Clustering does not assign pre-defined label to each and every group.
  • 8. Examples of Classification Following are the examples of cases where the data analysis task is Classification −  A bank loan officer wants to analyze the data in order to know which customer (loan applicant) are risky or which are safe.  A marketing manager at a company needs to analyze a customer with a given profile, who will buy a new computer. In both of the above examples, a model or classifier is constructed to predict the categorical labels. These labels are risky or safe for loan application data and yes or no for marketing data.
  • 9. How does Classification works? The Data Classification process includes two steps −  Building the Classifier or Model  Using Classifier for Classification
  • 10. Building the Classifier or Model  This step is the learning step or the learning phase.  In this step the classification algorithms build the classifier.  The classifier is built from the training set made up of database tuples and their associated class labels.  Each tuple that constitutes the training set is referred to as a category or class. These tuples can also be referred to as sample, object or data points.
  • 12. Using Classifier for Classification In this step, the classifier is used for classification. Here the test data is used to estimate the accuracy of classification rules. The classification rules can be applied to the new data tuples if the accuracy is considered acceptable.
  • 13. How does Clustering Works?  Clustering analysis finds clusters of data objects that are similar in some sense to one another.  The members of a cluster are more like each other than they are like members of other clusters.  The goal of clustering analysis is to find high-quality clusters such that the inter-cluster similarity is low and the intra-cluster similarity is high.  A cluster of data objects can be treated as one group.  While doing cluster analysis, we first partition the set of data into groups based on data similarity and then assign the labels to the groups.
  • 15. Major Existing clustering methods • Distance-based • Hierarchical • Partitioning • Probabilistic
  • 16. Distance based method • In this case we easily identify the 4 clusters into which the data can be divided; the similarity criterion is distance: two or more objects belong to the same cluster if they are “close” according to a given distance. This is called distance-based clustering.
  • 17. Hierarchical clusteringHierarchical clustering Agglomerative (bottom up)Agglomerative (bottom up) 1.1. start with 1 pointstart with 1 point (singleton)(singleton) 2.2. recursively add two orrecursively add two or more appropriatemore appropriate clustersclusters 3.3. Stop when k number ofStop when k number of clusters is achieved.clusters is achieved. Divisive (top dow)Divisive (top dow) 1.1. Start with a big clusterStart with a big cluster 2.2. Recursively divide intoRecursively divide into smaller clusterssmaller clusters 3.3. Stop when k number ofStop when k number of clusters is achieved.clusters is achieved.
  • 18. Hierarchical algorithms • Agglomerative algorithms begin with each element as a separate cluster and merge them into successively larger clusters. • Divisive algorithms begin with the whole set and proceed to divide it into successively smaller clusters.
  • 19. Hierarchical agglomerative general algorithm 1.1. Find the 2 closest objectsFind the 2 closest objects and merge them into aand merge them into a clustercluster 2.2. Find and merge the next twoFind and merge the next two closest points, where a pointclosest points, where a point is either an individual objectis either an individual object or a cluster of objects.or a cluster of objects. 3.3. If more than one clusterIf more than one cluster remains, return to step 2remains, return to step 2
  • 20. Partitioning clustering 1. Divide data into proper subset 2. recursively go through each subset and relocate points between clusters (opposite to visit-once approach in Hierarchical approach)
  • 21. Probabilistic clustering 1. Data are picked from mixture of probability distribution. 2. Use the mean, variance of each distribution as parameters for cluster 3. Single cluster membership
  • 22. K-mean algorithm 1. It accepts the number of clusters to group data into, and the dataset to cluster as input values. 2. It then creates the first K initial clusters (K= number of clusters needed) from the dataset by choosing K rows of data randomly from the dataset. For Example, if there are 10,000 rows of data in the dataset and 3 clusters need to be formed, then the first K=3 initial clusters will be created by selecting 3 records randomly from the dataset as the initial clusters. Each of the 3 initial clusters formed will have just one row of data.
  • 23. 3. The K-Means algorithm calculates the Arithmetic Mean of each cluster formed in the dataset. The Arithmetic Mean of a cluster is the mean of all the individual records in the cluster. In each of the first K initial clusters, their is only one record. The Arithmetic Mean of a cluster with one record is the set of values that make up that record. For Example if the dataset we are discussing is a set of Height, Weight and Age measurements for students in a University, where a record P in the dataset S is represented by a Height, Weight and Age measurement, then P = {Age, Height, Weight). Then a record containing the measurements of a student John, would be represented as John = {20, 170, 80} where John's Age = 20 years, Height = 1.70 metres and Weight = 80 Pounds. Since there is only one record in each initial cluster then the Arithmetic Mean of a cluster with only the record for John as a member = {20, 170, 80}.
  • 24. 4. Next, K-Means assigns each record in the dataset to only one of the initial clusters. Each record is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity like the Euclidean Distance Measure or Manhattan/City-Block Distance Measure. 5. K-Means re-assigns each record in the dataset to the most similar cluster and re- calculates the arithmetic mean of all the clusters in the dataset. The arithmetic mean of a cluster is the arithmetic mean of all the records in that cluster. For Example, if a cluster contains two records where the record of the set of measurements for John = {20, 170, 80} and Henry = {30, 160, 120}, then the arithmetic mean Pmean is represented as Pmean = {Agemean , Heightmean , Weightmean ).  Agemean = (20 + 30)/2, Heightmean = (170 + 160)/2 and Weightmean = (80 + 120)/2. The arithmetic mean of this cluster = {25, 165, 100}. This new arithmetic mean becomes the center of this new cluster. Following the same procedure, new cluster centers are formed for all the existing clusters.
  • 25. 6. K-Means re-assigns each record in the dataset to only one of the new clusters formed. A record or data point is assigned to the nearest cluster (the cluster which it is most similar to) using a measure of distance or similarity  7. The preceding steps are repeated until stable clusters are formed and the K-Means clustering procedure is completed. Stable clusters are formed when new iterations or repetitions of the K-Means clustering algorithm does not create new clusters as the cluster center or Arithmetic Mean of each cluster formed is the same as the old cluster center. There are different techniques for determining when a stable cluster is formed or when the k-means clustering algorithm procedure is completed.
  • 26. K-Means Algorithm: ExampleK-Means Algorithm: Example OutputOutput