Clustering & classification

Clustering & Classification
Institute of Engineering & Management (IEM), Kolkata (INDIA)
BCA (A) – 3rd
year
 Koyel Agarwal
 Madhurima Dey
 Mainak Sen Choudhary
 Md. Jamshed Khan
 Md. Masud Parvez

Contents:
• What is Clustering & Classification?
• Data Mining & Its applications.
• Types of Data Mining Functions
• How does Classification Works?
• How does Clustering Works?
• Types of Clustering
• K-Mean Algorithm
• Conclusion

What is Clustering & Classification?
• Clustering & Classification are the terms or
concept related to Data Mining and Machine
Learning.
• It is the technique used to extract data and
knowledge in Data Mining and in Machine
Learning.
• Classification & Clustering are also known as
Supervised Learning and Unsupervised
Learning respectively.

Data Mining:
Data Mining is defined as extracting information from
huge sets of data. In other words, we can say that data
mining is the procedure of mining knowledge from data.
Data Mining Applications:
Data mining is highly useful in the following domains:
 Market Analysis and Management
 Corporate Analysis & Risk Management
 Fraud Detection

There are two categories of functions involved in
Data Mining :-
 Descriptive
 Classification and Prediction
Descriptive Functions:
The descriptive function deals with the general properties of data in
the database. List of descriptive functions −
 Class/Concept Description
 Mining of frequent Patterns
 Mining of Associations
 Mining of Correlations
 Mining of Clusters

Classification:
Classification is supervised learning technique used to assign pre-
defined tag to instance on the basis of features. So classification
algorithm requires training data. Classification model is created
from training data, then classification model is used to classify new
instances.
Clustering:
Clustering is unsupervised technique used to group similar
instances on the basis of features. Clustering does not require
training data. Clustering does not assign pre-defined label to each
and every group.

Clustering & classification

Examples of Classification
Following are the examples of cases where the data analysis task
is Classification −
 A bank loan officer wants to analyze the data in order to know
which customer (loan applicant) are risky or which are safe.
 A marketing manager at a company needs to analyze a customer
with a given profile, who will buy a new computer.
In both of the above examples, a model or classifier is
constructed to predict the categorical labels. These labels are
risky or safe for loan application data and yes or no for
marketing data.

How does Classification works?
The Data Classification process includes two steps −
 Building the Classifier or Model
 Using Classifier for Classification

Building the Classifier or Model
 This step is the learning step or the learning phase.
 In this step the classification algorithms build the
classifier.
 The classifier is built from the training set made up of
database tuples and their associated class labels.
 Each tuple that constitutes the training set is referred to
as a category or class. These tuples can also be referred
to as sample, object or data points.

Building the Classifier or Model

Using Classifier for Classification
In this step, the classifier is used for classification. Here the test data is used to
estimate the accuracy of classification rules. The classification rules can be
applied to the new data tuples if the accuracy is considered acceptable.

How does Clustering Works?
 Clustering analysis finds clusters of data objects that are similar in
some sense to one another.
 The members of a cluster are more like each other than they are like
members of other clusters.
 The goal of clustering analysis is to find high-quality clusters such
that the inter-cluster similarity is low and the intra-cluster similarity
is high.
 A cluster of data objects can be treated as one group.
 While doing cluster analysis, we first partition the set of data into
groups based on data similarity and then assign the labels to the
groups.

Major Existing clustering methods
• Distance-based
• Hierarchical
• Partitioning
• Probabilistic

Distance based method
• In this case we easily identify the 4 clusters into which the data can be divided; the
similarity criterion is distance: two or more objects belong to the same cluster if they
are “close” according to a given distance. This is called distance-based clustering.

Hierarchical clusteringHierarchical clustering
Agglomerative (bottom up)Agglomerative (bottom up)
1.1. start with 1 pointstart with 1 point
(singleton)(singleton)
2.2. recursively add two orrecursively add two or
more appropriatemore appropriate
clustersclusters
3.3. Stop when k number ofStop when k number of
clusters is achieved.clusters is achieved.
Divisive (top dow)Divisive (top dow)
1.1. Start with a big clusterStart with a big cluster
2.2. Recursively divide intoRecursively divide into
smaller clusterssmaller clusters
3.3. Stop when k number ofStop when k number of
clusters is achieved.clusters is achieved.

Hierarchical algorithms
• Agglomerative algorithms begin with each
element as a separate cluster and merge them
into successively larger clusters.
• Divisive algorithms begin with the whole set and
proceed to divide it into successively smaller
clusters.

Hierarchical agglomerative general
algorithm
1.1. Find the 2 closest objectsFind the 2 closest objects
and merge them into aand merge them into a
clustercluster
2.2. Find and merge the next twoFind and merge the next two
closest points, where a pointclosest points, where a point
is either an individual objectis either an individual object
or a cluster of objects.or a cluster of objects.
3.3. If more than one clusterIf more than one cluster
remains, return to step 2remains, return to step 2

Partitioning clustering
1. Divide data into proper subset
2. recursively go through each subset and
relocate points between clusters (opposite to
visit-once approach in Hierarchical approach)

Probabilistic clustering
1. Data are picked from mixture of probability
distribution.
2. Use the mean, variance of each distribution as
parameters for cluster
3. Single cluster membership

K-mean algorithm
1. It accepts the number of clusters to group data
into, and the dataset to cluster as input values.
2. It then creates the first K initial clusters (K=
number of clusters needed) from the dataset by
choosing K rows of data randomly from the dataset.
For Example, if there are 10,000 rows of data in
the dataset and 3 clusters need to be formed, then
the first K=3 initial clusters will be created by
selecting 3 records randomly from the dataset as
the initial clusters. Each of the 3 initial clusters
formed will have just one row of data.

3. The K-Means algorithm calculates the Arithmetic Mean of each
cluster formed in the dataset. The Arithmetic Mean of a cluster is the
mean of all the individual records in the cluster. In each of the first K
initial clusters, their is only one record. The Arithmetic Mean of a
cluster with one record is the set of values that make up that record.
For Example if the dataset we are discussing is a set of Height,
Weight and Age measurements for students in a University, where
a record P in the dataset S is represented by a Height, Weight and
Age measurement, then P = {Age, Height, Weight). Then a record
containing the measurements of a student John, would be
represented as John = {20, 170, 80} where John's Age = 20 years,
Height = 1.70 metres and Weight = 80 Pounds. Since there is only
one record in each initial cluster then the Arithmetic Mean of a
cluster with only the record for John as a member = {20, 170, 80}.

4. Next, K-Means assigns each record in the dataset to only one of the initial
clusters. Each record is assigned to the nearest cluster (the cluster which it is
most similar to) using a measure of distance or similarity like the Euclidean
Distance Measure or Manhattan/City-Block Distance Measure.
5. K-Means re-assigns each record in the dataset to the most similar cluster and re-
calculates the arithmetic mean of all the clusters in the dataset. The arithmetic
mean of a cluster is the arithmetic mean of all the records in that cluster. For
Example, if a cluster contains two records where the record of the set of
measurements for John = {20, 170, 80} and Henry = {30, 160, 120}, then the
arithmetic mean Pmean
is represented as Pmean
= {Agemean
, Heightmean
,
Weightmean
). Agemean
= (20 + 30)/2, Heightmean
= (170 + 160)/2 and
Weightmean
= (80 + 120)/2. The arithmetic mean of this cluster = {25,
165, 100}. This new arithmetic mean becomes the center of this new cluster.
Following the same procedure, new cluster centers are formed for all the
existing clusters.

6. K-Means re-assigns each record in the dataset to only one of the
new clusters formed. A record or data point is assigned to the
nearest cluster (the cluster which it is most similar to) using a
measure of distance or similarity
7. The preceding steps are repeated until stable clusters are formed
and the K-Means clustering procedure is completed. Stable clusters
are formed when new iterations or repetitions of the K-Means
clustering algorithm does not create new clusters as the cluster center
or Arithmetic Mean of each cluster formed is the same as the old
cluster center. There are different techniques for determining when a
stable cluster is formed or when the k-means clustering algorithm
procedure is completed.

K-Means Algorithm: ExampleK-Means Algorithm: Example
OutputOutput

Clustering & classification

More Related Content

What's hot (20)

Similar to Clustering & classification (20)

Recently uploaded (20)