WHAT IS CLUSTERING?
• CLUSTERING IS THE PROCESS OF MAKING A GROUP
OF ABSTRACT OBJECTS INTO CLASSES OF SIMILAR
OBJECTS.
• IMPORTANT POINTS
• A CLUSTER OF DATA OBJECTS CAN BE TREATED AS
ONE GROUP.
• WHILE DOING CLUSTER ANALYSIS, WE FIRST
PARTITION THE SET OF DATA INTO GROUPS BASED
ON DATA SIMILARITY AND THEN ASSIGN THE LABELS
TO THE GROUPS.
WHAT IS FACTOR ANALYSIS?
• FACTOR ANALYSIS IS A TECHNIQUE THAT IS USED TO REDUCE A
LARGE NUMBER OF VARIABLES INTO FEWER NUMBERS OF
FACTORS. THIS TECHNIQUE EXTRACTS MAXIMUM COMMON
VARIANCE FROM ALL VARIABLES AND PUTS THEM INTO A COMMON
SCORE. AS AN INDEX OF ALL VARIABLES, WE CAN USE THIS SCORE
FOR FURTHER ANALYSIS.
• CORRELATION IS USED
DIFFERENCES BETWEEN CLUSTERING AND FACTOR
ANALYSIS
• FACTOR ANALYSIS CLUSTERING
WHAT IS DATA CLASSIFICATION ?
• DATA CLASSIFICATION IS THE PROCESS OF SORTING AND
CATEGORIZING DATA INTO VARIOUS TYPES, FORMS OR ANY OTHER
DISTINCT CLASS. DATA CLASSIFICATION ENABLES THE SEPARATION
AND CLASSIFICATION OF DATA ACCORDING TO DATA SET
REQUIREMENTS FOR VARIOUS BUSINESS OR PERSONAL OBJECTIVES.
IT IS MAINLY A DATA MANAGEMENT PROCESS.
• EXAMPLES:-
• SEPARATING CUSTOMER DATA BASED ON GENDER
• DATA SORTING BASED ON CONTENT/FILE TYPE, SIZE AND TIME OF
DATA
• SORTING FOR SECURITY REASONS BY CLASSIFYING DATA INTO
RESTRICTED, PUBLIC OR PRIVATE DATA TYPES
DIFFERENCES BETWEEN CLASSIFICATION AND
CLUSTERING
DIAGRAMMATICAL REPRESENTATION OF
DIFFERENCE BETWEEN CLASSIFICATION AND
CLUSTERING
TYPES OF CLUSTERING
• MAINLY THERE ARE THREE TYPES OF CLUSTERING :-
• HIERARCHICAL CLUSTERING-
• THIS METHOD CREATES A HIERARCHICAL DECOMPOSITION OF THE GIVEN SET OF DATA
OBJECTS. WE CAN CLASSIFY HIERARCHICAL METHODS ON THE BASIS OF HOW THE
HIERARCHICAL DECOMPOSITION IS FORMED. THERE ARE TWO APPROACHES HERE −
• AGGLOMERATIVE APPROACH
• DIVISIVE APPROACH
• K-MEAN CLUSTERING- NUMBER OF CLUSTERS ARE PREDETERMINED. ITS
ONLY USED, IF SAMPLE SIZE IS VERY LARGE.
• TWO STAGE CLUSTERING- ITS HYBRID OF K-MEAN AND HIERARCHICAL
CLUSTERING.
• STIRLING NUMBER OF THE SECOND KIND
• Using this we find the number of ways of sorting n objects into k nonempty groups
•
1
𝑘! 𝑗≠0
𝐾
(−1) 𝑘−𝑗 𝑘
𝑗
𝑗 𝑛
• Adding the values for k=0,1,2,…, we obtain the total number of ways to sort ‘n’ objects into ‘k’
groups.
Similarity and Dis-similarity measure
• Distance or similarity measures are essential to solve many pattern
recognition problems such as classification and clustering.
• Similarity Measure
• Numerical measure of how alike two data objects are.
• Often falls between 0 (no similarity) and 1 (complete similarity).
• Dissimilarity Measure
• Numerical measure of how different two data objects are.
• Range from 0 (objects are alike) to 1 (objects are different).
• When items (units or cases) are clustered, proximity is usually indicated by some sort of
distance. On the other hand variables are usually grouped on the basis of correlation
coefficients or like measures of association.
Similarity and Dis-similarity measure (cont.)
• Measures of distance
• .1) Euclidean Distance
• The distance between two p dimensional observations (items) x’ = [x1, x2,x3 ….xp] and y’ = [y1, y2,y3 ….yp ]
• 𝑑 𝑥, 𝑦 = (𝑥1−𝑦1)2 + (𝑥2−𝑦2)2+. . . +(𝑥 𝑝−𝑦𝑝)2
• 2) Minkowski Metric
• 𝑑 𝑥, 𝑦 = 𝑖=1
𝑝
𝑥𝑖 − 𝑦𝑖
𝑚
1
𝑚
• m=1, it becomes city block distance
• m=2, it becomes Euclidean distance
3) Canberra metric :-
𝑑 𝑥, 𝑦 =
𝑖=1
𝑝
𝑥𝑖 − 𝑦𝑖
(𝑥𝑖 + 𝑦𝑖)
4) Czekanowski Coefficient
𝑑 𝑥, 𝑦 = 1 − 𝑖=1
𝑝
min( 𝑥 𝑖,𝑦 𝑖)
𝑖=1
𝑝
(𝑥 𝑖+𝑦 𝑖)
Similarity and Dis-similarity measure (cont.)
• Measures of distance
Properties:
• d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q,
• d(p, q) = d(q,p) for all p and q,
• d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r
• The above similarity or distance measures are appropriate for continuous variables. However, for binary
variables a different approach is necessary.
• Simple Matching and Jaccard Coefficients
• Simple matching coefficient = (n1,1+ n0,0) / (n1,1 + n1,0 + n0,1 + n0,0).
• Jaccard coefficient = n1,1 / (n1,1 + n1,0 + n0,1).
Similarity and Dis-similarity measure (Ex.)
Suppose five individual processes the following characteristics:
• Define six binary variables X1, X2, X3, X4, X5, X6,
• 𝑋1 =
1 ℎ𝑒𝑖𝑔ℎ𝑡 ≥ 71 𝑖𝑛
0 ℎ𝑒𝑖𝑔ℎ𝑡 < 70 𝑖𝑛
𝑋2 =
1 𝑤𝑒𝑖𝑔ℎ𝑡 ≥ 150 𝑙𝑏
0 𝑤𝑒𝑖𝑔ℎ𝑡 < 150 𝑙𝑏
•
• 𝑋3 =
1 𝑏𝑟𝑜𝑤𝑛 𝑒𝑦𝑒𝑠
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑋4 =
1 𝑏𝑙𝑜𝑛𝑑 ℎ𝑎𝑖𝑟
0 𝑛𝑜𝑡 𝑏𝑙𝑜𝑛𝑑 ℎ𝑎𝑖𝑟
•
• 𝑋5 =
1 𝑟𝑖𝑔ℎ𝑡 ℎ𝑎𝑛𝑑𝑒𝑑
0 𝑙𝑒𝑓𝑡 ℎ𝑎𝑛𝑑𝑒𝑑
𝑋6 =
1 𝐹𝑒𝑚𝑎𝑙𝑒
0 𝑀𝑎𝑙𝑒
Height Weight Eye Color Hair Color Handedness Gender
Individual 1 68 in 140 lb Green Blond right female
Individual 2 73 in 185 lb brown Brown right male
Individual 3 67 in 165 lb blue Blond right male
Individual 4 64 in 120 lb brown brown right female
Individual 5 76 in 210 lb brown brown left male
Similarity and Dis-similarity measure (Ex.)
The scores for individual 1 and 2 on the p =6 binary variables are
Coefficient is- (1+0)/6=1/6
Similarly doing it for the other combination of individuals we get.
X1 X2 X3 X4 X5 X5
Individual 1 0 0 0 1 1 1
Individual 2 1 1 1 0 1 0
Individual 2
1 0 Total
Individual 1
1 1 2 3
0 3 0 3
Total 4 2 6
Similarity and Dis-similarity measure (Ex.)
From this we find that individuals 2 and 5 are most similar. And individuals 1 and 5 are least similar. Other
pairs fall in between these extremes. If we were to divide it into two sub groups then we might form the sub
groups –( 2,5) and (1,3,4).
Individual
1 2 3 4 5
1 1
2 1/6 1
Individual
3 1/6 3/6 1
4 4/6 3/6 2/6 1
5 0 5/6 2/6 2/6 1
Similarity and Dis-similarity measure (Ex.)
For similarity measures of variables, we can use correlation coefficients. When the variables are binary, the
data can again be arranged in the form of a contingency table.
For each pair of variables there are n items categorized table usual 0 and 1 coding, the table becomes as
follows
The usual product moment correlation formula applied to the binary variables in the contingency table is-
𝑟 =
𝑎𝑑 − 𝑏𝑐
{ 𝑎 + 𝑏 𝑐 + 𝑑 𝑎 + 𝑐 𝑏 + 𝑑 }1 2
This number can be taken as a measure of the similarity between the two variables.
Variable k
1 0 Total
Variable i
1 a b a+b
0 c d c+d
Total a+c b+d n=a+b+c+
d
Hierarchical Clustering
 It follows a series of successive mergers or series of successive divisions.
 Agglomerative hierarchical methods start with the individual objects.
 Initially there are as many clusters as objects. The most similar objects are first grouped and these initial
groups are merged according to their similarities. Eventually as the similarity decreases, all sub group are
focused into a single cluster.
 . Divisible hierarchical method work in the opposite direction.
 Initially a single group of objects is divided into two sub groups such that the objects in the sub group
are ‘far from’ the objects in the other. These sub groups are then further divided into dissimilar sub
groups, the process continues until there are as many sub groups as objects – i.e. until each objects form
a group.
 Both the methods can be displayed using a 2-D structure which is known as deldogram.
Hierarchical Clustering
 Methods-
 Single Linkage Method- In single linkage, we define the distance between two clusters to be
the minimum distance between any single data point in the first cluster and any single data point in
the second cluster. On the basis of this definition of distance between clusters, at each stage of the
process we combine the two clusters that have the smallest single linkage distance.
 Complete Linkage: In complete linkage, we define the distance between two clusters to be
the maximum distance between any single data point in the first cluster and any single data point in
the second cluster. On the basis of this definition of distance between clusters, at each stage of the
process we combine the two clusters that have the smallest complete linkage distance.
 Average Linkage: In average linkage, we define the distance between two clusters to be
the average distance between data points in the first cluster and data points in the second cluster.
On the basis of this definition of distance between clusters, at each stage of the process we combine
the two clusters that have the smallest average linkage distance.
 Centroid Method: In centroid method, the distance between two clusters is the distance between
the two mean vectors of the clusters. At each stage of the process we combine the two clusters that
have the smallest centroid distance.
 Ward’s Method- The distance between two clusters is the sum of squares between the two clusters
across all the clustering variables. Combination which results in smallest increases in ESS are
clustered.
Hierarchical Clustering
• The following are the steps in the agglomerative hierarchical clustering
algorithm for grouping n objects (items or variable)
1) Start with N clustering each containing a single entity and an N x N symmetric metric of
distance (or similarity) D= {dik}
2) Search the distance matrix for the nearest (most similar) pair of clusters. Let the
distance between most similar clusters U &V be dUV
3) Merge cluster U and V, label the newly formed clusters (UV). Update the entries in the
distance matrix by
• Deleting the rows and column corresponding to clustering U & V and
• Adding a row and column giving the distance between cluster (UV) and the remaining cluster.
4) Repeat steps 2 and 3 a total of N-1 times (all objects will be in a single cluster after the
algorithm terminates). Record the identity of the clusters that are merged and the levels
(distance or similarity) at which the merges take place.
Hierarchical Clustering(example)
Suppose we have 6 cases(A,B,C,D,E,F) and two features(X1,X2).
Now we have to compute the distance matrix.
We can compute the distance using Euclidean Formula.
X1 X2
A 1 1
B 1.5 1.5
C 5 5
D 3 4
E 4 4
F 3 3.5
Hierarchical Clustering(example)
 A dendogram is a tree diagram.
 Agglomerative hierarchical method.
 Divisible hierarchical method.
 The results of both agglomerative
and divisive methods may be
displayed in the form of two-
dimensional diagram known as a
dendogram
 A square matrix in which the entry in cell (j, k) is some
measure of the similarity (or distance) between the items to
which row j and column k correspond.
 Proximity matrices form the data for multidimensional
scaling.
 It is a matrix which is formed by distance between objects,
 Euclidean distance :-
a.Set of 6 2dimentional point
b.Xy coordinate of 6 points c.Proximity matrix
Cluster analysis
 Single linkage. Also
referred to as nearest
neighbor or minimum
method.
 This measure defines the
distance between two
clusters as the minimum
distance found between
one case from the first
cluster and one case from
the second cluster.
 Complete linkage. Also
referred to as furthest
neighbour or maximum
method.
 This measure is similar to
the single linkage measure
described above, but
instead of searching for the
minimum distance between
pairs of cases, it considers
the furthest distance
between pairs of cases.
 Average linkage.
Average linkage.
Average linkage.
 Also referred to as the
Unweighted Pair-Group
Method using
Arithmetic averages. To
overcome the
limitations of single and
complete linkage.

More Related Content

PPTX
Cluster analysis
PPTX
Cluster Analysis Introduction
PPT
Discriminant analysis
PPTX
Cluster analysis
PPTX
Discriminant analysis
PPT
Clustering
PDF
Cluster analysis
PPTX
Logistic Regression.pptx
Cluster analysis
Cluster Analysis Introduction
Discriminant analysis
Cluster analysis
Discriminant analysis
Clustering
Cluster analysis
Logistic Regression.pptx

What's hot (20)

PPTX
Clustering in Data Mining
PPTX
Discriminant function analysis (DFA)
PPTX
Data Reduction
PDF
Decision tree
PPTX
Loan prediction
PPTX
Game theory ppt
PPTX
Organizational behaviour, nature & levels of organizational behaviour
PPTX
Factor analysis
PPTX
Decision Tree Learning
PPTX
An Introduction to Factor analysis ppt
PPTX
Time Series
PPT
My regression lecture mk3 (uploaded to web ct)
PPTX
discriminant analysis
PDF
Box jenkins method of forecasting
PPTX
Multidimensional scaling1
PPTX
Decision Trees
PPTX
Introduction to Structural Equation Modeling
PPT
Multidimensional scaling
PPTX
Cluster analysis
PPTX
Correlation analysis
Clustering in Data Mining
Discriminant function analysis (DFA)
Data Reduction
Decision tree
Loan prediction
Game theory ppt
Organizational behaviour, nature & levels of organizational behaviour
Factor analysis
Decision Tree Learning
An Introduction to Factor analysis ppt
Time Series
My regression lecture mk3 (uploaded to web ct)
discriminant analysis
Box jenkins method of forecasting
Multidimensional scaling1
Decision Trees
Introduction to Structural Equation Modeling
Multidimensional scaling
Cluster analysis
Correlation analysis
Ad

Similar to Cluster analysis (20)

PPTX
Cluster Analysis
PDF
Clustering Algorithms - Kmeans,Min ALgorithm
PPT
clustering and their types explanation of data mining
PPT
Slide-TIF311-DM-10-11.ppt
PPT
Slide-TIF311-DM-10-11.ppt
PDF
Unsupervised Learning in Machine Learning
PPTX
Clustering
PPTX
K means clustering
DOCX
Data Science Using Python
PDF
CSA 3702 machine learning module 3
PPTX
Read first few slides cluster analysis
PPTX
Clusteranalysis 121206234137-phpapp01
PPTX
Clusteranalysis
PDF
Principal component analysis and lda
DOCX
8.clustering algorithm.k means.em algorithm
PPTX
03 Data Mining Techniques
PPTX
Cluster Analysis.pptx
PDF
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
PDF
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
Cluster Analysis
Clustering Algorithms - Kmeans,Min ALgorithm
clustering and their types explanation of data mining
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
Unsupervised Learning in Machine Learning
Clustering
K means clustering
Data Science Using Python
CSA 3702 machine learning module 3
Read first few slides cluster analysis
Clusteranalysis 121206234137-phpapp01
Clusteranalysis
Principal component analysis and lda
8.clustering algorithm.k means.em algorithm
03 Data Mining Techniques
Cluster Analysis.pptx
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
Ad

Recently uploaded (20)

PPTX
Computer Architecture Input Output Memory.pptx
PDF
International_Financial_Reporting_Standa.pdf
PDF
Empowerment Technology for Senior High School Guide
PDF
HVAC Specification 2024 according to central public works department
PPTX
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
PPTX
Virtual and Augmented Reality in Current Scenario
PDF
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
PPTX
Unit 4 Computer Architecture Multicore Processor.pptx
PPTX
B.Sc. DS Unit 2 Software Engineering.pptx
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
PDF
Complications of Minimal Access-Surgery.pdf
PDF
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
PDF
What if we spent less time fighting change, and more time building what’s rig...
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
PPTX
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
PDF
AI-driven educational solutions for real-life interventions in the Philippine...
PDF
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf
Computer Architecture Input Output Memory.pptx
International_Financial_Reporting_Standa.pdf
Empowerment Technology for Senior High School Guide
HVAC Specification 2024 according to central public works department
ELIAS-SEZIURE AND EPilepsy semmioan session.pptx
Virtual and Augmented Reality in Current Scenario
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
Unit 4 Computer Architecture Multicore Processor.pptx
B.Sc. DS Unit 2 Software Engineering.pptx
Paper A Mock Exam 9_ Attempt review.pdf.
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
Complications of Minimal Access-Surgery.pdf
1.3 FINAL REVISED K-10 PE and Health CG 2023 Grades 4-10 (1).pdf
What if we spent less time fighting change, and more time building what’s rig...
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 1)
Onco Emergencies - Spinal cord compression Superior vena cava syndrome Febr...
AI-driven educational solutions for real-life interventions in the Philippine...
BP 704 T. NOVEL DRUG DELIVERY SYSTEMS (UNIT 2).pdf

Cluster analysis

  • 1. WHAT IS CLUSTERING? • CLUSTERING IS THE PROCESS OF MAKING A GROUP OF ABSTRACT OBJECTS INTO CLASSES OF SIMILAR OBJECTS. • IMPORTANT POINTS • A CLUSTER OF DATA OBJECTS CAN BE TREATED AS ONE GROUP. • WHILE DOING CLUSTER ANALYSIS, WE FIRST PARTITION THE SET OF DATA INTO GROUPS BASED ON DATA SIMILARITY AND THEN ASSIGN THE LABELS TO THE GROUPS.
  • 2. WHAT IS FACTOR ANALYSIS? • FACTOR ANALYSIS IS A TECHNIQUE THAT IS USED TO REDUCE A LARGE NUMBER OF VARIABLES INTO FEWER NUMBERS OF FACTORS. THIS TECHNIQUE EXTRACTS MAXIMUM COMMON VARIANCE FROM ALL VARIABLES AND PUTS THEM INTO A COMMON SCORE. AS AN INDEX OF ALL VARIABLES, WE CAN USE THIS SCORE FOR FURTHER ANALYSIS. • CORRELATION IS USED
  • 3. DIFFERENCES BETWEEN CLUSTERING AND FACTOR ANALYSIS • FACTOR ANALYSIS CLUSTERING
  • 4. WHAT IS DATA CLASSIFICATION ? • DATA CLASSIFICATION IS THE PROCESS OF SORTING AND CATEGORIZING DATA INTO VARIOUS TYPES, FORMS OR ANY OTHER DISTINCT CLASS. DATA CLASSIFICATION ENABLES THE SEPARATION AND CLASSIFICATION OF DATA ACCORDING TO DATA SET REQUIREMENTS FOR VARIOUS BUSINESS OR PERSONAL OBJECTIVES. IT IS MAINLY A DATA MANAGEMENT PROCESS. • EXAMPLES:- • SEPARATING CUSTOMER DATA BASED ON GENDER • DATA SORTING BASED ON CONTENT/FILE TYPE, SIZE AND TIME OF DATA • SORTING FOR SECURITY REASONS BY CLASSIFYING DATA INTO RESTRICTED, PUBLIC OR PRIVATE DATA TYPES
  • 6. DIAGRAMMATICAL REPRESENTATION OF DIFFERENCE BETWEEN CLASSIFICATION AND CLUSTERING
  • 7. TYPES OF CLUSTERING • MAINLY THERE ARE THREE TYPES OF CLUSTERING :- • HIERARCHICAL CLUSTERING- • THIS METHOD CREATES A HIERARCHICAL DECOMPOSITION OF THE GIVEN SET OF DATA OBJECTS. WE CAN CLASSIFY HIERARCHICAL METHODS ON THE BASIS OF HOW THE HIERARCHICAL DECOMPOSITION IS FORMED. THERE ARE TWO APPROACHES HERE − • AGGLOMERATIVE APPROACH • DIVISIVE APPROACH • K-MEAN CLUSTERING- NUMBER OF CLUSTERS ARE PREDETERMINED. ITS ONLY USED, IF SAMPLE SIZE IS VERY LARGE. • TWO STAGE CLUSTERING- ITS HYBRID OF K-MEAN AND HIERARCHICAL CLUSTERING.
  • 8. • STIRLING NUMBER OF THE SECOND KIND • Using this we find the number of ways of sorting n objects into k nonempty groups • 1 𝑘! 𝑗≠0 𝐾 (−1) 𝑘−𝑗 𝑘 𝑗 𝑗 𝑛 • Adding the values for k=0,1,2,…, we obtain the total number of ways to sort ‘n’ objects into ‘k’ groups.
  • 9. Similarity and Dis-similarity measure • Distance or similarity measures are essential to solve many pattern recognition problems such as classification and clustering. • Similarity Measure • Numerical measure of how alike two data objects are. • Often falls between 0 (no similarity) and 1 (complete similarity). • Dissimilarity Measure • Numerical measure of how different two data objects are. • Range from 0 (objects are alike) to 1 (objects are different). • When items (units or cases) are clustered, proximity is usually indicated by some sort of distance. On the other hand variables are usually grouped on the basis of correlation coefficients or like measures of association.
  • 10. Similarity and Dis-similarity measure (cont.) • Measures of distance • .1) Euclidean Distance • The distance between two p dimensional observations (items) x’ = [x1, x2,x3 ….xp] and y’ = [y1, y2,y3 ….yp ] • 𝑑 𝑥, 𝑦 = (𝑥1−𝑦1)2 + (𝑥2−𝑦2)2+. . . +(𝑥 𝑝−𝑦𝑝)2 • 2) Minkowski Metric • 𝑑 𝑥, 𝑦 = 𝑖=1 𝑝 𝑥𝑖 − 𝑦𝑖 𝑚 1 𝑚 • m=1, it becomes city block distance • m=2, it becomes Euclidean distance 3) Canberra metric :- 𝑑 𝑥, 𝑦 = 𝑖=1 𝑝 𝑥𝑖 − 𝑦𝑖 (𝑥𝑖 + 𝑦𝑖) 4) Czekanowski Coefficient 𝑑 𝑥, 𝑦 = 1 − 𝑖=1 𝑝 min( 𝑥 𝑖,𝑦 𝑖) 𝑖=1 𝑝 (𝑥 𝑖+𝑦 𝑖)
  • 11. Similarity and Dis-similarity measure (cont.) • Measures of distance Properties: • d(p, q) ≥ 0 for all p and q, and d(p, q) = 0 if and only if p = q, • d(p, q) = d(q,p) for all p and q, • d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r • The above similarity or distance measures are appropriate for continuous variables. However, for binary variables a different approach is necessary. • Simple Matching and Jaccard Coefficients • Simple matching coefficient = (n1,1+ n0,0) / (n1,1 + n1,0 + n0,1 + n0,0). • Jaccard coefficient = n1,1 / (n1,1 + n1,0 + n0,1).
  • 12. Similarity and Dis-similarity measure (Ex.) Suppose five individual processes the following characteristics: • Define six binary variables X1, X2, X3, X4, X5, X6, • 𝑋1 = 1 ℎ𝑒𝑖𝑔ℎ𝑡 ≥ 71 𝑖𝑛 0 ℎ𝑒𝑖𝑔ℎ𝑡 < 70 𝑖𝑛 𝑋2 = 1 𝑤𝑒𝑖𝑔ℎ𝑡 ≥ 150 𝑙𝑏 0 𝑤𝑒𝑖𝑔ℎ𝑡 < 150 𝑙𝑏 • • 𝑋3 = 1 𝑏𝑟𝑜𝑤𝑛 𝑒𝑦𝑒𝑠 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑋4 = 1 𝑏𝑙𝑜𝑛𝑑 ℎ𝑎𝑖𝑟 0 𝑛𝑜𝑡 𝑏𝑙𝑜𝑛𝑑 ℎ𝑎𝑖𝑟 • • 𝑋5 = 1 𝑟𝑖𝑔ℎ𝑡 ℎ𝑎𝑛𝑑𝑒𝑑 0 𝑙𝑒𝑓𝑡 ℎ𝑎𝑛𝑑𝑒𝑑 𝑋6 = 1 𝐹𝑒𝑚𝑎𝑙𝑒 0 𝑀𝑎𝑙𝑒 Height Weight Eye Color Hair Color Handedness Gender Individual 1 68 in 140 lb Green Blond right female Individual 2 73 in 185 lb brown Brown right male Individual 3 67 in 165 lb blue Blond right male Individual 4 64 in 120 lb brown brown right female Individual 5 76 in 210 lb brown brown left male
  • 13. Similarity and Dis-similarity measure (Ex.) The scores for individual 1 and 2 on the p =6 binary variables are Coefficient is- (1+0)/6=1/6 Similarly doing it for the other combination of individuals we get. X1 X2 X3 X4 X5 X5 Individual 1 0 0 0 1 1 1 Individual 2 1 1 1 0 1 0 Individual 2 1 0 Total Individual 1 1 1 2 3 0 3 0 3 Total 4 2 6
  • 14. Similarity and Dis-similarity measure (Ex.) From this we find that individuals 2 and 5 are most similar. And individuals 1 and 5 are least similar. Other pairs fall in between these extremes. If we were to divide it into two sub groups then we might form the sub groups –( 2,5) and (1,3,4). Individual 1 2 3 4 5 1 1 2 1/6 1 Individual 3 1/6 3/6 1 4 4/6 3/6 2/6 1 5 0 5/6 2/6 2/6 1
  • 15. Similarity and Dis-similarity measure (Ex.) For similarity measures of variables, we can use correlation coefficients. When the variables are binary, the data can again be arranged in the form of a contingency table. For each pair of variables there are n items categorized table usual 0 and 1 coding, the table becomes as follows The usual product moment correlation formula applied to the binary variables in the contingency table is- 𝑟 = 𝑎𝑑 − 𝑏𝑐 { 𝑎 + 𝑏 𝑐 + 𝑑 𝑎 + 𝑐 𝑏 + 𝑑 }1 2 This number can be taken as a measure of the similarity between the two variables. Variable k 1 0 Total Variable i 1 a b a+b 0 c d c+d Total a+c b+d n=a+b+c+ d
  • 16. Hierarchical Clustering  It follows a series of successive mergers or series of successive divisions.  Agglomerative hierarchical methods start with the individual objects.  Initially there are as many clusters as objects. The most similar objects are first grouped and these initial groups are merged according to their similarities. Eventually as the similarity decreases, all sub group are focused into a single cluster.  . Divisible hierarchical method work in the opposite direction.  Initially a single group of objects is divided into two sub groups such that the objects in the sub group are ‘far from’ the objects in the other. These sub groups are then further divided into dissimilar sub groups, the process continues until there are as many sub groups as objects – i.e. until each objects form a group.  Both the methods can be displayed using a 2-D structure which is known as deldogram.
  • 17. Hierarchical Clustering  Methods-  Single Linkage Method- In single linkage, we define the distance between two clusters to be the minimum distance between any single data point in the first cluster and any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest single linkage distance.  Complete Linkage: In complete linkage, we define the distance between two clusters to be the maximum distance between any single data point in the first cluster and any single data point in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest complete linkage distance.  Average Linkage: In average linkage, we define the distance between two clusters to be the average distance between data points in the first cluster and data points in the second cluster. On the basis of this definition of distance between clusters, at each stage of the process we combine the two clusters that have the smallest average linkage distance.  Centroid Method: In centroid method, the distance between two clusters is the distance between the two mean vectors of the clusters. At each stage of the process we combine the two clusters that have the smallest centroid distance.  Ward’s Method- The distance between two clusters is the sum of squares between the two clusters across all the clustering variables. Combination which results in smallest increases in ESS are clustered.
  • 18. Hierarchical Clustering • The following are the steps in the agglomerative hierarchical clustering algorithm for grouping n objects (items or variable) 1) Start with N clustering each containing a single entity and an N x N symmetric metric of distance (or similarity) D= {dik} 2) Search the distance matrix for the nearest (most similar) pair of clusters. Let the distance between most similar clusters U &V be dUV 3) Merge cluster U and V, label the newly formed clusters (UV). Update the entries in the distance matrix by • Deleting the rows and column corresponding to clustering U & V and • Adding a row and column giving the distance between cluster (UV) and the remaining cluster. 4) Repeat steps 2 and 3 a total of N-1 times (all objects will be in a single cluster after the algorithm terminates). Record the identity of the clusters that are merged and the levels (distance or similarity) at which the merges take place.
  • 19. Hierarchical Clustering(example) Suppose we have 6 cases(A,B,C,D,E,F) and two features(X1,X2). Now we have to compute the distance matrix. We can compute the distance using Euclidean Formula. X1 X2 A 1 1 B 1.5 1.5 C 5 5 D 3 4 E 4 4 F 3 3.5
  • 21.  A dendogram is a tree diagram.  Agglomerative hierarchical method.  Divisible hierarchical method.  The results of both agglomerative and divisive methods may be displayed in the form of two- dimensional diagram known as a dendogram
  • 22.  A square matrix in which the entry in cell (j, k) is some measure of the similarity (or distance) between the items to which row j and column k correspond.  Proximity matrices form the data for multidimensional scaling.  It is a matrix which is formed by distance between objects,  Euclidean distance :-
  • 23. a.Set of 6 2dimentional point b.Xy coordinate of 6 points c.Proximity matrix
  • 25.  Single linkage. Also referred to as nearest neighbor or minimum method.  This measure defines the distance between two clusters as the minimum distance found between one case from the first cluster and one case from the second cluster.
  • 26.  Complete linkage. Also referred to as furthest neighbour or maximum method.  This measure is similar to the single linkage measure described above, but instead of searching for the minimum distance between pairs of cases, it considers the furthest distance between pairs of cases.
  • 27.  Average linkage. Average linkage. Average linkage.  Also referred to as the Unweighted Pair-Group Method using Arithmetic averages. To overcome the limitations of single and complete linkage.