SlideShare a Scribd company logo
CLUSTER VALIDITY
INDICES
WHY TO EVALUATE THE “GOODNESS” OF
THE RESULTING CLUSTERS?
 To avoid finding patterns in noise
 To compare clustering algorithms
 To compare two sets of clusters
 To compare two clusters
DIFFERENT ASPECTS OF CLUSTER
VALIDATION
1. Determining the clustering tendency of a set of data, i.e., distinguishing whether
non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g., to
externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without reference
to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to determine
which is better.
5. Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire
clustering or just individual clusters.
FRAMEWORK FOR CLUSTER VALIDITY
 Need a framework to interpret any measure.
 For example, if our measure of evaluation has the value, 10, is that good,
fair, or poor?
 Statistics provide a framework for cluster validity
 The more “atypical” a clustering result is, the more likely it represents valid
structure in the data
 Can compare the values of an index that result from random data or
clusterings to those of a clustering result.
 If the value of the index is unlikely, then the cluster results are valid
 These approaches are more complicated and harder to understand.
 For comparing the results of two different sets of cluster analyses, a
framework is less necessary.
 However, there is the question of whether the difference between two index
values is significant
MEASURES OF CLUSTER VALIDITY
 Numerical measures that are applied to judge various aspects of
cluster validity, are classified into the following three types.
 External Index: Used to measure the extent to which cluster labels match
externally supplied class labels.
 Entropy
 Internal Index: Used to measure the goodness of a clustering structure
without respect to external information.
 Sum of Squared Error (SSE)
 Relative Index: Used to compare two different clusterings or clusters.
 Often an external or internal index is used for this function, e.g., SSE or entropy
 Sometimes these are referred to as criteria instead of indices
 However, sometimes criterion is the general strategy and index is the numerical
measure that implements the criterion.
EXTERNAL MEASURES
 The correct or ground truth clustering is known priori.
 Given a clustering partition C and ground truth partitioning T, we
redefine TP, TN, FP, FN in the context of clustering.
 Given the number of pairs N
N=TP+FP+FN+TN
EXTERNAL MEASURES …
 True Positives (TP): Xi and Xj are a true positive pair if they belong to the
same partition in T, and they are also in the same cluster in C. TP is
defined as the number of true positive pairs.
 False Negatives (FN): Xi and Xj are a false negative pair if they belong to
the same partition in T, but they do not belong to the same cluster in C.
FN is defined as the number of false negative pairs.
 • False Positives (FP): Xi and Xj are a false positive pair if the do not
belong to the same partition in T, but belong to the same cluster in C.
FP is the number of false positive pairs.
 True Negatives (TN): Xi and Xj are a false negative pair if they do not
belong to the same partition in T, nor to the same cluster in C. TN is the
number of true negative pairs.
JACCARD COEFFICIENT
 Measures the fraction of true positive point pairs, but after ignoring the
true negatives as,
Jaccard = TP/ (TP+FP+FN)
 For a perfect clustering C, the coefficient is one, that is, there are no
false positives nor false negatives.
 Note that the Jaccard coefficient is asymmetric in that it ignores the
true negatives
RAND STATISTIC
 Measures the fraction of true positives and true negatives over all pairs
as
Rand = (TP + TN)/ N
 The Rand statistic measures the fraction of point pairs where both the
clustering C and the ground truth T agree.
 A perfect clustering has a value of 1 for the statistic.
 The adjusted rand index is the extension of the rand statistic corrected
for chance.
FOWLKES-MALLOWS MEASURE
 Define precision and recall analogously to what done for classification,
Prec = TP/ (TP+FP) and Recall = TP / (TP+FN)
• The Fowlkes–Mallows (FM) measure is defined as the geometric mean
of the pairwise precision and recall
FM = (precision recall)
√ ∙
 FM is also asymmetric in terms of the true positives and negatives
because it ignores the true negatives. Its highest value is also 1,
achieved when there are no false positives or negatives.
INTERNAL MEASURES: COHESION
AND SEPARATION
 Cluster Cohesion (Compactness): Measures how closely related
are objects in a cluster.
 Cluster Separation (Separation): Measure how distinct or well-
separated a cluster is from other clusters.
 Example: Squared Error
 Cohesion is measured by the within cluster sum of squares (SSE)
 Separation is measured by the between cluster sum of squares
 Where |Ci| is the size of cluster i
 

i
i
i m
m
C
BSS 2
)
(
INTERNAL MEASURES: COHESION AND
SEPARATION
10
9
1
9
)
3
5
.
4
(
2
)
5
.
1
3
(
2
1
)
5
.
4
5
(
)
5
.
4
4
(
)
5
.
1
2
(
)
5
.
1
1
(
2
2
2
2
2
2



















Total
BSS
WSS
1 2 3 4 5
 

m1 m2
m
K=2 clusters:
10
0
10
0
)
3
3
(
4
10
)
3
5
(
)
3
4
(
)
3
2
(
)
3
1
(
2
2
2
2
2
















Total
BSS
WSS
K=1 cluster:
INTERNAL MEASURES: COHESION
AND SEPARATION
CALINSKI-HARABAZ INDEX:
SILHOUETTE COEFFICIENT
DUNN’S INDEX:
DAVIES–BOULDIN INDEX:
XIE-BENI INDEX:
 In the definition of XB-index, the numerator indicates the compactness
of the obtained cluster while the denominator indicates the strength of
the separation between clusters.
 The objective is to minimize the XB-index for achieving proper
clustering.
PS INDEX:
I-INDEX:
CS-INDEX:
 Used for tackling clusters of different densities and/or sizes.
where mi , i = 1, . . . K are the cluster centers.

More Related Content

PPT
clustering, k-mean clustering, confusion matrices
PDF
Lesson. 3.pdf probability and statistics
PPTX
MODEL EVALUATION.pptx
PDF
Classification assessment methods
PPTX
Module 3_ Classification.pptx
PPTX
Basic Statistical Descriptions of Data.pptx
PPTX
3.1 Measures of center
PDF
dbscan clusteringdbscan clusteringdbscan clusteringdbscan clustering.pdf
clustering, k-mean clustering, confusion matrices
Lesson. 3.pdf probability and statistics
MODEL EVALUATION.pptx
Classification assessment methods
Module 3_ Classification.pptx
Basic Statistical Descriptions of Data.pptx
3.1 Measures of center
dbscan clusteringdbscan clusteringdbscan clusteringdbscan clustering.pdf

Similar to Cluster validity 1.ppt on cluster validity (20)

PDF
Assessing the compactness and isolation of individual clusters
PPT
Measures of dispersion
PPT
Descriptions of data statistics for research
DOCX
Interpretation and Utilization of Assessment Results.docx
PPT
clustering and their types explanation of data mining
PPTX
Descriptive
PPTX
Item Analysis and Interpretation of Data Parts 1&2.pptx
PDF
Quality Engineering material
PPT
Slide-TIF311-DM-10-11.ppt
PPT
Slide-TIF311-DM-10-11.ppt
PPT
Descriptive Statistics and Data Visualization
PDF
Galvin Frequency and Relative Distribution
PDF
Statistics Slides.pdf
PPTX
Statistics review
PPTX
Clusteranalysis 121206234137-phpapp01
PPTX
Clusteranalysis
PPTX
Read first few slides cluster analysis
PPTX
Parametric & non parametric
PPTX
Resampling methods
PDF
Day2 session i&ii - spss
Assessing the compactness and isolation of individual clusters
Measures of dispersion
Descriptions of data statistics for research
Interpretation and Utilization of Assessment Results.docx
clustering and their types explanation of data mining
Descriptive
Item Analysis and Interpretation of Data Parts 1&2.pptx
Quality Engineering material
Slide-TIF311-DM-10-11.ppt
Slide-TIF311-DM-10-11.ppt
Descriptive Statistics and Data Visualization
Galvin Frequency and Relative Distribution
Statistics Slides.pdf
Statistics review
Clusteranalysis 121206234137-phpapp01
Clusteranalysis
Read first few slides cluster analysis
Parametric & non parametric
Resampling methods
Day2 session i&ii - spss
Ad

More from shalinipriya1692 (12)

PPT
lecture1.ppt computer networks introduction
PPT
13-hashing.ppt computer networks introduction
PPT
Wk11-linkedlist.ppt linked list complete iit
PPTX
lecture-12evaluationmeasures-ml-221219130248-3522ee79.pptx eval
PPT
ML-NaiveBayes-NeuralNets-Clustering.ppt-
PPT
dbms-data prep-data science.ppt dbms--11.pptx database management systems
PDF
Decision-Tree-.pdf techniques and many more
PDF
evaluationmeasures-ml.pdf evaluation measures
PPT
data Preprocessing different techniques summarized
PPT
summarized best pre-processing techniques
PPTX
ppt on disaster management and decision tree
PPTX
ppt on natural disasters and random forest
lecture1.ppt computer networks introduction
13-hashing.ppt computer networks introduction
Wk11-linkedlist.ppt linked list complete iit
lecture-12evaluationmeasures-ml-221219130248-3522ee79.pptx eval
ML-NaiveBayes-NeuralNets-Clustering.ppt-
dbms-data prep-data science.ppt dbms--11.pptx database management systems
Decision-Tree-.pdf techniques and many more
evaluationmeasures-ml.pdf evaluation measures
data Preprocessing different techniques summarized
summarized best pre-processing techniques
ppt on disaster management and decision tree
ppt on natural disasters and random forest
Ad

Recently uploaded (20)

PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
DOCX
573137875-Attendance-Management-System-original
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPT
Project quality management in manufacturing
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPT
Mechanical Engineering MATERIALS Selection
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
web development for engineering and engineering
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
UNIT 4 Total Quality Management .pptx
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
R24 SURVEYING LAB MANUAL for civil enggi
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
CYBER-CRIMES AND SECURITY A guide to understanding
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
573137875-Attendance-Management-System-original
CH1 Production IntroductoryConcepts.pptx
Model Code of Practice - Construction Work - 21102022 .pdf
Project quality management in manufacturing
Embodied AI: Ushering in the Next Era of Intelligent Systems
Mechanical Engineering MATERIALS Selection
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
web development for engineering and engineering
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
UNIT 4 Total Quality Management .pptx
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Lecture Notes Electrical Wiring System Components
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
R24 SURVEYING LAB MANUAL for civil enggi

Cluster validity 1.ppt on cluster validity

  • 2. WHY TO EVALUATE THE “GOODNESS” OF THE RESULTING CLUSTERS?  To avoid finding patterns in noise  To compare clustering algorithms  To compare two sets of clusters  To compare two clusters
  • 3. DIFFERENT ASPECTS OF CLUSTER VALIDATION 1. Determining the clustering tendency of a set of data, i.e., distinguishing whether non-random structure actually exists in the data. 2. Comparing the results of a cluster analysis to externally known results, e.g., to externally given class labels. 3. Evaluating how well the results of a cluster analysis fit the data without reference to external information. - Use only the data 4. Comparing the results of two different sets of cluster analyses to determine which is better. 5. Determining the ‘correct’ number of clusters. For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire clustering or just individual clusters.
  • 4. FRAMEWORK FOR CLUSTER VALIDITY  Need a framework to interpret any measure.  For example, if our measure of evaluation has the value, 10, is that good, fair, or poor?  Statistics provide a framework for cluster validity  The more “atypical” a clustering result is, the more likely it represents valid structure in the data  Can compare the values of an index that result from random data or clusterings to those of a clustering result.  If the value of the index is unlikely, then the cluster results are valid  These approaches are more complicated and harder to understand.  For comparing the results of two different sets of cluster analyses, a framework is less necessary.  However, there is the question of whether the difference between two index values is significant
  • 5. MEASURES OF CLUSTER VALIDITY  Numerical measures that are applied to judge various aspects of cluster validity, are classified into the following three types.  External Index: Used to measure the extent to which cluster labels match externally supplied class labels.  Entropy  Internal Index: Used to measure the goodness of a clustering structure without respect to external information.  Sum of Squared Error (SSE)  Relative Index: Used to compare two different clusterings or clusters.  Often an external or internal index is used for this function, e.g., SSE or entropy  Sometimes these are referred to as criteria instead of indices  However, sometimes criterion is the general strategy and index is the numerical measure that implements the criterion.
  • 6. EXTERNAL MEASURES  The correct or ground truth clustering is known priori.  Given a clustering partition C and ground truth partitioning T, we redefine TP, TN, FP, FN in the context of clustering.  Given the number of pairs N N=TP+FP+FN+TN
  • 7. EXTERNAL MEASURES …  True Positives (TP): Xi and Xj are a true positive pair if they belong to the same partition in T, and they are also in the same cluster in C. TP is defined as the number of true positive pairs.  False Negatives (FN): Xi and Xj are a false negative pair if they belong to the same partition in T, but they do not belong to the same cluster in C. FN is defined as the number of false negative pairs.  • False Positives (FP): Xi and Xj are a false positive pair if the do not belong to the same partition in T, but belong to the same cluster in C. FP is the number of false positive pairs.  True Negatives (TN): Xi and Xj are a false negative pair if they do not belong to the same partition in T, nor to the same cluster in C. TN is the number of true negative pairs.
  • 8. JACCARD COEFFICIENT  Measures the fraction of true positive point pairs, but after ignoring the true negatives as, Jaccard = TP/ (TP+FP+FN)  For a perfect clustering C, the coefficient is one, that is, there are no false positives nor false negatives.  Note that the Jaccard coefficient is asymmetric in that it ignores the true negatives
  • 9. RAND STATISTIC  Measures the fraction of true positives and true negatives over all pairs as Rand = (TP + TN)/ N  The Rand statistic measures the fraction of point pairs where both the clustering C and the ground truth T agree.  A perfect clustering has a value of 1 for the statistic.  The adjusted rand index is the extension of the rand statistic corrected for chance.
  • 10. FOWLKES-MALLOWS MEASURE  Define precision and recall analogously to what done for classification, Prec = TP/ (TP+FP) and Recall = TP / (TP+FN) • The Fowlkes–Mallows (FM) measure is defined as the geometric mean of the pairwise precision and recall FM = (precision recall) √ ∙  FM is also asymmetric in terms of the true positives and negatives because it ignores the true negatives. Its highest value is also 1, achieved when there are no false positives or negatives.
  • 11. INTERNAL MEASURES: COHESION AND SEPARATION  Cluster Cohesion (Compactness): Measures how closely related are objects in a cluster.  Cluster Separation (Separation): Measure how distinct or well- separated a cluster is from other clusters.  Example: Squared Error  Cohesion is measured by the within cluster sum of squares (SSE)  Separation is measured by the between cluster sum of squares  Where |Ci| is the size of cluster i    i i i m m C BSS 2 ) (
  • 12. INTERNAL MEASURES: COHESION AND SEPARATION 10 9 1 9 ) 3 5 . 4 ( 2 ) 5 . 1 3 ( 2 1 ) 5 . 4 5 ( ) 5 . 4 4 ( ) 5 . 1 2 ( ) 5 . 1 1 ( 2 2 2 2 2 2                    Total BSS WSS 1 2 3 4 5    m1 m2 m K=2 clusters: 10 0 10 0 ) 3 3 ( 4 10 ) 3 5 ( ) 3 4 ( ) 3 2 ( ) 3 1 ( 2 2 2 2 2                 Total BSS WSS K=1 cluster:
  • 18. XIE-BENI INDEX:  In the definition of XB-index, the numerator indicates the compactness of the obtained cluster while the denominator indicates the strength of the separation between clusters.  The objective is to minimize the XB-index for achieving proper clustering.
  • 21. CS-INDEX:  Used for tackling clusters of different densities and/or sizes. where mi , i = 1, . . . K are the cluster centers.