Cluster validity 1.ppt on cluster validity

WHY TO EVALUATE THE “GOODNESS” OF
THE RESULTING CLUSTERS?
 To avoid finding patterns in noise
 To compare clustering algorithms
 To compare two sets of clusters
 To compare two clusters

DIFFERENT ASPECTS OF CLUSTER
VALIDATION
1. Determining the clustering tendency of a set of data, i.e., distinguishing whether
non-random structure actually exists in the data.
2. Comparing the results of a cluster analysis to externally known results, e.g., to
externally given class labels.
3. Evaluating how well the results of a cluster analysis fit the data without reference
to external information.
- Use only the data
4. Comparing the results of two different sets of cluster analyses to determine
which is better.
5. Determining the ‘correct’ number of clusters.
For 2, 3, and 4, we can further distinguish whether we want to evaluate the entire
clustering or just individual clusters.

FRAMEWORK FOR CLUSTER VALIDITY
 Need a framework to interpret any measure.
 For example, if our measure of evaluation has the value, 10, is that good,
fair, or poor?
 Statistics provide a framework for cluster validity
 The more “atypical” a clustering result is, the more likely it represents valid
structure in the data
 Can compare the values of an index that result from random data or
clusterings to those of a clustering result.
 If the value of the index is unlikely, then the cluster results are valid
 These approaches are more complicated and harder to understand.
 For comparing the results of two different sets of cluster analyses, a
framework is less necessary.
 However, there is the question of whether the difference between two index
values is significant

MEASURES OF CLUSTER VALIDITY
 Numerical measures that are applied to judge various aspects of
cluster validity, are classified into the following three types.
 External Index: Used to measure the extent to which cluster labels match
externally supplied class labels.
 Entropy
 Internal Index: Used to measure the goodness of a clustering structure
without respect to external information.
 Sum of Squared Error (SSE)
 Relative Index: Used to compare two different clusterings or clusters.
 Often an external or internal index is used for this function, e.g., SSE or entropy
 Sometimes these are referred to as criteria instead of indices
 However, sometimes criterion is the general strategy and index is the numerical
measure that implements the criterion.

EXTERNAL MEASURES
 The correct or ground truth clustering is known priori.
 Given a clustering partition C and ground truth partitioning T, we
redefine TP, TN, FP, FN in the context of clustering.
 Given the number of pairs N
N=TP+FP+FN+TN

EXTERNAL MEASURES …
 True Positives (TP): Xi and Xj are a true positive pair if they belong to the
same partition in T, and they are also in the same cluster in C. TP is
defined as the number of true positive pairs.
 False Negatives (FN): Xi and Xj are a false negative pair if they belong to
the same partition in T, but they do not belong to the same cluster in C.
FN is defined as the number of false negative pairs.
 • False Positives (FP): Xi and Xj are a false positive pair if the do not
belong to the same partition in T, but belong to the same cluster in C.
FP is the number of false positive pairs.
 True Negatives (TN): Xi and Xj are a false negative pair if they do not
belong to the same partition in T, nor to the same cluster in C. TN is the
number of true negative pairs.

JACCARD COEFFICIENT
 Measures the fraction of true positive point pairs, but after ignoring the
true negatives as,
Jaccard = TP/ (TP+FP+FN)
 For a perfect clustering C, the coefficient is one, that is, there are no
false positives nor false negatives.
 Note that the Jaccard coefficient is asymmetric in that it ignores the
true negatives

RAND STATISTIC
 Measures the fraction of true positives and true negatives over all pairs
as
Rand = (TP + TN)/ N
 The Rand statistic measures the fraction of point pairs where both the
clustering C and the ground truth T agree.
 A perfect clustering has a value of 1 for the statistic.
 The adjusted rand index is the extension of the rand statistic corrected
for chance.

FOWLKES-MALLOWS MEASURE
 Define precision and recall analogously to what done for classification,
Prec = TP/ (TP+FP) and Recall = TP / (TP+FN)
• The Fowlkes–Mallows (FM) measure is defined as the geometric mean
of the pairwise precision and recall
FM = (precision recall)
√ ∙
 FM is also asymmetric in terms of the true positives and negatives
because it ignores the true negatives. Its highest value is also 1,
achieved when there are no false positives or negatives.

INTERNAL MEASURES: COHESION
AND SEPARATION
 Cluster Cohesion (Compactness): Measures how closely related
are objects in a cluster.
 Cluster Separation (Separation): Measure how distinct or well-
separated a cluster is from other clusters.
 Example: Squared Error
 Cohesion is measured by the within cluster sum of squares (SSE)
 Separation is measured by the between cluster sum of squares
 Where |Ci| is the size of cluster i
 

i
i
i m
m
C
BSS 2
)
(

INTERNAL MEASURES: COHESION AND
SEPARATION
10
9
1
9
)
3
5
.
4
(
2
)
5
.
1
3
(
2
1
)
5
.
4
5
(
)
5
.
4
4
(
)
5
.
1
2
(
)
5
.
1
1
(
2
2
2
2
2
2



















Total
BSS
WSS
1 2 3 4 5
 

m1 m2
m
K=2 clusters:
10
0
10
0
)
3
3
(
4
10
)
3
5
(
)
3
4
(
)
3
2
(
)
3
1
(
2
2
2
2
2
















Total
BSS
WSS
K=1 cluster:

INTERNAL MEASURES: COHESION
AND SEPARATION

XIE-BENI INDEX:
 In the definition of XB-index, the numerator indicates the compactness
of the obtained cluster while the denominator indicates the strength of
the separation between clusters.
 The objective is to minimize the XB-index for achieving proper
clustering.

CS-INDEX:
 Used for tackling clusters of different densities and/or sizes.
where mi , i = 1, . . . K are the cluster centers.

Cluster validity 1.ppt on cluster validity

More Related Content

Similar to Cluster validity 1.ppt on cluster validity (20)

More from shalinipriya1692 (12)

Recently uploaded (20)

Cluster validity 1.ppt on cluster validity