UNIT-1_Measuring similarity and dissmilarity.ppt

Measures of similarity & Dissimilarity
• They are important because they are used by
number of data mining techniques, such as
clustering, nearest neighbor classification and
anomaly detection.
• The initial data set is not needed in many cases
once these similarities and dissimilarities have
been computed.

Similarity:
• Similarity between two objects is a
numerical measure of the degree
to which two objects are alike.
• Similarities are higher for pair of
objects that are more alike.
• Similarities are usually non-
negative and are between 0 and 1
• 0 (no Similarity)
• 1 (complete Similarity)

Dissimilarity:
• Dissimilarity between two objects is a numerical
measure of the degree to which two objects are
different.
• This is lower for more similar pairs of objects.
• The term distance is used as synonym for this.
• It fall in the interval [0,1], but it is also common
for it to range from 0 to infinity

Data matrix:
• In data matrix, elements or objects are simply
stored in the matrix.
• And the object doesn’t represent any
relationship.

Dissimilarity matrix:
• In dissimilarity matrix, the elements in the
matrix represents relationship (Dissimilarity)
between two entities.
• Note: Dissimilarity =1-similarity.
In the above example,
zero (0) represents that
both users are similar.
They more the
value away from zero,
the more dissimilar they
are.

UNIT-1_Measuring similarity and dissmilarity.ppt

Similarity/Dissimilarity for objects with
single(one) attribute:
x & y are the attribute values for two data objects

Example: Suppose we have an array of fruits:
Fruits = [“Apple”, “Banana”, “Orange”, “Apple”, Grapes”]
Dissimilarity and Similarity Calculations:
1) Nominal Attribute:
Dissimilarity(d): Compare the 1st
& 4th
elements(“Apple” & “Apple”)
d=0, if x=y
d=1, if x≠y
“Apple”=“Apple”, so d=0.
Similarity(s): compare the 1st
and 3rd
elements
s=1, if x=y
s=0, if x ≠y
Apple= Apple, so s=1
Apple ≠ Banana, so s=0
Likewise compare the other elements also.

2) Ordinal Attribute: numerical values arranged in order
wise(rank wise)
Ex 1: Dataset: [10,15,20,25,30]
Dissimilarity(d):
Compare 1st
& 4th
elements(10 & 25)
d=|p-q| / n-1 = |10-25| / 5-1
= 3.75
Similarity:
compare 10 & 25
S= 1-3.75
= -2.75
(dissimilarity is > 1; so similarity can be -ve)
S=1-d

Ex 2: dataset [1,3,2,4,1]
D=|p-q| / n-1
Compare 1st
and 4th
elements
d= |1-4|/5-1
= 0.75
S= 1-d
= 1-0.75
= 0.25

3.Interval Attribute:
Data set: [10,15,20,25,30]
d=|x-y|
Pair 1(10,15) = |
10-15| = 5
Pair2 (10,20)= |
10-20| =10
Pair3(10,25)= |10-
25|= 15
Pair4(10,30)=|10-
30|=20
.........
Pair 10(25,30)= |
25-30|= 5
1) S=-d
Pair 1 s=-5
Pair 2 s=-10
Pair 3 s= -15
Pair 4 s=-20
............
Pair 10 s= -5
2) S= 1/(1+d)
Pair 1 s=1/1+5 =
1/6
Pair 2 s= 1/1+10=
1/11
Pair 3 s=1/1+15 =
1/16
..........
Pair 10 s= 1/1+5 =
1/6
3) S= e^(-d)
Pair 1 s= e^(-5)
Pair 2 s= e^(-10)
Pair 3 s= e^(-15)
....................
Pair 10 s= e^(-5)
4)
Pair 1: s=1- 5-5/20-5 =1
Pair 2: s= 1- 10-5/20-5 = 2/3
..........

Dissimilarities b/w data objects with multiple numeric
attributes
1) Euclidean Distance:
Example:

a) Manhatten distance if r=1
dist= |x1-x2|+|y1-y2|
Dist b/w p1 and p1 = 0
p1 and p2 = |0-2|+|2-0|=2+2=4
P1 and p3 = |0-3|+|2-1|=3+1=4
.............
P4 and p4 = 0
Distance Matrix

b) Euclidean distance if r=2
Distance Matrix

c) Minkowski Distance if r=3
Dist = (|XA-XB|3
+|YA-YB|3
)1/3
L3 P1 P2 P3 P4
P1 0 2.52 3.03 5.0
P2 2.52 0 1.26 2.15
P3 3.03 1.26 0 2
P4 5.01 2.15 2 0
Distance Matrix

Similarities b/w data objects
1) Simple Matching Coefficient
2) Jaccard Coefficient
3) Cosine Coefficient
4) Correlation
1) Simple Matching Coefficient:
SMC = no.of matching attribute values / total no.of attributes
SMC = P4(x,y)+P1(x,y) / p4+p1+p3+p2
= P11+P00 / P11+P00+P01+P10
X Y
P1 0 0
P2 0 1
P3 1 0
P4 1 1

2) Jaccard Coefficient
Jaccard Coefficient= No.of items bought by both group /
Total no.of items bought by either group
Jaccard Coefficient = P11 / P11+P10+P01
X Y
P1 0 0
P2 0 1
P3 1 0
P4 1 1

3) Cosine Coefficient
• A⋅B denotes the dot product of vectors A and B.
• ∥A and
∥ ∥B represent the magnitudes (or norms)
∥
of vectors A and B respectively.

UNIT-1_Measuring similarity and dissmilarity.ppt

More Related Content

Similar to UNIT-1_Measuring similarity and dissmilarity.ppt (20)

Recently uploaded (20)

UNIT-1_Measuring similarity and dissmilarity.ppt