2. Measures of similarity & Dissimilarity
• They are important because they are used by
number of data mining techniques, such as
clustering, nearest neighbor classification and
anomaly detection.
• The initial data set is not needed in many cases
once these similarities and dissimilarities have
been computed.
3. Similarity:
• Similarity between two objects is a
numerical measure of the degree
to which two objects are alike.
• Similarities are higher for pair of
objects that are more alike.
• Similarities are usually non-
negative and are between 0 and 1
• 0 (no Similarity)
• 1 (complete Similarity)
4. Dissimilarity:
• Dissimilarity between two objects is a numerical
measure of the degree to which two objects are
different.
• This is lower for more similar pairs of objects.
• The term distance is used as synonym for this.
• It fall in the interval [0,1], but it is also common
for it to range from 0 to infinity
5. Data matrix:
• In data matrix, elements or objects are simply
stored in the matrix.
• And the object doesn’t represent any
relationship.
6. Dissimilarity matrix:
• In dissimilarity matrix, the elements in the
matrix represents relationship (Dissimilarity)
between two entities.
• Note: Dissimilarity =1-similarity.
In the above example,
zero (0) represents that
both users are similar.
They more the
value away from zero,
the more dissimilar they
are.
9. Example: Suppose we have an array of fruits:
Fruits = [“Apple”, “Banana”, “Orange”, “Apple”, Grapes”]
Dissimilarity and Similarity Calculations:
1) Nominal Attribute:
Dissimilarity(d): Compare the 1st
& 4th
elements(“Apple” & “Apple”)
d=0, if x=y
d=1, if x≠y
“Apple”=“Apple”, so d=0.
Similarity(s): compare the 1st
and 3rd
elements
s=1, if x=y
s=0, if x ≠y
Apple= Apple, so s=1
Apple ≠ Banana, so s=0
Likewise compare the other elements also.
10. 2) Ordinal Attribute: numerical values arranged in order
wise(rank wise)
Ex 1: Dataset: [10,15,20,25,30]
Dissimilarity(d):
Compare 1st
& 4th
elements(10 & 25)
d=|p-q| / n-1 = |10-25| / 5-1
= 3.75
Similarity:
compare 10 & 25
S= 1-3.75
= -2.75
(dissimilarity is > 1; so similarity can be -ve)
S=1-d
11. Ex 2: dataset [1,3,2,4,1]
D=|p-q| / n-1
Compare 1st
and 4th
elements
d= |1-4|/5-1
= 0.75
S= 1-d
= 1-0.75
= 0.25
22. 2) Jaccard Coefficient
Jaccard Coefficient= No.of items bought by both group /
Total no.of items bought by either group
Jaccard Coefficient = P11 / P11+P10+P01
X Y
P1 0 0
P2 0 1
P3 1 0
P4 1 1
23. 3) Cosine Coefficient
• A⋅B denotes the dot product of vectors A and B.
• ∥A and
∥ ∥B represent the magnitudes (or norms)
∥
of vectors A and B respectively.