2. Measures of Similarity and
Dissimilarity
● Similarity and dissimilarity are important because they are used by
a number of data mining techniques
○ such as
■ clustering,
■ nearest neighbor classification, and
■ anomaly detection.
● Proximity is used to refer to either similarity or dissimilarity.
○ proximity between objects having only one simple attribute, and
○ proximity measures for objects with multiple attributes.
3. Measures of Similarity and Dissimilarity
● Similarity between two objects is a numerical measure of the
degree to which the two objects are alike.
○ Similarity - high -objects that are more alike.
○ Non-negative
○ between 0 (no similarity) and 1 (complete similarity).
● Dissimilarity between two objects is a numerical measure of the
degree to which the two objects are different.
○ Dissimilarity - low - objects are more similar.
○ Distance - synonym for dissimilarity
4. Measures of Similarity and Dissimilarity
Transformations
● Transformations are often applied to
○ convert a similarity to a dissimilarity,
○ convert a dissimilarity to a similarity
○ to transform a proximity measure to fall within a particular range, such as [0,1].
● Example
○ Similarities between objects range from 1 (not at all similar) to 10 (completely similar)
○ we can make them fall within the range [0, 1] by using the transformation
■ s’ = (s 1)/9
−
■ s - Original Similarity
■ s’ - New similarity values
6. Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Euclidean Distance
7. Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
If d(x, y) is the distance between two points, x and y, then the following properties hold.
1. Positivity
(a) d(x, x) 0 for all x and y,
≥
(b) d(x, y) = 0 only if x = y.
2. Symmetry
d(x, y) = d(y, x) for all x and y.
3. Triangle Inequality
d(x, z) d(x, y) + d(y, z) for all points x, y, and z.
≤
Note:-Measures that satisfy all three properties are known as metrics.
12. Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Non-metric Dissimilarities: Set Differences
A = {1, 2, 3, 4} and B = {2, 3, 4},
then A − B = {1} and
B − A = , the empty set.
∅
If d(A, B) = size(A − B), then it does not satisfy the second part of the
positivity property, the symmetry property, or the triangle inequality.
d(A, B) = size(A − B) + size(B − A) (modified which follows all properties)
13. Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Non-metric Dissimilarities: Time
Dissimilarity measure that is not a metric,but still useful.
d(1PM, 2PM) = 1 hour
d(2PM, 1PM) = 23 hours
● Example:- when answering the question: “If an event occurs at 1PM
every day, and it is now 2PM, how long do I have to wait for that event to
occur again?”
15. Measures of Similarity and Dissimilarity
Similarities between Data Objects
● Typical properties of similarities are the following:
○ 1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
○ 2. s(x, y) = s(y, x) for all x and y. (Symmetry)
● A Non-symmetric Similarity Measure
○ Classify a small set of characters which is flashed on a screen.
○ Confusion matrix - records how often each character is classified as itself,
and how often each is classified as another character.
○ “0” appeared 200 times but classified as
■ “0” 160 times,
■ “o” 40 times.
○ ‘o’ appeared 200 times and was classified as
■ “o” 170 times
■ “0” only 30 times.
● similarity measure can be made symmetric by setting
○ S`(x, y) = S`(y, x) = (s(x, y)+s(y, x))/2,
16. Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
○ Similarity measures between objects that contain only binary
attributes are called similarity coefficients
○ Let x and y be two objects that consist of n binary attributes.
○ The comparison of two objects (or two binary vectors), leads to
the following four quantities (frequencies):
f00 = the number of attributes where x is 0 and y is 0
f01 = the number of attributes where x is 0 and y is 1
f10 = the number of attributes where x is 1 and y is 0
f11 = the number of attributes where x is 1 and y is 1
17. Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
Simple Matching Coefficient(SMC)
Jaccard Coefficient
18. Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
19. Measures of Similarity and Dissimilarity
Examples of proximity measures
Cosine similarity (Document similarity)
If x and y are two document vectors, then
20. Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
21. Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
# import required libraries
import numpy as np
from numpy.linalg import norm
# define two lists or array
A = np.array([2,1,2,3,2,9])
B = np.array([3,4,2,4,5,5])
print("A:", A)
print("B:", B)
# compute cosine similarity
cosine = np.dot(A,B)/(norm(A)*norm(B))
print("Cosine Similarity:", cosine)
22. Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
● Cosine similarity - measure of angle between x and y.
● Cosine similarity = 1 (angle is 0◦
, and x & y are same (except magnitude or length))
● Cosine similarity = 0 (angle is 90
◦
, and x & y do not share any terms (words))
23. Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
Note:-
Dividing x and y by their lengths normalizes them to have a length of 1 ( means magnitude is not
considered)
24. Measures of Similarity and Dissimilarity
Examples of proximity measures
Extended Jaccard Coefficient (Tanimoto Coefficient)
25. Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
26. Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
● The more tightly linear two variables X and Y
are, the closer Pearson's correlation
coefficient(PCC)
○ PCC = -1, if the relationship is negative,
○ PCC=+1, if the relationship is positive.
■ an increase in the value of one variable increases the value of another variable
○ PCC = 0 Perfectly linearly uncorrelated
numbers
■ an increase in the value of one decreases the value of another variable.
27. Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
28. Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation ( scipy.stats.pearsonr() - automatic)
29. Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation (manual in python)