SlideShare a Scribd company logo
Measures of
Similarity and
Dissimilarity
Unit - II
Datamining
Measures of Similarity and
Dissimilarity
● Similarity and dissimilarity are important because they are used by
a number of data mining techniques
○ such as
■ clustering,
■ nearest neighbor classification, and
■ anomaly detection.
● Proximity is used to refer to either similarity or dissimilarity.
○ proximity between objects having only one simple attribute, and
○ proximity measures for objects with multiple attributes.
Measures of Similarity and Dissimilarity
● Similarity between two objects is a numerical measure of the
degree to which the two objects are alike.
○ Similarity - high -objects that are more alike.
○ Non-negative
○ between 0 (no similarity) and 1 (complete similarity).
● Dissimilarity between two objects is a numerical measure of the
degree to which the two objects are different.
○ Dissimilarity - low - objects are more similar.
○ Distance - synonym for dissimilarity
Measures of Similarity and Dissimilarity
Transformations
● Transformations are often applied to
○ convert a similarity to a dissimilarity,
○ convert a dissimilarity to a similarity
○ to transform a proximity measure to fall within a particular range, such as [0,1].
● Example
○ Similarities between objects range from 1 (not at all similar) to 10 (completely similar)
○ we can make them fall within the range [0, 1] by using the transformation
■ s’ = (s 1)/9
−
■ s - Original Similarity
■ s’ - New similarity values
Measures of Similarity and Dissimilarity
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Euclidean Distance
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
If d(x, y) is the distance between two points, x and y, then the following properties hold.
1. Positivity
(a) d(x, x) 0 for all x and y,
≥
(b) d(x, y) = 0 only if x = y.
2. Symmetry
d(x, y) = d(y, x) for all x and y.
3. Triangle Inequality
d(x, z) d(x, y) + d(y, z) for all points x, y, and z.
≤
Note:-Measures that satisfy all three properties are known as metrics.
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Non-metric Dissimilarities: Set Differences
A = {1, 2, 3, 4} and B = {2, 3, 4},
then A − B = {1} and
B − A = , the empty set.
∅
If d(A, B) = size(A − B), then it does not satisfy the second part of the
positivity property, the symmetry property, or the triangle inequality.
d(A, B) = size(A − B) + size(B − A) (modified which follows all properties)
Measures of Similarity and Dissimilarity
Dissimilarities between Data Objects
Non-metric Dissimilarities: Time
Dissimilarity measure that is not a metric,but still useful.
d(1PM, 2PM) = 1 hour
d(2PM, 1PM) = 23 hours
● Example:- when answering the question: “If an event occurs at 1PM
every day, and it is now 2PM, how long do I have to wait for that event to
occur again?”
Distance in python
Measures of Similarity and Dissimilarity
Similarities between Data Objects
● Typical properties of similarities are the following:
○ 1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1)
○ 2. s(x, y) = s(y, x) for all x and y. (Symmetry)
● A Non-symmetric Similarity Measure
○ Classify a small set of characters which is flashed on a screen.
○ Confusion matrix - records how often each character is classified as itself,
and how often each is classified as another character.
○ “0” appeared 200 times but classified as
■ “0” 160 times,
■ “o” 40 times.
○ ‘o’ appeared 200 times and was classified as
■ “o” 170 times
■ “0” only 30 times.
● similarity measure can be made symmetric by setting
○ S`(x, y) = S`(y, x) = (s(x, y)+s(y, x))/2,
Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
○ Similarity measures between objects that contain only binary
attributes are called similarity coefficients
○ Let x and y be two objects that consist of n binary attributes.
○ The comparison of two objects (or two binary vectors), leads to
the following four quantities (frequencies):
f00 = the number of attributes where x is 0 and y is 0
f01 = the number of attributes where x is 0 and y is 1
f10 = the number of attributes where x is 1 and y is 0
f11 = the number of attributes where x is 1 and y is 1
Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
Simple Matching Coefficient(SMC)
Jaccard Coefficient
Measures of Similarity and Dissimilarity
Examples of proximity measures
● Similarity Measures for Binary Data
Measures of Similarity and Dissimilarity
Examples of proximity measures
Cosine similarity (Document similarity)
If x and y are two document vectors, then
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
# import required libraries
import numpy as np
from numpy.linalg import norm
# define two lists or array
A = np.array([2,1,2,3,2,9])
B = np.array([3,4,2,4,5,5])
print("A:", A)
print("B:", B)
# compute cosine similarity
cosine = np.dot(A,B)/(norm(A)*norm(B))
print("Cosine Similarity:", cosine)
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
● Cosine similarity - measure of angle between x and y.
● Cosine similarity = 1 (angle is 0◦
, and x & y are same (except magnitude or length))
● Cosine similarity = 0 (angle is 90
◦
, and x & y do not share any terms (words))
Measures of Similarity and Dissimilarity
Examples of proximity measures
cosine similarity (Document similarity)
Note:-
Dividing x and y by their lengths normalizes them to have a length of 1 ( means magnitude is not
considered)
Measures of Similarity and Dissimilarity
Examples of proximity measures
Extended Jaccard Coefficient (Tanimoto Coefficient)
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
● The more tightly linear two variables X and Y
are, the closer Pearson's correlation
coefficient(PCC)
○ PCC = -1, if the relationship is negative,
○ PCC=+1, if the relationship is positive.
■ an increase in the value of one variable increases the value of another variable
○ PCC = 0 Perfectly linearly uncorrelated
numbers
■ an increase in the value of one decreases the value of another variable.
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation ( scipy.stats.pearsonr() - automatic)
Measures of Similarity and Dissimilarity
Examples of proximity measures
Pearson’s correlation (manual in python)

More Related Content

PPT
UNIT-1_Measuring similarity and dissmilarity.ppt
DOCX
Measures of Similaritv and Dissimilaritv 65the comparison .docx
PPT
similarities-knn-1.ppt
PPTX
similarities-knn.pptx
PDF
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
PPTX
Data Mining Lecture_5.pptx
PPTX
Data mining Measuring similarity and desimilarity
PDF
Cluster Analysis: Measuring Similarity & Dissimilarity
UNIT-1_Measuring similarity and dissmilarity.ppt
Measures of Similaritv and Dissimilaritv 65the comparison .docx
similarities-knn-1.ppt
similarities-knn.pptx
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Data Mining Lecture_5.pptx
Data mining Measuring similarity and desimilarity
Cluster Analysis: Measuring Similarity & Dissimilarity

Similar to unit-2-part-5 material_mmmmmmmmmmmmmmmmm (20)

PPT
Unit-I Objects,Attributes,Similarity&Dissimilarity.ppt
PDF
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
PDF
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
PPTX
Similarity Measures (pptx)
PDF
09Evaluation_Clustering.pdf
PPTX
Data For Datamining
PPTX
Data For Datamining
PPTX
komal (distance and similarity measure).pptx
PDF
Clustering Algorithms - Kmeans,Min ALgorithm
PPT
similarity measure
PDF
Combined cosine-linear regression model similarity with application to handwr...
PPT
Datamining tools and techniques_lec-2.ppt
PPTX
Clasification approaches
PPT
4_22865_IS465_2019_1__2_1_02Data-2.ppt
PPT
Ds2 statistics
PPTX
SkNoushadddoja_28100119039.pptx
PDF
call for papers, research paper publishing, where to publish research paper, ...
PDF
PR07.pdf
PDF
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
PPTX
Distance function
Unit-I Objects,Attributes,Similarity&Dissimilarity.ppt
A COMPARATIVE STUDY ON DISTANCE MEASURING APPROACHES FOR CLUSTERING
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
Similarity Measures (pptx)
09Evaluation_Clustering.pdf
Data For Datamining
Data For Datamining
komal (distance and similarity measure).pptx
Clustering Algorithms - Kmeans,Min ALgorithm
similarity measure
Combined cosine-linear regression model similarity with application to handwr...
Datamining tools and techniques_lec-2.ppt
Clasification approaches
4_22865_IS465_2019_1__2_1_02Data-2.ppt
Ds2 statistics
SkNoushadddoja_28100119039.pptx
call for papers, research paper publishing, where to publish research paper, ...
PR07.pdf
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
Distance function
Ad

Recently uploaded (20)

PPTX
Current and future trends in Computer Vision.pptx
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PDF
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PDF
Visual Aids for Exploratory Data Analysis.pdf
PPT
Total quality management ppt for engineering students
PPTX
Management Information system : MIS-e-Business Systems.pptx
PDF
distributed database system" (DDBS) is often used to refer to both the distri...
PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PDF
Design Guidelines and solutions for Plastics parts
PPTX
Fundamentals of Mechanical Engineering.pptx
PPTX
Nature of X-rays, X- Ray Equipment, Fluoroscopy
PDF
Categorization of Factors Affecting Classification Algorithms Selection
PPTX
Feature types and data preprocessing steps
PPTX
Information Storage and Retrieval Techniques Unit III
PDF
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
PDF
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PDF
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Current and future trends in Computer Vision.pptx
Exploratory_Data_Analysis_Fundamentals.pdf
"Array and Linked List in Data Structures with Types, Operations, Implementat...
Level 2 – IBM Data and AI Fundamentals (1)_v1.1.PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
Visual Aids for Exploratory Data Analysis.pdf
Total quality management ppt for engineering students
Management Information system : MIS-e-Business Systems.pptx
distributed database system" (DDBS) is often used to refer to both the distri...
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
Design Guidelines and solutions for Plastics parts
Fundamentals of Mechanical Engineering.pptx
Nature of X-rays, X- Ray Equipment, Fluoroscopy
Categorization of Factors Affecting Classification Algorithms Selection
Feature types and data preprocessing steps
Information Storage and Retrieval Techniques Unit III
BIO-INSPIRED HORMONAL MODULATION AND ADAPTIVE ORCHESTRATION IN S-AI-GPT
Accra-Kumasi Expressway - Prefeasibility Report Volume 1 of 7.11.2018.pdf
III.4.1.2_The_Space_Environment.p pdffdf
BIO-INSPIRED ARCHITECTURE FOR PARSIMONIOUS CONVERSATIONAL INTELLIGENCE : THE ...
Ad

unit-2-part-5 material_mmmmmmmmmmmmmmmmm

  • 2. Measures of Similarity and Dissimilarity ● Similarity and dissimilarity are important because they are used by a number of data mining techniques ○ such as ■ clustering, ■ nearest neighbor classification, and ■ anomaly detection. ● Proximity is used to refer to either similarity or dissimilarity. ○ proximity between objects having only one simple attribute, and ○ proximity measures for objects with multiple attributes.
  • 3. Measures of Similarity and Dissimilarity ● Similarity between two objects is a numerical measure of the degree to which the two objects are alike. ○ Similarity - high -objects that are more alike. ○ Non-negative ○ between 0 (no similarity) and 1 (complete similarity). ● Dissimilarity between two objects is a numerical measure of the degree to which the two objects are different. ○ Dissimilarity - low - objects are more similar. ○ Distance - synonym for dissimilarity
  • 4. Measures of Similarity and Dissimilarity Transformations ● Transformations are often applied to ○ convert a similarity to a dissimilarity, ○ convert a dissimilarity to a similarity ○ to transform a proximity measure to fall within a particular range, such as [0,1]. ● Example ○ Similarities between objects range from 1 (not at all similar) to 10 (completely similar) ○ we can make them fall within the range [0, 1] by using the transformation ■ s’ = (s 1)/9 − ■ s - Original Similarity ■ s’ - New similarity values
  • 5. Measures of Similarity and Dissimilarity
  • 6. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects Euclidean Distance
  • 7. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects If d(x, y) is the distance between two points, x and y, then the following properties hold. 1. Positivity (a) d(x, x) 0 for all x and y, ≥ (b) d(x, y) = 0 only if x = y. 2. Symmetry d(x, y) = d(y, x) for all x and y. 3. Triangle Inequality d(x, z) d(x, y) + d(y, z) for all points x, y, and z. ≤ Note:-Measures that satisfy all three properties are known as metrics.
  • 8. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects
  • 9. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects
  • 10. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects
  • 11. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects
  • 12. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects Non-metric Dissimilarities: Set Differences A = {1, 2, 3, 4} and B = {2, 3, 4}, then A − B = {1} and B − A = , the empty set. ∅ If d(A, B) = size(A − B), then it does not satisfy the second part of the positivity property, the symmetry property, or the triangle inequality. d(A, B) = size(A − B) + size(B − A) (modified which follows all properties)
  • 13. Measures of Similarity and Dissimilarity Dissimilarities between Data Objects Non-metric Dissimilarities: Time Dissimilarity measure that is not a metric,but still useful. d(1PM, 2PM) = 1 hour d(2PM, 1PM) = 23 hours ● Example:- when answering the question: “If an event occurs at 1PM every day, and it is now 2PM, how long do I have to wait for that event to occur again?”
  • 15. Measures of Similarity and Dissimilarity Similarities between Data Objects ● Typical properties of similarities are the following: ○ 1. s(x, y) = 1 only if x = y. (0 ≤ s ≤ 1) ○ 2. s(x, y) = s(y, x) for all x and y. (Symmetry) ● A Non-symmetric Similarity Measure ○ Classify a small set of characters which is flashed on a screen. ○ Confusion matrix - records how often each character is classified as itself, and how often each is classified as another character. ○ “0” appeared 200 times but classified as ■ “0” 160 times, ■ “o” 40 times. ○ ‘o’ appeared 200 times and was classified as ■ “o” 170 times ■ “0” only 30 times. ● similarity measure can be made symmetric by setting ○ S`(x, y) = S`(y, x) = (s(x, y)+s(y, x))/2,
  • 16. Measures of Similarity and Dissimilarity Examples of proximity measures ● Similarity Measures for Binary Data ○ Similarity measures between objects that contain only binary attributes are called similarity coefficients ○ Let x and y be two objects that consist of n binary attributes. ○ The comparison of two objects (or two binary vectors), leads to the following four quantities (frequencies): f00 = the number of attributes where x is 0 and y is 0 f01 = the number of attributes where x is 0 and y is 1 f10 = the number of attributes where x is 1 and y is 0 f11 = the number of attributes where x is 1 and y is 1
  • 17. Measures of Similarity and Dissimilarity Examples of proximity measures ● Similarity Measures for Binary Data Simple Matching Coefficient(SMC) Jaccard Coefficient
  • 18. Measures of Similarity and Dissimilarity Examples of proximity measures ● Similarity Measures for Binary Data
  • 19. Measures of Similarity and Dissimilarity Examples of proximity measures Cosine similarity (Document similarity) If x and y are two document vectors, then
  • 20. Measures of Similarity and Dissimilarity Examples of proximity measures cosine similarity (Document similarity)
  • 21. Measures of Similarity and Dissimilarity Examples of proximity measures cosine similarity (Document similarity) # import required libraries import numpy as np from numpy.linalg import norm # define two lists or array A = np.array([2,1,2,3,2,9]) B = np.array([3,4,2,4,5,5]) print("A:", A) print("B:", B) # compute cosine similarity cosine = np.dot(A,B)/(norm(A)*norm(B)) print("Cosine Similarity:", cosine)
  • 22. Measures of Similarity and Dissimilarity Examples of proximity measures cosine similarity (Document similarity) ● Cosine similarity - measure of angle between x and y. ● Cosine similarity = 1 (angle is 0◦ , and x & y are same (except magnitude or length)) ● Cosine similarity = 0 (angle is 90 ◦ , and x & y do not share any terms (words))
  • 23. Measures of Similarity and Dissimilarity Examples of proximity measures cosine similarity (Document similarity) Note:- Dividing x and y by their lengths normalizes them to have a length of 1 ( means magnitude is not considered)
  • 24. Measures of Similarity and Dissimilarity Examples of proximity measures Extended Jaccard Coefficient (Tanimoto Coefficient)
  • 25. Measures of Similarity and Dissimilarity Examples of proximity measures Pearson’s correlation
  • 26. Measures of Similarity and Dissimilarity Examples of proximity measures Pearson’s correlation ● The more tightly linear two variables X and Y are, the closer Pearson's correlation coefficient(PCC) ○ PCC = -1, if the relationship is negative, ○ PCC=+1, if the relationship is positive. ■ an increase in the value of one variable increases the value of another variable ○ PCC = 0 Perfectly linearly uncorrelated numbers ■ an increase in the value of one decreases the value of another variable.
  • 27. Measures of Similarity and Dissimilarity Examples of proximity measures Pearson’s correlation
  • 28. Measures of Similarity and Dissimilarity Examples of proximity measures Pearson’s correlation ( scipy.stats.pearsonr() - automatic)
  • 29. Measures of Similarity and Dissimilarity Examples of proximity measures Pearson’s correlation (manual in python)