SlideShare a Scribd company logo
A KARUNASRI
CSE Dept.
Measures of similarity & Dissimilarity
• They are important because they are used by
number of data mining techniques, such as
clustering, nearest neighbor classification and
anomaly detection.
• The initial data set is not needed in many cases
once these similarities and dissimilarities have
been computed.
Similarity:
• Similarity between two objects is a
numerical measure of the degree
to which two objects are alike.
• Similarities are higher for pair of
objects that are more alike.
• Similarities are usually non-
negative and are between 0 and 1
• 0 (no Similarity)
• 1 (complete Similarity)
Dissimilarity:
• Dissimilarity between two objects is a numerical
measure of the degree to which two objects are
different.
• This is lower for more similar pairs of objects.
• The term distance is used as synonym for this.
• It fall in the interval [0,1], but it is also common
for it to range from 0 to infinity
Data matrix:
• In data matrix, elements or objects are simply
stored in the matrix.
• And the object doesn’t represent any
relationship.
Dissimilarity matrix:
• In dissimilarity matrix, the elements in the
matrix represents relationship (Dissimilarity)
between two entities.
• Note: Dissimilarity =1-similarity.
In the above example,
zero (0) represents that
both users are similar.
They more the
value away from zero,
the more dissimilar they
are.
UNIT-1_Measuring similarity and dissmilarity.ppt
Similarity/Dissimilarity for objects with
single(one) attribute:
x & y are the attribute values for two data objects
Example: Suppose we have an array of fruits:
Fruits = [“Apple”, “Banana”, “Orange”, “Apple”, Grapes”]
Dissimilarity and Similarity Calculations:
1) Nominal Attribute:
Dissimilarity(d): Compare the 1st
& 4th
elements(“Apple” & “Apple”)
d=0, if x=y
d=1, if x≠y
“Apple”=“Apple”, so d=0.
Similarity(s): compare the 1st
and 3rd
elements
s=1, if x=y
s=0, if x ≠y
Apple= Apple, so s=1
Apple ≠ Banana, so s=0
Likewise compare the other elements also.
2) Ordinal Attribute: numerical values arranged in order
wise(rank wise)
Ex 1: Dataset: [10,15,20,25,30]
Dissimilarity(d):
Compare 1st
& 4th
elements(10 & 25)
d=|p-q| / n-1 = |10-25| / 5-1
= 3.75
Similarity:
compare 10 & 25
S= 1-3.75
= -2.75
(dissimilarity is > 1; so similarity can be -ve)
S=1-d
Ex 2: dataset [1,3,2,4,1]
D=|p-q| / n-1
Compare 1st
and 4th
elements
d= |1-4|/5-1
= 0.75
S= 1-d
= 1-0.75
= 0.25
3.Interval Attribute:
Data set: [10,15,20,25,30]
d=|x-y|
Pair 1(10,15) = |
10-15| = 5
Pair2 (10,20)= |
10-20| =10
Pair3(10,25)= |10-
25|= 15
Pair4(10,30)=|10-
30|=20
.........
Pair 10(25,30)= |
25-30|= 5
1) S=-d
Pair 1 s=-5
Pair 2 s=-10
Pair 3 s= -15
Pair 4 s=-20
............
Pair 10 s= -5
2) S= 1/(1+d)
Pair 1 s=1/1+5 =
1/6
Pair 2 s= 1/1+10=
1/11
Pair 3 s=1/1+15 =
1/16
..........
Pair 10 s= 1/1+5 =
1/6
3) S= e^(-d)
Pair 1 s= e^(-5)
Pair 2 s= e^(-10)
Pair 3 s= e^(-15)
....................
Pair 10 s= e^(-5)
4)
Pair 1: s=1- 5-5/20-5 =1
Pair 2: s= 1- 10-5/20-5 = 2/3
..........
Dissimilarities b/w data objects with multiple numeric
attributes
1) Euclidean Distance:
Example:
UNIT-1_Measuring similarity and dissmilarity.ppt
UNIT-1_Measuring similarity and dissmilarity.ppt
Minkowski Distance
UNIT-1_Measuring similarity and dissmilarity.ppt
a) Manhatten distance if r=1
dist= |x1-x2|+|y1-y2|
Dist b/w p1 and p1 = 0
p1 and p2 = |0-2|+|2-0|=2+2=4
P1 and p3 = |0-3|+|2-1|=3+1=4
.............
P4 and p4 = 0
Distance Matrix
b) Euclidean distance if r=2
Distance Matrix
c) Minkowski Distance if r=3
Dist = (|XA-XB|3
+|YA-YB|3
)1/3
L3 P1 P2 P3 P4
P1 0 2.52 3.03 5.0
P2 2.52 0 1.26 2.15
P3 3.03 1.26 0 2
P4 5.01 2.15 2 0
Distance Matrix
Similarities b/w data objects
1) Simple Matching Coefficient
2) Jaccard Coefficient
3) Cosine Coefficient
4) Correlation
1) Simple Matching Coefficient:
SMC = no.of matching attribute values / total no.of attributes
SMC = P4(x,y)+P1(x,y) / p4+p1+p3+p2
= P11+P00 / P11+P00+P01+P10
X Y
P1 0 0
P2 0 1
P3 1 0
P4 1 1
2) Jaccard Coefficient
Jaccard Coefficient= No.of items bought by both group /
Total no.of items bought by either group
Jaccard Coefficient = P11 / P11+P10+P01
X Y
P1 0 0
P2 0 1
P3 1 0
P4 1 1
3) Cosine Coefficient
• A⋅B denotes the dot product of vectors A and B.
• ∥A and
∥ ∥B represent the magnitudes (or norms)
∥
of vectors A and B respectively.
UNIT-1_Measuring similarity and dissmilarity.ppt

More Related Content

PPTX
unit-2-part-5 material_mmmmmmmmmmmmmmmmm
PDF
Cluster Analysis: Measuring Similarity & Dissimilarity
PPTX
Data mining Measuring similarity and desimilarity
PPT
Unit-I Objects,Attributes,Similarity&Dissimilarity.ppt
PPT
Ds2 statistics
PPTX
similarities-knn.pptx
PPTX
Data For Datamining
PPTX
Data For Datamining
unit-2-part-5 material_mmmmmmmmmmmmmmmmm
Cluster Analysis: Measuring Similarity & Dissimilarity
Data mining Measuring similarity and desimilarity
Unit-I Objects,Attributes,Similarity&Dissimilarity.ppt
Ds2 statistics
similarities-knn.pptx
Data For Datamining
Data For Datamining

Similar to UNIT-1_Measuring similarity and dissmilarity.ppt (20)

PPT
similarities-knn-1.ppt
PPTX
Clasification approaches
PPT
Datamining tools and techniques_lec-2.ppt
PPT
Clustering
DOCX
Measures of Similaritv and Dissimilaritv 65the comparison .docx
PDF
Lect 2 getting to know your data
PPT
4_22865_IS465_2019_1__2_1_02Data-2.ppt
PDF
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
PDF
09Evaluation_Clustering.pdf
PPTX
Lect 2 getting to know your data
PDF
Clustering Algorithms - Kmeans,Min ALgorithm
PPT
Wk. 3. Data [12-05-2021] (2).ppt
PPTX
Similarity Measures (pptx)
PPTX
Data Mining Lecture_5.pptx
PPTX
Data types and Attributes1 (1).pptx
PPTX
Data mining Basics and complete description
PPTX
Abanandamengeneeringsdhrrghhhhhgggffffff
PPTX
Cluster analysis (2)
DOCX
Data Mining DataLecture Notes for Chapter 2Introduc.docx
PDF
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
similarities-knn-1.ppt
Clasification approaches
Datamining tools and techniques_lec-2.ppt
Clustering
Measures of Similaritv and Dissimilaritv 65the comparison .docx
Lect 2 getting to know your data
4_22865_IS465_2019_1__2_1_02Data-2.ppt
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
09Evaluation_Clustering.pdf
Lect 2 getting to know your data
Clustering Algorithms - Kmeans,Min ALgorithm
Wk. 3. Data [12-05-2021] (2).ppt
Similarity Measures (pptx)
Data Mining Lecture_5.pptx
Data types and Attributes1 (1).pptx
Data mining Basics and complete description
Abanandamengeneeringsdhrrghhhhhgggffffff
Cluster analysis (2)
Data Mining DataLecture Notes for Chapter 2Introduc.docx
A STUDY ON SIMILARITY MEASURE FUNCTIONS ON ENGINEERING MATERIALS SELECTION
Ad

Recently uploaded (20)

PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPT
Image processing and pattern recognition 2.ppt
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
CYBER SECURITY the Next Warefare Tactics
PPTX
Pilar Kemerdekaan dan Identi Bangsa.pptx
PDF
Introduction to the R Programming Language
PDF
Microsoft 365 products and services descrption
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PDF
Transcultural that can help you someday.
PPTX
Leprosy and NLEP programme community medicine
PPTX
SET 1 Compulsory MNH machine learning intro
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PPTX
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PPT
statistic analysis for study - data collection
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
Image processing and pattern recognition 2.ppt
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
CYBER SECURITY the Next Warefare Tactics
Pilar Kemerdekaan dan Identi Bangsa.pptx
Introduction to the R Programming Language
Microsoft 365 products and services descrption
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
Transcultural that can help you someday.
Leprosy and NLEP programme community medicine
SET 1 Compulsory MNH machine learning intro
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
SAP 2 completion done . PRESENTATION.pptx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
DS-40-Pre-Engagement and Kickoff deck - v8.0.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
statistic analysis for study - data collection
[EN] Industrial Machine Downtime Prediction
A Complete Guide to Streamlining Business Processes
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
Ad

UNIT-1_Measuring similarity and dissmilarity.ppt

  • 2. Measures of similarity & Dissimilarity • They are important because they are used by number of data mining techniques, such as clustering, nearest neighbor classification and anomaly detection. • The initial data set is not needed in many cases once these similarities and dissimilarities have been computed.
  • 3. Similarity: • Similarity between two objects is a numerical measure of the degree to which two objects are alike. • Similarities are higher for pair of objects that are more alike. • Similarities are usually non- negative and are between 0 and 1 • 0 (no Similarity) • 1 (complete Similarity)
  • 4. Dissimilarity: • Dissimilarity between two objects is a numerical measure of the degree to which two objects are different. • This is lower for more similar pairs of objects. • The term distance is used as synonym for this. • It fall in the interval [0,1], but it is also common for it to range from 0 to infinity
  • 5. Data matrix: • In data matrix, elements or objects are simply stored in the matrix. • And the object doesn’t represent any relationship.
  • 6. Dissimilarity matrix: • In dissimilarity matrix, the elements in the matrix represents relationship (Dissimilarity) between two entities. • Note: Dissimilarity =1-similarity. In the above example, zero (0) represents that both users are similar. They more the value away from zero, the more dissimilar they are.
  • 8. Similarity/Dissimilarity for objects with single(one) attribute: x & y are the attribute values for two data objects
  • 9. Example: Suppose we have an array of fruits: Fruits = [“Apple”, “Banana”, “Orange”, “Apple”, Grapes”] Dissimilarity and Similarity Calculations: 1) Nominal Attribute: Dissimilarity(d): Compare the 1st & 4th elements(“Apple” & “Apple”) d=0, if x=y d=1, if x≠y “Apple”=“Apple”, so d=0. Similarity(s): compare the 1st and 3rd elements s=1, if x=y s=0, if x ≠y Apple= Apple, so s=1 Apple ≠ Banana, so s=0 Likewise compare the other elements also.
  • 10. 2) Ordinal Attribute: numerical values arranged in order wise(rank wise) Ex 1: Dataset: [10,15,20,25,30] Dissimilarity(d): Compare 1st & 4th elements(10 & 25) d=|p-q| / n-1 = |10-25| / 5-1 = 3.75 Similarity: compare 10 & 25 S= 1-3.75 = -2.75 (dissimilarity is > 1; so similarity can be -ve) S=1-d
  • 11. Ex 2: dataset [1,3,2,4,1] D=|p-q| / n-1 Compare 1st and 4th elements d= |1-4|/5-1 = 0.75 S= 1-d = 1-0.75 = 0.25
  • 12. 3.Interval Attribute: Data set: [10,15,20,25,30] d=|x-y| Pair 1(10,15) = | 10-15| = 5 Pair2 (10,20)= | 10-20| =10 Pair3(10,25)= |10- 25|= 15 Pair4(10,30)=|10- 30|=20 ......... Pair 10(25,30)= | 25-30|= 5 1) S=-d Pair 1 s=-5 Pair 2 s=-10 Pair 3 s= -15 Pair 4 s=-20 ............ Pair 10 s= -5 2) S= 1/(1+d) Pair 1 s=1/1+5 = 1/6 Pair 2 s= 1/1+10= 1/11 Pair 3 s=1/1+15 = 1/16 .......... Pair 10 s= 1/1+5 = 1/6 3) S= e^(-d) Pair 1 s= e^(-5) Pair 2 s= e^(-10) Pair 3 s= e^(-15) .................... Pair 10 s= e^(-5) 4) Pair 1: s=1- 5-5/20-5 =1 Pair 2: s= 1- 10-5/20-5 = 2/3 ..........
  • 13. Dissimilarities b/w data objects with multiple numeric attributes 1) Euclidean Distance: Example:
  • 18. a) Manhatten distance if r=1 dist= |x1-x2|+|y1-y2| Dist b/w p1 and p1 = 0 p1 and p2 = |0-2|+|2-0|=2+2=4 P1 and p3 = |0-3|+|2-1|=3+1=4 ............. P4 and p4 = 0 Distance Matrix
  • 19. b) Euclidean distance if r=2 Distance Matrix
  • 20. c) Minkowski Distance if r=3 Dist = (|XA-XB|3 +|YA-YB|3 )1/3 L3 P1 P2 P3 P4 P1 0 2.52 3.03 5.0 P2 2.52 0 1.26 2.15 P3 3.03 1.26 0 2 P4 5.01 2.15 2 0 Distance Matrix
  • 21. Similarities b/w data objects 1) Simple Matching Coefficient 2) Jaccard Coefficient 3) Cosine Coefficient 4) Correlation 1) Simple Matching Coefficient: SMC = no.of matching attribute values / total no.of attributes SMC = P4(x,y)+P1(x,y) / p4+p1+p3+p2 = P11+P00 / P11+P00+P01+P10 X Y P1 0 0 P2 0 1 P3 1 0 P4 1 1
  • 22. 2) Jaccard Coefficient Jaccard Coefficient= No.of items bought by both group / Total no.of items bought by either group Jaccard Coefficient = P11 / P11+P10+P01 X Y P1 0 0 P2 0 1 P3 1 0 P4 1 1
  • 23. 3) Cosine Coefficient • A⋅B denotes the dot product of vectors A and B. • ∥A and ∥ ∥B represent the magnitudes (or norms) ∥ of vectors A and B respectively.