SlideShare a Scribd company logo
Distance Measures
• Remember K-Nearest Neighbor are determined on the
bases of some kind of “distance” between points.
• Two major classes of distance measure:
1. Euclidean : based on position of points in some k
-dimensional space.
2. Noneuclidean : not related to position or space.
Scales of Measurement
• Applying a distance measure largely depends on the type of
input data
• Major scales of measurement:
1. Nominal Data (aka Nominal Scale Variables)
• Typically classification data, e.g. m/f
• no ordering, e.g. it makes no sense to state that M > F
• Binary variables are a special case of Nominal scale variables.
1. Ordinal Data (aka Ordinal Scale)
• ordered but differences between values are not important
• e.g., political parties on left to right spectrum given labels 0, 1, 2
• e.g., Likert scales, rank on a scale of 1..5 your degree of satisfaction
• e.g., restaurant ratings
Scales of Measurement
• Applying a distance function largely depends on the type of
input data
• Major scales of measurement:
3. Numeric type Data (aka interval scaled)
• Ordered and equal intervals. Measured on a linear scale.
• Differences make sense
• e.g., temperature (C,F), height, weight, age, date
Scales of Measurement• Only certain operations can be performed on
certain scales of measurement.
Nominal Scale
Ordinal Scale
Interval Scale
1. Equality
2. Count
3. Rank
(Cannot quantify difference)
4. Quantify the difference
Axioms of a Distance Measure• d is a distance measure if it is a function from
pairs of points to reals such that:
1. d(x,x) = 0.
2. d(x,y) = d(y,x).
3. d(x,y) > 0.
Some Euclidean Distances
• L2 norm (also common or Euclidean distance):
– The most common notion of “distance.”
• L1 norm (also Manhattan distance)
– distance if you had to travel along coordinates only.
)||...|||(|),( 22
22
2
11 pp j
x
i
x
j
x
i
x
j
x
i
xjid −++−+−=
||...||||),(
2211 pp j
x
i
x
j
x
i
x
j
x
i
xjid −++−+−=
Examples L1 and L2 norms
x = (5,5)
y = (9,8)
L2-norm:
dist(x,y) = √(42
+32
) = 5
L1-norm:
dist(x,y) = 4+3 = 7
4
3
5
Another Euclidean Distance
• L∞ norm : d(x,y) = the maximum of the
differences between x and y in any dimension.
Non-Euclidean Distances
• Jaccard measure for binary vectors
• Cosine measure = angle between vectors
from the origin to the points in question.
• Edit distance = number of inserts and
deletes to change one string into another.
Jaccard Measure
• A note about Binary variables first
– Symmetric binary variable
• If both states are equally valuable and carry the same weight, that
is, there is no preference on which outcome should be coded as 0
or 1.
• Like “gender” having the states male and female
– Asymmetric binary variable:
• If the outcomes of the states are not equally important, such as
the positive and negative outcomes of a disease test.
• We should code the rarest one by 1 (e.g., HIV positive), and the
other by 0 (HIV negative).
– Given two asymmetric binary variables, the agreement of
two 1s (a positive match) is then considered more
important than that of two 0s (a negative match).
Jaccard Measure
• A contingency table for binary data
• Simple matching coefficient (invariant, if the binary variable is
symmetric):
• Jaccard coefficient (noninvariant if the binary variable is
asymmetric):
dcba
cbjid
+++
+=),(
cba
cbjid
++
+=),(
pdbcasum
dcdc
baba
sum
++
+
+
0
1
01
Object i
Object j
Jaccard Measure Example
• Example
– All attributes are asymmetric binary
– let the values Y and P be set to 1, and the value N be set to 0
cba
cbjid
++
+=),(
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack Y N P N N N
Mary Y N P N P N
Jim Y P N N N N
75.0
211
21
),(
67.0
111
11
),(
33.0
102
10
),(
=
++
+
=
=
++
+
=
=
++
+
=
maryjimd
jimjackd
maryjackd
pdbcasum
dcdc
baba
sum
++
+
+
0
1
01
Cosine Measure
• Think of a point as a vector from the origin
(0,0,…,0) to its location.
• Two points’ vectors make an angle, whose
cosine is the normalized dot-product of the
vectors.
– Example:
– p1.p2 = 2; |p1| = |p2| = √3.
– cos(θ) = 2/3; θ is about 48 degrees.
p1
p2
p1.p2
θ
|p2|
dist(p1, p2) = θ = arccos(p1.p2/|p2||p1|)
Edit Distance
• The edit distance of two strings is the
number of inserts and deletes of characters
needed to turn one into the other.
• Equivalently, d(x,y) = |x| + |y| -2|LCS(x,y)|.
– LCS = longest common subsequence = longest
string obtained both by deleting from x and
deleting from y.
Example
• x = abcde ; y = bcduve.
• LCS(x,y) = bcde.
• D(x,y) = |x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4 =
3.
• What left?
• Normalize it in the range [0-1]. We will study
normalization formulas in next lecture.
Back to k-Nearest Neighbor (Pseudo-code)
• Missing values Imputation using k-NN.
• Input: Dataset (D), size of K
• for each record (x) with at least on missing value in D.
– for each data object (y) in D.
• Take the Distance (x,y)
• Save the distance and y in array Similarity (S) array.
– Sort the array S in descending order
– Pick the top K data objects from S
• Impute the missing attribute value (s) of x on the basic of known
values of S (use Mean/Median or MOD).
K-Nearest Neighbor Drawbacks
• The major drawbacks of this approach are the
– Choice of selecting exact distance functions.
– Considering all attributes when attempting to
retrieve the similar type of examples.
– Searching through all the dataset for finding the
same type of instances.
– Algorithm Cost: ?
Noisy Data
• Noise: Random error, Data Present but not correct.
– Data Transmission error
– Data Entry problem
• Removing noise
– Data Smoothing (rounding, averaging within a window).
– Clustering/merging and Detecting outliers.
• Data Smoothing
– First sort the data and partition it into (equi-depth) bins.
– Then the values in each bin using Smooth by Bin Means,
Smooth by Bin Median, Smooth by Bin Boundaries, etc.
Noisy Data (Binning Methods)
Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Noisy Data (Clustering)
• Outliers may be detected by clustering, where similar values are
organized into groups or “clusters”.
• Values which falls outside of the set of clusters may be
considered outliers.
Data Discretization• The task of attribute (feature)-discretization techniques is to
discretize the values of continuous features into a small number
of intervals, where each interval is mapped to a discrete symbol.
• Advantages:-
– Simplified data description and easy-to-understand data and final data-
mining results.
– Only Small interesting rules are mined.
– End-results processing time decreased.
– End-results accuracy improved.
Effect of Continuous Data on Results Accuracy
age income age buys_computer
<=30 medium 9 ?
<=30 medium 11 ?
<=30 medium 13 ?
age income age buys_computer
<=30 medium 9 no
<=30 medium 10 no
<=30 medium 11 no
<=30 medium 12 no
Data Mining
• If ‘age <= 30’ and income = ‘medium’ and age =
‘9’ then buys_computer = ‘no’
• If ‘age <= 30’ and income = ‘medium’ and age =
‘10’ then buys_computer = ‘no’
• If ‘age <= 30’ and income = ‘medium’ and age =
‘11’ then buys_computer = ‘no’
• If ‘age <= 30’ and income = ‘medium’ and age =
‘12’ then buys_computer = ‘no’
Discover only those
rules which contain
support (frequency)
greater >= 1
Due to the missing value in training
dataset, the accuracy of prediction
decreases and becomes “66.7%”
Entropy-Based Discretization
• Given a set of samples S, if S is partitioned into two intervals S1
and S2 using boundary T, the entropy after partitioning is
• Where pi is the probability of class i in S1, determined by
dividing the number of samples of class i in S1 by the total
number of samples in S1.
Example 1
IDID 1 2 3 4 5 6 7 8 9
AgeAge 21 22 24 25 27 27 27 35 41
GradeGrade F F P F P P P P P
• Let Grade be the class attribute. Use entropy-based
discretization to divide the range of ages into different discrete
intervals.
• There are 6 possible boundaries. They are 21.5, 23, 24.5, 26,
31, and 38.
• Let us consider the boundary at T = 21.5.
Let S1 = {21}
Let S2 = {22, 24, 25, 27, 27, 27, 35, 41}
(21+22) / 2 = 21.5
(22+24) / 2 = 23
Example 1 (cont’)
• The number of elements in S1 and S2 are:
|S1| = 1
|S2| = 8
• The entropy of S1 is
• The entropy of S2 is
ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Grade F F P F P P P P P
=
×−×−=
=×=−=×=−=
)0(log)0()1(log)1(
)P(log)P()F(log)F()(
22
221 GradePGradePGradePGradePSEnt
=
×−×−=
=×=−=×=−=
)6(log)6()2(log)2(
)P(log)P()F(log)F()(
22
222 GradePGradePGradePGradePSEnt
Example 1 (cont’)
• Hence, the entropy after partitioning at T =
21.5 is
...
)(
|9|
|8|
)(
|9|
|1|
)(
||
||
)(
||
||
),(
21
2
2
1
1
=
+=
+=
SEntSEnt
SEnt
S
S
SEnt
S
S
TSE
Example 1 (cont’)
• The entropies after partitioning for all the boundaries are:
T = 21.5 = E(S,21.5)
T = 23 = E(S,23)
.
.
T = 38 = E(S,38)
Select the boundary with the smallest entropy
Suppose best is T = 23
ID 1 2 3 4 5 6 7 8 9
Age 21 22 24 25 27 27 27 35 41
Grade F F P F P P P P P
Now recursively apply entropy
discretization upon both partitions
References
– G. Batista and M. Monard, “The study of K-Nearest
Neighbor as a Imputation Method”, 2002 . (I will
placed at the course folder)
– “CS345 --- Lecture Notes”, by Jeff D Ullman at
Stanford. http://www-
db.stanford.edu/~ullman/cs345-notes.html
– Vipin Kumar’s course in data mining offered at
University of Minnesota
– official text book slides of Jiawei Han and Micheline
Kamber, “Data Mining: Concepts and Techniques”,
Morgan Kaufmann Publishers, August 2000.

More Related Content

PDF
Anatomy of YOLO - v1
PDF
Matrix Factorization
PDF
K - Nearest neighbor ( KNN )
PPT
Decision tree and random forest
PPTX
Regularization in deep learning
PDF
YOLO9000 - PR023
PPTX
You only look once
PDF
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
Anatomy of YOLO - v1
Matrix Factorization
K - Nearest neighbor ( KNN )
Decision tree and random forest
Regularization in deep learning
YOLO9000 - PR023
You only look once
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018

What's hot (20)

PDF
Word Embeddings, why the hype ?
PPTX
PPTX
Reinforcement Learning : A Beginners Tutorial
PDF
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
ODP
Dimensionality Reduction
PPTX
Knn Algorithm presentation
PPT
Decision tree
PDF
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
PDF
Yolov3
PDF
Recursive Neural Networks
PDF
Dimensionality Reduction
PDF
Artificial Neural Networks Lect1: Introduction & neural computation
PPTX
CNN Machine learning DeepLearning
PDF
Bias and variance trade off
PPTX
Association in Frequent Pattern Mining
PPTX
Tutorial on Object Detection (Faster R-CNN)
PDF
오토인코더의 모든 것
PDF
Feature Engineering
PDF
Causal Random Forest
Word Embeddings, why the hype ?
Reinforcement Learning : A Beginners Tutorial
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
Dimensionality Reduction
Knn Algorithm presentation
Decision tree
Artificial Neural Network Lect4 : Single Layer Perceptron Classifiers
Yolov3
Recursive Neural Networks
Dimensionality Reduction
Artificial Neural Networks Lect1: Introduction & neural computation
CNN Machine learning DeepLearning
Bias and variance trade off
Association in Frequent Pattern Mining
Tutorial on Object Detection (Faster R-CNN)
오토인코더의 모든 것
Feature Engineering
Causal Random Forest
Ad

Viewers also liked (10)

PDF
Distance Metric Learning tutorial at CVPR 2015
PDF
Distance Metric Learning
PDF
Metric learning ICML2010 tutorial
PPTX
Ibn Sina
PPTX
Image net classification with Deep Convolutional Neural Networks
PDF
Information-Theoretic Metric Learning
PPT
Data mining :Concepts and Techniques Chapter 2, data
PDF
Faster R-CNN: Towards real-time object detection with region proposal network...
PPTX
MIRU2014 tutorial deeplearning
PPT
complex variable PPT ( SEM 2 / CH -2 / GTU)
Distance Metric Learning tutorial at CVPR 2015
Distance Metric Learning
Metric learning ICML2010 tutorial
Ibn Sina
Image net classification with Deep Convolutional Neural Networks
Information-Theoretic Metric Learning
Data mining :Concepts and Techniques Chapter 2, data
Faster R-CNN: Towards real-time object detection with region proposal network...
MIRU2014 tutorial deeplearning
complex variable PPT ( SEM 2 / CH -2 / GTU)
Ad

Similar to Lecture slides week14-15 (20)

PPTX
Data Mining Lecture_5.pptx
PPTX
measure of variability (windri). In research include example
PPTX
Clasification approaches
PDF
PDF
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
PDF
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
PPTX
Predictive Modelling
PPT
Lecture 12 Principal Component Analysis in Machine Learning.ppt
PPT
Principal Component Analysis (PCA):How to conduct PCA
PPT
pca in machine learning pca in machine learning pca in machine learning pca i...
PDF
Classification_Algorithms_Student_Data_Presentation
PPTX
Measures of Variability.pptx
PPT
The following ppt is about principal component analysis
PPT
PPTX
Intoduction to Computer Appl 1st_coa.pptx
PPT
factoring
PDF
DAVLectuer3 Exploratory data analysis .pdf
PPT
Statisticsforbiologists colstons
PPTX
Chapter 3_M of Location and dispersion mean, median, mode, standard deviation
Data Mining Lecture_5.pptx
measure of variability (windri). In research include example
Clasification approaches
MLT Unit4.pdfgmgkgmflbmrfmbrfmbfrmbofl;mb;lf
MLT Unit4.pdffdhngnrfgrgrfflmbpmpphfhbomf
Predictive Modelling
Lecture 12 Principal Component Analysis in Machine Learning.ppt
Principal Component Analysis (PCA):How to conduct PCA
pca in machine learning pca in machine learning pca in machine learning pca i...
Classification_Algorithms_Student_Data_Presentation
Measures of Variability.pptx
The following ppt is about principal component analysis
Intoduction to Computer Appl 1st_coa.pptx
factoring
DAVLectuer3 Exploratory data analysis .pdf
Statisticsforbiologists colstons
Chapter 3_M of Location and dispersion mean, median, mode, standard deviation

More from Shani729 (20)

PPT
Python tutorialfeb152012
PPT
Python tutorial
PDF
Interaction design _beyond_human_computer_interaction
PPTX
Fm lecturer 13(final)
PPT
Frequent itemset mining using pattern growth method
PPT
Dwh lecture slides-week15
PPT
Dwh lecture slides-week10
PPT
Dwh lecture slidesweek7&8
PPT
Dwh lecture slides-week5&6
PPT
Dwh lecture slides-week3&4
PPT
Dwh lecture slides-week2
PPTX
Dwh lecture slides-week1
PPT
Dwh lecture slides-week 13
PPT
Dwh lecture slides-week 12&13
PPTX
Data warehousing and mining furc
PPT
Lecture 40
PPT
Lecture 39
PPT
Lecture 38
PPT
Lecture 37
PPT
Lecture 35
Python tutorialfeb152012
Python tutorial
Interaction design _beyond_human_computer_interaction
Fm lecturer 13(final)
Frequent itemset mining using pattern growth method
Dwh lecture slides-week15
Dwh lecture slides-week10
Dwh lecture slidesweek7&8
Dwh lecture slides-week5&6
Dwh lecture slides-week3&4
Dwh lecture slides-week2
Dwh lecture slides-week1
Dwh lecture slides-week 13
Dwh lecture slides-week 12&13
Data warehousing and mining furc
Lecture 40
Lecture 39
Lecture 38
Lecture 37
Lecture 35

Recently uploaded (20)

PPTX
Construction Project Organization Group 2.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
web development for engineering and engineering
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
UNIT 4 Total Quality Management .pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
DOCX
573137875-Attendance-Management-System-original
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPT
Mechanical Engineering MATERIALS Selection
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Sustainable Sites - Green Building Construction
PPTX
Lecture Notes Electrical Wiring System Components
Construction Project Organization Group 2.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
web development for engineering and engineering
Foundation to blockchain - A guide to Blockchain Tech
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
UNIT 4 Total Quality Management .pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
573137875-Attendance-Management-System-original
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Mechanical Engineering MATERIALS Selection
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Automation-in-Manufacturing-Chapter-Introduction.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
CYBER-CRIMES AND SECURITY A guide to understanding
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Sustainable Sites - Green Building Construction
Lecture Notes Electrical Wiring System Components

Lecture slides week14-15

  • 1. Distance Measures • Remember K-Nearest Neighbor are determined on the bases of some kind of “distance” between points. • Two major classes of distance measure: 1. Euclidean : based on position of points in some k -dimensional space. 2. Noneuclidean : not related to position or space.
  • 2. Scales of Measurement • Applying a distance measure largely depends on the type of input data • Major scales of measurement: 1. Nominal Data (aka Nominal Scale Variables) • Typically classification data, e.g. m/f • no ordering, e.g. it makes no sense to state that M > F • Binary variables are a special case of Nominal scale variables. 1. Ordinal Data (aka Ordinal Scale) • ordered but differences between values are not important • e.g., political parties on left to right spectrum given labels 0, 1, 2 • e.g., Likert scales, rank on a scale of 1..5 your degree of satisfaction • e.g., restaurant ratings
  • 3. Scales of Measurement • Applying a distance function largely depends on the type of input data • Major scales of measurement: 3. Numeric type Data (aka interval scaled) • Ordered and equal intervals. Measured on a linear scale. • Differences make sense • e.g., temperature (C,F), height, weight, age, date
  • 4. Scales of Measurement• Only certain operations can be performed on certain scales of measurement. Nominal Scale Ordinal Scale Interval Scale 1. Equality 2. Count 3. Rank (Cannot quantify difference) 4. Quantify the difference
  • 5. Axioms of a Distance Measure• d is a distance measure if it is a function from pairs of points to reals such that: 1. d(x,x) = 0. 2. d(x,y) = d(y,x). 3. d(x,y) > 0.
  • 6. Some Euclidean Distances • L2 norm (also common or Euclidean distance): – The most common notion of “distance.” • L1 norm (also Manhattan distance) – distance if you had to travel along coordinates only. )||...|||(|),( 22 22 2 11 pp j x i x j x i x j x i xjid −++−+−= ||...||||),( 2211 pp j x i x j x i x j x i xjid −++−+−=
  • 7. Examples L1 and L2 norms x = (5,5) y = (9,8) L2-norm: dist(x,y) = √(42 +32 ) = 5 L1-norm: dist(x,y) = 4+3 = 7 4 3 5
  • 8. Another Euclidean Distance • L∞ norm : d(x,y) = the maximum of the differences between x and y in any dimension.
  • 9. Non-Euclidean Distances • Jaccard measure for binary vectors • Cosine measure = angle between vectors from the origin to the points in question. • Edit distance = number of inserts and deletes to change one string into another.
  • 10. Jaccard Measure • A note about Binary variables first – Symmetric binary variable • If both states are equally valuable and carry the same weight, that is, there is no preference on which outcome should be coded as 0 or 1. • Like “gender” having the states male and female – Asymmetric binary variable: • If the outcomes of the states are not equally important, such as the positive and negative outcomes of a disease test. • We should code the rarest one by 1 (e.g., HIV positive), and the other by 0 (HIV negative). – Given two asymmetric binary variables, the agreement of two 1s (a positive match) is then considered more important than that of two 0s (a negative match).
  • 11. Jaccard Measure • A contingency table for binary data • Simple matching coefficient (invariant, if the binary variable is symmetric): • Jaccard coefficient (noninvariant if the binary variable is asymmetric): dcba cbjid +++ +=),( cba cbjid ++ +=),( pdbcasum dcdc baba sum ++ + + 0 1 01 Object i Object j
  • 12. Jaccard Measure Example • Example – All attributes are asymmetric binary – let the values Y and P be set to 1, and the value N be set to 0 cba cbjid ++ +=),( Name Fever Cough Test-1 Test-2 Test-3 Test-4 Jack Y N P N N N Mary Y N P N P N Jim Y P N N N N 75.0 211 21 ),( 67.0 111 11 ),( 33.0 102 10 ),( = ++ + = = ++ + = = ++ + = maryjimd jimjackd maryjackd pdbcasum dcdc baba sum ++ + + 0 1 01
  • 13. Cosine Measure • Think of a point as a vector from the origin (0,0,…,0) to its location. • Two points’ vectors make an angle, whose cosine is the normalized dot-product of the vectors. – Example: – p1.p2 = 2; |p1| = |p2| = √3. – cos(θ) = 2/3; θ is about 48 degrees. p1 p2 p1.p2 θ |p2| dist(p1, p2) = θ = arccos(p1.p2/|p2||p1|)
  • 14. Edit Distance • The edit distance of two strings is the number of inserts and deletes of characters needed to turn one into the other. • Equivalently, d(x,y) = |x| + |y| -2|LCS(x,y)|. – LCS = longest common subsequence = longest string obtained both by deleting from x and deleting from y.
  • 15. Example • x = abcde ; y = bcduve. • LCS(x,y) = bcde. • D(x,y) = |x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4 = 3. • What left? • Normalize it in the range [0-1]. We will study normalization formulas in next lecture.
  • 16. Back to k-Nearest Neighbor (Pseudo-code) • Missing values Imputation using k-NN. • Input: Dataset (D), size of K • for each record (x) with at least on missing value in D. – for each data object (y) in D. • Take the Distance (x,y) • Save the distance and y in array Similarity (S) array. – Sort the array S in descending order – Pick the top K data objects from S • Impute the missing attribute value (s) of x on the basic of known values of S (use Mean/Median or MOD).
  • 17. K-Nearest Neighbor Drawbacks • The major drawbacks of this approach are the – Choice of selecting exact distance functions. – Considering all attributes when attempting to retrieve the similar type of examples. – Searching through all the dataset for finding the same type of instances. – Algorithm Cost: ?
  • 18. Noisy Data • Noise: Random error, Data Present but not correct. – Data Transmission error – Data Entry problem • Removing noise – Data Smoothing (rounding, averaging within a window). – Clustering/merging and Detecting outliers. • Data Smoothing – First sort the data and partition it into (equi-depth) bins. – Then the values in each bin using Smooth by Bin Means, Smooth by Bin Median, Smooth by Bin Boundaries, etc.
  • 19. Noisy Data (Binning Methods) Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34 * Partition into (equi-depth) bins: - Bin 1: 4, 8, 9, 15 - Bin 2: 21, 21, 24, 25 - Bin 3: 26, 28, 29, 34 * Smoothing by bin means: - Bin 1: 9, 9, 9, 9 - Bin 2: 23, 23, 23, 23 - Bin 3: 29, 29, 29, 29 * Smoothing by bin boundaries: - Bin 1: 4, 4, 4, 15 - Bin 2: 21, 21, 25, 25 - Bin 3: 26, 26, 26, 34
  • 20. Noisy Data (Clustering) • Outliers may be detected by clustering, where similar values are organized into groups or “clusters”. • Values which falls outside of the set of clusters may be considered outliers.
  • 21. Data Discretization• The task of attribute (feature)-discretization techniques is to discretize the values of continuous features into a small number of intervals, where each interval is mapped to a discrete symbol. • Advantages:- – Simplified data description and easy-to-understand data and final data- mining results. – Only Small interesting rules are mined. – End-results processing time decreased. – End-results accuracy improved.
  • 22. Effect of Continuous Data on Results Accuracy age income age buys_computer <=30 medium 9 ? <=30 medium 11 ? <=30 medium 13 ? age income age buys_computer <=30 medium 9 no <=30 medium 10 no <=30 medium 11 no <=30 medium 12 no Data Mining • If ‘age <= 30’ and income = ‘medium’ and age = ‘9’ then buys_computer = ‘no’ • If ‘age <= 30’ and income = ‘medium’ and age = ‘10’ then buys_computer = ‘no’ • If ‘age <= 30’ and income = ‘medium’ and age = ‘11’ then buys_computer = ‘no’ • If ‘age <= 30’ and income = ‘medium’ and age = ‘12’ then buys_computer = ‘no’ Discover only those rules which contain support (frequency) greater >= 1 Due to the missing value in training dataset, the accuracy of prediction decreases and becomes “66.7%”
  • 23. Entropy-Based Discretization • Given a set of samples S, if S is partitioned into two intervals S1 and S2 using boundary T, the entropy after partitioning is • Where pi is the probability of class i in S1, determined by dividing the number of samples of class i in S1 by the total number of samples in S1.
  • 24. Example 1 IDID 1 2 3 4 5 6 7 8 9 AgeAge 21 22 24 25 27 27 27 35 41 GradeGrade F F P F P P P P P • Let Grade be the class attribute. Use entropy-based discretization to divide the range of ages into different discrete intervals. • There are 6 possible boundaries. They are 21.5, 23, 24.5, 26, 31, and 38. • Let us consider the boundary at T = 21.5. Let S1 = {21} Let S2 = {22, 24, 25, 27, 27, 27, 35, 41} (21+22) / 2 = 21.5 (22+24) / 2 = 23
  • 25. Example 1 (cont’) • The number of elements in S1 and S2 are: |S1| = 1 |S2| = 8 • The entropy of S1 is • The entropy of S2 is ID 1 2 3 4 5 6 7 8 9 Age 21 22 24 25 27 27 27 35 41 Grade F F P F P P P P P = ×−×−= =×=−=×=−= )0(log)0()1(log)1( )P(log)P()F(log)F()( 22 221 GradePGradePGradePGradePSEnt = ×−×−= =×=−=×=−= )6(log)6()2(log)2( )P(log)P()F(log)F()( 22 222 GradePGradePGradePGradePSEnt
  • 26. Example 1 (cont’) • Hence, the entropy after partitioning at T = 21.5 is ... )( |9| |8| )( |9| |1| )( || || )( || || ),( 21 2 2 1 1 = += += SEntSEnt SEnt S S SEnt S S TSE
  • 27. Example 1 (cont’) • The entropies after partitioning for all the boundaries are: T = 21.5 = E(S,21.5) T = 23 = E(S,23) . . T = 38 = E(S,38) Select the boundary with the smallest entropy Suppose best is T = 23 ID 1 2 3 4 5 6 7 8 9 Age 21 22 24 25 27 27 27 35 41 Grade F F P F P P P P P Now recursively apply entropy discretization upon both partitions
  • 28. References – G. Batista and M. Monard, “The study of K-Nearest Neighbor as a Imputation Method”, 2002 . (I will placed at the course folder) – “CS345 --- Lecture Notes”, by Jeff D Ullman at Stanford. http://www- db.stanford.edu/~ullman/cs345-notes.html – Vipin Kumar’s course in data mining offered at University of Minnesota – official text book slides of Jiawei Han and Micheline Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufmann Publishers, August 2000.