Classification Approaches
Dr G.Shyama Chandra Prasad
Associate Professor
2
Classification: Basic Concepts
• Classification: Basic Concepts
• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Summary
3 - 3
Types of Data
 Types of data typically occurring in applications
 Recognizing and understanding the different
data types is an important component of
proper data use and interpretation
 Like in programming languages types define
operations on data
3 - 4
Data are often discussed in terms of variables,
where a variable is:
Any characteristic that varies from one
member of a population to another.
A simple example is height in centimeters,
which varies from person to person.
Data and Variables
3 - 5
There are two basic types of variables: numerical
and categorical variables.
Numerical Variables: variables to which a number
is assigned as a quantitative value.
Categorical Variables: variables defined by the
classes or categories into which an individual
member falls.
Types of Variables
3 - 6
Types of Numerical variables
• Discrete: Reflects a number obtained by counting—
no decimal.
• Continuous: Reflects a measurement; the number
of decimal places depends on the precision of the
measuring device.
• Ratio scale: Order and distance implied. Differences can
be compared; has a true zero. Ratios can be compared.
Examples: Height, weight, blood pressure
• Interval scale: Order and distance implied. Differences
can be compared; no true zero. Ratios cannot be
compared.
Example: Temperature in Celsius.
3 - 7
Defined by the classes or categories into which an
individual member falls.
Categorical Variables
• Nominal Scale: Name only--Gender, hair
color, ethnicity
• Ordinal Scale: Nominal categories with an
implied order--Low, medium, high.
3 - 8
b. Appearance of plasma: b.
1. Clear……………………… 1.
2. Turbid…………………… 2.
9. Not done………………… 9.
NOMINAL SCALE
3 - 9
3.
81.Urine protein (dipstick reading): 81.
1. Negative………………… 1.
2. Trace……………………. 2.
3. 30 mg% or +……………
4. 100 mg% or ++………… 4.
5. 300 mg% or +++………… 5.
6. 1000 mg% or ++++……… 6.
If urine protein is 3+ or above, be
sure subject gets a 24 hour urine
collection container and instruction
ORDINAL SCALE
3 - 10
Question: Compared to others, what is your
satisfaction rating of the National Practitioner Data
Bank?
1 2 3 4 5
Very
Satisfied
Somewhat
Satisfied
Neutral Somewhat
Dissatisfied
Very
Dissatisfied
Likert Scale
3 - 11
Dataset: Data for a group of variables for a
collection of persons.
Data Table: A dataset organized into a table, with
one column for each variable and one row for each
person.
Datasets and Data Tables
Data Terminology
• Data Matrix arrangement
• Columns for attributes also called as features
• Rows for Objects
3 - 13
Typical Data Table
OBS AGE BMI FFNUM TEMP( 0F) GENDER EXERCISE LEVEL QUESTION
1 26 23.2 0 61.0 0 1 1
2 30 30.2 9 65.5 1 3 2
3 32 28.9 17 59.6 1 3 4
4 37 22.4 1 68.4 1 2 3
5 33 25.5 7 64.5 0 3 5
6 29 22.3 1 70.2 0 2 2
7 32 23.0 0 67.3 0 1 1
8 33 26.3 1 72.8 0 3 1
9 32 22.2 3 71.5 0 1 4
10 33 29.1 5 63.2 1 1 4
11 26 20.8 2 69.1 0 1 3
12 34 20.9 4 73.6 0 2 3
13 31 36.3 1 66.3 0 2 5
14 31 36.4 0 66.9 1 1 5
15 27 28.6 2 70.2 1 2 2
16 36 27.5 2 68.5 1 3 3
17 35 25.6 143 67.8 1 3 4
18 31 21.2 11 70.7 1 1 2
19 36 22.7 8 69.8 0 2 1
20 33 28.1 3 67.8 0 2 1
3 - 14
Definitions for Variables
• AGE: Age in years
• BMI: Body mass index, weight/height2 in kg/m2
• FFNUM: The average number of times eating “fast food”
in a week
• TEMP: High temperature for the day
• GENDER: 1- Female 0- Male
• EXERCISE LEVEL: 1- Low 2- Medium 3- High
• QUESTION: Compared to others, what is your satisfaction
rating of the National Practitioner Data Bank?
1- Very Satisfied 2- Somewhat Satisfied 3- Neutral
4- Somewhat dissatisfied 5- Dissatisfied
Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity
• Numerical measure of how different are two data objects
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity
Euclidean Distance
• Euclidean Distance
Where n is the number of dimensions
(attributes) and pk and qk are, respectively, the
kth attributes (components) or data objects p
and q.
• Standardization is necessary, if scales differ.



n
k
kk qpdist
1
2
)(
Euclidean Distance
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Distance Matrix
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
Minkowski Distance
• Minkowski Distance is a generalization of Euclidean Distance
Where r is a parameter, n is the number of dimensions
(attributes) and pk and qk are, respectively, the kth
attributes (components) or data objects p and q.
r
n
k
r
kk qpdist
1
1
)||( 


Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
• A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors
• r = 2. Euclidean distance
• r  . “supremum” (Lmax norm, L norm) distance.
• This is the maximum difference between any component of the vectors
• Example: L_infinity of (1, 0, 2) and (6, 0, 3) = ??
• Do not confuse r with n, i.e., all these distances are defined
for all numbers of dimensions.
Minkowski Distance
Distance Matrix
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0
Mahalanobis Distance T
qpqpqpsmahalanobi )()(),( 1
 
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.
 is the covariance matrix of the
input data X




n
i
kikjijkj XXXX
n 1
, ))((
1
1
When the covariance matrix is identity
Matrix, the mahalanobis distance is the
same as the Euclidean distance.
Useful for detecting outliers.
Q: what is the shape of data when
covariance matrix is identity?
Q: A is closer to P or B?
P
A
B
Mahalanobis Distance
Covariance Matrix:







3.02.0
2.03.0
B
A
C
A: (0.5, 0.5)
B: (0, 1)
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4
Common Properties of a Distance
• Distances, such as the Euclidean distance, have
some well known properties.
1. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between
points (data objects), p and q.
• A distance that satisfies these properties is a
metric, and a space is called a metric space
Common Properties of a Similarity
• Similarities, also have some well known
properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points
(data objects), p and q.
Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only
binary attributes
• Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
M10 = the number of attributes where p was 1 and q was 0
M00 = the number of attributes where p was 0 and q was 0
M11 = the number of attributes where p was 1 and q was 1
• Simple Matching and Jaccard Distance/Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of value-1-to-value-1 matches / number of not-both-zero attributes
values
= (M11) / (M01 + M10 + M11)
SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M01 = 2 (the number of attributes where p was 0 and q was 1)
M10 = 1 (the number of attributes where p was 1 and q was 0)
M00 = 7 (the number of attributes where p was 0 and q was 0)
M11 = 0 (the number of attributes where p was 1 and q was 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and || d || is the length of vector d.
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150, distance=1-cos(d1,d2)
28
Supervised vs. Unsupervised
Learning
• Supervised learning (classification)
• Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in the
data
29
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute and
uses it in classifying new data
• Numeric Prediction
• models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
Prediction Problems: Classification vs.
Numeric Prediction
30
Classification—A Two-Step
Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set
Classification and regression
• What is classification? What is regression?
• Classification by decision tree induction
• Bayesian Classification
• Other Classification Methods
• Rule based
• K-NN
• SVM
• Bagging/Boosting
32
Issues: Evaluating Classification Methods
• Accuracy
• classifier accuracy: predicting class label
• predictor accuracy: guessing value of predicted attributes
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
33
Predictor Error Measures
• Measure predictor accuracy: measure how far off the predicted value is from
the actual known value
• Loss function: measures the error betw. yi and the predicted value yi’
• Absolute error: | yi – yi’|
• Squared error: (yi – yi’)2
• Test error (generalization error): the average loss over the test set
• Mean absolute error: Mean squared error:
• Relative absolute error: Relative squared error:
The mean squared-error exaggerates the presence of outliers
Popularly use (square) root mean-square error, similarly, root relative
squared error
d
yy
d
i
ii

1
|'|
d
yy
d
i
ii

1
2
)'(






d
i
i
d
i
ii
yy
yy
1
1
||
|'|






d
i
i
d
i
ii
yy
yy
1
2
1
2
)(
)'(

More Related Content

PPTX
B.Ed.104 unit4.2-statistics
PPTX
B.Ed. 104 unit4.4-statistics
PPTX
B.Ed.104 unit4.5,4.6.4.7.4.8 statistics
PPTX
B.Ed.104 unit 4.1-statistics
PPTX
Jeffrey xu yu large graph processing
PDF
Measure of central tendency (Mean, Median and Mode)
PPT
Standard deviation quartile deviation
PDF
Lesson 25: Evaluating Definite Integrals (Section 041 slides)
B.Ed.104 unit4.2-statistics
B.Ed. 104 unit4.4-statistics
B.Ed.104 unit4.5,4.6.4.7.4.8 statistics
B.Ed.104 unit 4.1-statistics
Jeffrey xu yu large graph processing
Measure of central tendency (Mean, Median and Mode)
Standard deviation quartile deviation
Lesson 25: Evaluating Definite Integrals (Section 041 slides)

What's hot (19)

PPTX
Elements of Statistical Learning 読み会 第2章
PPTX
MEAN DEVIATION VTU
PDF
Uniform Order Legendre Approach for Continuous Hybrid Block Methods for the S...
PDF
Numerical Investigation of Higher Order Nonlinear Problem in the Calculus Of ...
PDF
Implementing Minimum Error Rate Classifier
PPTX
Cluster Validation
PPT
Sem lecture
PPTX
Measures of central tendency
PPT
Solving Systems by Elimination
PPT
PDF
Bayesian system reliability and availability analysis underthe vague environm...
PDF
Basic statistics
PDF
L05 language model_part2
PDF
Design of Second Order Digital Differentiator and Integrator Using Forward Di...
PDF
Matrix Factorization
PDF
Measurements of Errors - Physics - An introduction by Arun Umrao
PPTX
quartile deviation: An introduction
PDF
M estimation, s estimation, and mm estimation in robust regression
PDF
Notes of Units, Dimensions & Errors for IIT JEE by Arun Umrao
Elements of Statistical Learning 読み会 第2章
MEAN DEVIATION VTU
Uniform Order Legendre Approach for Continuous Hybrid Block Methods for the S...
Numerical Investigation of Higher Order Nonlinear Problem in the Calculus Of ...
Implementing Minimum Error Rate Classifier
Cluster Validation
Sem lecture
Measures of central tendency
Solving Systems by Elimination
Bayesian system reliability and availability analysis underthe vague environm...
Basic statistics
L05 language model_part2
Design of Second Order Digital Differentiator and Integrator Using Forward Di...
Matrix Factorization
Measurements of Errors - Physics - An introduction by Arun Umrao
quartile deviation: An introduction
M estimation, s estimation, and mm estimation in robust regression
Notes of Units, Dimensions & Errors for IIT JEE by Arun Umrao
Ad

Similar to Clasification approaches (20)

PDF
Cluster Analysis: Measuring Similarity & Dissimilarity
PPT
similarities-knn-1.ppt
PPT
Unit-I Objects,Attributes,Similarity&Dissimilarity.ppt
PPTX
similarities-knn.pptx
PPT
4_22865_IS465_2019_1__2_1_02Data-2.ppt
PPTX
Data mining Measuring similarity and desimilarity
PPT
Ds2 statistics
PDF
09Evaluation_Clustering.pdf
PPT
Datamining tools and techniques_lec-2.ppt
PPT
Lecture slides week14-15
PDF
Clustering training
PPTX
Data For Datamining
PPTX
Data For Datamining
PPTX
SkNoushadddoja_28100119039.pptx
PPT
UNIT-1_Measuring similarity and dissmilarity.ppt
PPT
ECCV2010: distance function and metric learning part 1
PDF
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
PPTX
Getting to Know Data presentation basics
PPT
Data preprocessing
PPTX
Cluster analysis (2)
Cluster Analysis: Measuring Similarity & Dissimilarity
similarities-knn-1.ppt
Unit-I Objects,Attributes,Similarity&Dissimilarity.ppt
similarities-knn.pptx
4_22865_IS465_2019_1__2_1_02Data-2.ppt
Data mining Measuring similarity and desimilarity
Ds2 statistics
09Evaluation_Clustering.pdf
Datamining tools and techniques_lec-2.ppt
Lecture slides week14-15
Clustering training
Data For Datamining
Data For Datamining
SkNoushadddoja_28100119039.pptx
UNIT-1_Measuring similarity and dissmilarity.ppt
ECCV2010: distance function and metric learning part 1
Google BigQuery is a very popular enterprise warehouse that’s built with a co...
Getting to Know Data presentation basics
Data preprocessing
Cluster analysis (2)
Ad

Recently uploaded (20)

PDF
August 2025 - Top 10 Read Articles in Network Security & Its Applications
PDF
737-MAX_SRG.pdf student reference guides
PPT
Total quality management ppt for engineering students
PPTX
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
PPTX
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
PDF
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
PPTX
Module 8- Technological and Communication Skills.pptx
PDF
Exploratory_Data_Analysis_Fundamentals.pdf
PDF
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
PDF
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
PDF
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
PDF
Visual Aids for Exploratory Data Analysis.pdf
PPTX
"Array and Linked List in Data Structures with Types, Operations, Implementat...
PDF
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
PDF
Soil Improvement Techniques Note - Rabbi
PPTX
Feature types and data preprocessing steps
PPTX
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
PPTX
CyberSecurity Mobile and Wireless Devices
PPTX
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
PDF
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf
August 2025 - Top 10 Read Articles in Network Security & Its Applications
737-MAX_SRG.pdf student reference guides
Total quality management ppt for engineering students
Graph Data Structures with Types, Traversals, Connectivity, and Real-Life App...
tack Data Structure with Array and Linked List Implementation, Push and Pop O...
ChapteR012372321DFGDSFGDFGDFSGDFGDFGDFGSDFGDFGFD
Module 8- Technological and Communication Skills.pptx
Exploratory_Data_Analysis_Fundamentals.pdf
SMART SIGNAL TIMING FOR URBAN INTERSECTIONS USING REAL-TIME VEHICLE DETECTI...
A SYSTEMATIC REVIEW OF APPLICATIONS IN FRAUD DETECTION
Influence of Green Infrastructure on Residents’ Endorsement of the New Ecolog...
Visual Aids for Exploratory Data Analysis.pdf
"Array and Linked List in Data Structures with Types, Operations, Implementat...
Artificial Superintelligence (ASI) Alliance Vision Paper.pdf
Soil Improvement Techniques Note - Rabbi
Feature types and data preprocessing steps
6ME3A-Unit-II-Sensors and Actuators_Handouts.pptx
CyberSecurity Mobile and Wireless Devices
Sorting and Hashing in Data Structures with Algorithms, Techniques, Implement...
null (2) bgfbg bfgb bfgb fbfg bfbgf b.pdf

Clasification approaches

  • 1. Classification Approaches Dr G.Shyama Chandra Prasad Associate Professor
  • 2. 2 Classification: Basic Concepts • Classification: Basic Concepts • Decision Tree Induction • Bayes Classification Methods • Rule-Based Classification • Model Evaluation and Selection • Techniques to Improve Classification Accuracy: Ensemble Methods • Summary
  • 3. 3 - 3 Types of Data  Types of data typically occurring in applications  Recognizing and understanding the different data types is an important component of proper data use and interpretation  Like in programming languages types define operations on data
  • 4. 3 - 4 Data are often discussed in terms of variables, where a variable is: Any characteristic that varies from one member of a population to another. A simple example is height in centimeters, which varies from person to person. Data and Variables
  • 5. 3 - 5 There are two basic types of variables: numerical and categorical variables. Numerical Variables: variables to which a number is assigned as a quantitative value. Categorical Variables: variables defined by the classes or categories into which an individual member falls. Types of Variables
  • 6. 3 - 6 Types of Numerical variables • Discrete: Reflects a number obtained by counting— no decimal. • Continuous: Reflects a measurement; the number of decimal places depends on the precision of the measuring device. • Ratio scale: Order and distance implied. Differences can be compared; has a true zero. Ratios can be compared. Examples: Height, weight, blood pressure • Interval scale: Order and distance implied. Differences can be compared; no true zero. Ratios cannot be compared. Example: Temperature in Celsius.
  • 7. 3 - 7 Defined by the classes or categories into which an individual member falls. Categorical Variables • Nominal Scale: Name only--Gender, hair color, ethnicity • Ordinal Scale: Nominal categories with an implied order--Low, medium, high.
  • 8. 3 - 8 b. Appearance of plasma: b. 1. Clear……………………… 1. 2. Turbid…………………… 2. 9. Not done………………… 9. NOMINAL SCALE
  • 9. 3 - 9 3. 81.Urine protein (dipstick reading): 81. 1. Negative………………… 1. 2. Trace……………………. 2. 3. 30 mg% or +…………… 4. 100 mg% or ++………… 4. 5. 300 mg% or +++………… 5. 6. 1000 mg% or ++++……… 6. If urine protein is 3+ or above, be sure subject gets a 24 hour urine collection container and instruction ORDINAL SCALE
  • 10. 3 - 10 Question: Compared to others, what is your satisfaction rating of the National Practitioner Data Bank? 1 2 3 4 5 Very Satisfied Somewhat Satisfied Neutral Somewhat Dissatisfied Very Dissatisfied Likert Scale
  • 11. 3 - 11 Dataset: Data for a group of variables for a collection of persons. Data Table: A dataset organized into a table, with one column for each variable and one row for each person. Datasets and Data Tables
  • 12. Data Terminology • Data Matrix arrangement • Columns for attributes also called as features • Rows for Objects
  • 13. 3 - 13 Typical Data Table OBS AGE BMI FFNUM TEMP( 0F) GENDER EXERCISE LEVEL QUESTION 1 26 23.2 0 61.0 0 1 1 2 30 30.2 9 65.5 1 3 2 3 32 28.9 17 59.6 1 3 4 4 37 22.4 1 68.4 1 2 3 5 33 25.5 7 64.5 0 3 5 6 29 22.3 1 70.2 0 2 2 7 32 23.0 0 67.3 0 1 1 8 33 26.3 1 72.8 0 3 1 9 32 22.2 3 71.5 0 1 4 10 33 29.1 5 63.2 1 1 4 11 26 20.8 2 69.1 0 1 3 12 34 20.9 4 73.6 0 2 3 13 31 36.3 1 66.3 0 2 5 14 31 36.4 0 66.9 1 1 5 15 27 28.6 2 70.2 1 2 2 16 36 27.5 2 68.5 1 3 3 17 35 25.6 143 67.8 1 3 4 18 31 21.2 11 70.7 1 1 2 19 36 22.7 8 69.8 0 2 1 20 33 28.1 3 67.8 0 2 1
  • 14. 3 - 14 Definitions for Variables • AGE: Age in years • BMI: Body mass index, weight/height2 in kg/m2 • FFNUM: The average number of times eating “fast food” in a week • TEMP: High temperature for the day • GENDER: 1- Female 0- Male • EXERCISE LEVEL: 1- Low 2- Medium 3- High • QUESTION: Compared to others, what is your satisfaction rating of the National Practitioner Data Bank? 1- Very Satisfied 2- Somewhat Satisfied 3- Neutral 4- Somewhat dissatisfied 5- Dissatisfied
  • 15. Similarity and Dissimilarity • Similarity • Numerical measure of how alike two data objects are. • Is higher when objects are more alike. • Often falls in the range [0,1] • Dissimilarity • Numerical measure of how different are two data objects • Lower when objects are more alike • Minimum dissimilarity is often 0 • Upper limit varies • Proximity refers to a similarity or dissimilarity
  • 16. Euclidean Distance • Euclidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. • Standardization is necessary, if scales differ.    n k kk qpdist 1 2 )(
  • 17. Euclidean Distance 0 1 2 3 0 1 2 3 4 5 6 p1 p2 p3 p4 point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 Distance Matrix p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0
  • 18. Minkowski Distance • Minkowski Distance is a generalization of Euclidean Distance Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. r n k r kk qpdist 1 1 )||(   
  • 19. Minkowski Distance: Examples • r = 1. City block (Manhattan, taxicab, L1 norm) distance. • A common example of this is the Hamming distance, which is just the number of bits that are different between two binary vectors • r = 2. Euclidean distance • r  . “supremum” (Lmax norm, L norm) distance. • This is the maximum difference between any component of the vectors • Example: L_infinity of (1, 0, 2) and (6, 0, 3) = ?? • Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.
  • 20. Minkowski Distance Distance Matrix point x y p1 0 2 p2 2 0 p3 3 1 p4 5 1 L1 p1 p2 p3 p4 p1 0 4 4 6 p2 4 0 2 4 p3 4 2 0 2 p4 6 4 2 0 L2 p1 p2 p3 p4 p1 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 L p1 p2 p3 p4 p1 0 2 3 5 p2 2 0 1 3 p3 3 1 0 2 p4 5 3 2 0
  • 21. Mahalanobis Distance T qpqpqpsmahalanobi )()(),( 1   For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.  is the covariance matrix of the input data X     n i kikjijkj XXXX n 1 , ))(( 1 1 When the covariance matrix is identity Matrix, the mahalanobis distance is the same as the Euclidean distance. Useful for detecting outliers. Q: what is the shape of data when covariance matrix is identity? Q: A is closer to P or B? P A B
  • 22. Mahalanobis Distance Covariance Matrix:        3.02.0 2.03.0 B A C A: (0.5, 0.5) B: (0, 1) C: (1.5, 1.5) Mahal(A,B) = 5 Mahal(A,C) = 4
  • 23. Common Properties of a Distance • Distances, such as the Euclidean distance, have some well known properties. 1. d(p, q)  0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness) 2. d(p, q) = d(q, p) for all p and q. (Symmetry) 3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r. (Triangle Inequality) where d(p, q) is the distance (dissimilarity) between points (data objects), p and q. • A distance that satisfies these properties is a metric, and a space is called a metric space
  • 24. Common Properties of a Similarity • Similarities, also have some well known properties. 1. s(p, q) = 1 (or maximum similarity) only if p = q. 2. s(p, q) = s(q, p) for all p and q. (Symmetry) where s(p, q) is the similarity between points (data objects), p and q.
  • 25. Similarity Between Binary Vectors • Common situation is that objects, p and q, have only binary attributes • Compute similarities using the following quantities M01 = the number of attributes where p was 0 and q was 1 M10 = the number of attributes where p was 1 and q was 0 M00 = the number of attributes where p was 0 and q was 0 M11 = the number of attributes where p was 1 and q was 1 • Simple Matching and Jaccard Distance/Coefficients SMC = number of matches / number of attributes = (M11 + M00) / (M01 + M10 + M11 + M00) J = number of value-1-to-value-1 matches / number of not-both-zero attributes values = (M11) / (M01 + M10 + M11)
  • 26. SMC versus Jaccard: Example p = 1 0 0 0 0 0 0 0 0 0 q = 0 0 0 0 0 0 1 0 0 1 M01 = 2 (the number of attributes where p was 0 and q was 1) M10 = 1 (the number of attributes where p was 1 and q was 0) M00 = 7 (the number of attributes where p was 0 and q was 0) M11 = 0 (the number of attributes where p was 1 and q was 1) SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7 J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0
  • 27. Cosine Similarity • If d1 and d2 are two document vectors, then cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| , where  indicates vector dot product and || d || is the length of vector d. • Example: d1 = 3 2 0 5 0 0 0 2 0 0 d2 = 1 0 0 0 0 0 0 1 0 2 d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5 ||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481 ||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245 cos( d1, d2 ) = .3150, distance=1-cos(d1,d2)
  • 28. 28 Supervised vs. Unsupervised Learning • Supervised learning (classification) • Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations • New data is classified based on the training set • Unsupervised learning (clustering) • The class labels of training data is unknown • Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data
  • 29. 29 • Classification • predicts categorical class labels (discrete or nominal) • classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data • Numeric Prediction • models continuous-valued functions, i.e., predicts unknown or missing values • Typical applications • Credit/loan approval: • Medical diagnosis: if a tumor is cancerous or benign • Fraud detection: if a transaction is fraudulent • Web page categorization: which category it is Prediction Problems: Classification vs. Numeric Prediction
  • 30. 30 Classification—A Two-Step Process • Model construction: describing a set of predetermined classes • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction is training set • The model is represented as classification rules, decision trees, or mathematical formulae • Model usage: for classifying future or unknown objects • Estimate accuracy of the model • The known label of test sample is compared with the classified result from the model • Accuracy rate is the percentage of test set samples that are correctly classified by the model • Test set is independent of training set (otherwise overfitting) • If the accuracy is acceptable, use the model to classify new data • Note: If the test set is used to select models, it is called validation (test) set
  • 31. Classification and regression • What is classification? What is regression? • Classification by decision tree induction • Bayesian Classification • Other Classification Methods • Rule based • K-NN • SVM • Bagging/Boosting
  • 32. 32 Issues: Evaluating Classification Methods • Accuracy • classifier accuracy: predicting class label • predictor accuracy: guessing value of predicted attributes • Speed • time to construct the model (training time) • time to use the model (classification/prediction time) • Robustness: handling noise and missing values • Scalability: efficiency in disk-resident databases • Interpretability • understanding and insight provided by the model • Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules
  • 33. 33 Predictor Error Measures • Measure predictor accuracy: measure how far off the predicted value is from the actual known value • Loss function: measures the error betw. yi and the predicted value yi’ • Absolute error: | yi – yi’| • Squared error: (yi – yi’)2 • Test error (generalization error): the average loss over the test set • Mean absolute error: Mean squared error: • Relative absolute error: Relative squared error: The mean squared-error exaggerates the presence of outliers Popularly use (square) root mean-square error, similarly, root relative squared error d yy d i ii  1 |'| d yy d i ii  1 2 )'(       d i i d i ii yy yy 1 1 || |'|       d i i d i ii yy yy 1 2 1 2 )( )'(