Clasification approaches

Classification Approaches
Dr G.Shyama Chandra Prasad
Associate Professor

2
Classification: Basic Concepts
• Classification: Basic Concepts
• Decision Tree Induction
• Bayes Classification Methods
• Rule-Based Classification
• Model Evaluation and Selection
• Techniques to Improve Classification Accuracy:
Ensemble Methods
• Summary

3 - 3
Types of Data
 Types of data typically occurring in applications
 Recognizing and understanding the different
data types is an important component of
proper data use and interpretation
 Like in programming languages types define
operations on data

3 - 4
Data are often discussed in terms of variables,
where a variable is:
Any characteristic that varies from one
member of a population to another.
A simple example is height in centimeters,
which varies from person to person.
Data and Variables

3 - 5
There are two basic types of variables: numerical
and categorical variables.
Numerical Variables: variables to which a number
is assigned as a quantitative value.
Categorical Variables: variables defined by the
classes or categories into which an individual
member falls.
Types of Variables

3 - 6
Types of Numerical variables
• Discrete: Reflects a number obtained by counting—
no decimal.
• Continuous: Reflects a measurement; the number
of decimal places depends on the precision of the
measuring device.
• Ratio scale: Order and distance implied. Differences can
be compared; has a true zero. Ratios can be compared.
Examples: Height, weight, blood pressure
• Interval scale: Order and distance implied. Differences
can be compared; no true zero. Ratios cannot be
compared.
Example: Temperature in Celsius.

3 - 7
Defined by the classes or categories into which an
individual member falls.
Categorical Variables
• Nominal Scale: Name only--Gender, hair
color, ethnicity
• Ordinal Scale: Nominal categories with an
implied order--Low, medium, high.

3 - 8
b. Appearance of plasma: b.
1. Clear……………………… 1.
2. Turbid…………………… 2.
9. Not done………………… 9.
NOMINAL SCALE

3 - 9
3.
81.Urine protein (dipstick reading): 81.
1. Negative………………… 1.
2. Trace……………………. 2.
3. 30 mg% or +……………
4. 100 mg% or ++………… 4.
5. 300 mg% or +++………… 5.
6. 1000 mg% or ++++……… 6.
If urine protein is 3+ or above, be
sure subject gets a 24 hour urine
collection container and instruction
ORDINAL SCALE

3 - 10
Question: Compared to others, what is your
satisfaction rating of the National Practitioner Data
Bank?
1 2 3 4 5
Very
Satisfied
Somewhat
Satisfied
Neutral Somewhat
Dissatisfied
Very
Dissatisfied
Likert Scale

3 - 11
Dataset: Data for a group of variables for a
collection of persons.
Data Table: A dataset organized into a table, with
one column for each variable and one row for each
person.
Datasets and Data Tables

Data Terminology
• Data Matrix arrangement
• Columns for attributes also called as features
• Rows for Objects

3 - 13
Typical Data Table
OBS AGE BMI FFNUM TEMP( 0F) GENDER EXERCISE LEVEL QUESTION
1 26 23.2 0 61.0 0 1 1
2 30 30.2 9 65.5 1 3 2
3 32 28.9 17 59.6 1 3 4
4 37 22.4 1 68.4 1 2 3
5 33 25.5 7 64.5 0 3 5
6 29 22.3 1 70.2 0 2 2
7 32 23.0 0 67.3 0 1 1
8 33 26.3 1 72.8 0 3 1
9 32 22.2 3 71.5 0 1 4
10 33 29.1 5 63.2 1 1 4
11 26 20.8 2 69.1 0 1 3
12 34 20.9 4 73.6 0 2 3
13 31 36.3 1 66.3 0 2 5
14 31 36.4 0 66.9 1 1 5
15 27 28.6 2 70.2 1 2 2
16 36 27.5 2 68.5 1 3 3
17 35 25.6 143 67.8 1 3 4
18 31 21.2 11 70.7 1 1 2
19 36 22.7 8 69.8 0 2 1
20 33 28.1 3 67.8 0 2 1

3 - 14
Definitions for Variables
• AGE: Age in years
• BMI: Body mass index, weight/height2 in kg/m2
• FFNUM: The average number of times eating “fast food”
in a week
• TEMP: High temperature for the day
• GENDER: 1- Female 0- Male
• EXERCISE LEVEL: 1- Low 2- Medium 3- High
• QUESTION: Compared to others, what is your satisfaction
rating of the National Practitioner Data Bank?
1- Very Satisfied 2- Somewhat Satisfied 3- Neutral
4- Somewhat dissatisfied 5- Dissatisfied

Similarity and Dissimilarity
• Similarity
• Numerical measure of how alike two data objects are.
• Is higher when objects are more alike.
• Often falls in the range [0,1]
• Dissimilarity
• Numerical measure of how different are two data objects
• Lower when objects are more alike
• Minimum dissimilarity is often 0
• Upper limit varies
• Proximity refers to a similarity or dissimilarity

Euclidean Distance
• Euclidean Distance
Where n is the number of dimensions
(attributes) and pk and qk are, respectively, the
kth attributes (components) or data objects p
and q.
• Standardization is necessary, if scales differ.



n
k
kk qpdist
1
2
)(

Euclidean Distance
0
1
2
3
0 1 2 3 4 5 6
p1
p2
p3 p4
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
Distance Matrix
p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0

Minkowski Distance
• Minkowski Distance is a generalization of Euclidean Distance
Where r is a parameter, n is the number of dimensions
(attributes) and pk and qk are, respectively, the kth
attributes (components) or data objects p and q.
r
n
k
r
kk qpdist
1
1
)||( 



Minkowski Distance: Examples
• r = 1. City block (Manhattan, taxicab, L1 norm) distance.
• A common example of this is the Hamming distance, which is just the
number of bits that are different between two binary vectors
• r = 2. Euclidean distance
• r  . “supremum” (Lmax norm, L norm) distance.
• This is the maximum difference between any component of the vectors
• Example: L_infinity of (1, 0, 2) and (6, 0, 3) = ??
• Do not confuse r with n, i.e., all these distances are defined
for all numbers of dimensions.

Minkowski Distance
Distance Matrix
point x y
p1 0 2
p2 2 0
p3 3 1
p4 5 1
L1 p1 p2 p3 p4
p1 0 4 4 6
p2 4 0 2 4
p3 4 2 0 2
p4 6 4 2 0
L2 p1 p2 p3 p4
p1 0 2.828 3.162 5.099
p2 2.828 0 1.414 3.162
p3 3.162 1.414 0 2
p4 5.099 3.162 2 0
L p1 p2 p3 p4
p1 0 2 3 5
p2 2 0 1 3
p3 3 1 0 2
p4 5 3 2 0

Mahalanobis Distance T
qpqpqpsmahalanobi )()(),( 1
 
For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.
 is the covariance matrix of the
input data X




n
i
kikjijkj XXXX
n 1
, ))((
1
1
When the covariance matrix is identity
Matrix, the mahalanobis distance is the
same as the Euclidean distance.
Useful for detecting outliers.
Q: what is the shape of data when
covariance matrix is identity?
Q: A is closer to P or B?
P
A
B

Mahalanobis Distance
Covariance Matrix:







3.02.0
2.03.0
B
A
C
A: (0.5, 0.5)
B: (0, 1)
C: (1.5, 1.5)
Mahal(A,B) = 5
Mahal(A,C) = 4

Common Properties of a Distance
• Distances, such as the Euclidean distance, have
some well known properties.
1. d(p, q)  0 for all p and q and d(p, q) = 0 only if
p = q. (Positive definiteness)
2. d(p, q) = d(q, p) for all p and q. (Symmetry)
3. d(p, r)  d(p, q) + d(q, r) for all points p, q, and r.
(Triangle Inequality)
where d(p, q) is the distance (dissimilarity) between
points (data objects), p and q.
• A distance that satisfies these properties is a
metric, and a space is called a metric space

Common Properties of a Similarity
• Similarities, also have some well known
properties.
1. s(p, q) = 1 (or maximum similarity) only if p = q.
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
where s(p, q) is the similarity between points
(data objects), p and q.

Similarity Between Binary Vectors
• Common situation is that objects, p and q, have only
binary attributes
• Compute similarities using the following quantities
M01 = the number of attributes where p was 0 and q was 1
• Simple Matching and Jaccard Distance/Coefficients
SMC = number of matches / number of attributes
= (M11 + M00) / (M01 + M10 + M11 + M00)
J = number of value-1-to-value-1 matches / number of not-both-zero attributes
values
= (M11) / (M01 + M10 + M11)

SMC versus Jaccard: Example
p = 1 0 0 0 0 0 0 0 0 0
q = 0 0 0 0 0 0 1 0 0 1
M01 = 2 (the number of attributes where p was 0 and q was 1)
SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7
J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

Cosine Similarity
• If d1 and d2 are two document vectors, then
cos( d1, d2 ) = (d1  d2) / ||d1|| ||d2|| ,
where  indicates vector dot product and || d || is the length of vector d.
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1  d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481
||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245
cos( d1, d2 ) = .3150, distance=1-cos(d1,d2)

28
Supervised vs. Unsupervised
Learning
• Supervised learning (classification)
• Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
• New data is classified based on the training set
• Unsupervised learning (clustering)
• The class labels of training data is unknown
• Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in the
data

29
• Classification
• predicts categorical class labels (discrete or nominal)
• classifies data (constructs a model) based on the training
set and the values (class labels) in a classifying attribute and
uses it in classifying new data
• Numeric Prediction
• models continuous-valued functions, i.e., predicts unknown
or missing values
• Typical applications
• Credit/loan approval:
• Medical diagnosis: if a tumor is cancerous or benign
• Fraud detection: if a transaction is fraudulent
• Web page categorization: which category it is
Prediction Problems: Classification vs.
Numeric Prediction

30
Classification—A Two-Step
Process
• Model construction: describing a set of predetermined classes
• Each tuple/sample is assumed to belong to a predefined class, as
determined by the class label attribute
• The set of tuples used for model construction is training set
• The model is represented as classification rules, decision trees, or
mathematical formulae
• Model usage: for classifying future or unknown objects
• Estimate accuracy of the model
• The known label of test sample is compared with the classified
result from the model
• Accuracy rate is the percentage of test set samples that are
correctly classified by the model
• Test set is independent of training set (otherwise overfitting)
• If the accuracy is acceptable, use the model to classify new data
• Note: If the test set is used to select models, it is called validation (test) set

Classification and regression
• What is classification? What is regression?
• Classification by decision tree induction
• Bayesian Classification
• Other Classification Methods
• Rule based
• K-NN
• SVM
• Bagging/Boosting

32
Issues: Evaluating Classification Methods
• Accuracy
• classifier accuracy: predicting class label
• predictor accuracy: guessing value of predicted attributes
• Speed
• time to construct the model (training time)
• time to use the model (classification/prediction time)
• Robustness: handling noise and missing values
• Scalability: efficiency in disk-resident databases
• Interpretability
• understanding and insight provided by the model
• Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules

33
Predictor Error Measures
• Measure predictor accuracy: measure how far off the predicted value is from
the actual known value
• Loss function: measures the error betw. yi and the predicted value yi’
• Absolute error: | yi – yi’|
• Squared error: (yi – yi’)2
• Test error (generalization error): the average loss over the test set
• Mean absolute error: Mean squared error:
• Relative absolute error: Relative squared error:
The mean squared-error exaggerates the presence of outliers
Popularly use (square) root mean-square error, similarly, root relative
squared error
d
yy
d
i
ii

1
|'|
d
yy
d
i
ii

1
2
)'(






d
i
i
d
i
ii
yy
yy
1
1
||
|'|






d
i
i
d
i
ii
yy
yy
1
2
1
2
)(
)'(

Clasification approaches

More Related Content

What's hot (19)

Similar to Clasification approaches (20)

Recently uploaded (20)

Clasification approaches