Machine Learning - Classification (ctd.)

DAT630 
Classiﬁcation Alternative Techniques
Darío Garigliotti | University of Stavanger
09/10/2017
Introduction to Data Mining, Chapter 5

Recall
Attribute set 
(x)
Class label 
(y)
Classiﬁcation
Model

Outline
- Alternative classiﬁcation techniques

- Rule-based
- Nearest neighbors
- Naive Bayes
- Ensemble methods
- Class imbalance problem

- Multiclass problem

Rule-based Classiﬁer
- Classifying records using a set of "if… then…"
rules

- Example

- R is known as the rule set
R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
R2: (Give Birth = no) ∧ (Live in Water = yes) → Fishes
R3: (Give Birth = yes) ∧ (Blood Type = warm) → Mammals
R4: (Give Birth = no) ∧ (Can Fly = no) → Reptiles
R5: (Live in Water = sometimes) → Amphibians

Classiﬁcation Rules
- Each classiﬁcation rule can be expressed in
the following way
ri : (Conditioni) ! yi
rule antecedent  
(or precondition)
rule consequent

- A rule r covers an instance x if the attributes of
the instance satisfy the condition of the rule
Which rules cover the "hawk" and the "grizzly bear"?
Name Blood Type Give Birth Can Fly Live in Water Class
hawk warm no yes no ?
grizzly bear warm yes no no ?

- A rule r covers an instance x if the attributes of
the instance satisfy the condition of the rule
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
hawk warm no yes no ?
grizzly bear warm yes no no ?

Rule Coverage and
Accuracy
- Coverage of a rule

- Fraction of records that
satisfy the antecedent of a
rule
- Accuracy of a rule

- Fraction of records that
satisfy both the antecedent
and consequent of a rule
Tid Refund Marital
Status
Taxable
Income Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
(Status=Single) → No
Coverage = 40%, Accuracy = 50%

How does it work?
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?

Properties of the Rule Set
- Mutually exclusive rules

- Classiﬁer contains mutually exclusive rules if the
rules are independent of each other
- Every record is covered by at most one rule
- Exhaustive rules

- Classiﬁer has exhaustive coverage if it accounts for
every possible combination of attribute values
- Each record is covered by at least one rule
- These two properties ensure that every record
is covered by exactly one rule

When these Properties are
not Satisﬁed
- Rules are not mutually exclusive

- A record may trigger more than one rule
- Solution?
- Ordered rule set
- Unordered rule set – use voting schemes
- Rules are not exhaustive

- A record may not trigger any rules
- Solution?
- Use a default class (assign the majority class from the
training records)

Ordered Rule Set
- Rules are rank ordered according to their priority

- An ordered rule set is known as a decision list
- When a test record is presented to the classiﬁer

- It is assigned to the class label of the highest ranked
rule it has triggered
- If none of the rules ﬁred, it is assigned to the default
class R1: (Give Birth = no) ∧ (Can Fly = yes) → Birds
turtle cold no no sometimes ?

Rule Ordering Schemes
- Rule-based ordering

- Individual rules are ranked based on some quality
measure (e.g., accuracy, coverage)
- Class-based ordering

- Rules that belong to the same class appear together
- Rules are sorted on the basis of their class
information (e.g., total description length)
- The relative order of rules within a class does not
matter

Rule Ordering Schemes
Rule-based Ordering
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
Class-based Ordering
(Refund=Yes) ==> No

How to Build a Rule-based
Classiﬁer?
- Direct Method

- Extract rules directly from data
- Indirect Method

- Extract rules from other classiﬁcation models (e.g.
decision trees, neural networks, etc)

From Decision Trees To
Rules
YESYESNONO
NONO
NONO
Yes No
{Married}
{Single,
Divorced}
< 80K > 80K
Taxable
Income
Marital
Status
Refund
Classification Rules
(Refund=Yes) ==> No
Rules are mutually exclusive and exhaustive
Rule set contains as much information as the tree

Rules Can Be Simpliﬁed
YESYESNONO
NONO
NONO
Yes No
{Married}
{Single,
Divorced}
< 80K > 80K
Taxable
Income
Marital
Status
Refund
Tid Refund Marital
Status
Taxable
Income Cheat
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
Initial Rule: (Refund=No) ∧ (Status=Married) → No
Simplified Rule: (Status=Married) → No

Summary
- Expressiveness is almost equivalent to that of
a decision tree

- Generally used to produce descriptive models
that are easy to interpret, but gives comparable
performance to decision tree classiﬁers

- The class-based ordering approach is well
suited for handling data sets with imbalanced
class distributions

So far
- Eager learners
- Decision trees, rule-base classiﬁers
- Learn a model as soon as the training data becomes
available
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set Model
Learning
algorithm
Learn
model
Apply
model

Opposite strategy
- Lazy learners
- Delay the process of modeling the data until it is
needed to classify the test examples
Apply
Model
Induction
Deduction
Learn
Model
Model
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set Modeling
Apply
model

Instance-Based Classiﬁers
Atr1 ……... AtrN Class
A
B
B
C
A
C
B
Set of Stored Cases
Atr1 ……... AtrN
Unseen Case
• Store the training records
• Use training records to  
predict the class label of  
unseen cases

Instance Based Classifiers
- Rote-learner

- Memorizes entire training data and performs
classification only if attributes of record match one of
the training examples exactly
- Nearest neighbors

- Uses k “closest” points (nearest neighbors) for
performing classification

Nearest neighbors
- Basic idea

- "If it walks like a duck, quacks like a duck, then it’s
probably a duck"
Training
Records
Test
Record
Compute
Distance
Choose k of the
“nearest” records

Nearest-Neighbor
Classiﬁers
- Requires three things

- The set of stored records
- Distance Metric to compute distance between
records
- The value of k, the number of nearest neighbors to
retrieve

Nearest-Neighbor
Classiﬁers
- To classify an unknown record

- Compute distance to other
training records
- Identify k-nearest neighbors
- Use class labels of nearest
neighbors to determine the class
label of unknown record (e.g., by
taking majority vote)
Unknown record

Deﬁnition of Nearest
Neighbor
X X X
(a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor
K-nearest neighbors of a record x are data
points that have the k smallest distance to x

Choices to make
- Compute distance between two points

- E.g., Euclidean distance
- See Chapter 2
- Determine the class from nearest neighbor list

- Take the majority vote of class labels among the k-
nearest neighbors
- Weigh the vote according to distance
- Choose the value of k

Choosing the value of k
- If k is too small, sensitive to noise points

- If k is too large, neighborhood may include
points from other classes
X

Summary
- Part of a more general technique called
instance-based learning

- Use speciﬁc training instances to make predictions
without having to maintain an abstraction (model)
derived from data
- Because there is no model building, classifying
a test example can be quite expensive

- Nearest-neighbors make their predictions
based on local information

- Susceptible to noise

Bayes Classiﬁer
- In many applications the relationship between
the attribute set and the class variable is  
non-deterministic

- The label of the test record cannot be predicted with
certainty even if it was seen previously during training
- A probabilistic framework for solving
classiﬁcation problems

- Treat X and Y as random variables and capture their
relationship probabilistically using P(Y|X)

Example
- Football game between teams A and B

- Team A won 65% team B won 35% of the time
- Among the games Team A won, 30% when game
hosted by B
- Among the games Team B won, 75% when B
played home
- Which team is more likely to win if the game is
hosted by Team B?

Probability Basics
- Conditional probability

- Bayes’ theorem
P(Y |X) =
P(X|Y )P(Y )
P(X)
P(X, Y ) = P(X|Y )P(Y ) = P(Y |X)P(X)

Example
- Probability Team A wins: P(win=A) = 0.65

- Probability Team B wins: P(win=B) = 0.35

- Probability Team A wins when B hosts:  
P(hosted=B|win=A) = 0.3

- Probability Team B wins when playing at home:
P(hosted=B|win=B) = 0.75

- Who wins the next game that is hosted by B?
P(win=B|hosted=B) = ? 
P(win=A|hosted=B) = ?

Solution
- Using:

- P(win=B|hosted=B) = 0.5738

- P(win=A|hosted=B) = 0.4262

- See book page 229

P(Y |X) =
P(X|Y )P(Y )
P(X)

Bayes’ Theorem for
Classiﬁcation
Posterior
probability
P(Y |X) =
P(X|Y )P(Y )
P(X)
Prior
probability
The evidence
Class-conditional
probability

Classiﬁcation
Posterior
probability
P(Y |X) =
P(X|Y )P(Y )
P(X)
Prior
probability
The evidence
Constant (same for all classes),
can be ignored
Class-conditional
probability

Classiﬁcation
Posterior
probability
P(Y |X) =
P(X|Y )P(Y )
P(X)
The evidence
Class-conditional
probability
Prior probability
Can be computed from training
data (fraction of records that
belong to each class)

Classiﬁcation
Posterior
probability
P(Y |X) =
P(X|Y )P(Y )
P(X)
Prior
probability
The evidence
Class-conditional probability 
Two methods: Naive Bayes, Bayesian belief network

Estimation
- Mind that X is a vector

- Class-conditional probability

- "Naive" assumption: attributes are independent
X = {X1, . . . , Xn}
P(X|Y ) = P(X1, . . . , Xn|Y )
P(X|Y ) =
nY
i=1
P(Xi|Y )

Naive Bayes Classiﬁer
- Probability that X belongs to class Y

- Target label for record X
P(Y |X) / P(Y )
nY
i=1
P(Xi|Y )
y = arg max
yj
P(Y = yj)
nY
i=1
P(Xi|Y = yj)

Estimating class-
conditional probabilities
- Categorical attributes
- The fraction of training instances in class Y that have
a particular attribute value xi
- Continuous attributes
- Discretizing the range into bins
- Assuming a certain probability distribution
number of training instances
where Xi=xi and Y=y
where Y=y
P(Xi = xi|Y = y) =
nc
n

Conditional probabilities
for categorical attributes
- The fraction of training
instances in class Y that
have a particular
attribute value Xi

- P(Status=Married|No)=?

- P(Refund=Yes|Yes)=?
Tid Refund Marital
Status
Taxable
Income Evade
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
categorical
categorical
continuous
class

Conditional probabilities
for continuous attributes
- Discretize the range into bins, or

- Assume a certain form of probability distribution

- Gaussian (normal) distribution is often used
- The parameters of the distribution are estimated from
the training data (from instances that belong to class yj)
- sample mean and variance
P(Xi = xi|Y = yj) =
1
q
2⇡ 2
ij
exp
(xi µij )2
2 2
ij
2
ij
µij

Example
Tid Refund Marital
Status
Taxable
Income Evade
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
Tid Refund Marital
Status
Taxable
Income Class
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10

Example
Tid Refund Marital
Status
Taxable
Income Evade
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
Tid Refund Marital
Status
Taxable
Income Class
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
X={Refund=No,
Marital st.=Married,
Income=120K}
P(C)
P(Refund=x|Y) P(Marital=x|Y) Ann. income
No Yes Single Divorced Married mean var
class=No 7/10 4/7 3/7 2/7 1/7 4/7 110 2975
class=Yes 3/10 3/3 3/3 2/3 1/3 0/3 90 25

Example 
classifying a new instance
X={Refund=No, Marital st.=Married, Income=120K}
P(C)
class=No 7/10 4/7 3/7 2/7 1/7 4/7 110 2975
class=Yes 3/10 3/3 3/3 2/3 1/3 0/3 90 25
P(Class=No|X) = P(Class=No)  
× P(Refund=No|Class=No)  
× P(Marital=Married| Class=No)  
× P(Income=120K| Class=No)
7/10
4/7
4/7
0.0072

Example 
classifying a new instance
X={Refund=No, Marital st.=Married, Income=120K}
P(C)
class=No 7/10 4/7 3/7 2/7 1/7 4/7 110 2975
class=Yes 3/10 3/3 0/3 2/3 1/3 0/3 90 25
P(Class=Yes|X) = P(Class=Yes)  
× P(Refund=No|Class=Yes)  
× P(Marital=Married| Class=Yes)  
× P(Income=120K| Class=Yes)
3/10
3/3
0/3
1.2*10-9

Can anything go wrong?
P(Y |X) / P(Y )
nY
i=1
P(Xi|Y )
What if this probability is zero?
- If one of the conditional probabilities is zero, then the
entire expression becomes zero!

Probability estimation
- Original
- Laplace smoothing
where Xi=xi and Y=y
where Y=y
P(Xi = xi|Y = y) =
nc
n
P(Xi = xi|Y = y) =
nc + 1
n + c
c is the number of classes

Probability estimation (2)
- M-estimate
- p can be regarded as the prior probability
- m is called equivalent sample size which determines
the trade-off between the observed probability nc/n
and the prior probability p
- E.g., p=1/3 and m=3
P(Xi = xi|Y = y) =
nc + mp
n + m

Summary
- Robust to isolated noise points

- Handles missing values by ignoring the
instance during probability estimate
calculations

- Robust to irrelevant attributes

- Independence assumption may not hold for
some attributes

Ensemble Methods
- Construct a set of classiﬁers from the training
data

- Predict class label of previously unseen
records by aggregating predictions made by
multiple classiﬁers

Class Imbalance Problem
- Data sets with imbalanced class distributions
are quite common in real-world applications

- E.g., credit card fraud detection
- Correct classiﬁcation of the rare class has
often greater value than a correct classiﬁcation
of the majority class

- The accuracy measure is not well suited for
imbalanced data sets

- We need alternative measures

Confusion Matrix
Predicted class
Positive Negative
Actual
class
Positive
True Positives
(TP)
False Negatives
(FN)
Negative
False Positives
(FP)
True Negatives
(TN)

Additional Measures
- True positive rate (or sensitivity)

- Fraction of positive examples predicted correctly
- True negative rate (or speciﬁcity)

- Fraction of negative examples predicted correctly
TPR =
TP
TP + FN
TNR =
TN
TN + FP

Additional Measures
- False positive rate

- Fraction of negative examples predicted as positive
- False negative rate

- Fraction of positive examples predicted as negative
FPR =
FP
TN + FP
FNR =
FN
TP + FN

Additional Measures
- Precision

- Fraction of positive records among those that are
classiﬁed as positive
- Recall

- Fraction of positive examples correctly predicted
(same as the true positive rate)
P =
TP
TP + FP
R =
TP
TP + FN

Additional Measures
- F1-measure
- Summarizing precision and recall into a single
number
- Harmonic mean between precision and recall
F1 =
2RP
R + P

Multiclass Classiﬁcation
- Many of the approaches are originally
designed for binary classiﬁcation problems

- Many real-world problems require data to be
divided into more than two categories

- Two approaches

- One-against-rest (1-r)
- One-against-one (1-1)
- Predictions need to be combined in both cases

One-against-rest
- Y={y1, y2, … yK} classes

- For each class yi

- Instances that belong to yi are positive examples
- All other instances are negative examples
- Combining predictions

- If an instance is classiﬁed positive, the positive class
gets a vote
- If an instance is classiﬁed negative, all classes
except for the positive class receive a vote

Example
- 4 classes, Y={y1, y2, y3, y4}

- Classifying a given test instance
y1 +
y2 -
y3 -
y4 -
class +
y1 -
y2 -
y3 +
y4 -
class -
y1 -
y2 +
y3 -
y4 -
class -
y1 -
y2 -
y3 -
y4 +
class -
total votes
y1
y2
y3
y4
target class

One-against-one
- Y={y1, y2, … yK} classes

- Construct a binary classiﬁer for each pair of
classes (yi, yj)

- K(K-1)/2 binary classiﬁers in total
- Combining predictions

- The positive class receives a vote in each pairwise
comparison

Example
- 4 classes, Y={y1, y2, y3, y4}

- Classifying a given test instance
y1 +
y2 -
class +
y1 +
y3 -
class +
y1 +
y4 -
class -
y2 +
y3 -
class +
y2 +
y4 -
class -
y3 +
y4 -
class +
total votes
y1
y2
y3
y4
target class

Machine Learning - Classification (ctd.)

More Related Content

Similar to Machine Learning - Classification (ctd.) (20)

More from Darío Garigliotti (20)

Recently uploaded (20)

Machine Learning - Classification (ctd.)