Classification: Alternative Techniques and Nearest Neighbor Classifiers

Akre
University
for
Applied
Sciences
Classification: Alternative
Techniques
Kurdistan Regional Government – Iraq
Ministry of Higher Education and Scientific Research,
Akre University For Applied Sciences
Technical College of Informatics-Akre
Information Technology
MSC in Computer Sciences
Prepared by
Aqeel H.Younus 2023-2024
Supervised by
Prof. Dr. Eng. Adnan Mohsin Abdulazeez

Akre
University
for
Applied
Sciences
Rule-based Classifier

Akre
University
for
Applied
Sciences
Rule-Based Classifier
• Classify records by using a collection of “if…then…” rules
• Rule: (Condition)  y
• where
• Condition is a conjunction of tests on attributes
• y is the class label
• Examples of classification rules:
• (Blood Type=Warm)  (Lay Eggs=Yes)  Birds
• (Taxable Income < 50K)  (Refund=Yes)  Evade=No

Akre
University
for
Applied
Sciences
Rule-based Classifier (Example)
Name Blood Type Give Birth Can Fly Live in Water Class
human warm yes no no mammals
python cold no no no reptiles
salmon cold no no yes fishes
whale warm yes no yes mammals
frog cold no no sometimes amphibians
komodo cold no no no reptiles
bat warm yes yes no mammals
pigeon warm no yes no birds
cat warm yes no no mammals
leopard shark cold yes no yes fishes
turtle cold no no sometimes reptiles
penguin warm no no sometimes birds
porcupine warm yes no no mammals
eel cold no no yes fishes
salamander cold no no sometimes amphibians
gila monster cold no no no reptiles
platypus warm no no no mammals
owl warm no yes no birds
dolphin warm yes no yes mammals
eagle warm no yes no birds
R1: (Give Birth = no)  (Can Fly = yes)  Birds
R2: (Give Birth = no)  (Live in Water = yes)  Fishes
R3: (Give Birth = yes)  (Blood Type = warm)  Mammals
R4: (Give Birth = no)  (Can Fly = no)  Reptiles
R5: (Live in Water = sometimes)  Amphibians

Akre
University
for
Applied
Sciences
• A rule r covers an instance x if the attributes of the
instance satisfy the condition of the rule
hawk warm no yes no ?
grizzly bear warm yes no no ?
The rule R1 covers a hawk => Bird
The rule R3 covers the grizzly bear => Mammal
Application of Rule-Based Classifier

Akre
University
for
Applied
Sciences
Rule Coverage and Accuracy
• Coverage of a rule:
• Fraction of records that satisfy
the antecedent of a rule
• Coverage R=n /|D|
• Where
• n :# of tuples covered by R
• |D|:# of tuples in data set.
• Accuracy of a rule:
• Fraction of records that satisfy
the antecedent that also satisfy
the consequent of a rule
• Accuracy R =n / n
• N :# of tuple correctly by R
Tid Refund Marital
Status
Taxable
Income Class
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
(Status=Single)  No
Coverage = 40%, Accuracy = 50%
covers
covers
covers
correct
correct

Akre
University
for
Applied
Sciences
A lemur triggers rule R3, so it is classified as a mammal
A turtle triggers both R4 and R5
A dogfish shark triggers none of the rules
lemur warm yes no no ?
turtle cold no no sometimes ?
dogfish shark cold yes no yes ?
How does Rule-based Classifier Work?

Akre
University
for
Applied
Sciences
Characteristics of Rule Sets: Strategy 1
• Mutually exclusive rules
• Classifier contains mutually exclusive rules if the rules are independent of
each other
• Every record is covered by at most one rule
• Exhaustive rules
• Classifier has exhaustive coverage if it accounts for every possible
combination of attribute values
• Each record is covered by at least one rule

Akre
University
for
Applied
Sciences
Characteristics of Rule Sets: Strategy 2
• Rules are not mutually exclusive
• A record may trigger more than one rule
• Solution?
• Ordered rule set
• Unordered rule set – use voting schemes
• Rules are not exhaustive
• A record may not trigger any rules
• Solution?
• Use a default class

Akre
University
for
Applied
Sciences
Ordered Rule Set
• Rules are rank ordered according to their priority
• An ordered rule set is known as a decision list
• When a test record is presented to the classifier
• It is assigned to the class label of the highest ranked rule it has triggered
• If none of the rules fired, it is assigned to the default class
turtle cold no no sometimes ?

Akre
University
for
Applied
Sciences
Rule Ordering Schemes
• Rule-based ordering
• Individual rules are ranked based on their quality
• Class-based ordering
• Rules that belong to the same class appear together
Rule-based Ordering
(Refund=Yes) ==> No
(Refund=No, Marital Status={Single,Divorced},
Taxable Income<80K) ==> No
Taxable Income>80K) ==> Yes
(Refund=No, Marital Status={Married}) ==> No
Class-based Ordering
(Refund=Yes) ==> No
Taxable Income<80K) ==> No
(Refund=No, Marital Status={Married}) ==> No
Taxable Income>80K) ==> Yes

Akre
University
for
Applied
Sciences
Direct Method:
Extract rules directly from data
Examples: RIPPER, CN2, Holte’s 1R , 0R , Sequential
Covering.
Indirect Method:
Extract rules from other classification models (e.g.
decision trees, neural networks, etc).
Examples: C4.5rules
Building Classification Rules

Akre
University
for
Applied
Sciences
Direct Method: Sequential Covering
1. Start from an empty rule
2. Grow a rule using the Learn-One-Rule function
3. Remove training records covered by the rule
4. Repeat Step (2) and (3) until stopping criterion is met

Akre
University
for
Applied
Sciences
Example of Sequential Covering
(i) Original Data (ii) Step 1

Akre
University
for
Applied
Sciences
Example of Sequential Covering…
(iii) Step 2
R1
(iv) Step 3
R1
R2

Akre
University
for
Applied
Sciences
Rule Growing
• Two common strategies
Status =
Single
Status =
Divorced
Status =
Married
Income
> 80K
...
Yes: 3
No: 4
{ }
Yes: 0
No: 3
Refund=
No
Yes: 3
No: 4
Yes: 2
No: 1
Yes: 1
No: 0
Yes: 3
No: 1
(a) General-to-specific
Tid Refund Marital
Status
Taxable
Income Class
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10

Akre
University
for
Applied
Sciences
Refund=No,
Status=Single,
Income=85K
(Class=Yes)
Refund=No,
Status=Single,
Income=90K
(Class=Yes)
Refund=No,
Status = Single
(Class = Yes)
(b) Specific-to-general
Tid Refund Marital
Status
Taxable
Income Class
3 No Single 70K No
6 No Married 60K No
8 No Single 85K Yes
9 No Married 75K No
10
Rule Growing

Akre
University
for
Applied
Sciences
Rule Evaluation
• Foil’s Information Gain
• R0: {} => class (initial rule)
• R1: {A} => class (rule after adding conjunct)
•
• 𝑝0: number of positive instances covered by R0
𝑛0: number of negative instances covered by R0
𝑝1: number of positive instances covered by R1
𝑛1: number of negative instances covered by R1
FOIL: First Order
Inductive Learner – an
early rule-based
learning algorithm
𝐺𝑎𝑖𝑛 𝑅0, 𝑅1 = 𝑝1 × [ 𝑙𝑜𝑔2
𝑝1
𝑝1 + 𝑛1
− 𝑙𝑜𝑔2
𝑝0
𝑝0 + 𝑛0
]

Akre
University
for
Applied
Sciences
Example

Akre
University
for
Applied
Sciences
𝑝1
𝑝1 + 𝑛1
− 𝑙𝑜𝑔2
𝑝0
𝑝0 + 𝑛0
]
Coverage R=n /|D|
covers
Accuracy R =n / n
covers
correct
R0: {} Mammals
Coverage R0 =
15
15
= 1
Accuracy R0 =
5
15
= 0.333
Candidate rule P1 N1 Accuracy Info Gian
{ skin cover =hair} Mammals 3 0 3/3 =1 [100 %] 4.755
{ Body temp =warm} Mammals 5 2 5/7 = 0.714 5.498
{has legs= no} Mammals 1 4 1/5 = 0.2 - 0.737
P0 =5
N0 =10

Akre
University
for
Applied
Sciences
𝑝1
𝑝1 + 𝑛1
− 𝑙𝑜𝑔2
𝑝0
𝑝0 + 𝑛0
]
𝐺𝑎𝑖𝑛 𝑅0, 𝑅1 = 3 × 𝑙𝑜𝑔2
3
3 + 0
− 𝑙𝑜𝑔2
5
5 + 10
=4.755
5
5 + 2
− 𝑙𝑜𝑔2
5
5 + 10
=5.498
1
1 + 4
− 𝑙𝑜𝑔2
5
5 + 10
= - 0.737

Akre
University
for
Applied
Sciences
Outlook Temp Humidity Windy Play
Overcast 3 81 False Long
Sunny 12 80 False Long
Sunny 15 70 True Long
Overcast 1 85 True Medium
Overcast 2 96 False Medium
Rainy 12 95 False Medium
Overcast 0 96 True Short
Rainy 5 95 False Short
Rainy 9 92 True Short
Example of Direct Method: oneR (1R) algorithm

Akre
University
for
Applied
Sciences
1.Applying the 1R algorithm to determine a single rule that enables
predicting the value of the attribute (play), assuming that the minimum
bachet size =2 ?
2. Determine the classification of the next instance based on the results of
the previous phase?
Sunny 5 75 true ?

Akre
University
for
Applied
Sciences
Converting digital data into nominal data
Temp :
0 1 2 3 5 9 12 12 15
s m m L s s m L L
A B C
humidity :
70 80 81 85 92 95 95 96 96
L L L m s s m m s
D E F
Outlook Temp Humidit
y
Windy Play
Overcast 1 85 True Mediu
m
Overcast 2 96 False Mediu
m
Rainy 12 95 False Mediu
m

Akre
University
for
Applied
Sciences
Attribute Rules Errors Total errors
Outlook overcast----
medium
Sunny-------long
Rainy --------short
2/4
0/2
1/3
2+0+1/9=3/9
Temp A------medium
B-------short
C--------long
1/3
1/3
1/3
3/9
Outlook Tem
p
Humid
ity
Windy Play
Overcast 1 85 True Mediu
m
Overcast 2 96 False Mediu
m
Rainy 12 95 False Mediu
m

Akre
University
for
Applied
Sciences
Humidity D------long
E------short
F------medium
0/3
1/3
1/3
2/9
windy True--------short
False* -------long
2/4
3/5
5/9
We take the smallest error value and it becomes the accepted rule is humidity = 2/9
Sunny 5 75 true ?
The answer or classification to the second requirement is a play
(long) according to the rule .

Akre
University
for
Applied
Sciences
Direct Method: RIPPER
• It stands for Repeated Incremental Pruning to Produce Error Reduction
• For 2-class problem, choose one of the classes as positive class, and the other as
negative class
• Learn rules for positive class
• Negative class will be default class
• For multi-class problem
• Order the classes according to increasing class prevalence (fraction of instances
that belong to a particular class)
• Learn the rule set for smallest class first, treat the rest as negative class
• Repeat with next smallest class as positive class

Akre
University
for
Applied
Sciences
• Growing a rule:
• Start from empty rule
• Add conjuncts as long as they improve FOIL’s information gain
• Stop when rule no longer covers negative examples
• Prune the rule immediately using incremental reduced error pruning
• Measure for pruning: v = (p-n)/(p+n)
• p: number of positive examples covered by the rule in
the validation set
• n: number of negative examples covered by the rule in
the validation set
• Pruning method: delete any final sequence of conditions that maximizes v

Akre
University
for
Applied
Sciences
• Building a Rule Set:
• Use sequential covering algorithm
• Finds the best rule that covers the current set of positive examples
• Eliminate both positive and negative examples covered by the rule
• Each time a rule is added to the rule set, compute the new description length
• Stop adding new rules when the new description length is d bits longer than the
smallest description length obtained so far

Akre
University
for
Applied
Sciences
Example of Indirect Method:
Rule Set
r1: (P=No,Q=No) ==> -
r2: (P=No,Q=Yes) ==> +
r3: (P=Yes,R=No) ==> +
r4: (P=Yes,R=Yes,Q=No) ==> -
r5: (P=Yes,R=Yes,Q=Yes) ==> +
P
Q R
Q
- + +
- +
No No
No
Yes Yes
Yes
No Yes

Akre
University
for
Applied
Sciences
Indirect Method: C4.5rules
• Extract rules from an unpruned decision tree
• For each rule, r: A  y,
• consider an alternative rule r′: A′  y where A′ is obtained by removing one
of the conjuncts in A
• Compare the pessimistic error rate for r against all r’s
• Prune if one of the alternative rules has lower pessimistic error rate
• Repeat until we can no longer improve generalization error

Akre
University
for
Applied
Sciences
Indirect Method: C4.5rules
• Instead of ordering the rules, order subsets of rules (class ordering)
• Each subset is a collection of rules with the same rule consequent (class)
• Compute description length of each subset
• Description length = L(error) + g L(model)
• g is a parameter that takes into account the presence of redundant attributes in a rule
set
(default value = 0.5)

Akre
University
for
Applied
Sciences
Example
Name Give Birth Lay Eggs Can Fly Live in Water Have Legs Class
human yes no no no yes mammals
python no yes no no no reptiles
salmon no yes no yes no fishes
whale yes no no yes no mammals
frog no yes no sometimes yes amphibians
komodo no yes no no yes reptiles
bat yes no yes no yes mammals
pigeon no yes yes no yes birds
cat yes no no no yes mammals
leopard shark yes no no yes no fishes
turtle no yes no sometimes yes reptiles
penguin no yes no sometimes yes birds
porcupine yes no no no yes mammals
eel no yes no yes no fishes
salamander no yes no sometimes yes amphibians
gila monster no yes no no yes reptiles
platypus no yes no no yes mammals
owl no yes yes no yes birds
dolphin yes no no yes no mammals
eagle no yes yes no yes birds

Akre
University
for
Applied
Sciences
C4.5 versus C4.5rules versus RIPPER
C4.5rules:
(Give Birth=No, Can Fly=Yes)  Birds
(Give Birth=No, Live in Water=Yes)  Fishes
(Give Birth=Yes)  Mammals
(Give Birth=No, Can Fly=No, Live in Water=No)  Reptiles
( )  Amphibians
Give
Birth?
Live In
Water?
Can
Fly?
Mammals
Fishes Amphibians
Birds Reptiles
Yes No
Yes
Sometimes
No
Yes No
RIPPER:
(Live in Water=Yes)  Fishes
(Have Legs=No)  Reptiles
(Give Birth=No, Can Fly=No, Live In Water=No)
 Reptiles
(Can Fly=Yes,Give Birth=No)  Birds
()  Mammals

Akre
University
for
Applied
Sciences
C4.5 versus C4.5rules versus RIPPER
PREDICTED CLASS
Amphibians Fishes Reptiles Birds Mammals
ACTUAL Amphibians 0 0 0 0 2
CLASS Fishes 0 3 0 0 0
Reptiles 0 0 3 0 1
Birds 0 0 1 2 1
Mammals 0 2 1 0 4
PREDICTED CLASS
Amphibians Fishes Reptiles Birds Mammals
ACTUAL Amphibians 2 0 0 0 0
CLASS Fishes 0 2 0 0 1
Reptiles 1 0 3 0 0
Birds 1 0 0 3 0
Mammals 0 0 1 0 6
C4.5 and C4.5rules:
RIPPER:

Akre
University
for
Applied
Sciences
Advantages of Rule Based Data Mining
Classifiers
1.Highly expressive.
2.Easy to interpret.
3.Easy to generate.
4.Capability to classify new records rapidly.
5.Performance is comparable to other classifiers.

Akre
University
for
Applied
Sciences
Nearest Neighbor Classifiers

Akre
University
for
Applied
Sciences
Nearest Neighbor Classifiers
• Basic idea:
• If it walks like a duck, quacks like a duck, then it’s probably a
duck
Training
Records
Test
Record
Compute
Distance
Choose k of the
“nearest” records

Akre
University
for
Applied
Sciences
Nearest-Neighbor Classifiers
 Requires the following:
– A set of labeled records
– Proximity metric to compute
distance/similarity between a
pair of records
– e.g., Euclidean distance
– The value of k, the number of
nearest neighbors to retrieve
– A method for using class
labels of K nearest neighbors
to determine the class label of
unknown record (e.g., by
taking majority vote)
Unknown record

Akre
University
for
Applied
Sciences
How to Determine the class label of a Test Sample?
• Take the majority vote of class labels among the k-
nearest neighbors
• Weight the vote according to distance
• weight factor, 𝑤 = 1/𝑑2

Akre
University
for
Applied
Sciences
Nearest Neighbor Classification…
• Data preprocessing is often required
• Attributes may have to be scaled to prevent distance measures from being
dominated by one of the attributes
• Example:
• height of a person may vary from 1.5m to 1.8m
• weight of a person may vary from 90lb to 300lb
• income of a person may vary from $10K to $1M
• Time series are often standardized to have 0 means a standard deviation of 1

Akre
University
for
Applied
Sciences
• Choosing the value of k:
• If k is too small, sensitive to noise points
• If k is too large, neighborhood may include points from
other classes
X

Akre
University
for
Applied
Sciences
Nearest-neighbor classifiers
1-nn decision boundary is
a Voronoi Diagram
 Nearest neighbor
classifiers are local
classifiers
 They can produce
decision boundaries of
arbitrary shapes.

Akre
University
for
Applied
Sciences
• How to handle missing values in training and test sets?
• Proximity computations normally require the presence of all attributes
• Some approaches use the subset of attributes present in two instances
• This may not produce good results since it effectively uses different proximity measures
for each pair of instances
• Thus, proximities are not comparable

Akre
University
for
Applied
Sciences
K-NN Classificiers…
Handling Irrelevant and Redundant Attributes
• Irrelevant attributes add noise to the proximity measure
• Redundant attributes bias the proximity measure towards certain attributes

Akre
University
for
Applied
Sciences
K-NN Classifiers: Handling attributes that are interacting

Akre
University
for
Applied
Sciences
Handling attributes that are interacting

Akre
University
for
Applied
Sciences
How does K-NN work?
KNN has the following basic steps:
1.Selecta value k.
2.Determine which distancefunctionis to be used (Euclidean).
3.Sortthe distances obtained and take the k-nearest data
samples.
4.Assignthe test class to the class based on the majority vote of
its k neighbors.

Akre
University
for
Applied
Sciences
Example:
This dataset is about Iris flower . In this dataset, we have 3 attributes
which have sepal length,sepal width, and species.in Specieswe have three
target attribute (Setosa, Virginia, and Versicolor) and our target finds the
nearest species which belong from three species using the k-Nearest
Neighbors.

Akre
University
for
Applied
Sciences
Target:New flower found, need to classify “Unlabeled”.Feature
of the new unlabeled flower:

Akre
University
for
Applied
Sciences
Solution:Step 1: Find Distance
Our first step is to find distance using the Euclidean distance
between ActualandObserved sepal length and sepal width. For
the firstinstance dataset
X = Observed sepal length=5.2
Y = Observed sepal width=3.1
Now Actual which is given in the dataset
A = Actual sepal length = 5.3
B = Actual sepal width=3.7

Akre
University
for
Applied
Sciences
Distance formula:
Euclidean distance=

Akre
University
for
Applied
Sciences
This is for firstinstances which I find the distance similarly find
allthe remaining instances distance as shown below in the table

Akre
University
for
Applied
Sciences
Step 2: FindRank:
In this step, we findthe Rankafter findingthe Distance. Rank basically
gives the number according to ascending order distance. As you can see
below table:
If we see the above table then instance number 5has a minimum
distance 0.22so gave him rank as below table

Akre
University
for
Applied
Sciences
Similarly, find the rankfor allother instances as shown below the
table

Akre
University
for
Applied
Sciences
Step3: Findthe NearestNeighbor:
Our last step finds the nearest neighbors on the basis of distance
and rank we can find our Unknown on the basis of species.
According to rank find the k Nearest Neighbor
for k=1
Feature Species is Setosa so K=1 is Setosa

Akre
University
for
Applied
Sciences
for k=2
Feature Species is Setosa because no other species is found for
so K=2 is Setosa

Akre
University
for
Applied
Sciences
For k=5

Akre
University
for
Applied
Sciences
Feature Species is Setosa because a majority vote for setosa=3 and
virginica=1 and virginica =1 so on the basis of highest vote KNN for K=5 is
Setosa.

Akre
University
for
Applied
Sciences
Improving KNN Efficiency
• Avoid having to compute distance to all objects in the training set
• Multi-dimensional access methods (k-d trees)
• Fast approximate similarity search
• Locality Sensitive Hashing (LSH)
• Condensing
• Determine a smaller set of objects that give the same performance
• Editing
• Remove objects to improve efficiency

Akre
University
for
Applied
Sciences
61

Classification: Alternative Techniques and Nearest Neighbor Classifiers

More Related Content

Recently uploaded (20)

Featured (20)

Classification: Alternative Techniques and Nearest Neighbor Classifiers