SlideShare a Scribd company logo
Data Mining
Classification: Basic Concepts and
Techniques
Lecture Notes for Chapter 3
Introduction to Data Mining, 2nd Edition
by
Tan, Steinbach, Karpatne, Kumar
6/11/2024 Introduction to Data Mining, 2nd Edition 1
Classification: Definition
Given a collection of records (training set )
– Each record is by characterized by a tuple
(x,y), where x is the attribute set and y is the
class label
 x: attribute, predictor, independent variable, input
 y: class, response, dependent variable, output
Task:
– Learn a model that maps each attribute set x
into one of the predefined class labels y
6/11/2024 Introduction to Data Mining, 2nd Edition 2
Examples of Classification Task
Task Attribute set, x Class label, y
Categorizing
email
messages
Features extracted from
email message header
and content
spam or non-spam
Identifying
tumor cells
Features extracted from
MRI scans
malignant or benign
cells
Cataloging
galaxies
Features extracted from
telescope images
Elliptical, spiral, or
irregular-shaped
galaxies
6/11/2024 Introduction to Data Mining, 2nd Edition 3
General Approach for Building
Classification Model
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Learning
algorithm
Training Set
6/11/2024 Introduction to Data Mining, 2nd Edition 4
Classification Techniques
Base Classifiers
– Decision Tree based Methods
– Rule-based Methods
– Nearest-neighbor
– Neural Networks
– Deep Learning
– Naïve Bayes and Bayesian Belief Networks
– Support Vector Machines
Ensemble Classifiers
– Boosting, Bagging, Random Forests
6/11/2024 Introduction to Data Mining, 2nd Edition 5
Example of a Decision Tree
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
Home
Owner
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
6/11/2024 Introduction to Data Mining, 2nd Edition 6
Another Example of Decision Tree
MarSt
Home
Owner
Income
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that
fits the same data!
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
6/11/2024 Introduction to Data Mining, 2nd Edition 7
Apply Model to Test Data
Home
Owner
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Start from the root of tree.
6/11/2024 Introduction to Data Mining, 2nd Edition 8
Apply Model to Test Data
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
6/11/2024 Introduction to Data Mining, 2nd Edition 9
Apply Model to Test Data
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
6/11/2024 Introduction to Data Mining, 2nd Edition 10
Apply Model to Test Data
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
6/11/2024 Introduction to Data Mining, 2nd Edition 11
Apply Model to Test Data
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Home
Owner
6/11/2024 Introduction to Data Mining, 2nd Edition 12
Apply Model to Test Data
MarSt
Income
YES
NO
NO
NO
Yes No
Married
Single, Divorced
< 80K > 80K
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
No Married 80K ?
10
Test Data
Assign Defaulted to
“No”
Home
Owner
6/11/2024 Introduction to Data Mining, 2nd Edition 13
Decision Tree Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes
10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
Tree
Induction
algorithm
Training Set
Decision
Tree
6/11/2024 Introduction to Data Mining, 2nd Edition 14
Decision Tree Induction
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART
– ID3, C4.5
– SLIQ,SPRINT
6/11/2024 Introduction to Data Mining, 2nd Edition 15
General Structure of Hunt’s Algorithm
Let Dt be the set of training
records that reach a node t
General Procedure:
– If Dt contains records that
belong the same class yt,
then t is a leaf node
labeled as yt
– If Dt contains records that
belong to more than one
class, use an attribute test
to split the data into smaller
subsets. Recursively apply
the procedure to each
subset.
Dt
?
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
6/11/2024 Introduction to Data Mining, 2nd Edition 16
Hunt’s Algorithm
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
6/11/2024 Introduction to Data Mining, 2nd Edition 17
Hunt’s Algorithm
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
6/11/2024 Introduction to Data Mining, 2nd Edition 18
Hunt’s Algorithm
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
6/11/2024 Introduction to Data Mining, 2nd Edition 19
Hunt’s Algorithm
(a) (b)
(c)
Defaulted = No
Home
Owner
Yes No
Defaulted = No Defaulted = No
Yes No
Marital
Status
Single,
Divorced
Married
(d)
Yes No
Marital
Status
Single,
Divorced
Married
Annual
Income
< 80K >= 80K
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = Yes
Home
Owner
Defaulted = No
Defaulted = No
Defaulted = No
Defaulted = Yes
(3,0) (4,3)
(3,0)
(1,3) (3,0)
(3,0)
(1,0) (0,3)
(3,0)
(7,3)
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
Borrower
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
6/11/2024 Introduction to Data Mining, 2nd Edition 20
Design Issues of Decision Tree Induction
How should training records be split?
– Method for specifying test condition
 depending on attribute types
– Measure for evaluating the goodness of a test
condition
How should the splitting procedure stop?
– Stop splitting if all the records belong to the
same class or have identical attribute values
– Early termination
6/11/2024 Introduction to Data Mining, 2nd Edition 21
Methods for Expressing Test Conditions
Depends on attribute types
– Binary
– Nominal
– Ordinal
– Continuous
Depends on number of ways to split
– 2-way split
– Multi-way split
6/11/2024 Introduction to Data Mining, 2nd Edition 22
Test Condition for Nominal Attributes
Multi-way split:
– Use as many partitions as
distinct values.
Binary split:
– Divides values into two subsets
Marital
Status
Single Divorced Married
{Single} {Married,
Divorced}
Marital
Status
{Married} {Single,
Divorced}
Marital
Status
OR OR
{Single,
Married}
Marital
Status
{Divorced}
6/11/2024 Introduction to Data Mining, 2nd Edition 23
Test Condition for Ordinal Attributes
Multi-way split:
– Use as many partitions
as distinct values
Binary split:
– Divides values into two
subsets
– Preserve order
property among
attribute values
Large
Shirt
Size
Medium Extra Large
Small
{Medium, Large,
Extra Large}
Shirt
Size
{Small}
{Large,
Extra Large}
Shirt
Size
{Small,
Medium}
{Medium,
Extra Large}
Shirt
Size
{Small,
Large}
This grouping
violates order
property
6/11/2024 Introduction to Data Mining, 2nd Edition 24
Test Condition for Continuous Attributes
Annual
Income
> 80K?
Yes No
Annual
Income?
(i) Binary split (ii) Multi-way split
< 10K
[10K,25K) [25K,50K) [50K,80K)
> 80K
6/11/2024 Introduction to Data Mining, 2nd Edition 25
Splitting Based on Continuous Attributes
Different ways of handling
– Discretization to form an ordinal categorical
attribute
Ranges can be found by equal interval bucketing,
equal frequency bucketing (percentiles), or
clustering.
 Static – discretize once at the beginning
 Dynamic – repeat at each node
– Binary Decision: (A < v) or (A  v)
 consider all possible splits and finds the best cut
 can be more compute intensive
6/11/2024 Introduction to Data Mining, 2nd Edition 26
How to determine the Best Split
Gender
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Car
Type
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Customer
ID
...
Yes No Family
Sports
Luxury c1
c10
c20
C0: 0
C1: 1
...
c11
Before Splitting: 10 records of class 0,
10 records of class 1
Which test condition is the best?
6/11/2024 Introduction to Data Mining, 2nd Edition 27
How to determine the Best Split
Greedy approach:
– Nodes with purer class distribution are
preferred
Need a measure of node impurity:
C0: 5
C1: 5
C0: 9
C1: 1
High degree of impurity Low degree of impurity
6/11/2024 Introduction to Data Mining, 2nd Edition 28
Measures of Node Impurity
Gini Index
Entropy
Misclassification error



j
t
j
p
t
GINI 2
)]
|
(
[
1
)
(


 j
t
j
p
t
j
p
t
Entropy )
|
(
log
)
|
(
)
(
)
|
(
max
1
)
( t
i
P
t
Error i


6/11/2024 Introduction to Data Mining, 2nd Edition 29
Finding the Best Split
1. Compute impurity measure (P) before splitting
2. Compute impurity measure (M) after splitting
Compute impurity measure of each child node
M is the weighted impurity of children
3. Choose the attribute test condition that
produces the highest gain
or equivalently, lowest impurity measure after
splitting (M)
Gain = P – M
6/11/2024 Introduction to Data Mining, 2nd Edition 30
Finding the Best Split
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
Before Splitting:
C0 N10
C1 N11
C0 N20
C1 N21
C0 N30
C1 N31
C0 N40
C1 N41
C0 N00
C1 N01
P
M11 M12 M21 M22
M1 M2
Gain = P – M1 vs P – M2
6/11/2024 Introduction to Data Mining, 2nd Edition 31
Measure of Impurity: GINI
Gini Index for a given node t :
(NOTE: p( j | t) is the relative frequency of class j at node t).
– Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
– Minimum (0.0) when all records belong to one class,
implying most interesting information



j
t
j
p
t
GINI 2
)]
|
(
[
1
)
(
6/11/2024 Introduction to Data Mining, 2nd Edition 32
Measure of Impurity: GINI
Gini Index for a given node t :
(NOTE: p( j | t) is the relative frequency of class j at node t).



j
t
j
p
t
GINI 2
)]
|
(
[
1
)
(
C1 0
C2 6
Gini=0.000
C1 2
C2 4
Gini=0.444
C1 3
C2 3
Gini=0.500
C1 1
C2 5
Gini=0.278
6/11/2024 Introduction to Data Mining, 2nd Edition 33
Computing Gini Index of a Single Node
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0



j
t
j
p
t
GINI 2
)]
|
(
[
1
)
(
P(C1) = 1/6 P(C2) = 5/6
Gini = 1 – (1/6)2 – (5/6)2 = 0.278
P(C1) = 2/6 P(C2) = 4/6
Gini = 1 – (2/6)2 – (4/6)2 = 0.444
6/11/2024 Introduction to Data Mining, 2nd Edition 34
Computing Gini Index for a Collection of
Nodes
When a node p is split into k partitions (children)
where, ni = number of records at child i,
n = number of records at parent node p.
Choose the attribute that minimizes weighted average
Gini index of the children
Gini index is used in decision tree algorithms such as
CART, SLIQ, SPRINT



k
i
i
split i
GINI
n
n
GINI
1
)
(
6/11/2024 Introduction to Data Mining, 2nd Edition 35
Binary Attributes: Computing GINI Index
Splits into two partitions
Effect of Weighing partitions:
– Larger and Purer Partitions are sought for.
B?
Yes No
Node N1 Node N2
Parent
C1 7
C2 5
Gini = 0.486
N1 N2
C1 5 2
C2 1 4
Gini=0.361
Gini(N1)
= 1 – (5/6)2 – (1/6)2
= 0.278
Gini(N2)
= 1 – (2/6)2 – (4/6)2
= 0.444
Weighted Gini of N1 N2
= 6/12 * 0.278 +
6/12 * 0.444
= 0.361
Gain = 0.486 – 0.361 = 0.125
6/11/2024 Introduction to Data Mining, 2nd Edition 36
Categorical Attributes: Computing Gini Index
For each distinct value, gather counts for each class in
the dataset
Use the count matrix to make decisions
CarType
{Sports,
Luxury}
{Family}
C1 9 1
C2 7 3
Gini 0.468
CarType
{Sports}
{Family,
Luxury}
C1 8 2
C2 0 10
Gini 0.167
CarType
Family Sports Luxury
C1 1 8 1
C2 3 0 7
Gini 0.163
Multi-way split Two-way split
(find best partition of values)
Which of these is the best?
6/11/2024 Introduction to Data Mining, 2nd Edition 37
Continuous Attributes: Computing Gini Index
Use Binary Decisions based on one
value
Several Choices for the splitting value
– Number of possible splitting values
= Number of distinct values
Each splitting value has a count matrix
associated with it
– Class counts in each of the
partitions, A < v and A  v
Simple method to choose best v
– For each v, scan the database to
gather count matrix and compute
its Gini index
– Computationally Inefficient!
Repetition of work.
ID
Home
Owner
Marital
Status
Annual
Income
Defaulted
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes
10
≤ 80 > 80
Defaulted Yes 0 3
Defaulted No 3 4
Annual Income ?
6/11/2024 Introduction to Data Mining, 2nd Edition 38
Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Split Positions
Sorted Values
6/11/2024 Introduction to Data Mining, 2nd Edition 39
Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Split Positions
Sorted Values
6/11/2024 Introduction to Data Mining, 2nd Edition 40
Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Split Positions
Sorted Values
6/11/2024 Introduction to Data Mining, 2nd Edition 41
Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Split Positions
Sorted Values
6/11/2024 Introduction to Data Mining, 2nd Edition 42
Cheat No No No Yes Yes Yes No No No No
Annual Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230
<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Continuous Attributes: Computing Gini Index...
For efficient computation: for each attribute,
– Sort the attribute on values
– Linearly scan these values, each time updating the count matrix
and computing gini index
– Choose the split position that has the least gini index
Split Positions
Sorted Values
6/11/2024 Introduction to Data Mining, 2nd Edition 43
Measure of Impurity: Entropy
Entropy at a given node t:
(NOTE: p( j | t) is the relative frequency of class j at node t).
Maximum (log nc) when records are equally distributed
among all classes implying least information
Minimum (0.0) when all records belong to one class,
implying most information
– Entropy based computations are quite similar to
the GINI index computations


 j
t
j
p
t
j
p
t
Entropy )
|
(
log
)
|
(
)
(
6/11/2024 Introduction to Data Mining, 2nd Edition 44
Computing Entropy of a Single Node
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92


 j
t
j
p
t
j
p
t
Entropy )
|
(
log
)
|
(
)
( 2
6/11/2024 Introduction to Data Mining, 2nd Edition 45
Computing Information Gain After Splitting
Information Gain:
Parent Node, p is split into k partitions;
ni is number of records in partition i
– Choose the split that achieves most reduction
(maximizes GAIN)
– Used in the ID3 and C4.5 decision tree algorithms







 

k
i
i
split
i
Entropy
n
n
p
Entropy
GAIN 1
)
(
)
(
6/11/2024 Introduction to Data Mining, 2nd Edition 46
Problem with large number of partitions
Node impurity measures tend to prefer splits that
result in large number of partitions, each being
small but pure
– Customer ID has highest information gain
because entropy for all the children is zero
Gender
C0: 6
C1: 4
C0: 4
C1: 6
C0: 1
C1: 3
C0: 8
C1: 0
C0: 1
C1: 7
Car
Type
C0: 1
C1: 0
C0: 1
C1: 0
C0: 0
C1: 1
Customer
ID
...
Yes No Family
Sports
Luxury c1
c10
c20
C0: 0
C1: 1
...
c11
6/11/2024 Introduction to Data Mining, 2nd Edition 47
Gain Ratio
Gain Ratio:
Parent Node, p is split into k partitions
ni is the number of records in partition i
– Adjusts Information Gain by the entropy of the partitioning
(SplitINFO).
 Higher entropy partitioning (large number of small partitions) is
penalized!
– Used in C4.5 algorithm
– Designed to overcome the disadvantage of Information Gain
SplitINFO
GAIN
GainRATIO Split
split
 



k
i
i
i
n
n
n
n
SplitINFO 1
log
6/11/2024 Introduction to Data Mining, 2nd Edition 48
Gain Ratio
Gain Ratio:
Parent Node, p is split into k partitions
ni is the number of records in partition i
SplitINFO
GAIN
GainRATIO Split
split
 



k
i
i
i
n
n
n
n
SplitINFO 1
log
CarType
{Sports,
Luxury}
{Family}
C1 9 1
C2 7 3
Gini 0.468
CarType
{Sports}
{Family,
Luxury}
C1 8 2
C2 0 10
Gini 0.167
CarType
Family Sports Luxury
C1 1 8 1
C2 3 0 7
Gini 0.163
SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97
6/11/2024 Introduction to Data Mining, 2nd Edition 49
Measure of Impurity: Classification Error
Classification error at a node t :
– Maximum (1 - 1/nc) when records are equally
distributed among all classes, implying least
interesting information
– Minimum (0) when all records belong to one class,
implying most interesting information
)
|
(
max
1
)
( t
i
P
t
Error i


6/11/2024 Introduction to Data Mining, 2nd Edition 50
Computing Error of a Single Node
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0
P(C1) = 1/6 P(C2) = 5/6
Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6
P(C1) = 2/6 P(C2) = 4/6
Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
)
|
(
max
1
)
( t
i
P
t
Error i


6/11/2024 Introduction to Data Mining, 2nd Edition 51
Comparison among Impurity Measures
For a 2-class problem:
6/11/2024 Introduction to Data Mining, 2nd Edition 52
Misclassification Error vs Gini Index
A?
Yes No
Node N1 Node N2
Parent
C1 7
C2 3
Gini = 0.42
N1 N2
C1 3 4
C2 0 3
Gini=0.342
Gini(N1)
= 1 – (3/3)2 – (0/3)2
= 0
Gini(N2)
= 1 – (4/7)2 – (3/7)2
= 0.489
Gini(Children)
= 3/10 * 0
+ 7/10 * 0.489
= 0.342
Gini improves but
error remains the
same!!
6/11/2024 Introduction to Data Mining, 2nd Edition 53
Misclassification Error vs Gini Index
A?
Yes No
Node N1 Node N2
Parent
C1 7
C2 3
Gini = 0.42
N1 N2
C1 3 4
C2 0 3
Gini=0.342
N1 N2
C1 3 4
C2 1 2
Gini=0.416
Misclassification error for all three cases = 0.3 !
6/11/2024 Introduction to Data Mining, 2nd Edition 54
Decision Tree Based Classification
Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Robust to noise (especially when methods to avoid
overfitting are employed)
– Can easily handle redundant or irrelevant attributes (unless
the attributes are interacting)
Disadvantages:
– Space of possible decision trees is exponentially large.
Greedy approaches are often unable to find the best tree.
– Does not take into account interactions between attributes
– Each decision boundary involves only a single attribute
6/11/2024 Introduction to Data Mining, 2nd Edition 55
Confusion Matrix
A confusion matrix is a tabular way of visualizing the
performance of your prediction model.
Each entry in a confusion matrix denotes the number
of predictions that were made by the model where it
classified the classes correctly or incorrectly.
Confusion Matrix for Binary Classification
A binary classification
problem has only two classes
Preferably a positive and a
negative class.
Introduction to Data Mining, 2nd Edition 56
6/11/2024
Confusion Matrix …
True Positive (TP): It refers to the number of predictions
where the classifier correctly predicts the positive class as
positive.
True Negative (TN): It refers to the number of predictions
where the classifier correctly predicts the negative class as
negative.
False Positive (FP): It refers to the number of predictions
where the classifier incorrectly predicts the negative class
as positive.
False Negative (FN): It refers to the number of predictions
where the classifier incorrectly predicts the positive class as
negative. Introduction to Data Mining, 2nd Edition 57
6/11/2024
Confusion Matrix for Multi Class
Introduction to Data Mining, 2nd Edition 58
6/11/2024
Performance measures for confusion matrix.
It’s always better to use confusion matrix as your
evaluation criteria for your machine learning model. It
gives you a very simple, yet efficient performance
measures for your model. Here are some of the most
common performance measures you can use from
the confusion matrix.
Accuracy: It gives you the overall accuracy of the
model, meaning the fraction of the total samples that
were correctly classified by the classifier.
accuracy = (TP+TN)/(TP+TN+FP+FN).
Introduction to Data Mining, 2nd Edition 59
6/11/2024
Performance measures for confusion matrix.
Precision: It tells you what fraction of predictions as a
positive class were actually positive.
 Precision = TP/(TP+FP).
Recall: It tells you what fraction of all positive
samples were correctly predicted as positive by the
classifier. It is also known as True Positive Rate (TPR),
Sensitivity, Probability of Detection.
 Recall = TP/(TP+FN).
Introduction to Data Mining, 2nd Edition 60
6/11/2024
Performance measures for confusion matrix.
F1-score: It combines precision and recall into a single
measure. Mathematically it’s the harmonic mean of
precision and recall. It can be calculated as follows
Specificity: It tells you what fraction of all negative
samples are correctly predicted as negative by the
classifier. It is also known as True Negative Rate
(TNR). To calculate specificity, use the following
formula: TN/(TN+FP)
Introduction to Data Mining, 2nd Edition 61
6/11/2024
Performance measures for confusion matrix.
Misclassification Rate: It tells you what fraction of
predictions were incorrect. It is also known as
Classification Error. You can calculate it using
(FP+FN)/(TP+TN+FP+FN)
or (1-Accuracy).
Introduction to Data Mining, 2nd Edition 62
6/11/2024

More Related Content

PPTX
Basic Classification.pptx
PPTX
Unit - 3 (chap3_basic_classification).pptx
PPT
data mining Module 4.ppt
PPT
chap4_basic_classification(2).ppt
PPT
chap4_basic_classification.ppt
PPT
Cluster analysis
PPT
Decision Tree based Classification - ML.ppt
PPTX
Data Mining Lecture_10(a).pptx
Basic Classification.pptx
Unit - 3 (chap3_basic_classification).pptx
data mining Module 4.ppt
chap4_basic_classification(2).ppt
chap4_basic_classification.ppt
Cluster analysis
Decision Tree based Classification - ML.ppt
Data Mining Lecture_10(a).pptx

Similar to Classification Slides and decision tree .ppt (20)

PPT
UNIT 2: Part 1: Data Warehousing and Data Mining
PPTX
Decision trees for Classification & Regression.pptx
PPTX
datamining-lect11.pptx
PPTX
NN Classififcation Neural Network NN.pptx
PPT
Classification
PDF
Decision-Tree-.pdf techniques and many more
PPT
Classification: Basic Concepts and Decision Trees
PPSX
Classification Using Decision tree
PDF
Lecture 5 Decision tree.pdf
PDF
Machine Learning - Classification
PPTX
Lecture_21_22_Classification_Instance-based Learning
PPT
Data science: DATA MINING AND DATA WHEREHOUSE.ppt
PPT
introDMintroDMintroDMintroDMintroDMintroDM.ppt
PPT
introDM.ppt
PPT
DataMining dgfg dfg fg dsfg dfg- Copy.ppt
PPT
inmlk;lklkjlk;lklkjlklkojhhkljkbjlkjhbtroDM.ppt
PDF
DM PROJECT
PPT
Classification and Prediction
PPTX
Classification in data mining
PPTX
slide-02-data-mining-Input_output-1.pptx
UNIT 2: Part 1: Data Warehousing and Data Mining
Decision trees for Classification & Regression.pptx
datamining-lect11.pptx
NN Classififcation Neural Network NN.pptx
Classification
Decision-Tree-.pdf techniques and many more
Classification: Basic Concepts and Decision Trees
Classification Using Decision tree
Lecture 5 Decision tree.pdf
Machine Learning - Classification
Lecture_21_22_Classification_Instance-based Learning
Data science: DATA MINING AND DATA WHEREHOUSE.ppt
introDMintroDMintroDMintroDMintroDMintroDM.ppt
introDM.ppt
DataMining dgfg dfg fg dsfg dfg- Copy.ppt
inmlk;lklkjlk;lklkjlklkojhhkljkbjlkjhbtroDM.ppt
DM PROJECT
Classification and Prediction
Classification in data mining
slide-02-data-mining-Input_output-1.pptx
Ad

Recently uploaded (20)

PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Introduction to Data Science and Data Analysis
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
1_Introduction to advance data techniques.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PDF
annual-report-2024-2025 original latest.
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Lecture1 pattern recognition............
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
.pdf is not working space design for the following data for the following dat...
Introduction to Data Science and Data Analysis
STUDY DESIGN details- Lt Col Maksud (21).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
SAP 2 completion done . PRESENTATION.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
1_Introduction to advance data techniques.pptx
ISS -ESG Data flows What is ESG and HowHow
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
annual-report-2024-2025 original latest.
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Lecture1 pattern recognition............
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
Ad

Classification Slides and decision tree .ppt

  • 1. Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 Introduction to Data Mining, 2nd Edition by Tan, Steinbach, Karpatne, Kumar 6/11/2024 Introduction to Data Mining, 2nd Edition 1
  • 2. Classification: Definition Given a collection of records (training set ) – Each record is by characterized by a tuple (x,y), where x is the attribute set and y is the class label  x: attribute, predictor, independent variable, input  y: class, response, dependent variable, output Task: – Learn a model that maps each attribute set x into one of the predefined class labels y 6/11/2024 Introduction to Data Mining, 2nd Edition 2
  • 3. Examples of Classification Task Task Attribute set, x Class label, y Categorizing email messages Features extracted from email message header and content spam or non-spam Identifying tumor cells Features extracted from MRI scans malignant or benign cells Cataloging galaxies Features extracted from telescope images Elliptical, spiral, or irregular-shaped galaxies 6/11/2024 Introduction to Data Mining, 2nd Edition 3
  • 4. General Approach for Building Classification Model Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Learning algorithm Training Set 6/11/2024 Introduction to Data Mining, 2nd Edition 4
  • 5. Classification Techniques Base Classifiers – Decision Tree based Methods – Rule-based Methods – Nearest-neighbor – Neural Networks – Deep Learning – Naïve Bayes and Bayesian Belief Networks – Support Vector Machines Ensemble Classifiers – Boosting, Bagging, Random Forests 6/11/2024 Introduction to Data Mining, 2nd Edition 5
  • 6. Example of a Decision Tree ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Home Owner MarSt Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Splitting Attributes Training Data Model: Decision Tree 6/11/2024 Introduction to Data Mining, 2nd Edition 6
  • 7. Another Example of Decision Tree MarSt Home Owner Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K There could be more than one tree that fits the same data! ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 6/11/2024 Introduction to Data Mining, 2nd Edition 7
  • 8. Apply Model to Test Data Home Owner MarSt Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Test Data Start from the root of tree. 6/11/2024 Introduction to Data Mining, 2nd Edition 8
  • 9. Apply Model to Test Data MarSt Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Test Data Home Owner 6/11/2024 Introduction to Data Mining, 2nd Edition 9
  • 10. Apply Model to Test Data MarSt Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Test Data Home Owner 6/11/2024 Introduction to Data Mining, 2nd Edition 10
  • 11. Apply Model to Test Data MarSt Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Test Data Home Owner 6/11/2024 Introduction to Data Mining, 2nd Edition 11
  • 12. Apply Model to Test Data MarSt Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Test Data Home Owner 6/11/2024 Introduction to Data Mining, 2nd Edition 12
  • 13. Apply Model to Test Data MarSt Income YES NO NO NO Yes No Married Single, Divorced < 80K > 80K Home Owner Marital Status Annual Income Defaulted Borrower No Married 80K ? 10 Test Data Assign Defaulted to “No” Home Owner 6/11/2024 Introduction to Data Mining, 2nd Edition 13
  • 14. Decision Tree Classification Task Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Tree Induction algorithm Training Set Decision Tree 6/11/2024 Introduction to Data Mining, 2nd Edition 14
  • 15. Decision Tree Induction Many Algorithms: – Hunt’s Algorithm (one of the earliest) – CART – ID3, C4.5 – SLIQ,SPRINT 6/11/2024 Introduction to Data Mining, 2nd Edition 15
  • 16. General Structure of Hunt’s Algorithm Let Dt be the set of training records that reach a node t General Procedure: – If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Dt ? ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 6/11/2024 Introduction to Data Mining, 2nd Edition 16
  • 17. Hunt’s Algorithm (a) (b) (c) Defaulted = No Home Owner Yes No Defaulted = No Defaulted = No Yes No Marital Status Single, Divorced Married (d) Yes No Marital Status Single, Divorced Married Annual Income < 80K >= 80K Home Owner Defaulted = No Defaulted = No Defaulted = Yes Home Owner Defaulted = No Defaulted = No Defaulted = No Defaulted = Yes (3,0) (4,3) (3,0) (1,3) (3,0) (3,0) (1,0) (0,3) (3,0) (7,3) ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 6/11/2024 Introduction to Data Mining, 2nd Edition 17
  • 18. Hunt’s Algorithm (a) (b) (c) Defaulted = No Home Owner Yes No Defaulted = No Defaulted = No Yes No Marital Status Single, Divorced Married (d) Yes No Marital Status Single, Divorced Married Annual Income < 80K >= 80K Home Owner Defaulted = No Defaulted = No Defaulted = Yes Home Owner Defaulted = No Defaulted = No Defaulted = No Defaulted = Yes (3,0) (4,3) (3,0) (1,3) (3,0) (3,0) (1,0) (0,3) (3,0) (7,3) ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 6/11/2024 Introduction to Data Mining, 2nd Edition 18
  • 19. Hunt’s Algorithm (a) (b) (c) Defaulted = No Home Owner Yes No Defaulted = No Defaulted = No Yes No Marital Status Single, Divorced Married (d) Yes No Marital Status Single, Divorced Married Annual Income < 80K >= 80K Home Owner Defaulted = No Defaulted = No Defaulted = Yes Home Owner Defaulted = No Defaulted = No Defaulted = No Defaulted = Yes (3,0) (4,3) (3,0) (1,3) (3,0) (3,0) (1,0) (0,3) (3,0) (7,3) ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 6/11/2024 Introduction to Data Mining, 2nd Edition 19
  • 20. Hunt’s Algorithm (a) (b) (c) Defaulted = No Home Owner Yes No Defaulted = No Defaulted = No Yes No Marital Status Single, Divorced Married (d) Yes No Marital Status Single, Divorced Married Annual Income < 80K >= 80K Home Owner Defaulted = No Defaulted = No Defaulted = Yes Home Owner Defaulted = No Defaulted = No Defaulted = No Defaulted = Yes (3,0) (4,3) (3,0) (1,3) (3,0) (3,0) (1,0) (0,3) (3,0) (7,3) ID Home Owner Marital Status Annual Income Defaulted Borrower 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 6/11/2024 Introduction to Data Mining, 2nd Edition 20
  • 21. Design Issues of Decision Tree Induction How should training records be split? – Method for specifying test condition  depending on attribute types – Measure for evaluating the goodness of a test condition How should the splitting procedure stop? – Stop splitting if all the records belong to the same class or have identical attribute values – Early termination 6/11/2024 Introduction to Data Mining, 2nd Edition 21
  • 22. Methods for Expressing Test Conditions Depends on attribute types – Binary – Nominal – Ordinal – Continuous Depends on number of ways to split – 2-way split – Multi-way split 6/11/2024 Introduction to Data Mining, 2nd Edition 22
  • 23. Test Condition for Nominal Attributes Multi-way split: – Use as many partitions as distinct values. Binary split: – Divides values into two subsets Marital Status Single Divorced Married {Single} {Married, Divorced} Marital Status {Married} {Single, Divorced} Marital Status OR OR {Single, Married} Marital Status {Divorced} 6/11/2024 Introduction to Data Mining, 2nd Edition 23
  • 24. Test Condition for Ordinal Attributes Multi-way split: – Use as many partitions as distinct values Binary split: – Divides values into two subsets – Preserve order property among attribute values Large Shirt Size Medium Extra Large Small {Medium, Large, Extra Large} Shirt Size {Small} {Large, Extra Large} Shirt Size {Small, Medium} {Medium, Extra Large} Shirt Size {Small, Large} This grouping violates order property 6/11/2024 Introduction to Data Mining, 2nd Edition 24
  • 25. Test Condition for Continuous Attributes Annual Income > 80K? Yes No Annual Income? (i) Binary split (ii) Multi-way split < 10K [10K,25K) [25K,50K) [50K,80K) > 80K 6/11/2024 Introduction to Data Mining, 2nd Edition 25
  • 26. Splitting Based on Continuous Attributes Different ways of handling – Discretization to form an ordinal categorical attribute Ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering.  Static – discretize once at the beginning  Dynamic – repeat at each node – Binary Decision: (A < v) or (A  v)  consider all possible splits and finds the best cut  can be more compute intensive 6/11/2024 Introduction to Data Mining, 2nd Edition 26
  • 27. How to determine the Best Split Gender C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 Car Type C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1 Customer ID ... Yes No Family Sports Luxury c1 c10 c20 C0: 0 C1: 1 ... c11 Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? 6/11/2024 Introduction to Data Mining, 2nd Edition 27
  • 28. How to determine the Best Split Greedy approach: – Nodes with purer class distribution are preferred Need a measure of node impurity: C0: 5 C1: 5 C0: 9 C1: 1 High degree of impurity Low degree of impurity 6/11/2024 Introduction to Data Mining, 2nd Edition 28
  • 29. Measures of Node Impurity Gini Index Entropy Misclassification error    j t j p t GINI 2 )] | ( [ 1 ) (    j t j p t j p t Entropy ) | ( log ) | ( ) ( ) | ( max 1 ) ( t i P t Error i   6/11/2024 Introduction to Data Mining, 2nd Edition 29
  • 30. Finding the Best Split 1. Compute impurity measure (P) before splitting 2. Compute impurity measure (M) after splitting Compute impurity measure of each child node M is the weighted impurity of children 3. Choose the attribute test condition that produces the highest gain or equivalently, lowest impurity measure after splitting (M) Gain = P – M 6/11/2024 Introduction to Data Mining, 2nd Edition 30
  • 31. Finding the Best Split B? Yes No Node N3 Node N4 A? Yes No Node N1 Node N2 Before Splitting: C0 N10 C1 N11 C0 N20 C1 N21 C0 N30 C1 N31 C0 N40 C1 N41 C0 N00 C1 N01 P M11 M12 M21 M22 M1 M2 Gain = P – M1 vs P – M2 6/11/2024 Introduction to Data Mining, 2nd Edition 31
  • 32. Measure of Impurity: GINI Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). – Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0.0) when all records belong to one class, implying most interesting information    j t j p t GINI 2 )] | ( [ 1 ) ( 6/11/2024 Introduction to Data Mining, 2nd Edition 32
  • 33. Measure of Impurity: GINI Gini Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t).    j t j p t GINI 2 )] | ( [ 1 ) ( C1 0 C2 6 Gini=0.000 C1 2 C2 4 Gini=0.444 C1 3 C2 3 Gini=0.500 C1 1 C2 5 Gini=0.278 6/11/2024 Introduction to Data Mining, 2nd Edition 33
  • 34. Computing Gini Index of a Single Node C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0    j t j p t GINI 2 )] | ( [ 1 ) ( P(C1) = 1/6 P(C2) = 5/6 Gini = 1 – (1/6)2 – (5/6)2 = 0.278 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 – (2/6)2 – (4/6)2 = 0.444 6/11/2024 Introduction to Data Mining, 2nd Edition 34
  • 35. Computing Gini Index for a Collection of Nodes When a node p is split into k partitions (children) where, ni = number of records at child i, n = number of records at parent node p. Choose the attribute that minimizes weighted average Gini index of the children Gini index is used in decision tree algorithms such as CART, SLIQ, SPRINT    k i i split i GINI n n GINI 1 ) ( 6/11/2024 Introduction to Data Mining, 2nd Edition 35
  • 36. Binary Attributes: Computing GINI Index Splits into two partitions Effect of Weighing partitions: – Larger and Purer Partitions are sought for. B? Yes No Node N1 Node N2 Parent C1 7 C2 5 Gini = 0.486 N1 N2 C1 5 2 C2 1 4 Gini=0.361 Gini(N1) = 1 – (5/6)2 – (1/6)2 = 0.278 Gini(N2) = 1 – (2/6)2 – (4/6)2 = 0.444 Weighted Gini of N1 N2 = 6/12 * 0.278 + 6/12 * 0.444 = 0.361 Gain = 0.486 – 0.361 = 0.125 6/11/2024 Introduction to Data Mining, 2nd Edition 36
  • 37. Categorical Attributes: Computing Gini Index For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions CarType {Sports, Luxury} {Family} C1 9 1 C2 7 3 Gini 0.468 CarType {Sports} {Family, Luxury} C1 8 2 C2 0 10 Gini 0.167 CarType Family Sports Luxury C1 1 8 1 C2 3 0 7 Gini 0.163 Multi-way split Two-way split (find best partition of values) Which of these is the best? 6/11/2024 Introduction to Data Mining, 2nd Edition 37
  • 38. Continuous Attributes: Computing Gini Index Use Binary Decisions based on one value Several Choices for the splitting value – Number of possible splitting values = Number of distinct values Each splitting value has a count matrix associated with it – Class counts in each of the partitions, A < v and A  v Simple method to choose best v – For each v, scan the database to gather count matrix and compute its Gini index – Computationally Inefficient! Repetition of work. ID Home Owner Marital Status Annual Income Defaulted 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 ≤ 80 > 80 Defaulted Yes 0 3 Defaulted No 3 4 Annual Income ? 6/11/2024 Introduction to Data Mining, 2nd Edition 38
  • 39. Cheat No No No Yes Yes Yes No No No No Annual Income 60 70 75 85 90 95 100 120 125 220 55 65 72 80 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Continuous Attributes: Computing Gini Index... For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Split Positions Sorted Values 6/11/2024 Introduction to Data Mining, 2nd Edition 39
  • 40. Cheat No No No Yes Yes Yes No No No No Annual Income 60 70 75 85 90 95 100 120 125 220 55 65 72 80 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Continuous Attributes: Computing Gini Index... For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Split Positions Sorted Values 6/11/2024 Introduction to Data Mining, 2nd Edition 40
  • 41. Cheat No No No Yes Yes Yes No No No No Annual Income 60 70 75 85 90 95 100 120 125 220 55 65 72 80 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Continuous Attributes: Computing Gini Index... For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Split Positions Sorted Values 6/11/2024 Introduction to Data Mining, 2nd Edition 41
  • 42. Cheat No No No Yes Yes Yes No No No No Annual Income 60 70 75 85 90 95 100 120 125 220 55 65 72 80 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Continuous Attributes: Computing Gini Index... For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Split Positions Sorted Values 6/11/2024 Introduction to Data Mining, 2nd Edition 42
  • 43. Cheat No No No Yes Yes Yes No No No No Annual Income 60 70 75 85 90 95 100 120 125 220 55 65 72 80 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Continuous Attributes: Computing Gini Index... For efficient computation: for each attribute, – Sort the attribute on values – Linearly scan these values, each time updating the count matrix and computing gini index – Choose the split position that has the least gini index Split Positions Sorted Values 6/11/2024 Introduction to Data Mining, 2nd Edition 43
  • 44. Measure of Impurity: Entropy Entropy at a given node t: (NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (log nc) when records are equally distributed among all classes implying least information Minimum (0.0) when all records belong to one class, implying most information – Entropy based computations are quite similar to the GINI index computations    j t j p t j p t Entropy ) | ( log ) | ( ) ( 6/11/2024 Introduction to Data Mining, 2nd Edition 44
  • 45. Computing Entropy of a Single Node C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92    j t j p t j p t Entropy ) | ( log ) | ( ) ( 2 6/11/2024 Introduction to Data Mining, 2nd Edition 45
  • 46. Computing Information Gain After Splitting Information Gain: Parent Node, p is split into k partitions; ni is number of records in partition i – Choose the split that achieves most reduction (maximizes GAIN) – Used in the ID3 and C4.5 decision tree algorithms           k i i split i Entropy n n p Entropy GAIN 1 ) ( ) ( 6/11/2024 Introduction to Data Mining, 2nd Edition 46
  • 47. Problem with large number of partitions Node impurity measures tend to prefer splits that result in large number of partitions, each being small but pure – Customer ID has highest information gain because entropy for all the children is zero Gender C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 Car Type C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1 Customer ID ... Yes No Family Sports Luxury c1 c10 c20 C0: 0 C1: 1 ... c11 6/11/2024 Introduction to Data Mining, 2nd Edition 47
  • 48. Gain Ratio Gain Ratio: Parent Node, p is split into k partitions ni is the number of records in partition i – Adjusts Information Gain by the entropy of the partitioning (SplitINFO).  Higher entropy partitioning (large number of small partitions) is penalized! – Used in C4.5 algorithm – Designed to overcome the disadvantage of Information Gain SplitINFO GAIN GainRATIO Split split      k i i i n n n n SplitINFO 1 log 6/11/2024 Introduction to Data Mining, 2nd Edition 48
  • 49. Gain Ratio Gain Ratio: Parent Node, p is split into k partitions ni is the number of records in partition i SplitINFO GAIN GainRATIO Split split      k i i i n n n n SplitINFO 1 log CarType {Sports, Luxury} {Family} C1 9 1 C2 7 3 Gini 0.468 CarType {Sports} {Family, Luxury} C1 8 2 C2 0 10 Gini 0.167 CarType Family Sports Luxury C1 1 8 1 C2 3 0 7 Gini 0.163 SplitINFO = 1.52 SplitINFO = 0.72 SplitINFO = 0.97 6/11/2024 Introduction to Data Mining, 2nd Edition 49
  • 50. Measure of Impurity: Classification Error Classification error at a node t : – Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information – Minimum (0) when all records belong to one class, implying most interesting information ) | ( max 1 ) ( t i P t Error i   6/11/2024 Introduction to Data Mining, 2nd Edition 50
  • 51. Computing Error of a Single Node C1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3 ) | ( max 1 ) ( t i P t Error i   6/11/2024 Introduction to Data Mining, 2nd Edition 51
  • 52. Comparison among Impurity Measures For a 2-class problem: 6/11/2024 Introduction to Data Mining, 2nd Edition 52
  • 53. Misclassification Error vs Gini Index A? Yes No Node N1 Node N2 Parent C1 7 C2 3 Gini = 0.42 N1 N2 C1 3 4 C2 0 3 Gini=0.342 Gini(N1) = 1 – (3/3)2 – (0/3)2 = 0 Gini(N2) = 1 – (4/7)2 – (3/7)2 = 0.489 Gini(Children) = 3/10 * 0 + 7/10 * 0.489 = 0.342 Gini improves but error remains the same!! 6/11/2024 Introduction to Data Mining, 2nd Edition 53
  • 54. Misclassification Error vs Gini Index A? Yes No Node N1 Node N2 Parent C1 7 C2 3 Gini = 0.42 N1 N2 C1 3 4 C2 0 3 Gini=0.342 N1 N2 C1 3 4 C2 1 2 Gini=0.416 Misclassification error for all three cases = 0.3 ! 6/11/2024 Introduction to Data Mining, 2nd Edition 54
  • 55. Decision Tree Based Classification Advantages: – Inexpensive to construct – Extremely fast at classifying unknown records – Easy to interpret for small-sized trees – Robust to noise (especially when methods to avoid overfitting are employed) – Can easily handle redundant or irrelevant attributes (unless the attributes are interacting) Disadvantages: – Space of possible decision trees is exponentially large. Greedy approaches are often unable to find the best tree. – Does not take into account interactions between attributes – Each decision boundary involves only a single attribute 6/11/2024 Introduction to Data Mining, 2nd Edition 55
  • 56. Confusion Matrix A confusion matrix is a tabular way of visualizing the performance of your prediction model. Each entry in a confusion matrix denotes the number of predictions that were made by the model where it classified the classes correctly or incorrectly. Confusion Matrix for Binary Classification A binary classification problem has only two classes Preferably a positive and a negative class. Introduction to Data Mining, 2nd Edition 56 6/11/2024
  • 57. Confusion Matrix … True Positive (TP): It refers to the number of predictions where the classifier correctly predicts the positive class as positive. True Negative (TN): It refers to the number of predictions where the classifier correctly predicts the negative class as negative. False Positive (FP): It refers to the number of predictions where the classifier incorrectly predicts the negative class as positive. False Negative (FN): It refers to the number of predictions where the classifier incorrectly predicts the positive class as negative. Introduction to Data Mining, 2nd Edition 57 6/11/2024
  • 58. Confusion Matrix for Multi Class Introduction to Data Mining, 2nd Edition 58 6/11/2024
  • 59. Performance measures for confusion matrix. It’s always better to use confusion matrix as your evaluation criteria for your machine learning model. It gives you a very simple, yet efficient performance measures for your model. Here are some of the most common performance measures you can use from the confusion matrix. Accuracy: It gives you the overall accuracy of the model, meaning the fraction of the total samples that were correctly classified by the classifier. accuracy = (TP+TN)/(TP+TN+FP+FN). Introduction to Data Mining, 2nd Edition 59 6/11/2024
  • 60. Performance measures for confusion matrix. Precision: It tells you what fraction of predictions as a positive class were actually positive.  Precision = TP/(TP+FP). Recall: It tells you what fraction of all positive samples were correctly predicted as positive by the classifier. It is also known as True Positive Rate (TPR), Sensitivity, Probability of Detection.  Recall = TP/(TP+FN). Introduction to Data Mining, 2nd Edition 60 6/11/2024
  • 61. Performance measures for confusion matrix. F1-score: It combines precision and recall into a single measure. Mathematically it’s the harmonic mean of precision and recall. It can be calculated as follows Specificity: It tells you what fraction of all negative samples are correctly predicted as negative by the classifier. It is also known as True Negative Rate (TNR). To calculate specificity, use the following formula: TN/(TN+FP) Introduction to Data Mining, 2nd Edition 61 6/11/2024
  • 62. Performance measures for confusion matrix. Misclassification Rate: It tells you what fraction of predictions were incorrect. It is also known as Classification Error. You can calculate it using (FP+FN)/(TP+TN+FP+FN) or (1-Accuracy). Introduction to Data Mining, 2nd Edition 62 6/11/2024