Lecture_21_22_Classification_Instance-based Learning

Data Mining and Data
Warehousing
CSE-4107
Md. Manowarul Islam
Associate Professor, Dept. of CSE
Jagannath University

Md. Manowarul Islam, Dept. Of CSE, JnU
What is classification?
🞐 Classification is the task of learning a target
function f that maps attribute set x to one of the
predefined class labels y
🞐 The target function f is known as a classification
model

What is classification?
🞐 One of the attributes is
the class attribute
🞐 In this case: Cheat
🞐 Two class labels (or
classes): Yes (1), No (0)
categorical
categorical
continuous
class

🞐 Classification
■predicts categorical class labels (discrete or
nominal)
■classifies data (constructs a model) based on
the training set and the values (class labels) in
a classifying attribute and uses it in classifying
new data
🞐 Prediction
■models continuous-valued functions,
■predicts unknown or missing values
Classification vs. Prediction

🞐 Descriptive modeling: Explanatory tool to
distinguish between objects of different classes
(e.g., understand why people cheat on their
taxes)
🞐 Predictive modeling: Predict a class of a
previously unseen record

🞐 Credit approval
■ A bank wants to classify its customers based on whether
they are expected to pay back their approved loans
■ The history of past customers is used to train the
classifier
■ The classifier provides rules, which identify potentially
reliable future customers
■ Classification rule:
🞐 If age = “31...40” and income = high then credit_rating =
excellent
■ Future customers
🞐 Paul: age = 35, income = high excellent credit rating
⇒
🞐 John: age = 20, income = medium fair credit rating
⇒
Why Classification?

🞐 Model construction: describing a set of
predetermined classes
■Each tuple/sample is assumed to belong to a
predefined class, as determined by the class
label attribute
■The set of tuples used for model construction:
training set
■The model is represented as classification
rules, decision trees, or mathematical
formulae
Classification—A Two-Step Process

🞐 Model usage: for classifying future or unknown
objects
■Estimate accuracy of the model
🞐The known label of test samples is
compared with the classified result from the
model
🞐Accuracy rate is the percentage of test set
samples that are correctly classified by the
model
🞐Test set is independent of training set,
otherwise over-fitting will occur
Classification—A Two-Step Process

Training
Data
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifie
r
(Model)
Model Construction

Classifie
r
Testing
Data
Unseen
Data
(Jeff, Professor, 4)
Tenured?
Use the Model in Prediction

Illustrating Classification Task

Decision Tree Classification Task
Decision
Tree

Supervised vs. Unsupervised Learning
🞐 Supervised learning (classification)
■ Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
■ New data is classified based on the training set
🞐 Unsupervised learning (clustering)
■ The class labels of training data is unknown
■ Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data

🞐 Data cleaning
■ Preprocess data in order to reduce noise and handle
missing values
🞐 Relevance analysis (feature selection)
■ Remove the irrelevant or redundant attributes
🞐 Data transformation
■ Generalize and/or normalize data
🞐 numerical attribute income categorical
⇒
{low,medium,high}
🞐 normalize all numerical attributes to [0,1]
Classification and prediction : Data Preparation

🞐 Predictive accuracy
🞐 Speed
■ time to construct the model
■ time to use the model
🞐 Robustness
■ handling noise and missing values
🞐 Scalability
■ efficiency in disk-resident databases
🞐 Interpretability:
■ understanding and insight provided by the model
🞐 Goodness of rules (quality)
■ decision tree size
■ compactness of classification rules
Evaluating Classification Methods

Evaluation of classification models
🞐 Counts of test records that are correctly (or
incorrectly) predicted by the classification model
🞐 Confusion matrix
Class = 1 Class = 0
Class = 1 f11 f10
Class = 0 f01 f00
Predicted Class
Actual
Class

Classification Techniques
🞐Decision Tree based Methods
🞐Rule-based Methods
🞐Memory based reasoning
🞐Neural Networks
🞐Naïve Bayes and Bayesian Belief Networks
🞐Support Vector Machines

🞐Decision tree
■A flow-chart-like tree structure
■Internal node denotes a test on an attribute
■Branch represents an outcome of the test
■Leaf nodes represent class labels or class
distribution
Decision Trees

categorical
categorical
continuous
class
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
Test outcome
Class labels
Example of a Decision Tree

Another Example of Decision Tree
categorical
categorical
continuous
class
MarSt
Refund
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that fits
the same data!

Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
Test Data
Start from the root of tree.
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?

Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
Test Data
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?

Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
Assign Cheat to “No”
Test Data
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?

General Structure of Hunt’s Algorithm
🞐 Let Dt be the set of training records that
reach a node t
🞐 General Procedure:
■ If Dt contains records that belong the
same class yt, then t is a leaf node
labeled as yt
■ If Dt contains records with the same
attribute values, then t is a leaf node
labeled with the majority class yt
■ If Dt is an empty set, then t is a leaf
node labeled by the default class, yd
■ If Dt contains records that belong to
more than one class, use an attribute
test to split the data into smaller
subsets.
🞐 Recursively apply the procedure to each
subset.
Dt
?

Hunt’s Algorithm
Don’t Cheat

Hunt’s Algorithm
Don’t Cheat
Refun
d
Don’t Cheat Don’t Cheat
Yes No

Hunt’s Algorithm
Don’t Cheat
Refun
d
Yes No
Refun
d
Don’t Cheat
Yes No
Marital
Status
Cheat
Single, Divorced
Marri
ed
Don’t Cheat

Hunt’s Algorithm
Don’t Cheat
Refun
d
Yes No
Refun
d
Don’t Cheat
Yes No
Marital
Status
Cheat
Single, Divorced
Marri
ed
Don’t Cheat
<
80K
>=
80K
Taxable
Income
Refun
d
Don’t Cheat
Yes No
Marital
Status
Single, Divorced
Marri
ed
Don’t Cheat
Don’t Cheat Cheat

Tree Induction
🞐Finding the best decision tree is NP-hard
🞐Greedy strategy.
■Split the records based on an attribute test
that optimizes certain criterion.
🞐Many Algorithms:
■Hunt’s Algorithm (one of the earliest)
■CART
■ID3, C4.5
■SLIQ,SPRINT

Classification by Decision Tree Induction
🞐 Decision tree
■ A flow-chart-like tree structure
■ Internal node denotes a test on an attribute
■ Branch represents an outcome of the test
■ Leaf nodes represent class labels or class distribution
🞐 Decision tree generation consists of two phases
■ Tree construction
🞐 At start, all the training examples are at the root
🞐 Partition examples recursively based on selected attributes
■ Tree pruning
🞐 Identify and remove branches that reflect noise or outliers
🞐 Use of decision tree: Classifying an unknown sample
■ Test the attribute values of the sample against the decision
tree

Training Dataset

Output: A Decision Tree for
“buys_computer”
age?
overcas
t
student? credit rating?
n
o
ye
s
fai
r
excellen
t
<=30 >40
n
o
n
o
ye
s
ye
s
ye
s
30..40

Algorithm for Decision Tree Induction
🞐 Basic algorithm (a greedy algorithm)
■ Tree is constructed in a top-down recursive divide-and-conquer
manner
■ At start, all the training examples are at the root
■ Attributes are categorical (if continuous-valued, they are
discretized in advance)
■ Samples are partitioned recursively based on selected attributes
■ Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
🞐 Conditions for stopping partitioning
■ All samples for a given node belong to the same class
■ There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
■ There are no samples left

Attribute Selection Measure:
🞐 Information Gain (ID3/C4.5)
🞐 Select the attribute with the highest information gain
age
?
overcas
t
student
?
credit
rating?
n
o
ye
s
fai
r
excellen
t
<=3
0
>4
0
n
o
n
o
ye
s
ye
s
ye
s
30..40

🞐 Let D, the data partition, be a training set of
class-labeled tuples.
🞐 m distinct classes, Ci (for i = 1,…,m).
🞐 Ci, D be the set of tuples in D belongs to class Ci
🞐 |Ci, D| and |D| number of tuples in Ci, D and D

🞐Let pi be the probability that an arbitrary tuple
in D belongs to class Ci, estimated by
■ pi = |Ci, D|/|D|
🞐Expected information (entropy) needed to
classify a tuple in D:

Training Dataset
🞐 The class label attribute, buys
Computer
■ Two distinct values (yes, no);
🞐 There are two distinct classes
(that is, m = 2).
🞐 Let class C1 correspond to yes
and class C2 correspond to no.
🞐 There are nine tuples of class
yes and five tuples of class no.

g Class C1: buys_computer = “yes”
g Class C2: buys_computer = “no”
Attribute Selection: Information Gain

■ Suppose we want to partition the tuples in D on some
attribute A having v distinct values , {a1, a2, … , av}
■ Attribute A can be used to split D into v partitions or
subsets, {D1, D2, … , Dv},
■ Where Dj contains those tuples in D that have
outcome aj of A.
■ Information needed (after using A to split D into v
partitions) to classify D:
■ Information gained by branching on attribute A

g Class C1: buys_computer = “yes”
g Class C2: buys_computer = “no”
Age Tuple C1(Y) C2(N)
<=30 5(14) 2 3
31…40 4(14) 4 0
>40 5(14) 3 2

Age Tuple C1(Y) C2(N)
<=30 5(14) 2 3
31…40 4(14) 4 0
>40 5(14) 3 2

Splitting the samples using age
age?
<=3
0
30...4
0
>4
0
labeled
yes

Gain Ratio for Attribute Selection (C4.5)
🞐 The information gain measure is biased toward
tests with many outcomes
🞐 consider an attribute that acts as a unique
identifier, such as product_ID.
🞐 split on product_ID would result in a large
number of partitions
🞐 Infoproduct_ID(D) = 0.
🞐 Information gained by partitioning on this
attribute is maximal.
🞐 Such a partitioning is useless for classification.

🞐 Information gain measure is biased towards
attributes with a large number of values
🞐 C4.5 (a successor of ID3) uses gain ratio to
overcome the problem (normalization to
information gain)

Income Tuple
low 4(14)
medium 6(14)
high 4(14)

🞐 Ex. gain_ratio(income) = 0.029/0.926 = 0.031
🞐 The attribute with the maximum gain ratio is
selected as the splitting attribute
Income Tuple
low 4(14)
medium 6(14)
high 4(14)

Thank you

Lecture_21_22_Classification_Instance-based Learning

More Related Content

Similar to Lecture_21_22_Classification_Instance-based Learning (20)

Recently uploaded (20)

Lecture_21_22_Classification_Instance-based Learning

Editor's Notes