1. Data Mining and Data
Warehousing
CSE-4107
Md. Manowarul Islam
Associate Professor, Dept. of CSE
Jagannath University
2. Md. Manowarul Islam, Dept. Of CSE, JnU
What is classification?
🞐 Classification is the task of learning a target
function f that maps attribute set x to one of the
predefined class labels y
🞐 The target function f is known as a classification
model
3. Md. Manowarul Islam, Dept. Of CSE, JnU
What is classification?
🞐 One of the attributes is
the class attribute
🞐 In this case: Cheat
🞐 Two class labels (or
classes): Yes (1), No (0)
categorical
categorical
continuous
class
4. Md. Manowarul Islam, Dept. Of CSE, JnU
🞐 Classification
■predicts categorical class labels (discrete or
nominal)
■classifies data (constructs a model) based on
the training set and the values (class labels) in
a classifying attribute and uses it in classifying
new data
🞐 Prediction
■models continuous-valued functions,
■predicts unknown or missing values
Classification vs. Prediction
5. Md. Manowarul Islam, Dept. Of CSE, JnU
🞐 Descriptive modeling: Explanatory tool to
distinguish between objects of different classes
(e.g., understand why people cheat on their
taxes)
🞐 Predictive modeling: Predict a class of a
previously unseen record
Classification vs. Prediction
7. Md. Manowarul Islam, Dept. Of CSE, JnU
🞐 Credit approval
■ A bank wants to classify its customers based on whether
they are expected to pay back their approved loans
■ The history of past customers is used to train the
classifier
■ The classifier provides rules, which identify potentially
reliable future customers
■ Classification rule:
🞐 If age = “31...40” and income = high then credit_rating =
excellent
■ Future customers
🞐 Paul: age = 35, income = high excellent credit rating
⇒
🞐 John: age = 20, income = medium fair credit rating
⇒
Why Classification?
8. Md. Manowarul Islam, Dept. Of CSE, JnU
🞐 Model construction: describing a set of
predetermined classes
■Each tuple/sample is assumed to belong to a
predefined class, as determined by the class
label attribute
■The set of tuples used for model construction:
training set
■The model is represented as classification
rules, decision trees, or mathematical
formulae
Classification—A Two-Step Process
9. Md. Manowarul Islam, Dept. Of CSE, JnU
🞐 Model usage: for classifying future or unknown
objects
■Estimate accuracy of the model
🞐The known label of test samples is
compared with the classified result from the
model
🞐Accuracy rate is the percentage of test set
samples that are correctly classified by the
model
🞐Test set is independent of training set,
otherwise over-fitting will occur
Classification—A Two-Step Process
10. Md. Manowarul Islam, Dept. Of CSE, JnU
Training
Data
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifie
r
(Model)
Model Construction
11. Md. Manowarul Islam, Dept. Of CSE, JnU
Classifie
r
Testing
Data
Unseen
Data
(Jeff, Professor, 4)
Tenured?
Use the Model in Prediction
13. Md. Manowarul Islam, Dept. Of CSE, JnU
Decision Tree Classification Task
Decision
Tree
14. Md. Manowarul Islam, Dept. Of CSE, JnU
Supervised vs. Unsupervised Learning
🞐 Supervised learning (classification)
■ Supervision: The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class of the observations
■ New data is classified based on the training set
🞐 Unsupervised learning (clustering)
■ The class labels of training data is unknown
■ Given a set of measurements, observations, etc. with
the aim of establishing the existence of classes or
clusters in the data
15. Md. Manowarul Islam, Dept. Of CSE, JnU
🞐 Data cleaning
■ Preprocess data in order to reduce noise and handle
missing values
🞐 Relevance analysis (feature selection)
■ Remove the irrelevant or redundant attributes
🞐 Data transformation
■ Generalize and/or normalize data
🞐 numerical attribute income categorical
⇒
{low,medium,high}
🞐 normalize all numerical attributes to [0,1]
Classification and prediction : Data Preparation
16. Md. Manowarul Islam, Dept. Of CSE, JnU
🞐 Predictive accuracy
🞐 Speed
■ time to construct the model
■ time to use the model
🞐 Robustness
■ handling noise and missing values
🞐 Scalability
■ efficiency in disk-resident databases
🞐 Interpretability:
■ understanding and insight provided by the model
🞐 Goodness of rules (quality)
■ decision tree size
■ compactness of classification rules
Evaluating Classification Methods
17. Md. Manowarul Islam, Dept. Of CSE, JnU
Evaluation of classification models
🞐 Counts of test records that are correctly (or
incorrectly) predicted by the classification model
🞐 Confusion matrix
Class = 1 Class = 0
Class = 1 f11 f10
Class = 0 f01 f00
Predicted Class
Actual
Class
18. Md. Manowarul Islam, Dept. Of CSE, JnU
Classification Techniques
🞐Decision Tree based Methods
🞐Rule-based Methods
🞐Memory based reasoning
🞐Neural Networks
🞐Naïve Bayes and Bayesian Belief Networks
🞐Support Vector Machines
19. Md. Manowarul Islam, Dept. Of CSE, JnU
🞐Decision tree
■A flow-chart-like tree structure
■Internal node denotes a test on an attribute
■Branch represents an outcome of the test
■Leaf nodes represent class labels or class
distribution
Decision Trees
20. Md. Manowarul Islam, Dept. Of CSE, JnU
categorical
categorical
continuous
class
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
Splitting Attributes
Training Data Model: Decision Tree
Test outcome
Class labels
Example of a Decision Tree
21. Md. Manowarul Islam, Dept. Of CSE, JnU
Another Example of Decision Tree
categorical
categorical
continuous
class
MarSt
Refund
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
There could be more than one tree that fits
the same data!
22. Md. Manowarul Islam, Dept. Of CSE, JnU
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
Test Data
Start from the root of tree.
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
23. Md. Manowarul Islam, Dept. Of CSE, JnU
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
Test Data
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
24. Md. Manowarul Islam, Dept. Of CSE, JnU
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
Test Data
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
25. Md. Manowarul Islam, Dept. Of CSE, JnU
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
Test Data
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
26. Md. Manowarul Islam, Dept. Of CSE, JnU
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
Test Data
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
27. Md. Manowarul Islam, Dept. Of CSE, JnU
Apply Model to Test Data
Refund
MarSt
TaxInc
YES
NO
NO
NO
Yes No
Married
Single,
Divorced
< 80K > 80K
Assign Cheat to “No”
Test Data
Refund Marital
Status
Taxable
Income
Cheat
No Married 80K ?
28. Md. Manowarul Islam, Dept. Of CSE, JnU
General Structure of Hunt’s Algorithm
🞐 Let Dt be the set of training records that
reach a node t
🞐 General Procedure:
■ If Dt contains records that belong the
same class yt, then t is a leaf node
labeled as yt
■ If Dt contains records with the same
attribute values, then t is a leaf node
labeled with the majority class yt
■ If Dt is an empty set, then t is a leaf
node labeled by the default class, yd
■ If Dt contains records that belong to
more than one class, use an attribute
test to split the data into smaller
subsets.
🞐 Recursively apply the procedure to each
subset.
Dt
?
30. Md. Manowarul Islam, Dept. Of CSE, JnU
Hunt’s Algorithm
Don’t Cheat
Refun
d
Don’t Cheat Don’t Cheat
Yes No
31. Md. Manowarul Islam, Dept. Of CSE, JnU
Hunt’s Algorithm
Don’t Cheat
Refun
d
Don’t Cheat Don’t Cheat
Yes No
Refun
d
Don’t Cheat
Yes No
Marital
Status
Cheat
Single, Divorced
Marri
ed
Don’t Cheat
32. Md. Manowarul Islam, Dept. Of CSE, JnU
Hunt’s Algorithm
Don’t Cheat
Refun
d
Don’t Cheat Don’t Cheat
Yes No
Refun
d
Don’t Cheat
Yes No
Marital
Status
Cheat
Single, Divorced
Marri
ed
Don’t Cheat
<
80K
>=
80K
Taxable
Income
Refun
d
Don’t Cheat
Yes No
Marital
Status
Single, Divorced
Marri
ed
Don’t Cheat
Don’t Cheat Cheat
33. Md. Manowarul Islam, Dept. Of CSE, JnU
Tree Induction
🞐Finding the best decision tree is NP-hard
🞐Greedy strategy.
■Split the records based on an attribute test
that optimizes certain criterion.
🞐Many Algorithms:
■Hunt’s Algorithm (one of the earliest)
■CART
■ID3, C4.5
■SLIQ,SPRINT
34. Md. Manowarul Islam, Dept. Of CSE, JnU
Classification by Decision Tree Induction
🞐 Decision tree
■ A flow-chart-like tree structure
■ Internal node denotes a test on an attribute
■ Branch represents an outcome of the test
■ Leaf nodes represent class labels or class distribution
🞐 Decision tree generation consists of two phases
■ Tree construction
🞐 At start, all the training examples are at the root
🞐 Partition examples recursively based on selected attributes
■ Tree pruning
🞐 Identify and remove branches that reflect noise or outliers
🞐 Use of decision tree: Classifying an unknown sample
■ Test the attribute values of the sample against the decision
tree
36. Md. Manowarul Islam, Dept. Of CSE, JnU
Output: A Decision Tree for
“buys_computer”
age?
overcas
t
student? credit rating?
n
o
ye
s
fai
r
excellen
t
<=30 >40
n
o
n
o
ye
s
ye
s
ye
s
30..40
37. Md. Manowarul Islam, Dept. Of CSE, JnU
Algorithm for Decision Tree Induction
🞐 Basic algorithm (a greedy algorithm)
■ Tree is constructed in a top-down recursive divide-and-conquer
manner
■ At start, all the training examples are at the root
■ Attributes are categorical (if continuous-valued, they are
discretized in advance)
■ Samples are partitioned recursively based on selected attributes
■ Test attributes are selected on the basis of a heuristic or
statistical measure (e.g., information gain)
🞐 Conditions for stopping partitioning
■ All samples for a given node belong to the same class
■ There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
■ There are no samples left
38. Md. Manowarul Islam, Dept. Of CSE, JnU
Attribute Selection Measure:
🞐 Information Gain (ID3/C4.5)
🞐 Select the attribute with the highest information gain
age
?
overcas
t
student
?
credit
rating?
n
o
ye
s
fai
r
excellen
t
<=3
0
>4
0
n
o
n
o
ye
s
ye
s
ye
s
30..40
39. Md. Manowarul Islam, Dept. Of CSE, JnU
Attribute Selection Measure:
🞐 Let D, the data partition, be a training set of
class-labeled tuples.
🞐 m distinct classes, Ci (for i = 1,…,m).
🞐 Ci, D be the set of tuples in D belongs to class Ci
🞐 |Ci, D| and |D| number of tuples in Ci, D and D
40. Md. Manowarul Islam, Dept. Of CSE, JnU
Attribute Selection Measure:
🞐Let pi be the probability that an arbitrary tuple
in D belongs to class Ci, estimated by
■ pi = |Ci, D|/|D|
🞐Expected information (entropy) needed to
classify a tuple in D:
41. Training Dataset
🞐 The class label attribute, buys
Computer
■ Two distinct values (yes, no);
🞐 There are two distinct classes
(that is, m = 2).
🞐 Let class C1 correspond to yes
and class C2 correspond to no.
🞐 There are nine tuples of class
yes and five tuples of class no.
42. g Class C1: buys_computer = “yes”
g Class C2: buys_computer = “no”
Attribute Selection: Information Gain
43. ■ Suppose we want to partition the tuples in D on some
attribute A having v distinct values , {a1, a2, … , av}
■ Attribute A can be used to split D into v partitions or
subsets, {D1, D2, … , Dv},
■ Where Dj contains those tuples in D that have
outcome aj of A.
■ Information needed (after using A to split D into v
partitions) to classify D:
■ Information gained by branching on attribute A
Attribute Selection: Information Gain
44. g Class C1: buys_computer = “yes”
g Class C2: buys_computer = “no”
Age Tuple C1(Y) C2(N)
<=30 5(14) 2 3
31…40 4(14) 4 0
>40 5(14) 3 2
Attribute Selection: Information Gain
45. Age Tuple C1(Y) C2(N)
<=30 5(14) 2 3
31…40 4(14) 4 0
>40 5(14) 3 2
Attribute Selection: Information Gain
47. Splitting the samples using age
age?
<=3
0
30...4
0
>4
0
labeled
yes
48. Md. Manowarul Islam, Dept. Of CSE, JnU
Output: A Decision Tree for
“buys_computer”
age?
overcas
t
student? credit rating?
n
o
ye
s
fai
r
excellen
t
<=30 >40
n
o
n
o
ye
s
ye
s
ye
s
30..40
49. Md. Manowarul Islam, Dept. Of CSE, JnU
Gain Ratio for Attribute Selection (C4.5)
🞐 The information gain measure is biased toward
tests with many outcomes
🞐 consider an attribute that acts as a unique
identifier, such as product_ID.
🞐 split on product_ID would result in a large
number of partitions
🞐 Infoproduct_ID(D) = 0.
🞐 Information gained by partitioning on this
attribute is maximal.
🞐 Such a partitioning is useless for classification.
50. Md. Manowarul Islam, Dept. Of CSE, JnU
Gain Ratio for Attribute Selection (C4.5)
🞐 Information gain measure is biased towards
attributes with a large number of values
🞐 C4.5 (a successor of ID3) uses gain ratio to
overcome the problem (normalization to
information gain)
52. Md. Manowarul Islam, Dept. Of CSE, JnU
🞐 Ex. gain_ratio(income) = 0.029/0.926 = 0.031
🞐 The attribute with the maximum gain ratio is
selected as the splitting attribute
Income Tuple
low 4(14)
medium 6(14)
high 4(14)
Gain Ratio for Attribute Selection (C4.5)