Classification and Prediction

Classification and Prediction
-Sahil Kumar Singh
January 25, 2021
Data Mining: Concepts and
Techniques 1

January 25, 2021 Data Mining: Concepts and Techniques 2
Classification and Prediction
 What is classification? What is prediction?
 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Back Propagation

 Classification:
 is a form of data analysis that extracts models (classifiers)
describing important data classes
 classifies the data into two or more categories based on
class labels (Yes/No, Positive/Negative, etc.)
 E.g. bank loan approval to customers based on the
customer’s age
 Prediction:
 used when the trained classifier is used to predict the
class of data that is unknown to classifier model
 models continuous-valued functions, i.e., predicts
unknown or missing values
 E.g. fraud detection, medical diagnosis
Classification vs. Prediction

Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees,
or mathematical formulae
 Model usage: to predict class categories of data tuples i.e. unknown
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set to judge the model’s
accuracy, otherwise over-fitting will occur
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known

Classification Process (1): Model
Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
Decision Tree,
Bayes’ ,
Backpropagation

Classification Process (2): Use the
Model in Prediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?

Issues Regarding Classification and Prediction
(1): Data Preparation
 Data cleaning
 Preprocess data in order to reduce noise and handle
missing values
 Relevance analysis (feature/attribute selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data

Issues regarding classification and prediction
(2): Evaluating Classification Methods
 Predictive accuracy
 Speed and scalability
 time to construct the model
 time to use the model
 Robustness
 handling noise and missing values
 Scalability
 efficiency in disk-resident databases
 Interpretability:
 understanding and insight provided by the model
 Goodness of rules
 decision tree size
 compactness of classification rules

Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left

Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 S contains Si tuples of class Ci for i = {1, …, m}
 information measures info required to classify any
arbitrary tuple
 entropy of attribute A with values {a1,a2,…,av}
 information gained by branching on attribute A
s
s
log
s
s
)
,...,s
,s
s
I(
i
m
i
i
m
2
1 2
1




)
s
,...,
s
(
I
s
s
...
s
E(A) mj
j
v
j
mj
j
1
1
1





E(A)
)
s
,...,
s
,
I(s
Gain(A) m 
 2
1

Classification by decision tree induction:
Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
This
follows an
example
from
Quinlan’s
ID3

Attribute Selection by Information Gain
Computation
Here, we use buys_computer for taking decision
to create the decision tree,
Class P: buys_computer = “yes”=9
Class N: buys_computer = “no”=5
Now, compute information gain, I(P, n) for
the entire table,

Attribute Selection by Information
Gain Computation
 Compute the entropy for age:
E(age)=
=5/14 x 0.970+ 4/14 x 0+ 5/14 x 0.970
=0.692
 Now, compute Gain for age w.r.t. T,
Gain(age)=0.940-0.692=0.248
Gain(T, age)=0.248
Similarly,
Gain(T, income)=0.029
Gain(T, student)=0.048
Gain(T, credit_rating)=0.151
age Pi ni I(Pi,ni)
<=30 2 3 0.970
30…40 4 0 0
>40 3 2 0.970
age?
overcast
<=30 >40
30..40

Computation
Entropy of income:
E(income)=
=2/5x0+2/5x1+1/5x0=0.4
Now, compute Gain for income w.r.t. T<=30,
Gain(T<=30 ,income)=0.970-0.40=0.57
age income studen
t
credit_ra
ting
buys_co
mputer
income Pi ni I(Pi,ni)
high 0 2 0
medium 1 1 1
low 1 0 0
P:2, n:3
I(T<=30)=0.970

Computation
Entropy of student:
E(student)=
=3/5x0+2/5x0=0
Now, compute Gain for student w.r.t. T <=30,
Gain(T <=30,student)=0.970-0
=0.970
student Pi ni I(Pi,ni)
no 0 3 0
yes 2 0 0
P:2, n:3
age income studen
t
credit_ra
ting
buys_co
mputer
I(T<=30)=0.970

Gain Computation
Entropy of credit_rating:
E(income)=
=3/5x0.918+2/5x1=0.4
Now, compute Gain for credit_rating
w.r.t. T<=30 ,
Gain(T<=30 , credit_rating)
=0.970-0.950=0.02
credit_rating Pi ni I(Pi,ni)
fair 1 2 0.918
excellent 1 1 1
P:2, n:3
age income studen
t
credit_ra
ting
buys_co
mputer
I(T<=30)=0.970

Attribute Selection by
Information Gain
Gain(T<=30 ,income)=0.57
Gain(T <=30,student)=0.970
Gain(T<=30 , credit_rating) =0.02
So, we put student attribute in our tree
under <=30 and for 30…40, we observe
that it has ‘yes’ in all the conditions, so
put it under 30…40.
age incom
e
stu
den
t
credit_rat
ing
buys_
comp
uter
31…40 mediu
m
no excellent yes

Gain Computation
Entropy of credit_rating:
E(income)=
=3/5x0+2/5x0=0
Now, compute Gain for credit_rating
w.r.t. T>40 ,
Gain(T>40, credit_rating)
=0.970-0=0.970
credit_rating Pi ni I(Pi,ni)
fair 3 0 0
excellent 0 2 0
P:3, n:2
age incom
e
stude
nt
credit_rati
ng
buys_compu
ter
>40 mediu
m
no fair yes
>40 mediu
m
yes fair yes
>40 mediu
m
no excellent no
I(T>40)=0.970

Computation
Entropy of income:
E(income)=
=0+3/5x0.918+2/5x1=0.951
Now, compute Gain for income
w.r.t. T>40 ,
Gain(T>40, income)
=0.970-0.951=0.019
income Pi ni I(Pi,ni)
high 0 0 0
medium 2 1 0.918
low 1 1 1
P:3, n:2
age income stude
nt
credit_rati
ng
buys_comput
er
I(T>40)=0.970

Output: A Decision Tree for “buys_computer”
age?
overcast
student? credit_rating?
no yes fair
excellent
<=30 >40
no no
yes yes
yes
30..40
Gain(T>40, credit_rating)=0.970
Gain(T>40, income)=0.019
So, we put credit_rating attribute in our
tree under >40. At the end, we have a
final decision tree for buys_computer.

Other Attribute Selection Measures
 Gini index (CART, IBM IntelligentMiner)
 All attributes are assumed continuous-valued
 Assume there exist several possible split values for
each attribute
 May need other tools, such as clustering, to get the
possible split values
 Can be modified for categorical attributes

Gini Index (IBM IntelligentMiner)
 If a data set T contains examples from n classes, gini index,
gini(T) is defined as
where pj is the relative frequency of class j in T.
 If a data set T is split into two subsets T1 and T2 with sizes
N1 and N2 respectively, the gini index of the split data
contains examples from n classes, the gini index gini(T) is
defined as
 The attribute provides the smallest ginisplit(T) is chosen to
split the node (need to enumerate all possible splitting
points for each attribute).




n
j
p j
T
gini
1
2
1
)
(
)
(
)
(
)
( 2
2
1
1
T
gini
N
N
T
gini
N
N
T
ginisplit



Extracting Classification Rules from Trees
 Represent the knowledge in the form of IF-THEN rules
 One rule is created for each path from the root to a leaf
 Each attribute-value pair along a path forms a conjunction
 The leaf node holds the class prediction
 Rules are easier for humans to understand
 Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =
“yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”

Avoid Overfitting in Classification
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due
to noise or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure
falling below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
 Use a set of data different from the training data to
decide which is the “best pruned tree”

Why to Use Decision Tree Induction for
Classification?
 Why decision tree induction in data mining?
 relatively faster learning speed (than other classification
methods)
 convertible to simple and easy to understand
classification rules
 can use SQL queries for accessing databases
 comparable classification accuracy with other methods

Bayesian Classification: Why?
 Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to
certain types of learning problems
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.
 Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured

Bayesian Theorem: Basics
 Let X be a data sample whose class label is unknown
 Let H be a hypothesis that X belongs to class C
 For classification problems, determine P(H/X): the
probability that the hypothesis holds given the observed
data sample X
 P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
 P(X): probability that sample data is observed
 P(X|H) : probability of observing the sample X, given that
the hypothesis holds

Bayesian Theorem
 Given training data X, posteriori probability of a hypothesis
H, P(H|X) follows the Bayes theorem
 Informally, this can be written as
posterior =likelihood x prior / evidence
 MAP (maximum posteriori) hypothesis
 Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
)
(
)
(
)
|
(
)
|
(
X
P
H
P
H
X
P
X
H
P 
.
)
(
)
|
(
max
arg
)
|
(
max
arg h
P
h
D
P
H
h
D
h
P
H
h
MAP
h





Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally
independent:
 The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities of each element taken separately, given the
same class P([y1,y2],C) = P(y1,C) * P(y2,C)
 No dependence relation between attributes
 Greatly reduces the computation cost, only count the class
distribution.
 Once the probability P(X|Ci) is known, assign X to the
class with maximum P(X|Ci)*P(Ci)



n
k
Ci
xk
P
Ci
X
P
1
)
|
(
)
|
(

Training dataset
Class:
C1:buys_computer=
‘yes’
C2:buys_computer=
‘no’
Data sample
X =(age<=30,
Income=medium,
Student=yes
Credit_rating=
Fair)

Naïve Bayesian Classifier: Example
 Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.007
X belongs to class “buys_computer=yes”

Naïve Bayesian Classifier
 Advantages :
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence , therefore loss of
accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
 Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
 How to deal with these dependencies?
 Bayesian Belief Networks

Bayesian Networks
 Bayesian belief network allows a subset of the variables
conditionally independent
 A graphical model of causal relationships
 Represents dependency among the variables
 Gives a specification of joint probability distribution
X Y
Z
P
Nodes: random variables
Links: dependency
X,Y are the parents of Z, and Y is the
parent of P
No dependency between Z and P
Has no loops or cycles

 Backpropagation: A neural network learning algorithm started by
psychologists and neurobiologists to develop and test computational
analogues of neurons
 A neural network: A set of connected input/output units where each
connection has a weight associated with it
 During the learning phase, the network learns by adjusting the weights
so as to be able to predict the correct class label of the input tuples
 Also referred to as connectionist learning due to the connections
between units
 Analogy to Biological Systems (Indeed a great example of a good
learning system)
 The first learning algorithm came in 1959 (Rosenblatt) who suggested
that if a target output value is provided for a single neuron with fixed
inputs, one can incrementally change weights. For the purpose to learn
to produce these outputs using the perceptron learning rule
Classification by Backpropagation

Advantages
• prediction accuracy is generally high
• robust, works when training examples contain errors
• output may be discrete, real-valued, or a vector of several
discrete or real-valued attributes
• fast evaluation of the learned target function
Criticism
• long training time
• difficult to understand the learned function (weights)
• not easy to incorporate domain knowledge
Neural Network as a Classifier

A Neuron (= a perceptron)
 The n-dimensional input vector x is
mapped into variable y by means
of the scalar product and a
nonlinear function mapping
mk
-
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector w

w0
w1
wn
x0
x1
xn
)
sign(
y
Example
For
n
0
i
k
i
i x
w m

 


Multi-Layer Perceptron
Output nodes
Input nodes
Hidden nodes
Output vector
Input vector: xi
wij
 

i
j
i
ij
j O
w
I 
j
I
j
e
O 


1
1
)
)(
1
( j
j
j
j
j O
T
O
O
Err 


jk
k
k
j
j
j w
Err
O
O
Err 

 )
1
(
i
j
ij
ij O
Err
l
w
w )
(


j
j
j Err
l)
(




Network Training
 The ultimate objective of training
 obtain a set of weights that makes almost all the
tuples in the training data classified correctly
 Steps
 Initialize weights with random values
 Feed the input tuples into the network one by one
 For each unit
 Compute the net input to the unit as a linear combination
of all the inputs to the unit
 Compute the output value using the activation function
 Compute the error
 Update the weights and the bias

Classification and Prediction

More Related Content

What's hot (20)

Similar to Classification and Prediction (20)

Recently uploaded (20)

Classification and Prediction