SlideShare a Scribd company logo
Classification and Prediction
-Sahil Kumar Singh
January 25, 2021
Data Mining: Concepts and
Techniques 1
January 25, 2021 Data Mining: Concepts and Techniques 2
Classification and Prediction
 What is classification? What is prediction?
 Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Back Propagation
January 25, 2021 Data Mining: Concepts and Techniques 3
 Classification:
 is a form of data analysis that extracts models (classifiers)
describing important data classes
 classifies the data into two or more categories based on
class labels (Yes/No, Positive/Negative, etc.)
 E.g. bank loan approval to customers based on the
customer’s age
 Prediction:
 used when the trained classifier is used to predict the
class of data that is unknown to classifier model
 models continuous-valued functions, i.e., predicts
unknown or missing values
 E.g. fraud detection, medical diagnosis
Classification vs. Prediction
January 25, 2021 Data Mining: Concepts and Techniques 4
Classification—A Two-Step Process
 Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees,
or mathematical formulae
 Model usage: to predict class categories of data tuples i.e. unknown
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set to judge the model’s
accuracy, otherwise over-fitting will occur
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
January 25, 2021 Data Mining: Concepts and Techniques 5
Classification Process (1): Model
Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
Decision Tree,
Bayes’ ,
Backpropagation
January 25, 2021 Data Mining: Concepts and Techniques 6
Classification Process (2): Use the
Model in Prediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
January 25, 2021 Data Mining: Concepts and Techniques 7
Issues Regarding Classification and Prediction
(1): Data Preparation
 Data cleaning
 Preprocess data in order to reduce noise and handle
missing values
 Relevance analysis (feature/attribute selection)
 Remove the irrelevant or redundant attributes
 Data transformation
 Generalize and/or normalize data
January 25, 2021 Data Mining: Concepts and Techniques 8
Issues regarding classification and prediction
(2): Evaluating Classification Methods
 Predictive accuracy
 Speed and scalability
 time to construct the model
 time to use the model
 Robustness
 handling noise and missing values
 Scalability
 efficiency in disk-resident databases
 Interpretability:
 understanding and insight provided by the model
 Goodness of rules
 decision tree size
 compactness of classification rules
January 25, 2021 Data Mining: Concepts and Techniques 9
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-conquer
manner
 At start, all the training examples are at the root
 Attributes are categorical (if continuous-valued, they are
discretized in advance)
 Examples are partitioned recursively based on selected attributes
 Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left
January 25, 2021 Data Mining: Concepts and Techniques 10
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 S contains Si tuples of class Ci for i = {1, …, m}
 information measures info required to classify any
arbitrary tuple
 entropy of attribute A with values {a1,a2,…,av}
 information gained by branching on attribute A
s
s
log
s
s
)
,...,s
,s
s
I(
i
m
i
i
m
2
1 2
1




)
s
,...,
s
(
I
s
s
...
s
E(A) mj
j
v
j
mj
j
1
1
1





E(A)
)
s
,...,
s
,
I(s
Gain(A) m 
 2
1
January 25, 2021 Data Mining: Concepts and Techniques 11
Classification by decision tree induction:
Training Dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
This
follows an
example
from
Quinlan’s
ID3
Attribute Selection by Information Gain
Computation
January 25, 2021 Data Mining: Concepts and Techniques 12
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Here, we use buys_computer for taking decision
to create the decision tree,
Class P: buys_computer = “yes”=9
Class N: buys_computer = “no”=5
Now, compute information gain, I(P, n) for
the entire table,
Attribute Selection by Information
Gain Computation
 Compute the entropy for age:
E(age)=
=5/14 x 0.970+ 4/14 x 0+ 5/14 x 0.970
=0.692
 Now, compute Gain for age w.r.t. T,
Gain(age)=0.940-0.692=0.248
Gain(T, age)=0.248
Similarly,
Gain(T, income)=0.029
Gain(T, student)=0.048
Gain(T, credit_rating)=0.151
January 25, 2021 Data Mining: Concepts and Techniques 13
age Pi ni I(Pi,ni)
<=30 2 3 0.970
30…40 4 0 0
>40 3 2 0.970
age?
overcast
<=30 >40
30..40
Attribute Selection by Information Gain
Computation
Entropy of income:
E(income)=
=2/5x0+2/5x1+1/5x0=0.4
Now, compute Gain for income w.r.t. T<=30,
Gain(T<=30 ,income)=0.970-0.40=0.57
January 25, 2021 Data Mining: Concepts and Techniques 14
age income studen
t
credit_ra
ting
buys_co
mputer
<=30 high no fair no
<=30 high no excellent no
<=30 medium no fair no
<=30 low yes fair yes
<=30 medium yes excellent yes
income Pi ni I(Pi,ni)
high 0 2 0
medium 1 1 1
low 1 0 0
P:2, n:3
I(T<=30)=0.970
Attribute Selection by Information Gain
Computation
Entropy of student:
E(student)=
=3/5x0+2/5x0=0
Now, compute Gain for student w.r.t. T <=30,
Gain(T <=30,student)=0.970-0
=0.970
January 25, 2021 Data Mining: Concepts and Techniques 15
student Pi ni I(Pi,ni)
no 0 3 0
yes 2 0 0
P:2, n:3
age income studen
t
credit_ra
ting
buys_co
mputer
<=30 high no fair no
<=30 high no excellent no
<=30 medium no fair no
<=30 low yes fair yes
<=30 medium yes excellent yes
I(T<=30)=0.970
Attribute Selection by Information
Gain Computation
Entropy of credit_rating:
E(income)=
=3/5x0.918+2/5x1=0.4
Now, compute Gain for credit_rating
w.r.t. T<=30 ,
Gain(T<=30 , credit_rating)
=0.970-0.950=0.02
January 25, 2021 Data Mining: Concepts and Techniques 16
credit_rating Pi ni I(Pi,ni)
fair 1 2 0.918
excellent 1 1 1
P:2, n:3
age income studen
t
credit_ra
ting
buys_co
mputer
<=30 high no fair no
<=30 high no excellent no
<=30 medium no fair no
<=30 low yes fair yes
<=30 medium yes excellent yes
I(T<=30)=0.970
Attribute Selection by
Information Gain
Gain(T<=30 ,income)=0.57
Gain(T <=30,student)=0.970
Gain(T<=30 , credit_rating) =0.02
So, we put student attribute in our tree
under <=30 and for 30…40, we observe
that it has ‘yes’ in all the conditions, so
put it under 30…40.
January 25, 2021 Data Mining: Concepts and Techniques 17
age incom
e
stu
den
t
credit_rat
ing
buys_
comp
uter
31…40 high no fair yes
31…40 low yes excellent yes
31…40 mediu
m
no excellent yes
31…40 high yes fair yes
Attribute Selection by Information
Gain Computation
Entropy of credit_rating:
E(income)=
=3/5x0+2/5x0=0
Now, compute Gain for credit_rating
w.r.t. T>40 ,
Gain(T>40, credit_rating)
=0.970-0=0.970
January 25, 2021 Data Mining: Concepts and Techniques 18
credit_rating Pi ni I(Pi,ni)
fair 3 0 0
excellent 0 2 0
P:3, n:2
age incom
e
stude
nt
credit_rati
ng
buys_compu
ter
>40 mediu
m
no fair yes
>40 low yes fair yes
>40 low yes excellent no
>40 mediu
m
yes fair yes
>40 mediu
m
no excellent no
I(T>40)=0.970
Attribute Selection by Information Gain
Computation
Entropy of income:
E(income)=
=0+3/5x0.918+2/5x1=0.951
Now, compute Gain for income
w.r.t. T>40 ,
Gain(T>40, income)
=0.970-0.951=0.019
January 25, 2021 Data Mining: Concepts and Techniques 19
income Pi ni I(Pi,ni)
high 0 0 0
medium 2 1 0.918
low 1 1 1
P:3, n:2
age income stude
nt
credit_rati
ng
buys_comput
er
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
>40 medium yes fair yes
>40 medium no excellent no
I(T>40)=0.970
January 25, 2021 Data Mining: Concepts and Techniques 20
Output: A Decision Tree for “buys_computer”
age?
overcast
student? credit_rating?
no yes fair
excellent
<=30 >40
no no
yes yes
yes
30..40
Gain(T>40, credit_rating)=0.970
Gain(T>40, income)=0.019
So, we put credit_rating attribute in our
tree under >40. At the end, we have a
final decision tree for buys_computer.
January 25, 2021 Data Mining: Concepts and Techniques 21
Other Attribute Selection Measures
 Gini index (CART, IBM IntelligentMiner)
 All attributes are assumed continuous-valued
 Assume there exist several possible split values for
each attribute
 May need other tools, such as clustering, to get the
possible split values
 Can be modified for categorical attributes
January 25, 2021 Data Mining: Concepts and Techniques 22
Gini Index (IBM IntelligentMiner)
 If a data set T contains examples from n classes, gini index,
gini(T) is defined as
where pj is the relative frequency of class j in T.
 If a data set T is split into two subsets T1 and T2 with sizes
N1 and N2 respectively, the gini index of the split data
contains examples from n classes, the gini index gini(T) is
defined as
 The attribute provides the smallest ginisplit(T) is chosen to
split the node (need to enumerate all possible splitting
points for each attribute).




n
j
p j
T
gini
1
2
1
)
(
)
(
)
(
)
( 2
2
1
1
T
gini
N
N
T
gini
N
N
T
ginisplit


January 25, 2021 Data Mining: Concepts and Techniques 23
Extracting Classification Rules from Trees
 Represent the knowledge in the form of IF-THEN rules
 One rule is created for each path from the root to a leaf
 Each attribute-value pair along a path forms a conjunction
 The leaf node holds the class prediction
 Rules are easier for humans to understand
 Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40” THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN buys_computer =
“yes”
IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
January 25, 2021 Data Mining: Concepts and Techniques 24
Avoid Overfitting in Classification
 Overfitting: An induced tree may overfit the training data
 Too many branches, some may reflect anomalies due
to noise or outliers
 Poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early—do not split a
node if this would result in the goodness measure
falling below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown”
tree—get a sequence of progressively pruned trees
 Use a set of data different from the training data to
decide which is the “best pruned tree”
January 25, 2021 Data Mining: Concepts and Techniques 25
Why to Use Decision Tree Induction for
Classification?
 Why decision tree induction in data mining?
 relatively faster learning speed (than other classification
methods)
 convertible to simple and easy to understand
classification rules
 can use SQL queries for accessing databases
 comparable classification accuracy with other methods
January 25, 2021 Data Mining: Concepts and Techniques 26
Bayesian Classification: Why?
 Probabilistic learning: Calculate explicit probabilities for
hypothesis, among the most practical approaches to
certain types of learning problems
 Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is
correct. Prior knowledge can be combined with observed
data.
 Probabilistic prediction: Predict multiple hypotheses,
weighted by their probabilities
 Standard: Even when Bayesian methods are
computationally intractable, they can provide a standard of
optimal decision making against which other methods can
be measured
January 25, 2021 Data Mining: Concepts and Techniques 27
Bayesian Theorem: Basics
 Let X be a data sample whose class label is unknown
 Let H be a hypothesis that X belongs to class C
 For classification problems, determine P(H/X): the
probability that the hypothesis holds given the observed
data sample X
 P(H): prior probability of hypothesis H (i.e. the initial
probability before we observe any data, reflects the
background knowledge)
 P(X): probability that sample data is observed
 P(X|H) : probability of observing the sample X, given that
the hypothesis holds
January 25, 2021 Data Mining: Concepts and Techniques 28
Bayesian Theorem
 Given training data X, posteriori probability of a hypothesis
H, P(H|X) follows the Bayes theorem
 Informally, this can be written as
posterior =likelihood x prior / evidence
 MAP (maximum posteriori) hypothesis
 Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
)
(
)
(
)
|
(
)
|
(
X
P
H
P
H
X
P
X
H
P 
.
)
(
)
|
(
max
arg
)
|
(
max
arg h
P
h
D
P
H
h
D
h
P
H
h
MAP
h




January 25, 2021 Data Mining: Concepts and Techniques 29
Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally
independent:
 The product of occurrence of say 2 elements x1 and x2,
given the current class is C, is the product of the
probabilities of each element taken separately, given the
same class P([y1,y2],C) = P(y1,C) * P(y2,C)
 No dependence relation between attributes
 Greatly reduces the computation cost, only count the class
distribution.
 Once the probability P(X|Ci) is known, assign X to the
class with maximum P(X|Ci)*P(Ci)



n
k
Ci
xk
P
Ci
X
P
1
)
|
(
)
|
(
January 25, 2021 Data Mining: Concepts and Techniques 30
Training dataset
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
30…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Class:
C1:buys_computer=
‘yes’
C2:buys_computer=
‘no’
Data sample
X =(age<=30,
Income=medium,
Student=yes
Credit_rating=
Fair)
January 25, 2021 Data Mining: Concepts and Techniques 31
Naïve Bayesian Classifier: Example
 Compute P(X/Ci) for each class
P(age=“<30” | buys_computer=“yes”) = 2/9=0.222
P(age=“<30” | buys_computer=“no”) = 3/5 =0.6
P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444
P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4
P(student=“yes” | buys_computer=“yes)= 6/9 =0.667
P(student=“yes” | buys_computer=“no”)= 1/5=0.2
P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667
P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4
X=(age<=30 ,income =medium, student=yes,credit_rating=fair)
P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044
P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019
P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028
P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.007
X belongs to class “buys_computer=yes”
January 25, 2021 Data Mining: Concepts and Techniques 32
Naïve Bayesian Classifier
 Advantages :
 Easy to implement
 Good results obtained in most of the cases
 Disadvantages
 Assumption: class conditional independence , therefore loss of
accuracy
 Practically, dependencies exist among variables
 E.g., hospitals: patients: Profile: age, family history etc
Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc
 Dependencies among these cannot be modeled by Naïve Bayesian
Classifier
 How to deal with these dependencies?
 Bayesian Belief Networks
January 25, 2021 Data Mining: Concepts and Techniques 33
Bayesian Networks
 Bayesian belief network allows a subset of the variables
conditionally independent
 A graphical model of causal relationships
 Represents dependency among the variables
 Gives a specification of joint probability distribution
X Y
Z
P
Nodes: random variables
Links: dependency
X,Y are the parents of Z, and Y is the
parent of P
No dependency between Z and P
Has no loops or cycles
January 25, 2021 Data Mining: Concepts and Techniques 34
 Backpropagation: A neural network learning algorithm started by
psychologists and neurobiologists to develop and test computational
analogues of neurons
 A neural network: A set of connected input/output units where each
connection has a weight associated with it
 During the learning phase, the network learns by adjusting the weights
so as to be able to predict the correct class label of the input tuples
 Also referred to as connectionist learning due to the connections
between units
 Analogy to Biological Systems (Indeed a great example of a good
learning system)
 The first learning algorithm came in 1959 (Rosenblatt) who suggested
that if a target output value is provided for a single neuron with fixed
inputs, one can incrementally change weights. For the purpose to learn
to produce these outputs using the perceptron learning rule
Classification by Backpropagation
January 25, 2021 Data Mining: Concepts and Techniques 35
Advantages
• prediction accuracy is generally high
• robust, works when training examples contain errors
• output may be discrete, real-valued, or a vector of several
discrete or real-valued attributes
• fast evaluation of the learned target function
Criticism
• long training time
• difficult to understand the learned function (weights)
• not easy to incorporate domain knowledge
Neural Network as a Classifier
January 25, 2021 Data Mining: Concepts and Techniques 36
A Neuron (= a perceptron)
 The n-dimensional input vector x is
mapped into variable y by means
of the scalar product and a
nonlinear function mapping
mk
-
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector w

w0
w1
wn
x0
x1
xn
)
sign(
y
Example
For
n
0
i
k
i
i x
w m

 

January 25, 2021 Data Mining: Concepts and Techniques 37
Multi-Layer Perceptron
Output nodes
Input nodes
Hidden nodes
Output vector
Input vector: xi
wij
 

i
j
i
ij
j O
w
I 
j
I
j
e
O 


1
1
)
)(
1
( j
j
j
j
j O
T
O
O
Err 


jk
k
k
j
j
j w
Err
O
O
Err 

 )
1
(
i
j
ij
ij O
Err
l
w
w )
(


j
j
j Err
l)
(



Network Training
 The ultimate objective of training
 obtain a set of weights that makes almost all the
tuples in the training data classified correctly
 Steps
 Initialize weights with random values
 Feed the input tuples into the network one by one
 For each unit
 Compute the net input to the unit as a linear combination
of all the inputs to the unit
 Compute the output value using the activation function
 Compute the error
 Update the weights and the bias

More Related Content

PPT
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
PPT
08 classbasic
PPT
08 classbasic
PDF
08 classbasic
PDF
Classification Techniques
PPT
2.8 accuracy and ensemble methods
PPT
Classification
PPT
Cs501 classification prediction
Chapter - 6 Data Mining Concepts and Techniques 2nd Ed slides Han &amp; Kamber
08 classbasic
08 classbasic
08 classbasic
Classification Techniques
2.8 accuracy and ensemble methods
Classification
Cs501 classification prediction

What's hot (20)

PPTX
Decision trees
PPT
Unit 3classification
PPT
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
PPT
2.2 decision tree
PPSX
Decision tree Using c4.5 Algorithm
PPT
Introduction to Machine Learning Aristotelis Tsirigos
PPTX
Decision tree
PPTX
BAS 250 Lecture 8
PPTX
Classification in data mining
DOCX
It is very important that students see mathematics
PPT
Covering (Rules-based) Algorithm
PDF
Decision tree
ODP
Machine Learning with Decision trees
PPT
Mathematical reasoning
PPT
Two-sample Hypothesis Tests
PPTX
Worst Practices in Statistical Data Analysis
PPTX
Data classification
PPTX
Classification decision tree
PPT
Slide3.ppt
PPT
NIOS STD X Economics Chapter 17 & 18 Collection, Presentation and analysis of...
Decision trees
Unit 3classification
Data Mining:Concepts and Techniques, Chapter 8. Classification: Basic Concepts
2.2 decision tree
Decision tree Using c4.5 Algorithm
Introduction to Machine Learning Aristotelis Tsirigos
Decision tree
BAS 250 Lecture 8
Classification in data mining
It is very important that students see mathematics
Covering (Rules-based) Algorithm
Decision tree
Machine Learning with Decision trees
Mathematical reasoning
Two-sample Hypothesis Tests
Worst Practices in Statistical Data Analysis
Data classification
Classification decision tree
Slide3.ppt
NIOS STD X Economics Chapter 17 & 18 Collection, Presentation and analysis of...
Ad

Similar to Classification and Prediction (20)

PPT
Data Mining
PPTX
unit classification.pptx
PPTX
Dataming-chapter-7-Classification-Basic.pptx
PDF
DSC603_ClassificationIntrointodatascience
PPT
Chapter 06 Data Mining Techniques
PPT
Unit-4 classification
PPT
classification in data warehouse and mining
PDF
classification in data mining and data warehousing.pdf
PPT
Chapter 08 Class_Basic.ppt DataMinning
PPT
Business Analytics using R.ppt
PDF
NBaysian classifier, Naive Bayes classifier
PPT
ClassificationOfMachineLearninginCSE.ppt
PPT
Data Mining Concepts and Techniques.ppt
PPT
Data Mining Concepts and Techniques.ppt
PPT
Classification (ML).ppt
PPT
Chapter 08 ClassBasic.ppt file used for help
PPTX
Machine Learning - Classification Algorithms
PDF
Machine Learning Lectujjjjjjjjjjjjjjjre 9.pdf
Data Mining
unit classification.pptx
Dataming-chapter-7-Classification-Basic.pptx
DSC603_ClassificationIntrointodatascience
Chapter 06 Data Mining Techniques
Unit-4 classification
classification in data warehouse and mining
classification in data mining and data warehousing.pdf
Chapter 08 Class_Basic.ppt DataMinning
Business Analytics using R.ppt
NBaysian classifier, Naive Bayes classifier
ClassificationOfMachineLearninginCSE.ppt
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
Classification (ML).ppt
Chapter 08 ClassBasic.ppt file used for help
Machine Learning - Classification Algorithms
Machine Learning Lectujjjjjjjjjjjjjjjre 9.pdf
Ad

Recently uploaded (20)

PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Classroom Observation Tools for Teachers
PDF
Complications of Minimal Access Surgery at WLH
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
Cell Structure & Organelles in detailed.
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
master seminar digital applications in india
PDF
Basic Mud Logging Guide for educational purpose
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
TR - Agricultural Crops Production NC III.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Classroom Observation Tools for Teachers
Complications of Minimal Access Surgery at WLH
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Supply Chain Operations Speaking Notes -ICLT Program
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Cell Structure & Organelles in detailed.
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
master seminar digital applications in india
Basic Mud Logging Guide for educational purpose
Sports Quiz easy sports quiz sports quiz
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Final Presentation General Medicine 03-08-2024.pptx
O5-L3 Freight Transport Ops (International) V1.pdf
Pharma ospi slides which help in ospi learning
Renaissance Architecture: A Journey from Faith to Humanism
O7-L3 Supply Chain Operations - ICLT Program
Module 4: Burden of Disease Tutorial Slides S2 2025
TR - Agricultural Crops Production NC III.pdf

Classification and Prediction

  • 1. Classification and Prediction -Sahil Kumar Singh January 25, 2021 Data Mining: Concepts and Techniques 1
  • 2. January 25, 2021 Data Mining: Concepts and Techniques 2 Classification and Prediction  What is classification? What is prediction?  Issues regarding classification and prediction  Classification by decision tree induction  Bayesian Classification  Classification by Back Propagation
  • 3. January 25, 2021 Data Mining: Concepts and Techniques 3  Classification:  is a form of data analysis that extracts models (classifiers) describing important data classes  classifies the data into two or more categories based on class labels (Yes/No, Positive/Negative, etc.)  E.g. bank loan approval to customers based on the customer’s age  Prediction:  used when the trained classifier is used to predict the class of data that is unknown to classifier model  models continuous-valued functions, i.e., predicts unknown or missing values  E.g. fraud detection, medical diagnosis Classification vs. Prediction
  • 4. January 25, 2021 Data Mining: Concepts and Techniques 4 Classification—A Two-Step Process  Model construction: describing a set of predetermined classes  Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute  The set of tuples used for model construction is training set  The model is represented as classification rules, decision trees, or mathematical formulae  Model usage: to predict class categories of data tuples i.e. unknown  Estimate accuracy of the model  The known label of test sample is compared with the classified result from the model  Accuracy rate is the percentage of test set samples that are correctly classified by the model  Test set is independent of training set to judge the model’s accuracy, otherwise over-fitting will occur  If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
  • 5. January 25, 2021 Data Mining: Concepts and Techniques 5 Classification Process (1): Model Construction Training Data NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model) Decision Tree, Bayes’ , Backpropagation
  • 6. January 25, 2021 Data Mining: Concepts and Techniques 6 Classification Process (2): Use the Model in Prediction Classifier Testing Data NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes Unseen Data (Jeff, Professor, 4) Tenured?
  • 7. January 25, 2021 Data Mining: Concepts and Techniques 7 Issues Regarding Classification and Prediction (1): Data Preparation  Data cleaning  Preprocess data in order to reduce noise and handle missing values  Relevance analysis (feature/attribute selection)  Remove the irrelevant or redundant attributes  Data transformation  Generalize and/or normalize data
  • 8. January 25, 2021 Data Mining: Concepts and Techniques 8 Issues regarding classification and prediction (2): Evaluating Classification Methods  Predictive accuracy  Speed and scalability  time to construct the model  time to use the model  Robustness  handling noise and missing values  Scalability  efficiency in disk-resident databases  Interpretability:  understanding and insight provided by the model  Goodness of rules  decision tree size  compactness of classification rules
  • 9. January 25, 2021 Data Mining: Concepts and Techniques 9 Algorithm for Decision Tree Induction  Basic algorithm (a greedy algorithm)  Tree is constructed in a top-down recursive divide-and-conquer manner  At start, all the training examples are at the root  Attributes are categorical (if continuous-valued, they are discretized in advance)  Examples are partitioned recursively based on selected attributes  Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)  Conditions for stopping partitioning  All samples for a given node belong to the same class  There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf  There are no samples left
  • 10. January 25, 2021 Data Mining: Concepts and Techniques 10 Attribute Selection Measure: Information Gain (ID3/C4.5)  Select the attribute with the highest information gain  S contains Si tuples of class Ci for i = {1, …, m}  information measures info required to classify any arbitrary tuple  entropy of attribute A with values {a1,a2,…,av}  information gained by branching on attribute A s s log s s ) ,...,s ,s s I( i m i i m 2 1 2 1     ) s ,..., s ( I s s ... s E(A) mj j v j mj j 1 1 1      E(A) ) s ,..., s , I(s Gain(A) m   2 1
  • 11. January 25, 2021 Data Mining: Concepts and Techniques 11 Classification by decision tree induction: Training Dataset age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no This follows an example from Quinlan’s ID3
  • 12. Attribute Selection by Information Gain Computation January 25, 2021 Data Mining: Concepts and Techniques 12 age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no Here, we use buys_computer for taking decision to create the decision tree, Class P: buys_computer = “yes”=9 Class N: buys_computer = “no”=5 Now, compute information gain, I(P, n) for the entire table,
  • 13. Attribute Selection by Information Gain Computation  Compute the entropy for age: E(age)= =5/14 x 0.970+ 4/14 x 0+ 5/14 x 0.970 =0.692  Now, compute Gain for age w.r.t. T, Gain(age)=0.940-0.692=0.248 Gain(T, age)=0.248 Similarly, Gain(T, income)=0.029 Gain(T, student)=0.048 Gain(T, credit_rating)=0.151 January 25, 2021 Data Mining: Concepts and Techniques 13 age Pi ni I(Pi,ni) <=30 2 3 0.970 30…40 4 0 0 >40 3 2 0.970 age? overcast <=30 >40 30..40
  • 14. Attribute Selection by Information Gain Computation Entropy of income: E(income)= =2/5x0+2/5x1+1/5x0=0.4 Now, compute Gain for income w.r.t. T<=30, Gain(T<=30 ,income)=0.970-0.40=0.57 January 25, 2021 Data Mining: Concepts and Techniques 14 age income studen t credit_ra ting buys_co mputer <=30 high no fair no <=30 high no excellent no <=30 medium no fair no <=30 low yes fair yes <=30 medium yes excellent yes income Pi ni I(Pi,ni) high 0 2 0 medium 1 1 1 low 1 0 0 P:2, n:3 I(T<=30)=0.970
  • 15. Attribute Selection by Information Gain Computation Entropy of student: E(student)= =3/5x0+2/5x0=0 Now, compute Gain for student w.r.t. T <=30, Gain(T <=30,student)=0.970-0 =0.970 January 25, 2021 Data Mining: Concepts and Techniques 15 student Pi ni I(Pi,ni) no 0 3 0 yes 2 0 0 P:2, n:3 age income studen t credit_ra ting buys_co mputer <=30 high no fair no <=30 high no excellent no <=30 medium no fair no <=30 low yes fair yes <=30 medium yes excellent yes I(T<=30)=0.970
  • 16. Attribute Selection by Information Gain Computation Entropy of credit_rating: E(income)= =3/5x0.918+2/5x1=0.4 Now, compute Gain for credit_rating w.r.t. T<=30 , Gain(T<=30 , credit_rating) =0.970-0.950=0.02 January 25, 2021 Data Mining: Concepts and Techniques 16 credit_rating Pi ni I(Pi,ni) fair 1 2 0.918 excellent 1 1 1 P:2, n:3 age income studen t credit_ra ting buys_co mputer <=30 high no fair no <=30 high no excellent no <=30 medium no fair no <=30 low yes fair yes <=30 medium yes excellent yes I(T<=30)=0.970
  • 17. Attribute Selection by Information Gain Gain(T<=30 ,income)=0.57 Gain(T <=30,student)=0.970 Gain(T<=30 , credit_rating) =0.02 So, we put student attribute in our tree under <=30 and for 30…40, we observe that it has ‘yes’ in all the conditions, so put it under 30…40. January 25, 2021 Data Mining: Concepts and Techniques 17 age incom e stu den t credit_rat ing buys_ comp uter 31…40 high no fair yes 31…40 low yes excellent yes 31…40 mediu m no excellent yes 31…40 high yes fair yes
  • 18. Attribute Selection by Information Gain Computation Entropy of credit_rating: E(income)= =3/5x0+2/5x0=0 Now, compute Gain for credit_rating w.r.t. T>40 , Gain(T>40, credit_rating) =0.970-0=0.970 January 25, 2021 Data Mining: Concepts and Techniques 18 credit_rating Pi ni I(Pi,ni) fair 3 0 0 excellent 0 2 0 P:3, n:2 age incom e stude nt credit_rati ng buys_compu ter >40 mediu m no fair yes >40 low yes fair yes >40 low yes excellent no >40 mediu m yes fair yes >40 mediu m no excellent no I(T>40)=0.970
  • 19. Attribute Selection by Information Gain Computation Entropy of income: E(income)= =0+3/5x0.918+2/5x1=0.951 Now, compute Gain for income w.r.t. T>40 , Gain(T>40, income) =0.970-0.951=0.019 January 25, 2021 Data Mining: Concepts and Techniques 19 income Pi ni I(Pi,ni) high 0 0 0 medium 2 1 0.918 low 1 1 1 P:3, n:2 age income stude nt credit_rati ng buys_comput er >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no >40 medium yes fair yes >40 medium no excellent no I(T>40)=0.970
  • 20. January 25, 2021 Data Mining: Concepts and Techniques 20 Output: A Decision Tree for “buys_computer” age? overcast student? credit_rating? no yes fair excellent <=30 >40 no no yes yes yes 30..40 Gain(T>40, credit_rating)=0.970 Gain(T>40, income)=0.019 So, we put credit_rating attribute in our tree under >40. At the end, we have a final decision tree for buys_computer.
  • 21. January 25, 2021 Data Mining: Concepts and Techniques 21 Other Attribute Selection Measures  Gini index (CART, IBM IntelligentMiner)  All attributes are assumed continuous-valued  Assume there exist several possible split values for each attribute  May need other tools, such as clustering, to get the possible split values  Can be modified for categorical attributes
  • 22. January 25, 2021 Data Mining: Concepts and Techniques 22 Gini Index (IBM IntelligentMiner)  If a data set T contains examples from n classes, gini index, gini(T) is defined as where pj is the relative frequency of class j in T.  If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as  The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute).     n j p j T gini 1 2 1 ) ( ) ( ) ( ) ( 2 2 1 1 T gini N N T gini N N T ginisplit  
  • 23. January 25, 2021 Data Mining: Concepts and Techniques 23 Extracting Classification Rules from Trees  Represent the knowledge in the form of IF-THEN rules  One rule is created for each path from the root to a leaf  Each attribute-value pair along a path forms a conjunction  The leaf node holds the class prediction  Rules are easier for humans to understand  Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “<=30” AND credit_rating = “fair” THEN buys_computer = “no”
  • 24. January 25, 2021 Data Mining: Concepts and Techniques 24 Avoid Overfitting in Classification  Overfitting: An induced tree may overfit the training data  Too many branches, some may reflect anomalies due to noise or outliers  Poor accuracy for unseen samples  Two approaches to avoid overfitting  Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold  Difficult to choose an appropriate threshold  Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees  Use a set of data different from the training data to decide which is the “best pruned tree”
  • 25. January 25, 2021 Data Mining: Concepts and Techniques 25 Why to Use Decision Tree Induction for Classification?  Why decision tree induction in data mining?  relatively faster learning speed (than other classification methods)  convertible to simple and easy to understand classification rules  can use SQL queries for accessing databases  comparable classification accuracy with other methods
  • 26. January 25, 2021 Data Mining: Concepts and Techniques 26 Bayesian Classification: Why?  Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems  Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.  Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities  Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured
  • 27. January 25, 2021 Data Mining: Concepts and Techniques 27 Bayesian Theorem: Basics  Let X be a data sample whose class label is unknown  Let H be a hypothesis that X belongs to class C  For classification problems, determine P(H/X): the probability that the hypothesis holds given the observed data sample X  P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge)  P(X): probability that sample data is observed  P(X|H) : probability of observing the sample X, given that the hypothesis holds
  • 28. January 25, 2021 Data Mining: Concepts and Techniques 28 Bayesian Theorem  Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem  Informally, this can be written as posterior =likelihood x prior / evidence  MAP (maximum posteriori) hypothesis  Practical difficulty: require initial knowledge of many probabilities, significant computational cost ) ( ) ( ) | ( ) | ( X P H P H X P X H P  . ) ( ) | ( max arg ) | ( max arg h P h D P H h D h P H h MAP h    
  • 29. January 25, 2021 Data Mining: Concepts and Techniques 29 Naïve Bayes Classifier  A simplified assumption: attributes are conditionally independent:  The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2],C) = P(y1,C) * P(y2,C)  No dependence relation between attributes  Greatly reduces the computation cost, only count the class distribution.  Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)*P(Ci)    n k Ci xk P Ci X P 1 ) | ( ) | (
  • 30. January 25, 2021 Data Mining: Concepts and Techniques 30 Training dataset age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 30…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair)
  • 31. January 25, 2021 Data Mining: Concepts and Techniques 31 Naïve Bayesian Classifier: Example  Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.007 X belongs to class “buys_computer=yes”
  • 32. January 25, 2021 Data Mining: Concepts and Techniques 32 Naïve Bayesian Classifier  Advantages :  Easy to implement  Good results obtained in most of the cases  Disadvantages  Assumption: class conditional independence , therefore loss of accuracy  Practically, dependencies exist among variables  E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc  Dependencies among these cannot be modeled by Naïve Bayesian Classifier  How to deal with these dependencies?  Bayesian Belief Networks
  • 33. January 25, 2021 Data Mining: Concepts and Techniques 33 Bayesian Networks  Bayesian belief network allows a subset of the variables conditionally independent  A graphical model of causal relationships  Represents dependency among the variables  Gives a specification of joint probability distribution X Y Z P Nodes: random variables Links: dependency X,Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops or cycles
  • 34. January 25, 2021 Data Mining: Concepts and Techniques 34  Backpropagation: A neural network learning algorithm started by psychologists and neurobiologists to develop and test computational analogues of neurons  A neural network: A set of connected input/output units where each connection has a weight associated with it  During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of the input tuples  Also referred to as connectionist learning due to the connections between units  Analogy to Biological Systems (Indeed a great example of a good learning system)  The first learning algorithm came in 1959 (Rosenblatt) who suggested that if a target output value is provided for a single neuron with fixed inputs, one can incrementally change weights. For the purpose to learn to produce these outputs using the perceptron learning rule Classification by Backpropagation
  • 35. January 25, 2021 Data Mining: Concepts and Techniques 35 Advantages • prediction accuracy is generally high • robust, works when training examples contain errors • output may be discrete, real-valued, or a vector of several discrete or real-valued attributes • fast evaluation of the learned target function Criticism • long training time • difficult to understand the learned function (weights) • not easy to incorporate domain knowledge Neural Network as a Classifier
  • 36. January 25, 2021 Data Mining: Concepts and Techniques 36 A Neuron (= a perceptron)  The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping mk - f weighted sum Input vector x output y Activation function weight vector w  w0 w1 wn x0 x1 xn ) sign( y Example For n 0 i k i i x w m    
  • 37. January 25, 2021 Data Mining: Concepts and Techniques 37 Multi-Layer Perceptron Output nodes Input nodes Hidden nodes Output vector Input vector: xi wij    i j i ij j O w I  j I j e O    1 1 ) )( 1 ( j j j j j O T O O Err    jk k k j j j w Err O O Err    ) 1 ( i j ij ij O Err l w w ) (   j j j Err l) (   
  • 38. Network Training  The ultimate objective of training  obtain a set of weights that makes almost all the tuples in the training data classified correctly  Steps  Initialize weights with random values  Feed the input tuples into the network one by one  For each unit  Compute the net input to the unit as a linear combination of all the inputs to the unit  Compute the output value using the activation function  Compute the error  Update the weights and the bias