SlideShare a Scribd company logo
1
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Classification by Backpropagation
 Lazy Learners (K-Nearest Neighbors Classification)
2
Supervised vs. Unsupervised Learning
 Supervised learning (classification) or Predictive Mining
 Supervision: The training data (observations, past
experience, etc.) has labels indicating the class of the
observations
 New data is classified based on the training set
 Unsupervised learning (clustering) or Descriptive Mining
 Class labels are not assigned to training data instances
 Given a set of measurements, observations, etc. discovers
patterns with the aim of grouping similar instances to form
clusters
3
 Classification
 Classification aims to predict categorical class labels
 constructs a predictive model based on the descriptions of
training instances and their class labels and uses it for
classifying new data
 Regression
 Predicts numeric values of a dependent attribute in terms
of one or more independent predictor attributes
 models continuous-valued functions
 Used for estimating unknown or missing values
 Typical applications
 Credit/loan approval:
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent or genuine
 Cost assessment of properties
Prediction Problems: Classification vs. Regression
4
Classification—A Two-Step Process
 Model construction:
 Each tuple/sample is assumed to belong to a predefined class, as specified by
the class label attribute
 Set of labeled tuples are partitioned into training and test sets
 The training set of tuples are used for model construction and refinement
 The model is represented as classification rules, decision trees, SVM, ANN, etc
 Model usage: After validating the model, it can be used for classifying future or
unknown objects
 Estimate accuracy of the model
 The known (true) label of test sample is compared with the predicted label
given by the classification model
 Accuracy rate is the percentage of test set samples that are correctly
classified by the model
 Test set Accuracy reflects its generalization performance on unknown data
 If the accuracy is acceptable, use the model to classify new data
 Part of the training set called validation set is used to select model hyper
parameters during refinement to achieve optimal generalization performance..
5
Process (1): Model Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)
6
Process (2): Using the Model in Prediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
7
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Classification by Backpropagation
 Lazy Learners (K-Nearest Neighbors Classification)
8
Decision Tree Induction: An Example
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
fair
excellent
yes
no
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
 Training data set: Buys_computer
 Attribute ‘age’ is discretized
 Quinlan’s ID3 algorithm learns the
Decision tree model
 Resulting tree:
9
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-
conquer manner
 At start, all the training examples are at the root
 If attributes are continuous-valued, they are discretized in
advance to have all attributes are of categorical type
 Attributes are selected for splitting the data on the basis of a
heuristic or statistical measure (e.g., information gain)
 Examples are partitioned recursively based on selected
attributes
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left

Brief Review of Entropy
10
m = 2
11
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci|/|D|
 Expected information (entropy) needed to classify a tuple in D:
 Information needed after using A to split D into v partitions
(conditional entropy with A) to classify D:
 Information gained by splitting D on attribute A
)
(
log
)
( 2
1
i
m
i
i p
p
D
Info 



)
(
|
|
|
|
)
(
1
j
v
j
j
A D
Info
D
D
D
Info 
 

(D)
Info
Info(D)
Gain(A) A


12
Attribute Selection: Information Gain
 Class P: buys_computer = “yes”
 Class N: buys_computer = “no”
means “age <=30” has 5 out of
14 samples, with 2 yes’es and 3
no’s. Similarly the others.
age pi ni I(pi, ni)
<=30 2 3 0.971
31…40 4 0 0
>40 3 2 0.971
694
.
0
)
2
,
3
(
14
5
)
0
,
4
(
14
4
)
3
,
2
(
14
5
)
(




I
I
I
D
Infoage
246
.
0
)
(
)
(
)
( 

 D
Info
D
Info
age
Gain age
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
)
3
,
2
(
14
5
I
940
.
0
)
14
5
(
log
14
5
)
14
9
(
log
14
9
)
5
,
9
(
)
( 2
2 



 I
D
Info
Similarly the information gain for the other attributes will be estimated as
Gain(income)=0.029; Gain(student)=0.151; Gain(credit_rating)=0.048;
13
Determining Best Split Point for
Numerical Attributes
 Let attribute A be a continuous-valued attribute whose range is split
into two for partitioning the data during decision tree construction
 Binary Split: D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
 Steps to determine the best split point for A
 Sort the instances based on the values of A in increasing order
 Typically, the midpoint between each pair of adjacent values with
altered class labels is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum (conditional entropy) expected
information requirement is selected as the best split-point for A
14
Gain Ratio for Attribute Selection (C4.5)
 Information gain measure is biased towards attributes with a
large number of splits
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.
 gain_ratio(income) = 0.029/1.557 = 0.019
 The attribute with the maximum gain ratio is selected as the
splitting attribute
)
|
|
|
|
(
log
|
|
|
|
)
( 2
1 D
D
D
D
D
SplitInfo
j
v
j
j
A 

 

15
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Complex models are developed with many branches often
representing specific noisy instances with no significance for
generalization
 Results in poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early ̵ do not split a node if its
entropy / uncertainity measure falls below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
 Use a (pruning) set of data different from the training data to
decide which is the “best pruned tree”
16
Merits of Classification with Decision Trees
 can use SQL queries for accessing databases
 relatively faster learning speed on memory-
resident training sets.
 convertible to simple and easy to understand
classification rules
 Achieves classification accuracy comparable
with other methods
 AVC-sets (Attribute-Value, Classlabel) are
maintained for each attribute at each tree node
splitting to adopt to the available memory for
gaining scalability to handle very large training
sets.
March 15, 2024 Data Mining: Concepts and Techniques 17
Presentation of Classification Results
18
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Classification by Backpropagation
 Lazy Learners (K-Nearest Neighbors Classification)
19
Prediction Based on Bayes’ Theorem
 Given training data X, posteriori probability of a hypothesis H,
denoted by P(H|X), follows the Bayes’ theorem
 Informally, this can be viewed as
posteriori = likelihood x prior/evidence
 Predicts the class label of X to be Ci iff the probability P(Ci|X) is the
highest among all P(Ck|X) for all k classes
 Practical difficulty: It requires initial knowledge of many
probabilities, involving significant data collection and computational
costs.
)
(
)
(
)
|
(
)
|
(
X
X
X
P
H
P
H
P
H
P 
20
Classification Problem Is to Identify the Class
with Maximum Posteriori Prob
 Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute vector
X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to identify the class with maximum posteriori,
i.e., the maximal P(Ci|X)
 This can be derived from Bayes’ theorem
 Since P(X) is constant for all classes, for the purpose of
classification, it is enough to identify the class that has
maximum value of the numerator
 The class label of X is given by
)
(
)
(
)
|
(
)
|
(
X
X
X
P
i
C
P
i
C
P
i
C
P 
)}
(
)
|
(
{
max
arg
i
C
P
i
C
P
i X
21
Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally independent
(i.e., no dependence relation between attributes):
 This greatly reduces the information requirement and
computation cost
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for
Ak divided by |Ci, | (# of tuples of Ci in D)
 If Ak is continuous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with mean μ and standard deviation σ
and P(xk|Ci) is estimated at Ak= xk in terms of µ and σ for Ci as
given below:
)
|
(
...
)
|
(
)
|
(
1
)
|
(
)
|
(
2
1
Ci
x
P
Ci
x
P
Ci
x
P
n
k
Ci
x
P
Ci
P
n
k







X
2
2
2
)
(
2
1
)
,
,
( 








x
e
x
g
)
,
,
(
)
|
( i
i C
C
k
x
g
Ci
P 


X
22
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Priors, P(Ci):
P(buys_computer = “yes”)
= 9/14 = 0.643
P(buys_computer = “no”)
= 5/14= 0.357
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
age income student
credit_rating
buys_compu
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
23
Naïve Bayes Classifier: An Example
Priors, P(Ci):
P(buys_computer = “yes”) = 0.643
P(buys_computer = “no”) = 0.357
 Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
age income student
credit_rating
buys_comp
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
24
Avoiding the Zero-Probability Problem
 Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the likelihood estimates being the
product of multiple cond. prob will be zero
 Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their
“uncorrected” counterparts



n
k
Ci
xk
P
Ci
X
P
1
)
|
(
)
|
(
25
Naïve Bayes Classifier: Comments
 Advantages
 Easy to implement
 Accurate results obtained in most of the cases
 Disadvantages
 Relies on class conditional independence assumption:
Practically, dependencies exist among variables
 E.g., Symptoms: fever, cough, cold, body aches, etc.,
 Dependencies among these cannot be modeled by Naïve
Bayes Classifier.
 If the features are not independent, predictions are less
accurate
26
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Classification by Backpropagation
 Lazy Learners (K-Nearest Neighbors Classification)
27
Using IF-THEN Rules for Classification
 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent
 Assessment of a rule: coverage and accuracy
 A rule,R covers a tuple,T, if the precondition of R is true in T
 ncovers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| where D is the training data set
accuracy(R) = ncorrect / ncovers
 While classification using Rules, if multiple rules are applicable, conflict resolution
is called for.
 Size ordering: assign the highest priority to the rules that has the “toughest”
requirement (i.e., with the most attribute tests) or specific rules preferred
 Class-based ordering: decreasing order of prevalence or misclassification cost
 Rule-based ordering (decision list): rules are organized into one long priority
list, according to some measure of rule quality or by experts
28
age?
student? credit rating?
<=30
>40
no yes yes
yes
31..40
fair
excellent
yes
no
 Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
Rule Extraction from a Decision Tree
 Rules are easier to understand than large
trees
 One rule is created for each path from the
root to a leaf
 Each attribute-value pair along a path forms
a conjunction: the leaf holds the class
prediction
 Rules are mutually exclusive and exhaustive
29
Rule Induction: Sequential Covering Method
 Sequential covering algorithm: Extracts rules directly from training
data
 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time
 Each time a rule is learned, the tuples covered by the rules are
removed
 Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
 Compared to decision-tree induction that learns a set of rules
simultaneously, sequential covering alg learns rules one-by-one.
30
Sequential Covering Algorithm
while (enough target tuples left)
generate a rule
remove positive target tuples satisfying this rule
Examples covered
by Rule 3
Examples covered
by Rule 2
Examples covered
by Rule 1
Positive
examples
31
Rule Generation
 To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
Positive
examples
Negative
examples
A3=1
A3=1&&A1=2
A3=1&&A1=2
&&A8=5
How to Learn-One-Rule?
32
33
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Learning by Backpropagation
 K-Nearest Neighbours classification
Classifier Evaluation Metrics: Confusion
Matrix (C1 as Positive class)
Actual classPredicted
class
buy_computer
= yes
buy_computer
= no
Total
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
 Given m classes, an entry, CMi,j in a confusion matrix indicates
# of tuples in class i that were labeled by the classifier as class j
 May have extra rows/columns to provide totals
Confusion Matrix:
Actual classPredicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:
34
Classifier Evaluation Metrics:
Accuracy, Error Rate, Sensitivity and Specificity
35
Classifier Evaluation Metrics:
Precision and Recall, and F-measures
 Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
 Recall: completeness – what % of positive tuples did the
classifier label as positive?
 Perfect score is 1.0
 Inverse relationship between precision & recall
 F measure (F1 or F-score): harmonic mean of precision and
recall,
 Fß: weighted measure of precision and recall
 assigns ß times as much weight to recall as to precision
36
Classifier Evaluation Metrics: Example
37
 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
Actual ClassPredicted class cancer = yes cancer = no Total Recognition(%)
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.50 (accuracy)
Model Evaluation and Selection
 Use validation (test) set of class-labeled tuples instead of
training set when assessing accuracy
 Methods for estimating a classifier’s accuracy:
 Holdout method, random sub-sampling
 Cross-validation
 Bootstrap
 Comparing classifiers:
 Confidence intervals
 Cost-benefit analysis and ROC Curves
38
Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
 Holdout method
 Given data is randomly partitioned into two independent sets
 Training set (e.g., 2/3) for model construction
 Test set (e.g., 1/3) for accuracy estimation
 Random sub-sampling: a variation of holdout
 Repeat holdout k times, accuracy = avg. of the accuracies
obtained
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
 At i-th iteration, use Di as test set and others as training set
 Leave-one-out: k folds where k = # of tuples, for small sized
data
 *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
39
Evaluating Classifier Accuracy: Bootstrap
 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement
 i.e., each time a tuple is selected, it is re-added to the training set
And is equally likely to be selected again
 A commonly used bootstrap method is .632 boostrap
 A data set with d tuples is sampled d times, with replacement, resulting in a
training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data ends
up in the bootstrap, and the remaining 36.8% form the test set (since (1 – 1/d)d
≈ e-1 = 0.368)
 Repeat the sampling procedure k times, estimate the overall accuracy of the
model:
40
Model Selection: ROC Curves
 ROC (Receiver Operating Characteristics)
curves: for visual comparison of probabilistic
classification models
 The plot shows the trade-off between the
true positive rate and the false positive rate
 Vertical axis represents the true positive rate,
TPR=TP/P, sensitivity or recall
 Horizontal axis rep. the false positive
rate,FPR=FP/N= (1-TNR) or (1-specificity)
 The area under the ROC curve is a measure
of the accuracy of the model
 The diagonal line corresponds to the
random guessing of class labels in balanced
datasets
 The model whose ROC is closer to the
diagonal line (i.e., the closer the area is to
0.5), is less accurate.
 A model with perfect accuracy will have an
area of 1.0 41
All neg
All pos
How to draw ROC
42
 Apply the model on test data to predict the probability of being
positive
 Rank the test tuples in the decreasing order of their probability of
being positive.
 Starting from the highest probability to accept the rank-1 tuple,
gradually reduce the threshold to accept more and more tuples as
positive and estimate TPR and FPR at each stage to plot the ROC.
All neg
All pos
Issues Affecting Model Selection
 Accuracy
 classifier accuracy: predicting class label
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
43
44
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Classification by Backpropagation
 Lazy Learners (K-Nearest Neighbors Classification)
Ensemble Methods: Increasing the Accuracy
 Ensemble methods
 Use a combination of models to increase accuracy
 Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
 Popular ensemble methods
 Bagging: averaging the prediction over a collection of
classifiers
 Boosting: weighted vote with a collection of classifiers
 Random Forest: majority voting among collection of base
classifiers built through randomly sampled attribute set 45
Bagging: Boostrap Aggregation
 Analogy: Diagnosis based on multiple doctors’ majority opinion / vote
 Training
 Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
 A classifier model Mi is learned for each training set Di
 Classification: to classify an unknown sample X
 Each classifier Mi returns its class prediction
 The bagged classifier M* counts the votes and assigns the class with the
majority votes to X
 Regression: can be applied to predict continuous valued variables by taking the
average value of each prediction for a given test tuple
 Accuracy
 Often significantly improves the accuracy of prediction compared to a single
classifier derived from D
 For noise data: more robust to noisy data as it goes by majority predictions
46
Boosting
 Analogy: Consult several doctors, and decide based on a
combination of weighted diagnoses—weight assigned based on
the previous diagnosis accuracy
 How boosting works?
 Weights are assigned to each training tuple
 A series of k classifiers (that complement each other) is
iteratively learned
 After a classifier Mi is learned, the weights are updated to
build the subsequent classifier, Mi+1, paying more attention to
the training tuples that were misclassified by Mi
 The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
 Boosting algorithm can be extended for numeric prediction
 Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfiting the model to misclassified or noisy data 4
7
48
Adaboost (Freund and Schapire, 1997)
 Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
 Initially, all tuples are assigned the same weight which is equal to 1/d
 Generate k classifiers in k rounds. At round i,
 Tuples from D are sampled (with replacement) to form a training set
Di of the same size
 Each tuple’s chance of being selected is based on its weight
 A classification model Mi is derived from Di
 Its error rate is calculated using Di as a test set
 If a tuple is misclassified, its weight is increased, else it is decreased
 Error rate: err(Xj) is the misclassification error for a specific tuple Xj.
Error rate of Classifier Mi is the weighted sum of misclassified tuples:
 The weight of classifier Mi’s vote is
)
(
)
(
1
log
i
i
M
error
M
error

 

d
j
j
i err
w
M
error )
(
)
( j
X
Random Forest (Breiman 2001)
 Random Forest:
 Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
 During classification, each tree votes and the most popular class is
returned
 Two Methods to construct Random Forest:
 Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to full size
 Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes
(reduces the correlation between individual classifiers)
 Comparable in accuracy to Adaboost, but more robust to noise and outliers
 Faster than bagging or boosting
49
50
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Classification by Backpropagation
 Lazy Learners (K-Nearest Neighbors Classification)
51
Classification by Backpropagation
 A neural network: A set of connected input/output units called
neurons where each connection has a weight associated with it.
 Extensively used for classification and regression tasks.
 During the learning phase, the network learns by adjusting the
weights so as to be able to predict the correct class label of
training tuples
 Back propagation: A supervised algorithm for learning weights of
edges in a Feed Forward Neural Network.
 These weight adjustments are made in the backward direction
from output layer through each hidden layer down to the first
hidden layer and hence named as ‘Back propagation’.
52
A Multi-Layer Feed-Forward Neural Network
Output layer
Input layer
Hidden layer
Output vector
Input vector: X
wij
i
k
j
j
k
ij
k
ij x
y
y
w
w )
ˆ
( )
(
)
(
)
1
(





53
Activity of a Neuron at Hidden/Output Layer
 An n-dimensional input vector x is mapped into variable y by means of the scalar
product and a nonlinear function mapping
 Hidden / output units receive inputs from the units in the previous layer. The
activations generated at feeder units are propagated along the weighted edges to
get their weighted sum, which is added to the bias associated with unit. Finally a
nonlinear activation function is applied to it to determine the output at the unit.
θ
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector w

w0
w1
wn
x0
x1
xn
)
sign(
y
Example
For
n
0
i


 

i
i x
w
bias
54
How A Multi-Layer Neural Network Works
 The inputs to the network correspond to the attributes measured for each
training tuple
 Inputs are fed simultaneously into the units making up the input layer
 They are weighted and fed simultaneously to a hidden layer
 The number of hidden layers is arbitrary, although usually only one
 The weighted outputs of the last hidden layer are input to units making up
the output layer, which emits the network's prediction
 Each neuron in the hidden layer as well as output layer has its own bias
which is learnt along with the weights of edges during training phase.
 The network is feed-forward: None of the weights cycles back to an input
unit or to an hidden unit of a previous layer
 From a statistical point of view, ANNs perform nonlinear regression or
probabilistic classification: Given enough hidden units and enough training
samples, they can closely approximate any function
55
Defining a Network Topology for
Classification / Regression
 Decide the network topology: Specify # of units in the input layer, #
of hidden layers, # of units in each hidden layer, and # of units in the
output layer
 One input unit for each descriptive feature
 One Output, for binary classification and if it is for multi-class classification, the
number of output units is equal to the number of classes.
 Experimentally select appropriate number of hidden neurons depending on the
problem complexity
 Normalize the input values for each attribute in the training tuples
to [0.0—1.0] range
 Train the network applying a learning algorithm like Back
propagation to adjust the weights of edges for classification
 Once a network has been trained and if its accuracy is unacceptable,
repeat the training process with a different network topology or a
different set of initial weights
56
Backpropagation
 Iteratively process a set of training tuples & compare the network's prediction
with the actual known target value
 For each training tuple, the weights are modified to minimize the mean
squared error between the network's prediction and the actual target value
 Modifications are made in the “backwards” direction: from the output layer,
through each hidden layer down to the first hidden layer, hence
“backpropagation”
 Steps
 Initialize weights to small random numbers, associated with biases
 Propagate the inputs forward (by applying activation function)
 Backpropagate the error and update weights and biases
 Terminating condition (when error is very small, etc.)
© Prentice Hall 57
Gradient Descent
58
gradient
Strengths of ANN for classification
 Strength
 High tolerance to noisy data
 Ability to classify unknown instances acccurately
 Well-suited for continuous-valued inputs and outputs
 Successful on real-world data, e.g., hand-written letters
 Algorithms are inherently parallel since the neurons in a
layer works independently
 Techniques have recently been developed for the
extraction of rules from trained neural networks
 Capability of ANNs is further extended through Deep
Learning
59
60
Drawbacks of Neural Network as a
Classifier
 Weakness
 Long training time
 Require a number of parameters typically
best determined empirically, e.g., the
network topology or “structure.”
 Poor interpretability: Difficult to interpret the
symbolic meaning behind the learned weights
and of “hidden units” in the network
61
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Classification by Backpropagation
 Lazy Learners (K-Nearest Neighbors Classification)
62
Lazy vs. Eager Learning
 Lazy vs. eager learning
 Lazy learning (e.g., instance-based learning): Simply stores
training data (or only minor processing) and waits until it is
given a test tuple
 Eager learning (the above discussed methods): Given a set of
training tuples, constructs a classification model before
accepting new (e.g., test) data to classify
 Lazy: less time in training but more time in predicting
 Accuracy
 Lazy method effectively uses a richer hypothesis space since
it uses many local linear functions to form an implicit global
approximation to the target function
 Eager: must commit to a single hypothesis that covers the
entire instance space
63
The k-Nearest Neighbor Algorithm
 All instances are preprocessed and mapped onto points in n-D space
 The nearest neighbors to a query point, Xq are identified from the
training instances sorted in the ascending order of their Euclidean
distances to Xq, dist(Xi, Xq)
 Target function could be discrete- or real- valued required for
classification or regression tasks respectively.
 For Classification tasks, k-NN returns the most common value (mode)
among the labels of the k training examples nearest to xq.
 Vonoroi diagram: the decision surface induced by 1-NN for a typical
set of training examples.
.
_
+
_ xq
+
_ _
+
_
_ +
.
.
.
. .
64
k-NN Algorithm for Regression
 k-NN for real-valued prediction for a given unknown tuple, Xq
 Returns the mean of the 𝑦𝑖 values of k nearest neighbors
 Distance-weighted nearest neighbor algorithm
 Weigh the contribution of each of the k neighbors according
to their distance to the query xq
 Gives greater weight to closer neighbors
 Robust to noisy data by averaging k-nearest neighbors
 Curse of dimensionality: distance between neighbors could be
dominated by irrelevant attributes
 To overcome it, elimination of the least relevant attributes
2
)
,
(
1
i
x
q
x
d
wi 

More Related Content

PPTX
unit classification.pptx
PPT
Unit 3classification
PPTX
Dataming-chapter-7-Classification-Basic.pptx
PPT
08 classbasic
PPT
08 classbasic
PDF
08 classbasic
PPT
Classification (ML).ppt
PPT
1791kjkljkljlkkljlkjkljlkkljlkjkjl9164.ppt
unit classification.pptx
Unit 3classification
Dataming-chapter-7-Classification-Basic.pptx
08 classbasic
08 classbasic
08 classbasic
Classification (ML).ppt
1791kjkljkljlkkljlkjkljlkkljlkjkjl9164.ppt

Similar to classification in data mining and data warehousing.pdf (20)

PPT
Chapter 08 ClassBasic.ppt file used for help
PPT
Chapter 8. Classification Basic Concepts.ppt
PPT
Data Mining Concepts and Techniques.ppt
PPT
Data Mining Concepts and Techniques.ppt
PPT
Data Mining
PPT
ClassificationOfMachineLearninginCSE.ppt
PPT
Chapter 08 Class_Basic.ppt DataMinning
PPT
Unit-4 classification
PPTX
Unit 4 Classification of data and more info on it
PPT
4_22865_IS465_2019_1__2_1_08ClassBasic.ppt
PDF
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
PPT
08ClassBasic.ppt
PPT
Basics of Classification.ppt
PPT
Cs501 classification prediction
PPT
Data Mining and Warehousing Concept and Techniques
PPT
Basic Concept of Classification - Data Mining
PPT
Classification Algorighms in Data Warehousing and Data Mininbg
PPT
08ClassBasic - Cosdfsdfadgádfádffádgádpy.ppt
PPTX
Machine learning Chapter three (16).pptx
PPT
classification in data warehouse and mining
Chapter 08 ClassBasic.ppt file used for help
Chapter 8. Classification Basic Concepts.ppt
Data Mining Concepts and Techniques.ppt
Data Mining Concepts and Techniques.ppt
Data Mining
ClassificationOfMachineLearninginCSE.ppt
Chapter 08 Class_Basic.ppt DataMinning
Unit-4 classification
Unit 4 Classification of data and more info on it
4_22865_IS465_2019_1__2_1_08ClassBasic.ppt
Classification, Attribute Selection, Classifiers- Decision Tree, ID3,C4.5,Nav...
08ClassBasic.ppt
Basics of Classification.ppt
Cs501 classification prediction
Data Mining and Warehousing Concept and Techniques
Basic Concept of Classification - Data Mining
Classification Algorighms in Data Warehousing and Data Mininbg
08ClassBasic - Cosdfsdfadgádfádffádgádpy.ppt
Machine learning Chapter three (16).pptx
classification in data warehouse and mining
Ad

Recently uploaded (20)

PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPT
Project quality management in manufacturing
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
PPT on Performance Review to get promotions
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Digital Logic Computer Design lecture notes
DOCX
573137875-Attendance-Management-System-original
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPT
Mechanical Engineering MATERIALS Selection
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Project quality management in manufacturing
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
UNIT 4 Total Quality Management .pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT on Performance Review to get promotions
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
UNIT-1 - COAL BASED THERMAL POWER PLANTS
OOP with Java - Java Introduction (Basics)
Digital Logic Computer Design lecture notes
573137875-Attendance-Management-System-original
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Internet of Things (IOT) - A guide to understanding
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Mechanical Engineering MATERIALS Selection
Ad

classification in data mining and data warehousing.pdf

  • 1. 1 Chapter 8. Classification: Basic Concepts  Classification: Basic Concepts  Decision Tree Induction  Bayes Classification Methods  Rule-Based Classification  Model Evaluation and Selection  Techniques to Improve Classification Accuracy: Ensemble Methods  Classification by Backpropagation  Lazy Learners (K-Nearest Neighbors Classification)
  • 2. 2 Supervised vs. Unsupervised Learning  Supervised learning (classification) or Predictive Mining  Supervision: The training data (observations, past experience, etc.) has labels indicating the class of the observations  New data is classified based on the training set  Unsupervised learning (clustering) or Descriptive Mining  Class labels are not assigned to training data instances  Given a set of measurements, observations, etc. discovers patterns with the aim of grouping similar instances to form clusters
  • 3. 3  Classification  Classification aims to predict categorical class labels  constructs a predictive model based on the descriptions of training instances and their class labels and uses it for classifying new data  Regression  Predicts numeric values of a dependent attribute in terms of one or more independent predictor attributes  models continuous-valued functions  Used for estimating unknown or missing values  Typical applications  Credit/loan approval:  Medical diagnosis: if a tumor is cancerous or benign  Fraud detection: if a transaction is fraudulent or genuine  Cost assessment of properties Prediction Problems: Classification vs. Regression
  • 4. 4 Classification—A Two-Step Process  Model construction:  Each tuple/sample is assumed to belong to a predefined class, as specified by the class label attribute  Set of labeled tuples are partitioned into training and test sets  The training set of tuples are used for model construction and refinement  The model is represented as classification rules, decision trees, SVM, ANN, etc  Model usage: After validating the model, it can be used for classifying future or unknown objects  Estimate accuracy of the model  The known (true) label of test sample is compared with the predicted label given by the classification model  Accuracy rate is the percentage of test set samples that are correctly classified by the model  Test set Accuracy reflects its generalization performance on unknown data  If the accuracy is acceptable, use the model to classify new data  Part of the training set called validation set is used to select model hyper parameters during refinement to achieve optimal generalization performance..
  • 5. 5 Process (1): Model Construction Training Data NAME RANK YEARS TENURED Mike Assistant Prof 3 no Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 no Anne Associate Prof 3 no Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model)
  • 6. 6 Process (2): Using the Model in Prediction Classifier Testing Data NAME RANK YEARS TENURED Tom Assistant Prof 2 no Merlisa Associate Prof 7 no George Professor 5 yes Joseph Assistant Prof 7 yes Unseen Data (Jeff, Professor, 4) Tenured?
  • 7. 7 Chapter 8. Classification: Basic Concepts  Classification: Basic Concepts  Decision Tree Induction  Bayes Classification Methods  Rule-Based Classification  Model Evaluation and Selection  Techniques to Improve Classification Accuracy: Ensemble Methods  Classification by Backpropagation  Lazy Learners (K-Nearest Neighbors Classification)
  • 8. 8 Decision Tree Induction: An Example age? overcast student? credit rating? <=30 >40 no yes yes yes 31..40 fair excellent yes no age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no  Training data set: Buys_computer  Attribute ‘age’ is discretized  Quinlan’s ID3 algorithm learns the Decision tree model  Resulting tree:
  • 9. 9 Algorithm for Decision Tree Induction  Basic algorithm (a greedy algorithm)  Tree is constructed in a top-down recursive divide-and- conquer manner  At start, all the training examples are at the root  If attributes are continuous-valued, they are discretized in advance to have all attributes are of categorical type  Attributes are selected for splitting the data on the basis of a heuristic or statistical measure (e.g., information gain)  Examples are partitioned recursively based on selected attributes  Conditions for stopping partitioning  All samples for a given node belong to the same class  There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf  There are no samples left
  • 10.  Brief Review of Entropy 10 m = 2
  • 11. 11 Attribute Selection Measure: Information Gain (ID3/C4.5)  Select the attribute with the highest information gain  Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci|/|D|  Expected information (entropy) needed to classify a tuple in D:  Information needed after using A to split D into v partitions (conditional entropy with A) to classify D:  Information gained by splitting D on attribute A ) ( log ) ( 2 1 i m i i p p D Info     ) ( | | | | ) ( 1 j v j j A D Info D D D Info     (D) Info Info(D) Gain(A) A  
  • 12. 12 Attribute Selection: Information Gain  Class P: buys_computer = “yes”  Class N: buys_computer = “no” means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Similarly the others. age pi ni I(pi, ni) <=30 2 3 0.971 31…40 4 0 0 >40 3 2 0.971 694 . 0 ) 2 , 3 ( 14 5 ) 0 , 4 ( 14 4 ) 3 , 2 ( 14 5 ) (     I I I D Infoage 246 . 0 ) ( ) ( ) (    D Info D Info age Gain age age income student credit_rating buys_computer <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no ) 3 , 2 ( 14 5 I 940 . 0 ) 14 5 ( log 14 5 ) 14 9 ( log 14 9 ) 5 , 9 ( ) ( 2 2      I D Info Similarly the information gain for the other attributes will be estimated as Gain(income)=0.029; Gain(student)=0.151; Gain(credit_rating)=0.048;
  • 13. 13 Determining Best Split Point for Numerical Attributes  Let attribute A be a continuous-valued attribute whose range is split into two for partitioning the data during decision tree construction  Binary Split: D1 is the set of tuples in D satisfying A ≤ split-point, and D2 is the set of tuples in D satisfying A > split-point  Steps to determine the best split point for A  Sort the instances based on the values of A in increasing order  Typically, the midpoint between each pair of adjacent values with altered class labels is considered as a possible split point  (ai+ai+1)/2 is the midpoint between the values of ai and ai+1  The point with the minimum (conditional entropy) expected information requirement is selected as the best split-point for A
  • 14. 14 Gain Ratio for Attribute Selection (C4.5)  Information gain measure is biased towards attributes with a large number of splits  C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain)  GainRatio(A) = Gain(A)/SplitInfo(A)  Ex.  gain_ratio(income) = 0.029/1.557 = 0.019  The attribute with the maximum gain ratio is selected as the splitting attribute ) | | | | ( log | | | | ) ( 2 1 D D D D D SplitInfo j v j j A     
  • 15. 15 Overfitting and Tree Pruning  Overfitting: An induced tree may overfit the training data  Complex models are developed with many branches often representing specific noisy instances with no significance for generalization  Results in poor accuracy for unseen samples  Two approaches to avoid overfitting  Prepruning: Halt tree construction early ̵ do not split a node if its entropy / uncertainity measure falls below a threshold  Difficult to choose an appropriate threshold  Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees  Use a (pruning) set of data different from the training data to decide which is the “best pruned tree”
  • 16. 16 Merits of Classification with Decision Trees  can use SQL queries for accessing databases  relatively faster learning speed on memory- resident training sets.  convertible to simple and easy to understand classification rules  Achieves classification accuracy comparable with other methods  AVC-sets (Attribute-Value, Classlabel) are maintained for each attribute at each tree node splitting to adopt to the available memory for gaining scalability to handle very large training sets.
  • 17. March 15, 2024 Data Mining: Concepts and Techniques 17 Presentation of Classification Results
  • 18. 18 Chapter 8. Classification: Basic Concepts  Classification: Basic Concepts  Decision Tree Induction  Bayes Classification Methods  Rule-Based Classification  Model Evaluation and Selection  Techniques to Improve Classification Accuracy: Ensemble Methods  Classification by Backpropagation  Lazy Learners (K-Nearest Neighbors Classification)
  • 19. 19 Prediction Based on Bayes’ Theorem  Given training data X, posteriori probability of a hypothesis H, denoted by P(H|X), follows the Bayes’ theorem  Informally, this can be viewed as posteriori = likelihood x prior/evidence  Predicts the class label of X to be Ci iff the probability P(Ci|X) is the highest among all P(Ck|X) for all k classes  Practical difficulty: It requires initial knowledge of many probabilities, involving significant data collection and computational costs. ) ( ) ( ) | ( ) | ( X X X P H P H P H P 
  • 20. 20 Classification Problem Is to Identify the Class with Maximum Posteriori Prob  Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn)  Suppose there are m classes C1, C2, …, Cm.  Classification is to identify the class with maximum posteriori, i.e., the maximal P(Ci|X)  This can be derived from Bayes’ theorem  Since P(X) is constant for all classes, for the purpose of classification, it is enough to identify the class that has maximum value of the numerator  The class label of X is given by ) ( ) ( ) | ( ) | ( X X X P i C P i C P i C P  )} ( ) | ( { max arg i C P i C P i X
  • 21. 21 Naïve Bayes Classifier  A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes):  This greatly reduces the information requirement and computation cost  If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, | (# of tuples of Ci in D)  If Ak is continuous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with mean μ and standard deviation σ and P(xk|Ci) is estimated at Ak= xk in terms of µ and σ for Ci as given below: ) | ( ... ) | ( ) | ( 1 ) | ( ) | ( 2 1 Ci x P Ci x P Ci x P n k Ci x P Ci P n k        X 2 2 2 ) ( 2 1 ) , , (          x e x g ) , , ( ) | ( i i C C k x g Ci P    X
  • 22. 22 Naïve Bayes Classifier: Training Dataset Class: C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Priors, P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 P(buys_computer = “no”) = 5/14= 0.357 Data to be classified: X = (age <=30, Income = medium, Student = yes Credit_rating = Fair) age income student credit_rating buys_compu <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
  • 23. 23 Naïve Bayes Classifier: An Example Priors, P(Ci): P(buys_computer = “yes”) = 0.643 P(buys_computer = “no”) = 0.357  Compute P(X|Ci) for each class P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6 P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444 P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4 P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667 P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2 P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667 P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4  X = (age <= 30 , income = medium, student = yes, credit_rating = fair) P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044 P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019 P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028 P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007 Therefore, X belongs to class (“buys_computer = yes”) age income student credit_rating buys_comp <=30 high no fair no <=30 high no excellent no 31…40 high no fair yes >40 medium no fair yes >40 low yes fair yes >40 low yes excellent no 31…40 low yes excellent yes <=30 medium no fair no <=30 low yes fair yes >40 medium yes fair yes <=30 medium yes excellent yes 31…40 medium no excellent yes 31…40 high yes fair yes >40 medium no excellent no
  • 24. 24 Avoiding the Zero-Probability Problem  Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the likelihood estimates being the product of multiple cond. prob will be zero  Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10)  Use Laplacian correction (or Laplacian estimator)  Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003  The “corrected” prob. estimates are close to their “uncorrected” counterparts    n k Ci xk P Ci X P 1 ) | ( ) | (
  • 25. 25 Naïve Bayes Classifier: Comments  Advantages  Easy to implement  Accurate results obtained in most of the cases  Disadvantages  Relies on class conditional independence assumption: Practically, dependencies exist among variables  E.g., Symptoms: fever, cough, cold, body aches, etc.,  Dependencies among these cannot be modeled by Naïve Bayes Classifier.  If the features are not independent, predictions are less accurate
  • 26. 26 Chapter 8. Classification: Basic Concepts  Classification: Basic Concepts  Decision Tree Induction  Bayes Classification Methods  Rule-Based Classification  Model Evaluation and Selection  Techniques to Improve Classification Accuracy: Ensemble Methods  Classification by Backpropagation  Lazy Learners (K-Nearest Neighbors Classification)
  • 27. 27 Using IF-THEN Rules for Classification  Represent the knowledge in the form of IF-THEN rules R: IF age = youth AND student = yes THEN buys_computer = yes  Rule antecedent/precondition vs. rule consequent  Assessment of a rule: coverage and accuracy  A rule,R covers a tuple,T, if the precondition of R is true in T  ncovers = # of tuples covered by R  ncorrect = # of tuples correctly classified by R coverage(R) = ncovers /|D| where D is the training data set accuracy(R) = ncorrect / ncovers  While classification using Rules, if multiple rules are applicable, conflict resolution is called for.  Size ordering: assign the highest priority to the rules that has the “toughest” requirement (i.e., with the most attribute tests) or specific rules preferred  Class-based ordering: decreasing order of prevalence or misclassification cost  Rule-based ordering (decision list): rules are organized into one long priority list, according to some measure of rule quality or by experts
  • 28. 28 age? student? credit rating? <=30 >40 no yes yes yes 31..40 fair excellent yes no  Example: Rule extraction from our buys_computer decision-tree IF age = young AND student = no THEN buys_computer = no IF age = young AND student = yes THEN buys_computer = yes IF age = mid-age THEN buys_computer = yes IF age = old AND credit_rating = excellent THEN buys_computer = no IF age = old AND credit_rating = fair THEN buys_computer = yes Rule Extraction from a Decision Tree  Rules are easier to understand than large trees  One rule is created for each path from the root to a leaf  Each attribute-value pair along a path forms a conjunction: the leaf holds the class prediction  Rules are mutually exclusive and exhaustive
  • 29. 29 Rule Induction: Sequential Covering Method  Sequential covering algorithm: Extracts rules directly from training data  Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER  Rules are learned sequentially, each for a given class Ci will cover many tuples of Ci but none (or few) of the tuples of other classes  Steps:  Rules are learned one at a time  Each time a rule is learned, the tuples covered by the rules are removed  Repeat the process on the remaining tuples until termination condition, e.g., when no more training examples or when the quality of a rule returned is below a user-specified threshold  Compared to decision-tree induction that learns a set of rules simultaneously, sequential covering alg learns rules one-by-one.
  • 30. 30 Sequential Covering Algorithm while (enough target tuples left) generate a rule remove positive target tuples satisfying this rule Examples covered by Rule 3 Examples covered by Rule 2 Examples covered by Rule 1 Positive examples
  • 31. 31 Rule Generation  To generate a rule while(true) find the best predicate p if foil-gain(p) > threshold then add p to current rule else break Positive examples Negative examples A3=1 A3=1&&A1=2 A3=1&&A1=2 &&A8=5
  • 33. 33 Chapter 8. Classification: Basic Concepts  Classification: Basic Concepts  Decision Tree Induction  Bayes Classification Methods  Rule-Based Classification  Model Evaluation and Selection  Techniques to Improve Classification Accuracy: Ensemble Methods  Learning by Backpropagation  K-Nearest Neighbours classification
  • 34. Classifier Evaluation Metrics: Confusion Matrix (C1 as Positive class) Actual classPredicted class buy_computer = yes buy_computer = no Total buy_computer = yes 6954 46 7000 buy_computer = no 412 2588 3000 Total 7366 2634 10000  Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j  May have extra rows/columns to provide totals Confusion Matrix: Actual classPredicted class C1 ¬ C1 C1 True Positives (TP) False Negatives (FN) ¬ C1 False Positives (FP) True Negatives (TN) Example of Confusion Matrix: 34
  • 35. Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity 35
  • 36. Classifier Evaluation Metrics: Precision and Recall, and F-measures  Precision: exactness – what % of tuples that the classifier labeled as positive are actually positive  Recall: completeness – what % of positive tuples did the classifier label as positive?  Perfect score is 1.0  Inverse relationship between precision & recall  F measure (F1 or F-score): harmonic mean of precision and recall,  Fß: weighted measure of precision and recall  assigns ß times as much weight to recall as to precision 36
  • 37. Classifier Evaluation Metrics: Example 37  Precision = 90/230 = 39.13% Recall = 90/300 = 30.00% Actual ClassPredicted class cancer = yes cancer = no Total Recognition(%) cancer = yes 90 210 300 30.00 (sensitivity cancer = no 140 9560 9700 98.56 (specificity) Total 230 9770 10000 96.50 (accuracy)
  • 38. Model Evaluation and Selection  Use validation (test) set of class-labeled tuples instead of training set when assessing accuracy  Methods for estimating a classifier’s accuracy:  Holdout method, random sub-sampling  Cross-validation  Bootstrap  Comparing classifiers:  Confidence intervals  Cost-benefit analysis and ROC Curves 38
  • 39. Evaluating Classifier Accuracy: Holdout & Cross-Validation Methods  Holdout method  Given data is randomly partitioned into two independent sets  Training set (e.g., 2/3) for model construction  Test set (e.g., 1/3) for accuracy estimation  Random sub-sampling: a variation of holdout  Repeat holdout k times, accuracy = avg. of the accuracies obtained  Cross-validation (k-fold, where k = 10 is most popular)  Randomly partition the data into k mutually exclusive subsets, each approximately equal size  At i-th iteration, use Di as test set and others as training set  Leave-one-out: k folds where k = # of tuples, for small sized data  *Stratified cross-validation*: folds are stratified so that class dist. in each fold is approx. the same as that in the initial data 39
  • 40. Evaluating Classifier Accuracy: Bootstrap  Bootstrap  Works well with small data sets  Samples the given training tuples uniformly with replacement  i.e., each time a tuple is selected, it is re-added to the training set And is equally likely to be selected again  A commonly used bootstrap method is .632 boostrap  A data set with d tuples is sampled d times, with replacement, resulting in a training set of d samples. The data tuples that did not make it into the training set end up forming the test set. About 63.2% of the original data ends up in the bootstrap, and the remaining 36.8% form the test set (since (1 – 1/d)d ≈ e-1 = 0.368)  Repeat the sampling procedure k times, estimate the overall accuracy of the model: 40
  • 41. Model Selection: ROC Curves  ROC (Receiver Operating Characteristics) curves: for visual comparison of probabilistic classification models  The plot shows the trade-off between the true positive rate and the false positive rate  Vertical axis represents the true positive rate, TPR=TP/P, sensitivity or recall  Horizontal axis rep. the false positive rate,FPR=FP/N= (1-TNR) or (1-specificity)  The area under the ROC curve is a measure of the accuracy of the model  The diagonal line corresponds to the random guessing of class labels in balanced datasets  The model whose ROC is closer to the diagonal line (i.e., the closer the area is to 0.5), is less accurate.  A model with perfect accuracy will have an area of 1.0 41 All neg All pos
  • 42. How to draw ROC 42  Apply the model on test data to predict the probability of being positive  Rank the test tuples in the decreasing order of their probability of being positive.  Starting from the highest probability to accept the rank-1 tuple, gradually reduce the threshold to accept more and more tuples as positive and estimate TPR and FPR at each stage to plot the ROC. All neg All pos
  • 43. Issues Affecting Model Selection  Accuracy  classifier accuracy: predicting class label  Speed  time to construct the model (training time)  time to use the model (classification/prediction time)  Robustness: handling noise and missing values  Scalability: efficiency in disk-resident databases  Interpretability  understanding and insight provided by the model  Other measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules 43
  • 44. 44 Chapter 8. Classification: Basic Concepts  Classification: Basic Concepts  Decision Tree Induction  Bayes Classification Methods  Rule-Based Classification  Model Evaluation and Selection  Techniques to Improve Classification Accuracy: Ensemble Methods  Classification by Backpropagation  Lazy Learners (K-Nearest Neighbors Classification)
  • 45. Ensemble Methods: Increasing the Accuracy  Ensemble methods  Use a combination of models to increase accuracy  Combine a series of k learned models, M1, M2, …, Mk, with the aim of creating an improved model M*  Popular ensemble methods  Bagging: averaging the prediction over a collection of classifiers  Boosting: weighted vote with a collection of classifiers  Random Forest: majority voting among collection of base classifiers built through randomly sampled attribute set 45
  • 46. Bagging: Boostrap Aggregation  Analogy: Diagnosis based on multiple doctors’ majority opinion / vote  Training  Given a set D of d tuples, at each iteration i, a training set Di of d tuples is sampled with replacement from D (i.e., bootstrap)  A classifier model Mi is learned for each training set Di  Classification: to classify an unknown sample X  Each classifier Mi returns its class prediction  The bagged classifier M* counts the votes and assigns the class with the majority votes to X  Regression: can be applied to predict continuous valued variables by taking the average value of each prediction for a given test tuple  Accuracy  Often significantly improves the accuracy of prediction compared to a single classifier derived from D  For noise data: more robust to noisy data as it goes by majority predictions 46
  • 47. Boosting  Analogy: Consult several doctors, and decide based on a combination of weighted diagnoses—weight assigned based on the previous diagnosis accuracy  How boosting works?  Weights are assigned to each training tuple  A series of k classifiers (that complement each other) is iteratively learned  After a classifier Mi is learned, the weights are updated to build the subsequent classifier, Mi+1, paying more attention to the training tuples that were misclassified by Mi  The final M* combines the votes of each individual classifier, where the weight of each classifier's vote is a function of its accuracy  Boosting algorithm can be extended for numeric prediction  Comparing with bagging: Boosting tends to have greater accuracy, but it also risks overfiting the model to misclassified or noisy data 4 7
  • 48. 48 Adaboost (Freund and Schapire, 1997)  Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)  Initially, all tuples are assigned the same weight which is equal to 1/d  Generate k classifiers in k rounds. At round i,  Tuples from D are sampled (with replacement) to form a training set Di of the same size  Each tuple’s chance of being selected is based on its weight  A classification model Mi is derived from Di  Its error rate is calculated using Di as a test set  If a tuple is misclassified, its weight is increased, else it is decreased  Error rate: err(Xj) is the misclassification error for a specific tuple Xj. Error rate of Classifier Mi is the weighted sum of misclassified tuples:  The weight of classifier Mi’s vote is ) ( ) ( 1 log i i M error M error     d j j i err w M error ) ( ) ( j X
  • 49. Random Forest (Breiman 2001)  Random Forest:  Each classifier in the ensemble is a decision tree classifier and is generated using a random selection of attributes at each node to determine the split  During classification, each tree votes and the most popular class is returned  Two Methods to construct Random Forest:  Forest-RI (random input selection): Randomly select, at each node, F attributes as candidates for the split at the node. The CART methodology is used to grow the trees to full size  Forest-RC (random linear combinations): Creates new attributes (or features) that are a linear combination of the existing attributes (reduces the correlation between individual classifiers)  Comparable in accuracy to Adaboost, but more robust to noise and outliers  Faster than bagging or boosting 49
  • 50. 50 Chapter 8. Classification: Basic Concepts  Classification: Basic Concepts  Decision Tree Induction  Bayes Classification Methods  Rule-Based Classification  Model Evaluation and Selection  Techniques to Improve Classification Accuracy: Ensemble Methods  Classification by Backpropagation  Lazy Learners (K-Nearest Neighbors Classification)
  • 51. 51 Classification by Backpropagation  A neural network: A set of connected input/output units called neurons where each connection has a weight associated with it.  Extensively used for classification and regression tasks.  During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct class label of training tuples  Back propagation: A supervised algorithm for learning weights of edges in a Feed Forward Neural Network.  These weight adjustments are made in the backward direction from output layer through each hidden layer down to the first hidden layer and hence named as ‘Back propagation’.
  • 52. 52 A Multi-Layer Feed-Forward Neural Network Output layer Input layer Hidden layer Output vector Input vector: X wij i k j j k ij k ij x y y w w ) ˆ ( ) ( ) ( ) 1 (     
  • 53. 53 Activity of a Neuron at Hidden/Output Layer  An n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping  Hidden / output units receive inputs from the units in the previous layer. The activations generated at feeder units are propagated along the weighted edges to get their weighted sum, which is added to the bias associated with unit. Finally a nonlinear activation function is applied to it to determine the output at the unit. θ f weighted sum Input vector x output y Activation function weight vector w  w0 w1 wn x0 x1 xn ) sign( y Example For n 0 i      i i x w bias
  • 54. 54 How A Multi-Layer Neural Network Works  The inputs to the network correspond to the attributes measured for each training tuple  Inputs are fed simultaneously into the units making up the input layer  They are weighted and fed simultaneously to a hidden layer  The number of hidden layers is arbitrary, although usually only one  The weighted outputs of the last hidden layer are input to units making up the output layer, which emits the network's prediction  Each neuron in the hidden layer as well as output layer has its own bias which is learnt along with the weights of edges during training phase.  The network is feed-forward: None of the weights cycles back to an input unit or to an hidden unit of a previous layer  From a statistical point of view, ANNs perform nonlinear regression or probabilistic classification: Given enough hidden units and enough training samples, they can closely approximate any function
  • 55. 55 Defining a Network Topology for Classification / Regression  Decide the network topology: Specify # of units in the input layer, # of hidden layers, # of units in each hidden layer, and # of units in the output layer  One input unit for each descriptive feature  One Output, for binary classification and if it is for multi-class classification, the number of output units is equal to the number of classes.  Experimentally select appropriate number of hidden neurons depending on the problem complexity  Normalize the input values for each attribute in the training tuples to [0.0—1.0] range  Train the network applying a learning algorithm like Back propagation to adjust the weights of edges for classification  Once a network has been trained and if its accuracy is unacceptable, repeat the training process with a different network topology or a different set of initial weights
  • 56. 56 Backpropagation  Iteratively process a set of training tuples & compare the network's prediction with the actual known target value  For each training tuple, the weights are modified to minimize the mean squared error between the network's prediction and the actual target value  Modifications are made in the “backwards” direction: from the output layer, through each hidden layer down to the first hidden layer, hence “backpropagation”  Steps  Initialize weights to small random numbers, associated with biases  Propagate the inputs forward (by applying activation function)  Backpropagate the error and update weights and biases  Terminating condition (when error is very small, etc.)
  • 57. © Prentice Hall 57 Gradient Descent
  • 59. Strengths of ANN for classification  Strength  High tolerance to noisy data  Ability to classify unknown instances acccurately  Well-suited for continuous-valued inputs and outputs  Successful on real-world data, e.g., hand-written letters  Algorithms are inherently parallel since the neurons in a layer works independently  Techniques have recently been developed for the extraction of rules from trained neural networks  Capability of ANNs is further extended through Deep Learning 59
  • 60. 60 Drawbacks of Neural Network as a Classifier  Weakness  Long training time  Require a number of parameters typically best determined empirically, e.g., the network topology or “structure.”  Poor interpretability: Difficult to interpret the symbolic meaning behind the learned weights and of “hidden units” in the network
  • 61. 61 Chapter 8. Classification: Basic Concepts  Classification: Basic Concepts  Decision Tree Induction  Bayes Classification Methods  Rule-Based Classification  Model Evaluation and Selection  Techniques to Improve Classification Accuracy: Ensemble Methods  Classification by Backpropagation  Lazy Learners (K-Nearest Neighbors Classification)
  • 62. 62 Lazy vs. Eager Learning  Lazy vs. eager learning  Lazy learning (e.g., instance-based learning): Simply stores training data (or only minor processing) and waits until it is given a test tuple  Eager learning (the above discussed methods): Given a set of training tuples, constructs a classification model before accepting new (e.g., test) data to classify  Lazy: less time in training but more time in predicting  Accuracy  Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form an implicit global approximation to the target function  Eager: must commit to a single hypothesis that covers the entire instance space
  • 63. 63 The k-Nearest Neighbor Algorithm  All instances are preprocessed and mapped onto points in n-D space  The nearest neighbors to a query point, Xq are identified from the training instances sorted in the ascending order of their Euclidean distances to Xq, dist(Xi, Xq)  Target function could be discrete- or real- valued required for classification or regression tasks respectively.  For Classification tasks, k-NN returns the most common value (mode) among the labels of the k training examples nearest to xq.  Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples. . _ + _ xq + _ _ + _ _ + . . . . .
  • 64. 64 k-NN Algorithm for Regression  k-NN for real-valued prediction for a given unknown tuple, Xq  Returns the mean of the 𝑦𝑖 values of k nearest neighbors  Distance-weighted nearest neighbor algorithm  Weigh the contribution of each of the k neighbors according to their distance to the query xq  Gives greater weight to closer neighbors  Robust to noisy data by averaging k-nearest neighbors  Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes  To overcome it, elimination of the least relevant attributes 2 ) , ( 1 i x q x d wi 