classification in data mining and data warehousing.pdf

1
Chapter 8. Classification: Basic Concepts
 Classification: Basic Concepts
 Decision Tree Induction
 Bayes Classification Methods
 Rule-Based Classification
 Model Evaluation and Selection
 Techniques to Improve Classification Accuracy:
Ensemble Methods
 Classification by Backpropagation
 Lazy Learners (K-Nearest Neighbors Classification)

2
Supervised vs. Unsupervised Learning
 Supervised learning (classification) or Predictive Mining
 Supervision: The training data (observations, past
experience, etc.) has labels indicating the class of the
observations
 New data is classified based on the training set
 Unsupervised learning (clustering) or Descriptive Mining
 Class labels are not assigned to training data instances
 Given a set of measurements, observations, etc. discovers
patterns with the aim of grouping similar instances to form
clusters

3
 Classification
 Classification aims to predict categorical class labels
 constructs a predictive model based on the descriptions of
training instances and their class labels and uses it for
classifying new data
 Regression
 Predicts numeric values of a dependent attribute in terms
of one or more independent predictor attributes
 models continuous-valued functions
 Used for estimating unknown or missing values
 Typical applications
 Credit/loan approval:
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent or genuine
 Cost assessment of properties
Prediction Problems: Classification vs. Regression

4
Classification—A Two-Step Process
 Model construction:
 Each tuple/sample is assumed to belong to a predefined class, as specified by
the class label attribute
 Set of labeled tuples are partitioned into training and test sets
 The training set of tuples are used for model construction and refinement
 The model is represented as classification rules, decision trees, SVM, ANN, etc
 Model usage: After validating the model, it can be used for classifying future or
unknown objects
 Estimate accuracy of the model
 The known (true) label of test sample is compared with the predicted label
given by the classification model
 Accuracy rate is the percentage of test set samples that are correctly
classified by the model
 Test set Accuracy reflects its generalization performance on unknown data
 If the accuracy is acceptable, use the model to classify new data
 Part of the training set called validation set is used to select model hyper
parameters during refinement to achieve optimal generalization performance..

5
Process (1): Model Construction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yes
Bill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = ‘professor’
OR years > 6
THEN tenured = ‘yes’
Classifier
(Model)

6
Process (2): Using the Model in Prediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Joseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?

7
Ensemble Methods

8
Decision Tree Induction: An Example
age?
overcast
student? credit rating?
<=30 >40
no yes yes
yes
31..40
fair
excellent
yes
no
age income student credit_rating buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31…40 low yes excellent yes
<=30 medium no fair no
<=30 low yes fair yes
>40 medium yes fair yes
<=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
 Training data set: Buys_computer
 Attribute ‘age’ is discretized
 Quinlan’s ID3 algorithm learns the
Decision tree model
 Resulting tree:

9
Algorithm for Decision Tree Induction
 Basic algorithm (a greedy algorithm)
 Tree is constructed in a top-down recursive divide-and-
conquer manner
 At start, all the training examples are at the root
 If attributes are continuous-valued, they are discretized in
advance to have all attributes are of categorical type
 Attributes are selected for splitting the data on the basis of a
heuristic or statistical measure (e.g., information gain)
 Examples are partitioned recursively based on selected
attributes
 Conditions for stopping partitioning
 All samples for a given node belong to the same class
 There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
 There are no samples left


Brief Review of Entropy
10
m = 2

11
Attribute Selection Measure:
Information Gain (ID3/C4.5)
 Select the attribute with the highest information gain
 Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci|/|D|
 Expected information (entropy) needed to classify a tuple in D:
 Information needed after using A to split D into v partitions
(conditional entropy with A) to classify D:
 Information gained by splitting D on attribute A
)
(
log
)
( 2
1
i
m
i
i p
p
D
Info 



)
(
|
|
|
|
)
(
1
j
v
j
j
A D
Info
D
D
D
Info 
 

(D)
Info
Info(D)
Gain(A) A



12
Attribute Selection: Information Gain
 Class P: buys_computer = “yes”
 Class N: buys_computer = “no”
means “age <=30” has 5 out of
14 samples, with 2 yes’es and 3
no’s. Similarly the others.
age pi ni I(pi, ni)
<=30 2 3 0.971
31…40 4 0 0
>40 3 2 0.971
694
.
0
)
2
,
3
(
14
5
)
0
,
4
(
14
4
)
3
,
2
(
14
5
)
(




I
I
I
D
Infoage
246
.
0
)
(
)
(
)
( 

 D
Info
D
Info
age
Gain age
age income student credit_rating buys_computer
)
3
,
2
(
14
5
I
940
.
0
)
14
5
(
log
14
5
)
14
9
(
log
14
9
)
5
,
9
(
)
( 2
2 



 I
D
Info
Similarly the information gain for the other attributes will be estimated as
Gain(income)=0.029; Gain(student)=0.151; Gain(credit_rating)=0.048;

13
Determining Best Split Point for
Numerical Attributes
 Let attribute A be a continuous-valued attribute whose range is split
into two for partitioning the data during decision tree construction
 Binary Split: D1 is the set of tuples in D satisfying A ≤ split-point, and
D2 is the set of tuples in D satisfying A > split-point
 Steps to determine the best split point for A
 Sort the instances based on the values of A in increasing order
 Typically, the midpoint between each pair of adjacent values with
altered class labels is considered as a possible split point
 (ai+ai+1)/2 is the midpoint between the values of ai and ai+1
 The point with the minimum (conditional entropy) expected
information requirement is selected as the best split-point for A

14
Gain Ratio for Attribute Selection (C4.5)
 Information gain measure is biased towards attributes with a
large number of splits
 C4.5 (a successor of ID3) uses gain ratio to overcome the
problem (normalization to information gain)
 GainRatio(A) = Gain(A)/SplitInfo(A)
 Ex.
 gain_ratio(income) = 0.029/1.557 = 0.019
 The attribute with the maximum gain ratio is selected as the
splitting attribute
)
|
|
|
|
(
log
|
|
|
|
)
( 2
1 D
D
D
D
D
SplitInfo
j
v
j
j
A 

 


15
Overfitting and Tree Pruning
 Overfitting: An induced tree may overfit the training data
 Complex models are developed with many branches often
representing specific noisy instances with no significance for
generalization
 Results in poor accuracy for unseen samples
 Two approaches to avoid overfitting
 Prepruning: Halt tree construction early ̵ do not split a node if its
entropy / uncertainity measure falls below a threshold
 Difficult to choose an appropriate threshold
 Postpruning: Remove branches from a “fully grown” tree—get a
sequence of progressively pruned trees
 Use a (pruning) set of data different from the training data to
decide which is the “best pruned tree”

16
Merits of Classification with Decision Trees
 can use SQL queries for accessing databases
 relatively faster learning speed on memory-
resident training sets.
 convertible to simple and easy to understand
classification rules
 Achieves classification accuracy comparable
with other methods
 AVC-sets (Attribute-Value, Classlabel) are
maintained for each attribute at each tree node
splitting to adopt to the available memory for
gaining scalability to handle very large training
sets.

March 15, 2024 Data Mining: Concepts and Techniques 17
Presentation of Classification Results

18
Ensemble Methods

19
Prediction Based on Bayes’ Theorem
 Given training data X, posteriori probability of a hypothesis H,
denoted by P(H|X), follows the Bayes’ theorem
 Informally, this can be viewed as
posteriori = likelihood x prior/evidence
 Predicts the class label of X to be Ci iff the probability P(Ci|X) is the
highest among all P(Ck|X) for all k classes
 Practical difficulty: It requires initial knowledge of many
probabilities, involving significant data collection and computational
costs.
)
(
)
(
)
|
(
)
|
(
X
X
X
P
H
P
H
P
H
P 

20
Classification Problem Is to Identify the Class
with Maximum Posteriori Prob
 Let D be a training set of tuples and their associated class
labels, and each tuple is represented by an n-D attribute vector
X = (x1, x2, …, xn)
 Suppose there are m classes C1, C2, …, Cm.
 Classification is to identify the class with maximum posteriori,
i.e., the maximal P(Ci|X)
 This can be derived from Bayes’ theorem
 Since P(X) is constant for all classes, for the purpose of
classification, it is enough to identify the class that has
maximum value of the numerator
 The class label of X is given by
)
(
)
(
)
|
(
)
|
(
X
X
X
P
i
C
P
i
C
P
i
C
P 
)}
(
)
|
(
{
max
arg
i
C
P
i
C
P
i X

21
Naïve Bayes Classifier
 A simplified assumption: attributes are conditionally independent
(i.e., no dependence relation between attributes):
 This greatly reduces the information requirement and
computation cost
 If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for
Ak divided by |Ci, | (# of tuples of Ci in D)
 If Ak is continuous-valued, P(xk|Ci) is usually computed based on
Gaussian distribution with mean μ and standard deviation σ
and P(xk|Ci) is estimated at Ak= xk in terms of µ and σ for Ci as
given below:
)
|
(
...
)
|
(
)
|
(
1
)
|
(
)
|
(
2
1
Ci
x
P
Ci
x
P
Ci
x
P
n
k
Ci
x
P
Ci
P
n
k







X
2
2
2
)
(
2
1
)
,
,
( 








x
e
x
g
)
,
,
(
)
|
( i
i C
C
k
x
g
Ci
P 


X

22
Naïve Bayes Classifier: Training Dataset
Class:
C1:buys_computer = ‘yes’
C2:buys_computer = ‘no’
Priors, P(Ci):
P(buys_computer = “yes”)
= 9/14 = 0.643
P(buys_computer = “no”)
= 5/14= 0.357
Data to be classified:
X = (age <=30,
Income = medium,
Student = yes
Credit_rating = Fair)
age income student
credit_rating
buys_compu

23
Naïve Bayes Classifier: An Example
Priors, P(Ci):
P(buys_computer = “yes”) = 0.643
P(buys_computer = “no”) = 0.357
 Compute P(X|Ci) for each class
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222
P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6
P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
 X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
age income student
credit_rating
buys_comp

24
Avoiding the Zero-Probability Problem
 Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the likelihood estimates being the
product of multiple cond. prob will be zero
 Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
 Use Laplacian correction (or Laplacian estimator)
 Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
 The “corrected” prob. estimates are close to their
“uncorrected” counterparts



n
k
Ci
xk
P
Ci
X
P
1
)
|
(
)
|
(

25
Naïve Bayes Classifier: Comments
 Advantages
 Easy to implement
 Accurate results obtained in most of the cases
 Disadvantages
 Relies on class conditional independence assumption:
Practically, dependencies exist among variables
 E.g., Symptoms: fever, cough, cold, body aches, etc.,
 Dependencies among these cannot be modeled by Naïve
Bayes Classifier.
 If the features are not independent, predictions are less
accurate

26
Ensemble Methods

27
Using IF-THEN Rules for Classification
 Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent
 Assessment of a rule: coverage and accuracy
 A rule,R covers a tuple,T, if the precondition of R is true in T
 ncovers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| where D is the training data set
accuracy(R) = ncorrect / ncovers
 While classification using Rules, if multiple rules are applicable, conflict resolution
is called for.
 Size ordering: assign the highest priority to the rules that has the “toughest”
requirement (i.e., with the most attribute tests) or specific rules preferred
 Class-based ordering: decreasing order of prevalence or misclassification cost
 Rule-based ordering (decision list): rules are organized into one long priority
list, according to some measure of rule quality or by experts

28
age?
student? credit rating?
<=30
>40
no yes yes
yes
31..40
fair
excellent
yes
no
 Example: Rule extraction from our buys_computer decision-tree
IF age = young AND student = no THEN buys_computer = no
IF age = young AND student = yes THEN buys_computer = yes
IF age = mid-age THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair THEN buys_computer = yes
Rule Extraction from a Decision Tree
 Rules are easier to understand than large
trees
 One rule is created for each path from the
root to a leaf
 Each attribute-value pair along a path forms
a conjunction: the leaf holds the class
prediction
 Rules are mutually exclusive and exhaustive

29
Rule Induction: Sequential Covering Method
 Sequential covering algorithm: Extracts rules directly from training
data
 Typical sequential covering algorithms: FOIL, AQ, CN2, RIPPER
 Rules are learned sequentially, each for a given class Ci will cover
many tuples of Ci but none (or few) of the tuples of other classes
 Steps:
 Rules are learned one at a time
 Each time a rule is learned, the tuples covered by the rules are
removed
 Repeat the process on the remaining tuples until termination
condition, e.g., when no more training examples or when the
quality of a rule returned is below a user-specified threshold
 Compared to decision-tree induction that learns a set of rules
simultaneously, sequential covering alg learns rules one-by-one.

30
Sequential Covering Algorithm
while (enough target tuples left)
generate a rule
remove positive target tuples satisfying this rule
Examples covered
by Rule 3
Examples covered
by Rule 2
Examples covered
by Rule 1
Positive
examples

31
Rule Generation
 To generate a rule
while(true)
find the best predicate p
if foil-gain(p) > threshold then add p to current rule
else break
Positive
examples
Negative
examples
A3=1
A3=1&&A1=2
A3=1&&A1=2
&&A8=5

33
Ensemble Methods
 Learning by Backpropagation
 K-Nearest Neighbours classification

Classifier Evaluation Metrics: Confusion
Matrix (C1 as Positive class)
Actual classPredicted
class
buy_computer
= yes
buy_computer
= no
Total
buy_computer = yes 6954 46 7000
buy_computer = no 412 2588 3000
Total 7366 2634 10000
 Given m classes, an entry, CMi,j in a confusion matrix indicates
# of tuples in class i that were labeled by the classifier as class j
 May have extra rows/columns to provide totals
Confusion Matrix:
Actual classPredicted class C1 ¬ C1
C1 True Positives (TP) False Negatives (FN)
¬ C1 False Positives (FP) True Negatives (TN)
Example of Confusion Matrix:
34

Classifier Evaluation Metrics:
Accuracy, Error Rate, Sensitivity and Specificity
35

Classifier Evaluation Metrics:
Precision and Recall, and F-measures
 Precision: exactness – what % of tuples that the classifier
labeled as positive are actually positive
 Recall: completeness – what % of positive tuples did the
classifier label as positive?
 Perfect score is 1.0
 Inverse relationship between precision & recall
 F measure (F1 or F-score): harmonic mean of precision and
recall,
 Fß: weighted measure of precision and recall
 assigns ß times as much weight to recall as to precision
36

Classifier Evaluation Metrics: Example
37
 Precision = 90/230 = 39.13% Recall = 90/300 = 30.00%
Actual ClassPredicted class cancer = yes cancer = no Total Recognition(%)
cancer = yes 90 210 300 30.00 (sensitivity
cancer = no 140 9560 9700 98.56 (specificity)
Total 230 9770 10000 96.50 (accuracy)

Model Evaluation and Selection
 Use validation (test) set of class-labeled tuples instead of
training set when assessing accuracy
 Methods for estimating a classifier’s accuracy:
 Holdout method, random sub-sampling
 Cross-validation
 Bootstrap
 Comparing classifiers:
 Confidence intervals
 Cost-benefit analysis and ROC Curves
38

Evaluating Classifier Accuracy:
Holdout & Cross-Validation Methods
 Holdout method
 Given data is randomly partitioned into two independent sets
 Training set (e.g., 2/3) for model construction
 Test set (e.g., 1/3) for accuracy estimation
 Random sub-sampling: a variation of holdout
 Repeat holdout k times, accuracy = avg. of the accuracies
obtained
 Cross-validation (k-fold, where k = 10 is most popular)
 Randomly partition the data into k mutually exclusive subsets,
each approximately equal size
 At i-th iteration, use Di as test set and others as training set
 Leave-one-out: k folds where k = # of tuples, for small sized
data
 *Stratified cross-validation*: folds are stratified so that class
dist. in each fold is approx. the same as that in the initial data
39

Evaluating Classifier Accuracy: Bootstrap
 Bootstrap
 Works well with small data sets
 Samples the given training tuples uniformly with replacement
 i.e., each time a tuple is selected, it is re-added to the training set
And is equally likely to be selected again
 A commonly used bootstrap method is .632 boostrap
 A data set with d tuples is sampled d times, with replacement, resulting in a
training set of d samples. The data tuples that did not make it into the
training set end up forming the test set. About 63.2% of the original data ends
up in the bootstrap, and the remaining 36.8% form the test set (since (1 – 1/d)d
≈ e-1 = 0.368)
 Repeat the sampling procedure k times, estimate the overall accuracy of the
model:
40

Model Selection: ROC Curves
 ROC (Receiver Operating Characteristics)
curves: for visual comparison of probabilistic
classification models
 The plot shows the trade-off between the
true positive rate and the false positive rate
 Vertical axis represents the true positive rate,
TPR=TP/P, sensitivity or recall
 Horizontal axis rep. the false positive
rate,FPR=FP/N= (1-TNR) or (1-specificity)
 The area under the ROC curve is a measure
of the accuracy of the model
 The diagonal line corresponds to the
random guessing of class labels in balanced
datasets
 The model whose ROC is closer to the
diagonal line (i.e., the closer the area is to
0.5), is less accurate.
 A model with perfect accuracy will have an
area of 1.0 41
All neg
All pos

How to draw ROC
42
 Apply the model on test data to predict the probability of being
positive
 Rank the test tuples in the decreasing order of their probability of
being positive.
 Starting from the highest probability to accept the rank-1 tuple,
gradually reduce the threshold to accept more and more tuples as
positive and estimate TPR and FPR at each stage to plot the ROC.
All neg
All pos

Issues Affecting Model Selection
 Accuracy
 classifier accuracy: predicting class label
 Speed
 time to construct the model (training time)
 time to use the model (classification/prediction time)
 Robustness: handling noise and missing values
 Scalability: efficiency in disk-resident databases
 Interpretability
 understanding and insight provided by the model
 Other measures, e.g., goodness of rules, such as decision tree
size or compactness of classification rules
43

44
Ensemble Methods

Ensemble Methods: Increasing the Accuracy
 Ensemble methods
 Use a combination of models to increase accuracy
 Combine a series of k learned models, M1, M2, …, Mk, with
the aim of creating an improved model M*
 Popular ensemble methods
 Bagging: averaging the prediction over a collection of
classifiers
 Boosting: weighted vote with a collection of classifiers
 Random Forest: majority voting among collection of base
classifiers built through randomly sampled attribute set 45

Bagging: Boostrap Aggregation
 Analogy: Diagnosis based on multiple doctors’ majority opinion / vote
 Training
 Given a set D of d tuples, at each iteration i, a training set Di of d tuples is
sampled with replacement from D (i.e., bootstrap)
 A classifier model Mi is learned for each training set Di
 Classification: to classify an unknown sample X
 Each classifier Mi returns its class prediction
 The bagged classifier M* counts the votes and assigns the class with the
majority votes to X
 Regression: can be applied to predict continuous valued variables by taking the
average value of each prediction for a given test tuple
 Accuracy
 Often significantly improves the accuracy of prediction compared to a single
classifier derived from D
 For noise data: more robust to noisy data as it goes by majority predictions
46

Boosting
 Analogy: Consult several doctors, and decide based on a
combination of weighted diagnoses—weight assigned based on
the previous diagnosis accuracy
 How boosting works?
 Weights are assigned to each training tuple
 A series of k classifiers (that complement each other) is
iteratively learned
 After a classifier Mi is learned, the weights are updated to
build the subsequent classifier, Mi+1, paying more attention to
the training tuples that were misclassified by Mi
 The final M* combines the votes of each individual classifier,
where the weight of each classifier's vote is a function of its
accuracy
 Boosting algorithm can be extended for numeric prediction
 Comparing with bagging: Boosting tends to have greater accuracy,
but it also risks overfiting the model to misclassified or noisy data 4
7

48
Adaboost (Freund and Schapire, 1997)
 Given a set of d class-labeled tuples, (X1, y1), …, (Xd, yd)
 Initially, all tuples are assigned the same weight which is equal to 1/d
 Generate k classifiers in k rounds. At round i,
 Tuples from D are sampled (with replacement) to form a training set
Di of the same size
 Each tuple’s chance of being selected is based on its weight
 A classification model Mi is derived from Di
 Its error rate is calculated using Di as a test set
 If a tuple is misclassified, its weight is increased, else it is decreased
 Error rate: err(Xj) is the misclassification error for a specific tuple Xj.
Error rate of Classifier Mi is the weighted sum of misclassified tuples:
 The weight of classifier Mi’s vote is
)
(
)
(
1
log
i
i
M
error
M
error

 

d
j
j
i err
w
M
error )
(
)
( j
X

Random Forest (Breiman 2001)
 Random Forest:
 Each classifier in the ensemble is a decision tree classifier and is
generated using a random selection of attributes at each node to
determine the split
 During classification, each tree votes and the most popular class is
returned
 Two Methods to construct Random Forest:
 Forest-RI (random input selection): Randomly select, at each node, F
attributes as candidates for the split at the node. The CART methodology
is used to grow the trees to full size
 Forest-RC (random linear combinations): Creates new attributes (or
features) that are a linear combination of the existing attributes
(reduces the correlation between individual classifiers)
 Comparable in accuracy to Adaboost, but more robust to noise and outliers
 Faster than bagging or boosting
49

50
Ensemble Methods

51
Classification by Backpropagation
 A neural network: A set of connected input/output units called
neurons where each connection has a weight associated with it.
 Extensively used for classification and regression tasks.
 During the learning phase, the network learns by adjusting the
weights so as to be able to predict the correct class label of
training tuples
 Back propagation: A supervised algorithm for learning weights of
edges in a Feed Forward Neural Network.
 These weight adjustments are made in the backward direction
from output layer through each hidden layer down to the first
hidden layer and hence named as ‘Back propagation’.

52
A Multi-Layer Feed-Forward Neural Network
Output layer
Input layer
Hidden layer
Output vector
Input vector: X
wij
i
k
j
j
k
ij
k
ij x
y
y
w
w )
ˆ
( )
(
)
(
)
1
(






53
Activity of a Neuron at Hidden/Output Layer
 An n-dimensional input vector x is mapped into variable y by means of the scalar
product and a nonlinear function mapping
 Hidden / output units receive inputs from the units in the previous layer. The
activations generated at feeder units are propagated along the weighted edges to
get their weighted sum, which is added to the bias associated with unit. Finally a
nonlinear activation function is applied to it to determine the output at the unit.
θ
f
weighted
sum
Input
vector x
output y
Activation
function
weight
vector w

w0
w1
wn
x0
x1
xn
)
sign(
y
Example
For
n
0
i


 

i
i x
w
bias

54
How A Multi-Layer Neural Network Works
 The inputs to the network correspond to the attributes measured for each
training tuple
 Inputs are fed simultaneously into the units making up the input layer
 They are weighted and fed simultaneously to a hidden layer
 The number of hidden layers is arbitrary, although usually only one
 The weighted outputs of the last hidden layer are input to units making up
the output layer, which emits the network's prediction
 Each neuron in the hidden layer as well as output layer has its own bias
which is learnt along with the weights of edges during training phase.
 The network is feed-forward: None of the weights cycles back to an input
unit or to an hidden unit of a previous layer
 From a statistical point of view, ANNs perform nonlinear regression or
probabilistic classification: Given enough hidden units and enough training
samples, they can closely approximate any function

55
Defining a Network Topology for
Classification / Regression
 Decide the network topology: Specify # of units in the input layer, #
of hidden layers, # of units in each hidden layer, and # of units in the
output layer
 One input unit for each descriptive feature
 One Output, for binary classification and if it is for multi-class classification, the
number of output units is equal to the number of classes.
 Experimentally select appropriate number of hidden neurons depending on the
problem complexity
 Normalize the input values for each attribute in the training tuples
to [0.0—1.0] range
 Train the network applying a learning algorithm like Back
propagation to adjust the weights of edges for classification
 Once a network has been trained and if its accuracy is unacceptable,
repeat the training process with a different network topology or a
different set of initial weights

56
Backpropagation
 Iteratively process a set of training tuples & compare the network's prediction
with the actual known target value
 For each training tuple, the weights are modified to minimize the mean
squared error between the network's prediction and the actual target value
 Modifications are made in the “backwards” direction: from the output layer,
through each hidden layer down to the first hidden layer, hence
“backpropagation”
 Steps
 Initialize weights to small random numbers, associated with biases
 Propagate the inputs forward (by applying activation function)
 Backpropagate the error and update weights and biases
 Terminating condition (when error is very small, etc.)

Strengths of ANN for classification
 Strength
 High tolerance to noisy data
 Ability to classify unknown instances acccurately
 Well-suited for continuous-valued inputs and outputs
 Successful on real-world data, e.g., hand-written letters
 Algorithms are inherently parallel since the neurons in a
layer works independently
 Techniques have recently been developed for the
extraction of rules from trained neural networks
 Capability of ANNs is further extended through Deep
Learning
59

60
Drawbacks of Neural Network as a
Classifier
 Weakness
 Long training time
 Require a number of parameters typically
best determined empirically, e.g., the
network topology or “structure.”
 Poor interpretability: Difficult to interpret the
symbolic meaning behind the learned weights
and of “hidden units” in the network

61
Ensemble Methods

62
Lazy vs. Eager Learning
 Lazy vs. eager learning
 Lazy learning (e.g., instance-based learning): Simply stores
training data (or only minor processing) and waits until it is
given a test tuple
 Eager learning (the above discussed methods): Given a set of
training tuples, constructs a classification model before
accepting new (e.g., test) data to classify
 Lazy: less time in training but more time in predicting
 Accuracy
 Lazy method effectively uses a richer hypothesis space since
it uses many local linear functions to form an implicit global
approximation to the target function
 Eager: must commit to a single hypothesis that covers the
entire instance space

63
The k-Nearest Neighbor Algorithm
 All instances are preprocessed and mapped onto points in n-D space
 The nearest neighbors to a query point, Xq are identified from the
training instances sorted in the ascending order of their Euclidean
distances to Xq, dist(Xi, Xq)
 Target function could be discrete- or real- valued required for
classification or regression tasks respectively.
 For Classification tasks, k-NN returns the most common value (mode)
among the labels of the k training examples nearest to xq.
 Vonoroi diagram: the decision surface induced by 1-NN for a typical
set of training examples.
.
_
+
_ xq
+
_ _
+
_
_ +
.
.
.
. .

64
k-NN Algorithm for Regression
 k-NN for real-valued prediction for a given unknown tuple, Xq
 Returns the mean of the 𝑦𝑖 values of k nearest neighbors
 Distance-weighted nearest neighbor algorithm
 Weigh the contribution of each of the k neighbors according
to their distance to the query xq
 Gives greater weight to closer neighbors
 Robust to noisy data by averaging k-nearest neighbors
 Curse of dimensionality: distance between neighbors could be
dominated by irrelevant attributes
 To overcome it, elimination of the least relevant attributes
2
)
,
(
1
i
x
q
x
d
wi 

classification in data mining and data warehousing.pdf

More Related Content

Similar to classification in data mining and data warehousing.pdf (20)

Recently uploaded (20)

classification in data mining and data warehousing.pdf