CS 402 DATAMINING AND WAREHOUSING -MODULE 4

MODULE 4
Rule based classification- 1R. Neural Networks-Back propagation. Support
Vector Machines, Lazy Learners-K Nearest Neighbor Classifier. Accuracy and
error Measures evaluation. Prediction:-Linear Regression and Non-Linear
Regression
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 1

Rule-Based Classification -1R
• Rules are a good way of representing information or bits of
knowledge.
• A rule-based classifier uses a set of IF-THEN rules for
classification.
• An IF-THEN rule is an expression of the form
IF condition THEN conclusion.
An example
• R1: IF age = youth AND student = yes THEN buys computer =
yes

• The “IF” part (or left side) of a rule is known as
the rule antecedent or precondition.
• The “THEN” part (or right side) is the rule
consequent.
• In the rule antecedent, the condition consists of
one or more attribute tests (e.g., age = youth and
student = yes) that are logically ANDed.
• The rule’s consequent contains a class prediction
(in this case, we are predicting whether a
customer will buy a computer).

• R1 can also be written as
R1: (age = youth) ∧ (student = yes) ⇒ (buys computer = yes).
• If the condition (i.e., all the attribute tests) in a
rule antecedent holds true for a given tuple,
we say that the rule antecedent is satisfied (or
simply, that the rule is satisfied) and that the
rule covers the tuple.

The following rules to determine classification of grades:
• If 90 ≤ grade, then class=A
• If 80 ≤ grade and grade < 90, then class = B
• If 70 ≤ grade and grade < 80, then class = C
• If 60 ≤ grade and grade < 70, then class =D
• If grade < 60, then class = F

• A classification rule, r = (a, c), consists of the if or
antecedent, a, part and the then or consequent portion,
c.
• These rules relate directly to the corresponding DT that
could be created. A DT can always be used to generate
rules.
There are differences between rules and trees :
• The tree has an implied order in which the splitting is
performed. Rules have no order.
• A tree is created based on looking at all classes. When
generating rules, only one class must be examined at a
time.

• If a rule is satisfied by X, the rule is said to be triggered .
• If more than one rule is triggered / if no rule is satisfied by X ,
need a conflict resolution strategy to figure out which rule
gets to fire and assign its class prediction to X.
• Size ordering and Rule ordering.
• The size ordering scheme assigns the highest priority to the
triggering rule with the most attribute tests is fired.
• The rule ordering scheme prioritizes the rules beforehand.
The ordering may be class based or rule-based.
• With class-based ordering, the classes are sorted in order of
decreasing “importance”.
• With rule-based ordering, the rules are organized into one
long priority list, according to some measure of rule quality
such as accuracy, coverage, or size

Rule Extraction from a Decision Tree
• To extract rules from a decision tree, one rule is
created for each path from the root
to a leaf node.
• Each splitting criterion along a given path is
logically ANDed to form the
rule antecedent (“IF” part).
• The leaf node holds the class prediction, forming
the rule consequent (“THEN” part).

Rule Induction Using a Sequential Covering Algorithm
• These techniques are sometimes called
covering algorithms because they attempt to
generate rules exactly cover a specific class .
• Tree algorithms work in a top down divide and
conquer approach, but this need not be the
case for covering algorithms.
• Usually the "best" attribute-value pair is
chosen, as opposed to the best attribute with
the tree-based algorithms

• Suppose that we wished to generate a rule to
classify persons as tall. The basic format for
the rule is then
If ? then class = tall
• The objective for the covering algorithms is to
replace the "?" in this statement with
predicates that can be used to obtain the
"best" probability of being tall.

1R Classification
• One simple approach is called 1R because it
generates a simple set of rules .
• The basic idea is to choose the best attribute
to perform the classification based on the
training data.
• "Best" is defined here by counting the number
of errors.

• Algorithm: Backpropagation. Neural network
learning for classification or numeric
prediction, using the backpropagation
algorithm.
• Input:
• D, a data set consisting of the training tuples
and their associated target values;
• l, the learning rate;
• network, a multilayer feed-forward network.
• Output: A trained neural network.

Lazy Learners
• The classification methods -tree induction, Bayesian classification, rule-
based classification, classification by backpropagation, support vector
machines, and classification based on association rule mining—are
examples of eager learners.
• Eager learners, when given a set of training tuples, will construct a
generalization (i.e., classification) model before receiving new (e.g., test)
tuples to classify.
• In lazy approach, in which the learner instead waits until the last minute
before doing any model construction to classify a given test tuple.
• That is, when given a training tuple, a lazy learner simply stores it (or does
only a little minor processing) and waits until it is given a test tuple.
• Only when it sees the test tuple does it perform generalization to classify
the tuple based on its similarity to the stored training tuples.

• lazy learners do less work when a training tuple is
presented and more work when making a classification
or numeric prediction.
• lazy learners can be computationally expensive.
• They require efficient storage techniques and are well
suited to implementation on parallel hardware.
• They offer little explanation or insight into the data’s
structure.
• Lazy learners, however, naturally support incremental
learning.

k-Nearest-Neighbor Classifiers
• Nearest-neighbor classifiers are based on learning by analogy,
that is, by comparing a given test tuple with training tuples
that are similar to it.
• When given an unknown tuple, a k-nearest-neighbor classifier
searches the pattern space for the k training tuples that are
closest to the unknown tuple. These k training tuples are the k
“nearest neighbors” of the unknown tuple.
• “Closeness” is defined in terms of a distance metric, such as
Euclidean distance. The Euclidean distance between two
points or tuples, say, X1 = (x11, x12,..., x1n) and X2 = (x21,
x22,..., x2n), is

• The unknown tuple is assigned the most common class among
its k-nearest neighbors. When k = 1, the unknown tuple is
assigned the class of the training tuple that is closest to it in
pattern space.
• Nearest-neighbor classifiers can also be used for numeric
prediction, that is, to return a real-valued prediction for a given
unknown tuple.
• In this case, the classifier returns the average value of the real-
valued labels associated with the k-nearest neighbors of the
unknown tuple

• For nominal attributes, a simple method is to
compare the corresponding value of the
attribute in tuple X1 with that in tuple X2.
• If the two are identical (e.g., tuples X1 and X2
both have the color blue), then the difference
between the two is taken as 0.
• If the two are different (e.g., tuple X1 is blue
but tuple X2 is red), then the difference is
considered to be 1.

• In general, if the value of a given attribute A is
missing in tuple X1 and/or in tuple X2, assume
the maximum possible difference.
• For nominal attributes, take the difference value
to be 1 if either one or both of the corresponding
values of A are missing.
• If A is numeric and missing from both tuples X1
and X2, then the difference is also taken to be 1.
• If only one value is missing and the other is
present and normalized, then we can take the
difference to be either |1 – v/| or |0 − v/ | (i.e., 1
− v/or v/), whichever is greater.

• The k value that gives the minimum error rate
may be selected.
• Nearest-neighbor classifiers use distance
metric The Manhattan distance, or other
distance measurements to improve accuracy.
• Other techniques to speed up classification
time include the use of partial distance
calculations and editing the stored tuples.

Prediction
• The prediction of continuous values can be modeled by statistical techniques
of regression.
• Linear and multiple regression
• Linear regression is the simplest form of regression.
• In linear regression, data are modeled using a straight
line.
• In linear regression models a random variable, Y (called a
response variable), as a linear function of another
random variable, X (called a predictor variable), i.e.,

• These coeffcients can be solved for by the
method of least squares.
• Given s samples or data points of the form
(x1,y1), (x2, y2), .., (xs, ys), then the regression
coeffcients can be estimated using this method.

Thus, the equation of the least squares line is estimated by Y = 21.7 + 3.7X

• Multiple regression is an extension of linear
regression involving more than one predictor
variable

• Nonlinear regression
• Given response variable and predictor variables
have a relationship that may be modeled by a
polynomial function.
• Polynomial regression can be modeled by adding
polynomial terms to the basic linear model.
• By applying transformations to the variables, we
can convert the nonlinear model into a linear one
that can then be solved by the method of least
squares.

Accuracy and error Measures evaluation-
Classifier accuracy measures
• Accuracy is measured on a test set consisting of
class-labeled tuples that were not used to train the
model.
• The accuracy of a classifier on a given test set is the
percentage of test set tuples that are correctly
classified by the classifier.
• This is also referred to as the overall recognition rate
of the classifier, that is, it reflects how well the
classifier recognizes tuples of the various classes.

• Error rate or misclassification rate of a
classifier, M, which is simply 1-Acc(M), where
Acc(M) is the accuracy of M.
• The confusion matrix is a useful tool for
accuracy measurement.
• Given m classes, a confusion matrix is a table
of at least size m by m.
• An entry, CMi, j in the first m rows and m
columns indicates the number of tuples of
class i that were labeled by the classifier as
class j.

• For a classifier to have good accuracy, ideally most of
the tuples would be represented along the diagonal
of the confusion matrix, from entry CM 1, 1 to entry
CM m, m, with the rest of the entries being close to
zero.
• The table may have additional rows or columns to
provide totals or recognition rates per class.

Given two classes
• positive tuples (tuples of the main class of interest, e.g.,
buys computer = yes)
• negative tuples (e.g., buys computer = no)
• True positives refer to the positive tuples that were
correctly labeled by the classifier,
• True negatives are the negative tuples that were correctly
labeled by the classifier.
• False positives are the negative tuples that were
incorrectly labeled (e.g., tuples of classbuys computer = no
for which the classifier predicted buys computer = yes).
• False negatives are the positive tuples that were
incorrectly labeled (e.g., tuples of classbuys computer =
yes for which the classifier predicted buys computer = no).

• The sensitivity and specificity measures can
also be used as accuracy measures.
• Sensitivity is also referred to as the true
positive (recognition) rate (that is, the
proportion of positive tuples that are correctly
identified)
• Specificity is the true negative rate (that is,
the proportion of negative tuples that are
correctly identified).
• Precision is used to access the percentage of
tuples labeled as “yes” that actually are “yes”
tuples.

where t pos is the number of true positives
pos is the number of positive tuples,
t neg is the number of true negatives
neg is the number of negative tuples,
f pos is the number of false positives

Predictor Error Measures

The mean squared error exaggerates the presence of outliers, while the mean absolute
error does not.

Evaluating the Accuracy of a Classifier or Predictor
• Holdout, random sub sampling, cross
validation, and the bootstrap are common
techniques for assessing accuracy based on
randomly sampled partitions of the given
data.
• The use of such techniques to estimate
accuracy increases the overall computation
time, yet is useful for model selection.

The holdout method
• In this method, the given data are randomly partitioned into
two independent sets, a training set and a test set.
• Typically, two-thirds of the data are allocated to the training
set, and the remaining one-third is allocated to the test set.
• The training set is used to derive the model, whose accuracy
is estimated with the test set

Random subsampling
• Random subsampling is a variation of the
holdout method in which the holdout method
is repeated k times.
• The overall accuracy estimate is taken as the
average of the accuracies obtained from each
iteration.
• (For prediction, we can take the average of
thepredictor error rates.)

Cross-validation
• In k-fold cross-validation, the initial data are
randomly partitioned into k mutually exclusive
subsets or “folds,” D1, D2, : : : , Dk, each of
approximately equal size.
• Training and testing is performed k times.
• In iteration i, partition Di is reserved as the test
set, and the remaining partitions are collectively
used to train the model.
• That is, in the first iteration, subsets D2, : : : , Dk
collectively serve as the training set in order to
obtain a first model, which is tested on D1;
• the second iteration is trained on subsets D1, D3, :
: : , Dk and tested on D2; and so on.

• each sample is used the same number of
times for training and once for testing.
• For classification, the accuracy estimate is the
overall number of correct classifications from
the k iterations, divided by the total number of
tuples in the initial data.
• For prediction, the error estimate can be
computed as the total loss from the k
iterations, divided by the total number of
initial tuples.

• Leave-one-out is a special case of k-fold cross-
validation where k is set to the number of initial
tuples.
• That is, only one sample is “left out” at a time for
the test set.
• In stratified cross-validation, the folds are stratified
so that the class distribution of the tuples in each
fold is approximately the same as that in the initial
data.
• In general, stratified 10-fold cross-validation is
recommended for estimating accuracy (even if
computation power allows using more folds) due
to its relatively low bias and variance.

Bootstrap
• The bootstrap method samples the given
training tuples uniformly with replacement.
• That is, each time a tuple is selected, it is
equally likely to be selected again and readded
to the training set.
• In sampling with replacement, the machine is
allowed to select the same tuple more than
• once

• A commonly used one is the .632 bootstrap,
which works as follows. Suppose we are given a
data set of d tuples.
• The data set is sampled d times, with
replacement, resulting in a bootstrap sample or
training set of d samples.
• some of the original data tuples will occur more
than oncemin this sample.
• The data tuples that did not make it into the
training set end up forming the test set.
• on average, 63.2% of the original data tuples will
end up in the bootstrap, and the remaining 36.8%
will form the test set (hence, the name, .632
bootstrap.)

CS 402 DATAMINING AND WAREHOUSING -MODULE 4

More Related Content

What's hot (20)

Similar to CS 402 DATAMINING AND WAREHOUSING -MODULE 4 (20)

Recently uploaded (20)

CS 402 DATAMINING AND WAREHOUSING -MODULE 4