SlideShare a Scribd company logo
MODULE 4
Rule based classification- 1R. Neural Networks-Back propagation. Support
Vector Machines, Lazy Learners-K Nearest Neighbor Classifier. Accuracy and
error Measures evaluation. Prediction:-Linear Regression and Non-Linear
Regression
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 1
Rule-Based Classification -1R
• Rules are a good way of representing information or bits of
knowledge.
• A rule-based classifier uses a set of IF-THEN rules for
classification.
• An IF-THEN rule is an expression of the form
IF condition THEN conclusion.
An example
• R1: IF age = youth AND student = yes THEN buys computer =
yes
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 2
• The “IF” part (or left side) of a rule is known as
the rule antecedent or precondition.
• The “THEN” part (or right side) is the rule
consequent.
• In the rule antecedent, the condition consists of
one or more attribute tests (e.g., age = youth and
student = yes) that are logically ANDed.
• The rule’s consequent contains a class prediction
(in this case, we are predicting whether a
customer will buy a computer).
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 3
• R1 can also be written as
R1: (age = youth) ∧ (student = yes) ⇒ (buys computer = yes).
• If the condition (i.e., all the attribute tests) in a
rule antecedent holds true for a given tuple,
we say that the rule antecedent is satisfied (or
simply, that the rule is satisfied) and that the
rule covers the tuple.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 4
The following rules to determine classification of grades:
• If 90 ≤ grade, then class=A
• If 80 ≤ grade and grade < 90, then class = B
• If 70 ≤ grade and grade < 80, then class = C
• If 60 ≤ grade and grade < 70, then class =D
• If grade < 60, then class = F
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 5
• A classification rule, r = (a, c), consists of the if or
antecedent, a, part and the then or consequent portion,
c.
• These rules relate directly to the corresponding DT that
could be created. A DT can always be used to generate
rules.
There are differences between rules and trees :
• The tree has an implied order in which the splitting is
performed. Rules have no order.
• A tree is created based on looking at all classes. When
generating rules, only one class must be examined at a
time.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 6
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 7
• If a rule is satisfied by X, the rule is said to be triggered .
• If more than one rule is triggered / if no rule is satisfied by X ,
need a conflict resolution strategy to figure out which rule
gets to fire and assign its class prediction to X.
• Size ordering and Rule ordering.
• The size ordering scheme assigns the highest priority to the
triggering rule with the most attribute tests is fired.
• The rule ordering scheme prioritizes the rules beforehand.
The ordering may be class based or rule-based.
• With class-based ordering, the classes are sorted in order of
decreasing “importance”.
• With rule-based ordering, the rules are organized into one
long priority list, according to some measure of rule quality
such as accuracy, coverage, or size
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 8
Rule Extraction from a Decision Tree
• To extract rules from a decision tree, one rule is
created for each path from the root
to a leaf node.
• Each splitting criterion along a given path is
logically ANDed to form the
rule antecedent (“IF” part).
• The leaf node holds the class prediction, forming
the rule consequent (“THEN” part).
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 9
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 10
Rule Induction Using a Sequential Covering Algorithm
• These techniques are sometimes called
covering algorithms because they attempt to
generate rules exactly cover a specific class .
• Tree algorithms work in a top down divide and
conquer approach, but this need not be the
case for covering algorithms.
• Usually the "best" attribute-value pair is
chosen, as opposed to the best attribute with
the tree-based algorithms
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 11
• Suppose that we wished to generate a rule to
classify persons as tall. The basic format for
the rule is then
If ? then class = tall
• The objective for the covering algorithms is to
replace the "?" in this statement with
predicates that can be used to obtain the
"best" probability of being tall.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 12
1R Classification
• One simple approach is called 1R because it
generates a simple set of rules .
• The basic idea is to choose the best attribute
to perform the classification based on the
training data.
• "Best" is defined here by counting the number
of errors.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 13
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 14
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 15
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 16
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 17
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 18
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 19
• Algorithm: Backpropagation. Neural network
learning for classification or numeric
prediction, using the backpropagation
algorithm.
• Input:
• D, a data set consisting of the training tuples
and their associated target values;
• l, the learning rate;
• network, a multilayer feed-forward network.
• Output: A trained neural network.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 20
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 21
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 22
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 23
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 24
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 25
Lazy Learners
• The classification methods -tree induction, Bayesian classification, rule-
based classification, classification by backpropagation, support vector
machines, and classification based on association rule mining—are
examples of eager learners.
• Eager learners, when given a set of training tuples, will construct a
generalization (i.e., classification) model before receiving new (e.g., test)
tuples to classify.
• In lazy approach, in which the learner instead waits until the last minute
before doing any model construction to classify a given test tuple.
• That is, when given a training tuple, a lazy learner simply stores it (or does
only a little minor processing) and waits until it is given a test tuple.
• Only when it sees the test tuple does it perform generalization to classify
the tuple based on its similarity to the stored training tuples.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 26
• lazy learners do less work when a training tuple is
presented and more work when making a classification
or numeric prediction.
• lazy learners can be computationally expensive.
• They require efficient storage techniques and are well
suited to implementation on parallel hardware.
• They offer little explanation or insight into the data’s
structure.
• Lazy learners, however, naturally support incremental
learning.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 27
k-Nearest-Neighbor Classifiers
• Nearest-neighbor classifiers are based on learning by analogy,
that is, by comparing a given test tuple with training tuples
that are similar to it.
• When given an unknown tuple, a k-nearest-neighbor classifier
searches the pattern space for the k training tuples that are
closest to the unknown tuple. These k training tuples are the k
“nearest neighbors” of the unknown tuple.
• “Closeness” is defined in terms of a distance metric, such as
Euclidean distance. The Euclidean distance between two
points or tuples, say, X1 = (x11, x12,..., x1n) and X2 = (x21,
x22,..., x2n), is
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 28
• The unknown tuple is assigned the most common class among
its k-nearest neighbors. When k = 1, the unknown tuple is
assigned the class of the training tuple that is closest to it in
pattern space.
• Nearest-neighbor classifiers can also be used for numeric
prediction, that is, to return a real-valued prediction for a given
unknown tuple.
• In this case, the classifier returns the average value of the real-
valued labels associated with the k-nearest neighbors of the
unknown tuple
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 29
• For nominal attributes, a simple method is to
compare the corresponding value of the
attribute in tuple X1 with that in tuple X2.
• If the two are identical (e.g., tuples X1 and X2
both have the color blue), then the difference
between the two is taken as 0.
• If the two are different (e.g., tuple X1 is blue
but tuple X2 is red), then the difference is
considered to be 1.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 30
• In general, if the value of a given attribute A is
missing in tuple X1 and/or in tuple X2, assume
the maximum possible difference.
• For nominal attributes, take the difference value
to be 1 if either one or both of the corresponding
values of A are missing.
• If A is numeric and missing from both tuples X1
and X2, then the difference is also taken to be 1.
• If only one value is missing and the other is
present and normalized, then we can take the
difference to be either |1 – v/| or |0 − v/ | (i.e., 1
− v/or v/), whichever is greater.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 31
• The k value that gives the minimum error rate
may be selected.
• Nearest-neighbor classifiers use distance
metric The Manhattan distance, or other
distance measurements to improve accuracy.
• Other techniques to speed up classification
time include the use of partial distance
calculations and editing the stored tuples.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 32
Prediction
• The prediction of continuous values can be modeled by statistical techniques
of regression.
• Linear and multiple regression
• Linear regression is the simplest form of regression.
• In linear regression, data are modeled using a straight
line.
• In linear regression models a random variable, Y (called a
response variable), as a linear function of another
random variable, X (called a predictor variable), i.e.,
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 33
• These coeffcients can be solved for by the
method of least squares.
• Given s samples or data points of the form
(x1,y1), (x2, y2), .., (xs, ys), then the regression
coeffcients can be estimated using this method.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 34
Thus, the equation of the least squares line is estimated by Y = 21.7 + 3.7X
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 35
• Multiple regression is an extension of linear
regression involving more than one predictor
variable
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 36
• Nonlinear regression
• Given response variable and predictor variables
have a relationship that may be modeled by a
polynomial function.
• Polynomial regression can be modeled by adding
polynomial terms to the basic linear model.
• By applying transformations to the variables, we
can convert the nonlinear model into a linear one
that can then be solved by the method of least
squares.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 37
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 38
Accuracy and error Measures evaluation-
Classifier accuracy measures
• Accuracy is measured on a test set consisting of
class-labeled tuples that were not used to train the
model.
• The accuracy of a classifier on a given test set is the
percentage of test set tuples that are correctly
classified by the classifier.
• This is also referred to as the overall recognition rate
of the classifier, that is, it reflects how well the
classifier recognizes tuples of the various classes.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 39
• Error rate or misclassification rate of a
classifier, M, which is simply 1-Acc(M), where
Acc(M) is the accuracy of M.
• The confusion matrix is a useful tool for
accuracy measurement.
• Given m classes, a confusion matrix is a table
of at least size m by m.
• An entry, CMi, j in the first m rows and m
columns indicates the number of tuples of
class i that were labeled by the classifier as
class j.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 40
• For a classifier to have good accuracy, ideally most of
the tuples would be represented along the diagonal
of the confusion matrix, from entry CM 1, 1 to entry
CM m, m, with the rest of the entries being close to
zero.
• The table may have additional rows or columns to
provide totals or recognition rates per class.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 41
Given two classes
• positive tuples (tuples of the main class of interest, e.g.,
buys computer = yes)
• negative tuples (e.g., buys computer = no)
• True positives refer to the positive tuples that were
correctly labeled by the classifier,
• True negatives are the negative tuples that were correctly
labeled by the classifier.
• False positives are the negative tuples that were
incorrectly labeled (e.g., tuples of classbuys computer = no
for which the classifier predicted buys computer = yes).
• False negatives are the positive tuples that were
incorrectly labeled (e.g., tuples of classbuys computer =
yes for which the classifier predicted buys computer = no).
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 42
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 43
• The sensitivity and specificity measures can
also be used as accuracy measures.
• Sensitivity is also referred to as the true
positive (recognition) rate (that is, the
proportion of positive tuples that are correctly
identified)
• Specificity is the true negative rate (that is,
the proportion of negative tuples that are
correctly identified).
• Precision is used to access the percentage of
tuples labeled as “yes” that actually are “yes”
tuples.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 44
where t pos is the number of true positives
pos is the number of positive tuples,
t neg is the number of true negatives
neg is the number of negative tuples,
f pos is the number of false positives
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 45
Predictor Error Measures
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 46
The mean squared error exaggerates the presence of outliers, while the mean absolute
error does not.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 47
Evaluating the Accuracy of a Classifier or Predictor
• Holdout, random sub sampling, cross
validation, and the bootstrap are common
techniques for assessing accuracy based on
randomly sampled partitions of the given
data.
• The use of such techniques to estimate
accuracy increases the overall computation
time, yet is useful for model selection.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 48
The holdout method
• In this method, the given data are randomly partitioned into
two independent sets, a training set and a test set.
• Typically, two-thirds of the data are allocated to the training
set, and the remaining one-third is allocated to the test set.
• The training set is used to derive the model, whose accuracy
is estimated with the test set
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 49
Random subsampling
• Random subsampling is a variation of the
holdout method in which the holdout method
is repeated k times.
• The overall accuracy estimate is taken as the
average of the accuracies obtained from each
iteration.
• (For prediction, we can take the average of
thepredictor error rates.)
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 50
Cross-validation
• In k-fold cross-validation, the initial data are
randomly partitioned into k mutually exclusive
subsets or “folds,” D1, D2, : : : , Dk, each of
approximately equal size.
• Training and testing is performed k times.
• In iteration i, partition Di is reserved as the test
set, and the remaining partitions are collectively
used to train the model.
• That is, in the first iteration, subsets D2, : : : , Dk
collectively serve as the training set in order to
obtain a first model, which is tested on D1;
• the second iteration is trained on subsets D1, D3, :
: : , Dk and tested on D2; and so on.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 51
• each sample is used the same number of
times for training and once for testing.
• For classification, the accuracy estimate is the
overall number of correct classifications from
the k iterations, divided by the total number of
tuples in the initial data.
• For prediction, the error estimate can be
computed as the total loss from the k
iterations, divided by the total number of
initial tuples.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 52
• Leave-one-out is a special case of k-fold cross-
validation where k is set to the number of initial
tuples.
• That is, only one sample is “left out” at a time for
the test set.
• In stratified cross-validation, the folds are stratified
so that the class distribution of the tuples in each
fold is approximately the same as that in the initial
data.
• In general, stratified 10-fold cross-validation is
recommended for estimating accuracy (even if
computation power allows using more folds) due
to its relatively low bias and variance.
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 53
Bootstrap
• The bootstrap method samples the given
training tuples uniformly with replacement.
• That is, each time a tuple is selected, it is
equally likely to be selected again and readded
to the training set.
• In sampling with replacement, the machine is
allowed to select the same tuple more than
• once
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 54
• A commonly used one is the .632 bootstrap,
which works as follows. Suppose we are given a
data set of d tuples.
• The data set is sampled d times, with
replacement, resulting in a bootstrap sample or
training set of d samples.
• some of the original data tuples will occur more
than oncemin this sample.
• The data tuples that did not make it into the
training set end up forming the test set.
• on average, 63.2% of the original data tuples will
end up in the bootstrap, and the remaining 36.8%
will form the test set (hence, the name, .632
bootstrap.)
6/30/2020 NIMMY RAJU,AP,VKCET,TVM 55

More Related Content

PPT
CS 402 DATAMINING AND WAREHOUSING -MODULE 3
PPT
CS 402 DATAMINING AND WAREHOUSING -MODULE 2
PPTX
CS 402 DATAMINING AND WAREHOUSING -PROBLEMS
PPTX
CS 402 DATAMINING AND WAREHOUSING -MODULE 5
PPTX
CS 402 DATAMINING AND WAREHOUSING -MODULE 6
PPTX
04 Classification in Data Mining
PPT
3. mining frequent patterns
PPT
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...
CS 402 DATAMINING AND WAREHOUSING -MODULE 3
CS 402 DATAMINING AND WAREHOUSING -MODULE 2
CS 402 DATAMINING AND WAREHOUSING -PROBLEMS
CS 402 DATAMINING AND WAREHOUSING -MODULE 5
CS 402 DATAMINING AND WAREHOUSING -MODULE 6
04 Classification in Data Mining
3. mining frequent patterns
Data Mining: Concepts and Techniques chapter 07 : Advanced Frequent Pattern M...

What's hot (20)

PDF
Data mining
PPTX
Terminology Machine Learning
PDF
Understanding Bagging and Boosting
PDF
L2. Evaluating Machine Learning Algorithms I
PDF
Lecture13 - Association Rules
PPTX
Lecture 18: Gaussian Mixture Models and Expectation Maximization
PPTX
Outlier analysis and anomaly detection
PPTX
Data mining
PPTX
k medoid clustering.pptx
PPTX
Naïve Bayes Classifier Algorithm.pptx
PPTX
Association Rule Learning Part 1: Frequent Itemset Generation
PPTX
Genetic algorithms
ODP
NAIVE BAYES CLASSIFIER
PPTX
K MEANS CLUSTERING
PDF
Data preprocessing using Machine Learning
PPT
Bayes Classification
PPT
Lect12 graph mining
PPTX
Bayesian Belief Network and its Applications.pptx
PPTX
Instance based learning
Data mining
Terminology Machine Learning
Understanding Bagging and Boosting
L2. Evaluating Machine Learning Algorithms I
Lecture13 - Association Rules
Lecture 18: Gaussian Mixture Models and Expectation Maximization
Outlier analysis and anomaly detection
Data mining
k medoid clustering.pptx
Naïve Bayes Classifier Algorithm.pptx
Association Rule Learning Part 1: Frequent Itemset Generation
Genetic algorithms
NAIVE BAYES CLASSIFIER
K MEANS CLUSTERING
Data preprocessing using Machine Learning
Bayes Classification
Lect12 graph mining
Bayesian Belief Network and its Applications.pptx
Instance based learning
Ad

Similar to CS 402 DATAMINING AND WAREHOUSING -MODULE 4 (20)

PPTX
�datamining-lect7.pptx literature of data mining and summary
PDF
MLHEP 2015: Introductory Lecture #1
PPTX
Classification Continued
PPTX
Classification Continued
PDF
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
PPTX
Unit 4 Classification of data and more info on it
PPT
3 DM Classification HFCS kilometres .ppt
PPTX
Machine learning
PPT
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
PPTX
UNIT 3: Data Warehousing and Data Mining
PPT
Instance Based Learning in Machine Learning
PPTX
machine leraning : main principles and techniques
PDF
Introduction to conventional machine learning techniques
PPT
[ppt]
PPT
[ppt]
PPTX
CST413 KTU S7 CSE Machine Learning Supervised Learning Classification Algorit...
PPTX
Application of Machine Learning in Agriculture
DOC
Improving Classifier Accuracy using Unlabeled Data..doc
PPTX
AI -learning and machine learning.pptx
PPTX
Machine Learning
�datamining-lect7.pptx literature of data mining and summary
MLHEP 2015: Introductory Lecture #1
Classification Continued
Classification Continued
Machine Learning: An introduction โดย รศ.ดร.สุรพงค์ เอื้อวัฒนามงคล
Unit 4 Classification of data and more info on it
3 DM Classification HFCS kilometres .ppt
Machine learning
594503964-Introduction-to-Classification-PPT-Slides-1.ppt
UNIT 3: Data Warehousing and Data Mining
Instance Based Learning in Machine Learning
machine leraning : main principles and techniques
Introduction to conventional machine learning techniques
[ppt]
[ppt]
CST413 KTU S7 CSE Machine Learning Supervised Learning Classification Algorit...
Application of Machine Learning in Agriculture
Improving Classifier Accuracy using Unlabeled Data..doc
AI -learning and machine learning.pptx
Machine Learning
Ad

Recently uploaded (20)

PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Well-logging-methods_new................
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
Artificial Intelligence
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPTX
Current and future trends in Computer Vision.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
Digital Logic Computer Design lecture notes
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
DOCX
573137875-Attendance-Management-System-original
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
Well-logging-methods_new................
Operating System & Kernel Study Guide-1 - converted.pdf
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
UNIT 4 Total Quality Management .pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Artificial Intelligence
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Foundation to blockchain - A guide to Blockchain Tech
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Current and future trends in Computer Vision.pptx
bas. eng. economics group 4 presentation 1.pptx
Digital Logic Computer Design lecture notes
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
573137875-Attendance-Management-System-original
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Model Code of Practice - Construction Work - 21102022 .pdf

CS 402 DATAMINING AND WAREHOUSING -MODULE 4

  • 1. MODULE 4 Rule based classification- 1R. Neural Networks-Back propagation. Support Vector Machines, Lazy Learners-K Nearest Neighbor Classifier. Accuracy and error Measures evaluation. Prediction:-Linear Regression and Non-Linear Regression 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 1
  • 2. Rule-Based Classification -1R • Rules are a good way of representing information or bits of knowledge. • A rule-based classifier uses a set of IF-THEN rules for classification. • An IF-THEN rule is an expression of the form IF condition THEN conclusion. An example • R1: IF age = youth AND student = yes THEN buys computer = yes 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 2
  • 3. • The “IF” part (or left side) of a rule is known as the rule antecedent or precondition. • The “THEN” part (or right side) is the rule consequent. • In the rule antecedent, the condition consists of one or more attribute tests (e.g., age = youth and student = yes) that are logically ANDed. • The rule’s consequent contains a class prediction (in this case, we are predicting whether a customer will buy a computer). 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 3
  • 4. • R1 can also be written as R1: (age = youth) ∧ (student = yes) ⇒ (buys computer = yes). • If the condition (i.e., all the attribute tests) in a rule antecedent holds true for a given tuple, we say that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the tuple. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 4
  • 5. The following rules to determine classification of grades: • If 90 ≤ grade, then class=A • If 80 ≤ grade and grade < 90, then class = B • If 70 ≤ grade and grade < 80, then class = C • If 60 ≤ grade and grade < 70, then class =D • If grade < 60, then class = F 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 5
  • 6. • A classification rule, r = (a, c), consists of the if or antecedent, a, part and the then or consequent portion, c. • These rules relate directly to the corresponding DT that could be created. A DT can always be used to generate rules. There are differences between rules and trees : • The tree has an implied order in which the splitting is performed. Rules have no order. • A tree is created based on looking at all classes. When generating rules, only one class must be examined at a time. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 6
  • 8. • If a rule is satisfied by X, the rule is said to be triggered . • If more than one rule is triggered / if no rule is satisfied by X , need a conflict resolution strategy to figure out which rule gets to fire and assign its class prediction to X. • Size ordering and Rule ordering. • The size ordering scheme assigns the highest priority to the triggering rule with the most attribute tests is fired. • The rule ordering scheme prioritizes the rules beforehand. The ordering may be class based or rule-based. • With class-based ordering, the classes are sorted in order of decreasing “importance”. • With rule-based ordering, the rules are organized into one long priority list, according to some measure of rule quality such as accuracy, coverage, or size 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 8
  • 9. Rule Extraction from a Decision Tree • To extract rules from a decision tree, one rule is created for each path from the root to a leaf node. • Each splitting criterion along a given path is logically ANDed to form the rule antecedent (“IF” part). • The leaf node holds the class prediction, forming the rule consequent (“THEN” part). 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 9
  • 11. Rule Induction Using a Sequential Covering Algorithm • These techniques are sometimes called covering algorithms because they attempt to generate rules exactly cover a specific class . • Tree algorithms work in a top down divide and conquer approach, but this need not be the case for covering algorithms. • Usually the "best" attribute-value pair is chosen, as opposed to the best attribute with the tree-based algorithms 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 11
  • 12. • Suppose that we wished to generate a rule to classify persons as tall. The basic format for the rule is then If ? then class = tall • The objective for the covering algorithms is to replace the "?" in this statement with predicates that can be used to obtain the "best" probability of being tall. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 12
  • 13. 1R Classification • One simple approach is called 1R because it generates a simple set of rules . • The basic idea is to choose the best attribute to perform the classification based on the training data. • "Best" is defined here by counting the number of errors. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 13
  • 20. • Algorithm: Backpropagation. Neural network learning for classification or numeric prediction, using the backpropagation algorithm. • Input: • D, a data set consisting of the training tuples and their associated target values; • l, the learning rate; • network, a multilayer feed-forward network. • Output: A trained neural network. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 20
  • 26. Lazy Learners • The classification methods -tree induction, Bayesian classification, rule- based classification, classification by backpropagation, support vector machines, and classification based on association rule mining—are examples of eager learners. • Eager learners, when given a set of training tuples, will construct a generalization (i.e., classification) model before receiving new (e.g., test) tuples to classify. • In lazy approach, in which the learner instead waits until the last minute before doing any model construction to classify a given test tuple. • That is, when given a training tuple, a lazy learner simply stores it (or does only a little minor processing) and waits until it is given a test tuple. • Only when it sees the test tuple does it perform generalization to classify the tuple based on its similarity to the stored training tuples. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 26
  • 27. • lazy learners do less work when a training tuple is presented and more work when making a classification or numeric prediction. • lazy learners can be computationally expensive. • They require efficient storage techniques and are well suited to implementation on parallel hardware. • They offer little explanation or insight into the data’s structure. • Lazy learners, however, naturally support incremental learning. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 27
  • 28. k-Nearest-Neighbor Classifiers • Nearest-neighbor classifiers are based on learning by analogy, that is, by comparing a given test tuple with training tuples that are similar to it. • When given an unknown tuple, a k-nearest-neighbor classifier searches the pattern space for the k training tuples that are closest to the unknown tuple. These k training tuples are the k “nearest neighbors” of the unknown tuple. • “Closeness” is defined in terms of a distance metric, such as Euclidean distance. The Euclidean distance between two points or tuples, say, X1 = (x11, x12,..., x1n) and X2 = (x21, x22,..., x2n), is 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 28
  • 29. • The unknown tuple is assigned the most common class among its k-nearest neighbors. When k = 1, the unknown tuple is assigned the class of the training tuple that is closest to it in pattern space. • Nearest-neighbor classifiers can also be used for numeric prediction, that is, to return a real-valued prediction for a given unknown tuple. • In this case, the classifier returns the average value of the real- valued labels associated with the k-nearest neighbors of the unknown tuple 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 29
  • 30. • For nominal attributes, a simple method is to compare the corresponding value of the attribute in tuple X1 with that in tuple X2. • If the two are identical (e.g., tuples X1 and X2 both have the color blue), then the difference between the two is taken as 0. • If the two are different (e.g., tuple X1 is blue but tuple X2 is red), then the difference is considered to be 1. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 30
  • 31. • In general, if the value of a given attribute A is missing in tuple X1 and/or in tuple X2, assume the maximum possible difference. • For nominal attributes, take the difference value to be 1 if either one or both of the corresponding values of A are missing. • If A is numeric and missing from both tuples X1 and X2, then the difference is also taken to be 1. • If only one value is missing and the other is present and normalized, then we can take the difference to be either |1 – v/| or |0 − v/ | (i.e., 1 − v/or v/), whichever is greater. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 31
  • 32. • The k value that gives the minimum error rate may be selected. • Nearest-neighbor classifiers use distance metric The Manhattan distance, or other distance measurements to improve accuracy. • Other techniques to speed up classification time include the use of partial distance calculations and editing the stored tuples. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 32
  • 33. Prediction • The prediction of continuous values can be modeled by statistical techniques of regression. • Linear and multiple regression • Linear regression is the simplest form of regression. • In linear regression, data are modeled using a straight line. • In linear regression models a random variable, Y (called a response variable), as a linear function of another random variable, X (called a predictor variable), i.e., 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 33
  • 34. • These coeffcients can be solved for by the method of least squares. • Given s samples or data points of the form (x1,y1), (x2, y2), .., (xs, ys), then the regression coeffcients can be estimated using this method. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 34
  • 35. Thus, the equation of the least squares line is estimated by Y = 21.7 + 3.7X 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 35
  • 36. • Multiple regression is an extension of linear regression involving more than one predictor variable 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 36
  • 37. • Nonlinear regression • Given response variable and predictor variables have a relationship that may be modeled by a polynomial function. • Polynomial regression can be modeled by adding polynomial terms to the basic linear model. • By applying transformations to the variables, we can convert the nonlinear model into a linear one that can then be solved by the method of least squares. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 37
  • 39. Accuracy and error Measures evaluation- Classifier accuracy measures • Accuracy is measured on a test set consisting of class-labeled tuples that were not used to train the model. • The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. • This is also referred to as the overall recognition rate of the classifier, that is, it reflects how well the classifier recognizes tuples of the various classes. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 39
  • 40. • Error rate or misclassification rate of a classifier, M, which is simply 1-Acc(M), where Acc(M) is the accuracy of M. • The confusion matrix is a useful tool for accuracy measurement. • Given m classes, a confusion matrix is a table of at least size m by m. • An entry, CMi, j in the first m rows and m columns indicates the number of tuples of class i that were labeled by the classifier as class j. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 40
  • 41. • For a classifier to have good accuracy, ideally most of the tuples would be represented along the diagonal of the confusion matrix, from entry CM 1, 1 to entry CM m, m, with the rest of the entries being close to zero. • The table may have additional rows or columns to provide totals or recognition rates per class. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 41
  • 42. Given two classes • positive tuples (tuples of the main class of interest, e.g., buys computer = yes) • negative tuples (e.g., buys computer = no) • True positives refer to the positive tuples that were correctly labeled by the classifier, • True negatives are the negative tuples that were correctly labeled by the classifier. • False positives are the negative tuples that were incorrectly labeled (e.g., tuples of classbuys computer = no for which the classifier predicted buys computer = yes). • False negatives are the positive tuples that were incorrectly labeled (e.g., tuples of classbuys computer = yes for which the classifier predicted buys computer = no). 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 42
  • 44. • The sensitivity and specificity measures can also be used as accuracy measures. • Sensitivity is also referred to as the true positive (recognition) rate (that is, the proportion of positive tuples that are correctly identified) • Specificity is the true negative rate (that is, the proportion of negative tuples that are correctly identified). • Precision is used to access the percentage of tuples labeled as “yes” that actually are “yes” tuples. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 44
  • 45. where t pos is the number of true positives pos is the number of positive tuples, t neg is the number of true negatives neg is the number of negative tuples, f pos is the number of false positives 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 45
  • 46. Predictor Error Measures 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 46
  • 47. The mean squared error exaggerates the presence of outliers, while the mean absolute error does not. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 47
  • 48. Evaluating the Accuracy of a Classifier or Predictor • Holdout, random sub sampling, cross validation, and the bootstrap are common techniques for assessing accuracy based on randomly sampled partitions of the given data. • The use of such techniques to estimate accuracy increases the overall computation time, yet is useful for model selection. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 48
  • 49. The holdout method • In this method, the given data are randomly partitioned into two independent sets, a training set and a test set. • Typically, two-thirds of the data are allocated to the training set, and the remaining one-third is allocated to the test set. • The training set is used to derive the model, whose accuracy is estimated with the test set 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 49
  • 50. Random subsampling • Random subsampling is a variation of the holdout method in which the holdout method is repeated k times. • The overall accuracy estimate is taken as the average of the accuracies obtained from each iteration. • (For prediction, we can take the average of thepredictor error rates.) 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 50
  • 51. Cross-validation • In k-fold cross-validation, the initial data are randomly partitioned into k mutually exclusive subsets or “folds,” D1, D2, : : : , Dk, each of approximately equal size. • Training and testing is performed k times. • In iteration i, partition Di is reserved as the test set, and the remaining partitions are collectively used to train the model. • That is, in the first iteration, subsets D2, : : : , Dk collectively serve as the training set in order to obtain a first model, which is tested on D1; • the second iteration is trained on subsets D1, D3, : : : , Dk and tested on D2; and so on. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 51
  • 52. • each sample is used the same number of times for training and once for testing. • For classification, the accuracy estimate is the overall number of correct classifications from the k iterations, divided by the total number of tuples in the initial data. • For prediction, the error estimate can be computed as the total loss from the k iterations, divided by the total number of initial tuples. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 52
  • 53. • Leave-one-out is a special case of k-fold cross- validation where k is set to the number of initial tuples. • That is, only one sample is “left out” at a time for the test set. • In stratified cross-validation, the folds are stratified so that the class distribution of the tuples in each fold is approximately the same as that in the initial data. • In general, stratified 10-fold cross-validation is recommended for estimating accuracy (even if computation power allows using more folds) due to its relatively low bias and variance. 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 53
  • 54. Bootstrap • The bootstrap method samples the given training tuples uniformly with replacement. • That is, each time a tuple is selected, it is equally likely to be selected again and readded to the training set. • In sampling with replacement, the machine is allowed to select the same tuple more than • once 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 54
  • 55. • A commonly used one is the .632 bootstrap, which works as follows. Suppose we are given a data set of d tuples. • The data set is sampled d times, with replacement, resulting in a bootstrap sample or training set of d samples. • some of the original data tuples will occur more than oncemin this sample. • The data tuples that did not make it into the training set end up forming the test set. • on average, 63.2% of the original data tuples will end up in the bootstrap, and the remaining 36.8% will form the test set (hence, the name, .632 bootstrap.) 6/30/2020 NIMMY RAJU,AP,VKCET,TVM 55