SlideShare a Scribd company logo
Chapter 6
Three Simple Classification
Methods
The Naïve Rule
Naïve Bayes
k-Nearest Neighbor
1
Introduction
• Naïve Rule used to set up Naïve Bayes & k-NN
• Naïve Bayes & k-NN used in practice
• Data driven methods
• Naïve Bayes uses categorical predictors
• k-NN may be used with continuous predictors
• Illustrate with three examples:
– Example 1: Predicting Fraudulent Financial Reporting
• Uses Categorical predictors
– Example 2: Predicting Delayed Flights
• Uses Categorical Predictors
– Example 3: Riding Mowers
• Uses Continuous Predictors
2
Predicting Fraudulent Financial Reporting
• To avoid being involved in any legal charges against it, the firm wants to
detect whether a company submitted a fraudulent financial report .
• In this case each company (customer) is a record, and the response of
interest, Y = {fraudulent; truthful}, has two classes that a company can be
classified into: C1=fraudulent and C2=truthful.
• The only other piece of information that the auditing firm has on its
customers is whether or not legal charges were filed against them.
• The firm would like to use this information to improve its estimates of
fraud.
• Thus “X=legal charges" is a single (categorical) predictor with two
categories: whether legal charges were filed (1) or not (0).
3
• 1500 companies
• Partition into 1000 training set & 500 validation set
• Counts from training below
Predicting Fraudulent Financial Reporting
4
Predicting Delayed Flights
• The outcome of interest is whether the flight is delayed or not (delayed means
arrive more than 15 minutes late).
• Our data consist of all flights from the Washington, DC area into the New York City
area during January 2004.
• The percent of delayed flights among these 2346 flights is 18%
• Six predictors listed below
• Predict if a new flight will be delayed – two classes
• 1 = “Delayed” and 0 = “On Time”
5
The Naive Rule
• Classify everything as belonging to the most prevalent class
• Classifying a record into one of m classes, ignoring all predictor information
(X1,X2,…,Xp) that we may have, is to classify the record as a member of the
majority class.
• In the auditing example the naive rule would classify all customers as being
truthful, because 90% of the investigated companies in the training set were
found to be truthful.
• Similarly, all flights would be classified as being on-time, because the
majority of the flights in the dataset (82%) were not delayed.
6
Naive Bayes
• More sophisticated method than the naive rule.
• The main idea is to integrate the information given in a set of predictors
into the Naive Rule to obtain more accurate classifications.
• The probability of a record belonging to a certain class is now evaluated
– Based on the prevalence of that class
– And on the additional information that is given on that record in term of its X
information.
• Naive Bayes works only with predictors that are categorical.
– Numerical predictors must be binned and converted to categorical variables
before the Naive Bayes classifier can use them.
• The Naive Bayes method is very useful when very large datasets are
available.
– For instance, web-search companies like Google use naive Bayes classifiers to correct
misspellings that users type in. When you type a phrase that includes a misspelled word
into Google it suggests a spelling correction for the phrase. The suggestion(s) are based
on information not only on the frequencies of similarly-spelled words that were typed
by millions of other users, but also on the other words in the phrase.
7
Conditional Probabilities
• Classification Task
– Estimate the probability of membership in each class given
a certain set of predictor variables
• This type of probability is called “conditional
probability”
• A conditional probability of event A given event B
(denoted by P(A|B)) represents the chances of event
A occurring only under the scenario that event B
occurs.
• In the auditing example we are interested in
– P(fraudulent financial report | legal charges)
8
• To classify a record, we compute its chance of belonging to
each of the classes by computing P(Ci|X1,…,Xp) for each class
i. We then classify the record to the class that has the
highest probability
• Since conditioning on an event means that we have
additional information (e.g., we know that legal charges
were filed against them), uncertainty is reduced
• In auditing example column headings are used as
predictors for classification probabilities
– Column sums are sample size used to compute probabilities
– P(fraudulent financial report | legal charges)
• 50/232
– P(fraudulent financial report | no legal charges)
• 50/770
9
Conditional Probabilities
A Practical Difficulty
• For N predictors and M classes training set
may need to be very large.
• To “fill in” the M X N table so that we can
compute the conditional probabilities would
require a large number cases to avoid entries
of zero (no instances of cases of in the table)
• Apples & Oranges Example
10
A Solution: Naive Bayes
• A solution that has been widely used is based on making the
simplifying assumption of predictor independence. If it is
reasonable to assume that the predictors are all mutually
independent within each class,
• Simplify the expression making it useful in practice
• Independence of the predictors within each class gives us the
following simplification
– follows from the product rule for probabilities of independent events
(the probability of occurrence of multiple events is the product of the
probabilities of the individual event occurrences):
– P(X1,X2, …,Xm|Ci) = P(X1|Ci)P(X2|Ci)P(X3|Ci) ,…,P(Xm|Ci)
• The terms on the right are estimated from frequency counts in the
training data, with the estimate of P(Xj|Ci) being equal to the
number of occurrences of the value xj in class Ci divided by the total
number of records in that class
• Example pgs. 97-98 (or pgs. 91-92 earlier edition) demonstrate
– P(X1,X2, …,Xm|Ci) approximates P(Ci|X1,…,Xp)
– Assuming a classification cutoff of 0.5
11
12
13
Evaluation of the Model
• To evaluate the performance of the naive Bayes classifier for our
data, we use
– the classification matrix,
– lift charts,
– And measures described in Chapter 4.
• The classification matrices for the training and validation sets are
shown
• The overall error level is around 18% for both the training and
validation data
• A naive rule which would classify all 880 flights in the validation set
as on-time
• missed the 172 delayed flights resulting in a 20% error level.
• The Naive Bayes is only slightly less accurate.
• The lift chart shows the strength of the Naive Bayes in capturing
the delayed flights well.
14
15
16
17
Evaluation of Naive Bayes Classifier
• The Naive Bayes classifier's advantages are in its
– simplicity, computational efficiency, and its good classification
performance.
– it often outperforms more sophisticated classifiers even when the
underlying assumption of independent predictors is far from true.
– This advantage is especially pronounced when the number of
predictors is very large.
18
• There are three main issues that should be kept in mind however.
– First, the Naive Bayes classifier requires a very large number of records to
obtain good results.
– Second, where a predictor category is not present in the training data, Naive
Bayes assumes that a new record with that category of the predictor has zero
probability.
• This can be a problem if this rare predictor value is important.
• For example, assume the target variable is “bought high value life insurance" and a
predictor category is “own yacht". If the training data have no records with “owns
yacht"=1, for any new records where “owns yacht"=1, Naive Bayes will assign a
probability of 0 to the target variable “bought high value life insurance".
• With no training records with ”owns yacht"=1, of course, no data mining technique
will be able to incorporate this potentially important variable into the classification
model - it will be ignored.
• With Naive Bayes, however, the absence of this predictor actively “outvotes" any
other information in the record to assign a 0 to the target value (when, in this case,
it has a relatively good chance of being a 1).
• The presence of a large training set (and judicious binning of continuous variables,
if required) help mitigate this effect.
19
Evaluation of Naive Bayes Classifier
• Finally, the good performance is obtained when the goal is classification
or ranking of records according to their probability of belonging to a
certain class.
• However, when the goal is to actually estimate the probability of class
membership, this method provides very biased results.
– For this reason the Naive Bayes method is rarely used in credit scoring.
20
Evaluation of Naive Bayes Classifier
21
k-Nearest Neighbors (k-NN)
k-Nearest Neighbors (k-NN)
• The idea in k-Nearest Neighbor methods is to identify k
observations in the training dataset that are similar to a new
record that we wish to classify.
• We then use these similar (neighboring) records to classify the
new record into a class, assigning the new record to the
predominant class among these neighbors.
• Denote by (x1, x2,…,xp) the values of the predictors for this new
record.
• We look for records in our training data that are similar or “near"
to the record to be classified in the predictor space, i.e., records
that have values close to x1, x2,…,xp.
• Then, based on the classes to which those proximate records
belong, we assign a class to the record that we want to classify.
22
k-Nearest Neighbors (k-NN)
• The k-Nearest Neighbor algorithm is a classification
method that does not make assumptions about the
form of the relationship between the class
membership (Y ) and the predictors x1, x2,…,xp.
• This is a non-parametric method because it does not
involve estimation of parameters in an assumed
function form such as the linear form that we
encountered in linear regression.
• This method draws information from similarities
between the predictor values of the records in the data
set.
23
k-Nearest Neighbors (k-NN)
• The central issue here is how to measure the distance between records
based on their predictor values.
• The most popular measure of distance is the Euclidean distance.
• The Euclidean distance between two records x1, x2,…,xpand u1, u2,…,up is
• For simplicity, we continue here only with the Euclidean distance, but you
will find a host of other distance metrics in Chapters 12 (Cluster Analysis)
and 10 (Discriminant Analysis) for both numerical and categorical
variables.
• In most cases predictors should first be standardized before computing
Euclidean distance, to equalize the scales that the difierent predictors
may have.
24
k-Nearest Neighbors (k-NN)
• After computing the distances between the record to be classified and
existing records, we need a rule to assign a class to the record to be
classified, based on the classes of its neighbors.
• The simplest case is k = 1 where we look for the record that is closest (the
nearest neighbor) to classify the new record as belonging to the same
class as its closest neighbor.
• This intuitive idea of using a single nearest neighbor to classify records can
be very powerful when we have a large number of records in our training
set.
• It is possible to prove that the misclassification error of the 1-Nearest
Neighbor scheme has a misclassification rate that is no more than twice
the error when we know exactly the probability density functions for
each class.
25
k-Nearest Neighbors (k-NN)
• The idea of the 1-Nearest Neighbor can be extended
to k > 1 neighbors as follows:
– 1. Find the nearest k neighbors to the record to be
classified
– 2. Use a majority decision rule to classify the record,
where the record is classified as a member of the majority
class of the k neighbors.
26
Riding Mowers
• A riding-mower manufacturer would like to find a way of classifying
families in a city into those likely to purchase a riding mower and those
not likely to buy one.
• A pilot random sample of 12 owners and 12 non-owners in the city is
undertaken. The data are shown and plotted in the table on the next
slide.
• We first partition the data into training data (18 households) and
validation data (6 households).
• Obviously this dataset is too small for partitioning, but we continue with
this for illustration purposes.
• The data set is shown on the next slide.
27
28
Riding Mowers
• Consider a new household with $60,000 income and lot size 20,000 ft.
The train set is shown on the next slide.
• Among the households in the training set, the closest one to the new
household (in Euclidean distance after normalizing income and lot size) is
household #4, with $61,500 income and lot size 20,800 ft.
• If we use a 1-NN classifier, we would classify the new household as an
owner, like household #4.
• If we use k = 3, then the three nearest households are #4, #9, and #14.
• The first two are owners of riding mowers, and the last is a non-owner.
• The majority vote is therefore “owner", and the new household would be
classified as an owner. 29
30
Choosing k
• The advantage of choosing k > 1 is that higher values of k provide
smoothing that reduces the risk of overfitting due to noise in the training
data.
• Generally speaking, if k is too low, we may be fitting to the noise in the
data.
• However, if k is too high, we will miss out on the method's ability to
capture the local structure in the data, one of its main advantages.
• In the extreme, k = n = the number of records in the training dataset.
– In that case we simply assign all records to the majority class in the training
data irrespective of the values of (x1, x2,…,xp), which coincides with the Naive
Rule!
31
Choosing k
• K = n is clearly a case of over-smoothing in the absence of useful
information in the predictors about the class membership.
• In other words, we want to balance between overfitting to the predictor
information and ignoring this information completely.
• A balanced choice depends on the nature of the data.
• The more complex and irregular the structure of the data, the lower the
optimum value of k.
• Typically, values of k fall in the range between 1 and 20.
• Often an odd number is chosen, to avoid ties.
32
Choosing k
• So how is k chosen?
– Answer: we choose that k which has the best classification performance.
• We use the training data to classify the records in the validation data,
then compute error rates for various choices of k.
• For our example, if we choose k = 1 we will classify in a way that is very
sensitive to the local characteristics of the training data.
• If we choose a large value of k such as k = 18 we would simply predict the
most frequent class in the dataset in all cases.
• This is a very stable prediction but it completely ignores the information in
the predictors.
33
• To find a balance we examine the misclassification rate (of the validation
set) that results for different choices of k between 1-18.
• This is shown on a previous slide. We would choose k = 8, which
minimizes the misclassification rate in the validation set.
• Now the validation set is used as an addition to the training set and does
not reflect a “hold-out" set as before.
• We need a third test set to evaluate the performance of the method on
data that it did not see.
34
Choosing k
k-NN for a Quantitative Response
(Continuous Response Variable)
• The idea of k-NN can be readily extended to predicting a continuous value
• Instead of taking a majority vote of the neighbors to determine class, we
take the average response value of the k nearest neighbors to determine
the prediction.
• Often this average is a weighted average with the weight decreasing with
increasing distance from the point at which the prediction is required.
35
Evaluation of k-NN Algorithms
• The main advantage of k-NN methods is their simplicity and lack of
parametric assumptions.
• In the presence of a large enough training set, these methods perform
surprisingly well, especially when each class is characterized by multiple
combinations of predictor values.
• For instance, in the flight delays example there are likely to be multiple
combinations of carrier-destination-arrival-time etc. that characterize
delayed flights vs. on-time flights.
36
Evaluation of k-NN Algorithms
• While there is no time required to estimate parameters from the training
data (as would be the case for parametric models such as regression), the
time to find the nearest neighbors in a large training set can be
prohibitive.
• A number of ideas have been implemented to overcome this difficulty.
• The main ideas are:
– Reduce the time taken to compute distances by working in a reduced
dimension using dimension reduction techniques such as principal
components analysis (Chapter 3).
– Use sophisticated data structures such as search trees to speed up
identification of the nearest neighbor. This approach often settles for an
“almost nearest" neighbor to improve speed.
– Edit the training data to remove redundant or “almost redundant" points to
speed up the search for the nearest neighbor.
• An example is to remove records in the training set that have no effect on the classification
because they are surrounded by records that all belong to the same class.
37
Evaluation of k-NN Algorithms
• The number of records required in the training set to qualify as large
increases exponentially with the number of predictors p.
• This is because the expected distance to the nearest neighbor goes up
dramatically with p unless the size of the training set increases
exponentially with p.
– This phenomenon is knows as “the curse of dimensionality".
– The curse of dimensionality is a fundamental issue pertinent to all
classification, prediction and clustering techniques.
• We often seek to reduce the dimensionality of the space of predictor
variables through methods
– such as selecting subsets of the predictors for our model or
– by combining them using methods such as principal components
analysis, singular value decomposition, and factor analysis.
• In the artificial intelligence literature dimension reduction is often
referred to as factor selection or feature extraction. 38
Problems
• Personal Loan Acceptance
• Automobile Accidents
39

More Related Content

PDF
VaR Approximation Methods
PPTX
Maximizing a churn campaigns profitability with cost sensitive machine learning
PDF
2012 predictive clusters
PDF
POSSIBILISTIC SHARPE RATIO BASED NOVICE PORTFOLIO SELECTION MODELS
PDF
Efficient Online Evaluation of Big Data Stream Classifiers
PDF
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
PPTX
Estimating default risk in fund structures
PDF
PhD Defense - Example-Dependent Cost-Sensitive Classification
VaR Approximation Methods
Maximizing a churn campaigns profitability with cost sensitive machine learning
2012 predictive clusters
POSSIBILISTIC SHARPE RATIO BASED NOVICE PORTFOLIO SELECTION MODELS
Efficient Online Evaluation of Big Data Stream Classifiers
Frequent Pattern Mining - Krishna Sridhar, Feb 2016
Estimating default risk in fund structures
PhD Defense - Example-Dependent Cost-Sensitive Classification

Viewers also liked (14)

PDF
Brochure du PAOPA (Programme d'appui aux Organisations Paysannes Africaines)
PDF
Curso: Cómo elaborar un requerimiento eficiente
PDF
PDF
Curso: SIGA
DOCX
linkedinspeciale
PDF
Fiji Water Environmental Nightmare
PDF
PDF
Energy Drinks Report
PDF
Norma 1926.104
PDF
02 wsdot c1710 areas descanso
PDF
51 50-1-pb
PPTX
Analysis%20 of%20a%20film%20opening%20sequence
PDF
Curso: Gestión del Presupuesto Público 2017
PDF
L’attieke produit a partir de l a pâte de manioc, une innovation creatrice d...
Brochure du PAOPA (Programme d'appui aux Organisations Paysannes Africaines)
Curso: Cómo elaborar un requerimiento eficiente
Curso: SIGA
linkedinspeciale
Fiji Water Environmental Nightmare
Energy Drinks Report
Norma 1926.104
02 wsdot c1710 areas descanso
51 50-1-pb
Analysis%20 of%20a%20film%20opening%20sequence
Curso: Gestión del Presupuesto Público 2017
L’attieke produit a partir de l a pâte de manioc, une innovation creatrice d...
Ad

Similar to Text classification methods (20)

PPT
UNIT2_NaiveBayes algorithms used in machine learning
PDF
19BayesTheoremClassification19BayesTheoremClassification.ppt
PDF
Machine learning naive bayes and svm.pdf
PPT
natural language processing by Christopher
PDF
Supply Chain Analytics, Supply Chain Management, Supply Chain Data Analytics
PPT
NaiveBayesfcctcvtyvyuyuvuygygygiughuobiubivvyjnh
PPT
cas_washington_nov2010_web
PPTX
UNIT 3: Data Warehousing and Data Mining
PPTX
Machine learning ( Part 2 )
PDF
Introduction to machine learning
PPTX
Lecture 3 ml
PDF
Machine Learning - Implementation with Python - 2
PPT
PPT
3 DM Classification HFCS kilometres .ppt
DOCX
Naive bayes classifier
PPT
Cluster2
PPT
Probablistic information retrieval
PPT
4646150.ppt
PPTX
MACHINE LEARNING Unit -2 Algorithm.pptx
PDF
Subscription fraud analytics using classification
UNIT2_NaiveBayes algorithms used in machine learning
19BayesTheoremClassification19BayesTheoremClassification.ppt
Machine learning naive bayes and svm.pdf
natural language processing by Christopher
Supply Chain Analytics, Supply Chain Management, Supply Chain Data Analytics
NaiveBayesfcctcvtyvyuyuvuygygygiughuobiubivvyjnh
cas_washington_nov2010_web
UNIT 3: Data Warehousing and Data Mining
Machine learning ( Part 2 )
Introduction to machine learning
Lecture 3 ml
Machine Learning - Implementation with Python - 2
3 DM Classification HFCS kilometres .ppt
Naive bayes classifier
Cluster2
Probablistic information retrieval
4646150.ppt
MACHINE LEARNING Unit -2 Algorithm.pptx
Subscription fraud analytics using classification
Ad

More from David Hoen (20)

PPT
Computer security
PPT
Introduction to prolog
PPT
Database introduction
PPTX
Building a-database
PPTX
Decision tree
PPT
Database constraints
PPT
Prolog programming
PPT
Hash crypto
PPTX
Introduction to security_and_crypto
PPTX
Key exchange in crypto
PPTX
Nlp naive bayes
PPT
Prolog resume
PPT
Access data connection
PPT
Basic dns-mod
PPT
Database concepts
PPTX
Hashfunction
PPTX
Datamining with nb
PDF
Text categorization as a graph
PPT
Xml schema
PPT
Text classification
Computer security
Introduction to prolog
Database introduction
Building a-database
Decision tree
Database constraints
Prolog programming
Hash crypto
Introduction to security_and_crypto
Key exchange in crypto
Nlp naive bayes
Prolog resume
Access data connection
Basic dns-mod
Database concepts
Hashfunction
Datamining with nb
Text categorization as a graph
Xml schema
Text classification

Recently uploaded (20)

PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Electronic commerce courselecture one. Pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
cuic standard and advanced reporting.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Cloud computing and distributed systems.
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
20250228 LYD VKU AI Blended-Learning.pptx
Spectroscopy.pptx food analysis technology
Electronic commerce courselecture one. Pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Per capita expenditure prediction using model stacking based on satellite ima...
Machine learning based COVID-19 study performance prediction
Diabetes mellitus diagnosis method based random forest with bat algorithm
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Review of recent advances in non-invasive hemoglobin estimation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
cuic standard and advanced reporting.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Mobile App Security Testing_ A Comprehensive Guide.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Cloud computing and distributed systems.
Digital-Transformation-Roadmap-for-Companies.pptx

Text classification methods

  • 1. Chapter 6 Three Simple Classification Methods The Naïve Rule Naïve Bayes k-Nearest Neighbor 1
  • 2. Introduction • Naïve Rule used to set up Naïve Bayes & k-NN • Naïve Bayes & k-NN used in practice • Data driven methods • Naïve Bayes uses categorical predictors • k-NN may be used with continuous predictors • Illustrate with three examples: – Example 1: Predicting Fraudulent Financial Reporting • Uses Categorical predictors – Example 2: Predicting Delayed Flights • Uses Categorical Predictors – Example 3: Riding Mowers • Uses Continuous Predictors 2
  • 3. Predicting Fraudulent Financial Reporting • To avoid being involved in any legal charges against it, the firm wants to detect whether a company submitted a fraudulent financial report . • In this case each company (customer) is a record, and the response of interest, Y = {fraudulent; truthful}, has two classes that a company can be classified into: C1=fraudulent and C2=truthful. • The only other piece of information that the auditing firm has on its customers is whether or not legal charges were filed against them. • The firm would like to use this information to improve its estimates of fraud. • Thus “X=legal charges" is a single (categorical) predictor with two categories: whether legal charges were filed (1) or not (0). 3
  • 4. • 1500 companies • Partition into 1000 training set & 500 validation set • Counts from training below Predicting Fraudulent Financial Reporting 4
  • 5. Predicting Delayed Flights • The outcome of interest is whether the flight is delayed or not (delayed means arrive more than 15 minutes late). • Our data consist of all flights from the Washington, DC area into the New York City area during January 2004. • The percent of delayed flights among these 2346 flights is 18% • Six predictors listed below • Predict if a new flight will be delayed – two classes • 1 = “Delayed” and 0 = “On Time” 5
  • 6. The Naive Rule • Classify everything as belonging to the most prevalent class • Classifying a record into one of m classes, ignoring all predictor information (X1,X2,…,Xp) that we may have, is to classify the record as a member of the majority class. • In the auditing example the naive rule would classify all customers as being truthful, because 90% of the investigated companies in the training set were found to be truthful. • Similarly, all flights would be classified as being on-time, because the majority of the flights in the dataset (82%) were not delayed. 6
  • 7. Naive Bayes • More sophisticated method than the naive rule. • The main idea is to integrate the information given in a set of predictors into the Naive Rule to obtain more accurate classifications. • The probability of a record belonging to a certain class is now evaluated – Based on the prevalence of that class – And on the additional information that is given on that record in term of its X information. • Naive Bayes works only with predictors that are categorical. – Numerical predictors must be binned and converted to categorical variables before the Naive Bayes classifier can use them. • The Naive Bayes method is very useful when very large datasets are available. – For instance, web-search companies like Google use naive Bayes classifiers to correct misspellings that users type in. When you type a phrase that includes a misspelled word into Google it suggests a spelling correction for the phrase. The suggestion(s) are based on information not only on the frequencies of similarly-spelled words that were typed by millions of other users, but also on the other words in the phrase. 7
  • 8. Conditional Probabilities • Classification Task – Estimate the probability of membership in each class given a certain set of predictor variables • This type of probability is called “conditional probability” • A conditional probability of event A given event B (denoted by P(A|B)) represents the chances of event A occurring only under the scenario that event B occurs. • In the auditing example we are interested in – P(fraudulent financial report | legal charges) 8
  • 9. • To classify a record, we compute its chance of belonging to each of the classes by computing P(Ci|X1,…,Xp) for each class i. We then classify the record to the class that has the highest probability • Since conditioning on an event means that we have additional information (e.g., we know that legal charges were filed against them), uncertainty is reduced • In auditing example column headings are used as predictors for classification probabilities – Column sums are sample size used to compute probabilities – P(fraudulent financial report | legal charges) • 50/232 – P(fraudulent financial report | no legal charges) • 50/770 9 Conditional Probabilities
  • 10. A Practical Difficulty • For N predictors and M classes training set may need to be very large. • To “fill in” the M X N table so that we can compute the conditional probabilities would require a large number cases to avoid entries of zero (no instances of cases of in the table) • Apples & Oranges Example 10
  • 11. A Solution: Naive Bayes • A solution that has been widely used is based on making the simplifying assumption of predictor independence. If it is reasonable to assume that the predictors are all mutually independent within each class, • Simplify the expression making it useful in practice • Independence of the predictors within each class gives us the following simplification – follows from the product rule for probabilities of independent events (the probability of occurrence of multiple events is the product of the probabilities of the individual event occurrences): – P(X1,X2, …,Xm|Ci) = P(X1|Ci)P(X2|Ci)P(X3|Ci) ,…,P(Xm|Ci) • The terms on the right are estimated from frequency counts in the training data, with the estimate of P(Xj|Ci) being equal to the number of occurrences of the value xj in class Ci divided by the total number of records in that class • Example pgs. 97-98 (or pgs. 91-92 earlier edition) demonstrate – P(X1,X2, …,Xm|Ci) approximates P(Ci|X1,…,Xp) – Assuming a classification cutoff of 0.5 11
  • 12. 12
  • 13. 13
  • 14. Evaluation of the Model • To evaluate the performance of the naive Bayes classifier for our data, we use – the classification matrix, – lift charts, – And measures described in Chapter 4. • The classification matrices for the training and validation sets are shown • The overall error level is around 18% for both the training and validation data • A naive rule which would classify all 880 flights in the validation set as on-time • missed the 172 delayed flights resulting in a 20% error level. • The Naive Bayes is only slightly less accurate. • The lift chart shows the strength of the Naive Bayes in capturing the delayed flights well. 14
  • 15. 15
  • 16. 16
  • 17. 17
  • 18. Evaluation of Naive Bayes Classifier • The Naive Bayes classifier's advantages are in its – simplicity, computational efficiency, and its good classification performance. – it often outperforms more sophisticated classifiers even when the underlying assumption of independent predictors is far from true. – This advantage is especially pronounced when the number of predictors is very large. 18
  • 19. • There are three main issues that should be kept in mind however. – First, the Naive Bayes classifier requires a very large number of records to obtain good results. – Second, where a predictor category is not present in the training data, Naive Bayes assumes that a new record with that category of the predictor has zero probability. • This can be a problem if this rare predictor value is important. • For example, assume the target variable is “bought high value life insurance" and a predictor category is “own yacht". If the training data have no records with “owns yacht"=1, for any new records where “owns yacht"=1, Naive Bayes will assign a probability of 0 to the target variable “bought high value life insurance". • With no training records with ”owns yacht"=1, of course, no data mining technique will be able to incorporate this potentially important variable into the classification model - it will be ignored. • With Naive Bayes, however, the absence of this predictor actively “outvotes" any other information in the record to assign a 0 to the target value (when, in this case, it has a relatively good chance of being a 1). • The presence of a large training set (and judicious binning of continuous variables, if required) help mitigate this effect. 19 Evaluation of Naive Bayes Classifier
  • 20. • Finally, the good performance is obtained when the goal is classification or ranking of records according to their probability of belonging to a certain class. • However, when the goal is to actually estimate the probability of class membership, this method provides very biased results. – For this reason the Naive Bayes method is rarely used in credit scoring. 20 Evaluation of Naive Bayes Classifier
  • 22. k-Nearest Neighbors (k-NN) • The idea in k-Nearest Neighbor methods is to identify k observations in the training dataset that are similar to a new record that we wish to classify. • We then use these similar (neighboring) records to classify the new record into a class, assigning the new record to the predominant class among these neighbors. • Denote by (x1, x2,…,xp) the values of the predictors for this new record. • We look for records in our training data that are similar or “near" to the record to be classified in the predictor space, i.e., records that have values close to x1, x2,…,xp. • Then, based on the classes to which those proximate records belong, we assign a class to the record that we want to classify. 22
  • 23. k-Nearest Neighbors (k-NN) • The k-Nearest Neighbor algorithm is a classification method that does not make assumptions about the form of the relationship between the class membership (Y ) and the predictors x1, x2,…,xp. • This is a non-parametric method because it does not involve estimation of parameters in an assumed function form such as the linear form that we encountered in linear regression. • This method draws information from similarities between the predictor values of the records in the data set. 23
  • 24. k-Nearest Neighbors (k-NN) • The central issue here is how to measure the distance between records based on their predictor values. • The most popular measure of distance is the Euclidean distance. • The Euclidean distance between two records x1, x2,…,xpand u1, u2,…,up is • For simplicity, we continue here only with the Euclidean distance, but you will find a host of other distance metrics in Chapters 12 (Cluster Analysis) and 10 (Discriminant Analysis) for both numerical and categorical variables. • In most cases predictors should first be standardized before computing Euclidean distance, to equalize the scales that the difierent predictors may have. 24
  • 25. k-Nearest Neighbors (k-NN) • After computing the distances between the record to be classified and existing records, we need a rule to assign a class to the record to be classified, based on the classes of its neighbors. • The simplest case is k = 1 where we look for the record that is closest (the nearest neighbor) to classify the new record as belonging to the same class as its closest neighbor. • This intuitive idea of using a single nearest neighbor to classify records can be very powerful when we have a large number of records in our training set. • It is possible to prove that the misclassification error of the 1-Nearest Neighbor scheme has a misclassification rate that is no more than twice the error when we know exactly the probability density functions for each class. 25
  • 26. k-Nearest Neighbors (k-NN) • The idea of the 1-Nearest Neighbor can be extended to k > 1 neighbors as follows: – 1. Find the nearest k neighbors to the record to be classified – 2. Use a majority decision rule to classify the record, where the record is classified as a member of the majority class of the k neighbors. 26
  • 27. Riding Mowers • A riding-mower manufacturer would like to find a way of classifying families in a city into those likely to purchase a riding mower and those not likely to buy one. • A pilot random sample of 12 owners and 12 non-owners in the city is undertaken. The data are shown and plotted in the table on the next slide. • We first partition the data into training data (18 households) and validation data (6 households). • Obviously this dataset is too small for partitioning, but we continue with this for illustration purposes. • The data set is shown on the next slide. 27
  • 28. 28
  • 29. Riding Mowers • Consider a new household with $60,000 income and lot size 20,000 ft. The train set is shown on the next slide. • Among the households in the training set, the closest one to the new household (in Euclidean distance after normalizing income and lot size) is household #4, with $61,500 income and lot size 20,800 ft. • If we use a 1-NN classifier, we would classify the new household as an owner, like household #4. • If we use k = 3, then the three nearest households are #4, #9, and #14. • The first two are owners of riding mowers, and the last is a non-owner. • The majority vote is therefore “owner", and the new household would be classified as an owner. 29
  • 30. 30
  • 31. Choosing k • The advantage of choosing k > 1 is that higher values of k provide smoothing that reduces the risk of overfitting due to noise in the training data. • Generally speaking, if k is too low, we may be fitting to the noise in the data. • However, if k is too high, we will miss out on the method's ability to capture the local structure in the data, one of its main advantages. • In the extreme, k = n = the number of records in the training dataset. – In that case we simply assign all records to the majority class in the training data irrespective of the values of (x1, x2,…,xp), which coincides with the Naive Rule! 31
  • 32. Choosing k • K = n is clearly a case of over-smoothing in the absence of useful information in the predictors about the class membership. • In other words, we want to balance between overfitting to the predictor information and ignoring this information completely. • A balanced choice depends on the nature of the data. • The more complex and irregular the structure of the data, the lower the optimum value of k. • Typically, values of k fall in the range between 1 and 20. • Often an odd number is chosen, to avoid ties. 32
  • 33. Choosing k • So how is k chosen? – Answer: we choose that k which has the best classification performance. • We use the training data to classify the records in the validation data, then compute error rates for various choices of k. • For our example, if we choose k = 1 we will classify in a way that is very sensitive to the local characteristics of the training data. • If we choose a large value of k such as k = 18 we would simply predict the most frequent class in the dataset in all cases. • This is a very stable prediction but it completely ignores the information in the predictors. 33
  • 34. • To find a balance we examine the misclassification rate (of the validation set) that results for different choices of k between 1-18. • This is shown on a previous slide. We would choose k = 8, which minimizes the misclassification rate in the validation set. • Now the validation set is used as an addition to the training set and does not reflect a “hold-out" set as before. • We need a third test set to evaluate the performance of the method on data that it did not see. 34 Choosing k
  • 35. k-NN for a Quantitative Response (Continuous Response Variable) • The idea of k-NN can be readily extended to predicting a continuous value • Instead of taking a majority vote of the neighbors to determine class, we take the average response value of the k nearest neighbors to determine the prediction. • Often this average is a weighted average with the weight decreasing with increasing distance from the point at which the prediction is required. 35
  • 36. Evaluation of k-NN Algorithms • The main advantage of k-NN methods is their simplicity and lack of parametric assumptions. • In the presence of a large enough training set, these methods perform surprisingly well, especially when each class is characterized by multiple combinations of predictor values. • For instance, in the flight delays example there are likely to be multiple combinations of carrier-destination-arrival-time etc. that characterize delayed flights vs. on-time flights. 36
  • 37. Evaluation of k-NN Algorithms • While there is no time required to estimate parameters from the training data (as would be the case for parametric models such as regression), the time to find the nearest neighbors in a large training set can be prohibitive. • A number of ideas have been implemented to overcome this difficulty. • The main ideas are: – Reduce the time taken to compute distances by working in a reduced dimension using dimension reduction techniques such as principal components analysis (Chapter 3). – Use sophisticated data structures such as search trees to speed up identification of the nearest neighbor. This approach often settles for an “almost nearest" neighbor to improve speed. – Edit the training data to remove redundant or “almost redundant" points to speed up the search for the nearest neighbor. • An example is to remove records in the training set that have no effect on the classification because they are surrounded by records that all belong to the same class. 37
  • 38. Evaluation of k-NN Algorithms • The number of records required in the training set to qualify as large increases exponentially with the number of predictors p. • This is because the expected distance to the nearest neighbor goes up dramatically with p unless the size of the training set increases exponentially with p. – This phenomenon is knows as “the curse of dimensionality". – The curse of dimensionality is a fundamental issue pertinent to all classification, prediction and clustering techniques. • We often seek to reduce the dimensionality of the space of predictor variables through methods – such as selecting subsets of the predictors for our model or – by combining them using methods such as principal components analysis, singular value decomposition, and factor analysis. • In the artificial intelligence literature dimension reduction is often referred to as factor selection or feature extraction. 38
  • 39. Problems • Personal Loan Acceptance • Automobile Accidents 39