Text classification methods

Chapter 6
Three Simple Classification
Methods
The Naïve Rule
Naïve Bayes
k-Nearest Neighbor
1

Introduction
• Naïve Rule used to set up Naïve Bayes & k-NN
• Naïve Bayes & k-NN used in practice
• Data driven methods
• Naïve Bayes uses categorical predictors
• k-NN may be used with continuous predictors
• Illustrate with three examples:
– Example 1: Predicting Fraudulent Financial Reporting
• Uses Categorical predictors
– Example 2: Predicting Delayed Flights
• Uses Categorical Predictors
– Example 3: Riding Mowers
• Uses Continuous Predictors
2

Predicting Fraudulent Financial Reporting
• To avoid being involved in any legal charges against it, the firm wants to
detect whether a company submitted a fraudulent financial report .
• In this case each company (customer) is a record, and the response of
interest, Y = {fraudulent; truthful}, has two classes that a company can be
classified into: C1=fraudulent and C2=truthful.
• The only other piece of information that the auditing firm has on its
customers is whether or not legal charges were filed against them.
• The firm would like to use this information to improve its estimates of
fraud.
• Thus “X=legal charges" is a single (categorical) predictor with two
categories: whether legal charges were filed (1) or not (0).
3

• 1500 companies
• Partition into 1000 training set & 500 validation set
• Counts from training below
Predicting Fraudulent Financial Reporting
4

Predicting Delayed Flights
• The outcome of interest is whether the flight is delayed or not (delayed means
arrive more than 15 minutes late).
• Our data consist of all flights from the Washington, DC area into the New York City
area during January 2004.
• The percent of delayed flights among these 2346 flights is 18%
• Six predictors listed below
• Predict if a new flight will be delayed – two classes
• 1 = “Delayed” and 0 = “On Time”
5

The Naive Rule
• Classify everything as belonging to the most prevalent class
• Classifying a record into one of m classes, ignoring all predictor information
(X1,X2,…,Xp) that we may have, is to classify the record as a member of the
majority class.
• In the auditing example the naive rule would classify all customers as being
truthful, because 90% of the investigated companies in the training set were
found to be truthful.
• Similarly, all flights would be classified as being on-time, because the
majority of the flights in the dataset (82%) were not delayed.
6

Naive Bayes
• More sophisticated method than the naive rule.
• The main idea is to integrate the information given in a set of predictors
into the Naive Rule to obtain more accurate classifications.
• The probability of a record belonging to a certain class is now evaluated
– Based on the prevalence of that class
– And on the additional information that is given on that record in term of its X
information.
• Naive Bayes works only with predictors that are categorical.
– Numerical predictors must be binned and converted to categorical variables
before the Naive Bayes classifier can use them.
• The Naive Bayes method is very useful when very large datasets are
available.
– For instance, web-search companies like Google use naive Bayes classifiers to correct
misspellings that users type in. When you type a phrase that includes a misspelled word
into Google it suggests a spelling correction for the phrase. The suggestion(s) are based
on information not only on the frequencies of similarly-spelled words that were typed
by millions of other users, but also on the other words in the phrase.
7

Conditional Probabilities
• Classification Task
– Estimate the probability of membership in each class given
a certain set of predictor variables
• This type of probability is called “conditional
probability”
• A conditional probability of event A given event B
(denoted by P(A|B)) represents the chances of event
A occurring only under the scenario that event B
occurs.
• In the auditing example we are interested in
– P(fraudulent financial report | legal charges)
8

• To classify a record, we compute its chance of belonging to
each of the classes by computing P(Ci|X1,…,Xp) for each class
i. We then classify the record to the class that has the
highest probability
• Since conditioning on an event means that we have
additional information (e.g., we know that legal charges
were filed against them), uncertainty is reduced
• In auditing example column headings are used as
predictors for classification probabilities
– Column sums are sample size used to compute probabilities
– P(fraudulent financial report | legal charges)
• 50/232
– P(fraudulent financial report | no legal charges)
• 50/770
9
Conditional Probabilities

A Practical Difficulty
• For N predictors and M classes training set
may need to be very large.
• To “fill in” the M X N table so that we can
compute the conditional probabilities would
require a large number cases to avoid entries
of zero (no instances of cases of in the table)
• Apples & Oranges Example
10

A Solution: Naive Bayes
• A solution that has been widely used is based on making the
simplifying assumption of predictor independence. If it is
reasonable to assume that the predictors are all mutually
independent within each class,
• Simplify the expression making it useful in practice
• Independence of the predictors within each class gives us the
following simplification
– follows from the product rule for probabilities of independent events
(the probability of occurrence of multiple events is the product of the
probabilities of the individual event occurrences):
– P(X1,X2, …,Xm|Ci) = P(X1|Ci)P(X2|Ci)P(X3|Ci) ,…,P(Xm|Ci)
• The terms on the right are estimated from frequency counts in the
training data, with the estimate of P(Xj|Ci) being equal to the
number of occurrences of the value xj in class Ci divided by the total
number of records in that class
• Example pgs. 97-98 (or pgs. 91-92 earlier edition) demonstrate
– P(X1,X2, …,Xm|Ci) approximates P(Ci|X1,…,Xp)
– Assuming a classification cutoff of 0.5
11

Evaluation of the Model
• To evaluate the performance of the naive Bayes classifier for our
data, we use
– the classification matrix,
– lift charts,
– And measures described in Chapter 4.
• The classification matrices for the training and validation sets are
shown
• The overall error level is around 18% for both the training and
validation data
• A naive rule which would classify all 880 flights in the validation set
as on-time
• missed the 172 delayed flights resulting in a 20% error level.
• The Naive Bayes is only slightly less accurate.
• The lift chart shows the strength of the Naive Bayes in capturing
the delayed flights well.
14

Evaluation of Naive Bayes Classifier
• The Naive Bayes classifier's advantages are in its
– simplicity, computational efficiency, and its good classification
performance.
– it often outperforms more sophisticated classifiers even when the
underlying assumption of independent predictors is far from true.
– This advantage is especially pronounced when the number of
predictors is very large.
18

• There are three main issues that should be kept in mind however.
– First, the Naive Bayes classifier requires a very large number of records to
obtain good results.
– Second, where a predictor category is not present in the training data, Naive
Bayes assumes that a new record with that category of the predictor has zero
probability.
• This can be a problem if this rare predictor value is important.
• For example, assume the target variable is “bought high value life insurance" and a
predictor category is “own yacht". If the training data have no records with “owns
yacht"=1, for any new records where “owns yacht"=1, Naive Bayes will assign a
probability of 0 to the target variable “bought high value life insurance".
• With no training records with ”owns yacht"=1, of course, no data mining technique
will be able to incorporate this potentially important variable into the classification
model - it will be ignored.
• With Naive Bayes, however, the absence of this predictor actively “outvotes" any
other information in the record to assign a 0 to the target value (when, in this case,
it has a relatively good chance of being a 1).
• The presence of a large training set (and judicious binning of continuous variables,
if required) help mitigate this effect.
19

• Finally, the good performance is obtained when the goal is classification
or ranking of records according to their probability of belonging to a
certain class.
• However, when the goal is to actually estimate the probability of class
membership, this method provides very biased results.
– For this reason the Naive Bayes method is rarely used in credit scoring.
20

k-Nearest Neighbors (k-NN)
• The idea in k-Nearest Neighbor methods is to identify k
observations in the training dataset that are similar to a new
record that we wish to classify.
• We then use these similar (neighboring) records to classify the
new record into a class, assigning the new record to the
predominant class among these neighbors.
• Denote by (x1, x2,…,xp) the values of the predictors for this new
record.
• We look for records in our training data that are similar or “near"
to the record to be classified in the predictor space, i.e., records
that have values close to x1, x2,…,xp.
• Then, based on the classes to which those proximate records
belong, we assign a class to the record that we want to classify.
22

• The k-Nearest Neighbor algorithm is a classification
method that does not make assumptions about the
form of the relationship between the class
membership (Y ) and the predictors x1, x2,…,xp.
• This is a non-parametric method because it does not
involve estimation of parameters in an assumed
function form such as the linear form that we
encountered in linear regression.
• This method draws information from similarities
between the predictor values of the records in the data
set.
23

• The central issue here is how to measure the distance between records
based on their predictor values.
• The most popular measure of distance is the Euclidean distance.
• The Euclidean distance between two records x1, x2,…,xpand u1, u2,…,up is
• For simplicity, we continue here only with the Euclidean distance, but you
will find a host of other distance metrics in Chapters 12 (Cluster Analysis)
and 10 (Discriminant Analysis) for both numerical and categorical
variables.
• In most cases predictors should first be standardized before computing
Euclidean distance, to equalize the scales that the difierent predictors
may have.
24

• After computing the distances between the record to be classified and
existing records, we need a rule to assign a class to the record to be
classified, based on the classes of its neighbors.
• The simplest case is k = 1 where we look for the record that is closest (the
nearest neighbor) to classify the new record as belonging to the same
class as its closest neighbor.
• This intuitive idea of using a single nearest neighbor to classify records can
be very powerful when we have a large number of records in our training
set.
• It is possible to prove that the misclassification error of the 1-Nearest
Neighbor scheme has a misclassification rate that is no more than twice
the error when we know exactly the probability density functions for
each class.
25

• The idea of the 1-Nearest Neighbor can be extended
to k > 1 neighbors as follows:
– 1. Find the nearest k neighbors to the record to be
classified
– 2. Use a majority decision rule to classify the record,
where the record is classified as a member of the majority
class of the k neighbors.
26

Riding Mowers
• A riding-mower manufacturer would like to find a way of classifying
families in a city into those likely to purchase a riding mower and those
not likely to buy one.
• A pilot random sample of 12 owners and 12 non-owners in the city is
undertaken. The data are shown and plotted in the table on the next
slide.
• We first partition the data into training data (18 households) and
validation data (6 households).
• Obviously this dataset is too small for partitioning, but we continue with
this for illustration purposes.
• The data set is shown on the next slide.
27

Riding Mowers
• Consider a new household with $60,000 income and lot size 20,000 ft.
The train set is shown on the next slide.
• Among the households in the training set, the closest one to the new
household (in Euclidean distance after normalizing income and lot size) is
household #4, with $61,500 income and lot size 20,800 ft.
• If we use a 1-NN classifier, we would classify the new household as an
owner, like household #4.
• If we use k = 3, then the three nearest households are #4, #9, and #14.
• The first two are owners of riding mowers, and the last is a non-owner.
• The majority vote is therefore “owner", and the new household would be
classified as an owner. 29

Choosing k
• The advantage of choosing k > 1 is that higher values of k provide
smoothing that reduces the risk of overfitting due to noise in the training
data.
• Generally speaking, if k is too low, we may be fitting to the noise in the
data.
• However, if k is too high, we will miss out on the method's ability to
capture the local structure in the data, one of its main advantages.
• In the extreme, k = n = the number of records in the training dataset.
– In that case we simply assign all records to the majority class in the training
data irrespective of the values of (x1, x2,…,xp), which coincides with the Naive
Rule!
31

Choosing k
• K = n is clearly a case of over-smoothing in the absence of useful
information in the predictors about the class membership.
• In other words, we want to balance between overfitting to the predictor
information and ignoring this information completely.
• A balanced choice depends on the nature of the data.
• The more complex and irregular the structure of the data, the lower the
optimum value of k.
• Typically, values of k fall in the range between 1 and 20.
• Often an odd number is chosen, to avoid ties.
32

Choosing k
• So how is k chosen?
– Answer: we choose that k which has the best classification performance.
• We use the training data to classify the records in the validation data,
then compute error rates for various choices of k.
• For our example, if we choose k = 1 we will classify in a way that is very
sensitive to the local characteristics of the training data.
• If we choose a large value of k such as k = 18 we would simply predict the
most frequent class in the dataset in all cases.
• This is a very stable prediction but it completely ignores the information in
the predictors.
33

• To find a balance we examine the misclassification rate (of the validation
set) that results for different choices of k between 1-18.
• This is shown on a previous slide. We would choose k = 8, which
minimizes the misclassification rate in the validation set.
• Now the validation set is used as an addition to the training set and does
not reflect a “hold-out" set as before.
• We need a third test set to evaluate the performance of the method on
data that it did not see.
34
Choosing k

k-NN for a Quantitative Response
(Continuous Response Variable)
• The idea of k-NN can be readily extended to predicting a continuous value
• Instead of taking a majority vote of the neighbors to determine class, we
take the average response value of the k nearest neighbors to determine
the prediction.
• Often this average is a weighted average with the weight decreasing with
increasing distance from the point at which the prediction is required.
35

Evaluation of k-NN Algorithms
• The main advantage of k-NN methods is their simplicity and lack of
parametric assumptions.
• In the presence of a large enough training set, these methods perform
surprisingly well, especially when each class is characterized by multiple
combinations of predictor values.
• For instance, in the flight delays example there are likely to be multiple
combinations of carrier-destination-arrival-time etc. that characterize
delayed flights vs. on-time flights.
36

• While there is no time required to estimate parameters from the training
data (as would be the case for parametric models such as regression), the
time to find the nearest neighbors in a large training set can be
prohibitive.
• A number of ideas have been implemented to overcome this difficulty.
• The main ideas are:
– Reduce the time taken to compute distances by working in a reduced
dimension using dimension reduction techniques such as principal
components analysis (Chapter 3).
– Use sophisticated data structures such as search trees to speed up
identification of the nearest neighbor. This approach often settles for an
“almost nearest" neighbor to improve speed.
– Edit the training data to remove redundant or “almost redundant" points to
speed up the search for the nearest neighbor.
• An example is to remove records in the training set that have no effect on the classification
because they are surrounded by records that all belong to the same class.
37

• The number of records required in the training set to qualify as large
increases exponentially with the number of predictors p.
• This is because the expected distance to the nearest neighbor goes up
dramatically with p unless the size of the training set increases
exponentially with p.
– This phenomenon is knows as “the curse of dimensionality".
– The curse of dimensionality is a fundamental issue pertinent to all
classification, prediction and clustering techniques.
• We often seek to reduce the dimensionality of the space of predictor
variables through methods
– such as selecting subsets of the predictors for our model or
– by combining them using methods such as principal components
analysis, singular value decomposition, and factor analysis.
• In the artificial intelligence literature dimension reduction is often
referred to as factor selection or feature extraction. 38

Problems
• Personal Loan Acceptance
• Automobile Accidents
39

Text classification methods

More Related Content

Viewers also liked (14)

Similar to Text classification methods (20)

More from David Hoen (20)

Recently uploaded (20)

Text classification methods