Name IDPractical Data MiningCOMP-321BTutorial 5.docx

Name: ID:
Practical Data Mining
COMP-321B
Tutorial 5: Document Classification
Eibe Frank
July 17, 2014
c©2008-2014 University of Waikato
1 Introduction
A popular application area of machine learning is text
classification. You may
have actually encountered machine-learning-based text
classification before: so-
called “Bayesian” personalized spam filters are essentially
document classifiers
that classify email messages (i.e., short documents) into two
groups, namely junk
and not junk. They are trained on messages that have been
manually labeled—
perhaps by putting them into appropriate folders (e.g., “ham” vs
“spam”). In
this tutorial, we look at how to perform document classification
using tools in

WEKA. Please submit this tutorial to the teaching staff at the
end of the lab.
2 Data with string attributes
The raw data in document classification is text, but most
machine learning
algorithms expect examples that are described by a fixed set of
nominal or
numeric attributes. Hence we need to convert the text data into
a form suitable
for learning. This is usually done by creating a dictionary of
terms from all the
documents in the training corpus and making a numeric attribute
for each of the
terms. Then, for a particular document, the value of an attribute
is based on
the frequency of the corresponding term in the document—or it
simply indicates
the presence of absence of the term in the document.
The resulting data is usually very high-dimensional because the
dictionary
consists of hundreds or thousands of terms. Each document is
thus represented
as a vector in a high-dimensional space. There is obviously also
the class at-
tribute, which has the label of each document, and which we
want to learn to
predict based on the contents of the document.
The filter
weka.filters.unsupervised.attribute.StringToWordVector
can be used to perform the conversion of raw text into term-
frequency-based
attributes for us. The filter assumes that the text of the

documents is stored
in an attribute of type string, a nominal attribute without a pre-
specified set
of values. In the filtered data, this string attribute is replaced by
a fixed set
of numeric attributes and the class attribute is made the first
attribute in the
resulting set. The numeric attributes have values 0 and 1 by
default, indicating
whether a term is absent or present in a document.
To perform document classification in WEKA, we first need to
create an
appropriate ARFF file with a string attribute that holds the
documents’ text—
declared in the header of the ARFF file using @attribute
document string,
where document is the name of the attribute. We also need a
nominal attribute
that holds the documents’ classifications.
To get some feeling for how this works, make an ARFF file
from the labeled
mini-“documents” in Table 1 and run StringToWordVector with
default op-
tions on this data. How many attributes are generated? Now
change the value of
the option minTermFreq to two. What are the attributes that are
generated?
1
Document text Classification
The price of crude oil has increased significantly yes

Demand of crude oil outstrips supply yes
Some people do not like the flavor of olive oil no
The food was very oily no
Crude oil is in short supply yes
Use a bit of cooking oil in the frying pan no
Table 1: Training “documents”.
Answer:
Build a J48 decision tree from the last version of the data you
generated.
Sate the tree that is output, in textual form.
Answer:
Usually, we want to build a classifier to classify new
documents. Let us
assume we want to classify the documents given in Table 2,
once they have been
converted into ARFF format, using question marks for the
missing class labels.
Classification should be done based on the decision tree
classifier generated from
the documents in Table 1.
To do this, we need to use the FilteredClassifier in conjunction
with the
StringToWordVector filter and the base classifier that we want
to apply
(i.e., J48). Configure the FilteredClassifier using default
options for String-
ToWordVector and J48, and specify the data from Table 2 as the
test set.
(Ignore the fact that the Explorer states that the data has an
unknown number

of instances: it loads the data incrementally when the classifier
is tested.) The
data from Table 1 should be the training data (in unfiltered, raw
form, so that
its format is consistent with that of the test set), so make sure
that data has
2
Document text Classification
Oil platforms extract crude oil Unknown
Canola oil is supposed to be healthy Unknown
Iraq has significant oil reserves Unknown
There are different types of cooking oil Unknown
Table 2: Test “documents”.
been loaded in the Preprocess panel.
We want to look at the classifications for the test data, so make
sure that
you select Output predictions under More options... in the
Classify panel.
After you have hit Start, take a look at the model and the
predictions generated.
What are the predictions (in the order in which the documents
are listed in
Table 2)? Please explain how they come about.
Answer:
3 Classifying actual short text documents
A well-known dataset for evaluating document classifiers is a

corpus consisting of
a collection of Reuters newswire articles. In C:Program
FilesWeka-3-6data,
you can find some training data derived from this benchmark
corpus, more
specifically ReutersCorn-train.arff and ReutersGrain-train.arff.
There
are also corresponding test datasets in *-test.arff. We will use
these to ex-
periment with two learning algorithms and the
StringToWordVector filter.
The actual documents in the corn and grain data are the same,
only the
labels differ. In the first dataset, articles that talk about corn-
related issues are
labeled with a class value of 1, other articles are labeled with
the value 0. In
the second dataset, the same kind of labeling is applied, but
with respect to
grain-related issues. In the first case, we want to build a
classification model
that can be used to identify articles that talk about corn, in the
second case we
want to do the same for grain.
Let us now consider document classification using decision
trees vs. docu-
ment classification using a naive Bayes classifier. To this end,
build document
classifiers for the two training sets by applying the
FilteredClassifier with
StringToWordVector using (a) J48 and (b)
NaiveBayesMultinomial, in
each case evaluating them on the corresponding test set.
(NaiveBayesMulti-

3
nomial implements the multinomial naive Bayes model
discussed in class that
is specifically designed for document classification.) What is
the percentage of
correct classifications obtained for the four scenarios? Which
classifier would
you deploy based on your results?
Answer:
Percent correct is actually not the only evaluation metric that is
commonly
used for document classification problems. WEKA lists several
other “per-class”
evaluation statistics that are often used for evaluating
information retrieval sys-
tems like search engines, under Detailed Accuracy By Class in
the Classi-
fier output text area. They are based on the number of true
positives (TP),
the number of false positives (FP), the number of true negatives
(TN), and the
number of false negatives (FN) in the test data. A true positive
is an example
in the test data that is classified correctly as an example of the
target class con-
cerned, a false positive is a (negative) example that is
incorrectly assigned to the
target class. FN and TN are defined analogously. (In the
information retrieval
setting, true positives would be documents or web sites that are

returned by the
search engine and that are actually relevant, false positives
would be irrelevant
documents that are returned by the search engine, and so on.)
It is clear that these numbers can only be computed on a per-
class basis,
where the corresponding class is considered the positive class,
and all other
classes are considered joined together as the negative class.
Based on this ap-
proach, the statistics output by WEKA under Detailed Accuracy
By Class
are computed as follows (ROC Area, another statistic that is
output by WEKA,
is discussed further below):
• TP Rate: TP / (TP + FN)
• FP Rate: FP / (FP + TN)
• Precision: TP / (TP + FP)
• Recall: TP / (TP + FN)
• F-Measure: the harmonic mean of precision and recall
Based on the formulas, what are the best possible values for
each of the
statistics in this list, considering an arbitrary dataset and
classifier? Describe
in English when these values are attained assuming we treat
newswire articles
from the above corn category as positive examples and other
articles as negative
examples.

4
Answer:
In the Classifier Output, there is also the ROC Area, which
differs from
these other statistics because it is based on ranking the
examples in the test
data according to how likely they are to belong to the positive
class. The
likelihood is given by the class probability that the classifier
predicts. (Note that
most classifiers in WEKA can produce class probabilities in
addition to discrete
classifications.) The ROC area (also known as AUC) is the
probability that a
randomly chosen positive instance in the test data is ranked
above a randomly
chosen negative instance, based on the ranking produced by the
classifier.
The best outcome is that all positive examples are ranked above
all negative
examples. In that case the AUC is one. In the worst case it is
zero. In the case
where the ranking is essentially random, the AUC is 0.5. Hence
we want an
AUC that is at least 0.5, otherwise our classifier has not learned
anything from
the training data. Which of the two classifiers used above
produces the best
AUC for the two Reuters datasets? Compare this to the outcome
for percent

correct. What do the different outcomes mean?
5
Answer:
Interestingly, there is a close relationship between ROC Area
and TP
Rate/FP Rate. Rather than just obtaining a single pair of values
for the
latter statistics, we can obtain a whole range of value pairs by
imposing different
classification thresholds on the class probabilities estimated by
the classifier: in
the ranked list of instances, ranked according to their estimated
probability of
being positive, we simply count how many truly positive and
negative instances
exhibit a probability above the specified threshold.
By default, an instance is classified as “positive” if the
estimated probability
for the positive class exceeds 0.5, otherwise it is classified as
negative (i.e., the
most likely class is predicted). We can change this threshold
from 0.5 to some
other value between 0 and 1, and recompute the TP Rate/FP
Rate. By
repeating this with all possible different thresholds, we can
compute what is
called an ROC curve. You can graph it in WEKA by right-
clicking on the
Result list entry corresponding to the classification model you
are evaluating,

and selecting Visualize threshold curve.
When you do this, you get a plot with FP Rate on the x axis and
TP Rate
on the y axis. Depending on the classifier you use, this plot can
be quite smooth,
or it can be fairly discrete. The interesting thing is that if you
connect the dots
shown in the plot by lines, and you compute the area under the
resulting curve,
you get the ROC Area discussed above! That is where the
acronym AUC for
the ROC Area comes from (Area Under the Curve).
What does the ideal ROC curve corresponding to perfect
performance look
like (a rough sketch is sufficient)? (Note that this corresponds
to the case where
all positive instances are ranked above all negative ones when
sorted according
to the estimated probability of being positive.)
Answer:
6
Using the threshold curve GUI, you can also plot other types of
curves,
e.g., a precision/recall curve (with Recall on the x axis and
Precision on
the y axis). This plots precision/recall values for each
probability threshold
evaluated. Change the axes to obtain a precision/recall curve.
What shape

does the ideal precision/recall curve corresponding to perfect
performance have
(a rough sketch is again sufficient)?
Answer:
4 Exploring the StringToWordVector filter
By default, the StringToWordVector filter simply outputs binary
attributes
for all raw single-word terms. These attributes indicate presence
and absence of
a term as the attribute values in the transformed dataset.
However, there are
many options that can be changed, e.g:
• outputWordCounts: if set to true, actual word frequencies are
output,
rather than just binary values.
• IDFTransform and TFTransform: when both are set to true,
raw
term frequencies are transformed into so-called TF × IDF values
that are
popular for representing documents in information retrieval
applications.
• stemmer: allows you to choose from different word stemming
algorithms
that attempt to reduced words to their stems.
• useStopList: allows you determine whether stop words are
deleted. Stop
words are words that are usually not informative (e.g., articles).
• tokenizer: allows you to choose a different tokenizer for

generating terms,
e.g., one that produces word n-grams instead of single words.
However, there are several other useful options as well. Check
the descriptions
in the GenericObjectEditor for more information by clicking on
More.
Experiment with the options that are available. What options
give you the
best AUC for the two datasets from above when
MultinomialNaiveBayes is
applied as the classifier? (Note that an exhaustive search is not
required.)
7
Answer:
Often, not all attributes (i.e., terms) are important when
classifying docu-
ments, because many words may be irrelevant for determining
the topic of a
document. We can use WEKA’s AttributeSelectedClassifier,
using ranking
with InfoGainAttributeEval and the Ranker search, to try and
eliminate at-
tributes that are not so useful. As before we need to use the
FilteredClassifier
to transform the data before it is passed to the
AttributeSelectedClassifier.
Experiment with this set-up, using default options for
StringToWordVector
and NaiveBayesMultinomial as the classifier. Vary the number

of most infor-
mative attributes that are selected from the info-gain-based
ranking by changing
the value of the numToSelect field in the Ranker method.
Record the AUC
values you obtain. What number of attributes gives you the best
AUC for the
two datasets from above? What AUC values are the best ones
you manage to
obtain? (An exhaustive search is again not required.)
Answer:
8

Name IDPractical Data MiningCOMP-321BTutorial 5.docx

More Related Content

Similar to Name IDPractical Data MiningCOMP-321BTutorial 5.docx (20)

More from rosemarybdodson23141 (20)

Recently uploaded (20)

Name IDPractical Data MiningCOMP-321BTutorial 5.docx