SlideShare a Scribd company logo
Name: ID:
Practical Data Mining
COMP-321B
Tutorial 5: Document Classification
Eibe Frank
July 17, 2014
c©2008-2014 University of Waikato
1 Introduction
A popular application area of machine learning is text
classification. You may
have actually encountered machine-learning-based text
classification before: so-
called “Bayesian” personalized spam filters are essentially
document classifiers
that classify email messages (i.e., short documents) into two
groups, namely junk
and not junk. They are trained on messages that have been
manually labeled—
perhaps by putting them into appropriate folders (e.g., “ham” vs
“spam”). In
this tutorial, we look at how to perform document classification
using tools in
WEKA. Please submit this tutorial to the teaching staff at the
end of the lab.
2 Data with string attributes
The raw data in document classification is text, but most
machine learning
algorithms expect examples that are described by a fixed set of
nominal or
numeric attributes. Hence we need to convert the text data into
a form suitable
for learning. This is usually done by creating a dictionary of
terms from all the
documents in the training corpus and making a numeric attribute
for each of the
terms. Then, for a particular document, the value of an attribute
is based on
the frequency of the corresponding term in the document—or it
simply indicates
the presence of absence of the term in the document.
The resulting data is usually very high-dimensional because the
dictionary
consists of hundreds or thousands of terms. Each document is
thus represented
as a vector in a high-dimensional space. There is obviously also
the class at-
tribute, which has the label of each document, and which we
want to learn to
predict based on the contents of the document.
The filter
weka.filters.unsupervised.attribute.StringToWordVector
can be used to perform the conversion of raw text into term-
frequency-based
attributes for us. The filter assumes that the text of the
documents is stored
in an attribute of type string, a nominal attribute without a pre-
specified set
of values. In the filtered data, this string attribute is replaced by
a fixed set
of numeric attributes and the class attribute is made the first
attribute in the
resulting set. The numeric attributes have values 0 and 1 by
default, indicating
whether a term is absent or present in a document.
To perform document classification in WEKA, we first need to
create an
appropriate ARFF file with a string attribute that holds the
documents’ text—
declared in the header of the ARFF file using @attribute
document string,
where document is the name of the attribute. We also need a
nominal attribute
that holds the documents’ classifications.
To get some feeling for how this works, make an ARFF file
from the labeled
mini-“documents” in Table 1 and run StringToWordVector with
default op-
tions on this data. How many attributes are generated? Now
change the value of
the option minTermFreq to two. What are the attributes that are
generated?
1
Document text Classification
The price of crude oil has increased significantly yes
Demand of crude oil outstrips supply yes
Some people do not like the flavor of olive oil no
The food was very oily no
Crude oil is in short supply yes
Use a bit of cooking oil in the frying pan no
Table 1: Training “documents”.
Answer:
Build a J48 decision tree from the last version of the data you
generated.
Sate the tree that is output, in textual form.
Answer:
Usually, we want to build a classifier to classify new
documents. Let us
assume we want to classify the documents given in Table 2,
once they have been
converted into ARFF format, using question marks for the
missing class labels.
Classification should be done based on the decision tree
classifier generated from
the documents in Table 1.
To do this, we need to use the FilteredClassifier in conjunction
with the
StringToWordVector filter and the base classifier that we want
to apply
(i.e., J48). Configure the FilteredClassifier using default
options for String-
ToWordVector and J48, and specify the data from Table 2 as the
test set.
(Ignore the fact that the Explorer states that the data has an
unknown number
of instances: it loads the data incrementally when the classifier
is tested.) The
data from Table 1 should be the training data (in unfiltered, raw
form, so that
its format is consistent with that of the test set), so make sure
that data has
2
Document text Classification
Oil platforms extract crude oil Unknown
Canola oil is supposed to be healthy Unknown
Iraq has significant oil reserves Unknown
There are different types of cooking oil Unknown
Table 2: Test “documents”.
been loaded in the Preprocess panel.
We want to look at the classifications for the test data, so make
sure that
you select Output predictions under More options... in the
Classify panel.
After you have hit Start, take a look at the model and the
predictions generated.
What are the predictions (in the order in which the documents
are listed in
Table 2)? Please explain how they come about.
Answer:
3 Classifying actual short text documents
A well-known dataset for evaluating document classifiers is a
corpus consisting of
a collection of Reuters newswire articles. In C:Program
FilesWeka-3-6data,
you can find some training data derived from this benchmark
corpus, more
specifically ReutersCorn-train.arff and ReutersGrain-train.arff.
There
are also corresponding test datasets in *-test.arff. We will use
these to ex-
periment with two learning algorithms and the
StringToWordVector filter.
The actual documents in the corn and grain data are the same,
only the
labels differ. In the first dataset, articles that talk about corn-
related issues are
labeled with a class value of 1, other articles are labeled with
the value 0. In
the second dataset, the same kind of labeling is applied, but
with respect to
grain-related issues. In the first case, we want to build a
classification model
that can be used to identify articles that talk about corn, in the
second case we
want to do the same for grain.
Let us now consider document classification using decision
trees vs. docu-
ment classification using a naive Bayes classifier. To this end,
build document
classifiers for the two training sets by applying the
FilteredClassifier with
StringToWordVector using (a) J48 and (b)
NaiveBayesMultinomial, in
each case evaluating them on the corresponding test set.
(NaiveBayesMulti-
3
nomial implements the multinomial naive Bayes model
discussed in class that
is specifically designed for document classification.) What is
the percentage of
correct classifications obtained for the four scenarios? Which
classifier would
you deploy based on your results?
Answer:
Percent correct is actually not the only evaluation metric that is
commonly
used for document classification problems. WEKA lists several
other “per-class”
evaluation statistics that are often used for evaluating
information retrieval sys-
tems like search engines, under Detailed Accuracy By Class in
the Classi-
fier output text area. They are based on the number of true
positives (TP),
the number of false positives (FP), the number of true negatives
(TN), and the
number of false negatives (FN) in the test data. A true positive
is an example
in the test data that is classified correctly as an example of the
target class con-
cerned, a false positive is a (negative) example that is
incorrectly assigned to the
target class. FN and TN are defined analogously. (In the
information retrieval
setting, true positives would be documents or web sites that are
returned by the
search engine and that are actually relevant, false positives
would be irrelevant
documents that are returned by the search engine, and so on.)
It is clear that these numbers can only be computed on a per-
class basis,
where the corresponding class is considered the positive class,
and all other
classes are considered joined together as the negative class.
Based on this ap-
proach, the statistics output by WEKA under Detailed Accuracy
By Class
are computed as follows (ROC Area, another statistic that is
output by WEKA,
is discussed further below):
• TP Rate: TP / (TP + FN)
• FP Rate: FP / (FP + TN)
• Precision: TP / (TP + FP)
• Recall: TP / (TP + FN)
• F-Measure: the harmonic mean of precision and recall
Based on the formulas, what are the best possible values for
each of the
statistics in this list, considering an arbitrary dataset and
classifier? Describe
in English when these values are attained assuming we treat
newswire articles
from the above corn category as positive examples and other
articles as negative
examples.
4
Answer:
In the Classifier Output, there is also the ROC Area, which
differs from
these other statistics because it is based on ranking the
examples in the test
data according to how likely they are to belong to the positive
class. The
likelihood is given by the class probability that the classifier
predicts. (Note that
most classifiers in WEKA can produce class probabilities in
addition to discrete
classifications.) The ROC area (also known as AUC) is the
probability that a
randomly chosen positive instance in the test data is ranked
above a randomly
chosen negative instance, based on the ranking produced by the
classifier.
The best outcome is that all positive examples are ranked above
all negative
examples. In that case the AUC is one. In the worst case it is
zero. In the case
where the ranking is essentially random, the AUC is 0.5. Hence
we want an
AUC that is at least 0.5, otherwise our classifier has not learned
anything from
the training data. Which of the two classifiers used above
produces the best
AUC for the two Reuters datasets? Compare this to the outcome
for percent
correct. What do the different outcomes mean?
5
Answer:
Interestingly, there is a close relationship between ROC Area
and TP
Rate/FP Rate. Rather than just obtaining a single pair of values
for the
latter statistics, we can obtain a whole range of value pairs by
imposing different
classification thresholds on the class probabilities estimated by
the classifier: in
the ranked list of instances, ranked according to their estimated
probability of
being positive, we simply count how many truly positive and
negative instances
exhibit a probability above the specified threshold.
By default, an instance is classified as “positive” if the
estimated probability
for the positive class exceeds 0.5, otherwise it is classified as
negative (i.e., the
most likely class is predicted). We can change this threshold
from 0.5 to some
other value between 0 and 1, and recompute the TP Rate/FP
Rate. By
repeating this with all possible different thresholds, we can
compute what is
called an ROC curve. You can graph it in WEKA by right-
clicking on the
Result list entry corresponding to the classification model you
are evaluating,
and selecting Visualize threshold curve.
When you do this, you get a plot with FP Rate on the x axis and
TP Rate
on the y axis. Depending on the classifier you use, this plot can
be quite smooth,
or it can be fairly discrete. The interesting thing is that if you
connect the dots
shown in the plot by lines, and you compute the area under the
resulting curve,
you get the ROC Area discussed above! That is where the
acronym AUC for
the ROC Area comes from (Area Under the Curve).
What does the ideal ROC curve corresponding to perfect
performance look
like (a rough sketch is sufficient)? (Note that this corresponds
to the case where
all positive instances are ranked above all negative ones when
sorted according
to the estimated probability of being positive.)
Answer:
6
Using the threshold curve GUI, you can also plot other types of
curves,
e.g., a precision/recall curve (with Recall on the x axis and
Precision on
the y axis). This plots precision/recall values for each
probability threshold
evaluated. Change the axes to obtain a precision/recall curve.
What shape
does the ideal precision/recall curve corresponding to perfect
performance have
(a rough sketch is again sufficient)?
Answer:
4 Exploring the StringToWordVector filter
By default, the StringToWordVector filter simply outputs binary
attributes
for all raw single-word terms. These attributes indicate presence
and absence of
a term as the attribute values in the transformed dataset.
However, there are
many options that can be changed, e.g:
• outputWordCounts: if set to true, actual word frequencies are
output,
rather than just binary values.
• IDFTransform and TFTransform: when both are set to true,
raw
term frequencies are transformed into so-called TF × IDF values
that are
popular for representing documents in information retrieval
applications.
• stemmer: allows you to choose from different word stemming
algorithms
that attempt to reduced words to their stems.
• useStopList: allows you determine whether stop words are
deleted. Stop
words are words that are usually not informative (e.g., articles).
• tokenizer: allows you to choose a different tokenizer for
generating terms,
e.g., one that produces word n-grams instead of single words.
However, there are several other useful options as well. Check
the descriptions
in the GenericObjectEditor for more information by clicking on
More.
Experiment with the options that are available. What options
give you the
best AUC for the two datasets from above when
MultinomialNaiveBayes is
applied as the classifier? (Note that an exhaustive search is not
required.)
7
Answer:
Often, not all attributes (i.e., terms) are important when
classifying docu-
ments, because many words may be irrelevant for determining
the topic of a
document. We can use WEKA’s AttributeSelectedClassifier,
using ranking
with InfoGainAttributeEval and the Ranker search, to try and
eliminate at-
tributes that are not so useful. As before we need to use the
FilteredClassifier
to transform the data before it is passed to the
AttributeSelectedClassifier.
Experiment with this set-up, using default options for
StringToWordVector
and NaiveBayesMultinomial as the classifier. Vary the number
of most infor-
mative attributes that are selected from the info-gain-based
ranking by changing
the value of the numToSelect field in the Ranker method.
Record the AUC
values you obtain. What number of attributes gives you the best
AUC for the
two datasets from above? What AUC values are the best ones
you manage to
obtain? (An exhaustive search is again not required.)
Answer:
8

More Related Content

PDF
A-Study_TopicModeling
PDF
Multi label classification of
PDF
Weka
PDF
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
PDF
Topic Mining on disaster data (Robert Monné)
PPT
probabilistic ranking
PDF
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
PDF
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
A-Study_TopicModeling
Multi label classification of
Weka
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
Topic Mining on disaster data (Robert Monné)
probabilistic ranking
Reaction StatisticsBackgroundWhen collecting experimental data f.pdf
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE

Similar to Name IDPractical Data MiningCOMP-321BTutorial 5.docx (20)

PDF
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
PDF
Document Classification Using Expectation Maximization with Semi Supervised L...
PDF
Document Classification Using Expectation Maximization with Semi Supervised L...
PDF
Aj35198205
PDF
Mcs 021
PDF
Coling2014:Single Document Keyphrase Extraction Using Label Information
PDF
AN ENVIRONMENT FOR NON-DUPLICATE TEST GENERATION FOR WEB BASED APPLICATION
PDF
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
PDF
Dq2644974501
PDF
A rough set based hybrid method to text categorization
PDF
The comparison of the text classification methods to be used for the analysis...
PPTX
Kripanshu MOOC PPT - Kripanshu Shekhar Jha (1).pptx
PPT
Lucene basics
PDF
Data Science - Part XI - Text Analytics
DOC
INTRODUCTION
DOC
INTRODUCTION
PPTX
WEKA: Data Mining Input Concepts Instances And Attributes
PPTX
WEKA:Data Mining Input Concepts Instances And Attributes
PDF
Text Classification using Support Vector Machine
PDF
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
A FILM SYNOPSIS GENRE CLASSIFIER BASED ON MAJORITY VOTE
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
Aj35198205
Mcs 021
Coling2014:Single Document Keyphrase Extraction Using Label Information
AN ENVIRONMENT FOR NON-DUPLICATE TEST GENERATION FOR WEB BASED APPLICATION
Enhancing Keyword Query Results Over Database for Improving User Satisfaction
Dq2644974501
A rough set based hybrid method to text categorization
The comparison of the text classification methods to be used for the analysis...
Kripanshu MOOC PPT - Kripanshu Shekhar Jha (1).pptx
Lucene basics
Data Science - Part XI - Text Analytics
INTRODUCTION
INTRODUCTION
WEKA: Data Mining Input Concepts Instances And Attributes
WEKA:Data Mining Input Concepts Instances And Attributes
Text Classification using Support Vector Machine
T EXT M INING AND C LASSIFICATION OF P RODUCT R EVIEWS U SING S TRUCTURED S U...
Ad

More from rosemarybdodson23141 (20)

DOCX
Young Adulthood begins with the individual being on the verge of att.docx
DOCX
Your abilities in international management have been recognize.docx
DOCX
your 14 years daughter accidently leaves her purse open in the fam.docx
DOCX
Young people are ruining the English languageIn your reflectio.docx
DOCX
Young man drops out of school in seventh grade and becomes his mothe.docx
DOCX
Young and the RestlessWeek 11 Couples Therapy Movie Experience .docx
DOCX
You-Attitude A Linguistic PerspectiveLllita RodmanThe Uni.docx
DOCX
You  may have seen how financial news outlets provide real-time .docx
DOCX
You  are responsible for putting together the Harmony Day celebr.docx
DOCX
You wrote this scenario from the perspective of Behaviorism learni.docx
DOCX
You worked closely with your IT managers to develop a complementing .docx
DOCX
You work in the office of a personal financial planner. He has asked.docx
DOCX
You work in the IT department of a financial services company that s.docx
DOCX
You work for the Jaguars Bank as the Chief Information Officer.  It .docx
DOCX
You work for OneEarth, an environmental consulting company that .docx
DOCX
You work for an international construction company that has been con.docx
DOCX
You will write your Literature Review Section of your EBP Projec.docx
DOCX
You work for an airline, a small airline, so small you have only one.docx
DOCX
You work for a small community hospital that has recently updated it.docx
DOCX
You work for a regional forensic computer lab and have been tasked w.docx
Young Adulthood begins with the individual being on the verge of att.docx
Your abilities in international management have been recognize.docx
your 14 years daughter accidently leaves her purse open in the fam.docx
Young people are ruining the English languageIn your reflectio.docx
Young man drops out of school in seventh grade and becomes his mothe.docx
Young and the RestlessWeek 11 Couples Therapy Movie Experience .docx
You-Attitude A Linguistic PerspectiveLllita RodmanThe Uni.docx
You  may have seen how financial news outlets provide real-time .docx
You  are responsible for putting together the Harmony Day celebr.docx
You wrote this scenario from the perspective of Behaviorism learni.docx
You worked closely with your IT managers to develop a complementing .docx
You work in the office of a personal financial planner. He has asked.docx
You work in the IT department of a financial services company that s.docx
You work for the Jaguars Bank as the Chief Information Officer.  It .docx
You work for OneEarth, an environmental consulting company that .docx
You work for an international construction company that has been con.docx
You will write your Literature Review Section of your EBP Projec.docx
You work for an airline, a small airline, so small you have only one.docx
You work for a small community hospital that has recently updated it.docx
You work for a regional forensic computer lab and have been tasked w.docx
Ad

Recently uploaded (20)

PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Cell Structure & Organelles in detailed.
PPTX
Lesson notes of climatology university.
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Complications of Minimal Access Surgery at WLH
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
RMMM.pdf make it easy to upload and study
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Pre independence Education in Inndia.pdf
PDF
Classroom Observation Tools for Teachers
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Cell Structure & Organelles in detailed.
Lesson notes of climatology university.
TR - Agricultural Crops Production NC III.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Microbial disease of the cardiovascular and lymphatic systems
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Complications of Minimal Access Surgery at WLH
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Anesthesia in Laparoscopic Surgery in India
O7-L3 Supply Chain Operations - ICLT Program
Renaissance Architecture: A Journey from Faith to Humanism
STATICS OF THE RIGID BODIES Hibbelers.pdf
RMMM.pdf make it easy to upload and study
Microbial diseases, their pathogenesis and prophylaxis
VCE English Exam - Section C Student Revision Booklet
Pre independence Education in Inndia.pdf
Classroom Observation Tools for Teachers
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
O5-L3 Freight Transport Ops (International) V1.pdf

Name IDPractical Data MiningCOMP-321BTutorial 5.docx

  • 1. Name: ID: Practical Data Mining COMP-321B Tutorial 5: Document Classification Eibe Frank July 17, 2014 c©2008-2014 University of Waikato 1 Introduction A popular application area of machine learning is text classification. You may have actually encountered machine-learning-based text classification before: so- called “Bayesian” personalized spam filters are essentially document classifiers that classify email messages (i.e., short documents) into two groups, namely junk and not junk. They are trained on messages that have been manually labeled— perhaps by putting them into appropriate folders (e.g., “ham” vs “spam”). In this tutorial, we look at how to perform document classification using tools in
  • 2. WEKA. Please submit this tutorial to the teaching staff at the end of the lab. 2 Data with string attributes The raw data in document classification is text, but most machine learning algorithms expect examples that are described by a fixed set of nominal or numeric attributes. Hence we need to convert the text data into a form suitable for learning. This is usually done by creating a dictionary of terms from all the documents in the training corpus and making a numeric attribute for each of the terms. Then, for a particular document, the value of an attribute is based on the frequency of the corresponding term in the document—or it simply indicates the presence of absence of the term in the document. The resulting data is usually very high-dimensional because the dictionary consists of hundreds or thousands of terms. Each document is thus represented as a vector in a high-dimensional space. There is obviously also the class at- tribute, which has the label of each document, and which we want to learn to predict based on the contents of the document. The filter weka.filters.unsupervised.attribute.StringToWordVector can be used to perform the conversion of raw text into term- frequency-based attributes for us. The filter assumes that the text of the
  • 3. documents is stored in an attribute of type string, a nominal attribute without a pre- specified set of values. In the filtered data, this string attribute is replaced by a fixed set of numeric attributes and the class attribute is made the first attribute in the resulting set. The numeric attributes have values 0 and 1 by default, indicating whether a term is absent or present in a document. To perform document classification in WEKA, we first need to create an appropriate ARFF file with a string attribute that holds the documents’ text— declared in the header of the ARFF file using @attribute document string, where document is the name of the attribute. We also need a nominal attribute that holds the documents’ classifications. To get some feeling for how this works, make an ARFF file from the labeled mini-“documents” in Table 1 and run StringToWordVector with default op- tions on this data. How many attributes are generated? Now change the value of the option minTermFreq to two. What are the attributes that are generated? 1 Document text Classification The price of crude oil has increased significantly yes
  • 4. Demand of crude oil outstrips supply yes Some people do not like the flavor of olive oil no The food was very oily no Crude oil is in short supply yes Use a bit of cooking oil in the frying pan no Table 1: Training “documents”. Answer: Build a J48 decision tree from the last version of the data you generated. Sate the tree that is output, in textual form. Answer: Usually, we want to build a classifier to classify new documents. Let us assume we want to classify the documents given in Table 2, once they have been converted into ARFF format, using question marks for the missing class labels. Classification should be done based on the decision tree classifier generated from the documents in Table 1. To do this, we need to use the FilteredClassifier in conjunction with the StringToWordVector filter and the base classifier that we want to apply (i.e., J48). Configure the FilteredClassifier using default options for String- ToWordVector and J48, and specify the data from Table 2 as the test set. (Ignore the fact that the Explorer states that the data has an unknown number
  • 5. of instances: it loads the data incrementally when the classifier is tested.) The data from Table 1 should be the training data (in unfiltered, raw form, so that its format is consistent with that of the test set), so make sure that data has 2 Document text Classification Oil platforms extract crude oil Unknown Canola oil is supposed to be healthy Unknown Iraq has significant oil reserves Unknown There are different types of cooking oil Unknown Table 2: Test “documents”. been loaded in the Preprocess panel. We want to look at the classifications for the test data, so make sure that you select Output predictions under More options... in the Classify panel. After you have hit Start, take a look at the model and the predictions generated. What are the predictions (in the order in which the documents are listed in Table 2)? Please explain how they come about. Answer: 3 Classifying actual short text documents A well-known dataset for evaluating document classifiers is a
  • 6. corpus consisting of a collection of Reuters newswire articles. In C:Program FilesWeka-3-6data, you can find some training data derived from this benchmark corpus, more specifically ReutersCorn-train.arff and ReutersGrain-train.arff. There are also corresponding test datasets in *-test.arff. We will use these to ex- periment with two learning algorithms and the StringToWordVector filter. The actual documents in the corn and grain data are the same, only the labels differ. In the first dataset, articles that talk about corn- related issues are labeled with a class value of 1, other articles are labeled with the value 0. In the second dataset, the same kind of labeling is applied, but with respect to grain-related issues. In the first case, we want to build a classification model that can be used to identify articles that talk about corn, in the second case we want to do the same for grain. Let us now consider document classification using decision trees vs. docu- ment classification using a naive Bayes classifier. To this end, build document classifiers for the two training sets by applying the FilteredClassifier with StringToWordVector using (a) J48 and (b) NaiveBayesMultinomial, in each case evaluating them on the corresponding test set. (NaiveBayesMulti-
  • 7. 3 nomial implements the multinomial naive Bayes model discussed in class that is specifically designed for document classification.) What is the percentage of correct classifications obtained for the four scenarios? Which classifier would you deploy based on your results? Answer: Percent correct is actually not the only evaluation metric that is commonly used for document classification problems. WEKA lists several other “per-class” evaluation statistics that are often used for evaluating information retrieval sys- tems like search engines, under Detailed Accuracy By Class in the Classi- fier output text area. They are based on the number of true positives (TP), the number of false positives (FP), the number of true negatives (TN), and the number of false negatives (FN) in the test data. A true positive is an example in the test data that is classified correctly as an example of the target class con- cerned, a false positive is a (negative) example that is incorrectly assigned to the target class. FN and TN are defined analogously. (In the information retrieval setting, true positives would be documents or web sites that are
  • 8. returned by the search engine and that are actually relevant, false positives would be irrelevant documents that are returned by the search engine, and so on.) It is clear that these numbers can only be computed on a per- class basis, where the corresponding class is considered the positive class, and all other classes are considered joined together as the negative class. Based on this ap- proach, the statistics output by WEKA under Detailed Accuracy By Class are computed as follows (ROC Area, another statistic that is output by WEKA, is discussed further below): • TP Rate: TP / (TP + FN) • FP Rate: FP / (FP + TN) • Precision: TP / (TP + FP) • Recall: TP / (TP + FN) • F-Measure: the harmonic mean of precision and recall Based on the formulas, what are the best possible values for each of the statistics in this list, considering an arbitrary dataset and classifier? Describe in English when these values are attained assuming we treat newswire articles from the above corn category as positive examples and other articles as negative examples.
  • 9. 4 Answer: In the Classifier Output, there is also the ROC Area, which differs from these other statistics because it is based on ranking the examples in the test data according to how likely they are to belong to the positive class. The likelihood is given by the class probability that the classifier predicts. (Note that most classifiers in WEKA can produce class probabilities in addition to discrete classifications.) The ROC area (also known as AUC) is the probability that a randomly chosen positive instance in the test data is ranked above a randomly chosen negative instance, based on the ranking produced by the classifier. The best outcome is that all positive examples are ranked above all negative examples. In that case the AUC is one. In the worst case it is zero. In the case where the ranking is essentially random, the AUC is 0.5. Hence we want an AUC that is at least 0.5, otherwise our classifier has not learned anything from the training data. Which of the two classifiers used above produces the best AUC for the two Reuters datasets? Compare this to the outcome for percent
  • 10. correct. What do the different outcomes mean? 5 Answer: Interestingly, there is a close relationship between ROC Area and TP Rate/FP Rate. Rather than just obtaining a single pair of values for the latter statistics, we can obtain a whole range of value pairs by imposing different classification thresholds on the class probabilities estimated by the classifier: in the ranked list of instances, ranked according to their estimated probability of being positive, we simply count how many truly positive and negative instances exhibit a probability above the specified threshold. By default, an instance is classified as “positive” if the estimated probability for the positive class exceeds 0.5, otherwise it is classified as negative (i.e., the most likely class is predicted). We can change this threshold from 0.5 to some other value between 0 and 1, and recompute the TP Rate/FP Rate. By repeating this with all possible different thresholds, we can compute what is called an ROC curve. You can graph it in WEKA by right- clicking on the Result list entry corresponding to the classification model you are evaluating,
  • 11. and selecting Visualize threshold curve. When you do this, you get a plot with FP Rate on the x axis and TP Rate on the y axis. Depending on the classifier you use, this plot can be quite smooth, or it can be fairly discrete. The interesting thing is that if you connect the dots shown in the plot by lines, and you compute the area under the resulting curve, you get the ROC Area discussed above! That is where the acronym AUC for the ROC Area comes from (Area Under the Curve). What does the ideal ROC curve corresponding to perfect performance look like (a rough sketch is sufficient)? (Note that this corresponds to the case where all positive instances are ranked above all negative ones when sorted according to the estimated probability of being positive.) Answer: 6 Using the threshold curve GUI, you can also plot other types of curves, e.g., a precision/recall curve (with Recall on the x axis and Precision on the y axis). This plots precision/recall values for each probability threshold evaluated. Change the axes to obtain a precision/recall curve. What shape
  • 12. does the ideal precision/recall curve corresponding to perfect performance have (a rough sketch is again sufficient)? Answer: 4 Exploring the StringToWordVector filter By default, the StringToWordVector filter simply outputs binary attributes for all raw single-word terms. These attributes indicate presence and absence of a term as the attribute values in the transformed dataset. However, there are many options that can be changed, e.g: • outputWordCounts: if set to true, actual word frequencies are output, rather than just binary values. • IDFTransform and TFTransform: when both are set to true, raw term frequencies are transformed into so-called TF × IDF values that are popular for representing documents in information retrieval applications. • stemmer: allows you to choose from different word stemming algorithms that attempt to reduced words to their stems. • useStopList: allows you determine whether stop words are deleted. Stop words are words that are usually not informative (e.g., articles). • tokenizer: allows you to choose a different tokenizer for
  • 13. generating terms, e.g., one that produces word n-grams instead of single words. However, there are several other useful options as well. Check the descriptions in the GenericObjectEditor for more information by clicking on More. Experiment with the options that are available. What options give you the best AUC for the two datasets from above when MultinomialNaiveBayes is applied as the classifier? (Note that an exhaustive search is not required.) 7 Answer: Often, not all attributes (i.e., terms) are important when classifying docu- ments, because many words may be irrelevant for determining the topic of a document. We can use WEKA’s AttributeSelectedClassifier, using ranking with InfoGainAttributeEval and the Ranker search, to try and eliminate at- tributes that are not so useful. As before we need to use the FilteredClassifier to transform the data before it is passed to the AttributeSelectedClassifier. Experiment with this set-up, using default options for StringToWordVector and NaiveBayesMultinomial as the classifier. Vary the number
  • 14. of most infor- mative attributes that are selected from the info-gain-based ranking by changing the value of the numToSelect field in the Ranker method. Record the AUC values you obtain. What number of attributes gives you the best AUC for the two datasets from above? What AUC values are the best ones you manage to obtain? (An exhaustive search is again not required.) Answer: 8