Data science and visualization MODULE 3 FG&FS

99 | P a g e
Data Wrangling in R
1. Dplyr-fundamental data-munging R bundle. Tool for Supreme Data Framing.Particularly useful
for the operation of categories of data.
2. Purrr-good for listing and error-checking features.
3. Splitstackshape-an oldie, but a goldie. Great for the shaping of full
4. Splitstackshape-an oldie, but a goldie. Perfect for shaping complex data sets and making
visualization easier.
5. JSOnline-a simple and quick scanning device.
6. Magrittr-good for wrangling scattered sets and putting them in a more cohesive shape.
8. FEATURE GENERATION
8.1 INTRODUCTION
The 2004 Text Retrieval Conference (TREC) Genomics Track was divided into two main tasks:
categorization and ad hoc retrieval. The categorization task consisted of a document triage subtask
and an annotation subtask to detect the presence of evidence in the document for each of the three
main Gene Ontology (GO) code hierarchies. Our work focused on the document triage subtask. We
also participated in the ad hoc retrieval task.
8.2 BACKGROUND
The classification of documents is a common problem in biomedicine. Training a support vector
machine (SVM) on vectors generated from stemmed and/or stopped document word counts has
proven to be a simple and generally efficient method (Yeh et al., 2003).
However, we agreed that the triage issue posed here had some distinctive features that would entail a
modification of the standard approach. First, it was understood that the number of true positive
results in both the training and the test set was low, about 6-7%. Second, the utility function chosen
as the record metric was heavily weighted to reward recall and not precision.
This was based on an overview of the existing working procedures of the annotators at the Mouse
Genome Institute (MGI) and an estimate of how they actually view false negative and false positive
classifications. The official utility feature weights a false negative as 20 times more extreme than a
false positive. Using this metric, the existing work procedure of MGI, which reads all the documents
in the test set, has a value of 0.25. In fact, the training and evaluation samples were not randomly
drawn from the same survey, but rather obtained from documents released in two consecutive years.
While this is a more practical simulation of the framework as it would be applied to MGI, it poses
the question of how well the features derived from one year of literature reflect the literature of
subsequent years. As a result of these problems, our approach included a rich collection of features,
statistically dependent selection of features, multiple classifiers, and an analysis of how well the
features extracted from the 2002 corpus reflected the documents in the 2003 corpus.

100 | P a g e
8.3 SYSTEM AND METHODS
We have tackled the question of triage in four stages: the generation of features, the selection of
features, the collection and training of classifiers and, finally, the classification of test documents.
Only the training package was used to complete the first three measures. The final step in collecting
the test was taken to produce the results submitted. During the development of the program, we used
ten-fold cross-validation on the training set to compare approaches and set program parameters. It
included the execution of the first two phases of the entire training series. Then 90% of the training
data was used to train the classifiers, which were then applied to the remaining 10% of the training
data. This has been repeated nine times, so that all the training data has been classified once. The
findings were then aggregated to compute cross-validation metrics for the training corpus. Figure 17
displays this phase diagrammatically.
Figure 17: Step-wise approach to test classification
1. Feature generation
The full text corpus with SGML mark-up offered an opportunity to explore the use of several types
of features. While many text classification methods view text as a "bag-of-words," we have opted to
use the information contained in the SGML mark-up to generate unique section type features. Since
we merged features that could occur several times in a single document with features that could only
occur once, after some initial testing, we decided to view each feature as binary, that is, each feature
was either present in a document or absent. One type of function we created consisted of pairs of
section names and stemmed words using the Porter stemming algorithm. Upon applying a stop list of
the 300 most common English words, the individual parts of the text collected were coded to include
abstract sections, body paragraphs, captions and section titles. We have created a similar hybrid
section, stammedword features, using the stopped and stammed section title in conjunction with the
stopped and stammed words in the named section. In addition, we have downloaded the related
MEDLINE documents from PubMed. For each post, the corresponding MeSH headings have been
extracted. We included MeSH-based features based on the full MeSH headings, the MeSH main
headings and the MeSH subheadings. Finally, we included features based on details in the reference
section of each text. The key author of each reference was taken as a form of attribute. We also
Feature Generation
Classifier Selection & Training
Feature Selection
Document Classification
Test Corpus
Training Corpus

101 | P a g e
included a long form of references as a feature sort, including the primary author, journal name,
length, year, and page number. Running the feature generation process on a full set of 5837 training
documents created over 100,000 potentially useful features along with a count of the number of
documents containing each feature.
2. Feature selection
We have opted to use the Chi-square selection method to pick the features that best differentiated
between positive and negative documents in the training corpus. The 2x2 Chisquare table is
constructed as shown in Table 1, using the number of documents obtained in the previous stage.
During machine tuning, an alpha value of 0.025 was found to produce the best results. Using this
value as a cut-off, 1885 features were selected as the most important. The number and type of each
feature found significant and used in the following steps are shown in Table 2.
3. Classifier selection and training
Three specific classifiers were applied to the problem: Naive Bayes, SVM, and Voting Perceptron.
Although it is widely thought that the best classifiers are based on Vapnik's SVM method (Vapnik,
2000), the distinctive aspects of the current classification issue discussed above inspired us to apply
three different classifiers. By using the same feature set with each of the classifiers, this helped us to
compare the efficiency of the classifier algorithms with the particular requirements of the triage
function.

102 | P a g e
Neither Naive Bayes nor the implementation of the SVM we used, SVMLight (Joachims, 2004),
offered adequate means to change the low frequency of positive and the high value of true positive
relative to false positive. We used our own implementation for Naive Bayes. Naive Bayes sets a
classification probability threshold that can be used to switch between precision and recall.
Nonetheless, this is an indirect form of compensation, and in practice, for this task of classification,
we found that raising the likelihood threshold did not have a meaningful impact.
We fully expected SVMLight to perform better than Naive Bayes, as it contained a cost factor
parameter that could be changed to require unequal penalties for false positives and false negatives.
Nevertheless, we found that the amount of impact of this parameter was limited and insufficient to
account for the 20 difference between the cost of false positives and negatives. Since neither Naive
Bayes nor one of the most common SVM implementations addressed our requirements, something
else was required.
An analysis of the classification literature reveals considerable progress in adjusting the classical
Rosenblatt Perceptron algorithm (Rosenblatt, 1958) to achieve efficiency at or near SVM for several
problems. One algorithm in particular, the Voting Perceptron algorithm (Freund and Schapire, 1999),
has quite good efficiency, is very fast and easy to implement. Although the algorithm as published
does not provide a way to account for asymmetric false positive and negative penalties, we have
made a change to the algorithm that does. Perceptron is basically an equation for a linear
combination of the values of the set of features.
For every element in the feature set, there is one term in the perceptron plus an optional bias phrase.
The document is defined by taking the dot product of the document's feature vector with the
perceptron and adding it to the bias word. When the result is greater than zero, the document is
classified as positive, if it is less than or equal to zero, then the document is classified as negative.
The original algorithm of Rosenblatt trained the perceptron by applying it to each sample in the
training results.

103 | P a g e
If the sample was wrongly labeled, the perceptron was changed by adding or subtracting the sample
back into the perceptron, adding when the sample was a true positive, and subtracting when the
sample was a true negative. Over a large number of training samples, the perceptron converges on a
solution that better approximates the distinction between positive and negative documents in the
training package. Freund and Schapire improved the performance of the perceptron by modifying the
algorithm to produce a series of perceptrons, each of which makes a prediction about the class of
each document and receives a number of "votes" depending on how many documents the perceptron
has correctly classified in the training set.
The class with the most votes is the class allocated to the paper. Our extension to this algorithm is
based on a specific modification of the perceptron learning rate for false negatives and false
positives. Although incorrectly classified samples are directly added or subtracted back to the
perceptron in the typical implementation, we first multiply the sample by a factor known as the
learning rate. In addition, we use separate learning thresholds for false positives and false negatives.
Given the concept of the utility function, we predicted that the optimal learning rate for false
negatives will be around 20 times that for false positives.
In reality, that's what we noticed during the training. We used 20.0 for false negatives, and 1.0 for
false positives. The training corpus was applied to each of the three classifiers. Ten-fold cross-
validation has been used to optimize all free parameters. The Naive Bayes classifier had one free
parameter, the threshold for the probability classification. It was left to the default value of 0.50. The
selected SVM-Light classifier settings used a linear kernel and a cost factor of 20.0. The Voting
Perceptron classifier was used with a linear kernel and the learning rate was given above. For each of
the three approaches, a trained classifier model was developed.
4. Classification of test documents
Eventually, the test corpus was added to the models developed by the Naive Bayes, SVM and Voting
Perceptron classifiers. It's done in two steps. The documents in the study sample were first examined
for the presence or absence of significant features observed during the selection process. This has
generated a vector function for each test paper. The documents were then categorized by applying
each of the three qualified classifiers.

104 | P a g e
5. Evaluation of conceptualdrift
One critical problem in applying text classification systems to documents of interest to curators and
annotators is how well the available training data reflect the documents to be categorized. When
classifying the biomedical text, the available training manuals must have been written in advance of
the text to be categorized. However, because of its very existence, the field of science shifts over
time, as does the vocabulary used to explain it.
How easily written science literature changes has a direct effect on the creation of biomedical text
classification systems in terms of how the features are developed and chosen, how much the systems
need to be re-trained, how much training is required, and the overall performance that can be
expected from such systems can be affected. Throughout biomedical literature, we decided to begin
to understand this significant topic of conceptual drift.
In order to determine how well the features chosen from the training collection reflected the
information that was relevant in classifying the document in the test collection, we took additional
steps in producing the features and selecting the test collection. The exact same method and
parameters were used for the collection of tests as for the collection of testing. We then calculated
how well the training collection feature set reflected the test collection feature set by the
computational similarity metrics between the two sets (Dunham, 2003).
9. FEATURE SELECTION ALGORITHMS
Feature selection is also called selection of variables or selection of attributes.
It is the automated selection of attributes in your data (such as columns in tabular data) that are most
important to the issue of predictive modeling that you are working on.
"Feature Selection... is the method of selecting a subset of the applicable features for use in model
construction.”
The selection of features is distinct from the reduction of dimensionality. Both methods aim to
minimize the number of attributes in the dataset, but the dimensional reduction approach does so by
introducing new combinations of attributes, while the feature selection methods include and remove
attributes present in the data without modifying them.
Examples of dimensionality reduction methods include Principal Component Analysis, Singular
Value Decomposition and Sammon’sMapping.
“Feature selection is itself useful, but it mostly acts as a filter, muting out features that aren’t useful
in addition to your existing features”.

105 | P a g e
9.1 The Problem the Feature Selection Solves
Feature selection methods allow you to build an effective predictive model in your task. We support
you by selecting features that will give you as good or better accuracy while needing less data.
Feature selection approaches may be used to recognize and delete unwanted, obsolete and redundant
attributes from data that do not contribute to the accuracy of the predictive model or can potentially
reduce the accuracy of the model.
Fewer attributes are preferable because they reduce the complexity of the model, and a simpler
model is easier to understand and describe.
The goal of variable selection is threefold:
1. to enhance predictor efficiency,
2. to provide quicker and more cost-effective predictors,
3. and to provide a better understanding of the underlying process that generated the data.
9.2 Feature Selection Algorithms
There are three general classes of feature selection algorithms:
1. Filter methods,
2. Wrapper methods,
3. Embedded methods.
Filter Methods
Filter feature selection approaches use a statistical test to assign a score to each element. The features
are ranked by the score and either selected to be stored or deleted from the dataset. Methods are
mostly univariate and consider the function separately or in relation to the dependent variable.
Examples of some of the filter methods include the Chi squared test, information gain and correlation
coefficient ratings.
Wrapper Methods
Wrapper approaches consider the collection of a set of features as a search problem where various
combinations are planned, evaluated and compared to other combinations. A predictive algorithm
used to test a combination of features and give a score based on the accuracy of the formula.

106 | P a g e
The search process may be methodical, such as a best-first search, stochastic, such as a random hill-
climbing algorithm, or heuristics, such as forward and backward passes, may be used to add and
remove features.
An example if the wrapper approach is a recursive elimination algorithm.
Embedded Methods
Embedded methods learn which features better contribute to the accuracy of the model when the
model is being built. Regularization methods are the most common type of embedded feature
selection methods. Regularization methods are often called penalization methods that apply
additional constraints to the design of a predictive algorithm (such as a regression algorithm) that
moves the model towards lower complexity (lower coefficients). Examples of regularization
algorithms include LASSO, Elastic Net and Ridge Regression.
9.3 How to Choose a Feature Selection Method for Machine Learning
Feature selection is a method that reduces the number of input variables when designing a predictive
model. It is beneficial to reduce the number of input variables, both to reduce the computational cost
of modeling and, in some cases, to increase the efficiency of the model.
Feature-based feature selection approaches include analyzing the relationship between each input
variable and the target variable using statistics and choosing those input variables that have the best
relationship with the target variable. These methods can be fast and efficient, although the choice of
statistical measures depends on the data type of both input and output variables. As such, it may be
difficult for a machine learning practitioner to choose an appropriate statistical measure for a data set
when choosing filter-based apps.
1. Feature Selection Methods
Feature selection approaches are designed to reduce the number of input variables to those deemed
most useful for the model in order to predict the target variable.
Some predictive modeling problems have a large number of variables that can slow down the
creation and training of models and require a large amount of machine memory. In addition, the
output of certain models can be degraded by adding input variables that are not important to the
target variable.

107 | P a g e
There are two major types of feature selection algorithms: the wrapper method and the filter method.
1. Wrapper Feature Selection Methods.
2. Filter Feature Selection Methods.
1. Wrapper feature selection approaches generate several models with various input features
subsets and pick those features that result in the best output model according to the performance
metric. These methods are not concerned with variable types, although they can be
computationally costly. RFE is a good example of a method for selecting a wrapper function.
Wrapper methods test several models using procedures that add and/or extract predictors to find
the optimum combination that maximizes model efficiency.
2. Filter feature selection approaches use statistical techniques to test the relationship between
each input variable and the target variable, and these scores are used as the basis for selecting
(filtering) the input variables that will be used in the model. Filter methods test the importance of
predictors outside the predictive models, and then only model predictors that pass any criterion.
Correlation style statistical measurements between input and output variables are widely used as
the basis for filter function selection. As such, the choice of statistical measures is highly
dependent on variable data types. Popular data types include numerical (such as height) and
categorical (such as label), although each can be further subdivided as integer and floating point
for numerical variables, and boolean, ordinal, or nominal for categorical variables.

Data science and visualization MODULE 3 FG&FS

More Related Content

Similar to Data science and visualization MODULE 3 FG&FS (20)

More from vinuthak18 (8)

Recently uploaded (20)

Data science and visualization MODULE 3 FG&FS