SlideShare a Scribd company logo
Data science and visualization MODULE 3 FG&FS
99 | P a g e
Data Wrangling in R
1. Dplyr-fundamental data-munging R bundle. Tool for Supreme Data Framing.Particularly useful
for the operation of categories of data.
2. Purrr-good for listing and error-checking features.
3. Splitstackshape-an oldie, but a goldie. Great for the shaping of full
4. Splitstackshape-an oldie, but a goldie. Perfect for shaping complex data sets and making
visualization easier.
5. JSOnline-a simple and quick scanning device.
6. Magrittr-good for wrangling scattered sets and putting them in a more cohesive shape.
8. FEATURE GENERATION
8.1 INTRODUCTION
The 2004 Text Retrieval Conference (TREC) Genomics Track was divided into two main tasks:
categorization and ad hoc retrieval. The categorization task consisted of a document triage subtask
and an annotation subtask to detect the presence of evidence in the document for each of the three
main Gene Ontology (GO) code hierarchies. Our work focused on the document triage subtask. We
also participated in the ad hoc retrieval task.
8.2 BACKGROUND
The classification of documents is a common problem in biomedicine. Training a support vector
machine (SVM) on vectors generated from stemmed and/or stopped document word counts has
proven to be a simple and generally efficient method (Yeh et al., 2003).
However, we agreed that the triage issue posed here had some distinctive features that would entail a
modification of the standard approach. First, it was understood that the number of true positive
results in both the training and the test set was low, about 6-7%. Second, the utility function chosen
as the record metric was heavily weighted to reward recall and not precision.
This was based on an overview of the existing working procedures of the annotators at the Mouse
Genome Institute (MGI) and an estimate of how they actually view false negative and false positive
classifications. The official utility feature weights a false negative as 20 times more extreme than a
false positive. Using this metric, the existing work procedure of MGI, which reads all the documents
in the test set, has a value of 0.25. In fact, the training and evaluation samples were not randomly
drawn from the same survey, but rather obtained from documents released in two consecutive years.
While this is a more practical simulation of the framework as it would be applied to MGI, it poses
the question of how well the features derived from one year of literature reflect the literature of
subsequent years. As a result of these problems, our approach included a rich collection of features,
statistically dependent selection of features, multiple classifiers, and an analysis of how well the
features extracted from the 2002 corpus reflected the documents in the 2003 corpus.
100 | P a g e
8.3 SYSTEM AND METHODS
We have tackled the question of triage in four stages: the generation of features, the selection of
features, the collection and training of classifiers and, finally, the classification of test documents.
Only the training package was used to complete the first three measures. The final step in collecting
the test was taken to produce the results submitted. During the development of the program, we used
ten-fold cross-validation on the training set to compare approaches and set program parameters. It
included the execution of the first two phases of the entire training series. Then 90% of the training
data was used to train the classifiers, which were then applied to the remaining 10% of the training
data. This has been repeated nine times, so that all the training data has been classified once. The
findings were then aggregated to compute cross-validation metrics for the training corpus. Figure 17
displays this phase diagrammatically.
Figure 17: Step-wise approach to test classification
1. Feature generation
The full text corpus with SGML mark-up offered an opportunity to explore the use of several types
of features. While many text classification methods view text as a "bag-of-words," we have opted to
use the information contained in the SGML mark-up to generate unique section type features. Since
we merged features that could occur several times in a single document with features that could only
occur once, after some initial testing, we decided to view each feature as binary, that is, each feature
was either present in a document or absent. One type of function we created consisted of pairs of
section names and stemmed words using the Porter stemming algorithm. Upon applying a stop list of
the 300 most common English words, the individual parts of the text collected were coded to include
abstract sections, body paragraphs, captions and section titles. We have created a similar hybrid
section, stammedword features, using the stopped and stammed section title in conjunction with the
stopped and stammed words in the named section. In addition, we have downloaded the related
MEDLINE documents from PubMed. For each post, the corresponding MeSH headings have been
extracted. We included MeSH-based features based on the full MeSH headings, the MeSH main
headings and the MeSH subheadings. Finally, we included features based on details in the reference
section of each text. The key author of each reference was taken as a form of attribute. We also
Feature Generation
Classifier Selection & Training
Feature Selection
Document Classification
Test Corpus
Training Corpus
101 | P a g e
included a long form of references as a feature sort, including the primary author, journal name,
length, year, and page number. Running the feature generation process on a full set of 5837 training
documents created over 100,000 potentially useful features along with a count of the number of
documents containing each feature.
2. Feature selection
We have opted to use the Chi-square selection method to pick the features that best differentiated
between positive and negative documents in the training corpus. The 2x2 Chisquare table is
constructed as shown in Table 1, using the number of documents obtained in the previous stage.
During machine tuning, an alpha value of 0.025 was found to produce the best results. Using this
value as a cut-off, 1885 features were selected as the most important. The number and type of each
feature found significant and used in the following steps are shown in Table 2.
3. Classifier selection and training
Three specific classifiers were applied to the problem: Naive Bayes, SVM, and Voting Perceptron.
Although it is widely thought that the best classifiers are based on Vapnik's SVM method (Vapnik,
2000), the distinctive aspects of the current classification issue discussed above inspired us to apply
three different classifiers. By using the same feature set with each of the classifiers, this helped us to
compare the efficiency of the classifier algorithms with the particular requirements of the triage
function.
102 | P a g e
Neither Naive Bayes nor the implementation of the SVM we used, SVMLight (Joachims, 2004),
offered adequate means to change the low frequency of positive and the high value of true positive
relative to false positive. We used our own implementation for Naive Bayes. Naive Bayes sets a
classification probability threshold that can be used to switch between precision and recall.
Nonetheless, this is an indirect form of compensation, and in practice, for this task of classification,
we found that raising the likelihood threshold did not have a meaningful impact.
We fully expected SVMLight to perform better than Naive Bayes, as it contained a cost factor
parameter that could be changed to require unequal penalties for false positives and false negatives.
Nevertheless, we found that the amount of impact of this parameter was limited and insufficient to
account for the 20 difference between the cost of false positives and negatives. Since neither Naive
Bayes nor one of the most common SVM implementations addressed our requirements, something
else was required.
An analysis of the classification literature reveals considerable progress in adjusting the classical
Rosenblatt Perceptron algorithm (Rosenblatt, 1958) to achieve efficiency at or near SVM for several
problems. One algorithm in particular, the Voting Perceptron algorithm (Freund and Schapire, 1999),
has quite good efficiency, is very fast and easy to implement. Although the algorithm as published
does not provide a way to account for asymmetric false positive and negative penalties, we have
made a change to the algorithm that does. Perceptron is basically an equation for a linear
combination of the values of the set of features.
For every element in the feature set, there is one term in the perceptron plus an optional bias phrase.
The document is defined by taking the dot product of the document's feature vector with the
perceptron and adding it to the bias word. When the result is greater than zero, the document is
classified as positive, if it is less than or equal to zero, then the document is classified as negative.
The original algorithm of Rosenblatt trained the perceptron by applying it to each sample in the
training results.
103 | P a g e
If the sample was wrongly labeled, the perceptron was changed by adding or subtracting the sample
back into the perceptron, adding when the sample was a true positive, and subtracting when the
sample was a true negative. Over a large number of training samples, the perceptron converges on a
solution that better approximates the distinction between positive and negative documents in the
training package. Freund and Schapire improved the performance of the perceptron by modifying the
algorithm to produce a series of perceptrons, each of which makes a prediction about the class of
each document and receives a number of "votes" depending on how many documents the perceptron
has correctly classified in the training set.
The class with the most votes is the class allocated to the paper. Our extension to this algorithm is
based on a specific modification of the perceptron learning rate for false negatives and false
positives. Although incorrectly classified samples are directly added or subtracted back to the
perceptron in the typical implementation, we first multiply the sample by a factor known as the
learning rate. In addition, we use separate learning thresholds for false positives and false negatives.
Given the concept of the utility function, we predicted that the optimal learning rate for false
negatives will be around 20 times that for false positives.
In reality, that's what we noticed during the training. We used 20.0 for false negatives, and 1.0 for
false positives. The training corpus was applied to each of the three classifiers. Ten-fold cross-
validation has been used to optimize all free parameters. The Naive Bayes classifier had one free
parameter, the threshold for the probability classification. It was left to the default value of 0.50. The
selected SVM-Light classifier settings used a linear kernel and a cost factor of 20.0. The Voting
Perceptron classifier was used with a linear kernel and the learning rate was given above. For each of
the three approaches, a trained classifier model was developed.
4. Classification of test documents
Eventually, the test corpus was added to the models developed by the Naive Bayes, SVM and Voting
Perceptron classifiers. It's done in two steps. The documents in the study sample were first examined
for the presence or absence of significant features observed during the selection process. This has
generated a vector function for each test paper. The documents were then categorized by applying
each of the three qualified classifiers.
104 | P a g e
5. Evaluation of conceptualdrift
One critical problem in applying text classification systems to documents of interest to curators and
annotators is how well the available training data reflect the documents to be categorized. When
classifying the biomedical text, the available training manuals must have been written in advance of
the text to be categorized. However, because of its very existence, the field of science shifts over
time, as does the vocabulary used to explain it.
How easily written science literature changes has a direct effect on the creation of biomedical text
classification systems in terms of how the features are developed and chosen, how much the systems
need to be re-trained, how much training is required, and the overall performance that can be
expected from such systems can be affected. Throughout biomedical literature, we decided to begin
to understand this significant topic of conceptual drift.
In order to determine how well the features chosen from the training collection reflected the
information that was relevant in classifying the document in the test collection, we took additional
steps in producing the features and selecting the test collection. The exact same method and
parameters were used for the collection of tests as for the collection of testing. We then calculated
how well the training collection feature set reflected the test collection feature set by the
computational similarity metrics between the two sets (Dunham, 2003).
9. FEATURE SELECTION ALGORITHMS
Feature selection is also called selection of variables or selection of attributes.
It is the automated selection of attributes in your data (such as columns in tabular data) that are most
important to the issue of predictive modeling that you are working on.
"Feature Selection... is the method of selecting a subset of the applicable features for use in model
construction.”
The selection of features is distinct from the reduction of dimensionality. Both methods aim to
minimize the number of attributes in the dataset, but the dimensional reduction approach does so by
introducing new combinations of attributes, while the feature selection methods include and remove
attributes present in the data without modifying them.
Examples of dimensionality reduction methods include Principal Component Analysis, Singular
Value Decomposition and Sammon’sMapping.
“Feature selection is itself useful, but it mostly acts as a filter, muting out features that aren’t useful
in addition to your existing features”.
105 | P a g e
9.1 The Problem the Feature Selection Solves
Feature selection methods allow you to build an effective predictive model in your task. We support
you by selecting features that will give you as good or better accuracy while needing less data.
Feature selection approaches may be used to recognize and delete unwanted, obsolete and redundant
attributes from data that do not contribute to the accuracy of the predictive model or can potentially
reduce the accuracy of the model.
Fewer attributes are preferable because they reduce the complexity of the model, and a simpler
model is easier to understand and describe.
The goal of variable selection is threefold:
1. to enhance predictor efficiency,
2. to provide quicker and more cost-effective predictors,
3. and to provide a better understanding of the underlying process that generated the data.
9.2 Feature Selection Algorithms
There are three general classes of feature selection algorithms:
1. Filter methods,
2. Wrapper methods,
3. Embedded methods.
Filter Methods
Filter feature selection approaches use a statistical test to assign a score to each element. The features
are ranked by the score and either selected to be stored or deleted from the dataset. Methods are
mostly univariate and consider the function separately or in relation to the dependent variable.
Examples of some of the filter methods include the Chi squared test, information gain and correlation
coefficient ratings.
Wrapper Methods
Wrapper approaches consider the collection of a set of features as a search problem where various
combinations are planned, evaluated and compared to other combinations. A predictive algorithm
used to test a combination of features and give a score based on the accuracy of the formula.
106 | P a g e
The search process may be methodical, such as a best-first search, stochastic, such as a random hill-
climbing algorithm, or heuristics, such as forward and backward passes, may be used to add and
remove features.
An example if the wrapper approach is a recursive elimination algorithm.
Embedded Methods
Embedded methods learn which features better contribute to the accuracy of the model when the
model is being built. Regularization methods are the most common type of embedded feature
selection methods. Regularization methods are often called penalization methods that apply
additional constraints to the design of a predictive algorithm (such as a regression algorithm) that
moves the model towards lower complexity (lower coefficients). Examples of regularization
algorithms include LASSO, Elastic Net and Ridge Regression.
9.3 How to Choose a Feature Selection Method for Machine Learning
Feature selection is a method that reduces the number of input variables when designing a predictive
model. It is beneficial to reduce the number of input variables, both to reduce the computational cost
of modeling and, in some cases, to increase the efficiency of the model.
Feature-based feature selection approaches include analyzing the relationship between each input
variable and the target variable using statistics and choosing those input variables that have the best
relationship with the target variable. These methods can be fast and efficient, although the choice of
statistical measures depends on the data type of both input and output variables. As such, it may be
difficult for a machine learning practitioner to choose an appropriate statistical measure for a data set
when choosing filter-based apps.
1. Feature Selection Methods
Feature selection approaches are designed to reduce the number of input variables to those deemed
most useful for the model in order to predict the target variable.
Some predictive modeling problems have a large number of variables that can slow down the
creation and training of models and require a large amount of machine memory. In addition, the
output of certain models can be degraded by adding input variables that are not important to the
target variable.
107 | P a g e
There are two major types of feature selection algorithms: the wrapper method and the filter method.
1. Wrapper Feature Selection Methods.
2. Filter Feature Selection Methods.
1. Wrapper feature selection approaches generate several models with various input features
subsets and pick those features that result in the best output model according to the performance
metric. These methods are not concerned with variable types, although they can be
computationally costly. RFE is a good example of a method for selecting a wrapper function.
Wrapper methods test several models using procedures that add and/or extract predictors to find
the optimum combination that maximizes model efficiency.
2. Filter feature selection approaches use statistical techniques to test the relationship between
each input variable and the target variable, and these scores are used as the basis for selecting
(filtering) the input variables that will be used in the model. Filter methods test the importance of
predictors outside the predictive models, and then only model predictors that pass any criterion.
Correlation style statistical measurements between input and output variables are widely used as
the basis for filter function selection. As such, the choice of statistical measures is highly
dependent on variable data types. Popular data types include numerical (such as height) and
categorical (such as label), although each can be further subdivided as integer and floating point
for numerical variables, and boolean, ordinal, or nominal for categorical variables.

More Related Content

PDF
Feature selection, optimization and clustering strategies of text documents
PPT
Project Presentation
DOCX
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
DOCX
A fast clustering based feature subset selection algorithm for high-dimension...
DOCX
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
PDF
Knowledge Management in the Cloud: Benefits and Risks
PDF
An Evaluation of Feature Selection Methods for Positive- Unlabeled Learning i...
PDF
An Evaluation of Feature Selection Methods for Positive - Unlabeled Learning ...
Feature selection, optimization and clustering strategies of text documents
Project Presentation
JAVA 2013 IEEE PROJECT A fast clustering based feature subset selection algor...
A fast clustering based feature subset selection algorithm for high-dimension...
JAVA 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subset ...
Knowledge Management in the Cloud: Benefits and Risks
An Evaluation of Feature Selection Methods for Positive- Unlabeled Learning i...
An Evaluation of Feature Selection Methods for Positive - Unlabeled Learning ...

Similar to Data science and visualization MODULE 3 FG&FS (20)

DOCX
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
PDF
An Evaluation of Two - S tep T echniques for Positive - Unlabeled Learning in...
PDF
An Evaluation of Two-Step Techniques for Positive- Unlabeled Learning in Text...
PDF
Rule Based Automatic Generation of Query Terms for SMS Based Retrieval Systems
PDF
SCOPUS PAPER EJMCM.pdf
DOC
A report of the work done in this project is available here
PDF
Hc3612711275
DOC
FOCUS.doc
PDF
Eryk_Kulikowski_a2
PPT
Lecture 2
PPT
PPT SLIDES
PPT
PPT SLIDES
DOC
PDF
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
PDF
The comparison of the text classification methods to be used for the analysis...
DOCX
A fast clustering based feature subset selection algorithm for high-dimension...
DOC
INTRODUCTION
DOC
INTRODUCTION
DOCX
A fast clustering based feature subset selection algorithm for high-dimension...
DOCX
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
An Evaluation of Two - S tep T echniques for Positive - Unlabeled Learning in...
An Evaluation of Two-Step Techniques for Positive- Unlabeled Learning in Text...
Rule Based Automatic Generation of Query Terms for SMS Based Retrieval Systems
SCOPUS PAPER EJMCM.pdf
A report of the work done in this project is available here
Hc3612711275
FOCUS.doc
Eryk_Kulikowski_a2
Lecture 2
PPT SLIDES
PPT SLIDES
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
The comparison of the text classification methods to be used for the analysis...
A fast clustering based feature subset selection algorithm for high-dimension...
INTRODUCTION
INTRODUCTION
A fast clustering based feature subset selection algorithm for high-dimension...
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
Ad

More from vinuthak18 (8)

PPTX
PPT_ Module_2_suruchi presentation notes
PPTX
data science module-3 power point presentation
PPTX
Data science and visualization power point
PPTX
COMPARISION PLOTS power point presentation
PDF
COMPARISION PLOTS topicof data visualization
PPTX
Computer networks presentation of module 1
PDF
digital design and algorithm module 1 ppt
PDF
Computer networks module 5 content covered in this ppt
PPT_ Module_2_suruchi presentation notes
data science module-3 power point presentation
Data science and visualization power point
COMPARISION PLOTS power point presentation
COMPARISION PLOTS topicof data visualization
Computer networks presentation of module 1
digital design and algorithm module 1 ppt
Computer networks module 5 content covered in this ppt
Ad

Recently uploaded (20)

PPT
introduction to datamining and warehousing
PPT
Mechanical Engineering MATERIALS Selection
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPT
Project quality management in manufacturing
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
composite construction of structures.pdf
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PPTX
Construction Project Organization Group 2.pptx
PPTX
Artificial Intelligence
PPTX
Current and future trends in Computer Vision.pptx
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
Sustainable Sites - Green Building Construction
PPTX
Lecture Notes Electrical Wiring System Components
PDF
R24 SURVEYING LAB MANUAL for civil enggi
DOCX
573137875-Attendance-Management-System-original
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Well-logging-methods_new................
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
web development for engineering and engineering
introduction to datamining and warehousing
Mechanical Engineering MATERIALS Selection
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Project quality management in manufacturing
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Foundation to blockchain - A guide to Blockchain Tech
composite construction of structures.pdf
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Construction Project Organization Group 2.pptx
Artificial Intelligence
Current and future trends in Computer Vision.pptx
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
Sustainable Sites - Green Building Construction
Lecture Notes Electrical Wiring System Components
R24 SURVEYING LAB MANUAL for civil enggi
573137875-Attendance-Management-System-original
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Well-logging-methods_new................
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
web development for engineering and engineering

Data science and visualization MODULE 3 FG&FS

  • 2. 99 | P a g e Data Wrangling in R 1. Dplyr-fundamental data-munging R bundle. Tool for Supreme Data Framing.Particularly useful for the operation of categories of data. 2. Purrr-good for listing and error-checking features. 3. Splitstackshape-an oldie, but a goldie. Great for the shaping of full 4. Splitstackshape-an oldie, but a goldie. Perfect for shaping complex data sets and making visualization easier. 5. JSOnline-a simple and quick scanning device. 6. Magrittr-good for wrangling scattered sets and putting them in a more cohesive shape. 8. FEATURE GENERATION 8.1 INTRODUCTION The 2004 Text Retrieval Conference (TREC) Genomics Track was divided into two main tasks: categorization and ad hoc retrieval. The categorization task consisted of a document triage subtask and an annotation subtask to detect the presence of evidence in the document for each of the three main Gene Ontology (GO) code hierarchies. Our work focused on the document triage subtask. We also participated in the ad hoc retrieval task. 8.2 BACKGROUND The classification of documents is a common problem in biomedicine. Training a support vector machine (SVM) on vectors generated from stemmed and/or stopped document word counts has proven to be a simple and generally efficient method (Yeh et al., 2003). However, we agreed that the triage issue posed here had some distinctive features that would entail a modification of the standard approach. First, it was understood that the number of true positive results in both the training and the test set was low, about 6-7%. Second, the utility function chosen as the record metric was heavily weighted to reward recall and not precision. This was based on an overview of the existing working procedures of the annotators at the Mouse Genome Institute (MGI) and an estimate of how they actually view false negative and false positive classifications. The official utility feature weights a false negative as 20 times more extreme than a false positive. Using this metric, the existing work procedure of MGI, which reads all the documents in the test set, has a value of 0.25. In fact, the training and evaluation samples were not randomly drawn from the same survey, but rather obtained from documents released in two consecutive years. While this is a more practical simulation of the framework as it would be applied to MGI, it poses the question of how well the features derived from one year of literature reflect the literature of subsequent years. As a result of these problems, our approach included a rich collection of features, statistically dependent selection of features, multiple classifiers, and an analysis of how well the features extracted from the 2002 corpus reflected the documents in the 2003 corpus.
  • 3. 100 | P a g e 8.3 SYSTEM AND METHODS We have tackled the question of triage in four stages: the generation of features, the selection of features, the collection and training of classifiers and, finally, the classification of test documents. Only the training package was used to complete the first three measures. The final step in collecting the test was taken to produce the results submitted. During the development of the program, we used ten-fold cross-validation on the training set to compare approaches and set program parameters. It included the execution of the first two phases of the entire training series. Then 90% of the training data was used to train the classifiers, which were then applied to the remaining 10% of the training data. This has been repeated nine times, so that all the training data has been classified once. The findings were then aggregated to compute cross-validation metrics for the training corpus. Figure 17 displays this phase diagrammatically. Figure 17: Step-wise approach to test classification 1. Feature generation The full text corpus with SGML mark-up offered an opportunity to explore the use of several types of features. While many text classification methods view text as a "bag-of-words," we have opted to use the information contained in the SGML mark-up to generate unique section type features. Since we merged features that could occur several times in a single document with features that could only occur once, after some initial testing, we decided to view each feature as binary, that is, each feature was either present in a document or absent. One type of function we created consisted of pairs of section names and stemmed words using the Porter stemming algorithm. Upon applying a stop list of the 300 most common English words, the individual parts of the text collected were coded to include abstract sections, body paragraphs, captions and section titles. We have created a similar hybrid section, stammedword features, using the stopped and stammed section title in conjunction with the stopped and stammed words in the named section. In addition, we have downloaded the related MEDLINE documents from PubMed. For each post, the corresponding MeSH headings have been extracted. We included MeSH-based features based on the full MeSH headings, the MeSH main headings and the MeSH subheadings. Finally, we included features based on details in the reference section of each text. The key author of each reference was taken as a form of attribute. We also Feature Generation Classifier Selection & Training Feature Selection Document Classification Test Corpus Training Corpus
  • 4. 101 | P a g e included a long form of references as a feature sort, including the primary author, journal name, length, year, and page number. Running the feature generation process on a full set of 5837 training documents created over 100,000 potentially useful features along with a count of the number of documents containing each feature. 2. Feature selection We have opted to use the Chi-square selection method to pick the features that best differentiated between positive and negative documents in the training corpus. The 2x2 Chisquare table is constructed as shown in Table 1, using the number of documents obtained in the previous stage. During machine tuning, an alpha value of 0.025 was found to produce the best results. Using this value as a cut-off, 1885 features were selected as the most important. The number and type of each feature found significant and used in the following steps are shown in Table 2. 3. Classifier selection and training Three specific classifiers were applied to the problem: Naive Bayes, SVM, and Voting Perceptron. Although it is widely thought that the best classifiers are based on Vapnik's SVM method (Vapnik, 2000), the distinctive aspects of the current classification issue discussed above inspired us to apply three different classifiers. By using the same feature set with each of the classifiers, this helped us to compare the efficiency of the classifier algorithms with the particular requirements of the triage function.
  • 5. 102 | P a g e Neither Naive Bayes nor the implementation of the SVM we used, SVMLight (Joachims, 2004), offered adequate means to change the low frequency of positive and the high value of true positive relative to false positive. We used our own implementation for Naive Bayes. Naive Bayes sets a classification probability threshold that can be used to switch between precision and recall. Nonetheless, this is an indirect form of compensation, and in practice, for this task of classification, we found that raising the likelihood threshold did not have a meaningful impact. We fully expected SVMLight to perform better than Naive Bayes, as it contained a cost factor parameter that could be changed to require unequal penalties for false positives and false negatives. Nevertheless, we found that the amount of impact of this parameter was limited and insufficient to account for the 20 difference between the cost of false positives and negatives. Since neither Naive Bayes nor one of the most common SVM implementations addressed our requirements, something else was required. An analysis of the classification literature reveals considerable progress in adjusting the classical Rosenblatt Perceptron algorithm (Rosenblatt, 1958) to achieve efficiency at or near SVM for several problems. One algorithm in particular, the Voting Perceptron algorithm (Freund and Schapire, 1999), has quite good efficiency, is very fast and easy to implement. Although the algorithm as published does not provide a way to account for asymmetric false positive and negative penalties, we have made a change to the algorithm that does. Perceptron is basically an equation for a linear combination of the values of the set of features. For every element in the feature set, there is one term in the perceptron plus an optional bias phrase. The document is defined by taking the dot product of the document's feature vector with the perceptron and adding it to the bias word. When the result is greater than zero, the document is classified as positive, if it is less than or equal to zero, then the document is classified as negative. The original algorithm of Rosenblatt trained the perceptron by applying it to each sample in the training results.
  • 6. 103 | P a g e If the sample was wrongly labeled, the perceptron was changed by adding or subtracting the sample back into the perceptron, adding when the sample was a true positive, and subtracting when the sample was a true negative. Over a large number of training samples, the perceptron converges on a solution that better approximates the distinction between positive and negative documents in the training package. Freund and Schapire improved the performance of the perceptron by modifying the algorithm to produce a series of perceptrons, each of which makes a prediction about the class of each document and receives a number of "votes" depending on how many documents the perceptron has correctly classified in the training set. The class with the most votes is the class allocated to the paper. Our extension to this algorithm is based on a specific modification of the perceptron learning rate for false negatives and false positives. Although incorrectly classified samples are directly added or subtracted back to the perceptron in the typical implementation, we first multiply the sample by a factor known as the learning rate. In addition, we use separate learning thresholds for false positives and false negatives. Given the concept of the utility function, we predicted that the optimal learning rate for false negatives will be around 20 times that for false positives. In reality, that's what we noticed during the training. We used 20.0 for false negatives, and 1.0 for false positives. The training corpus was applied to each of the three classifiers. Ten-fold cross- validation has been used to optimize all free parameters. The Naive Bayes classifier had one free parameter, the threshold for the probability classification. It was left to the default value of 0.50. The selected SVM-Light classifier settings used a linear kernel and a cost factor of 20.0. The Voting Perceptron classifier was used with a linear kernel and the learning rate was given above. For each of the three approaches, a trained classifier model was developed. 4. Classification of test documents Eventually, the test corpus was added to the models developed by the Naive Bayes, SVM and Voting Perceptron classifiers. It's done in two steps. The documents in the study sample were first examined for the presence or absence of significant features observed during the selection process. This has generated a vector function for each test paper. The documents were then categorized by applying each of the three qualified classifiers.
  • 7. 104 | P a g e 5. Evaluation of conceptualdrift One critical problem in applying text classification systems to documents of interest to curators and annotators is how well the available training data reflect the documents to be categorized. When classifying the biomedical text, the available training manuals must have been written in advance of the text to be categorized. However, because of its very existence, the field of science shifts over time, as does the vocabulary used to explain it. How easily written science literature changes has a direct effect on the creation of biomedical text classification systems in terms of how the features are developed and chosen, how much the systems need to be re-trained, how much training is required, and the overall performance that can be expected from such systems can be affected. Throughout biomedical literature, we decided to begin to understand this significant topic of conceptual drift. In order to determine how well the features chosen from the training collection reflected the information that was relevant in classifying the document in the test collection, we took additional steps in producing the features and selecting the test collection. The exact same method and parameters were used for the collection of tests as for the collection of testing. We then calculated how well the training collection feature set reflected the test collection feature set by the computational similarity metrics between the two sets (Dunham, 2003). 9. FEATURE SELECTION ALGORITHMS Feature selection is also called selection of variables or selection of attributes. It is the automated selection of attributes in your data (such as columns in tabular data) that are most important to the issue of predictive modeling that you are working on. "Feature Selection... is the method of selecting a subset of the applicable features for use in model construction.” The selection of features is distinct from the reduction of dimensionality. Both methods aim to minimize the number of attributes in the dataset, but the dimensional reduction approach does so by introducing new combinations of attributes, while the feature selection methods include and remove attributes present in the data without modifying them. Examples of dimensionality reduction methods include Principal Component Analysis, Singular Value Decomposition and Sammon’sMapping. “Feature selection is itself useful, but it mostly acts as a filter, muting out features that aren’t useful in addition to your existing features”.
  • 8. 105 | P a g e 9.1 The Problem the Feature Selection Solves Feature selection methods allow you to build an effective predictive model in your task. We support you by selecting features that will give you as good or better accuracy while needing less data. Feature selection approaches may be used to recognize and delete unwanted, obsolete and redundant attributes from data that do not contribute to the accuracy of the predictive model or can potentially reduce the accuracy of the model. Fewer attributes are preferable because they reduce the complexity of the model, and a simpler model is easier to understand and describe. The goal of variable selection is threefold: 1. to enhance predictor efficiency, 2. to provide quicker and more cost-effective predictors, 3. and to provide a better understanding of the underlying process that generated the data. 9.2 Feature Selection Algorithms There are three general classes of feature selection algorithms: 1. Filter methods, 2. Wrapper methods, 3. Embedded methods. Filter Methods Filter feature selection approaches use a statistical test to assign a score to each element. The features are ranked by the score and either selected to be stored or deleted from the dataset. Methods are mostly univariate and consider the function separately or in relation to the dependent variable. Examples of some of the filter methods include the Chi squared test, information gain and correlation coefficient ratings. Wrapper Methods Wrapper approaches consider the collection of a set of features as a search problem where various combinations are planned, evaluated and compared to other combinations. A predictive algorithm used to test a combination of features and give a score based on the accuracy of the formula.
  • 9. 106 | P a g e The search process may be methodical, such as a best-first search, stochastic, such as a random hill- climbing algorithm, or heuristics, such as forward and backward passes, may be used to add and remove features. An example if the wrapper approach is a recursive elimination algorithm. Embedded Methods Embedded methods learn which features better contribute to the accuracy of the model when the model is being built. Regularization methods are the most common type of embedded feature selection methods. Regularization methods are often called penalization methods that apply additional constraints to the design of a predictive algorithm (such as a regression algorithm) that moves the model towards lower complexity (lower coefficients). Examples of regularization algorithms include LASSO, Elastic Net and Ridge Regression. 9.3 How to Choose a Feature Selection Method for Machine Learning Feature selection is a method that reduces the number of input variables when designing a predictive model. It is beneficial to reduce the number of input variables, both to reduce the computational cost of modeling and, in some cases, to increase the efficiency of the model. Feature-based feature selection approaches include analyzing the relationship between each input variable and the target variable using statistics and choosing those input variables that have the best relationship with the target variable. These methods can be fast and efficient, although the choice of statistical measures depends on the data type of both input and output variables. As such, it may be difficult for a machine learning practitioner to choose an appropriate statistical measure for a data set when choosing filter-based apps. 1. Feature Selection Methods Feature selection approaches are designed to reduce the number of input variables to those deemed most useful for the model in order to predict the target variable. Some predictive modeling problems have a large number of variables that can slow down the creation and training of models and require a large amount of machine memory. In addition, the output of certain models can be degraded by adding input variables that are not important to the target variable.
  • 10. 107 | P a g e There are two major types of feature selection algorithms: the wrapper method and the filter method. 1. Wrapper Feature Selection Methods. 2. Filter Feature Selection Methods. 1. Wrapper feature selection approaches generate several models with various input features subsets and pick those features that result in the best output model according to the performance metric. These methods are not concerned with variable types, although they can be computationally costly. RFE is a good example of a method for selecting a wrapper function. Wrapper methods test several models using procedures that add and/or extract predictors to find the optimum combination that maximizes model efficiency. 2. Filter feature selection approaches use statistical techniques to test the relationship between each input variable and the target variable, and these scores are used as the basis for selecting (filtering) the input variables that will be used in the model. Filter methods test the importance of predictors outside the predictive models, and then only model predictors that pass any criterion. Correlation style statistical measurements between input and output variables are widely used as the basis for filter function selection. As such, the choice of statistical measures is highly dependent on variable data types. Popular data types include numerical (such as height) and categorical (such as label), although each can be further subdivided as integer and floating point for numerical variables, and boolean, ordinal, or nominal for categorical variables.