SlideShare a Scribd company logo
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 05 Issue: 02 | Feb-2016, Available @ http://www.ijret.org 217
COMPARATIVE STUDY OF CLASSIFICATION ALGORITHM FOR
TEXT BASED CATEGORIZATION
Omkar Ardhapure1
, Gayatri Patil2
, Disha Udani3
, Kamlesh Jetha4
1
Student, Department of Computer Engineering, ABMSP's APCOER Pune, Maharashtra, India
2
Student, Department of Computer Engineering, ABMSP's APCOER Pune, Maharashtra, India
3
Student, Department of Computer Engineering, ABMSP's APCOER Pune, Maharashtra, India
4
Assistant Professor, Department of Computer Engineering, ABMSP's APCOER Pune, Maharashtra, India
Abstract
Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine
learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques.
It provides conceptual views of the collected documents and has important applications in the real world. Text based
categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number
of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes'
algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of
the above mentioned classification algorithm.
Keywords: Data Mining, Text Mining, Text Categorization, Machine Learning, Pattern Analysis, Naive Bayes’, KNN,
Decision Tree.
-------------------------------------------------------------------***----------------------------------------------------------------------
1. INTRODUCTION
Text classification is an integral part of text mining for
machine learning techniques. With the rapid growth of
digital information, text categorization is widely being used
to handle and organize text data. The main goal of text
categorization is to divide free text documents into different
categories and automatically assign documents into defined
categories. If the data is potentially useful, hidden, trivial,
these methods helps to find regularities. The aim of text
categorization is to allow users to extract data from textual
resource and to deal with operations such as retrieval,
classification, clustering, data mining, natural language pre-
processing and machine learning methods together to
classify different pattern.
Text categorization can be stated like a set a= (a1 ,a2 ,...,ag )
where x is the jth document to be categorized and let set
b=(b1 ,b2 ,...,bh ) where bk is the predefined category to
which text document ak will be mapped for a function f.
Here gdenotes the total number of documents that has to be
categorized and the total number of pre-defined categories
are denoted by h. It is represented like this [1]:
F: aj→ bk
With the rapid growth and development in technology, the
research on text categorization has been evolved into a new
stage, where the machine learning techniques have led to
modality of text categorization. For example: K- Nearest
Neighbor, Naïve Bayesian Classification, Support Vector
Machine, Decision Trees etc.
2. ASSOCIATION RULE
Let S = {i1, i2, i3..., ik} be a set of attributes (items).
Let T= {t1, t2, t3..., tk} be a set of transactions (database).
Unique transaction ID to each transaction is assign which
containing a subset of the item is S.
A rule is defined as below in an implication of the form:
P→Q
Where P, Q S
And P∩Q=Θ.
Each rule contains two different set of items, which are also
known as item sets, P and Q, where P is called antecedent
and Q consequent.
Interesting rules are selected by using various rules, interests
and constrains on various measures of significance. [2]
2.1 Support
The proportion of transactions in the database which
contains the item set P is nothing but the support value of P
with respect to T.
In formula: supp (P)
2.2Confidence
The confidence value of a rule P Q, with respect to a set of
transactions T, is the proportion of the transactions that
contains P which also contains Q.
Confidence is defined as:
Conf (P→Q) =support (P∩Q)/supp (P)
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 05 Issue: 02 | Feb-2016, Available @ http://www.ijret.org 218
3. TEXT CLASSIFICATION
Given that most of the text in the modern era is in digital
format, text classification plays an important role to manage
and process such data. The goal of text categorization is the
classification of documents into defined categories. The
categories are nothing but just symbolic labels with no
additional knowledge of their meanings. Fig 1 shows the
different stages of text classification such as collection of
documents, pre-processing, feature indexing, and feature
filtering, different classification algorithm and performance
measure.
Fig -1: The Text Classification Process [3]
First step includes collection of data which can be present in
various formats such as pdf, html, doc, php etc. Data which
is collected goes through a series of steps: Removing stop
words: Stop words are the common words that appear in text
and carry little meaning, they serve only syntactic meaning
but do not indicate subject matter; e.g. “the”, “a”, “and”,
“that” are useless as indexing terms. A list of stop words is
extracted and after scanning word by word, these words are
removed.
Second step includes indexing of the document. Full text
document is converted into document vector. It uses vector
space model to represent documents which are nothing but
the vectors of words.
In Feature-selection irrelevant features are removed for the
purpose of classification and vector space is constructed to
improve scalability, efficiency and accuracy of a text
classifier.
Models describing important data are extracted using
classification algorithms. Different techniques/methods
such as K Nearest Neighbor, Artificial Neural Network,
Naive Bayes Classifier, and Decision Trees are used to
classify the models.
The last step for text classification is performance measure.
Performance Evaluation of text classifier is done by
calculating the precision and recall. Precision is the
percentage of the documents that are retrieved that are in
relevant to the query i.e. having a “correct” responses to the
fired query. It is defined as
Precision =| {Relevant} {Retrieved}| {Retrieved}
Recall: this is the percentage of document that is relevant to
the documents that are relevant to the query and were in
fact, retrieved. It is formally defined as [7]
Recall = | {Relevant} {Retrieved}| {Relevant}
F measure is calculated by taking harmonic mean of
precision and recall. [7]
F measure=Recall precision (Recall + Precision)/2.
4. CLASSIFICATION ALGORITHM
4.1 Naive Bayesian Classification
Naive Bayes is a simple and easy technique for
classification: Naive Bayes classifier calculates posterior
and prior probabilities to find the particular class. The
Bayes’ Theorem is:
P (M|N) =P (N|M).P (N) P (M)
Where,
M- Some hypothesis, such that data tuple N belongs to
specified class C
N – Some evidence, describe by measure on set of attributes
P (M |N) – the posterior probability that the hypothesis M
holds given the evidence N
P (M) – prior probability of M, independent on N
P (N|M) – the posterior probability that of N conditioned on
M.
Advantages:
• This technique work well on numeric as well as textual
data.
• This classifier is easy to implement and computation are
simple comparing with other algorithms.
• As it can be applied to large data set, no complicated
iterative parameter estimation schemes are needed.
• Easy interpretation of knowledge representation.
• Performs well and it is robust.
Disadvantages:
• Does not consider frequency of word occurrences.
• Theoretically, naive Bayes classifier have minimum
error rate when compared with other classifier, but
practically it is not always true, because of the
assumption of class conditional independence.
4.2 K-Nearest Neighbor
The k-nearest neighbors’ technique is based upon the
principle that the samples which are similar to each other
will lie in close proximity. Given an unlabeled sample, K-
nearest neighbor classifier will search the pattern space for
k-objects that are closest to it and will delegate the class by
identifying the class label which is frequently used. For
value of k=1 the samples which are closest to unknown
samples are given pattern space. [4][5]
Advantages:
• Performs well on applications which has a sample with
many class labels.
• This classifier is robust to noisy training data.
• Classifier is effective if the training data is not small.
• Vary little information is needed to make it work.
• Learning is simple.
Disadvantages:
• Slower than other classification examples.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 05 Issue: 02 | Feb-2016, Available @ http://www.ijret.org 219
• Nearest neighbor classifiers delegates equal weight to
each attribute. This may cause confusion when there are
many irrelevant attributes in the data and thus results
into poor accuracy.
• It lacks to choose the value of N, except through cross-
validation, which makes finding optimal value of N very
difficult. [4]
4.3 Decision Tree
In decision tree “root” is the main nodewhich has no
incoming edges from any other node. All the other nodes in
the tree have exactly one incoming edge. Other nodes are
called as internal nodes have only outgoing edges. Each
nodes splits into two or more nodes.These nodes are called
as terminal nodes.
Advantages:
• Decision tree has excellent speed of learning and speed
of classification.
• Supports transparency of knowledge/classification.
• Supports multi-classification.
Disadvantages:
• Small amount of variations in the data can imply very
different looking trees.
• Construction of a decision tree may affect badly for
irrelevant attributes.
5. COMPARATIVE STUDY OF
CLASSIFICATION ALGORITHMS
Table -1: Comparison of Classification Techniques
Naive Bayes K - Nearest Neighbor Decision Trees
Accuracy in general Average Good Good
Speed of learning Excellent Excellent V. Good
Speed of classification Excellent Average Excellent
Tolerance to missing values Excellent Average V. Good
Tolerance to irrelevant attributes Good Good V. Good
Tolerance to noise V. Good Average Good
Attempts for incremental learning Excellent Excellent Good
Explanation ability/ transparency of knowledge/ classification Excellent Good Excellent
Support Multi Classification Naturally Extended Excellent Excellent
Based on above comparison one can choose anyone of the
above technique for text based categorization according to
their need and performance requirement.
6. CONCLUSIONS
Text classification has a vital role in managing and
processing data. In this paper, comparison amongst the
classification algorithms like decision trees, Bayesian
network and K-nearest neighbor has been done in detail.
The aim behind this study was to find appropriate
classification technique for text based categorization. The
comparative study has shown that each algorithm has its
own set of advantages and disadvantages and its own area of
implementation. All the criteria cannot be fulfilled by a
single classification algorithm. Two or more classifiers can
be integrated and built according to the implementation and
performance needed.
ACKNOWLEDGEMENT
We acknowledge with gratitude to our Guide, professors,
and our principal who has always been sincere and helpful
in making us understand the varied concepts.
Apart from this, the term paper will be of immense
importance for those who are interested in this subject. We
hope they find it comprehensible.
REFERENCES
[1]. S. Ramasundaramand S.P. Victor, “Algorithms for Text
Categorization: A Comparative Study “, World Applied
Sciences Journal 22 (9): 1232-1240, 2013.
[2]. Tan, Steinbach, Kumar, “Data Mining Association
Analysis: Basic Concepts and Algorithms”, 2004.
[3].The Wikipedia website. [Online].
https://en.wikipedia.org/wiki/Association_rule_learning [4].
Anuradha Patra, Divakar Singh,” A Survey Report on Text
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
_______________________________________________________________________________________
Volume: 05 Issue: 02 | Feb-2016, Available @ http://www.ijret.org 220
Classification with Different Term Weighting Methods and
Comparison between Classification Algorithms”,
International Journal of Computer Applications (0975 –
8887) Volume 75–No.7, August 2013
[5]. Amit Ganatra, Hetal Bhavsar, “A Comparative Study of
Training Algorithms for Supervised Machine Learning”,
International Journal of Soft Computing and Engineering
(IJSCE) ISSN: 2231-2307, Volume-2, Issue-4, September
2012.
[6]. Cover, T., Hart, “Nearest Neighbor Pattern
Classification”, IEEE Transactions on Information Theory,
vol. 13, no. 1, pp. 21-27, 1967.
[7]. Hetal Bhavsar, Amit Ganatra, “A Comparative Study of
Training Algorithms for Supervised Machine Learning”,
International Journal of Soft Computing and Engineering
(IJSCE) ISSN: 2231-2307, Volume-2, Issue-4, September
2012
[8]. Anuradha Patra, Divakar Singh, “A Survey Report on
Text Classification with Different Term Weighing Methods
and Comparison between Classification Algorithms”,
International Journal of Computer Applications (0975 –
8887) Volume 75– No.7, August 2013.
BIOGRAPHIES
Omkar Vijay Ardhapure
Student at ABMSP’S Anantrao Pawar
College of Engineering & Research Pune.
Gayatri Patil
Student at ABMSP’S Anantrao Pawar
College of Engineering & Research Pune.
Disha Udani
Student at ABMSP’S Anantrao Pawar
College of Engineering & Research Pune.
Kamlesh Jetha
Assistant Professor at ABMSP’S Anantrao
Pawar College of Engineering & Research
Pune.

More Related Content

PPTX
lazy learners and other classication methods
DOC
report.doc
PDF
Hypothesis on Different Data Mining Algorithms
PPTX
Boolean,vector space retrieval Models
PPTX
Probabilistic retrieval model
PDF
Different Similarity Measures for Text Classification Using Knn
PDF
Associative Classification: Synopsis
PPT
Textmining Predictive Models
lazy learners and other classication methods
report.doc
Hypothesis on Different Data Mining Algorithms
Boolean,vector space retrieval Models
Probabilistic retrieval model
Different Similarity Measures for Text Classification Using Knn
Associative Classification: Synopsis
Textmining Predictive Models

What's hot (18)

PDF
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
PDF
Bl24409420
PPTX
Text clustering
PDF
Language independent document
PDF
Comparision of methods for combination of multiple classifiers that predict b...
PPT
Information Retrieval 02
PDF
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
PDF
Volume 2-issue-6-1969-1973
PPT
Data Mining: Concepts and Techniques — Chapter 2 —
PDF
Literature Survey: Clustering Technique
PPTX
Cluster Analysis
PDF
Construction of Keyword Extraction using Statistical Approaches and Document ...
PPTX
Data discretization
PDF
Review of Various Text Categorization Methods
PDF
A systematic study of text mining techniques
PDF
Category & Training Texts Selection for Scientific Article Categorization in ...
PDF
Research scholars evaluation based on guides view using id3
PDF
Blei ngjordan2003
A Combined Approach for Feature Subset Selection and Size Reduction for High ...
Bl24409420
Text clustering
Language independent document
Comparision of methods for combination of multiple classifiers that predict b...
Information Retrieval 02
EFFECTIVENESS PREDICTION OF MEMORY BASED CLASSIFIERS FOR THE CLASSIFICATION O...
Volume 2-issue-6-1969-1973
Data Mining: Concepts and Techniques — Chapter 2 —
Literature Survey: Clustering Technique
Cluster Analysis
Construction of Keyword Extraction using Statistical Approaches and Document ...
Data discretization
Review of Various Text Categorization Methods
A systematic study of text mining techniques
Category & Training Texts Selection for Scientific Article Categorization in ...
Research scholars evaluation based on guides view using id3
Blei ngjordan2003
Ad

Viewers also liked (20)

PDF
A General Architecture for an Emotion-aware Content-based Recommender System
PDF
Rec sys2008 tutorial
PDF
Jake Mannix, MLconf 2013
KEY
Enhanced Vector Space Models for Content-based Recommender Systems
KEY
Random Indexing for Content-based Recommender Systems
PPT
Diversity and novelty for recommendation system
PDF
Dwdm naive bayes_ankit_gadgil_027
PPTX
Development Platform as a Service - erfarenheter efter ett års användning - ...
PDF
Modern Perspectives on Recommender Systems and their Applications in Mendeley
PPTX
Paris data-geeks-2013-03-28
PPT
Couchbase Server 2.0 - Indexing and Querying - Deep dive
PDF
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
PDF
Semantics-aware Content-based Recommender Systems
PPTX
RAPS: A Recommender Algorithm Based on Pattern Structures
PDF
OpenStack Heat slides
PDF
Linked Open Data-enabled Strategies for Top-N Recommendations
PPTX
Big Data Analysis Patterns - TriHUG 6/27/2013
PDF
Recommender system introduction
PPTX
Recommender Algorithm for PRBT BiPartite Networks - IESL 18 Oct 2016_final_us...
PDF
Tutorial on Robustness of Recommender Systems
A General Architecture for an Emotion-aware Content-based Recommender System
Rec sys2008 tutorial
Jake Mannix, MLconf 2013
Enhanced Vector Space Models for Content-based Recommender Systems
Random Indexing for Content-based Recommender Systems
Diversity and novelty for recommendation system
Dwdm naive bayes_ankit_gadgil_027
Development Platform as a Service - erfarenheter efter ett års användning - ...
Modern Perspectives on Recommender Systems and their Applications in Mendeley
Paris data-geeks-2013-03-28
Couchbase Server 2.0 - Indexing and Querying - Deep dive
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Semantics-aware Content-based Recommender Systems
RAPS: A Recommender Algorithm Based on Pattern Structures
OpenStack Heat slides
Linked Open Data-enabled Strategies for Top-N Recommendations
Big Data Analysis Patterns - TriHUG 6/27/2013
Recommender system introduction
Recommender Algorithm for PRBT BiPartite Networks - IESL 18 Oct 2016_final_us...
Tutorial on Robustness of Recommender Systems
Ad

Similar to Comparative study of classification algorithm for text based categorization (20)

PDF
Paper id 25201435
PDF
N045038690
PDF
A novel approach for text extraction using effective pattern matching technique
PDF
Evaluating the efficiency of rule techniques for file
PDF
LEARNING CONTEXT FOR TEXT.pdf
PDF
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
PDF
Evaluating the efficiency of rule techniques for file classification
PDF
Document retrieval using clustering
PDF
Text Classification using Support Vector Machine
PDF
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
PDF
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
PDF
Comparison of Text Classifiers on News Articles
PDF
Analysis on Data Mining Techniques for Heart Disease Dataset
PDF
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
PDF
Research on ontology based information retrieval techniques
PDF
Feature selection, optimization and clustering strategies of text documents
PDF
Text Segmentation for Online Subjective Examination using Machine Learning
PDF
Text Document categorization using support vector machine
PDF
IRJET- Student Placement Prediction using Machine Learning
PDF
IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and Lime
Paper id 25201435
N045038690
A novel approach for text extraction using effective pattern matching technique
Evaluating the efficiency of rule techniques for file
LEARNING CONTEXT FOR TEXT.pdf
Scaling Down Dimensions and Feature Extraction in Document Repository Classif...
Evaluating the efficiency of rule techniques for file classification
Document retrieval using clustering
Text Classification using Support Vector Machine
IRJET- Study and Evaluation of Classification Algorithms in Data Mining
IRJET- Diverse Approaches for Document Clustering in Product Development Anal...
Comparison of Text Classifiers on News Articles
Analysis on Data Mining Techniques for Heart Disease Dataset
A Comparative Study of Centroid-Based and Naïve Bayes Classifiers for Documen...
Research on ontology based information retrieval techniques
Feature selection, optimization and clustering strategies of text documents
Text Segmentation for Online Subjective Examination using Machine Learning
Text Document categorization using support vector machine
IRJET- Student Placement Prediction using Machine Learning
IRJET- Stabilization of Black Cotton Soil using Rice Husk Ash and Lime

More from eSAT Journals (20)

PDF
Mechanical properties of hybrid fiber reinforced concrete for pavements
PDF
Material management in construction – a case study
PDF
Managing drought short term strategies in semi arid regions a case study
PDF
Life cycle cost analysis of overlay for an urban road in bangalore
PDF
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
PDF
Laboratory investigation of expansive soil stabilized with natural inorganic ...
PDF
Influence of reinforcement on the behavior of hollow concrete block masonry p...
PDF
Influence of compaction energy on soil stabilized with chemical stabilizer
PDF
Geographical information system (gis) for water resources management
PDF
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
PDF
Factors influencing compressive strength of geopolymer concrete
PDF
Experimental investigation on circular hollow steel columns in filled with li...
PDF
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
PDF
Evaluation of punching shear in flat slabs
PDF
Evaluation of performance of intake tower dam for recent earthquake in india
PDF
Evaluation of operational efficiency of urban road network using travel time ...
PDF
Estimation of surface runoff in nallur amanikere watershed using scs cn method
PDF
Estimation of morphometric parameters and runoff using rs & gis techniques
PDF
Effect of variation of plastic hinge length on the results of non linear anal...
PDF
Effect of use of recycled materials on indirect tensile strength of asphalt c...
Mechanical properties of hybrid fiber reinforced concrete for pavements
Material management in construction – a case study
Managing drought short term strategies in semi arid regions a case study
Life cycle cost analysis of overlay for an urban road in bangalore
Laboratory studies of dense bituminous mixes ii with reclaimed asphalt materials
Laboratory investigation of expansive soil stabilized with natural inorganic ...
Influence of reinforcement on the behavior of hollow concrete block masonry p...
Influence of compaction energy on soil stabilized with chemical stabilizer
Geographical information system (gis) for water resources management
Forest type mapping of bidar forest division, karnataka using geoinformatics ...
Factors influencing compressive strength of geopolymer concrete
Experimental investigation on circular hollow steel columns in filled with li...
Experimental behavior of circular hsscfrc filled steel tubular columns under ...
Evaluation of punching shear in flat slabs
Evaluation of performance of intake tower dam for recent earthquake in india
Evaluation of operational efficiency of urban road network using travel time ...
Estimation of surface runoff in nallur amanikere watershed using scs cn method
Estimation of morphometric parameters and runoff using rs & gis techniques
Effect of variation of plastic hinge length on the results of non linear anal...
Effect of use of recycled materials on indirect tensile strength of asphalt c...

Recently uploaded (20)

PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Welding lecture in detail for understanding
PPTX
web development for engineering and engineering
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
Sustainable Sites - Green Building Construction
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
DOCX
573137875-Attendance-Management-System-original
PPTX
bas. eng. economics group 4 presentation 1.pptx
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Welding lecture in detail for understanding
web development for engineering and engineering
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Sustainable Sites - Green Building Construction
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
573137875-Attendance-Management-System-original
bas. eng. economics group 4 presentation 1.pptx
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
Lecture Notes Electrical Wiring System Components
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
Arduino robotics embedded978-1-4302-3184-4.pdf
Model Code of Practice - Construction Work - 21102022 .pdf

Comparative study of classification algorithm for text based categorization

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 05 Issue: 02 | Feb-2016, Available @ http://www.ijret.org 217 COMPARATIVE STUDY OF CLASSIFICATION ALGORITHM FOR TEXT BASED CATEGORIZATION Omkar Ardhapure1 , Gayatri Patil2 , Disha Udani3 , Kamlesh Jetha4 1 Student, Department of Computer Engineering, ABMSP's APCOER Pune, Maharashtra, India 2 Student, Department of Computer Engineering, ABMSP's APCOER Pune, Maharashtra, India 3 Student, Department of Computer Engineering, ABMSP's APCOER Pune, Maharashtra, India 4 Assistant Professor, Department of Computer Engineering, ABMSP's APCOER Pune, Maharashtra, India Abstract Text categorization is a process in data mining which assigns predefined categories to free-text documents using machine learning techniques. Any document in the form of text, image, music, etc. can be classified using some categorization techniques. It provides conceptual views of the collected documents and has important applications in the real world. Text based categorization is made use of for document classification with pattern recognition and machine learning. Advantages of a number of classification algorithms have been studied in this paper to classify documents. An example of these algorithms is: Naive Bayes' algorithm, K-Nearest Neighbor, Decision Tree etc. This paper presents a comparative study of advantages and disadvantages of the above mentioned classification algorithm. Keywords: Data Mining, Text Mining, Text Categorization, Machine Learning, Pattern Analysis, Naive Bayes’, KNN, Decision Tree. -------------------------------------------------------------------***---------------------------------------------------------------------- 1. INTRODUCTION Text classification is an integral part of text mining for machine learning techniques. With the rapid growth of digital information, text categorization is widely being used to handle and organize text data. The main goal of text categorization is to divide free text documents into different categories and automatically assign documents into defined categories. If the data is potentially useful, hidden, trivial, these methods helps to find regularities. The aim of text categorization is to allow users to extract data from textual resource and to deal with operations such as retrieval, classification, clustering, data mining, natural language pre- processing and machine learning methods together to classify different pattern. Text categorization can be stated like a set a= (a1 ,a2 ,...,ag ) where x is the jth document to be categorized and let set b=(b1 ,b2 ,...,bh ) where bk is the predefined category to which text document ak will be mapped for a function f. Here gdenotes the total number of documents that has to be categorized and the total number of pre-defined categories are denoted by h. It is represented like this [1]: F: aj→ bk With the rapid growth and development in technology, the research on text categorization has been evolved into a new stage, where the machine learning techniques have led to modality of text categorization. For example: K- Nearest Neighbor, Naïve Bayesian Classification, Support Vector Machine, Decision Trees etc. 2. ASSOCIATION RULE Let S = {i1, i2, i3..., ik} be a set of attributes (items). Let T= {t1, t2, t3..., tk} be a set of transactions (database). Unique transaction ID to each transaction is assign which containing a subset of the item is S. A rule is defined as below in an implication of the form: P→Q Where P, Q S And P∩Q=Θ. Each rule contains two different set of items, which are also known as item sets, P and Q, where P is called antecedent and Q consequent. Interesting rules are selected by using various rules, interests and constrains on various measures of significance. [2] 2.1 Support The proportion of transactions in the database which contains the item set P is nothing but the support value of P with respect to T. In formula: supp (P) 2.2Confidence The confidence value of a rule P Q, with respect to a set of transactions T, is the proportion of the transactions that contains P which also contains Q. Confidence is defined as: Conf (P→Q) =support (P∩Q)/supp (P)
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 05 Issue: 02 | Feb-2016, Available @ http://www.ijret.org 218 3. TEXT CLASSIFICATION Given that most of the text in the modern era is in digital format, text classification plays an important role to manage and process such data. The goal of text categorization is the classification of documents into defined categories. The categories are nothing but just symbolic labels with no additional knowledge of their meanings. Fig 1 shows the different stages of text classification such as collection of documents, pre-processing, feature indexing, and feature filtering, different classification algorithm and performance measure. Fig -1: The Text Classification Process [3] First step includes collection of data which can be present in various formats such as pdf, html, doc, php etc. Data which is collected goes through a series of steps: Removing stop words: Stop words are the common words that appear in text and carry little meaning, they serve only syntactic meaning but do not indicate subject matter; e.g. “the”, “a”, “and”, “that” are useless as indexing terms. A list of stop words is extracted and after scanning word by word, these words are removed. Second step includes indexing of the document. Full text document is converted into document vector. It uses vector space model to represent documents which are nothing but the vectors of words. In Feature-selection irrelevant features are removed for the purpose of classification and vector space is constructed to improve scalability, efficiency and accuracy of a text classifier. Models describing important data are extracted using classification algorithms. Different techniques/methods such as K Nearest Neighbor, Artificial Neural Network, Naive Bayes Classifier, and Decision Trees are used to classify the models. The last step for text classification is performance measure. Performance Evaluation of text classifier is done by calculating the precision and recall. Precision is the percentage of the documents that are retrieved that are in relevant to the query i.e. having a “correct” responses to the fired query. It is defined as Precision =| {Relevant} {Retrieved}| {Retrieved} Recall: this is the percentage of document that is relevant to the documents that are relevant to the query and were in fact, retrieved. It is formally defined as [7] Recall = | {Relevant} {Retrieved}| {Relevant} F measure is calculated by taking harmonic mean of precision and recall. [7] F measure=Recall precision (Recall + Precision)/2. 4. CLASSIFICATION ALGORITHM 4.1 Naive Bayesian Classification Naive Bayes is a simple and easy technique for classification: Naive Bayes classifier calculates posterior and prior probabilities to find the particular class. The Bayes’ Theorem is: P (M|N) =P (N|M).P (N) P (M) Where, M- Some hypothesis, such that data tuple N belongs to specified class C N – Some evidence, describe by measure on set of attributes P (M |N) – the posterior probability that the hypothesis M holds given the evidence N P (M) – prior probability of M, independent on N P (N|M) – the posterior probability that of N conditioned on M. Advantages: • This technique work well on numeric as well as textual data. • This classifier is easy to implement and computation are simple comparing with other algorithms. • As it can be applied to large data set, no complicated iterative parameter estimation schemes are needed. • Easy interpretation of knowledge representation. • Performs well and it is robust. Disadvantages: • Does not consider frequency of word occurrences. • Theoretically, naive Bayes classifier have minimum error rate when compared with other classifier, but practically it is not always true, because of the assumption of class conditional independence. 4.2 K-Nearest Neighbor The k-nearest neighbors’ technique is based upon the principle that the samples which are similar to each other will lie in close proximity. Given an unlabeled sample, K- nearest neighbor classifier will search the pattern space for k-objects that are closest to it and will delegate the class by identifying the class label which is frequently used. For value of k=1 the samples which are closest to unknown samples are given pattern space. [4][5] Advantages: • Performs well on applications which has a sample with many class labels. • This classifier is robust to noisy training data. • Classifier is effective if the training data is not small. • Vary little information is needed to make it work. • Learning is simple. Disadvantages: • Slower than other classification examples.
  • 3. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 05 Issue: 02 | Feb-2016, Available @ http://www.ijret.org 219 • Nearest neighbor classifiers delegates equal weight to each attribute. This may cause confusion when there are many irrelevant attributes in the data and thus results into poor accuracy. • It lacks to choose the value of N, except through cross- validation, which makes finding optimal value of N very difficult. [4] 4.3 Decision Tree In decision tree “root” is the main nodewhich has no incoming edges from any other node. All the other nodes in the tree have exactly one incoming edge. Other nodes are called as internal nodes have only outgoing edges. Each nodes splits into two or more nodes.These nodes are called as terminal nodes. Advantages: • Decision tree has excellent speed of learning and speed of classification. • Supports transparency of knowledge/classification. • Supports multi-classification. Disadvantages: • Small amount of variations in the data can imply very different looking trees. • Construction of a decision tree may affect badly for irrelevant attributes. 5. COMPARATIVE STUDY OF CLASSIFICATION ALGORITHMS Table -1: Comparison of Classification Techniques Naive Bayes K - Nearest Neighbor Decision Trees Accuracy in general Average Good Good Speed of learning Excellent Excellent V. Good Speed of classification Excellent Average Excellent Tolerance to missing values Excellent Average V. Good Tolerance to irrelevant attributes Good Good V. Good Tolerance to noise V. Good Average Good Attempts for incremental learning Excellent Excellent Good Explanation ability/ transparency of knowledge/ classification Excellent Good Excellent Support Multi Classification Naturally Extended Excellent Excellent Based on above comparison one can choose anyone of the above technique for text based categorization according to their need and performance requirement. 6. CONCLUSIONS Text classification has a vital role in managing and processing data. In this paper, comparison amongst the classification algorithms like decision trees, Bayesian network and K-nearest neighbor has been done in detail. The aim behind this study was to find appropriate classification technique for text based categorization. The comparative study has shown that each algorithm has its own set of advantages and disadvantages and its own area of implementation. All the criteria cannot be fulfilled by a single classification algorithm. Two or more classifiers can be integrated and built according to the implementation and performance needed. ACKNOWLEDGEMENT We acknowledge with gratitude to our Guide, professors, and our principal who has always been sincere and helpful in making us understand the varied concepts. Apart from this, the term paper will be of immense importance for those who are interested in this subject. We hope they find it comprehensible. REFERENCES [1]. S. Ramasundaramand S.P. Victor, “Algorithms for Text Categorization: A Comparative Study “, World Applied Sciences Journal 22 (9): 1232-1240, 2013. [2]. Tan, Steinbach, Kumar, “Data Mining Association Analysis: Basic Concepts and Algorithms”, 2004. [3].The Wikipedia website. [Online]. https://en.wikipedia.org/wiki/Association_rule_learning [4]. Anuradha Patra, Divakar Singh,” A Survey Report on Text
  • 4. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 _______________________________________________________________________________________ Volume: 05 Issue: 02 | Feb-2016, Available @ http://www.ijret.org 220 Classification with Different Term Weighting Methods and Comparison between Classification Algorithms”, International Journal of Computer Applications (0975 – 8887) Volume 75–No.7, August 2013 [5]. Amit Ganatra, Hetal Bhavsar, “A Comparative Study of Training Algorithms for Supervised Machine Learning”, International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-4, September 2012. [6]. Cover, T., Hart, “Nearest Neighbor Pattern Classification”, IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21-27, 1967. [7]. Hetal Bhavsar, Amit Ganatra, “A Comparative Study of Training Algorithms for Supervised Machine Learning”, International Journal of Soft Computing and Engineering (IJSCE) ISSN: 2231-2307, Volume-2, Issue-4, September 2012 [8]. Anuradha Patra, Divakar Singh, “A Survey Report on Text Classification with Different Term Weighing Methods and Comparison between Classification Algorithms”, International Journal of Computer Applications (0975 – 8887) Volume 75– No.7, August 2013. BIOGRAPHIES Omkar Vijay Ardhapure Student at ABMSP’S Anantrao Pawar College of Engineering & Research Pune. Gayatri Patil Student at ABMSP’S Anantrao Pawar College of Engineering & Research Pune. Disha Udani Student at ABMSP’S Anantrao Pawar College of Engineering & Research Pune. Kamlesh Jetha Assistant Professor at ABMSP’S Anantrao Pawar College of Engineering & Research Pune.