SlideShare a Scribd company logo
IOSR Journal of Computer Engineering (IOSR-JCE)
e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 3, Ver. VI (May – Jun. 2015), PP 76-80
www.iosrjournals.org
DOI: 10.9790/0661-17367680 www.iosrjournals.org 76 | Page
A survey of Stemming Algorithms for Information Retrieval
Brajendra Singh Rajput1
, Dr. Nilay Khare2
1,2
(Computer Science &Engineering, Maulana Azad National Institute of Technology, India)
Abstract:Now a day’s text documents is advancing over internet, e-mails and web pages. As the use of internet
is exponentially growing, the need of massive data storage is increasing. Normally many of the documents
contain morphological variables, so stemming which is a preprocessing technique gives a mapping of different
morphological variants of words into their base word called the stem. Stemming process is used in information
retrieval as a way to improve retrieval performance based on the assumption that terms with the same stem
usually have similar meaning. To do stemming operation on large data, we require normally more computation
time and power, to cope up with the need to search for a particular word in the data. In this paper, various
stemming algorithms are analyzed with the benefits and limitation of the recent stemming technique.
Keywords - Information Retrieval, NLP, stemming technique, Decision based method, Statistical method.
I. Introduction
In Information Retrieval systems the main thing is to improve recall while keeping a good precision. A
recall increasing method which can be useful for even the simplest Boolean retrieval systems is stemming.
Information finder who is looking for texts say dogs is probably interested in the texts which consist of the term
dog [6]. The capacity of the search database has increased in the last few years, so in order to meet the challenge
of real time search NLP algorithms speed up required. Natural language texts typically consist of many different
syntactic variants for example corrected, correct, correcting, correction, correctly, correctness, correctively,
correctional, corrective, correctable (adjective), corrector (noun) all are derived word of root word correct [1].
The conventional approach used to extract data for some user query is to search the documents present in the
corpus word by word for the given query. This approach is very time taking and it may leave some of the
equivalent documents of equal nature. Thus to avoid these situations, Stemming has been extensively used in
various Information Retrieval Systems to increase the extracting accuracy [4].
All documents which contain word with same stem as the query term are relevant, Stemming cut down the size
of the feature set. In text mining, stemming can be viewed as clustering in pattern recognition, feature
reducibility. In rule based reasoning, the main purpose is to choose maximum representative feature, dimensions
base on similarity measurement [13].
The derived words present, presented, presentation and presenting are converted to root word present,
through which not only retrieval performance improve but also storage can be optimized in some specific
applications.
II. Approaches Of Conflations
In order to perform stemming operation, we have to conflate a word to its different variants.
Conflations approaches which are used in stemming algorithms are shown in figure 1. The conflation of words
or Stemming can be executed in two ways, either manually using the kinds of regular expression or automatic.
Automatic technique can be divided into four types namely affix removal, successor, table lookup and n gram.
Affix removal can be further divided into two ways one is longest match and another is a simple removal [8].
A survey on Stemming Algorithms for Information Retrieval
DOI: 10.9790/0661-17367680 www.iosrjournals.org 77 | Page
Fig1: Conflation Approach.
2.1 Affix Removal
The affix removal algorithms eliminate prefix or suffix from word in order to reduce word into
common base. Most of stemmer used this type of approach for conflation. These algorithms depend on two
principles one is iteration, which removes strings in each order class one at a time, starting at the end of a word
and going towards its beginning. Not more than one match is allowed in a single order class. The suffix is added
to a word in any random order, that is, there exist order classes of suffix. The longest match is second type in
which within any given class of endings, if more than one ending gives a match then longest match should be
eliminated [1].
2.2 Successor Variety
In successor variety method [12], frequencies of letter sequences in a body of text as the basis of
stemming. The successor variety of a string is the number of different characters that follows it in word in some
body of text. Consider text pattern which consists of the following terms for example, match, mean, mood,
miasm, mobile .For estimating the successor variety (SV) for “machine" suppose, the following approach is
used. The earliest letter of machine is 'm' which is accompany by a, i, o, e so successor variety of m is 4,for the
next SV of machine we have to check that “ma” in machine is followed by which terms in the text body, so
next SV of machine is 1 because t come next in match for machine. When this process is applied on a large
body of text the successor variety of the substring of term will reduces as more character are added until a
segment boundary is reached. So this idea is used to get the stem.
2.3 Table Lookup Method
Table lookup method is done by looking at the table where the term stems and their Corresponding
stored. Term from queries and indexes could be stemmed by then a lookup table [6].If we use B-tree or hash
table lookup then such would be fast, but there is a problem of storage overhead for such table.
2.4 N-Gram Method
Another method of conflating the terms called shared diagram method given in 1974 by Adamson and
Boreham [9]. The diagram is a pair of consecutive letters. Besides diagram, we can also use trigrams and Hence
it is called n-gram method [10] .With this approach, pair of words are associated on the basis of unique diagram
they hold both. For calculating this relationship, we use determines Dice's coefficient [8]. For example, the term
Correction and Corrective can be broken into di-grams as follows.
WORD DI GRAMS TRI GRAMS
Correction *C,CO,OR,RR,RE,EC,CT,TI,IO,ON,N* **C,*CO,COR,ORR,RRE,REC,ECT,CTI,TIO,ION,ON*,N**
Corrective *C,CO,OR,RR,RE,EC,CT,TI,IV,VE,E* **C,*CO,COR,ORR,RRE,REC,ECT,CTI,TIV,IVE,VE*,E**
A 11 12
B 11 12
C 8 8
Dice-Coeff. 0.727 0.667
Table 1 N – Grams (* denotes padding space)
A survey on Stemming Algorithms for Information Retrieval
DOI: 10.9790/0661-17367680 www.iosrjournals.org 78 | Page
Thus “Correction " has eleven digrams and twelve trigrams of which all are unique and " Corrective "
also has eleven digram and twelve trigrams of which all are unique. The two words share eight unique digrams
and trigrams.
Once the unique digrams and trigrams for the pair have been identified and counted, the similarity
measure based on them can be calculated. The similarity measure is used Dice's coefficient, which is given as:
S = (2C)/ (A + B)
Where A is the number of unique N-gram in the First Word, B is the number of unique N-gram in the
second word and C is the number of N-grams shared by A and B. For example, above Dice's coefficient would
be equal (2 * 8) / (11 + 11) = 0.727 for Di gram and (2*8)/(12+12) = 0.667 for Tri grams. Such similarity
measures are determined for all pairs of term in the database. Such similarity is computed for all the word pairs,
they clustered as the groups. The value of the Dice coefficient gives you the hint that the stem for these pairs of
words lies in the first 8 unique n-grams.
III. Classification Of Stemmer
Basically Stemming algorithms can be classified into two types, Rule based and Statistical. Each type
has its own ways to find for stem. Rule based stemmer encodes language specific rules, whereas statistical
information from a large corpus of a given language to learn the morphology.
Fig2: Stemmer Classification.
3.1 Rule Based Stemmer
In a rule based approach language specific rules are planned and based on these regulations stemming
is performed. In this approach various provision are specified for converting a word to its derivation stem, a list
of all legitimate stem are given and there are some special rules which are used to handle the exceptional cases.
A survey on Stemming Algorithms for Information Retrieval
DOI: 10.9790/0661-17367680 www.iosrjournals.org 79 | Page
3.1.1 Porter Stemmer:
In standard Porter stemmer there are five steps and sixty conditions. There are many modifications of
standard algorithms and its used for English document processing. General rule of removing suffix is given as:
(Condition)S1  S2
Whenever condition is fulfilled suffix S1 is replaced by suffix S2. The order of consonants(C), vowel
(V) and consonants (C) is counted as measure function (m) in porter stemmer. When the measuring function is
greater than one, then only certain condition are applied [5].
3.1.2 Lovins Stemmer
In Lovins stemmer there are 29 conditions, 35 transformation rules and it perform a lookup on a table
of 294 endings. Here stemming comprises of two phases [7].In the first phase, the stemming algorithm retrieves
the stem of a word by removing its longest possible ending by matching these ending with the list of suffixes
stored in computer and in the second phase spelling exception are handled. For example the word “absorption”
is derived from the stem ”absorpt” and “absorbing” is derived from the stem ”absorb”. The problem with the
spelling exception arises in the above case when we try to match the two words “absorpt” and “absorb”. Such
exceptions are handled very carefully by introducing recording and partial matching techniques in the stemmer
as post stemming procedures.
Rule dependent stemmer is fast in nature means calculation time used to find a stem is less. The
retrieval result for English by using a rule dependent stemmer is reasonable, but the problem associated with
rule based is one need to have extensive language expertise to make them.
3.2 Statistical Stemmer
Statistical Stemmer is good alternative to rule based stemmer and does not involve language expertise.
They use statistical information from a large corpus of a given language to learn morphology. Statistical
language processing has been successfully used to improve the performance of information retrieval systems in
the absence of extensive linguistic resources for some language.
3.2.1 Yet Another Suffix Stripper (YASS)
Yet another suffix stripper is one of statistical based language independent stemmer and its
performance can be compared with both rule base stemmer in term of average precision. In this method a set of
string distance measure is used. The string distance measure is used to check the similarity between the two
words by calculating string the distance between two strings. The distance function maps a pair of string a and b
to a real number r, where a smaller value of r indicates greater similarity between a and b. The main reason for
estimating this distance is to find the longest matching prefix [4].
3.2.2 Graph Based Stemmer (GRAS)
GRAS is a graph based language independent stemmer for information retrieval. Extracting
effectiveness, simplification and low computation cost are the features of GRAS. In GRAS [10], first we look
for long common prefix amongst the word pair available in the document set. Suppose two word pair W1=P*S1
and W2=P*S2 where P is the longest common prefix between W1 and W2.The suffix pair S1 & S2 should be
valid suffix if other word pairs also have a common initial part followed by these suffixes such that W’1 = P’ *S1
& W’2 = P’* S2 Then, S1 & S2 is the pair of candidate suffix if large number of word pairs is of this form.
Then look for pairs that are morphological related if they share a non-empty common prefix. The suffix pair is a
legal candidate suffix pair. Using a Graph we model word relationships where nodes represent the words and
edges are used to attach the related words. Normally in GRAS Pivot is a node which is associated by edges to an
other nodes. In the last step, a word which is connected to a pivot is put in the same class as the pivot if it shares
common neighbors with the pivot.
IV. Stemming Error
There are fundamentally two kinds of fault in stemming algorithms one is over stemming and another
is under stemming [3]. Over stemming occurs when two words which have dissimilar root word are changed to
the identical base term, which is also identified as a false positive. In under stemming two words which have
similar root are not stemmed to the same base term, which is also called as false negative. Paice [11] has
demonstrated that light stemmer decreases the over stemming but increases the under stemming errors. On the
other side heavy stemmer reduces the under stemming error while increasing the over stemming errors.
A survey on Stemming Algorithms for Information Retrieval
DOI: 10.9790/0661-17367680 www.iosrjournals.org 80 | Page
V. Conclusion
We studied a variety of stemming methods and got to know that stemming appreciably increases the
retrieval results for both rule dependent and statistical approach. It is also useful in reducing the size of index
files and feature set or attribute as the number of words to be indexed are reduced to common forms called
stems. The performance of statistical stemmers is far superior to some well-known rule-based stemmers but time
consuming. Rule dependent stemmer like porter stemmer is good choice for English document processing but its
language dependent.
References
[1]. Sandeep R. Sirsat, Dr. Vinay Chavan and Dr. Hemant S. Mahalle, Strength and Accuracy Analysis of Affix Removal Stemming
Algorithms, International Journal of Computer Science and Information Technologies, Vol. 4 (2) , 2013, 265 - 269.
[2]. S.P.Ruba Rani, B.Ramesh, M.Anusha and Dr. J.G.R.Sathiaseelan, Evaluation of Stemming Techniques for Text Classification
,International Journal of Computer Science and Mobile Computing, Vol.4 Issue.3, March- 2015, pg. 165-171
[3]. Ms. Anjali Ganesh Jivani, A Comparative Study of Stemming Algorithms, International Journal Comp. Tech. Appl.(IJCTA) 2011,
Vol 2 (6), 1930-1938 ISSN:2229-6093
[4]. Deepika Sharma, Stemming Algorithms: A Comparative Study and their Analysis, International Journal of Applied Information
Systems (IJAIS) Foundation of Computer Science, FCS, New York, USA September 2012 ISSN : 2249-0868 Volume 4– No.3
[5]. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3):130–137
[6]. Wessel Kraaij and Renee ´ Pohlmann,Porter’s stemming algorithm for Dutch,UPLIFT (Utrecht Project: Linguistic Information for
Free Text retrieval) is sponsored by the NBBI,Philips Research, the Foundation for Language Technology.
[7]. Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linquistics, 11:22–31
[8]. WB Frakes, 1992,“Stemming Algorithm “, in “Information Retrieval Data Structures and Algorithm”,Chapter 8, page 132-139.
[9]. G. Adamson and J. Boreham 1974. "The Use of an Association Measure Based on Character Structure to Identify Semantically
Related Pairs of Words and Document Titles," Information Storage and Retrieval, 10,253-60.
[10]. JH Paik, Mandar Mitra, Swapan K. Parui, Kalervo Jarvelin, “GRAS ,An effective and efficient stemming algorithm for information
retrieval”, ACM Transaction on Information System Volume 29 Issue 4, December 2011, Chapter 19, page 20-24
[11]. Paice Chris D.,Another stemmer, ACM SIGIR Forum, Volume 24, No. 3. 1990, 56-61.
[12]. M. Hafer and S. Weiss 1974. "Word Segmentation by Letter Successor Varieties," Information Storage and Retrieval, 10, 371-85
[13]. Narayan L. Bhamidipati and Sankar K. Pal, Stemming via distribution based word segregation for classification and retrival,IEEE
Transaction on system,man,and cybernetics – partB cybernetics. Vol 37 ,no2 april 2007.

More Related Content

PDF
Fast and Accurate Spelling Correction Using Trie and Damerau-levenshtein Dist...
PDF
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING
PDF
Designing of an efficient algorithm for identifying Abbreviation definitions ...
PDF
I6 mala3 sowmya
PPTX
PDF
Different Similarity Measures for Text Classification Using Knn
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
Seeds Affinity Propagation Based on Text Clustering
Fast and Accurate Spelling Correction Using Trie and Damerau-levenshtein Dist...
A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING
Designing of an efficient algorithm for identifying Abbreviation definitions ...
I6 mala3 sowmya
Different Similarity Measures for Text Classification Using Knn
International Journal of Engineering Research and Development (IJERD)
Seeds Affinity Propagation Based on Text Clustering

What's hot (19)

PDF
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
PPTX
Text clustering
PDF
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
PDF
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
PDF
Compressing the dependent elements of multiset
PDF
poster
PDF
Farthest Neighbor Approach for Finding Initial Centroids in K- Means
PPTX
Boolean,vector space retrieval Models
DOCX
DOC
Discovering Novel Information with sentence Level clustering From Multi-docu...
PPTX
Basic Local Alignment Search Tool (BLAST)
PDF
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
PDF
Information Retrieval using Semantic Similarity
PDF
Supervised WSD Using Master- Slave Voting Technique
PPTX
Summary distributed representations_words_phrases
PDF
2-IJCSE-00536
PDF
Text summarization
PDF
Optimizing Near-Synonym System
PPTX
Bioinformatics t5-database searching-v2013_wim_vancriekinge
AN ALGORITHM FOR OPTIMIZED SEARCHING USING NON-OVERLAPPING ITERATIVE NEIGHBOR...
Text clustering
Algorithm of Dynamic Programming for Paper-Reviewer Assignment Problem
A SURVEY ON SIMILARITY MEASURES IN TEXT MINING
Compressing the dependent elements of multiset
poster
Farthest Neighbor Approach for Finding Initial Centroids in K- Means
Boolean,vector space retrieval Models
Discovering Novel Information with sentence Level clustering From Multi-docu...
Basic Local Alignment Search Tool (BLAST)
COMPARISON OF HIERARCHICAL AGGLOMERATIVE ALGORITHMS FOR CLUSTERING MEDICAL DO...
Information Retrieval using Semantic Similarity
Supervised WSD Using Master- Slave Voting Technique
Summary distributed representations_words_phrases
2-IJCSE-00536
Text summarization
Optimizing Near-Synonym System
Bioinformatics t5-database searching-v2013_wim_vancriekinge
Ad

Similar to A survey of Stemming Algorithms for Information Retrieval (20)

PPT
Stemming is one of several text normalization techniques that converts raw te...
PPT
unit-4.ppt
PPT
unit 4.ppt
PDF
An Application of Pattern matching for Motif Identification
PDF
Proposed Method for String Transformation using Probablistic Approach
PDF
50120130405011
PDF
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
PDF
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
PDF
Visualizing stemming techniques on online news articles text analytics
PDF
Extractive Document Summarization - An Unsupervised Approach
PDF
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
PDF
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
PDF
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PDF
F017243241
PDF
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
PDF
Improvement of Text Summarization using Fuzzy Logic Based Method
PDF
Cohesive Software Design
PDF
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
PDF
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
Stemming is one of several text normalization techniques that converts raw te...
unit-4.ppt
unit 4.ppt
An Application of Pattern matching for Motif Identification
Proposed Method for String Transformation using Probablistic Approach
50120130405011
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
A COMPARISON OF DOCUMENT SIMILARITY ALGORITHMS
Visualizing stemming techniques on online news articles text analytics
Extractive Document Summarization - An Unsupervised Approach
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
A COMPARATIVE STUDY OF ROOT-BASED AND STEM-BASED APPROACHES FOR MEASURING THE...
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
F017243241
An Enhanced Suffix Tree Approach to Measure Semantic Similarity between Multi...
Improvement of Text Summarization using Fuzzy Logic Based Method
Cohesive Software Design
Measuring word alignment_quality_for_statistical_machine_translation_tcm17-29663
AN EFFICIENT APPROACH TO IMPROVE ARABIC DOCUMENTS CLUSTERING BASED ON A NEW K...
Ad

More from iosrjce (20)

PDF
An Examination of Effectuation Dimension as Financing Practice of Small and M...
PDF
Does Goods and Services Tax (GST) Leads to Indian Economic Development?
PDF
Childhood Factors that influence success in later life
PDF
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
PDF
Customer’s Acceptance of Internet Banking in Dubai
PDF
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
PDF
Consumer Perspectives on Brand Preference: A Choice Based Model Approach
PDF
Student`S Approach towards Social Network Sites
PDF
Broadcast Management in Nigeria: The systems approach as an imperative
PDF
A Study on Retailer’s Perception on Soya Products with Special Reference to T...
PDF
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
PDF
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
PDF
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
PDF
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
PDF
Media Innovations and its Impact on Brand awareness & Consideration
PDF
Customer experience in supermarkets and hypermarkets – A comparative study
PDF
Social Media and Small Businesses: A Combinational Strategic Approach under t...
PDF
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
PDF
Implementation of Quality Management principles at Zimbabwe Open University (...
PDF
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
An Examination of Effectuation Dimension as Financing Practice of Small and M...
Does Goods and Services Tax (GST) Leads to Indian Economic Development?
Childhood Factors that influence success in later life
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
Customer’s Acceptance of Internet Banking in Dubai
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
Consumer Perspectives on Brand Preference: A Choice Based Model Approach
Student`S Approach towards Social Network Sites
Broadcast Management in Nigeria: The systems approach as an imperative
A Study on Retailer’s Perception on Soya Products with Special Reference to T...
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
Media Innovations and its Impact on Brand awareness & Consideration
Customer experience in supermarkets and hypermarkets – A comparative study
Social Media and Small Businesses: A Combinational Strategic Approach under t...
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
Implementation of Quality Management principles at Zimbabwe Open University (...
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...

Recently uploaded (20)

PPT
Mechanical Engineering MATERIALS Selection
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Sustainable Sites - Green Building Construction
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
DOCX
573137875-Attendance-Management-System-original
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
Geodesy 1.pptx...............................................
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
web development for engineering and engineering
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Mechanical Engineering MATERIALS Selection
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Sustainable Sites - Green Building Construction
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
573137875-Attendance-Management-System-original
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Geodesy 1.pptx...............................................
Embodied AI: Ushering in the Next Era of Intelligent Systems
web development for engineering and engineering
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
bas. eng. economics group 4 presentation 1.pptx
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Foundation to blockchain - A guide to Blockchain Tech
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...

A survey of Stemming Algorithms for Information Retrieval

  • 1. IOSR Journal of Computer Engineering (IOSR-JCE) e-ISSN: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 3, Ver. VI (May – Jun. 2015), PP 76-80 www.iosrjournals.org DOI: 10.9790/0661-17367680 www.iosrjournals.org 76 | Page A survey of Stemming Algorithms for Information Retrieval Brajendra Singh Rajput1 , Dr. Nilay Khare2 1,2 (Computer Science &Engineering, Maulana Azad National Institute of Technology, India) Abstract:Now a day’s text documents is advancing over internet, e-mails and web pages. As the use of internet is exponentially growing, the need of massive data storage is increasing. Normally many of the documents contain morphological variables, so stemming which is a preprocessing technique gives a mapping of different morphological variants of words into their base word called the stem. Stemming process is used in information retrieval as a way to improve retrieval performance based on the assumption that terms with the same stem usually have similar meaning. To do stemming operation on large data, we require normally more computation time and power, to cope up with the need to search for a particular word in the data. In this paper, various stemming algorithms are analyzed with the benefits and limitation of the recent stemming technique. Keywords - Information Retrieval, NLP, stemming technique, Decision based method, Statistical method. I. Introduction In Information Retrieval systems the main thing is to improve recall while keeping a good precision. A recall increasing method which can be useful for even the simplest Boolean retrieval systems is stemming. Information finder who is looking for texts say dogs is probably interested in the texts which consist of the term dog [6]. The capacity of the search database has increased in the last few years, so in order to meet the challenge of real time search NLP algorithms speed up required. Natural language texts typically consist of many different syntactic variants for example corrected, correct, correcting, correction, correctly, correctness, correctively, correctional, corrective, correctable (adjective), corrector (noun) all are derived word of root word correct [1]. The conventional approach used to extract data for some user query is to search the documents present in the corpus word by word for the given query. This approach is very time taking and it may leave some of the equivalent documents of equal nature. Thus to avoid these situations, Stemming has been extensively used in various Information Retrieval Systems to increase the extracting accuracy [4]. All documents which contain word with same stem as the query term are relevant, Stemming cut down the size of the feature set. In text mining, stemming can be viewed as clustering in pattern recognition, feature reducibility. In rule based reasoning, the main purpose is to choose maximum representative feature, dimensions base on similarity measurement [13]. The derived words present, presented, presentation and presenting are converted to root word present, through which not only retrieval performance improve but also storage can be optimized in some specific applications. II. Approaches Of Conflations In order to perform stemming operation, we have to conflate a word to its different variants. Conflations approaches which are used in stemming algorithms are shown in figure 1. The conflation of words or Stemming can be executed in two ways, either manually using the kinds of regular expression or automatic. Automatic technique can be divided into four types namely affix removal, successor, table lookup and n gram. Affix removal can be further divided into two ways one is longest match and another is a simple removal [8].
  • 2. A survey on Stemming Algorithms for Information Retrieval DOI: 10.9790/0661-17367680 www.iosrjournals.org 77 | Page Fig1: Conflation Approach. 2.1 Affix Removal The affix removal algorithms eliminate prefix or suffix from word in order to reduce word into common base. Most of stemmer used this type of approach for conflation. These algorithms depend on two principles one is iteration, which removes strings in each order class one at a time, starting at the end of a word and going towards its beginning. Not more than one match is allowed in a single order class. The suffix is added to a word in any random order, that is, there exist order classes of suffix. The longest match is second type in which within any given class of endings, if more than one ending gives a match then longest match should be eliminated [1]. 2.2 Successor Variety In successor variety method [12], frequencies of letter sequences in a body of text as the basis of stemming. The successor variety of a string is the number of different characters that follows it in word in some body of text. Consider text pattern which consists of the following terms for example, match, mean, mood, miasm, mobile .For estimating the successor variety (SV) for “machine" suppose, the following approach is used. The earliest letter of machine is 'm' which is accompany by a, i, o, e so successor variety of m is 4,for the next SV of machine we have to check that “ma” in machine is followed by which terms in the text body, so next SV of machine is 1 because t come next in match for machine. When this process is applied on a large body of text the successor variety of the substring of term will reduces as more character are added until a segment boundary is reached. So this idea is used to get the stem. 2.3 Table Lookup Method Table lookup method is done by looking at the table where the term stems and their Corresponding stored. Term from queries and indexes could be stemmed by then a lookup table [6].If we use B-tree or hash table lookup then such would be fast, but there is a problem of storage overhead for such table. 2.4 N-Gram Method Another method of conflating the terms called shared diagram method given in 1974 by Adamson and Boreham [9]. The diagram is a pair of consecutive letters. Besides diagram, we can also use trigrams and Hence it is called n-gram method [10] .With this approach, pair of words are associated on the basis of unique diagram they hold both. For calculating this relationship, we use determines Dice's coefficient [8]. For example, the term Correction and Corrective can be broken into di-grams as follows. WORD DI GRAMS TRI GRAMS Correction *C,CO,OR,RR,RE,EC,CT,TI,IO,ON,N* **C,*CO,COR,ORR,RRE,REC,ECT,CTI,TIO,ION,ON*,N** Corrective *C,CO,OR,RR,RE,EC,CT,TI,IV,VE,E* **C,*CO,COR,ORR,RRE,REC,ECT,CTI,TIV,IVE,VE*,E** A 11 12 B 11 12 C 8 8 Dice-Coeff. 0.727 0.667 Table 1 N – Grams (* denotes padding space)
  • 3. A survey on Stemming Algorithms for Information Retrieval DOI: 10.9790/0661-17367680 www.iosrjournals.org 78 | Page Thus “Correction " has eleven digrams and twelve trigrams of which all are unique and " Corrective " also has eleven digram and twelve trigrams of which all are unique. The two words share eight unique digrams and trigrams. Once the unique digrams and trigrams for the pair have been identified and counted, the similarity measure based on them can be calculated. The similarity measure is used Dice's coefficient, which is given as: S = (2C)/ (A + B) Where A is the number of unique N-gram in the First Word, B is the number of unique N-gram in the second word and C is the number of N-grams shared by A and B. For example, above Dice's coefficient would be equal (2 * 8) / (11 + 11) = 0.727 for Di gram and (2*8)/(12+12) = 0.667 for Tri grams. Such similarity measures are determined for all pairs of term in the database. Such similarity is computed for all the word pairs, they clustered as the groups. The value of the Dice coefficient gives you the hint that the stem for these pairs of words lies in the first 8 unique n-grams. III. Classification Of Stemmer Basically Stemming algorithms can be classified into two types, Rule based and Statistical. Each type has its own ways to find for stem. Rule based stemmer encodes language specific rules, whereas statistical information from a large corpus of a given language to learn the morphology. Fig2: Stemmer Classification. 3.1 Rule Based Stemmer In a rule based approach language specific rules are planned and based on these regulations stemming is performed. In this approach various provision are specified for converting a word to its derivation stem, a list of all legitimate stem are given and there are some special rules which are used to handle the exceptional cases.
  • 4. A survey on Stemming Algorithms for Information Retrieval DOI: 10.9790/0661-17367680 www.iosrjournals.org 79 | Page 3.1.1 Porter Stemmer: In standard Porter stemmer there are five steps and sixty conditions. There are many modifications of standard algorithms and its used for English document processing. General rule of removing suffix is given as: (Condition)S1  S2 Whenever condition is fulfilled suffix S1 is replaced by suffix S2. The order of consonants(C), vowel (V) and consonants (C) is counted as measure function (m) in porter stemmer. When the measuring function is greater than one, then only certain condition are applied [5]. 3.1.2 Lovins Stemmer In Lovins stemmer there are 29 conditions, 35 transformation rules and it perform a lookup on a table of 294 endings. Here stemming comprises of two phases [7].In the first phase, the stemming algorithm retrieves the stem of a word by removing its longest possible ending by matching these ending with the list of suffixes stored in computer and in the second phase spelling exception are handled. For example the word “absorption” is derived from the stem ”absorpt” and “absorbing” is derived from the stem ”absorb”. The problem with the spelling exception arises in the above case when we try to match the two words “absorpt” and “absorb”. Such exceptions are handled very carefully by introducing recording and partial matching techniques in the stemmer as post stemming procedures. Rule dependent stemmer is fast in nature means calculation time used to find a stem is less. The retrieval result for English by using a rule dependent stemmer is reasonable, but the problem associated with rule based is one need to have extensive language expertise to make them. 3.2 Statistical Stemmer Statistical Stemmer is good alternative to rule based stemmer and does not involve language expertise. They use statistical information from a large corpus of a given language to learn morphology. Statistical language processing has been successfully used to improve the performance of information retrieval systems in the absence of extensive linguistic resources for some language. 3.2.1 Yet Another Suffix Stripper (YASS) Yet another suffix stripper is one of statistical based language independent stemmer and its performance can be compared with both rule base stemmer in term of average precision. In this method a set of string distance measure is used. The string distance measure is used to check the similarity between the two words by calculating string the distance between two strings. The distance function maps a pair of string a and b to a real number r, where a smaller value of r indicates greater similarity between a and b. The main reason for estimating this distance is to find the longest matching prefix [4]. 3.2.2 Graph Based Stemmer (GRAS) GRAS is a graph based language independent stemmer for information retrieval. Extracting effectiveness, simplification and low computation cost are the features of GRAS. In GRAS [10], first we look for long common prefix amongst the word pair available in the document set. Suppose two word pair W1=P*S1 and W2=P*S2 where P is the longest common prefix between W1 and W2.The suffix pair S1 & S2 should be valid suffix if other word pairs also have a common initial part followed by these suffixes such that W’1 = P’ *S1 & W’2 = P’* S2 Then, S1 & S2 is the pair of candidate suffix if large number of word pairs is of this form. Then look for pairs that are morphological related if they share a non-empty common prefix. The suffix pair is a legal candidate suffix pair. Using a Graph we model word relationships where nodes represent the words and edges are used to attach the related words. Normally in GRAS Pivot is a node which is associated by edges to an other nodes. In the last step, a word which is connected to a pivot is put in the same class as the pivot if it shares common neighbors with the pivot. IV. Stemming Error There are fundamentally two kinds of fault in stemming algorithms one is over stemming and another is under stemming [3]. Over stemming occurs when two words which have dissimilar root word are changed to the identical base term, which is also identified as a false positive. In under stemming two words which have similar root are not stemmed to the same base term, which is also called as false negative. Paice [11] has demonstrated that light stemmer decreases the over stemming but increases the under stemming errors. On the other side heavy stemmer reduces the under stemming error while increasing the over stemming errors.
  • 5. A survey on Stemming Algorithms for Information Retrieval DOI: 10.9790/0661-17367680 www.iosrjournals.org 80 | Page V. Conclusion We studied a variety of stemming methods and got to know that stemming appreciably increases the retrieval results for both rule dependent and statistical approach. It is also useful in reducing the size of index files and feature set or attribute as the number of words to be indexed are reduced to common forms called stems. The performance of statistical stemmers is far superior to some well-known rule-based stemmers but time consuming. Rule dependent stemmer like porter stemmer is good choice for English document processing but its language dependent. References [1]. Sandeep R. Sirsat, Dr. Vinay Chavan and Dr. Hemant S. Mahalle, Strength and Accuracy Analysis of Affix Removal Stemming Algorithms, International Journal of Computer Science and Information Technologies, Vol. 4 (2) , 2013, 265 - 269. [2]. S.P.Ruba Rani, B.Ramesh, M.Anusha and Dr. J.G.R.Sathiaseelan, Evaluation of Stemming Techniques for Text Classification ,International Journal of Computer Science and Mobile Computing, Vol.4 Issue.3, March- 2015, pg. 165-171 [3]. Ms. Anjali Ganesh Jivani, A Comparative Study of Stemming Algorithms, International Journal Comp. Tech. Appl.(IJCTA) 2011, Vol 2 (6), 1930-1938 ISSN:2229-6093 [4]. Deepika Sharma, Stemming Algorithms: A Comparative Study and their Analysis, International Journal of Applied Information Systems (IJAIS) Foundation of Computer Science, FCS, New York, USA September 2012 ISSN : 2249-0868 Volume 4– No.3 [5]. Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3):130–137 [6]. Wessel Kraaij and Renee ´ Pohlmann,Porter’s stemming algorithm for Dutch,UPLIFT (Utrecht Project: Linguistic Information for Free Text retrieval) is sponsored by the NBBI,Philips Research, the Foundation for Language Technology. [7]. Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linquistics, 11:22–31 [8]. WB Frakes, 1992,“Stemming Algorithm “, in “Information Retrieval Data Structures and Algorithm”,Chapter 8, page 132-139. [9]. G. Adamson and J. Boreham 1974. "The Use of an Association Measure Based on Character Structure to Identify Semantically Related Pairs of Words and Document Titles," Information Storage and Retrieval, 10,253-60. [10]. JH Paik, Mandar Mitra, Swapan K. Parui, Kalervo Jarvelin, “GRAS ,An effective and efficient stemming algorithm for information retrieval”, ACM Transaction on Information System Volume 29 Issue 4, December 2011, Chapter 19, page 20-24 [11]. Paice Chris D.,Another stemmer, ACM SIGIR Forum, Volume 24, No. 3. 1990, 56-61. [12]. M. Hafer and S. Weiss 1974. "Word Segmentation by Letter Successor Varieties," Information Storage and Retrieval, 10, 371-85 [13]. Narayan L. Bhamidipati and Sankar K. Pal, Stemming via distribution based word segregation for classification and retrival,IEEE Transaction on system,man,and cybernetics – partB cybernetics. Vol 37 ,no2 april 2007.