SlideShare a Scribd company logo
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014
DOI : 10.5121/ijnlc.2014.3102 11
HPS: A HIERARCHICAL PERSIAN STEMMING
METHOD
Ayshe Rashidi1
and Mina Zolfy Lighvan2
1
Department of Electrical and Computer Engineering, Tabriz University, Tabriz, Iran
2
Department of Electrical and Computer Engineering, Tabriz University, Tabriz, Iran
ABSTRACT
In this paper, a novel hierarchical Persian Stemming approach based on the Part-Of-Speech (POS) of the
word in a sentence is presented. The implemented stemmer includes hash tables and several deterministic
finite automata (DFA) in its different levels of hierarchy for removing the prefixes and suffixes of the
words. We had two intentions in using hash tables in our method. The first one is that the DFA don’t
support some special words, so hash table can partly solve the addressed problem. And the second goal is
to speed up the implemented stemmer with omitting the time that DFA need. Because of the hierarchical
organization, this method is fast and flexible enough. Our experiments on test sets from Hamshahri
Collection and Security News from ICTna.ir Site show that our method has the average accuracy of
95.37% which is even improved in using the method on a test set with common topics.
KEYWORDS
Stemming, morphology, DFA machine, hash table, POS tags & hierarchical
1. INTRODUCTION
Nowadays, people are surrounded by huge amount of information especially with the
development of the internet. Hence, over the years many techniques are developed to help people
manage and process their desired information. Many research themes in the field of artificial
intelligence are emerging under this environment, for example, information retrieval, information
extraction, information filtering, machine translation, question answering. Unfortunately, the
words that seem in documents and in queries often have many morphological variants. In most
cases, morphological variants of words have similar semantic interpretations and can be
considered as equivalent for IR applications. Thus, pairs of terms such as "connect" and
"connection" will not be recognized as equivalent without some form of natural language
processing (NLP). So before the information retrieval from the documents the stemming
techniques as an essential step are applied on the target data set to reduce the size of the data set
which will improve the performance of IR System. So that a smaller data set or dictionary size
results in a saving of storage space and processing time. There are several types of stemming
algorithms which differ in respect to performance and accuracy. In this paper, we will describe
some of them briefly and then also we will present our proposed method.
The organization of the rest of paper is as follows. Section 2, gives a brief background of Persian
Language. Section three is a glance of related work. Section 4 describes our stemming method. In
the Section 5, we test experimental results of our method, and Section 6 discusses our conclusion
and suggestions.
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014
12
2. RELATED WORKS
More frequently used stemming methods are: Affix removing, Look up Table and Statistics
methods [1]. Affix Removing is depending on morphological structure of the language in which
stemming is done by removing morphemes from any word. Porter algorithm is an example of this
category[2], that is composed of 5 different steps. During these 5 steps more common affixes are
removed using some special rules. Another example of this method is Krovatez [3], that uses a
stemming procedure based on both inflectional and derivational suffixes in which the suffix
stripping process is under the control of an English dictionary.
In the look up table based method, each word and its stem are stored in some look up tables,
where for each stored word corresponding stem could be found. This method needs large storage
space and its tables must be updated manually for each new word.
In the Statistics methods, using a process based on sets, rules are formulated according to the
arrangement of words. n-gram [4], link analysis [5] and Hidden Markov Models [6] are examples
of models that have been used in some statistics method for stemming.
In general, many works on stemming performance are reported in different fields for English
language but not for other less popular language. For example for French language, Savoy [7]
proposes a suffixing algorithm based on grammatical categories, also Savoy [8] presented another
stemming procedure based on only a few general morphological rules. This approach corresponds
to the English "S stemmer" method which conflates singular and plural word forms [9].
Tomlinson [10] evaluated the differences between Porter’s stemmer [2] strategy and lexical
stemmers (based on a dictionary of the corresponding language) for various European languages.
For the Finnish and the German language, lexical stemmer tends to produce statistically better
results, while for seven other languages performance differences were insignificant [11].
Two major algorithms for stemming in Persian language are presented. The first one has been
proposed by Kazem taghva, Russell Beckley and Mohammad Sadeh in 2005 [12]. This method is
an inspiration of the Porter algorithm in English [2], which is based on removing the suffix and
prefix using Persian language morphology. For implementation of this method and to remove
suffix and prefix from words, a DFA machine with 40 states is used. This method has some
problems such as limited number of suffixes and low speed. The second algorithm is designed by
GholamReza Ghasem Sani and Reza Hesamifard [13] which is based on the database or
dictionary information of all the stems of the language. At first the input word should be searched
in the database, if it is found, the stem will be returned, otherwise, the suffixes and prefixes
should be removed and it should be searched again in database. Disadvantages of this method are
its requirement to frequently database update, and high storage space.
3. PERSIAN LANGUAGE
The Persian language belongs to Indo-European languages, spoken and written primarily in Iran,
Afghanistan, and a part of Tajikistan and is written using modified Arabic script, containing 28
Arabic letters and four more characters ( ‫گ‬‫چ‬‫پ‬‫ژ‬ ) __to express sounds not present in Classical
Arabic and is a right to left language. In Persian, verbs involve tense, person, mode and its form
(negative or positive). For example, the verb “‫زم‬ ” (mi-sazam: I make) is a present tense one
consisting of three morphemes. “‫م‬” (am) is a suffix denoting first single person “‫ز‬ ” (saz) is the
present tense root of the verb and “ ” (mi) is a prefix that expresses continuity.
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014
13
Negative form of verbs is produced with adding “‫ن‬” (ne) to the first of them. For example,
“‫زم‬ ” (ne-mi-saz-am - I don’t make) is the negative form of the verb “‫زم‬ ” (misazam - I
make). There are some certain rules to make verbs in Farsi language. A subset of these rules is
shown in Table 1.
Table 1 Some morphological rules for verbs in Persian Language
pasttense
()
Simple ( ‫د‬ )
+
past person identifier + past
root
)+‫م‬(
Neveştam = neweşt + am
Continuous
( ‫ار‬ ‫)ا‬
++
past person identifier + past
root + mi
)++‫م‬(
Mineveştam = mi + neveşt +
am
Present perfect
( )
+’‘+
Present perfect past person
identifier + ‘h’ + past root
‫ام‬)++‫ام‬(
Neveşteam = Neveşt + e + am
Unlikely ( )
+’‘+‫د‬+
past person identifier + bud +
‘h’ + past root
‫دم‬)+‘‘+‫د‬+’‫م‬’(
Neveşte budam = neveşt + e
+ bud + am
Implicit ( ‫ا‬ ‫)ا‬
+‘‘+‫ش‬+
‫رع‬
present person identifier +
baş + ‘h’ + past root
)++‫ش‬+‫م‬(
Neveşte başam = Neveşt + e +
baş + am
Future
tense
()
--
‫ا‬+‫رع‬+
past root + Present person
identifier + xãh
‫اه‬)‫ا‬+‫م‬+(
Xaham neveşt= xãh+am+
neveşt
Presenttense
(‫رع‬)
Simple ( ‫د‬ )
‫رع‬+‫رع‬
Present root + Present person
identifier
)+‫م‬(
Nevisam = Nevis + am
Declarative
( ‫ر‬ ‫)ا‬
+‫رع‬+‫رع‬
Mi + Present root + Present
person identifier
)++‫م‬(
Minevisam= mi + nevis + am
Implicit ( ‫ا‬ ‫)ا‬
‘‫ب‬’+‫رع‬+‫رع‬
B + Present root + Present
person identifier
)‫ب‬++‫م‬(
Benevisam = be + nevis + am
Imperative ( ‫)ا‬
‘‫ب‬’+‫رع‬
B + Present root
)‫ب‬+(
Benevis = be + nevis
In Persian language we have a lot of rules for making nouns. In general, the plural forms of nouns
are formed by adding the suffixes ( ‫ه‬ ،‫ان‬ ،‫ات‬ ، ،‫ون‬). “ ‫ه‬” (hã) is used for all words. “‫ان‬” (ãn) is
used for humans, animals and everything that is alive. Also, “ ،‫ون‬‫ات‬ ، ” (ãt ,un , in) is used for
some words borrowed from Arabic and some Persian words. There is another kind of plural form
in Persian that is called Mokassar which is a derivational plural form (irregulars in Persian), that
many of them borrowed from Arabic. Some examples of plural forms are shown in Table 2.
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014
14
Table 2 Some Morphological Rules for Nouns in Persian Language
Type
Suffixes Word structure: Word= Word Stem + suffixes
Plural
‫ان‬)ãn( Deraxtãn=deraxt+ãn ‫ن‬ ‫در‬)trees= (‫در‬+‫ان‬
‫ه‬)ha( Dasthã=dast+hã ‫د‬‫ه‬)hands= (‫د‬+‫ه‬
‫ات‬)tã( Nabãtãt=nabãt+ãt ‫ت‬)plants= (‫ت‬+‫ات‬
)in(‫ون‬ ،)un( Mo’alemun =
mo’alem+ in
)teachers= (+
Posse
ssion
‫ت‬)at(،‫م‬)am(،‫ش‬)aş( Dastam=dast+am ‫د‬)my hand= (‫د‬+‫م‬
‫ن‬)mãn(،‫ن‬)tãn(،
‫ن‬)nãş(
Dastemãn=dast+mãn ‫ن‬ ‫د‬)our hand= (‫د‬+‫ن‬
Others
)i(،)h(،)k( Xubi=xub+i )goodness= (‫ب‬+
)yat(،)eĉ(،(ĉi) Jam’yat=jam’+yat )population= (+
‫ن‬)bãn(،‫دان‬)dãn(،
‫زار‬)zãr(
Bãghbãn=bãgh+bãn ‫ن‬)gardener= (‫غ‬+‫ن‬
‫وار‬)wãre( Guşwãre=Guş+ware ‫ار‬)eardrop= (‫ش‬+‫وار‬
There are some orthographic rules on the effects of joining affixes in some words. For example,
consider a plural word consisting of two parts A and B. In such an example if the last letter of A
and the first letter of B is “‫ا‬” (ã), a letter “ ” (y) is added between them. Assume A is “ ‫دا‬” (dãnã
- wise) and B is “‫ان‬” (ãn), the joining result is “‫ن‬ ‫دا‬” (dãnã-yãn: wises).
An adjective is a word or group of words that appears before or after a noun, and explains a
feature or concept about it. Adjectives have different types such as simple, nominative, participle,
relative and merit. Here, we categorized them based on the number of suffixes letters, because our
method is based on morphology. Some of common types of adjectives are presented in table 3.
Table 3 Some Morphological Rules for Adjectives in Persian Language
Adjective
Suffixes Word structure: Word= Word Stem + suffixes
‫ا‬)ã(،)i(،)h( Dãrã = dãr + ã ‫دارا‬)wealthy(=‫دار‬+‫ا‬
)tar(،)gar(
)in(،‫ار‬)rã(
Xubtar = xub + tar )Better(=‫ب‬+
‫ا‬)ãne(،)mand(
)nãk(،‫وار‬)wãr(
Mahramãneh = mahram +
ãne
)Confidential(=‫م‬+
‫ا‬
)tarin(،)gãne( Xubtarin = xub + tarin )best(=‫ب‬+
Similar to the nouns, there are some orthographic rules for adjectives in Persian language. For
example if we want to make a relative adjective from a word(with adding ‘ ’(i) to end of it) that
has a ‘ ’ (h) as its last letter like “ ”( Baneh: a city name), we should add an “‫ا‬” (a) letter
between them so relative adjective for “ ” is “ ‫ا‬ ”(Baneai: from Bane).
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014
15
4. HPS METHOD
4.1. Description of our HPS method
For stemming a textual document or a sentence, an effective stemming method should focus
mainly on nouns, adjectives and verbs because these words carry the major meaning of a sentence
or a document. Therefore, in this paper we ignore the stemming of other components of sentence.
Persian language as well as English language has affixation morphology, means that for the
different applications or to create the new meaning of a word, prefix and suffix stick to the begin
and end of the words. Persian nouns as well as English nouns have plural and ownership suffixes.
Persian verbs according to tense, person, negative and modes are different and have more variety
than English verbs. Also Persian has so many adjective suffixes.
HPS (hierarchical Persian Stemmer) method employs a hierarchical process based on morphology
and POS tags. It has three distinct parts for nouns, adjectives and verbs suffix stemming. In
addition HPS uses hash table for stemming of some exceptions that other stemmer can’t support
it.
In HPS the stemming task is spread into several hierarchical levels. Figure 1 shows a Block
Diagram of different levels of HPS method. The first level of HPS is showed by PreStemmer-
DFA which is responsible of removing prefixes from the words. The Next level named
SufStemmer removes suffixes and is composed of three distinct parts based on the POS tags (N
for nouns, V for verbs and A for adjectives). Each of the mentioned parts contains of two levels
that composed by a hash table and a DFA. For example in the first part that is belong to the
nouns, N_Hash is a hash table that constructs the first level and SufStemmer_NDFA is the DFA
based stemmer of the corresponding second level.
HPS method stores some particular words like high frequency words , Mokassar plural words that
borrowed from Arabic and irregular plurals and some words like "‫ن‬ ‫ز‬" (sãzemãn: organization)
in three distinct small hash tables(N_Hash for nouns, A_Hash for adjectives and V_Hash for
verbs). In the diagram of the method NFile, AFile and VFile are files that containing noun,
adjectives and verbs words respectively those stores in corresponding hash tables.
Our stemmer uses a lower bound limit on stem length (which is equal to three here) and it also
follows some rules on the last letter of words and the first letters of suffixes. HPS at first
identifies prefixes, and removes prefix according to defined sequences in the existence paths in
the PreStemmer-DFA.
We have grouped suffixes into three main groups as verb-suffixes (VL1, VL2, VL4, VL5, VL6,
VL7), noun-suffixes (Pl2, Plo3, Po1, Po2, Po3, Ot1, Ot2, Ot3, Ot4), and adjective-suffixes (AL1,
AL2, AL3, AL4) and each of this main groups has sub groups based on number of suffix letters
(and type of suffix for the noun-suffixes). This grouping indicates the number of suffix letters that
would be cut from the word. If the stemmer first identifies the prefix “‫ن‬” (n) in the word “ ”
(naneveştim: we did not write) as a prefix, it then identifies suffix “ ” (yam) and removes it to
produce the stem “ ” (neveşt: wrote).
Noun suffixes are stacked according to this pattern (reading from right-to-left):
(Possessive) + (Plural) + (Other) < Stem >
For example, the stemmer first finds the possessive noun suffix “‫ن‬ ”(yemãn) in the word
“‫ن‬ ‫ه‬ ” (neveştehãyemãn: our writings”), then it finds the plural noun suffix “ ‫ه‬”(hã) and,
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014
16
finally, it finds the other-noun-suffix “ ”(h) to reach the stem “ ” (neveşt: wrote). Hence
the stemmer removes up to three suffixes from nouns.
Figure 1 Diagram of HPS Proposed Method
4.2. Implementation
We implemented our proposed HPS method with a composition of three hash tables and four
DFA (deterministic finite automata) machines. The hash tables are belonged to three major parts
of word stemmer as described before. One of the four implemented DFA machines takes the role
of prefix stemmer and the other three are for removing the suffixes from the words based on POS
tags (noun: N, adjective: A or verb: V).
The prefix DFA stemmer runs on the input word and if detects a prefix pattern then removes it.
Depending on the POS tag of a word its corresponding hash tables is being searched, in the case
of finding the word in the hash table, related stem is returned otherwise corresponding suffix
DFA stemmer is being run to remove the suffixes during the states of the DFA. If the generated
word is a stem then the process is completed otherwise it will be returned again to the hash table.
It is remarkable that a word may have multiple suffixes, so for removing all suffixes, output will
be given back to the suffix stemmer system as a new word and this process repeated until it can’t
find any more suffix or returned word is contained less than three letters.
Depends on POS of input word, a small array for storing suffix groups is used. We have named
all existence states in the DFAs, as “NIL” or one of suffix groups in the suffix DFA stemmers and
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014
17
“NIL” or “PRE” in the Prefix DFA stemmer. If final state was “NIL” then not removed any things
from the input word means that the input word is returned as its stem, else regard to the suffix
group of final state, related suffix will be removed. Figure 2 shows a simple DFA machine
which has been used for removing two noun suffixes subsets: Plo3= {“‫ن‬ ”,”‫ن‬ ”,”‫ن‬ ”} and
Pl2={“‫ان‬”,” ‫ه‬”,”‫ات‬”,”‫ون‬”,” ”}. The three groups of states of this DFA are showed in Table 4.
For example, consider “‫ن‬ ” (kifeşãn= their bag) as an input word. The DFA gets the words
from left to right that means the last letter of the word (‘‫ن‬’) is the first one the DFA gets.
Therefore applying the example word (“‫ن‬ ”) will terminate in state 9 that is grouped as
“Plo3”. Thus three letters of “‫ن‬ ” (şãn) suffix will be cut from the end of input word and “ ”
(kif) has been returned as the stem.
Table 4 An example for grouping of the final state
Final States Suffix group
1,2,3,4 NIL
5,6,7,11,12 Pl2
8,9,10 Plo3
Figure 2 An example of a Small DFA machine
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014
18
5. EXPERIMENTAL RESULTS
For evaluating the proposed HPS method on the Persian language, Hamshahri Collection (with
various topics) and Security News from the ISTna.ir site(with the special security topic) have
been used, so that we create some different test sets with different sizes, then we test the HPS
algorithm on each of them. The creation of test sets is as follows: first, we select some test
documents with different lengths (small to large) from the two mentioned corpus, and then give
them to a POS (Part Of Speech) tagger system like [14] for detecting POS of all words of
documents. Then, we hold only words that have Noun, Adjective or Verb POS tags and stored
these words and their POS tags in the two distinct files as the inputs of our system. We assumed
that nouns, adjectives and verbs are most meaningful parts of sentences of texts, therefore
remaining components of sentences such as adverbs, conjunctions, determiner, number,
propositions, pronouns and punctuations are ignored. The results that are shown in Table 5, Table
6 have relatively good accuracy. Most of the incorrect results are related to compound words
because of many of them haven’t specified morphology rules.
Table 5 Test of HPS method on the Hamshahri Collection
Test set
No.
Topic
Words
( noun, adjective
and verb)
Correct
Results
Wrong
Results
Accuracy
(%)
1 Literature & Art 24 23 1 95.8
2 Literature & Art 48 45 3 93.7
3 Literature & Art 72 67 5 93.1
4 Literature & Art 99 92 7 93
5 Literature & Art 150 140 10 93.3
6 Literature & Art 247 234 13 94.7
7 social 117 113 4 96.5
8 social 324 314 10 96.9
9 science & culture 131 127 4 96.9
10 science & culture 246 240 6 97.5
11
science & culture 394 385 9 97.7
Average of Accuracy = 95.37
Table 5 shows the experimental results of applying HPS on a test set composed of texts on
different topics from the Hamshahri Collection. The Correct Results column indicates number of
words stemmed correctly and the Wrong Results indicates number of incorrectly stemmed words
plus not stemmed words. The Accuracy is the percentage of correct results between all words.
The average accuracy of 95.37% is a reasonable result which shows the performance of HPS
method.
Another experiment has been done on a test set composed of texts with common topic on security
and the results are showed in Table 6. In this table the stemming results of using hash tables are
compared to the results of not using them. Obtained results shows that hash tables have
remarkable influence on the stemming accuracy which has increased it by 4%.
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014
19
Table 6 Test of HPS method on Security News from ICTna.ir (with hash table and without hash table)
Text
No.
Word
No.
With Hash Table Without Hash Table
Correct Wrong Accuracy (%) Correct Wrong Accuracy (%)
1
2
3
4
5
6
7
8
72
94
188
211
214
215
349
179
68
91
182
199
203
210
331
170
4
3
6
12
11
5
18
9
94.44
96.80
96.80
94.31
94.85
97.67
94.84
94.97
62
89
176
190
196
199
320
164
10
5
12
36
34
21
29
14
86.11
94.68
93.61
90.04
91.58
92.55
91.69
91.62
Average of Accuracy = 95.58 Average of Accuracy = 91.45
6. CONCLUSIONS
In this paper the HPS methods for Persian stemming is presented. The novelty of this method is
because of its hierarchical structure which is composed of different levels based on DFAs and
hash tables. Using DFAs and hash tables together provides taking advantages of both of them.
In HPS the words are categorized based on their POS tags which reduce the probability of
mistaken results. The structured design of HPS made this method dynamic and extensible. Using
individual DFAs for the words with different POS tags increased the speed of stemming and also
made it more extensible.
The main goal in introducing HPS was stemming on the texts with special topics therefore we
have used small hash tables of the words on special topics. This idea increases the accuracy of
stemming and also increases the stemming task speed because searching in small hash table is fast
enough and also the words found in hash tables don’t go through DFAs.
The experimental result shows the average accuracy of 95.37% which is even improved in using
the method on a test set with common topics. Comparing the results with the similar works such
as [12, 13, 15] shows the advantages of HPS method.
REFERENCES
[1] Bento, C., A. Cardoso, and G. Dias,(2005) Progress in Artificial Intelligence: 12th Portuguese
Conference on Artificial Intelligence, EPIA 2005, Covilha, Portugal, December 5-8, 2005.
[2] Porter, M.F., (1980) "An algorithm for suffix stripping". Program: electronic library and
information systems. 14(3): p. 130-137.
[3] Krovetz, R. (1993) "Viewing morphology as an inference process". in Proceedings of the 16th
annual international ACM SIGIR conference on Research and development in information
retrieval: ACM.
[4] Mayfield, J. and P. McNamee, (2003) "Single n-gram stemming". in Proceedings of the 26th
annual international ACM SIGIR conference on Research and development in informaion
retrieval: ACM.
International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014
20
[5] Bacchin, M., N. Ferro, and M. Melucci. (2002), "Experiments to evaluate a statistical stemming
algorithm" University of Padua at CLEF. in Proceedings of CLEF: Citeseer.
[6] Melucci, M. and N. Orio.(2003), "A novel method for stemmer generation based on hidden
markov models". in Proceedings of the twelfth international conference on Information and
knowledge management: ACM.
[7] Savoy, J.( 1993), "Stemming of French words based on grammatical categories". Journal of the
American Society for Information Science, 44(1): p. 1-9.
[8] Savoy, J. ( 1999), "A stemming procedure and stopword list for general French corpora". JASIS.
50(10): p. 944-952.
[9] Harman, D. (1991), "How effective is suffixing?" JASIS. 42(1): p. 7-15.
[10] Tomlinson, S. (2004), "Lexical and algorithmic stemming compared for 9 European languages
with Hummingbird SearchServerTM", at CLEF, in Comparative evaluation of multilingual
information access systems. Springer. p. 286-300.
[11] Dolamic, L. and J. Savoy. ( 2009), "Persian Language, is Stemming Efficient?" in Database and
Expert Systems Application. DEXA'09. 20th International Workshop on: IEEE.
[12] Taghva, K., R. Beckley, and M. Sadeh. ( 2005), "A stemming algorithm for the farsi language". in
International Conference on ITCC: Information Technology: Coding and Computing, IEEE.
[13] Fard, R.H. and G.G. Sani. (2006), "Stemmer Algorithm Design for Persian Language". in 11th
International CSI Computer Conference (CSICC’2006), School of Computer Science, IPM.
[14] Mohseni, M. and B. Minaei-Bidgoli.( 2010), " A Persian Part-Of-Speech Tagger Based on
Morphological Analysis". in LREC.
[15] Estahbanati, S. and J. Reza (2011), "A New Multi-Phase Algorithm for Stemming in Farsi
Language Based on Morphology". International Journal of Computer Theory and Engineering
(IJCTE), 3(5).
Ayshe Rashidi received the B.S.c degree in Computer Engineering (Hardware) from
Technical and Engineering faculty, Shahed University, Tehran, Iran in 2011. She is
currently M.Sc. student in Computer Engineering (Artificial Intelligent) from Electrical
and Computer Engineering faculty of Tabriz University, Iran. Her research interests
include Algorithm Design, Data Mining, Text Processing, NLP, and Intrusion Detection
Systems, Information Extraction and Retrieval.
Mina Zolfy Lighvan received the B.Sc degree in Computer Engineering (hardware) and
M.Sc. degree in Computer Engineering (Computer Architecture) from ECE faculty,
university of Tehran, Iran in 1999, 2002 respectively. She received Ph.D. degree in
Electronic Engineering (Digital Electronic) from Electrical and Computer Engineering
faculty of Tabriz University, Iran. She currently is an assistant professor and works as a
lecturer in Tabriz university. She has more than 20 papers that were published in different
national and international conferences and Journals. Dr. Zolfy major research interests include Text
Retrieval, Object oriented Programming & Design, Algorithms Analysis, HDL Simulation, HDL
Verification, HDL Fault Simulation, HDL Test Tool VHDL, Verilog, hardware test, CAD Tool,
synthesis, Digital circuit design & simulation.

More Related Content

PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
PDF
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
PDF
THE STRUCTURED COMPACT TAG-SET FOR LUGANDA
PDF
CBAS: CONTEXT BASED ARABIC STEMMER
PDF
IMPROVEMENT OF CRF BASED MANIPURI POS TAGGER BY USING REDUPLICATED MWE (RMWE)
PDF
A Word Stemming Algorithm for Hausa Language
PDF
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...
PDF
C8 akumaran
Welcome to International Journal of Engineering Research and Development (IJERD)
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
THE STRUCTURED COMPACT TAG-SET FOR LUGANDA
CBAS: CONTEXT BASED ARABIC STEMMER
IMPROVEMENT OF CRF BASED MANIPURI POS TAGGER BY USING REDUPLICATED MWE (RMWE)
A Word Stemming Algorithm for Hausa Language
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...
C8 akumaran

What's hot (17)

PDF
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELS
PDF
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
PDF
Ijetcas14 575
DOC
referát.doc
PDF
Tamil-English Document Translation Using Statistical Machine Translation Appr...
PPT
Tamil Morphological Analysis
PDF
Implementation of English-Text to Marathi-Speech (ETMS) Synthesizer
PDF
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
PDF
Grapheme-To-Phoneme Tools for the Marathi Speech Synthesis
PDF
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
PDF
Quality estimation of machine translation outputs through stemming
PDF
An implementation of apertium based assamese morphological analyzer
PDF
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
PDF
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
PDF
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
PDF
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
PDF
Ijarcet vol-3-issue-1-9-11
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELS
Designing A Rule Based Stemming Algorithm for Kambaata Language Text
Ijetcas14 575
referát.doc
Tamil-English Document Translation Using Statistical Machine Translation Appr...
Tamil Morphological Analysis
Implementation of English-Text to Marathi-Speech (ETMS) Synthesizer
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
Grapheme-To-Phoneme Tools for the Marathi Speech Synthesis
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
Quality estimation of machine translation outputs through stemming
An implementation of apertium based assamese morphological analyzer
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
Ijarcet vol-3-issue-1-9-11
Ad

Viewers also liked (20)

PDF
A syntactic analysis model for vietnamese questions in v dlg~tabl system
PDF
Novel cochlear filter based cepstral coefficients for classification of unvoi...
PDF
A systematic study of text mining techniques
PDF
Coalesced hashing / Hash Coalescido
PDF
Database Management System Tutorial
DOCX
Multiview alignment hashing for
PDF
PDF
Faster Case Retrieval Using Hash Indexing Technique
PDF
Indexing and-hashing
PPTX
MySQL aio
DOCX
Distributed Hash Table and Consistent Hashing
PPT
File organization 1
PDF
A Survey on Balancing the Network Load Using Geographic Hash Tables
PDF
藉由公眾參與,打造全民期待的互動平台(一)
PPT
Ernc Web
PPT
Gio Am1
XLS
all my loving !
PDF
An expert system for automatic reading of a text written in standard arabic
PDF
Financial Chronicle May 29 2009
DOC
Свободата е отговорност
A syntactic analysis model for vietnamese questions in v dlg~tabl system
Novel cochlear filter based cepstral coefficients for classification of unvoi...
A systematic study of text mining techniques
Coalesced hashing / Hash Coalescido
Database Management System Tutorial
Multiview alignment hashing for
Faster Case Retrieval Using Hash Indexing Technique
Indexing and-hashing
MySQL aio
Distributed Hash Table and Consistent Hashing
File organization 1
A Survey on Balancing the Network Load Using Geographic Hash Tables
藉由公眾參與,打造全民期待的互動平台(一)
Ernc Web
Gio Am1
all my loving !
An expert system for automatic reading of a text written in standard arabic
Financial Chronicle May 29 2009
Свободата е отговорност
Ad

Similar to Hps a hierarchical persian stemming method (20)

PDF
Stemming algorithms
PDF
Improving a Lightweight Stemmer for Gujarati Language
PDF
almisbarIEEE-1
PPTX
Words _Transducers Finite state transducers in natural language processing
PDF
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
PDF
Arabic words stemming approach using arabic wordnet
PPTX
NL5MorphologyAndFinteStateTransducersPart1.pptx
PPTX
Computational Linguistics - Finite State Automata
PDF
Corpus-based part-of-speech disambiguation of Persian
PDF
Exploring the effects of stemming on
PDF
DESIGN OF A RULE BASED HINDI LEMMATIZER
PDF
Design of a rule based hindi lemmatizer
PDF
DESIGN OF A RULE BASED HINDI LEMMATIZER
PDF
Paper id 25201466
PDF
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
PDF
Fsmnlp presentation mohammed_attia
PDF
Natural language Processing: Word Level Analysis
PPTX
NLP topic CHAPTER 2_word level analysis.pptx
PDF
MorphologyAndFST.pdf
PDF
Keywords- Based on Arabic Information Retrieval Using Light Stemmer
Stemming algorithms
Improving a Lightweight Stemmer for Gujarati Language
almisbarIEEE-1
Words _Transducers Finite state transducers in natural language processing
SIMILAR THESAURUS BASED ON ARABIC DOCUMENT: AN OVERVIEW AND COMPARISON
Arabic words stemming approach using arabic wordnet
NL5MorphologyAndFinteStateTransducersPart1.pptx
Computational Linguistics - Finite State Automata
Corpus-based part-of-speech disambiguation of Persian
Exploring the effects of stemming on
DESIGN OF A RULE BASED HINDI LEMMATIZER
Design of a rule based hindi lemmatizer
DESIGN OF A RULE BASED HINDI LEMMATIZER
Paper id 25201466
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
Fsmnlp presentation mohammed_attia
Natural language Processing: Word Level Analysis
NLP topic CHAPTER 2_word level analysis.pptx
MorphologyAndFST.pdf
Keywords- Based on Arabic Information Retrieval Using Light Stemmer

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Network Security Unit 5.pdf for BCA BBA.
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Electronic commerce courselecture one. Pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
A Presentation on Artificial Intelligence
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Spectral efficient network and resource selection model in 5G networks
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
20250228 LYD VKU AI Blended-Learning.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Digital-Transformation-Roadmap-for-Companies.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Network Security Unit 5.pdf for BCA BBA.
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Chapter 3 Spatial Domain Image Processing.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
A Presentation on Artificial Intelligence
Building Integrated photovoltaic BIPV_UPV.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Spectral efficient network and resource selection model in 5G networks

Hps a hierarchical persian stemming method

  • 1. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014 DOI : 10.5121/ijnlc.2014.3102 11 HPS: A HIERARCHICAL PERSIAN STEMMING METHOD Ayshe Rashidi1 and Mina Zolfy Lighvan2 1 Department of Electrical and Computer Engineering, Tabriz University, Tabriz, Iran 2 Department of Electrical and Computer Engineering, Tabriz University, Tabriz, Iran ABSTRACT In this paper, a novel hierarchical Persian Stemming approach based on the Part-Of-Speech (POS) of the word in a sentence is presented. The implemented stemmer includes hash tables and several deterministic finite automata (DFA) in its different levels of hierarchy for removing the prefixes and suffixes of the words. We had two intentions in using hash tables in our method. The first one is that the DFA don’t support some special words, so hash table can partly solve the addressed problem. And the second goal is to speed up the implemented stemmer with omitting the time that DFA need. Because of the hierarchical organization, this method is fast and flexible enough. Our experiments on test sets from Hamshahri Collection and Security News from ICTna.ir Site show that our method has the average accuracy of 95.37% which is even improved in using the method on a test set with common topics. KEYWORDS Stemming, morphology, DFA machine, hash table, POS tags & hierarchical 1. INTRODUCTION Nowadays, people are surrounded by huge amount of information especially with the development of the internet. Hence, over the years many techniques are developed to help people manage and process their desired information. Many research themes in the field of artificial intelligence are emerging under this environment, for example, information retrieval, information extraction, information filtering, machine translation, question answering. Unfortunately, the words that seem in documents and in queries often have many morphological variants. In most cases, morphological variants of words have similar semantic interpretations and can be considered as equivalent for IR applications. Thus, pairs of terms such as "connect" and "connection" will not be recognized as equivalent without some form of natural language processing (NLP). So before the information retrieval from the documents the stemming techniques as an essential step are applied on the target data set to reduce the size of the data set which will improve the performance of IR System. So that a smaller data set or dictionary size results in a saving of storage space and processing time. There are several types of stemming algorithms which differ in respect to performance and accuracy. In this paper, we will describe some of them briefly and then also we will present our proposed method. The organization of the rest of paper is as follows. Section 2, gives a brief background of Persian Language. Section three is a glance of related work. Section 4 describes our stemming method. In the Section 5, we test experimental results of our method, and Section 6 discusses our conclusion and suggestions.
  • 2. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014 12 2. RELATED WORKS More frequently used stemming methods are: Affix removing, Look up Table and Statistics methods [1]. Affix Removing is depending on morphological structure of the language in which stemming is done by removing morphemes from any word. Porter algorithm is an example of this category[2], that is composed of 5 different steps. During these 5 steps more common affixes are removed using some special rules. Another example of this method is Krovatez [3], that uses a stemming procedure based on both inflectional and derivational suffixes in which the suffix stripping process is under the control of an English dictionary. In the look up table based method, each word and its stem are stored in some look up tables, where for each stored word corresponding stem could be found. This method needs large storage space and its tables must be updated manually for each new word. In the Statistics methods, using a process based on sets, rules are formulated according to the arrangement of words. n-gram [4], link analysis [5] and Hidden Markov Models [6] are examples of models that have been used in some statistics method for stemming. In general, many works on stemming performance are reported in different fields for English language but not for other less popular language. For example for French language, Savoy [7] proposes a suffixing algorithm based on grammatical categories, also Savoy [8] presented another stemming procedure based on only a few general morphological rules. This approach corresponds to the English "S stemmer" method which conflates singular and plural word forms [9]. Tomlinson [10] evaluated the differences between Porter’s stemmer [2] strategy and lexical stemmers (based on a dictionary of the corresponding language) for various European languages. For the Finnish and the German language, lexical stemmer tends to produce statistically better results, while for seven other languages performance differences were insignificant [11]. Two major algorithms for stemming in Persian language are presented. The first one has been proposed by Kazem taghva, Russell Beckley and Mohammad Sadeh in 2005 [12]. This method is an inspiration of the Porter algorithm in English [2], which is based on removing the suffix and prefix using Persian language morphology. For implementation of this method and to remove suffix and prefix from words, a DFA machine with 40 states is used. This method has some problems such as limited number of suffixes and low speed. The second algorithm is designed by GholamReza Ghasem Sani and Reza Hesamifard [13] which is based on the database or dictionary information of all the stems of the language. At first the input word should be searched in the database, if it is found, the stem will be returned, otherwise, the suffixes and prefixes should be removed and it should be searched again in database. Disadvantages of this method are its requirement to frequently database update, and high storage space. 3. PERSIAN LANGUAGE The Persian language belongs to Indo-European languages, spoken and written primarily in Iran, Afghanistan, and a part of Tajikistan and is written using modified Arabic script, containing 28 Arabic letters and four more characters ( ‫گ‬‫چ‬‫پ‬‫ژ‬ ) __to express sounds not present in Classical Arabic and is a right to left language. In Persian, verbs involve tense, person, mode and its form (negative or positive). For example, the verb “‫زم‬ ” (mi-sazam: I make) is a present tense one consisting of three morphemes. “‫م‬” (am) is a suffix denoting first single person “‫ز‬ ” (saz) is the present tense root of the verb and “ ” (mi) is a prefix that expresses continuity.
  • 3. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014 13 Negative form of verbs is produced with adding “‫ن‬” (ne) to the first of them. For example, “‫زم‬ ” (ne-mi-saz-am - I don’t make) is the negative form of the verb “‫زم‬ ” (misazam - I make). There are some certain rules to make verbs in Farsi language. A subset of these rules is shown in Table 1. Table 1 Some morphological rules for verbs in Persian Language pasttense () Simple ( ‫د‬ ) + past person identifier + past root )+‫م‬( Neveştam = neweşt + am Continuous ( ‫ار‬ ‫)ا‬ ++ past person identifier + past root + mi )++‫م‬( Mineveştam = mi + neveşt + am Present perfect ( ) +’‘+ Present perfect past person identifier + ‘h’ + past root ‫ام‬)++‫ام‬( Neveşteam = Neveşt + e + am Unlikely ( ) +’‘+‫د‬+ past person identifier + bud + ‘h’ + past root ‫دم‬)+‘‘+‫د‬+’‫م‬’( Neveşte budam = neveşt + e + bud + am Implicit ( ‫ا‬ ‫)ا‬ +‘‘+‫ش‬+ ‫رع‬ present person identifier + baş + ‘h’ + past root )++‫ش‬+‫م‬( Neveşte başam = Neveşt + e + baş + am Future tense () -- ‫ا‬+‫رع‬+ past root + Present person identifier + xãh ‫اه‬)‫ا‬+‫م‬+( Xaham neveşt= xãh+am+ neveşt Presenttense (‫رع‬) Simple ( ‫د‬ ) ‫رع‬+‫رع‬ Present root + Present person identifier )+‫م‬( Nevisam = Nevis + am Declarative ( ‫ر‬ ‫)ا‬ +‫رع‬+‫رع‬ Mi + Present root + Present person identifier )++‫م‬( Minevisam= mi + nevis + am Implicit ( ‫ا‬ ‫)ا‬ ‘‫ب‬’+‫رع‬+‫رع‬ B + Present root + Present person identifier )‫ب‬++‫م‬( Benevisam = be + nevis + am Imperative ( ‫)ا‬ ‘‫ب‬’+‫رع‬ B + Present root )‫ب‬+( Benevis = be + nevis In Persian language we have a lot of rules for making nouns. In general, the plural forms of nouns are formed by adding the suffixes ( ‫ه‬ ،‫ان‬ ،‫ات‬ ، ،‫ون‬). “ ‫ه‬” (hã) is used for all words. “‫ان‬” (ãn) is used for humans, animals and everything that is alive. Also, “ ،‫ون‬‫ات‬ ، ” (ãt ,un , in) is used for some words borrowed from Arabic and some Persian words. There is another kind of plural form in Persian that is called Mokassar which is a derivational plural form (irregulars in Persian), that many of them borrowed from Arabic. Some examples of plural forms are shown in Table 2.
  • 4. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014 14 Table 2 Some Morphological Rules for Nouns in Persian Language Type Suffixes Word structure: Word= Word Stem + suffixes Plural ‫ان‬)ãn( Deraxtãn=deraxt+ãn ‫ن‬ ‫در‬)trees= (‫در‬+‫ان‬ ‫ه‬)ha( Dasthã=dast+hã ‫د‬‫ه‬)hands= (‫د‬+‫ه‬ ‫ات‬)tã( Nabãtãt=nabãt+ãt ‫ت‬)plants= (‫ت‬+‫ات‬ )in(‫ون‬ ،)un( Mo’alemun = mo’alem+ in )teachers= (+ Posse ssion ‫ت‬)at(،‫م‬)am(،‫ش‬)aş( Dastam=dast+am ‫د‬)my hand= (‫د‬+‫م‬ ‫ن‬)mãn(،‫ن‬)tãn(، ‫ن‬)nãş( Dastemãn=dast+mãn ‫ن‬ ‫د‬)our hand= (‫د‬+‫ن‬ Others )i(،)h(،)k( Xubi=xub+i )goodness= (‫ب‬+ )yat(،)eĉ(،(ĉi) Jam’yat=jam’+yat )population= (+ ‫ن‬)bãn(،‫دان‬)dãn(، ‫زار‬)zãr( Bãghbãn=bãgh+bãn ‫ن‬)gardener= (‫غ‬+‫ن‬ ‫وار‬)wãre( Guşwãre=Guş+ware ‫ار‬)eardrop= (‫ش‬+‫وار‬ There are some orthographic rules on the effects of joining affixes in some words. For example, consider a plural word consisting of two parts A and B. In such an example if the last letter of A and the first letter of B is “‫ا‬” (ã), a letter “ ” (y) is added between them. Assume A is “ ‫دا‬” (dãnã - wise) and B is “‫ان‬” (ãn), the joining result is “‫ن‬ ‫دا‬” (dãnã-yãn: wises). An adjective is a word or group of words that appears before or after a noun, and explains a feature or concept about it. Adjectives have different types such as simple, nominative, participle, relative and merit. Here, we categorized them based on the number of suffixes letters, because our method is based on morphology. Some of common types of adjectives are presented in table 3. Table 3 Some Morphological Rules for Adjectives in Persian Language Adjective Suffixes Word structure: Word= Word Stem + suffixes ‫ا‬)ã(،)i(،)h( Dãrã = dãr + ã ‫دارا‬)wealthy(=‫دار‬+‫ا‬ )tar(،)gar( )in(،‫ار‬)rã( Xubtar = xub + tar )Better(=‫ب‬+ ‫ا‬)ãne(،)mand( )nãk(،‫وار‬)wãr( Mahramãneh = mahram + ãne )Confidential(=‫م‬+ ‫ا‬ )tarin(،)gãne( Xubtarin = xub + tarin )best(=‫ب‬+ Similar to the nouns, there are some orthographic rules for adjectives in Persian language. For example if we want to make a relative adjective from a word(with adding ‘ ’(i) to end of it) that has a ‘ ’ (h) as its last letter like “ ”( Baneh: a city name), we should add an “‫ا‬” (a) letter between them so relative adjective for “ ” is “ ‫ا‬ ”(Baneai: from Bane).
  • 5. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014 15 4. HPS METHOD 4.1. Description of our HPS method For stemming a textual document or a sentence, an effective stemming method should focus mainly on nouns, adjectives and verbs because these words carry the major meaning of a sentence or a document. Therefore, in this paper we ignore the stemming of other components of sentence. Persian language as well as English language has affixation morphology, means that for the different applications or to create the new meaning of a word, prefix and suffix stick to the begin and end of the words. Persian nouns as well as English nouns have plural and ownership suffixes. Persian verbs according to tense, person, negative and modes are different and have more variety than English verbs. Also Persian has so many adjective suffixes. HPS (hierarchical Persian Stemmer) method employs a hierarchical process based on morphology and POS tags. It has three distinct parts for nouns, adjectives and verbs suffix stemming. In addition HPS uses hash table for stemming of some exceptions that other stemmer can’t support it. In HPS the stemming task is spread into several hierarchical levels. Figure 1 shows a Block Diagram of different levels of HPS method. The first level of HPS is showed by PreStemmer- DFA which is responsible of removing prefixes from the words. The Next level named SufStemmer removes suffixes and is composed of three distinct parts based on the POS tags (N for nouns, V for verbs and A for adjectives). Each of the mentioned parts contains of two levels that composed by a hash table and a DFA. For example in the first part that is belong to the nouns, N_Hash is a hash table that constructs the first level and SufStemmer_NDFA is the DFA based stemmer of the corresponding second level. HPS method stores some particular words like high frequency words , Mokassar plural words that borrowed from Arabic and irregular plurals and some words like "‫ن‬ ‫ز‬" (sãzemãn: organization) in three distinct small hash tables(N_Hash for nouns, A_Hash for adjectives and V_Hash for verbs). In the diagram of the method NFile, AFile and VFile are files that containing noun, adjectives and verbs words respectively those stores in corresponding hash tables. Our stemmer uses a lower bound limit on stem length (which is equal to three here) and it also follows some rules on the last letter of words and the first letters of suffixes. HPS at first identifies prefixes, and removes prefix according to defined sequences in the existence paths in the PreStemmer-DFA. We have grouped suffixes into three main groups as verb-suffixes (VL1, VL2, VL4, VL5, VL6, VL7), noun-suffixes (Pl2, Plo3, Po1, Po2, Po3, Ot1, Ot2, Ot3, Ot4), and adjective-suffixes (AL1, AL2, AL3, AL4) and each of this main groups has sub groups based on number of suffix letters (and type of suffix for the noun-suffixes). This grouping indicates the number of suffix letters that would be cut from the word. If the stemmer first identifies the prefix “‫ن‬” (n) in the word “ ” (naneveştim: we did not write) as a prefix, it then identifies suffix “ ” (yam) and removes it to produce the stem “ ” (neveşt: wrote). Noun suffixes are stacked according to this pattern (reading from right-to-left): (Possessive) + (Plural) + (Other) < Stem > For example, the stemmer first finds the possessive noun suffix “‫ن‬ ”(yemãn) in the word “‫ن‬ ‫ه‬ ” (neveştehãyemãn: our writings”), then it finds the plural noun suffix “ ‫ه‬”(hã) and,
  • 6. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014 16 finally, it finds the other-noun-suffix “ ”(h) to reach the stem “ ” (neveşt: wrote). Hence the stemmer removes up to three suffixes from nouns. Figure 1 Diagram of HPS Proposed Method 4.2. Implementation We implemented our proposed HPS method with a composition of three hash tables and four DFA (deterministic finite automata) machines. The hash tables are belonged to three major parts of word stemmer as described before. One of the four implemented DFA machines takes the role of prefix stemmer and the other three are for removing the suffixes from the words based on POS tags (noun: N, adjective: A or verb: V). The prefix DFA stemmer runs on the input word and if detects a prefix pattern then removes it. Depending on the POS tag of a word its corresponding hash tables is being searched, in the case of finding the word in the hash table, related stem is returned otherwise corresponding suffix DFA stemmer is being run to remove the suffixes during the states of the DFA. If the generated word is a stem then the process is completed otherwise it will be returned again to the hash table. It is remarkable that a word may have multiple suffixes, so for removing all suffixes, output will be given back to the suffix stemmer system as a new word and this process repeated until it can’t find any more suffix or returned word is contained less than three letters. Depends on POS of input word, a small array for storing suffix groups is used. We have named all existence states in the DFAs, as “NIL” or one of suffix groups in the suffix DFA stemmers and
  • 7. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014 17 “NIL” or “PRE” in the Prefix DFA stemmer. If final state was “NIL” then not removed any things from the input word means that the input word is returned as its stem, else regard to the suffix group of final state, related suffix will be removed. Figure 2 shows a simple DFA machine which has been used for removing two noun suffixes subsets: Plo3= {“‫ن‬ ”,”‫ن‬ ”,”‫ن‬ ”} and Pl2={“‫ان‬”,” ‫ه‬”,”‫ات‬”,”‫ون‬”,” ”}. The three groups of states of this DFA are showed in Table 4. For example, consider “‫ن‬ ” (kifeşãn= their bag) as an input word. The DFA gets the words from left to right that means the last letter of the word (‘‫ن‬’) is the first one the DFA gets. Therefore applying the example word (“‫ن‬ ”) will terminate in state 9 that is grouped as “Plo3”. Thus three letters of “‫ن‬ ” (şãn) suffix will be cut from the end of input word and “ ” (kif) has been returned as the stem. Table 4 An example for grouping of the final state Final States Suffix group 1,2,3,4 NIL 5,6,7,11,12 Pl2 8,9,10 Plo3 Figure 2 An example of a Small DFA machine
  • 8. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014 18 5. EXPERIMENTAL RESULTS For evaluating the proposed HPS method on the Persian language, Hamshahri Collection (with various topics) and Security News from the ISTna.ir site(with the special security topic) have been used, so that we create some different test sets with different sizes, then we test the HPS algorithm on each of them. The creation of test sets is as follows: first, we select some test documents with different lengths (small to large) from the two mentioned corpus, and then give them to a POS (Part Of Speech) tagger system like [14] for detecting POS of all words of documents. Then, we hold only words that have Noun, Adjective or Verb POS tags and stored these words and their POS tags in the two distinct files as the inputs of our system. We assumed that nouns, adjectives and verbs are most meaningful parts of sentences of texts, therefore remaining components of sentences such as adverbs, conjunctions, determiner, number, propositions, pronouns and punctuations are ignored. The results that are shown in Table 5, Table 6 have relatively good accuracy. Most of the incorrect results are related to compound words because of many of them haven’t specified morphology rules. Table 5 Test of HPS method on the Hamshahri Collection Test set No. Topic Words ( noun, adjective and verb) Correct Results Wrong Results Accuracy (%) 1 Literature & Art 24 23 1 95.8 2 Literature & Art 48 45 3 93.7 3 Literature & Art 72 67 5 93.1 4 Literature & Art 99 92 7 93 5 Literature & Art 150 140 10 93.3 6 Literature & Art 247 234 13 94.7 7 social 117 113 4 96.5 8 social 324 314 10 96.9 9 science & culture 131 127 4 96.9 10 science & culture 246 240 6 97.5 11 science & culture 394 385 9 97.7 Average of Accuracy = 95.37 Table 5 shows the experimental results of applying HPS on a test set composed of texts on different topics from the Hamshahri Collection. The Correct Results column indicates number of words stemmed correctly and the Wrong Results indicates number of incorrectly stemmed words plus not stemmed words. The Accuracy is the percentage of correct results between all words. The average accuracy of 95.37% is a reasonable result which shows the performance of HPS method. Another experiment has been done on a test set composed of texts with common topic on security and the results are showed in Table 6. In this table the stemming results of using hash tables are compared to the results of not using them. Obtained results shows that hash tables have remarkable influence on the stemming accuracy which has increased it by 4%.
  • 9. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014 19 Table 6 Test of HPS method on Security News from ICTna.ir (with hash table and without hash table) Text No. Word No. With Hash Table Without Hash Table Correct Wrong Accuracy (%) Correct Wrong Accuracy (%) 1 2 3 4 5 6 7 8 72 94 188 211 214 215 349 179 68 91 182 199 203 210 331 170 4 3 6 12 11 5 18 9 94.44 96.80 96.80 94.31 94.85 97.67 94.84 94.97 62 89 176 190 196 199 320 164 10 5 12 36 34 21 29 14 86.11 94.68 93.61 90.04 91.58 92.55 91.69 91.62 Average of Accuracy = 95.58 Average of Accuracy = 91.45 6. CONCLUSIONS In this paper the HPS methods for Persian stemming is presented. The novelty of this method is because of its hierarchical structure which is composed of different levels based on DFAs and hash tables. Using DFAs and hash tables together provides taking advantages of both of them. In HPS the words are categorized based on their POS tags which reduce the probability of mistaken results. The structured design of HPS made this method dynamic and extensible. Using individual DFAs for the words with different POS tags increased the speed of stemming and also made it more extensible. The main goal in introducing HPS was stemming on the texts with special topics therefore we have used small hash tables of the words on special topics. This idea increases the accuracy of stemming and also increases the stemming task speed because searching in small hash table is fast enough and also the words found in hash tables don’t go through DFAs. The experimental result shows the average accuracy of 95.37% which is even improved in using the method on a test set with common topics. Comparing the results with the similar works such as [12, 13, 15] shows the advantages of HPS method. REFERENCES [1] Bento, C., A. Cardoso, and G. Dias,(2005) Progress in Artificial Intelligence: 12th Portuguese Conference on Artificial Intelligence, EPIA 2005, Covilha, Portugal, December 5-8, 2005. [2] Porter, M.F., (1980) "An algorithm for suffix stripping". Program: electronic library and information systems. 14(3): p. 130-137. [3] Krovetz, R. (1993) "Viewing morphology as an inference process". in Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval: ACM. [4] Mayfield, J. and P. McNamee, (2003) "Single n-gram stemming". in Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval: ACM.
  • 10. International Journal on Natural Language Computing (IJNLC) Vol. 3, No.1, February 2014 20 [5] Bacchin, M., N. Ferro, and M. Melucci. (2002), "Experiments to evaluate a statistical stemming algorithm" University of Padua at CLEF. in Proceedings of CLEF: Citeseer. [6] Melucci, M. and N. Orio.(2003), "A novel method for stemmer generation based on hidden markov models". in Proceedings of the twelfth international conference on Information and knowledge management: ACM. [7] Savoy, J.( 1993), "Stemming of French words based on grammatical categories". Journal of the American Society for Information Science, 44(1): p. 1-9. [8] Savoy, J. ( 1999), "A stemming procedure and stopword list for general French corpora". JASIS. 50(10): p. 944-952. [9] Harman, D. (1991), "How effective is suffixing?" JASIS. 42(1): p. 7-15. [10] Tomlinson, S. (2004), "Lexical and algorithmic stemming compared for 9 European languages with Hummingbird SearchServerTM", at CLEF, in Comparative evaluation of multilingual information access systems. Springer. p. 286-300. [11] Dolamic, L. and J. Savoy. ( 2009), "Persian Language, is Stemming Efficient?" in Database and Expert Systems Application. DEXA'09. 20th International Workshop on: IEEE. [12] Taghva, K., R. Beckley, and M. Sadeh. ( 2005), "A stemming algorithm for the farsi language". in International Conference on ITCC: Information Technology: Coding and Computing, IEEE. [13] Fard, R.H. and G.G. Sani. (2006), "Stemmer Algorithm Design for Persian Language". in 11th International CSI Computer Conference (CSICC’2006), School of Computer Science, IPM. [14] Mohseni, M. and B. Minaei-Bidgoli.( 2010), " A Persian Part-Of-Speech Tagger Based on Morphological Analysis". in LREC. [15] Estahbanati, S. and J. Reza (2011), "A New Multi-Phase Algorithm for Stemming in Farsi Language Based on Morphology". International Journal of Computer Theory and Engineering (IJCTE), 3(5). Ayshe Rashidi received the B.S.c degree in Computer Engineering (Hardware) from Technical and Engineering faculty, Shahed University, Tehran, Iran in 2011. She is currently M.Sc. student in Computer Engineering (Artificial Intelligent) from Electrical and Computer Engineering faculty of Tabriz University, Iran. Her research interests include Algorithm Design, Data Mining, Text Processing, NLP, and Intrusion Detection Systems, Information Extraction and Retrieval. Mina Zolfy Lighvan received the B.Sc degree in Computer Engineering (hardware) and M.Sc. degree in Computer Engineering (Computer Architecture) from ECE faculty, university of Tehran, Iran in 1999, 2002 respectively. She received Ph.D. degree in Electronic Engineering (Digital Electronic) from Electrical and Computer Engineering faculty of Tabriz University, Iran. She currently is an assistant professor and works as a lecturer in Tabriz university. She has more than 20 papers that were published in different national and international conferences and Journals. Dr. Zolfy major research interests include Text Retrieval, Object oriented Programming & Design, Algorithms Analysis, HDL Simulation, HDL Verification, HDL Fault Simulation, HDL Test Tool VHDL, Verilog, hardware test, CAD Tool, synthesis, Digital circuit design & simulation.