SlideShare a Scribd company logo
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
DOI: 10.5121/ijnlc.2015.4401 1

Contextual Analysis for Middle Eastern
Languages with Hidden Markov Models
Kazem Taghva
Department of Computer Science
University of Nevada, Las Vegas
Las Vegas, NV
Abstract
Displaying a document in Middle Eastern languages requires contextual analysis due to different
presentational forms for each character of the alphabet. The words of the document will be formed by the
joining of the correct positional glyphs representing corresponding presentational forms of the
characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to
language and are subject to interpretation by the software developers.
In this paper, we propose a machine learning approach for contextual analysis based on the first order
Hidden Markov Model. We will design and build a model for the Farsi language to exhibit this
technology. The Farsi model achieves 94% accuracy with the training based on a short list of 89 Farsi
vocabularies consisting of 2780 Farsi characters.
The experiment can be easily extended to many languages including Arabic, Urdu, and Sindhi.
Furthermore, the advantage of this approach is that the same software can be used to perform contextual
analysis without coding complex rules for each specific language. Of particular interest is that the
languages with fewer speakers can have greater representation on the web, since they are typically
ignored by software developers due to lack of financial incentives.
Index Terms
Unicode, Contextual Analysis, Hidden Markov Models, Big Data, Middle Eastern Languages, Farsi,
Arabic, data science, machine learning, artificial intelligence
1. Introduction
One of the main objectives of the Unicode is to provide a setting that non-English documents
can be easily created and displayed on modern electronic devices such as laptops and cellular
phones. Consequently, this encoding has led to development of many software tools for text
editing, font design, storage, and management of data in foreign languages. For commercial
reasons, the languages with high speaking populations and large economies have enjoyed much
more rapid advancement in Unicode based technologies. On the other hand, less spoken
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
2
languages such as Pushtu is barely given attention. According to [11], approximately 40 to 60
million people speak Pushtu worldwide.
Many Unicode based technologies are based on proprietary and patented methods and thus are
not available to the general open source software developers' communities. For example, BIT [9]
does not reveal its contextual analysis algorithm for Farsi[10]. Many software engineers need to
redevelop new methods to implement tools to mimic these commercial technologies. The new
contextual analysis for Farsi developed by Moshfeghi in Iran Telecommunication Research
Center is an example of these kinds of efforts [10].
The Unicode also introduces a challenge for the internationalization of any software regardless
of being commercial or open source. Tim Bray [2] writes:
Whether you're doing business or academic research or public service, you have to deal with
people, and these days, it's quite likely that some of the people you want to deal with come from
somewhere else, and you'll sometimes want to deal with them in their own language. And if your
software is unable to collect, store, and display a name, an address, or a part description in
Chinese, Bengali, or Greek, there's a good chance that this could become very painful very
quickly.
There are a few organizations that as a matter of principle operate in one language only (The US
Department of Defense, the Acadmie Franaise) but as a proportion of the world, they shrink
every year.''
This internationalization is a costly effort and subject to availability of resources. As mentioned
above, languages with high speaking population such as Mandarin attract a lot of the efforts. The
availability of data in Unicode represents an opportunity to employ machine learning techniques
to advance software internationalization and foreign text manipulation. The language translation
technologies heavily use Hidden Markov Models (HMM) to improve translation accuracy [3][1].
In this paper, we propose the use of HMM for contextual analysis. In particular, we design and
build a generic HMM for Farsi that can be easily adapted to other Middle Eastern languages.
In section 2, we provide some background and related work on contextual analysis. Section 3
will provide a brief introduction to first order HMM. In section 4, we describe the design and
implementation of our HMM for Farsi contextual analysis. The training and testing of HMM
will be explained in section 5. Finally, section 6 describes our conclusion and proposes future
work.
2. Background
In 2002, the Center for Intelligent Information Retrieval at the University of Massachusetts,
Amherst, held a workshop on Challenges in Information Retrieval and Language Model [7]. The
premise of this workshop was to promote the use of the Language Model technology for various
natural languages. The aim is to use the same software for indexing and retrieval regardless of
the language. It was pointed out that, by using training materials such as document collections,
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
3
we can automatically build retrieval engines for all languages. This report was one of the reasons
that we decided to start a couple of projects on Farsi and Arabic [16][19].
Consequently, these projects led to developments of the two widely used Farsi and Arabic
Stemmers [15][17]. One of the difficulties we had was the lack of technologies for input and
display of Farsi and Arabic documents [19]. For example, we needed an input/display method
that would allow us to enter Farsi query words in a Latin-based operating system without any
special software or hardware. It was further necessary to have a standard character encoding for
text representation and searching. At the time, we developed a system that provides the
following capabilities:
 a web-browser based keyboard applet for input
 if the web-browser has the ability to process and display Unicode content, it will be
used
 if the browser cannot display Unicode content, an auxiliary process will be invoked
to render the Unicode content into a portable bitmap image with associated HTML to
display the image in the browser.
Another area of difficulty that we encountered is that the presence of white space used to
separate words in the document is dependent on the display geometry of the glyphs. Since Farsi
and Arabic are written using a cursive form, each character can have up to four different display
glyphs. These glyphs represent the four different presentation forms:
isolated: the standalone character
initial: the character at the beginning of a word
medial: the character in the middle of a word
final: the character at the end of a word
We found that depending on the amount of trailing white space following a final form glyph, a
space character may or may not be found in the text. This situation came to light when our
subject matter experts were developing our test queries. We found that since the glyphs used to
display the final form of characters had very little trailing space, they were manually adding
space characters to improve the look of the displayed queries.
2.1 Keyboard Applet
The keyboard applet was written in java script. The applet displays a Farsi keyboard image with
the ability to enter characters from both the keyboard and mouse. The applet also handles
character display conversion and joining of the input data.
The keyboard layout is based on the ISIRI 2901:1994 standard layout as documented in an email
by Pournader [12]. Figure 1 shows the keyboard applet being used to define our test queries for
search and retrieval.
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
4
Display of the input data is normally performed by using the preloaded glyph images. However,
if a character has not been preloaded, it can be generated on the fly. Most of the time, these
generated characters are ``compound'' characters. Farsi (and other Arabic script languages) may
use "compound" characters which are a combination of two or more separate characters. For
example, the rightmost character of ‫خرما‬
, the Farsi word for ``date'' (that is,the fruit), is a combination of a ‫ﺥ‬ with a damma.
The complications associated with our work on Farsi and Arabic convinced us that we need to
develop generic machine learning tools if we want to develop display and search technologies
for most of the Middle Eastern languages. In the next few sections, we will offer a solution to
contextual analysis to display the correct presentational forms of characters.
3. Hidden Markov Model
An HMM is a finite state automaton with probabilistic transitions and symbol emissions
[13][14]. An HMM
consists of:
A set of states 𝑆 = {𝑠1, 𝑠2, … , 𝑠 𝑛}.
An emission vocabulary 𝑉 = {𝑤1, 𝑤2, … , 𝑤 𝑚}.
Probability distributions over emission symbols where the probability that a state 𝑠 emits symbol
𝑤 is given by 𝑃(𝑤|𝑠). This is denoted by matrix B.
Probability distributions over the set of possible outgoing transitions. The probability of moving
from state 𝑠𝑖 to 𝑠𝑗 is given by 𝑃(𝑠𝑗|𝑠𝑖). This is denoted by matrix A. A subset of the states that
are considered start states, and to each of these is associated an initial probability that the state
will be a start state. This is denoted by Π.
Figure 1. Example Use Of The Keyboard Applet
As an example, consider the widely used HMM [21] that decodes weather states based on a
friend's activities. Assume there are only two states of weather: Sunny, Rainy. Also assume
there are only three activities: Walking, Shopping, Cleaning.
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
5
You regularly call your friend who lives in another city to find out about his activity and the
weather status. He may respond by saying ``I am cleaning and it is rainy'', or ``I am shopping
and it is sunny''. If you collect a good number of these weather states and activities, you then can
summarize your data as the HMM shown in Figure 3.
Fig:Figure 1 An Hmm For Activities And Weather
This HMM states that on rainy days, your friend walks 10% of the days while on sunny days, he
walks 60%. The statistics associated with this HMM is obtained by simply counting the
activities on rainy and sunny days.
You also notice arrows from states to states that keeps track of weather changes. For example,
our HMM reflects the fact that on a rainy day, there is a 70% chance of rain next day while 30%
chance of sunshine.
In addition, one can keep track of how many days in the data are sunny or rainy. This will be the
initial probabilities. Formally these statistics are calculated by Maximum Likelihood Estimates
(MLE). Formally, transition probabilities are estimated as:
𝑃(𝑠𝑖, 𝑠𝑗) =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛𝑠 𝑓𝑟𝑜𝑚 𝑠𝑖 𝑡𝑜 𝑠𝑗
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑖𝑜𝑡𝑖𝑜𝑛𝑠 𝑜𝑢𝑡 𝑜𝑓 𝑠𝑖
The emission probabilities are estimated with Maximum Likelihood supplemented by
smoothing. Smoothing is required because Maximum Likelihood Estimation will sometimes
assign a zero probability to unseen emission-state combinations.
Prior to smoothing, emission probabilities are estimated by:
P(w|s)ml =
Number of times w is emitted by s
Total number of symbols emitted by s
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
6
The most interesting part of an HMM is the decoding aspect. We may be told that our friend's
activities for the last four days were cleaning, cleaning, shopping, cleaning and we want to
know what the weather patterns were for those four days. This essentially translate to finding a
sequence of four states 𝑠1 𝑠2 𝑠3 𝑠4 that maximizes probability:
𝑃(𝑠1 𝑠2 𝑠3 𝑠4 |𝐜𝐥𝐞𝐚𝐧𝐢𝐧𝐠 𝐜𝐥𝐞𝐚𝐧𝐢𝐧𝐠 𝐬𝐡𝐨𝐩𝐩𝐢𝐧𝐠 𝐜𝐥𝐞𝐚𝐧𝐢𝐧𝐠)
This amounts to choosing the highest probability among 16 choices for 𝑠1 𝑠2 𝑠3 𝑠4 . This is
computationally very expensive as the number of states and symbols increases. The solution is
given by the Viterbi algorithm that finds an optimal path using dynamic programming [14]. The
Algorithm 1 is a modification of the pseudo code from [21].
In the next section we will describe the design and implementation of an HMM for Farsi
contextual analysis.
4. Farsi Hidden Markov Model
The Farsi HMM is very similar to the example of HMM described in the previous section. The
HMM has a state for each presentation form of Farsi alphabet. Also the HMM has a vocabulary
of size 32, one for each character in Farsi alphabet. A simple calculation reveals that the Farsi
HMM should have 128 states and 32 vocabulary. The HMM has fewer than 128 states since
some of the characters do not have four presentational form. For example, there are only two
states for the character ‫ا‬ (alef), as there are no medial or initial form for this character.
Algorithm 1: Viterbi Algorithm
Algorithm 1: Viterbi Algorithm
Data: Given K states and M vocabularies, and a
sequence of vocabularies 𝑌 = 𝑤1 𝑤2 … 𝑤_𝑛
Result: The most likely state sequence
𝑅 = 𝑟1 𝑟2 … 𝑟𝑛 that maximizes the above probability
Function Viterbi ( V,S, Π, Y, A,B) : X
for each state 𝑠𝑖 do
𝑇1[𝑖, 1] = Π𝑖 ∗ 𝐵𝑖𝑤1
;
𝑇2[𝑖, 1] = 0;
end
for i = 2,3, ... , n do
for each state 𝑠𝑗 do
𝑇1[𝑗, 𝑖] = max
𝑘
(𝑇1[𝑘, 𝑖 − 1] ∗ 𝐴 𝑘𝑗 ∗ 𝐵𝑗𝑤 𝑖
) ;
𝑇2[𝑗, 𝑖] = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑘 (𝑇1[𝑘, 𝑖 − 1] ∗ 𝐴 𝑘𝑗 ∗ 𝐵𝑗𝑤 𝑖
) ;
end
end
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
7
𝑧 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑘(𝑇1[𝑘, 𝑛])
𝑟𝑛 = 𝑠𝑧 𝑛
for i = n , n-1 , ... , 2 do
𝑧𝑖−1 = 𝑇2[𝑧𝑖, 1];
𝑟𝑖−1 = 𝑠𝑧 𝑖
− 1;
end
Return R
As an example, suppose we want to type the word ‫شغال‬ , in English jackal. On the keyboard, we
type four isolated characters ‫ﺵ‬ , ‫ﻍ‬ , ‫ا‬ , and ‫ﻝ‬ . The HMM should decode these four characters as
initial, medial, final, and isolated, respectively. In other words, the sequence of the four isolated
characters (or vocabulary in HMM terminology) should be decoded in the four states as shown
in Figure 3.
Figure 2 The Four Isolated Characters On The Left Are Vocabularies While The Four Characters
On The Right Are The States Of Hmm
The part of the HMM as displayed in Figure 4. Shows how Viterbi algorithm takes the path to
decode the correct form of the characters by choosing the appropriate states. As we observe,
there are four states for the character ‫ﻍ‬representing the four shapes of this character. We also
observe that there are only two states for the character ‫ا‬ , as there are no medial or initial form
for this character.
A typical implementation of HMM adds states and vocabularies as being trained [20]. The
training is done by providing pairs of the form ([𝑤1 𝑤2 … . 𝑤 𝑛], [𝑠1 𝑠2 … 𝑠 𝑛]) similar to the
[vocabularies, states] sequences as shown in Figure 3.
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
8
Figure 3 Glyphs Chosen By HMM
5. Training and Testing of HMM
We trained the HMM with 89 words ( 2780 characters ) chosen from the list of the frequent
words from Kayhan newspaper published in 2005 [4]. There are over 10,000 words in this
collection. We limited the training to this short list to save time. The list of these words are
shown in Figure 5.
The test data is a small number of words selected randomly from a small dictionary and shown
in Figure 6. This list
contains 32 words ( 350 characters ). The training file contains pairs of words separated by a
vertical bar. The first word is the isolated form and the second word is the correct presentational
form of the word. We read the file one line at a time and submit the two words for trainingas
seen in the following Ruby code:
f = File.open("./training-data")
farsi.train([" "],[" "])
f.each do | line |
seq1,seq2 = line.chomp.split(/s*|s*/)
farsi.train(seq1.split(" "),
seq2.split(" "))
end
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
9
As it is seen, we have added a blank vocabulary and state to our HMM. The HMM adds
vocabulary and states as a part of the training. The HMM has 32 vocabularies and 74 states. It is
anticipated that the HMM will have more states as the size of the training data increases.
The test correctly decoded 94% of all the characters. Most of the mistakes are due to the fact that
the HMM has not seen enough combination examples of characters. For example, in the word
‫آتش‬ , the initial form of ‫ﺕ‬ was not decoded correctly. A closer examination of the training data
reveals that there are no occurence of ‫تش‬ in the set. Similarly, there are other errors of this form
such as the initial form of ‫ﻥ‬in the word ‫ترانه‬ . There are also a few errors attributed to the
double combination of the character ‫ﯼ‬ as in ‫طالیی‬ . We believe most of these errors will be
corrected with a larger training sets.
6. Conclusion and Future Work
In this paper, We have presented a machine learning approach to the contextual analysis of script
languages. It is shown that an ergodic HMM can be easily trained to automatically decode
presentational forms of the script languages.
Although the paper is developed based on Farsi, it can be easily extended to other middle eastern
languages. Further training and research in this area can improve the character accuracy.
A successful program for contextual analysis may have to include a list of exceptional words
that do not fall into the normal combination of the characters. It is also important to notice that
most of the Arabic and Farsi type setting technologies such as ArabTex [8] or FarsiTex [5] have
problems with contextual analysis. This is mainly due to the fact that it is practically impossible
to devise an algorithm that has 100% accuracy for tasks associated with natura languages.
Figure 4 Top 89 Frequent Words From Kayhan 2005
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
10
Finally, a higher order HMM may also improve the contextual analysis. For example, it is shown
that the second order HMM improves the hand written character recognition [6]. It may also
worth mentioning that the second order HMM does not improve error detection and correction
for post processing of printed documents [18]
Figure 5. Test Data Chosen Randomly
REFERENCES
[1] Jan A. Botha and Phil Blunsom. Compositional Morphology for Word Representations and Language
Modeling. In Proceedings of the 31st International Conference on Machine Learning (ICML) ,
Beijing, China, june 2014. *Award for best application paper*.
[2] Tim Bray. Element sets: A minimal basis for an XML query engine. In QL , 1998.
[3] Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. The
mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. , 19(2):263 -
311, June 1993.
[4] Jon Dehdari. Top frequent words in Farsi, 2005. http://www.ling.ohio-state.edu/~jonsafari/.
[5] Mohammad Ghodsi and Behdad Esfahbod, 1992. http://www.farsitex.org/
[6] Y. H. Y. He. Extended Viterbi algorithm for second order hidden Markov process. In Proceedings 9th
International Conference on Pattern Recognition, pages 718{720, 1988.
[7] Allan James. Challenges in information retrieval and language modeling: Report of a workshop held at
the center for intelligent information retrieval, university of Massachusetts Amherst, September 2002.
SIGIR Forum, 37(1):31{47, April 2003.
[8] Klaus Lagally, 2006. http://www2.informatik. uni-stuttgart.de/ivi/bs/research/Arabic.htm.
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015
11
[9] Fallah Moshfeghi and Kourosh Shadsari. Design and implementation of bilingual information entrance
and edit environment. technical report, Iran Telecommunication Research Center , winter, 1999.
[10] Kourosh Fallah Moshfeghi. A new algorithm for contextual analysis of Farsi characters and its
implementation in java. In 17th International Unicode Conference , 2000.
[11] Herbert Penzl and Ismail Sloan. A Grammar of Pashto a Descriptive Study of the Dialect of
Kandahar, Afghanistan . Ishi Press International, 2009.
[12] Roozbeh Pournader. National Iranian standard isiri 6219, information technology Persian information
interchange and display mechanism, using Unicode. In Technical Report , 2005.
[13] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech
recognition. In PROCEEDINGS OF THE IEEE , pages 257{286, 1989.
[14] Lawrence R. Rabiner. Readings in speech recognition. chapter A Tutorial on Hidden Markov Models
and Selected Applications in Speech Recognition, pages 267{296. Morgan Kaufmann Publishers Inc.,
San Francisco, CA, USA, 1990.
[15] Kazem Taghva, Russell Beckley, and Mohammad Sadeh. A stemming algorithm for the Farsi
language. In International Symposium on Information Technology: Coding and Computing (ITCC
2005), Volume 1, 4-6 April 2005, Las Vegas, Nevada, USA , pages 158{162, 2005.
[16] Kazem Taghva, Jeffrey S. Coombs, Ray Pereda, and Thomas A. Nartker. Language model-based
retrieval for Farsi documents. In International Conference on Information Technology: Coding and
Computing (ITCC'04), Volume 2, April 5-7, 2004, Las Vegas, Nevada, USA , pages 13{17, 2004.
[17] Kazem Taghva, Rania Elkhoury, and Jeffrey S. Coombs. Arabic stemming without A root dictionary.
In International Symposium on Information Technology: Coding and Computing (ITCC 2005),
Volume 1, 4-6 April 2005, Las Vegas, Nevada, USA , pages 152{157, 2005.
[18] Kazem Taghva, Srijana Poudel, and Spandana Malreddy. Post processing with first- and second-order
hidden Markov models. In DRR, 2013.
[19] Kazem Taghva, Ron Young, Jeffrey Coombs, Russell Beckley, and Mohammad Sadeh. Farsi
searching and display technologies. In SDIUT, 2003.
[20] David Tresner-Kirsch. Hmm ruby gem, 2009. https://github.com/dtkirsch/hmm/.
[21] Xing M. Wang. Probability bracket notation: Markov state chain projector, hidden Markov models
and dynamic Bayesian networks. CoRR , abs/1212.3817, 2012.

More Related Content

ODT
A tutorial on Machine Translation
PDF
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
PDF
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
PDF
Language Identifier for Languages of Pakistan Including Arabic and Persian
PDF
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
PPTX
PPTX
NLP pipeline in machine translation
PDF
Script to Sentiment : on future of Language TechnologyMysore latest
A tutorial on Machine Translation
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
Language Identifier for Languages of Pakistan Including Arabic and Persian
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
NLP pipeline in machine translation
Script to Sentiment : on future of Language TechnologyMysore latest

What's hot (17)

PPTX
Machine translation with statistical approach
PDF
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
PDF
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
PDF
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
PDF
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?
PDF
Native Language Identification - Brief review to the state of the art
PDF
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
PDF
Error Analysis of Rule-based Machine Translation Outputs
DOCX
Division_3_Fianna_O'Brien
PDF
Summer Research Project (Anusaaraka) Report
PDF
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
PDF
Grapheme-To-Phoneme Tools for the Marathi Speech Synthesis
PDF
MOLTO Annual Report 2011
PDF
Hybrid approaches for automatic vowelization of arabic texts
PDF
Hybrid part of-speech tagger for non-vocalized arabic text
PPTX
Machine translator Introduction
PPTX
Machine translation from English to Hindi
Machine translation with statistical approach
DICTIONARY BASED AMHARIC-ARABIC CROSS LANGUAGE INFORMATION RETRIEVAL
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
2013 ALC Boston: Your Trained Moses SMT System doesn't work. What can you do?
Native Language Identification - Brief review to the state of the art
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
Error Analysis of Rule-based Machine Translation Outputs
Division_3_Fianna_O'Brien
Summer Research Project (Anusaaraka) Report
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
Grapheme-To-Phoneme Tools for the Marathi Speech Synthesis
MOLTO Annual Report 2011
Hybrid approaches for automatic vowelization of arabic texts
Hybrid part of-speech tagger for non-vocalized arabic text
Machine translator Introduction
Machine translation from English to Hindi
Ad

Viewers also liked (20)

PDF
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...
PDF
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
PDF
A Novel Approach for Recognizing Text in Arabic Ancient Manuscripts
PDF
A systematic study of text mining techniques
PDF
A SIGNATURE BASED DRAVIDIAN SIGN LANGUAGE RECOGNITION BY SPARSE REPRESENTATION
PDF
G2 pil a grapheme to-phoneme conversion tool for the italian language
PDF
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATION
PDF
ALGORITHM FOR TEXT TO GRAPH CONVERSION
PDF
S ENTIMENT A NALYSIS F OR M ODERN S TANDARD A RABIC A ND C OLLOQUIAl
PDF
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELS
PDF
K AMBA P ART O F S PEECH T AGGER U SING M EMORY B ASED A PPROACH
PDF
CBAS: CONTEXT BASED ARABIC STEMMER
PDF
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...
PDF
M ACHINE T RANSLATION D EVELOPMENT F OR I NDIAN L ANGUAGE S A ND I TS A PPROA...
PDF
C ONSTRUCTION O F R ESOURCES U SING J APANESE - S PANISH M EDICAL D ATA
PDF
Identification of prosodic features of punjabi for enhancing the pronunciatio...
PDF
A COMPARISON OF TEXT CATEGORIZATION METHODS
PDF
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
PDF
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
PDF
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
A N H YBRID A PPROACH TO W ORD S ENSE D ISAMBIGUATION W ITH A ND W ITH...
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
A Novel Approach for Recognizing Text in Arabic Ancient Manuscripts
A systematic study of text mining techniques
A SIGNATURE BASED DRAVIDIAN SIGN LANGUAGE RECOGNITION BY SPARSE REPRESENTATION
G2 pil a grapheme to-phoneme conversion tool for the italian language
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATION
ALGORITHM FOR TEXT TO GRAPH CONVERSION
S ENTIMENT A NALYSIS F OR M ODERN S TANDARD A RABIC A ND C OLLOQUIAl
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELS
K AMBA P ART O F S PEECH T AGGER U SING M EMORY B ASED A PPROACH
CBAS: CONTEXT BASED ARABIC STEMMER
T URN S EGMENTATION I NTO U TTERANCES F OR A RABIC S PONTANEOUS D IALOGUES ...
M ACHINE T RANSLATION D EVELOPMENT F OR I NDIAN L ANGUAGE S A ND I TS A PPROA...
C ONSTRUCTION O F R ESOURCES U SING J APANESE - S PANISH M EDICAL D ATA
Identification of prosodic features of punjabi for enhancing the pronunciatio...
A COMPARISON OF TEXT CATEGORIZATION METHODS
STRUCTURED AND QUANTITATIVE PROPERTIES OF ARABIC SMS-BASED CLASSIFIED ADS SUB...
ISSUES AND CHALLENGES IN MARATHI NAMED ENTITY RECOGNITION
A COMPREHENSIVE ANALYSIS OF STEMMERS AVAILABLE FOR INDIC LANGUAGES
Ad

Similar to Contextual Analysis for Middle Eastern Languages with Hidden Markov Models (20)

PDF
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
PDF
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
PDF
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
PDF
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
PDF
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
PDF
Improving the role of language model in statistical machine translation (Indo...
PDF
Javanese part-of-speech tagging using cross-lingual transfer learning
PDF
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
PDF
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
PDF
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
PDF
An Extensible Multilingual Open Source Lemmatizer
PDF
Jq3616701679
PPTX
Computational linguistics
DOC
B tech project_report
PDF
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
PDF
Integration of Phonotactic Features for Language Identification on Code-Switc...
PDF
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
PDF
BERT-based models for classifying multi-dialect Arabic texts
PDF
XMODEL: An XML-based Morphological Analyzer for Arabic Language
PDF
AMAZIGH PART-OF-SPEECH TAGGING USING MARKOV MODELS AND DECISION TREES
Construction of Amharic-arabic Parallel Text Corpus for Neural Machine Transl...
CONSTRUCTION OF AMHARIC-ARABIC PARALLEL TEXT CORPUS FOR NEURAL MACHINE TRANSL...
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
Ara--CANINE: Character-Based Pre-Trained Language Model for Arabic Language U...
Different valuable tools for Arabic sentiment analysis: a comparative evaluat...
Improving the role of language model in statistical machine translation (Indo...
Javanese part-of-speech tagging using cross-lingual transfer learning
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
ADVERSARIAL GRAMMATICAL ERROR GENERATION: APPLICATION TO PERSIAN LANGUAGE
An Extensible Multilingual Open Source Lemmatizer
Jq3616701679
Computational linguistics
B tech project_report
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
Integration of Phonotactic Features for Language Identification on Code-Switc...
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
BERT-based models for classifying multi-dialect Arabic texts
XMODEL: An XML-based Morphological Analyzer for Arabic Language
AMAZIGH PART-OF-SPEECH TAGGING USING MARKOV MODELS AND DECISION TREES

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPT
Teaching material agriculture food technology
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation theory and applications.pdf
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation_ Review paper, used for researhc scholars
Electronic commerce courselecture one. Pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Machine learning based COVID-19 study performance prediction
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Teaching material agriculture food technology
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Spectroscopy.pptx food analysis technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation theory and applications.pdf
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
MYSQL Presentation for SQL database connectivity
Building Integrated photovoltaic BIPV_UPV.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Diabetes mellitus diagnosis method based random forest with bat algorithm
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation_ Review paper, used for researhc scholars

Contextual Analysis for Middle Eastern Languages with Hidden Markov Models

  • 1. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015 DOI: 10.5121/ijnlc.2015.4401 1  Contextual Analysis for Middle Eastern Languages with Hidden Markov Models Kazem Taghva Department of Computer Science University of Nevada, Las Vegas Las Vegas, NV Abstract Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers. In this paper, we propose a machine learning approach for contextual analysis based on the first order Hidden Markov Model. We will design and build a model for the Farsi language to exhibit this technology. The Farsi model achieves 94% accuracy with the training based on a short list of 89 Farsi vocabularies consisting of 2780 Farsi characters. The experiment can be easily extended to many languages including Arabic, Urdu, and Sindhi. Furthermore, the advantage of this approach is that the same software can be used to perform contextual analysis without coding complex rules for each specific language. Of particular interest is that the languages with fewer speakers can have greater representation on the web, since they are typically ignored by software developers due to lack of financial incentives. Index Terms Unicode, Contextual Analysis, Hidden Markov Models, Big Data, Middle Eastern Languages, Farsi, Arabic, data science, machine learning, artificial intelligence 1. Introduction One of the main objectives of the Unicode is to provide a setting that non-English documents can be easily created and displayed on modern electronic devices such as laptops and cellular phones. Consequently, this encoding has led to development of many software tools for text editing, font design, storage, and management of data in foreign languages. For commercial reasons, the languages with high speaking populations and large economies have enjoyed much more rapid advancement in Unicode based technologies. On the other hand, less spoken
  • 2. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015 2 languages such as Pushtu is barely given attention. According to [11], approximately 40 to 60 million people speak Pushtu worldwide. Many Unicode based technologies are based on proprietary and patented methods and thus are not available to the general open source software developers' communities. For example, BIT [9] does not reveal its contextual analysis algorithm for Farsi[10]. Many software engineers need to redevelop new methods to implement tools to mimic these commercial technologies. The new contextual analysis for Farsi developed by Moshfeghi in Iran Telecommunication Research Center is an example of these kinds of efforts [10]. The Unicode also introduces a challenge for the internationalization of any software regardless of being commercial or open source. Tim Bray [2] writes: Whether you're doing business or academic research or public service, you have to deal with people, and these days, it's quite likely that some of the people you want to deal with come from somewhere else, and you'll sometimes want to deal with them in their own language. And if your software is unable to collect, store, and display a name, an address, or a part description in Chinese, Bengali, or Greek, there's a good chance that this could become very painful very quickly. There are a few organizations that as a matter of principle operate in one language only (The US Department of Defense, the Acadmie Franaise) but as a proportion of the world, they shrink every year.'' This internationalization is a costly effort and subject to availability of resources. As mentioned above, languages with high speaking population such as Mandarin attract a lot of the efforts. The availability of data in Unicode represents an opportunity to employ machine learning techniques to advance software internationalization and foreign text manipulation. The language translation technologies heavily use Hidden Markov Models (HMM) to improve translation accuracy [3][1]. In this paper, we propose the use of HMM for contextual analysis. In particular, we design and build a generic HMM for Farsi that can be easily adapted to other Middle Eastern languages. In section 2, we provide some background and related work on contextual analysis. Section 3 will provide a brief introduction to first order HMM. In section 4, we describe the design and implementation of our HMM for Farsi contextual analysis. The training and testing of HMM will be explained in section 5. Finally, section 6 describes our conclusion and proposes future work. 2. Background In 2002, the Center for Intelligent Information Retrieval at the University of Massachusetts, Amherst, held a workshop on Challenges in Information Retrieval and Language Model [7]. The premise of this workshop was to promote the use of the Language Model technology for various natural languages. The aim is to use the same software for indexing and retrieval regardless of the language. It was pointed out that, by using training materials such as document collections,
  • 3. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015 3 we can automatically build retrieval engines for all languages. This report was one of the reasons that we decided to start a couple of projects on Farsi and Arabic [16][19]. Consequently, these projects led to developments of the two widely used Farsi and Arabic Stemmers [15][17]. One of the difficulties we had was the lack of technologies for input and display of Farsi and Arabic documents [19]. For example, we needed an input/display method that would allow us to enter Farsi query words in a Latin-based operating system without any special software or hardware. It was further necessary to have a standard character encoding for text representation and searching. At the time, we developed a system that provides the following capabilities:  a web-browser based keyboard applet for input  if the web-browser has the ability to process and display Unicode content, it will be used  if the browser cannot display Unicode content, an auxiliary process will be invoked to render the Unicode content into a portable bitmap image with associated HTML to display the image in the browser. Another area of difficulty that we encountered is that the presence of white space used to separate words in the document is dependent on the display geometry of the glyphs. Since Farsi and Arabic are written using a cursive form, each character can have up to four different display glyphs. These glyphs represent the four different presentation forms: isolated: the standalone character initial: the character at the beginning of a word medial: the character in the middle of a word final: the character at the end of a word We found that depending on the amount of trailing white space following a final form glyph, a space character may or may not be found in the text. This situation came to light when our subject matter experts were developing our test queries. We found that since the glyphs used to display the final form of characters had very little trailing space, they were manually adding space characters to improve the look of the displayed queries. 2.1 Keyboard Applet The keyboard applet was written in java script. The applet displays a Farsi keyboard image with the ability to enter characters from both the keyboard and mouse. The applet also handles character display conversion and joining of the input data. The keyboard layout is based on the ISIRI 2901:1994 standard layout as documented in an email by Pournader [12]. Figure 1 shows the keyboard applet being used to define our test queries for search and retrieval.
  • 4. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015 4 Display of the input data is normally performed by using the preloaded glyph images. However, if a character has not been preloaded, it can be generated on the fly. Most of the time, these generated characters are ``compound'' characters. Farsi (and other Arabic script languages) may use "compound" characters which are a combination of two or more separate characters. For example, the rightmost character of ‫خرما‬ , the Farsi word for ``date'' (that is,the fruit), is a combination of a ‫ﺥ‬ with a damma. The complications associated with our work on Farsi and Arabic convinced us that we need to develop generic machine learning tools if we want to develop display and search technologies for most of the Middle Eastern languages. In the next few sections, we will offer a solution to contextual analysis to display the correct presentational forms of characters. 3. Hidden Markov Model An HMM is a finite state automaton with probabilistic transitions and symbol emissions [13][14]. An HMM consists of: A set of states 𝑆 = {𝑠1, 𝑠2, … , 𝑠 𝑛}. An emission vocabulary 𝑉 = {𝑤1, 𝑤2, … , 𝑤 𝑚}. Probability distributions over emission symbols where the probability that a state 𝑠 emits symbol 𝑤 is given by 𝑃(𝑤|𝑠). This is denoted by matrix B. Probability distributions over the set of possible outgoing transitions. The probability of moving from state 𝑠𝑖 to 𝑠𝑗 is given by 𝑃(𝑠𝑗|𝑠𝑖). This is denoted by matrix A. A subset of the states that are considered start states, and to each of these is associated an initial probability that the state will be a start state. This is denoted by Π. Figure 1. Example Use Of The Keyboard Applet As an example, consider the widely used HMM [21] that decodes weather states based on a friend's activities. Assume there are only two states of weather: Sunny, Rainy. Also assume there are only three activities: Walking, Shopping, Cleaning.
  • 5. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015 5 You regularly call your friend who lives in another city to find out about his activity and the weather status. He may respond by saying ``I am cleaning and it is rainy'', or ``I am shopping and it is sunny''. If you collect a good number of these weather states and activities, you then can summarize your data as the HMM shown in Figure 3. Fig:Figure 1 An Hmm For Activities And Weather This HMM states that on rainy days, your friend walks 10% of the days while on sunny days, he walks 60%. The statistics associated with this HMM is obtained by simply counting the activities on rainy and sunny days. You also notice arrows from states to states that keeps track of weather changes. For example, our HMM reflects the fact that on a rainy day, there is a 70% chance of rain next day while 30% chance of sunshine. In addition, one can keep track of how many days in the data are sunny or rainy. This will be the initial probabilities. Formally these statistics are calculated by Maximum Likelihood Estimates (MLE). Formally, transition probabilities are estimated as: 𝑃(𝑠𝑖, 𝑠𝑗) = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑖𝑡𝑖𝑜𝑛𝑠 𝑓𝑟𝑜𝑚 𝑠𝑖 𝑡𝑜 𝑠𝑗 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑎𝑛𝑠𝑖𝑜𝑡𝑖𝑜𝑛𝑠 𝑜𝑢𝑡 𝑜𝑓 𝑠𝑖 The emission probabilities are estimated with Maximum Likelihood supplemented by smoothing. Smoothing is required because Maximum Likelihood Estimation will sometimes assign a zero probability to unseen emission-state combinations. Prior to smoothing, emission probabilities are estimated by: P(w|s)ml = Number of times w is emitted by s Total number of symbols emitted by s
  • 6. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015 6 The most interesting part of an HMM is the decoding aspect. We may be told that our friend's activities for the last four days were cleaning, cleaning, shopping, cleaning and we want to know what the weather patterns were for those four days. This essentially translate to finding a sequence of four states 𝑠1 𝑠2 𝑠3 𝑠4 that maximizes probability: 𝑃(𝑠1 𝑠2 𝑠3 𝑠4 |𝐜𝐥𝐞𝐚𝐧𝐢𝐧𝐠 𝐜𝐥𝐞𝐚𝐧𝐢𝐧𝐠 𝐬𝐡𝐨𝐩𝐩𝐢𝐧𝐠 𝐜𝐥𝐞𝐚𝐧𝐢𝐧𝐠) This amounts to choosing the highest probability among 16 choices for 𝑠1 𝑠2 𝑠3 𝑠4 . This is computationally very expensive as the number of states and symbols increases. The solution is given by the Viterbi algorithm that finds an optimal path using dynamic programming [14]. The Algorithm 1 is a modification of the pseudo code from [21]. In the next section we will describe the design and implementation of an HMM for Farsi contextual analysis. 4. Farsi Hidden Markov Model The Farsi HMM is very similar to the example of HMM described in the previous section. The HMM has a state for each presentation form of Farsi alphabet. Also the HMM has a vocabulary of size 32, one for each character in Farsi alphabet. A simple calculation reveals that the Farsi HMM should have 128 states and 32 vocabulary. The HMM has fewer than 128 states since some of the characters do not have four presentational form. For example, there are only two states for the character ‫ا‬ (alef), as there are no medial or initial form for this character. Algorithm 1: Viterbi Algorithm Algorithm 1: Viterbi Algorithm Data: Given K states and M vocabularies, and a sequence of vocabularies 𝑌 = 𝑤1 𝑤2 … 𝑤_𝑛 Result: The most likely state sequence 𝑅 = 𝑟1 𝑟2 … 𝑟𝑛 that maximizes the above probability Function Viterbi ( V,S, Π, Y, A,B) : X for each state 𝑠𝑖 do 𝑇1[𝑖, 1] = Π𝑖 ∗ 𝐵𝑖𝑤1 ; 𝑇2[𝑖, 1] = 0; end for i = 2,3, ... , n do for each state 𝑠𝑗 do 𝑇1[𝑗, 𝑖] = max 𝑘 (𝑇1[𝑘, 𝑖 − 1] ∗ 𝐴 𝑘𝑗 ∗ 𝐵𝑗𝑤 𝑖 ) ; 𝑇2[𝑗, 𝑖] = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑘 (𝑇1[𝑘, 𝑖 − 1] ∗ 𝐴 𝑘𝑗 ∗ 𝐵𝑗𝑤 𝑖 ) ; end end
  • 7. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015 7 𝑧 𝑛 = 𝑎𝑟𝑔𝑚𝑎𝑥 𝑘(𝑇1[𝑘, 𝑛]) 𝑟𝑛 = 𝑠𝑧 𝑛 for i = n , n-1 , ... , 2 do 𝑧𝑖−1 = 𝑇2[𝑧𝑖, 1]; 𝑟𝑖−1 = 𝑠𝑧 𝑖 − 1; end Return R As an example, suppose we want to type the word ‫شغال‬ , in English jackal. On the keyboard, we type four isolated characters ‫ﺵ‬ , ‫ﻍ‬ , ‫ا‬ , and ‫ﻝ‬ . The HMM should decode these four characters as initial, medial, final, and isolated, respectively. In other words, the sequence of the four isolated characters (or vocabulary in HMM terminology) should be decoded in the four states as shown in Figure 3. Figure 2 The Four Isolated Characters On The Left Are Vocabularies While The Four Characters On The Right Are The States Of Hmm The part of the HMM as displayed in Figure 4. Shows how Viterbi algorithm takes the path to decode the correct form of the characters by choosing the appropriate states. As we observe, there are four states for the character ‫ﻍ‬representing the four shapes of this character. We also observe that there are only two states for the character ‫ا‬ , as there are no medial or initial form for this character. A typical implementation of HMM adds states and vocabularies as being trained [20]. The training is done by providing pairs of the form ([𝑤1 𝑤2 … . 𝑤 𝑛], [𝑠1 𝑠2 … 𝑠 𝑛]) similar to the [vocabularies, states] sequences as shown in Figure 3.
  • 8. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015 8 Figure 3 Glyphs Chosen By HMM 5. Training and Testing of HMM We trained the HMM with 89 words ( 2780 characters ) chosen from the list of the frequent words from Kayhan newspaper published in 2005 [4]. There are over 10,000 words in this collection. We limited the training to this short list to save time. The list of these words are shown in Figure 5. The test data is a small number of words selected randomly from a small dictionary and shown in Figure 6. This list contains 32 words ( 350 characters ). The training file contains pairs of words separated by a vertical bar. The first word is the isolated form and the second word is the correct presentational form of the word. We read the file one line at a time and submit the two words for trainingas seen in the following Ruby code: f = File.open("./training-data") farsi.train([" "],[" "]) f.each do | line | seq1,seq2 = line.chomp.split(/s*|s*/) farsi.train(seq1.split(" "), seq2.split(" ")) end
  • 9. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015 9 As it is seen, we have added a blank vocabulary and state to our HMM. The HMM adds vocabulary and states as a part of the training. The HMM has 32 vocabularies and 74 states. It is anticipated that the HMM will have more states as the size of the training data increases. The test correctly decoded 94% of all the characters. Most of the mistakes are due to the fact that the HMM has not seen enough combination examples of characters. For example, in the word ‫آتش‬ , the initial form of ‫ﺕ‬ was not decoded correctly. A closer examination of the training data reveals that there are no occurence of ‫تش‬ in the set. Similarly, there are other errors of this form such as the initial form of ‫ﻥ‬in the word ‫ترانه‬ . There are also a few errors attributed to the double combination of the character ‫ﯼ‬ as in ‫طالیی‬ . We believe most of these errors will be corrected with a larger training sets. 6. Conclusion and Future Work In this paper, We have presented a machine learning approach to the contextual analysis of script languages. It is shown that an ergodic HMM can be easily trained to automatically decode presentational forms of the script languages. Although the paper is developed based on Farsi, it can be easily extended to other middle eastern languages. Further training and research in this area can improve the character accuracy. A successful program for contextual analysis may have to include a list of exceptional words that do not fall into the normal combination of the characters. It is also important to notice that most of the Arabic and Farsi type setting technologies such as ArabTex [8] or FarsiTex [5] have problems with contextual analysis. This is mainly due to the fact that it is practically impossible to devise an algorithm that has 100% accuracy for tasks associated with natura languages. Figure 4 Top 89 Frequent Words From Kayhan 2005
  • 10. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015 10 Finally, a higher order HMM may also improve the contextual analysis. For example, it is shown that the second order HMM improves the hand written character recognition [6]. It may also worth mentioning that the second order HMM does not improve error detection and correction for post processing of printed documents [18] Figure 5. Test Data Chosen Randomly REFERENCES [1] Jan A. Botha and Phil Blunsom. Compositional Morphology for Word Representations and Language Modeling. In Proceedings of the 31st International Conference on Machine Learning (ICML) , Beijing, China, june 2014. *Award for best application paper*. [2] Tim Bray. Element sets: A minimal basis for an XML query engine. In QL , 1998. [3] Peter F. Brown, Vincent J. Della Pietra, Stephen A. Della Pietra, and Robert L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. , 19(2):263 - 311, June 1993. [4] Jon Dehdari. Top frequent words in Farsi, 2005. http://www.ling.ohio-state.edu/~jonsafari/. [5] Mohammad Ghodsi and Behdad Esfahbod, 1992. http://www.farsitex.org/ [6] Y. H. Y. He. Extended Viterbi algorithm for second order hidden Markov process. In Proceedings 9th International Conference on Pattern Recognition, pages 718{720, 1988. [7] Allan James. Challenges in information retrieval and language modeling: Report of a workshop held at the center for intelligent information retrieval, university of Massachusetts Amherst, September 2002. SIGIR Forum, 37(1):31{47, April 2003. [8] Klaus Lagally, 2006. http://www2.informatik. uni-stuttgart.de/ivi/bs/research/Arabic.htm.
  • 11. International Journal on Natural Language Computing (IJNLC) Vol. 4, No.4, August 2015 11 [9] Fallah Moshfeghi and Kourosh Shadsari. Design and implementation of bilingual information entrance and edit environment. technical report, Iran Telecommunication Research Center , winter, 1999. [10] Kourosh Fallah Moshfeghi. A new algorithm for contextual analysis of Farsi characters and its implementation in java. In 17th International Unicode Conference , 2000. [11] Herbert Penzl and Ismail Sloan. A Grammar of Pashto a Descriptive Study of the Dialect of Kandahar, Afghanistan . Ishi Press International, 2009. [12] Roozbeh Pournader. National Iranian standard isiri 6219, information technology Persian information interchange and display mechanism, using Unicode. In Technical Report , 2005. [13] Lawrence R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In PROCEEDINGS OF THE IEEE , pages 257{286, 1989. [14] Lawrence R. Rabiner. Readings in speech recognition. chapter A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, pages 267{296. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1990. [15] Kazem Taghva, Russell Beckley, and Mohammad Sadeh. A stemming algorithm for the Farsi language. In International Symposium on Information Technology: Coding and Computing (ITCC 2005), Volume 1, 4-6 April 2005, Las Vegas, Nevada, USA , pages 158{162, 2005. [16] Kazem Taghva, Jeffrey S. Coombs, Ray Pereda, and Thomas A. Nartker. Language model-based retrieval for Farsi documents. In International Conference on Information Technology: Coding and Computing (ITCC'04), Volume 2, April 5-7, 2004, Las Vegas, Nevada, USA , pages 13{17, 2004. [17] Kazem Taghva, Rania Elkhoury, and Jeffrey S. Coombs. Arabic stemming without A root dictionary. In International Symposium on Information Technology: Coding and Computing (ITCC 2005), Volume 1, 4-6 April 2005, Las Vegas, Nevada, USA , pages 152{157, 2005. [18] Kazem Taghva, Srijana Poudel, and Spandana Malreddy. Post processing with first- and second-order hidden Markov models. In DRR, 2013. [19] Kazem Taghva, Ron Young, Jeffrey Coombs, Russell Beckley, and Mohammad Sadeh. Farsi searching and display technologies. In SDIUT, 2003. [20] David Tresner-Kirsch. Hmm ruby gem, 2009. https://github.com/dtkirsch/hmm/. [21] Xing M. Wang. Probability bracket notation: Markov state chain projector, hidden Markov models and dynamic Bayesian networks. CoRR , abs/1212.3817, 2012.