SlideShare a Scribd company logo
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 10, No. 2, April 2020, pp. 2023~2030
ISSN: 2088-8708, DOI: 10.11591/ijece.v10i2.pp2023-2030  2023
Journal homepage: http://ijece.iaescore.com/index.php/IJECE
Improving accuracy of part-of-speech (POS) tagging using
hidden markov model and morphological analysis
for Myanmar Language
Dim Lam Cing, Khin Mar Soe
Natural Language Processing Lab, University of Computer Studies, Myanmar
Article Info ABSTRACT
Article history:
Received Sep 23, 2019
Revised Oct 25, 2019
Accepted Nov 2, 2019
In Natural Language Processing (NLP), Word segmentation and Part-of-
Speech (POS) tagging are fundamental tasks. The POS information is also
necessary in NLP’s preprocessing work applications such as machine
translation (MT), information retrieval (IR), etc. Currently, there are many
research efforts in word segmentation and POS tagging developed separately
with different methods to get high performance and accuracy. For Myanmar
Language, there are also separate word segmentors and POS taggers based
on statistical approaches such as Neural Network (NN) and Hidden Markov
Models (HMMs). But, as the Myanmar language's complex morphological
structure, the OOV problem still exists. To keep away from error and
improve segmentation by utilizing POS data, segmentation and labeling
should be possible at the same time.The main goal of developing POS tagger
for any Language is to improve accuracy of tagging and remove ambiguity in
sentences due to language structure. This paper focuses on developing word
segmentation and Part-of- Speech (POS) Tagger for Myanmar Language.
This paper presented the comparison of separate word segmentation and POS
tagging with joint word segmentation and POS tagging.
Keywords:
Natural language processing
hidden markov model
Morphological analysis
Copyright © 2020 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Dim Lam Cing,
Natural Language Processing Lab,
University of Computer Studies,
No.4, Main Road, ShwePyiThar Township, Yangon, Myanmar.
Email: dimlamcing@ucsy.edu.mm
1. INTRODUCTION
In numerous uses of characteristic language handling, Part-of-Speech (POS) labeling is an essential
assignment for each language. So, to have high precision tagger is one of the importance tasks for NLP
applications. Handling ambiguous and unknown words are the challenge of POS tagging [1, 2]. For every
NLP application such as machine translation, information extraction, speech recognition, grammar checking
and word sense disambiguation, etc are needed to do word segmentation and Part-of-speech (POS) tagging of
a fundamental process of natural language processing application. There are many methods for development
of POS taggers. The most using techniques are rule based method, statistical based method and neural
network based method. In the rule-based approach, rules are developed according to the nature of
the language to define precisely how and where to assign the various POS tags [3-5]. This methodology has
just been utilized to build up the POS tagger for Myanmar Language. In the factual methodology, measurable
language models are manufactured, refined and used to POS label the info message naturally.
Most commonly used statistical approaches are Hidden Markov Models based approach, Support vector
machine based, Conditional Random Field based and Maximum Entropy based approach [6, 7].
This paper describes Hidden Markov Models (HMM) and the proposed system for word
segmentation and part-of-speech tagging for Myanmar language. Myanmar Language is morphologically
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 2023 - 2030
2024
rich, complex, and agglutinative in nature, expressions of which are arched with numerous linguistic
highlights. POS labeling [8] is a significant issue in the field of NLP and one of the fundamental preparing
ventures for any language in NLP. i.e., the capability of a computer to automatically POS tag a given
sentence. Normally, the first step of processing is to divide the input text into units called tokens where each
is either a word or something else like a number. The main clue used in space-delimited language like
English is the white space. In major East-Asian languages such as Japanese, Chinese, Thai and Myanmar,
there is no spaces between words. Myanmar language, its writing style does not use any delimiter
between words.
In word segmentation and POS tagging, the structure of morphological words is the main source of
information to get the correct process of tagging. By using the morphological structure of words, eliminate
irrelevant tags can be removed and find the suitable tag for the word [9-11]. So, morphological analysis is an
important part of language engineering applications especially for morphologically rich and complex
language like Myanmar.
There has been very few research conducted on various language processing tasks including
morphological analysis for Myanmar language compare to English, France, Chinese, India, and Thai., etc.
Since high level language processing tasks such as POS tagging, machine translation, semantic analysis,
syntactic analysis, sentiment analysis, information retrieval, classification, clustering system, etc. all process
on smallest language unit; words. The morphology of the language through a systematic linguistic study is
important in order to reveal words that are significant to users such as historians, linguists, etc.
Most of the current researches on Myanmar language done used a lexicon or dictionary or corpus
which lists all the words forms for word segmentation as an initial stage of processing. To get correct
segmentation, we need an exhaustive lexicon or corpus. Myanmar language[12-16] has been classified by
linguists as a monosyllabic or isolating language with agglutinative features. Its writing style does not use
any delimiter between words and so there is no way of knowing whether a word form of syllables is group,
or is just a separate group of monosyllabic words. Every syllable has a meaning of its own. The Myanmar
Language have complex morphotactic structures and has the ambiguous word segmentation. Therefore,
segment the sentence to generate lexical and semantic of word sequences is a challenging task. Thus, this
paper aim to addresses this shortcoming by proposing a language model that consider joint word
segmentation and POS tagging. The rest of this paper is organized as follows. In Section 2, we discussed
Literature Review. Section 3 described Aspect of Myanmar Language. Section 4 presented Design of
Proposed System. Section 5 provides the Evaluation. Finally, we described the conclusion of the paper.
2. LITERATURE REVIEW
Part-of-Speech Tagger that using supervised learning approach for Myanmar Language is presented
in [17]. For disambiguous of the POS tags, Baum-Welch algorithm and Viterbi algorithm with HMM model
is used for training and decoding. For tagging a word, Myanmar lexicon is used with its all possible tags.
The examination results show that the strategy got high precision (over 90%) for various testing input.
Myanmar Word Segmentation [18] used Hybrid Approach and the sentences are segmented in syllable and
matched by longest words. In the using of Longest matching method, the words that are known from
a dictionary are first segmented and the unknown words are guest from an n-gram model [19]. The major
issue of this technique is comes from the vagueness in the longest coordinating procedure, since words can be
showed up in numerous structures.
The porposed of Y. Zhang and S. Clark [20], that got a lower mistake rate contrasted with a two
stage baseline system. The large combined search space for this method is a challenge and it is very hard in
decoding. For reason for at the same time word division and POS labeling, a solitary straight model is
utilized, and for joint preparing and pillar search of unraveling, the summed up perceptron calculation is
picked. The joint model lessens a mistake pace of exactness for division to 14.6% and a blunder decline in
labeling precision of 12.2%, contrasted with the conventional pipeline strategy. A Persian POS tagger,
the Persian sentences are tagged by implementing a blend of measurable and principle-based technique.
To tag unknown words, a morphological analysis probabilistic method is used. Persian morphological rules
that are knowledge base and that the probabilities is worked by a corpus is the second result of the research.
Trial results show that their approach increase the labeling execution and exactness [11].
3. ASPECT OF MYANMAR LANGUAGE
Myanmar language is highly agglutinative and is morphologically rich and complex. Moreover,
to separate each word, the Myanmar writing style do not use spaces and there is no chance to get of knowing
whether a gathering of syllables structure a word, or is only a group of separate monosyllabic words.
Int J Elec & Comp Eng ISSN: 2088-8708 
Improving accuracy of part-of-speech (POS) tagging using hidden markov model and ... (Dim Lam Cing)
2025
Every syllable has its own meanings. In Myanmar words consist of one or more syllables which are
compound in different ways. Depend on the way of the words structures from syllables, these can be classify
into three types single simple words, complex words and reduplicative words [21, 22]. For example,
ေ ပါင်း (steam) + အ်း(pot) =>ေ ပါင်းအ်း (rice cooker), မ်း(fire) + ပူ (hot) => မ်းပူ (iron), ပန်း(flower) +
ခ (carry) => ပန်းခ (painting), all have their referential meaning and each monosyllable within words also
has their own meaning. In Myanmar morphology processes include inflection, derivation, and compounding.
3.1. Inflection morphology
Myanmar inflection morphology of nouns, verbs and adjectives is mostly achieved by suffixation.
The inflection morphology remains the same POS tags with the original words but by adding the inflection
morpheme -တ ို့, -မ ်း can make the plural on nouns and the inflectional morpheme -ခို့ make the past
tense on verbs. For example: ေ က င်းသ ်းမ ်း (students) -> ေ က င်းသ ်း (student) + မ ်း; သ ်းခို့ (went) ->
သ ်း (go) + ခို့.
3.2. Derivation morphology
Myanmar morphology derivation occurs by means of prefixation or suffixation. Derivation can
change the POS tag of word forms. Derivation of nouns, verbs and adjectives are also achieved by suffixation
but a circumfix also occurs in the Myanmar language. For example: အလပ (work) -> အ (Prefix) + လပ (do);
ေ ြ ပ်း ြ ခင်း (running) -> ေ ြ ပ်း (run) + ြ ခင်း (Suffix). But အ- is not prefix bound morpheme in some
nouns and verbs and cannot be splitted; for example: if the words ေအမ(mother) is splitted, it has not
meaning.
3.3. Compounding
Myanmar words contain many compound words. They are noun compound words, verb compound
words, adjective compound words and also noun, verb and adjective are compound. For example: compound
noun: ေ ဈ်းနှုန်း (price)-> ေ ဈ်း(market) + နှုန်း(rate); compound verb: ြ ဖတပင်း (voucher) -> ြ ဖတ(cut) +
ပင်း(divide); compound adjective: ခငမ (firm) -> ခင(firm) + မ (rigid); compound noun,verd and adjective:
လူန တငက ်း(ambulance) -> လူ(human) + န (painful) + တင(placed) + က ်း(car). By compounding
the words some words POS is the same to the original and some words got a new POS tag.
4. DESIGN OF PROPOSED SYSTEM
The structure of the proposed framework is shown in Figure 1. There are two modules: preparing
and testing modules. In the training phase, the collection of segmented and tagged-sentences are used to
develop the proposed HMM model. This model is used in the testing phase. In testing phase, the input
Myanmar sentences are identified into each sentence using the sentence end marker called pote-ma ‘။ ’.
After that, word segmentation and POS tagging is performed
4.1. Corpus creation
Part-of-Speech tagged corpora are one of the essential resources for developing state-of-the-art POS
Tagger in Myanmar. There are several steps to create tagged corpus. The following list demonstrates steps
needed corpus building.
 Collecting raw text
 Hand-annotating and preparing training data
We collect and normalize raw text from online journals, newspapers and e-books. Since, documents
used various Myanmar font styles; these are converted to standard Unicode format and and make cleaning
such as spelling checking. We assign tags in un-annotated text manually and finally, we have got the training
data for statistical method. If the number of tags is large, the complexity will be increased and
the performance will be decreased. According to Myanmar grammar books and dictionary book [12-16],
there are nine Part-of-Speech tags in Myanmar language. We have annotated every word with appropriate
basic POS tags and created a POS tag Corpus. Moreover, we added another three POS tags Number, Symbol
and Abbreviation in our research. The tagset is described in Table 1.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 2023 - 2030
2026
Figure 1. Framework of the proposed system
Table 1. Tagset
No. Tag Description Example
1. NN Noun ပန်း(flower)
2. PN Pronoun ကျွနမ(I)၊ သင(you)
3. V Verb ဝယ(buy)၊ စ ်း(eat)
4. Adj Adjective ပူ(hot)
5. Adv Adverb ေ လ်းစ ်းစ (respectfully)
6. PPM Postpositional Marker က၊ က
7. Conj Conjunction ထအခါ၊ ၍
8. Part Particles မ ်း၊ ခို့
9. Interj Interjection အ၊ အမယေလ်း
10. Number Number ၁၊၂ ၊ ၂၀
11. Symbol Symbol ( ) / % + - = ၊ ။
12. Abbrev Abbreviation အထက၊ဖဆပလ၊ေ အဘအမ
4.1.1. Corpus statistic
For our experiments, the corpus consists of sentences from Myanmar grammar books, Myanmar text
books, some Myanmar history and websites. Corpus informations are described in Table 2. The font used for
this research is Unicode. There are total 39716 sentences covering 690258 words and each sentence has an
average of 18 words. The vocabulary size is 27043 words.
Table 2. Distribution of POS tags
POS Tags No. of words
NN 25%
PN 4%
V 15%
Adj 2%
Adv 2%
PPM 17%
Conj 5%
Part 22%
Interj 0.03%
Number 1%
Symbol 7%
Abbrev 0.09%
Int J Elec & Comp Eng ISSN: 2088-8708 
Improving accuracy of part-of-speech (POS) tagging using hidden markov model and ... (Dim Lam Cing)
2027
4.2. Training hidden markov model
To get training data, we have to compute probabilities for each tag in the tagged corpus. Since we
have developed a model, it produces two results. The results of the training phase are transition probabilities
and emission probabilities.
4.2.1. Estimating probabilities
POS tagging using HMM, the probabilities are calculated from a tagged training corpus instead of
using the full power of HMM learning. The probabilities of tag transition P(ti|ti-1) is the probability of a tag
given in the previous tag. Estimation of transition probability is computed by counting the times that the first
tag in a tagged corpus, how often the first tag is followed by the second.
The emission probabilities, P(wi|ti) given a tag, it will be associated with a given word [23]. The emission
probability is
4.3. Joint Myanmar word segmentation and POS tagging
The input sentences are firstly separated by pote-ma “။”. The words in each sentence is segmented
and assigned POS with the proposed tagsets in Table 1 by using HMM probabilistic models. In Myanmar
Language, since words are formed by combining more than one syllable that is one word can have one or
more syllables and one syllable has more than one character, syllable identification must be done before word
level segmentation [24]. For example, the input is as follows in Table 3:
ြ ကာပန်ေ်းသည ်ေရထဲတွည ်ေပါက်ေသ ်ေ။ (Lotus grows in water.)
After Syllable Identification, the right output is come out as follows:
ြ ကာ|ပန်ေ်း|သ ်ေ|ေ ရ|ထဲ|တွ ်ေ|ေ ပါက်ေ|သ ်ေ
Table 3. N-gram word segmentation for input sentence
N-gram (N=1,2,3,4,5) Word Segmentation
Unigram ြ ကာ|ပန်ေ်း|သ ်ေ|ေ ရ|ထဲ|တွ ်ေ|ေ ပါက်ေ|သ ်ေ
Bigrams ြ ကာပန်ေ်း၊ပန်ေ်းသ ်ေ၊သည ်ေရ၊ည ရထဲ၊ထဲတွ ်ေ၊တွည ်ေပါက်ေ၊ည ပါက်ေသ ်ေ
Trigrams ြ ကာပန်ေ်းသ ်ေ၊ပန်ေ်းသည ်ေရ၊သည ်ေရထဲ၊ည ရထဲတွ ်ေ၊ထဲတွည ်ေပါက်ေ၊တွည ်ေပါက်ေသ ်ေ
4-grams ြ ကာပန်ေ်းသည ်ေရ၊ပန်ေ်းသည ်ေရထဲ၊သည ်ေရထဲတွ ်ေ၊ည ရထဲတွည ်ေပါက်ေ၊ထဲတွည ်ေပါက်ေသ ်ေ
5-grams ြ ကာပန်ေ်းသည ်ေရထဲ၊ပန်ေ်းသည ်ေရထဲတွ ်ေ၊သည ်ေရထဲတွည ်ေပါက်ေ၊ည ရထဲတွည ်ေပါက်ေသ ်ေ
A typical strategy to do word division and POS simultaneously is to utilize the N-gram (5-grams)
which sweeps an information sentence from left to right, and recover the word with its everything potential
labels with the likelihood from emanation record. If all 5-grams words have not been contained in
the emission probability file, the system used 4-grams, trigrams, bigrams and unigram. Word segmentation
for input sentence as per the longest N-gram technique
ြ ကာပန်ေ်း၊သ ်ေ၊ေ ရ၊ထဲတွ ်ေ၊ေ ပါက်ေ၊သ ်ေ
Word probabilities and language model probabilities is calculated by using relative frequency count. If there
are more than one POS options for word, the system selected POS option with highest word probability as
described in Table 4.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 2023 - 2030
2028
Table 4. All possible word, tag and probability
Word Segmentation POS Language Model Probability Selected POS
ြ ကာပန်ေ်း (Lotus) NN 1 NN
သ ်ေ (null) PPM 0.4 PPM
Part 0.3
PN 0.2
Adj 0.1
ည ရ (water) NN 0.6 NN
V 0.2
Part 0.2
ထဲတွ ်ေ (in) PPM 1 PPM
ည ပါက်ေ(grow) Part 0.2
V 0.7 V
NN 0.1
သ ်ေ(null) PPM 0.4 PPM
Part 0.3
PN 0.2
Adj 0.1
4.4. Morphological rules approach
The internal structures of words are defined by using morphological rules [11]. These rules
comprise of three sections: prefix (အ), stem and suffix (မ ်း). The common syntax is as follows:
prefix + stem + suffix  POS tag
In the above syntax, sometime both of prefix and suffix are contain in the string. In some syntax, one of
prefix or suffix is empty string. There are three types’ morphological rules for Myanmar Language:
inflectional, derivational rules and compounding. In this system, morphological rules (68 rules) are
characterized [25] and utilized. The rules are drawn out from Myanmar Grammar book [12-16]. The uses of
inflectional, derivational and compounding are described in Section 3.
5. EVALUATION
To appraise the testing result for POS labeling, the framework utilized the parameters of Recall,
Precision and F-score. These parameters are characterized as pursues:
𝑅𝑒𝑐𝑎𝑙𝑙, 𝑅 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑂𝑆 𝑡𝑎𝑔 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑠𝑦𝑠𝑡𝑒𝑚
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛, 𝑃 =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑂𝑆 𝑡𝑎𝑔 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑠𝑦𝑠𝑡𝑒𝑚
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑃𝑂𝑆 𝑡𝑎𝑔 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑠𝑦𝑠𝑡𝑒𝑚
𝐹𝑠𝑐𝑜𝑟𝑒, 𝐹 =
2𝑃𝑅
𝑃 + 𝑅
5.1. Experimental setup
For testing the proposed model, we divided our corpus into two corpuses as follows in Table 5.
We collect 500 new sentences for open testing. In our experiments, we compare the separate
word segmentation and POS tagging using HMM , joint word segmentation and POS tagging using HMM
and joint word segmentataion and POS tagging using HMM with morphological rules in Table 6.
For the comparative purpose, we used Bigram Part-of-Speech Tagger for Myanmar Language [17] as based
line system. The proposed system and base line system used same training corpus and test data.
Table 5. Statistic of the dataset
Data No.of Sentence No. of words
Corpus 1 29680 547969
Corpus 2 39716 690258
Int J Elec & Comp Eng ISSN: 2088-8708 
Improving accuracy of part-of-speech (POS) tagging using hidden markov model and ... (Dim Lam Cing)
2029
Table 6. Accuracy of system on different test cases using HMM and morphological rules
Corpus Size
(sentences)
Separate word Segmentation
and POS tag
Joint word segmentation and
POS tag
Joint word segmentation and
POS tag + morphological rules
Precision Recall F-score Precision Recall F-score Precision Recall F-score
29680 68% 67% 67% 78% 76% 77% 90% 88% 89%
39716 77% 75% 76% 85% 83% 84% 94% 92% 93%
5.2. Results and discussion
Table 6 shows the experiment results for Myanmar word segmentation and POS tagging with
different training data sizes. Conforming to the table, the proposed technique starts to get a few progressions
over the correlation standard. When the measure of preparing information sentences is increased and using of
morphology rules also has good increased compared with the corresponding baselines. The accuracy of
the tagger is appraised by using testing data which contains different kinds of words. Testing words can be
defined as known words, unknown words and ambiguous words for the tagger. “Known words” are
the words contain in the training corpus and “Unknown Words” are the words which are not containing in
the training corpus. “Ambiguous words” are the known words which are tagged wrong because of
segmentation error and it is needful to solve for disambiguating that tag is the correct tag for these words.
In proposed system, most “Unknown Words” occur in Proper Noun (name of person, name of location),
different position of Particle and Postpositional marker in segmentation can cause ambiguous in POS
tagging. There is no training data to cover all Proper Nouns. Including of disambiguous words and unknown
words make decrease in the performance of the tagger. To solve the disambiguation of ambiguous words is to
use the morphological rules. By using morphological rules, the system reduced ambiguous in Particle and
Postpositional markers.
6. CONCLUSION
This paper presents a joint word segmentation and POS tagging in Myanmar using HMM and
morphological rules. In our experiments, we compare the separate word segmentation and POS tagging with
our proposed joint word segmentation and POS tagging using HMM. Then, we found that there is
a significant improvement in joint word segmentataion and POS tagging using HMM with morphological
rules. We also describe the distribution of words in the corpus. Until now, there are unknown words in our
experiments. The future work will be to improve the exactness of word segmentation and POS tagging.
We also need a larger corpus for training. By using a large training and morphological rules, the assignment
of POS tag will be more accurate and will be reduced the unknown words, incorrect tag and ambiguous
words. The paper has shown that word segmentation and POS tagging in Myanmar can be improved by using
lager training corpus and combining the morphological analysis of Myanmar Language.
REFERENCES
[1] T. Mikolov, A. Deoras, D. Povey, L. Burget, J. H. Cernocky, "Strategies for training large scale neural network
language models," IEEE Automatic Speech Recognition and Understanding Workshop, pp. 196-201, 2011.
[2] A.J.P.M.P. Jayaweera, N. G. J. Dias, "Hidden markov model based part of speech tagger for sinhala language,"
International Journal on Natural Language Computing (IJNLC), vol. 3(3), 2014.
[3] Sirajuddin Y. Hala, Sagar H. Virani, "Improve accuracy of parts of speech tagger for Gujarati language,"
International Journal of Advance Engineering and Research Development, vol. 2(5), 2015.
[4] P.M Bhatt, A. Ganatra, "Analyzing & enhancing accuracy of part of speech tagger with the usage of mixed
approaches for Gujarati," International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878,
vol. 8(1), 2019.
[5] K. Mohnot, N. Bansal, S.P. Singh, A. Kumar, "Hybrid approach for part of speech tagger for Hindi language,"
International Journal of Computer Technology and Electronics Engineering (IJCTEE), vol. 4(1), 2014.
[6] S. AlGahtani, J. McNaught, "Joint Arabic Segmentation and Part-of-Speech Tagging," Proceedings of the Second
Workshop on Arabic Natural Language Processing ©2014 Association for Computational Linguistics,
pp. 108-117, 2015.
[7] A. F. Wicaksono, A. Purwarianti, "HMM based part-of-speech tagger f or Bahasa Indonesia," On Proceedings of
4th International MALINDO (Malay and Indonesian Language) Workshop, 2010.
[8] S. HOON N. A., "Conditional random fields for Korean morpheme segmentation and POS tagging," ACM
Transactions on Asian Language Information Processing, vol. 14(3), 2015.
[9] Z. H. Pozveh, A. Monadjemi, A. Ahmadi, "Persian texts part of speech tagging using artificial neural networks,"
Journal of Computing and Security, vol. 3(4), pp. 233-241, 2016.
[10] C. Lyu, Y. Zhang, D. Ji, "Joint word segmentation, POS-tagging and syntactic chunking," Proceedings of
the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), 2016.
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 2023 - 2030
2030
[11] H. Fadaei, M. Shamsfard, "Persian POS tagging using probabilistic morphological analysis," Int. J. Computer
Applications in Technology, vol. 38(4), pp. 264-273, 2010.
[12] P. Hopple, "The structure of nominalization in Burmese," Ph. D Dissertation. University of Texas, Arlington, 2003.
[13] Department of the Myanmar Language Commission ,"Myanmar grammar," Ministry of Education. Myanmar,2006.
[14] "Myanmar-English dictionary," Ministry of Education, Myanmar.
[15] Grammar. Burmese language. http://en.wikipedia.org/wiki/Burmese_Language
[16] Department of the Myanmar Language Commission, "Myanmar grammar," Ministry of Education, Myanmar, 2016.
[17] P. H. Myint, T. M. Htwe, N. L. Thein, "Bigram part-of-speech tagger for Myanmar language," 2011 International
Conference on Information Communication and Management, IPCSIT, vol. 16, 2011.
[18] W. P. Pa, N. L. Thein, "Myanmar word segmentation using hybrid approach," Proceedings of 6th International
Conference on Computer Applications, 2008.
[19] W. P. Pa, Y. K. Thu, A. Finch, E. Sumita, "Word boundary identification for Myanmar text using conditional
random fields," Genetic and Evolutionary Computing, Springer International Publishing Switzerland, p. 447,2016
[20] Y. Zhang, S. Clark, "Joint word segmentation and POS tagging using a single perceptron," Proceedings of ACL-08:
HLT, pp. 888-896, 2008.
[21] T. M, Htwe, D. L. Cing, "A neural probabilistic language model for joint morphological segmentation and POS
tagging," The Seventh International Conference on Science and Engineering(ICSE), pp. 9-10, 2016.
[22] T. T. Zin, K. M. Soe, N. L. Thein, "Myanmar phrases translation model with morphological analysis for statistical
Myanmar to English translation system," 25th Pacific Asia Conference on Language, Information and
Computation, pp. 130-139, 2011.
[23] D. Jurafsky, James H. Martin, "Speech and language processing: An introduction to natural language processing,
computational linguistics, and speech recognition," Copyright 2006, Draft of June 25, 2007.
[24] https://github.com/ye-kyaw-thu/sylbreak
[25] D. L. Cing, K. M. Soe, "Joint word segmentation and part-of-speech (POS) tagging for Myanmar language," 17th
International Conference on Computer Application, 2019.
BIOGRAPHIES OF AUTHORS
Dim Lam Cing received M.C.Sc in Computer Science from Computer University (Kalay) in 2010.
She is a PhD candidate in University of Computer Studies, Yangon (UCSY). Her research interest
includes Natural Language Processing and Machine Learning.
Khin Mar Soe received M.C.Sc and Ph.D degree in Information Technology from University of
Computer Studies, Yangon (UCSY) in 2000 and 2005 respectively. She is currently a full
professor from Natural Language Processing (NLP) Lab in UCSY. Her main research interest
includes Natural Language Processing and Artificial Intelligence.

More Related Content

PDF
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
PDF
A Marathi Hidden-Markov Model Based Speech Synthesis System
PDF
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
PDF
Myanmar named entity corpus and its use in syllable-based neural named entity...
PDF
Myanmar news summarization using different word representations
PDF
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
PDF
Cl35491494
PDF
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
A Marathi Hidden-Markov Model Based Speech Synthesis System
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
Myanmar named entity corpus and its use in syllable-based neural named entity...
Myanmar news summarization using different word representations
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
Cl35491494
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...

What's hot (18)

PDF
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
PDF
Natural Language Processing Theory, Applications and Difficulties
PDF
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
PDF
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
PDF
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
PDF
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
PDF
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
PDF
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
PDF
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELS
PDF
G1803013542
PDF
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
PDF
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...
PDF
Implementation of Text To Speech for Marathi Language Using Transcriptions Co...
PDF
A Context-based Numeral Reading Technique for Text to Speech Systems
PDF
J1803015357
PDF
An Improved Approach for Word Ambiguity Removal
PDF
Marathi Text-To-Speech Synthesis using Natural Language Processing
PDF
IMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUE
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
Natural Language Processing Theory, Applications and Difficulties
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
IRJET- Tamil Speech to Indian Sign Language using CMUSphinx Language Models
BOOTSTRAPPING METHOD FOR DEVELOPING PART-OF-SPEECH TAGGED CORPUS IN LOW RESOU...
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELS
G1803013542
Dynamic Construction of Telugu Speech Corpus for Voice Enabled Text Editor
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...
Implementation of Text To Speech for Marathi Language Using Transcriptions Co...
A Context-based Numeral Reading Technique for Text to Speech Systems
J1803015357
An Improved Approach for Word Ambiguity Removal
Marathi Text-To-Speech Synthesis using Natural Language Processing
IMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUE
Ad

Similar to Improving accuracy of part-of-speech (POS) tagging using hidden markov model and morphological analysis for Myanmar Language (20)

PDF
Morpheme Based Myanmar Word Segmenter
PDF
MYANMAR WORDS SORTING
PDF
MYANMAR WORDS SORTING
PDF
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
PDF
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
PDF
Unknown Words Analysis in POS Tagging of Sinhala Language
PDF
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
PDF
Parsing of Myanmar Sentences With Function Tagging
PDF
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PDF
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PDF
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
PDF
part of speech tagger for ARABIC TEXT
PDF
IMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUE
PDF
Hidden markov model based part of speech tagger for sinhala language
PDF
Statistically-Enhanced New Word Identification
PDF
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
PDF
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
PDF
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
PDF
5215ijcseit01
PDF
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
Morpheme Based Myanmar Word Segmenter
MYANMAR WORDS SORTING
MYANMAR WORDS SORTING
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
Unknown Words Analysis in POS Tagging of Sinhala Language
A REVIEW ON PARTS-OF-SPEECH TECHNOLOGIES
Parsing of Myanmar Sentences With Function Tagging
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
part of speech tagger for ARABIC TEXT
IMPROVING RULE-BASED METHOD FOR ARABIC POS TAGGING USING HMM TECHNIQUE
Hidden markov model based part of speech tagger for sinhala language
Statistically-Enhanced New Word Identification
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
5215ijcseit01
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
Ad

More from IJECEIAES (20)

PDF
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
PDF
Embedded machine learning-based road conditions and driving behavior monitoring
PDF
Advanced control scheme of doubly fed induction generator for wind turbine us...
PDF
Neural network optimizer of proportional-integral-differential controller par...
PDF
An improved modulation technique suitable for a three level flying capacitor ...
PDF
A review on features and methods of potential fishing zone
PDF
Electrical signal interference minimization using appropriate core material f...
PDF
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
PDF
Bibliometric analysis highlighting the role of women in addressing climate ch...
PDF
Voltage and frequency control of microgrid in presence of micro-turbine inter...
PDF
Enhancing battery system identification: nonlinear autoregressive modeling fo...
PDF
Smart grid deployment: from a bibliometric analysis to a survey
PDF
Use of analytical hierarchy process for selecting and prioritizing islanding ...
PDF
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
PDF
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
PDF
Adaptive synchronous sliding control for a robot manipulator based on neural ...
PDF
Remote field-programmable gate array laboratory for signal acquisition and de...
PDF
Detecting and resolving feature envy through automated machine learning and m...
PDF
Smart monitoring technique for solar cell systems using internet of things ba...
PDF
An efficient security framework for intrusion detection and prevention in int...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Embedded machine learning-based road conditions and driving behavior monitoring
Advanced control scheme of doubly fed induction generator for wind turbine us...
Neural network optimizer of proportional-integral-differential controller par...
An improved modulation technique suitable for a three level flying capacitor ...
A review on features and methods of potential fishing zone
Electrical signal interference minimization using appropriate core material f...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Bibliometric analysis highlighting the role of women in addressing climate ch...
Voltage and frequency control of microgrid in presence of micro-turbine inter...
Enhancing battery system identification: nonlinear autoregressive modeling fo...
Smart grid deployment: from a bibliometric analysis to a survey
Use of analytical hierarchy process for selecting and prioritizing islanding ...
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
Adaptive synchronous sliding control for a robot manipulator based on neural ...
Remote field-programmable gate array laboratory for signal acquisition and de...
Detecting and resolving feature envy through automated machine learning and m...
Smart monitoring technique for solar cell systems using internet of things ba...
An efficient security framework for intrusion detection and prevention in int...

Recently uploaded (20)

PPT
introduction to datamining and warehousing
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
Construction Project Organization Group 2.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PPT
Project quality management in manufacturing
PDF
Digital Logic Computer Design lecture notes
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Artificial Intelligence
PPTX
web development for engineering and engineering
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
DOCX
573137875-Attendance-Management-System-original
DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
introduction to datamining and warehousing
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Safety Seminar civil to be ensured for safe working.
Construction Project Organization Group 2.pptx
OOP with Java - Java Introduction (Basics)
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
Project quality management in manufacturing
Digital Logic Computer Design lecture notes
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
CYBER-CRIMES AND SECURITY A guide to understanding
Artificial Intelligence
web development for engineering and engineering
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Foundation to blockchain - A guide to Blockchain Tech
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
573137875-Attendance-Management-System-original
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Embodied AI: Ushering in the Next Era of Intelligent Systems

Improving accuracy of part-of-speech (POS) tagging using hidden markov model and morphological analysis for Myanmar Language

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 10, No. 2, April 2020, pp. 2023~2030 ISSN: 2088-8708, DOI: 10.11591/ijece.v10i2.pp2023-2030  2023 Journal homepage: http://ijece.iaescore.com/index.php/IJECE Improving accuracy of part-of-speech (POS) tagging using hidden markov model and morphological analysis for Myanmar Language Dim Lam Cing, Khin Mar Soe Natural Language Processing Lab, University of Computer Studies, Myanmar Article Info ABSTRACT Article history: Received Sep 23, 2019 Revised Oct 25, 2019 Accepted Nov 2, 2019 In Natural Language Processing (NLP), Word segmentation and Part-of- Speech (POS) tagging are fundamental tasks. The POS information is also necessary in NLP’s preprocessing work applications such as machine translation (MT), information retrieval (IR), etc. Currently, there are many research efforts in word segmentation and POS tagging developed separately with different methods to get high performance and accuracy. For Myanmar Language, there are also separate word segmentors and POS taggers based on statistical approaches such as Neural Network (NN) and Hidden Markov Models (HMMs). But, as the Myanmar language's complex morphological structure, the OOV problem still exists. To keep away from error and improve segmentation by utilizing POS data, segmentation and labeling should be possible at the same time.The main goal of developing POS tagger for any Language is to improve accuracy of tagging and remove ambiguity in sentences due to language structure. This paper focuses on developing word segmentation and Part-of- Speech (POS) Tagger for Myanmar Language. This paper presented the comparison of separate word segmentation and POS tagging with joint word segmentation and POS tagging. Keywords: Natural language processing hidden markov model Morphological analysis Copyright © 2020 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Dim Lam Cing, Natural Language Processing Lab, University of Computer Studies, No.4, Main Road, ShwePyiThar Township, Yangon, Myanmar. Email: dimlamcing@ucsy.edu.mm 1. INTRODUCTION In numerous uses of characteristic language handling, Part-of-Speech (POS) labeling is an essential assignment for each language. So, to have high precision tagger is one of the importance tasks for NLP applications. Handling ambiguous and unknown words are the challenge of POS tagging [1, 2]. For every NLP application such as machine translation, information extraction, speech recognition, grammar checking and word sense disambiguation, etc are needed to do word segmentation and Part-of-speech (POS) tagging of a fundamental process of natural language processing application. There are many methods for development of POS taggers. The most using techniques are rule based method, statistical based method and neural network based method. In the rule-based approach, rules are developed according to the nature of the language to define precisely how and where to assign the various POS tags [3-5]. This methodology has just been utilized to build up the POS tagger for Myanmar Language. In the factual methodology, measurable language models are manufactured, refined and used to POS label the info message naturally. Most commonly used statistical approaches are Hidden Markov Models based approach, Support vector machine based, Conditional Random Field based and Maximum Entropy based approach [6, 7]. This paper describes Hidden Markov Models (HMM) and the proposed system for word segmentation and part-of-speech tagging for Myanmar language. Myanmar Language is morphologically
  • 2.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 2023 - 2030 2024 rich, complex, and agglutinative in nature, expressions of which are arched with numerous linguistic highlights. POS labeling [8] is a significant issue in the field of NLP and one of the fundamental preparing ventures for any language in NLP. i.e., the capability of a computer to automatically POS tag a given sentence. Normally, the first step of processing is to divide the input text into units called tokens where each is either a word or something else like a number. The main clue used in space-delimited language like English is the white space. In major East-Asian languages such as Japanese, Chinese, Thai and Myanmar, there is no spaces between words. Myanmar language, its writing style does not use any delimiter between words. In word segmentation and POS tagging, the structure of morphological words is the main source of information to get the correct process of tagging. By using the morphological structure of words, eliminate irrelevant tags can be removed and find the suitable tag for the word [9-11]. So, morphological analysis is an important part of language engineering applications especially for morphologically rich and complex language like Myanmar. There has been very few research conducted on various language processing tasks including morphological analysis for Myanmar language compare to English, France, Chinese, India, and Thai., etc. Since high level language processing tasks such as POS tagging, machine translation, semantic analysis, syntactic analysis, sentiment analysis, information retrieval, classification, clustering system, etc. all process on smallest language unit; words. The morphology of the language through a systematic linguistic study is important in order to reveal words that are significant to users such as historians, linguists, etc. Most of the current researches on Myanmar language done used a lexicon or dictionary or corpus which lists all the words forms for word segmentation as an initial stage of processing. To get correct segmentation, we need an exhaustive lexicon or corpus. Myanmar language[12-16] has been classified by linguists as a monosyllabic or isolating language with agglutinative features. Its writing style does not use any delimiter between words and so there is no way of knowing whether a word form of syllables is group, or is just a separate group of monosyllabic words. Every syllable has a meaning of its own. The Myanmar Language have complex morphotactic structures and has the ambiguous word segmentation. Therefore, segment the sentence to generate lexical and semantic of word sequences is a challenging task. Thus, this paper aim to addresses this shortcoming by proposing a language model that consider joint word segmentation and POS tagging. The rest of this paper is organized as follows. In Section 2, we discussed Literature Review. Section 3 described Aspect of Myanmar Language. Section 4 presented Design of Proposed System. Section 5 provides the Evaluation. Finally, we described the conclusion of the paper. 2. LITERATURE REVIEW Part-of-Speech Tagger that using supervised learning approach for Myanmar Language is presented in [17]. For disambiguous of the POS tags, Baum-Welch algorithm and Viterbi algorithm with HMM model is used for training and decoding. For tagging a word, Myanmar lexicon is used with its all possible tags. The examination results show that the strategy got high precision (over 90%) for various testing input. Myanmar Word Segmentation [18] used Hybrid Approach and the sentences are segmented in syllable and matched by longest words. In the using of Longest matching method, the words that are known from a dictionary are first segmented and the unknown words are guest from an n-gram model [19]. The major issue of this technique is comes from the vagueness in the longest coordinating procedure, since words can be showed up in numerous structures. The porposed of Y. Zhang and S. Clark [20], that got a lower mistake rate contrasted with a two stage baseline system. The large combined search space for this method is a challenge and it is very hard in decoding. For reason for at the same time word division and POS labeling, a solitary straight model is utilized, and for joint preparing and pillar search of unraveling, the summed up perceptron calculation is picked. The joint model lessens a mistake pace of exactness for division to 14.6% and a blunder decline in labeling precision of 12.2%, contrasted with the conventional pipeline strategy. A Persian POS tagger, the Persian sentences are tagged by implementing a blend of measurable and principle-based technique. To tag unknown words, a morphological analysis probabilistic method is used. Persian morphological rules that are knowledge base and that the probabilities is worked by a corpus is the second result of the research. Trial results show that their approach increase the labeling execution and exactness [11]. 3. ASPECT OF MYANMAR LANGUAGE Myanmar language is highly agglutinative and is morphologically rich and complex. Moreover, to separate each word, the Myanmar writing style do not use spaces and there is no chance to get of knowing whether a gathering of syllables structure a word, or is only a group of separate monosyllabic words.
  • 3. Int J Elec & Comp Eng ISSN: 2088-8708  Improving accuracy of part-of-speech (POS) tagging using hidden markov model and ... (Dim Lam Cing) 2025 Every syllable has its own meanings. In Myanmar words consist of one or more syllables which are compound in different ways. Depend on the way of the words structures from syllables, these can be classify into three types single simple words, complex words and reduplicative words [21, 22]. For example, ေ ပါင်း (steam) + အ်း(pot) =>ေ ပါင်းအ်း (rice cooker), မ်း(fire) + ပူ (hot) => မ်းပူ (iron), ပန်း(flower) + ခ (carry) => ပန်းခ (painting), all have their referential meaning and each monosyllable within words also has their own meaning. In Myanmar morphology processes include inflection, derivation, and compounding. 3.1. Inflection morphology Myanmar inflection morphology of nouns, verbs and adjectives is mostly achieved by suffixation. The inflection morphology remains the same POS tags with the original words but by adding the inflection morpheme -တ ို့, -မ ်း can make the plural on nouns and the inflectional morpheme -ခို့ make the past tense on verbs. For example: ေ က င်းသ ်းမ ်း (students) -> ေ က င်းသ ်း (student) + မ ်း; သ ်းခို့ (went) -> သ ်း (go) + ခို့. 3.2. Derivation morphology Myanmar morphology derivation occurs by means of prefixation or suffixation. Derivation can change the POS tag of word forms. Derivation of nouns, verbs and adjectives are also achieved by suffixation but a circumfix also occurs in the Myanmar language. For example: အလပ (work) -> အ (Prefix) + လပ (do); ေ ြ ပ်း ြ ခင်း (running) -> ေ ြ ပ်း (run) + ြ ခင်း (Suffix). But အ- is not prefix bound morpheme in some nouns and verbs and cannot be splitted; for example: if the words ေအမ(mother) is splitted, it has not meaning. 3.3. Compounding Myanmar words contain many compound words. They are noun compound words, verb compound words, adjective compound words and also noun, verb and adjective are compound. For example: compound noun: ေ ဈ်းနှုန်း (price)-> ေ ဈ်း(market) + နှုန်း(rate); compound verb: ြ ဖတပင်း (voucher) -> ြ ဖတ(cut) + ပင်း(divide); compound adjective: ခငမ (firm) -> ခင(firm) + မ (rigid); compound noun,verd and adjective: လူန တငက ်း(ambulance) -> လူ(human) + န (painful) + တင(placed) + က ်း(car). By compounding the words some words POS is the same to the original and some words got a new POS tag. 4. DESIGN OF PROPOSED SYSTEM The structure of the proposed framework is shown in Figure 1. There are two modules: preparing and testing modules. In the training phase, the collection of segmented and tagged-sentences are used to develop the proposed HMM model. This model is used in the testing phase. In testing phase, the input Myanmar sentences are identified into each sentence using the sentence end marker called pote-ma ‘။ ’. After that, word segmentation and POS tagging is performed 4.1. Corpus creation Part-of-Speech tagged corpora are one of the essential resources for developing state-of-the-art POS Tagger in Myanmar. There are several steps to create tagged corpus. The following list demonstrates steps needed corpus building.  Collecting raw text  Hand-annotating and preparing training data We collect and normalize raw text from online journals, newspapers and e-books. Since, documents used various Myanmar font styles; these are converted to standard Unicode format and and make cleaning such as spelling checking. We assign tags in un-annotated text manually and finally, we have got the training data for statistical method. If the number of tags is large, the complexity will be increased and the performance will be decreased. According to Myanmar grammar books and dictionary book [12-16], there are nine Part-of-Speech tags in Myanmar language. We have annotated every word with appropriate basic POS tags and created a POS tag Corpus. Moreover, we added another three POS tags Number, Symbol and Abbreviation in our research. The tagset is described in Table 1.
  • 4.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 2023 - 2030 2026 Figure 1. Framework of the proposed system Table 1. Tagset No. Tag Description Example 1. NN Noun ပန်း(flower) 2. PN Pronoun ကျွနမ(I)၊ သင(you) 3. V Verb ဝယ(buy)၊ စ ်း(eat) 4. Adj Adjective ပူ(hot) 5. Adv Adverb ေ လ်းစ ်းစ (respectfully) 6. PPM Postpositional Marker က၊ က 7. Conj Conjunction ထအခါ၊ ၍ 8. Part Particles မ ်း၊ ခို့ 9. Interj Interjection အ၊ အမယေလ်း 10. Number Number ၁၊၂ ၊ ၂၀ 11. Symbol Symbol ( ) / % + - = ၊ ။ 12. Abbrev Abbreviation အထက၊ဖဆပလ၊ေ အဘအမ 4.1.1. Corpus statistic For our experiments, the corpus consists of sentences from Myanmar grammar books, Myanmar text books, some Myanmar history and websites. Corpus informations are described in Table 2. The font used for this research is Unicode. There are total 39716 sentences covering 690258 words and each sentence has an average of 18 words. The vocabulary size is 27043 words. Table 2. Distribution of POS tags POS Tags No. of words NN 25% PN 4% V 15% Adj 2% Adv 2% PPM 17% Conj 5% Part 22% Interj 0.03% Number 1% Symbol 7% Abbrev 0.09%
  • 5. Int J Elec & Comp Eng ISSN: 2088-8708  Improving accuracy of part-of-speech (POS) tagging using hidden markov model and ... (Dim Lam Cing) 2027 4.2. Training hidden markov model To get training data, we have to compute probabilities for each tag in the tagged corpus. Since we have developed a model, it produces two results. The results of the training phase are transition probabilities and emission probabilities. 4.2.1. Estimating probabilities POS tagging using HMM, the probabilities are calculated from a tagged training corpus instead of using the full power of HMM learning. The probabilities of tag transition P(ti|ti-1) is the probability of a tag given in the previous tag. Estimation of transition probability is computed by counting the times that the first tag in a tagged corpus, how often the first tag is followed by the second. The emission probabilities, P(wi|ti) given a tag, it will be associated with a given word [23]. The emission probability is 4.3. Joint Myanmar word segmentation and POS tagging The input sentences are firstly separated by pote-ma “။”. The words in each sentence is segmented and assigned POS with the proposed tagsets in Table 1 by using HMM probabilistic models. In Myanmar Language, since words are formed by combining more than one syllable that is one word can have one or more syllables and one syllable has more than one character, syllable identification must be done before word level segmentation [24]. For example, the input is as follows in Table 3: ြ ကာပန်ေ်းသည ်ေရထဲတွည ်ေပါက်ေသ ်ေ။ (Lotus grows in water.) After Syllable Identification, the right output is come out as follows: ြ ကာ|ပန်ေ်း|သ ်ေ|ေ ရ|ထဲ|တွ ်ေ|ေ ပါက်ေ|သ ်ေ Table 3. N-gram word segmentation for input sentence N-gram (N=1,2,3,4,5) Word Segmentation Unigram ြ ကာ|ပန်ေ်း|သ ်ေ|ေ ရ|ထဲ|တွ ်ေ|ေ ပါက်ေ|သ ်ေ Bigrams ြ ကာပန်ေ်း၊ပန်ေ်းသ ်ေ၊သည ်ေရ၊ည ရထဲ၊ထဲတွ ်ေ၊တွည ်ေပါက်ေ၊ည ပါက်ေသ ်ေ Trigrams ြ ကာပန်ေ်းသ ်ေ၊ပန်ေ်းသည ်ေရ၊သည ်ေရထဲ၊ည ရထဲတွ ်ေ၊ထဲတွည ်ေပါက်ေ၊တွည ်ေပါက်ေသ ်ေ 4-grams ြ ကာပန်ေ်းသည ်ေရ၊ပန်ေ်းသည ်ေရထဲ၊သည ်ေရထဲတွ ်ေ၊ည ရထဲတွည ်ေပါက်ေ၊ထဲတွည ်ေပါက်ေသ ်ေ 5-grams ြ ကာပန်ေ်းသည ်ေရထဲ၊ပန်ေ်းသည ်ေရထဲတွ ်ေ၊သည ်ေရထဲတွည ်ေပါက်ေ၊ည ရထဲတွည ်ေပါက်ေသ ်ေ A typical strategy to do word division and POS simultaneously is to utilize the N-gram (5-grams) which sweeps an information sentence from left to right, and recover the word with its everything potential labels with the likelihood from emanation record. If all 5-grams words have not been contained in the emission probability file, the system used 4-grams, trigrams, bigrams and unigram. Word segmentation for input sentence as per the longest N-gram technique ြ ကာပန်ေ်း၊သ ်ေ၊ေ ရ၊ထဲတွ ်ေ၊ေ ပါက်ေ၊သ ်ေ Word probabilities and language model probabilities is calculated by using relative frequency count. If there are more than one POS options for word, the system selected POS option with highest word probability as described in Table 4.
  • 6.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 2023 - 2030 2028 Table 4. All possible word, tag and probability Word Segmentation POS Language Model Probability Selected POS ြ ကာပန်ေ်း (Lotus) NN 1 NN သ ်ေ (null) PPM 0.4 PPM Part 0.3 PN 0.2 Adj 0.1 ည ရ (water) NN 0.6 NN V 0.2 Part 0.2 ထဲတွ ်ေ (in) PPM 1 PPM ည ပါက်ေ(grow) Part 0.2 V 0.7 V NN 0.1 သ ်ေ(null) PPM 0.4 PPM Part 0.3 PN 0.2 Adj 0.1 4.4. Morphological rules approach The internal structures of words are defined by using morphological rules [11]. These rules comprise of three sections: prefix (အ), stem and suffix (မ ်း). The common syntax is as follows: prefix + stem + suffix  POS tag In the above syntax, sometime both of prefix and suffix are contain in the string. In some syntax, one of prefix or suffix is empty string. There are three types’ morphological rules for Myanmar Language: inflectional, derivational rules and compounding. In this system, morphological rules (68 rules) are characterized [25] and utilized. The rules are drawn out from Myanmar Grammar book [12-16]. The uses of inflectional, derivational and compounding are described in Section 3. 5. EVALUATION To appraise the testing result for POS labeling, the framework utilized the parameters of Recall, Precision and F-score. These parameters are characterized as pursues: 𝑅𝑒𝑐𝑎𝑙𝑙, 𝑅 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑂𝑆 𝑡𝑎𝑔 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑠𝑦𝑠𝑡𝑒𝑚 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑤𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑡𝑒𝑠𝑡 𝑠𝑒𝑡 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛, 𝑃 = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑃𝑂𝑆 𝑡𝑎𝑔 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑠𝑦𝑠𝑡𝑒𝑚 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑃𝑂𝑆 𝑡𝑎𝑔 𝑎𝑠𝑠𝑖𝑔𝑛𝑒𝑑 𝑏𝑦 𝑡ℎ𝑒 𝑠𝑦𝑠𝑡𝑒𝑚 𝐹𝑠𝑐𝑜𝑟𝑒, 𝐹 = 2𝑃𝑅 𝑃 + 𝑅 5.1. Experimental setup For testing the proposed model, we divided our corpus into two corpuses as follows in Table 5. We collect 500 new sentences for open testing. In our experiments, we compare the separate word segmentation and POS tagging using HMM , joint word segmentation and POS tagging using HMM and joint word segmentataion and POS tagging using HMM with morphological rules in Table 6. For the comparative purpose, we used Bigram Part-of-Speech Tagger for Myanmar Language [17] as based line system. The proposed system and base line system used same training corpus and test data. Table 5. Statistic of the dataset Data No.of Sentence No. of words Corpus 1 29680 547969 Corpus 2 39716 690258
  • 7. Int J Elec & Comp Eng ISSN: 2088-8708  Improving accuracy of part-of-speech (POS) tagging using hidden markov model and ... (Dim Lam Cing) 2029 Table 6. Accuracy of system on different test cases using HMM and morphological rules Corpus Size (sentences) Separate word Segmentation and POS tag Joint word segmentation and POS tag Joint word segmentation and POS tag + morphological rules Precision Recall F-score Precision Recall F-score Precision Recall F-score 29680 68% 67% 67% 78% 76% 77% 90% 88% 89% 39716 77% 75% 76% 85% 83% 84% 94% 92% 93% 5.2. Results and discussion Table 6 shows the experiment results for Myanmar word segmentation and POS tagging with different training data sizes. Conforming to the table, the proposed technique starts to get a few progressions over the correlation standard. When the measure of preparing information sentences is increased and using of morphology rules also has good increased compared with the corresponding baselines. The accuracy of the tagger is appraised by using testing data which contains different kinds of words. Testing words can be defined as known words, unknown words and ambiguous words for the tagger. “Known words” are the words contain in the training corpus and “Unknown Words” are the words which are not containing in the training corpus. “Ambiguous words” are the known words which are tagged wrong because of segmentation error and it is needful to solve for disambiguating that tag is the correct tag for these words. In proposed system, most “Unknown Words” occur in Proper Noun (name of person, name of location), different position of Particle and Postpositional marker in segmentation can cause ambiguous in POS tagging. There is no training data to cover all Proper Nouns. Including of disambiguous words and unknown words make decrease in the performance of the tagger. To solve the disambiguation of ambiguous words is to use the morphological rules. By using morphological rules, the system reduced ambiguous in Particle and Postpositional markers. 6. CONCLUSION This paper presents a joint word segmentation and POS tagging in Myanmar using HMM and morphological rules. In our experiments, we compare the separate word segmentation and POS tagging with our proposed joint word segmentation and POS tagging using HMM. Then, we found that there is a significant improvement in joint word segmentataion and POS tagging using HMM with morphological rules. We also describe the distribution of words in the corpus. Until now, there are unknown words in our experiments. The future work will be to improve the exactness of word segmentation and POS tagging. We also need a larger corpus for training. By using a large training and morphological rules, the assignment of POS tag will be more accurate and will be reduced the unknown words, incorrect tag and ambiguous words. The paper has shown that word segmentation and POS tagging in Myanmar can be improved by using lager training corpus and combining the morphological analysis of Myanmar Language. REFERENCES [1] T. Mikolov, A. Deoras, D. Povey, L. Burget, J. H. Cernocky, "Strategies for training large scale neural network language models," IEEE Automatic Speech Recognition and Understanding Workshop, pp. 196-201, 2011. [2] A.J.P.M.P. Jayaweera, N. G. J. Dias, "Hidden markov model based part of speech tagger for sinhala language," International Journal on Natural Language Computing (IJNLC), vol. 3(3), 2014. [3] Sirajuddin Y. Hala, Sagar H. Virani, "Improve accuracy of parts of speech tagger for Gujarati language," International Journal of Advance Engineering and Research Development, vol. 2(5), 2015. [4] P.M Bhatt, A. Ganatra, "Analyzing & enhancing accuracy of part of speech tagger with the usage of mixed approaches for Gujarati," International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, vol. 8(1), 2019. [5] K. Mohnot, N. Bansal, S.P. Singh, A. Kumar, "Hybrid approach for part of speech tagger for Hindi language," International Journal of Computer Technology and Electronics Engineering (IJCTEE), vol. 4(1), 2014. [6] S. AlGahtani, J. McNaught, "Joint Arabic Segmentation and Part-of-Speech Tagging," Proceedings of the Second Workshop on Arabic Natural Language Processing ©2014 Association for Computational Linguistics, pp. 108-117, 2015. [7] A. F. Wicaksono, A. Purwarianti, "HMM based part-of-speech tagger f or Bahasa Indonesia," On Proceedings of 4th International MALINDO (Malay and Indonesian Language) Workshop, 2010. [8] S. HOON N. A., "Conditional random fields for Korean morpheme segmentation and POS tagging," ACM Transactions on Asian Language Information Processing, vol. 14(3), 2015. [9] Z. H. Pozveh, A. Monadjemi, A. Ahmadi, "Persian texts part of speech tagging using artificial neural networks," Journal of Computing and Security, vol. 3(4), pp. 233-241, 2016. [10] C. Lyu, Y. Zhang, D. Ji, "Joint word segmentation, POS-tagging and syntactic chunking," Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16), 2016.
  • 8.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 2023 - 2030 2030 [11] H. Fadaei, M. Shamsfard, "Persian POS tagging using probabilistic morphological analysis," Int. J. Computer Applications in Technology, vol. 38(4), pp. 264-273, 2010. [12] P. Hopple, "The structure of nominalization in Burmese," Ph. D Dissertation. University of Texas, Arlington, 2003. [13] Department of the Myanmar Language Commission ,"Myanmar grammar," Ministry of Education. Myanmar,2006. [14] "Myanmar-English dictionary," Ministry of Education, Myanmar. [15] Grammar. Burmese language. http://en.wikipedia.org/wiki/Burmese_Language [16] Department of the Myanmar Language Commission, "Myanmar grammar," Ministry of Education, Myanmar, 2016. [17] P. H. Myint, T. M. Htwe, N. L. Thein, "Bigram part-of-speech tagger for Myanmar language," 2011 International Conference on Information Communication and Management, IPCSIT, vol. 16, 2011. [18] W. P. Pa, N. L. Thein, "Myanmar word segmentation using hybrid approach," Proceedings of 6th International Conference on Computer Applications, 2008. [19] W. P. Pa, Y. K. Thu, A. Finch, E. Sumita, "Word boundary identification for Myanmar text using conditional random fields," Genetic and Evolutionary Computing, Springer International Publishing Switzerland, p. 447,2016 [20] Y. Zhang, S. Clark, "Joint word segmentation and POS tagging using a single perceptron," Proceedings of ACL-08: HLT, pp. 888-896, 2008. [21] T. M, Htwe, D. L. Cing, "A neural probabilistic language model for joint morphological segmentation and POS tagging," The Seventh International Conference on Science and Engineering(ICSE), pp. 9-10, 2016. [22] T. T. Zin, K. M. Soe, N. L. Thein, "Myanmar phrases translation model with morphological analysis for statistical Myanmar to English translation system," 25th Pacific Asia Conference on Language, Information and Computation, pp. 130-139, 2011. [23] D. Jurafsky, James H. Martin, "Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition," Copyright 2006, Draft of June 25, 2007. [24] https://github.com/ye-kyaw-thu/sylbreak [25] D. L. Cing, K. M. Soe, "Joint word segmentation and part-of-speech (POS) tagging for Myanmar language," 17th International Conference on Computer Application, 2019. BIOGRAPHIES OF AUTHORS Dim Lam Cing received M.C.Sc in Computer Science from Computer University (Kalay) in 2010. She is a PhD candidate in University of Computer Studies, Yangon (UCSY). Her research interest includes Natural Language Processing and Machine Learning. Khin Mar Soe received M.C.Sc and Ph.D degree in Information Technology from University of Computer Studies, Yangon (UCSY) in 2000 and 2005 respectively. She is currently a full professor from Natural Language Processing (NLP) Lab in UCSY. Her main research interest includes Natural Language Processing and Artificial Intelligence.