SlideShare a Scribd company logo
International Journal of Trend in Scientific Research and Development (IJTSRD)
Volume 3 Issue 5, August 2019 Available Online: www.ijtsrd.com e-ISSN: 2456 – 6470
@ IJTSRD | Unique Paper ID – IJTSRD26520 | Volume – 3 | Issue – 5 | July - August 2019 Page 911
Morpheme Based Myanmar Word Segmenter
Sin Thi Yar Myint, Hanni Htun, Myat Myo Nwe Wai
Faculty of Computer Science, Myanmar Institute of Information Technology, Mandalay, Myanmar
How to cite this paper: Sin Thi Yar Myint
| Hanni Htun | Myat Myo Nwe Wai
"Morpheme Based Myanmar Word
Segmenter"
Published in
International
Journal of Trend in
Scientific Research
and Development
(ijtsrd), ISSN: 2456-
6470, Volume-3 |
Issue-5, August 2019, pp.911-914,
https://doi.org/10.31142/ijtsrd26520
Copyright © 2019 by author(s) and
International Journalof Trend in Scientific
Research and Development Journal. This
is an Open Access article distributed
under the terms of
the Creative
CommonsAttribution
License (CC BY 4.0)
(http://creativecommons.org/licenses/by
/4.0)
ABSTRACT
Myanmar script has no fixed delimitersbetween wordsor syllables. Therefore,
to achieve meaningful and correct segmented words from the text is a
challenging task. This paper has proposed a morpheme-based Myanmarword
tokenizer which combines rule-based syllable breakinganddictionarylookup
syllable merging methods with longest string matching approach. The
proposed approach is tested on a Monolingual dictionary that contains useful
information for the word segmentation. It also contains above 32,581 words
including headwords, stop words and essential words with Myanmar3 font.
These words are collected from Myanmar and Essential Words dictionaries.
According to the experimental results, it can provide the promising
segmentation accuracy of Myanmar text.
KEYWORDS: Syllable breaking; Morpheme; style; styling
INTRODUCTION
Word segmentation is prerequisite for any Myanmar language processingsuch
as part of speech (POS) tagging, search engine, translation, information
retrieval, and word sense disambiguation and many more of Natural Language
Processing (NLP) activities. Currently, there has no Myanmar word
segmentation approach based on the morpheme of the word in Myanmar text
using a dictionary approach. Morpheme represents the root of a specific word.
According to the Myanmar language nature, a morpheme is a vital role for the
machine translation of Myanmar text. By exploiting the power of morpheme
word, it can achieve the easy way of translation of Myanmar text.
In the Myanmar language, there is no statistical corpus
resources and training data to test the word segmentation
algorithm for Myanmar language.
In this paper, we proposed the word segmentationapproach
which is not applied to statistical methods with the corpus.
This approach is very useful when there is no linguistic
resource such as corpus and copra for Myanmar language.
We simply build the monolingual lexicon which is inspired
by morpheme Myanmar words collectedfromMyanmarand
Essential Words dictionaries. Syllables tokenization is
defined as preprocessing. Syllable segmentation is done by
using the rules on the syllable structure of Myanmar script
for the input sentence. To determine word boundariesofthe
segmented syllables, the proposed approach is applied
forward longest matching dictionary. This system can
segment into morpheme-based Myanmar words from the
input sentence of text by comparing one by one character
from the input string with the monolingual dictionary. This
approach is very simple but it proved that this is a practical
approach which is not available the applicable linguistic
resources.
RELATED WORK
In this section, previous works on Myanmar word
segmentation are reviewed. Win Pa Pa and NiLar Thein
experimented Disambiguation in Myanmar Word
Segmentation. The authors solved the word ambiguity
problems by combining Forward Maximum Matching,
Backward Maximum Matching and Joint Entropy. And then,
they tried to solve the ambiguity problem using a statistical
approach with the corpus.TheauthorsdescribedPrecisionof
word segmentation for this approach was 92% and recall is
94%. Tun Thura Thet, Jin-Cheon Na, Wunna Ko Ko, was a
proposed word segmentation for the Myanmar language.
They applied rule-based syllablesegmentation andalsoused
dictionary-based statistical syllable merging, for the word
ambiguity. The authors combined with Mutual Information
by calculating collocation strength with the corpus. They
showed that Precision 98.94%, Recall 99.05%,
FMeasure98.99.
“Myanmar Word Segmentation using Syllable level Longest
Matching” was proved by Hla Hla Htay, Kavi Narayana
Murthy. They used word Listabove 800,000 wordsincluding
inflected forms. The authors also applied to stop word
removal first and also used the Ngram approach for syllable
matching. They achieved Recall 98.81%, Precision 99.11%,
F_measure 98.95%, also tested on the sentencelevelwhichis
collected from web documents, grammar books and stories.
MYANMAR LANGUAGE
Myanmar language is the official languageof theUnionof the
Republic of Myanmar and is more than one thousand years
old. Texts in the Myanmar language use the Myanmar script,
which is descended from the Brahmi script of ancient South
India.
A. Myanmar Script
A Myanmar text is a string of characters without explicit
word boundary markup, written in sequence from left to
right.
IJTSRD26520
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
@ IJTSRD | Unique Paper ID – IJTSRD26520 | Volume – 3 | Issue – 5 | July - August 2019 Page 912
Myanmar script contains 33 consonants, 8 vowels (free-
standing and attached, 2 diacritics, 11 medials,a vowelkiller
or ASAT, 10 digits and 2 punctuation marks [4].
B. Syllable Breaking in Myanmar Text
Syllable breaking is the process of identifying syllable
boundaries in a text. The syllable is the smallest unit of
language. In Myanmar text, a syllable can start with a
consonant may be followed by a medial consonant. After the
vowel, a syllable may end with nasalization of the vowel or
an unreleased glottal stop. At the end of syallable, a final
consonant usually has an ‘asat’ sign above it, to show that
there is no inherent vowel. In multisyllabic words derived
from an Indian language such as Pali, where two consonants
occur internally with no intervening vowel, the consonants
tend to be stacked vertically, and the asat sign is not used.
There are a set of Myanmar numerals, which are used just
like Latin digits [2]. Firstly, syllable segmentation is done by
using the rules on the syllable structure of the Myanmar
script. Syllable breaking rules are based on combining
consonant and vowel, devocalizing and kinzi, contractions,
syllable chaining, distinct letter, single character and loan
words. In syllable breaking stage, the proposed system
determines a syllable boundary by comparing pairs of
characters to find whether a break is possible or not
between them Moreover, the accuracy results of syllable
segmentation are described in Table I and Table II.
1. Combining consonant and vowel
2. Devowelizing and Kinzi Devowelising and Kinzi
3. Syllable Changing
4. Single Character
5. Contraction
MYANMAR WORD SEGMENTATION
Word segmentation is the process of parsing concatenated
text (i.e. text that contains no spaces or other word
separators) to infer where word breaks exist. Myanmar
script doesn’t need to put white spaces between words or
syllables. Modern writing style contains spaces after each
clause in order to enhance readability [5]. Generally, a word
is a basic unit of language that carries meaning and can be
spoken or written. A Myanmar word can consist of one or
more morphemes that are linked more or less tightly
together.
And then a Myanmar word will consist of a root or stem and
zero or more affixes.
Moreover, Myanmar words can be combined to form
phrases, clauses and sentences.
In Addition, a word consisting of two or more stems joined
together is known as a compound word[3].
And then, the next step was to merge the segmented
syllables into the meaningful word from the input sentence.
Syllable merging is done by using the longest matching
approach and mapped with the lexicon. The algorithmstarts
from the beginning of a sentence, finding the longest
matching word compared with the lexicon and then
repeating the process until it reaches the end of the
sentence. This system can segment into a morpheme-based
word from the input sentence by comparing one by one
character from the input string with the monolingual
dictionary. The process of word segmentation is shown in
Figure2. This system is tested on all types of simple and
complex sentence types of Myanmar text including one or
more clauses and phrases. The accuracy results are
mentioned in Table2. There may be some problems in
syllable merging of the proposed system. Because of the
longest matching approach, it cannot give the correct
segmentation of all words in the input sentence. It can find
segment conflicts in some word in the sentence.
With the longest matching approach, this sentence is
segmented to the wrong word into
Fig1. Process of Word Segmentation
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
@ IJTSRD | Unique Paper ID – IJTSRD26520 | Volume – 3 | Issue – 5 | July - August 2019 Page 913
The structure of the sentence in Myanmar language may be
simple and compound or complex. Generally, thesentenceis
subdivided into phrases. The phrase is subdivided into
words. Word is subdivided into syllables. A syllable is the
smallest unit of the language [3]. In this case, either simple
or compound sentence can be contained with one more
phrases and one or more clauses. A group of words, which
makes sense, but not complete sense, is called a Phrase. It is
a group of related words without a Subject and a Verb.
Examples: in the east, on a wall, with blue trimming, on the
bridge, with red hair [2]. A clause is a group of words that
contains both a subject and a predicate but cannotalwaysbe
considered as a full grammatical sentence. Clauses can be
either independent clauses (also called main clauses) or
dependent clauses (also called subordinateclauses)[2]. Like
an English sentence, Myanmar sentence is also composed of
one or more clauses and phrases. Myanmar script contains
33 consonants, 8 vowels (free-standing and attached, 2
diacritics, 11 medials, a vowel killer or ASAT, 10 digits and 2
punctuation marks [4].
1. Examples for adding adjective & adverb phrase in a
simple sentence
2. Examples for adding phrases in a simple sentence
3. Examples for adding time phrases in a simple
sentence
4. Examples for adding accusative phrases in a simple
sentence
5. Examples of a compound sentencewitha dependent
clause and independent clause
6. Examples of a compound sentence with three
clauses
7. Examples of sentence Hidden object
8. Examples of sentence changing the position of
subject and reason
EXPERIMENT RESULTS
Table I and Table II show the experimental results of word
segmentation system for syllable breaking and syllable
merging word segmentation. Accuracy result for syllable
breaking is 100% correct.
TABLE.I Accuracy Results on syllable Segmentation
Syallable Type NCseg NTseg Accuracy
Unique Syllable 1903 1903 100%
Tokens 7069 7069 100%
Sentence 1226 1226 100%
Accuracy=NCseg/NTseg*100
NCseg=the number of correctly segmented syllables by the
program on the input
Ntseg=the number of total segmented syllables verified
manually
International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470
@ IJTSRD | Unique Paper ID – IJTSRD26520 | Volume – 3 | Issue – 5 | July - August 2019 Page 914
TABLE.II Accuracy Results on Word Segmentation
Syallable Type NCseg NTseg Accuracy
Unique Syllable 7069 6769 95.77%
Tokens 1226 926 75.53%
Accuracy=NCmg/NTmg*100
NCmg=the number of correctly merge syllables by the
program on the input
Ntmg=the number of total merge syllables verified manually
Tested Dictionary contains 32,581 tokens. Sentences are
tested upon all kind of sentence types, namely {simple,
compound or complex}.Covers on all complex sentencetype
including a sentence with one clause, two clauses, and three
clauses.
CONCLUSION
This paper has proposed an approach for Myanmar word
segmentation by using rule-based syllable breaking and
dictionary lookup syllable merging methods. In the syllable
breaking stage, the proposed system determines a syllable
boundary by comparing pairs of characterstofindwhether a
break is possible or not between them. And then, it merges
the segmented syllables into a meaningful wordbyusingthe
dictionary lookup approach withthe longeststringmatching
algorithm. Moreover, this proposed system can produce
correct morpheme-based Myanmar words from the input
sentence. It can also solve to segment the words with one or
more phrases and clauses of in the written Myanmar
sentences. It can give the correct segmented words which
contain one or more dependent clauses and independent
clauses on all types of simple and compound sentences of
Myanmar text. So, it can support many benefits to Myanmar
to English translation system and further(NLP)taskssuchas
information retrieval, noun phrase identification, verb
phrase identification, named entity recognition, word sense
disambiguation and many more of NLP activities.
References
[1] C. D. Manning, H. Schiitze,” Foundations Of Statistical
Natural Language Processing”, The MIT Press,
Cambridge, Massachusetts London, England, 2000. .
[2] https:// Myanmar script notes.htm, https:// what-
isclause.html, https:// what-is-phrase.html.
[3] Myanmar Grammar, First Edition, Myanmar Language
Commission, memorable for 30th anniversary, June
2005.
[4] Myanmar Orthography, Second Edition, Myanmar
Language Commission, June 2003.
[5] M. T. Win & et.al, “Burmese Phrase Segmentation”,
Proceedings of Conference on Human Language
Technology for Development, Egypt, May 2011.
[6] Lexique Pro_ Myanmar lexicon (Version-2), July, 2011.
[7] T. T. Thet, J. C. Na, W. K. Ko, “Word segmentation for
the Myanmar Language”, Journal of Information
Science, 2007, PP. 1-17.
[8] T. H. Hlaing, “Manually Constructed Context-Free
Grammar For Myanmar Syllable Structure”, Nagaoka
University of Technology Nagaoka, JAPAN, 2011.
[9] W. P. Pa, N. L. Thein ”Disambiguation in Myanmar
Word Segmentation”, ”Proceedings Of the Seventh
International Conference On Computer Applications”,
Yangon, Myanmar,2009, PP. 1-4.
[10] H. H. Htay, K. N. Murthy, Myanmar Word Segmentation
using Syllable Level Longest Matching, “Proceedingsof
the IJCNLP-08 Workshop on NLP for Less Privileged
Languages, Hyderabad, India, January 2008.
[11] PyinNya Kyaw, Essential Words Dictionary and
Myanmar Dictionary, First Edition, February 2010,
Yangon.

More Related Content

PDF
MYANMAR WORDS SORTING
PDF
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
PDF
Marathi Text-To-Speech Synthesis using Natural Language Processing
PDF
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
PDF
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
PDF
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
PDF
Paper id 25201466
PDF
ISOLATING WORD LEVEL RULES IN TAMIL LANGUAGE FOR EFFICIENT DEVELOPMENT OF LAN...
MYANMAR WORDS SORTING
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
Marathi Text-To-Speech Synthesis using Natural Language Processing
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
Paper id 25201466
ISOLATING WORD LEVEL RULES IN TAMIL LANGUAGE FOR EFFICIENT DEVELOPMENT OF LAN...

What's hot (17)

PDF
Kannada Phonemes to Speech Dictionary: Statistical Approach
PDF
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
PDF
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
PDF
Welcome to International Journal of Engineering Research and Development (IJERD)
PDF
IMPROVEMENT OF CRF BASED MANIPURI POS TAGGER BY USING REDUPLICATED MWE (RMWE)
PDF
Phonetic Recognition In Words For Persian Text To Speech Systems
PDF
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
PDF
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
PDF
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
PDF
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
PDF
Ey4301913917
PDF
551 466-472
PDF
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
PDF
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
PDF
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
PDF
Development of morphological analyzer for hindi
PDF
Cf32516518
Kannada Phonemes to Speech Dictionary: Statistical Approach
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
A ROBUST THREE-STAGE HYBRID FRAMEWORK FOR ENGLISH TO BANGLA TRANSLITERATION
Welcome to International Journal of Engineering Research and Development (IJERD)
IMPROVEMENT OF CRF BASED MANIPURI POS TAGGER BY USING REDUPLICATED MWE (RMWE)
Phonetic Recognition In Words For Persian Text To Speech Systems
EXTRACTING LINGUISTIC SPEECH PATTERNS OF JAPANESE FICTIONAL CHARACTERS USING ...
INTEGRATION OF PHONOTACTIC FEATURES FOR LANGUAGE IDENTIFICATION ON CODE-SWITC...
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
Ey4301913917
551 466-472
RULE BASED TRANSLITERATION SCHEME FOR ENGLISH TO PUNJABI
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
HANDLING CHALLENGES IN RULE BASED MACHINE TRANSLATION FROM MARATHI TO ENGLISH
Development of morphological analyzer for hindi
Cf32516518
Ad

Similar to Morpheme Based Myanmar Word Segmenter (20)

PDF
MYANMAR WORDS SORTING
PDF
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...
PDF
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
PDF
Parsing of Myanmar Sentences With Function Tagging
PDF
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PDF
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PDF
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
PDF
5215ijcseit01
PDF
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
PDF
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
PDF
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
PDF
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
PDF
BUILDING A SYLLABLE DATABASE TO SOLVE THE PROBLEM OF KHMER WORD SEGMENTATION
PDF
ANNOTATED GUIDELINES AND BUILDING REFERENCE CORPUS FOR MYANMAR-ENGLISH WORD A...
PDF
Myanmar news summarization using different word representations
PDF
Myanmar named entity corpus and its use in syllable-based neural named entity...
PDF
Statistically-Enhanced New Word Identification
PDF
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
PDF
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
PDF
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
MYANMAR WORDS SORTING
Improving accuracy of part-of-speech (POS) tagging using hidden markov model ...
OPTIMIZE THE LEARNING RATE OF NEURAL ARCHITECTURE IN MYANMAR STEMMER
Parsing of Myanmar Sentences With Function Tagging
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
STATISTICAL FUNCTION TAGGING AND GRAMMATICAL RELATIONS OF MYANMAR SENTENCES
5215ijcseit01
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
SYLLABLE-BASED SPEECH RECOGNITION SYSTEM FOR MYANMAR
BUILDING A SYLLABLE DATABASE TO SOLVE THE PROBLEM OF KHMER WORD SEGMENTATION
ANNOTATED GUIDELINES AND BUILDING REFERENCE CORPUS FOR MYANMAR-ENGLISH WORD A...
Myanmar news summarization using different word representations
Myanmar named entity corpus and its use in syllable-based neural named entity...
Statistically-Enhanced New Word Identification
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
Ad

More from ijtsrd (20)

PDF
A Study of School Dropout in Rural Districts of Darjeeling and Its Causes
PDF
Pre extension Demonstration and Evaluation of Soybean Technologies in Fedis D...
PDF
Pre extension Demonstration and Evaluation of Potato Technologies in Selected...
PDF
Pre extension Demonstration and Evaluation of Animal Drawn Potato Digger in S...
PDF
Pre extension Demonstration and Evaluation of Drought Tolerant and Early Matu...
PDF
Pre extension Demonstration and Evaluation of Double Cropping Practice Legume...
PDF
Pre extension Demonstration and Evaluation of Common Bean Technology in Low L...
PDF
Enhancing Image Quality in Compression and Fading Channels A Wavelet Based Ap...
PDF
Manpower Training and Employee Performance in Mellienium Ltdawka, Anambra State
PDF
A Statistical Analysis on the Growth Rate of Selected Sectors of Nigerian Eco...
PDF
Automatic Accident Detection and Emergency Alert System using IoT
PDF
Corporate Social Responsibility Dimensions and Corporate Image of Selected Up...
PDF
The Role of Media in Tribal Health and Educational Progress of Odisha
PDF
Advancements and Future Trends in Advanced Quantum Algorithms A Prompt Scienc...
PDF
A Study on Seismic Analysis of High Rise Building with Mass Irregularities, T...
PDF
Descriptive Study to Assess the Knowledge of B.Sc. Interns Regarding Biomedic...
PDF
Performance of Grid Connected Solar PV Power Plant at Clear Sky Day
PDF
Vitiligo Treated Homoeopathically A Case Report
PDF
Vitiligo Treated Homoeopathically A Case Report
PDF
Uterine Fibroids Homoeopathic Perspectives
A Study of School Dropout in Rural Districts of Darjeeling and Its Causes
Pre extension Demonstration and Evaluation of Soybean Technologies in Fedis D...
Pre extension Demonstration and Evaluation of Potato Technologies in Selected...
Pre extension Demonstration and Evaluation of Animal Drawn Potato Digger in S...
Pre extension Demonstration and Evaluation of Drought Tolerant and Early Matu...
Pre extension Demonstration and Evaluation of Double Cropping Practice Legume...
Pre extension Demonstration and Evaluation of Common Bean Technology in Low L...
Enhancing Image Quality in Compression and Fading Channels A Wavelet Based Ap...
Manpower Training and Employee Performance in Mellienium Ltdawka, Anambra State
A Statistical Analysis on the Growth Rate of Selected Sectors of Nigerian Eco...
Automatic Accident Detection and Emergency Alert System using IoT
Corporate Social Responsibility Dimensions and Corporate Image of Selected Up...
The Role of Media in Tribal Health and Educational Progress of Odisha
Advancements and Future Trends in Advanced Quantum Algorithms A Prompt Scienc...
A Study on Seismic Analysis of High Rise Building with Mass Irregularities, T...
Descriptive Study to Assess the Knowledge of B.Sc. Interns Regarding Biomedic...
Performance of Grid Connected Solar PV Power Plant at Clear Sky Day
Vitiligo Treated Homoeopathically A Case Report
Vitiligo Treated Homoeopathically A Case Report
Uterine Fibroids Homoeopathic Perspectives

Recently uploaded (20)

PDF
01-Introduction-to-Information-Management.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
Trump Administration's workforce development strategy
PDF
Computing-Curriculum for Schools in Ghana
PPTX
History, Philosophy and sociology of education (1).pptx
PDF
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
RMMM.pdf make it easy to upload and study
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
PPTX
UNIT III MENTAL HEALTH NURSING ASSESSMENT
PDF
What if we spent less time fighting change, and more time building what’s rig...
PDF
Classroom Observation Tools for Teachers
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PPTX
Microbial diseases, their pathogenesis and prophylaxis
01-Introduction-to-Information-Management.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Weekly quiz Compilation Jan -July 25.pdf
Trump Administration's workforce development strategy
Computing-Curriculum for Schools in Ghana
History, Philosophy and sociology of education (1).pptx
ChatGPT for Dummies - Pam Baker Ccesa007.pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Microbial disease of the cardiovascular and lymphatic systems
Anesthesia in Laparoscopic Surgery in India
RMMM.pdf make it easy to upload and study
Supply Chain Operations Speaking Notes -ICLT Program
UV-Visible spectroscopy..pptx UV-Visible Spectroscopy – Electronic Transition...
UNIT III MENTAL HEALTH NURSING ASSESSMENT
What if we spent less time fighting change, and more time building what’s rig...
Classroom Observation Tools for Teachers
Module 4: Burden of Disease Tutorial Slides S2 2025
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Microbial diseases, their pathogenesis and prophylaxis

Morpheme Based Myanmar Word Segmenter

  • 1. International Journal of Trend in Scientific Research and Development (IJTSRD) Volume 3 Issue 5, August 2019 Available Online: www.ijtsrd.com e-ISSN: 2456 – 6470 @ IJTSRD | Unique Paper ID – IJTSRD26520 | Volume – 3 | Issue – 5 | July - August 2019 Page 911 Morpheme Based Myanmar Word Segmenter Sin Thi Yar Myint, Hanni Htun, Myat Myo Nwe Wai Faculty of Computer Science, Myanmar Institute of Information Technology, Mandalay, Myanmar How to cite this paper: Sin Thi Yar Myint | Hanni Htun | Myat Myo Nwe Wai "Morpheme Based Myanmar Word Segmenter" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456- 6470, Volume-3 | Issue-5, August 2019, pp.911-914, https://doi.org/10.31142/ijtsrd26520 Copyright © 2019 by author(s) and International Journalof Trend in Scientific Research and Development Journal. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (CC BY 4.0) (http://creativecommons.org/licenses/by /4.0) ABSTRACT Myanmar script has no fixed delimitersbetween wordsor syllables. Therefore, to achieve meaningful and correct segmented words from the text is a challenging task. This paper has proposed a morpheme-based Myanmarword tokenizer which combines rule-based syllable breakinganddictionarylookup syllable merging methods with longest string matching approach. The proposed approach is tested on a Monolingual dictionary that contains useful information for the word segmentation. It also contains above 32,581 words including headwords, stop words and essential words with Myanmar3 font. These words are collected from Myanmar and Essential Words dictionaries. According to the experimental results, it can provide the promising segmentation accuracy of Myanmar text. KEYWORDS: Syllable breaking; Morpheme; style; styling INTRODUCTION Word segmentation is prerequisite for any Myanmar language processingsuch as part of speech (POS) tagging, search engine, translation, information retrieval, and word sense disambiguation and many more of Natural Language Processing (NLP) activities. Currently, there has no Myanmar word segmentation approach based on the morpheme of the word in Myanmar text using a dictionary approach. Morpheme represents the root of a specific word. According to the Myanmar language nature, a morpheme is a vital role for the machine translation of Myanmar text. By exploiting the power of morpheme word, it can achieve the easy way of translation of Myanmar text. In the Myanmar language, there is no statistical corpus resources and training data to test the word segmentation algorithm for Myanmar language. In this paper, we proposed the word segmentationapproach which is not applied to statistical methods with the corpus. This approach is very useful when there is no linguistic resource such as corpus and copra for Myanmar language. We simply build the monolingual lexicon which is inspired by morpheme Myanmar words collectedfromMyanmarand Essential Words dictionaries. Syllables tokenization is defined as preprocessing. Syllable segmentation is done by using the rules on the syllable structure of Myanmar script for the input sentence. To determine word boundariesofthe segmented syllables, the proposed approach is applied forward longest matching dictionary. This system can segment into morpheme-based Myanmar words from the input sentence of text by comparing one by one character from the input string with the monolingual dictionary. This approach is very simple but it proved that this is a practical approach which is not available the applicable linguistic resources. RELATED WORK In this section, previous works on Myanmar word segmentation are reviewed. Win Pa Pa and NiLar Thein experimented Disambiguation in Myanmar Word Segmentation. The authors solved the word ambiguity problems by combining Forward Maximum Matching, Backward Maximum Matching and Joint Entropy. And then, they tried to solve the ambiguity problem using a statistical approach with the corpus.TheauthorsdescribedPrecisionof word segmentation for this approach was 92% and recall is 94%. Tun Thura Thet, Jin-Cheon Na, Wunna Ko Ko, was a proposed word segmentation for the Myanmar language. They applied rule-based syllablesegmentation andalsoused dictionary-based statistical syllable merging, for the word ambiguity. The authors combined with Mutual Information by calculating collocation strength with the corpus. They showed that Precision 98.94%, Recall 99.05%, FMeasure98.99. “Myanmar Word Segmentation using Syllable level Longest Matching” was proved by Hla Hla Htay, Kavi Narayana Murthy. They used word Listabove 800,000 wordsincluding inflected forms. The authors also applied to stop word removal first and also used the Ngram approach for syllable matching. They achieved Recall 98.81%, Precision 99.11%, F_measure 98.95%, also tested on the sentencelevelwhichis collected from web documents, grammar books and stories. MYANMAR LANGUAGE Myanmar language is the official languageof theUnionof the Republic of Myanmar and is more than one thousand years old. Texts in the Myanmar language use the Myanmar script, which is descended from the Brahmi script of ancient South India. A. Myanmar Script A Myanmar text is a string of characters without explicit word boundary markup, written in sequence from left to right. IJTSRD26520
  • 2. International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470 @ IJTSRD | Unique Paper ID – IJTSRD26520 | Volume – 3 | Issue – 5 | July - August 2019 Page 912 Myanmar script contains 33 consonants, 8 vowels (free- standing and attached, 2 diacritics, 11 medials,a vowelkiller or ASAT, 10 digits and 2 punctuation marks [4]. B. Syllable Breaking in Myanmar Text Syllable breaking is the process of identifying syllable boundaries in a text. The syllable is the smallest unit of language. In Myanmar text, a syllable can start with a consonant may be followed by a medial consonant. After the vowel, a syllable may end with nasalization of the vowel or an unreleased glottal stop. At the end of syallable, a final consonant usually has an ‘asat’ sign above it, to show that there is no inherent vowel. In multisyllabic words derived from an Indian language such as Pali, where two consonants occur internally with no intervening vowel, the consonants tend to be stacked vertically, and the asat sign is not used. There are a set of Myanmar numerals, which are used just like Latin digits [2]. Firstly, syllable segmentation is done by using the rules on the syllable structure of the Myanmar script. Syllable breaking rules are based on combining consonant and vowel, devocalizing and kinzi, contractions, syllable chaining, distinct letter, single character and loan words. In syllable breaking stage, the proposed system determines a syllable boundary by comparing pairs of characters to find whether a break is possible or not between them Moreover, the accuracy results of syllable segmentation are described in Table I and Table II. 1. Combining consonant and vowel 2. Devowelizing and Kinzi Devowelising and Kinzi 3. Syllable Changing 4. Single Character 5. Contraction MYANMAR WORD SEGMENTATION Word segmentation is the process of parsing concatenated text (i.e. text that contains no spaces or other word separators) to infer where word breaks exist. Myanmar script doesn’t need to put white spaces between words or syllables. Modern writing style contains spaces after each clause in order to enhance readability [5]. Generally, a word is a basic unit of language that carries meaning and can be spoken or written. A Myanmar word can consist of one or more morphemes that are linked more or less tightly together. And then a Myanmar word will consist of a root or stem and zero or more affixes. Moreover, Myanmar words can be combined to form phrases, clauses and sentences. In Addition, a word consisting of two or more stems joined together is known as a compound word[3]. And then, the next step was to merge the segmented syllables into the meaningful word from the input sentence. Syllable merging is done by using the longest matching approach and mapped with the lexicon. The algorithmstarts from the beginning of a sentence, finding the longest matching word compared with the lexicon and then repeating the process until it reaches the end of the sentence. This system can segment into a morpheme-based word from the input sentence by comparing one by one character from the input string with the monolingual dictionary. The process of word segmentation is shown in Figure2. This system is tested on all types of simple and complex sentence types of Myanmar text including one or more clauses and phrases. The accuracy results are mentioned in Table2. There may be some problems in syllable merging of the proposed system. Because of the longest matching approach, it cannot give the correct segmentation of all words in the input sentence. It can find segment conflicts in some word in the sentence. With the longest matching approach, this sentence is segmented to the wrong word into Fig1. Process of Word Segmentation
  • 3. International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470 @ IJTSRD | Unique Paper ID – IJTSRD26520 | Volume – 3 | Issue – 5 | July - August 2019 Page 913 The structure of the sentence in Myanmar language may be simple and compound or complex. Generally, thesentenceis subdivided into phrases. The phrase is subdivided into words. Word is subdivided into syllables. A syllable is the smallest unit of the language [3]. In this case, either simple or compound sentence can be contained with one more phrases and one or more clauses. A group of words, which makes sense, but not complete sense, is called a Phrase. It is a group of related words without a Subject and a Verb. Examples: in the east, on a wall, with blue trimming, on the bridge, with red hair [2]. A clause is a group of words that contains both a subject and a predicate but cannotalwaysbe considered as a full grammatical sentence. Clauses can be either independent clauses (also called main clauses) or dependent clauses (also called subordinateclauses)[2]. Like an English sentence, Myanmar sentence is also composed of one or more clauses and phrases. Myanmar script contains 33 consonants, 8 vowels (free-standing and attached, 2 diacritics, 11 medials, a vowel killer or ASAT, 10 digits and 2 punctuation marks [4]. 1. Examples for adding adjective & adverb phrase in a simple sentence 2. Examples for adding phrases in a simple sentence 3. Examples for adding time phrases in a simple sentence 4. Examples for adding accusative phrases in a simple sentence 5. Examples of a compound sentencewitha dependent clause and independent clause 6. Examples of a compound sentence with three clauses 7. Examples of sentence Hidden object 8. Examples of sentence changing the position of subject and reason EXPERIMENT RESULTS Table I and Table II show the experimental results of word segmentation system for syllable breaking and syllable merging word segmentation. Accuracy result for syllable breaking is 100% correct. TABLE.I Accuracy Results on syllable Segmentation Syallable Type NCseg NTseg Accuracy Unique Syllable 1903 1903 100% Tokens 7069 7069 100% Sentence 1226 1226 100% Accuracy=NCseg/NTseg*100 NCseg=the number of correctly segmented syllables by the program on the input Ntseg=the number of total segmented syllables verified manually
  • 4. International Journal of Trend in Scientific Research and Development (IJTSRD) @ www.ijtsrd.com eISSN: 2456-6470 @ IJTSRD | Unique Paper ID – IJTSRD26520 | Volume – 3 | Issue – 5 | July - August 2019 Page 914 TABLE.II Accuracy Results on Word Segmentation Syallable Type NCseg NTseg Accuracy Unique Syllable 7069 6769 95.77% Tokens 1226 926 75.53% Accuracy=NCmg/NTmg*100 NCmg=the number of correctly merge syllables by the program on the input Ntmg=the number of total merge syllables verified manually Tested Dictionary contains 32,581 tokens. Sentences are tested upon all kind of sentence types, namely {simple, compound or complex}.Covers on all complex sentencetype including a sentence with one clause, two clauses, and three clauses. CONCLUSION This paper has proposed an approach for Myanmar word segmentation by using rule-based syllable breaking and dictionary lookup syllable merging methods. In the syllable breaking stage, the proposed system determines a syllable boundary by comparing pairs of characterstofindwhether a break is possible or not between them. And then, it merges the segmented syllables into a meaningful wordbyusingthe dictionary lookup approach withthe longeststringmatching algorithm. Moreover, this proposed system can produce correct morpheme-based Myanmar words from the input sentence. It can also solve to segment the words with one or more phrases and clauses of in the written Myanmar sentences. It can give the correct segmented words which contain one or more dependent clauses and independent clauses on all types of simple and compound sentences of Myanmar text. So, it can support many benefits to Myanmar to English translation system and further(NLP)taskssuchas information retrieval, noun phrase identification, verb phrase identification, named entity recognition, word sense disambiguation and many more of NLP activities. References [1] C. D. Manning, H. Schiitze,” Foundations Of Statistical Natural Language Processing”, The MIT Press, Cambridge, Massachusetts London, England, 2000. . [2] https:// Myanmar script notes.htm, https:// what- isclause.html, https:// what-is-phrase.html. [3] Myanmar Grammar, First Edition, Myanmar Language Commission, memorable for 30th anniversary, June 2005. [4] Myanmar Orthography, Second Edition, Myanmar Language Commission, June 2003. [5] M. T. Win & et.al, “Burmese Phrase Segmentation”, Proceedings of Conference on Human Language Technology for Development, Egypt, May 2011. [6] Lexique Pro_ Myanmar lexicon (Version-2), July, 2011. [7] T. T. Thet, J. C. Na, W. K. Ko, “Word segmentation for the Myanmar Language”, Journal of Information Science, 2007, PP. 1-17. [8] T. H. Hlaing, “Manually Constructed Context-Free Grammar For Myanmar Syllable Structure”, Nagaoka University of Technology Nagaoka, JAPAN, 2011. [9] W. P. Pa, N. L. Thein ”Disambiguation in Myanmar Word Segmentation”, ”Proceedings Of the Seventh International Conference On Computer Applications”, Yangon, Myanmar,2009, PP. 1-4. [10] H. H. Htay, K. N. Murthy, Myanmar Word Segmentation using Syllable Level Longest Matching, “Proceedingsof the IJCNLP-08 Workshop on NLP for Less Privileged Languages, Hyderabad, India, January 2008. [11] PyinNya Kyaw, Essential Words Dictionary and Myanmar Dictionary, First Edition, February 2010, Yangon.