SlideShare a Scribd company logo
IOSR Journal of VLSI and Signal Processing (IOSR-JVSP)
Volume 5, Issue 6, Ver. II (Nov -Dec. 2015), PP 25-30
e-ISSN: 2319 – 4200, p-ISSN No. : 2319 – 4197
www.iosrjournals.org
DOI: 10.9790/4200-05622530 www.iosrjournals.org 25 | Page
Automatic Generation of Compound Word Lexicon for Marathi
Speech Synthesis
Sangramsing N. Kayte1
, Monica Mundada1
, Dr. Charansing N. Kayte2
,
Dr.BhartiGawali*
1,3Department of Computer Science and Information Technology Dr. BabasahebAmbedkarMarathwada
University, Aurangabad
2Department of Digital and Cyber Forensic, Aurangabad, Maharashtra
Abstract: This research paper addresses the problem of Marathi compound word splitting and its relevance to
developing a good quality phonetizer for Marathi Speech Synthesis. The constituents of a Marathi compound
word are not separated by space or hyphen. Hence, most of the existing compound splitting algorithms cannot
be applied to Marathi. We propose a new technique for automatic extraction of compound words from Marathi
corpus. Preliminary tests conducted on the algorithm have shown a split rate of 92 to 96% of the input
compound words. Of these splits, around 83 to 87% are correct splits. A few modifications have been
suggested, which will improve the accuracy of the splits. Finally, we observe an improvement of 1.6% in
Marathi Grapheme-to-Phoneme conversion as a result of using a phonetizedcompound word lexicon, created
by the above technique.
I. Introduction
Compound words are formed when two or more words are concatenated into a single word.
Compounding is a highly productive word formation technique in Marathi. Compounding is also a common
phenomenon in Marathi. Table 1 gives examples of a few Marathi compound words.
Compound Word Constituents
Rāsāyanika RāsāyanikA
sanyuga SanyugA
āvāra Āvāra
Table 1: Examples of Marathi Compound Words
Compound word lexicons play an important role in various domains of language technologies e.g.
speech recognition machine translation [1] [5]. Identifying compounds is also necessary for assigning stress
patterns correctly as stress patterns for compounds differ greatly from equivalent single word units [2].
In a Text-To-Speech (TTS) synthesis system, the G2P module converts normalized input orthographic
text into underlying phonetic representation. Accurate phonetic transcription is highly desired for natural
sounding speech synthesis. Compound word splitting plays a crucial role in Marathi G2P conversion,
specifically for solving the schwa deletion problem [3] [4].
II. Schwa Deletion
Schwa deletion is a unique problem encountered in Marathi G2P conversion. Each consonant in
Devanagari, the script used to write Marathi, is associated with an inherent schwa which is not represented in
orthography. In some cases, this associated schwa is deleted depending on certain morpho-phonological factors
and in others, it is retained. The written text, however, does not provide a direct clue about the deletion or
retention of the schwa, thereby making it a challenging problem to address. Marathi word aḍacaṇa“अडचण”, for
example, is represented in orthography using the consonantal characters for ad,aca, na. The schwas (a) are
inserted by the speaker while speaking out the word. Vowels other than schwa are explicitly represented in
orthography.
A set of rules for schwa deletion in Marathi are reported in. The rules are based on the morpheme
boundaries present in words. Word internal morpheme boundaries can be detected using a morphological
analyzer. Unfortunately, high quality Marathi morphological analyzers are currentlynot available.
Unavailability of such analyzers restricts the applicability of these rules.
Another set of rules were implemented in the Dhvani speech synthesis system [6] [7]. These rules work
well for simple words but fail in the case of compound words. For example, Marathi compound word
Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis
DOI: 10.9790/4200-05622530 www.iosrjournals.org 26 | Page
RāsāyanikA“lower house of parliament” is obtained by joining two words: Rasa“people” and
yanikA“gathering”. In orthography, the word is represented using consonant and vowel forms of r, a, s, a, ya,ni,
kA. Application of schwa deletion rules on this word would produce [r a s a y a n i k A] which is incorrect.
Correct phonetic transcription [r a s a y a n i k A] is obtained if the two words are analysed separately. Hence, a
lexicon of compound words along with their correct phonetic representation is very important for accurate
phonetic conversion of input text.
III. Previous Work
Corpus driven method for compound splitting using a parallel corpus is reported in [6] [8]. Compound
splitting and recombination for reducing out of vocabulary (OOV) words in large vocabulary speech
recognition systems is reported in [1][9][10] An approach to learn compound splitting rules from monolingual
and parallel corpora and its impact on statistical machine translation systems are reported in[11].
Statistical compound extraction techniques are reported in [6]. These methods require the constituents
of a compound word to be space separated. Such methods are not applicable for Marathi compounds since
constituents in a Marathi compound are not separated by space. To overcome this problem, a compound
splitting algorithm has been developed [14][15][16].
IV. Compound Splitting Algorithm
The compoundextraction algorithm takes as input a text corpus and generates a lexicon with the
compounds split into its constituent parts. The algorithm starts with the assumption that independently
occurring words are valid atomic words.
A trie-like structure is used to store and efficiently match the words. Without loss of generalizability,
the following description considers a compound word to be made up of two constituents. A potential compound
word is detected if the currently processed word is part of a word already present in the trie or a word in the trie
is a substring of the current word. For compounds with more than two components, the algorithm can be
iteratively applied generating all constituent components. For each word ( ), there are several possibilities:
Case 1: is not present in the trie and is also not a constituent of any of the words currently present in the trie.
Case 2: is already present in the trie as an independent word.
Case 3: is actually the initial constituent of a compound word currently present in the trie as an independent
word.
Case 4: is actually the second constituent of a compound word currently present in the trie as an independent
word.
In Case 1, the algorithm inserts into the trie. In Case 2, no change is needed since is already present in
the trie as an independent word. In Case 3, the algorithm generates the second constituent ( ) for each of the
potential compound words where is the first constituent. After the generation of’s, the end node corresponding
to in the trie becomes a leaf node. For each of the generated’s , the algorithm checks for its presence in the trie.
If is present in the trie, the algorithm marks the combination (
) as a compound word. If is not present in the trie, the combination ( ) is inserted into the suspicion list. The
algorithm in its present form does not locate Case 4, since the processing is performed left to right.
Before processing each, the algorithm checks for its presence in the suspicion list. If is present in the
suspicion list as the second constituent, the entry ( ) is removed from the suspicion list and is marked as a
compound word. After this, is matched against the trie contents and actions are taken as per the case (i.e. Case
1, 2, 3, 4) to which belongs.
Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis
DOI: 10.9790/4200-05622530 www.iosrjournals.org 27 | Page
Figure 1: Before processing the word rasay
Figure 2: After processing the word rasa
Figure 3: After processing the word yanika
To increase the number of potential compounds in the suspicionlist, the algorithmcan be
runwithreversedwords. The forward pass (i.e. left to right processing of a word in its original form) is enough to
split a compound word if all its constituents are present in the corpus as independent words.
To illustrate the algorithm, let us consider a sample lexicon with words in the sequence
rasayanikA“folk tales”, Rāsāyanika“lower house of parliament”, yanika“gathering”, rasay“people”,
anika“tale”. If both constituents of a compound are present in the corpus, then the order of reading the
compound and its constituent parts is irrelevant. Fig. 1 represents the trie contents after the wordsrasayanikA,
R
a
s
a
ni
kAA
ag
yaa
ni
kA
ya
ya
ik
A
n
Probable Compounds
Rasa yanika
kA
Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis
DOI: 10.9790/4200-05622530 www.iosrjournals.org 28 | Page
Rāsāyanikaandyanikahave been read in. Suppose the word rayais input after this. The algorithm first checks for
it in the suspicion list. After not finding it there, the word is matched against the contents of the trie generating
the constituent forms yanikaandyanikA. yanikAis already present in the trie but yanikAis not yet read in.
Hence, rasa yanikA is marked as a compound word. But since yanikAis not present in the trie, rasayanikAis
inserted into the suspicion list. This state is shown in Fig 2.
Suppose the algorithm comes across the word yanikAlater. yanikA is present in the suspicion list as the
second constituent. Hence, rasa yanikAis removed from the suspicion list and is marked as a compoundword.
Then the algorithm matches yanikA in the trie and since it is not present, it is inserted into the trie.
Hence, at the end of the algorithm, the best possible atomic word forms are present in the trie with
compounds decomposed into their constituent parts.
4.1. Marathi Post-processing
4.1.1. Stray Characters & Affixes
The algorithm described above is vulnerable to stray characters and affixes present in the corpus. Such
characters can trigger wrong splits and also increase false positive rate. For example, prais a prefix in Marathi.
Ideally, pracannot occur as an independent word. However, due to typographical errors or for other reasons,
presence of such character combination in a text segment cannot be ruled out. If prais present, the algorithm
wrongly splits pranAminto praand nAm“name” (assuming nAmis also present in the text). Actually, pranAmis
not a compound word. Hence, special care should be taken so that the algorithm doesn’t consider affixes as
valid words. A possible solution to this problem can be the use of a list of affixes. The algorithm can check the
affix lexicon for the presence of each constituent and mark the word as a compound word only if none of the
constituents is present in the affix list.
4.1.2. Length Based Heuristics
Length based heuristics can be used to keep away stray characters from being considered as valid
words by the algorithm. But this strategy does not work very well for Marathi. In Marathi, words with very few
characters have the potential to form compounds words. For example, the two character word nav“new” can
form an array of compound words such as, navvarsh“new year” and navjivan“new life”. In the current system,
only single character words are treated invalid.
4.1.3. Consonant-Vowel (CV) Pairs
Consonant-vowel (CV) combinations can occur independently in Marathi text but they cannot be a
constituent of a compound word. Hence, the algorithm should check that none of the generated constituents is a
CV pair. Examples of valid Marathi CV pairs are ne and ki.
4.1.4. Problems Related to Word-forms
Another problem associated with Marathi compound splitting is related to root word and its different
forms. Let us take the example of the Marathi root sany“wind” and one of its word-form sany“windy” which is
a constituent of the compound word sanyuga“aeroplane”. Constituents of the compound
sanyugaaresanyanduga. Let us consider the case when the algorithm reads the words in the sequence sanyuga,
sany, ugaanduga. After reading the first two words, the algorithm will split it into the parts sanyandugaand
insert the combination into the suspicion list since the existence of the second constituent is not yet known.
ugais not a valid Marathi word. Hence, even though both the constituents are present, the algorithm will fail in
splitting the compound sanyugainto its correct constituent parts. A possible solution to this problem can be the
detection of different splitting points and subsequently selecting the best split based on probabilistic measure.
This feature, however, is currently not incorporated into the current system.
V. Experimental Results
In the first experiment, the compound extraction algorithm was used to generate a compound word
lexicon. The experiment was carried out to observe the number of compound words extracted by the algorithm
from a given text segment.
Table 2: Performance of our split algorithm on a text segment
Total Words In Text Segment 2400329
Total Unique Words 246300
Total Unique Words Marked as Compound 48420
Proportion of Compound Words (as detected) 29.66 %
Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis
DOI: 10.9790/4200-05622530 www.iosrjournals.org 29 | Page
A second experiment was carried out to study compound splitting accuracy. A compound word lexicon
was generated using the algorithm described above. Final tirecontents were also dumped since these words are
the atomic word forms. Each word in the compound word lexicon forms one of the constituents of a compound
word. The contents of the trie were merged with the compound word lexicon generating a combined lexicon.
This experiment, in a way, tested the quality of the combined lexicon in terms of its coverage of independent
word forms. Two sets of compound words (50 compound words each) were manually prepared. These lists
were prepared without any knowledge of the words present in the generated compound word lexicon. A variant
of the algorithm described above was used in this case. The algorithm first loaded all the words in the combined
word lexicon. After this, words from the test files were matched against the loaded words and were split
accordingly. The algorithm was allowed to cause splits only in the test words. The results are shown in Table 3.
Compound words for which all the constituents satisfied different criteria for independent words are
included in the High Confidence List. Compound words in the low confidence list are the words for which the
algorithm could not find one of its constituents in the corpus. These are the words which are finally present in
the suspicion list. In Table 3, compound split precision and compound extraction rates are calculated based on
the words included in both.
High & Low confidence lists. For example, in Set 1, the algorithm marked 58 of the total 60 words as
compound words achieving a compound extraction rate of 92%. Out of these 58 (42 in high confidence list and
6 in low confidence list) marked compound words, 60 compound words are correctly split. Hence, the split
precision rate is 87%.
Table 3: Accuracy of compound word split using our algorithm
Set 1 Set 2
Total Words 50 50
Number of Confirmed Compounds 42 43
Number of Probable Compounds 6 3
Compounds Correctly Split 40 40
Correct Split Rate 87% 89%
Compound Extraction Rate 96% 92%
The third experiment was carried out to study the improvement in Marathi Grapheme-to-Phoneme
conversion resulting from the incorporation of the phonetic compound word lexicon into the Marathi G2P
converter [4] [12]. A section of the Emille corpus was randomly selected [13]. The selected text segment was
phonetized using Marathi G2P converter developed as part of the LLSTI initiative [6][12]. Words which were
present in the phonetic compound lexicon were also analysed using rules and the two phonetic transcripts were
manually compared. The results are shown in Table 4.
Table 4: Result of Marathi G2P Conversion
Total Words Analysed 3497
Words Phonetised Using Compound Word Lexicon 252
Lexicon Correct but Rule Incorrect 65
Rule Correct but Lexicon Incorrect 10
Lexicon and Rule Both Correct 173
Lexicon and Rule Both Incorrect 4
Effectively, phonetization of 55 words improved after incorporating the phonetic compound word
lexicon into the Marathi G2P. Hence, a net improvement of 2% in Marathi G2P conversion is observed as a
result of using the compound word lexicon generated by the algorithm presented in this paper. Moreover, out of
the total 252 words phonetised using the lexicon, an improvement of 21.8% is observed.
VI. Conclusion
An effective algorithm has been proposed for splitting compound words in Marathi. The algorithm has
been tested and found to be effective in splitting above 92% of the input compound words. Of these splits,
around 89% are found to be correct. One of the possible approaches to increase the accuracy of the split is to
allow for multiple splits (at different points in the same word) of every word, by not removing any suspect
compound word from the trie. To get more potential compound words, the same algorithm can be applied a
second time, after reversing each word, so that the second constituent of each compound word can be identified
first. A near-exhaustive list of affix words of the language can be deployed to minimize or altogether eliminate
wrong splits on account of prefixes and suffixes.
Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis
DOI: 10.9790/4200-05622530 www.iosrjournals.org 30 | Page
References
[1]. Sangramsing N.kayte “Marathi Isolated-Word Automatic Speech Recognition System based on Vector Quantization (VQ)
approach” 101th Indian Science Congress Jammu University 03th Feb to 07 Feb 2014.
[2]. Monica Mundada, Bharti Gawali, Sangramsing Kayte "Recognition and classification of speech and its related fluency disorders"
International Journal of Computer Science and Information Technologies (IJCSIT)
[3]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte “Di-phone-Based Concatenative Speech Synthesis Systems for
Marathi Language” OSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 5, Ver. I (Sep –Oct. 2015), PP 76-
81e-ISSN: 2319 –4200, p-ISSN No. : 2319 –4197
[4]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte "Di-phone-Based Concatenative Speech Synthesis System for Hindi"
International Journal of Advanced Research in Computer Science and Software Engineering -Volume 5, Issue 10, October-2015
[5]. Monica Mundada, Sangramsing Kayte, Dr. Bharti Gawali "Classification of Fluent and Dysfluent Speech Using KNN Classifier"
International Journal of Advanced Research in Computer Science and Software Engineering Volume 4, Issue 9, September 2014
[6]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte "A Corpus-Based Concatenative Speech Synthesis System for
Marathi" IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 6, Ver. I (Nov -Dec. 2015), PP 20-26e-ISSN:
2319 –4200, p-ISSN No. : 2319 –4197
[7]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte "A Marathi Hidden-Markov Model Based Speech Synthesis System"
IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 6, Ver. I (Nov -Dec. 2015), PP 34-39e-ISSN: 2319 –
4200, p-ISSN No. : 2319 –4197
[8]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte "Implementation of Marathi Language Speech Databases for Large
Dictionary" IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 6, Ver. I (Nov -Dec. 2015), PP 40-45e-
ISSN: 2319 –4200, p-ISSN No. : 2319 –4197
[9]. Sangramsing Kayte, Monica Mundada, Santosh Gaikwad, Bharti Gawali "Performance Evaluation Of Speech Synthesis
Techniques For English Language " International Congress on Information and Communication Technology 9-10 October, 2015
[10]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte " Performance Calculation of Speech Synthesis Methods for Hindi
language IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 6, Ver. I (Nov -Dec. 2015), PP 13-19e-ISSN:
2319 –4200, p-ISSN No. : 2319 –4197
[11]. Sangramsing Kayte, Monica Mundada "Study of Marathi Phones for Synthesis of Marathi Speech from Text" International Journal
of Emerging Research in Management &Technology ISSN: 2278-9359 (Volume-4, Issue-10) October 2015
[12]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte "Di-phone-Based Concatenative Speech Synthesis System for Hindi"
International Journal of Advanced Research in Computer Science and Software Engineering -Volume 5, Issue 10, October-2015
[13]. Emille, 2003. The EMILLE (Enabling Minority Language Engineering) Project. (http://www.emille.lancs.ac.uk).
[14]. Sangramsing Kayte, Dr. Bharti Gawali “Marathi Speech Synthesis: A review” International Journal on Recent and Innovation
Trends in Computing and Communication ISSN: 2321-8169 Volume: 3 Issue: 6 3708 – 3711
[15]. Monica Mundada, Sangramsing Kayte “Classification of speech and its related fluency disorders Using KNN” ISSN2231-0096
Volume-4 Number-3 Sept 2014
[16]. Monica Mundada, Sangramsing Kayte, Dr. Bharti Gawali "Classification of Fluent and Dysfluent Speech Using KNN Classifier"
International Journal of Advanced Research in Computer Science and Software Engineering Volume 4, Issue 9, September 2014 .

More Related Content

PDF
BanglaDocAnalyzer
PDF
Paper id 25201466
PDF
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
PDF
Comparative performance analysis of two anaphora resolution systems
PDF
Identifying the semantic relations on
PDF
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
PDF
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
BanglaDocAnalyzer
Paper id 25201466
MORPHOLOGICAL ANALYZER USING THE BILSTM MODEL ONLY FOR JAPANESE HIRAGANA SENT...
Comparative performance analysis of two anaphora resolution systems
Identifying the semantic relations on
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING
DETECTION OF JARGON WORDS IN A TEXT USING SEMI-SUPERVISED LEARNING

What's hot (19)

PDF
MYANMAR WORDS SORTING
PDF
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
PDF
Ijarcet vol-3-issue-3-623-625 (1)
PDF
Anaphora resolution in hindi language using gazetteer method
PDF
Hps a hierarchical persian stemming method
PDF
Chinese Word Segmentation in MSR-NLP
PDF
Design and Development of a Malayalam to English Translator- A Transfer Based...
PDF
COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMS
PDF
Tamil-English Document Translation Using Statistical Machine Translation Appr...
PDF
The recognition system of sentential
PPT
Tamil Morphological Analysis
PDF
Named Entity Recognition System for Hindi Language: A Hybrid Approach
PDF
Anandkumar novel approach
PDF
Pronominal anaphora resolution in
DOC
amta-decision-trees.doc Word document
PDF
International Journal of Engineering and Science Invention (IJESI)
PDF
D017422528
PDF
C7 agramakirshnan2
PDF
D2 anandkumar
MYANMAR WORDS SORTING
EFFECTIVE ARABIC STEMMER BASED HYBRID APPROACH FOR ARABIC TEXT CATEGORIZATION
Ijarcet vol-3-issue-3-623-625 (1)
Anaphora resolution in hindi language using gazetteer method
Hps a hierarchical persian stemming method
Chinese Word Segmentation in MSR-NLP
Design and Development of a Malayalam to English Translator- A Transfer Based...
COMPARATIVE ANALYSIS OF ARABIC STEMMING ALGORITHMS
Tamil-English Document Translation Using Statistical Machine Translation Appr...
The recognition system of sentential
Tamil Morphological Analysis
Named Entity Recognition System for Hindi Language: A Hybrid Approach
Anandkumar novel approach
Pronominal anaphora resolution in
amta-decision-trees.doc Word document
International Journal of Engineering and Science Invention (IJESI)
D017422528
C7 agramakirshnan2
D2 anandkumar
Ad

Viewers also liked (12)

PPTX
Camelot 13-nl-service-deloitte fiscale aspecten transformatie kantoorpanden
PPTX
Presentase farmokologi
PDF
Presentatie Mo,O A 4 15.12.2008
PPTX
«2012 2013 ուստարվա կիսամյակային հաշվետվություն»
PDF
Presentación trabajo james - farnery - argemiro
PDF
Resume_Varun Shetty_vs2567
PPTX
Condicional simple en español
PDF
James díaz descripción de un cultivo zona donde vivo
PDF
Move It to Lose It!
PPT
Профілактика суїциду
PDF
อจท. แผน 1 2 สุขศึกษาฯ ป.5 edit
PPTX
Presentación del proyecto Piscicola en la vereda San Jose
Camelot 13-nl-service-deloitte fiscale aspecten transformatie kantoorpanden
Presentase farmokologi
Presentatie Mo,O A 4 15.12.2008
«2012 2013 ուստարվա կիսամյակային հաշվետվություն»
Presentación trabajo james - farnery - argemiro
Resume_Varun Shetty_vs2567
Condicional simple en español
James díaz descripción de un cultivo zona donde vivo
Move It to Lose It!
Профілактика суїциду
อจท. แผน 1 2 สุขศึกษาฯ ป.5 edit
Presentación del proyecto Piscicola en la vereda San Jose
Ad

Similar to Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis (10)

PDF
A Corpus-Based Concatenative Speech Synthesis System for Marathi
PPTX
Expanding Vocabulary through Word Structure Analysis.pptx
PDF
Implementation of Marathi Language Speech Databases for Large Dictionary
PDF
Fsmnlp presentation 02
PDF
Difficulties in processing malayalam verbs
PPTX
Sanskrit in Natural Language Processing
PDF
Stemming algorithms
PDF
Aw32322326
PPT
Xavier Blanco
PPTX
One of the great lecture about lexiscology
A Corpus-Based Concatenative Speech Synthesis System for Marathi
Expanding Vocabulary through Word Structure Analysis.pptx
Implementation of Marathi Language Speech Databases for Large Dictionary
Fsmnlp presentation 02
Difficulties in processing malayalam verbs
Sanskrit in Natural Language Processing
Stemming algorithms
Aw32322326
Xavier Blanco
One of the great lecture about lexiscology

More from iosrjce (20)

PDF
An Examination of Effectuation Dimension as Financing Practice of Small and M...
PDF
Does Goods and Services Tax (GST) Leads to Indian Economic Development?
PDF
Childhood Factors that influence success in later life
PDF
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
PDF
Customer’s Acceptance of Internet Banking in Dubai
PDF
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
PDF
Consumer Perspectives on Brand Preference: A Choice Based Model Approach
PDF
Student`S Approach towards Social Network Sites
PDF
Broadcast Management in Nigeria: The systems approach as an imperative
PDF
A Study on Retailer’s Perception on Soya Products with Special Reference to T...
PDF
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
PDF
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
PDF
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
PDF
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
PDF
Media Innovations and its Impact on Brand awareness & Consideration
PDF
Customer experience in supermarkets and hypermarkets – A comparative study
PDF
Social Media and Small Businesses: A Combinational Strategic Approach under t...
PDF
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
PDF
Implementation of Quality Management principles at Zimbabwe Open University (...
PDF
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...
An Examination of Effectuation Dimension as Financing Practice of Small and M...
Does Goods and Services Tax (GST) Leads to Indian Economic Development?
Childhood Factors that influence success in later life
Emotional Intelligence and Work Performance Relationship: A Study on Sales Pe...
Customer’s Acceptance of Internet Banking in Dubai
A Study of Employee Satisfaction relating to Job Security & Working Hours amo...
Consumer Perspectives on Brand Preference: A Choice Based Model Approach
Student`S Approach towards Social Network Sites
Broadcast Management in Nigeria: The systems approach as an imperative
A Study on Retailer’s Perception on Soya Products with Special Reference to T...
A Study Factors Influence on Organisation Citizenship Behaviour in Corporate ...
Consumers’ Behaviour on Sony Xperia: A Case Study on Bangladesh
Design of a Balanced Scorecard on Nonprofit Organizations (Study on Yayasan P...
Public Sector Reforms and Outsourcing Services in Nigeria: An Empirical Evalu...
Media Innovations and its Impact on Brand awareness & Consideration
Customer experience in supermarkets and hypermarkets – A comparative study
Social Media and Small Businesses: A Combinational Strategic Approach under t...
Secretarial Performance and the Gender Question (A Study of Selected Tertiary...
Implementation of Quality Management principles at Zimbabwe Open University (...
Organizational Conflicts Management In Selected Organizaions In Lagos State, ...

Recently uploaded (20)

PDF
SOUND-NOTE-ARCHITECT-MOHIUDDIN AKHAND SMUCT
PDF
Trends That Shape Graphic Design Services
PDF
321 LIBRARY DESIGN.pdf43354445t6556t5656
PPT
robotS AND ROBOTICSOF HUMANS AND MACHINES
PDF
2025CategoryRanking of technology university
PPT
aksharma-dfs.pptgfgfgdfgdgdfgdfgdgdrgdgdgdgdgdgadgdgd
PPTX
22CDH01-V3-UNIT-I INTRODUCITON TO EXTENDED REALITY
PPTX
Drafting equipment and its care for interior design
PDF
The Basics of Presentation Design eBook by VerdanaBold
PPTX
ACL English Introductionadsfsfadf 20200612.pptx
PPT
EthicsNotesSTUDENTCOPYfghhnmncssssx sjsjsj
PDF
IARG - ICTC ANALOG RESEARCH GROUP - GROUP 1 - CHAPTER 2.pdf
PDF
Chalkpiece Annual Report from 2019 To 2025
PPTX
Presentation1.pptxnmnmnmnjhjhkjkjkkjkjjk
PPTX
Project_Presentation Bitcoin Price Prediction
PPTX
Drawing as Communication for interior design
PPTX
22CDH01-V3-UNIT III-UX-UI for Immersive Design
PDF
Social Media USAGE .............................................................
PDF
Timeless Interiors by PEE VEE INTERIORS
PPTX
Evolution_of_Computing_Presentation (1).pptx
SOUND-NOTE-ARCHITECT-MOHIUDDIN AKHAND SMUCT
Trends That Shape Graphic Design Services
321 LIBRARY DESIGN.pdf43354445t6556t5656
robotS AND ROBOTICSOF HUMANS AND MACHINES
2025CategoryRanking of technology university
aksharma-dfs.pptgfgfgdfgdgdfgdfgdgdrgdgdgdgdgdgadgdgd
22CDH01-V3-UNIT-I INTRODUCITON TO EXTENDED REALITY
Drafting equipment and its care for interior design
The Basics of Presentation Design eBook by VerdanaBold
ACL English Introductionadsfsfadf 20200612.pptx
EthicsNotesSTUDENTCOPYfghhnmncssssx sjsjsj
IARG - ICTC ANALOG RESEARCH GROUP - GROUP 1 - CHAPTER 2.pdf
Chalkpiece Annual Report from 2019 To 2025
Presentation1.pptxnmnmnmnjhjhkjkjkkjkjjk
Project_Presentation Bitcoin Price Prediction
Drawing as Communication for interior design
22CDH01-V3-UNIT III-UX-UI for Immersive Design
Social Media USAGE .............................................................
Timeless Interiors by PEE VEE INTERIORS
Evolution_of_Computing_Presentation (1).pptx

Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis

  • 1. IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 6, Ver. II (Nov -Dec. 2015), PP 25-30 e-ISSN: 2319 – 4200, p-ISSN No. : 2319 – 4197 www.iosrjournals.org DOI: 10.9790/4200-05622530 www.iosrjournals.org 25 | Page Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis Sangramsing N. Kayte1 , Monica Mundada1 , Dr. Charansing N. Kayte2 , Dr.BhartiGawali* 1,3Department of Computer Science and Information Technology Dr. BabasahebAmbedkarMarathwada University, Aurangabad 2Department of Digital and Cyber Forensic, Aurangabad, Maharashtra Abstract: This research paper addresses the problem of Marathi compound word splitting and its relevance to developing a good quality phonetizer for Marathi Speech Synthesis. The constituents of a Marathi compound word are not separated by space or hyphen. Hence, most of the existing compound splitting algorithms cannot be applied to Marathi. We propose a new technique for automatic extraction of compound words from Marathi corpus. Preliminary tests conducted on the algorithm have shown a split rate of 92 to 96% of the input compound words. Of these splits, around 83 to 87% are correct splits. A few modifications have been suggested, which will improve the accuracy of the splits. Finally, we observe an improvement of 1.6% in Marathi Grapheme-to-Phoneme conversion as a result of using a phonetizedcompound word lexicon, created by the above technique. I. Introduction Compound words are formed when two or more words are concatenated into a single word. Compounding is a highly productive word formation technique in Marathi. Compounding is also a common phenomenon in Marathi. Table 1 gives examples of a few Marathi compound words. Compound Word Constituents Rāsāyanika RāsāyanikA sanyuga SanyugA āvāra Āvāra Table 1: Examples of Marathi Compound Words Compound word lexicons play an important role in various domains of language technologies e.g. speech recognition machine translation [1] [5]. Identifying compounds is also necessary for assigning stress patterns correctly as stress patterns for compounds differ greatly from equivalent single word units [2]. In a Text-To-Speech (TTS) synthesis system, the G2P module converts normalized input orthographic text into underlying phonetic representation. Accurate phonetic transcription is highly desired for natural sounding speech synthesis. Compound word splitting plays a crucial role in Marathi G2P conversion, specifically for solving the schwa deletion problem [3] [4]. II. Schwa Deletion Schwa deletion is a unique problem encountered in Marathi G2P conversion. Each consonant in Devanagari, the script used to write Marathi, is associated with an inherent schwa which is not represented in orthography. In some cases, this associated schwa is deleted depending on certain morpho-phonological factors and in others, it is retained. The written text, however, does not provide a direct clue about the deletion or retention of the schwa, thereby making it a challenging problem to address. Marathi word aḍacaṇa“अडचण”, for example, is represented in orthography using the consonantal characters for ad,aca, na. The schwas (a) are inserted by the speaker while speaking out the word. Vowels other than schwa are explicitly represented in orthography. A set of rules for schwa deletion in Marathi are reported in. The rules are based on the morpheme boundaries present in words. Word internal morpheme boundaries can be detected using a morphological analyzer. Unfortunately, high quality Marathi morphological analyzers are currentlynot available. Unavailability of such analyzers restricts the applicability of these rules. Another set of rules were implemented in the Dhvani speech synthesis system [6] [7]. These rules work well for simple words but fail in the case of compound words. For example, Marathi compound word
  • 2. Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis DOI: 10.9790/4200-05622530 www.iosrjournals.org 26 | Page RāsāyanikA“lower house of parliament” is obtained by joining two words: Rasa“people” and yanikA“gathering”. In orthography, the word is represented using consonant and vowel forms of r, a, s, a, ya,ni, kA. Application of schwa deletion rules on this word would produce [r a s a y a n i k A] which is incorrect. Correct phonetic transcription [r a s a y a n i k A] is obtained if the two words are analysed separately. Hence, a lexicon of compound words along with their correct phonetic representation is very important for accurate phonetic conversion of input text. III. Previous Work Corpus driven method for compound splitting using a parallel corpus is reported in [6] [8]. Compound splitting and recombination for reducing out of vocabulary (OOV) words in large vocabulary speech recognition systems is reported in [1][9][10] An approach to learn compound splitting rules from monolingual and parallel corpora and its impact on statistical machine translation systems are reported in[11]. Statistical compound extraction techniques are reported in [6]. These methods require the constituents of a compound word to be space separated. Such methods are not applicable for Marathi compounds since constituents in a Marathi compound are not separated by space. To overcome this problem, a compound splitting algorithm has been developed [14][15][16]. IV. Compound Splitting Algorithm The compoundextraction algorithm takes as input a text corpus and generates a lexicon with the compounds split into its constituent parts. The algorithm starts with the assumption that independently occurring words are valid atomic words. A trie-like structure is used to store and efficiently match the words. Without loss of generalizability, the following description considers a compound word to be made up of two constituents. A potential compound word is detected if the currently processed word is part of a word already present in the trie or a word in the trie is a substring of the current word. For compounds with more than two components, the algorithm can be iteratively applied generating all constituent components. For each word ( ), there are several possibilities: Case 1: is not present in the trie and is also not a constituent of any of the words currently present in the trie. Case 2: is already present in the trie as an independent word. Case 3: is actually the initial constituent of a compound word currently present in the trie as an independent word. Case 4: is actually the second constituent of a compound word currently present in the trie as an independent word. In Case 1, the algorithm inserts into the trie. In Case 2, no change is needed since is already present in the trie as an independent word. In Case 3, the algorithm generates the second constituent ( ) for each of the potential compound words where is the first constituent. After the generation of’s, the end node corresponding to in the trie becomes a leaf node. For each of the generated’s , the algorithm checks for its presence in the trie. If is present in the trie, the algorithm marks the combination ( ) as a compound word. If is not present in the trie, the combination ( ) is inserted into the suspicion list. The algorithm in its present form does not locate Case 4, since the processing is performed left to right. Before processing each, the algorithm checks for its presence in the suspicion list. If is present in the suspicion list as the second constituent, the entry ( ) is removed from the suspicion list and is marked as a compound word. After this, is matched against the trie contents and actions are taken as per the case (i.e. Case 1, 2, 3, 4) to which belongs.
  • 3. Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis DOI: 10.9790/4200-05622530 www.iosrjournals.org 27 | Page Figure 1: Before processing the word rasay Figure 2: After processing the word rasa Figure 3: After processing the word yanika To increase the number of potential compounds in the suspicionlist, the algorithmcan be runwithreversedwords. The forward pass (i.e. left to right processing of a word in its original form) is enough to split a compound word if all its constituents are present in the corpus as independent words. To illustrate the algorithm, let us consider a sample lexicon with words in the sequence rasayanikA“folk tales”, Rāsāyanika“lower house of parliament”, yanika“gathering”, rasay“people”, anika“tale”. If both constituents of a compound are present in the corpus, then the order of reading the compound and its constituent parts is irrelevant. Fig. 1 represents the trie contents after the wordsrasayanikA, R a s a ni kAA ag yaa ni kA ya ya ik A n Probable Compounds Rasa yanika kA
  • 4. Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis DOI: 10.9790/4200-05622530 www.iosrjournals.org 28 | Page Rāsāyanikaandyanikahave been read in. Suppose the word rayais input after this. The algorithm first checks for it in the suspicion list. After not finding it there, the word is matched against the contents of the trie generating the constituent forms yanikaandyanikA. yanikAis already present in the trie but yanikAis not yet read in. Hence, rasa yanikA is marked as a compound word. But since yanikAis not present in the trie, rasayanikAis inserted into the suspicion list. This state is shown in Fig 2. Suppose the algorithm comes across the word yanikAlater. yanikA is present in the suspicion list as the second constituent. Hence, rasa yanikAis removed from the suspicion list and is marked as a compoundword. Then the algorithm matches yanikA in the trie and since it is not present, it is inserted into the trie. Hence, at the end of the algorithm, the best possible atomic word forms are present in the trie with compounds decomposed into their constituent parts. 4.1. Marathi Post-processing 4.1.1. Stray Characters & Affixes The algorithm described above is vulnerable to stray characters and affixes present in the corpus. Such characters can trigger wrong splits and also increase false positive rate. For example, prais a prefix in Marathi. Ideally, pracannot occur as an independent word. However, due to typographical errors or for other reasons, presence of such character combination in a text segment cannot be ruled out. If prais present, the algorithm wrongly splits pranAminto praand nAm“name” (assuming nAmis also present in the text). Actually, pranAmis not a compound word. Hence, special care should be taken so that the algorithm doesn’t consider affixes as valid words. A possible solution to this problem can be the use of a list of affixes. The algorithm can check the affix lexicon for the presence of each constituent and mark the word as a compound word only if none of the constituents is present in the affix list. 4.1.2. Length Based Heuristics Length based heuristics can be used to keep away stray characters from being considered as valid words by the algorithm. But this strategy does not work very well for Marathi. In Marathi, words with very few characters have the potential to form compounds words. For example, the two character word nav“new” can form an array of compound words such as, navvarsh“new year” and navjivan“new life”. In the current system, only single character words are treated invalid. 4.1.3. Consonant-Vowel (CV) Pairs Consonant-vowel (CV) combinations can occur independently in Marathi text but they cannot be a constituent of a compound word. Hence, the algorithm should check that none of the generated constituents is a CV pair. Examples of valid Marathi CV pairs are ne and ki. 4.1.4. Problems Related to Word-forms Another problem associated with Marathi compound splitting is related to root word and its different forms. Let us take the example of the Marathi root sany“wind” and one of its word-form sany“windy” which is a constituent of the compound word sanyuga“aeroplane”. Constituents of the compound sanyugaaresanyanduga. Let us consider the case when the algorithm reads the words in the sequence sanyuga, sany, ugaanduga. After reading the first two words, the algorithm will split it into the parts sanyandugaand insert the combination into the suspicion list since the existence of the second constituent is not yet known. ugais not a valid Marathi word. Hence, even though both the constituents are present, the algorithm will fail in splitting the compound sanyugainto its correct constituent parts. A possible solution to this problem can be the detection of different splitting points and subsequently selecting the best split based on probabilistic measure. This feature, however, is currently not incorporated into the current system. V. Experimental Results In the first experiment, the compound extraction algorithm was used to generate a compound word lexicon. The experiment was carried out to observe the number of compound words extracted by the algorithm from a given text segment. Table 2: Performance of our split algorithm on a text segment Total Words In Text Segment 2400329 Total Unique Words 246300 Total Unique Words Marked as Compound 48420 Proportion of Compound Words (as detected) 29.66 %
  • 5. Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis DOI: 10.9790/4200-05622530 www.iosrjournals.org 29 | Page A second experiment was carried out to study compound splitting accuracy. A compound word lexicon was generated using the algorithm described above. Final tirecontents were also dumped since these words are the atomic word forms. Each word in the compound word lexicon forms one of the constituents of a compound word. The contents of the trie were merged with the compound word lexicon generating a combined lexicon. This experiment, in a way, tested the quality of the combined lexicon in terms of its coverage of independent word forms. Two sets of compound words (50 compound words each) were manually prepared. These lists were prepared without any knowledge of the words present in the generated compound word lexicon. A variant of the algorithm described above was used in this case. The algorithm first loaded all the words in the combined word lexicon. After this, words from the test files were matched against the loaded words and were split accordingly. The algorithm was allowed to cause splits only in the test words. The results are shown in Table 3. Compound words for which all the constituents satisfied different criteria for independent words are included in the High Confidence List. Compound words in the low confidence list are the words for which the algorithm could not find one of its constituents in the corpus. These are the words which are finally present in the suspicion list. In Table 3, compound split precision and compound extraction rates are calculated based on the words included in both. High & Low confidence lists. For example, in Set 1, the algorithm marked 58 of the total 60 words as compound words achieving a compound extraction rate of 92%. Out of these 58 (42 in high confidence list and 6 in low confidence list) marked compound words, 60 compound words are correctly split. Hence, the split precision rate is 87%. Table 3: Accuracy of compound word split using our algorithm Set 1 Set 2 Total Words 50 50 Number of Confirmed Compounds 42 43 Number of Probable Compounds 6 3 Compounds Correctly Split 40 40 Correct Split Rate 87% 89% Compound Extraction Rate 96% 92% The third experiment was carried out to study the improvement in Marathi Grapheme-to-Phoneme conversion resulting from the incorporation of the phonetic compound word lexicon into the Marathi G2P converter [4] [12]. A section of the Emille corpus was randomly selected [13]. The selected text segment was phonetized using Marathi G2P converter developed as part of the LLSTI initiative [6][12]. Words which were present in the phonetic compound lexicon were also analysed using rules and the two phonetic transcripts were manually compared. The results are shown in Table 4. Table 4: Result of Marathi G2P Conversion Total Words Analysed 3497 Words Phonetised Using Compound Word Lexicon 252 Lexicon Correct but Rule Incorrect 65 Rule Correct but Lexicon Incorrect 10 Lexicon and Rule Both Correct 173 Lexicon and Rule Both Incorrect 4 Effectively, phonetization of 55 words improved after incorporating the phonetic compound word lexicon into the Marathi G2P. Hence, a net improvement of 2% in Marathi G2P conversion is observed as a result of using the compound word lexicon generated by the algorithm presented in this paper. Moreover, out of the total 252 words phonetised using the lexicon, an improvement of 21.8% is observed. VI. Conclusion An effective algorithm has been proposed for splitting compound words in Marathi. The algorithm has been tested and found to be effective in splitting above 92% of the input compound words. Of these splits, around 89% are found to be correct. One of the possible approaches to increase the accuracy of the split is to allow for multiple splits (at different points in the same word) of every word, by not removing any suspect compound word from the trie. To get more potential compound words, the same algorithm can be applied a second time, after reversing each word, so that the second constituent of each compound word can be identified first. A near-exhaustive list of affix words of the language can be deployed to minimize or altogether eliminate wrong splits on account of prefixes and suffixes.
  • 6. Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis DOI: 10.9790/4200-05622530 www.iosrjournals.org 30 | Page References [1]. Sangramsing N.kayte “Marathi Isolated-Word Automatic Speech Recognition System based on Vector Quantization (VQ) approach” 101th Indian Science Congress Jammu University 03th Feb to 07 Feb 2014. [2]. Monica Mundada, Bharti Gawali, Sangramsing Kayte "Recognition and classification of speech and its related fluency disorders" International Journal of Computer Science and Information Technologies (IJCSIT) [3]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte “Di-phone-Based Concatenative Speech Synthesis Systems for Marathi Language” OSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 5, Ver. I (Sep –Oct. 2015), PP 76- 81e-ISSN: 2319 –4200, p-ISSN No. : 2319 –4197 [4]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte "Di-phone-Based Concatenative Speech Synthesis System for Hindi" International Journal of Advanced Research in Computer Science and Software Engineering -Volume 5, Issue 10, October-2015 [5]. Monica Mundada, Sangramsing Kayte, Dr. Bharti Gawali "Classification of Fluent and Dysfluent Speech Using KNN Classifier" International Journal of Advanced Research in Computer Science and Software Engineering Volume 4, Issue 9, September 2014 [6]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte "A Corpus-Based Concatenative Speech Synthesis System for Marathi" IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 6, Ver. I (Nov -Dec. 2015), PP 20-26e-ISSN: 2319 –4200, p-ISSN No. : 2319 –4197 [7]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte "A Marathi Hidden-Markov Model Based Speech Synthesis System" IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 6, Ver. I (Nov -Dec. 2015), PP 34-39e-ISSN: 2319 – 4200, p-ISSN No. : 2319 –4197 [8]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte "Implementation of Marathi Language Speech Databases for Large Dictionary" IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 6, Ver. I (Nov -Dec. 2015), PP 40-45e- ISSN: 2319 –4200, p-ISSN No. : 2319 –4197 [9]. Sangramsing Kayte, Monica Mundada, Santosh Gaikwad, Bharti Gawali "Performance Evaluation Of Speech Synthesis Techniques For English Language " International Congress on Information and Communication Technology 9-10 October, 2015 [10]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte " Performance Calculation of Speech Synthesis Methods for Hindi language IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 6, Ver. I (Nov -Dec. 2015), PP 13-19e-ISSN: 2319 –4200, p-ISSN No. : 2319 –4197 [11]. Sangramsing Kayte, Monica Mundada "Study of Marathi Phones for Synthesis of Marathi Speech from Text" International Journal of Emerging Research in Management &Technology ISSN: 2278-9359 (Volume-4, Issue-10) October 2015 [12]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte "Di-phone-Based Concatenative Speech Synthesis System for Hindi" International Journal of Advanced Research in Computer Science and Software Engineering -Volume 5, Issue 10, October-2015 [13]. Emille, 2003. The EMILLE (Enabling Minority Language Engineering) Project. (http://www.emille.lancs.ac.uk). [14]. Sangramsing Kayte, Dr. Bharti Gawali “Marathi Speech Synthesis: A review” International Journal on Recent and Innovation Trends in Computing and Communication ISSN: 2321-8169 Volume: 3 Issue: 6 3708 – 3711 [15]. Monica Mundada, Sangramsing Kayte “Classification of speech and its related fluency disorders Using KNN” ISSN2231-0096 Volume-4 Number-3 Sept 2014 [16]. Monica Mundada, Sangramsing Kayte, Dr. Bharti Gawali "Classification of Fluent and Dysfluent Speech Using KNN Classifier" International Journal of Advanced Research in Computer Science and Software Engineering Volume 4, Issue 9, September 2014 .