Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis

IOSR Journal of VLSI and Signal Processing (IOSR-JVSP)
Volume 5, Issue 6, Ver. II (Nov -Dec. 2015), PP 25-30
e-ISSN: 2319 – 4200, p-ISSN No. : 2319 – 4197
www.iosrjournals.org
DOI: 10.9790/4200-05622530 www.iosrjournals.org 25 | Page
Automatic Generation of Compound Word Lexicon for Marathi
Speech Synthesis
Sangramsing N. Kayte1
, Monica Mundada1
, Dr. Charansing N. Kayte2
,
Dr.BhartiGawali*
1,3Department of Computer Science and Information Technology Dr. BabasahebAmbedkarMarathwada
University, Aurangabad
2Department of Digital and Cyber Forensic, Aurangabad, Maharashtra
Abstract: This research paper addresses the problem of Marathi compound word splitting and its relevance to
developing a good quality phonetizer for Marathi Speech Synthesis. The constituents of a Marathi compound
word are not separated by space or hyphen. Hence, most of the existing compound splitting algorithms cannot
be applied to Marathi. We propose a new technique for automatic extraction of compound words from Marathi
corpus. Preliminary tests conducted on the algorithm have shown a split rate of 92 to 96% of the input
compound words. Of these splits, around 83 to 87% are correct splits. A few modifications have been
suggested, which will improve the accuracy of the splits. Finally, we observe an improvement of 1.6% in
Marathi Grapheme-to-Phoneme conversion as a result of using a phonetizedcompound word lexicon, created
by the above technique.
I. Introduction
Compound words are formed when two or more words are concatenated into a single word.
Compounding is a highly productive word formation technique in Marathi. Compounding is also a common
phenomenon in Marathi. Table 1 gives examples of a few Marathi compound words.
Compound Word Constituents
Rāsāyanika RāsāyanikA
sanyuga SanyugA
āvāra Āvāra
Table 1: Examples of Marathi Compound Words
Compound word lexicons play an important role in various domains of language technologies e.g.
speech recognition machine translation [1] [5]. Identifying compounds is also necessary for assigning stress
patterns correctly as stress patterns for compounds differ greatly from equivalent single word units [2].
In a Text-To-Speech (TTS) synthesis system, the G2P module converts normalized input orthographic
text into underlying phonetic representation. Accurate phonetic transcription is highly desired for natural
sounding speech synthesis. Compound word splitting plays a crucial role in Marathi G2P conversion,
specifically for solving the schwa deletion problem [3] [4].
II. Schwa Deletion
Schwa deletion is a unique problem encountered in Marathi G2P conversion. Each consonant in
Devanagari, the script used to write Marathi, is associated with an inherent schwa which is not represented in
orthography. In some cases, this associated schwa is deleted depending on certain morpho-phonological factors
and in others, it is retained. The written text, however, does not provide a direct clue about the deletion or
retention of the schwa, thereby making it a challenging problem to address. Marathi word aḍacaṇa“अडचण”, for
example, is represented in orthography using the consonantal characters for ad,aca, na. The schwas (a) are
inserted by the speaker while speaking out the word. Vowels other than schwa are explicitly represented in
orthography.
A set of rules for schwa deletion in Marathi are reported in. The rules are based on the morpheme
boundaries present in words. Word internal morpheme boundaries can be detected using a morphological
analyzer. Unfortunately, high quality Marathi morphological analyzers are currentlynot available.
Unavailability of such analyzers restricts the applicability of these rules.
Another set of rules were implemented in the Dhvani speech synthesis system [6] [7]. These rules work
well for simple words but fail in the case of compound words. For example, Marathi compound word

Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis
RāsāyanikA“lower house of parliament” is obtained by joining two words: Rasa“people” and
yanikA“gathering”. In orthography, the word is represented using consonant and vowel forms of r, a, s, a, ya,ni,
kA. Application of schwa deletion rules on this word would produce [r a s a y a n i k A] which is incorrect.
Correct phonetic transcription [r a s a y a n i k A] is obtained if the two words are analysed separately. Hence, a
lexicon of compound words along with their correct phonetic representation is very important for accurate
phonetic conversion of input text.
III. Previous Work
Corpus driven method for compound splitting using a parallel corpus is reported in [6] [8]. Compound
splitting and recombination for reducing out of vocabulary (OOV) words in large vocabulary speech
recognition systems is reported in [1][9][10] An approach to learn compound splitting rules from monolingual
and parallel corpora and its impact on statistical machine translation systems are reported in[11].
Statistical compound extraction techniques are reported in [6]. These methods require the constituents
of a compound word to be space separated. Such methods are not applicable for Marathi compounds since
constituents in a Marathi compound are not separated by space. To overcome this problem, a compound
splitting algorithm has been developed [14][15][16].
IV. Compound Splitting Algorithm
The compoundextraction algorithm takes as input a text corpus and generates a lexicon with the
compounds split into its constituent parts. The algorithm starts with the assumption that independently
occurring words are valid atomic words.
A trie-like structure is used to store and efficiently match the words. Without loss of generalizability,
the following description considers a compound word to be made up of two constituents. A potential compound
word is detected if the currently processed word is part of a word already present in the trie or a word in the trie
is a substring of the current word. For compounds with more than two components, the algorithm can be
iteratively applied generating all constituent components. For each word ( ), there are several possibilities:
Case 1: is not present in the trie and is also not a constituent of any of the words currently present in the trie.
Case 2: is already present in the trie as an independent word.
Case 3: is actually the initial constituent of a compound word currently present in the trie as an independent
word.
Case 4: is actually the second constituent of a compound word currently present in the trie as an independent
word.
In Case 1, the algorithm inserts into the trie. In Case 2, no change is needed since is already present in
the trie as an independent word. In Case 3, the algorithm generates the second constituent ( ) for each of the
potential compound words where is the first constituent. After the generation of’s, the end node corresponding
to in the trie becomes a leaf node. For each of the generated’s , the algorithm checks for its presence in the trie.
If is present in the trie, the algorithm marks the combination (
) as a compound word. If is not present in the trie, the combination ( ) is inserted into the suspicion list. The
algorithm in its present form does not locate Case 4, since the processing is performed left to right.
Before processing each, the algorithm checks for its presence in the suspicion list. If is present in the
suspicion list as the second constituent, the entry ( ) is removed from the suspicion list and is marked as a
compound word. After this, is matched against the trie contents and actions are taken as per the case (i.e. Case
1, 2, 3, 4) to which belongs.

Figure 1: Before processing the word rasay
Figure 2: After processing the word rasa
Figure 3: After processing the word yanika
To increase the number of potential compounds in the suspicionlist, the algorithmcan be
runwithreversedwords. The forward pass (i.e. left to right processing of a word in its original form) is enough to
split a compound word if all its constituents are present in the corpus as independent words.
To illustrate the algorithm, let us consider a sample lexicon with words in the sequence
rasayanikA“folk tales”, Rāsāyanika“lower house of parliament”, yanika“gathering”, rasay“people”,
anika“tale”. If both constituents of a compound are present in the corpus, then the order of reading the
compound and its constituent parts is irrelevant. Fig. 1 represents the trie contents after the wordsrasayanikA,
R
a
s
a
ni
kAA
ag
yaa
ni
kA
ya
ya
ik
A
n
Probable Compounds
Rasa yanika
kA

Rāsāyanikaandyanikahave been read in. Suppose the word rayais input after this. The algorithm first checks for
it in the suspicion list. After not finding it there, the word is matched against the contents of the trie generating
the constituent forms yanikaandyanikA. yanikAis already present in the trie but yanikAis not yet read in.
Hence, rasa yanikA is marked as a compound word. But since yanikAis not present in the trie, rasayanikAis
inserted into the suspicion list. This state is shown in Fig 2.
Suppose the algorithm comes across the word yanikAlater. yanikA is present in the suspicion list as the
second constituent. Hence, rasa yanikAis removed from the suspicion list and is marked as a compoundword.
Then the algorithm matches yanikA in the trie and since it is not present, it is inserted into the trie.
Hence, at the end of the algorithm, the best possible atomic word forms are present in the trie with
compounds decomposed into their constituent parts.
4.1. Marathi Post-processing
4.1.1. Stray Characters & Affixes
The algorithm described above is vulnerable to stray characters and affixes present in the corpus. Such
characters can trigger wrong splits and also increase false positive rate. For example, prais a prefix in Marathi.
Ideally, pracannot occur as an independent word. However, due to typographical errors or for other reasons,
presence of such character combination in a text segment cannot be ruled out. If prais present, the algorithm
wrongly splits pranAminto praand nAm“name” (assuming nAmis also present in the text). Actually, pranAmis
not a compound word. Hence, special care should be taken so that the algorithm doesn’t consider affixes as
valid words. A possible solution to this problem can be the use of a list of affixes. The algorithm can check the
affix lexicon for the presence of each constituent and mark the word as a compound word only if none of the
constituents is present in the affix list.
4.1.2. Length Based Heuristics
Length based heuristics can be used to keep away stray characters from being considered as valid
words by the algorithm. But this strategy does not work very well for Marathi. In Marathi, words with very few
characters have the potential to form compounds words. For example, the two character word nav“new” can
form an array of compound words such as, navvarsh“new year” and navjivan“new life”. In the current system,
only single character words are treated invalid.
4.1.3. Consonant-Vowel (CV) Pairs
Consonant-vowel (CV) combinations can occur independently in Marathi text but they cannot be a
constituent of a compound word. Hence, the algorithm should check that none of the generated constituents is a
CV pair. Examples of valid Marathi CV pairs are ne and ki.
4.1.4. Problems Related to Word-forms
Another problem associated with Marathi compound splitting is related to root word and its different
forms. Let us take the example of the Marathi root sany“wind” and one of its word-form sany“windy” which is
a constituent of the compound word sanyuga“aeroplane”. Constituents of the compound
sanyugaaresanyanduga. Let us consider the case when the algorithm reads the words in the sequence sanyuga,
sany, ugaanduga. After reading the first two words, the algorithm will split it into the parts sanyandugaand
insert the combination into the suspicion list since the existence of the second constituent is not yet known.
ugais not a valid Marathi word. Hence, even though both the constituents are present, the algorithm will fail in
splitting the compound sanyugainto its correct constituent parts. A possible solution to this problem can be the
detection of different splitting points and subsequently selecting the best split based on probabilistic measure.
This feature, however, is currently not incorporated into the current system.
V. Experimental Results
In the first experiment, the compound extraction algorithm was used to generate a compound word
lexicon. The experiment was carried out to observe the number of compound words extracted by the algorithm
from a given text segment.
Table 2: Performance of our split algorithm on a text segment
Total Words In Text Segment 2400329
Total Unique Words 246300
Total Unique Words Marked as Compound 48420
Proportion of Compound Words (as detected) 29.66 %

A second experiment was carried out to study compound splitting accuracy. A compound word lexicon
was generated using the algorithm described above. Final tirecontents were also dumped since these words are
the atomic word forms. Each word in the compound word lexicon forms one of the constituents of a compound
word. The contents of the trie were merged with the compound word lexicon generating a combined lexicon.
This experiment, in a way, tested the quality of the combined lexicon in terms of its coverage of independent
word forms. Two sets of compound words (50 compound words each) were manually prepared. These lists
were prepared without any knowledge of the words present in the generated compound word lexicon. A variant
of the algorithm described above was used in this case. The algorithm first loaded all the words in the combined
word lexicon. After this, words from the test files were matched against the loaded words and were split
accordingly. The algorithm was allowed to cause splits only in the test words. The results are shown in Table 3.
Compound words for which all the constituents satisfied different criteria for independent words are
included in the High Confidence List. Compound words in the low confidence list are the words for which the
algorithm could not find one of its constituents in the corpus. These are the words which are finally present in
the suspicion list. In Table 3, compound split precision and compound extraction rates are calculated based on
the words included in both.
High & Low confidence lists. For example, in Set 1, the algorithm marked 58 of the total 60 words as
compound words achieving a compound extraction rate of 92%. Out of these 58 (42 in high confidence list and
6 in low confidence list) marked compound words, 60 compound words are correctly split. Hence, the split
precision rate is 87%.
Table 3: Accuracy of compound word split using our algorithm
Set 1 Set 2
Total Words 50 50
Number of Confirmed Compounds 42 43
Number of Probable Compounds 6 3
Compounds Correctly Split 40 40
Correct Split Rate 87% 89%
Compound Extraction Rate 96% 92%
The third experiment was carried out to study the improvement in Marathi Grapheme-to-Phoneme
conversion resulting from the incorporation of the phonetic compound word lexicon into the Marathi G2P
converter [4] [12]. A section of the Emille corpus was randomly selected [13]. The selected text segment was
phonetized using Marathi G2P converter developed as part of the LLSTI initiative [6][12]. Words which were
present in the phonetic compound lexicon were also analysed using rules and the two phonetic transcripts were
manually compared. The results are shown in Table 4.
Table 4: Result of Marathi G2P Conversion
Total Words Analysed 3497
Words Phonetised Using Compound Word Lexicon 252
Lexicon Correct but Rule Incorrect 65
Rule Correct but Lexicon Incorrect 10
Lexicon and Rule Both Correct 173
Lexicon and Rule Both Incorrect 4
Effectively, phonetization of 55 words improved after incorporating the phonetic compound word
lexicon into the Marathi G2P. Hence, a net improvement of 2% in Marathi G2P conversion is observed as a
result of using the compound word lexicon generated by the algorithm presented in this paper. Moreover, out of
the total 252 words phonetised using the lexicon, an improvement of 21.8% is observed.
VI. Conclusion
An effective algorithm has been proposed for splitting compound words in Marathi. The algorithm has
been tested and found to be effective in splitting above 92% of the input compound words. Of these splits,
around 89% are found to be correct. One of the possible approaches to increase the accuracy of the split is to
allow for multiple splits (at different points in the same word) of every word, by not removing any suspect
compound word from the trie. To get more potential compound words, the same algorithm can be applied a
second time, after reversing each word, so that the second constituent of each compound word can be identified
first. A near-exhaustive list of affix words of the language can be deployed to minimize or altogether eliminate
wrong splits on account of prefixes and suffixes.

References
[1]. Sangramsing N.kayte “Marathi Isolated-Word Automatic Speech Recognition System based on Vector Quantization (VQ)
approach” 101th Indian Science Congress Jammu University 03th Feb to 07 Feb 2014.
[2]. Monica Mundada, Bharti Gawali, Sangramsing Kayte "Recognition and classification of speech and its related fluency disorders"
International Journal of Computer Science and Information Technologies (IJCSIT)
[3]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte “Di-phone-Based Concatenative Speech Synthesis Systems for
Marathi Language” OSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 5, Ver. I (Sep –Oct. 2015), PP 76-
81e-ISSN: 2319 –4200, p-ISSN No. : 2319 –4197
[4]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte "Di-phone-Based Concatenative Speech Synthesis System for Hindi"
International Journal of Advanced Research in Computer Science and Software Engineering -Volume 5, Issue 10, October-2015
[5]. Monica Mundada, Sangramsing Kayte, Dr. Bharti Gawali "Classification of Fluent and Dysfluent Speech Using KNN Classifier"
International Journal of Advanced Research in Computer Science and Software Engineering Volume 4, Issue 9, September 2014
[6]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte "A Corpus-Based Concatenative Speech Synthesis System for
Marathi" IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 6, Ver. I (Nov -Dec. 2015), PP 20-26e-ISSN:
2319 –4200, p-ISSN No. : 2319 –4197
[7]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte "A Marathi Hidden-Markov Model Based Speech Synthesis System"
IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 6, Ver. I (Nov -Dec. 2015), PP 34-39e-ISSN: 2319 –
4200, p-ISSN No. : 2319 –4197
[8]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte "Implementation of Marathi Language Speech Databases for Large
Dictionary" IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 6, Ver. I (Nov -Dec. 2015), PP 40-45e-
ISSN: 2319 –4200, p-ISSN No. : 2319 –4197
[9]. Sangramsing Kayte, Monica Mundada, Santosh Gaikwad, Bharti Gawali "Performance Evaluation Of Speech Synthesis
Techniques For English Language " International Congress on Information and Communication Technology 9-10 October, 2015
[10]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte " Performance Calculation of Speech Synthesis Methods for Hindi
language IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 5, Issue 6, Ver. I (Nov -Dec. 2015), PP 13-19e-ISSN:
2319 –4200, p-ISSN No. : 2319 –4197
[11]. Sangramsing Kayte, Monica Mundada "Study of Marathi Phones for Synthesis of Marathi Speech from Text" International Journal
of Emerging Research in Management &Technology ISSN: 2278-9359 (Volume-4, Issue-10) October 2015
[12]. Sangramsing Kayte, Monica Mundada, Dr. CharansingKayte "Di-phone-Based Concatenative Speech Synthesis System for Hindi"
International Journal of Advanced Research in Computer Science and Software Engineering -Volume 5, Issue 10, October-2015
[13]. Emille, 2003. The EMILLE (Enabling Minority Language Engineering) Project. (http://www.emille.lancs.ac.uk).
[14]. Sangramsing Kayte, Dr. Bharti Gawali “Marathi Speech Synthesis: A review” International Journal on Recent and Innovation
Trends in Computing and Communication ISSN: 2321-8169 Volume: 3 Issue: 6 3708 – 3711
[15]. Monica Mundada, Sangramsing Kayte “Classification of speech and its related fluency disorders Using KNN” ISSN2231-0096
Volume-4 Number-3 Sept 2014
[16]. Monica Mundada, Sangramsing Kayte, Dr. Bharti Gawali "Classification of Fluent and Dysfluent Speech Using KNN Classifier"
International Journal of Advanced Research in Computer Science and Software Engineering Volume 4, Issue 9, September 2014 .

Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis

More Related Content

What's hot (19)

Viewers also liked (12)

Similar to Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis (10)

More from iosrjce (20)

Recently uploaded (20)

Automatic Generation of Compound Word Lexicon for Marathi Speech Synthesis