SlideShare a Scribd company logo
The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
DOI:10.5121/ijma.2020.12501 1
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR
BANGLA SPEECH RECOGNITION
Nahyan Al Mahmud1
and Shahfida Amjad Munni2
1
Department of Electrical and Electronic Engineering, Ahsanullah University of Science
and Technology, Dhaka, Bangladesh
2
Cygnus Innovation Limited, Dhaka, Bangladesh
ABSTRACT
The performance of various acoustic feature extraction methods has been compared in this work using
Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic
features are a series of vectors that represents the speech signals. They can be classified in either words or
sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic
vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector
extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction
(PLP) have also been used. These two methods closely resemble the human auditory system. These feature
vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes
are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to
investigate the nature of those acoustic features.
KEYWORDS
LSTM, Perceptual linear prediction, Mel frequency cepstral coefficients, Bhattacharyya Distance,
Mahalanobis Distance.
1. INTRODUCTION
Speech is the most effective way of communication among people. This is the most natural way
of conveying information. Therefore, to interact vocally with computers, studies are being
conducted in a number of fields. The objective is to simulate the humans’ ability to talk, to carry
out of simple tasks by computers through the means of machine-human interaction, to turning
speech to text through Automatic Speech Recognition (ASR) systems. Recently, signal
processing and detection have been applied in a variety of fields such as human activity tracking
and recognition [1,2], computer engineering, physical sciences, health-related applications [3],
and natural science and industrial applications [4].
In recent years, neural network technology has significantly enhanced the precision of ASR
system, and many applications are being developed for smartphones and intelligent personal
assistants. Many researches on end-to-end speech detection are being conducted to swap the
hidden Markov model (HMM) based technique which has been used for many years. The end-to-
end models with neural networks include connectionist temporal classification (CTC)-trained
recurrent neural networks (RNN), encoder-decoder architectures [5], and RNN transducers [6,7].
Although HMM-based algorithms can be considered computationally efficient, they require many
uneven memory accesses and a large memory foot-print. On the contrary, RNN based ASR has
the advantage of low memory footprint; however, it demands many computationally expensive
arithmetic operations for real-time inference.
The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
2
A lot of works has been done in ASR for a variety of major languages in the world. Regrettably,
only a handful of works have been carried out for Bangla, which is among the most widely
spoken languages in the world in terms of number of speakers. Some of these efforts can be
found in [8]. However, majority of these studies mainly focussed on simple word-level detection
worked on a very minor database. Also these works did not account for the various dialects of
different parts of the country. The reasons for this can be pointed as the lack of proper speech
database.
The primary target of this study is to examine the efficiency of various acoustic vectors for
Bangla speech detection using LSTM neural network and assess their performances based on
different statistical parameters.
2. DATA PREPARATION
2.1. Bangla Speech Database
Right now the real difficulty while experimenting on Bangla ASR is the absence of proper
Bangla speech database. Either the existing database is not large enough or these databases are
not considering the colloquial variations among the speakers. Also majority of these are not
publicly available which is a massive hindrance in conducting research. Another difficulty is that
the database must be properly segmented or labelled while using for supervised learning. To
overcome this problem, we had to come up with our own solution [9].
The samples were first divided into two groups: training and testing. The male training corpus
were prepared by 50 male and 50 female speakers. A separate set of 20 male and female speakers
voiced the test samples. 200 common Bangla sentences were chosen for this database.
A medium sized room having styrofoam damper was used to record the samples. A dynamic
unidirectional microphone was used to collect training samples. In order to mimic the real-world
scenario for testing samples, a moderate grade smart phone was used.
2.2. Acoustic Feature Vectors
Linear predictive coding (LPC) was first introduced in 1966. It is widely used in audio signal
processing, speech processing and telephony. LPC embodies the spectral envelope of speech in
compressed form, using a linear predictive methodology. “These techniques can thus serve as a
tool for modifying the acoustic properties of a given speech signal without degrading the speech
quality. Some other potential applications of these techniques are in the areas of efficient storage
and transmission of speech, automatic formant and pitch extraction, and speaker and speech
recognition” [10].
The most efficient acoustic feature extraction methods are mainly based on models that are
similar to human hearing system. There are some limiting features of the human auditory system
such as nonlinear frequency scale, spectral amplitude compression, reduced sensitivity of at
lower frequencies etc. Mel Frequency Cepstrum Coefficients (MFCC) and Perceptual Linear
Predictive (PLP) to some extent adopts the behaviour of human auditory system. MFCC was first
introduced in [11]. PLP introduced in [12] is based on ideas similar to the MFCCs. But the
computational steps of PLP are a bit elaborate and thus produce marginally better result which is
illustrated in Figure 1.
The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
3
Figure 1. The steps of PLP and MFCC feature vector extraction.
3. EXPERIMENTAL SETUP
3.1. LSTM Neural Network Structure
LSTM, and in general, RNN based ASR systems [13,14,15] trained with CTC [16] have recently
been shown to work extremely well when there is an abundance of training data, matching and
exceeding the performance of hybrid DNN systems [15]. In the conventional DNN if the time is
long, the amount of remaining vector reduces that needs to be looped back. This process
inherently breaks down the network-updating scheme [17]. LSTM is naturally immune to this
problem as it can employ its comparatively long-term memory structure [5].
LSTM and RNN are very prominent in sequence prediction and labelling at the same time,
LSTM proved to have the better performance than RNN and DNN in context [18]. Recently,
according to the successful application of DNN to acoustic modelling [19,20,21] and speech
enhancement [22], a CTC output layer, is imposed on the acoustic model [23,24] in conjunction
with a deep LSTM network. Figure 2 shows the LSTM speech recognition block.
Figure 2. LSTM speech detection block
A standard LSTM neural network structure is comprised of an input layer, a recurring LSTM
layer and an output layer. In this study the input layer is a Time Delay Neural Network (TDNN)
layer and this layer is connected to the hidden layer.
The new LSTM structure proposed in [23] has been used in this work. As this method is based on
CTC training, there is a significant performance growth in speech detection tasks. This technique
has been adopted by Google. This typical LSTM cell structure is illustrated in Figure 3. The
LSTM cell consists of three gates: input, forget, and output, which control, respectively, what
fraction of the input is passed to the "memory" cell, what fraction of the stored cell memory is
retained, and what fraction of the cell memory is output.
The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
4
Figure 3. LSTM cell model
Equations (1)-(5) are vector formulas that describe the LSTM cell.
Typically, LSTM parameters are initialized to small random values in an interval. However, for
the forget gate, this is a suboptimal choice, where small weights effectively close the gate,
preventing cell memory and its gradients from flowing in time. In this work this issue has been
addressed by initializing the forget gate bias to a large value.
3.2. Bhattacharyya Distance
The Bhattacharyya Distance [25,26] measures the similarity of two discrete or continuous
probability distributions which can be expressed by the following expressions.
where µi and Σi are the means and covariances of the distributions, respectively and
3.3. Mahalanobis Distance
The Mahalanobis distance [26] is a measure of the distance between a point and a distribution,
introduced by P. C. Mahalanobis in 1936.
Where µ and Σ are the means and covariances of the distributions respectively and x is the point
whose distance is to be measured. In this work, the Mahalanobis distance between each frame is
measured for different phonemes.
𝑖𝑡 = 𝜎(𝑊𝑖𝑥 𝑥𝑡 + 𝑊𝑖𝑟 𝑥𝑡−1 + 𝑏𝑖)………………………...(1)
𝑓𝑡 = 𝜎(𝑊𝑓𝑥 𝑥𝑡 + 𝑊𝑓𝑟 𝑥𝑡−1 + 𝑏𝑓)………………………..(2)
𝐶𝑡 = 𝑓𝑡 ⊙ 𝐶𝑡−1 + 𝑖𝑡 ⊙ g(𝑊𝑐𝑥 𝑥𝑡 + 𝑊𝑐𝑟 𝑟𝑡−1 + 𝑏𝑐 )…….(3)
𝑜𝑡 = 𝜎(𝑊𝑜𝑥 𝑥𝑡 + 𝑊𝑜𝑟 𝑟𝑡−1 + 𝑊𝑜𝑐 𝑐𝑡 + 𝑏 𝑜 )………….…..(4)
ℎ𝑡 = 𝑜𝑡 ⊙ 𝛷 𝑐𝑡 …………………………….…….…….(5)
𝐷 𝐵 =
1
8
(µ1 − µ2) 𝑇
(µ1 − µ2)−1
+
1
2
𝑙𝑛
det 𝛴
det 𝛴1 det 𝛴2
….(6)
𝛴 =
𝛴1+𝛴2
2
….(7)
𝐷 𝑀(𝑥) = (𝑥 − 𝛴) 𝑇 𝑆−1(𝑥 − 𝛴)….(8)
The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
5
4. PERFORMANCE ANALYSIS
The next steps are training the LSTM model using the recorded training set of data for Bangla
speech. Three popular feature extraction methods, namely, LPC, MFCC and PLP has been tried
while training by 5000 male samples and testing this model using 1000 different male samples.
This combination has been chosen carefully as it performed reasonably well in [9].
Statistical distance quantifies the distance between two statistical objects, which can be two
random variables, or two probability distributions or samples. The distance can be between an
individual sample point and a population or a wider sample of points.
In this work both Bhattacharyya distance and Mahalanobis distance of various phoneme
coefficients has been computed for LPC, MFCC and PLP.
Table 1. Bhattacharyya distance between similar sounding phonemes
Phonemes to be compared LPC MFCC PLP
'"r"' '"er"' 1.268892 1.609877 1.935885
'"b"' '"v"' 0.602648 1.019706 1.176067
'"b"' '"bh"' 0.674364 1.99346 2.228276
'"d"' '"dh"' 0.091483 0.732092 0.630423
'"aa"' '"ax"' 1.398714 0.257079 0.278304
'"ch"' '"chh"' 0.477755 0.288733 0.318336
'"ey"' '"y"' 6.73841 1.88483 2.006171
'"n"' '"ng"' 1.090365 2.900765 2.830959
'"uh"' '"ih"' 1.399962 3.502966 3.659523
'"uh"' '"eh"' 9.00648 3.199748 3.829925
'"uh"' '"u"' 7.451423 1.100306 1.12853
'"dh"' '"d"' 0.091483 0.732092 0.630423
'"s"' '"sh"' 5.384669 0.149458 0.150676
'"v"' '"bh"' 0.513366 1.427321 1.206649
'"oy"' '"ay"' 3.930717 1.342863 1.604389
'"hh"' '"h"' 1.363757 2.696534 2.499212
'"ah"' '"ax"' 8.392236 0.288995 0.300063
'"ay"' '"oy"' 3.930717 1.342863 1.604389
'"z"' '"jh"' 3.993217 0.83873 0.956689
'"z"' '"j"' 7.540232 0.810661 0.860811
'"zh"' '"jh"' 0.934331 1.320837 1.200627
'"j"' '"z"' 7.540232 0.810661 0.860811
'"j"' '"zh"' 1.583787 1.722148 1.625498
'"j"' '"jh"' 3.002899 0.511258 0.524364
The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
6
Figure 4. Mahalanobis distance between the phoneme ‘sil’ and the word “bideshi bonduk amdani”
Table 2. Timeframe of the word “bideshi” in Figure 4.
Frame start Frame end Phoneme
0 53 sil
53 63 bi
63 69 d
69 74 ey
74 78 eh
78 89 sh
89 94 ih
94 100 sp
5. DISCUSSION
From Table 1 it is evident that both PLP and LPC show some benefits over MFCC for similar
sounding phonemes. However, in most cases LPC shows some abnormally large distances which
should not be that large, as these phonemes sound similar.
If the timeframe of the word “bideshi” in Table 2 is matched with Figure 4, both PLP and MFCC
manages to distinguish 53th frame from the phoneme ‘sil’, while LPC detects something on the
onset of the phoneme ‘sh’ at frame no 78.
6. CONCLUSION
From Table 1, Table 2 and Figure 4 it is clear that, PLP and MFCC shows significantly better
result than LPC in terms of statistical distance. Combining the findings in [9] it can be said
clearly that PLP performs better than MFCC. So for large data set or for large sets of training and
testing, PLP seems to be a better alternative. However, RNN and LSTM mainly relies on
sequential processing. Which means they are inherently slow. The performance of PLP shines
through by combining it with a more advanced neural network like the transformer network [27].
As the transformer network can process faster, PLP should perform more aggressively with this
types of framework.
The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
7
REFERENCES
[1] Uddin, M.T.; Uddiny, M.A. “Human activity recognition from wearable sensors using extremely
randomized trees”. In Proceedings of the International Conference on Electrical Engineering and
Information Communication Technology (ICEEICT), Dhaka, Bangladesh, 13–15 September 2015;
pp. 1–6.
[2] Jalal, A. “Human activity recognition using the labelled depth body parts information of depth
silhouettes”. In Proceedings of the 6th International Symposium on Sustainable Healthy Buildings,
Seoul, Korea, 27 February 2012; pp. 1–8.
[3] Ahad, M.A.R.; Kobashi, S.; Tavares, J.M.R. “Advancements of image processing and vision” in
healthcare. J. Healthcare Eng. 2018.
[4] Jalal, A.; Quaid, M.A.K.; Hasan, A.S. “Wearable sensor-based human behaviour understanding and
recognition in daily life for smart environments”. In Proceedings of the International Conference on
Frontiers of Information Technology (FIT), Islamabad, Pakistan, 18–20 December 2017.
[5] C. Chiu, T. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R.Weiss, K. Rao, E.
Gonina, et al. “State-of-the art speech recognition with sequence-to-sequence models”. In Acoustics,
Speech and Signal Processing (ICASSP), 2018 IEEE International Conference, pages 4774–4778.
IEEE, 2018.
[6] Kanishka Rao, Ha¸sim Sak, and Rohit Prabhavalkar. “Exploring architectures, data and units for
streaming end-to-end speech recognition with RNN-transducer”. In Automatic Speech Recognition
and Understanding Workshop (ASRU), 2017 IEEE, pages 193–199. IEEE, 2017.
[7] Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur Yi Li, Hairong Liu,
Sanjeev Satheesh, Anuroop Sriram, and Zhenyao Zhu. “Exploring neural transducers for end-to- end
speech recognition”. In Automatic Speech Recognition and Understanding Workshop (ASRU), 2017
IEEE, pages 206–213. IEEE, 2017.
[8] Kishore, S., Black, A., Kumar, R., and Sangal, R. “Experiments with unit selection speech databases
for Indian languages,” National Seminar on Language Technology Tools: Implementation of Telugu,
Hyderabad, India, October 2003.
[9] Nahyan A. M. “Performance analysis of different acoustic features based on LSTM for Bangla
speech recognition.” The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No.
1/2/3/4, August 2020, DOI :10.5121/ijma.2020.12402 17
[10] Rafal J., Wojciech Z., and Ilya S. “An empirical exploration of recurrent network architectures”. In
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France,
6-11 July 2015, pages 2342–2350, 2015.
[11] Atal, B. S., “Speech analysis and synthesis by linear prediction of the speech wave.” The Journal of
The Acoustical Society of America 47 (1970) 65.
[12] Davis, S. B and Mermelstein, P., “Comparison of parametric representations for monosyllabic word
recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. ASSP-28, no. 4, pp. 357 – 366, August 1980.
[13] Alex Graves and Navdeep Jaitly. “Towards end-to-end speech recognition with recurrent neural
networks”. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014,
Beijing, China, 21-26 June 2014, pages 1764–1772, 2014.
[14] Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan
Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. “Deep speech:
scaling up end-to-end speech recognition”. CoRR, abs/1412.5567, 2014.
[15] Hagen Soltau, Hank Liao, and Hasim Sak. “Neural speech recognizer: acoustic-to-word LSTM model
for large vocabulary speech recognition”. CoRR, abs/1610.09975, 2016.
[16] Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber. “Connectionist
temporal classification: labelling unsegmented sequence data with recurrent neural networks”. In
Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006),
Pittsburgh, Pennsylvania, USA, June 25-29, 2006, pages 369–376, 2006.
[17] Sepp Hochreiter and Jurgen Schmidhuber, “Long short-term memory”, Neural Computation, vol.9,
no.8,pp.1735 780,Nov.1997
[18] Mike Schuster and Kuldip K.Paliwal, “Bidirectional recurrent neural networks,” Signal Processing,
IEEE Transactions, vol. 45, no. 11, pp.2673-2681,1997.
The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020
8
[19] Abdel Rahman Mohamed, George E. Dah1, and Geoffrey E.Hinton, “Acoustic modeling using deep
belief networks,” IEEE Transactions on Audio, Speech & Language Processing, vol. 20, no. 1, pp.
14-22, 2012.
[20] George E. Dah1, Dong Yu, Li Deng, and Alex Acero, “Contextdependent pre-trained deep neural
networks for large-vocabulary speech recognition,” IEEE Transactions on Audio, Speech &
Language Processing, vol. 20, no. 1, pp. 30-42, Jan. 2012.
[21] Navedeep Jaitly, Patrick Nguyen, Andrew Senior, and Vincent Vanhoucke, “Application of pre-
trained deep neural networks to large vocabulary speech recognition,” in Proceedings of
INTERSPEECH, 2012.
[22] Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “An experiment study on speech enhancement
based on deep neural networks”, IEEE Signal Processing, vol. 21, no. 1, pp. 65-68, Nov. 2013.
[23] Hasim Sak, Andrew Senior, and Francoise Beaufays, “Long Short-Term memory based recurrent
neural network architectures for large vocabulary speech recognition”, ArXiv e-prints, Feb.2014.
[24] Z. Chen, Y. Zhuang, Y. Qian, K. Yu, et al. “Phone synchronous speech recognition with CTC
lattices” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 25(1): 90–
101, 2017
[25] S. Dubuisson. “The computation of the Bhattacharyya distance between histograms without
histograms” 2nd International Conference on Image Processing Theory Tools and Applications
(IPTA'10), Jul 2010, Paris, France. pp.373-378,
[26] W. F. Basener and M. Flynn "Microscene evaluation using the Bhattacharyya distance", Proc. SPIE
10780, Multispectral, Hyperspectral, and Ultraspectral Remote Sensing Technology, Techniques and
Applications VII, 107800R (23 October 2018); https://doi.org/10.1117/12.2327004
[27] Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel (2018).
"GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding".
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural
Networks for NLP. Stroudsburg, PA, USA: Association for Computational Linguistics: 353–355.
AUTHORS
Nahyan Al Mahmud
Mr. Mahmud graduated from Electrical and Electronic Engineering department of
Ahsanullah University of Science and Technology (AUST), Dhaka in 2008. Mr.
Mahmud has completed the MSc program (EEE) from Bangladesh University of
Engineering & Technology (BUET), Dhaka. Currently he is working as an Assistant
Professor of EEE Department in AUST. His research interests include system and
signal processing, analysis and design.
Shahfida Amjad Munni
Shahfida Amjad Munni completed her master of engineering degree at the Institute of
Information and Communication Technology (IICT) in Bangladesh University of
Engineering and Technology (BUET), Bangladesh in March 2018. Currently she is
working as a software engineer at Cygnus Innovation Limited, Bangladesh. Her
research interests include optical wireless communication, data science, wireless
communication, OFDM modulation and LiFi.

More Related Content

PDF
Sentiment analysis by deep learning approaches
DOCX
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
PDF
19 ijcse-01227
PDF
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
PDF
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
PDF
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
PDF
PDF
Comparative Study of Different Techniques in Speaker Recognition: Review
Sentiment analysis by deep learning approaches
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
19 ijcse-01227
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
IMPROVING MYANMAR AUTOMATIC SPEECH RECOGNITION WITH OPTIMIZATION OF CONVOLUTI...
Comparative Study of Different Techniques in Speaker Recognition: Review

What's hot (18)

PDF
Bayesian distance metric learning and its application in automatic speaker re...
PDF
IRJET - Audio Emotion Analysis
PDF
IRJET- Survey on Text Error Detection using Deep Learning
PDF
Survey on Text Prediction Techniques
PDF
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
PDF
E42062126
PDF
2224d_final
PDF
histogram-based-emotion
PDF
IRJET- Survey on Generating Suggestions for Erroneous Part in a Sentence
PDF
Human Emotion Recognition From Speech
PDF
Throughput optimization in
PDF
Mfcc based enlargement of the training set for emotion recognition in speech
PDF
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
PDF
Text independent speaker identification system using average pitch and forman...
PDF
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...
PDF
SEARCH TIME REDUCTION USING HIDDEN MARKOV MODELS FOR ISOLATED DIGIT RECOGNITION
PDF
DOMAIN BASED CHUNKING
PPT
M sc thesis_presentation_
Bayesian distance metric learning and its application in automatic speaker re...
IRJET - Audio Emotion Analysis
IRJET- Survey on Text Error Detection using Deep Learning
Survey on Text Prediction Techniques
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
E42062126
2224d_final
histogram-based-emotion
IRJET- Survey on Generating Suggestions for Erroneous Part in a Sentence
Human Emotion Recognition From Speech
Throughput optimization in
Mfcc based enlargement of the training set for emotion recognition in speech
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Text independent speaker identification system using average pitch and forman...
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...
SEARCH TIME REDUCTION USING HIDDEN MARKOV MODELS FOR ISOLATED DIGIT RECOGNITION
DOMAIN BASED CHUNKING
M sc thesis_presentation_
Ad

Similar to QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION (20)

PDF
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
PDF
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
PDF
New research articles 2020 october issue international journal of multimedi...
PDF
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
PDF
Et25897899
PDF
AUTOMATIC SPEECH RECOGNITION- A SURVEY
PDF
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
PDF
IRJET - Threat Prediction using Speech Analysis
PDF
Speech recognition using neural + fuzzy logic
PDF
Automated Speech Recognition
PDF
High level speaker specific features modeling in automatic speaker recognitio...
PDF
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
PDF
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...
PDF
PDF
Malayalam Isolated Digit Recognition using HMM and PLP cepstral coefficient
PDF
Enhanced transformer long short-term memory framework for datastream prediction
PDF
Identification of frequency domain using quantum based optimization neural ne...
PPTX
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
PDF
Deep Learning in practice : Speech recognition and beyond - Meetup
PDF
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
New research articles 2020 october issue international journal of multimedi...
Convolutional Neural Network and Feature Transformation for Distant Speech Re...
Et25897899
AUTOMATIC SPEECH RECOGNITION- A SURVEY
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
IRJET - Threat Prediction using Speech Analysis
Speech recognition using neural + fuzzy logic
Automated Speech Recognition
High level speaker specific features modeling in automatic speaker recognitio...
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...
Malayalam Isolated Digit Recognition using HMM and PLP cepstral coefficient
Enhanced transformer long short-term memory framework for datastream prediction
Identification of frequency domain using quantum based optimization neural ne...
LiDeng-BerlinOct2015-ASR-GenDisc-4by3.pptx
Deep Learning in practice : Speech recognition and beyond - Meetup
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...
Ad

Recently uploaded (20)

PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PDF
composite construction of structures.pdf
PPTX
Construction Project Organization Group 2.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
DOCX
573137875-Attendance-Management-System-original
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Foundation to blockchain - A guide to Blockchain Tech
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Structs to JSON How Go Powers REST APIs.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
Lecture Notes Electrical Wiring System Components
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Arduino robotics embedded978-1-4302-3184-4.pdf
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
composite construction of structures.pdf
Construction Project Organization Group 2.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Model Code of Practice - Construction Work - 21102022 .pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
573137875-Attendance-Management-System-original
UNIT 4 Total Quality Management .pptx
CYBER-CRIMES AND SECURITY A guide to understanding
Foundation to blockchain - A guide to Blockchain Tech
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Structs to JSON How Go Powers REST APIs.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Lecture Notes Electrical Wiring System Components

QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION

  • 1. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020 DOI:10.5121/ijma.2020.12501 1 QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION Nahyan Al Mahmud1 and Shahfida Amjad Munni2 1 Department of Electrical and Electronic Engineering, Ahsanullah University of Science and Technology, Dhaka, Bangladesh 2 Cygnus Innovation Limited, Dhaka, Bangladesh ABSTRACT The performance of various acoustic feature extraction methods has been compared in this work using Long Short-Term Memory (LSTM) neural network in a Bangla speech recognition system. The acoustic features are a series of vectors that represents the speech signals. They can be classified in either words or sub word units such as phonemes. In this work, at first linear predictive coding (LPC) is used as acoustic vector extraction technique. LPC has been chosen due to its widespread popularity. Then other vector extraction techniques like Mel frequency cepstral coefficients (MFCC) and perceptual linear prediction (PLP) have also been used. These two methods closely resemble the human auditory system. These feature vectors are then trained using the LSTM neural network. Then the obtained models of different phonemes are compared with different statistical tools namely Bhattacharyya Distance and Mahalanobis Distance to investigate the nature of those acoustic features. KEYWORDS LSTM, Perceptual linear prediction, Mel frequency cepstral coefficients, Bhattacharyya Distance, Mahalanobis Distance. 1. INTRODUCTION Speech is the most effective way of communication among people. This is the most natural way of conveying information. Therefore, to interact vocally with computers, studies are being conducted in a number of fields. The objective is to simulate the humans’ ability to talk, to carry out of simple tasks by computers through the means of machine-human interaction, to turning speech to text through Automatic Speech Recognition (ASR) systems. Recently, signal processing and detection have been applied in a variety of fields such as human activity tracking and recognition [1,2], computer engineering, physical sciences, health-related applications [3], and natural science and industrial applications [4]. In recent years, neural network technology has significantly enhanced the precision of ASR system, and many applications are being developed for smartphones and intelligent personal assistants. Many researches on end-to-end speech detection are being conducted to swap the hidden Markov model (HMM) based technique which has been used for many years. The end-to- end models with neural networks include connectionist temporal classification (CTC)-trained recurrent neural networks (RNN), encoder-decoder architectures [5], and RNN transducers [6,7]. Although HMM-based algorithms can be considered computationally efficient, they require many uneven memory accesses and a large memory foot-print. On the contrary, RNN based ASR has the advantage of low memory footprint; however, it demands many computationally expensive arithmetic operations for real-time inference.
  • 2. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020 2 A lot of works has been done in ASR for a variety of major languages in the world. Regrettably, only a handful of works have been carried out for Bangla, which is among the most widely spoken languages in the world in terms of number of speakers. Some of these efforts can be found in [8]. However, majority of these studies mainly focussed on simple word-level detection worked on a very minor database. Also these works did not account for the various dialects of different parts of the country. The reasons for this can be pointed as the lack of proper speech database. The primary target of this study is to examine the efficiency of various acoustic vectors for Bangla speech detection using LSTM neural network and assess their performances based on different statistical parameters. 2. DATA PREPARATION 2.1. Bangla Speech Database Right now the real difficulty while experimenting on Bangla ASR is the absence of proper Bangla speech database. Either the existing database is not large enough or these databases are not considering the colloquial variations among the speakers. Also majority of these are not publicly available which is a massive hindrance in conducting research. Another difficulty is that the database must be properly segmented or labelled while using for supervised learning. To overcome this problem, we had to come up with our own solution [9]. The samples were first divided into two groups: training and testing. The male training corpus were prepared by 50 male and 50 female speakers. A separate set of 20 male and female speakers voiced the test samples. 200 common Bangla sentences were chosen for this database. A medium sized room having styrofoam damper was used to record the samples. A dynamic unidirectional microphone was used to collect training samples. In order to mimic the real-world scenario for testing samples, a moderate grade smart phone was used. 2.2. Acoustic Feature Vectors Linear predictive coding (LPC) was first introduced in 1966. It is widely used in audio signal processing, speech processing and telephony. LPC embodies the spectral envelope of speech in compressed form, using a linear predictive methodology. “These techniques can thus serve as a tool for modifying the acoustic properties of a given speech signal without degrading the speech quality. Some other potential applications of these techniques are in the areas of efficient storage and transmission of speech, automatic formant and pitch extraction, and speaker and speech recognition” [10]. The most efficient acoustic feature extraction methods are mainly based on models that are similar to human hearing system. There are some limiting features of the human auditory system such as nonlinear frequency scale, spectral amplitude compression, reduced sensitivity of at lower frequencies etc. Mel Frequency Cepstrum Coefficients (MFCC) and Perceptual Linear Predictive (PLP) to some extent adopts the behaviour of human auditory system. MFCC was first introduced in [11]. PLP introduced in [12] is based on ideas similar to the MFCCs. But the computational steps of PLP are a bit elaborate and thus produce marginally better result which is illustrated in Figure 1.
  • 3. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020 3 Figure 1. The steps of PLP and MFCC feature vector extraction. 3. EXPERIMENTAL SETUP 3.1. LSTM Neural Network Structure LSTM, and in general, RNN based ASR systems [13,14,15] trained with CTC [16] have recently been shown to work extremely well when there is an abundance of training data, matching and exceeding the performance of hybrid DNN systems [15]. In the conventional DNN if the time is long, the amount of remaining vector reduces that needs to be looped back. This process inherently breaks down the network-updating scheme [17]. LSTM is naturally immune to this problem as it can employ its comparatively long-term memory structure [5]. LSTM and RNN are very prominent in sequence prediction and labelling at the same time, LSTM proved to have the better performance than RNN and DNN in context [18]. Recently, according to the successful application of DNN to acoustic modelling [19,20,21] and speech enhancement [22], a CTC output layer, is imposed on the acoustic model [23,24] in conjunction with a deep LSTM network. Figure 2 shows the LSTM speech recognition block. Figure 2. LSTM speech detection block A standard LSTM neural network structure is comprised of an input layer, a recurring LSTM layer and an output layer. In this study the input layer is a Time Delay Neural Network (TDNN) layer and this layer is connected to the hidden layer. The new LSTM structure proposed in [23] has been used in this work. As this method is based on CTC training, there is a significant performance growth in speech detection tasks. This technique has been adopted by Google. This typical LSTM cell structure is illustrated in Figure 3. The LSTM cell consists of three gates: input, forget, and output, which control, respectively, what fraction of the input is passed to the "memory" cell, what fraction of the stored cell memory is retained, and what fraction of the cell memory is output.
  • 4. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020 4 Figure 3. LSTM cell model Equations (1)-(5) are vector formulas that describe the LSTM cell. Typically, LSTM parameters are initialized to small random values in an interval. However, for the forget gate, this is a suboptimal choice, where small weights effectively close the gate, preventing cell memory and its gradients from flowing in time. In this work this issue has been addressed by initializing the forget gate bias to a large value. 3.2. Bhattacharyya Distance The Bhattacharyya Distance [25,26] measures the similarity of two discrete or continuous probability distributions which can be expressed by the following expressions. where µi and Σi are the means and covariances of the distributions, respectively and 3.3. Mahalanobis Distance The Mahalanobis distance [26] is a measure of the distance between a point and a distribution, introduced by P. C. Mahalanobis in 1936. Where µ and Σ are the means and covariances of the distributions respectively and x is the point whose distance is to be measured. In this work, the Mahalanobis distance between each frame is measured for different phonemes. 𝑖𝑡 = 𝜎(𝑊𝑖𝑥 𝑥𝑡 + 𝑊𝑖𝑟 𝑥𝑡−1 + 𝑏𝑖)………………………...(1) 𝑓𝑡 = 𝜎(𝑊𝑓𝑥 𝑥𝑡 + 𝑊𝑓𝑟 𝑥𝑡−1 + 𝑏𝑓)………………………..(2) 𝐶𝑡 = 𝑓𝑡 ⊙ 𝐶𝑡−1 + 𝑖𝑡 ⊙ g(𝑊𝑐𝑥 𝑥𝑡 + 𝑊𝑐𝑟 𝑟𝑡−1 + 𝑏𝑐 )…….(3) 𝑜𝑡 = 𝜎(𝑊𝑜𝑥 𝑥𝑡 + 𝑊𝑜𝑟 𝑟𝑡−1 + 𝑊𝑜𝑐 𝑐𝑡 + 𝑏 𝑜 )………….…..(4) ℎ𝑡 = 𝑜𝑡 ⊙ 𝛷 𝑐𝑡 …………………………….…….…….(5) 𝐷 𝐵 = 1 8 (µ1 − µ2) 𝑇 (µ1 − µ2)−1 + 1 2 𝑙𝑛 det 𝛴 det 𝛴1 det 𝛴2 ….(6) 𝛴 = 𝛴1+𝛴2 2 ….(7) 𝐷 𝑀(𝑥) = (𝑥 − 𝛴) 𝑇 𝑆−1(𝑥 − 𝛴)….(8)
  • 5. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020 5 4. PERFORMANCE ANALYSIS The next steps are training the LSTM model using the recorded training set of data for Bangla speech. Three popular feature extraction methods, namely, LPC, MFCC and PLP has been tried while training by 5000 male samples and testing this model using 1000 different male samples. This combination has been chosen carefully as it performed reasonably well in [9]. Statistical distance quantifies the distance between two statistical objects, which can be two random variables, or two probability distributions or samples. The distance can be between an individual sample point and a population or a wider sample of points. In this work both Bhattacharyya distance and Mahalanobis distance of various phoneme coefficients has been computed for LPC, MFCC and PLP. Table 1. Bhattacharyya distance between similar sounding phonemes Phonemes to be compared LPC MFCC PLP '"r"' '"er"' 1.268892 1.609877 1.935885 '"b"' '"v"' 0.602648 1.019706 1.176067 '"b"' '"bh"' 0.674364 1.99346 2.228276 '"d"' '"dh"' 0.091483 0.732092 0.630423 '"aa"' '"ax"' 1.398714 0.257079 0.278304 '"ch"' '"chh"' 0.477755 0.288733 0.318336 '"ey"' '"y"' 6.73841 1.88483 2.006171 '"n"' '"ng"' 1.090365 2.900765 2.830959 '"uh"' '"ih"' 1.399962 3.502966 3.659523 '"uh"' '"eh"' 9.00648 3.199748 3.829925 '"uh"' '"u"' 7.451423 1.100306 1.12853 '"dh"' '"d"' 0.091483 0.732092 0.630423 '"s"' '"sh"' 5.384669 0.149458 0.150676 '"v"' '"bh"' 0.513366 1.427321 1.206649 '"oy"' '"ay"' 3.930717 1.342863 1.604389 '"hh"' '"h"' 1.363757 2.696534 2.499212 '"ah"' '"ax"' 8.392236 0.288995 0.300063 '"ay"' '"oy"' 3.930717 1.342863 1.604389 '"z"' '"jh"' 3.993217 0.83873 0.956689 '"z"' '"j"' 7.540232 0.810661 0.860811 '"zh"' '"jh"' 0.934331 1.320837 1.200627 '"j"' '"z"' 7.540232 0.810661 0.860811 '"j"' '"zh"' 1.583787 1.722148 1.625498 '"j"' '"jh"' 3.002899 0.511258 0.524364
  • 6. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020 6 Figure 4. Mahalanobis distance between the phoneme ‘sil’ and the word “bideshi bonduk amdani” Table 2. Timeframe of the word “bideshi” in Figure 4. Frame start Frame end Phoneme 0 53 sil 53 63 bi 63 69 d 69 74 ey 74 78 eh 78 89 sh 89 94 ih 94 100 sp 5. DISCUSSION From Table 1 it is evident that both PLP and LPC show some benefits over MFCC for similar sounding phonemes. However, in most cases LPC shows some abnormally large distances which should not be that large, as these phonemes sound similar. If the timeframe of the word “bideshi” in Table 2 is matched with Figure 4, both PLP and MFCC manages to distinguish 53th frame from the phoneme ‘sil’, while LPC detects something on the onset of the phoneme ‘sh’ at frame no 78. 6. CONCLUSION From Table 1, Table 2 and Figure 4 it is clear that, PLP and MFCC shows significantly better result than LPC in terms of statistical distance. Combining the findings in [9] it can be said clearly that PLP performs better than MFCC. So for large data set or for large sets of training and testing, PLP seems to be a better alternative. However, RNN and LSTM mainly relies on sequential processing. Which means they are inherently slow. The performance of PLP shines through by combining it with a more advanced neural network like the transformer network [27]. As the transformer network can process faster, PLP should perform more aggressively with this types of framework.
  • 7. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020 7 REFERENCES [1] Uddin, M.T.; Uddiny, M.A. “Human activity recognition from wearable sensors using extremely randomized trees”. In Proceedings of the International Conference on Electrical Engineering and Information Communication Technology (ICEEICT), Dhaka, Bangladesh, 13–15 September 2015; pp. 1–6. [2] Jalal, A. “Human activity recognition using the labelled depth body parts information of depth silhouettes”. In Proceedings of the 6th International Symposium on Sustainable Healthy Buildings, Seoul, Korea, 27 February 2012; pp. 1–8. [3] Ahad, M.A.R.; Kobashi, S.; Tavares, J.M.R. “Advancements of image processing and vision” in healthcare. J. Healthcare Eng. 2018. [4] Jalal, A.; Quaid, M.A.K.; Hasan, A.S. “Wearable sensor-based human behaviour understanding and recognition in daily life for smart environments”. In Proceedings of the International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 18–20 December 2017. [5] C. Chiu, T. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R.Weiss, K. Rao, E. Gonina, et al. “State-of-the art speech recognition with sequence-to-sequence models”. In Acoustics, Speech and Signal Processing (ICASSP), 2018 IEEE International Conference, pages 4774–4778. IEEE, 2018. [6] Kanishka Rao, Ha¸sim Sak, and Rohit Prabhavalkar. “Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer”. In Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE, pages 193–199. IEEE, 2017. [7] Eric Battenberg, Jitong Chen, Rewon Child, Adam Coates, Yashesh Gaur Yi Li, Hairong Liu, Sanjeev Satheesh, Anuroop Sriram, and Zhenyao Zhu. “Exploring neural transducers for end-to- end speech recognition”. In Automatic Speech Recognition and Understanding Workshop (ASRU), 2017 IEEE, pages 206–213. IEEE, 2017. [8] Kishore, S., Black, A., Kumar, R., and Sangal, R. “Experiments with unit selection speech databases for Indian languages,” National Seminar on Language Technology Tools: Implementation of Telugu, Hyderabad, India, October 2003. [9] Nahyan A. M. “Performance analysis of different acoustic features based on LSTM for Bangla speech recognition.” The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 1/2/3/4, August 2020, DOI :10.5121/ijma.2020.12402 17 [10] Rafal J., Wojciech Z., and Ilya S. “An empirical exploration of recurrent network architectures”. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 2342–2350, 2015. [11] Atal, B. S., “Speech analysis and synthesis by linear prediction of the speech wave.” The Journal of The Acoustical Society of America 47 (1970) 65. [12] Davis, S. B and Mermelstein, P., “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-28, no. 4, pp. 357 – 366, August 1980. [13] Alex Graves and Navdeep Jaitly. “Towards end-to-end speech recognition with recurrent neural networks”. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 21-26 June 2014, pages 1764–1772, 2014. [14] Awni Y. Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, and Andrew Y. Ng. “Deep speech: scaling up end-to-end speech recognition”. CoRR, abs/1412.5567, 2014. [15] Hagen Soltau, Hank Liao, and Hasim Sak. “Neural speech recognizer: acoustic-to-word LSTM model for large vocabulary speech recognition”. CoRR, abs/1610.09975, 2016. [16] Alex Graves, Santiago Fernández, Faustino J. Gomez, and Jürgen Schmidhuber. “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks”. In Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006, pages 369–376, 2006. [17] Sepp Hochreiter and Jurgen Schmidhuber, “Long short-term memory”, Neural Computation, vol.9, no.8,pp.1735 780,Nov.1997 [18] Mike Schuster and Kuldip K.Paliwal, “Bidirectional recurrent neural networks,” Signal Processing, IEEE Transactions, vol. 45, no. 11, pp.2673-2681,1997.
  • 8. The International Journal of Multimedia & Its Applications (IJMA) Vol.12, No. 5, October 2020 8 [19] Abdel Rahman Mohamed, George E. Dah1, and Geoffrey E.Hinton, “Acoustic modeling using deep belief networks,” IEEE Transactions on Audio, Speech & Language Processing, vol. 20, no. 1, pp. 14-22, 2012. [20] George E. Dah1, Dong Yu, Li Deng, and Alex Acero, “Contextdependent pre-trained deep neural networks for large-vocabulary speech recognition,” IEEE Transactions on Audio, Speech & Language Processing, vol. 20, no. 1, pp. 30-42, Jan. 2012. [21] Navedeep Jaitly, Patrick Nguyen, Andrew Senior, and Vincent Vanhoucke, “Application of pre- trained deep neural networks to large vocabulary speech recognition,” in Proceedings of INTERSPEECH, 2012. [22] Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “An experiment study on speech enhancement based on deep neural networks”, IEEE Signal Processing, vol. 21, no. 1, pp. 65-68, Nov. 2013. [23] Hasim Sak, Andrew Senior, and Francoise Beaufays, “Long Short-Term memory based recurrent neural network architectures for large vocabulary speech recognition”, ArXiv e-prints, Feb.2014. [24] Z. Chen, Y. Zhuang, Y. Qian, K. Yu, et al. “Phone synchronous speech recognition with CTC lattices” IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 25(1): 90– 101, 2017 [25] S. Dubuisson. “The computation of the Bhattacharyya distance between histograms without histograms” 2nd International Conference on Image Processing Theory Tools and Applications (IPTA'10), Jul 2010, Paris, France. pp.373-378, [26] W. F. Basener and M. Flynn "Microscene evaluation using the Bhattacharyya distance", Proc. SPIE 10780, Multispectral, Hyperspectral, and Ultraspectral Remote Sensing Technology, Techniques and Applications VII, 107800R (23 October 2018); https://doi.org/10.1117/12.2327004 [27] Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omer; Bowman, Samuel (2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Stroudsburg, PA, USA: Association for Computational Linguistics: 353–355. AUTHORS Nahyan Al Mahmud Mr. Mahmud graduated from Electrical and Electronic Engineering department of Ahsanullah University of Science and Technology (AUST), Dhaka in 2008. Mr. Mahmud has completed the MSc program (EEE) from Bangladesh University of Engineering & Technology (BUET), Dhaka. Currently he is working as an Assistant Professor of EEE Department in AUST. His research interests include system and signal processing, analysis and design. Shahfida Amjad Munni Shahfida Amjad Munni completed her master of engineering degree at the Institute of Information and Communication Technology (IICT) in Bangladesh University of Engineering and Technology (BUET), Bangladesh in March 2018. Currently she is working as a software engineer at Cygnus Innovation Limited, Bangladesh. Her research interests include optical wireless communication, data science, wireless communication, OFDM modulation and LiFi.