SlideShare a Scribd company logo
Speech Recognition with Neural Networks
Pruthvij Thakar | Gurtej Khanooja
{thakar.p, khanooja.g} @husky.neu.edu
CSYE 7245, Fall 2017, Northeastern University
Introduction: -
Speech recognition is an interesting topic in deep learning because of its difficulty
and its great applications in life: Amazon Echo, Apple Siri, speech typer, etc. The
most common architecture used in speech recognition is recurrent neural
network (RNN) and its variants. We have implemented various variations in
network architecture based on RNN and reported and compared the results and
reasoned for choosing the best architecture.
The general outline for creating end to end speech recognition can be seen
below:
• STEP 1 is a pre-processing step that converts raw audio to one of two
feature representations that are commonly used for ASR.
• STEP 2 is an acoustic model which accepts audio features as input and
returns a probability distribution over all potential transcriptions. After
learning about the basic types of neural networks that are often used
for acoustic modeling, you will engage in your own investigations, to
design your own acoustic model!
• STEP 3 in the pipeline takes the output from the acoustic model and
returns a predicted transcription.
The workflow being followed is as follows:
• Data Collection
• Data Preprocessing
• Acoustic Features for Speech Recognition
• Deep Neural Networks for Acoustic Modeling
▪ RNN
▪ RNN + TimeDistributed Dense
▪ Deeper RNN + TimeDistributed Dense
▪ Bidirectional RNN + TimeDistributed Dense
▪ Our own model arcitecture
• Comapre Models
• Final Model
The Data:
We begin by investigating the dataset that will be used to train and evaluate your
pipeline. Voxforge provides dataset of English-read speech, designed for training
and evaluating models for ASR (Automatic Speech Recognization). The dataset
contains 130 hours of speech. We will work with a small subset in this project,
since larger-scale data would take a long while to train.
In the code cells below, you will use the vis_train_features module to visualize a
training example.
• transcribed text (label) for the training example.
• raw audio waveform for the training example.
• frequency cepstral coefficients (MFCCs) for the training example.
• spectrogram for the training example
• the file path to the training example
There are 3126 total training examples
The following code cell visualizes the audio waveform for your chosen example,
along with the corresponding transcript.
Automated Speech Recognition
Acoustic Features for Speech Recognition:
Raw audio is generally not a good option for using input to neural networks.
There are acoustic features which could be used to get better results. We
explored two acoustic features for the purpose of the project: spectrograms and
MFCCs.
Spectrograms:
The first option for an audio feature representation is the spectrogram. The code
given below returns the spectrogram as a 2D tensor, where the first (vertical)
dimension indexes time, and the second (horizontal) dimension indexes
frequency. To speed the convergence of your algorithm, we have also normalized
the spectrogram. (can be seen quickly in the visualization below by noting that
the mean value hovers around zero, and most entries in the tensor assume values
close to zero.
Mel-Frequency Cepstral Coefficients (MFCCs):
The second option for an audio feature representation is MFCCs. The main idea
behind MFCC features is the same as spectrogram features: at each time window,
the MFCC feature yields a feature vector that characterizes the sound within the
window. Note that the MFCC feature is much lower-dimensional than the
spectrogram feature, which could help an acoustic model to avoid overfitting to
the training dataset.
Automated Speech Recognition
Deep Neural Networks for Acoustic Modeling
Model 0 (RNN + Time distributed dense)
Model 0: Fully connected layer after rnn: This model is a small upgrade on the
model 0, in the model 0 we use outputs from RNN and from those we are
calculating loss and probs for the spoken words. This is not good in practices as
we have seen. The idea of the Model 1 is to add a fully connected layer between
RNN part and output probs. In this way we are getting more information which
we use to get better predictions. This add-on really helped in the matter of
decreasing loss. As we can see on the left graph, model immediately started with
loss of about 300. But this model didn't improve over time at both training loss
and validation loss.
Implementation:
Model 1(CNN + RNN)
Model 1: CNN model: In this model we have added CNN layer at the beginning
(before RNN layer). This way we are analyzing Spectrogram or MFCC features with
convolutional kernels. With this strategy we are able to get only important
features and give them to the RNN part of the network. This strategy showed the
best results of all models tested. Training loss kept decreasing over time and also
validation loss decreased a lot. This results helped me in deciding what
layers/architecture to use in the Final model.
Implementation
Model 2(Deep RNN + Time distributed dense)
Model 2: Deep RNN Network: This model is made to have many RNN layers. In the
testing (above) we have used 2 RNN layers + fully connected layer after the last
RNN layer. This model showed the same training results as our Model 1. At the
beginning of the training process it started to decrease loss and after 2.5 epochs it
just stopped and kept that amount of loss until the end of the training. Validation
loss stayed constant throughout whole training process.
Implementation:
Model 3(Deeper RNN + Time distributed dense)
Model 3: Bidirectional RNN layers: This model uses Bidirectional Rnn layers, which
is logical use in this use case. With using Bidirectional Rnn we are able to analyze
input signal from the beginning to the end and from the end to the beginning.
This model showed similar results to models 1 and 3.
Implementation
Model 4: Bidirectional RNN + Time Distributed Dense
One shortcoming of conventional RNNs is that they are only able to make use of previous
context. In speech recognition, where whole utterances are transcribed at once, there is no
reason not to exploit future context as well. Bidirectional RNNs (BRNNs) do this by
processing the data in both directions with two separate hidden layers which are then fed
forwards to the same output layer.
Implementation:
Note: The initial scope was to implement the LAS (Listen, Attend and Spell) model
from one of the papers published by Google but the main aim for the project was
speech recognition. We reduced the scope to implementation of 5 models as
described above whose model architecture was inspired by one of the project
which was part of udacity’s artificial nano-degree program. We were quite not
satisfied with the performance of all the models implemented above so to
improve further we did some hyper parameter tuning and using the learnings
derived from the above list of model we tried to create our own version of the
model. The implementation of the final accepted model and reasoning for it is
described below. Implementation of all models have been done on keras.
Model 5: Final Model (Own Architecture):
Implementation:
Automated Speech Recognition
Model Description and Reasoning:
While creating the Final model we have tested many different architectures. Firstly,
we have started with Deep Bidirectional RNN, we used 2 and then 3 bidirectional
layers. But this architecture didn't work well for us. It stayed on 500 loss through
whole training. Our second attempt was to add CNN to that architecture and that
improved the model a bit - it got to about 230 loss (Both training and validation
loss).
Our third attempt was to add dilate parameter into convolutional layer. This didn't
improve the network as well.
The next step was to create just Deep RNN with 5 GRU layers with dropout of 40%.
This network was really hard to train and it did improve but for really small amount
(from 760 loss to 710 in 20 epochs).
Then we changed SGD optimizer to RMSprop optimizer with learning rate to 0.02.
This move didn't show any improvements overall.
We have looked at the results of the tested architectures (above, Models 1-4) and
started to put a few parts together to get this architecture that we are using now.
We have used convolutional layer from Model 2, and structure from Deep RNN
model. Combined them together, tried the same model with and without dropout.
It seemed to work better without dropout so we trained it without dropout at all.
Final Model Analysis: The model was training nicely. It kept decreasing Training
and validation loss over time. But there is one thing to note here, if we look at the
validation loss we can see that it stopped decreasing and started to increase. This
means that the network had started to overfitting to the training set.
There are a few things we can do to prevent this from happening. Early-stopping,
we could have stopped training of the network after the 12 epochs.
Regularization techniques such Dropout would help too. But dropout with the same
setup of parameters didn't show good results, so tweaking other parameters too
would help in the case of using dropout.
Predictions (Final Model)
Following are snippets of the predictions that we got from our final accepted
model:
References:
[1] Bidirectional LSTM Networks for Improved Phoneme Classification and
Recognition Alex Graves1, Santiago Fern¥andez1, and J®urgen Schmidhuber
[2] K. Choi, G. Fazekas, M. Sandler, and K. Cho, ìConvolutional recurrent neural
networks for music classification,î in IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), 2017.
[3] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural
Computation, 9(8):1735ñ1780, 1997.
[4] A. J. Robinson. An application of recurrent nets to phone probability
estimation. IEEE Transactions on Neural Networks, 5(2):298ñ305, March 1994.
[5] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE
Transactions on Signal Processing, 45:2673ñ2681, November 1997.
[6] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, ìConvolutional,long short-term
memory, fully connected deep neural networks,î in IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), 2015.
[7] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bah-danau, F. Bougares, H.
Schwenk, and Y. Bengio, Learning phrase representations using RNN
encoderdecoder for statistical machine translation, arXiv preprint
arXiv:1406.1078, 2014.
[8] D. Bahdanau, K. Cho, and Y. Bengio, ìNeural Machine Translation by Jointly
Learning to Align and Translate, in ICLR, 2015.
[9] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, ìAttention-Based
Models for Speech Recognition, in NIPS, 2015.
[10] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, ìListen, Attend and Spell: A Neural
Network for Large Vocabulary Conversational Speech Recognition,î in ICASSP,
2016.
[11] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, ìEnd-to-end
Attention-based Large Vocabulary Speech Recognition,î in ICASSP, 2016.

More Related Content

PPT
Behula_Lakhindar folklore
PPTX
An Approach For Predicting Road Accident Severity
PPT
Machine Learning Ch 1.ppt
PDF
Training Neural Networks
PDF
Recommending and searching @ Spotify
PPTX
Data mining
PDF
High Performance Data Analysis (HPDA): HPC - Big Data Convergence
PPT
Reconfigurable Computing
Behula_Lakhindar folklore
An Approach For Predicting Road Accident Severity
Machine Learning Ch 1.ppt
Training Neural Networks
Recommending and searching @ Spotify
Data mining
High Performance Data Analysis (HPDA): HPC - Big Data Convergence
Reconfigurable Computing

Similar to Automated Speech Recognition (20)

PDF
IRJET- American Sign Language Classification
PDF
Sign Detection from Hearing Impaired
PDF
deep CNN vs conventional ML
PPTX
cnn ppt.pptx
PDF
From Conventional Machine Learning to Deep Learning and Beyond.pptx
PPT
presentation.ppt
PDF
Verilog Ams Used In Top Down Methodology For Wireless Integrated Circuits
PPTX
Dssg talk CNN intro
PDF
TEST-COST-SENSITIVE CONVOLUTIONAL NEURAL NETWORKS WITH EXPERT BRANCHES
PDF
Machine learning in science and industry — day 4
PDF
Et25897899
PDF
Bangla Handwritten Digit Recognition Report.pdf
PDF
Text Recognition using Convolutional Neural Network: A Review
DOCX
Som paper1.doc
PPT
deeplearning
PDF
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
PPTX
Industrial Trainingdbhkbdbdwjb dbxjnwbndcbj
PPTX
ML Module 3 Non Linear Learning.pptx
PDF
Harmful interupts
PDF
Deep Learning Demystified
IRJET- American Sign Language Classification
Sign Detection from Hearing Impaired
deep CNN vs conventional ML
cnn ppt.pptx
From Conventional Machine Learning to Deep Learning and Beyond.pptx
presentation.ppt
Verilog Ams Used In Top Down Methodology For Wireless Integrated Circuits
Dssg talk CNN intro
TEST-COST-SENSITIVE CONVOLUTIONAL NEURAL NETWORKS WITH EXPERT BRANCHES
Machine learning in science and industry — day 4
Et25897899
Bangla Handwritten Digit Recognition Report.pdf
Text Recognition using Convolutional Neural Network: A Review
Som paper1.doc
deeplearning
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
Industrial Trainingdbhkbdbdwjb dbxjnwbndcbj
ML Module 3 Non Linear Learning.pptx
Harmful interupts
Deep Learning Demystified
Ad

Recently uploaded (20)

PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Nekopoi APK 2025 free lastest update
PPTX
assetexplorer- product-overview - presentation
PPTX
L1 - Introduction to python Backend.pptx
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
Introduction to Artificial Intelligence
PDF
Digital Strategies for Manufacturing Companies
PDF
top salesforce developer skills in 2025.pdf
PDF
System and Network Administraation Chapter 3
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
medical staffing services at VALiNTRY
PDF
Understanding Forklifts - TECH EHS Solution
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
Reimagine Home Health with the Power of Agentic AI​
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Nekopoi APK 2025 free lastest update
assetexplorer- product-overview - presentation
L1 - Introduction to python Backend.pptx
Upgrade and Innovation Strategies for SAP ERP Customers
Introduction to Artificial Intelligence
Digital Strategies for Manufacturing Companies
top salesforce developer skills in 2025.pdf
System and Network Administraation Chapter 3
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
medical staffing services at VALiNTRY
Understanding Forklifts - TECH EHS Solution
How to Migrate SBCGlobal Email to Yahoo Easily
Reimagine Home Health with the Power of Agentic AI​
Ad

Automated Speech Recognition

  • 1. Speech Recognition with Neural Networks Pruthvij Thakar | Gurtej Khanooja {thakar.p, khanooja.g} @husky.neu.edu CSYE 7245, Fall 2017, Northeastern University Introduction: - Speech recognition is an interesting topic in deep learning because of its difficulty and its great applications in life: Amazon Echo, Apple Siri, speech typer, etc. The most common architecture used in speech recognition is recurrent neural network (RNN) and its variants. We have implemented various variations in network architecture based on RNN and reported and compared the results and reasoned for choosing the best architecture. The general outline for creating end to end speech recognition can be seen below: • STEP 1 is a pre-processing step that converts raw audio to one of two feature representations that are commonly used for ASR. • STEP 2 is an acoustic model which accepts audio features as input and returns a probability distribution over all potential transcriptions. After learning about the basic types of neural networks that are often used for acoustic modeling, you will engage in your own investigations, to design your own acoustic model!
  • 2. • STEP 3 in the pipeline takes the output from the acoustic model and returns a predicted transcription. The workflow being followed is as follows: • Data Collection • Data Preprocessing • Acoustic Features for Speech Recognition • Deep Neural Networks for Acoustic Modeling ▪ RNN ▪ RNN + TimeDistributed Dense ▪ Deeper RNN + TimeDistributed Dense ▪ Bidirectional RNN + TimeDistributed Dense ▪ Our own model arcitecture • Comapre Models • Final Model
  • 3. The Data: We begin by investigating the dataset that will be used to train and evaluate your pipeline. Voxforge provides dataset of English-read speech, designed for training and evaluating models for ASR (Automatic Speech Recognization). The dataset contains 130 hours of speech. We will work with a small subset in this project, since larger-scale data would take a long while to train. In the code cells below, you will use the vis_train_features module to visualize a training example. • transcribed text (label) for the training example. • raw audio waveform for the training example. • frequency cepstral coefficients (MFCCs) for the training example. • spectrogram for the training example • the file path to the training example There are 3126 total training examples The following code cell visualizes the audio waveform for your chosen example, along with the corresponding transcript.
  • 5. Acoustic Features for Speech Recognition: Raw audio is generally not a good option for using input to neural networks. There are acoustic features which could be used to get better results. We explored two acoustic features for the purpose of the project: spectrograms and MFCCs. Spectrograms: The first option for an audio feature representation is the spectrogram. The code given below returns the spectrogram as a 2D tensor, where the first (vertical) dimension indexes time, and the second (horizontal) dimension indexes frequency. To speed the convergence of your algorithm, we have also normalized the spectrogram. (can be seen quickly in the visualization below by noting that the mean value hovers around zero, and most entries in the tensor assume values close to zero. Mel-Frequency Cepstral Coefficients (MFCCs): The second option for an audio feature representation is MFCCs. The main idea behind MFCC features is the same as spectrogram features: at each time window, the MFCC feature yields a feature vector that characterizes the sound within the window. Note that the MFCC feature is much lower-dimensional than the spectrogram feature, which could help an acoustic model to avoid overfitting to the training dataset.
  • 7. Deep Neural Networks for Acoustic Modeling Model 0 (RNN + Time distributed dense) Model 0: Fully connected layer after rnn: This model is a small upgrade on the model 0, in the model 0 we use outputs from RNN and from those we are calculating loss and probs for the spoken words. This is not good in practices as we have seen. The idea of the Model 1 is to add a fully connected layer between RNN part and output probs. In this way we are getting more information which we use to get better predictions. This add-on really helped in the matter of decreasing loss. As we can see on the left graph, model immediately started with loss of about 300. But this model didn't improve over time at both training loss and validation loss.
  • 9. Model 1(CNN + RNN) Model 1: CNN model: In this model we have added CNN layer at the beginning (before RNN layer). This way we are analyzing Spectrogram or MFCC features with convolutional kernels. With this strategy we are able to get only important features and give them to the RNN part of the network. This strategy showed the best results of all models tested. Training loss kept decreasing over time and also validation loss decreased a lot. This results helped me in deciding what layers/architecture to use in the Final model.
  • 11. Model 2(Deep RNN + Time distributed dense) Model 2: Deep RNN Network: This model is made to have many RNN layers. In the testing (above) we have used 2 RNN layers + fully connected layer after the last RNN layer. This model showed the same training results as our Model 1. At the beginning of the training process it started to decrease loss and after 2.5 epochs it just stopped and kept that amount of loss until the end of the training. Validation loss stayed constant throughout whole training process.
  • 13. Model 3(Deeper RNN + Time distributed dense) Model 3: Bidirectional RNN layers: This model uses Bidirectional Rnn layers, which is logical use in this use case. With using Bidirectional Rnn we are able to analyze input signal from the beginning to the end and from the end to the beginning. This model showed similar results to models 1 and 3.
  • 15. Model 4: Bidirectional RNN + Time Distributed Dense One shortcoming of conventional RNNs is that they are only able to make use of previous context. In speech recognition, where whole utterances are transcribed at once, there is no reason not to exploit future context as well. Bidirectional RNNs (BRNNs) do this by processing the data in both directions with two separate hidden layers which are then fed forwards to the same output layer.
  • 17. Note: The initial scope was to implement the LAS (Listen, Attend and Spell) model from one of the papers published by Google but the main aim for the project was speech recognition. We reduced the scope to implementation of 5 models as described above whose model architecture was inspired by one of the project which was part of udacity’s artificial nano-degree program. We were quite not satisfied with the performance of all the models implemented above so to improve further we did some hyper parameter tuning and using the learnings derived from the above list of model we tried to create our own version of the model. The implementation of the final accepted model and reasoning for it is described below. Implementation of all models have been done on keras.
  • 18. Model 5: Final Model (Own Architecture): Implementation:
  • 20. Model Description and Reasoning: While creating the Final model we have tested many different architectures. Firstly, we have started with Deep Bidirectional RNN, we used 2 and then 3 bidirectional layers. But this architecture didn't work well for us. It stayed on 500 loss through whole training. Our second attempt was to add CNN to that architecture and that improved the model a bit - it got to about 230 loss (Both training and validation loss). Our third attempt was to add dilate parameter into convolutional layer. This didn't improve the network as well. The next step was to create just Deep RNN with 5 GRU layers with dropout of 40%. This network was really hard to train and it did improve but for really small amount (from 760 loss to 710 in 20 epochs). Then we changed SGD optimizer to RMSprop optimizer with learning rate to 0.02. This move didn't show any improvements overall. We have looked at the results of the tested architectures (above, Models 1-4) and started to put a few parts together to get this architecture that we are using now. We have used convolutional layer from Model 2, and structure from Deep RNN model. Combined them together, tried the same model with and without dropout. It seemed to work better without dropout so we trained it without dropout at all. Final Model Analysis: The model was training nicely. It kept decreasing Training and validation loss over time. But there is one thing to note here, if we look at the validation loss we can see that it stopped decreasing and started to increase. This means that the network had started to overfitting to the training set. There are a few things we can do to prevent this from happening. Early-stopping, we could have stopped training of the network after the 12 epochs. Regularization techniques such Dropout would help too. But dropout with the same setup of parameters didn't show good results, so tweaking other parameters too would help in the case of using dropout.
  • 21. Predictions (Final Model) Following are snippets of the predictions that we got from our final accepted model:
  • 22. References: [1] Bidirectional LSTM Networks for Improved Phoneme Classification and Recognition Alex Graves1, Santiago Fern¥andez1, and J®urgen Schmidhuber [2] K. Choi, G. Fazekas, M. Sandler, and K. Cho, ìConvolutional recurrent neural networks for music classification,î in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017. [3] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735ñ1780, 1997. [4] A. J. Robinson. An application of recurrent nets to phone probability estimation. IEEE Transactions on Neural Networks, 5(2):298ñ305, March 1994. [5] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45:2673ñ2681, November 1997. [6] T. N. Sainath, O. Vinyals, A. Senior, and H. Sak, ìConvolutional,long short-term memory, fully connected deep neural networks,î in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015. [7] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bah-danau, F. Bougares, H. Schwenk, and Y. Bengio, Learning phrase representations using RNN encoderdecoder for statistical machine translation, arXiv preprint arXiv:1406.1078, 2014. [8] D. Bahdanau, K. Cho, and Y. Bengio, ìNeural Machine Translation by Jointly Learning to Align and Translate, in ICLR, 2015. [9] J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, ìAttention-Based Models for Speech Recognition, in NIPS, 2015. [10] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, ìListen, Attend and Spell: A Neural Network for Large Vocabulary Conversational Speech Recognition,î in ICASSP, 2016. [11] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, ìEnd-to-end Attention-based Large Vocabulary Speech Recognition,î in ICASSP, 2016.