SlideShare a Scribd company logo
Deep Learning for Speech
Recognition
Vikrant Tomar
Founder, Fluent.ai
vt@fluent.ai
We are hiring!
Outline
- Introduction
- General overview of speech recognition framework
- Conventional GMM-HMM based systems
- Deep neural networks in speech
- ConvNets
- RNNs/LSTMs and End-to-end learning
- New interesting stuff
2
Intro 1: What is speech recognition?
- Dream: A machine should be able to develop a functional equivalent of the
speaker’s intended message as effortlessly as humans can
- In other words: The goal is to find the most likely sequence of symbols such as
words or sub-word speech units from a stream of acoustic data.
3
Intro 2: How is deep learning for speech different from
deep learning for images?
- Speech is a temporal signal, there is information in the sequence
- One dimensional signal with multitudes of information:
- Speaker
- Accent and language
- Age and health
- Environment
- Issues:
- Noise and background conditions
- Accents
- Recording devices
4
Overview: Statistical Framework for speech recognition
- Formally, an ASR system maps the sequence of observation vectors, X, to the
optimum sequence of words, Ŵ :
-
5
Overview 2: System Architecture
6
System Architecture : Feature extraction & spectrogram
7
GMM-HMM based systems
8
Deep neural networks in speech
- Few different approaches
- Tandem
- Hybrid
- End-to-end
- Old but new
9
Tandem DNN: DNN -- GMM -- HMM
10
Hybrid DNN - HMM
11
- Good source:
Hinton et. al, Deep neural networks
for acoustic modelling in speech, 2012.
Hybrid CNN - HMM
12
- Good source: A-Hamid et. al, Covolutional neural networks for speech recognition,
2014
Hybrid CNN - HMM -- Partial weight sharing
13
Some benchmarks
14
RNNs and End to end models
- RNN:
- Good because sequential models
- However, cannot capture long-term dependencies
- Vanishing gradients
- Solutions: LSTMs and GRUs
- End to end models have overall simplified arch.
- CTC : Connectionist temporal classification
A. Graves et. al., “Towards End-to-End Speech
Recognition with Recurrent Neural Networks, 2014
15
New interesting stuff
- Baidu Deep Speech: Use bi-directional RNNs to directly map to characters
- IBM 2015/2016 and Microsoft 2016: Deep CNN with 3 x 3 kernels similar to VGG
net etc.
- CLDNN : Conv + LSTMs + Fully Connected
Baidu Lab: Deep Speech 2014 and Deep Speech 2, 2015
Sainath et. al, CONVOLUTIONAL, LONG SHORT-TERM MEMORY, FULLY CONNECTED DEEP
NEURAL NETWORKS, 2015
Xiong et. al, THE MICROSOFT 2016 CONVERSATIONAL SPEECH RECOGNITION SYSTEM, 2016
Saon et. al, The IBM 2015/16 English Conversational Telephone Speech Recognition System, 2015/16 16
Conclusion and resources
- Lots of exciting stuff, most concepts are similar to other deep learning
communities
- Good starting point: http://www.recognize-speech.com
- You can use any toolbox you like to start:
- Tensorflow, Torch, Theano etc.
- Kaldi, Currennt
- Older stuff: CMU-Sphinx, RWTH-ASR, HTK
- Free(-ish) datasets: http://www.openslr.org/resources.php
- Contact: vt@fluent.ai (Hiring Scientists)
17

More Related Content

PPT
Automatic speech recognition
DOCX
Automatic Speech Recognition
PDF
Deep Learning For Speech Recognition
PDF
Block Ciphers and the Data Encryption Standard
PPT
Automatic speech recognition
PPTX
Speech Recognition
PPTX
SPEECH RECOGNITION USING NEURAL NETWORK
PPTX
Information and network security 13 playfair cipher
Automatic speech recognition
Automatic Speech Recognition
Deep Learning For Speech Recognition
Block Ciphers and the Data Encryption Standard
Automatic speech recognition
Speech Recognition
SPEECH RECOGNITION USING NEURAL NETWORK
Information and network security 13 playfair cipher

What's hot (20)

PPTX
Speech Recognition Technology
PDF
Artificial Neural Network Abstract
PDF
Cryptography and Network Lecture Notes
DOCX
Speech Recognition
PPT
Voice morphing-101113123852-phpapp01
PPTX
Speech recognition system seminar
PPTX
Natural language processing and transformer models
PPT
Unit 1 speech processing
PDF
Remote Procedure Call (RPC) Server creation semantics & call semantics
PDF
Deep Learning Frameworks slides
PPTX
Linear Predictive Coding
PPTX
Word embedding
PPT
Asr
PPTX
Cryptography
PPTX
Steganography(Presentation)
PPTX
PDF
Deep Learning for Natural Language Processing: Word Embeddings
PPT
Shared memory
PDF
Machine Learning: Introduction to Neural Networks
PPTX
Transformers AI PPT.pptx
Speech Recognition Technology
Artificial Neural Network Abstract
Cryptography and Network Lecture Notes
Speech Recognition
Voice morphing-101113123852-phpapp01
Speech recognition system seminar
Natural language processing and transformer models
Unit 1 speech processing
Remote Procedure Call (RPC) Server creation semantics & call semantics
Deep Learning Frameworks slides
Linear Predictive Coding
Word embedding
Asr
Cryptography
Steganography(Presentation)
Deep Learning for Natural Language Processing: Word Embeddings
Shared memory
Machine Learning: Introduction to Neural Networks
Transformers AI PPT.pptx
Ad

Viewers also liked (6)

PPTX
Speech recognition techniques
PDF
Scaling Deep Learning with MXNet
PPTX
HPC Advisory Council Stanford Conference 2016
PDF
混合モデルとEMアルゴリズム(PRML第9章)
PPTX
Speech recognition final presentation
PPT
Speech recognition
Speech recognition techniques
Scaling Deep Learning with MXNet
HPC Advisory Council Stanford Conference 2016
混合モデルとEMアルゴリズム(PRML第9章)
Speech recognition final presentation
Speech recognition
Ad

Similar to Deep Learning for Speech Recognition - Vikrant Singh Tomar (20)

PPTX
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
PDF
Deep convolutional neural networks-based features for Indonesian large vocabu...
PPTX
Advanced_NLP_with_Transformers_PPT_final 50.pptx
PDF
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
PDF
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
PDF
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
PDF
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
PDF
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
PDF
Tuning Dari Speech Classification Employing Deep Neural Networks
PDF
Sentiment analysis by deep learning approaches
PPTX
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
PPTX
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
PDF
E0502 01 2327
PPTX
Deep Learning | Speaker Indentification
PDF
Kc3517481754
PPTX
Deep Learning and Watson Studio
PDF
De4201715719
PDF
A survey on Enhancements in Speech Recognition
PDF
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
PDF
How deep learning is shaping natural language processing(NLP)
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
Deep convolutional neural networks-based features for Indonesian large vocabu...
Advanced_NLP_with_Transformers_PPT_final 50.pptx
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
QUALITATIVE ANALYSIS OF PLP IN LSTM FOR BANGLA SPEECH RECOGNITION
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
Tuning Dari Speech Classification Employing Deep Neural Networks
Sentiment analysis by deep learning approaches
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
E0502 01 2327
Deep Learning | Speaker Indentification
Kc3517481754
Deep Learning and Watson Studio
De4201715719
A survey on Enhancements in Speech Recognition
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
How deep learning is shaping natural language processing(NLP)

More from WithTheBest (20)

PDF
Riccardo Vittoria
PPTX
Recreating history in virtual reality
PDF
Engaging and sharing your VR experience
PDF
How to survive the early days of VR as an Indie Studio
PDF
Mixed reality 101
PDF
Unlocking Human Potential with Immersive Technology
PPTX
Building your own video devices
PPTX
Maximizing performance of 3 d user generated assets in unity
PPTX
Wizdish rovr
PPTX
Haptics & amp; null space vr
PPTX
How we use vr to break the laws of physics
PPTX
The Virtual Self
PPTX
You dont have to be mad to do VR and AR ... but it helps
PDF
Omnivirt overview
PDF
VR Interactions - Jason Jerald
PDF
Japheth Funding your startup - dating the devil
PDF
Transported vr the virtual reality platform for real estate
PDF
Measuring Behavior in VR - Rob Merki Cognitive VR
PDF
Global demand for Mixed Realty (VR/AR) content is about to explode.
PDF
VR, a new technology over 40,000 years old
Riccardo Vittoria
Recreating history in virtual reality
Engaging and sharing your VR experience
How to survive the early days of VR as an Indie Studio
Mixed reality 101
Unlocking Human Potential with Immersive Technology
Building your own video devices
Maximizing performance of 3 d user generated assets in unity
Wizdish rovr
Haptics & amp; null space vr
How we use vr to break the laws of physics
The Virtual Self
You dont have to be mad to do VR and AR ... but it helps
Omnivirt overview
VR Interactions - Jason Jerald
Japheth Funding your startup - dating the devil
Transported vr the virtual reality platform for real estate
Measuring Behavior in VR - Rob Merki Cognitive VR
Global demand for Mixed Realty (VR/AR) content is about to explode.
VR, a new technology over 40,000 years old

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Modernizing your data center with Dell and AMD
PPTX
Big Data Technologies - Introduction.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
A Presentation on Artificial Intelligence
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Empathic Computing: Creating Shared Understanding
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
Advanced methodologies resolving dimensionality complications for autism neur...
Mobile App Security Testing_ A Comprehensive Guide.pdf
Modernizing your data center with Dell and AMD
Big Data Technologies - Introduction.pptx
Machine learning based COVID-19 study performance prediction
A Presentation on Artificial Intelligence
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
MYSQL Presentation for SQL database connectivity
Empathic Computing: Creating Shared Understanding
The Rise and Fall of 3GPP – Time for a Sabbatical?
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cuic standard and advanced reporting.pdf
Network Security Unit 5.pdf for BCA BBA.
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
CIFDAQ's Market Insight: SEC Turns Pro Crypto
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation_ Review paper, used for researhc scholars
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Review of recent advances in non-invasive hemoglobin estimation

Deep Learning for Speech Recognition - Vikrant Singh Tomar

  • 1. Deep Learning for Speech Recognition Vikrant Tomar Founder, Fluent.ai vt@fluent.ai We are hiring!
  • 2. Outline - Introduction - General overview of speech recognition framework - Conventional GMM-HMM based systems - Deep neural networks in speech - ConvNets - RNNs/LSTMs and End-to-end learning - New interesting stuff 2
  • 3. Intro 1: What is speech recognition? - Dream: A machine should be able to develop a functional equivalent of the speaker’s intended message as effortlessly as humans can - In other words: The goal is to find the most likely sequence of symbols such as words or sub-word speech units from a stream of acoustic data. 3
  • 4. Intro 2: How is deep learning for speech different from deep learning for images? - Speech is a temporal signal, there is information in the sequence - One dimensional signal with multitudes of information: - Speaker - Accent and language - Age and health - Environment - Issues: - Noise and background conditions - Accents - Recording devices 4
  • 5. Overview: Statistical Framework for speech recognition - Formally, an ASR system maps the sequence of observation vectors, X, to the optimum sequence of words, Ŵ : - 5
  • 6. Overview 2: System Architecture 6
  • 7. System Architecture : Feature extraction & spectrogram 7
  • 9. Deep neural networks in speech - Few different approaches - Tandem - Hybrid - End-to-end - Old but new 9
  • 10. Tandem DNN: DNN -- GMM -- HMM 10
  • 11. Hybrid DNN - HMM 11 - Good source: Hinton et. al, Deep neural networks for acoustic modelling in speech, 2012.
  • 12. Hybrid CNN - HMM 12 - Good source: A-Hamid et. al, Covolutional neural networks for speech recognition, 2014
  • 13. Hybrid CNN - HMM -- Partial weight sharing 13
  • 15. RNNs and End to end models - RNN: - Good because sequential models - However, cannot capture long-term dependencies - Vanishing gradients - Solutions: LSTMs and GRUs - End to end models have overall simplified arch. - CTC : Connectionist temporal classification A. Graves et. al., “Towards End-to-End Speech Recognition with Recurrent Neural Networks, 2014 15
  • 16. New interesting stuff - Baidu Deep Speech: Use bi-directional RNNs to directly map to characters - IBM 2015/2016 and Microsoft 2016: Deep CNN with 3 x 3 kernels similar to VGG net etc. - CLDNN : Conv + LSTMs + Fully Connected Baidu Lab: Deep Speech 2014 and Deep Speech 2, 2015 Sainath et. al, CONVOLUTIONAL, LONG SHORT-TERM MEMORY, FULLY CONNECTED DEEP NEURAL NETWORKS, 2015 Xiong et. al, THE MICROSOFT 2016 CONVERSATIONAL SPEECH RECOGNITION SYSTEM, 2016 Saon et. al, The IBM 2015/16 English Conversational Telephone Speech Recognition System, 2015/16 16
  • 17. Conclusion and resources - Lots of exciting stuff, most concepts are similar to other deep learning communities - Good starting point: http://www.recognize-speech.com - You can use any toolbox you like to start: - Tensorflow, Torch, Theano etc. - Kaldi, Currennt - Older stuff: CMU-Sphinx, RWTH-ASR, HTK - Free(-ish) datasets: http://www.openslr.org/resources.php - Contact: vt@fluent.ai (Hiring Scientists) 17