SlideShare a Scribd company logo
Introduction to Automatic
Speech Recognition
Outline
Define the problem
What is speech?
Feature Selection
Models
 Early methods
 Modern statistical models
Current State of ASR
Future Work
The ASR Problem
There is no single ASR problem
The problem depends on many factors
 Microphone: Close-mic, throat-mic, microphone array, audio-visual
 Sources: band-limited, background noise, reverberation
 Speaker: speaker dependent, speaker independent
 Language: open/closed vocabulary, vocabulary size, read/spontaneous
speech
 Output: Transcription, speaker id, keywords
Performance Evaluation
Accuracy
 Percentage of tokens correctly recognized
Error Rate
 Inverse of accuracy
Token Type
 Phones
 Words*
 Sentences
 Semantics?
What is Speech?
Analog signal produced by humans
You can think about the speech signal being decomposed into the source
and filter
The source is the vocal folds in voiced speech
The filter is the vocal tract and articulators
Acoustic Model
 For each frame of data, we need some way of describing the
likelihood of it belonging to any of our classes
 Two methods are commonly used
 Multilayer perceptron (MLP) gives the likelihood of a class given the data
 Gaussian Mixture Model (GMM) gives the likelihood of the data given a class
Gaussian Distribution
Pronunciation Model
 While the pronunciation model can be very complex, it is typically
just a dictionary
 The dictionary contains the valid pronunciations for each word
 Examples:
 Cat: k ae t
 Dog: d ao g
 Fox: f aa x s
Language Model
 Now we need some way of representing the likelihood of any given
word sequence
 Many methods exist, but ngrams are the most common
 Ngrams models are trained by simply counting the occurrences of
words in a training set
Ngrams
 A unigram is the probability of any word in isolation
 A bigram is the probability of a given word given the previous word
 Higher order ngrams continue in a similar fashion
 A backoff probability is used for any unseen data
How do we put it together?
 We now have models to represent the three parts of our equation
 We need a framework to join these models together
 The standard framework used is the Hidden Markov Model (HMM)
Markov Model
 A state model using the markov property
 The markov property states that the future depends only on the present
state
 Models the likelihood of transitions between states in a model
 Given the model, we can determine the likelihood of any sequence of
states
Hidden Markov Model
 Similar to a markov model except the states are hidden
 We now have observations tied to the individual states
 We no longer know the exact state sequence given the data
 Allows for the modeling of an underlying unobservable process
HMMs for ASR
 First we build an HMM for each phone
 Next we combine the phone models based on the pronunciation model
to create word level models
 Finally, the word level models are combined based on the language
model
 We now have a giant network with potentially thousands or even
millions of states
Decoding
 Decoding happens in the same way as the previous example
 For each time frame we need to maintain two pieces of information
 The likelihood of being at any state
 The previous state for every state
State of the Art
 What works well
 Constrained vocabulary systems
 Systems adapted to a given speaker
 Systems in anechoic environments without background noise
 Systems expecting read speech
 What doesn't work
 Large unconstrained vocabulary
 Noisy environments
 Conversational speech
Future Work
 Better representations of audio based on humans
 Better representation of acoustic elements based on articulatory
phonology
 Segmental models that do not rely on the simple frame-based
approach
Resources
 Hidden Markov Model Toolkit (HTK)
 http://htk.eng.cam.ac.uk/
 CHIME ( a freely available dataset)
 http://spandh.dcs.shef.ac.uk/projects/chime/PCC/datasets.html
 Machine Learning Lectures
 http://www.stanford.edu/class/cs229/
 http://www.youtube.com/watch?v=UzxYlbK2c7E

More Related Content

PPT
PPT
sr.ppt
PPT
Voice recognitionr.ppt
PPT
speech recognition system of modern world.ppt
PPTX
lec26_audio.pptx
PPT
Voice Recognition
PPTX
Speech recognition final presentation
DOCX
speech enhancement
sr.ppt
Voice recognitionr.ppt
speech recognition system of modern world.ppt
lec26_audio.pptx
Voice Recognition
Speech recognition final presentation
speech enhancement

Similar to scribgy.ppt (20)

PDF
AUTOMATIC SPEECH RECOGNITION- A SURVEY
PDF
A_Review_on_Different_Approaches_for_Spe.pdf
PPTX
Speech recognition techniques
PDF
High level speaker specific features modeling in automatic speaker recognitio...
DOCX
A neural probabilistic language model
PDF
dialogue act modeling for automatic tagging and recognition
PDF
Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...
PPTX
Voice
PPTX
Wreck a nice beach: adventures in speech recognition
PPT
Asr
PPTX
Speech to text conversion
PPTX
Speech to text conversion
PDF
De4201715719
PPTX
Speech Recognition Technology
DOC
Bondec - A Sentence Boundary Detector
PDF
Kc3517481754
PDF
Supervised Approach to Extract Sentiments from Unstructured Text
PDF
Text independent speaker identification system using average pitch and forman...
PPTX
Sequence to sequence model speech recognition
PDF
AUTOMATIC SPEECH RECOGNITION- A SURVEY
A_Review_on_Different_Approaches_for_Spe.pdf
Speech recognition techniques
High level speaker specific features modeling in automatic speaker recognitio...
A neural probabilistic language model
dialogue act modeling for automatic tagging and recognition
Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...
Voice
Wreck a nice beach: adventures in speech recognition
Asr
Speech to text conversion
Speech to text conversion
De4201715719
Speech Recognition Technology
Bondec - A Sentence Boundary Detector
Kc3517481754
Supervised Approach to Extract Sentiments from Unstructured Text
Text independent speaker identification system using average pitch and forman...
Sequence to sequence model speech recognition

Recently uploaded (20)

PPTX
1751884730-Visual Basic -Unitj CS B.pptx
PPTX
E-Commerce____Intermediate_Presentation.pptx
PPTX
Autonomic_Nervous_SystemM_Drugs_PPT.pptx
PDF
Understanding the Rhetorical Situation Presentation in Blue Orange Muted Il_2...
PPTX
Definition and Relation of Food Science( Lecture1).pptx
PPT
BCH3201 (Enzymes and biocatalysis)-JEB (1).ppt
PDF
Sales and Distribution Managemnjnfijient.pdf
PDF
MCQ Practice CBT OL Official Language 1.pptx.pdf
PPTX
OnePlus 13R – ⚡ All-Rounder King Performance: Snapdragon 8 Gen 3 – same as iQ...
PPTX
PE3-WEEK-3sdsadsadasdadadwadwdsdddddd.pptx
PDF
APNCET2025RESULT Result Result 2025 2025
DOC
field study for teachers graduating samplr
PPTX
internship presentation of bsnl in colllege
DOCX
mcsp232projectguidelinesjan2023 (1).docx
PDF
シュアーイノベーション採用ピッチ資料|Company Introduction & Recruiting Deck
PPTX
DPT-MAY24.pptx for review and ucploading
PDF
313302 DBMS UNIT 1 PPT for diploma Computer Eng Unit 2
PPTX
FINAL PPT.pptx cfyufuyfuyuy8ioyoiuvy ituyc utdfm v
PPTX
chapter 3_bem.pptxKLJLKJLKJLKJKJKLJKJKJKHJH
PPTX
cse couse aefrfrqewrbqwrgbqgvq2w3vqbvq23rbgw3rnw345
1751884730-Visual Basic -Unitj CS B.pptx
E-Commerce____Intermediate_Presentation.pptx
Autonomic_Nervous_SystemM_Drugs_PPT.pptx
Understanding the Rhetorical Situation Presentation in Blue Orange Muted Il_2...
Definition and Relation of Food Science( Lecture1).pptx
BCH3201 (Enzymes and biocatalysis)-JEB (1).ppt
Sales and Distribution Managemnjnfijient.pdf
MCQ Practice CBT OL Official Language 1.pptx.pdf
OnePlus 13R – ⚡ All-Rounder King Performance: Snapdragon 8 Gen 3 – same as iQ...
PE3-WEEK-3sdsadsadasdadadwadwdsdddddd.pptx
APNCET2025RESULT Result Result 2025 2025
field study for teachers graduating samplr
internship presentation of bsnl in colllege
mcsp232projectguidelinesjan2023 (1).docx
シュアーイノベーション採用ピッチ資料|Company Introduction & Recruiting Deck
DPT-MAY24.pptx for review and ucploading
313302 DBMS UNIT 1 PPT for diploma Computer Eng Unit 2
FINAL PPT.pptx cfyufuyfuyuy8ioyoiuvy ituyc utdfm v
chapter 3_bem.pptxKLJLKJLKJLKJKJKLJKJKJKHJH
cse couse aefrfrqewrbqwrgbqgvq2w3vqbvq23rbgw3rnw345

scribgy.ppt

  • 2. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical models Current State of ASR Future Work
  • 3. The ASR Problem There is no single ASR problem The problem depends on many factors  Microphone: Close-mic, throat-mic, microphone array, audio-visual  Sources: band-limited, background noise, reverberation  Speaker: speaker dependent, speaker independent  Language: open/closed vocabulary, vocabulary size, read/spontaneous speech  Output: Transcription, speaker id, keywords
  • 4. Performance Evaluation Accuracy  Percentage of tokens correctly recognized Error Rate  Inverse of accuracy Token Type  Phones  Words*  Sentences  Semantics?
  • 5. What is Speech? Analog signal produced by humans You can think about the speech signal being decomposed into the source and filter The source is the vocal folds in voiced speech The filter is the vocal tract and articulators
  • 6. Acoustic Model  For each frame of data, we need some way of describing the likelihood of it belonging to any of our classes  Two methods are commonly used  Multilayer perceptron (MLP) gives the likelihood of a class given the data  Gaussian Mixture Model (GMM) gives the likelihood of the data given a class
  • 8. Pronunciation Model  While the pronunciation model can be very complex, it is typically just a dictionary  The dictionary contains the valid pronunciations for each word  Examples:  Cat: k ae t  Dog: d ao g  Fox: f aa x s
  • 9. Language Model  Now we need some way of representing the likelihood of any given word sequence  Many methods exist, but ngrams are the most common  Ngrams models are trained by simply counting the occurrences of words in a training set
  • 10. Ngrams  A unigram is the probability of any word in isolation  A bigram is the probability of a given word given the previous word  Higher order ngrams continue in a similar fashion  A backoff probability is used for any unseen data
  • 11. How do we put it together?  We now have models to represent the three parts of our equation  We need a framework to join these models together  The standard framework used is the Hidden Markov Model (HMM)
  • 12. Markov Model  A state model using the markov property  The markov property states that the future depends only on the present state  Models the likelihood of transitions between states in a model  Given the model, we can determine the likelihood of any sequence of states
  • 13. Hidden Markov Model  Similar to a markov model except the states are hidden  We now have observations tied to the individual states  We no longer know the exact state sequence given the data  Allows for the modeling of an underlying unobservable process
  • 14. HMMs for ASR  First we build an HMM for each phone  Next we combine the phone models based on the pronunciation model to create word level models  Finally, the word level models are combined based on the language model  We now have a giant network with potentially thousands or even millions of states
  • 15. Decoding  Decoding happens in the same way as the previous example  For each time frame we need to maintain two pieces of information  The likelihood of being at any state  The previous state for every state
  • 16. State of the Art  What works well  Constrained vocabulary systems  Systems adapted to a given speaker  Systems in anechoic environments without background noise  Systems expecting read speech  What doesn't work  Large unconstrained vocabulary  Noisy environments  Conversational speech
  • 17. Future Work  Better representations of audio based on humans  Better representation of acoustic elements based on articulatory phonology  Segmental models that do not rely on the simple frame-based approach
  • 18. Resources  Hidden Markov Model Toolkit (HTK)  http://htk.eng.cam.ac.uk/  CHIME ( a freely available dataset)  http://spandh.dcs.shef.ac.uk/projects/chime/PCC/datasets.html  Machine Learning Lectures  http://www.stanford.edu/class/cs229/  http://www.youtube.com/watch?v=UzxYlbK2c7E