SlideShare a Scribd company logo
An Introduction to Automatic Speech Recognition
Teaching Machines
to Listen
● Introduction
● Considerations for Working with Speech Data
● Overview of Automatic Speech Recognition (ASR) Systems
● Modeling Approaches
Outline
Introduction
• Why should you listen to me
tell you about ASR?
whoami
What we will cover
• Main components of speech
recognition systems
• Unique challenges working
with speech data
• Some common modeling
approaches
Expectations
What we will cover
• Main components of speech
recognition systems
• Unique challenges working
with speech data
• Some common modeling
approaches
Expectations
What we will not cover
• Deep dive into specific model
architectures
• Comprehensive coverage of
all aspects of working with
speech
• Code
Consider this an orientation to the field of speech recognition for
those familiar with other types of Data Science!
What we will cover
• Main components of speech
recognition systems
• Unique challenges working
with speech data
• Some common modeling
approaches
Expectations
What we will not cover
• Deep dive into specific model
architectures
• Comprehensive coverage of
all aspects of working with
speech
• Code
• Speech recognition is the task
of recognizing speech within
audio and converting it into
text
• Active field since 1950’s!
• Very active research field with
substantial recent advances
driven largely by novel neural
network architectures
Introduction
• 1950s and 1960s: Focus on
limited use cases; digits,
phonemes, single speakers
• 1980s and 1990s: Shift to focus
on statistical approaches
(HMMs, etc.)
• 2000s and 2010s: Wider
availability of speech
recognition toolkits
• Late 2010s: end-to-end
modeling approaches
(Selected) History
• Data volume
Specific Challenges for Speech Recognition
• Data Quality & Characteristics
Specific Challenges for Speech Recognition
• Annotation
Specific Challenges for Speech Recognition
Considerations for Working with
Speech Data
• Machine learning algorithms
want to deal with vectors (or
tensors) of floating point
numbers
• Different data types have
different preprocessing
requirements, as well as
different vectorization
strategies
Preparing Data for Machine Learning
• Tabular data is often a mix of
different field types, likely
captured from various systems
of reference in a wide variety
of ways
• “Messiness” often comes from
interpreting business context
for individual fields in a given
dataset
Common Characteristics of Tabular Data
• Preprocessing often includes
categorical/numerical
standardization and
identifying and addressing
relationships between
columns
Preprocessing and Vectorizing Tabular Data
• Text carries it own set of
unique preprocessing
challenges and data
characteristics
• Characteristics
• Language
• Text Encoding
• Typos and word-level errors
Common Characteristics of Text Data
• Preprocessing
• Case and punctuation
normalization
• Frequency analysis
• Usually map to fixed lexicon
• Vectorization
• Tokenization: Word / subword
/ character
• Vectorization: Count-based
vs. embedding
Preprocessing and Vectorizing Text Data
• Technical Characteristics
• Size and resolution
• Color space
• Format / compression
Common Characteristics of Image Data
• Preprocessing
• Type Conversion
• Resampling
• Re-scaling (centering)
• Cropping
• Vectorization
• Preprocessing produces
matrices and tensors of
machine-readable data!
Preprocessing and Vectorizing Image Data
• Technical Characteristics
• Sample rate
• Bit rate
• Compression / Encoding
• Audio channels
• Content Characteristics
• Number / demographics of
speakers
• Acoustic Environment
• Accent
• Continuous vs. discrete speech
• Language(s)
• Dialect and vocabulary
Common Characteristics of Audio Data
Overview of Automatic Speech
Recognition (ASR) Systems
• Input is raw audio waveform
Shape of Automatic Speech Recognition Task
• Output is discrete text tokens
automatic speech
recognition is the coolest
area of machine learning
• Input is raw audio waveform
Shape of Automatic Speech Recognition Task
• Output is discrete text tokens
automatic speech
recognition is the coolest
area of machine learning
Framed as a sequence to sequence machine learning task
10,000 ft. View of ASR System Components
automatic speech
recognition is the
coolest area of
machine learning
Preprocessing/
Feature
Extraction
Acoustic
Model
Language
Model
Preprocessing and Feature Extraction
automatic speech
recognition is the
coolest area of
machine learning
Preprocessing/
Feature
Extraction
Acoustic
Model
Language
Model
• Resampling
• Normalization
• Splitting / Chunking
• Noise detection / cleaning
• Language identification /
cleaning
Preprocessing Audio Data
• Resampling
• Normalization
• Splitting / Chunking
• Noise detection / cleaning
• Language identification /
cleaning
Preprocessing Audio Data
• Resampling
• Normalization
• Splitting / Chunking
• Noise detection / cleaning
• Language identification /
cleaning
Preprocessing Audio Data
• Resampling
• Normalization
• Splitting / Chunking
• Noise detection / parsing
• Language identification /
cleaning
Preprocessing Audio Data
• Resampling
• Normalization
• Splitting / Chunking
• Noise detection / parsing
• Language identification /
cleaning
Preprocessing Audio Data
Preprocessing Audio Text Data
Automatic speach
recognition is the
coolest area of
making learning!!!
automatic speech
recognition is the
coolest area of
machine learning
• Furthermore, the target text
data (transcripts) need to be
preprocessed, normalized, etc
• Many speech recognition
map to a fixed output
vocabulary
• Using this vocabulary, we can
define the output lexicon for
our downstream modeling
tasks
• Much of the useful information
for downstream tasks isn’t
directly accessible from the
raw audio signal
• Most modern frameworks
leverage features derived
from the frequency
representation of the audio
signal
Feature Extraction and Vectorization
• The human ear is more
sensitive to particular
frequency ranges than others
• That domain knowledge is
leveraged for computing sets
of features that capture
features most relevant for
speech processing
Feature Extraction and Vectorization
• A common approach for
extracting relevant features
incorporating this domain
knowledge is the
computation of Mel
Frequency Cepstral
Coefficients, or MFCCs
Feature Extraction and Vectorization
• The result is now a series of
discrete feature vectors which
can then be fed as input to
the next step in the process
Feature Extraction and Vectorization
Acoustic Model
automatic speech
recognition is the
coolest area of
machine learning
Preprocessing/
Feature
Extraction
Acoustic
Model
Language
Model
• An acoustic model provides a mapping from the feature space to an
intermediate acoustic representation of discrete phonemes
Acoustic Model
Acoustic Model
• Given a discrete input of
feature vectors, the acoustic
model provides probabilities
that a given (sequence of)
features corresponds to a
particular output phoneme
Acoustic Model
• As we’re ultimately interested
in word output and not
sequences of phonemes, an
acoustic model
• A lexicon is leveraged by
several methods to directly
map sequences of phonemes
to a word token hypothesis
Language Model
automatic speech
recognition is the
coolest area of
machine learning
Preprocessing/
Feature
Extraction
Acoustic
Model
Language
Model
• Output tokens from an
acoustic model are largely
unconstrained in the sense of
logical ordering
• A language model attempts
to introduce probabilistic
constraints on the raw output
of an acoustic model to
provide more likely sequences
of words
Language Model
Modeling Approaches
• Historically, Hidden Markov Models (HMM) have found great success in use
as acoustic models for ASR systems
Acoustic Models as Hidden Markov Models
• N-gram language models model
the probability of the next token in
a sequence by computing
statistical transition probabilities
between n-grams for a large
representative corpus
• Even complex, modern
approaches still rely on n-gram
language models to constrain the
output text to a given domain!
N-Gram Language Models
• Time-delay neural networks seek to
directly incorporate the
surrounding temporal context
features into the classification of
each frame into a corresponding
phoneme
Time-delay Neural Network
• Much research over the past
decade has focused on
approaching speech recognition
as a single task
• Many approaches borrow ideas or
architectural design from the
speech and signal processing
domain to maximize the useful
information signal through the
model
End to end approaches
• Deep speech (2014) was a popular
architecture that directly modeled
recursive relationships in the input
signal
End to end approaches
• Wav2vec(2) (2019/2020) leverages
vector product quantization to
encode discrete speech
representations
• This strategy, along with large scale
unsupervised model pretraining,
allowed for markedly performant
models when fine-tuned on
relatively small data sets
End to end approaches
• Conformer (2020) leverages both
convolutional layers and
attention-based transformer layers
to model both localized and longer
term dependencies in input
speech data
End to end approaches
• More recently, the Whisper (2022)
model from Open AI (of ChatGPT
fame) made a splash for providing
a very robust model trained on
over 680,000 hours of audio from
captioned video
• This model uses an
encoder-decoder architecture
accommodated by a multitask
training framework
End to end approaches
10,000 ft. View of ASR System Components
automatic speech
recognition is the
coolest area of
machine learning
The newest awesome end-to-end model

More Related Content

PPT
Asr
PPTX
Speech Recognition Technology
PPTX
lec26_audio.pptx
PDF
International journal of signal and image processing issues vol 2015 - no 1...
PPT
speech recognition system of modern world.ppt
PPTX
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"
PPT
Asr
Speech Recognition Technology
lec26_audio.pptx
International journal of signal and image processing issues vol 2015 - no 1...
speech recognition system of modern world.ppt
Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"

Similar to Teaching Machines to Listen: An Introduction to Automatic Speech Recognition (20)

PPT
sr.ppt
PPT
Voice recognitionr.ppt
PPTX
Research Developments and Directions in Speech Recognition and ...
PPTX
Speech-Recognition.pptx
PPT
scribgy.ppt
PPT
Asr
PDF
A survey on Enhancements in Speech Recognition
PDF
A_Review_on_Different_Approaches_for_Spe.pdf
PDF
AUTOMATIC SPEECH RECOGNITION- A SURVEY
PDF
AReviewonDifferentApproachesforSpeechRecognitionSystem.pdf
PPT
Automatic speech recognition
PPTX
AI for voice recognition.pptx
PDF
A Guide to Building an Automatic Speech Recognition System
PDF
Course report-islam-taharimul (1)
PPT
Automatic speech recognition
PPT
Automatic Speech Recognition.ppt
PDF
ch1.pdf
PDF
Speech recognition using neural + fuzzy logic
sr.ppt
Voice recognitionr.ppt
Research Developments and Directions in Speech Recognition and ...
Speech-Recognition.pptx
scribgy.ppt
Asr
A survey on Enhancements in Speech Recognition
A_Review_on_Different_Approaches_for_Spe.pdf
AUTOMATIC SPEECH RECOGNITION- A SURVEY
AReviewonDifferentApproachesforSpeechRecognitionSystem.pdf
Automatic speech recognition
AI for voice recognition.pptx
A Guide to Building an Automatic Speech Recognition System
Course report-islam-taharimul (1)
Automatic speech recognition
Automatic Speech Recognition.ppt
ch1.pdf
Speech recognition using neural + fuzzy logic
Ad

More from Zachary S. Brown (7)

PDF
Working in NLP in the Age of Large Language Models
PDF
Building and Deploying Scalable NLP Model Services
PPTX
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
PDF
Text Representations for Deep learning
PDF
Deep Learning and Modern NLP
PDF
Cyber Threat Ranking using READ
PDF
Deep Domain
Working in NLP in the Age of Large Language Models
Building and Deploying Scalable NLP Model Services
Deep Learning and Modern Natural Language Processing (AnacondaCon2019)
Text Representations for Deep learning
Deep Learning and Modern NLP
Cyber Threat Ranking using READ
Deep Domain
Ad

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Empathic Computing: Creating Shared Understanding
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
cuic standard and advanced reporting.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
A Presentation on Artificial Intelligence
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Approach and Philosophy of On baking technology
PPTX
Cloud computing and distributed systems.
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Review of recent advances in non-invasive hemoglobin estimation
Digital-Transformation-Roadmap-for-Companies.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Reach Out and Touch Someone: Haptics and Empathic Computing
Empathic Computing: Creating Shared Understanding
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
cuic standard and advanced reporting.pdf
The AUB Centre for AI in Media Proposal.docx
A Presentation on Artificial Intelligence
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
NewMind AI Weekly Chronicles - August'25 Week I
Approach and Philosophy of On baking technology
Cloud computing and distributed systems.
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Dropbox Q2 2025 Financial Results & Investor Presentation

Teaching Machines to Listen: An Introduction to Automatic Speech Recognition

  • 1. An Introduction to Automatic Speech Recognition Teaching Machines to Listen
  • 2. ● Introduction ● Considerations for Working with Speech Data ● Overview of Automatic Speech Recognition (ASR) Systems ● Modeling Approaches Outline
  • 4. • Why should you listen to me tell you about ASR? whoami
  • 5. What we will cover • Main components of speech recognition systems • Unique challenges working with speech data • Some common modeling approaches Expectations
  • 6. What we will cover • Main components of speech recognition systems • Unique challenges working with speech data • Some common modeling approaches Expectations What we will not cover • Deep dive into specific model architectures • Comprehensive coverage of all aspects of working with speech • Code
  • 7. Consider this an orientation to the field of speech recognition for those familiar with other types of Data Science! What we will cover • Main components of speech recognition systems • Unique challenges working with speech data • Some common modeling approaches Expectations What we will not cover • Deep dive into specific model architectures • Comprehensive coverage of all aspects of working with speech • Code
  • 8. • Speech recognition is the task of recognizing speech within audio and converting it into text • Active field since 1950’s! • Very active research field with substantial recent advances driven largely by novel neural network architectures Introduction
  • 9. • 1950s and 1960s: Focus on limited use cases; digits, phonemes, single speakers • 1980s and 1990s: Shift to focus on statistical approaches (HMMs, etc.) • 2000s and 2010s: Wider availability of speech recognition toolkits • Late 2010s: end-to-end modeling approaches (Selected) History
  • 10. • Data volume Specific Challenges for Speech Recognition
  • 11. • Data Quality & Characteristics Specific Challenges for Speech Recognition
  • 12. • Annotation Specific Challenges for Speech Recognition
  • 13. Considerations for Working with Speech Data
  • 14. • Machine learning algorithms want to deal with vectors (or tensors) of floating point numbers • Different data types have different preprocessing requirements, as well as different vectorization strategies Preparing Data for Machine Learning
  • 15. • Tabular data is often a mix of different field types, likely captured from various systems of reference in a wide variety of ways • “Messiness” often comes from interpreting business context for individual fields in a given dataset Common Characteristics of Tabular Data
  • 16. • Preprocessing often includes categorical/numerical standardization and identifying and addressing relationships between columns Preprocessing and Vectorizing Tabular Data
  • 17. • Text carries it own set of unique preprocessing challenges and data characteristics • Characteristics • Language • Text Encoding • Typos and word-level errors Common Characteristics of Text Data
  • 18. • Preprocessing • Case and punctuation normalization • Frequency analysis • Usually map to fixed lexicon • Vectorization • Tokenization: Word / subword / character • Vectorization: Count-based vs. embedding Preprocessing and Vectorizing Text Data
  • 19. • Technical Characteristics • Size and resolution • Color space • Format / compression Common Characteristics of Image Data
  • 20. • Preprocessing • Type Conversion • Resampling • Re-scaling (centering) • Cropping • Vectorization • Preprocessing produces matrices and tensors of machine-readable data! Preprocessing and Vectorizing Image Data
  • 21. • Technical Characteristics • Sample rate • Bit rate • Compression / Encoding • Audio channels • Content Characteristics • Number / demographics of speakers • Acoustic Environment • Accent • Continuous vs. discrete speech • Language(s) • Dialect and vocabulary Common Characteristics of Audio Data
  • 22. Overview of Automatic Speech Recognition (ASR) Systems
  • 23. • Input is raw audio waveform Shape of Automatic Speech Recognition Task • Output is discrete text tokens automatic speech recognition is the coolest area of machine learning
  • 24. • Input is raw audio waveform Shape of Automatic Speech Recognition Task • Output is discrete text tokens automatic speech recognition is the coolest area of machine learning Framed as a sequence to sequence machine learning task
  • 25. 10,000 ft. View of ASR System Components automatic speech recognition is the coolest area of machine learning Preprocessing/ Feature Extraction Acoustic Model Language Model
  • 26. Preprocessing and Feature Extraction automatic speech recognition is the coolest area of machine learning Preprocessing/ Feature Extraction Acoustic Model Language Model
  • 27. • Resampling • Normalization • Splitting / Chunking • Noise detection / cleaning • Language identification / cleaning Preprocessing Audio Data
  • 28. • Resampling • Normalization • Splitting / Chunking • Noise detection / cleaning • Language identification / cleaning Preprocessing Audio Data
  • 29. • Resampling • Normalization • Splitting / Chunking • Noise detection / cleaning • Language identification / cleaning Preprocessing Audio Data
  • 30. • Resampling • Normalization • Splitting / Chunking • Noise detection / parsing • Language identification / cleaning Preprocessing Audio Data
  • 31. • Resampling • Normalization • Splitting / Chunking • Noise detection / parsing • Language identification / cleaning Preprocessing Audio Data
  • 32. Preprocessing Audio Text Data Automatic speach recognition is the coolest area of making learning!!! automatic speech recognition is the coolest area of machine learning • Furthermore, the target text data (transcripts) need to be preprocessed, normalized, etc • Many speech recognition map to a fixed output vocabulary • Using this vocabulary, we can define the output lexicon for our downstream modeling tasks
  • 33. • Much of the useful information for downstream tasks isn’t directly accessible from the raw audio signal • Most modern frameworks leverage features derived from the frequency representation of the audio signal Feature Extraction and Vectorization
  • 34. • The human ear is more sensitive to particular frequency ranges than others • That domain knowledge is leveraged for computing sets of features that capture features most relevant for speech processing Feature Extraction and Vectorization
  • 35. • A common approach for extracting relevant features incorporating this domain knowledge is the computation of Mel Frequency Cepstral Coefficients, or MFCCs Feature Extraction and Vectorization
  • 36. • The result is now a series of discrete feature vectors which can then be fed as input to the next step in the process Feature Extraction and Vectorization
  • 37. Acoustic Model automatic speech recognition is the coolest area of machine learning Preprocessing/ Feature Extraction Acoustic Model Language Model
  • 38. • An acoustic model provides a mapping from the feature space to an intermediate acoustic representation of discrete phonemes Acoustic Model
  • 39. Acoustic Model • Given a discrete input of feature vectors, the acoustic model provides probabilities that a given (sequence of) features corresponds to a particular output phoneme
  • 40. Acoustic Model • As we’re ultimately interested in word output and not sequences of phonemes, an acoustic model • A lexicon is leveraged by several methods to directly map sequences of phonemes to a word token hypothesis
  • 41. Language Model automatic speech recognition is the coolest area of machine learning Preprocessing/ Feature Extraction Acoustic Model Language Model
  • 42. • Output tokens from an acoustic model are largely unconstrained in the sense of logical ordering • A language model attempts to introduce probabilistic constraints on the raw output of an acoustic model to provide more likely sequences of words Language Model
  • 44. • Historically, Hidden Markov Models (HMM) have found great success in use as acoustic models for ASR systems Acoustic Models as Hidden Markov Models
  • 45. • N-gram language models model the probability of the next token in a sequence by computing statistical transition probabilities between n-grams for a large representative corpus • Even complex, modern approaches still rely on n-gram language models to constrain the output text to a given domain! N-Gram Language Models
  • 46. • Time-delay neural networks seek to directly incorporate the surrounding temporal context features into the classification of each frame into a corresponding phoneme Time-delay Neural Network
  • 47. • Much research over the past decade has focused on approaching speech recognition as a single task • Many approaches borrow ideas or architectural design from the speech and signal processing domain to maximize the useful information signal through the model End to end approaches
  • 48. • Deep speech (2014) was a popular architecture that directly modeled recursive relationships in the input signal End to end approaches
  • 49. • Wav2vec(2) (2019/2020) leverages vector product quantization to encode discrete speech representations • This strategy, along with large scale unsupervised model pretraining, allowed for markedly performant models when fine-tuned on relatively small data sets End to end approaches
  • 50. • Conformer (2020) leverages both convolutional layers and attention-based transformer layers to model both localized and longer term dependencies in input speech data End to end approaches
  • 51. • More recently, the Whisper (2022) model from Open AI (of ChatGPT fame) made a splash for providing a very robust model trained on over 680,000 hours of audio from captioned video • This model uses an encoder-decoder architecture accommodated by a multitask training framework End to end approaches
  • 52. 10,000 ft. View of ASR System Components automatic speech recognition is the coolest area of machine learning The newest awesome end-to-end model