Teaching Machines to Listen: An Introduction to Automatic Speech Recognition

An Introduction to Automatic Speech Recognition
Teaching Machines
to Listen

● Introduction
● Considerations for Working with Speech Data
● Overview of Automatic Speech Recognition (ASR) Systems
● Modeling Approaches
Outline

• Why should you listen to me
tell you about ASR?
whoami

What we will cover
• Main components of speech
recognition systems
• Unique challenges working
with speech data
• Some common modeling
approaches
Expectations

What we will cover
recognition systems
with speech data
approaches
Expectations
What we will not cover
• Deep dive into specific model
architectures
• Comprehensive coverage of
all aspects of working with
speech
• Code

Consider this an orientation to the field of speech recognition for
those familiar with other types of Data Science!
What we will cover
recognition systems
with speech data
approaches
Expectations
What we will not cover
• Deep dive into specific model
architectures
• Comprehensive coverage of
all aspects of working with
speech
• Code

• Speech recognition is the task
of recognizing speech within
audio and converting it into
text
• Active field since 1950’s!
• Very active research field with
substantial recent advances
driven largely by novel neural
network architectures
Introduction

• 1950s and 1960s: Focus on
limited use cases; digits,
phonemes, single speakers
• 1980s and 1990s: Shift to focus
on statistical approaches
(HMMs, etc.)
• 2000s and 2010s: Wider
availability of speech
recognition toolkits
• Late 2010s: end-to-end
modeling approaches
(Selected) History

• Data volume
Specific Challenges for Speech Recognition

• Data Quality & Characteristics

• Annotation

Considerations for Working with
Speech Data

• Machine learning algorithms
want to deal with vectors (or
tensors) of floating point
numbers
• Different data types have
different preprocessing
requirements, as well as
different vectorization
strategies
Preparing Data for Machine Learning

• Tabular data is often a mix of
different field types, likely
captured from various systems
of reference in a wide variety
of ways
• “Messiness” often comes from
interpreting business context
for individual fields in a given
dataset
Common Characteristics of Tabular Data

• Preprocessing often includes
categorical/numerical
standardization and
identifying and addressing
relationships between
columns
Preprocessing and Vectorizing Tabular Data

• Text carries it own set of
unique preprocessing
challenges and data
characteristics
• Characteristics
• Language
• Text Encoding
• Typos and word-level errors
Common Characteristics of Text Data

• Preprocessing
• Case and punctuation
normalization
• Frequency analysis
• Usually map to fixed lexicon
• Vectorization
• Tokenization: Word / subword
/ character
• Vectorization: Count-based
vs. embedding
Preprocessing and Vectorizing Text Data

• Technical Characteristics
• Size and resolution
• Color space
• Format / compression
Common Characteristics of Image Data

• Preprocessing
• Type Conversion
• Resampling
• Re-scaling (centering)
• Cropping
• Vectorization
• Preprocessing produces
matrices and tensors of
machine-readable data!
Preprocessing and Vectorizing Image Data

• Technical Characteristics
• Sample rate
• Bit rate
• Compression / Encoding
• Audio channels
• Content Characteristics
• Number / demographics of
speakers
• Acoustic Environment
• Accent
• Continuous vs. discrete speech
• Language(s)
• Dialect and vocabulary
Common Characteristics of Audio Data

Overview of Automatic Speech
Recognition (ASR) Systems

• Input is raw audio waveform
Shape of Automatic Speech Recognition Task
• Output is discrete text tokens
automatic speech
recognition is the coolest
area of machine learning

• Input is raw audio waveform
Shape of Automatic Speech Recognition Task
• Output is discrete text tokens
automatic speech
recognition is the coolest
area of machine learning
Framed as a sequence to sequence machine learning task

10,000 ft. View of ASR System Components
automatic speech
recognition is the
coolest area of
machine learning
Preprocessing/
Feature
Extraction
Acoustic
Model
Language
Model

Preprocessing and Feature Extraction
automatic speech
recognition is the
coolest area of
machine learning
Preprocessing/
Feature
Extraction
Acoustic
Model
Language
Model

• Resampling
• Normalization
• Splitting / Chunking
• Noise detection / cleaning
• Language identification /
cleaning
Preprocessing Audio Data

• Resampling
• Normalization
• Splitting / Chunking
• Noise detection / parsing
• Language identification /
cleaning
Preprocessing Audio Data

Preprocessing Audio Text Data
Automatic speach
recognition is the
coolest area of
making learning!!!
automatic speech
recognition is the
coolest area of
machine learning
• Furthermore, the target text
data (transcripts) need to be
preprocessed, normalized, etc
• Many speech recognition
map to a fixed output
vocabulary
• Using this vocabulary, we can
define the output lexicon for
our downstream modeling
tasks

• Much of the useful information
for downstream tasks isn’t
directly accessible from the
raw audio signal
• Most modern frameworks
leverage features derived
from the frequency
representation of the audio
signal
Feature Extraction and Vectorization

• The human ear is more
sensitive to particular
frequency ranges than others
• That domain knowledge is
leveraged for computing sets
of features that capture
features most relevant for
speech processing

• A common approach for
extracting relevant features
incorporating this domain
knowledge is the
computation of Mel
Frequency Cepstral
Coefficients, or MFCCs

• The result is now a series of
discrete feature vectors which
can then be fed as input to
the next step in the process

Acoustic Model
automatic speech
recognition is the
coolest area of
machine learning
Preprocessing/
Feature
Extraction
Acoustic
Model
Language
Model

• An acoustic model provides a mapping from the feature space to an
intermediate acoustic representation of discrete phonemes
Acoustic Model

Acoustic Model
• Given a discrete input of
feature vectors, the acoustic
model provides probabilities
that a given (sequence of)
features corresponds to a
particular output phoneme

Acoustic Model
• As we’re ultimately interested
in word output and not
sequences of phonemes, an
acoustic model
• A lexicon is leveraged by
several methods to directly
map sequences of phonemes
to a word token hypothesis

Language Model
automatic speech
recognition is the
coolest area of
machine learning
Preprocessing/
Feature
Extraction
Acoustic
Model
Language
Model

• Output tokens from an
acoustic model are largely
unconstrained in the sense of
logical ordering
• A language model attempts
to introduce probabilistic
constraints on the raw output
of an acoustic model to
provide more likely sequences
of words
Language Model

• Historically, Hidden Markov Models (HMM) have found great success in use
as acoustic models for ASR systems
Acoustic Models as Hidden Markov Models

• N-gram language models model
the probability of the next token in
a sequence by computing
statistical transition probabilities
between n-grams for a large
representative corpus
• Even complex, modern
approaches still rely on n-gram
language models to constrain the
output text to a given domain!
N-Gram Language Models

• Time-delay neural networks seek to
directly incorporate the
surrounding temporal context
features into the classification of
each frame into a corresponding
phoneme
Time-delay Neural Network

• Much research over the past
decade has focused on
approaching speech recognition
as a single task
• Many approaches borrow ideas or
architectural design from the
speech and signal processing
domain to maximize the useful
information signal through the
model
End to end approaches

• Deep speech (2014) was a popular
architecture that directly modeled
recursive relationships in the input
signal

• Wav2vec(2) (2019/2020) leverages
vector product quantization to
encode discrete speech
representations
• This strategy, along with large scale
unsupervised model pretraining,
allowed for markedly performant
models when fine-tuned on
relatively small data sets

• Conformer (2020) leverages both
convolutional layers and
attention-based transformer layers
to model both localized and longer
term dependencies in input
speech data

• More recently, the Whisper (2022)
model from Open AI (of ChatGPT
fame) made a splash for providing
a very robust model trained on
over 680,000 hours of audio from
captioned video
• This model uses an
encoder-decoder architecture
accommodated by a multitask
training framework

10,000 ft. View of ASR System Components
automatic speech
recognition is the
coolest area of
machine learning
The newest awesome end-to-end model

Teaching Machines to Listen: An Introduction to Automatic Speech Recognition

More Related Content

Similar to Teaching Machines to Listen: An Introduction to Automatic Speech Recognition (20)

More from Zachary S. Brown (7)

Recently uploaded (20)

Teaching Machines to Listen: An Introduction to Automatic Speech Recognition