Automatic speech recognition

Automatic Speech
Recognition

Automatic speech recognition
• What is the task?
• What are the main difficulties?
• How is it approached?
• How good is it?
• How much better could it be?

2/34

What is the task?
• Getting a computer to understand spoken
language
• By “understand” we might mean
– React appropriately
– Convert the input speech into another
medium, e.g. text
• Several variables impinge on this (see
later)
3/34

How do humans do it?

• Articulation produces
• sound waves which
• the ear conveys to the brain
• for processing
4/34

How might computers do it?

Acoustic waveform Acoustic signal

• Digitization
• Acoustic analysis of the
Speech recognition
speech signal
• Linguistic interpretation
5/34

What’s hard about that?
• Digitization
– Converting analogue signal into digital representation
• Signal processing
– Separating speech from background noise
• Phonetics
– Variability in human speech
• Phonology
– Recognizing individual sound distinctions (similar phonemes)
• Lexicology and syntax
– Disambiguating homophones
– Features of continuous speech
• Syntax and pragmatics
– Interpreting prosodic features
• Pragmatics
– Filtering of performance errors (disfluencies)
6/34

Digitization
• Analogue to digital conversion
• Sampling and quantizing
• Use filters to measure energy levels for various
points on the frequency spectrum
• Knowing the relative importance of different
frequency bands (for speech) makes this
process more efficient
• E.g. high frequency sounds are less informative,
so can be sampled using a broader bandwidth
(log scale)
7/34

Separating speech from
background noise
• Noise cancelling microphones
– Two mics, one facing speaker, the other facing away
– Ambient noise is roughly same for both mics
• Knowing which bits of the signal relate to speech
– Spectrograph analysis

8/34

Variability in individuals’ speech
• Variation among speakers due to
– Vocal range (f0, and pitch range – see later)
– Voice quality (growl, whisper, physiological elements
such as nasality, adenoidality, etc)
– ACCENT !!! (especially vowel systems, but also
consonants, allophones, etc.)
• Variation within speakers due to
– Health, emotional state
– Ambient conditions
• Speech style: formal read vs spontaneous
9/34

Speaker-(in)dependent systems
• Speaker-dependent systems
– Require “training” to “teach” the system your individual
idiosyncracies
• The more the merrier, but typically nowadays 5 or 10 minutes is
enough
• User asked to pronounce some key words which allow computer to
infer details of the user’s accent and voice
• Fortunately, languages are generally systematic
– More robust
– But less convenient
– And obviously less portable
• Speaker-independent systems
– Language coverage is reduced to compensate need to be
flexible in phoneme identification
– Clever compromise is to learn on the fly
10/34

Identifying phonemes
• Differences between some phonemes are
sometimes very small
– May be reflected in speech signal (eg vowels
have more or less distinctive f1 and f2)
– Often show up in coarticulation effects
(transition to next sound)
• e.g. aspiration of voiceless stops in English
– Allophonic variation

11/34

Disambiguating homophones
• Mostly differences are recognised by humans by
context and need to make sense
It’s hard to wreck a nice beach
What dime’s a neck’s drain to stop port?
• Systems can only recognize words that are in
their lexicon, so limiting the lexicon is an obvious
ploy
• Some ASR systems include a grammar which
can help disambiguation

12/34

(Dis)continuous speech
• Discontinuous speech much easier to
recognize
– Single words tend to be pronounced more
clearly
• Continuous speech involves contextual
coarticulation effects
– Weak forms
– Assimilation
– Contractions

13/34

Interpreting prosodic features
• Pitch, length and loudness are used to
indicate “stress”
• All of these are relative
– On a speaker-by-speaker basis
– And in relation to context
• Pitch and length are phonemic in some
languages

14/34

Pitch
• Pitch contour can be extracted from
speech signal
– But pitch differences are relative
– One man’s high is another (wo)man’s low
– Pitch range is variable
• Pitch contributes to intonation
– But has other functions in tone languages
• Intonation can convey meaning
15/34

Length
• Length is easy to measure but difficult to
interpret
• Again, length is relative
• It is phonemic in many languages
• Speech rate is not constant – slows down at the
end of a sentence

16/34

Loudness
• Loudness is easy to measure but difficult
to interpret
• Again, loudness is relative

17/34

Performance errors
• Performance “errors” include
– Non-speech sounds
– Hesitations
– False starts, repetitions
• Filtering implies handling at syntactic level
or above
• Some disfluencies are deliberate and
have pragmatic effect – this is not
something we can handle in the near
future
18/34

Approaches to ASR
• Template matching
• Knowledge-based (or rule-based)
approach
• Statistical approach:
– Noisy channel model + machine learning

19/34

Template-based approach
• Store examples of units (words,
phonemes), then find the example that
most closely fits the input
• Extract features from speech signal, then
it’s “just” a complex similarity matching
problem, using solutions developed for all
sorts of applications
• OK for discrete utterances, and a single
user
20/34

Template-based approach
• Hard to distinguish very similar templates
• And quickly degrades when input differs
from templates
• Therefore needs techniques to mitigate
this degradation:
– More subtle matching techniques
– Multiple templates which are aggregated
• Taken together, these suggested …
21/34

Rule-based approach
• Use knowledge of phonetics and
linguistics to guide search process
• Templates are replaced by rules
expressing everything (anything) that
might help to decode:
– Phonetics, phonology, phonotactics
– Syntax
– Pragmatics

22/34

Rule-based approach
• Typical approach is based on “blackboard”
architecture:
– At each decision point, lay out the possibilities
– Apply rules to determine which sequences are
permitted s
k
i: ʃ
h tʃ
ʃ iə
• Poor performance due to p
t
ɪ h
s
– Difficulty to express rules
– Difficulty to make rules interact
– Difficulty to know how to improve the system
23/34

• Identify individual phonemes
• Identify words
• Identify sentence structure and/or meaning
• Interpret prosodic features (pitch, loudness, length)
24/34

Statistics-based approach
• Can be seen as extension of template-
based approach, using more powerful
mathematical and statistical tools
• Sometimes seen as “anti-linguistic”
approach
– Fred Jelinek (IBM, 1988): “Every time I fire a
linguist my system improves”

25/34

Statistics-based approach
• Collect a large corpus of transcribed
speech recordings
• Train the computer to learn the
correspondences (“machine learning”)
• At run time, apply statistical processes to
search through the space of all possible
solutions, and pick the statistically most
likely one
26/34

Machine learning
• Acoustic and Lexical Models
– Analyse training data in terms of relevant
features
– Learn from large amount of data different
possibilities
• different phone sequences for a given word
• different combinations of elements of the speech
signal for a given phone/phoneme
– Combine these into a Hidden Markov Model
expressing the probabilities

27/34

HMMs for some words

28/34

Language model
• Models likelihood of word given previous
word(s)
• n-gram models:
– Build the model by calculating bigram or
trigram probabilities from text training corpus
– Smoothing issues

29/34

The Noisy Channel Model

• Search through space of all possible
sentences
• Pick the one that is most probable given
the waveform
30/34

The Noisy Channel Model
• Use the acoustic model to give a set of
likely phone sequences
• Use the lexical and language models to
judge which of these are likely to result in
probable word sequences
• The trick is having sophisticated
algorithms to juggle the statistics
• A bit like the rule-based approach except
that it is all learned automatically from
data
31/34

Evaluation
• Funders have been very keen on
competitive quantitative evaluation
• Subjective evaluations are informative, but
not cost-effective
• For transcription tasks, word-error rate is
popular (though can be misleading: all
words are not equally important)
• For task-based dialogues, other measures
of understanding are needed
32/34

Comparing ASR systems
• Factors include
– Speaking mode: isolated words vs continuous speech
– Speaking style: read vs spontaneous
– “Enrollment”: speaker (in)dependent
– Vocabulary size (small <20 … large > 20,000)
– Equipment: good quality noise-cancelling mic …
telephone
– Size of training set (if appropriate) or rule set
– Recognition method

33/34

Remaining problems
• Robustness – graceful degradation, not catastrophic failure
• Portability – independence of computing platform
• Adaptability – to changing conditions (different mic, background
noise, new speaker, new task domain, new language even)
• Language Modelling – is there a role for linguistics in improving the
language models?
• Confidence Measures – better methods to evaluate the absolute
correctness of hypotheses.
• Out-of-Vocabulary (OOV) Words – Systems must have some
method of detecting OOV words, and dealing with them in a
sensible way.
• Spontaneous Speech – disfluencies (filled pauses, false starts,
hesitations, ungrammatical constructions etc) remain a problem.
• Prosody –Stress, intonation, and rhythm convey important
information for word recognition and the user's intentions (e.g.,
sarcasm, anger)
• Accent, dialect and mixed language – non-native speech is a
huge problem, especially where code-switching is commonplace
34/34

Automatic speech recognition

More Related Content

What's hot (20)

Viewers also liked (18)

Similar to Automatic speech recognition (20)

More from Birudugadda Pranathi (7)

Recently uploaded (20)

Automatic speech recognition