Automatic Speech Recognition

August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
Speech Recognition

August21,
2013
CONTENTS
01 What is Voice
02 Component of Sound
03 Why Voices are Different
04 Classification of Speech Sound
05 Process of Speech Production
06 What is Voice Recognition
07 ASR(Automatic Speech Recognition)
08 Types of ASR
09 Approachesto Speech Recognition
10 Process of Speech Recognition
11 How Speech Recognition Works
12 Approachesof Speech recognition
13 Application of Speech Processing

August21,
2013
1.What is Voice?
The voice consists of sound made by a human being using the vocal folds for talking,
singing, laughing, crying, screaming, etc. The human voice is specifically that part of human
sound production in which the vocal folds (vocal cords) are the primary sound source. Generally
speaking, the mechanism for generating the human voice can be subdivided into three parts; the
lungs, the vocal folds within the larynx, and the articulators. The lung (the pump) must produce
adequate airflow and air pressure to vibrate vocal folds (this air pressure is the fuel of the voice).
The vocal folds (vocal cords) are a vibrating valve that chops up the airflow from the lungs into
audible pulses that form the laryngeal sound source. The muscles of the larynx adjust the length
and tension of the vocal folds to ‘fine tune’ pitch and tone. The articulators (the parts of the vocal
tract above the larynx consisting of tongue, palate, cheek, lips, etc.) articulate and filter the sound
emanating from the larynx and to some degree can interact with the laryngeal airflow to
strengthen it or weaken it as a sound source.
The vocal folds, in combination with the articulators, are capable of producing highly
intricate arrays of sound. The tone of voice may be modulated to suggest emotions such as anger,
surprise, or happiness. Singers use the human voice as an instrument for creating music.
2.Componentsof Sound
There are NINE(09) components of sound given below
1. Music components
 Pitch
 Timbre
 Harmonics
2. Loudness
 Rhythm
3. Sound envelope components
 Attack
 Sustain
 Decay
4. Record and playback component
 Speed

August21,
2013
Different Terms
1. Compressions, in which particles are crowded together, appear as upward curves in the
line.
2. Rarefactions, in which particles are spread apart, appear as downward curves in the line.
Three characteristics are used to describe a sound wave. These are wavelength,
frequency, and amplitude.
3. Wavelength; this is the distance from the crest of one wave to the crest of the next.
4. Frequency; this is the number of waves that pass a point in each second.
5. Amplitude; this is the measure of the amount of energy in a sound wave.
6. Pitch
This is how high or low a sound seems. A bird makes a high pitch. A lion makes a low
pitch.

August21,
2013
Sounds also are different in how loud and how soft they are.
The more energy the sound wave has the louder the sound seems. The intensity of a
sound is the amount of energy it has. You hear intensity as loudness.
Remember the amplitude, or height of a sound wave is a measure of the amount of
energy in the wave. So the greater the intensity of a sound, the greater the amplitude.
Pitch and loudness are two ways that sounds are different. Another way is in quality.
Some sounds are pleasant and some are a noise. Compare the two waves on the right. A pleasant
sound has a regular wave pattern. The pattern is repeated over and over. But the waves of noise
are irregular. They do not have a repeated pattern.
7.Why Voices are Different?
Voices are different caused by
 INTENSITY(depend on amplitude)
 PITCH(frequency)
 TONE(pleasant or unpleasant).
1. Amplitude is a measure of energy. The more energy a wave has, the higher its amplitude.
As amplitude increases, intensity also increases.

August21,
2013
2. Intensity is the amount of energy a sound has over an area. The same sound is more
intense if you hear it in a smaller area. In general, we call sounds with a higher intensity
louder.
3. Pitch depends on the frequency of a sound wave. Frequency is the number of
wavelengths that fit into one unit of time.
Sounds also are different in how loud and how soft they are. The more energy the sound
wave has the louder the sound seems. The intensity of a sound is the amount of energy it has.
You hear intensity as loudness.
Remember the amplitude, or height of a sound wave is a measure of the amount of
energy in the wave. so the greater the intensity of a sound, the greater the amplitude.
4.Classification of Speech Sound
One can make broad divisions such as voiced and unvoiced sound, or become more
speci_c, such as front vowels, back vowels, semivowels, and so on.
The difference between voiced and unvoiced sounds becomes clear in these samples.
The first two blocks demonstrate a dominant low frequency sound wave, which is not present in
the third block. This frequency is produced by the vibration of the larynx, or voice box.
Although the exact frequency differs for each speaker (females tend to have a higher frequency),
the dominant presence of a low frequency sound wave is a surefire indicator of a voiced sound.
1. Voiced Sound
Vocal Chord play active role in the production of SOUND e.g. a/e/I. It has high
frequency
2. Un Voiced Sound
When Vocal Chord is Inactive Called UN VOICED SOUND e.g. s/f. It build up by
pressure

August21,
2013
5.Process of Speech Production
6.What is voice recognition?
Voice recognition is the process of taking the spoken word as an input to a computer
program. It is the process of converting voice into electric signals. Signals transform into
CODING PATTERN.
Voice recognition is "the technology by which sounds, words or phrases spoken by
humans are converted into electrical signals and these signals are transformed into coding
patterns to which meaning has been assigned". While the concept could more generally is called
"sound recognition".
speech recognition, voice recognition is an ability of a computer, computer software
program, or hardware device to decode the human voice into digitized speech that can be
interpreted by the computer or hardware device. Voice recognition is commonly used to operate
a device, perform commands, or write without having to operate a keyboard, mouse, or press any
buttons
7.ASR (Automatic Speech Recognition)
Process of converting acoustic signal captured by microphone or telephone to a set of
words. Recognized words can be final results, as for applications such as commands and control,
data entry and document preparation. They can also serve as input to further linguistic processing
in order to achieve speech understanding.
First ASR device was used in 1952 and recognized single digits spoken by a user (it was
not computer driven). Today, ASR programs are used in many industries, including Healthcare,

August21,
2013
Military (i.e. jets and helicopters), Telecommunications and Personal computing (i.e. hands free
computing).
Evaluation of ASR
Acoustic Model
An acoustic model is created by taking audio recordings of speech, and their text
transcriptions, and using software to create statistical representations of the sounds that make up
each word. It is used by a speech recognition engine to recognize speech.

August21,
2013
Language Model
Language modeling is used in many natural language processing applications such
as speech recognition tries to capture the properties of a language, and to predict the next word in
a speech sequence.
8.Basic Types of Speech Recognition System
1. Speaker-dependent:
user must provide samples of his/her speech before using them. The voice recognition
must be trained before it can be used. This often requires a user reads a series of words and
phrases so the computer can understand the users voice.
Speaker–dependent software works by learning the unique characteristics of a
single person's voice, in a way similar to voice recognition. New users must first "train" the
software by speaking to it, so the computer can analyze how the person talks. This often means
users have to read a few pages of text to the computer before they can use the speech recognition
software.
2. Speaker independent
no speaker enrollment necessary. The voice recognition software recognizes most users
voices with no training.
Speaker–independent software is designed to recognize anyone's voice, so no training is
involved. This means it is the only real option for applications such as interactive voice response
systems — where businesses can't ask callers to read pages of text before using the system. The
downside is that speaker–independent software is generally less accurate than speaker–dependent
software.
Other types
1. Discrete speech recognition - The user must pause between each word so
that the speech recognition can identify each separate word.
2. Continuous speech recognition - The voice recognition can understand a
normal rate of speaking.
3. Natural language - The speech recognition not only can understand the voice
but also return answers to questions or other queries that are being asked.
9. Approachesto ASR
 Template matching
 Knowledge-based (or rule-based) approach
 Statistical approach:

August21,
2013
 Noisy channel model + machine learning
1. Template matching
It is SPEAKE DEPENDENT. It match voice with already saved templates. Before it
we’ve to trained the system. System must be trained. User speak same word which are avail in
template. Recognition accuracy can be about 98 percent. Store examples of units (words,
phonemes), then find the example that most closely fits the input Extract features from speech
signal, then it’s “just” a complex similarity matching problem, using solutions developed for all
sorts of applications OK for discrete utterances, and a single user Hard to distinguish very
similar templates And quickly degrades when input differs from templates Therefore needs
techniques to mitigate this degradation: More subtle matching techniques Multiple templates
which are aggregated.
2. Rule-based approach
It is SPEAKER INDEPENDENT. First process the giving voice as input.Using
LPC(Linear Productive Coding) Attempt to find similarities b/w expected Input and Digitized
input. Recognition accuracy for speaker-independent systems is somewhat less than for
speaker-dependent systems, usually between 90 and 95 percent. Use knowledge of phonetics
and linguistics to guide search process Templates are replaced by rules expressing everything
(anything) that might help to decode: Phonetics, phonology, phonotactics
Syntax
 Pragmatics
 Typical approach is based on “blackboard” architecture:
 At each decision point, lay out the possibilities
 Apply rules to determine which sequences are permitted
 Poor performance due to
 Difficulty to express rules
 Difficulty to make rules interact
 Difficulty to know how to improve the system
3. Statistical Base Approach
Can be seen as extension of template-based approach, using more powerful mathematical
and statistical tools. Sometimes seen as “anti-linguistic” approach Fred Jelinek (IBM, 1988):
“Every time I fire a linguist my system improves” Collect a large corpus of transcribed speech
recordings Train the computer to learn the correspondences (“machine learning”)
At run time, apply statistical processes to search through the space of all possible
solutions, and pick the statistically most likely one

August21,
2013
9.Process of Speech Recognition
Vocal Tract
Consist of laryngeal pharynx, oral hyrax, oral cavity, nasal cavity, nasal pharynx.
Spectrum Analysis
MFCC used to produce voice feature. DTW to select the pattern that match the
database(matLab).
10. How Speech Recognition Works
 Divide the sound wave into evenly spaced blocks. Transform the PCM digital
audio into a better acoustic representation.
 Process each block for important characteristics, such as strength across various
frequency ranges, number of zero crossings, and total energy. Apply a "grammar"
so the speech recognizer knows what phonemes to expect. A grammar could be
anything from a context-free grammar to full-blown Language.
 Using this characteristic vector, attempt to associate each block with a phone,
which is the most basic unit of speech, producing a string of phones? Figure out
which phonemes are spoken.

August21,
2013
 Find the word whose model is the most likely match to the string of phones which
was produced. Convert the phonemes into words.
1. Speech Detection
The first task is to identify the precense of a speech signal. This task is easy if the signal
is clear, however frequently the signal contains background noise, resulting from a noisy
microphone, a fan running in the room, etc. The signals obtained were in fact found to contain
some noise. I used two criteria to identify the presence of a spoken word. First, the total energy
is measured, and second the number of zero crossings are counted. Both of these were found to
be necessary, as voiced sounds tend to have a high volume (and thus a high total energy), but a
low overall frequency (and thus a low number of zero crossings), while unvoiced sounds were
found to have a high frequency, but a low volume. Only background noise was found to have
both low energy and low frequency. The method was found to successfully detect the beginning
and end of the several words tested. Note that this is not sufficient for the general case, as fluent
speech tends to have pauses, even in the middle of words (such as in the word 'acquire', between
the 'c' and 'q').
2. Blocking
The second task is blocking. Older speech recognition systems first attempted to detect where
the phones would start and finish, and then block the signal by placing one phone in each block.
However, phones can blend together in many circumstances, and this method generally could not
reliably detect the correct boundaries. Most modern systems simply separate the signal into
blocks of a fixed length. These blocks tend to overlap, so that phones which cross block
boundaries will not be missed. This project uses blocks which are 30 msec in length (containing
600 samples), and which shift by 10 msec increments.
The next important step in the processing of the signal is to obtain a frequency spectrum
of each block. The information in the frequency spectrum is often enough to identify the phone.
The purpose of the frequency spectrum is to identify the formants, which are the peaks in the
frequency spectrum. Vowels are often uniquely identified by their first two formants. This
experiment has shown that the identification of formants is not a trivial task. One method to
obtain a frequency spectrum is to apply an FFT to each block. The resulting information can be
examined manually to find the peaks, but it is quite noisy, which makes the take difficult for a
computer to identify the peaks. Very useful data can still be obtained. This is often done by
measuring the strength across various frequency ranges.
the frequency spectrum of a different speaker, saying the 's' in 'yes'. The important
feature to note is the presence of a peak in the 100-150 range (which scales to 3600-5400 Hz).
This peak is a feature of the letter 's'. Each spectrum has a peak there, although it is at a different
strength in each one. (Any data in the 0-10 range is likely to be noise in these). In many cases,
the overall strength in that range is quite low, compared with the strength of the lower
frequencies.
This is a feature of the voiced sounds, although the exact frequencies vary with the
speaker. The important features visible in this spectrum are the existence of a formant in the 80-

August21,
2013
100 range while the 'y' is spoken, and then later the existence of formants at both ~70 and ~50
simultaneously while the 'e' is spoken.
This is the frequency spectrum produced by another speaker, while saying the 'ye' of yes.
Notice here that the 'y' and 'e' overlap substantially. Often times, consonants will take on the
frequencies of the vowels which follow them, and must be identified by characteristics other than
their frequencies alone. Here, the 'y' may be identified by the transition from the higher
frequency into the frequency of the vowel which follows.
Another method, which is used to obtain a frequency spectrum is that of Linear
Predictive Coding(LPC). This is the most successful method in widespread use today. The idea
behind LPC is that the values of the signal can be expressed as a linear combination of the
preceding values. That is, if s(i) is the amplitude at time i,
s(i) = a1*s(i-1) + a2*s(i-2) + ... + ap*s(i-p)
When the input data is filled in, this becomes a system of linear equations which can be
solved to determine the values of a1 through ap. These values then produce a very noise free
signal, which clearly identifies the formants.
3. Other Features
Plosives (b, p, d, t, g, k) can generally be identified by a pause followed by a sudden
increase in energy of short duration. Nasals (n, m, ng), are often characterized by a single
formant of low frequency, and if followed by a vowel, their formants tend to have a wide
spectrum. The 'h' is characterized by a building unvoiced sound followed by a sudden sustained
increase in energy at the formants of the vowel which follows. Unvoiced fricatives (th, s, sh, f),
are characterized by a low energy, wide band, high frequency spectrum. Their voiced
counterparts (dh, z, zh, v) have an additional formant in the low frequency spectrum.
Affricatives (j, ch) are often described as a plosive which turns into a fricative (d - zh and t - sh
respectively). Glides, or semivowels (w, l, r, y) may be the most difficult to characterize,
because they are highly situation dependent. They are followed by vowels, unless they appear at
the end of a word, and behave much like a transition from another vowel into the vowel which
follows it. In this project we noted how the 'y' transitions from its characteristic frequency into
the frequencies of the 'e' which follows it. There may be no clear distinction where one ends and
the other begins.
4. Word identification
Although this project used a very simple identification method to differentiate between
two words, real word identification has many obstacles to overcome. Because we chose to
divide our signal into blocks of a set duration, we do not know how many blocks a given phone
may occupy. Some phones may only be recognized as the transition from one phone to another.
Some phones may be missing or improperly identified. All of these notions are captured by a
model known as the Hidden Markov Model. A HMM is basically a finite automata in which
each transition has a probability associated with it.
A given vocabulary word has a HMM which is designed to model the many possible
strings of phones which may be produced by the utterance of the word. Each expected phone is

August21,
2013
generally represented by a state in the HMM, while each possible phone at every stage has an
arc. This it that a 'y' may be represented as a 'y' or an 'i' arc, while both lead to the 'y' state. Self
loops account for the possibility of a phone stretching over several blocks. Missed phones are
also allowed, as an arc may jump over a state. Each arc is then assigned a probility to complete
the HMM.
Then on an input signal, a dynamic programming algorithm, called the Viterbi
algorithm, is applied to identify which HMM is the most likely match for the input signal.
11. Approach to Speech Recognition
 Acoustic Phonetic Approach
 Pattern Recognition Approach(HMM)
 Artificial Intelligence Approach(Neural Networks)
1. Pattern Recognition Approach
“A pattern is the opposite of a chaos; it is an entity vaguely defined, that could be given a
name.”
A pattern is an object, process or event
A class (or category) is a set of patterns that share common attribute (features) usually
from the same information source
During recognition (or classification) classes are assigned to the objects.
A classifier is a machine that performs such task
2. Neural Network Approach
classifier is represented as a network of cells modeling neurons of the human brain
(connectionist approach).

August21,
2013
3. Language Model
12. Application of Speech Processing
 Medical Transcription
 Military
 Telephony and other domains
 Serving the disabled
 Home automation
 Automobile audio systems
 Telematics

Automatic Speech Recognition

More Related Content

What's hot (20)

Similar to Automatic Speech Recognition (20)

More from International Islamic University (20)

Recently uploaded (20)

Automatic Speech Recognition