SlideShare a Scribd company logo
4
Most read
10
Most read
14
Most read
August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
Speech Recognition
August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
CONTENTS
01 What is Voice
02 Component of Sound
03 Why Voices are Different
04 Classification of Speech Sound
05 Process of Speech Production
06 What is Voice Recognition
07 ASR(Automatic Speech Recognition)
08 Types of ASR
09 Approachesto Speech Recognition
10 Process of Speech Recognition
11 How Speech Recognition Works
12 Approachesof Speech recognition
13 Application of Speech Processing
August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
1.What is Voice?
The voice consists of sound made by a human being using the vocal folds for talking,
singing, laughing, crying, screaming, etc. The human voice is specifically that part of human
sound production in which the vocal folds (vocal cords) are the primary sound source. Generally
speaking, the mechanism for generating the human voice can be subdivided into three parts; the
lungs, the vocal folds within the larynx, and the articulators. The lung (the pump) must produce
adequate airflow and air pressure to vibrate vocal folds (this air pressure is the fuel of the voice).
The vocal folds (vocal cords) are a vibrating valve that chops up the airflow from the lungs into
audible pulses that form the laryngeal sound source. The muscles of the larynx adjust the length
and tension of the vocal folds to ‘fine tune’ pitch and tone. The articulators (the parts of the vocal
tract above the larynx consisting of tongue, palate, cheek, lips, etc.) articulate and filter the sound
emanating from the larynx and to some degree can interact with the laryngeal airflow to
strengthen it or weaken it as a sound source.
The vocal folds, in combination with the articulators, are capable of producing highly
intricate arrays of sound. The tone of voice may be modulated to suggest emotions such as anger,
surprise, or happiness. Singers use the human voice as an instrument for creating music.
2.Componentsof Sound
There are NINE(09) components of sound given below
1. Music components
 Pitch
 Timbre
 Harmonics
2. Loudness
 Rhythm
3. Sound envelope components
 Attack
 Sustain
 Decay
4. Record and playback component
 Speed
August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
Different Terms
1. Compressions, in which particles are crowded together, appear as upward curves in the
line.
2. Rarefactions, in which particles are spread apart, appear as downward curves in the line.
Three characteristics are used to describe a sound wave. These are wavelength,
frequency, and amplitude.
3. Wavelength; this is the distance from the crest of one wave to the crest of the next.
4. Frequency; this is the number of waves that pass a point in each second.
5. Amplitude; this is the measure of the amount of energy in a sound wave.
6. Pitch
This is how high or low a sound seems. A bird makes a high pitch. A lion makes a low
pitch.
August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
Sounds also are different in how loud and how soft they are.
The more energy the sound wave has the louder the sound seems. The intensity of a
sound is the amount of energy it has. You hear intensity as loudness.
Remember the amplitude, or height of a sound wave is a measure of the amount of
energy in the wave. So the greater the intensity of a sound, the greater the amplitude.
Pitch and loudness are two ways that sounds are different. Another way is in quality.
Some sounds are pleasant and some are a noise. Compare the two waves on the right. A pleasant
sound has a regular wave pattern. The pattern is repeated over and over. But the waves of noise
are irregular. They do not have a repeated pattern.
7.Why Voices are Different?
Voices are different caused by
 INTENSITY(depend on amplitude)
 PITCH(frequency)
 TONE(pleasant or unpleasant).
1. Amplitude is a measure of energy. The more energy a wave has, the higher its amplitude.
As amplitude increases, intensity also increases.
August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
2. Intensity is the amount of energy a sound has over an area. The same sound is more
intense if you hear it in a smaller area. In general, we call sounds with a higher intensity
louder.
3. Pitch depends on the frequency of a sound wave. Frequency is the number of
wavelengths that fit into one unit of time.
Sounds also are different in how loud and how soft they are. The more energy the sound
wave has the louder the sound seems. The intensity of a sound is the amount of energy it has.
You hear intensity as loudness.
Remember the amplitude, or height of a sound wave is a measure of the amount of
energy in the wave. so the greater the intensity of a sound, the greater the amplitude.
4.Classification of Speech Sound
One can make broad divisions such as voiced and unvoiced sound, or become more
speci_c, such as front vowels, back vowels, semivowels, and so on.
The difference between voiced and unvoiced sounds becomes clear in these samples.
The first two blocks demonstrate a dominant low frequency sound wave, which is not present in
the third block. This frequency is produced by the vibration of the larynx, or voice box.
Although the exact frequency differs for each speaker (females tend to have a higher frequency),
the dominant presence of a low frequency sound wave is a surefire indicator of a voiced sound.
1. Voiced Sound
Vocal Chord play active role in the production of SOUND e.g. a/e/I. It has high
frequency
2. Un Voiced Sound
When Vocal Chord is Inactive Called UN VOICED SOUND e.g. s/f. It build up by
pressure
August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
5.Process of Speech Production
6.What is voice recognition?
Voice recognition is the process of taking the spoken word as an input to a computer
program. It is the process of converting voice into electric signals. Signals transform into
CODING PATTERN.
Voice recognition is "the technology by which sounds, words or phrases spoken by
humans are converted into electrical signals and these signals are transformed into coding
patterns to which meaning has been assigned". While the concept could more generally is called
"sound recognition".
speech recognition, voice recognition is an ability of a computer, computer software
program, or hardware device to decode the human voice into digitized speech that can be
interpreted by the computer or hardware device. Voice recognition is commonly used to operate
a device, perform commands, or write without having to operate a keyboard, mouse, or press any
buttons
7.ASR (Automatic Speech Recognition)
Process of converting acoustic signal captured by microphone or telephone to a set of
words. Recognized words can be final results, as for applications such as commands and control,
data entry and document preparation. They can also serve as input to further linguistic processing
in order to achieve speech understanding.
First ASR device was used in 1952 and recognized single digits spoken by a user (it was
not computer driven). Today, ASR programs are used in many industries, including Healthcare,
August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
Military (i.e. jets and helicopters), Telecommunications and Personal computing (i.e. hands free
computing).
Evaluation of ASR
Acoustic Model
An acoustic model is created by taking audio recordings of speech, and their text
transcriptions, and using software to create statistical representations of the sounds that make up
each word. It is used by a speech recognition engine to recognize speech.
August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
Language Model
Language modeling is used in many natural language processing applications such
as speech recognition tries to capture the properties of a language, and to predict the next word in
a speech sequence.
8.Basic Types of Speech Recognition System
1. Speaker-dependent:
user must provide samples of his/her speech before using them. The voice recognition
must be trained before it can be used. This often requires a user reads a series of words and
phrases so the computer can understand the users voice.
Speaker–dependent software works by learning the unique characteristics of a
single person's voice, in a way similar to voice recognition. New users must first "train" the
software by speaking to it, so the computer can analyze how the person talks. This often means
users have to read a few pages of text to the computer before they can use the speech recognition
software.
2. Speaker independent
no speaker enrollment necessary. The voice recognition software recognizes most users
voices with no training.
Speaker–independent software is designed to recognize anyone's voice, so no training is
involved. This means it is the only real option for applications such as interactive voice response
systems — where businesses can't ask callers to read pages of text before using the system. The
downside is that speaker–independent software is generally less accurate than speaker–dependent
software.
Other types
1. Discrete speech recognition - The user must pause between each word so
that the speech recognition can identify each separate word.
2. Continuous speech recognition - The voice recognition can understand a
normal rate of speaking.
3. Natural language - The speech recognition not only can understand the voice
but also return answers to questions or other queries that are being asked.
9. Approachesto ASR
 Template matching
 Knowledge-based (or rule-based) approach
 Statistical approach:
August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
 Noisy channel model + machine learning
1. Template matching
It is SPEAKE DEPENDENT. It match voice with already saved templates. Before it
we’ve to trained the system. System must be trained. User speak same word which are avail in
template. Recognition accuracy can be about 98 percent. Store examples of units (words,
phonemes), then find the example that most closely fits the input Extract features from speech
signal, then it’s “just” a complex similarity matching problem, using solutions developed for all
sorts of applications OK for discrete utterances, and a single user Hard to distinguish very
similar templates And quickly degrades when input differs from templates Therefore needs
techniques to mitigate this degradation: More subtle matching techniques Multiple templates
which are aggregated.
2. Rule-based approach
It is SPEAKER INDEPENDENT. First process the giving voice as input.Using
LPC(Linear Productive Coding) Attempt to find similarities b/w expected Input and Digitized
input. Recognition accuracy for speaker-independent systems is somewhat less than for
speaker-dependent systems, usually between 90 and 95 percent. Use knowledge of phonetics
and linguistics to guide search process Templates are replaced by rules expressing everything
(anything) that might help to decode: Phonetics, phonology, phonotactics
Syntax
 Pragmatics
 Typical approach is based on “blackboard” architecture:
 At each decision point, lay out the possibilities
 Apply rules to determine which sequences are permitted
 Poor performance due to
 Difficulty to express rules
 Difficulty to make rules interact
 Difficulty to know how to improve the system
3. Statistical Base Approach
Can be seen as extension of template-based approach, using more powerful mathematical
and statistical tools. Sometimes seen as “anti-linguistic” approach Fred Jelinek (IBM, 1988):
“Every time I fire a linguist my system improves” Collect a large corpus of transcribed speech
recordings Train the computer to learn the correspondences (“machine learning”)
At run time, apply statistical processes to search through the space of all possible
solutions, and pick the statistically most likely one
August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
9.Process of Speech Recognition
Vocal Tract
Consist of laryngeal pharynx, oral hyrax, oral cavity, nasal cavity, nasal pharynx.
Spectrum Analysis
MFCC used to produce voice feature. DTW to select the pattern that match the
database(matLab).
10. How Speech Recognition Works
 Divide the sound wave into evenly spaced blocks. Transform the PCM digital
audio into a better acoustic representation.
 Process each block for important characteristics, such as strength across various
frequency ranges, number of zero crossings, and total energy. Apply a "grammar"
so the speech recognizer knows what phonemes to expect. A grammar could be
anything from a context-free grammar to full-blown Language.
 Using this characteristic vector, attempt to associate each block with a phone,
which is the most basic unit of speech, producing a string of phones? Figure out
which phonemes are spoken.
August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
 Find the word whose model is the most likely match to the string of phones which
was produced. Convert the phonemes into words.
1. Speech Detection
The first task is to identify the precense of a speech signal. This task is easy if the signal
is clear, however frequently the signal contains background noise, resulting from a noisy
microphone, a fan running in the room, etc. The signals obtained were in fact found to contain
some noise. I used two criteria to identify the presence of a spoken word. First, the total energy
is measured, and second the number of zero crossings are counted. Both of these were found to
be necessary, as voiced sounds tend to have a high volume (and thus a high total energy), but a
low overall frequency (and thus a low number of zero crossings), while unvoiced sounds were
found to have a high frequency, but a low volume. Only background noise was found to have
both low energy and low frequency. The method was found to successfully detect the beginning
and end of the several words tested. Note that this is not sufficient for the general case, as fluent
speech tends to have pauses, even in the middle of words (such as in the word 'acquire', between
the 'c' and 'q').
2. Blocking
The second task is blocking. Older speech recognition systems first attempted to detect where
the phones would start and finish, and then block the signal by placing one phone in each block.
However, phones can blend together in many circumstances, and this method generally could not
reliably detect the correct boundaries. Most modern systems simply separate the signal into
blocks of a fixed length. These blocks tend to overlap, so that phones which cross block
boundaries will not be missed. This project uses blocks which are 30 msec in length (containing
600 samples), and which shift by 10 msec increments.
The next important step in the processing of the signal is to obtain a frequency spectrum
of each block. The information in the frequency spectrum is often enough to identify the phone.
The purpose of the frequency spectrum is to identify the formants, which are the peaks in the
frequency spectrum. Vowels are often uniquely identified by their first two formants. This
experiment has shown that the identification of formants is not a trivial task. One method to
obtain a frequency spectrum is to apply an FFT to each block. The resulting information can be
examined manually to find the peaks, but it is quite noisy, which makes the take difficult for a
computer to identify the peaks. Very useful data can still be obtained. This is often done by
measuring the strength across various frequency ranges.
the frequency spectrum of a different speaker, saying the 's' in 'yes'. The important
feature to note is the presence of a peak in the 100-150 range (which scales to 3600-5400 Hz).
This peak is a feature of the letter 's'. Each spectrum has a peak there, although it is at a different
strength in each one. (Any data in the 0-10 range is likely to be noise in these). In many cases,
the overall strength in that range is quite low, compared with the strength of the lower
frequencies.
This is a feature of the voiced sounds, although the exact frequencies vary with the
speaker. The important features visible in this spectrum are the existence of a formant in the 80-
August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
100 range while the 'y' is spoken, and then later the existence of formants at both ~70 and ~50
simultaneously while the 'e' is spoken.
This is the frequency spectrum produced by another speaker, while saying the 'ye' of yes.
Notice here that the 'y' and 'e' overlap substantially. Often times, consonants will take on the
frequencies of the vowels which follow them, and must be identified by characteristics other than
their frequencies alone. Here, the 'y' may be identified by the transition from the higher
frequency into the frequency of the vowel which follows.
Another method, which is used to obtain a frequency spectrum is that of Linear
Predictive Coding(LPC). This is the most successful method in widespread use today. The idea
behind LPC is that the values of the signal can be expressed as a linear combination of the
preceding values. That is, if s(i) is the amplitude at time i,
s(i) = a1*s(i-1) + a2*s(i-2) + ... + ap*s(i-p)
When the input data is filled in, this becomes a system of linear equations which can be
solved to determine the values of a1 through ap. These values then produce a very noise free
signal, which clearly identifies the formants.
3. Other Features
Plosives (b, p, d, t, g, k) can generally be identified by a pause followed by a sudden
increase in energy of short duration. Nasals (n, m, ng), are often characterized by a single
formant of low frequency, and if followed by a vowel, their formants tend to have a wide
spectrum. The 'h' is characterized by a building unvoiced sound followed by a sudden sustained
increase in energy at the formants of the vowel which follows. Unvoiced fricatives (th, s, sh, f),
are characterized by a low energy, wide band, high frequency spectrum. Their voiced
counterparts (dh, z, zh, v) have an additional formant in the low frequency spectrum.
Affricatives (j, ch) are often described as a plosive which turns into a fricative (d - zh and t - sh
respectively). Glides, or semivowels (w, l, r, y) may be the most difficult to characterize,
because they are highly situation dependent. They are followed by vowels, unless they appear at
the end of a word, and behave much like a transition from another vowel into the vowel which
follows it. In this project we noted how the 'y' transitions from its characteristic frequency into
the frequencies of the 'e' which follows it. There may be no clear distinction where one ends and
the other begins.
4. Word identification
Although this project used a very simple identification method to differentiate between
two words, real word identification has many obstacles to overcome. Because we chose to
divide our signal into blocks of a set duration, we do not know how many blocks a given phone
may occupy. Some phones may only be recognized as the transition from one phone to another.
Some phones may be missing or improperly identified. All of these notions are captured by a
model known as the Hidden Markov Model. A HMM is basically a finite automata in which
each transition has a probability associated with it.
A given vocabulary word has a HMM which is designed to model the many possible
strings of phones which may be produced by the utterance of the word. Each expected phone is
August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
generally represented by a state in the HMM, while each possible phone at every stage has an
arc. This it that a 'y' may be represented as a 'y' or an 'i' arc, while both lead to the 'y' state. Self
loops account for the possibility of a phone stretching over several blocks. Missed phones are
also allowed, as an arc may jump over a state. Each arc is then assigned a probility to complete
the HMM.
Then on an input signal, a dynamic programming algorithm, called the Viterbi
algorithm, is applied to identify which HMM is the most likely match for the input signal.
11. Approach to Speech Recognition
 Acoustic Phonetic Approach
 Pattern Recognition Approach(HMM)
 Artificial Intelligence Approach(Neural Networks)
1. Pattern Recognition Approach
“A pattern is the opposite of a chaos; it is an entity vaguely defined, that could be given a
name.”
A pattern is an object, process or event
A class (or category) is a set of patterns that share common attribute (features) usually
from the same information source
During recognition (or classification) classes are assigned to the objects.
A classifier is a machine that performs such task
2. Neural Network Approach
classifier is represented as a network of cells modeling neurons of the human brain
(connectionist approach).
August21,
2013
SEARCH USING VOICEANDIMAGE RFECOGNITION
3. Language Model
12. Application of Speech Processing
 Medical Transcription
 Military
 Telephony and other domains
 Serving the disabled
 Home automation
 Automobile audio systems
 Telematics

More Related Content

PDF
speech processing and recognition basic in data mining
PPT
Speech Recognition
PPT
Automatic speech recognition
PPT
Speech Recognition in Artificail Inteligence
PPTX
Speech Recognition
PPTX
Speech Recognition
speech processing and recognition basic in data mining
Speech Recognition
Automatic speech recognition
Speech Recognition in Artificail Inteligence
Speech Recognition
Speech Recognition

What's hot (20)

PPTX
Speech recognition system seminar
PPT
Voice Recognition
PPT
Hidden Markov Models with applications to speech recognition
PPT
Speech recognition
PPTX
Speech Signal Processing
DOCX
Speech Recognition
PPTX
Speaker recognition using MFCC
PPTX
Speech recognition An overview
PPTX
SPEECH RECOGNITION USING NEURAL NETWORK
PPTX
speech processing basics
PPTX
Speech Recognition Technology
PPT
Abstract of speech recognition
PPTX
Digital speech processing lecture1
PPTX
Speech Recognition Technology
PPT
Automatic speech recognition
PPSX
Speech recognition an overview
DOCX
speech enhancement
PDF
SPEECH CODING
PPTX
Automatic speech recognition system
PPTX
Speech recognition final presentation
Speech recognition system seminar
Voice Recognition
Hidden Markov Models with applications to speech recognition
Speech recognition
Speech Signal Processing
Speech Recognition
Speaker recognition using MFCC
Speech recognition An overview
SPEECH RECOGNITION USING NEURAL NETWORK
speech processing basics
Speech Recognition Technology
Abstract of speech recognition
Digital speech processing lecture1
Speech Recognition Technology
Automatic speech recognition
Speech recognition an overview
speech enhancement
SPEECH CODING
Automatic speech recognition system
Speech recognition final presentation
Ad

Similar to Automatic Speech Recognition (20)

PDF
ACHIEVING SECURITY VIA SPEECH RECOGNITION
PDF
Isolated English Word Recognition System: Appropriate for Bengali-accented En...
DOCX
A seminar report on speech recognition technology
PDF
An Introduction To Speech Sciences (Acoustic Analysis Of Speech)
PDF
Animal Voice Morphing System
PDF
An Introduction to Various Features of Speech SignalSpeech features
DOCX
PPTX
PPTX
Speech and Language Processing
PDF
Silent Sound Technology
PDF
High Level Speaker Specific Features as an Efficiency Enhancing Parameters in...
PPTX
Speech Analysis
PPTX
Automatic Speech Recognion
DOCX
Magazine Article
PDF
Ece speech-recognition-report
PPTX
DOCX
Final article
PPT
Phonetics
ACHIEVING SECURITY VIA SPEECH RECOGNITION
Isolated English Word Recognition System: Appropriate for Bengali-accented En...
A seminar report on speech recognition technology
An Introduction To Speech Sciences (Acoustic Analysis Of Speech)
Animal Voice Morphing System
An Introduction to Various Features of Speech SignalSpeech features
Speech and Language Processing
Silent Sound Technology
High Level Speaker Specific Features as an Efficiency Enhancing Parameters in...
Speech Analysis
Automatic Speech Recognion
Magazine Article
Ece speech-recognition-report
Final article
Phonetics
Ad

More from International Islamic University (20)

Recently uploaded (20)

PDF
Insiders guide to clinical Medicine.pdf
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Classroom Observation Tools for Teachers
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Business Ethics Teaching Materials for college
PDF
Complications of Minimal Access Surgery at WLH
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
Cell Structure & Organelles in detailed.
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
Pharma ospi slides which help in ospi learning
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
O7-L3 Supply Chain Operations - ICLT Program
Insiders guide to clinical Medicine.pdf
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Week 4 Term 3 Study Techniques revisited.pptx
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Classroom Observation Tools for Teachers
PPH.pptx obstetrics and gynecology in nursing
STATICS OF THE RIGID BODIES Hibbelers.pdf
Business Ethics Teaching Materials for college
Complications of Minimal Access Surgery at WLH
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
TR - Agricultural Crops Production NC III.pdf
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Cell Structure & Organelles in detailed.
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Pharma ospi slides which help in ospi learning
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
O7-L3 Supply Chain Operations - ICLT Program

Automatic Speech Recognition

  • 1. August21, 2013 SEARCH USING VOICEANDIMAGE RFECOGNITION Speech Recognition
  • 2. August21, 2013 SEARCH USING VOICEANDIMAGE RFECOGNITION CONTENTS 01 What is Voice 02 Component of Sound 03 Why Voices are Different 04 Classification of Speech Sound 05 Process of Speech Production 06 What is Voice Recognition 07 ASR(Automatic Speech Recognition) 08 Types of ASR 09 Approachesto Speech Recognition 10 Process of Speech Recognition 11 How Speech Recognition Works 12 Approachesof Speech recognition 13 Application of Speech Processing
  • 3. August21, 2013 SEARCH USING VOICEANDIMAGE RFECOGNITION 1.What is Voice? The voice consists of sound made by a human being using the vocal folds for talking, singing, laughing, crying, screaming, etc. The human voice is specifically that part of human sound production in which the vocal folds (vocal cords) are the primary sound source. Generally speaking, the mechanism for generating the human voice can be subdivided into three parts; the lungs, the vocal folds within the larynx, and the articulators. The lung (the pump) must produce adequate airflow and air pressure to vibrate vocal folds (this air pressure is the fuel of the voice). The vocal folds (vocal cords) are a vibrating valve that chops up the airflow from the lungs into audible pulses that form the laryngeal sound source. The muscles of the larynx adjust the length and tension of the vocal folds to ‘fine tune’ pitch and tone. The articulators (the parts of the vocal tract above the larynx consisting of tongue, palate, cheek, lips, etc.) articulate and filter the sound emanating from the larynx and to some degree can interact with the laryngeal airflow to strengthen it or weaken it as a sound source. The vocal folds, in combination with the articulators, are capable of producing highly intricate arrays of sound. The tone of voice may be modulated to suggest emotions such as anger, surprise, or happiness. Singers use the human voice as an instrument for creating music. 2.Componentsof Sound There are NINE(09) components of sound given below 1. Music components  Pitch  Timbre  Harmonics 2. Loudness  Rhythm 3. Sound envelope components  Attack  Sustain  Decay 4. Record and playback component  Speed
  • 4. August21, 2013 SEARCH USING VOICEANDIMAGE RFECOGNITION Different Terms 1. Compressions, in which particles are crowded together, appear as upward curves in the line. 2. Rarefactions, in which particles are spread apart, appear as downward curves in the line. Three characteristics are used to describe a sound wave. These are wavelength, frequency, and amplitude. 3. Wavelength; this is the distance from the crest of one wave to the crest of the next. 4. Frequency; this is the number of waves that pass a point in each second. 5. Amplitude; this is the measure of the amount of energy in a sound wave. 6. Pitch This is how high or low a sound seems. A bird makes a high pitch. A lion makes a low pitch.
  • 5. August21, 2013 SEARCH USING VOICEANDIMAGE RFECOGNITION Sounds also are different in how loud and how soft they are. The more energy the sound wave has the louder the sound seems. The intensity of a sound is the amount of energy it has. You hear intensity as loudness. Remember the amplitude, or height of a sound wave is a measure of the amount of energy in the wave. So the greater the intensity of a sound, the greater the amplitude. Pitch and loudness are two ways that sounds are different. Another way is in quality. Some sounds are pleasant and some are a noise. Compare the two waves on the right. A pleasant sound has a regular wave pattern. The pattern is repeated over and over. But the waves of noise are irregular. They do not have a repeated pattern. 7.Why Voices are Different? Voices are different caused by  INTENSITY(depend on amplitude)  PITCH(frequency)  TONE(pleasant or unpleasant). 1. Amplitude is a measure of energy. The more energy a wave has, the higher its amplitude. As amplitude increases, intensity also increases.
  • 6. August21, 2013 SEARCH USING VOICEANDIMAGE RFECOGNITION 2. Intensity is the amount of energy a sound has over an area. The same sound is more intense if you hear it in a smaller area. In general, we call sounds with a higher intensity louder. 3. Pitch depends on the frequency of a sound wave. Frequency is the number of wavelengths that fit into one unit of time. Sounds also are different in how loud and how soft they are. The more energy the sound wave has the louder the sound seems. The intensity of a sound is the amount of energy it has. You hear intensity as loudness. Remember the amplitude, or height of a sound wave is a measure of the amount of energy in the wave. so the greater the intensity of a sound, the greater the amplitude. 4.Classification of Speech Sound One can make broad divisions such as voiced and unvoiced sound, or become more speci_c, such as front vowels, back vowels, semivowels, and so on. The difference between voiced and unvoiced sounds becomes clear in these samples. The first two blocks demonstrate a dominant low frequency sound wave, which is not present in the third block. This frequency is produced by the vibration of the larynx, or voice box. Although the exact frequency differs for each speaker (females tend to have a higher frequency), the dominant presence of a low frequency sound wave is a surefire indicator of a voiced sound. 1. Voiced Sound Vocal Chord play active role in the production of SOUND e.g. a/e/I. It has high frequency 2. Un Voiced Sound When Vocal Chord is Inactive Called UN VOICED SOUND e.g. s/f. It build up by pressure
  • 7. August21, 2013 SEARCH USING VOICEANDIMAGE RFECOGNITION 5.Process of Speech Production 6.What is voice recognition? Voice recognition is the process of taking the spoken word as an input to a computer program. It is the process of converting voice into electric signals. Signals transform into CODING PATTERN. Voice recognition is "the technology by which sounds, words or phrases spoken by humans are converted into electrical signals and these signals are transformed into coding patterns to which meaning has been assigned". While the concept could more generally is called "sound recognition". speech recognition, voice recognition is an ability of a computer, computer software program, or hardware device to decode the human voice into digitized speech that can be interpreted by the computer or hardware device. Voice recognition is commonly used to operate a device, perform commands, or write without having to operate a keyboard, mouse, or press any buttons 7.ASR (Automatic Speech Recognition) Process of converting acoustic signal captured by microphone or telephone to a set of words. Recognized words can be final results, as for applications such as commands and control, data entry and document preparation. They can also serve as input to further linguistic processing in order to achieve speech understanding. First ASR device was used in 1952 and recognized single digits spoken by a user (it was not computer driven). Today, ASR programs are used in many industries, including Healthcare,
  • 8. August21, 2013 SEARCH USING VOICEANDIMAGE RFECOGNITION Military (i.e. jets and helicopters), Telecommunications and Personal computing (i.e. hands free computing). Evaluation of ASR Acoustic Model An acoustic model is created by taking audio recordings of speech, and their text transcriptions, and using software to create statistical representations of the sounds that make up each word. It is used by a speech recognition engine to recognize speech.
  • 9. August21, 2013 SEARCH USING VOICEANDIMAGE RFECOGNITION Language Model Language modeling is used in many natural language processing applications such as speech recognition tries to capture the properties of a language, and to predict the next word in a speech sequence. 8.Basic Types of Speech Recognition System 1. Speaker-dependent: user must provide samples of his/her speech before using them. The voice recognition must be trained before it can be used. This often requires a user reads a series of words and phrases so the computer can understand the users voice. Speaker–dependent software works by learning the unique characteristics of a single person's voice, in a way similar to voice recognition. New users must first "train" the software by speaking to it, so the computer can analyze how the person talks. This often means users have to read a few pages of text to the computer before they can use the speech recognition software. 2. Speaker independent no speaker enrollment necessary. The voice recognition software recognizes most users voices with no training. Speaker–independent software is designed to recognize anyone's voice, so no training is involved. This means it is the only real option for applications such as interactive voice response systems — where businesses can't ask callers to read pages of text before using the system. The downside is that speaker–independent software is generally less accurate than speaker–dependent software. Other types 1. Discrete speech recognition - The user must pause between each word so that the speech recognition can identify each separate word. 2. Continuous speech recognition - The voice recognition can understand a normal rate of speaking. 3. Natural language - The speech recognition not only can understand the voice but also return answers to questions or other queries that are being asked. 9. Approachesto ASR  Template matching  Knowledge-based (or rule-based) approach  Statistical approach:
  • 10. August21, 2013 SEARCH USING VOICEANDIMAGE RFECOGNITION  Noisy channel model + machine learning 1. Template matching It is SPEAKE DEPENDENT. It match voice with already saved templates. Before it we’ve to trained the system. System must be trained. User speak same word which are avail in template. Recognition accuracy can be about 98 percent. Store examples of units (words, phonemes), then find the example that most closely fits the input Extract features from speech signal, then it’s “just” a complex similarity matching problem, using solutions developed for all sorts of applications OK for discrete utterances, and a single user Hard to distinguish very similar templates And quickly degrades when input differs from templates Therefore needs techniques to mitigate this degradation: More subtle matching techniques Multiple templates which are aggregated. 2. Rule-based approach It is SPEAKER INDEPENDENT. First process the giving voice as input.Using LPC(Linear Productive Coding) Attempt to find similarities b/w expected Input and Digitized input. Recognition accuracy for speaker-independent systems is somewhat less than for speaker-dependent systems, usually between 90 and 95 percent. Use knowledge of phonetics and linguistics to guide search process Templates are replaced by rules expressing everything (anything) that might help to decode: Phonetics, phonology, phonotactics Syntax  Pragmatics  Typical approach is based on “blackboard” architecture:  At each decision point, lay out the possibilities  Apply rules to determine which sequences are permitted  Poor performance due to  Difficulty to express rules  Difficulty to make rules interact  Difficulty to know how to improve the system 3. Statistical Base Approach Can be seen as extension of template-based approach, using more powerful mathematical and statistical tools. Sometimes seen as “anti-linguistic” approach Fred Jelinek (IBM, 1988): “Every time I fire a linguist my system improves” Collect a large corpus of transcribed speech recordings Train the computer to learn the correspondences (“machine learning”) At run time, apply statistical processes to search through the space of all possible solutions, and pick the statistically most likely one
  • 11. August21, 2013 SEARCH USING VOICEANDIMAGE RFECOGNITION 9.Process of Speech Recognition Vocal Tract Consist of laryngeal pharynx, oral hyrax, oral cavity, nasal cavity, nasal pharynx. Spectrum Analysis MFCC used to produce voice feature. DTW to select the pattern that match the database(matLab). 10. How Speech Recognition Works  Divide the sound wave into evenly spaced blocks. Transform the PCM digital audio into a better acoustic representation.  Process each block for important characteristics, such as strength across various frequency ranges, number of zero crossings, and total energy. Apply a "grammar" so the speech recognizer knows what phonemes to expect. A grammar could be anything from a context-free grammar to full-blown Language.  Using this characteristic vector, attempt to associate each block with a phone, which is the most basic unit of speech, producing a string of phones? Figure out which phonemes are spoken.
  • 12. August21, 2013 SEARCH USING VOICEANDIMAGE RFECOGNITION  Find the word whose model is the most likely match to the string of phones which was produced. Convert the phonemes into words. 1. Speech Detection The first task is to identify the precense of a speech signal. This task is easy if the signal is clear, however frequently the signal contains background noise, resulting from a noisy microphone, a fan running in the room, etc. The signals obtained were in fact found to contain some noise. I used two criteria to identify the presence of a spoken word. First, the total energy is measured, and second the number of zero crossings are counted. Both of these were found to be necessary, as voiced sounds tend to have a high volume (and thus a high total energy), but a low overall frequency (and thus a low number of zero crossings), while unvoiced sounds were found to have a high frequency, but a low volume. Only background noise was found to have both low energy and low frequency. The method was found to successfully detect the beginning and end of the several words tested. Note that this is not sufficient for the general case, as fluent speech tends to have pauses, even in the middle of words (such as in the word 'acquire', between the 'c' and 'q'). 2. Blocking The second task is blocking. Older speech recognition systems first attempted to detect where the phones would start and finish, and then block the signal by placing one phone in each block. However, phones can blend together in many circumstances, and this method generally could not reliably detect the correct boundaries. Most modern systems simply separate the signal into blocks of a fixed length. These blocks tend to overlap, so that phones which cross block boundaries will not be missed. This project uses blocks which are 30 msec in length (containing 600 samples), and which shift by 10 msec increments. The next important step in the processing of the signal is to obtain a frequency spectrum of each block. The information in the frequency spectrum is often enough to identify the phone. The purpose of the frequency spectrum is to identify the formants, which are the peaks in the frequency spectrum. Vowels are often uniquely identified by their first two formants. This experiment has shown that the identification of formants is not a trivial task. One method to obtain a frequency spectrum is to apply an FFT to each block. The resulting information can be examined manually to find the peaks, but it is quite noisy, which makes the take difficult for a computer to identify the peaks. Very useful data can still be obtained. This is often done by measuring the strength across various frequency ranges. the frequency spectrum of a different speaker, saying the 's' in 'yes'. The important feature to note is the presence of a peak in the 100-150 range (which scales to 3600-5400 Hz). This peak is a feature of the letter 's'. Each spectrum has a peak there, although it is at a different strength in each one. (Any data in the 0-10 range is likely to be noise in these). In many cases, the overall strength in that range is quite low, compared with the strength of the lower frequencies. This is a feature of the voiced sounds, although the exact frequencies vary with the speaker. The important features visible in this spectrum are the existence of a formant in the 80-
  • 13. August21, 2013 SEARCH USING VOICEANDIMAGE RFECOGNITION 100 range while the 'y' is spoken, and then later the existence of formants at both ~70 and ~50 simultaneously while the 'e' is spoken. This is the frequency spectrum produced by another speaker, while saying the 'ye' of yes. Notice here that the 'y' and 'e' overlap substantially. Often times, consonants will take on the frequencies of the vowels which follow them, and must be identified by characteristics other than their frequencies alone. Here, the 'y' may be identified by the transition from the higher frequency into the frequency of the vowel which follows. Another method, which is used to obtain a frequency spectrum is that of Linear Predictive Coding(LPC). This is the most successful method in widespread use today. The idea behind LPC is that the values of the signal can be expressed as a linear combination of the preceding values. That is, if s(i) is the amplitude at time i, s(i) = a1*s(i-1) + a2*s(i-2) + ... + ap*s(i-p) When the input data is filled in, this becomes a system of linear equations which can be solved to determine the values of a1 through ap. These values then produce a very noise free signal, which clearly identifies the formants. 3. Other Features Plosives (b, p, d, t, g, k) can generally be identified by a pause followed by a sudden increase in energy of short duration. Nasals (n, m, ng), are often characterized by a single formant of low frequency, and if followed by a vowel, their formants tend to have a wide spectrum. The 'h' is characterized by a building unvoiced sound followed by a sudden sustained increase in energy at the formants of the vowel which follows. Unvoiced fricatives (th, s, sh, f), are characterized by a low energy, wide band, high frequency spectrum. Their voiced counterparts (dh, z, zh, v) have an additional formant in the low frequency spectrum. Affricatives (j, ch) are often described as a plosive which turns into a fricative (d - zh and t - sh respectively). Glides, or semivowels (w, l, r, y) may be the most difficult to characterize, because they are highly situation dependent. They are followed by vowels, unless they appear at the end of a word, and behave much like a transition from another vowel into the vowel which follows it. In this project we noted how the 'y' transitions from its characteristic frequency into the frequencies of the 'e' which follows it. There may be no clear distinction where one ends and the other begins. 4. Word identification Although this project used a very simple identification method to differentiate between two words, real word identification has many obstacles to overcome. Because we chose to divide our signal into blocks of a set duration, we do not know how many blocks a given phone may occupy. Some phones may only be recognized as the transition from one phone to another. Some phones may be missing or improperly identified. All of these notions are captured by a model known as the Hidden Markov Model. A HMM is basically a finite automata in which each transition has a probability associated with it. A given vocabulary word has a HMM which is designed to model the many possible strings of phones which may be produced by the utterance of the word. Each expected phone is
  • 14. August21, 2013 SEARCH USING VOICEANDIMAGE RFECOGNITION generally represented by a state in the HMM, while each possible phone at every stage has an arc. This it that a 'y' may be represented as a 'y' or an 'i' arc, while both lead to the 'y' state. Self loops account for the possibility of a phone stretching over several blocks. Missed phones are also allowed, as an arc may jump over a state. Each arc is then assigned a probility to complete the HMM. Then on an input signal, a dynamic programming algorithm, called the Viterbi algorithm, is applied to identify which HMM is the most likely match for the input signal. 11. Approach to Speech Recognition  Acoustic Phonetic Approach  Pattern Recognition Approach(HMM)  Artificial Intelligence Approach(Neural Networks) 1. Pattern Recognition Approach “A pattern is the opposite of a chaos; it is an entity vaguely defined, that could be given a name.” A pattern is an object, process or event A class (or category) is a set of patterns that share common attribute (features) usually from the same information source During recognition (or classification) classes are assigned to the objects. A classifier is a machine that performs such task 2. Neural Network Approach classifier is represented as a network of cells modeling neurons of the human brain (connectionist approach).
  • 15. August21, 2013 SEARCH USING VOICEANDIMAGE RFECOGNITION 3. Language Model 12. Application of Speech Processing  Medical Transcription  Military  Telephony and other domains  Serving the disabled  Home automation  Automobile audio systems  Telematics