SlideShare a Scribd company logo
How Speech
Reorganization
Works
By:
Taqi Shah
taqi.shajee@gmail.com
How speech reorganization works
Speech Recognition
• Speech recognition is the hottest topic in research
today. In fact, many full-blown speech recognition
applications are being implemented in the West to
increase work efficiency.
• Speech recognition uses several techniques to
"recognize" the human voice. It functions as a
pipeline that converts digital audio signals coming
from the sound card to recognized speech.
• These signals pass through several stages, where
various mathematical and statistical methods are
applied to figure out what is actually being said.
How It Works
• The voice input to the microphone goes to the
sound card. The output from the sound card—
digital audio—is processed using FFT (Fast Fourier
Transform)—and further fine-processed using
HMMs and other techniques.
• The built-in database is used for analyzing what’s
been spoken. There’s a reverse feedback to the
database at the final stage for the purpose of
adaptation. The final recognized output then goes
back to the CPU.
Sounds Simple
• The user—gives a voice command over the microphone, which is
passed to the sound card in your system. This analog signal is
sampled 16,000 times a second and converted into digital form
using a technique called Pulse Code Modulation or PCM. This
digital waveform is a stream of amplitudes that look like a wavy
line.
• The speech recognition software can’t figure out anything from
this stream—it first has to translate it into something it can easily
recognize. So, it converts this signal into a set of discrete frequency
bands using a technique called Windowed Fast Fourier Transform
(FFT).
• For this, the audio signal is further sampled every 1/100th of a
second and each sample is converted into a particular frequency.
So, the incoming stream is now a set of discrete frequency bands,
in a form that can be used by the speech recognizer.
Sounds Simple
• The next stage involves recognizing these bands of frequencies.
For this, the speech recognition software has a database
containing thousands of frequencies or "phonemes", as they’re
called.
• A phoneme is the smallest unit of speech in a language or dialect.
The utterance of one phoneme is different from another, such that
if one phoneme replaces another in a word, the word would have
a different meaning. For example, if the "b" in "bat" were replaced
by the phoneme "r", the meaning would change to "rat".
• The phoneme database is used to match the audio frequency
bands that were sampled. So, for example, if the incoming
frequency sounds like a "t", the software will try and match it to
the corresponding phoneme in the database. Each phoneme is
tagged with a feature number, which is then assigned to the
incoming signal.
Figuring Out The Right Sound
• There can be so many variations in sound due to how
words are spoken that it’s almost impossible to exactly
match an incoming sound to an entry in the database.
• For example, the "t" in "the" sounds different from the
"t" in, say "table". Not only that, but different people
would pronounce the same word differently. To make
matters worse, the environment also adds its own
share of noise.
• Therefore, the software has to use complex techniques
to approximate the incoming sound and figure out
which phonemes are being used.
Other Techniques
• There are many other complexities involved in
recognizing sound.
Other Techniques
• For example, the software has to be able to judge when a
phoneme ends and the next one begins. For this, it uses a
technique called Hidden Markov Models (HMM), which is
another mathematical model that uses statistics. To figure
out when speech starts and stops, a speech recognizer has
silence phonemes, which are also assigned feature
numbers.
• There are also some phonemes that depend upon what
comes before or after them. For example, consider two
words, "see" and "saw". Here the vowels "ee" and "aw"
intrude into the phoneme "s". You hear the vowels for a
longer period than the "s". To solve this problem, speech
recognition software uses tri-phones, or phonemes
produced along with the surrounding phonemes.
Other Techniques
• In another technique called pruning, for a
particular speech, the software generates several
hypotheses on what could have been spoken. It
then generates scores for each hypothesis and the
one with the highest score is taken. The ones with
the lower scores are "pruned" out.
• This is the essence of how speech recognition
works, though there are lots of other complexities
involved. The technology holds great scope for
the future.

More Related Content

DOCX
A seminar report on speech recognition technology
PPTX
Speech Recognition Technology
PPTX
Speech Signal Processing
PPTX
Digital speech processing lecture1
PPTX
Automatic speech recognition
PPTX
Speech Recognition
PPTX
Speech recognition final
PPT
Speech Recognition in Artificail Inteligence
A seminar report on speech recognition technology
Speech Recognition Technology
Speech Signal Processing
Digital speech processing lecture1
Automatic speech recognition
Speech Recognition
Speech recognition final
Speech Recognition in Artificail Inteligence

What's hot (20)

PPT
Automatic speech recognition
PPTX
Speech recognition techniques
PPSX
Speech recognition an overview
PPTX
Speech Signal Analysis
PPT
Speech Recognition
PPTX
Speech synthesis technology
PPTX
Automatic Speech Recognion
PPTX
Speech Recognition Technology
PPTX
Digital modeling of speech signal
PPT
Voice Recognition
PPTX
Speech Retrieval
PPTX
Speech Recognition Technology
PPTX
Speech and Language Processing
PPT
Abstract of speech recognition
PPTX
COLEA : A MATLAB Tool for Speech Analysis
PPT
Automatic speech recognition
PPT
Automatic speech recognition
PPTX
Automatic Speech Recognition
Automatic speech recognition
Speech recognition techniques
Speech recognition an overview
Speech Signal Analysis
Speech Recognition
Speech synthesis technology
Automatic Speech Recognion
Speech Recognition Technology
Digital modeling of speech signal
Voice Recognition
Speech Retrieval
Speech Recognition Technology
Speech and Language Processing
Abstract of speech recognition
COLEA : A MATLAB Tool for Speech Analysis
Automatic speech recognition
Automatic speech recognition
Automatic Speech Recognition
Ad

Similar to How speech reorganization works (20)

PPT
Asr
PPT
Asr
PPTX
550529842-SPEECH-RECOGNITION-PPT-BF.pptx
PPT
Speech recognition
PDF
Speech recognition - how does it work?
PDF
PDF
Course report-islam-taharimul (1)
PPTX
Speech recognition system seminar
PDF
Mjfg now
PDF
International journal of signal and image processing issues vol 2015 - no 1...
PPTX
Speech recognition final presentation
PDF
ch1.pdf
PDF
A survey on Enhancements in Speech Recognition
PDF
Kc3517481754
DOC
Speaker recognition on matlab
PPTX
Speech user interface
PDF
A Review On Speech Feature Techniques And Classification Techniques
PDF
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Asr
Asr
550529842-SPEECH-RECOGNITION-PPT-BF.pptx
Speech recognition
Speech recognition - how does it work?
Course report-islam-taharimul (1)
Speech recognition system seminar
Mjfg now
International journal of signal and image processing issues vol 2015 - no 1...
Speech recognition final presentation
ch1.pdf
A survey on Enhancements in Speech Recognition
Kc3517481754
Speaker recognition on matlab
Speech user interface
A Review On Speech Feature Techniques And Classification Techniques
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Ad

More from Muhammad Taqi (7)

PPTX
Cloud computing
PPTX
How e mail works
PPTX
How proxy works
DOCX
Software Cost and Effort Esitmation
PPTX
Variable eliminatin example
PPTX
D sepration examples
PDF
Adobe illustrator cs5 full tutorials
Cloud computing
How e mail works
How proxy works
Software Cost and Effort Esitmation
Variable eliminatin example
D sepration examples
Adobe illustrator cs5 full tutorials

Recently uploaded (20)

PDF
August Patch Tuesday
PPTX
1. Introduction to Computer Programming.pptx
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPT
Teaching material agriculture food technology
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Empathic Computing: Creating Shared Understanding
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Spectroscopy.pptx food analysis technology
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Mushroom cultivation and it's methods.pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
August Patch Tuesday
1. Introduction to Computer Programming.pptx
OMC Textile Division Presentation 2021.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Teaching material agriculture food technology
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Univ-Connecticut-ChatGPT-Presentaion.pdf
Unlocking AI with Model Context Protocol (MCP)
cloud_computing_Infrastucture_as_cloud_p
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Empathic Computing: Creating Shared Understanding
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Spectral efficient network and resource selection model in 5G networks
Spectroscopy.pptx food analysis technology
Heart disease approach using modified random forest and particle swarm optimi...
Mushroom cultivation and it's methods.pdf
SOPHOS-XG Firewall Administrator PPT.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Building Integrated photovoltaic BIPV_UPV.pdf

How speech reorganization works

  • 3. Speech Recognition • Speech recognition is the hottest topic in research today. In fact, many full-blown speech recognition applications are being implemented in the West to increase work efficiency. • Speech recognition uses several techniques to "recognize" the human voice. It functions as a pipeline that converts digital audio signals coming from the sound card to recognized speech. • These signals pass through several stages, where various mathematical and statistical methods are applied to figure out what is actually being said.
  • 4. How It Works • The voice input to the microphone goes to the sound card. The output from the sound card— digital audio—is processed using FFT (Fast Fourier Transform)—and further fine-processed using HMMs and other techniques. • The built-in database is used for analyzing what’s been spoken. There’s a reverse feedback to the database at the final stage for the purpose of adaptation. The final recognized output then goes back to the CPU.
  • 5. Sounds Simple • The user—gives a voice command over the microphone, which is passed to the sound card in your system. This analog signal is sampled 16,000 times a second and converted into digital form using a technique called Pulse Code Modulation or PCM. This digital waveform is a stream of amplitudes that look like a wavy line. • The speech recognition software can’t figure out anything from this stream—it first has to translate it into something it can easily recognize. So, it converts this signal into a set of discrete frequency bands using a technique called Windowed Fast Fourier Transform (FFT). • For this, the audio signal is further sampled every 1/100th of a second and each sample is converted into a particular frequency. So, the incoming stream is now a set of discrete frequency bands, in a form that can be used by the speech recognizer.
  • 6. Sounds Simple • The next stage involves recognizing these bands of frequencies. For this, the speech recognition software has a database containing thousands of frequencies or "phonemes", as they’re called. • A phoneme is the smallest unit of speech in a language or dialect. The utterance of one phoneme is different from another, such that if one phoneme replaces another in a word, the word would have a different meaning. For example, if the "b" in "bat" were replaced by the phoneme "r", the meaning would change to "rat". • The phoneme database is used to match the audio frequency bands that were sampled. So, for example, if the incoming frequency sounds like a "t", the software will try and match it to the corresponding phoneme in the database. Each phoneme is tagged with a feature number, which is then assigned to the incoming signal.
  • 7. Figuring Out The Right Sound • There can be so many variations in sound due to how words are spoken that it’s almost impossible to exactly match an incoming sound to an entry in the database. • For example, the "t" in "the" sounds different from the "t" in, say "table". Not only that, but different people would pronounce the same word differently. To make matters worse, the environment also adds its own share of noise. • Therefore, the software has to use complex techniques to approximate the incoming sound and figure out which phonemes are being used.
  • 8. Other Techniques • There are many other complexities involved in recognizing sound.
  • 9. Other Techniques • For example, the software has to be able to judge when a phoneme ends and the next one begins. For this, it uses a technique called Hidden Markov Models (HMM), which is another mathematical model that uses statistics. To figure out when speech starts and stops, a speech recognizer has silence phonemes, which are also assigned feature numbers. • There are also some phonemes that depend upon what comes before or after them. For example, consider two words, "see" and "saw". Here the vowels "ee" and "aw" intrude into the phoneme "s". You hear the vowels for a longer period than the "s". To solve this problem, speech recognition software uses tri-phones, or phonemes produced along with the surrounding phonemes.
  • 10. Other Techniques • In another technique called pruning, for a particular speech, the software generates several hypotheses on what could have been spoken. It then generates scores for each hypothesis and the one with the highest score is taken. The ones with the lower scores are "pruned" out. • This is the essence of how speech recognition works, though there are lots of other complexities involved. The technology holds great scope for the future.