Speech recognition system seminar

SPEECH RECOGNITION
SYSTEMS

TWINKLE SAHU
CSE 6TH SEM

INTRODUCTION
• Speech recognition is a process by which a computer
takes a speech signal (recorded using a microphone)
and converts it into words in real-time. It is achieved by
following certain steps and the software responsible for
it is known as a ‘Speech Recognition System’
• SR systems are usually implemented in the form of
dictation software and intelligent assistants in personal
computers, smartphones, web browsers and many
other devices.

DESIGN OF A SR
SYSTEM
SR systems have to deal with a large number of challenges
like :• The speaker’s voice is often accompanied by
surrounding noise which makes their accurate
recognition difficult.
• A speaker may speak a number of different words and
all of these words have to be accurately recognized.
• Accent of speaking varies from person to person and
this is a very big challenge
• A speaker may speak something very quickly and all of
the words spoken have to be individually recognized
accurately.

TYPES OF SR SYSTEMS
• Speaker Dependent SR systems : Work by learning
the unique characteristics of a single person’s voice
and depend on the speaker for training.

• Speaker Independent SR systems : Designed to
recognize anyone’s voice, so no training is involved.

BASIC PRINCIPLES OF
SPEECH RECOGNITION
• The smallest unit of spoken language is known as a
Phoneme.
• The English language contains approximately 44
phonemes representing all the vowels and
consonants that we use for speech.
• We can take the example of a typical word such as
moon which can be broken down into three
phonemes: m, ue, n.

• To interpret speech we must have a way of
identifying the components of spoken words and
phonemes act as identifying markers within speech.
• An algorithm has to be used to interpret the
speech further. The Hidden Markov Model is a
commonly used mathematical model used to do
this.
• To create a speech recognition engine, a large
database of models is created to match each
phoneme.
• When a comparison is performed, the most likely
match is determined between the spoken
phoneme and the stored one, and further
computations are performed.

COMPONENTS OF SPEECH
RECOGNITION
• Corpus Collection :
Database consisting of speech data that built from
multiple speech samples.

• Corpus collection construction for a speakerdependent SR system :-

• Corpus collection construction for a speakerindependent SR system.

• Signal Analyzer :
Analyses the speech signal
and removes the background
noise thus focusing only on the
speaker’s speech .

• Acoustic Model : Identifies
phonemes from the speech
sample using a probability
based mathematical model.

ACOUSTIC MODEL

• Language Model : Identifies words and thus
sentences uttered by the speaker from the
phonemes by making use of a dictionary file and
grammar file.

DICTIONARY FILE

GRAMMAR FILE

PROCESS OF SPEECH
RECOGNITION
PAIN……
……

SPEECH
ANALYZER

SPEECH ANALYZER

/p/--/ae/--/n/

ACOUSTIC MODEL

/p/--/ae/--/n/

CORRECT
/p/--/ae/--/n/

TRAINED HIDDEN
MARKOV MODEL

LANGUAGE MODEL
/p/--/ae/--/n/

DICTIONARY FILE

pain

pain

GRAMMAR FILE
pain
TEXT OUTPUT

HIDDEN MARKOV MODEL
• Markov models are excellent ways of abstracting
simple concepts into a relatively easily computable
form.
• Used in data compression to sound recognition.

From this graph we can create sequences
such as:
N1 N2 N3
N1 N2 N2 N2 N3 N3 N3 N3 N3
N1 N1 N2 N2 N3

N1 N2 N3

= 0.4 X 0.8 X 0.5 = 0.16

N1 N2 N2 N2 N3 N3 N3 N3 N3 = 0.4 x 0.2 x 0.2 x 0.8 x
0.5 x 0.5 x 0.5 x 0.5
= 0.0008
N1 N1 N2 N2 N3

= 0.6 x 0.4 x 0.2 x 0.8 x 0.5
= 0.192

This accommodates for pronunciations such as:
t ow m aa t ow - British English
t ah m ey t ow - American English
t ah mey t a
- Possibly pronunciation when
speaking quickly

With sentences such as:
I like apple juice
I like tomato juice
I hate apple juice
I hate tomato juice

- Very probable
- Very improbable!
- Relatively improbable
- Relatively probable

• The Markov Model makes the Speech Recognition
systems more intelligent i.e. it can accurately
differentiate between similar sounding words like in
the case :
James's school...
James is cool
• In simpler Markov models , the state is directly visible
to the observer.
• In a hidden Markov model, the state is not directly
visible, but output, dependent on the state, is
visible.

PERFORMANCE OF A SR
SYSTEM
• Accuracy is usually rated with word error rate (WER),
whereas speed is measured with the real time
factor.
•

Other measures of accuracy include Single Word
Error Rate (SWER) and Command Success Rate
(CSR).

Factors affecting the accuracy of a SR system :•
•
•
•
•
•

Vocabulary size and confusability
Speaker dependence vs. independence
Isolated, discontinuous, or continuous speech
Task and language constraints
Read vs. spontaneous speech
Adverse conditions

APPLICATIONS
• Health Care
• Military - High Performance Aircrafts
- Air Traffic Control Systems

• Telephony – Smart-phones
- Customer Helpline Services
• Personal Computers

SIRI AND GOOGLE
NOW

Intelligent Personal Assistant
developed by Apple.

Google Now is an intelligent
personal assistant developed by
Google.

Both use a combination of speakerdependent
and speaker-independent sr systems

CONCLUSION
• Speech Recognition systems are an indispensable
part of the ever-advancing field of humancomputer interaction.
• Needs greater research to tackle various
challenges.

Thank You!

Speech recognition system seminar

More Related Content

What's hot (20)

Similar to Speech recognition system seminar (20)

Speech recognition system seminar