SlideShare a Scribd company logo
PRINCIPLES OF SPEECH RECOGNITION
Speech recognition can be classified into identification and verification. Speech identification
is the process of determining which registered speaker provides a given utterance. Speech
verification, on the other hand, is the process of accepting or rejecting the identity claim of a
speaker. Figure 1 shows the basic structures of speech identification and verification systems.
All technologies of speech recognition, identification and verification, each have its own
advantages and disadvantages and might require different treatment and techniques. The choice of
which technology to use is application-specific. At the highest level, all speech recognition systems
contain two main modules (refer to Figure 1): feature extraction and feature matching. Feature
extraction is the process that extracts a small amount of data from the voice signal that can later be
used to represent each string. Feature matching involves the actual procedure to identify the
unknown string by comparing extracted features from his/her voice input with the ones from a set
of known speakers. We will discuss each module in detail in later sections.
(a) Speaker identification
Input
speech
Feature
extraction
Reference
model
(Speaker #1)
Similarity
Reference
model
(Speaker #N)
Similarity
Maximum
selection
Identification
result
(Speaker ID)
(b) Speaker verification
Figure 1.1 Basic structures of speaker recognition system
All speaker recognition systems have to serve two distinguished phases. The first one is referred to
the enrolment or training phase, while the second one is referred to as the operational or testing
phase. In the training phase, each registered speaker has to provide samples of their speech so that
the system can build or train a reference model for that speaker. In case of speaker verification
systems, in addition, a speaker-specific threshold is also computed from the training samples. In the
testing phase, the input speech is matched with stored reference model(s) and a recognition
decision is made.
Speech signals in training and testing sessions can be greatly different due to many facts
such as people voice change with time, health conditions (e.g. the speaker has a cold),
speaking rates, and so on. There are also other factors, beyond speaker variability, that
present a challenge to speaker recognition technology. Examples of these are acoustical
noise and variations in recording environments (e.g. speaker uses different telephone
handsets).
Reference
model
(Speaker #M)
Similarity
Input
speech
Feature
extraction
Verification
result
(Accept/Reject)
Decision
ThresholdSpeaker ID
(#M)
SPEECH ANALYSIS
OVERVIEW
Phonetics is part of the linguistic sciences. It is concerned with the sounds produced by the
human vocal organs, and more specifically, the sounds which are used in human speech. One
important aspect of Phonetics research is the instrumental analysis of speech. This is often referred
to as experimental Phonetics or machines Phonetics.
The Speech Analysis Series is a series of articles examining different aspects of
presentation analysis. One will learn how to study a speech and how to deliver an effective
speech evaluation. Later articles will examine Toastmasters evaluation contests and speech
evaluation forms and resources.
The document describes how to build a simple, yet complete and representative
speech recognition system. Such a speech recognition system has potential in many security
applications. For example, users have to speak a PIN (Personal Identification Number) in
order to gain access to the laboratory door, or users have to speak their credit card number
over the telephone line to verify their identity. By checking the voice characteristics of the
input utterance, using an speech recognition system similar to the one that we will describe,
the system is able to add an extra level of security
NOTE: The sounds in this demo are down sampled to 8 bit, 8000Hz. You may want to widen
you browser’s window to view the images.
FEATURE EXTRACTION
OVERVIEW
The purpose of this module is to convert the speech waveform of some type of
parametric representation (at a considerably lower information rate) for further analysis and
processing. This is often referred as the signal-processing front end. The speech signal is a
slowly timed varying signal (it is called quasi-stationary). An example of speech signal is
shown in Figure 2. When examined over a sufficiently short period of time (between 5 and
100 msec), its characteristics are fairly stationary. However, over long periods of time (on
the order of 1/5 seconds or more) the signal characteristic change to reflect the different
speech sounds being spoken. Therefore, short-time spectral analysis is the most common
way to characterize the speech signal.
FEATURE MATCHING
OVERVIEW
The problem of speech recognition belongs to a much broader topic in scientific and
engineering so called pattern recognition. The goal of pattern recognition is to classify
objects of interest into one of a number of categories or classes. The objects of interest are
generically called patterns and in our case are sequences of acoustic vectors that are extracted
from an input speech using the techniques described in the previous section. The classes here
refer to individual speakers. Since the classification procedure in our case is applied on
extracted features, it can be also referred to as feature matching.
Furthermore, if there exists some set of patterns that the individual classes of which are
already known, then one has a problem in supervised pattern recognition. These patterns
comprise the training set and are used to derive a classification algorithm. The remaining
patterns are then used to test the classification algorithm; these patterns are collectively
referred to as the test set. If the correct classes of the individual patterns in the test set are
also known, then one can evaluate the performance of the algorithm.
The state-of-the-art in feature matching techniques used in speaker recognition include
Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Vector
Quantization (VQ). In this project, the VQ approach will be used, due to ease of
implementation and high accuracy. VQ is a process of mapping vectors from a large vector
space to a finite number of regions in that space. Each region is called a cluster and can be
represented by its center called a codeword. The collection of all codewords is called a
codebook.
Figure 4.1 shows a conceptual diagram to illustrate this recognition process. In the
figure, only two speakers and two dimensions of the acoustic space are shown. The circles
refer to the acoustic vectors from the speaker 1 while the triangles are from the speaker 2. In
the training phase, using the clustering algorithm described in Section 4.2, a speaker-specific
VQ codebook is generated for each known speaker by clustering his/her training acoustic
vectors. The result codewords (centroids) are shown in Figure 5 by black circles and black
triangles for speaker 1 and 2, respectively. The distance from a vector to the closest
codeword of a codebook is called a VQ-distortion. In the recognition phase, an input
utterance of an unknown voice is “vector-quantized” using each trained codebook and the
total VQ distortion is computed. The speaker corresponding to the VQ codebook with
smallest total distortion is identified as the speaker of the input utterance.

More Related Content

PPT
Automatic speech recognition
PPTX
Speaker recognition in android
PPTX
A Survey on Speaker Recognition System
DOC
Speaker recognition.
PDF
Real Time Speaker Identification System – Design, Implementation and Validation
PPTX
SPEAKER VERIFICATION
PPTX
Speech Signal Analysis
PPTX
Speaker recognition system by abhishek mahajan
Automatic speech recognition
Speaker recognition in android
A Survey on Speaker Recognition System
Speaker recognition.
Real Time Speaker Identification System – Design, Implementation and Validation
SPEAKER VERIFICATION
Speech Signal Analysis
Speaker recognition system by abhishek mahajan

What's hot (20)

PDF
International journal of signal and image processing issues vol 2015 - no 1...
PPT
Speech Recognition System By Matlab
PPT
Speech Recognition
PDF
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
PPTX
COLEA : A MATLAB Tool for Speech Analysis
PPT
Automatic Speaker Recognition system using MFCC and VQ approach
PDF
Deep Learning in practice : Speech recognition and beyond - Meetup
PDF
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
PDF
Speaker Recognition System using MFCC and Vector Quantization Approach
PPT
Voice Recognition
PPTX
Speaker recognition using MFCC
PPTX
Speaker recognition systems
PDF
Deep Learning For Speech Recognition
PPTX
Text-Independent Speaker Verification
PDF
Text-Independent Speaker Verification Report
DOC
Speaker recognition on matlab
PPTX
Text independent speaker recognition system
PPTX
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
PPTX
Speech Signal Processing
PDF
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
International journal of signal and image processing issues vol 2015 - no 1...
Speech Recognition System By Matlab
Speech Recognition
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
COLEA : A MATLAB Tool for Speech Analysis
Automatic Speaker Recognition system using MFCC and VQ approach
Deep Learning in practice : Speech recognition and beyond - Meetup
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
Speaker Recognition System using MFCC and Vector Quantization Approach
Voice Recognition
Speaker recognition using MFCC
Speaker recognition systems
Deep Learning For Speech Recognition
Text-Independent Speaker Verification
Text-Independent Speaker Verification Report
Speaker recognition on matlab
Text independent speaker recognition system
SPEKER RECOGNITION UNDER LIMITED DATA CODITION
Speech Signal Processing
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
Ad

Viewers also liked (6)

PDF
Stephy index page no 1 to 25 2
DOCX
Josh McNulty Official Resume
PPTX
Guillan Barre Syndrome (A Case Report)
PDF
ICS Training Certificates - Bundled
PDF
TRTA-TaxKS-SFDCProjCompletion-Certificate
DOCX
підсумки роботи волонтерських загонів за 2015р
Stephy index page no 1 to 25 2
Josh McNulty Official Resume
Guillan Barre Syndrome (A Case Report)
ICS Training Certificates - Bundled
TRTA-TaxKS-SFDCProjCompletion-Certificate
підсумки роботи волонтерських загонів за 2015р
Ad

Similar to Bachelors project summary (20)

PDF
ASR_final
PDF
Course report-islam-taharimul (1)
PPTX
Voice
PDF
Utterance based speaker identification
PDF
Utterance Based Speaker Identification Using ANN
PDF
Utterance Based Speaker Identification Using ANN
PPTX
Speaker Recognition System
PDF
B.Tech Project Report
PDF
Voice Recognition System using Template Matching
PDF
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
PDF
Dy36749754
DOCX
Speech Recognition
PDF
Classification of Language Speech Recognition System
DOCX
PDF
44 i9 advanced-speaker-recognition
PDF
AN EFFICIENT SPEECH RECOGNITION SYSTEM
DOCX
Voice biometric recognition
PDF
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTS
PDF
Effect of MFCC Based Features for Speech Signal Alignments
ASR_final
Course report-islam-taharimul (1)
Voice
Utterance based speaker identification
Utterance Based Speaker Identification Using ANN
Utterance Based Speaker Identification Using ANN
Speaker Recognition System
B.Tech Project Report
Voice Recognition System using Template Matching
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
Dy36749754
Speech Recognition
Classification of Language Speech Recognition System
44 i9 advanced-speaker-recognition
AN EFFICIENT SPEECH RECOGNITION SYSTEM
Voice biometric recognition
EFFECT OF MFCC BASED FEATURES FOR SPEECH SIGNAL ALIGNMENTS
Effect of MFCC Based Features for Speech Signal Alignments

Bachelors project summary

  • 1. PRINCIPLES OF SPEECH RECOGNITION Speech recognition can be classified into identification and verification. Speech identification is the process of determining which registered speaker provides a given utterance. Speech verification, on the other hand, is the process of accepting or rejecting the identity claim of a speaker. Figure 1 shows the basic structures of speech identification and verification systems. All technologies of speech recognition, identification and verification, each have its own advantages and disadvantages and might require different treatment and techniques. The choice of which technology to use is application-specific. At the highest level, all speech recognition systems contain two main modules (refer to Figure 1): feature extraction and feature matching. Feature extraction is the process that extracts a small amount of data from the voice signal that can later be used to represent each string. Feature matching involves the actual procedure to identify the unknown string by comparing extracted features from his/her voice input with the ones from a set of known speakers. We will discuss each module in detail in later sections. (a) Speaker identification Input speech Feature extraction Reference model (Speaker #1) Similarity Reference model (Speaker #N) Similarity Maximum selection Identification result (Speaker ID)
  • 2. (b) Speaker verification Figure 1.1 Basic structures of speaker recognition system All speaker recognition systems have to serve two distinguished phases. The first one is referred to the enrolment or training phase, while the second one is referred to as the operational or testing phase. In the training phase, each registered speaker has to provide samples of their speech so that the system can build or train a reference model for that speaker. In case of speaker verification systems, in addition, a speaker-specific threshold is also computed from the training samples. In the testing phase, the input speech is matched with stored reference model(s) and a recognition decision is made. Speech signals in training and testing sessions can be greatly different due to many facts such as people voice change with time, health conditions (e.g. the speaker has a cold), speaking rates, and so on. There are also other factors, beyond speaker variability, that present a challenge to speaker recognition technology. Examples of these are acoustical noise and variations in recording environments (e.g. speaker uses different telephone handsets). Reference model (Speaker #M) Similarity Input speech Feature extraction Verification result (Accept/Reject) Decision ThresholdSpeaker ID (#M)
  • 3. SPEECH ANALYSIS OVERVIEW Phonetics is part of the linguistic sciences. It is concerned with the sounds produced by the human vocal organs, and more specifically, the sounds which are used in human speech. One important aspect of Phonetics research is the instrumental analysis of speech. This is often referred to as experimental Phonetics or machines Phonetics. The Speech Analysis Series is a series of articles examining different aspects of presentation analysis. One will learn how to study a speech and how to deliver an effective speech evaluation. Later articles will examine Toastmasters evaluation contests and speech evaluation forms and resources. The document describes how to build a simple, yet complete and representative speech recognition system. Such a speech recognition system has potential in many security applications. For example, users have to speak a PIN (Personal Identification Number) in order to gain access to the laboratory door, or users have to speak their credit card number over the telephone line to verify their identity. By checking the voice characteristics of the input utterance, using an speech recognition system similar to the one that we will describe, the system is able to add an extra level of security NOTE: The sounds in this demo are down sampled to 8 bit, 8000Hz. You may want to widen you browser’s window to view the images.
  • 4. FEATURE EXTRACTION OVERVIEW The purpose of this module is to convert the speech waveform of some type of parametric representation (at a considerably lower information rate) for further analysis and processing. This is often referred as the signal-processing front end. The speech signal is a slowly timed varying signal (it is called quasi-stationary). An example of speech signal is shown in Figure 2. When examined over a sufficiently short period of time (between 5 and 100 msec), its characteristics are fairly stationary. However, over long periods of time (on the order of 1/5 seconds or more) the signal characteristic change to reflect the different speech sounds being spoken. Therefore, short-time spectral analysis is the most common way to characterize the speech signal.
  • 5. FEATURE MATCHING OVERVIEW The problem of speech recognition belongs to a much broader topic in scientific and engineering so called pattern recognition. The goal of pattern recognition is to classify objects of interest into one of a number of categories or classes. The objects of interest are generically called patterns and in our case are sequences of acoustic vectors that are extracted from an input speech using the techniques described in the previous section. The classes here refer to individual speakers. Since the classification procedure in our case is applied on extracted features, it can be also referred to as feature matching. Furthermore, if there exists some set of patterns that the individual classes of which are already known, then one has a problem in supervised pattern recognition. These patterns comprise the training set and are used to derive a classification algorithm. The remaining patterns are then used to test the classification algorithm; these patterns are collectively referred to as the test set. If the correct classes of the individual patterns in the test set are also known, then one can evaluate the performance of the algorithm. The state-of-the-art in feature matching techniques used in speaker recognition include Dynamic Time Warping (DTW), Hidden Markov Modeling (HMM), and Vector Quantization (VQ). In this project, the VQ approach will be used, due to ease of implementation and high accuracy. VQ is a process of mapping vectors from a large vector space to a finite number of regions in that space. Each region is called a cluster and can be represented by its center called a codeword. The collection of all codewords is called a codebook. Figure 4.1 shows a conceptual diagram to illustrate this recognition process. In the figure, only two speakers and two dimensions of the acoustic space are shown. The circles refer to the acoustic vectors from the speaker 1 while the triangles are from the speaker 2. In the training phase, using the clustering algorithm described in Section 4.2, a speaker-specific VQ codebook is generated for each known speaker by clustering his/her training acoustic vectors. The result codewords (centroids) are shown in Figure 5 by black circles and black triangles for speaker 1 and 2, respectively. The distance from a vector to the closest codeword of a codebook is called a VQ-distortion. In the recognition phase, an input utterance of an unknown voice is “vector-quantized” using each trained codebook and the total VQ distortion is computed. The speaker corresponding to the VQ codebook with smallest total distortion is identified as the speaker of the input utterance.