Fusion Approach for Robust Speaker Identification System

Fusion Approach for Robust Speaker
Identification system
El bachir Tazi
R.T.: Physics, Computer Science and Process Modeling
Moulay Ismail University, ESTK
Khenifra, Morocco
elbachirtazi@yahoo.fr
Noureddine El makhfi
R.T.: Physics, Computer Science and Process Modeling
Moulay Ismail University, ESTK
Khenifra, Morocco
n.elmakhfi@gmail.com
Abstract— The performance of speaker identification systems
decreases significantly under noisy conditions and especially
when there is a difference between the recognition and the
learning sessions. To improve robustness, we have proposed in
the previous study an auditory features and a robust speaker
recognition system using a front-end based on the combination of
MFCC and RASTA-PLP methods. In this paper, we further
study the auditory features by exploring the combination of
GFCC and RASTA-PLP. We find that the method performs
substantially better than all previous studied methods.
Furthermore, our current identification system achieves
significant performance improvements of 5.92% in a wide range
of signal-to-noise conditions compared with the last studied
front-end based on MFCC combined to RASTA-PLP.
Experimental results show an average accuracy improvement of
10.11% in case of GFCC combined with RASTA-PLP over the
base line MFCC thechnique across various SNR. This fusion
approach allow a highly and appreciable enhancement in the
higher noisy conditions.
Keywords—Robust Speaker Identification; Gammatone
Frequency Cepstral Coefficients GFCC; Relative Spectral
Transform Perceptual Linear Prediction RASTA-PLP; Gaussian
Mixture Model GMM.
I. INTRODUCTION
The most accepted form of biometric identification for human
is his speech signal. The speaker recognition process based on
a speech signal is treated as one of the most exciting
technologies of human recognition [1,2,3]. Audio signal
features can be classified either in the perceptual mode or in
the physical mode. In the previous work, we have studied the
perceptual features towards the speaker identification
activities [4,5]. In the current work we intend to study
furthermore the conceptual mode based on the auditory
techniques mainly GFCC and RASTA-PLP. Most published
works in the areas of speaker recognition focus on speech
under the noiseless environments and few published works
focus on speech under noisy conditions [6,7,8,9]. In this study
we added the white Gaussian noise with different level to our
signal test to simulate the real used environment for these
systems. Learning systems in speaker identification that
employ hybrid strategies can potentially offer significant
advantages over single-strategy systems. In this proposed
system, an hybrid algorithm based on GFCC combined to
RASTA-PLP has been used to improve the performance of the
text independent speaker identification system under noisy
environment. Our system is implemented and simulated under
Matlab environment based toolbox such as signal processing
Toolbox, Voicebox and HMM Toolbox.
The Speaker identification task is typically achieved by
two-stage signal processing: training and testing. The training
process calculates speaker-specific feature parameters from
the speech. The features are used to generate speaker models.
In the testing phase, speech samples from unknown speakers
are compared with the models and classified. The following
figure 1 shows the block diagram of the standard structure of
an automatic speaker recognition system.
Fig.1 Block diagram of an ASR system
The rest of this paper is organized as follows: Section 2
describes the used GFCC and RASTA-PLP feature extraction
techniques, followed by a description of Gaussian Mixture
Modeling and expectation maximization classification in
Section 3. In section 4 we give the experimental results and
finally a conclusion and perspective work at the last section 5.
II. FEATURE EXTRACTION METHODS
Front-End Analysis or features extraction is the first step in
automatic speaker recognition task. It aims to extract features
from the speech waveform that are compact and efficient to
represent the speaker’s voice imprint. Since speech is a non-
stationary signal. The feature parameters should be estimated
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, August 2017
264 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

over short-term intervals from 16 ms to 32 ms, in which
speech is considered to be stationary. The major types of
front-end processing techniques used in the field of ASR are:
Linear Predictive Coding (LPC), Perceptual Linear Prediction
(PLP), Mel Frequency cepstral coefficients (MFCC),
Gammatone Frequency cepstral coefficients (GFCC) and
RASTA-PLP. In this study we have studied the RASTA-PLP
combined with the auditory GFCC method to characterize the
speaker’s voices. The conventional MFCC method is usually
used as a base line system serving for comparison and
evaluation with other feature extraction methods. Many study
show that this last technique gives the best result in the quiet
environment but its performance is exceeded in very noisy
environments [10,11,12,13]
A. GFCC METHOD
The GFCC (Gammatone Frequency Cepstral Coefficients) is
a feature method based on Gammatone filterBank. The filters
in the bank are designed to simulate the auditory process of
human ear [14,15] that are formulated as follows :
݃ሺ‫ݐ‬ሻ = ܽ‫ݐ‬௡ିଵ
݁ିଶగ௕௧
cosሺ2ߨ݂௖‫ݐ‬ + ߮ሻ ሺ1ሻ
Where a is a constant which is generally equal to 1.
N is the filter order which is set less or equal 4
Φ is the phase shift between filters
Fc and b are respectively the center frequency and the
bandwidth of the filter in Hz which is related by:
ܾ = 1.019 ∗ ‫ܤܴܧ‬ = 1.019 ∗ 24.7 ൬4.37 ∗
݂௖
1000
+ 1൰ ሺ2ሻ
The extraction of the best parametric representation of
acoustic signals is an important task to produce a better
identification performance. The efficiency of this phase is
important for the next phase since it affects its behavior. The
overall process of the GFCC algorithm is shown in the block
diagram at the following figure 2.
Fig.2 Block diagram of GFCC method
The GFCC algorithm is another FFT-based feature extraction
technique in speaker recognition field. The technique is based
on the Gammatone Filter Bank (GTFB), which attempts to
model the human auditory system as a series of
overlapping band-pass filters [16,17]. The following figure
3 shows the shape of the used gammatone filter-bank with 16
KHz sampling frequency.
Fig.3 Impulse response of a set of 10 Gammatone filters
As in the conventional MFCC the robust GFCC technique are
calculated from the spectra of a series of windowed speech
frames of 32ms and overlapping by 16ms. First, the spectrum
of a speech frame is obtained by applying the Fast Fourier
Transformation (FFT), 512 point. Then the speech spectrum
is passed through 20 gammatone filterbank GTFB. Equal-
loudness is applied to each of the filter output, according to
the centre frequency of the filter. After that, the logarithm is
taken to each of the filter outputs. Finally we applied the
Reverse Discrete Cosine Transform (RDCT) to the
gammatone filter-Bank outputs in order to transit from
spectral domain to cepstral domain. The following figure 4
and figure 5 show respectively an example of the 10
gammatone filter-Bank outputs behavior and a
gammatonegram of a speech signal frame of 32ms applied in
input.
Fig.4 Example of a 10 gammatone filterbank outputs behavior
0 1000 2000 3000 4000 5000 6000 7000 8000
-100
-80
-60
-40
-20
0
Fréquence (Hz)
Amplitudes
Vol. 15, No. 8, Augus 2017
ISSN 1947-5500

Fig. 5 Example of a gammatonegram of a speech signal
According to this gammatonegram, we can be said that
most of the energy of the speech signal is concentrated only
at the outputs of the first five gammatone filters.
B. RASTA-PLP Method
The RASTA-PLP (RelAtive SpecTrAl-PLP) analysis is a
variant of the PLP analysis, which is designed to eliminate
time-related variations that are too slow or too fast for noise. It
is inspired by the fact that human perception responds to
relative values rather than to absolute values [18,19,20].
This method, which is based on the famous PLP analysis,
consists firstly of performing the discrete Fourier transform in
the short term and then calculating the amplitude spectrum in
critical bands. Then, the logarithm is applied to extract the
spectral envelope of the speech signal. Pass-band filtering is
then carried out in order to eliminate any eventual offset or
slow components of the signal. After this, the amplitude is
compressed by the application of a cubic root in order to
simulate the power law of the human ear. Finally, the
coefficients are calculated according to the classical LPC
method. The following figure 6 shows the overall process of
the RASTA-PLP analysis [21,22,23]
Fig.6 Block diagram of RASTA-PLP method
C. Hybrid Front-End Feature Extraction
In order to design a new robust feature extraction
technique, we used a hybrid algorithm based on a combination
of the previous described feature extraction methods GFCC
and RASTA-PLP. Each of these method is first used separately
and then concatenated together in order to obtain a new
feature representation vector. The following block diagram at
figure 7 show the principle of the proposed hybrid front-end
extractor.
Fig.7 Structure of the proposed feature extraction method
III. PATTERN RECOGNITION
A. GMM Method
There are many methods for pattern recognition in the
field of speaker recognition: Template approach, Statistical
approach, Neural Network approach and multiple Hybrid
models. In this study we have used the probabilistic method
GMM that is the state of the art in this field [24]. The
Gaussian mixture model assumes all the data points are
generated from a mixture of a finite number of Gaussian
distributions with unknown parameters. The probability
density functions (PDF) of many random processes, such as
speech, are non-Gaussian. A non-Gaussian PDF may be
approximated by a weighted sum mixture of a number of
Gaussian densities of appropriate mean vectors and covariance
matrices according to (3)
݂௜ሺ‫ݔ‬ሻ =
ଵ
ሺଶగሻ
೏
మඥ∑೔
exp ቂ−
ଵ
ଶ
ሺ‫ݔ‬ − ߤ݅ሻ′ ∑ ሺ‫ݔ‬ −ିଵ
௜ ߤ݅ሻቃ ሺ3)
Where: ݂௜ሺ‫ݔ‬ሻ is the probability density functions, µi and
Σi, i∈{1...M} are respectively the mean vectors and the
covariance matrices of each component and finally d is the
dimension of vectors.
GMM is a conventional method for speaker recognition,
known for its effectiveness and scalability in this field [25].
GMM is simple, easy and faster to compute models also in
training and testing phases. The limitation of GMM that it
requires a sufficient amount of training data to ensure good
performance which increase the training time. GMM works
well in terms of accuracy compared to other classification and
recognition methods. We have used GMM in our hybrid
system because we need to combine fast and accurate
approaches in order to have an acceptable computational time
of the derived system from our fusion approach. The decision
of the identity of the speaker is based on the similarity
measure between the test speaker model and all of the
speakers models in the reference database. For the similarity
measure we have used the maximum likelihood estimation
(MLLE). To maximize the classification process for a given
Les numéros de trames composants le signal de parole
Lesnumérosdefiltresgammatone
100 200 300 400 500 600 700 800 900
2
4
6
8
10
-100
-80
-60
-40
-20
0
Vol. 15, No. 8, Augus 2017
ISSN 1947-5500

set of feature vectors the Expectation Maximization (EM)
algorithm is used.
B. The Expectation Maximization Algorithm
In this study, the maximum likelihood ML method using
Gaussian mixtures model is used to design our speaker
identification system. The EM algorithm is based on an
efficient iterative procedure for calculating the maximum
likelihood estimate in the presence of missing data. In the ML
estimation, we first compute the parameters of the model for
which the observed data are most likely. The EM algorithm
consists of two steps for each iteration: estimation and
maximization [26]. In the estimation step the missing data are
computed based on the observed data and the actual estimated
parameters. In the maximization step, the likelihood function
is optimized by assuming that the missing data are known.
Estimates of missing data from the estimation step are used as
actual missing data. The convergence of the EM algorithm
shown in the following figure 8, is satisfied by the fact that it
increases the probability at almost every iteration [27].
Fig.8 Expectation Maximization algorithm
IV. EXPERIMENTATION AND RESULTS
A. Effect of the pre-emphasis stage
The pre-processing of speech signal is a very important
stage in the design of any speech/speaker recognition system
because it has a direct influence on its robustness and
efficiency. The pre-emphasis of the speech signal intend to
give more energy to the high frequencies. This is obtained by
using a high-pass filter having the following transfer function:
Hሺzሻ = 1 − α. zିଵ
(4)
In this study we fixed α equal 0.95
The next figures 9a and 9b show the benefit of the pre-
emphasis preprocessing that make uniform display of the
energy over all the frequencies of the speech signal. This new
frequency distribution will consequently lead to an
improvement in the characterization of the speech signal and
an increase in the accuracy of the speaker recognition system.
The obtained result justifies the use of this technique.
(a)
(b)
Fig.9 The spectrogram (a) before and (b) after the pre-emphasis
The pre-emphasis processing contributes to reduce the edge
effect between successive frames as illustrated by the
following figure 10.
(a)
(b)
Fig.10 Edge effect interframes (a) before and (b) after pre-emphasis
Vol. 15, No. 8, Augus 2017
ISSN 1947-5500

B. Experimental conditions
In this study, we recorded a database of 51 speakers (35
male and 16 female). For each speaker we acquired two
records: one of about 20 seconds for the training phase, the
other of about 10 seconds which will serve later for the
recognition phase. All speech signals were acquired in .wav
format with a sampling frequency of 16 kHz and 16 bits
monophonic quantization using the Wavesurfer software tool
[28]. A Gaussian white Noise of zero mean and unit variance
with a various level was added only to the test signals in order
to evaluate the robustness of our system under conditions
similar to reality. The noise has been added only to the test
records because generally the learning speech signal are
carried out in a controlled environment whereas the test
speech signals is taken in most practices cases in noisy and
uncontrolled environments. The feature extractors that will be
considered in this study are MFCC, GFCC, RASTA-PLP and
the combination of GFCC and RASTA-PLP. The entire system
is implemented under the MATLAB environment. The
following table I gives a detailed description of the
experimental conditions of our study.
TABLE I
EXPERIMENTAL CONDITIONS OF THE STUDY
Task system
Text independent automatic speaker
identification
Feature set
MFCC, GFCC, RASTA-PLP,
GFCC+RASTA-PLP
Back-end
Gaussian mixtures model with NG=4
mixtures
Nb. of coeff. in a
feature vector
12 MFCC, 12 GFCC and 12 RASTA-PLP
24 for GFCC&RASTA-PLP
Window size 32 ms
Step size 16 ms
Sampling rate 16 kHz
Training set
51 speakers (one record of 20s for eatch
speaker)
Test set
51 speakers (one record of 10s for eatch
speaker)
Noise Type
White Gaussian Noise (zero mean and unit
variance)
SNR range From 40 db to 0db with a step of 5db
Platform HP Elite book core i5 2.4Ghz
Prog. environnement MATLAB®7
C. Performance measure
In this paragraph we will present the performance results
corresponding to our recognition tests experiments which we
have conducted on four sets features extractors: MFCC,
GFCC, RASTA-PLP and GFCC combined to RASTA-PLP.
All tests were conducted in the presence of an additive white
Gaussian noise having a variable signal par noise ratio
changing from 40 dB to 0 dB. The total accuracies of
recognition are presented in Table II and figure 11. The results
show that the performance obtained by the fusion of the
feature parameters GFCC and RASTA-PLP is better than that
by using separately the feature parameters MFCC, GFCC or
RASTA-PLP.
TABLE II
THE ACCURACY RECOGNITION OF USED METHODS
Fig.7 The performance results of studied methods
V. CONCLUSION
In this paper a fusion approach based on the combination of
two auditory models GFCC and RASTA-PLP have been
explored for feature extraction process. We have chosen these
methods to build our front-end because it simulates the
spectral and temporal aspects of the peripheral auditory
system. The GMM classifier is used as a back-end for speaker
modeling. The main objective of this study is to explore the
new feature representation and evaluate the robustness of the
implemented speaker identification system. It is therefore
necessary to analyze the strengths and weaknesses of each set
of parameters and to compare their performance in terms of
identification rate (accuracy) under noisy conditions but also
in terms of response time (efficiency). The ultimate objective
of all these analyzes is to identify the strengths of each method
in order to exploit them when they are combined or used
separately according to the requirements of the intended
application. Based on the results of tests carried out on our
database, we can conclude that all methods MFCC, GFCC and
RASTA-PLP used separately give a good accuracy result
100% with the speech is recorded in quiet environments and
when there is a similar condition for recognition test and
learning phases. But when we added a highly level of white
Gaussian noise to the test speech, the performance result
decreases considerably. However, the hybrid method RASTA-
5 10 15 20 25 30 35 40
20
30
40
50
60
70
80
90
100
SNR(dB)
Accuracy(%)
MFCC
GFCC
RASTA-PLP
GFCC+RASTA-PLP
RSB
(dB)
MFCC GFCC
RASTA-
PLP
GFCC +
RASTA-PLP
40 100 100 100 100
35 100 98.37 98.37 100
30 91.18 98.37 88.67 98.37
25 82.37 94.15 80.37 94.15
20 58.75 80.37 60.12 80.37
15 25.32 58.75 27.65 60.12
10 20.91 25.32 20.91 25.32
5 16.45 20.91 16.45 20.91
0 10.85 10.85 10.85 16.45
Av. Acc. 56,09% 65.23% 55.95% 66.20%
Vol. 15, No. 8, Augus 2017
ISSN 1947-5500

PLP combined to GFCC gives the best performance of with a
relative improvement of average accuracy of 10% compared
to the conventional MFCC method when SNR changes from
40 db to 0db.
Finally, a multiple combination of features may be
deployed to implement a robust and efficient parametric
representation for speaker identification system. However
different features may be complementary and can be
combined to enhance accuracy.
Multiple combinations of features can be used to
implement robust systems. However, despite the fact that
these different features are complementary and can be
combined to improve the accuracy, the computational load of
the implemented system may increase with the consequent
number of parameters to be processed, which may be
inacceptable in terms of system efficiency. Future work will
extend the study to other scenarios to improve noise
robustness but also the efficiency of the system.
REFERENCES
[1] Dakshina Ranjan Kisku,Phalguni Gupta,Jamuna Kanta Sing “Advantages
in Biometrics for secure human authentification and recognition” CRC
Press,Taylor and francis group, US, 2014
[2] R. Parashar and S. Joshi, “proportional study of human recognition
nethods”, international journal of advanced research in computer science
a software engineering, 2012
[3] k. Delac and M. Grgic, “Asurvey of biometric recognition methods”, 46th
international symposium Electronics in Marine, 2004
[4] El bachir Tazi, A. Benabbou, and M. Harti, “Efficient Text Independent
Speaker Identification Based on GFCC and CMN Methods" International
Conference on Multimedia Computing and Systems IEEE conference-
ICMCS’12, Tangier, Morocco, 10-12 may 2012.
[5] El Bachir Tazi, “A robust Speaker Identification System based on the
combination of GFCC and MFCC methods” , 5th International
Conference on Multimedia Computing and Systems – IEEE Conference
ICMCS’16, Marrakech, Morocco, 29 September – 1 October 2016
[6] X. Zhao, Y. Shao and D.L.Wang,"CASA-Based Robust Speaker
Identification,"IEEE Trans. Audio, Speech and Language Processing ,
vol.20, no.5, pp.1608 -1616, 2012.
[7] Y. Shao, Z. Jin, D. Wang, and S. Srinivasan, “An auditory-based feature
for robust speech recognition,” in Proc. I-CASSP’09, 2009, pp. 4625–
4628
[8] J. Ming, T.J. Hazen, J.R. Glass and D.A. Reynolds, "Robust speaker
recognition in noisy conditions," IEEE Trans. Audio, Speech, and
Language Processing , vol. 15(5), pp. 1711-1723, 2007
[9] Douglas A. Reynolds et Richard C. Rose; “ Robust text-independent
speaker identification using gaussian mixture speaker models” in IEEE
Transactions on Acoustics, Speech and Signal Processing, Vol 3, N° 1 pp:
72-83, january 1995
[10] Rashidul Hasan and all “speaker identification using mel frequency
Cepstral coefficients” 3rd International Conference on Electrical &
Computer Engineering ICECE 2004, 28-30 December 2004, Dhaka,
Bangladesh
[11] Davis S B, Mermelstein P. Comparison of parametric representation for
monosyllabic word recognition in continuously spoken sentences.IEEE
Trans. ASSP, Aug., 1980.
[12] X. Zhou, all “Linear versus Mel-Frequency cepstral coefficients for
speaker recognition”, in IEEE Workshop on ASRU 2011, pp. 559-564
[13] Khan SA, all “A unique approach in text independent speaker
recognition using MFCC feature sets and probabilistic neural network.
In Eighth International Conference on Advances in Pattern Recognition
(ICAPR), Kolkata, 2015
[14] J. Qi, D. Wang, Y. Jiang, and R. Liu, “Auditory feature based on
gammatone filters for robust speech recognition,” in the IEEE
International Symposium on Circuits and Systems (ISCAS), 2013
[15] J. Qi, D. Wang, Y. Jiang, and R. Liu, “Auditory feature based on
gammatone filters for robust speech recognition,” in the IEEE
International Symposium on Circuits and Systems (ISCAS), 2013
[16] Jun Qi, all “Bottleneck Features based on Gammatone Frequency
Cepstral Coefficients” interspeech2013, 25 29 August 2013, Lyon,
France
[17] He Xu, all "A New Algorithm for Auditory Feature Extraction" CSNT -
2012, pp. 229-232
[18] El Bachir Tazi, Noureddine El Makhfi, “An Hybrid Front-End for
Robust Speaker Identification under Noisy Conditions" Intelligent
Systems Conference, IEEE conference- IntelliSys’17, London, UK,
7-8 September 2017.
[19] H. Hermansky, "Perceptual linear predictive (PLP) analysis of
speech", J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752, Apr. 1990.
[20] H. Hermansky and N. Morgan, "RASTA processing of speech", IEEE
Trans. on Speech and Audio Proc., vol. 2, no. 4, pp. 578-589, Oct.
1994.
[21] Zwicker E. Subdivision of the audible frequency range into critical
bands.J. Acoust. Soc. Am., Feb., 1961, 33.
[22] Prithvi, all “Comparative Analysis of MFCC, LFCC, RASTA –PLP”
International Journal of Scientific Engineering and Research (IJSER),
Volume 4 Issue 5, May 2016
[23] HynekHermansky “RASTA Processing of Speech” IEEE Transactions
on speech and audio processing. Vol. 2. NO. 4. October I994
[24] D.A. Reynolds, T.F. Quatieri, R.B. Dunn, "Speaker verification using
adapted Gaussian mixture models", Digital Signal Processing, vol. 10,
pp. 19-41, January 2000.
[25] D.A. Reynolds, “Speaker identification and verification using Gaussian
mixture speaker models” in Speech Comm., vol. 17, pp. 91108, 1995.
[26] Dempster, A. P., Laird, N. M., and Rubin, D. B. “Maximum Likelihood
from Incomplete Data via the EM Algorithm” in Journal of the Royal
Statistical Society, B, 39, 1–38. December 1976
[27] L. Xu and M.I. Jordan. On convergence properties of the EM algorithm
for Gaussian mixtures. Neural computation , 8:129–151, 1996
[28] http://www.speech.kth.se/wavesurfer/
Authors’ information
Prof. El bachir Tazi graduated in Electronic
Engineering from ENSET Mohammedia, in 1992.
He received his postgraduate diplomas DEA and
DES in Automatic and Signal Processing and PhD
in Computer Science from Sidi Mohammed Ben
Abdellah University, Faculty of Sciences Fez,
Morocco respectively in 1995, 1999 and 2012. He
is now member of the research team “Physics,
Science Computer and Process Modeling” and
professor at the higher school of Technology Khenifra, Moulay Ismail
University, Morocco. His areas of interest include automatic speaker
recognition, signal processing, pattern recognition, artificial intelligence and
real time processing using embedded systems.
Dr Noureddine El makhfi received his MSc and
PhD in Computer Science from Sidi Mohamed Ben
Abdallah University, Faculty of Science and
Technology in Fez, Morocco. He is now member of
the research team “Physics, Science Computer and
Process Modeling” at higher school of Technology
Khenifra, Morocco. His main research interest
processing and recognition of historical documents,
cognitive science, image processing, computer
vision, pattern recognition, document image analysis OCR, artificial
intelligence, Web document analysis, industrial data systems and embedded
processors.
Vol. 15, No. 8, Augus 2017
ISSN 1947-5500

Fusion Approach for Robust Speaker Identification System

More Related Content

What's hot (20)

Similar to Fusion Approach for Robust Speaker Identification System (20)

Recently uploaded (20)

Fusion Approach for Robust Speaker Identification System