SlideShare a Scribd company logo
Fusion Approach for Robust Speaker
Identification system
El bachir Tazi
R.T.: Physics, Computer Science and Process Modeling
Moulay Ismail University, ESTK
Khenifra, Morocco
elbachirtazi@yahoo.fr
Noureddine El makhfi
R.T.: Physics, Computer Science and Process Modeling
Moulay Ismail University, ESTK
Khenifra, Morocco
n.elmakhfi@gmail.com
Abstract— The performance of speaker identification systems
decreases significantly under noisy conditions and especially
when there is a difference between the recognition and the
learning sessions. To improve robustness, we have proposed in
the previous study an auditory features and a robust speaker
recognition system using a front-end based on the combination of
MFCC and RASTA-PLP methods. In this paper, we further
study the auditory features by exploring the combination of
GFCC and RASTA-PLP. We find that the method performs
substantially better than all previous studied methods.
Furthermore, our current identification system achieves
significant performance improvements of 5.92% in a wide range
of signal-to-noise conditions compared with the last studied
front-end based on MFCC combined to RASTA-PLP.
Experimental results show an average accuracy improvement of
10.11% in case of GFCC combined with RASTA-PLP over the
base line MFCC thechnique across various SNR. This fusion
approach allow a highly and appreciable enhancement in the
higher noisy conditions.
Keywords—Robust Speaker Identification; Gammatone
Frequency Cepstral Coefficients GFCC; Relative Spectral
Transform Perceptual Linear Prediction RASTA-PLP; Gaussian
Mixture Model GMM.
I. INTRODUCTION
The most accepted form of biometric identification for human
is his speech signal. The speaker recognition process based on
a speech signal is treated as one of the most exciting
technologies of human recognition [1,2,3]. Audio signal
features can be classified either in the perceptual mode or in
the physical mode. In the previous work, we have studied the
perceptual features towards the speaker identification
activities [4,5]. In the current work we intend to study
furthermore the conceptual mode based on the auditory
techniques mainly GFCC and RASTA-PLP. Most published
works in the areas of speaker recognition focus on speech
under the noiseless environments and few published works
focus on speech under noisy conditions [6,7,8,9]. In this study
we added the white Gaussian noise with different level to our
signal test to simulate the real used environment for these
systems. Learning systems in speaker identification that
employ hybrid strategies can potentially offer significant
advantages over single-strategy systems. In this proposed
system, an hybrid algorithm based on GFCC combined to
RASTA-PLP has been used to improve the performance of the
text independent speaker identification system under noisy
environment. Our system is implemented and simulated under
Matlab environment based toolbox such as signal processing
Toolbox, Voicebox and HMM Toolbox.
The Speaker identification task is typically achieved by
two-stage signal processing: training and testing. The training
process calculates speaker-specific feature parameters from
the speech. The features are used to generate speaker models.
In the testing phase, speech samples from unknown speakers
are compared with the models and classified. The following
figure 1 shows the block diagram of the standard structure of
an automatic speaker recognition system.
Fig.1 Block diagram of an ASR system
The rest of this paper is organized as follows: Section 2
describes the used GFCC and RASTA-PLP feature extraction
techniques, followed by a description of Gaussian Mixture
Modeling and expectation maximization classification in
Section 3. In section 4 we give the experimental results and
finally a conclusion and perspective work at the last section 5.
II. FEATURE EXTRACTION METHODS
Front-End Analysis or features extraction is the first step in
automatic speaker recognition task. It aims to extract features
from the speech waveform that are compact and efficient to
represent the speaker’s voice imprint. Since speech is a non-
stationary signal. The feature parameters should be estimated
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, August 2017
264 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
over short-term intervals from 16 ms to 32 ms, in which
speech is considered to be stationary. The major types of
front-end processing techniques used in the field of ASR are:
Linear Predictive Coding (LPC), Perceptual Linear Prediction
(PLP), Mel Frequency cepstral coefficients (MFCC),
Gammatone Frequency cepstral coefficients (GFCC) and
RASTA-PLP. In this study we have studied the RASTA-PLP
combined with the auditory GFCC method to characterize the
speaker’s voices. The conventional MFCC method is usually
used as a base line system serving for comparison and
evaluation with other feature extraction methods. Many study
show that this last technique gives the best result in the quiet
environment but its performance is exceeded in very noisy
environments [10,11,12,13]
A. GFCC METHOD
The GFCC (Gammatone Frequency Cepstral Coefficients) is
a feature method based on Gammatone filterBank. The filters
in the bank are designed to simulate the auditory process of
human ear [14,15] that are formulated as follows :
݃ሺ‫ݐ‬ሻ = ܽ‫ݐ‬௡ିଵ
݁ିଶగ௕௧
cosሺ2ߨ݂௖‫ݐ‬ + ߮ሻ																																								ሺ1ሻ
Where a is a constant which is generally equal to 1.
N is the filter order which is set less or equal 4
Φ is the phase shift between filters
Fc and b are respectively the center frequency and the
bandwidth of the filter in Hz which is related by:
ܾ = 1.019 ∗ ‫ܤܴܧ‬ = 1.019 ∗ 24.7 ൬4.37 ∗
݂௖
1000
+ 1൰									ሺ2ሻ
The extraction of the best parametric representation of
acoustic signals is an important task to produce a better
identification performance. The efficiency of this phase is
important for the next phase since it affects its behavior. The
overall process of the GFCC algorithm is shown in the block
diagram at the following figure 2.
Fig.2 Block diagram of GFCC method
The GFCC algorithm is another FFT-based feature extraction
technique in speaker recognition field. The technique is based
on the Gammatone Filter Bank (GTFB), which attempts to
model the human auditory system as a series of
overlapping band-pass filters [16,17]. The following figure
3 shows the shape of the used gammatone filter-bank with 16
KHz sampling frequency.
Fig.3 Impulse response of a set of 10 Gammatone filters
As in the conventional MFCC the robust GFCC technique are
calculated from the spectra of a series of windowed speech
frames of 32ms and overlapping by 16ms. First, the spectrum
of a speech frame is obtained by applying the Fast Fourier
Transformation (FFT), 512 point. Then the speech spectrum
is passed through 20 gammatone filterbank GTFB. Equal-
loudness is applied to each of the filter output, according to
the centre frequency of the filter. After that, the logarithm is
taken to each of the filter outputs. Finally we applied the
Reverse Discrete Cosine Transform (RDCT) to the
gammatone filter-Bank outputs in order to transit from
spectral domain to cepstral domain. The following figure 4
and figure 5 show respectively an example of the 10
gammatone filter-Bank outputs behavior and a
gammatonegram of a speech signal frame of 32ms applied in
input.
Fig.4 Example of a 10 gammatone filterbank outputs behavior
0 1000 2000 3000 4000 5000 6000 7000 8000
-100
-80
-60
-40
-20
0
Fréquence (Hz)
Amplitudes
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, Augus 2017
265 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
Fig. 5 Example of a gammatonegram of a speech signal
According to this gammatonegram, we can be said that
most of the energy of the speech signal is concentrated only
at the outputs of the first five gammatone filters.
B. RASTA-PLP Method
The RASTA-PLP (RelAtive SpecTrAl-PLP) analysis is a
variant of the PLP analysis, which is designed to eliminate
time-related variations that are too slow or too fast for noise. It
is inspired by the fact that human perception responds to
relative values rather than to absolute values [18,19,20].
This method, which is based on the famous PLP analysis,
consists firstly of performing the discrete Fourier transform in
the short term and then calculating the amplitude spectrum in
critical bands. Then, the logarithm is applied to extract the
spectral envelope of the speech signal. Pass-band filtering is
then carried out in order to eliminate any eventual offset or
slow components of the signal. After this, the amplitude is
compressed by the application of a cubic root in order to
simulate the power law of the human ear. Finally, the
coefficients are calculated according to the classical LPC
method. The following figure 6 shows the overall process of
the RASTA-PLP analysis [21,22,23]
Fig.6 Block diagram of RASTA-PLP method
C. Hybrid Front-End Feature Extraction
In order to design a new robust feature extraction
technique, we used a hybrid algorithm based on a combination
of the previous described feature extraction methods GFCC
and RASTA-PLP. Each of these method is first used separately
and then concatenated together in order to obtain a new
feature representation vector. The following block diagram at
figure 7 show the principle of the proposed hybrid front-end
extractor.
Fig.7 Structure of the proposed feature extraction method
III. PATTERN RECOGNITION
A. GMM Method
There are many methods for pattern recognition in the
field of speaker recognition: Template approach, Statistical
approach, Neural Network approach and multiple Hybrid
models. In this study we have used the probabilistic method
GMM that is the state of the art in this field [24]. The
Gaussian mixture model assumes all the data points are
generated from a mixture of a finite number of Gaussian
distributions with unknown parameters. The probability
density functions (PDF) of many random processes, such as
speech, are non-Gaussian. A non-Gaussian PDF may be
approximated by a weighted sum mixture of a number of
Gaussian densities of appropriate mean vectors and covariance
matrices according to (3)
݂௜ሺ‫ݔ‬ሻ =
ଵ
ሺଶగሻ
೏
మඥ∑೔
exp ቂ−
ଵ
ଶ
ሺ‫ݔ‬ − ߤ݅ሻ′ ∑ ሺ‫ݔ‬ −ିଵ
௜ ߤ݅ሻቃ																		ሺ3)
Where: ݂௜ሺ‫ݔ‬ሻ	is the probability density functions, µi and
Σi, i∈{1...M} are respectively the mean vectors and the
covariance matrices of each component and finally d is the
dimension of vectors.
GMM is a conventional method for speaker recognition,
known for its effectiveness and scalability in this field [25].
GMM is simple, easy and faster to compute models also in
training and testing phases. The limitation of GMM that it
requires a sufficient amount of training data to ensure good
performance which increase the training time. GMM works
well in terms of accuracy compared to other classification and
recognition methods. We have used GMM in our hybrid
system because we need to combine fast and accurate
approaches in order to have an acceptable computational time
of the derived system from our fusion approach. The decision
of the identity of the speaker is based on the similarity
measure between the test speaker model and all of the
speakers models in the reference database. For the similarity
measure we have used the maximum likelihood estimation
(MLLE). To maximize the classification process for a given
Les numéros de trames composants le signal de parole
Lesnumérosdefiltresgammatone
100 200 300 400 500 600 700 800 900
2
4
6
8
10
-100
-80
-60
-40
-20
0
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, Augus 2017
266 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
set of feature vectors the Expectation Maximization (EM)
algorithm is used.
B. The Expectation Maximization Algorithm
In this study, the maximum likelihood ML method using
Gaussian mixtures model is used to design our speaker
identification system. The EM algorithm is based on an
efficient iterative procedure for calculating the maximum
likelihood estimate in the presence of missing data. In the ML
estimation, we first compute the parameters of the model for
which the observed data are most likely. The EM algorithm
consists of two steps for each iteration: estimation and
maximization [26]. In the estimation step the missing data are
computed based on the observed data and the actual estimated
parameters. In the maximization step, the likelihood function
is optimized by assuming that the missing data are known.
Estimates of missing data from the estimation step are used as
actual missing data. The convergence of the EM algorithm
shown in the following figure 8, is satisfied by the fact that it
increases the probability at almost every iteration [27].
Fig.8 Expectation Maximization algorithm
IV. EXPERIMENTATION AND RESULTS
A. Effect of the pre-emphasis stage
The pre-processing of speech signal is a very important
stage in the design of any speech/speaker recognition system
because it has a direct influence on its robustness and
efficiency. The pre-emphasis of the speech signal intend to
give more energy to the high frequencies. This is obtained by
using a high-pass filter having the following transfer function:
Hሺzሻ = 1 − α. zିଵ
(4)
In this study we fixed α equal 0.95
The next figures 9a and 9b show the benefit of the pre-
emphasis preprocessing that make uniform display of the
energy over all the frequencies of the speech signal. This new
frequency distribution will consequently lead to an
improvement in the characterization of the speech signal and
an increase in the accuracy of the speaker recognition system.
The obtained result justifies the use of this technique.
(a)
(b)
Fig.9 The spectrogram (a) before and (b) after the pre-emphasis
The pre-emphasis processing contributes to reduce the edge
effect between successive frames as illustrated by the
following figure 10.
(a)
(b)
Fig.10 Edge effect interframes (a) before and (b) after pre-emphasis
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, Augus 2017
267 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
B. Experimental conditions
In this study, we recorded a database of 51 speakers (35
male and 16 female). For each speaker we acquired two
records: one of about 20 seconds for the training phase, the
other of about 10 seconds which will serve later for the
recognition phase. All speech signals were acquired in .wav
format with a sampling frequency of 16 kHz and 16 bits
monophonic quantization using the Wavesurfer software tool
[28]. A Gaussian white Noise of zero mean and unit variance
with a various level was added only to the test signals in order
to evaluate the robustness of our system under conditions
similar to reality. The noise has been added only to the test
records because generally the learning speech signal are
carried out in a controlled environment whereas the test
speech signals is taken in most practices cases in noisy and
uncontrolled environments. The feature extractors that will be
considered in this study are MFCC, GFCC, RASTA-PLP and
the combination of GFCC and RASTA-PLP. The entire system
is implemented under the MATLAB environment. The
following table I gives a detailed description of the
experimental conditions of our study.
TABLE I
EXPERIMENTAL CONDITIONS OF THE STUDY
Task system
Text independent automatic speaker
identification
Feature set
MFCC, GFCC, RASTA-PLP,
GFCC+RASTA-PLP
Back-end
Gaussian mixtures model with NG=4
mixtures
Nb. of coeff. in a
feature vector
12 MFCC, 12 GFCC and 12 RASTA-PLP
24 for GFCC&RASTA-PLP
Window size 32 ms
Step size 16 ms
Sampling rate 16 kHz
Training set
51 speakers (one record of 20s for eatch
speaker)
Test set
51 speakers (one record of 10s for eatch
speaker)
Noise Type
White Gaussian Noise (zero mean and unit
variance)
SNR range From 40 db to 0db with a step of 5db
Platform HP Elite book core i5 2.4Ghz
Prog. environnement MATLAB®7
C. Performance measure
In this paragraph we will present the performance results
corresponding to our recognition tests experiments which we
have conducted on four sets features extractors: MFCC,
GFCC, RASTA-PLP and GFCC combined to RASTA-PLP.
All tests were conducted in the presence of an additive white
Gaussian noise having a variable signal par noise ratio
changing from 40 dB to 0 dB. The total accuracies of
recognition are presented in Table II and figure 11. The results
show that the performance obtained by the fusion of the
feature parameters GFCC and RASTA-PLP is better than that
by using separately the feature parameters MFCC, GFCC or
RASTA-PLP.
TABLE II
THE ACCURACY RECOGNITION OF USED METHODS
Fig.7 The performance results of studied methods
V. CONCLUSION
In this paper a fusion approach based on the combination of
two auditory models GFCC and RASTA-PLP have been
explored for feature extraction process. We have chosen these
methods to build our front-end because it simulates the
spectral and temporal aspects of the peripheral auditory
system. The GMM classifier is used as a back-end for speaker
modeling. The main objective of this study is to explore the
new feature representation and evaluate the robustness of the
implemented speaker identification system. It is therefore
necessary to analyze the strengths and weaknesses of each set
of parameters and to compare their performance in terms of
identification rate (accuracy) under noisy conditions but also
in terms of response time (efficiency). The ultimate objective
of all these analyzes is to identify the strengths of each method
in order to exploit them when they are combined or used
separately according to the requirements of the intended
application. Based on the results of tests carried out on our
database, we can conclude that all methods MFCC, GFCC and
RASTA-PLP used separately give a good accuracy result
100% with the speech is recorded in quiet environments and
when there is a similar condition for recognition test and
learning phases. But when we added a highly level of white
Gaussian noise to the test speech, the performance result
decreases considerably. However, the hybrid method RASTA-
5 10 15 20 25 30 35 40
20
30
40
50
60
70
80
90
100
SNR(dB)
Accuracy(%)
MFCC
GFCC
RASTA-PLP
GFCC+RASTA-PLP
RSB
(dB)
MFCC GFCC
RASTA-
PLP
GFCC +
RASTA-PLP
40 100 100 100 100
35 100 98.37 98.37 100
30 91.18 98.37 88.67 98.37
25 82.37 94.15 80.37 94.15
20 58.75 80.37 60.12 80.37
15 25.32 58.75 27.65 60.12
10 20.91 25.32 20.91 25.32
5 16.45 20.91 16.45 20.91
0 10.85 10.85 10.85 16.45
Av. Acc. 56,09% 65.23% 55.95% 66.20%
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, Augus 2017
268 https://sites.google.com/site/ijcsis/
ISSN 1947-5500
PLP combined to GFCC gives the best performance of with a
relative improvement of average accuracy of 10% compared
to the conventional MFCC method when SNR changes from
40 db to 0db.
Finally, a multiple combination of features may be
deployed to implement a robust and efficient parametric
representation for speaker identification system. However
different features may be complementary and can be
combined to enhance accuracy.
Multiple combinations of features can be used to
implement robust systems. However, despite the fact that
these different features are complementary and can be
combined to improve the accuracy, the computational load of
the implemented system may increase with the consequent
number of parameters to be processed, which may be
inacceptable in terms of system efficiency. Future work will
extend the study to other scenarios to improve noise
robustness but also the efficiency of the system.
REFERENCES
[1] Dakshina Ranjan Kisku,Phalguni Gupta,Jamuna Kanta Sing “Advantages
in Biometrics for secure human authentification and recognition” CRC
Press,Taylor and francis group, US, 2014
[2] R. Parashar and S. Joshi, “proportional study of human recognition
nethods”, international journal of advanced research in computer science
a software engineering, 2012
[3] k. Delac and M. Grgic, “Asurvey of biometric recognition methods”, 46th
international symposium Electronics in Marine, 2004
[4] El bachir Tazi, A. Benabbou, and M. Harti, “Efficient Text Independent
Speaker Identification Based on GFCC and CMN Methods" International
Conference on Multimedia Computing and Systems IEEE conference-
ICMCS’12, Tangier, Morocco, 10-12 may 2012.
[5] El Bachir Tazi, “A robust Speaker Identification System based on the
combination of GFCC and MFCC methods” , 5th International
Conference on Multimedia Computing and Systems – IEEE Conference
ICMCS’16, Marrakech, Morocco, 29 September – 1 October 2016
[6] X. Zhao, Y. Shao and D.L.Wang,"CASA-Based Robust Speaker
Identification,"IEEE Trans. Audio, Speech and Language Processing ,
vol.20, no.5, pp.1608 -1616, 2012.
[7] Y. Shao, Z. Jin, D. Wang, and S. Srinivasan, “An auditory-based feature
for robust speech recognition,” in Proc. I-CASSP’09, 2009, pp. 4625–
4628
[8] J. Ming, T.J. Hazen, J.R. Glass and D.A. Reynolds, "Robust speaker
recognition in noisy conditions," IEEE Trans. Audio, Speech, and
Language Processing , vol. 15(5), pp. 1711-1723, 2007
[9] Douglas A. Reynolds et Richard C. Rose; “ Robust text-independent
speaker identification using gaussian mixture speaker models” in IEEE
Transactions on Acoustics, Speech and Signal Processing, Vol 3, N° 1 pp:
72-83, january 1995
[10] Rashidul Hasan and all “speaker identification using mel frequency
Cepstral coefficients” 3rd International Conference on Electrical &
Computer Engineering ICECE 2004, 28-30 December 2004, Dhaka,
Bangladesh
[11] Davis S B, Mermelstein P. Comparison of parametric representation for
monosyllabic word recognition in continuously spoken sentences.IEEE
Trans. ASSP, Aug., 1980.
[12] X. Zhou, all “Linear versus Mel-Frequency cepstral coefficients for
speaker recognition”, in IEEE Workshop on ASRU 2011, pp. 559-564
[13] Khan SA, all “A unique approach in text independent speaker
recognition using MFCC feature sets and probabilistic neural network.
In Eighth International Conference on Advances in Pattern Recognition
(ICAPR), Kolkata, 2015
[14] J. Qi, D. Wang, Y. Jiang, and R. Liu, “Auditory feature based on
gammatone filters for robust speech recognition,” in the IEEE
International Symposium on Circuits and Systems (ISCAS), 2013
[15] J. Qi, D. Wang, Y. Jiang, and R. Liu, “Auditory feature based on
gammatone filters for robust speech recognition,” in the IEEE
International Symposium on Circuits and Systems (ISCAS), 2013
[16] Jun Qi, all “Bottleneck Features based on Gammatone Frequency
Cepstral Coefficients” interspeech2013, 25 29 August 2013, Lyon,
France
[17] He Xu, all "A New Algorithm for Auditory Feature Extraction" CSNT -
2012, pp. 229-232
[18] El Bachir Tazi, Noureddine El Makhfi, “An Hybrid Front-End for
Robust Speaker Identification under Noisy Conditions" Intelligent
Systems Conference, IEEE conference- IntelliSys’17, London, UK,
7-8 September 2017.
[19] H. Hermansky, "Perceptual linear predictive (PLP) analysis of
speech", J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752, Apr. 1990.
[20] H. Hermansky and N. Morgan, "RASTA processing of speech", IEEE
Trans. on Speech and Audio Proc., vol. 2, no. 4, pp. 578-589, Oct.
1994.
[21] Zwicker E. Subdivision of the audible frequency range into critical
bands.J. Acoust. Soc. Am., Feb., 1961, 33.
[22] Prithvi, all “Comparative Analysis of MFCC, LFCC, RASTA –PLP”
International Journal of Scientific Engineering and Research (IJSER),
Volume 4 Issue 5, May 2016
[23] HynekHermansky “RASTA Processing of Speech” IEEE Transactions
on speech and audio processing. Vol. 2. NO. 4. October I994
[24] D.A. Reynolds, T.F. Quatieri, R.B. Dunn, "Speaker verification using
adapted Gaussian mixture models", Digital Signal Processing, vol. 10,
pp. 19-41, January 2000.
[25] D.A. Reynolds, “Speaker identification and verification using Gaussian
mixture speaker models” in Speech Comm., vol. 17, pp. 91108, 1995.
[26] Dempster, A. P., Laird, N. M., and Rubin, D. B. “Maximum Likelihood
from Incomplete Data via the EM Algorithm” in Journal of the Royal
Statistical Society, B, 39, 1–38. December 1976
[27] L. Xu and M.I. Jordan. On convergence properties of the EM algorithm
for Gaussian mixtures. Neural computation , 8:129–151, 1996
[28] http://www.speech.kth.se/wavesurfer/
Authors’ information
Prof. El bachir Tazi graduated in Electronic
Engineering from ENSET Mohammedia, in 1992.
He received his postgraduate diplomas DEA and
DES in Automatic and Signal Processing and PhD
in Computer Science from Sidi Mohammed Ben
Abdellah University, Faculty of Sciences Fez,
Morocco respectively in 1995, 1999 and 2012. He
is now member of the research team “Physics,
Science Computer and Process Modeling” and
professor at the higher school of Technology Khenifra, Moulay Ismail
University, Morocco. His areas of interest include automatic speaker
recognition, signal processing, pattern recognition, artificial intelligence and
real time processing using embedded systems.
Dr Noureddine El makhfi received his MSc and
PhD in Computer Science from Sidi Mohamed Ben
Abdallah University, Faculty of Science and
Technology in Fez, Morocco. He is now member of
the research team “Physics, Science Computer and
Process Modeling” at higher school of Technology
Khenifra, Morocco. His main research interest
processing and recognition of historical documents,
cognitive science, image processing, computer
vision, pattern recognition, document image analysis OCR, artificial
intelligence, Web document analysis, industrial data systems and embedded
processors.
International Journal of Computer Science and Information Security (IJCSIS),
Vol. 15, No. 8, Augus 2017
269 https://sites.google.com/site/ijcsis/
ISSN 1947-5500

More Related Content

PDF
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
PDF
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
PDF
Analysis of Reduction of PAPR by Linear Predictive Coding in OFDM
PDF
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...
PDF
Performance of Multiple symbol representation with clipping scheme for PAPR r...
PDF
Limited Data Speaker Verification: Fusion of Features
PDF
A COMPARATIVE PERFORMANCE STUDY OF OFDM SYSTEM WITH THE IMPLEMENTATION OF COM...
PDF
Emulation OF 3gpp Scme CHANNEL MODELS USING A Reverberation Chamber MEASUREME...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
Performance Evaluation of Conventional and Hybrid Feature Extractions Using M...
Analysis of Reduction of PAPR by Linear Predictive Coding in OFDM
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...
Performance of Multiple symbol representation with clipping scheme for PAPR r...
Limited Data Speaker Verification: Fusion of Features
A COMPARATIVE PERFORMANCE STUDY OF OFDM SYSTEM WITH THE IMPLEMENTATION OF COM...
Emulation OF 3gpp Scme CHANNEL MODELS USING A Reverberation Chamber MEASUREME...

What's hot (20)

PDF
FPGA-based implementation of speech recognition for robocar control using MFCC
PDF
Research paper (channel_estimation)
PDF
General Kalman Filter & Speech Enhancement for Speaker Identification
PDF
Hybrid Adaptive Channel Estimation Technique in Time and Frequency Domain for...
PDF
An agent based particle swarm optimization for papr reduction of ofdm systems
PDF
NEW BER ANALYSIS OF OFDM SYSTEM OVER NAKAGAMI-n (RICE) FADING CHANNEL
PDF
Coded OFDM in Fiber-Optics Communication Systems with Optimum biasing of Laser
PDF
Fuzzy Recursive Least-Squares Approach in Speech System Identification: A Tra...
DOCX
Voice biometric recognition
PDF
A Subspace Method for Blind Channel Estimation in CP-free OFDM Systems
PDF
Method for Converter Synchronization with RF Injection
PPTX
Signal Distortion Techniques for PAPR Reduction in OFDM systems
PDF
Peak to–average power ratio reduction of ofdm siganls
PDF
Spectral Analysis of Sample Rate Converter
PDF
IRJET- Peak to Average Power Ratio Reduction Technique using LPC Coding in OF...
PDF
Cv35547551
PPTX
SLM-PTS BASED PAPR REDUCTION TECHNIQUES IN OFDM SYSTEM
PDF
Classical Discrete-Time Fourier TransformBased Channel Estimation for MIMO-OF...
PDF
Minimization Of Inter Symbol Interference Based Error in OFDM System Using A...
PDF
Performance evaluation on the basis of bit error rate for different order of ...
FPGA-based implementation of speech recognition for robocar control using MFCC
Research paper (channel_estimation)
General Kalman Filter & Speech Enhancement for Speaker Identification
Hybrid Adaptive Channel Estimation Technique in Time and Frequency Domain for...
An agent based particle swarm optimization for papr reduction of ofdm systems
NEW BER ANALYSIS OF OFDM SYSTEM OVER NAKAGAMI-n (RICE) FADING CHANNEL
Coded OFDM in Fiber-Optics Communication Systems with Optimum biasing of Laser
Fuzzy Recursive Least-Squares Approach in Speech System Identification: A Tra...
Voice biometric recognition
A Subspace Method for Blind Channel Estimation in CP-free OFDM Systems
Method for Converter Synchronization with RF Injection
Signal Distortion Techniques for PAPR Reduction in OFDM systems
Peak to–average power ratio reduction of ofdm siganls
Spectral Analysis of Sample Rate Converter
IRJET- Peak to Average Power Ratio Reduction Technique using LPC Coding in OF...
Cv35547551
SLM-PTS BASED PAPR REDUCTION TECHNIQUES IN OFDM SYSTEM
Classical Discrete-Time Fourier TransformBased Channel Estimation for MIMO-OF...
Minimization Of Inter Symbol Interference Based Error in OFDM System Using A...
Performance evaluation on the basis of bit error rate for different order of ...
Ad

Similar to Fusion Approach for Robust Speaker Identification System (20)

PDF
Adaptive wavelet thresholding with robust hybrid features for text-independe...
PDF
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
PDF
Wavelet Based Noise Robust Features for Speaker Recognition
PDF
Speaker Identification
PDF
Speaker Recognition System using MFCC and Vector Quantization Approach
PDF
PDF
44 i9 advanced-speaker-recognition
PDF
Towards an objective comparison of feature extraction techniques for automati...
PDF
ASR_final
PDF
50120140501002
PDF
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...
PDF
Text independent speaker identification system using average pitch and forman...
PDF
Ijetcas14 426
PDF
A Robust Speaker Identification System
PDF
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
PDF
Ijecet 06 09_010
DOC
Speaker recognition.
PDF
Speaker Recognition Using Vocal Tract Features
PDF
A Review On Speech Feature Techniques And Classification Techniques
Adaptive wavelet thresholding with robust hybrid features for text-independe...
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
Wavelet Based Noise Robust Features for Speaker Recognition
Speaker Identification
Speaker Recognition System using MFCC and Vector Quantization Approach
44 i9 advanced-speaker-recognition
Towards an objective comparison of feature extraction techniques for automati...
ASR_final
50120140501002
AN ANALYSIS OF SPEECH RECOGNITION PERFORMANCE BASED UPON NETWORK LAYERS AND T...
Text independent speaker identification system using average pitch and forman...
Ijetcas14 426
A Robust Speaker Identification System
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
Ijecet 06 09_010
Speaker recognition.
Speaker Recognition Using Vocal Tract Features
A Review On Speech Feature Techniques And Classification Techniques
Ad

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
Teaching material agriculture food technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
A Presentation on Artificial Intelligence
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
Tartificialntelligence_presentation.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Machine learning based COVID-19 study performance prediction
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Reach Out and Touch Someone: Haptics and Empathic Computing
20250228 LYD VKU AI Blended-Learning.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
The Rise and Fall of 3GPP – Time for a Sabbatical?
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Mobile App Security Testing_ A Comprehensive Guide.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
MIND Revenue Release Quarter 2 2025 Press Release
Teaching material agriculture food technology
Programs and apps: productivity, graphics, security and other tools
Agricultural_Statistics_at_a_Glance_2022_0.pdf
A Presentation on Artificial Intelligence
SOPHOS-XG Firewall Administrator PPT.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Tartificialntelligence_presentation.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine learning based COVID-19 study performance prediction
Building Integrated photovoltaic BIPV_UPV.pdf

Fusion Approach for Robust Speaker Identification System

  • 1. Fusion Approach for Robust Speaker Identification system El bachir Tazi R.T.: Physics, Computer Science and Process Modeling Moulay Ismail University, ESTK Khenifra, Morocco elbachirtazi@yahoo.fr Noureddine El makhfi R.T.: Physics, Computer Science and Process Modeling Moulay Ismail University, ESTK Khenifra, Morocco n.elmakhfi@gmail.com Abstract— The performance of speaker identification systems decreases significantly under noisy conditions and especially when there is a difference between the recognition and the learning sessions. To improve robustness, we have proposed in the previous study an auditory features and a robust speaker recognition system using a front-end based on the combination of MFCC and RASTA-PLP methods. In this paper, we further study the auditory features by exploring the combination of GFCC and RASTA-PLP. We find that the method performs substantially better than all previous studied methods. Furthermore, our current identification system achieves significant performance improvements of 5.92% in a wide range of signal-to-noise conditions compared with the last studied front-end based on MFCC combined to RASTA-PLP. Experimental results show an average accuracy improvement of 10.11% in case of GFCC combined with RASTA-PLP over the base line MFCC thechnique across various SNR. This fusion approach allow a highly and appreciable enhancement in the higher noisy conditions. Keywords—Robust Speaker Identification; Gammatone Frequency Cepstral Coefficients GFCC; Relative Spectral Transform Perceptual Linear Prediction RASTA-PLP; Gaussian Mixture Model GMM. I. INTRODUCTION The most accepted form of biometric identification for human is his speech signal. The speaker recognition process based on a speech signal is treated as one of the most exciting technologies of human recognition [1,2,3]. Audio signal features can be classified either in the perceptual mode or in the physical mode. In the previous work, we have studied the perceptual features towards the speaker identification activities [4,5]. In the current work we intend to study furthermore the conceptual mode based on the auditory techniques mainly GFCC and RASTA-PLP. Most published works in the areas of speaker recognition focus on speech under the noiseless environments and few published works focus on speech under noisy conditions [6,7,8,9]. In this study we added the white Gaussian noise with different level to our signal test to simulate the real used environment for these systems. Learning systems in speaker identification that employ hybrid strategies can potentially offer significant advantages over single-strategy systems. In this proposed system, an hybrid algorithm based on GFCC combined to RASTA-PLP has been used to improve the performance of the text independent speaker identification system under noisy environment. Our system is implemented and simulated under Matlab environment based toolbox such as signal processing Toolbox, Voicebox and HMM Toolbox. The Speaker identification task is typically achieved by two-stage signal processing: training and testing. The training process calculates speaker-specific feature parameters from the speech. The features are used to generate speaker models. In the testing phase, speech samples from unknown speakers are compared with the models and classified. The following figure 1 shows the block diagram of the standard structure of an automatic speaker recognition system. Fig.1 Block diagram of an ASR system The rest of this paper is organized as follows: Section 2 describes the used GFCC and RASTA-PLP feature extraction techniques, followed by a description of Gaussian Mixture Modeling and expectation maximization classification in Section 3. In section 4 we give the experimental results and finally a conclusion and perspective work at the last section 5. II. FEATURE EXTRACTION METHODS Front-End Analysis or features extraction is the first step in automatic speaker recognition task. It aims to extract features from the speech waveform that are compact and efficient to represent the speaker’s voice imprint. Since speech is a non- stationary signal. The feature parameters should be estimated International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 8, August 2017 264 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 2. over short-term intervals from 16 ms to 32 ms, in which speech is considered to be stationary. The major types of front-end processing techniques used in the field of ASR are: Linear Predictive Coding (LPC), Perceptual Linear Prediction (PLP), Mel Frequency cepstral coefficients (MFCC), Gammatone Frequency cepstral coefficients (GFCC) and RASTA-PLP. In this study we have studied the RASTA-PLP combined with the auditory GFCC method to characterize the speaker’s voices. The conventional MFCC method is usually used as a base line system serving for comparison and evaluation with other feature extraction methods. Many study show that this last technique gives the best result in the quiet environment but its performance is exceeded in very noisy environments [10,11,12,13] A. GFCC METHOD The GFCC (Gammatone Frequency Cepstral Coefficients) is a feature method based on Gammatone filterBank. The filters in the bank are designed to simulate the auditory process of human ear [14,15] that are formulated as follows : ݃ሺ‫ݐ‬ሻ = ܽ‫ݐ‬௡ିଵ ݁ିଶగ௕௧ cosሺ2ߨ݂௖‫ݐ‬ + ߮ሻ ሺ1ሻ Where a is a constant which is generally equal to 1. N is the filter order which is set less or equal 4 Φ is the phase shift between filters Fc and b are respectively the center frequency and the bandwidth of the filter in Hz which is related by: ܾ = 1.019 ∗ ‫ܤܴܧ‬ = 1.019 ∗ 24.7 ൬4.37 ∗ ݂௖ 1000 + 1൰ ሺ2ሻ The extraction of the best parametric representation of acoustic signals is an important task to produce a better identification performance. The efficiency of this phase is important for the next phase since it affects its behavior. The overall process of the GFCC algorithm is shown in the block diagram at the following figure 2. Fig.2 Block diagram of GFCC method The GFCC algorithm is another FFT-based feature extraction technique in speaker recognition field. The technique is based on the Gammatone Filter Bank (GTFB), which attempts to model the human auditory system as a series of overlapping band-pass filters [16,17]. The following figure 3 shows the shape of the used gammatone filter-bank with 16 KHz sampling frequency. Fig.3 Impulse response of a set of 10 Gammatone filters As in the conventional MFCC the robust GFCC technique are calculated from the spectra of a series of windowed speech frames of 32ms and overlapping by 16ms. First, the spectrum of a speech frame is obtained by applying the Fast Fourier Transformation (FFT), 512 point. Then the speech spectrum is passed through 20 gammatone filterbank GTFB. Equal- loudness is applied to each of the filter output, according to the centre frequency of the filter. After that, the logarithm is taken to each of the filter outputs. Finally we applied the Reverse Discrete Cosine Transform (RDCT) to the gammatone filter-Bank outputs in order to transit from spectral domain to cepstral domain. The following figure 4 and figure 5 show respectively an example of the 10 gammatone filter-Bank outputs behavior and a gammatonegram of a speech signal frame of 32ms applied in input. Fig.4 Example of a 10 gammatone filterbank outputs behavior 0 1000 2000 3000 4000 5000 6000 7000 8000 -100 -80 -60 -40 -20 0 Fréquence (Hz) Amplitudes International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 8, Augus 2017 265 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 3. Fig. 5 Example of a gammatonegram of a speech signal According to this gammatonegram, we can be said that most of the energy of the speech signal is concentrated only at the outputs of the first five gammatone filters. B. RASTA-PLP Method The RASTA-PLP (RelAtive SpecTrAl-PLP) analysis is a variant of the PLP analysis, which is designed to eliminate time-related variations that are too slow or too fast for noise. It is inspired by the fact that human perception responds to relative values rather than to absolute values [18,19,20]. This method, which is based on the famous PLP analysis, consists firstly of performing the discrete Fourier transform in the short term and then calculating the amplitude spectrum in critical bands. Then, the logarithm is applied to extract the spectral envelope of the speech signal. Pass-band filtering is then carried out in order to eliminate any eventual offset or slow components of the signal. After this, the amplitude is compressed by the application of a cubic root in order to simulate the power law of the human ear. Finally, the coefficients are calculated according to the classical LPC method. The following figure 6 shows the overall process of the RASTA-PLP analysis [21,22,23] Fig.6 Block diagram of RASTA-PLP method C. Hybrid Front-End Feature Extraction In order to design a new robust feature extraction technique, we used a hybrid algorithm based on a combination of the previous described feature extraction methods GFCC and RASTA-PLP. Each of these method is first used separately and then concatenated together in order to obtain a new feature representation vector. The following block diagram at figure 7 show the principle of the proposed hybrid front-end extractor. Fig.7 Structure of the proposed feature extraction method III. PATTERN RECOGNITION A. GMM Method There are many methods for pattern recognition in the field of speaker recognition: Template approach, Statistical approach, Neural Network approach and multiple Hybrid models. In this study we have used the probabilistic method GMM that is the state of the art in this field [24]. The Gaussian mixture model assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters. The probability density functions (PDF) of many random processes, such as speech, are non-Gaussian. A non-Gaussian PDF may be approximated by a weighted sum mixture of a number of Gaussian densities of appropriate mean vectors and covariance matrices according to (3) ݂௜ሺ‫ݔ‬ሻ = ଵ ሺଶగሻ ೏ మඥ∑೔ exp ቂ− ଵ ଶ ሺ‫ݔ‬ − ߤ݅ሻ′ ∑ ሺ‫ݔ‬ −ିଵ ௜ ߤ݅ሻቃ ሺ3) Where: ݂௜ሺ‫ݔ‬ሻ is the probability density functions, µi and Σi, i∈{1...M} are respectively the mean vectors and the covariance matrices of each component and finally d is the dimension of vectors. GMM is a conventional method for speaker recognition, known for its effectiveness and scalability in this field [25]. GMM is simple, easy and faster to compute models also in training and testing phases. The limitation of GMM that it requires a sufficient amount of training data to ensure good performance which increase the training time. GMM works well in terms of accuracy compared to other classification and recognition methods. We have used GMM in our hybrid system because we need to combine fast and accurate approaches in order to have an acceptable computational time of the derived system from our fusion approach. The decision of the identity of the speaker is based on the similarity measure between the test speaker model and all of the speakers models in the reference database. For the similarity measure we have used the maximum likelihood estimation (MLLE). To maximize the classification process for a given Les numéros de trames composants le signal de parole Lesnumérosdefiltresgammatone 100 200 300 400 500 600 700 800 900 2 4 6 8 10 -100 -80 -60 -40 -20 0 International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 8, Augus 2017 266 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 4. set of feature vectors the Expectation Maximization (EM) algorithm is used. B. The Expectation Maximization Algorithm In this study, the maximum likelihood ML method using Gaussian mixtures model is used to design our speaker identification system. The EM algorithm is based on an efficient iterative procedure for calculating the maximum likelihood estimate in the presence of missing data. In the ML estimation, we first compute the parameters of the model for which the observed data are most likely. The EM algorithm consists of two steps for each iteration: estimation and maximization [26]. In the estimation step the missing data are computed based on the observed data and the actual estimated parameters. In the maximization step, the likelihood function is optimized by assuming that the missing data are known. Estimates of missing data from the estimation step are used as actual missing data. The convergence of the EM algorithm shown in the following figure 8, is satisfied by the fact that it increases the probability at almost every iteration [27]. Fig.8 Expectation Maximization algorithm IV. EXPERIMENTATION AND RESULTS A. Effect of the pre-emphasis stage The pre-processing of speech signal is a very important stage in the design of any speech/speaker recognition system because it has a direct influence on its robustness and efficiency. The pre-emphasis of the speech signal intend to give more energy to the high frequencies. This is obtained by using a high-pass filter having the following transfer function: Hሺzሻ = 1 − α. zିଵ (4) In this study we fixed α equal 0.95 The next figures 9a and 9b show the benefit of the pre- emphasis preprocessing that make uniform display of the energy over all the frequencies of the speech signal. This new frequency distribution will consequently lead to an improvement in the characterization of the speech signal and an increase in the accuracy of the speaker recognition system. The obtained result justifies the use of this technique. (a) (b) Fig.9 The spectrogram (a) before and (b) after the pre-emphasis The pre-emphasis processing contributes to reduce the edge effect between successive frames as illustrated by the following figure 10. (a) (b) Fig.10 Edge effect interframes (a) before and (b) after pre-emphasis International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 8, Augus 2017 267 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 5. B. Experimental conditions In this study, we recorded a database of 51 speakers (35 male and 16 female). For each speaker we acquired two records: one of about 20 seconds for the training phase, the other of about 10 seconds which will serve later for the recognition phase. All speech signals were acquired in .wav format with a sampling frequency of 16 kHz and 16 bits monophonic quantization using the Wavesurfer software tool [28]. A Gaussian white Noise of zero mean and unit variance with a various level was added only to the test signals in order to evaluate the robustness of our system under conditions similar to reality. The noise has been added only to the test records because generally the learning speech signal are carried out in a controlled environment whereas the test speech signals is taken in most practices cases in noisy and uncontrolled environments. The feature extractors that will be considered in this study are MFCC, GFCC, RASTA-PLP and the combination of GFCC and RASTA-PLP. The entire system is implemented under the MATLAB environment. The following table I gives a detailed description of the experimental conditions of our study. TABLE I EXPERIMENTAL CONDITIONS OF THE STUDY Task system Text independent automatic speaker identification Feature set MFCC, GFCC, RASTA-PLP, GFCC+RASTA-PLP Back-end Gaussian mixtures model with NG=4 mixtures Nb. of coeff. in a feature vector 12 MFCC, 12 GFCC and 12 RASTA-PLP 24 for GFCC&RASTA-PLP Window size 32 ms Step size 16 ms Sampling rate 16 kHz Training set 51 speakers (one record of 20s for eatch speaker) Test set 51 speakers (one record of 10s for eatch speaker) Noise Type White Gaussian Noise (zero mean and unit variance) SNR range From 40 db to 0db with a step of 5db Platform HP Elite book core i5 2.4Ghz Prog. environnement MATLAB®7 C. Performance measure In this paragraph we will present the performance results corresponding to our recognition tests experiments which we have conducted on four sets features extractors: MFCC, GFCC, RASTA-PLP and GFCC combined to RASTA-PLP. All tests were conducted in the presence of an additive white Gaussian noise having a variable signal par noise ratio changing from 40 dB to 0 dB. The total accuracies of recognition are presented in Table II and figure 11. The results show that the performance obtained by the fusion of the feature parameters GFCC and RASTA-PLP is better than that by using separately the feature parameters MFCC, GFCC or RASTA-PLP. TABLE II THE ACCURACY RECOGNITION OF USED METHODS Fig.7 The performance results of studied methods V. CONCLUSION In this paper a fusion approach based on the combination of two auditory models GFCC and RASTA-PLP have been explored for feature extraction process. We have chosen these methods to build our front-end because it simulates the spectral and temporal aspects of the peripheral auditory system. The GMM classifier is used as a back-end for speaker modeling. The main objective of this study is to explore the new feature representation and evaluate the robustness of the implemented speaker identification system. It is therefore necessary to analyze the strengths and weaknesses of each set of parameters and to compare their performance in terms of identification rate (accuracy) under noisy conditions but also in terms of response time (efficiency). The ultimate objective of all these analyzes is to identify the strengths of each method in order to exploit them when they are combined or used separately according to the requirements of the intended application. Based on the results of tests carried out on our database, we can conclude that all methods MFCC, GFCC and RASTA-PLP used separately give a good accuracy result 100% with the speech is recorded in quiet environments and when there is a similar condition for recognition test and learning phases. But when we added a highly level of white Gaussian noise to the test speech, the performance result decreases considerably. However, the hybrid method RASTA- 5 10 15 20 25 30 35 40 20 30 40 50 60 70 80 90 100 SNR(dB) Accuracy(%) MFCC GFCC RASTA-PLP GFCC+RASTA-PLP RSB (dB) MFCC GFCC RASTA- PLP GFCC + RASTA-PLP 40 100 100 100 100 35 100 98.37 98.37 100 30 91.18 98.37 88.67 98.37 25 82.37 94.15 80.37 94.15 20 58.75 80.37 60.12 80.37 15 25.32 58.75 27.65 60.12 10 20.91 25.32 20.91 25.32 5 16.45 20.91 16.45 20.91 0 10.85 10.85 10.85 16.45 Av. Acc. 56,09% 65.23% 55.95% 66.20% International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 8, Augus 2017 268 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
  • 6. PLP combined to GFCC gives the best performance of with a relative improvement of average accuracy of 10% compared to the conventional MFCC method when SNR changes from 40 db to 0db. Finally, a multiple combination of features may be deployed to implement a robust and efficient parametric representation for speaker identification system. However different features may be complementary and can be combined to enhance accuracy. Multiple combinations of features can be used to implement robust systems. However, despite the fact that these different features are complementary and can be combined to improve the accuracy, the computational load of the implemented system may increase with the consequent number of parameters to be processed, which may be inacceptable in terms of system efficiency. Future work will extend the study to other scenarios to improve noise robustness but also the efficiency of the system. REFERENCES [1] Dakshina Ranjan Kisku,Phalguni Gupta,Jamuna Kanta Sing “Advantages in Biometrics for secure human authentification and recognition” CRC Press,Taylor and francis group, US, 2014 [2] R. Parashar and S. Joshi, “proportional study of human recognition nethods”, international journal of advanced research in computer science a software engineering, 2012 [3] k. Delac and M. Grgic, “Asurvey of biometric recognition methods”, 46th international symposium Electronics in Marine, 2004 [4] El bachir Tazi, A. Benabbou, and M. Harti, “Efficient Text Independent Speaker Identification Based on GFCC and CMN Methods" International Conference on Multimedia Computing and Systems IEEE conference- ICMCS’12, Tangier, Morocco, 10-12 may 2012. [5] El Bachir Tazi, “A robust Speaker Identification System based on the combination of GFCC and MFCC methods” , 5th International Conference on Multimedia Computing and Systems – IEEE Conference ICMCS’16, Marrakech, Morocco, 29 September – 1 October 2016 [6] X. Zhao, Y. Shao and D.L.Wang,"CASA-Based Robust Speaker Identification,"IEEE Trans. Audio, Speech and Language Processing , vol.20, no.5, pp.1608 -1616, 2012. [7] Y. Shao, Z. Jin, D. Wang, and S. Srinivasan, “An auditory-based feature for robust speech recognition,” in Proc. I-CASSP’09, 2009, pp. 4625– 4628 [8] J. Ming, T.J. Hazen, J.R. Glass and D.A. Reynolds, "Robust speaker recognition in noisy conditions," IEEE Trans. Audio, Speech, and Language Processing , vol. 15(5), pp. 1711-1723, 2007 [9] Douglas A. Reynolds et Richard C. Rose; “ Robust text-independent speaker identification using gaussian mixture speaker models” in IEEE Transactions on Acoustics, Speech and Signal Processing, Vol 3, N° 1 pp: 72-83, january 1995 [10] Rashidul Hasan and all “speaker identification using mel frequency Cepstral coefficients” 3rd International Conference on Electrical & Computer Engineering ICECE 2004, 28-30 December 2004, Dhaka, Bangladesh [11] Davis S B, Mermelstein P. Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences.IEEE Trans. ASSP, Aug., 1980. [12] X. Zhou, all “Linear versus Mel-Frequency cepstral coefficients for speaker recognition”, in IEEE Workshop on ASRU 2011, pp. 559-564 [13] Khan SA, all “A unique approach in text independent speaker recognition using MFCC feature sets and probabilistic neural network. In Eighth International Conference on Advances in Pattern Recognition (ICAPR), Kolkata, 2015 [14] J. Qi, D. Wang, Y. Jiang, and R. Liu, “Auditory feature based on gammatone filters for robust speech recognition,” in the IEEE International Symposium on Circuits and Systems (ISCAS), 2013 [15] J. Qi, D. Wang, Y. Jiang, and R. Liu, “Auditory feature based on gammatone filters for robust speech recognition,” in the IEEE International Symposium on Circuits and Systems (ISCAS), 2013 [16] Jun Qi, all “Bottleneck Features based on Gammatone Frequency Cepstral Coefficients” interspeech2013, 25 29 August 2013, Lyon, France [17] He Xu, all "A New Algorithm for Auditory Feature Extraction" CSNT - 2012, pp. 229-232 [18] El Bachir Tazi, Noureddine El Makhfi, “An Hybrid Front-End for Robust Speaker Identification under Noisy Conditions" Intelligent Systems Conference, IEEE conference- IntelliSys’17, London, UK, 7-8 September 2017. [19] H. Hermansky, "Perceptual linear predictive (PLP) analysis of speech", J. Acoust. Soc. Am., vol. 87, no. 4, pp. 1738-1752, Apr. 1990. [20] H. Hermansky and N. Morgan, "RASTA processing of speech", IEEE Trans. on Speech and Audio Proc., vol. 2, no. 4, pp. 578-589, Oct. 1994. [21] Zwicker E. Subdivision of the audible frequency range into critical bands.J. Acoust. Soc. Am., Feb., 1961, 33. [22] Prithvi, all “Comparative Analysis of MFCC, LFCC, RASTA –PLP” International Journal of Scientific Engineering and Research (IJSER), Volume 4 Issue 5, May 2016 [23] HynekHermansky “RASTA Processing of Speech” IEEE Transactions on speech and audio processing. Vol. 2. NO. 4. October I994 [24] D.A. Reynolds, T.F. Quatieri, R.B. Dunn, "Speaker verification using adapted Gaussian mixture models", Digital Signal Processing, vol. 10, pp. 19-41, January 2000. [25] D.A. Reynolds, “Speaker identification and verification using Gaussian mixture speaker models” in Speech Comm., vol. 17, pp. 91108, 1995. [26] Dempster, A. P., Laird, N. M., and Rubin, D. B. “Maximum Likelihood from Incomplete Data via the EM Algorithm” in Journal of the Royal Statistical Society, B, 39, 1–38. December 1976 [27] L. Xu and M.I. Jordan. On convergence properties of the EM algorithm for Gaussian mixtures. Neural computation , 8:129–151, 1996 [28] http://www.speech.kth.se/wavesurfer/ Authors’ information Prof. El bachir Tazi graduated in Electronic Engineering from ENSET Mohammedia, in 1992. He received his postgraduate diplomas DEA and DES in Automatic and Signal Processing and PhD in Computer Science from Sidi Mohammed Ben Abdellah University, Faculty of Sciences Fez, Morocco respectively in 1995, 1999 and 2012. He is now member of the research team “Physics, Science Computer and Process Modeling” and professor at the higher school of Technology Khenifra, Moulay Ismail University, Morocco. His areas of interest include automatic speaker recognition, signal processing, pattern recognition, artificial intelligence and real time processing using embedded systems. Dr Noureddine El makhfi received his MSc and PhD in Computer Science from Sidi Mohamed Ben Abdallah University, Faculty of Science and Technology in Fez, Morocco. He is now member of the research team “Physics, Science Computer and Process Modeling” at higher school of Technology Khenifra, Morocco. His main research interest processing and recognition of historical documents, cognitive science, image processing, computer vision, pattern recognition, document image analysis OCR, artificial intelligence, Web document analysis, industrial data systems and embedded processors. International Journal of Computer Science and Information Security (IJCSIS), Vol. 15, No. 8, Augus 2017 269 https://sites.google.com/site/ijcsis/ ISSN 1947-5500