SlideShare a Scribd company logo
TELKOMNIKA, Vol.16, No.2, February 2018, pp. 250~258
ISSN: 1693-6930, accredited A by DIKTI, Decree No: 58/DIKTI/Kep/2013
DOI: 10.12928/TELKOMNIKA.v16i1.7559  250
Received Mar 5, 2017; Revised September 18, 2017; Accepted September 30, 2017
Probabilistic Self-Organizing Maps for Text-
Independent Speaker Identification
Ayoub Bouziane*1
, Jamal Kharroubi2
, Arsalane Zarghili3
Intelligent Systems and Applications Laboratory, Sidi Mohamed Ben Abdellah University
P.B: 2202 Immouzer road, Fez, Morocco
*Corresponding author, e-mail: ayoub.bouziane@usmba.ac.ma
1
, jamal.kharroubi@usmba.ac.ma
2
,
arsalane.zarghili@usmba.ac.ma
3
Abstract
The present paper introduces a novel speaker modeling technique for text-independent speaker
identification using probabilistic self-organizing maps (PbSOMs). The basic motivation behind the
introduced technique was to combine the self-organizing quality of the self-organizing maps and
generative power of Gaussian mixture models. Experimental results show that the introduced modeling
technique using probabilistic self-organizing maps significantly outperforms the traditional technique using
the classical GMMs and the EM algorithm or its deterministic variant. More precisely, a relative accuracy
improvement of roughly 39% has been gained, as well as, a much less sensitivity to the model-parameters
initialization has been exhibited by using the introduced speaker modeling technique using probabilistic
self-organizing maps.
Keywords: speaker identification system, gaussian mixture model (GMM), probabilistic self-organizing
maps, EM algorithm, deterministic annealing EM algorithm, the SOEM algorithm.
Copyright © 2018 Universitas Ahmad Dahlan. All rights reserved.
1. Introduction
The Gaussian mixture models (GMMs) [1], [2] are considered as the simplest and the
traditional speaker modeling technique in speaker recognition systems, as well as, the basis of
the most successful approaches that have been emerged in the last decade.
Each speaker is modeled in the system as a mixture of Gaussian densities, which may
reflect the specific acoustical classes of the speaker. Generally, the parameters of the Gaussian
mixture models (GMMs) are estimated using the widely used and well-known EM algorithm.
Beside all the advantages of the EM algorithm, such as its simplicity, both conceptually and
computationally, it suffers from some general drawbacks like its sensitivity to the initial model
parameters - especially in a multivariate context - and the trapping in local optimums. To
overcome this problem, various techniques were proposed and used in the speaker recognition
state-of-the-art, such as the deterministic annealing EM proposed by Ueda and Nakano [3],the
split and merge algorithms, as well as some heuristics to find the appropriate initial points for the
EM algorithm.
In the same perspective, the probabilistic self-organizing maps method [4]–[6], based
on the combination between the strengths of self-organizing maps and mixture models, was
proposed and yielded better results in some image processing applications. In the present
study, the probabilistic self-organizing maps method is assessed and introduced for speaker
modeling in speaker recognition applications. The obtained results using the probabilistic self-
organizing maps are compared with the classical training of the Gaussian mixture models using
the EM algorithm and its deterministic variant.
The remainder of this paper is organized as follows. The second section briefly
highlights the general operating structure of speaker identification systems. Section 3 and 4 deal
with speaker modeling process, Section 3 gives a brief description of Gaussian Mixture Models
and outlines the principle of the EM algorithm and its deterministic annealing variant, while
section 4 introduces the Probabilistic Self-Organizing Maps for speaker modeling in speaker
recognition systems. Next, the experimental results are provided in Section 5. Finally,
conclusions and future directions are drawn in Section 6.
TELKOMNIKA ISSN: 1693-6930 
Probabilistic Self-Organizing Maps for Text-Independent Speaker… (Ayoub Bouziane)
251
2. The General Operating Structure of the Speaker Identification Systems
The basic structure of automatic speaker identification systems, as shown in Figure 1,
consists of two distinct phases: the training phase and the testing phase.
Figure 1. The basic framework and components of speaker recognition systems
During the training phase, speech samples are gathered from new client speakers, their
individual feature vectors that reflect the characteristics of their vocal tracts are extracted and
they're used to train a reference model for each client speaker. As regards the testing phase,
the speech signal of the unknown speaker is acquired, corresponding feature vectors are
extracted and scored against the previously enrolled reference models. Finally, the similarity
scores computed from this comparison are then used to make a decision about the identity of
the speaker.
3. Speaker Modeling Using the Traditional Gaussian Mixture Models
The Gaussian Mixture Models were firstly introduced to the speaker recognition
community in 1995 [7]–[9]. Since then, they have become the predominant approach for
speaker modeling in text-independent speaker recognition systems, and the basis of the most
successful approaches that have emerged in the last decade. The basic idea underlying the
GMM approach consists in modeling the distribution of the speaker’s features as a Gaussian
mixture density. The Gaussian mixture density is generally defined by a weighted sum of M
Gaussian densities, as depicted in Figure 2, and is given by the following equation:
( | ) ∑ ( ) ∑ ( | ) (1)
where, x_t is a D-dimensional feature vector, b_i (x)=g(x│ _i, _i ),i=1,2,3,…,M. are the
Gaussian densities and 〖 w〗_i,i=1,2,3,…,M are the mixture weights. Each density component is
a D-variate Gaussian function of the following form:
( | )
( ) ⁄
| | ⁄
{ ( ) ( )} (2)
Speaker
Modeling
Pattern
Matching
Enrollment phase
Testing phase
Enrollment
utterances of
client speakers
Test
utterances
Speakers'
Models
Feature
Extraction
Decision
Making
 ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 1, February 2018 : 250 – 258
252
Figure 2. Gaussian Mixture Density
The Gaussian mixture model is parameterized by the collection of the mean vectors,
covariance matrices and mixture weights of the Gaussian densities ={w_i, _i, _i },i=1,2,…,M.
The mixture weights, w_i, furthermore satisfy the constraint ∑_(i=1)^M▒w_i =1.
The motivation behind the use of Gaussian mixture models for speaker modeling lies on the
assumption that Gaussian densities may model a set of hidden acoustic classes that reflect the
characteristics of the speaker dependent vocal tract.
The model parameters ={w_i, _i, _i },i=1,2,…,M. are determined in such manner that they
best fit the distribution of the training feature vectors X={x1,. . . ,xT}. In other words, they are
determined in such manner that they maximize the log-likelihood of the GMM log⁡(p(X│ )).The
traditional and the commonly-used method in this context is the maximum likelihood estimation
(MLE) method via the Expectation–maximization (EM) algorithm.
3.1. Gaussian mixture models using the EM algorithm
The basic idea of the EM algorithm, as reported in algorithm 1, consists in starting with
an initial model and tending to estimate a new model( ) , such that p(X│ ) p(X│ ). ext, the
new estimated model becomes an initial model to be refined in the next iteration, and the
process is repeated until an increase in the log-likelihood of the data, given the current model, is
less than some convergence threshold.
Algorithm 1. The EM algorithm
Input : Training feature vectors * +
Output : GMM of M component * + .
1: Randomly initialize the model parameters * + .
2: Compute the a posteriori probability ( | ):
( | )
( | )
∑ ( | )
(1)
3: Re-estimate the new model parameters, i.e. the mixture weights, the means and
variances vectors, using the following equations:
∑ ( | )
∑ ( | )
∑ ( | )
∑ ( | )
∑ ( | )
(2)
5: Repeat step 2-3 until convergence.
6: Return the model parameters * + .
Mixture of
Gaussian
likelihoods
…
.
…
.
𝑏 (𝑥𝑡)
𝑤
𝑤𝑖
Gaussian likelihood
𝑥𝑡
𝑤 𝑀
𝛴𝑏𝑖(𝑥𝑡)
𝑏 𝑀(𝑥𝑡)
TELKOMNIKA ISSN: 1693-6930 
Probabilistic Self-Organizing Maps for Text-Independent Speaker… (Ayoub Bouziane)
253
3.2. Gaussian mixture models uisng the DAEM algorithm
The Deterministic Annealing EM algorithm [3] is an EM variant algorithm based on the
deterministic annealing concept. The key idea of the DAEM algorithm consists in reformulating
the problem of maximizing the log-likelihood in the classical EM algorithm as a problem of
minimizing the thermodynamic free energy defined through the maximum entropy principle and
statistical mechanics analogy.
Similarly to the EM algorithm, the DAEM algorithm is an iterative procedure based on
expectation and maximization steps. In the expectation step, a new temperature-parameterized
posterior distribution was introduced as follows:
( | )
( | )
∑ ( | )
(5)
where the temperature 1/ is gradually decreased during the training process, and the posterior
distribution is optimized at each temperature. The diminishing rate of temperature must be as
slow as possible, particularly at the early stages of training. In the maximization step, the model
parameters are estimated using the temperature-parameterized posterior distribution P_
(i│x_t, ) in exactly the same way as the classical EM algorithm. See Figure 3.
Figure 3. Flowchart of the DAEM algorithm
3. Speaker modeling using Probabilistic Self-Organizing Maps (PbSOM)
The Self-Organizing Maps (SOM), commonly known also as Kohonen network [10], are
the most popular unsupervised neural network for data clustering and visualization. The SOM
approach was inspired from self-organizing nature of the human cerebral cortex. Indeed, it is
based on the idea of competition and neighborhood update concepts which preserve the
topological relationships between classes in the network [11].
A self-organizing map, as shown in Fig 4, consists of two layers of neurons, an input
layer and an output layer. The input layer is composed of N input neurons according to the N
input vectors {X_n=[X_1 ,X_2,…,X_ ],1≤n≤ } to be classified, while the output layer (so-called
competitive layer) is composed of M output neurons {r_m=[r_1 ,r_2,…,r_M],1≤m≤M} according
to the M clusters {C_m=[C_1 ,C_2,…,C_M],1≤m≤M} to be determined. The input neurons are
fully connected to output neurons, which are connected to each other by a neighborhood
relation h_ij,1≤i,j≤M dictating the structure of the layer. The layer structure is often specified by
the following factors: the local lattice structure (hexagonal, rectangular …) and the dimension or
the global map shape (sheet, cylinder …). The self-organizing map algorithms are trained
iteratively based on two steps: a competitive step and a cooperative step. In the first step, the
various output neurons compete with each other to determine the “winner” neuron(s) which best
matches the input vector(s). In the second step, i.e. the cooperative step, the weights of the
winner neuron(s) and that of neurons close to them in the SOM lattice are adjusted towards the
Start
Initialize
the model
parameters λ
Yes
End
No
Parameters
converged?
Compute
the posterior
probabilities
Adapts
the model
parameters λ
Decrease the temperature T Temperature
T > 1 ?
Yes
No
 ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 1, February 2018 : 250 – 258
254
input vector(s). Therefore, output neurons will self-organize to an ordered map in such a way
that output neurons which have similar weights will be placed nearby after training.
Once the original SOM idea been proposed and succeeded in several clustering
applications, a numerous variations and improvements of the original idea have been proposed
in the literature. Among the proposed ones are the probabilistic self-organizing maps.
Figure 4. Structure of Self-Organizing Map
The Probabilistic Self-Organizing Maps are a probabilistic variant of the traditional self-
organizing maps where the response n_k of each neuron θ_k to each input vector x_i is
modeled by a multivariate Gaussian〖 θ〗_k={w_k, _k, _k }, as follows
( θ )
( ) | |
( ( ) | | ( )) (6)
In the literature, several formulations and algorithms have been proposed for the
training of the probabilistic self-organizing map. Among the most widely studied and applied
ones is the coupling-likelihood mixture model formulation together with the SOEM algorithm [4],
[5].
The coupling-likelihood mixture model formulation was principally inspired from the work
of Sum and john that interpreted Kohonen’s sequential SOM learning algorithm as maximizing
the local correlations (coupling energies) between the output neurons and their neighborhoods
with the input traning data [12].
Given a SOM Network ℵ of M output neurons where each neuron n_(k )is
parameterized by a reference Gaussian〖 θ〗_k={w_k, _k, _k } . The coupling energy between
of each neuron n_k and its neighborhood in terms of probabilistic likelihood is defined as follows
[5]:
( | ) ( θ ) ∏ ( θ ) (7)
Here, ={θ_l,θ_2,…,θ_M } is the reference model of the whole SOM etwork ℵ, h_kl
denotes the neighborhood function that defines the strengths of lateral interaction between
neurons k and l ∈{1,2,…,M} and the term ∏_(l≠k)▒〖n_l (x_i;θ_l )〗^kl represents the
neighborhood response of the neuron〖 n〗_k. Accordingly, the coupling likelihood (the coupling
energy) of an input data x_i over the network ℵ can be depicted as shown in Fig. 5 and defined
by the following mixture likelihood:
The input
layer
XN
J
X1 Xn
The output
layer
…
…
…
……..
……
… .…
TELKOMNIKA ISSN: 1693-6930 
Probabilistic Self-Organizing Maps for Text-Independent Speaker… (Ayoub Bouziane)
255
( ) ∑ ( ) ( | ) (8)
Compared to the GMM traditional formulation, the coupling-likelihood mixture model
formulation embeds a coupling-likelihood layer between the Gaussian-likelihood layer and the
mixture-likelihood layer in order to take into account the coupling between the neurons and their
neighborhoods, see Fig 5.
Figure 5. The coupling-likelihood x_i over the network
Algorithm 2. The SOEM algorithm
Input : Training feature vectors * +
Output : Optimized Gaussian mixture Model parameters.
1: Randomly initialize the model parameters * + * + .
2: Initialize the radius of the neighborhood function at a higher value.
3: Repeat the following steps until convergence:
 Expectation step: Aims to compute the posterior probability of the Gaussian components
representing the network neurons for each :
( | ) ( | )
(∑ ( ( )))
∑ ( ∑ ( ( ))) (3)
 Maximization step: Aims to re-estimation of the networks parameters i.e. the mean and
variances vectors, using the following equations:
∑ (∑ ( | ) )
∑ (∑ ( | ) )
(4)
∑ (∑ ( | ) )( )( )
∑ (∑ ( | ) ) (5)
Coupling
likelihood
Gaussian
likelihood
Mixture of
coupling
likelihood
Σ
xi
𝑛 (𝑥𝑖 𝜃 )
𝑛 𝑀(𝑥𝑖 𝜃 𝑀)
𝑝𝑠(𝑥𝑖 𝜆 ℎ)
𝑝𝑠(𝑥𝑖|𝑀 𝜆 ℎ)
𝑝𝑠(𝑥𝑖|𝑘 𝜆 ℎ)
𝑝𝑠(𝑥𝑖| 𝜆 ℎ)
𝑤𝑠(𝑀)
𝑤𝑠( )
m
…
……
…
𝑤𝑠(𝑘)
𝑛 𝑘(𝑥𝑖 𝜃 𝑘)
 ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 1, February 2018 : 250 – 258
256
4: Decrease the radius of the neighborhood function.
5: Repeat step 3-4 until it reaches a predefined minimum value.
6: Return the model parameters * + .
The neighborhood function is traditionally taken as a Gaussian kernel of the following form:
(
‖ ‖
) (12)
where ‖r_k-r_l ‖ is the Euclidean distance between two neurons r_kand r_l, and σ is the radius
of the neighborhood function. On another side, the network parameters, i.e., the reference
model , are determined using the SOEM algorithm aiming to maximize the following objective
log-likelihood function:
( ) (∏ ( )) ∑ ( ( )) (12)
The SOEM algorithm is a modified EM algorithm that iteratively refines the network
parameters by alternating between modified expectation and maximization steps, until
convergence. The specifics of the SOEM algorithm are reported in Algorithm 2 and depicted as
flowchart in Figure 6.
Figure 6. Flowchart of the SOEM algorithm
4. Experiments, Results and Discussion
The aim of the performed experiences in this study is to access and evaluate the
performance of introduced speaker modeling technique using probabilistic self-organizing maps
compared to the traditional technique using the EM algorithm or its deterministic variant.
5.1. Experimental Protocol
The conducted experiments in this study performed under a speech corpus of 40
Moroccan speakers in the age range of 18 to 30 years, 17 female and 23 male. Each speaker
was recorded for at least more than two recording sessions separated by around two-three
weeks. The sort of recorded speech incorporates free monolog in Moroccan dialect and read
text in Arabic, French and English languages. The recordings were gathered from volunteer
speakers over internet as voice messages via Skype. In order to cover a wide range of real-life
acoustical environments, we recommended the speakers to make calls from many different
places, e.g., home, office etc. Furthermore, different kinds of equipment were used for recoding
(laptops, tablets and smartphones …). On another side, the voice messages were digitized at
Start
Initialize the
set of reference
model λ
Yes
End
Compute the
posterior
probability of each
mixture
component
Adapts the
reference
modes λ
Decrease the radius of the
neighborhood function
No
σ > σm n
Radius
Parameters
converged?
Yes
No
TELKOMNIKA ISSN: 1693-6930 
Probabilistic Self-Organizing Maps for Text-Independent Speaker… (Ayoub Bouziane)
257
16 kHz with a determination of 16 bits (mono, PCM) and stored in the most commonly used
“wav” format.
The feature vectors of the speakers’ speech utterances were extracted using the mel-
frequency cepstral coefficients [13]. Each frame was parameterized by a vector consisting of 19
coefficients. The MFCCs features are pre-processed as follows. The emphasizing step is firstly
performed using a simple first order filter with transfer function: H (z) = 1 – 0.95z. Next, the
emphasized speech signal is blocked into Hamming-windowed frames of 25 ms (400 samples)
in length with 10 ms (160 samples) overlap between any two adjacent frames [13].
During the training phase, one minute of active speech per speaker is used for the
building the speaker’s model, whereas in the testing phase, the evaluation data composes 400
identification tests of 8 seconds (i.e., ten tests per speaker each of 8s in duration).
On another side, the temperature of the DAEM algorithm was updated using
following way (i)= √(i/I),i = 1,2,...,I. where (i) is the value of at i-th temperature update step,
and I is the total number of temperature update steps (Empirically chosen as I=10). Regarding
the SOEM algorithm, the probabilistic self-organizing maps were trained on rectangular lattices
using the Gaussian kernel h_kl as neighborhood function. The neighborhood width is fixed in
the beginning at σ=1 and reduced gradually during the training to 0.
5.2. Sub Bab 2
The identification performances of the introduced PbSOM-based modeling technique
and the traditional GMM-based modeling techniques using the EM and the DAEM algorithms
are summarized in Figure 7. As it can be seen, the performance evaluation was done at various
models’ sizes (i.e. number of Gaussian components used for speaker modeling). Moreover, the
experiments were repeated three times using the same experimental protocol and the same
model size in order to evaluate the techniques’ sensitivity to the initial parameters.
Figure 7. The performance of the introduced PbSOM-based modeling technique using the
SOEM algorithm compared to the traditional GMM-based modeling techniques using the EM
and the DAEM algorithms
The obtained results clearly confirm the superiority of the introduced technique using
the SOEM algorithm in comparison with the traditional technique using the classical GMMs and
the EM algorithm or its deterministic variant. Effectively, it can be seen across the various used
model sizes that the DAEM algorithm outperforms the EM algorithm and the SOEM algorithm
significantly outperforms both EM and DAEM algorithms. By way of illustration, we can see that
the identification performance of the DAEM–based system using models’ size of 128 Gaussians
demonstrates a relative accuracy improvement of roughly 11% compared to the system
performance using the EM algorithm. Likewise, we can observe that the identification
performance of the SOEM–based system using the same models’ size (i.e. 128) demonstrates
a relative accuracy improvement of approximately 39% and 32% compared to the system
performances using the EM and the DAEM algorithms, respectively.
94,50
95,00
95,50
96,00
96,50
97,00
97,50
98,00
98,50
99,00
32G 64G 128G 256G 512G
GMM using EM Algorithm GMM using DAEM Algorithm PbSOM using the SOEM algorithm
Identificati
 ISSN: 1693-6930
TELKOMNIKA Vol. 16, No. 1, February 2018 : 250 – 258
258
Concerning the algorithms sensitivity to the parameters initialization, we can observe
that the system performance using EM algorithm is severely unstable when repeating the same
experiment using the same model size and experimental protocol. Apparently, the EM algorithm
seems to be strongly dependent on the model-parameters initialization. Besides, we can remark
that the DAEM algorithm is less sensitive to the parameters initialization compared to the EM
algorithm. On another hand, we can see that the SOEM algorithm is much less sensitive to the
parameters initialization compared to the EM and the DAEM algorithms. Seemingly, the self-
organizing quality of the SOEM algorithm makes it less sensitive to parameters initialization.
6. Conclusion
In this paper, a novel speaker modelling technique using the probabilistic self-
organizing maps (PbSOMs) has been introduced for text-independent speaker identification.
The basic motivation behind the introduced technique was to combine the strengths of the
traditional self-organizing maps and the Gaussian mixture models. Experimental results
demonstrated that the introduced modelling technique using probabilistic self-organizing maps
outperforms the traditional technique using the classical GMMs and the EM algorithm or its
deterministic variant.
References
[1] D. Reynolds, « Gaussian Mixture Models », in Encyclopedia of Biometrics, S. Z. Li et A. K. Jain, Éd.
Boston, MA: Springer US, 2015, p. 827‑832.
[2] T. R. J. Kumari et H. S. Jayanna, « Limited Data Speaker Verification: Fusion of Features », International
Journal of Electrical and Computer Engineering (IJECE), vol. 7, no
6, p. 3344‑3357, déc. 2017.
[3] N. Ueda et R. Nakano, « Deterministic annealing EM algorithm », Neural Networks, vol. 11, no
2, p. 271‑
282, mars 1998.
[4] S.-S. Cheng, H.-C. Fu, et H. Wang, « CEM, EM, and DAEM Algorithms for Learning Self-Organizing
Maps », in 2007 IEEE Workshop on Machine Learning for Signal Processing, 2007, p. 378‑383.
[5] S.-S. Cheng, H.-C. Fu, et H. Wang, « Model-Based Clustering by Probabilistic Self-Organizing Maps »,
IEEE Transactions on Neural Networks, vol. 20, no
5, p. 805‑826, mai 2009.
[6] L. J. Lin Chang, « Skin detection using a modified Self-Organizing Mixture Network », p. 1‑6, 2013.
[7] D. A. Reynolds et R. C. Rose, « Robust text-independent speaker identification using Gaussian mixture
speaker models », IEEE Transactions on Speech and Audio Processing, vol. 3, no
1, p. 72‑83, janv. 1995.
[8] D. A. Reynolds, « Automatic speaker recognition using gaussian mixture speaker models », The Lincoln
Laboratory Journal, p. 173–192, 1995.
[9] D. A. Reynolds, « Speaker identification and verification using Gaussian mixture speaker models », Speech
Communication, vol. 17, no
1, p. 91‑108, août 1995.
[10] T. Kohonen, Self-Organizing Maps. Springer, 2001.
[11] T. Heskes, « Self-organizing Maps, Vector Quantization, and Mixture Modeling », Trans. Neur. Netw., vol.
12, no
6, p. 1299–1305, nov. 2001.
[12] J. Sum, C. Leung, L. Chan, et L. Xu, « Yet Another Algorithm Which Can Generate Topography Map »,
IEEE Trans. Neural Networks, vol. 8, p. 1204–1207, 1997.
[13] B. Ayoub, K. Jamal, et Z. Arsalane, « An analysis and comparative evaluation of MFCC variants for
speaker identification over VoIP networks », in 2015 World Congress on Information Technology and Computer
Applications Congress (WCITCA), 2015, p. 1‑6.

More Related Content

PDF
PDF
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
PDF
Speaker Identification From Youtube Obtained Data
PDF
Inter IIT Tech Meet 2k19, IIT Jodhpur
PDF
Mfcc based enlargement of the training set for emotion recognition in speech
PDF
Paper id 312201512
PDF
MCGDM with AHP based on Adaptive interval Value Fuzzy
PDF
Architecture neural network deep optimizing based on self organizing feature ...
SYNTHETICAL ENLARGEMENT OF MFCC BASED TRAINING SETS FOR EMOTION RECOGNITION
Speaker Identification From Youtube Obtained Data
Inter IIT Tech Meet 2k19, IIT Jodhpur
Mfcc based enlargement of the training set for emotion recognition in speech
Paper id 312201512
MCGDM with AHP based on Adaptive interval Value Fuzzy
Architecture neural network deep optimizing based on self organizing feature ...

What's hot (20)

PDF
SEARCH TIME REDUCTION USING HIDDEN MARKOV MODELS FOR ISOLATED DIGIT RECOGNITION
PDF
An Interactive Decomposition Algorithm for Two-Level Large Scale Linear Multi...
PDF
07 18sep 7983 10108-1-ed an edge edit ari
PDF
F EATURE S ELECTION USING F ISHER ’ S R ATIO T ECHNIQUE FOR A UTOMATIC ...
DOCX
conferense
PDF
Sequential estimation of_discrete_choice_models
PDF
Baum1
PDF
V.KARTHIKEYAN PUBLISHED ARTICLE
PDF
An ann approach for network
PDF
Direction Finding - Antennas Project
PDF
Texture classification of fabric defects using machine learning
PDF
PDF
AN IMPLEMENTATION OF ADAPTIVE PROPAGATION-BASED COLOR SAMPLING FOR IMAGE MATT...
PDF
International Journal of Engineering Research and Development (IJERD)
PDF
Exposure Fusion by FABEMD
PDF
first_assignment_Report
PPTX
Automatic segmentation and disentangling of chromosomes in q band image
PDF
Application of normalized cross correlation to image registration
PDF
Archana kalapgar 19210184_ca684
SEARCH TIME REDUCTION USING HIDDEN MARKOV MODELS FOR ISOLATED DIGIT RECOGNITION
An Interactive Decomposition Algorithm for Two-Level Large Scale Linear Multi...
07 18sep 7983 10108-1-ed an edge edit ari
F EATURE S ELECTION USING F ISHER ’ S R ATIO T ECHNIQUE FOR A UTOMATIC ...
conferense
Sequential estimation of_discrete_choice_models
Baum1
V.KARTHIKEYAN PUBLISHED ARTICLE
An ann approach for network
Direction Finding - Antennas Project
Texture classification of fabric defects using machine learning
AN IMPLEMENTATION OF ADAPTIVE PROPAGATION-BASED COLOR SAMPLING FOR IMAGE MATT...
International Journal of Engineering Research and Development (IJERD)
Exposure Fusion by FABEMD
first_assignment_Report
Automatic segmentation and disentangling of chromosomes in q band image
Application of normalized cross correlation to image registration
Archana kalapgar 19210184_ca684
Ad

Similar to Probabilistic Self-Organizing Maps for Text-Independent Speaker Identification (20)

PDF
Towards an Optimal Speaker Modeling in Speaker Verification Systems using Per...
PDF
A Text-Independent Speaker Identification System based on The Zak Transform
PDF
Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...
PDF
THE APPLICATION OF BAYES YING-YANG HARMONY BASED GMMS IN ON-LINE SIGNATURE VE...
PDF
A Novel Parallel Model Method for Noise Speech Recognition_正式投稿_
PDF
Speaker Identification based on GFCC using GMM-UBM
PDF
Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
PDF
The Application Of Bayes Ying-Yang Harmony Based Gmms In On-Line Signature Ve...
PDF
40120130406014 2
PDF
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
PDF
Colour Image Segmentation Using Soft Rough Fuzzy-C-Means and Multi Class SVM
PDF
Research on Emotion Recognition for Facial Expression Images Based on Hidden ...
PDF
PPTX
PDF
Mjfg now
PDF
An Adaptive Masker for the Differential Evolution Algorithm
PDF
PDF
Comparative study to realize an automatic speaker recognition system
PDF
Report for Speech Emotion Recognition
PDF
A beamforming comparative study of least mean square, genetic algorithm and g...
Towards an Optimal Speaker Modeling in Speaker Verification Systems using Per...
A Text-Independent Speaker Identification System based on The Zak Transform
Speech Recognition Using HMM with MFCC-An Analysis Using Frequency Specral De...
THE APPLICATION OF BAYES YING-YANG HARMONY BASED GMMS IN ON-LINE SIGNATURE VE...
A Novel Parallel Model Method for Noise Speech Recognition_正式投稿_
Speaker Identification based on GFCC using GMM-UBM
Equirs: Explicitly Query Understanding Information Retrieval System Based on Hmm
The Application Of Bayes Ying-Yang Harmony Based Gmms In On-Line Signature Ve...
40120130406014 2
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
Colour Image Segmentation Using Soft Rough Fuzzy-C-Means and Multi Class SVM
Research on Emotion Recognition for Facial Expression Images Based on Hidden ...
Mjfg now
An Adaptive Masker for the Differential Evolution Algorithm
Comparative study to realize an automatic speaker recognition system
Report for Speech Emotion Recognition
A beamforming comparative study of least mean square, genetic algorithm and g...
Ad

More from TELKOMNIKA JOURNAL (20)

PDF
Earthquake magnitude prediction based on radon cloud data near Grindulu fault...
PDF
Implementation of ICMP flood detection and mitigation system based on softwar...
PDF
Indonesian continuous speech recognition optimization with convolution bidir...
PDF
Recognition and understanding of construction safety signs by final year engi...
PDF
The use of dolomite to overcome grounding resistance in acidic swamp land
PDF
Clustering of swamp land types against soil resistivity and grounding resistance
PDF
Hybrid methodology for parameter algebraic identification in spatial/time dom...
PDF
Integration of image processing with 6-degrees-of-freedom robotic arm for adv...
PDF
Deep learning approaches for accurate wood species recognition
PDF
Neuromarketing case study: recognition of sweet and sour taste in beverage pr...
PDF
Reversible data hiding with selective bits difference expansion and modulus f...
PDF
Website-based: smart goat farm monitoring cages
PDF
Novel internet of things-spectroscopy methods for targeted water pollutants i...
PDF
XGBoost optimization using hybrid Bayesian optimization and nested cross vali...
PDF
Convolutional neural network-based real-time drowsy driver detection for acci...
PDF
Addressing overfitting in comparative study for deep learningbased classifica...
PDF
Integrating artificial intelligence into accounting systems: a qualitative st...
PDF
Leveraging technology to improve tuberculosis patient adherence: a comprehens...
PDF
Adulterated beef detection with redundant gas sensor using optimized convolut...
PDF
A 6G THz MIMO antenna with high gain and wide bandwidth for high-speed wirele...
Earthquake magnitude prediction based on radon cloud data near Grindulu fault...
Implementation of ICMP flood detection and mitigation system based on softwar...
Indonesian continuous speech recognition optimization with convolution bidir...
Recognition and understanding of construction safety signs by final year engi...
The use of dolomite to overcome grounding resistance in acidic swamp land
Clustering of swamp land types against soil resistivity and grounding resistance
Hybrid methodology for parameter algebraic identification in spatial/time dom...
Integration of image processing with 6-degrees-of-freedom robotic arm for adv...
Deep learning approaches for accurate wood species recognition
Neuromarketing case study: recognition of sweet and sour taste in beverage pr...
Reversible data hiding with selective bits difference expansion and modulus f...
Website-based: smart goat farm monitoring cages
Novel internet of things-spectroscopy methods for targeted water pollutants i...
XGBoost optimization using hybrid Bayesian optimization and nested cross vali...
Convolutional neural network-based real-time drowsy driver detection for acci...
Addressing overfitting in comparative study for deep learningbased classifica...
Integrating artificial intelligence into accounting systems: a qualitative st...
Leveraging technology to improve tuberculosis patient adherence: a comprehens...
Adulterated beef detection with redundant gas sensor using optimized convolut...
A 6G THz MIMO antenna with high gain and wide bandwidth for high-speed wirele...

Recently uploaded (20)

PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Sustainable Sites - Green Building Construction
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PPTX
Welding lecture in detail for understanding
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
additive manufacturing of ss316l using mig welding
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
PPT on Performance Review to get promotions
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Sustainable Sites - Green Building Construction
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
bas. eng. economics group 4 presentation 1.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
Operating System & Kernel Study Guide-1 - converted.pdf
Welding lecture in detail for understanding
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
additive manufacturing of ss316l using mig welding
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
UNIT 4 Total Quality Management .pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
R24 SURVEYING LAB MANUAL for civil enggi
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPT on Performance Review to get promotions
Internet of Things (IOT) - A guide to understanding
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf

Probabilistic Self-Organizing Maps for Text-Independent Speaker Identification

  • 1. TELKOMNIKA, Vol.16, No.2, February 2018, pp. 250~258 ISSN: 1693-6930, accredited A by DIKTI, Decree No: 58/DIKTI/Kep/2013 DOI: 10.12928/TELKOMNIKA.v16i1.7559  250 Received Mar 5, 2017; Revised September 18, 2017; Accepted September 30, 2017 Probabilistic Self-Organizing Maps for Text- Independent Speaker Identification Ayoub Bouziane*1 , Jamal Kharroubi2 , Arsalane Zarghili3 Intelligent Systems and Applications Laboratory, Sidi Mohamed Ben Abdellah University P.B: 2202 Immouzer road, Fez, Morocco *Corresponding author, e-mail: ayoub.bouziane@usmba.ac.ma 1 , jamal.kharroubi@usmba.ac.ma 2 , arsalane.zarghili@usmba.ac.ma 3 Abstract The present paper introduces a novel speaker modeling technique for text-independent speaker identification using probabilistic self-organizing maps (PbSOMs). The basic motivation behind the introduced technique was to combine the self-organizing quality of the self-organizing maps and generative power of Gaussian mixture models. Experimental results show that the introduced modeling technique using probabilistic self-organizing maps significantly outperforms the traditional technique using the classical GMMs and the EM algorithm or its deterministic variant. More precisely, a relative accuracy improvement of roughly 39% has been gained, as well as, a much less sensitivity to the model-parameters initialization has been exhibited by using the introduced speaker modeling technique using probabilistic self-organizing maps. Keywords: speaker identification system, gaussian mixture model (GMM), probabilistic self-organizing maps, EM algorithm, deterministic annealing EM algorithm, the SOEM algorithm. Copyright © 2018 Universitas Ahmad Dahlan. All rights reserved. 1. Introduction The Gaussian mixture models (GMMs) [1], [2] are considered as the simplest and the traditional speaker modeling technique in speaker recognition systems, as well as, the basis of the most successful approaches that have been emerged in the last decade. Each speaker is modeled in the system as a mixture of Gaussian densities, which may reflect the specific acoustical classes of the speaker. Generally, the parameters of the Gaussian mixture models (GMMs) are estimated using the widely used and well-known EM algorithm. Beside all the advantages of the EM algorithm, such as its simplicity, both conceptually and computationally, it suffers from some general drawbacks like its sensitivity to the initial model parameters - especially in a multivariate context - and the trapping in local optimums. To overcome this problem, various techniques were proposed and used in the speaker recognition state-of-the-art, such as the deterministic annealing EM proposed by Ueda and Nakano [3],the split and merge algorithms, as well as some heuristics to find the appropriate initial points for the EM algorithm. In the same perspective, the probabilistic self-organizing maps method [4]–[6], based on the combination between the strengths of self-organizing maps and mixture models, was proposed and yielded better results in some image processing applications. In the present study, the probabilistic self-organizing maps method is assessed and introduced for speaker modeling in speaker recognition applications. The obtained results using the probabilistic self- organizing maps are compared with the classical training of the Gaussian mixture models using the EM algorithm and its deterministic variant. The remainder of this paper is organized as follows. The second section briefly highlights the general operating structure of speaker identification systems. Section 3 and 4 deal with speaker modeling process, Section 3 gives a brief description of Gaussian Mixture Models and outlines the principle of the EM algorithm and its deterministic annealing variant, while section 4 introduces the Probabilistic Self-Organizing Maps for speaker modeling in speaker recognition systems. Next, the experimental results are provided in Section 5. Finally, conclusions and future directions are drawn in Section 6.
  • 2. TELKOMNIKA ISSN: 1693-6930  Probabilistic Self-Organizing Maps for Text-Independent Speaker… (Ayoub Bouziane) 251 2. The General Operating Structure of the Speaker Identification Systems The basic structure of automatic speaker identification systems, as shown in Figure 1, consists of two distinct phases: the training phase and the testing phase. Figure 1. The basic framework and components of speaker recognition systems During the training phase, speech samples are gathered from new client speakers, their individual feature vectors that reflect the characteristics of their vocal tracts are extracted and they're used to train a reference model for each client speaker. As regards the testing phase, the speech signal of the unknown speaker is acquired, corresponding feature vectors are extracted and scored against the previously enrolled reference models. Finally, the similarity scores computed from this comparison are then used to make a decision about the identity of the speaker. 3. Speaker Modeling Using the Traditional Gaussian Mixture Models The Gaussian Mixture Models were firstly introduced to the speaker recognition community in 1995 [7]–[9]. Since then, they have become the predominant approach for speaker modeling in text-independent speaker recognition systems, and the basis of the most successful approaches that have emerged in the last decade. The basic idea underlying the GMM approach consists in modeling the distribution of the speaker’s features as a Gaussian mixture density. The Gaussian mixture density is generally defined by a weighted sum of M Gaussian densities, as depicted in Figure 2, and is given by the following equation: ( | ) ∑ ( ) ∑ ( | ) (1) where, x_t is a D-dimensional feature vector, b_i (x)=g(x│ _i, _i ),i=1,2,3,…,M. are the Gaussian densities and 〖 w〗_i,i=1,2,3,…,M are the mixture weights. Each density component is a D-variate Gaussian function of the following form: ( | ) ( ) ⁄ | | ⁄ { ( ) ( )} (2) Speaker Modeling Pattern Matching Enrollment phase Testing phase Enrollment utterances of client speakers Test utterances Speakers' Models Feature Extraction Decision Making
  • 3.  ISSN: 1693-6930 TELKOMNIKA Vol. 16, No. 1, February 2018 : 250 – 258 252 Figure 2. Gaussian Mixture Density The Gaussian mixture model is parameterized by the collection of the mean vectors, covariance matrices and mixture weights of the Gaussian densities ={w_i, _i, _i },i=1,2,…,M. The mixture weights, w_i, furthermore satisfy the constraint ∑_(i=1)^M▒w_i =1. The motivation behind the use of Gaussian mixture models for speaker modeling lies on the assumption that Gaussian densities may model a set of hidden acoustic classes that reflect the characteristics of the speaker dependent vocal tract. The model parameters ={w_i, _i, _i },i=1,2,…,M. are determined in such manner that they best fit the distribution of the training feature vectors X={x1,. . . ,xT}. In other words, they are determined in such manner that they maximize the log-likelihood of the GMM log⁡(p(X│ )).The traditional and the commonly-used method in this context is the maximum likelihood estimation (MLE) method via the Expectation–maximization (EM) algorithm. 3.1. Gaussian mixture models using the EM algorithm The basic idea of the EM algorithm, as reported in algorithm 1, consists in starting with an initial model and tending to estimate a new model( ) , such that p(X│ ) p(X│ ). ext, the new estimated model becomes an initial model to be refined in the next iteration, and the process is repeated until an increase in the log-likelihood of the data, given the current model, is less than some convergence threshold. Algorithm 1. The EM algorithm Input : Training feature vectors * + Output : GMM of M component * + . 1: Randomly initialize the model parameters * + . 2: Compute the a posteriori probability ( | ): ( | ) ( | ) ∑ ( | ) (1) 3: Re-estimate the new model parameters, i.e. the mixture weights, the means and variances vectors, using the following equations: ∑ ( | ) ∑ ( | ) ∑ ( | ) ∑ ( | ) ∑ ( | ) (2) 5: Repeat step 2-3 until convergence. 6: Return the model parameters * + . Mixture of Gaussian likelihoods … . … . 𝑏 (𝑥𝑡) 𝑤 𝑤𝑖 Gaussian likelihood 𝑥𝑡 𝑤 𝑀 𝛴𝑏𝑖(𝑥𝑡) 𝑏 𝑀(𝑥𝑡)
  • 4. TELKOMNIKA ISSN: 1693-6930  Probabilistic Self-Organizing Maps for Text-Independent Speaker… (Ayoub Bouziane) 253 3.2. Gaussian mixture models uisng the DAEM algorithm The Deterministic Annealing EM algorithm [3] is an EM variant algorithm based on the deterministic annealing concept. The key idea of the DAEM algorithm consists in reformulating the problem of maximizing the log-likelihood in the classical EM algorithm as a problem of minimizing the thermodynamic free energy defined through the maximum entropy principle and statistical mechanics analogy. Similarly to the EM algorithm, the DAEM algorithm is an iterative procedure based on expectation and maximization steps. In the expectation step, a new temperature-parameterized posterior distribution was introduced as follows: ( | ) ( | ) ∑ ( | ) (5) where the temperature 1/ is gradually decreased during the training process, and the posterior distribution is optimized at each temperature. The diminishing rate of temperature must be as slow as possible, particularly at the early stages of training. In the maximization step, the model parameters are estimated using the temperature-parameterized posterior distribution P_ (i│x_t, ) in exactly the same way as the classical EM algorithm. See Figure 3. Figure 3. Flowchart of the DAEM algorithm 3. Speaker modeling using Probabilistic Self-Organizing Maps (PbSOM) The Self-Organizing Maps (SOM), commonly known also as Kohonen network [10], are the most popular unsupervised neural network for data clustering and visualization. The SOM approach was inspired from self-organizing nature of the human cerebral cortex. Indeed, it is based on the idea of competition and neighborhood update concepts which preserve the topological relationships between classes in the network [11]. A self-organizing map, as shown in Fig 4, consists of two layers of neurons, an input layer and an output layer. The input layer is composed of N input neurons according to the N input vectors {X_n=[X_1 ,X_2,…,X_ ],1≤n≤ } to be classified, while the output layer (so-called competitive layer) is composed of M output neurons {r_m=[r_1 ,r_2,…,r_M],1≤m≤M} according to the M clusters {C_m=[C_1 ,C_2,…,C_M],1≤m≤M} to be determined. The input neurons are fully connected to output neurons, which are connected to each other by a neighborhood relation h_ij,1≤i,j≤M dictating the structure of the layer. The layer structure is often specified by the following factors: the local lattice structure (hexagonal, rectangular …) and the dimension or the global map shape (sheet, cylinder …). The self-organizing map algorithms are trained iteratively based on two steps: a competitive step and a cooperative step. In the first step, the various output neurons compete with each other to determine the “winner” neuron(s) which best matches the input vector(s). In the second step, i.e. the cooperative step, the weights of the winner neuron(s) and that of neurons close to them in the SOM lattice are adjusted towards the Start Initialize the model parameters λ Yes End No Parameters converged? Compute the posterior probabilities Adapts the model parameters λ Decrease the temperature T Temperature T > 1 ? Yes No
  • 5.  ISSN: 1693-6930 TELKOMNIKA Vol. 16, No. 1, February 2018 : 250 – 258 254 input vector(s). Therefore, output neurons will self-organize to an ordered map in such a way that output neurons which have similar weights will be placed nearby after training. Once the original SOM idea been proposed and succeeded in several clustering applications, a numerous variations and improvements of the original idea have been proposed in the literature. Among the proposed ones are the probabilistic self-organizing maps. Figure 4. Structure of Self-Organizing Map The Probabilistic Self-Organizing Maps are a probabilistic variant of the traditional self- organizing maps where the response n_k of each neuron θ_k to each input vector x_i is modeled by a multivariate Gaussian〖 θ〗_k={w_k, _k, _k }, as follows ( θ ) ( ) | | ( ( ) | | ( )) (6) In the literature, several formulations and algorithms have been proposed for the training of the probabilistic self-organizing map. Among the most widely studied and applied ones is the coupling-likelihood mixture model formulation together with the SOEM algorithm [4], [5]. The coupling-likelihood mixture model formulation was principally inspired from the work of Sum and john that interpreted Kohonen’s sequential SOM learning algorithm as maximizing the local correlations (coupling energies) between the output neurons and their neighborhoods with the input traning data [12]. Given a SOM Network ℵ of M output neurons where each neuron n_(k )is parameterized by a reference Gaussian〖 θ〗_k={w_k, _k, _k } . The coupling energy between of each neuron n_k and its neighborhood in terms of probabilistic likelihood is defined as follows [5]: ( | ) ( θ ) ∏ ( θ ) (7) Here, ={θ_l,θ_2,…,θ_M } is the reference model of the whole SOM etwork ℵ, h_kl denotes the neighborhood function that defines the strengths of lateral interaction between neurons k and l ∈{1,2,…,M} and the term ∏_(l≠k)▒〖n_l (x_i;θ_l )〗^kl represents the neighborhood response of the neuron〖 n〗_k. Accordingly, the coupling likelihood (the coupling energy) of an input data x_i over the network ℵ can be depicted as shown in Fig. 5 and defined by the following mixture likelihood: The input layer XN J X1 Xn The output layer … … … …….. …… … .…
  • 6. TELKOMNIKA ISSN: 1693-6930  Probabilistic Self-Organizing Maps for Text-Independent Speaker… (Ayoub Bouziane) 255 ( ) ∑ ( ) ( | ) (8) Compared to the GMM traditional formulation, the coupling-likelihood mixture model formulation embeds a coupling-likelihood layer between the Gaussian-likelihood layer and the mixture-likelihood layer in order to take into account the coupling between the neurons and their neighborhoods, see Fig 5. Figure 5. The coupling-likelihood x_i over the network Algorithm 2. The SOEM algorithm Input : Training feature vectors * + Output : Optimized Gaussian mixture Model parameters. 1: Randomly initialize the model parameters * + * + . 2: Initialize the radius of the neighborhood function at a higher value. 3: Repeat the following steps until convergence:  Expectation step: Aims to compute the posterior probability of the Gaussian components representing the network neurons for each : ( | ) ( | ) (∑ ( ( ))) ∑ ( ∑ ( ( ))) (3)  Maximization step: Aims to re-estimation of the networks parameters i.e. the mean and variances vectors, using the following equations: ∑ (∑ ( | ) ) ∑ (∑ ( | ) ) (4) ∑ (∑ ( | ) )( )( ) ∑ (∑ ( | ) ) (5) Coupling likelihood Gaussian likelihood Mixture of coupling likelihood Σ xi 𝑛 (𝑥𝑖 𝜃 ) 𝑛 𝑀(𝑥𝑖 𝜃 𝑀) 𝑝𝑠(𝑥𝑖 𝜆 ℎ) 𝑝𝑠(𝑥𝑖|𝑀 𝜆 ℎ) 𝑝𝑠(𝑥𝑖|𝑘 𝜆 ℎ) 𝑝𝑠(𝑥𝑖| 𝜆 ℎ) 𝑤𝑠(𝑀) 𝑤𝑠( ) m … …… … 𝑤𝑠(𝑘) 𝑛 𝑘(𝑥𝑖 𝜃 𝑘)
  • 7.  ISSN: 1693-6930 TELKOMNIKA Vol. 16, No. 1, February 2018 : 250 – 258 256 4: Decrease the radius of the neighborhood function. 5: Repeat step 3-4 until it reaches a predefined minimum value. 6: Return the model parameters * + . The neighborhood function is traditionally taken as a Gaussian kernel of the following form: ( ‖ ‖ ) (12) where ‖r_k-r_l ‖ is the Euclidean distance between two neurons r_kand r_l, and σ is the radius of the neighborhood function. On another side, the network parameters, i.e., the reference model , are determined using the SOEM algorithm aiming to maximize the following objective log-likelihood function: ( ) (∏ ( )) ∑ ( ( )) (12) The SOEM algorithm is a modified EM algorithm that iteratively refines the network parameters by alternating between modified expectation and maximization steps, until convergence. The specifics of the SOEM algorithm are reported in Algorithm 2 and depicted as flowchart in Figure 6. Figure 6. Flowchart of the SOEM algorithm 4. Experiments, Results and Discussion The aim of the performed experiences in this study is to access and evaluate the performance of introduced speaker modeling technique using probabilistic self-organizing maps compared to the traditional technique using the EM algorithm or its deterministic variant. 5.1. Experimental Protocol The conducted experiments in this study performed under a speech corpus of 40 Moroccan speakers in the age range of 18 to 30 years, 17 female and 23 male. Each speaker was recorded for at least more than two recording sessions separated by around two-three weeks. The sort of recorded speech incorporates free monolog in Moroccan dialect and read text in Arabic, French and English languages. The recordings were gathered from volunteer speakers over internet as voice messages via Skype. In order to cover a wide range of real-life acoustical environments, we recommended the speakers to make calls from many different places, e.g., home, office etc. Furthermore, different kinds of equipment were used for recoding (laptops, tablets and smartphones …). On another side, the voice messages were digitized at Start Initialize the set of reference model λ Yes End Compute the posterior probability of each mixture component Adapts the reference modes λ Decrease the radius of the neighborhood function No σ > σm n Radius Parameters converged? Yes No
  • 8. TELKOMNIKA ISSN: 1693-6930  Probabilistic Self-Organizing Maps for Text-Independent Speaker… (Ayoub Bouziane) 257 16 kHz with a determination of 16 bits (mono, PCM) and stored in the most commonly used “wav” format. The feature vectors of the speakers’ speech utterances were extracted using the mel- frequency cepstral coefficients [13]. Each frame was parameterized by a vector consisting of 19 coefficients. The MFCCs features are pre-processed as follows. The emphasizing step is firstly performed using a simple first order filter with transfer function: H (z) = 1 – 0.95z. Next, the emphasized speech signal is blocked into Hamming-windowed frames of 25 ms (400 samples) in length with 10 ms (160 samples) overlap between any two adjacent frames [13]. During the training phase, one minute of active speech per speaker is used for the building the speaker’s model, whereas in the testing phase, the evaluation data composes 400 identification tests of 8 seconds (i.e., ten tests per speaker each of 8s in duration). On another side, the temperature of the DAEM algorithm was updated using following way (i)= √(i/I),i = 1,2,...,I. where (i) is the value of at i-th temperature update step, and I is the total number of temperature update steps (Empirically chosen as I=10). Regarding the SOEM algorithm, the probabilistic self-organizing maps were trained on rectangular lattices using the Gaussian kernel h_kl as neighborhood function. The neighborhood width is fixed in the beginning at σ=1 and reduced gradually during the training to 0. 5.2. Sub Bab 2 The identification performances of the introduced PbSOM-based modeling technique and the traditional GMM-based modeling techniques using the EM and the DAEM algorithms are summarized in Figure 7. As it can be seen, the performance evaluation was done at various models’ sizes (i.e. number of Gaussian components used for speaker modeling). Moreover, the experiments were repeated three times using the same experimental protocol and the same model size in order to evaluate the techniques’ sensitivity to the initial parameters. Figure 7. The performance of the introduced PbSOM-based modeling technique using the SOEM algorithm compared to the traditional GMM-based modeling techniques using the EM and the DAEM algorithms The obtained results clearly confirm the superiority of the introduced technique using the SOEM algorithm in comparison with the traditional technique using the classical GMMs and the EM algorithm or its deterministic variant. Effectively, it can be seen across the various used model sizes that the DAEM algorithm outperforms the EM algorithm and the SOEM algorithm significantly outperforms both EM and DAEM algorithms. By way of illustration, we can see that the identification performance of the DAEM–based system using models’ size of 128 Gaussians demonstrates a relative accuracy improvement of roughly 11% compared to the system performance using the EM algorithm. Likewise, we can observe that the identification performance of the SOEM–based system using the same models’ size (i.e. 128) demonstrates a relative accuracy improvement of approximately 39% and 32% compared to the system performances using the EM and the DAEM algorithms, respectively. 94,50 95,00 95,50 96,00 96,50 97,00 97,50 98,00 98,50 99,00 32G 64G 128G 256G 512G GMM using EM Algorithm GMM using DAEM Algorithm PbSOM using the SOEM algorithm Identificati
  • 9.  ISSN: 1693-6930 TELKOMNIKA Vol. 16, No. 1, February 2018 : 250 – 258 258 Concerning the algorithms sensitivity to the parameters initialization, we can observe that the system performance using EM algorithm is severely unstable when repeating the same experiment using the same model size and experimental protocol. Apparently, the EM algorithm seems to be strongly dependent on the model-parameters initialization. Besides, we can remark that the DAEM algorithm is less sensitive to the parameters initialization compared to the EM algorithm. On another hand, we can see that the SOEM algorithm is much less sensitive to the parameters initialization compared to the EM and the DAEM algorithms. Seemingly, the self- organizing quality of the SOEM algorithm makes it less sensitive to parameters initialization. 6. Conclusion In this paper, a novel speaker modelling technique using the probabilistic self- organizing maps (PbSOMs) has been introduced for text-independent speaker identification. The basic motivation behind the introduced technique was to combine the strengths of the traditional self-organizing maps and the Gaussian mixture models. Experimental results demonstrated that the introduced modelling technique using probabilistic self-organizing maps outperforms the traditional technique using the classical GMMs and the EM algorithm or its deterministic variant. References [1] D. Reynolds, « Gaussian Mixture Models », in Encyclopedia of Biometrics, S. Z. Li et A. K. Jain, Éd. Boston, MA: Springer US, 2015, p. 827‑832. [2] T. R. J. Kumari et H. S. Jayanna, « Limited Data Speaker Verification: Fusion of Features », International Journal of Electrical and Computer Engineering (IJECE), vol. 7, no 6, p. 3344‑3357, déc. 2017. [3] N. Ueda et R. Nakano, « Deterministic annealing EM algorithm », Neural Networks, vol. 11, no 2, p. 271‑ 282, mars 1998. [4] S.-S. Cheng, H.-C. Fu, et H. Wang, « CEM, EM, and DAEM Algorithms for Learning Self-Organizing Maps », in 2007 IEEE Workshop on Machine Learning for Signal Processing, 2007, p. 378‑383. [5] S.-S. Cheng, H.-C. Fu, et H. Wang, « Model-Based Clustering by Probabilistic Self-Organizing Maps », IEEE Transactions on Neural Networks, vol. 20, no 5, p. 805‑826, mai 2009. [6] L. J. Lin Chang, « Skin detection using a modified Self-Organizing Mixture Network », p. 1‑6, 2013. [7] D. A. Reynolds et R. C. Rose, « Robust text-independent speaker identification using Gaussian mixture speaker models », IEEE Transactions on Speech and Audio Processing, vol. 3, no 1, p. 72‑83, janv. 1995. [8] D. A. Reynolds, « Automatic speaker recognition using gaussian mixture speaker models », The Lincoln Laboratory Journal, p. 173–192, 1995. [9] D. A. Reynolds, « Speaker identification and verification using Gaussian mixture speaker models », Speech Communication, vol. 17, no 1, p. 91‑108, août 1995. [10] T. Kohonen, Self-Organizing Maps. Springer, 2001. [11] T. Heskes, « Self-organizing Maps, Vector Quantization, and Mixture Modeling », Trans. Neur. Netw., vol. 12, no 6, p. 1299–1305, nov. 2001. [12] J. Sum, C. Leung, L. Chan, et L. Xu, « Yet Another Algorithm Which Can Generate Topography Map », IEEE Trans. Neural Networks, vol. 8, p. 1204–1207, 1997. [13] B. Ayoub, K. Jamal, et Z. Arsalane, « An analysis and comparative evaluation of MFCC variants for speaker identification over VoIP networks », in 2015 World Congress on Information Technology and Computer Applications Congress (WCITCA), 2015, p. 1‑6.