SlideShare a Scribd company logo
International Journal of Electrical and Computer Engineering (IJECE)
Vol. 10, No. 2, April 2020, pp. 1859~1867
ISSN: 2088-8708, DOI: 10.11591/ijece.v10i2.pp1859-1867  1859
Journal homepage: http://ijece.iaescore.com/index.php/IJECE
High level speaker specific features modeling in automatic
speaker recognition system
Satyanand Singh1
, Pragya Singh2
1
School of Electrical and Electronics Engineering, Fiji National University, Fiji Island
2
School of Public Health and Primary Care, Fiji National Universisty, Fiji Island
Article Info ABSTRACT
Article history:
Received Apr 19, 2019
Revised Oct 29, 2019
Accepted Nov 6, 2019
Spoken words convey several levels of information. At the primary level,
the speech conveys words or spoken messages, but at the secondary level,
the speech also reveals information about the speakers. This work is based on
the high-level speaker-specific features on statistical speaker modeling
techniques that express the characteristic sound of the human voice. Using
Hidden Markov model (HMM), Gaussian mixture model (GMM), and Linear
Discriminant Analysis (LDA) models build Automatic Speaker Recognition
(ASR) system that are computational inexpensive can recognize speakers
regardless of what is said. The performance of the ASR system is evaluated
for clear speech to a wide range of speech quality using a standard TIMIT
speech corpus. The ASR efficiency of HMM, GMM, and LDA based modeling
technique are 98.8%, 99.1%, and 98.6% and Equal Error Rate (EER) is 4.5%,
4.4% and 4.55% respectively. The EER improvement of GMM modeling
technique based ASR systemcompared with HMM and LDA is 4.25% and
8.51% respectively.
Keywords:
Automatic speaker recognition
(ASR)
Extreme learning machine
(ELM)
Gaussian mixer model (GMM)
Hidden markov model (HMM)
Linear discriminant analysis
(LDA)
Support vector machines
(SVM)
Universal background model
(UBM)
Copyright © 2020 Institute of Advanced Engineering and Science.
All rights reserved.
Corresponding Author:
Satyanand Singh,
School of Electrical and Electronics Engineering,
Fiji National University, Fiji Island.
Email: satyanand.singh@fnu.ac.fj
1. INTRODUCTION
Most of ASR application modeling techniques make various mathematical assumptions about
speaker-specific features. If voice data does not satisfy these attributes, incompleteness will occur at ASR
modeling stage. Therefore, the mathematical model fits the features and is forced to derive recognition scores
based on these models and test speech data. Converting audio segments into the functional parameter, after that
modeling process started in ASR. In ASR modeling is a process flow to categories all speakers based on their
characteristics. The model should also provide its meaning for comparison with unfamiliar speaker utterances.
ASR modeling is called as robust when its speaker specific feature characterization process is not significantly
affected by unwanted maladies, although these features are ideal if such features can be designed in such a way
that interspeaker discrimination is maximum, then no intraspeaker variation exists and simple modeling
methods can be sufficient. In short form, the non-ideal properties of the speaker specific feature extraction
phase require different compensation techniques during the ASR modeling phase so that the effect of
the disturbance variation present in the speech signal can be reduced during the testing of the speaker
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 1859 - 1867
1860
recognition process. Most of the ASR modeling techniques do different mathematical hypotheses about
the speaker-specific features. If assumed properties are not met from the speech data, then we are basically
presenting flaws even during the ASR modeling phase.
The normalization of speaker-specific features can reduce these problems to some extent, but not
completely. As a result, mathematical models are compelled to adopt the characteristics and speaker
recognition scores are obtained based on these models and test speech data. Thus, in this process, the properties
of detecting artifacts are introduced and a family of score standardization techniques has been proposed which
is proposed to complete this final stage mismatch [1]. In essence, the decline in acoustic signal affects
the speaker-specific features, patterns, and scores. Therefore, it is important to improve the robustness of ASR
systems in all three domains. It has been mentioned recently that speaker modeling techniques have improved
and score normalization techniques are not much effective [2].
Probabilistic modeling techniques such as GMM and HMM are widely used for the speaker, language,
emotion, and speech recognition. In the probabilistic model, each speaker/language/emotion is modeled as
a probability source with an unknown but fixed probability density function. The training phase is a parameter
that estimates the probability density function from a sufficient number of training samples. For ASR
recognition, the possibility of test utterances on the model is calculated. GMM is a linear combination of
multivariate Gaussian distributions that simulate 𝑃(𝑋 𝐶)⁄ . GMM can be converted to a post classifier using
Bayesian rules [3]. There are other advantages, such as being able to train the model for a large amount of
speech data and adapt it to the new data format. When using a model for ASR application such as GMM,
the speaker-independent Universal Background Model (UBM) first uses voice data for training. UBM
represents the distribution of feature vectors independent of speakers. When a new speaker is registered in
the ASR system, the parameters of the background model are adapted to the feature distribution of the new
speaker. The adaptive model is then used as an ASR speaker’s model.
Statistical Language Modeling (LM) is the science of building a model to estimate the prior
probability of word strings. Successful use of language model to model the rhythm of speaker and language.
The fundamental frequency Fo and energy profiles are labeled as discrete classes and then modeled using two
bigrams or trigrams [4]. Hidden Events LM contains special words that appear in the model’s N-gram. Instead,
they correspond to the state of the HMM and can be used to simulate language events such as boundaries of
unmarked sentences. Alternatively, these events may be associated with unnatural possibilities for adjusting
LM (eg, rhythm) for other sources of knowledge. A special type of hidden event LM can simulate a nonsmooth
speech by letting hidden events modify the word history [5].
Decision trees are also successfully used in prosodic modeling for ASR application [6].
The decision tree model “progress” by system-generated question to the speaker at once. The features of
the questions in each question and then the thresholds in the questions (eg normalized pitch greater than
threshold value) preferably distinguish the class of nodes in the tree. In the test phase, the decision tree estimates
the posterior probability of each class C of each sample X, resulting in 𝑃(𝑋 𝐶)⁄ [7]. One of the main drawbacks
of decision trees is the greedy build process: at each step, the combination selects a single best variable and
the best breakpoint, but considering multi-step prefetching of variable combinations than a good result. Another
disadvantage is the fact that continuous variables are implicitly discretized by the partitioning process and
information is lost along the way. The advantage of decision trees for other machine learning methods is that
they are not black-box models, but can easily be represented as rules. In many applications, these models are
more important than disadvantages, so these models are widely used in ASR application.
Discriminant models such as Artificial Neural Networks (ANN) [8] and Support Vector Machines
(SVM) are also used for prosodic modeling [9]. Deep Neural Network (DNN) [10], Extreme Learning Machine
(ELM), and DNN-ELM have proved useful for prosodic-based speaker recognition [11]. The SVM model is
an algorithmic implementation of the idea from the statistical learning theory [12] and focuses on the problem
of constructing a consistent estimator from the speech data. Model performance and training set estimation
method for unknown data set when only model characteristics are given Performance? Regarding
the algorithm, the support vector machine establishes an optimal separation boundary between data sets by
solving the constrained quadratic optimization problem [13]. By using different kernel functions, different
degrees of nonlinearity and flexibility can be included in the model. Support vector machines are gained from
advanced statistical ideas and can calculate the range of generalization error for them, so we have gained
considerable research interest over the past few years. The performance of other machine learning algorithms
equal to or better than those of other machine learning algorithms are reported in the medical literature.
A disadvantage of the support vector machine is that the classification result is purely dichotomous and there
is no possibility of giving class membership [14].
Int J Elec & Comp Eng ISSN: 2088-8708 
High level speaker specific features modeling in automatic speaker… (Satyanand Singh)
1861
2. MODELING BASED ON PROSODY IN AUTOMATIC SPEAKER RECOGNITION SYSTEM
Prosody uses the appropriate method to obtain the global statistics of the speaker’s fundamental
frequency 𝐹𝑜 value and the ASR system recognizing the task. The dynamics of the 𝐹𝑜 contour reflecting
the person’s talking style has been shown to be able to help the speaker recognition the task. The 𝐹𝑜 motion of
the speaker is modeled by fitting a piecewise linear model to the 𝐹𝑜 orbit to obtain a stylized 𝐹𝑜 profile.
Using median F 0, the slope and duration represent each linear 𝐹𝑜 segment. These features are modeled
by log-normal distribution, normal distribution, and shift exponential distribution, respectively. In order to
investigate the possibility of speaker recognition using rhythm and idiom, NIST introduced extended data task
telephone talk based on exchange corpus. Unlike traditional speaker recognition tasks, the extended data task
provides multiple complete session planes (4/8/16 sides) for speaker training and testing the ASR system.
In [15] the focus is on investigating various prosodic features. Fundamental frequency based on
segment period and pause period. Periodic characteristics, or word characteristics, telephone periods and period
sequences have been used to model the period. In [16], duration, pitch, and energy characteristics are calculated
for each estimated syllable region. Syllable boundary obtained from the ASR system. These features are
quantized and used to form N-grams called N-gram based syllable non-uniform extraction region features.
In [17], continuous prosodic features were modeled using Joint Factor Analysis (JFA) for speaker
recognition. The prosodic feature used is the pitch and energy profile over units of similar syllables, represented
using bases of Legendre polynomials. Standard GMM is used for modeling. In addition, the effect of
the speaker and session change is modeled in the same way as conventional JFA. Legendre polynomial
coefficients of pitch and energy, together with the length of the segment, constitute a 13-dimensional prosody
feature set for GMM and factor analysis modeling [17].
2.1. Eigenvoice consideration in hidden markov models
In the standard eigenvoice approach, voice data is collected from the number of speakers with
the diverse scenario. When each HMM state is modeled as a mixture of Gaussian distributions, a set of speaker-
dependent HMMs are formed from each speaker. The speaker's voice is represented by the super vector
composed of the concatenation of the mean vectors of all Gaussian HMM distributions. Therefore,
the i-th speaker supervector is composed of R components, one Gaussian per distribution, and is expressed as
𝑥𝑖 = [𝑥𝑖1
,
, 𝑥𝑖2
,
, … . . 𝑥𝑖𝑅
,
]
,
∈ ℝ 𝑑2. The similarity between any two speaker supervectors 𝑥𝑖 and 𝑥𝑗 is measured by
their dot product as follows.
𝑥𝑖
,
𝑥𝑗 = ∑ 𝑥𝑖𝑟
,
𝑥𝑗𝑟
𝑅
𝑟=1 (1)
Principal component analysis (PCA) is then performed on the training speaker supervector and
the resulting eigenvector is referred to as eigenvoice. In order to adapt to the new speaker, his/her supervector
process deals with a linear combination of the top 𝑀 eigenvoices 𝑠 = 𝑠(𝑒𝑣)
= ∑ [{𝑤1, 𝑤2, … . 𝑤 𝑀}]′
𝑉𝑚
𝑀
𝑚=1 .
Usually, only a less than ten eigenvoices are taken into consideration so that few second of adaptation speech
will be required. The mathematically computed eighteen eigenvoices are as: 0.180696, 0.168936, 0.082378,
0.065117, 0.058677, 0.027971, 0.020124, 0.017375, 0.016086, 0.008081, 0.007063, 0.004332, 0.003474,
0.003072, 0.002031, 0.001976, 0.00112, and 0.001062. The adaptation data 𝑜𝑡, 𝑡 = 1, … … . , 𝑇 to estimate
unique eigenvoice weights by maximizing the likelihood of 𝑜𝑡. In mathematically one can find 𝑤 by
maximizing the 𝑄 function as follows:
𝑄(𝑤) = ∑ 𝛾1(𝑟)𝑙𝑜𝑔(𝜋 𝑟) + ∑ ∑ 𝜉𝑡(𝑝, 𝑟)𝑙𝑜𝑔(𝑎 𝑝𝑟) + ∑ ∑ 𝛾𝑡(𝑟)𝑙𝑜𝑔(𝑏 𝑟(𝑜𝑡, 𝑤))𝑇
𝑡=1
𝑅
𝑟=1
𝑇−1
𝑡=1
𝑅
𝑝,𝑟=1
𝑅
𝑟=1 (2)
State r initial probability and posterior probability of observation is represented by πr and 𝛾𝑡(𝑟)
respectively at time t. State p posterior probability of observation sequence is represented by ξt(p, r)
at time t and at state r at time 𝑡 + 1. 𝑏 𝑟 is the rth
Gaussian probability density function.
Further 𝑄 𝑏(𝑤) = ∑ ∑ γt(r)log(br(ot, w))T
t=1
R
r=1 is related to the new speaker supervector 𝑠 as follows:
Qb(w) = −0.5 ∑ ∑ γt(r)[d1log(2π) + log|Cr| + ‖ot − sr(w)‖2
Cr]T
t=1
R
r=1 (3)
Covariance matrix of the Gaussian in eqn. (3) at state 𝑟 is represented as 𝐶𝑟. Here the estimation of
eigenvoices is generalized by performing kernel PCA in its place of linear PCA. Subsequently, let 𝑘(. , . ) be
a kernel with a corresponding mapping 𝜑. This maps the pattern 𝑥 of the specific speaker supervector space 𝜒
to the 𝜑(𝑥) in the speaker specific feature space ℱ. Given a set of N patterns speaker supervectors
(𝑥1, 𝑥2, … … 𝑥 𝑁−1, 𝑥 𝑁) denote the mean of the 𝜑 -mapped feature vectors by 𝜑̅ =
1
𝑁
∑ 𝜑(𝑥𝑖)𝑁
𝑖=1 and
the centered map with 𝜑̃ = 𝜑(𝑥) − 𝜑̅ . Next step Eigen decomposition is performed on 𝐾̃ where
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 1859 - 1867
1862
𝐾 = [𝑘(𝑥𝑖, 𝑥𝑗)]𝑖,𝑗
. 𝑣 𝑚 is the 𝑚 𝑡ℎ
orthogonal eighnvector of 𝑁𝑋𝑁 dimension covariance matrix in the feature
space is represented as 𝑣 𝑚 = ∑
𝛼𝑚 𝑖
√𝜆 𝑚
𝑁
𝑖=1 𝜑̅(𝑥𝑖) by considering 𝐾 = 𝑈⋀𝑈′
where 𝑈 = [𝛼1, … … 𝛼 𝑁−1, 𝛼 𝑁] with
𝛼𝑖 = [𝛼𝑖1, … . . 𝛼𝑖(𝑁−1), 𝛼𝑖𝑁]
′
and ⋀ = 𝑑𝑖𝑎𝑔𝑜𝑛𝑎𝑙(𝜆1, … … 𝜆 𝑁−1, 𝜆 𝑁). A computer generated 8𝑋8 orthogonal
eighnvector 𝑣 𝑚 is represented in Table 1. Two-dimension representation of utterances from TIMIT database
evaluation using KPCA+linear solution and non-linear SVM shown in Figure 1.
Table 1. A computer generated 8X8 orthogonal eighnvector vm
C1 C2 C3 C4 C5 C6 C7 C8
R1 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000
R2 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000
R3 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000
R4 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000
R5 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000
R6 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000
R7 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000
R8 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000
Figure 1. Two-dimensio representation of utterances from TIMIT database evaluation using KPCA+linear
solution and non-linear SVM
2.2. Gaussian mixture model (GMM) based high label feature modeling
GMM has become the leading generation statistical model in the state of the art ASR system. GMM
is an attractive statistical model because it can represent various probability density functions when estimating
a sufficient number of parameters. The GMM, in general, contains a set of 𝑁 multivariate Gaussian density
functions represented by the index 𝑘. The resulting probability density function for a particular speaker model
𝑖 is a convex combination of all density functions. GMM is built using standard multivariate Gaussian
density,but introduces component index k as a latent variable with discrete probability 𝑝(𝑘 𝑖⁄ ). The weights
are represented as 𝑤 𝑘
𝑖
= 𝑝(𝑘 𝑖⁄ ). Complies with the GMM density function and the conditions that characterize
the past contributions of the corresponding component as ∑ 𝑤 𝑘
𝑖𝑁
𝑘=1 = 1. Each Gaussian density represents
a conditional density function 𝑝((𝑥𝑡|𝑘, 𝑖)). According to Bayes’ theorem, the joint probability density
function 𝑝((𝑥𝑡|𝑘, 𝑖))is given by the multiplication of the two. The sum over all densities results in the multi-
modal probability density of GMMs as follows:
𝑝(𝑥𝑡| ⊖𝑖) = ∑ 𝑝(𝑘| ⊝𝑖)𝑁
𝑘=1 ∙ 𝑝(𝑥𝑡|𝑘,⊖𝑖) = ∑ 𝑤 𝑘
𝑖
∙ 𝒩{(𝑥𝑡|𝜇 𝑘
𝑖
, Σ 𝑘
𝑖
)}𝑁
𝑘=1 (4)
Where μk is the mean vector and Σk is the covarience matrix. Each component density is completely determined
by μk and Σk. The parameter set ⊝𝑖= {𝑤1
𝑖
, 𝑤2
𝑖
, … . . , 𝑤 𝑁
𝑖
, 𝜇1
𝑖
, 𝜇2
𝑖
, … . . 𝜇 𝑁
𝑖
, Σ1
𝑖
, Σ2
𝑖
, … . . Σ 𝑁
𝑖
} where eighting factor
including specific speaker model 𝑖 of mean vector and covariance matrix.
Int J Elec & Comp Eng ISSN: 2088-8708 
High level speaker specific features modeling in automatic speaker… (Satyanand Singh)
1863
Figure 2 illustrates the likelihood function of the GMM, including seven Gaussian distributions with
covariance matrices of two dimensional mean and feature vectors are chosen 𝑥1 and 𝑥2 denote the elements of
the feature vector. Computer generated log-likelihood completed training speaker 1 model is represented
as -6.067379, -4.288333, -4.253459, -4.241043, -4.230592, -4.218451, -4.203952, -4.188224, -4.173566,
-4.161955, -4.153866, -4.148612, -4.145268, -4.143124, -4.141712, -4.140738. A computer generated 8𝑋8
training feature vectors of a speaker by Gaussian Mixture Models is represented in Table 2 and Table 3
represent testing feature vectors of same speaker with different text. Figure 2 shows a likelihood function for
a GMM with seven Gaussian densities.
Figure 2. A likelihood function for a GMM with seven Gaussian densities
Table 2. A computer generated 8X8 training feature vectorsof a speaker by Gaussian mixture models
C1 C2 C3 C4 C5 C6 C7 C8
R1 4.0646 2.7960 3.3696 2.5665 1.4115 1.4582 1.3393 0.7637
R2 4.8317 3.5756 3.3678 2.8608 0.9304 0.8075 0.9295 1.1848
R3 3.7562 3.4273 3.8380 2.7522 1.3471 0.9934 1.4731 1.6576
R4 5.0021 3.3969 3.4032 2.2354 0.4914 0.8931 2.0563 1.4244
R5 4.1528 3.3462 3.8148 3.4006 1.8268 1.0450 1.5436 1.1512
R6 3.8352 3.1605 4.3616 2.8652 1.7510 1.0464 1.6336 1.3007
R7 4.1610 3.3430 4.4114 1.7857 1.1003 1.5388 1.3885 1.6549
R8 3.5921 3.7265 4.1634 2.5118 1.8623 1.5231 1.5569 1.4148
Table 3. 8X8 testing feature vectors of a speaker by Gaussian mixture models
C1 C2 C3 C4 C5 C6 C7 C8
R1 3.2927 2.0086 4.7630 3.1760 1.4675 0.9331 1.7318 1.3194
R2 3.6418 2.6172 5.1925 2.5124 0.5417 1.2929 1.9916 0.9756
R3 2.9897 1.6382 5.2565 4.0006 1.3647 1.8824 1.9576 1.0245
R4 3.4203 2.3760 4.4596 2.5434 1.0803 1.4107 1.8440 1.3208
R5 3.4864 2.9604 3.9410 3.2120 1.5138 1.5098 2.2160 1.2051
R6 4.0004 2.2980 4.2781 3.0504 1.8364 1.0121 1.2600 1.1491
R7 3.0806 2.0417 4.0331 3.6395 1.9743 1.8195 1.3774 1.0800
R8 2.9109 2.3116 4.6019 3.5167 2.3270 1.1858 2.6674 1.3994
2.3. Linear discriminant analysis (LDA) based high label feature modeling
LDA is a commonly employed technique in statistical pattern recognition that aims at finding linear
combinations of feature coefficients to facilitate discrimination of multiple classes. It finds orthogonal
orientation in place of most effective functions in class discrimination. By introducing the original features in
these guidelines, the accuracy of classification improves. Let us indicate the set of all development utterances
by D, utterance features indicated by ws,i, these features obtained from the ith utterance of the speaker s,
the total number of utterances belonging to s is indicated by ns and the total number of speakers in D is indicated
by S. Class covariance matrices between Sb and within Sw are given by
0
10
20
30
40
50
0
10
20
30
40
50
-20
-10
0
10
20
x1x2
Likelihood
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 1859 - 1867
1864
𝑆 𝑏 =
1
𝑆
∑ (𝑤̅ 𝑠 − 𝑤̅)(𝑤̅ 𝑠 − 𝑤̅) 𝑇𝑆
𝑠=1 (5)
𝑆 𝑤 =
1
𝑆
∑
1
𝑛 𝑠
∑ (𝑤𝑠,𝑖 − 𝑤̅ 𝑠)(𝑤𝑠,𝑖 − 𝑤̅ 𝑠)
𝑇𝑛 𝑠
𝑖=1
𝑆
𝑠=1 (6)
Where the speaker dependant mean vector is given by w̅s = 1 ns ∑ ws,i
ns
i=1
⁄ and speaker independent mean
vector is given by w̅ =
1
S
∑
1
ns
∑ ws,i
ns
i=1
S
s=1 respectively. The LDA optimization is therefore to maximize
between class variance, whereas reducing within the class variance. The exact estimation can be obtain from
this optimization by solving generalized eigenvalue problem:
𝑆 𝑏𝑉 =∧ 𝑆 𝑤 𝑣 (7)
The diagonal matrix containing of eignvector is indicated by ∧. If the matrix Sw in eqn. (6) is invertible then
the solution can be easily found by Sw
−1
Sb. ALDA matrix of dimension R × k is as follows:
𝐴 𝐿𝐷𝐴 = [𝑣1 … … . . 𝑣 𝑘] (8)
k eigenvectors v1 … … . . vk obtained by solving eqn. (7). Thus, the LDA change of the utterance feature w is
obtained in this way:
𝛷 𝐿𝐷𝐴(𝑤) = 𝐴 𝐿𝐷𝐴
𝑇
𝑤 (9)
A computer generated 8X8 ΦLDA(w) matrix of dimension RXk by LDA Models is represented in Table 4.
Table 4. A computer generated 8X8 𝛷 𝐿𝐷𝐴(𝑤) matrix of dimension 𝑅𝑋𝑘
C1 C2 C3 C4 C5 C6 C7 C8
R1 -0.5302 -0.6328 -0.6402 -0.5861 -0.5306 -0.5137 -0.5403 -0.5678
R2 -0.6601 -0.7932 -0.8189 -0.7774 -0.7347 -0.7332 -0.7773 -0.8138
R3 -0.6949 -0.8420 0.8846 -0.8622 -0.8389 -0.8565 -0.9219 -0.9783
R4 -0.6594 -0.8031 -0.8484 -0.8308 -0.8124 -0.8399 -0.9289 -1.0271
R5 -0.6314 -0.7653 -0.7968 -0.7584 -0.7169 -0.7325 -0.8374 -0.9885
R6 -0.6698 -0.8029 -0.8170 -0.7446 -0.6615 -0.6450 -0.7462 -0.9332
R7 -0.7548 -0.8985 -0.9072 -0.8157 -0.7044 -0.6588 -0.7423 -0.9333
R8 -0.7876 -0.9328 -0.9467 -0.8688 -0.7722 -0.7314 -0.8065 -0.9806
LDA assumes normal distribution data for all classes, statistically independent features and the same
covariance matrix. However, this only applies to LDA as a classifier. If these assumptions are violated,
the dimensionally reduced LDA can work reasonably. Even for classification tasks, LDA seems powerful
enough to be used for data distribution in ASR applications. The speaker feature modeling histograms with
normal fit eigenvector obtained from the LDA is illustrated in Figure 3.
Figure 3. The speaker feature modeling histograms with normal fit eigenvector with LDA
-0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
0
2
4
6
8
10
Length of Feature Vectors
Counts
Int J Elec & Comp Eng ISSN: 2088-8708 
High level speaker specific features modeling in automatic speaker… (Satyanand Singh)
1865
3. ACOUSTIC DATA FEATURE EXTRACTION
The speaker specific features refer to parameters extracted from phrase segments/periods within
a 20-25 ms frame. The most common short-term acoustic features are Mel Frequency Cepstrum Coefficients
(MFCC) and Linear Predictive Coding (LPC) based features [18,19,20]. In order to obtain these coefficients
from the speech recording, the speech samples are first divided into short overlapping segments. The signals
obtained at these segments / frames are then multiplied by a window function (e.g. Hamming and Hanning) to
obtain a Fourier power spectrum. In the next step, the logarithm of the spectrum is calculated and
a mel-space filter bank analysis of non-linear intervals is performed. Logarithmic operations expand
the range of coefficients and break up the multiplicative components into additional components [21]. In filter
bank analysis, spectral energy (also called filter bank energy coefficient) is generated for each channel to
represent different frequency bands.
Filterbanks, like the human auditory system, are designed to be more sensitive to frequency changes
at the bottom of the spectrum. Finally, the MFCC is obtained by performing a discrete cosine transform (DCT)
on the filter bank energy parameters and retaining many preamble coefficients [22, 23]. DCT has two important
properties. (i) to compress the energy of the signal into multiple coefficients, and (ii) to be highly correlated
with the coefficients. For these reasons, using DCT to remove specific dimensions improves
the efficiency of the model and reduces some harmful components [24]. Furthermore, the uncorrelated
properties of the DCT help to assume that the models of feature coefficients are not relevant. In summary,
the following sequence of operations-power spectrum, logarithm, DCT-produces a signal with a well-known
cepstral representation [25].
4. EXPERIMENTAL SETUP
The experiment uses the TIMIT set of database. The proposed algorithm implemented in MATLAB
and results were compared with those of the Eigenvoice consideration in HMM, GMM and LDA. A total 1000
utterances of the TIMIT database of 6 sec, 4 sec and 2 sec voice were put to train and test the ASR system.
For the above cases, ASR recognition efficiency has been calculated “Efficiency” = Number of utterance
correctlyidentified/Total Number of utterance under test. Table 5 shows that the efficiency of the ASR system
for HMM, GMM and LDA respectively. It can be observed from this table that use of GMM has highest
efficiency compared to other modeling techniques. Figure 4 show the equal error rate (EER) of HMM, GMM,
and LDA based modeling technique. The ASR efficiency of HMM, GMM, and LDA based modeling technique
are 98.8%, 99.1%, and 98.6% and EER are 4.5%, 4.4% and 4.55% respectively. The EER improvement of
GMM modeling technique based ASR system compared with HMM and LDA is 4.25% and 8.51%
respectively.
Figure 4. Equal Error Rate of ASR system of HMM, GMM and
LDA based modeling technique for 2 sec of voice data
0.1 0.2 0.5 1 2 5 10 20 40
0.1
0.2
0.5
1
2
5
10
20
40
False Alarm probability (in %)
Missprobability(in%)
HMM
GMM
LDA
 ISSN: 2088-8708
Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 1859 - 1867
1866
Table 5. Efficiency of the ASR system for HMM, GMM and LDA respectively
HMM GMM LDA
Efficiency in % EER in % Efficiency in % EER in % Efficiency in % EER in %
6 sec 99.6 4.9 99.9 4.7 99.1 5.1
4 sec 98.8 4.9 99.5 4.7 98.2 5.1
2 sec 98.8 4.9 99.1 4.7 98.6 5.1
5. CONCLUSION
This paper presented the research, development and evaluation of ASR system based on HMM, GMM
and LDA modeling techniques. GMM models provide a simple but effective representation that offers
inexpensive and high recognition accuracy for a wide range of speaker recognition tasks. An experimental
evaluation of the performance of the speaker recognition system has been done on publicly available TIMIT
database. For the 1000, voice samples of the TIMIT database spaker recognition accuracy 99.1%, 98.8% and
98.6 for GMM, HMM and LDA was obtained for 2 sec of voice length. The EER improvement of GMM
modeling technique based ASR system compared with HMM and LDA is 4.25% and 8.51% respectively.
As experimental results showed that, speaker recognition performance is at practically usable levels
for specific applications such as access control authentication. The main limiting factor in less controlled
situations is the lack of robustness to transmission impairments such as noise and mic variability. Much more
to address these limitations, such as exploring areas such as understanding and modeling the impact of
impairments on spectral characteristics, applying more sophisticated channel compensation techniques, and
exploring features that are less sensitive to channel degradation efforts are underway.
REFERENCES
[1] S. Singh, “Forensic and Automatic Speaker Recognition System” International Journal of Applied Engineering
Research, Vol. 8, No. 5, 2018, pp. 2804-2811, 2018.
[2] S. Singh and Ajeet Singh “Accuracy Comparison using Different Modeling Techniques under Limited Speech Data
of Speaker Recognition Systems,” Global Journal of Science Frontier Research: F Mathematics and Decision
Sciences, vol 16(2), pp.1-17, 2016.
[3] S. Singh. “Bayesian distance metric learning and its application in automatic speaker recognition systems”
International Journal of Electrical and Computer Engineering, Vol, 9, No. 4, 2019.
[4] S. Singh. “The Role of Speech Technology in Biometrics, Forensics and Man-Machine Interface” International
Journal of Electrical and Computer Engineering, Vol. 9, No. 1, pp.281-288, 2019.
[5] S. Singh. “High Level Speaker Specific Features as an Efficiency Enhancing Parameters in Speaker Recognition
System,” International Journal of Electrical and Computer Engineering, Vol, 9, No. 4, 2019.
[6] S. Singh, Abhay Kumar, David Raju Kolluri, “Efficient Modelling Technique based Speaker Recognition under
Limited Speech Data,” International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.8, No.11,
pp.41-48, 2016.
[7] Shriberg, E., & Stolcke, “Direct modeling of prosody: An overview of applications in automatic speech processing,”
In Speech Prosody, Nara, Japan 2004.
[8] Mary, L., & Yegnanarayana, B, “Prosodic features for speaker verification,” In Proceedings of Interspeech,
Pittsburgh, Pennsylvania, pp. 917- 920, 2006.
[9] Ferrer, L., Shriberg, E., Kajarekar, S., & Sonmez, K, “Parameterization of prosodic feature distributions for SVM
modeling in speaker recognition,” In Proceedings of International Conference on Acoustics, Speech and Signal
Processing, Vol. 4, pp. 233-236, 2007.
[10] Han, K., Dong, Y., & Tashev, I, “Speech emotion recognition using deep neural network and extreme learning
machine,” In Proceedings of Interspeech, pp. 223-227, 2014.
[11] Wang, Z. Q., & Tashev, I, “Learning utterance-level representations for speech emotion and age/gender recognition
using deep neural networks,” In IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2017.
[12] Vapnik V. “An Overview of Statistical Learning Theory,” IEEE Transaction on Neural Networks, Vol. 10, No. 5,
pp. 988-999,1999.
[13] S.Singh, “Support Vector Machine Based Approaches For Real Time Automatic Speaker Recognition System,”
International Journal of Applied Engineering Research, Vol. 13, No. 10, pp. 8561-8567, 2018.
[14] Scholkopf B, Smola A, “Learning with kernels: support vector machines, regularization, optimization, and beyond,”
Cambridge, MA: MIT Press; 2002
[15] Peskin, B., Navratil, J., Abramson, J., Jones, D., Klusacek, D., Reynolds, D., et al., “Using prosodic and
conversational features for high-performance speaker recognition,” Report from JHU WS’02, In Proceedings of
ICASSP, Hong Kong, China, Vol. 4, pp. 792-795, 2003.
[16] S.Singh, Mansour H. Assaf, Sunil R.Das, Emil M. Petriu, and Voicu Groza, “Short Duration Voice Data Speaker
Recognition System Using Novel Fuzzy Vector Quantization Algorithms,” 2016 IEEE International Instrumentation
and Measurement Technology Conference, May 23-26, Taipei, Taiwan, 2016.
Int J Elec & Comp Eng ISSN: 2088-8708 
High level speaker specific features modeling in automatic speaker… (Satyanand Singh)
1867
[17] Najim, D., Dumouchel, P., & Kenny, P, “Modeling prosodic features with joint factor analysis for speaker
verification,” IEEE Transactions on Audio, Speech, and Language Processing, Vol. 15 7, 2095-2103, 2007.
[18] S.Singh, “Speaker Recognition by Gaussian Filter Based Feature Extraction and Proposed Fuzzy Vector Quantization
Modeling Technique,” International Journal of Applied Engineering Research, Vol. 13, No. 16, pp. 12798-12804,
2018.
[19] S.Singh, “High Level Speaker Specific Features Modeling in Automatic Speaker Recognition System,” International
Journal of Electrical and Computer Engineering, Vol. 10, No. 2, 2018, pp. 2804-2811, 2020.
[20] S.Singh, “Speaker Recognition System for Limited Speech Data Using High-Level Speaker Specific Features and
Support Vector Machines” International Journal of Applied Engineering Research, Vol. 12, No. 9, 2018,
pp. 8026-8033 2017.
[21] S.Singh, MH Assaf and Abhay Kumar, “A Novel Algorithm of Sparse Representations for Speech
Compression/Enhancement and Its Application in Speaker Recognition System,” International Journal of
Computational and Applied Mathematics, Vol. 11, No. 1, pp. 89-104, 2016.
[22] S.Singh, “Evaluation of Sparsification algorithm and Its Application in Speaker Recognition System” International
Journal of Applied Engineering Research, Vol. 13, No. 17, pp. 13015-13021, 2018.
[23] S.Singh and Mansour H. Assaf “A Perfect Balance of Sparsity and Acoustic hole in Speech Signal and Its Application
in Speaker Recognition System” Middle-East Journal of Scientific Research, Vol. 24, No.11, pp. 3527-3541, 2016.
[24] S.Singh and Dr. E.G. Rajan, “MFCC VQ based Speaker Recognition and Its Accuracy Affecting Factors,”
International Journal of Engineering Research & Technology, International Journal of Computer Applications, Vol.
21, No. 6, pp. 1-6, 2011.
[25] S.Singh and Dr. E.G. Rajan, “Application of Different Filters In Mel Frequency Cepstral Coefficients Feature
Extraction And Fuzzy Vector Quantization Approach In Speaker Recognition,” International Journal of Engineering
Research & Technology, Vol. 2 Issue 6, pp-3171-3182, 2013.

More Related Content

PDF
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
PDF
Classification improvement of spoken arabic language based on radial basis fu...
PDF
Isolated word recognition using lpc & vector quantization
PPTX
Sequence to sequence model speech recognition
PDF
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
PDF
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
PDF
D3 dhanalakshmi
PDF
Turkish language modeling using BERT
SYLLABLE-BASED NEURAL NAMED ENTITY RECOGNITION FOR MYANMAR LANGUAGE
Classification improvement of spoken arabic language based on radial basis fu...
Isolated word recognition using lpc & vector quantization
Sequence to sequence model speech recognition
Parameters Optimization for Improving ASR Performance in Adverse Real World N...
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
D3 dhanalakshmi
Turkish language modeling using BERT

What's hot (17)

PDF
IRJET- A Pragmatic Supervised Learning Methodology of Hate Speech Detection i...
PDF
An Improved Approach for Word Ambiguity Removal
PDF
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
PDF
Extractive Summarization with Very Deep Pretrained Language Model
PPT
Speech Recognition
PDF
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
PDF
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
PDF
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PDF
Deep Learning For Speech Recognition
PDF
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
PDF
IRJET- Survey on Generating Suggestions for Erroneous Part in a Sentence
PPTX
Presentation1
PDF
Text independent speaker identification system using average pitch and forman...
PDF
Phonetic distance based accent
PPTX
Speech recognition final
PPT
Speech Recognition System By Matlab
PDF
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
IRJET- A Pragmatic Supervised Learning Methodology of Hate Speech Detection i...
An Improved Approach for Word Ambiguity Removal
BIDIRECTIONAL LONG SHORT-TERM MEMORY (BILSTM)WITH CONDITIONAL RANDOM FIELDS (...
Extractive Summarization with Very Deep Pretrained Language Model
Speech Recognition
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
Deep Learning For Speech Recognition
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
IRJET- Survey on Generating Suggestions for Erroneous Part in a Sentence
Presentation1
Text independent speaker identification system using average pitch and forman...
Phonetic distance based accent
Speech recognition final
Speech Recognition System By Matlab
SENSE DISAMBIGUATION TECHNIQUE FOR PROVIDING MORE ACCURATE RESULTS IN WEB SEARCH
Ad

Similar to High level speaker specific features modeling in automatic speaker recognition system (20)

PDF
Bayesian distance metric learning and its application in automatic speaker re...
PDF
A_Review_on_Different_Approaches_for_Spe.pdf
PDF
AUTOMATIC SPEECH RECOGNITION- A SURVEY
PDF
40120130406014 2
PDF
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
PDF
International journal of signal and image processing issues vol 2015 - no 1...
PDF
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
PDF
AReviewonDifferentApproachesforSpeechRecognitionSystem.pdf
PDF
IRJET- Vocal Code
PDF
Kc3517481754
PDF
A novel automatic voice recognition system based on text-independent in a noi...
PDF
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
PDF
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
PDF
A Survey on: Sound Source Separation Methods
PDF
Ijetcas14 390
PDF
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
PDF
Comparison and Analysis Of LDM and LMS for an Application of a Speech
PDF
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
PDF
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes
PDF
Ijecet 06 09_010
Bayesian distance metric learning and its application in automatic speaker re...
A_Review_on_Different_Approaches_for_Spe.pdf
AUTOMATIC SPEECH RECOGNITION- A SURVEY
40120130406014 2
IRJET- A Review on Audible Sound Analysis based on State Clustering throu...
International journal of signal and image processing issues vol 2015 - no 1...
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
AReviewonDifferentApproachesforSpeechRecognitionSystem.pdf
IRJET- Vocal Code
Kc3517481754
A novel automatic voice recognition system based on text-independent in a noi...
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
A Survey on: Sound Source Separation Methods
Ijetcas14 390
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
Comparison and Analysis Of LDM and LMS for an Application of a Speech
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes
Ijecet 06 09_010
Ad

More from IJECEIAES (20)

PDF
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
PDF
Embedded machine learning-based road conditions and driving behavior monitoring
PDF
Advanced control scheme of doubly fed induction generator for wind turbine us...
PDF
Neural network optimizer of proportional-integral-differential controller par...
PDF
An improved modulation technique suitable for a three level flying capacitor ...
PDF
A review on features and methods of potential fishing zone
PDF
Electrical signal interference minimization using appropriate core material f...
PDF
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
PDF
Bibliometric analysis highlighting the role of women in addressing climate ch...
PDF
Voltage and frequency control of microgrid in presence of micro-turbine inter...
PDF
Enhancing battery system identification: nonlinear autoregressive modeling fo...
PDF
Smart grid deployment: from a bibliometric analysis to a survey
PDF
Use of analytical hierarchy process for selecting and prioritizing islanding ...
PDF
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
PDF
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
PDF
Adaptive synchronous sliding control for a robot manipulator based on neural ...
PDF
Remote field-programmable gate array laboratory for signal acquisition and de...
PDF
Detecting and resolving feature envy through automated machine learning and m...
PDF
Smart monitoring technique for solar cell systems using internet of things ba...
PDF
An efficient security framework for intrusion detection and prevention in int...
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...
Embedded machine learning-based road conditions and driving behavior monitoring
Advanced control scheme of doubly fed induction generator for wind turbine us...
Neural network optimizer of proportional-integral-differential controller par...
An improved modulation technique suitable for a three level flying capacitor ...
A review on features and methods of potential fishing zone
Electrical signal interference minimization using appropriate core material f...
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...
Bibliometric analysis highlighting the role of women in addressing climate ch...
Voltage and frequency control of microgrid in presence of micro-turbine inter...
Enhancing battery system identification: nonlinear autoregressive modeling fo...
Smart grid deployment: from a bibliometric analysis to a survey
Use of analytical hierarchy process for selecting and prioritizing islanding ...
Enhancing of single-stage grid-connected photovoltaic system using fuzzy logi...
Enhancing photovoltaic system maximum power point tracking with fuzzy logic-b...
Adaptive synchronous sliding control for a robot manipulator based on neural ...
Remote field-programmable gate array laboratory for signal acquisition and de...
Detecting and resolving feature envy through automated machine learning and m...
Smart monitoring technique for solar cell systems using internet of things ba...
An efficient security framework for intrusion detection and prevention in int...

Recently uploaded (20)

PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
web development for engineering and engineering
DOCX
573137875-Attendance-Management-System-original
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPT
introduction to datamining and warehousing
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPT
Project quality management in manufacturing
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Well-logging-methods_new................
PPTX
Internet of Things (IOT) - A guide to understanding
PDF
Digital Logic Computer Design lecture notes
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PPT
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
PDF
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf
Model Code of Practice - Construction Work - 21102022 .pdf
Safety Seminar civil to be ensured for safe working.
web development for engineering and engineering
573137875-Attendance-Management-System-original
CYBER-CRIMES AND SECURITY A guide to understanding
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Sustainable Sites - Green Building Construction
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
introduction to datamining and warehousing
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Project quality management in manufacturing
Operating System & Kernel Study Guide-1 - converted.pdf
Well-logging-methods_new................
Internet of Things (IOT) - A guide to understanding
Digital Logic Computer Design lecture notes
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
Introduction, IoT Design Methodology, Case Study on IoT System for Weather Mo...
TFEC-4-2020-Design-Guide-for-Timber-Roof-Trusses.pdf

High level speaker specific features modeling in automatic speaker recognition system

  • 1. International Journal of Electrical and Computer Engineering (IJECE) Vol. 10, No. 2, April 2020, pp. 1859~1867 ISSN: 2088-8708, DOI: 10.11591/ijece.v10i2.pp1859-1867  1859 Journal homepage: http://ijece.iaescore.com/index.php/IJECE High level speaker specific features modeling in automatic speaker recognition system Satyanand Singh1 , Pragya Singh2 1 School of Electrical and Electronics Engineering, Fiji National University, Fiji Island 2 School of Public Health and Primary Care, Fiji National Universisty, Fiji Island Article Info ABSTRACT Article history: Received Apr 19, 2019 Revised Oct 29, 2019 Accepted Nov 6, 2019 Spoken words convey several levels of information. At the primary level, the speech conveys words or spoken messages, but at the secondary level, the speech also reveals information about the speakers. This work is based on the high-level speaker-specific features on statistical speaker modeling techniques that express the characteristic sound of the human voice. Using Hidden Markov model (HMM), Gaussian mixture model (GMM), and Linear Discriminant Analysis (LDA) models build Automatic Speaker Recognition (ASR) system that are computational inexpensive can recognize speakers regardless of what is said. The performance of the ASR system is evaluated for clear speech to a wide range of speech quality using a standard TIMIT speech corpus. The ASR efficiency of HMM, GMM, and LDA based modeling technique are 98.8%, 99.1%, and 98.6% and Equal Error Rate (EER) is 4.5%, 4.4% and 4.55% respectively. The EER improvement of GMM modeling technique based ASR systemcompared with HMM and LDA is 4.25% and 8.51% respectively. Keywords: Automatic speaker recognition (ASR) Extreme learning machine (ELM) Gaussian mixer model (GMM) Hidden markov model (HMM) Linear discriminant analysis (LDA) Support vector machines (SVM) Universal background model (UBM) Copyright © 2020 Institute of Advanced Engineering and Science. All rights reserved. Corresponding Author: Satyanand Singh, School of Electrical and Electronics Engineering, Fiji National University, Fiji Island. Email: satyanand.singh@fnu.ac.fj 1. INTRODUCTION Most of ASR application modeling techniques make various mathematical assumptions about speaker-specific features. If voice data does not satisfy these attributes, incompleteness will occur at ASR modeling stage. Therefore, the mathematical model fits the features and is forced to derive recognition scores based on these models and test speech data. Converting audio segments into the functional parameter, after that modeling process started in ASR. In ASR modeling is a process flow to categories all speakers based on their characteristics. The model should also provide its meaning for comparison with unfamiliar speaker utterances. ASR modeling is called as robust when its speaker specific feature characterization process is not significantly affected by unwanted maladies, although these features are ideal if such features can be designed in such a way that interspeaker discrimination is maximum, then no intraspeaker variation exists and simple modeling methods can be sufficient. In short form, the non-ideal properties of the speaker specific feature extraction phase require different compensation techniques during the ASR modeling phase so that the effect of the disturbance variation present in the speech signal can be reduced during the testing of the speaker
  • 2.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 1859 - 1867 1860 recognition process. Most of the ASR modeling techniques do different mathematical hypotheses about the speaker-specific features. If assumed properties are not met from the speech data, then we are basically presenting flaws even during the ASR modeling phase. The normalization of speaker-specific features can reduce these problems to some extent, but not completely. As a result, mathematical models are compelled to adopt the characteristics and speaker recognition scores are obtained based on these models and test speech data. Thus, in this process, the properties of detecting artifacts are introduced and a family of score standardization techniques has been proposed which is proposed to complete this final stage mismatch [1]. In essence, the decline in acoustic signal affects the speaker-specific features, patterns, and scores. Therefore, it is important to improve the robustness of ASR systems in all three domains. It has been mentioned recently that speaker modeling techniques have improved and score normalization techniques are not much effective [2]. Probabilistic modeling techniques such as GMM and HMM are widely used for the speaker, language, emotion, and speech recognition. In the probabilistic model, each speaker/language/emotion is modeled as a probability source with an unknown but fixed probability density function. The training phase is a parameter that estimates the probability density function from a sufficient number of training samples. For ASR recognition, the possibility of test utterances on the model is calculated. GMM is a linear combination of multivariate Gaussian distributions that simulate 𝑃(𝑋 𝐶)⁄ . GMM can be converted to a post classifier using Bayesian rules [3]. There are other advantages, such as being able to train the model for a large amount of speech data and adapt it to the new data format. When using a model for ASR application such as GMM, the speaker-independent Universal Background Model (UBM) first uses voice data for training. UBM represents the distribution of feature vectors independent of speakers. When a new speaker is registered in the ASR system, the parameters of the background model are adapted to the feature distribution of the new speaker. The adaptive model is then used as an ASR speaker’s model. Statistical Language Modeling (LM) is the science of building a model to estimate the prior probability of word strings. Successful use of language model to model the rhythm of speaker and language. The fundamental frequency Fo and energy profiles are labeled as discrete classes and then modeled using two bigrams or trigrams [4]. Hidden Events LM contains special words that appear in the model’s N-gram. Instead, they correspond to the state of the HMM and can be used to simulate language events such as boundaries of unmarked sentences. Alternatively, these events may be associated with unnatural possibilities for adjusting LM (eg, rhythm) for other sources of knowledge. A special type of hidden event LM can simulate a nonsmooth speech by letting hidden events modify the word history [5]. Decision trees are also successfully used in prosodic modeling for ASR application [6]. The decision tree model “progress” by system-generated question to the speaker at once. The features of the questions in each question and then the thresholds in the questions (eg normalized pitch greater than threshold value) preferably distinguish the class of nodes in the tree. In the test phase, the decision tree estimates the posterior probability of each class C of each sample X, resulting in 𝑃(𝑋 𝐶)⁄ [7]. One of the main drawbacks of decision trees is the greedy build process: at each step, the combination selects a single best variable and the best breakpoint, but considering multi-step prefetching of variable combinations than a good result. Another disadvantage is the fact that continuous variables are implicitly discretized by the partitioning process and information is lost along the way. The advantage of decision trees for other machine learning methods is that they are not black-box models, but can easily be represented as rules. In many applications, these models are more important than disadvantages, so these models are widely used in ASR application. Discriminant models such as Artificial Neural Networks (ANN) [8] and Support Vector Machines (SVM) are also used for prosodic modeling [9]. Deep Neural Network (DNN) [10], Extreme Learning Machine (ELM), and DNN-ELM have proved useful for prosodic-based speaker recognition [11]. The SVM model is an algorithmic implementation of the idea from the statistical learning theory [12] and focuses on the problem of constructing a consistent estimator from the speech data. Model performance and training set estimation method for unknown data set when only model characteristics are given Performance? Regarding the algorithm, the support vector machine establishes an optimal separation boundary between data sets by solving the constrained quadratic optimization problem [13]. By using different kernel functions, different degrees of nonlinearity and flexibility can be included in the model. Support vector machines are gained from advanced statistical ideas and can calculate the range of generalization error for them, so we have gained considerable research interest over the past few years. The performance of other machine learning algorithms equal to or better than those of other machine learning algorithms are reported in the medical literature. A disadvantage of the support vector machine is that the classification result is purely dichotomous and there is no possibility of giving class membership [14].
  • 3. Int J Elec & Comp Eng ISSN: 2088-8708  High level speaker specific features modeling in automatic speaker… (Satyanand Singh) 1861 2. MODELING BASED ON PROSODY IN AUTOMATIC SPEAKER RECOGNITION SYSTEM Prosody uses the appropriate method to obtain the global statistics of the speaker’s fundamental frequency 𝐹𝑜 value and the ASR system recognizing the task. The dynamics of the 𝐹𝑜 contour reflecting the person’s talking style has been shown to be able to help the speaker recognition the task. The 𝐹𝑜 motion of the speaker is modeled by fitting a piecewise linear model to the 𝐹𝑜 orbit to obtain a stylized 𝐹𝑜 profile. Using median F 0, the slope and duration represent each linear 𝐹𝑜 segment. These features are modeled by log-normal distribution, normal distribution, and shift exponential distribution, respectively. In order to investigate the possibility of speaker recognition using rhythm and idiom, NIST introduced extended data task telephone talk based on exchange corpus. Unlike traditional speaker recognition tasks, the extended data task provides multiple complete session planes (4/8/16 sides) for speaker training and testing the ASR system. In [15] the focus is on investigating various prosodic features. Fundamental frequency based on segment period and pause period. Periodic characteristics, or word characteristics, telephone periods and period sequences have been used to model the period. In [16], duration, pitch, and energy characteristics are calculated for each estimated syllable region. Syllable boundary obtained from the ASR system. These features are quantized and used to form N-grams called N-gram based syllable non-uniform extraction region features. In [17], continuous prosodic features were modeled using Joint Factor Analysis (JFA) for speaker recognition. The prosodic feature used is the pitch and energy profile over units of similar syllables, represented using bases of Legendre polynomials. Standard GMM is used for modeling. In addition, the effect of the speaker and session change is modeled in the same way as conventional JFA. Legendre polynomial coefficients of pitch and energy, together with the length of the segment, constitute a 13-dimensional prosody feature set for GMM and factor analysis modeling [17]. 2.1. Eigenvoice consideration in hidden markov models In the standard eigenvoice approach, voice data is collected from the number of speakers with the diverse scenario. When each HMM state is modeled as a mixture of Gaussian distributions, a set of speaker- dependent HMMs are formed from each speaker. The speaker's voice is represented by the super vector composed of the concatenation of the mean vectors of all Gaussian HMM distributions. Therefore, the i-th speaker supervector is composed of R components, one Gaussian per distribution, and is expressed as 𝑥𝑖 = [𝑥𝑖1 , , 𝑥𝑖2 , , … . . 𝑥𝑖𝑅 , ] , ∈ ℝ 𝑑2. The similarity between any two speaker supervectors 𝑥𝑖 and 𝑥𝑗 is measured by their dot product as follows. 𝑥𝑖 , 𝑥𝑗 = ∑ 𝑥𝑖𝑟 , 𝑥𝑗𝑟 𝑅 𝑟=1 (1) Principal component analysis (PCA) is then performed on the training speaker supervector and the resulting eigenvector is referred to as eigenvoice. In order to adapt to the new speaker, his/her supervector process deals with a linear combination of the top 𝑀 eigenvoices 𝑠 = 𝑠(𝑒𝑣) = ∑ [{𝑤1, 𝑤2, … . 𝑤 𝑀}]′ 𝑉𝑚 𝑀 𝑚=1 . Usually, only a less than ten eigenvoices are taken into consideration so that few second of adaptation speech will be required. The mathematically computed eighteen eigenvoices are as: 0.180696, 0.168936, 0.082378, 0.065117, 0.058677, 0.027971, 0.020124, 0.017375, 0.016086, 0.008081, 0.007063, 0.004332, 0.003474, 0.003072, 0.002031, 0.001976, 0.00112, and 0.001062. The adaptation data 𝑜𝑡, 𝑡 = 1, … … . , 𝑇 to estimate unique eigenvoice weights by maximizing the likelihood of 𝑜𝑡. In mathematically one can find 𝑤 by maximizing the 𝑄 function as follows: 𝑄(𝑤) = ∑ 𝛾1(𝑟)𝑙𝑜𝑔(𝜋 𝑟) + ∑ ∑ 𝜉𝑡(𝑝, 𝑟)𝑙𝑜𝑔(𝑎 𝑝𝑟) + ∑ ∑ 𝛾𝑡(𝑟)𝑙𝑜𝑔(𝑏 𝑟(𝑜𝑡, 𝑤))𝑇 𝑡=1 𝑅 𝑟=1 𝑇−1 𝑡=1 𝑅 𝑝,𝑟=1 𝑅 𝑟=1 (2) State r initial probability and posterior probability of observation is represented by πr and 𝛾𝑡(𝑟) respectively at time t. State p posterior probability of observation sequence is represented by ξt(p, r) at time t and at state r at time 𝑡 + 1. 𝑏 𝑟 is the rth Gaussian probability density function. Further 𝑄 𝑏(𝑤) = ∑ ∑ γt(r)log(br(ot, w))T t=1 R r=1 is related to the new speaker supervector 𝑠 as follows: Qb(w) = −0.5 ∑ ∑ γt(r)[d1log(2π) + log|Cr| + ‖ot − sr(w)‖2 Cr]T t=1 R r=1 (3) Covariance matrix of the Gaussian in eqn. (3) at state 𝑟 is represented as 𝐶𝑟. Here the estimation of eigenvoices is generalized by performing kernel PCA in its place of linear PCA. Subsequently, let 𝑘(. , . ) be a kernel with a corresponding mapping 𝜑. This maps the pattern 𝑥 of the specific speaker supervector space 𝜒 to the 𝜑(𝑥) in the speaker specific feature space ℱ. Given a set of N patterns speaker supervectors (𝑥1, 𝑥2, … … 𝑥 𝑁−1, 𝑥 𝑁) denote the mean of the 𝜑 -mapped feature vectors by 𝜑̅ = 1 𝑁 ∑ 𝜑(𝑥𝑖)𝑁 𝑖=1 and the centered map with 𝜑̃ = 𝜑(𝑥) − 𝜑̅ . Next step Eigen decomposition is performed on 𝐾̃ where
  • 4.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 1859 - 1867 1862 𝐾 = [𝑘(𝑥𝑖, 𝑥𝑗)]𝑖,𝑗 . 𝑣 𝑚 is the 𝑚 𝑡ℎ orthogonal eighnvector of 𝑁𝑋𝑁 dimension covariance matrix in the feature space is represented as 𝑣 𝑚 = ∑ 𝛼𝑚 𝑖 √𝜆 𝑚 𝑁 𝑖=1 𝜑̅(𝑥𝑖) by considering 𝐾 = 𝑈⋀𝑈′ where 𝑈 = [𝛼1, … … 𝛼 𝑁−1, 𝛼 𝑁] with 𝛼𝑖 = [𝛼𝑖1, … . . 𝛼𝑖(𝑁−1), 𝛼𝑖𝑁] ′ and ⋀ = 𝑑𝑖𝑎𝑔𝑜𝑛𝑎𝑙(𝜆1, … … 𝜆 𝑁−1, 𝜆 𝑁). A computer generated 8𝑋8 orthogonal eighnvector 𝑣 𝑚 is represented in Table 1. Two-dimension representation of utterances from TIMIT database evaluation using KPCA+linear solution and non-linear SVM shown in Figure 1. Table 1. A computer generated 8X8 orthogonal eighnvector vm C1 C2 C3 C4 C5 C6 C7 C8 R1 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000 R2 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000 R3 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000 R4 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000 R5 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000 R6 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000 R7 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000 R8 -1.0000 -0.8571 -0.7143 -0.5714 -0.4286 -0.2857 -0.1429 0.0000 Figure 1. Two-dimensio representation of utterances from TIMIT database evaluation using KPCA+linear solution and non-linear SVM 2.2. Gaussian mixture model (GMM) based high label feature modeling GMM has become the leading generation statistical model in the state of the art ASR system. GMM is an attractive statistical model because it can represent various probability density functions when estimating a sufficient number of parameters. The GMM, in general, contains a set of 𝑁 multivariate Gaussian density functions represented by the index 𝑘. The resulting probability density function for a particular speaker model 𝑖 is a convex combination of all density functions. GMM is built using standard multivariate Gaussian density,but introduces component index k as a latent variable with discrete probability 𝑝(𝑘 𝑖⁄ ). The weights are represented as 𝑤 𝑘 𝑖 = 𝑝(𝑘 𝑖⁄ ). Complies with the GMM density function and the conditions that characterize the past contributions of the corresponding component as ∑ 𝑤 𝑘 𝑖𝑁 𝑘=1 = 1. Each Gaussian density represents a conditional density function 𝑝((𝑥𝑡|𝑘, 𝑖)). According to Bayes’ theorem, the joint probability density function 𝑝((𝑥𝑡|𝑘, 𝑖))is given by the multiplication of the two. The sum over all densities results in the multi- modal probability density of GMMs as follows: 𝑝(𝑥𝑡| ⊖𝑖) = ∑ 𝑝(𝑘| ⊝𝑖)𝑁 𝑘=1 ∙ 𝑝(𝑥𝑡|𝑘,⊖𝑖) = ∑ 𝑤 𝑘 𝑖 ∙ 𝒩{(𝑥𝑡|𝜇 𝑘 𝑖 , Σ 𝑘 𝑖 )}𝑁 𝑘=1 (4) Where μk is the mean vector and Σk is the covarience matrix. Each component density is completely determined by μk and Σk. The parameter set ⊝𝑖= {𝑤1 𝑖 , 𝑤2 𝑖 , … . . , 𝑤 𝑁 𝑖 , 𝜇1 𝑖 , 𝜇2 𝑖 , … . . 𝜇 𝑁 𝑖 , Σ1 𝑖 , Σ2 𝑖 , … . . Σ 𝑁 𝑖 } where eighting factor including specific speaker model 𝑖 of mean vector and covariance matrix.
  • 5. Int J Elec & Comp Eng ISSN: 2088-8708  High level speaker specific features modeling in automatic speaker… (Satyanand Singh) 1863 Figure 2 illustrates the likelihood function of the GMM, including seven Gaussian distributions with covariance matrices of two dimensional mean and feature vectors are chosen 𝑥1 and 𝑥2 denote the elements of the feature vector. Computer generated log-likelihood completed training speaker 1 model is represented as -6.067379, -4.288333, -4.253459, -4.241043, -4.230592, -4.218451, -4.203952, -4.188224, -4.173566, -4.161955, -4.153866, -4.148612, -4.145268, -4.143124, -4.141712, -4.140738. A computer generated 8𝑋8 training feature vectors of a speaker by Gaussian Mixture Models is represented in Table 2 and Table 3 represent testing feature vectors of same speaker with different text. Figure 2 shows a likelihood function for a GMM with seven Gaussian densities. Figure 2. A likelihood function for a GMM with seven Gaussian densities Table 2. A computer generated 8X8 training feature vectorsof a speaker by Gaussian mixture models C1 C2 C3 C4 C5 C6 C7 C8 R1 4.0646 2.7960 3.3696 2.5665 1.4115 1.4582 1.3393 0.7637 R2 4.8317 3.5756 3.3678 2.8608 0.9304 0.8075 0.9295 1.1848 R3 3.7562 3.4273 3.8380 2.7522 1.3471 0.9934 1.4731 1.6576 R4 5.0021 3.3969 3.4032 2.2354 0.4914 0.8931 2.0563 1.4244 R5 4.1528 3.3462 3.8148 3.4006 1.8268 1.0450 1.5436 1.1512 R6 3.8352 3.1605 4.3616 2.8652 1.7510 1.0464 1.6336 1.3007 R7 4.1610 3.3430 4.4114 1.7857 1.1003 1.5388 1.3885 1.6549 R8 3.5921 3.7265 4.1634 2.5118 1.8623 1.5231 1.5569 1.4148 Table 3. 8X8 testing feature vectors of a speaker by Gaussian mixture models C1 C2 C3 C4 C5 C6 C7 C8 R1 3.2927 2.0086 4.7630 3.1760 1.4675 0.9331 1.7318 1.3194 R2 3.6418 2.6172 5.1925 2.5124 0.5417 1.2929 1.9916 0.9756 R3 2.9897 1.6382 5.2565 4.0006 1.3647 1.8824 1.9576 1.0245 R4 3.4203 2.3760 4.4596 2.5434 1.0803 1.4107 1.8440 1.3208 R5 3.4864 2.9604 3.9410 3.2120 1.5138 1.5098 2.2160 1.2051 R6 4.0004 2.2980 4.2781 3.0504 1.8364 1.0121 1.2600 1.1491 R7 3.0806 2.0417 4.0331 3.6395 1.9743 1.8195 1.3774 1.0800 R8 2.9109 2.3116 4.6019 3.5167 2.3270 1.1858 2.6674 1.3994 2.3. Linear discriminant analysis (LDA) based high label feature modeling LDA is a commonly employed technique in statistical pattern recognition that aims at finding linear combinations of feature coefficients to facilitate discrimination of multiple classes. It finds orthogonal orientation in place of most effective functions in class discrimination. By introducing the original features in these guidelines, the accuracy of classification improves. Let us indicate the set of all development utterances by D, utterance features indicated by ws,i, these features obtained from the ith utterance of the speaker s, the total number of utterances belonging to s is indicated by ns and the total number of speakers in D is indicated by S. Class covariance matrices between Sb and within Sw are given by 0 10 20 30 40 50 0 10 20 30 40 50 -20 -10 0 10 20 x1x2 Likelihood
  • 6.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 1859 - 1867 1864 𝑆 𝑏 = 1 𝑆 ∑ (𝑤̅ 𝑠 − 𝑤̅)(𝑤̅ 𝑠 − 𝑤̅) 𝑇𝑆 𝑠=1 (5) 𝑆 𝑤 = 1 𝑆 ∑ 1 𝑛 𝑠 ∑ (𝑤𝑠,𝑖 − 𝑤̅ 𝑠)(𝑤𝑠,𝑖 − 𝑤̅ 𝑠) 𝑇𝑛 𝑠 𝑖=1 𝑆 𝑠=1 (6) Where the speaker dependant mean vector is given by w̅s = 1 ns ∑ ws,i ns i=1 ⁄ and speaker independent mean vector is given by w̅ = 1 S ∑ 1 ns ∑ ws,i ns i=1 S s=1 respectively. The LDA optimization is therefore to maximize between class variance, whereas reducing within the class variance. The exact estimation can be obtain from this optimization by solving generalized eigenvalue problem: 𝑆 𝑏𝑉 =∧ 𝑆 𝑤 𝑣 (7) The diagonal matrix containing of eignvector is indicated by ∧. If the matrix Sw in eqn. (6) is invertible then the solution can be easily found by Sw −1 Sb. ALDA matrix of dimension R × k is as follows: 𝐴 𝐿𝐷𝐴 = [𝑣1 … … . . 𝑣 𝑘] (8) k eigenvectors v1 … … . . vk obtained by solving eqn. (7). Thus, the LDA change of the utterance feature w is obtained in this way: 𝛷 𝐿𝐷𝐴(𝑤) = 𝐴 𝐿𝐷𝐴 𝑇 𝑤 (9) A computer generated 8X8 ΦLDA(w) matrix of dimension RXk by LDA Models is represented in Table 4. Table 4. A computer generated 8X8 𝛷 𝐿𝐷𝐴(𝑤) matrix of dimension 𝑅𝑋𝑘 C1 C2 C3 C4 C5 C6 C7 C8 R1 -0.5302 -0.6328 -0.6402 -0.5861 -0.5306 -0.5137 -0.5403 -0.5678 R2 -0.6601 -0.7932 -0.8189 -0.7774 -0.7347 -0.7332 -0.7773 -0.8138 R3 -0.6949 -0.8420 0.8846 -0.8622 -0.8389 -0.8565 -0.9219 -0.9783 R4 -0.6594 -0.8031 -0.8484 -0.8308 -0.8124 -0.8399 -0.9289 -1.0271 R5 -0.6314 -0.7653 -0.7968 -0.7584 -0.7169 -0.7325 -0.8374 -0.9885 R6 -0.6698 -0.8029 -0.8170 -0.7446 -0.6615 -0.6450 -0.7462 -0.9332 R7 -0.7548 -0.8985 -0.9072 -0.8157 -0.7044 -0.6588 -0.7423 -0.9333 R8 -0.7876 -0.9328 -0.9467 -0.8688 -0.7722 -0.7314 -0.8065 -0.9806 LDA assumes normal distribution data for all classes, statistically independent features and the same covariance matrix. However, this only applies to LDA as a classifier. If these assumptions are violated, the dimensionally reduced LDA can work reasonably. Even for classification tasks, LDA seems powerful enough to be used for data distribution in ASR applications. The speaker feature modeling histograms with normal fit eigenvector obtained from the LDA is illustrated in Figure 3. Figure 3. The speaker feature modeling histograms with normal fit eigenvector with LDA -0.5 -0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5 0 2 4 6 8 10 Length of Feature Vectors Counts
  • 7. Int J Elec & Comp Eng ISSN: 2088-8708  High level speaker specific features modeling in automatic speaker… (Satyanand Singh) 1865 3. ACOUSTIC DATA FEATURE EXTRACTION The speaker specific features refer to parameters extracted from phrase segments/periods within a 20-25 ms frame. The most common short-term acoustic features are Mel Frequency Cepstrum Coefficients (MFCC) and Linear Predictive Coding (LPC) based features [18,19,20]. In order to obtain these coefficients from the speech recording, the speech samples are first divided into short overlapping segments. The signals obtained at these segments / frames are then multiplied by a window function (e.g. Hamming and Hanning) to obtain a Fourier power spectrum. In the next step, the logarithm of the spectrum is calculated and a mel-space filter bank analysis of non-linear intervals is performed. Logarithmic operations expand the range of coefficients and break up the multiplicative components into additional components [21]. In filter bank analysis, spectral energy (also called filter bank energy coefficient) is generated for each channel to represent different frequency bands. Filterbanks, like the human auditory system, are designed to be more sensitive to frequency changes at the bottom of the spectrum. Finally, the MFCC is obtained by performing a discrete cosine transform (DCT) on the filter bank energy parameters and retaining many preamble coefficients [22, 23]. DCT has two important properties. (i) to compress the energy of the signal into multiple coefficients, and (ii) to be highly correlated with the coefficients. For these reasons, using DCT to remove specific dimensions improves the efficiency of the model and reduces some harmful components [24]. Furthermore, the uncorrelated properties of the DCT help to assume that the models of feature coefficients are not relevant. In summary, the following sequence of operations-power spectrum, logarithm, DCT-produces a signal with a well-known cepstral representation [25]. 4. EXPERIMENTAL SETUP The experiment uses the TIMIT set of database. The proposed algorithm implemented in MATLAB and results were compared with those of the Eigenvoice consideration in HMM, GMM and LDA. A total 1000 utterances of the TIMIT database of 6 sec, 4 sec and 2 sec voice were put to train and test the ASR system. For the above cases, ASR recognition efficiency has been calculated “Efficiency” = Number of utterance correctlyidentified/Total Number of utterance under test. Table 5 shows that the efficiency of the ASR system for HMM, GMM and LDA respectively. It can be observed from this table that use of GMM has highest efficiency compared to other modeling techniques. Figure 4 show the equal error rate (EER) of HMM, GMM, and LDA based modeling technique. The ASR efficiency of HMM, GMM, and LDA based modeling technique are 98.8%, 99.1%, and 98.6% and EER are 4.5%, 4.4% and 4.55% respectively. The EER improvement of GMM modeling technique based ASR system compared with HMM and LDA is 4.25% and 8.51% respectively. Figure 4. Equal Error Rate of ASR system of HMM, GMM and LDA based modeling technique for 2 sec of voice data 0.1 0.2 0.5 1 2 5 10 20 40 0.1 0.2 0.5 1 2 5 10 20 40 False Alarm probability (in %) Missprobability(in%) HMM GMM LDA
  • 8.  ISSN: 2088-8708 Int J Elec & Comp Eng, Vol. 10, No. 2, April 2020 : 1859 - 1867 1866 Table 5. Efficiency of the ASR system for HMM, GMM and LDA respectively HMM GMM LDA Efficiency in % EER in % Efficiency in % EER in % Efficiency in % EER in % 6 sec 99.6 4.9 99.9 4.7 99.1 5.1 4 sec 98.8 4.9 99.5 4.7 98.2 5.1 2 sec 98.8 4.9 99.1 4.7 98.6 5.1 5. CONCLUSION This paper presented the research, development and evaluation of ASR system based on HMM, GMM and LDA modeling techniques. GMM models provide a simple but effective representation that offers inexpensive and high recognition accuracy for a wide range of speaker recognition tasks. An experimental evaluation of the performance of the speaker recognition system has been done on publicly available TIMIT database. For the 1000, voice samples of the TIMIT database spaker recognition accuracy 99.1%, 98.8% and 98.6 for GMM, HMM and LDA was obtained for 2 sec of voice length. The EER improvement of GMM modeling technique based ASR system compared with HMM and LDA is 4.25% and 8.51% respectively. As experimental results showed that, speaker recognition performance is at practically usable levels for specific applications such as access control authentication. The main limiting factor in less controlled situations is the lack of robustness to transmission impairments such as noise and mic variability. Much more to address these limitations, such as exploring areas such as understanding and modeling the impact of impairments on spectral characteristics, applying more sophisticated channel compensation techniques, and exploring features that are less sensitive to channel degradation efforts are underway. REFERENCES [1] S. Singh, “Forensic and Automatic Speaker Recognition System” International Journal of Applied Engineering Research, Vol. 8, No. 5, 2018, pp. 2804-2811, 2018. [2] S. Singh and Ajeet Singh “Accuracy Comparison using Different Modeling Techniques under Limited Speech Data of Speaker Recognition Systems,” Global Journal of Science Frontier Research: F Mathematics and Decision Sciences, vol 16(2), pp.1-17, 2016. [3] S. Singh. “Bayesian distance metric learning and its application in automatic speaker recognition systems” International Journal of Electrical and Computer Engineering, Vol, 9, No. 4, 2019. [4] S. Singh. “The Role of Speech Technology in Biometrics, Forensics and Man-Machine Interface” International Journal of Electrical and Computer Engineering, Vol. 9, No. 1, pp.281-288, 2019. [5] S. Singh. “High Level Speaker Specific Features as an Efficiency Enhancing Parameters in Speaker Recognition System,” International Journal of Electrical and Computer Engineering, Vol, 9, No. 4, 2019. [6] S. Singh, Abhay Kumar, David Raju Kolluri, “Efficient Modelling Technique based Speaker Recognition under Limited Speech Data,” International Journal of Image, Graphics and Signal Processing(IJIGSP), Vol.8, No.11, pp.41-48, 2016. [7] Shriberg, E., & Stolcke, “Direct modeling of prosody: An overview of applications in automatic speech processing,” In Speech Prosody, Nara, Japan 2004. [8] Mary, L., & Yegnanarayana, B, “Prosodic features for speaker verification,” In Proceedings of Interspeech, Pittsburgh, Pennsylvania, pp. 917- 920, 2006. [9] Ferrer, L., Shriberg, E., Kajarekar, S., & Sonmez, K, “Parameterization of prosodic feature distributions for SVM modeling in speaker recognition,” In Proceedings of International Conference on Acoustics, Speech and Signal Processing, Vol. 4, pp. 233-236, 2007. [10] Han, K., Dong, Y., & Tashev, I, “Speech emotion recognition using deep neural network and extreme learning machine,” In Proceedings of Interspeech, pp. 223-227, 2014. [11] Wang, Z. Q., & Tashev, I, “Learning utterance-level representations for speech emotion and age/gender recognition using deep neural networks,” In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017. [12] Vapnik V. “An Overview of Statistical Learning Theory,” IEEE Transaction on Neural Networks, Vol. 10, No. 5, pp. 988-999,1999. [13] S.Singh, “Support Vector Machine Based Approaches For Real Time Automatic Speaker Recognition System,” International Journal of Applied Engineering Research, Vol. 13, No. 10, pp. 8561-8567, 2018. [14] Scholkopf B, Smola A, “Learning with kernels: support vector machines, regularization, optimization, and beyond,” Cambridge, MA: MIT Press; 2002 [15] Peskin, B., Navratil, J., Abramson, J., Jones, D., Klusacek, D., Reynolds, D., et al., “Using prosodic and conversational features for high-performance speaker recognition,” Report from JHU WS’02, In Proceedings of ICASSP, Hong Kong, China, Vol. 4, pp. 792-795, 2003. [16] S.Singh, Mansour H. Assaf, Sunil R.Das, Emil M. Petriu, and Voicu Groza, “Short Duration Voice Data Speaker Recognition System Using Novel Fuzzy Vector Quantization Algorithms,” 2016 IEEE International Instrumentation and Measurement Technology Conference, May 23-26, Taipei, Taiwan, 2016.
  • 9. Int J Elec & Comp Eng ISSN: 2088-8708  High level speaker specific features modeling in automatic speaker… (Satyanand Singh) 1867 [17] Najim, D., Dumouchel, P., & Kenny, P, “Modeling prosodic features with joint factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, Vol. 15 7, 2095-2103, 2007. [18] S.Singh, “Speaker Recognition by Gaussian Filter Based Feature Extraction and Proposed Fuzzy Vector Quantization Modeling Technique,” International Journal of Applied Engineering Research, Vol. 13, No. 16, pp. 12798-12804, 2018. [19] S.Singh, “High Level Speaker Specific Features Modeling in Automatic Speaker Recognition System,” International Journal of Electrical and Computer Engineering, Vol. 10, No. 2, 2018, pp. 2804-2811, 2020. [20] S.Singh, “Speaker Recognition System for Limited Speech Data Using High-Level Speaker Specific Features and Support Vector Machines” International Journal of Applied Engineering Research, Vol. 12, No. 9, 2018, pp. 8026-8033 2017. [21] S.Singh, MH Assaf and Abhay Kumar, “A Novel Algorithm of Sparse Representations for Speech Compression/Enhancement and Its Application in Speaker Recognition System,” International Journal of Computational and Applied Mathematics, Vol. 11, No. 1, pp. 89-104, 2016. [22] S.Singh, “Evaluation of Sparsification algorithm and Its Application in Speaker Recognition System” International Journal of Applied Engineering Research, Vol. 13, No. 17, pp. 13015-13021, 2018. [23] S.Singh and Mansour H. Assaf “A Perfect Balance of Sparsity and Acoustic hole in Speech Signal and Its Application in Speaker Recognition System” Middle-East Journal of Scientific Research, Vol. 24, No.11, pp. 3527-3541, 2016. [24] S.Singh and Dr. E.G. Rajan, “MFCC VQ based Speaker Recognition and Its Accuracy Affecting Factors,” International Journal of Engineering Research & Technology, International Journal of Computer Applications, Vol. 21, No. 6, pp. 1-6, 2011. [25] S.Singh and Dr. E.G. Rajan, “Application of Different Filters In Mel Frequency Cepstral Coefficients Feature Extraction And Fuzzy Vector Quantization Approach In Speaker Recognition,” International Journal of Engineering Research & Technology, Vol. 2 Issue 6, pp-3171-3182, 2013.