A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model

ACEEE Int. J. on Signal & Image Processing, Vol. 02, No. 01, Jan 2011

A Novel Method for Speaker Independent
Recognition Based on Hidden Markov Model
Feng-Long Huang
Computer Science and Information Engineering, National United University
No. 1, Lienda, Miaoli, Taiwan, 36003
flhuang@nuu.edu.tw

Abstract: In this paper, we address the speaker independent for this success is the powerful ability to characterize the
recognition of Chinese number speeches 0~9 based on HMM. speech signal in a mathematically tractable way.
Our former results of inside and outside testing achieved In a typical ASR system based on HMM, the HMM
92.5% and 76.79% respectively. To improve further the stage is proceeded by the parameter extraction. Thus the
performance, two important features of speech; MFCC and input to the HMM is a discrete time sequence of parameter
cluster number of vector quantification, are unified together vectors, which will be supplied to the HMM.
and evaluated on various values. The best performance In the paper, the following sections are organized as
achieve 96.2% and 83.1% on MFCC Number = 20 and VQ follow: the process of speeches is introduced in Section 2
clustering number = 64. and the acoustic model of recognition will be described in
Keywords: Speech Recognition, Hidden Markov Model, Section 3. The initial results for former approaches are
LBG Algorithm, Mel-frequency cepstral coefficients, Viterbi presented in Section 4. The improvement metods are
Algorithm. furthermore described in Section 5
I. INTRODUCTION II. PROCESSES OF SPEECH
In Speech processing, automatic speech recognition In this section, we will describe all the procedures for
(ASR) is capable automatically of understanding the input pre-processes.
of human speech for the text output with various A. Processing Speech
vocabularies. ASR can be applied in a wide range of The analog voice signals are recorded thru
applications, such as: human interface design, speech microphone. It should be digitalized and quantified. The
Information Retrieval (SIR) [11,12], language translation, digital signal process can be described as follows:
and so on. In real world, there are several commercial x p (t ) = x a (t ) p (t )
ASR systems, for example, IBM’s Via Voice, Mandarin (1)
Dictation System–the Golden Mandarin (III) of NTU in where xp(t) and xa(t) denote the processed and analog
Taiwan, Voice Portal on Internet and 104 on-line speech signal. p(t) is the impulse signal.
queries systems. Modern ASR technologies merged the Each signal should be segmented into several short
signal process, pattern recognition, network and frames of speech which contain a time series signal. The
telecommunication into a unified framework. Such features of each frame are extracted for further processes.
architecture can be expanded into broad domains of
B. Pre-emphasis
services, such as e-commerce and wireless speech system
Basically, the purpose of pre-emphasis is to increase,
of WiMAX. the magnitude of some (usually higher) frequencies with
The approaches adopted on ASR can be categorized as: respect to the magnitude of other (usually lower)
1)Hidden Markov Model (HMM) [1,2,3,4], 2)Neural frequencies in order to improve the overall signal-to-noise
Networks [5,6,7], 3)Wavelet-based and spectrum coefficients ratio (SNR) by minimizing the adverse effects of such
of speech [15,16], other method is the combination of first
phenomena as attenuation distortion.
two approaches above [8,9]. The Hidden Markov Model is C. Frame Blocking
a result of the attempt to model the speech generation While analyzing audio signals, we usually adopt the
statistically, and thus belongs to the first category above. method of short-term analysis because most audio signals
During the past several years it has become the most are relatively stable within a short period of time. Usually,
successful speech model used in ASR. The main reason
27
© 2011 ACEEE
DOI: 01.IJSIP.02.01.218


the signal will be segmented into time frame, say 15 ~ 30 In a regular Markov model, the state is directly visible
ms. to the observer, and therefore the state transition
D. Hamming Window probabilities are the only parameters. However, in a
In signal processing, the window function is hidden Markov model, the state is not directly visible (so-
a function that is zero-valued outside of some called hidden), while the variables influenced by the state
chosen interval. The Hamming window is a weighted are visible. Each state has a probability distribution over
moving average transformation used to smooth the
the output. Therefore, the sequence of tokens generated by
periodogram values.
Supposed that original signal s(n) is as follows: an HMM gives some information about the sequence of
s(n), n = 0,…N-1 (2) states.
The original signal s(n) is multiplied by hamming A complete HMM can be defined as follows:
window w(n), we will obtain s(n)* w(n), w(n) can be λ = ( π , A, B) (5)
defined as follows: HMM model can be defined as ( π , A, B) :
1. Π (Initial state probability):
w(n) = (1 - α) – α*cos(2πn/(N-1)), 0≦ n≦ N-1 (3) π = { π i = prob(q = S i )} 1≤ i ≤ N (6)
1
where N denotes the sample number in a window. 2. A (State transition probability):
E. Mel-frequency cepstral coefficients A = {a ij = prob(q t+1 = S j |q t = S i )} (7)
Mel Frequency Cepstral Coefficient (MFCC) is one of 1 ≤ i ≤ N
the most effective feature parameter in speech recognition. 3. B (Observation symbol probability):
B = {b j (O t ) = prob(Ot | q t = S j )} 1 ≤ i ≤ N (8)
For speech representation, it is well known that MFCC
parameters appear to be more effective than power where O = {O 1 , O 2 ,.... , O T } is the observation.
spectrum based features. MFCCs are based on the human S = {S1 , S 2 , S 3 ,..... , S N } is state symbols and
ears' non-linear frequency characteristic and perform a q = {q 1 , q 2 , q 3 ,..... , q T } is observation states and
high recognition rate in practical application. T denote the length of observation, N is the number of
o lower frequency, human hear more acute. states.
o higher frequency, human hear less acute. C. System Models
As shown in Fig. 7, MFCC are presented as: The recognition system is composed of two main
mel(f)=1125*ln(1+f/700) (4) functions: 1) extracting the speech features, including
frame blocking, VQ, and so on, 2) constructing the model
III. ACOUSTIC MODEL OF RECOGNITION and recognition based on the HMM, VQ and Viterbi
Algorithm.
A. Vector Quantification It is apparent that short speech signal varied sharply
Foundational vector quantifications (VQ) were and rapidly, whereas longer signal varied slowly.
proposed by Y. Linde, A. Buzo, and R. Gray in 1980, So- Therefore, we use the dynamic frame blocking rather than
called LBG algorithm. LBG is based on k-means fixed frame for different experiments.
clustering [2,5], referring to the size of codebook G,
training vectors will be categorized into G groups. The IV. INITIAL EXPERIMENTS
centroid Ci of each Gi will be the representative for such
A. Recognition System Based on HMM
vector of codeword. In principal, the category is tree
In the paper, we focus on speaker independent
based structure.
speech recognition of Chinese number speeches 0~9. All
B. Hidden Markov Model the samples with 44100 Hz/16 bits are recorded by three
native male adults. Total 560 samples are divided into two
A Hidden Markov Model (HMM) is a statistical model
parts, 280 for training and 280 for testing. After complete
in which is assumed to be a Markov process with
the pre-process, such as preemphasis, frame boloking, VQ.
unknown parameters. The challenge is to find all the
appropriate hidden parameters from the observable states. B. Comparison for fixed and Dynamic Frame Size
HMM can be considered as the simplest dynamic
According to our empirical results, comparing the
Bayesian network.
fixed and dynamic frame size, recognition rate of fixed

28
© 2011 ACEEE
DOI: 01.IJSIP.02.01.218


frame size achieves 76.79%, and superior to the other B. Better Combination of Various Features
with75.71%, as shown in Table 1. To improve furthermore the performance, two spectrum
Table 1: comparing the frame size, (SymbolNum=64) features, MFCC and cluster number, of speeches are
wave Mfcc VQ HMM Symbol rate(%) unified and evaluated. MFCC degree varied from 8 to 36
Num time time training Num with interval 4 and cluster number varied on 32 to 256
I 280 90.36 with interval 32. We evaluated all the combination for
fixed 32.9 5.77 3.44 64 these two features with various numbers. The process
O 280 76.79*
times needed for computation are shown in Table 2. The
I 280 92.50* best results can achieve on MFCC Number= 20 and VQ
dynamic 32.0 3.31 2.42 64
O 280 75.71 clustering number = 64. The inside and outside testing of
PS. I and O denote the inside and outside testing, respectively recognition achieve 96.2% and 83.1% shown in Fig. 3 and
net results for inside and outside testing are 3.7% and
V. FURTHER IMPROVEMENT
6.3% respectively. We just list the results with VQ = 64 in
A. Improving the Samples of Speech the paper.
According to our empirical results, recognition rate Table 2: processed time with VQ = 64.
achieve better results while cluster number=64. Inside and
MFCC
outside testing are 92.5% and 76.79%, respectively. degree 8 12 16 20 24 28 32 36
To improve the performance, we analyze all the
MFCC 15.8 16.9 18.6 23.5 25.3 27.2 28.5 29.9
speech wavelet. There are many samples affected by boost
noise derived from human speaking or environment, as VQ 1.0 2.6 3.3 3.4 3.8 4.9 5.3 6.6
shown in Fig. 1. In such a situation, the end points of
HMM 1.7 1.7 1.8 1.8 1.8 1.8 1.9 1.9
boosted speech cannot be usually detected correctly. It
will lead to degrade the performance of system.
Usually, detecting end points judged on ZCR and
energy of speech, as shown in Fig. 1. However, it is
significant that we need extra features to detect for noise
situation. Based on experimental results and observation,
the improvement rules are summarized as follows:
Input: X(n) , n = 1 to j
Output: Y(m),1 <= m <= j
1. segment the speech X(n): framedY = framed (X(n))
2. calculate the ZCR and energy for each frame.
3. smooth the curves for both ZCR and energy
4. calculate the average of first 10 frames, and
Fig. 1: before improvement, Chinese number 8 (ㄅㄚ)
multiplying 1.2. The average value will be used as
the threshold for detecting process.
5. ZCR is valid only if framedY is larger than 100, as
shown in Fig. 2.
6. the speech will be effective only if the size is larger
than 3ms.
7. the starting energy of speech should be larger than
threshold.
8. the energy for continuous 5 frames of speech .
should be increased progressively.
Referring to the improvement, the speeches number 8
(ㄅㄚ) with boost noise can be detected, as shown in Fig.
2. The improvement of detection will leads to better Fig. 2: after improvement, Chinese number 8 (ㄅㄚ).
results for following recognition process.
29
© 2011 ACEEE
DOI: 01.IJSIP.02.01.218


100 Observations--A Combinatorial Method, IEEE
95 Transactions on Pattern Analysis and Machine
performance(%)

90
Intelligence (PAMI), Vol. 22, No. 4.
85
[4] A. Sperduti and A. Starita, May 1997, Supervised
80

75
Neural Networks for Classification of Structures. IEEE
70 Inside Test(%) Transactions on Neural Networks, 8(3): pp.714-735.
65 Outside test(%) [6] E. Behrman, L. Nash, J. Steck, V. Chandrashekar, and
60 8 12 16 20 24 28 32 36
MFC C de gre e
S. Skinner, October 2000, Simulations of Quantum
Neural Networks, Information Sciences, 128(3-4): pp.
257-269.
Fig. 3: performance with VQ = 64, MFCC degrees varied between 8 and
[7] Hsien-Leing Tsai, 2004, Automatic Construction
36.
Algorithms for Supervised Neural Networks and
VI. CONCLUSION Applications, PhD thesis of NSYSU, Taiwan.
[8] Li-Yi Lu, 2003, The Research of Neural Network and
In this paper, we address the speaker independent Hidden Markov Model Applied on Personal Digital
speech recognition of Chinese number speeches based on Assistant, Master thesis of CYU, Taiwan.
HMM. The algorithm for our novel approach is proposed [10] Rabiner, L. R., 1989, A Tutorial on Hidden Markov
for the speech recognition. 480 speech samples are Models and Selected Applications in Speech
recorded and pre-processed. The preliminary results of Recognition, Proceedings of the IEEE, Vol.77, No.22,
outside testing achieve 76.79%. pp.257-286.
To improve furthermore the performance, two [11] Manfred R. Schroeder, H. Quast, H.W. Strube,
features of speeches; MFCC and VQ cluster number, are Computer Speech: Recognition, Compression,
evaluated. We then find the combination of two spectrum Synthesis , Springer, 2004.
features to achieve best results. The best performance will [12] Wald, M., 2006, Learning Through Multimedia:
be achieved on MFCC, Number = 20 and VQ clustering Automatic Speech Recognition Enabling Accessibility
number = 64. The final inside and outside testing of and Interaction. Proceedings of ED-MEDIA 2006:
recognition achieve 96.2% and 83.1%. It proves that the World Conference on Educational Multimedia,
proposed approach can be employed to recognize the Hypermedia & Telecommunications. pp. 2965-2976.
speaker independent speeches. [13]A. Revathi, R. Ganapathy and Y. Venkataramani, Nov.
Future works will be studied in the following: 2009, Text Independent Speaker Recognition and
1) Employing other effective methods to merging novel Speaker Independent Speech Recognition Using
method to enhance the performance. Iterative Clustering Approach, International Journal of
2) Applying the method into isolated Chinese speech Computer science & Information Technology (IJCSIT),
recognition. Vol. 1, No 2, pp.30-42.
3) Improving the precision rates. [14]Haamid M. Gazi, Omar Farooq, Yusuf U. Khan,
Sekharjit Datta, 2008, Wavelet-based, speaker-
ACKNOWLEDGEMENT
independent isolated Hindi digit recognition
The paper is supported under the Project of Lein-Ho International Journal of Information and
Foundation, Taiwan. Communication Technology, Vol. 1 , Issue 2 pp.
185-198
REFERENCES
[15]Chakraborty P., et at., 2008, An Automatic Speaker
[1] Keng-Yu Lin, 2006, Extended Discrete Hidden Recognition System, Neural Information Processing,
Markov Model and Its Application to Chinese Syllable Lecture Notes in Computer Science (LNCS), Springer
Recognition, Master thesis of NCHU, Taiwan. Berlin / Heidelberg, pp. 517-526.
[2] Keng-Yu Lin, 2006, Extended Discrete Hidden [16] Kun-Ching Wang, 2009, Wavelet-Based Speech
Markov Model and Its Application to Chinese Syllable Enhancement Using Time-Frequency Adaptation,
Recognition, Master thesis of NCHU. EURASIP Journal on Advances in Signal Processing,
[3] X. Li, M. Parizeau and R. Plamondon, April 2000, Volume 2009 (2009), Article ID 924135.
Training Hidden Markov Models with Multiple
30
© 2011 ACEEE
DOI: 01.IJSIP.02.01.218

A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model

More Related Content

What's hot (17)

Viewers also liked (9)

Similar to A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model (20)

More from IDES Editor (20)

Recently uploaded (20)

A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model