SlideShare a Scribd company logo
ACEEE Int. J. on Signal & Image Processing, Vol. 02, No. 01, Jan 2011




        A Novel Method for Speaker Independent
      Recognition Based on Hidden Markov Model
                                                        Feng-Long Huang
                           Computer Science and Information Engineering, National United University
                                         No. 1, Lienda, Miaoli, Taiwan, 36003
                                                     flhuang@nuu.edu.tw



Abstract: In this paper, we address the speaker independent           for this success is the powerful ability to characterize the
recognition of Chinese number speeches 0~9 based on HMM.              speech signal in a mathematically tractable way.
Our former results of inside and outside testing achieved                   In a typical ASR system based on HMM, the HMM
92.5% and 76.79% respectively. To improve further the                 stage is proceeded by the parameter extraction. Thus the
performance, two important features of speech; MFCC and               input to the HMM is a discrete time sequence of parameter
cluster number of vector quantification, are unified together         vectors, which will be supplied to the HMM.
and evaluated on various values. The best performance                       In the paper, the following sections are organized as
achieve 96.2% and 83.1% on MFCC Number = 20 and VQ                    follow: the process of speeches is introduced in Section 2
clustering number = 64.                                               and the acoustic model of recognition will be described in
Keywords: Speech Recognition, Hidden Markov Model,                    Section 3. The initial results for former approaches are
LBG Algorithm, Mel-frequency cepstral coefficients, Viterbi           presented in Section 4. The improvement metods are
Algorithm.                                                            furthermore described in Section 5
                     I. INTRODUCTION                                                    II. PROCESSES OF SPEECH
    In Speech processing, automatic speech recognition                  In this section, we will describe all the procedures for
(ASR) is capable automatically of understanding the input             pre-processes.
of human speech for the text output with various                      A. Processing Speech
vocabularies. ASR can be applied in a wide range of                        The analog voice signals are recorded thru
applications, such as: human interface design, speech                 microphone. It should be digitalized and quantified. The
Information Retrieval (SIR) [11,12], language translation,            digital signal process can be described as follows:
and so on. In real world, there are several commercial                x   p   (t ) = x a (t ) p (t )
ASR systems, for example, IBM’s Via Voice, Mandarin                       (1)
Dictation System–the Golden Mandarin (III) of NTU in                  where xp(t) and xa(t) denote the processed and analog
Taiwan, Voice Portal on Internet and 104 on-line speech               signal. p(t) is the impulse signal.
queries systems. Modern ASR technologies merged the                        Each signal should be segmented into several short
signal process, pattern recognition, network and                      frames of speech which contain a time series signal. The
telecommunication into a unified framework. Such                      features of each frame are extracted for further processes.
architecture can be expanded into broad domains of
                                                                     B. Pre-emphasis
services, such as e-commerce and wireless speech system
                                                                         Basically, the purpose of pre-emphasis is to increase,
of WiMAX.                                                            the magnitude of some (usually higher) frequencies with
   The approaches adopted on ASR can be categorized as:              respect to the magnitude of other (usually lower)
1)Hidden Markov Model (HMM) [1,2,3,4], 2)Neural                      frequencies in order to improve the overall signal-to-noise
Networks [5,6,7], 3)Wavelet-based and spectrum coefficients          ratio (SNR) by minimizing the adverse effects of such
of speech [15,16], other method is the combination of first
                                                                     phenomena as attenuation distortion.
two approaches above [8,9]. The Hidden Markov Model is               C. Frame Blocking
a result of the attempt to model the speech generation                      While analyzing audio signals, we usually adopt the
statistically, and thus belongs to the first category above.          method of short-term analysis because most audio signals
During the past several years it has become the most                  are relatively stable within a short period of time. Usually,
successful speech model used in ASR. The main reason
                                                                27
© 2011 ACEEE
DOI: 01.IJSIP.02.01.218
ACEEE Int. J. on Signal & Image Processing, Vol. 02, No. 01, Jan 2011



the signal will be segmented into time frame, say 15 ~ 30              In a regular Markov model, the state is directly visible
ms.                                                                to the observer, and therefore the state transition
D. Hamming Window                                                  probabilities are the only parameters. However, in a
      In signal processing, the window function is                 hidden Markov model, the state is not directly visible (so-
a function that is zero-valued outside of some                     called hidden), while the variables influenced by the state
chosen interval. The Hamming window is a weighted                  are visible. Each state has a probability distribution over
moving average transformation used to smooth the
                                                                   the output. Therefore, the sequence of tokens generated by
periodogram values.
    Supposed that original signal s(n) is as follows:              an HMM gives some information about the sequence of
 s(n), n = 0,…N-1                                     (2)          states.
   The original signal s(n) is multiplied by hamming                  A complete HMM can be defined as follows:
window w(n), we will obtain s(n)* w(n), w(n) can be                 λ = ( π , A, B)                                         (5)
defined as follows:                                                      HMM model can be defined as ( π , A, B) :
                                                                    1.   Π (Initial state probability):
w(n) = (1 - α) – α*cos(2πn/(N-1)), 0≦ n≦ N-1            (3)         π = { π i = prob(q             = S i )}       1≤ i ≤ N            (6)
                                                                                               1
where N denotes the sample number in a window.                      2. A (State transition probability):
E. Mel-frequency cepstral coefficients                               A = {a ij = prob(q        t+1 = S        j   |q   t   = S i )}   (7)
    Mel Frequency Cepstral Coefficient (MFCC) is one of                 1 ≤ i ≤ N
the most effective feature parameter in speech recognition.          3. B (Observation symbol probability):
                                                                      B = {b j (O t ) = prob(Ot | q t = S j )} 1 ≤ i ≤ N              (8)
For speech representation, it is well known that MFCC
parameters appear to be more effective than power                  where O = {O 1 , O 2 ,.... , O T } is the observation.
spectrum based features. MFCCs are based on the human                    S = {S1 , S 2 , S 3 ,..... , S N } is state symbols and
ears' non-linear frequency characteristic and perform a                  q = {q 1 , q 2 , q 3 ,..... , q T } is observation states and
high recognition rate in practical application.                    T denote the length of observation, N is the number of
   o lower frequency, human hear more acute.                       states.
   o higher frequency, human hear less acute.                      C. System Models
 As shown in Fig. 7, MFCC are presented as:                              The recognition system is composed of two main
mel(f)=1125*ln(1+f/700)                                (4)         functions: 1) extracting the speech features, including
                                                                   frame blocking, VQ, and so on, 2) constructing the model
         III. ACOUSTIC MODEL OF RECOGNITION                        and recognition based on the HMM, VQ and Viterbi
                                                                   Algorithm.
A. Vector Quantification                                               It is apparent that short speech signal varied sharply
      Foundational vector quantifications (VQ) were                and rapidly, whereas longer signal varied slowly.
proposed by Y. Linde, A. Buzo, and R. Gray in 1980, So-            Therefore, we use the dynamic frame blocking rather than
called LBG algorithm. LBG is based on k-means                      fixed frame for different experiments.
clustering [2,5], referring to the size of codebook G,
training vectors will be categorized into G groups. The                               IV. INITIAL EXPERIMENTS
centroid Ci of each Gi will be the representative for such
                                                                   A. Recognition System Based on HMM
vector of codeword. In principal, the category is tree
                                                                         In the paper, we focus on speaker independent
based structure.
                                                                   speech recognition of Chinese number speeches 0~9. All
B. Hidden Markov Model                                             the samples with 44100 Hz/16 bits are recorded by three
                                                                   native male adults. Total 560 samples are divided into two
   A Hidden Markov Model (HMM) is a statistical model
                                                                   parts, 280 for training and 280 for testing. After complete
in which is assumed to be a Markov process with
                                                                   the pre-process, such as preemphasis, frame boloking, VQ.
unknown parameters. The challenge is to find all the
appropriate hidden parameters from the observable states.          B. Comparison for fixed and Dynamic Frame Size
HMM can be considered as the simplest dynamic
                                                                       According to our empirical results, comparing the
Bayesian network.
                                                                   fixed and dynamic frame size, recognition rate of fixed

                                                              28
© 2011 ACEEE
DOI: 01.IJSIP.02.01.218
ACEEE Int. J. on Signal & Image Processing, Vol. 02, No. 01, Jan 2011



frame size achieves 76.79%, and superior to the other                       B. Better Combination of Various Features
with75.71%, as shown in Table 1.                                                To improve furthermore the performance, two spectrum
     Table 1: comparing   the frame size, (SymbolNum=64)                    features, MFCC and cluster number, of speeches are
               wave    Mfcc     VQ      HMM       Symbol     rate(%)        unified and evaluated. MFCC degree varied from 8 to 36
               Num      time    time   training    Num                      with interval 4 and cluster number varied on 32 to 256
           I    280                                          90.36          with interval 32. We evaluated all the combination for
  fixed                32.9    5.77     3.44       64                       these two features with various numbers. The process
           O    280                                          76.79*
                                                                            times needed for computation are shown in Table 2. The
           I    280                                          92.50*         best results can achieve on MFCC Number= 20 and VQ
 dynamic               32.0    3.31     2.42       64
           O    280                                          75.71          clustering number = 64. The inside and outside testing of
PS. I and O denote the inside and outside testing, respectively             recognition achieve 96.2% and 83.1% shown in Fig. 3 and
                                                                            net results for inside and outside testing are 3.7% and
                 V. FURTHER IMPROVEMENT
                                                                            6.3% respectively. We just list the results with VQ = 64 in
A. Improving the Samples of Speech                                          the paper.
     According to our empirical results, recognition rate                                      Table 2: processed time with VQ = 64.
achieve better results while cluster number=64. Inside and
                                                                            MFCC
outside testing are 92.5% and 76.79%, respectively.                         degree     8     12     16       20      24       28       32     36
     To improve the performance, we analyze all the
                                                                            MFCC     15.8   16.9    18.6    23.5     25.3    27.2      28.5   29.9
speech wavelet. There are many samples affected by boost
noise derived from human speaking or environment, as                         VQ      1.0     2.6     3.3     3.4     3.8      4.9      5.3    6.6
shown in Fig. 1. In such a situation, the end points of
                                                                            HMM      1.7     1.7     1.8     1.8     1.8      1.8      1.9    1.9
boosted speech cannot be usually detected correctly. It
will lead to degrade the performance of system.
     Usually, detecting end points judged on ZCR and
energy of speech, as shown in Fig. 1. However, it is
significant that we need extra features to detect for noise
situation. Based on experimental results and observation,
the improvement rules are summarized as follows:
    Input: X(n) , n = 1 to j
    Output: Y(m),1 <= m <= j
    1. segment the speech X(n): framedY = framed (X(n))
    2. calculate the ZCR and energy for each frame.
    3. smooth the curves for both ZCR and energy
    4. calculate the average of first 10 frames, and
                                                                                     Fig. 1: before improvement, Chinese number 8 (ㄅㄚ)
        multiplying 1.2. The average value will be used as
        the threshold for detecting process.
    5. ZCR is valid only if framedY is larger than 100, as
        shown in Fig. 2.
    6. the speech will be effective only if the size is larger
        than 3ms.
    7. the starting energy of speech should be larger than
        threshold.
     8. the energy for continuous 5 frames of speech                                                         .
        should be increased progressively.
    Referring to the improvement, the speeches number 8
(ㄅㄚ) with boost noise can be detected, as shown in Fig.
2. The improvement of detection will leads to better                                 Fig. 2: after improvement, Chinese number 8 (ㄅㄚ).
results for following recognition process.
                                                                       29
© 2011 ACEEE
DOI: 01.IJSIP.02.01.218
ACEEE Int. J. on Signal & Image Processing, Vol. 02, No. 01, Jan 2011



        100                                                                     Observations--A Combinatorial Method, IEEE
            95                                                                  Transactions on Pattern Analysis and Machine
  performance(%)


            90
                                                                                Intelligence (PAMI), Vol. 22, No. 4.
            85
                                                                            [4] A. Sperduti and A. Starita, May 1997, Supervised
            80

            75
                                                                                Neural Networks for Classification of Structures. IEEE
            70                    Inside Test(%)                                Transactions on Neural Networks, 8(3): pp.714-735.
            65                    Outside test(%)                           [6] E. Behrman, L. Nash, J. Steck, V. Chandrashekar, and
            60     8   12    16   20      24        28   32   36
                              MFC C de gre e
                                                                                S. Skinner, October 2000, Simulations of Quantum
                                                                                Neural Networks, Information Sciences, 128(3-4): pp.
                                                                                257-269.
 Fig. 3: performance with VQ = 64, MFCC degrees varied between 8 and
                                                                            [7] Hsien-Leing Tsai, 2004, Automatic Construction
                                       36.
                                                                                Algorithms for Supervised Neural Networks and
                             VI. CONCLUSION                                     Applications, PhD thesis of NSYSU, Taiwan.
                                                                            [8] Li-Yi Lu, 2003, The Research of Neural Network and
         In this paper, we address the speaker independent                      Hidden Markov Model Applied on Personal Digital
  speech recognition of Chinese number speeches based on                        Assistant, Master thesis of CYU, Taiwan.
  HMM. The algorithm for our novel approach is proposed                     [10] Rabiner, L. R., 1989, A Tutorial on Hidden Markov
  for the speech recognition. 480 speech samples are                            Models and Selected Applications in Speech
  recorded and pre-processed. The preliminary results of                        Recognition, Proceedings of the IEEE, Vol.77, No.22,
  outside testing achieve 76.79%.                                               pp.257-286.
      To improve furthermore the performance, two                           [11] Manfred R. Schroeder, H. Quast, H.W. Strube,
features of speeches; MFCC and VQ cluster number, are                           Computer Speech: Recognition, Compression,
evaluated. We then find the combination of two spectrum                         Synthesis , Springer, 2004.
features to achieve best results. The best performance will                 [12] Wald, M., 2006, Learning Through Multimedia:
be achieved on MFCC, Number = 20 and VQ clustering                              Automatic Speech Recognition Enabling Accessibility
number = 64. The final inside and outside testing of                            and Interaction. Proceedings of ED-MEDIA 2006:
recognition achieve 96.2% and 83.1%. It proves that the                         World Conference on Educational Multimedia,
proposed approach can be employed to recognize the                              Hypermedia & Telecommunications. pp. 2965-2976.
speaker independent speeches.                                               [13]A. Revathi, R. Ganapathy and Y. Venkataramani, Nov.
Future works will be studied in the following:                                  2009, Text Independent Speaker Recognition and
  1) Employing other effective methods to merging novel                         Speaker Independent Speech Recognition Using
     method to enhance the performance.                                         Iterative Clustering Approach, International Journal of
  2) Applying the method into isolated Chinese speech                           Computer science & Information Technology (IJCSIT),
     recognition.                                                               Vol. 1, No 2, pp.30-42.
          3) Improving the precision rates.                                 [14]Haamid M. Gazi, Omar Farooq, Yusuf U. Khan,
                                                                                Sekharjit Datta,      2008, Wavelet-based, speaker-
                            ACKNOWLEDGEMENT
                                                                                independent isolated Hindi digit recognition
    The paper is supported under the Project of Lein-Ho                         International     Journal    of    Information      and
 Foundation, Taiwan.                                                            Communication Technology, Vol. 1 , Issue 2 pp.
                                                                                185-198
                              REFERENCES
                                                                            [15]Chakraborty P., et at., 2008, An Automatic Speaker
 [1] Keng-Yu Lin, 2006, Extended Discrete Hidden                                Recognition System, Neural Information Processing,
     Markov Model and Its Application to Chinese Syllable                       Lecture Notes in Computer Science (LNCS), Springer
     Recognition, Master thesis of NCHU, Taiwan.                                Berlin / Heidelberg, pp. 517-526.
 [2] Keng-Yu Lin, 2006, Extended Discrete Hidden                            [16] Kun-Ching Wang, 2009, Wavelet-Based Speech
     Markov Model and Its Application to Chinese Syllable                      Enhancement     Using     Time-Frequency    Adaptation,
     Recognition, Master thesis of NCHU.                                       EURASIP Journal on Advances in Signal Processing,
 [3] X. Li, M. Parizeau and R. Plamondon, April 2000,                          Volume 2009 (2009), Article ID 924135.
     Training Hidden Markov Models with Multiple
                                                                       30
 © 2011 ACEEE
 DOI: 01.IJSIP.02.01.218

More Related Content

PDF
44 i9 advanced-speaker-recognition
PDF
Realization and design of a pilot assist decision making system based on spee...
PDF
Et25897899
DOCX
Voice biometric recognition
PDF
Speaker identification using mel frequency
PDF
A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-...
PDF
Text-Independent Speaker Verification Report
PPT
Automatic Speaker Recognition system using MFCC and VQ approach
44 i9 advanced-speaker-recognition
Realization and design of a pilot assist decision making system based on spee...
Et25897899
Voice biometric recognition
Speaker identification using mel frequency
A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-...
Text-Independent Speaker Verification Report
Automatic Speaker Recognition system using MFCC and VQ approach

What's hot (17)

PDF
A017410108
DOCX
speech enhancement
PDF
Fb24958960
PDF
Speaker Recognition System using MFCC and Vector Quantization Approach
PDF
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
DOC
Speaker recognition.
PDF
20080502 software verification_sharygina_lecture03
PPTX
Speaker recognition systems
PPTX
Text-Independent Speaker Verification
PDF
Speaker and Speech Recognition for Secured Smart Home Applications
PDF
Ber performance analysis of mimo systems using equalization
PDF
Performance analysis of image compression using fuzzy logic algorithm
PDF
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
PDF
HGS-Assisted Detection Algorithm for 4G and Beyond Wireless Mobile Communicat...
PDF
Kf2517971799
PPT
Environmental Sound detection Using MFCC technique
PPTX
Text independent speaker recognition system
A017410108
speech enhancement
Fb24958960
Speaker Recognition System using MFCC and Vector Quantization Approach
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Speaker recognition.
20080502 software verification_sharygina_lecture03
Speaker recognition systems
Text-Independent Speaker Verification
Speaker and Speech Recognition for Secured Smart Home Applications
Ber performance analysis of mimo systems using equalization
Performance analysis of image compression using fuzzy logic algorithm
CURVELET BASED SPEECH RECOGNITION SYSTEM IN NOISY ENVIRONMENT: A STATISTICAL ...
HGS-Assisted Detection Algorithm for 4G and Beyond Wireless Mobile Communicat...
Kf2517971799
Environmental Sound detection Using MFCC technique
Text independent speaker recognition system
Ad

Viewers also liked (9)

PDF
A Dynamic MAC Protocol for WCDMA Wireless Multimedia Networks
PDF
A Robust & Fast Face Detection System
PDF
A Quality of Service Strategy to Optimize Bandwidth Utilization in Mobile Net...
PDF
Towards a Software Framework for Automatic Business Process Redesign
PDF
Different Attacks on Selective Encryption in RSA based Singular Cubic Curve w...
PDF
Detection of Carotid Artery from Pre-Processed Magnetic Resonance Angiogram
PDF
Using PageRank Algorithm to Improve Coupling Metrics
PDF
Modified Epc Global Network Architecture of Internet of Things for High Load ...
PDF
Power System State Estimation - A Review
A Dynamic MAC Protocol for WCDMA Wireless Multimedia Networks
A Robust & Fast Face Detection System
A Quality of Service Strategy to Optimize Bandwidth Utilization in Mobile Net...
Towards a Software Framework for Automatic Business Process Redesign
Different Attacks on Selective Encryption in RSA based Singular Cubic Curve w...
Detection of Carotid Artery from Pre-Processed Magnetic Resonance Angiogram
Using PageRank Algorithm to Improve Coupling Metrics
Modified Epc Global Network Architecture of Internet of Things for High Load ...
Power System State Estimation - A Review
Ad

Similar to A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model (20)

PDF
An Effective Approach for Chinese Speech Recognition on Small Size of Vocabulary
PDF
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
PDF
Emotion Recognition Based On Audio Speech
PDF
IRJET- Emotion recognition using Speech Signal: A Review
PDF
P141omfccu
PDF
Iberspeech2012
DOCX
EBDSS Max Research Report - Final
PDF
Iy2617051711
PDF
Real Time Speech Enhancement in the Waveform Domain
PDF
A comparison of different support vector machine kernels for artificial speec...
DOC
Speaker recognition on matlab
PDF
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
PDF
A Text-Independent Speaker Identification System based on The Zak Transform
PDF
A Combined Voice Activity Detector Based On Singular Value Decomposition and ...
PDF
Design and implementation of different audio restoration techniques for audio...
PDF
Blind, Non-stationary Source Separation Using Variational Mode Decomposition ...
PDF
Bz25454457
PPT
Asr
PPT
Speech Recognition System By Matlab
PDF
Comparative Analysis of Distortive and Non-Distortive Techniques for PAPR Red...
An Effective Approach for Chinese Speech Recognition on Small Size of Vocabulary
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
Emotion Recognition Based On Audio Speech
IRJET- Emotion recognition using Speech Signal: A Review
P141omfccu
Iberspeech2012
EBDSS Max Research Report - Final
Iy2617051711
Real Time Speech Enhancement in the Waveform Domain
A comparison of different support vector machine kernels for artificial speec...
Speaker recognition on matlab
A Novel Uncertainty Parameter SR ( Signal to Residual Spectrum Ratio ) Evalua...
A Text-Independent Speaker Identification System based on The Zak Transform
A Combined Voice Activity Detector Based On Singular Value Decomposition and ...
Design and implementation of different audio restoration techniques for audio...
Blind, Non-stationary Source Separation Using Variational Mode Decomposition ...
Bz25454457
Asr
Speech Recognition System By Matlab
Comparative Analysis of Distortive and Non-Distortive Techniques for PAPR Red...

More from IDES Editor (20)

PDF
Artificial Intelligence Technique based Reactive Power Planning Incorporating...
PDF
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
PDF
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
PDF
Line Losses in the 14-Bus Power System Network using UPFC
PDF
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
PDF
Assessing Uncertainty of Pushover Analysis to Geometric Modeling
PDF
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
PDF
Selfish Node Isolation & Incentivation using Progressive Thresholds
PDF
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
PDF
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
PDF
Cloud Security and Data Integrity with Client Accountability Framework
PDF
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
PDF
Enhancing Data Storage Security in Cloud Computing Through Steganography
PDF
Low Energy Routing for WSN’s
PDF
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
PDF
Rotman Lens Performance Analysis
PDF
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
PDF
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
PDF
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
PDF
Mental Stress Evaluation using an Adaptive Model
Artificial Intelligence Technique based Reactive Power Planning Incorporating...
Design and Performance Analysis of Genetic based PID-PSS with SVC in a Multi-...
Optimal Placement of DG for Loss Reduction and Voltage Sag Mitigation in Radi...
Line Losses in the 14-Bus Power System Network using UPFC
Study of Structural Behaviour of Gravity Dam with Various Features of Gallery...
Assessing Uncertainty of Pushover Analysis to Geometric Modeling
Secure Multi-Party Negotiation: An Analysis for Electronic Payments in Mobile...
Selfish Node Isolation & Incentivation using Progressive Thresholds
Various OSI Layer Attacks and Countermeasure to Enhance the Performance of WS...
Responsive Parameter based an AntiWorm Approach to Prevent Wormhole Attack in...
Cloud Security and Data Integrity with Client Accountability Framework
Genetic Algorithm based Layered Detection and Defense of HTTP Botnet
Enhancing Data Storage Security in Cloud Computing Through Steganography
Low Energy Routing for WSN’s
Permutation of Pixels within the Shares of Visual Cryptography using KBRP for...
Rotman Lens Performance Analysis
Band Clustering for the Lossless Compression of AVIRIS Hyperspectral Images
Microelectronic Circuit Analogous to Hydrogen Bonding Network in Active Site ...
Texture Unit based Monocular Real-world Scene Classification using SOM and KN...
Mental Stress Evaluation using an Adaptive Model

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Encapsulation theory and applications.pdf
PDF
Approach and Philosophy of On baking technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Empathic Computing: Creating Shared Understanding
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Weekly Chronicles - August'25 Week I
Encapsulation theory and applications.pdf
Approach and Philosophy of On baking technology
Chapter 3 Spatial Domain Image Processing.pdf
MIND Revenue Release Quarter 2 2025 Press Release
“AI and Expert System Decision Support & Business Intelligence Systems”
Network Security Unit 5.pdf for BCA BBA.
Review of recent advances in non-invasive hemoglobin estimation
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Advanced methodologies resolving dimensionality complications for autism neur...
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation_ Review paper, used for researhc scholars
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Machine learning based COVID-19 study performance prediction
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Empathic Computing: Creating Shared Understanding
The Rise and Fall of 3GPP – Time for a Sabbatical?
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model

  • 1. ACEEE Int. J. on Signal & Image Processing, Vol. 02, No. 01, Jan 2011 A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model Feng-Long Huang Computer Science and Information Engineering, National United University No. 1, Lienda, Miaoli, Taiwan, 36003 flhuang@nuu.edu.tw Abstract: In this paper, we address the speaker independent for this success is the powerful ability to characterize the recognition of Chinese number speeches 0~9 based on HMM. speech signal in a mathematically tractable way. Our former results of inside and outside testing achieved In a typical ASR system based on HMM, the HMM 92.5% and 76.79% respectively. To improve further the stage is proceeded by the parameter extraction. Thus the performance, two important features of speech; MFCC and input to the HMM is a discrete time sequence of parameter cluster number of vector quantification, are unified together vectors, which will be supplied to the HMM. and evaluated on various values. The best performance In the paper, the following sections are organized as achieve 96.2% and 83.1% on MFCC Number = 20 and VQ follow: the process of speeches is introduced in Section 2 clustering number = 64. and the acoustic model of recognition will be described in Keywords: Speech Recognition, Hidden Markov Model, Section 3. The initial results for former approaches are LBG Algorithm, Mel-frequency cepstral coefficients, Viterbi presented in Section 4. The improvement metods are Algorithm. furthermore described in Section 5 I. INTRODUCTION II. PROCESSES OF SPEECH In Speech processing, automatic speech recognition In this section, we will describe all the procedures for (ASR) is capable automatically of understanding the input pre-processes. of human speech for the text output with various A. Processing Speech vocabularies. ASR can be applied in a wide range of The analog voice signals are recorded thru applications, such as: human interface design, speech microphone. It should be digitalized and quantified. The Information Retrieval (SIR) [11,12], language translation, digital signal process can be described as follows: and so on. In real world, there are several commercial x p (t ) = x a (t ) p (t ) ASR systems, for example, IBM’s Via Voice, Mandarin (1) Dictation System–the Golden Mandarin (III) of NTU in where xp(t) and xa(t) denote the processed and analog Taiwan, Voice Portal on Internet and 104 on-line speech signal. p(t) is the impulse signal. queries systems. Modern ASR technologies merged the Each signal should be segmented into several short signal process, pattern recognition, network and frames of speech which contain a time series signal. The telecommunication into a unified framework. Such features of each frame are extracted for further processes. architecture can be expanded into broad domains of B. Pre-emphasis services, such as e-commerce and wireless speech system Basically, the purpose of pre-emphasis is to increase, of WiMAX. the magnitude of some (usually higher) frequencies with The approaches adopted on ASR can be categorized as: respect to the magnitude of other (usually lower) 1)Hidden Markov Model (HMM) [1,2,3,4], 2)Neural frequencies in order to improve the overall signal-to-noise Networks [5,6,7], 3)Wavelet-based and spectrum coefficients ratio (SNR) by minimizing the adverse effects of such of speech [15,16], other method is the combination of first phenomena as attenuation distortion. two approaches above [8,9]. The Hidden Markov Model is C. Frame Blocking a result of the attempt to model the speech generation While analyzing audio signals, we usually adopt the statistically, and thus belongs to the first category above. method of short-term analysis because most audio signals During the past several years it has become the most are relatively stable within a short period of time. Usually, successful speech model used in ASR. The main reason 27 © 2011 ACEEE DOI: 01.IJSIP.02.01.218
  • 2. ACEEE Int. J. on Signal & Image Processing, Vol. 02, No. 01, Jan 2011 the signal will be segmented into time frame, say 15 ~ 30 In a regular Markov model, the state is directly visible ms. to the observer, and therefore the state transition D. Hamming Window probabilities are the only parameters. However, in a In signal processing, the window function is hidden Markov model, the state is not directly visible (so- a function that is zero-valued outside of some called hidden), while the variables influenced by the state chosen interval. The Hamming window is a weighted are visible. Each state has a probability distribution over moving average transformation used to smooth the the output. Therefore, the sequence of tokens generated by periodogram values. Supposed that original signal s(n) is as follows: an HMM gives some information about the sequence of s(n), n = 0,…N-1 (2) states. The original signal s(n) is multiplied by hamming A complete HMM can be defined as follows: window w(n), we will obtain s(n)* w(n), w(n) can be λ = ( π , A, B) (5) defined as follows: HMM model can be defined as ( π , A, B) : 1. Π (Initial state probability): w(n) = (1 - α) – α*cos(2πn/(N-1)), 0≦ n≦ N-1 (3) π = { π i = prob(q = S i )} 1≤ i ≤ N (6) 1 where N denotes the sample number in a window. 2. A (State transition probability): E. Mel-frequency cepstral coefficients A = {a ij = prob(q t+1 = S j |q t = S i )} (7) Mel Frequency Cepstral Coefficient (MFCC) is one of 1 ≤ i ≤ N the most effective feature parameter in speech recognition. 3. B (Observation symbol probability): B = {b j (O t ) = prob(Ot | q t = S j )} 1 ≤ i ≤ N (8) For speech representation, it is well known that MFCC parameters appear to be more effective than power where O = {O 1 , O 2 ,.... , O T } is the observation. spectrum based features. MFCCs are based on the human S = {S1 , S 2 , S 3 ,..... , S N } is state symbols and ears' non-linear frequency characteristic and perform a q = {q 1 , q 2 , q 3 ,..... , q T } is observation states and high recognition rate in practical application. T denote the length of observation, N is the number of o lower frequency, human hear more acute. states. o higher frequency, human hear less acute. C. System Models As shown in Fig. 7, MFCC are presented as: The recognition system is composed of two main mel(f)=1125*ln(1+f/700) (4) functions: 1) extracting the speech features, including frame blocking, VQ, and so on, 2) constructing the model III. ACOUSTIC MODEL OF RECOGNITION and recognition based on the HMM, VQ and Viterbi Algorithm. A. Vector Quantification It is apparent that short speech signal varied sharply Foundational vector quantifications (VQ) were and rapidly, whereas longer signal varied slowly. proposed by Y. Linde, A. Buzo, and R. Gray in 1980, So- Therefore, we use the dynamic frame blocking rather than called LBG algorithm. LBG is based on k-means fixed frame for different experiments. clustering [2,5], referring to the size of codebook G, training vectors will be categorized into G groups. The IV. INITIAL EXPERIMENTS centroid Ci of each Gi will be the representative for such A. Recognition System Based on HMM vector of codeword. In principal, the category is tree In the paper, we focus on speaker independent based structure. speech recognition of Chinese number speeches 0~9. All B. Hidden Markov Model the samples with 44100 Hz/16 bits are recorded by three native male adults. Total 560 samples are divided into two A Hidden Markov Model (HMM) is a statistical model parts, 280 for training and 280 for testing. After complete in which is assumed to be a Markov process with the pre-process, such as preemphasis, frame boloking, VQ. unknown parameters. The challenge is to find all the appropriate hidden parameters from the observable states. B. Comparison for fixed and Dynamic Frame Size HMM can be considered as the simplest dynamic According to our empirical results, comparing the Bayesian network. fixed and dynamic frame size, recognition rate of fixed 28 © 2011 ACEEE DOI: 01.IJSIP.02.01.218
  • 3. ACEEE Int. J. on Signal & Image Processing, Vol. 02, No. 01, Jan 2011 frame size achieves 76.79%, and superior to the other B. Better Combination of Various Features with75.71%, as shown in Table 1. To improve furthermore the performance, two spectrum Table 1: comparing the frame size, (SymbolNum=64) features, MFCC and cluster number, of speeches are wave Mfcc VQ HMM Symbol rate(%) unified and evaluated. MFCC degree varied from 8 to 36 Num time time training Num with interval 4 and cluster number varied on 32 to 256 I 280 90.36 with interval 32. We evaluated all the combination for fixed 32.9 5.77 3.44 64 these two features with various numbers. The process O 280 76.79* times needed for computation are shown in Table 2. The I 280 92.50* best results can achieve on MFCC Number= 20 and VQ dynamic 32.0 3.31 2.42 64 O 280 75.71 clustering number = 64. The inside and outside testing of PS. I and O denote the inside and outside testing, respectively recognition achieve 96.2% and 83.1% shown in Fig. 3 and net results for inside and outside testing are 3.7% and V. FURTHER IMPROVEMENT 6.3% respectively. We just list the results with VQ = 64 in A. Improving the Samples of Speech the paper. According to our empirical results, recognition rate Table 2: processed time with VQ = 64. achieve better results while cluster number=64. Inside and MFCC outside testing are 92.5% and 76.79%, respectively. degree 8 12 16 20 24 28 32 36 To improve the performance, we analyze all the MFCC 15.8 16.9 18.6 23.5 25.3 27.2 28.5 29.9 speech wavelet. There are many samples affected by boost noise derived from human speaking or environment, as VQ 1.0 2.6 3.3 3.4 3.8 4.9 5.3 6.6 shown in Fig. 1. In such a situation, the end points of HMM 1.7 1.7 1.8 1.8 1.8 1.8 1.9 1.9 boosted speech cannot be usually detected correctly. It will lead to degrade the performance of system. Usually, detecting end points judged on ZCR and energy of speech, as shown in Fig. 1. However, it is significant that we need extra features to detect for noise situation. Based on experimental results and observation, the improvement rules are summarized as follows: Input: X(n) , n = 1 to j Output: Y(m),1 <= m <= j 1. segment the speech X(n): framedY = framed (X(n)) 2. calculate the ZCR and energy for each frame. 3. smooth the curves for both ZCR and energy 4. calculate the average of first 10 frames, and Fig. 1: before improvement, Chinese number 8 (ㄅㄚ) multiplying 1.2. The average value will be used as the threshold for detecting process. 5. ZCR is valid only if framedY is larger than 100, as shown in Fig. 2. 6. the speech will be effective only if the size is larger than 3ms. 7. the starting energy of speech should be larger than threshold. 8. the energy for continuous 5 frames of speech . should be increased progressively. Referring to the improvement, the speeches number 8 (ㄅㄚ) with boost noise can be detected, as shown in Fig. 2. The improvement of detection will leads to better Fig. 2: after improvement, Chinese number 8 (ㄅㄚ). results for following recognition process. 29 © 2011 ACEEE DOI: 01.IJSIP.02.01.218
  • 4. ACEEE Int. J. on Signal & Image Processing, Vol. 02, No. 01, Jan 2011 100 Observations--A Combinatorial Method, IEEE 95 Transactions on Pattern Analysis and Machine performance(%) 90 Intelligence (PAMI), Vol. 22, No. 4. 85 [4] A. Sperduti and A. Starita, May 1997, Supervised 80 75 Neural Networks for Classification of Structures. IEEE 70 Inside Test(%) Transactions on Neural Networks, 8(3): pp.714-735. 65 Outside test(%) [6] E. Behrman, L. Nash, J. Steck, V. Chandrashekar, and 60 8 12 16 20 24 28 32 36 MFC C de gre e S. Skinner, October 2000, Simulations of Quantum Neural Networks, Information Sciences, 128(3-4): pp. 257-269. Fig. 3: performance with VQ = 64, MFCC degrees varied between 8 and [7] Hsien-Leing Tsai, 2004, Automatic Construction 36. Algorithms for Supervised Neural Networks and VI. CONCLUSION Applications, PhD thesis of NSYSU, Taiwan. [8] Li-Yi Lu, 2003, The Research of Neural Network and In this paper, we address the speaker independent Hidden Markov Model Applied on Personal Digital speech recognition of Chinese number speeches based on Assistant, Master thesis of CYU, Taiwan. HMM. The algorithm for our novel approach is proposed [10] Rabiner, L. R., 1989, A Tutorial on Hidden Markov for the speech recognition. 480 speech samples are Models and Selected Applications in Speech recorded and pre-processed. The preliminary results of Recognition, Proceedings of the IEEE, Vol.77, No.22, outside testing achieve 76.79%. pp.257-286. To improve furthermore the performance, two [11] Manfred R. Schroeder, H. Quast, H.W. Strube, features of speeches; MFCC and VQ cluster number, are Computer Speech: Recognition, Compression, evaluated. We then find the combination of two spectrum Synthesis , Springer, 2004. features to achieve best results. The best performance will [12] Wald, M., 2006, Learning Through Multimedia: be achieved on MFCC, Number = 20 and VQ clustering Automatic Speech Recognition Enabling Accessibility number = 64. The final inside and outside testing of and Interaction. Proceedings of ED-MEDIA 2006: recognition achieve 96.2% and 83.1%. It proves that the World Conference on Educational Multimedia, proposed approach can be employed to recognize the Hypermedia & Telecommunications. pp. 2965-2976. speaker independent speeches. [13]A. Revathi, R. Ganapathy and Y. Venkataramani, Nov. Future works will be studied in the following: 2009, Text Independent Speaker Recognition and 1) Employing other effective methods to merging novel Speaker Independent Speech Recognition Using method to enhance the performance. Iterative Clustering Approach, International Journal of 2) Applying the method into isolated Chinese speech Computer science & Information Technology (IJCSIT), recognition. Vol. 1, No 2, pp.30-42. 3) Improving the precision rates. [14]Haamid M. Gazi, Omar Farooq, Yusuf U. Khan, Sekharjit Datta, 2008, Wavelet-based, speaker- ACKNOWLEDGEMENT independent isolated Hindi digit recognition The paper is supported under the Project of Lein-Ho International Journal of Information and Foundation, Taiwan. Communication Technology, Vol. 1 , Issue 2 pp. 185-198 REFERENCES [15]Chakraborty P., et at., 2008, An Automatic Speaker [1] Keng-Yu Lin, 2006, Extended Discrete Hidden Recognition System, Neural Information Processing, Markov Model and Its Application to Chinese Syllable Lecture Notes in Computer Science (LNCS), Springer Recognition, Master thesis of NCHU, Taiwan. Berlin / Heidelberg, pp. 517-526. [2] Keng-Yu Lin, 2006, Extended Discrete Hidden [16] Kun-Ching Wang, 2009, Wavelet-Based Speech Markov Model and Its Application to Chinese Syllable Enhancement Using Time-Frequency Adaptation, Recognition, Master thesis of NCHU. EURASIP Journal on Advances in Signal Processing, [3] X. Li, M. Parizeau and R. Plamondon, April 2000, Volume 2009 (2009), Article ID 924135. Training Hidden Markov Models with Multiple 30 © 2011 ACEEE DOI: 01.IJSIP.02.01.218