Speech recognition final

Speech Recognition
Guide:
Prof. Tanish Zaveri
Archit Vora(09bec101)
Shrey Patel(09bec066)

Algorithms
 Pre-emphasis
 LPC
 VAD
 MFCC
 LPC
 GMM
 LBG algorithm for VQ
 K-means for VQ
 HMM
 Log distance
 DTW
 Euclidian Distance

Physiological Model
Nasal Voice
Voiced Speech
Unvoiced Speech
Pitch (100 Hz)
– Depends on frequency of glottal pulses
Formant frequency (500 Hz)
– Depends on length of vocal tract
Velocity of Sound = 340 m/s

Representation of Speech
Time Domain
Spectogrpah

Pre-emphasis
The spectrum for voiced segments has more
energy at lower frequencies than higher
frequencies.
– Before and after pre-emphasis Formant Frequency

Pre-emphasis
Transfer function:
– H=[1 -0.98];
– Zero at 0.98
– Enhances higher frequency
– FIR filter having linear phase

Algorithms
Pre-emphasis
MFCC
GMM

MFCC
Windowing
DFT
Mel filter bank
Log of square
IDCT

Windowing
Speech is not a stationary signal; we want
information about a small enough region that
the spectral information is a useful cue.
We have used hamming window of 256
samples
Frames:
– Frame size: typically, 10-25ms
– Frame shift: the length of time between
successive frames, typically, 5-10ms

DCT
Because mel filter bank requires input in
frequency domain
Multiplication saves calculation than
convolution

Mel filter bank
Linear at low frequency
Then Logarithmic

Why mel scale ?
Human hearing is not equally sensitive to all
frequency bands
Less sensitive at higher frequencies, roughly >
1000 Hz
A mel is a unit of pitch
– Definition:
• Pairs of sounds perceptually equidistant in pitch are
separated by an equal number of mels

Log of square
Why ?
– Phase information is not much useful in speech
– Makes frequency estimates less sensitive to
slight variations in input (power variation due
to speaker’s mouth moving closer to mike)
– Helps in separating the source and filter
Log(s*f)=log(s)+log(f)
– S source f  filter

Why is MFCC so popular?
Efficient to compute
Incorporates a perceptual Mel frequency scale
Separates the source and filter
IDFT(DCT) de-correlates the features
– Improves diagonal assumption in HMM modeling
Alternative
– PLP(Perceptual Linear Prediction)
– LPC based

GMM
GMM : Gaussian mixture model
Uses 8 GMM’s per digit to train and recognize an
individual users voice
8 Gaussian model means 8x39 ‘one’
– 39 is cepstarl coefficient
– In the initial stage we have 12 coefficients
Matlab Functions
– gmdistribution.fit
– posterior

GMM implementation
A gmm object is created during training for
each dictionary entry, in this case digits 0-9,
using the function call gmdistribution.fit
‘Posterior’ accepts a gmm object/model as its
input, along with an input data set, and
returns a log-likelihood number that
represents the data set match to the model
Pattern comparison : log liklihood

HMM
Evaluation problem :
– Simple formula
– Forward Algorithm
Decoding problem :
– Trellis Algorithm
– Viterbi Algorithm
Learning problem :
– Baum-Welch algorithm
– EM (expectation maximization) method

Vector Quantization
LBG algorithm
– Linde–Buzo–Gray algorithm
39*60(approx)  39*16
Saves memory
Saves processing
Simplifies comparisons

Vector Quantization-LBG
1. Determine the number of codewords, N, or the
size of the codebook
2. Select N codewords at random, and let that be
the initial codebook
3. Using the Euclidean distance measure clusterize
the vectors around each codeword
4. Compute the new set of codewords
5. Repeat steps 2 and 3 until the either the
codewords don't change or the change in the
codewords is small.

Vector Quantization-k means
In k-mean move centroid close to test data
Otherwise same as LBG

Pattern Comparisons
Log distance
DTW
Euclidian Distance

Database
1. One
2. Two
3. Three
4. Four
5. Five
6. Six
7. Seven
8. Nirma
9. Linux

Result
Full efficiency with VQ approach
It has opened a great path of HMM
– HMM can be applied with Gaussian mixtures or
vector quantization
– But Gaussian mixtures is difficult mathematically
and also requires greater computation

Conclusion
Whole Process in nutshell
Multidiscipline Area
– DSP : Heart of the process
Application
– Machine Interaction
– Phone Dialing (specially while driving)
– Voice activated routing
We need HMM !

Thank you !
Tanish Sir
Ruchi madam
Nirma University
Lawrence Rabiner
Thomas Quatieri
Mike Brookes (Voicebox)
Open Source Community
Kevin Murphy (HMM toolbox)

Speech recognition final

More Related Content

What's hot (20)

Similar to Speech recognition final (20)

Recently uploaded (20)

Speech recognition final