SlideShare a Scribd company logo
Speech Recognition
Guide:
Prof. Tanish Zaveri
Archit Vora(09bec101)
Shrey Patel(09bec066)
Algorithms
 Pre-emphasis
 LPC
 VAD
 MFCC
 LPC
 GMM
 LBG algorithm for VQ
 K-means for VQ
 HMM
 Log distance
 DTW
 Euclidian Distance
Algorithms
 Pre-emphasis
 LPC
 VAD
 MFCC
 LPC
 GMM
 LBG algorithm for VQ
 K-means for VQ
 HMM
 Log distance
 DTW
 Euclidian Distance
Physiological Model
Physiological Model
Physiological Model
Nasal Voice
Voiced Speech
Unvoiced Speech
Pitch (100 Hz)
– Depends on frequency of glottal pulses
Formant frequency (500 Hz)
– Depends on length of vocal tract
Velocity of Sound = 340 m/s
Representation of Speech
Time Domain
Spectogrpah
Pre-emphasis
The spectrum for voiced segments has more
energy at lower frequencies than higher
frequencies.
– Before and after pre-emphasis Formant Frequency
Pre-emphasis
Transfer function:
– H=[1 -0.98];
– Zero at 0.98
– Enhances higher frequency
– FIR filter having linear phase
Pre-emphasis
Algorithms
Pre-emphasis
MFCC
GMM
MFCC
Windowing
DFT
Mel filter bank
Log of square
IDCT
MFCC
Windowing
Speech is not a stationary signal; we want
information about a small enough region that
the spectral information is a useful cue.
We have used hamming window of 256
samples
Frames:
– Frame size: typically, 10-25ms
– Frame shift: the length of time between
successive frames, typically, 5-10ms
Windowning
DCT
Because mel filter bank requires input in
frequency domain
Multiplication saves calculation than
convolution
Mel filter bank
Linear at low frequency
Then Logarithmic
Why mel scale ?
Human hearing is not equally sensitive to all
frequency bands
Less sensitive at higher frequencies, roughly >
1000 Hz
A mel is a unit of pitch
– Definition:
• Pairs of sounds perceptually equidistant in pitch are
separated by an equal number of mels
Log of square
Why ?
– Phase information is not much useful in speech
– Makes frequency estimates less sensitive to
slight variations in input (power variation due
to speaker’s mouth moving closer to mike)
– Helps in separating the source and filter
Log(s*f)=log(s)+log(f)
– S source f  filter
MFCC
Why is MFCC so popular?
Efficient to compute
Incorporates a perceptual Mel frequency scale
Separates the source and filter
IDFT(DCT) de-correlates the features
– Improves diagonal assumption in HMM modeling
Alternative
– PLP(Perceptual Linear Prediction)
– LPC based
GMM
GMM : Gaussian mixture model
Uses 8 GMM’s per digit to train and recognize an
individual users voice
8 Gaussian model means 8x39 ‘one’
– 39 is cepstarl coefficient
– In the initial stage we have 12 coefficients
Matlab Functions
– gmdistribution.fit
– posterior
GMM implementation
A gmm object is created during training for
each dictionary entry, in this case digits 0-9,
using the function call gmdistribution.fit
‘Posterior’ accepts a gmm object/model as its
input, along with an input data set, and
returns a log-likelihood number that
represents the data set match to the model
Pattern comparison : log liklihood
HMM
HMM
Evaluation problem :
– Simple formula
– Forward Algorithm
Decoding problem :
– Trellis Algorithm
– Viterbi Algorithm
Learning problem :
– Baum-Welch algorithm
– EM (expectation maximization) method
HMM
Vector Quantization
Vector Quantization
Vector Quantization
LBG algorithm
– Linde–Buzo–Gray algorithm
39*60(approx)  39*16
Saves memory
Saves processing
Simplifies comparisons
Vector Quantization-LBG
1. Determine the number of codewords, N, or the
size of the codebook
2. Select N codewords at random, and let that be
the initial codebook
3. Using the Euclidean distance measure clusterize
the vectors around each codeword
4. Compute the new set of codewords
5. Repeat steps 2 and 3 until the either the
codewords don't change or the change in the
codewords is small.
Vector Quantization-k means
In k-mean move centroid close to test data
Otherwise same as LBG
Pattern Comparisons
Log distance
DTW
Euclidian Distance
DTW
Database
1. One
2. Two
3. Three
4. Four
5. Five
6. Six
7. Seven
8. Nirma
9. Linux
Implementation
Result
Full efficiency with VQ approach
It has opened a great path of HMM
– HMM can be applied with Gaussian mixtures or
vector quantization
– But Gaussian mixtures is difficult mathematically
and also requires greater computation
Conclusion
Whole Process in nutshell
Multidiscipline Area
– DSP : Heart of the process
Application
– Machine Interaction
– Phone Dialing (specially while driving)
– Voice activated routing
We need HMM !
Thank you !
Tanish Sir
Ruchi madam
Nirma University
Lawrence Rabiner
Thomas Quatieri
Mike Brookes (Voicebox)
Open Source Community
Kevin Murphy (HMM toolbox)
Speech recognition final

More Related Content

PPT
Automatic speech recognition
PPT
Speech Recognition
PPTX
Voice recognition system
PPTX
Speech recognition final presentation
PPT
Speech Recognition System By Matlab
PPT
Voice Recognition
PPT
Speech recognition system
PPTX
Automatic speech recognition
Automatic speech recognition
Speech Recognition
Voice recognition system
Speech recognition final presentation
Speech Recognition System By Matlab
Voice Recognition
Speech recognition system
Automatic speech recognition

What's hot (20)

PPTX
A Survey on Speaker Recognition System
PPTX
Automatic Speech Recognion
PPTX
Speech Recognition Technology
PPT
Speech Recognition in Artificail Inteligence
PPSX
Speech recognition an overview
PPTX
Speaker recognition in android
PDF
Deep Learning For Speech Recognition
PPTX
Speech Recognition
PPT
Abstract of speech recognition
PPT
Automatic speech recognition
PPT
Automatic speech recognition
PPTX
Speech recognition challenges
PPTX
Speech recognition system seminar
PPT
Noise Adaptive Training for Robust Automatic Speech Recognition
PPTX
Speech recognition techniques
PPTX
Speech recognition An overview
PPTX
Esophageal Speech Recognition using Artificial Neural Network (ANN)
PPTX
Speech Signal Analysis
A Survey on Speaker Recognition System
Automatic Speech Recognion
Speech Recognition Technology
Speech Recognition in Artificail Inteligence
Speech recognition an overview
Speaker recognition in android
Deep Learning For Speech Recognition
Speech Recognition
Abstract of speech recognition
Automatic speech recognition
Automatic speech recognition
Speech recognition challenges
Speech recognition system seminar
Noise Adaptive Training for Robust Automatic Speech Recognition
Speech recognition techniques
Speech recognition An overview
Esophageal Speech Recognition using Artificial Neural Network (ANN)
Speech Signal Analysis
Ad

Similar to Speech recognition final (20)

PPTX
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
PPTX
Speaker recognition using MFCC
PPTX
Text independent speaker recognition system
PDF
Isolated words recognition using mfcc, lpc and neural network
PDF
FPGA-based implementation of speech recognition for robocar control using MFCC
PPTX
BSc 4th year project proposal final 16-5-22
PPT
Speaker identification system with voice controlled functionality
DOCX
Voice biometric recognition
PDF
Isolated word recognition using lpc & vector quantization
PDF
Isolated word recognition using lpc & vector quantization
PDF
PDF
44 i9 advanced-speaker-recognition
PDF
Speaker Recognition System using MFCC and Vector Quantization Approach
PDF
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
PDF
Speaker Recognition Using Vocal Tract Features
PDF
19 ijcse-01227
PDF
Speaker Identification & Verification Using MFCC & SVM
PPTX
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
PPTX
Real-Time Voice Actuation
Text Prompted Remote Speaker Authentication : Joint Speech and Speaker Recogn...
Speaker recognition using MFCC
Text independent speaker recognition system
Isolated words recognition using mfcc, lpc and neural network
FPGA-based implementation of speech recognition for robocar control using MFCC
BSc 4th year project proposal final 16-5-22
Speaker identification system with voice controlled functionality
Voice biometric recognition
Isolated word recognition using lpc & vector quantization
Isolated word recognition using lpc & vector quantization
44 i9 advanced-speaker-recognition
Speaker Recognition System using MFCC and Vector Quantization Approach
Automatic speech emotion and speaker recognition based on hybrid gmm and ffbnn
Speaker Recognition Using Vocal Tract Features
19 ijcse-01227
Speaker Identification & Verification Using MFCC & SVM
Joint MFCC-and-Vector Quantization based Text-Independent Speaker Recognition...
Real-Time Voice Actuation
Ad

Recently uploaded (20)

PPTX
CH1 Production IntroductoryConcepts.pptx
PPTX
Welding lecture in detail for understanding
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PDF
Digital Logic Computer Design lecture notes
PPT
Mechanical Engineering MATERIALS Selection
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
additive manufacturing of ss316l using mig welding
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
web development for engineering and engineering
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
composite construction of structures.pdf
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PDF
Well-logging-methods_new................
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
CH1 Production IntroductoryConcepts.pptx
Welding lecture in detail for understanding
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
Digital Logic Computer Design lecture notes
Mechanical Engineering MATERIALS Selection
Model Code of Practice - Construction Work - 21102022 .pdf
additive manufacturing of ss316l using mig welding
UNIT-1 - COAL BASED THERMAL POWER PLANTS
web development for engineering and engineering
CYBER-CRIMES AND SECURITY A guide to understanding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
composite construction of structures.pdf
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
Well-logging-methods_new................
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Operating System & Kernel Study Guide-1 - converted.pdf

Speech recognition final

  • 1. Speech Recognition Guide: Prof. Tanish Zaveri Archit Vora(09bec101) Shrey Patel(09bec066)
  • 2. Algorithms  Pre-emphasis  LPC  VAD  MFCC  LPC  GMM  LBG algorithm for VQ  K-means for VQ  HMM  Log distance  DTW  Euclidian Distance
  • 3. Algorithms  Pre-emphasis  LPC  VAD  MFCC  LPC  GMM  LBG algorithm for VQ  K-means for VQ  HMM  Log distance  DTW  Euclidian Distance
  • 6. Physiological Model Nasal Voice Voiced Speech Unvoiced Speech Pitch (100 Hz) – Depends on frequency of glottal pulses Formant frequency (500 Hz) – Depends on length of vocal tract Velocity of Sound = 340 m/s
  • 7. Representation of Speech Time Domain Spectogrpah
  • 8. Pre-emphasis The spectrum for voiced segments has more energy at lower frequencies than higher frequencies. – Before and after pre-emphasis Formant Frequency
  • 9. Pre-emphasis Transfer function: – H=[1 -0.98]; – Zero at 0.98 – Enhances higher frequency – FIR filter having linear phase
  • 13. MFCC
  • 14. Windowing Speech is not a stationary signal; we want information about a small enough region that the spectral information is a useful cue. We have used hamming window of 256 samples Frames: – Frame size: typically, 10-25ms – Frame shift: the length of time between successive frames, typically, 5-10ms
  • 16. DCT Because mel filter bank requires input in frequency domain Multiplication saves calculation than convolution
  • 17. Mel filter bank Linear at low frequency Then Logarithmic
  • 18. Why mel scale ? Human hearing is not equally sensitive to all frequency bands Less sensitive at higher frequencies, roughly > 1000 Hz A mel is a unit of pitch – Definition: • Pairs of sounds perceptually equidistant in pitch are separated by an equal number of mels
  • 19. Log of square Why ? – Phase information is not much useful in speech – Makes frequency estimates less sensitive to slight variations in input (power variation due to speaker’s mouth moving closer to mike) – Helps in separating the source and filter Log(s*f)=log(s)+log(f) – S source f  filter
  • 20. MFCC
  • 21. Why is MFCC so popular? Efficient to compute Incorporates a perceptual Mel frequency scale Separates the source and filter IDFT(DCT) de-correlates the features – Improves diagonal assumption in HMM modeling Alternative – PLP(Perceptual Linear Prediction) – LPC based
  • 22. GMM GMM : Gaussian mixture model Uses 8 GMM’s per digit to train and recognize an individual users voice 8 Gaussian model means 8x39 ‘one’ – 39 is cepstarl coefficient – In the initial stage we have 12 coefficients Matlab Functions – gmdistribution.fit – posterior
  • 23. GMM implementation A gmm object is created during training for each dictionary entry, in this case digits 0-9, using the function call gmdistribution.fit ‘Posterior’ accepts a gmm object/model as its input, along with an input data set, and returns a log-likelihood number that represents the data set match to the model Pattern comparison : log liklihood
  • 24. HMM
  • 25. HMM Evaluation problem : – Simple formula – Forward Algorithm Decoding problem : – Trellis Algorithm – Viterbi Algorithm Learning problem : – Baum-Welch algorithm – EM (expectation maximization) method
  • 26. HMM
  • 29. Vector Quantization LBG algorithm – Linde–Buzo–Gray algorithm 39*60(approx)  39*16 Saves memory Saves processing Simplifies comparisons
  • 30. Vector Quantization-LBG 1. Determine the number of codewords, N, or the size of the codebook 2. Select N codewords at random, and let that be the initial codebook 3. Using the Euclidean distance measure clusterize the vectors around each codeword 4. Compute the new set of codewords 5. Repeat steps 2 and 3 until the either the codewords don't change or the change in the codewords is small.
  • 31. Vector Quantization-k means In k-mean move centroid close to test data Otherwise same as LBG
  • 33. DTW
  • 34. Database 1. One 2. Two 3. Three 4. Four 5. Five 6. Six 7. Seven 8. Nirma 9. Linux
  • 36. Result Full efficiency with VQ approach It has opened a great path of HMM – HMM can be applied with Gaussian mixtures or vector quantization – But Gaussian mixtures is difficult mathematically and also requires greater computation
  • 37. Conclusion Whole Process in nutshell Multidiscipline Area – DSP : Heart of the process Application – Machine Interaction – Phone Dialing (specially while driving) – Voice activated routing We need HMM !
  • 38. Thank you ! Tanish Sir Ruchi madam Nirma University Lawrence Rabiner Thomas Quatieri Mike Brookes (Voicebox) Open Source Community Kevin Murphy (HMM toolbox)