SlideShare a Scribd company logo
How to choose ASR
AI & BIG DATA DAY
Hamolia Vladyslav
skype: vhamolya
ELEKS
Agenda
● History of ASR
● ASR challenges
● General overview of ASR processing
● Speech representation
● Implement ASR using HMM and DNN
● Open Source tools
● Q&A
2
History
Isolated
words
Continuous
speech
Connected
digits
Connected
words
Spoken dialog
Filter-bank
analysis
Pattern
recognition
HHM,
Stochastic
language
model
Statistical
learning,
Multi-layer
perceptron
Concatenative
synthesis;
Machine
Learning,
LSTM
1962 1967 1972 1977 1982 1987 1998 2005 +
3
● Large vocabulary
● Background Noise
● Regional and Social Dialects
● Spoken Language vs Written Language
● Spontaneous vs read speech
● Ambiguity
ASR Сhallenges
4
General Flow
5
Speech Representation
Fast fourier transform:
- Time domain to frequency domain
- Shows energy in different frequency band
- Complex spectra: all information is preserved
- Supporting sound source separation (sources
overlap less in time frequency than in time
domain) invertibility
6
Spectrum Estimation
Spectrum of audio signals is typically
estimated in short consecutive segments,
frames
- Real audio signals are not stationary but
vary through time
- Framewise processing assumes the
signal is time-invariant
- Frame length for audio application in
between 10ms - 100ms
- Frame length for ASR 25ms
7
Bhiksha Raj, Rita Singh: Techniques for Noise Robust Automatic Speech Recognition
Common path:
- Take the FFT of a signal
- Map the powers of the spectrum
obtained above onto the mel scale
- Cosine transform of mel log powers
- MFCCs are amplitudes of the resulting
spectrum
MFCC
8
Acoustic Model
- Phonemes are fundamental units
- “cat” -> /k/, /at/, /t/
- Split each phoneme in 3 states
- ~ 10% advantage using phonemes
in contrast to words
- Training model for each word
requires a lot of data
9
p r aa b iy
p r ay b i
p r aw i uh
p r aa i iy
p r aa b uw
p ow ih
p aa iy
p aa b uh b l iy
p aa ah iy
s eh n t s
s ih t s
eh v r ax b ax d iy
eh v er ax d iy
eh ux b ax iy
eh r uw ay
eh b ah iy
Phonemes
10
do ow n
d ow
d ow n t
d ow t
d ah n
ow
n ax
d ax n
ax
n uw
probably sense everybody don’t
11
Formant space of vowels
HMM-based Recogniser
Optimization problem:
O Overvation
(features)
P(O | S) Acoustic mode
P(S | W) Pronunciation model
P(W) Language model
12
HMM-based Recogniser
- For each example, use current HMM models
to assign feature vectors to HMM states
- Viterbi algorithm, find the most likely path
through the composite HMM model
- Group the feature vectors assigned to each
HMM
- GMM for computing P(O|S) (acoustic model)
13
Viterbi Algorithm
14
Language Model
- Word sequence
- Bigram approximation
- N-gram approximation
15
- …. and LSTMs
DNN
● Two ways of using DNN for ASR task:
○ Extracting nonlinear features (and modeling in
GMM)
○ Estimate phonetic probabilities
● Train the network as a classifier with a softmax across
the phonetics units
● Will converge to posterior across phonetic state
● Architectures
○ Fully connected
○ Convolutional networks (CNNs)
○ Recurrent (LSTMs, GRUs)
● Dependencies not long at speech frame rates (100Hz)
DNN
LSTM
LSTM
LSTM
Conv
Log Mel
16
17
Open Source ASR
● Offline tools
○ Cmusphinx
○ Kaldi
○ Julius
● Libs
○ Time tools
○ Automatic Speech Recognition
○ KerasDeepSpeech
18
Results
Implementation details:
Lib:
Automatic_Speech_Recognition
Dataset: TIMIT
Architecture: BiLSTM
Speakers: 2
Target: Elderly people are often excluded.
Predicted: Early people are often excluded.
Target: Drop five forms in the box before you go out.
Predicted: Drop wave forms in the box before you got
it.
Target: one who writes of such an era labours under a
troublesome disadvantage
Predicted: one how rights of such an er a labours
onder a troubles hom disadvantage
Target: Don't ask me to carry an oily rag like that.
Predicted: Don't ask me to carefully rog like that.
Target: Calcium makes bones and teeth strong.
Predicted: Calcium makes bones and tea strong.
19
Q&A
1. TIMIT dataset
2. http://www.speech.sri.com/projects/srilm/
3. Human parity in speech recognition
4. Cloud solution vs open source comparing
20
Links

More Related Content

PDF
Deep Learning for Machine Translation - A dramatic turn of paradigm
PDF
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
PPT
Otp2
ODP
Speech totext
PPTX
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
PPTX
lec26_audio.pptx
PDF
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition
Deep Learning for Machine Translation - A dramatic turn of paradigm
Deep Learning for Machine Translation: a paradigm shift - Alberto Massidda - ...
Otp2
Speech totext
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
lec26_audio.pptx
Teaching Machines to Listen: An Introduction to Automatic Speech Recognition

Similar to Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system" (20)

PPTX
Speech recognition final presentation
PPT
Asr
PPT
scribgy.ppt
PDF
A Guide to Building an Automatic Speech Recognition System
PDF
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
PPT
PPT
sr.ppt
PPT
Voice recognitionr.ppt
PDF
Deep Learning in practice : Speech recognition and beyond - Meetup
PDF
Build your own ASR engine
PDF
International journal of signal and image processing issues vol 2015 - no 1...
PPT
speech recognition system of modern world.ppt
PPT
Asr
PDF
AUTOMATIC SPEECH RECOGNITION- A SURVEY
PDF
PPT
Machine Learning_ How to Do Speech Recognition with Deep Learning
PPTX
Speech Recognition Technology
PPT
Automatic speech recognition
PDF
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
PPTX
Speech recognition techniques
Speech recognition final presentation
Asr
scribgy.ppt
A Guide to Building an Automatic Speech Recognition System
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
sr.ppt
Voice recognitionr.ppt
Deep Learning in practice : Speech recognition and beyond - Meetup
Build your own ASR engine
International journal of signal and image processing issues vol 2015 - no 1...
speech recognition system of modern world.ppt
Asr
AUTOMATIC SPEECH RECOGNITION- A SURVEY
Machine Learning_ How to Do Speech Recognition with Deep Learning
Speech Recognition Technology
Automatic speech recognition
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
Speech recognition techniques
Ad

More from Lviv Startup Club (20)

PDF
Oleksandr Ivakhnenko: LinkedIn Marketing і Content Marketing: розширений підх...
PDF
Maksym Vyshnivetskyi: PMO Quality Management (UA)
PDF
Oleksandr Ivakhnenko: Вступ до генерації лідів для ІТ-аутсорсингу (UA)
PDF
Oleksandr Osypenko: Поради щодо іспиту та закриття курсу (UA)
PDF
Oleksandr Osypenko: Пробний іспит + аналіз (UA)
PDF
Oleksandr Osypenko: Agile / Hybrid Delivery (UA)
PDF
Oleksandr Osypenko: Стейкхолдери та їх вплив (UA)
PDF
Rostyslav Chayka: Prompt Engineering для проєктного менеджменту (Advanced) (UA)
PPTX
Dmytro Liesov: PMO Tools and Technologies (UA)
PDF
Rostyslav Chayka: Управління командою за допомогою AI (UA)
PDF
Oleksandr Osypenko: Tailoring + Change Management (UA)
PDF
Maksym Vyshnivetskyi: Управління закупівлями (UA)
PDF
Oleksandr Osypenko: Управління ризиками (UA)
PPTX
Dmytro Zubkov: PMO Resource Management (UA)
PPTX
Rostyslav Chayka: Комунікація за допомогою AI (UA)
PDF
Ihor Pavlenko: Комунікація за допомогою AI (UA)
PDF
Maksym Vyshnivetskyi: Управління якістю (UA)
PDF
Ihor Pavlenko: Робота зі стейкхолдерами за допомогою AI (UA)
PDF
Maksym Vyshnivetskyi: Управління вартістю (Cost) (UA)
PDF
Oleksandr Osypenko: Управління часом та ресурсами (UA)
Oleksandr Ivakhnenko: LinkedIn Marketing і Content Marketing: розширений підх...
Maksym Vyshnivetskyi: PMO Quality Management (UA)
Oleksandr Ivakhnenko: Вступ до генерації лідів для ІТ-аутсорсингу (UA)
Oleksandr Osypenko: Поради щодо іспиту та закриття курсу (UA)
Oleksandr Osypenko: Пробний іспит + аналіз (UA)
Oleksandr Osypenko: Agile / Hybrid Delivery (UA)
Oleksandr Osypenko: Стейкхолдери та їх вплив (UA)
Rostyslav Chayka: Prompt Engineering для проєктного менеджменту (Advanced) (UA)
Dmytro Liesov: PMO Tools and Technologies (UA)
Rostyslav Chayka: Управління командою за допомогою AI (UA)
Oleksandr Osypenko: Tailoring + Change Management (UA)
Maksym Vyshnivetskyi: Управління закупівлями (UA)
Oleksandr Osypenko: Управління ризиками (UA)
Dmytro Zubkov: PMO Resource Management (UA)
Rostyslav Chayka: Комунікація за допомогою AI (UA)
Ihor Pavlenko: Комунікація за допомогою AI (UA)
Maksym Vyshnivetskyi: Управління якістю (UA)
Ihor Pavlenko: Робота зі стейкхолдерами за допомогою AI (UA)
Maksym Vyshnivetskyi: Управління вартістю (Cost) (UA)
Oleksandr Osypenko: Управління часом та ресурсами (UA)
Ad

Recently uploaded (20)

PDF
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
PDF
Training And Development of Employee .pdf
PPTX
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
DOCX
unit 1 COST ACCOUNTING AND COST SHEET
PPT
Chapter four Project-Preparation material
PPTX
HR Introduction Slide (1).pptx on hr intro
PPTX
Amazon (Business Studies) management studies
PDF
Ôn tập tiếng anh trong kinh doanh nâng cao
PPTX
Business Ethics - An introduction and its overview.pptx
PDF
Traveri Digital Marketing Seminar 2025 by Corey and Jessica Perlman
PDF
DOC-20250806-WA0002._20250806_112011_0000.pdf
PPTX
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
PPTX
Probability Distribution, binomial distribution, poisson distribution
PDF
Reconciliation AND MEMORANDUM RECONCILATION
PDF
Dr. Enrique Segura Ense Group - A Self-Made Entrepreneur And Executive
PDF
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
PDF
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
PPTX
ICG2025_ICG 6th steering committee 30-8-24.pptx
PDF
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
PPT
340036916-American-Literature-Literary-Period-Overview.ppt
BsN 7th Sem Course GridNNNNNNNN CCN.pdf
Training And Development of Employee .pdf
AI-assistance in Knowledge Collection and Curation supporting Safe and Sustai...
unit 1 COST ACCOUNTING AND COST SHEET
Chapter four Project-Preparation material
HR Introduction Slide (1).pptx on hr intro
Amazon (Business Studies) management studies
Ôn tập tiếng anh trong kinh doanh nâng cao
Business Ethics - An introduction and its overview.pptx
Traveri Digital Marketing Seminar 2025 by Corey and Jessica Perlman
DOC-20250806-WA0002._20250806_112011_0000.pdf
CkgxkgxydkydyldylydlydyldlyddolydyoyyU2.pptx
Probability Distribution, binomial distribution, poisson distribution
Reconciliation AND MEMORANDUM RECONCILATION
Dr. Enrique Segura Ense Group - A Self-Made Entrepreneur And Executive
Katrina Stoneking: Shaking Up the Alcohol Beverage Industry
Stem Cell Market Report | Trends, Growth & Forecast 2025-2034
ICG2025_ICG 6th steering committee 30-8-24.pptx
pdfcoffee.com-opt-b1plus-sb-answers.pdfvi
340036916-American-Literature-Literary-Period-Overview.ppt

Vladyslav Hamolia "How to choose ASR (automatic speech recognition) system"

  • 1. How to choose ASR AI & BIG DATA DAY Hamolia Vladyslav skype: vhamolya ELEKS
  • 2. Agenda ● History of ASR ● ASR challenges ● General overview of ASR processing ● Speech representation ● Implement ASR using HMM and DNN ● Open Source tools ● Q&A 2
  • 4. ● Large vocabulary ● Background Noise ● Regional and Social Dialects ● Spoken Language vs Written Language ● Spontaneous vs read speech ● Ambiguity ASR Сhallenges 4
  • 6. Speech Representation Fast fourier transform: - Time domain to frequency domain - Shows energy in different frequency band - Complex spectra: all information is preserved - Supporting sound source separation (sources overlap less in time frequency than in time domain) invertibility 6
  • 7. Spectrum Estimation Spectrum of audio signals is typically estimated in short consecutive segments, frames - Real audio signals are not stationary but vary through time - Framewise processing assumes the signal is time-invariant - Frame length for audio application in between 10ms - 100ms - Frame length for ASR 25ms 7
  • 8. Bhiksha Raj, Rita Singh: Techniques for Noise Robust Automatic Speech Recognition Common path: - Take the FFT of a signal - Map the powers of the spectrum obtained above onto the mel scale - Cosine transform of mel log powers - MFCCs are amplitudes of the resulting spectrum MFCC 8
  • 9. Acoustic Model - Phonemes are fundamental units - “cat” -> /k/, /at/, /t/ - Split each phoneme in 3 states - ~ 10% advantage using phonemes in contrast to words - Training model for each word requires a lot of data 9
  • 10. p r aa b iy p r ay b i p r aw i uh p r aa i iy p r aa b uw p ow ih p aa iy p aa b uh b l iy p aa ah iy s eh n t s s ih t s eh v r ax b ax d iy eh v er ax d iy eh ux b ax iy eh r uw ay eh b ah iy Phonemes 10 do ow n d ow d ow n t d ow t d ah n ow n ax d ax n ax n uw probably sense everybody don’t
  • 12. HMM-based Recogniser Optimization problem: O Overvation (features) P(O | S) Acoustic mode P(S | W) Pronunciation model P(W) Language model 12
  • 13. HMM-based Recogniser - For each example, use current HMM models to assign feature vectors to HMM states - Viterbi algorithm, find the most likely path through the composite HMM model - Group the feature vectors assigned to each HMM - GMM for computing P(O|S) (acoustic model) 13
  • 15. Language Model - Word sequence - Bigram approximation - N-gram approximation 15 - …. and LSTMs
  • 16. DNN ● Two ways of using DNN for ASR task: ○ Extracting nonlinear features (and modeling in GMM) ○ Estimate phonetic probabilities ● Train the network as a classifier with a softmax across the phonetics units ● Will converge to posterior across phonetic state ● Architectures ○ Fully connected ○ Convolutional networks (CNNs) ○ Recurrent (LSTMs, GRUs) ● Dependencies not long at speech frame rates (100Hz) DNN LSTM LSTM LSTM Conv Log Mel 16
  • 17. 17 Open Source ASR ● Offline tools ○ Cmusphinx ○ Kaldi ○ Julius ● Libs ○ Time tools ○ Automatic Speech Recognition ○ KerasDeepSpeech
  • 18. 18 Results Implementation details: Lib: Automatic_Speech_Recognition Dataset: TIMIT Architecture: BiLSTM Speakers: 2 Target: Elderly people are often excluded. Predicted: Early people are often excluded. Target: Drop five forms in the box before you go out. Predicted: Drop wave forms in the box before you got it. Target: one who writes of such an era labours under a troublesome disadvantage Predicted: one how rights of such an er a labours onder a troubles hom disadvantage Target: Don't ask me to carry an oily rag like that. Predicted: Don't ask me to carefully rog like that. Target: Calcium makes bones and teeth strong. Predicted: Calcium makes bones and tea strong.
  • 20. 1. TIMIT dataset 2. http://www.speech.sri.com/projects/srilm/ 3. Human parity in speech recognition 4. Cloud solution vs open source comparing 20 Links