SlideShare a Scribd company logo
Jinyu Li
Microsoft
Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
} Review the deep learning trends for automatic
speech recognition (ASR) in industry
◦ Deep Neural Network (DNN)
◦ Long Short-Term Memory (LSTM)
◦ Connectionist Temporal Classification (CTC)
} Describe selected key technologies to make
deep learning models more effective under
production environment
Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
Feature
Analysis
(Spectral
Analysis)
Language
Model
Word
Lexicon
Confidence
Scoring
Pattern
Classification
(Decoding,
Search)
Acoustic
Model
(HMM)
Input
Speech “Hey Cortana”
(0.9) (0.8)
s(n), W
Xn
W
^
Feature
Analysis
(Spectral
Analysis)
Language
Model
Word
Lexicon
Confidence
Scoring
Pattern
Classification
(Decoding,
Search)
Acoustic
Model
(HMM)
Input
Speech “Hey Cortana”
(0.9) (0.8)
s(n), W
Xn
W
^
} Word sequence: Hey Cortana
} Phone sequence: hh ey k ao r t ae n ax
} Triphonesequence: sil-hh+ey hh-ey+k ey-k+ao k-ao+r
ao-r+ae ae-n+ax n-ax+sil
} Every triphone is then modeled by a three-state HMM: sil-
hh+ey[1], sil-hh+ey[2], sil-hh+ey[3], hh-ey+k[1], ......, n-
ax+sil[3]. The key problem is how to evaluate the state
likelihood given the speech signal.
Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]
sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]
sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3]hh-ey+k[1] n-ax+sil[3]
sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]
} ZH-CN is improved by 32% within one year!
0
5
10
15
20
25
30
35
GMM MFCC CE DNN LFB CE DNN LFB SE DNN
ZH-CN Relative Improvement
CERR
CE: Cross Entropy training
SE: SEquence training
Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
DNNs process speech frames independently
tx1−tx ( )bxWh += thxt σ
RNN considers temporal relation over speech frames.
tx1−tx
Vulnerable to gradients vanishing and exploding
( )bhWxWh ++= −1thhthxt σ
Memory cells store the history information
Various gates control the information flow inside LSTM
Advantageous in learning long short-term temporal dependency
tx1−tx
0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
SMD2015 VS2015 MobileC Mobile Win10C
WERR
Relative WER reduction of LSTM from DNN
Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
The HMM/GMM or HMM/DNN pipeline is highly complex
Multiple training stages: CI phone, CD senones, …
Various resources: lexicon, decision trees questions, …
Many hyper-parameters: number of senones, number of
Gaussians, …
CI
Phone
CD
Senone
DNN/
LSTM
GMM Hybrid
Feature
Analysis
(Spectral
Analysis)
Language
Model
Word
Lexicon
Confidence
Scoring
Pattern
Classification
(Decoding,
Search)
Acoustic
Model
(HMM)
Input
Speech “Hey Cortana”
(0.9) (0.8)
s(n), W
Xn
W
^
The HMM/GMM or HMM/DNN pipeline is highly complex
Multiple training stages: CI phone, CD senones, …
Various resources: lexicon, decision trees questions, …
Many hyper-parameters: number of senones, number of
Gaussians, …
LM building requests tons of data and complicated process also
Efficient decoder writing needs experts with years’ experience
The HMM/GMM or HMM/DNN pipeline is highly complex
Multiple training stages: CI phone, CD senones, …
Various resources: lexicon, decision trees questions, …
Many hyper-parameters: number of senones, number of
Gaussians, …
LM building requests tons of data and complicated process also
Efficient decoder writing needs experts with years’ experience
End-to-End
Model
“Hey Cortana”
} ASR is a sequence-to-sequence learning problem.
} A simpler paradigm with a single model (and training
stage) is desired.
Allow repetitions of non-blank labels
Add the blank as an additional label, meaning no (actual) labels are
emitted
A B C
!A!!!A!!!∅!!!∅!!!B!!!C!!!∅!
!∅!!!A!!!A!!!B!!!∅!!!C!!!C!
!∅!!!∅!!!∅!!!A!!!B!!!C!!!∅!
collapse
expand
} CTC is a sequence-to-sequence learning method used to map speech
waveforms directly to characters, phonemes, or even words
} CTC paths differ from labels sequences in that:
A B C
-- labels sequencez -- observation framesX
t-1 t t+1
LSTM LSTM LSTM
…
…
softmax
∅blank
words
} Directly from speech to text, no language
model, no decoder, no lexicon……
Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
} Reduce runtime cost without accuracy loss
} Adapt to speakers with low footprints
} Reduce accuracy gap between large and
small deep networks
} Enable languages with limited training data
[Xue13]
} The runtime cost of DNN is much larger than that of GMM,
which has been fully optimized in product deployment. We
need to reduce the runtime cost of DNN in order to ship it.
} The runtime cost of DNN is much larger than that of GMM,
which has been fully optimized in product deployment. We
need to reduce the runtime cost of DNN in order to ship it.
} We propose a new DNN structure by taking advantage of the
low-rank property of DNN model to compress it
} How to reduce the runtime cost of DNN ?
SVD !!!
} speaker personalization & AM modularization.
𝐴"×$ = 𝑈"×$∑$×$ 𝑉$×$
)
=
𝑢++ ⋯ 𝑢+$
⋮ ⋱ ⋮
𝑢"+ ⋯ 𝑢"$
/
𝜖++ ⋯
⋮ ⋱
0 ⋯ 0
⋮ ⋱ ⋮
0 ⋯
⋮ ⋱
0 ⋯
𝜖22 ⋯ 0
⋮ ⋱ ⋮
0 ⋯ 𝜖$$
/
𝑣++ ⋯ 𝑣+$
⋮ ⋱ ⋮
𝑣$+ ⋯ 𝑣$$
} Number of parameters: mn->mk+nk.
} Runtime cost: O(mn) -> O(mk+nk).
} E.g., m=2048, n=2048, k=192. 80% runtime cost reduction.
Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
} Singular Value Decomposition
LSTM LSTM
tx1−tx
LSTM
1+tx
Cop
y
DNN Model LSTM Model
DNN DNN
tx1−tx
DNN
1+tx
Cop
y
Split training utterances through frame skipping
2x1x 3x 5x4x 6x
1x 3x 5x 2x 4x 6x
When skipping 1 frame, odd and even frames are picked as
separate utterances
Frame labels are selected accordingly
[Xue 14]
} Speaker personalization with a deep model creates a storage
size issue: It is not practical to store an entire deep models
for each individual speaker during deployment.
} Speaker personalization with a DNN model creates a storage
size issue: It is not practical to store an entire DNN model for
each individual speaker during deployment.
} We propose low-footprint DNN personalization method based
on SVD structure.
Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
0 0.36
18.64
20.86
30
7.4 7.4
0.26
FULL-SIZE DNN SVD DNN STANDARD ADAPTATION SVD ADAPTATION
Adapting with 100 utterances
Relative WER reduction Number of parameters (M)
Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
} SVD matrices are used to reduce the number of DNN
parameters and CPU cost.
} Quantization for SSE evaluation is used for single instruction
multiple data processing.
} Frame skipping is used to remove the evaluation of some
frames.
} The industry has strong interests to have DNN systems on
devices due to the increasingly popular mobile scenarios.
} Even with the technologies mentioned above, the large
computational cost is still very challenging due to the limited
processing power of devices.
} A common way to fit CD-DNN-HMM on devices is to reduce
the DNN model size by
◦ reducing the number of nodes in hidden layers
◦ reducing the number of targets in the output layer
} Better accuracy is obtained if
we use the output of large-
size DNN for acoustic
likelihood evaluation
} The output of small-size DNN
is away from that of large-
size DNN, resulting in worse
recognition accuracy
} The problem is solved if the
small-size DNN can generate
similar output as the large-
size DNN
...
...
...
...
...
...Text
...
...
...
...
...
...
...
...
◦ Use the standard
DNN training
method to train a
large-size teacher
DNN using
transcribed data
◦ Minimize the KL
divergence between
the output
distribution of the
student DNN and
teacher DNN with
large amount of un-
transcribed data
} 2 Million parameter for small-size DNN, compared to 30
Million parameters for teacher DNN.
} The footprint is further reduced to 0.5 million parameter
when combining with SVD.
Teacher DNN trained with standard sequence training
Small-size DNN trained with standard sequence training
Student DNN trained with output distribution learning
Accuracy
[Huang 13]
} Develop a new language in new scenario with
small amount of training data.
} Develop a new language in new scenario with
small amount of training data.
} Leverage the resource-rich languages to
develop high-quality ASR for resource-
limited languages.
Deep Learning for Speech Recognition in Cortana at AI NEXT Conference
...
...
...
...
...
...
...
Input	Layer:	
A	window	of	acoustic	feature	frames	
Shared	
Feature	Transformation
Output	Layer
New	language	senones
New	Language Training	or	Testing	Samples
Text
Many	Hidden	Layers
0
5
10
15
20
25
3 hrs 9hrs 36hrs 139hrs
releative error reduction
Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

More Related Content

PDF
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
PDF
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
PDF
Convolutional Neural Networks (CNN)
PPTX
TypeScript and Deep Learning
PDF
is2015_poster
PDF
AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.
PDF
Recurrent Neural Networks, LSTM and GRU
PDF
Machine Learning and Deep Learning with R
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Convolutional Neural Networks (CNN)
TypeScript and Deep Learning
is2015_poster
AI&BigData Lab 2016. Александр Баев: Transfer learning - зачем, как и где.
Recurrent Neural Networks, LSTM and GRU
Machine Learning and Deep Learning with R

What's hot (20)

PPTX
Deep Learning Tutorial
PDF
Deep Learning for Personalized Search and Recommender Systems
PDF
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)
PDF
Convolutional Neural Network
PDF
Transfer Learning: An overview
PDF
Introduction to Tree-LSTMs
PDF
Deep Learning Based Voice Activity Detection and Speech Enhancement
PPTX
Recurrent neural networks for sequence learning and learning human identity f...
PPTX
Neural network basic and introduction of Deep learning
PDF
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
PPTX
Tutorial on convolutional neural networks
PDF
Deep learning in Computer Vision
PPTX
Electricity price forecasting with Recurrent Neural Networks
PDF
Use CNN for Sequence Modeling
PPTX
Introduction to deep learning
PPTX
Convolutional neural networks deepa
PPTX
Deep Neural Methods for Retrieval
PDF
Introduction to Convolutional Neural Networks
PPTX
Dcnn for text
PPTX
Voice Activity Detection using Single Frequency Filtering
Deep Learning Tutorial
Deep Learning for Personalized Search and Recommender Systems
캡슐 네트워크를 이용한 엔드투엔드 음성 단어 인식, 배재성(KAIST 석사과정)
Convolutional Neural Network
Transfer Learning: An overview
Introduction to Tree-LSTMs
Deep Learning Based Voice Activity Detection and Speech Enhancement
Recurrent neural networks for sequence learning and learning human identity f...
Neural network basic and introduction of Deep learning
Transfer Learning and Domain Adaptation (D2L3 2017 UPC Deep Learning for Comp...
Tutorial on convolutional neural networks
Deep learning in Computer Vision
Electricity price forecasting with Recurrent Neural Networks
Use CNN for Sequence Modeling
Introduction to deep learning
Convolutional neural networks deepa
Deep Neural Methods for Retrieval
Introduction to Convolutional Neural Networks
Dcnn for text
Voice Activity Detection using Single Frequency Filtering
Ad

Viewers also liked (17)

PDF
Multi-talker Speech Separation and Tracing at AI NEXT Conference
PDF
Cortana Analytics Workshop: Cortana Analytics -- Security, Privacy & Compliance
PPTX
Wielding a cortana
PPTX
Explanation on Tensorflow example -Deep mnist for expert
PDF
Meetup#6: AWS-AI & Lambda Serverless
PDF
AIMeetup #3: Cortana intelligence suite - tchnij życie w swoje dane
PDF
Image Recognition With TensorFlow
PPT
Cortana
PPTX
Tensorflow
PPTX
Cortana : A Microsoft Virtual Personal Assistant
PPTX
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
PPTX
MICROSOFT CORTANA
PDF
Machine learning and TensorFlow
PPTX
Virtual personal assistant
PPTX
Tensorflow windows installation
PDF
Overview of Microsoft Azure AI Services
PDF
GDG-Shanghai 2017 TensorFlow Summit Recap
Multi-talker Speech Separation and Tracing at AI NEXT Conference
Cortana Analytics Workshop: Cortana Analytics -- Security, Privacy & Compliance
Wielding a cortana
Explanation on Tensorflow example -Deep mnist for expert
Meetup#6: AWS-AI & Lambda Serverless
AIMeetup #3: Cortana intelligence suite - tchnij życie w swoje dane
Image Recognition With TensorFlow
Cortana
Tensorflow
Cortana : A Microsoft Virtual Personal Assistant
Deep Learning with TensorFlow: Understanding Tensors, Computations Graphs, Im...
MICROSOFT CORTANA
Machine learning and TensorFlow
Virtual personal assistant
Tensorflow windows installation
Overview of Microsoft Azure AI Services
GDG-Shanghai 2017 TensorFlow Summit Recap
Ad

Similar to Deep Learning for Speech Recognition in Cortana at AI NEXT Conference (20)

PPT
modeling.ppt
PDF
Dynamic Memory Networks for Dialogue Topic Tracking
PDF
Titan X Research Paper
PPTX
Introduction to Deep Learning
PDF
AI On the Edge: Model Compression
PDF
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
PDF
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
PPTX
Angular and Deep Learning
PDF
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
PDF
PPT
VII Compression Introduction
PDF
Do deep nets really need to be deep?
PDF
Serial-War
PDF
Deep Dive on Deep Learning (June 2018)
PDF
Performance and scalability for machine learning
PPT
add9.5.ppt
PDF
PyData Amsterdam - Name Matching at Scale
PPTX
B.tech_project_ppt.pptx
PPTX
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
PDF
Tensors Are All You Need: Faster Inference with Hummingbird
modeling.ppt
Dynamic Memory Networks for Dialogue Topic Tracking
Titan X Research Paper
Introduction to Deep Learning
AI On the Edge: Model Compression
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Angular and Deep Learning
Yufeng Guo - Tensor Processing Units: how TPUs enable the next generation of ...
VII Compression Introduction
Do deep nets really need to be deep?
Serial-War
Deep Dive on Deep Learning (June 2018)
Performance and scalability for machine learning
add9.5.ppt
PyData Amsterdam - Name Matching at Scale
B.tech_project_ppt.pptx
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
Tensors Are All You Need: Faster Inference with Hummingbird

More from Bill Liu (20)

PDF
Walk Through a Real World ML Production Project
PDF
Redefining MLOps with Model Deployment, Management and Observability in Produ...
PDF
Productizing Machine Learning at the Edge
PPTX
Transformers in Vision: From Zero to Hero
PDF
Deep AutoViML For Tensorflow Models and MLOps Workflows
PDF
Metaflow: The ML Infrastructure at Netflix
PDF
Practical Crowdsourcing for ML at Scale
PDF
Building large scale transactional data lake using apache hudi
PDF
Deep Reinforcement Learning and Its Applications
PDF
Big Data and AI in Fighting Against COVID-19
PDF
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
PDF
Build computer vision models to perform object detection and classification w...
PDF
Causal Inference in Data Science and Machine Learning
PDF
Weekly #106: Deep Learning on Mobile
PDF
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
PDF
AISF19 - On Blending Machine Learning with Microeconomics
PDF
AISF19 - Travel in the AI-First World
PDF
AISF19 - Unleash Computer Vision at the Edge
PDF
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
PDF
Toronto meetup 20190917
Walk Through a Real World ML Production Project
Redefining MLOps with Model Deployment, Management and Observability in Produ...
Productizing Machine Learning at the Edge
Transformers in Vision: From Zero to Hero
Deep AutoViML For Tensorflow Models and MLOps Workflows
Metaflow: The ML Infrastructure at Netflix
Practical Crowdsourcing for ML at Scale
Building large scale transactional data lake using apache hudi
Deep Reinforcement Learning and Its Applications
Big Data and AI in Fighting Against COVID-19
Highly-scalable Reinforcement Learning RLlib for Real-world Applications
Build computer vision models to perform object detection and classification w...
Causal Inference in Data Science and Machine Learning
Weekly #106: Deep Learning on Mobile
Weekly #105: AutoViz and Auto_ViML Visualization and Machine Learning
AISF19 - On Blending Machine Learning with Microeconomics
AISF19 - Travel in the AI-First World
AISF19 - Unleash Computer Vision at the Edge
AISF19 - Building Scalable, Kubernetes-Native ML/AI Pipelines with TFX, KubeF...
Toronto meetup 20190917

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
Cloud computing and distributed systems.
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
cuic standard and advanced reporting.pdf
PDF
Empathic Computing: Creating Shared Understanding
NewMind AI Weekly Chronicles - August'25 Week I
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Encapsulation_ Review paper, used for researhc scholars
Review of recent advances in non-invasive hemoglobin estimation
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
Cloud computing and distributed systems.
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Mobile App Security Testing_ A Comprehensive Guide.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
The AUB Centre for AI in Media Proposal.docx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Chapter 3 Spatial Domain Image Processing.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Encapsulation theory and applications.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
cuic standard and advanced reporting.pdf
Empathic Computing: Creating Shared Understanding

Deep Learning for Speech Recognition in Cortana at AI NEXT Conference