Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

} Review the deep learning trends for automatic
speech recognition (ASR) in industry
◦ Deep Neural Network (DNN)
◦ Long Short-Term Memory (LSTM)
◦ Connectionist Temporal Classification (CTC)
} Describe selected key technologies to make
deep learning models more effective under
production environment

Feature
Analysis
(Spectral
Analysis)
Language
Model
Word
Lexicon
Confidence
Scoring
Pattern
Classification
(Decoding,
Search)
Acoustic
Model
(HMM)
Input
Speech “Hey Cortana”
(0.9) (0.8)
s(n), W
Xn
W
^

} Word sequence: Hey Cortana
} Phone sequence: hh ey k ao r t ae n ax
} Triphonesequence: sil-hh+ey hh-ey+k ey-k+ao k-ao+r
ao-r+ae ae-n+ax n-ax+sil
} Every triphone is then modeled by a three-state HMM: sil-
hh+ey[1], sil-hh+ey[2], sil-hh+ey[3], hh-ey+k[1], ......, n-
ax+sil[3]. The key problem is how to evaluate the state
likelihood given the speech signal.

sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3] hh-ey+k[1] n-ax+sil[3]

sil-hh+ey[2]sil-hh+ey[1] sil-hh+ey[3]hh-ey+k[1] n-ax+sil[3]

} ZH-CN is improved by 32% within one year!
0
5
10
15
20
25
30
35
GMM MFCC CE DNN LFB CE DNN LFB SE DNN
ZH-CN Relative Improvement
CERR
CE: Cross Entropy training
SE: SEquence training

DNNs process speech frames independently
tx1−tx ( )bxWh += thxt σ

RNN considers temporal relation over speech frames.
tx1−tx
Vulnerable to gradients vanishing and exploding
( )bhWxWh ++= −1thhthxt σ

Memory cells store the history information
Various gates control the information flow inside LSTM
Advantageous in learning long short-term temporal dependency
tx1−tx

0.00
2.00
4.00
6.00
8.00
10.00
12.00
14.00
16.00
18.00
20.00
SMD2015 VS2015 MobileC Mobile Win10C
WERR
Relative WER reduction of LSTM from DNN

The HMM/GMM or HMM/DNN pipeline is highly complex
Multiple training stages: CI phone, CD senones, …
Various resources: lexicon, decision trees questions, …
Many hyper-parameters: number of senones, number of
Gaussians, …
CI
Phone
CD
Senone
DNN/
LSTM
GMM Hybrid

The HMM/GMM or HMM/DNN pipeline is highly complex
Multiple training stages: CI phone, CD senones, …
Various resources: lexicon, decision trees questions, …
Many hyper-parameters: number of senones, number of
Gaussians, …
LM building requests tons of data and complicated process also
Efficient decoder writing needs experts with years’ experience

End-to-End
Model
“Hey Cortana”
} ASR is a sequence-to-sequence learning problem.
} A simpler paradigm with a single model (and training
stage) is desired.

Allow repetitions of non-blank labels
Add the blank as an additional label, meaning no (actual) labels are
emitted
A B C
!A!!!A!!!∅!!!∅!!!B!!!C!!!∅!
!∅!!!A!!!A!!!B!!!∅!!!C!!!C!
!∅!!!∅!!!∅!!!A!!!B!!!C!!!∅!
collapse
expand
} CTC is a sequence-to-sequence learning method used to map speech
waveforms directly to characters, phonemes, or even words
} CTC paths differ from labels sequences in that:
A B C
-- labels sequencez -- observation framesX

t-1 t t+1
LSTM LSTM LSTM
…
…
softmax
∅blank
words

} Directly from speech to text, no language
model, no decoder, no lexicon……

} Reduce runtime cost without accuracy loss
} Adapt to speakers with low footprints
} Reduce accuracy gap between large and
small deep networks
} Enable languages with limited training data

} The runtime cost of DNN is much larger than that of GMM,
which has been fully optimized in product deployment. We
need to reduce the runtime cost of DNN in order to ship it.

} The runtime cost of DNN is much larger than that of GMM,
which has been fully optimized in product deployment. We
need to reduce the runtime cost of DNN in order to ship it.
} We propose a new DNN structure by taking advantage of the
low-rank property of DNN model to compress it

} How to reduce the runtime cost of DNN ?
SVD !!!
} speaker personalization & AM modularization.

𝐴"×$ = 𝑈"×$∑$×$ 𝑉$×$
)
=
𝑢++ ⋯ 𝑢+$
⋮ ⋱ ⋮
𝑢"+ ⋯ 𝑢"$
/
𝜖++ ⋯
⋮ ⋱
0 ⋯ 0
⋮ ⋱ ⋮
0 ⋯
⋮ ⋱
0 ⋯
𝜖22 ⋯ 0
⋮ ⋱ ⋮
0 ⋯ 𝜖$$
/
𝑣++ ⋯ 𝑣+$
⋮ ⋱ ⋮
𝑣$+ ⋯ 𝑣$$

} Number of parameters: mn->mk+nk.
} Runtime cost: O(mn) -> O(mk+nk).
} E.g., m=2048, n=2048, k=192. 80% runtime cost reduction.

} Singular Value Decomposition

LSTM LSTM
tx1−tx
LSTM
1+tx
Cop
y
DNN Model LSTM Model
DNN DNN
tx1−tx
DNN
1+tx
Cop
y

Split training utterances through frame skipping
2x1x 3x 5x4x 6x
1x 3x 5x 2x 4x 6x
When skipping 1 frame, odd and even frames are picked as
separate utterances
Frame labels are selected accordingly

} Speaker personalization with a deep model creates a storage
size issue: It is not practical to store an entire deep models
for each individual speaker during deployment.

} Speaker personalization with a DNN model creates a storage
size issue: It is not practical to store an entire DNN model for
each individual speaker during deployment.
} We propose low-footprint DNN personalization method based
on SVD structure.

0 0.36
18.64
20.86
30
7.4 7.4
0.26
FULL-SIZE DNN SVD DNN STANDARD ADAPTATION SVD ADAPTATION
Adapting with 100 utterances
Relative WER reduction Number of parameters (M)

} SVD matrices are used to reduce the number of DNN
parameters and CPU cost.
} Quantization for SSE evaluation is used for single instruction
multiple data processing.
} Frame skipping is used to remove the evaluation of some
frames.

} The industry has strong interests to have DNN systems on
devices due to the increasingly popular mobile scenarios.
} Even with the technologies mentioned above, the large
computational cost is still very challenging due to the limited
processing power of devices.
} A common way to fit CD-DNN-HMM on devices is to reduce
the DNN model size by
◦ reducing the number of nodes in hidden layers
◦ reducing the number of targets in the output layer

} Better accuracy is obtained if
we use the output of large-
size DNN for acoustic
likelihood evaluation
} The output of small-size DNN
is away from that of large-
size DNN, resulting in worse
recognition accuracy
} The problem is solved if the
small-size DNN can generate
similar output as the large-
size DNN
...
...
...
...
...
...Text
...
...
...
...
...
...
...
...

◦ Use the standard
DNN training
method to train a
large-size teacher
DNN using
transcribed data
◦ Minimize the KL
divergence between
the output
distribution of the
student DNN and
teacher DNN with
large amount of un-
transcribed data

} 2 Million parameter for small-size DNN, compared to 30
Million parameters for teacher DNN.
} The footprint is further reduced to 0.5 million parameter
when combining with SVD.
Teacher DNN trained with standard sequence training
Small-size DNN trained with standard sequence training
Student DNN trained with output distribution learning
Accuracy

} Develop a new language in new scenario with
small amount of training data.

} Develop a new language in new scenario with
small amount of training data.
} Leverage the resource-rich languages to
develop high-quality ASR for resource-
limited languages.

...
...
...
...
...
...
...
Input Layer:
A window of acoustic feature frames
Shared
Feature Transformation
Output Layer
New language senones
New Language Training or Testing Samples
Text
Many Hidden Layers

0
5
10
15
20
25
3 hrs 9hrs 36hrs 139hrs
releative error reduction

Deep Learning for Speech Recognition in Cortana at AI NEXT Conference

More Related Content

What's hot (20)

Viewers also liked (17)

Similar to Deep Learning for Speech Recognition in Cortana at AI NEXT Conference (20)

More from Bill Liu (20)

Recently uploaded (20)

Deep Learning for Speech Recognition in Cortana at AI NEXT Conference