Semantic Mask for Transformer Based End-to-End Speech Recognition

Semantic Mask for Transformer
Based End-to-End Speech Recognition
Author : 𝐶ℎ𝑒𝑛𝑔𝑦𝑖 𝑊𝑎𝑛𝑔, 𝑌𝑢 𝑊𝑢, 𝑌𝑢𝑗𝑖𝑎𝑜 𝐷𝑢⸭, 𝐽𝑖𝑛𝑦𝑢 𝐿𝑖⸭, 𝑆ℎ𝑢𝑗𝑖𝑒 𝐿𝑖𝑢, 𝐿𝑖𝑎𝑛𝑔 𝐿𝑢⸭, 𝑆ℎ𝑢𝑜 𝑅𝑒𝑛, 𝐺𝑢𝑜𝑙𝑖 𝐿𝑒⸭, 𝑆ℎ𝑒𝑛𝑔 𝑍ℎ𝑎𝑜⸭, 𝑀𝑖𝑛𝑔 𝑍ℎ𝑢𝑜
 Microsoft Research Asia, Beijing
⸭ Microsoft Speech and Language Group
⸭ Beijing University of Posts and Telecommunications
PAPER PRESENTATION
Whenty Ariyanti
DEPARTMENT OF COMPUTER SCIENCE AND INFORMATION
ENGINEERINGNATIONAL CENTRAL UNIVERSITY
TAIWAN
March 23, 2020

OUTLINE
OVERVIEW01
• Masking Strategy
• Why Semantic Mask Works?
SEMANTIC MASKING02
• CNN Layer
• Transformer Block
• ASR Training and Decoding
MODEL ARCHITECTURE03
• Librispeech 960h
• TedLium2
EXPERIMENTS04
RESULTS05

OVERVIEW
01
Attention-based encoder-decoder model has achieved
impressive results for both automatic speech
recognition (ASR) and text-to-speech (TTS) tasks
03This model is prone to overfitting, especially when
the amount of training data is limited
05
The idea is to mask the input features
corresponding to a particular output token (e.g., a
word or a word-piece)
02
This approach takes advantage of the memorization
capacity of neural networks to learn the mapping from
the input sequence to the output sequence from
scratch (without assumption of prior knowledge such
as the alignments)
04
Inspired by SpecAugment and BERT, proposed
semantic mask based regularization for training such
kind of end-to-end (E2E) model
06
This research study the transformer-based model
for ASR for this work and perform experiments on
Librispeech 960h and TedLium2 dataset.
INDEX TERMS :
End-to-End ASR, Transformer, Semantic Mask

BACKGROUND
End-to-End (E2E) acoustic model, particularly with the
attention-based encoder-decoder framework, have
achieved a competitive recognition accuracy in a wide
range of speech dataset
End-to-End (E2E)
Learn the mapping from the input acoustic signals to
the output transcriptions without decomposing the
problems into several different modules such as
lexicon modeling, acoustic modeling and language
modeling as in the conventional hybrid architecture
To improves the generalization capacity of the model and
the strength of the language modeling power, this study
propose a semantic approach (Inspired by SpecAugment
and BERT)
PROPOSED METHOD
This method masks out the whole patch of the
features corresponding to an output token during
training (e.g., a word or a word-piece)
This study focus on the transformer architecture, which
originally proposed for neural machine translation. Compared
with RNNs the transformer based encoder can capture the
long-term correlations with a computational complexity
instead of using many steps of BPPT as in RNN
• Difficult to tune the strength of each component
• Tends to make grammatical error (indicate the language
modeling power of the model is weak)
• Mismatch between the training and evaluation data (due to
the small amount of training data)
E2E Weakness :
01
02
03
04
05
06

SEMANTICMASKING
M A S K I N G
S T R A T E G Y
Figure 1. An example of semantic mask
Requires the alignment information in order to perform the token-wise
masking (as shown in Figure 1)
APPROACH
Used Montreal Forced Aligner trained with the training data to perform
forced-alignment between the acoustic signals and the transcription to
obtain the world-level timing information
TOOLKIT
Randomly select a percentage of the tokens and mask the corresponding
speech segments in each iteration
TRAINING
Randomly sample 15% of the tokens and set the masked piece to the
mean value of the whole utterance
PROPOSED WORK
01
02
03
04
Adopt a time wrap, frequency masking and time masking strategy
MASKING STRATEGY
05
Idea of Speech Augment

SEMANTICMASKING
03
01
04
02
Spectrum augmentation similar to this method. Both
propose to mask spectrum for E2E model training but
the intuitions behind those two are different
SpecAugment randomly masks spectrum in order to add
noise to the source input, making the E2E ASR problem
harder and prevents the over-fitting problem in a large
E2E model
E2E model has to predict the token based on other
signals, tokens that have generated or other unmasked
speech features (to alleviate over-fitting)
Reduces the hyper-parameter tuning workload of
SpecAugment and is more robust when the variance of
input audio length is large.
WHY SEMANTIC
M A S K W O R K ?

CNN LAYER
Model Architecture
Figure 2. CNN Layer Architecture
Represent input signals as a sequence of log-Mel filter bank features,
𝑋 = (𝑋0 … 𝑋 𝑛) where 𝑋𝑖 is 83-dim vector.
01
Use VGG-like convolution block with layer normalization and max-
pooling function
02
The specific architecture outperforms Convolution 2D subsampling
method
03
Use 1D-CNN in the decoder to extract local features replacing the
position embedding
04

TRANSFORMER BLOCK
Model Architecture
Transformer module consumes the outputs of CNN and extract features
with a self-attention mechanism
01
Suppose that 𝑄, 𝐾, and 𝑉 are inputs of transformer block, its output are
calculated as :
SelfAttention 𝑄, 𝐾, 𝑉 = softmax
𝑄𝐾
𝑑 𝑘
𝑉
02
Multi-head attention is proposed to enable dealing with multiple
attention as :
Multihead 𝑄, 𝐾, 𝑉 = [𝐻1 … 𝐻 𝑑ℎ𝑒𝑎𝑑
]𝑊ℎ𝑒𝑎𝑑
where 𝐻𝑖 = SelfAttention(𝑄𝑖, 𝐾𝑖, 𝑉𝑖)
03
Residual connection, feed-forward layer and layer normalization are
indispensable parts in Transformer
04

ASR TRAINING AND DECODING
Model Architecture
Both the E2E model decoder and the CTC module predict the frame-wise
distribution of 𝑌 given corresponding source 𝑋, denoted as 𝑃𝑠2𝑠(Y|X)
and 𝑃𝑐𝑡𝑐(Y|X)
01
Weighted averaged two negative log likelihoods to train the model :
𝐿 = −𝛼 log 𝑃𝑠2𝑠 Y X = 1 − 𝛼 log 𝑃𝑐𝑡𝑐(Y|X)
Where 𝛼 is set to 0.7
02
Combine scores of E2E model 𝑃𝑠2𝑠, CTC score 𝑃𝑐𝑡𝑐 and a RNN based
language model 𝑃𝑟𝑛𝑛 in the decoding process as :
03
Rescore the beam outputs based on another right-to-left language
model 𝑃𝑟2𝑙(Y) and the sentence length penalty Wordcount (Y)
formulated as :04
Reranked outputs of a left-to-right s2s model with a right-to-left
language model in the NLP community (since the right-to-left model is
more sensitive to the errors existing in the right part of a sentence)
Where 𝑃𝑡𝑟𝑎𝑛𝑠_𝑙𝑚 denotes the sentence generative probability given by a
Transformer language model
05

EXPERIMENTS
The transformer language model for rescoring is trained on
LibriSpeech language model corpus with the GPT-2 base setting
The learning rate decreases proportionally to inverse square root
of the step number after 25000th step
Represent input signals as a sequence of 8—dim log-Mel filter
bank with 3-dim pitch features
Train the model 40 epoch on 4 P40 GPUs, which costs 5 days to
coverage and apply speed perturbation by changing the audio
sped to 0.9,1.0 and 1.1
Base model structure :
12 encoder layers, 6 decoders, attention vector size 512 with 8
heads, containing 75M parameters
LIBRISPEECH 690h

EXPERIMENTS
The vocabulary size is set to 1000
The corpus consists of 207 hours of speech data accompanying
90k transcripts
The utterances with more than 3000 fames or more than 400
characters are discarded
The acoustic features are 80-dim log-Mel filter bank and 3-dim
pitch features, which is normalized by the mean and the standard
deviation for training set
TEDLIUM2

Table 1. Comparison of the Librispeech ASR benchmark
You can simply impress your audience and
add a unique zing and appeal to your
Presentations.
Presentations.
Contents Title
RESULTS

 All model are in model based setting and
shallow fused with the RNN language
model
ANALYSIS
Presentations.
Presentations.
Contents Title
Performance
TEDLIUM2
Table 2. Ablation test of difference masking methods. The fourth line is a default
setting of SpecAugment. The fifth line uses word mask to replace random time
mask, and the last line combine both methods on the time axis
Table 3. Experiment results on TEDLIUM2
RESULTS

CONCLUSION
This study elaborate a new architecture for
E2E model, achieving state-of-the-art
performance on the Librispeech test set in the
scope of E2E model
This study presents a semantic mask method
for E2E speech recognition, which is able to
train a model to better consider the whole
audio context for disambiguation

Semantic Mask for Transformer Based End-to-End Speech Recognition

More Related Content

What's hot (20)

Similar to Semantic Mask for Transformer Based End-to-End Speech Recognition (20)

Recently uploaded (20)

Semantic Mask for Transformer Based End-to-End Speech Recognition