SlideShare a Scribd company logo
Semantic Mask for Transformer
Based End-to-End Speech Recognition
Author : 𝐶ℎ𝑒𝑛𝑔𝑦𝑖 𝑊𝑎𝑛𝑔, 𝑌𝑢 𝑊𝑢, 𝑌𝑢𝑗𝑖𝑎𝑜 𝐷𝑢⸭, 𝐽𝑖𝑛𝑦𝑢 𝐿𝑖⸭, 𝑆ℎ𝑢𝑗𝑖𝑒 𝐿𝑖𝑢, 𝐿𝑖𝑎𝑛𝑔 𝐿𝑢⸭, 𝑆ℎ𝑢𝑜 𝑅𝑒𝑛, 𝐺𝑢𝑜𝑙𝑖 𝐿𝑒⸭, 𝑆ℎ𝑒𝑛𝑔 𝑍ℎ𝑎𝑜⸭, 𝑀𝑖𝑛𝑔 𝑍ℎ𝑢𝑜
 Microsoft Research Asia, Beijing
⸭ Microsoft Speech and Language Group
⸭ Beijing University of Posts and Telecommunications
PAPER PRESENTATION
Whenty Ariyanti
DEPARTMENT OF COMPUTER SCIENCE AND INFORMATION
ENGINEERINGNATIONAL CENTRAL UNIVERSITY
TAIWAN
March 23, 2020
OUTLINE
OVERVIEW01
• Masking Strategy
• Why Semantic Mask Works?
SEMANTIC MASKING02
• CNN Layer
• Transformer Block
• ASR Training and Decoding
MODEL ARCHITECTURE03
• Librispeech 960h
• TedLium2
EXPERIMENTS04
RESULTS05
OVERVIEW
01
Attention-based encoder-decoder model has achieved
impressive results for both automatic speech
recognition (ASR) and text-to-speech (TTS) tasks
03This model is prone to overfitting, especially when
the amount of training data is limited
05
The idea is to mask the input features
corresponding to a particular output token (e.g., a
word or a word-piece)
02
This approach takes advantage of the memorization
capacity of neural networks to learn the mapping from
the input sequence to the output sequence from
scratch (without assumption of prior knowledge such
as the alignments)
04
Inspired by SpecAugment and BERT, proposed
semantic mask based regularization for training such
kind of end-to-end (E2E) model
06
This research study the transformer-based model
for ASR for this work and perform experiments on
Librispeech 960h and TedLium2 dataset.
INDEX TERMS :
End-to-End ASR, Transformer, Semantic Mask
BACKGROUND
End-to-End (E2E) acoustic model, particularly with the
attention-based encoder-decoder framework, have
achieved a competitive recognition accuracy in a wide
range of speech dataset
End-to-End (E2E)
Learn the mapping from the input acoustic signals to
the output transcriptions without decomposing the
problems into several different modules such as
lexicon modeling, acoustic modeling and language
modeling as in the conventional hybrid architecture
To improves the generalization capacity of the model and
the strength of the language modeling power, this study
propose a semantic approach (Inspired by SpecAugment
and BERT)
PROPOSED METHOD
This method masks out the whole patch of the
features corresponding to an output token during
training (e.g., a word or a word-piece)
This study focus on the transformer architecture, which
originally proposed for neural machine translation. Compared
with RNNs the transformer based encoder can capture the
long-term correlations with a computational complexity
instead of using many steps of BPPT as in RNN
• Difficult to tune the strength of each component
• Tends to make grammatical error (indicate the language
modeling power of the model is weak)
• Mismatch between the training and evaluation data (due to
the small amount of training data)
E2E Weakness :
01
02
03
04
05
06
SEMANTICMASKING
M A S K I N G
S T R A T E G Y
Figure 1. An example of semantic mask
Requires the alignment information in order to perform the token-wise
masking (as shown in Figure 1)
APPROACH
Used Montreal Forced Aligner trained with the training data to perform
forced-alignment between the acoustic signals and the transcription to
obtain the world-level timing information
TOOLKIT
Randomly select a percentage of the tokens and mask the corresponding
speech segments in each iteration
TRAINING
Randomly sample 15% of the tokens and set the masked piece to the
mean value of the whole utterance
PROPOSED WORK
01
02
03
04
Adopt a time wrap, frequency masking and time masking strategy
MASKING STRATEGY
05
Idea of Speech Augment
SEMANTICMASKING
03
01
04
02
Spectrum augmentation similar to this method. Both
propose to mask spectrum for E2E model training but
the intuitions behind those two are different
SpecAugment randomly masks spectrum in order to add
noise to the source input, making the E2E ASR problem
harder and prevents the over-fitting problem in a large
E2E model
E2E model has to predict the token based on other
signals, tokens that have generated or other unmasked
speech features (to alleviate over-fitting)
Reduces the hyper-parameter tuning workload of
SpecAugment and is more robust when the variance of
input audio length is large.
WHY SEMANTIC
M A S K W O R K ?
CNN LAYER
Model Architecture
Figure 2. CNN Layer Architecture
Represent input signals as a sequence of log-Mel filter bank features,
𝑋 = (𝑋0 … 𝑋 𝑛) where 𝑋𝑖 is 83-dim vector.
01
Use VGG-like convolution block with layer normalization and max-
pooling function
02
The specific architecture outperforms Convolution 2D subsampling
method
03
Use 1D-CNN in the decoder to extract local features replacing the
position embedding
04
TRANSFORMER BLOCK
Model Architecture
Transformer module consumes the outputs of CNN and extract features
with a self-attention mechanism
01
Suppose that 𝑄, 𝐾, and 𝑉 are inputs of transformer block, its output are
calculated as :
SelfAttention 𝑄, 𝐾, 𝑉 = softmax
𝑄𝐾
𝑑 𝑘
𝑉
02
Multi-head attention is proposed to enable dealing with multiple
attention as :
Multihead 𝑄, 𝐾, 𝑉 = [𝐻1 … 𝐻 𝑑ℎ𝑒𝑎𝑑
]𝑊ℎ𝑒𝑎𝑑
where 𝐻𝑖 = SelfAttention(𝑄𝑖, 𝐾𝑖, 𝑉𝑖)
03
Residual connection, feed-forward layer and layer normalization are
indispensable parts in Transformer
04
ASR TRAINING AND DECODING
Model Architecture
Both the E2E model decoder and the CTC module predict the frame-wise
distribution of 𝑌 given corresponding source 𝑋, denoted as 𝑃𝑠2𝑠(Y|X)
and 𝑃𝑐𝑡𝑐(Y|X)
01
Weighted averaged two negative log likelihoods to train the model :
𝐿 = −𝛼 log 𝑃𝑠2𝑠 Y X = 1 − 𝛼 log 𝑃𝑐𝑡𝑐(Y|X)
Where 𝛼 is set to 0.7
02
Combine scores of E2E model 𝑃𝑠2𝑠, CTC score 𝑃𝑐𝑡𝑐 and a RNN based
language model 𝑃𝑟𝑛𝑛 in the decoding process as :
03
Rescore the beam outputs based on another right-to-left language
model 𝑃𝑟2𝑙(Y) and the sentence length penalty Wordcount (Y)
formulated as :04
Reranked outputs of a left-to-right s2s model with a right-to-left
language model in the NLP community (since the right-to-left model is
more sensitive to the errors existing in the right part of a sentence)
Where 𝑃𝑡𝑟𝑎𝑛𝑠_𝑙𝑚 denotes the sentence generative probability given by a
Transformer language model
05
EXPERIMENTS
The transformer language model for rescoring is trained on
LibriSpeech language model corpus with the GPT-2 base setting
The learning rate decreases proportionally to inverse square root
of the step number after 25000th step
Represent input signals as a sequence of 8—dim log-Mel filter
bank with 3-dim pitch features
Train the model 40 epoch on 4 P40 GPUs, which costs 5 days to
coverage and apply speed perturbation by changing the audio
sped to 0.9,1.0 and 1.1
Base model structure :
12 encoder layers, 6 decoders, attention vector size 512 with 8
heads, containing 75M parameters
LIBRISPEECH 690h
EXPERIMENTS
The vocabulary size is set to 1000
The corpus consists of 207 hours of speech data accompanying
90k transcripts
The utterances with more than 3000 fames or more than 400
characters are discarded
The acoustic features are 80-dim log-Mel filter bank and 3-dim
pitch features, which is normalized by the mean and the standard
deviation for training set
TEDLIUM2
Table 1. Comparison of the Librispeech ASR benchmark
You can simply impress your audience and
add a unique zing and appeal to your
Presentations.
You can simply impress your audience and
add a unique zing and appeal to your
Presentations.
Contents Title
RESULTS
 All model are in model based setting and
shallow fused with the RNN language
model
ANALYSIS
You can simply impress your audience and
add a unique zing and appeal to your
Presentations.
You can simply impress your audience and
add a unique zing and appeal to your
Presentations.
Contents Title
Performance
TEDLIUM2
Table 2. Ablation test of difference masking methods. The fourth line is a default
setting of SpecAugment. The fifth line uses word mask to replace random time
mask, and the last line combine both methods on the time axis
Table 3. Experiment results on TEDLIUM2
RESULTS
CONCLUSION
This study elaborate a new architecture for
E2E model, achieving state-of-the-art
performance on the Librispeech test set in the
scope of E2E model
This study presents a semantic mask method
for E2E speech recognition, which is able to
train a model to better consider the whole
audio context for disambiguation
THANK YOUFor Your Patience !

More Related Content

PDF
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
PDF
Attention Is All You Need
PDF
Text-Independent Speaker Verification Report
PPTX
Text-Independent Speaker Verification
PDF
Speaker Identification From Youtube Obtained Data
PDF
Isolated word recognition using lpc & vector quantization
PDF
SEARCH TIME REDUCTION USING HIDDEN MARKOV MODELS FOR ISOLATED DIGIT RECOGNITION
PPTX
1909 BERT: why-and-how (CODE SEMINAR)
PUNJABI SPEECH SYNTHESIS SYSTEM USING HTK
Attention Is All You Need
Text-Independent Speaker Verification Report
Text-Independent Speaker Verification
Speaker Identification From Youtube Obtained Data
Isolated word recognition using lpc & vector quantization
SEARCH TIME REDUCTION USING HIDDEN MARKOV MODELS FOR ISOLATED DIGIT RECOGNITION
1909 BERT: why-and-how (CODE SEMINAR)

What's hot (20)

PDF
Extractive Summarization with Very Deep Pretrained Language Model
PDF
2021 03-02-distributed representations-of_words_and_phrases
PPTX
2021 04-04-google nmt
PPTX
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
Mjfg now
PDF
Architecture neural network deep optimizing based on self organizing feature ...
PDF
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
PDF
Hidden Layer Leraning Vector Quantizatio
PDF
Speech Compression Using Wavelets
PPTX
Speaker recognition systems
PPTX
PPTX
BERT introduction
PDF
ANN Based POS Tagging For Nepali Text
PDF
Efficient implementation of bit parallel finite
PDF
Efficient implementation of bit parallel finite field multipliers
PPT
ECCV2010: feature learning for image classification, part 3
PDF
Deep Multi-Task Learning with Shared Memory
PDF
A Vietnamese Language Model Based on Recurrent Neural Network
PDF
An Index Based K-Partitions Multiple Pattern Matching Algorithm
PDF
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Extractive Summarization with Very Deep Pretrained Language Model
2021 03-02-distributed representations-of_words_and_phrases
2021 04-04-google nmt
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Mjfg now
Architecture neural network deep optimizing based on self organizing feature ...
EXTRACTIVE SUMMARIZATION WITH VERY DEEP PRETRAINED LANGUAGE MODEL
Hidden Layer Leraning Vector Quantizatio
Speech Compression Using Wavelets
Speaker recognition systems
BERT introduction
ANN Based POS Tagging For Nepali Text
Efficient implementation of bit parallel finite
Efficient implementation of bit parallel finite field multipliers
ECCV2010: feature learning for image classification, part 3
Deep Multi-Task Learning with Shared Memory
A Vietnamese Language Model Based on Recurrent Neural Network
An Index Based K-Partitions Multiple Pattern Matching Algorithm
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Ad

Similar to Semantic Mask for Transformer Based End-to-End Speech Recognition (20)

PDF
Automated Essay Scoring Using Efficient Transformer-Based Language Models
PDF
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
PDF
LLM Cheatsheet and it's brief introduction
PPTX
Deep Learning Project.pptx
PDF
EEND-SS.pdf
PDF
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
PDF
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
PDF
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
PDF
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
PDF
Digital Watermarking Applications and Techniques: A Brief Review
PDF
Methodology of Implementing the Pulse code techniques for Distributed Optical...
PDF
Speech Separation under Reverberant Condition.pdf
PPTX
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
PDF
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
PPTX
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
PDF
Isolated word recognition using lpc & vector quantization
DOCX
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
PDF
[Paper] Multiscale Vision Transformers(MVit)
PDF
ENSEMBLE MODEL FOR CHUNKING
PPTX
Ppt on Regularization, batch normamalization.pptx
Automated Essay Scoring Using Efficient Transformer-Based Language Models
MULTILINGUAL SPEECH TO TEXT CONVERSION USING HUGGING FACE FOR DEAF PEOPLE
LLM Cheatsheet and it's brief introduction
Deep Learning Project.pptx
EEND-SS.pdf
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
Streaming Punctuation: A Novel Punctuation Technique Leveraging Bidirectional...
STREAMING PUNCTUATION: A NOVEL PUNCTUATION TECHNIQUE LEVERAGING BIDIRECTIONAL...
Digital Watermarking Applications and Techniques: A Brief Review
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Speech Separation under Reverberant Condition.pdf
A Conformer-based ASR Frontend for Joint Acoustic Echo Cancellation, Speech E...
社内勉強会資料_XTTS: a Massively Multilingual ZeroShot Text-to-Speech Model.pdf
240311_JW_labseminar[Sequence to Sequence Learning with Neural Networks].pptx
Isolated word recognition using lpc & vector quantization
Ieee transactions on 2018 TOPICS with Abstract in audio, speech, and language...
[Paper] Multiscale Vision Transformers(MVit)
ENSEMBLE MODEL FOR CHUNKING
Ppt on Regularization, batch normamalization.pptx
Ad

Recently uploaded (20)

PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Classroom Observation Tools for Teachers
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Institutional Correction lecture only . . .
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Complications of Minimal Access Surgery at WLH
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
01-Introduction-to-Information-Management.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
PPTX
Cell Types and Its function , kingdom of life
Supply Chain Operations Speaking Notes -ICLT Program
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Classroom Observation Tools for Teachers
102 student loan defaulters named and shamed – Is someone you know on the list?
Computing-Curriculum for Schools in Ghana
Institutional Correction lecture only . . .
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Module 4: Burden of Disease Tutorial Slides S2 2025
Microbial disease of the cardiovascular and lymphatic systems
Complications of Minimal Access Surgery at WLH
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
2.FourierTransform-ShortQuestionswithAnswers.pdf
202450812 BayCHI UCSC-SV 20250812 v17.pptx
01-Introduction-to-Information-Management.pdf
VCE English Exam - Section C Student Revision Booklet
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
GDM (1) (1).pptx small presentation for students
Tissue processing ( HISTOPATHOLOGICAL TECHNIQUE
Cell Types and Its function , kingdom of life

Semantic Mask for Transformer Based End-to-End Speech Recognition

  • 1. Semantic Mask for Transformer Based End-to-End Speech Recognition Author : 𝐶ℎ𝑒𝑛𝑔𝑦𝑖 𝑊𝑎𝑛𝑔, 𝑌𝑢 𝑊𝑢, 𝑌𝑢𝑗𝑖𝑎𝑜 𝐷𝑢⸭, 𝐽𝑖𝑛𝑦𝑢 𝐿𝑖⸭, 𝑆ℎ𝑢𝑗𝑖𝑒 𝐿𝑖𝑢, 𝐿𝑖𝑎𝑛𝑔 𝐿𝑢⸭, 𝑆ℎ𝑢𝑜 𝑅𝑒𝑛, 𝐺𝑢𝑜𝑙𝑖 𝐿𝑒⸭, 𝑆ℎ𝑒𝑛𝑔 𝑍ℎ𝑎𝑜⸭, 𝑀𝑖𝑛𝑔 𝑍ℎ𝑢𝑜  Microsoft Research Asia, Beijing ⸭ Microsoft Speech and Language Group ⸭ Beijing University of Posts and Telecommunications PAPER PRESENTATION Whenty Ariyanti DEPARTMENT OF COMPUTER SCIENCE AND INFORMATION ENGINEERINGNATIONAL CENTRAL UNIVERSITY TAIWAN March 23, 2020
  • 2. OUTLINE OVERVIEW01 • Masking Strategy • Why Semantic Mask Works? SEMANTIC MASKING02 • CNN Layer • Transformer Block • ASR Training and Decoding MODEL ARCHITECTURE03 • Librispeech 960h • TedLium2 EXPERIMENTS04 RESULTS05
  • 3. OVERVIEW 01 Attention-based encoder-decoder model has achieved impressive results for both automatic speech recognition (ASR) and text-to-speech (TTS) tasks 03This model is prone to overfitting, especially when the amount of training data is limited 05 The idea is to mask the input features corresponding to a particular output token (e.g., a word or a word-piece) 02 This approach takes advantage of the memorization capacity of neural networks to learn the mapping from the input sequence to the output sequence from scratch (without assumption of prior knowledge such as the alignments) 04 Inspired by SpecAugment and BERT, proposed semantic mask based regularization for training such kind of end-to-end (E2E) model 06 This research study the transformer-based model for ASR for this work and perform experiments on Librispeech 960h and TedLium2 dataset. INDEX TERMS : End-to-End ASR, Transformer, Semantic Mask
  • 4. BACKGROUND End-to-End (E2E) acoustic model, particularly with the attention-based encoder-decoder framework, have achieved a competitive recognition accuracy in a wide range of speech dataset End-to-End (E2E) Learn the mapping from the input acoustic signals to the output transcriptions without decomposing the problems into several different modules such as lexicon modeling, acoustic modeling and language modeling as in the conventional hybrid architecture To improves the generalization capacity of the model and the strength of the language modeling power, this study propose a semantic approach (Inspired by SpecAugment and BERT) PROPOSED METHOD This method masks out the whole patch of the features corresponding to an output token during training (e.g., a word or a word-piece) This study focus on the transformer architecture, which originally proposed for neural machine translation. Compared with RNNs the transformer based encoder can capture the long-term correlations with a computational complexity instead of using many steps of BPPT as in RNN • Difficult to tune the strength of each component • Tends to make grammatical error (indicate the language modeling power of the model is weak) • Mismatch between the training and evaluation data (due to the small amount of training data) E2E Weakness : 01 02 03 04 05 06
  • 5. SEMANTICMASKING M A S K I N G S T R A T E G Y Figure 1. An example of semantic mask Requires the alignment information in order to perform the token-wise masking (as shown in Figure 1) APPROACH Used Montreal Forced Aligner trained with the training data to perform forced-alignment between the acoustic signals and the transcription to obtain the world-level timing information TOOLKIT Randomly select a percentage of the tokens and mask the corresponding speech segments in each iteration TRAINING Randomly sample 15% of the tokens and set the masked piece to the mean value of the whole utterance PROPOSED WORK 01 02 03 04 Adopt a time wrap, frequency masking and time masking strategy MASKING STRATEGY 05 Idea of Speech Augment
  • 6. SEMANTICMASKING 03 01 04 02 Spectrum augmentation similar to this method. Both propose to mask spectrum for E2E model training but the intuitions behind those two are different SpecAugment randomly masks spectrum in order to add noise to the source input, making the E2E ASR problem harder and prevents the over-fitting problem in a large E2E model E2E model has to predict the token based on other signals, tokens that have generated or other unmasked speech features (to alleviate over-fitting) Reduces the hyper-parameter tuning workload of SpecAugment and is more robust when the variance of input audio length is large. WHY SEMANTIC M A S K W O R K ?
  • 7. CNN LAYER Model Architecture Figure 2. CNN Layer Architecture Represent input signals as a sequence of log-Mel filter bank features, 𝑋 = (𝑋0 … 𝑋 𝑛) where 𝑋𝑖 is 83-dim vector. 01 Use VGG-like convolution block with layer normalization and max- pooling function 02 The specific architecture outperforms Convolution 2D subsampling method 03 Use 1D-CNN in the decoder to extract local features replacing the position embedding 04
  • 8. TRANSFORMER BLOCK Model Architecture Transformer module consumes the outputs of CNN and extract features with a self-attention mechanism 01 Suppose that 𝑄, 𝐾, and 𝑉 are inputs of transformer block, its output are calculated as : SelfAttention 𝑄, 𝐾, 𝑉 = softmax 𝑄𝐾 𝑑 𝑘 𝑉 02 Multi-head attention is proposed to enable dealing with multiple attention as : Multihead 𝑄, 𝐾, 𝑉 = [𝐻1 … 𝐻 𝑑ℎ𝑒𝑎𝑑 ]𝑊ℎ𝑒𝑎𝑑 where 𝐻𝑖 = SelfAttention(𝑄𝑖, 𝐾𝑖, 𝑉𝑖) 03 Residual connection, feed-forward layer and layer normalization are indispensable parts in Transformer 04
  • 9. ASR TRAINING AND DECODING Model Architecture Both the E2E model decoder and the CTC module predict the frame-wise distribution of 𝑌 given corresponding source 𝑋, denoted as 𝑃𝑠2𝑠(Y|X) and 𝑃𝑐𝑡𝑐(Y|X) 01 Weighted averaged two negative log likelihoods to train the model : 𝐿 = −𝛼 log 𝑃𝑠2𝑠 Y X = 1 − 𝛼 log 𝑃𝑐𝑡𝑐(Y|X) Where 𝛼 is set to 0.7 02 Combine scores of E2E model 𝑃𝑠2𝑠, CTC score 𝑃𝑐𝑡𝑐 and a RNN based language model 𝑃𝑟𝑛𝑛 in the decoding process as : 03 Rescore the beam outputs based on another right-to-left language model 𝑃𝑟2𝑙(Y) and the sentence length penalty Wordcount (Y) formulated as :04 Reranked outputs of a left-to-right s2s model with a right-to-left language model in the NLP community (since the right-to-left model is more sensitive to the errors existing in the right part of a sentence) Where 𝑃𝑡𝑟𝑎𝑛𝑠_𝑙𝑚 denotes the sentence generative probability given by a Transformer language model 05
  • 10. EXPERIMENTS The transformer language model for rescoring is trained on LibriSpeech language model corpus with the GPT-2 base setting The learning rate decreases proportionally to inverse square root of the step number after 25000th step Represent input signals as a sequence of 8—dim log-Mel filter bank with 3-dim pitch features Train the model 40 epoch on 4 P40 GPUs, which costs 5 days to coverage and apply speed perturbation by changing the audio sped to 0.9,1.0 and 1.1 Base model structure : 12 encoder layers, 6 decoders, attention vector size 512 with 8 heads, containing 75M parameters LIBRISPEECH 690h
  • 11. EXPERIMENTS The vocabulary size is set to 1000 The corpus consists of 207 hours of speech data accompanying 90k transcripts The utterances with more than 3000 fames or more than 400 characters are discarded The acoustic features are 80-dim log-Mel filter bank and 3-dim pitch features, which is normalized by the mean and the standard deviation for training set TEDLIUM2
  • 12. Table 1. Comparison of the Librispeech ASR benchmark You can simply impress your audience and add a unique zing and appeal to your Presentations. You can simply impress your audience and add a unique zing and appeal to your Presentations. Contents Title RESULTS
  • 13.  All model are in model based setting and shallow fused with the RNN language model ANALYSIS You can simply impress your audience and add a unique zing and appeal to your Presentations. You can simply impress your audience and add a unique zing and appeal to your Presentations. Contents Title Performance TEDLIUM2 Table 2. Ablation test of difference masking methods. The fourth line is a default setting of SpecAugment. The fifth line uses word mask to replace random time mask, and the last line combine both methods on the time axis Table 3. Experiment results on TEDLIUM2 RESULTS
  • 14. CONCLUSION This study elaborate a new architecture for E2E model, achieving state-of-the-art performance on the Librispeech test set in the scope of E2E model This study presents a semantic mask method for E2E speech recognition, which is able to train a model to better consider the whole audio context for disambiguation
  • 15. THANK YOUFor Your Patience !