SlideShare a Scribd company logo
Deep Neural Machine
Translation
with Linear Associative Unit
Mingxuan Wang, Zhengdong Lu, Jie Zhou, Qun Liu
(ACL 2017)
首都大 B4 勝又智
Abstract
● NMT systems with deep architecture RNNs often suffer from severe gradient
diffusion.
○ due to the non-linear recurrent activations, which often make the optimization much more
difficult.
● we propose novel linear associative units (LAU).
○ reduce the gradient path inside the recurrent units
● experiment
○ NIST task: Chinese-English
○ WMT14: English-German
English-French
● analysis
○ LAU vs. GRU
○ Depth vs. Width
○ about Length 2
background: gate structure
● LSTM, GRU: capture long-term dependencies
● Residual Network (He et al., 2015)
● highway Network (Srivastava et al., 2015)
● Fast-Forward Network (Zhou et al., 2016) (F-F connections)
3
highway Network (H, T: non-linear function)
background: Gated Recurrent Unit
taking a linear sum between the existing state and the newly computed state
z_t: update gate
r_t: reset gate
4
Model: LAU (Linear Asocciative Unit)
LAU extends GRU by having an additional linear transformation of the input.
f_t and r_t express how much of the non-linear abstraction are preduced
by the input x_t and previous hidden state h_t.
g_t decides how much of the linear transformation of the input is carried
to the hidden state.
GRU
5
What is good using LAU?
LAU offers a direct way for input x_t to go to latter hidden state layers.
This mechanism is very useful for translation where the input should sometimes
be directly carried to the next stage of processing without any substantial
composition or nonlinear transformation.
ex. imagine we want to translate a rare entity name such as ‘Bahrain’ to Chinese.
→ LAU is able to retain the embedding of this word in its hidden state.
Otherwise, serious distortion occurs due to lack of training instances.
6
Model: encoder-decoder (DeepLAU)
● vertical stacking
○ only the output of the previous layer of RNN is
fed to the current layer as input.
● bidirectional encoder
○ φ is LAU.
○ the directions are marked by a direction term d.
d = -1 or +1
when d = -1, processing in forward diretion.
otherwise backward direction.
7
Model: Encoder side
● in order to learn more temporal dependencies, they choose unusual
bidirectional approach.
● encoding
○ an RNN layer processes the input sequence in forward direction.
○ the output of this layer is taken by an upper RNN layer as input, processed in reverse
direction.
○ Formally, following Equation (9), they set d = (-1)^ℓ
○ the final encoder consists of Lenc layers and produces the output
8
model: Attention side
α_t,j is caluclated by the first layer of decoder at step t - 1 (st-1),
the most-top layer of the encoder at step j (hj),
and context word yt-1.
σ(): tanh()
9
model: Decoder side
the decoder follows Equation (9) with fixed direction term d = -1.
At the first layer, they use the following input:
At inference stage, they only utilize the top-most hidden state s(Ldec)
to make the final predication with a softmax layer:
yt-1 is the target word embedding
10
Experiments: corpus
● NIST Chinese-English
○ training: LDC corpora 1.25M sents, 27.9M Chinese words and 34.5M English words
○ dev: NIST 2002 (MT02) dataset
○ test: NIST 2003 (MT03), NIST 2004 (MT04), NIST 2005 (MT05), NIST 2006 (MT06)
● WMT14 English-German
○ training: WMT14 training corpus 4.5M sents, 91M English words and 87M German words
○ dev: news-test 2012, 2013
○ test: news-test 2014
● WMT14 English-French
○ training: subset of WMT14 training corpus 12M sents, 304M English words and 348M French
words
○ dev: concatenation of news-test 2012 and news-test 2013
○ test: news-test 2014
11
Experiments: setup
● For all expariments
○ dimension: embedding, hidden states and ct are 512 size.
○ optimizer: Adadelta
○ batch size: 128
○ input length limit: 80 words
○ beam size: 10
○ dropout rate: 0.5
○ layer: both encoder and decoder have 4 layers
● settings of each experiment
○ in Chinese-English and English-French, use the soruce and target vocab frequent 30k
○ for English-German, use the source 120k and the target 80k in order of frequent
12
Result: Chinese-English, English-French
LAUs apply adaptive gate function conditioned on the input which it able to decide
how much linear information should be transferred to the next step.
13
Result: English-German
14
Analysis: LAU vs. GRU, Depth vs. Width
LAU vs. GRU
● row 3 to row 7
→ LAU bring imporovement.
● row 3,4 to row 7,8
→ GRU decrese BLEU,
but LAU bring improvement.
Depth vs. Width
● when increasing the model depth
but they failed to see further imporovements.
● hidden size (width)
row 2 to row 3
→ improvements is relative small
→ depth plays a more important role in incresing
the complexity of neural networks than width. in any analysis, use NIST Chinese-English task 15
Analysis: About Length
DeepLAU models yield higher BLEU score
than the DeepGRU model.
→ very deep RNN model is good at
modelling the nested latent structures
on relatively complicated sentences.
16
Conclusion
● propose a Linear Asociative Unit (LAU)
○ it makes a fusion of both linear and nonlinear transformation.
● LAU enable us to build a deep neural network for MT.
● My feeling
○ After all, is this model a non-recurrent deep or a recurrent deep?
maybe, both are likely to be good…
○ I was also interested in the weight of the model.
So, I wanted them to mention about it.
17
reference
● Srivastava et al. Training Very Deep Networks NIPS 2015
● Zhou et al. Deep Recurrent Models with Fast-Forward Connections for Neural
Machine Translation arXiv
● He et al. Deep residual learning for image recognition arXiv
● Wu et al. Google’s neural machine translation system: Bridging the gap
between human and machine translation arXiv
18

More Related Content

PDF
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
PDF
Recent Progress in RNN and NLP
PDF
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
PDF
Skip RNN: Learning to Skip State Updates in RNNs (ICLR 2018)
PDF
Introduction to Tree-LSTMs
PPTX
[Paper Reading] Attention is All You Need
PPTX
DOCX
レポート深層学習前編
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
Recent Progress in RNN and NLP
Language Model (D3L1 Deep Learning for Speech and Language UPC 2017)
Skip RNN: Learning to Skip State Updates in RNNs (ICLR 2018)
Introduction to Tree-LSTMs
[Paper Reading] Attention is All You Need
レポート深層学習前編

What's hot (20)

PDF
Lexically constrained decoding for sequence generation using grid beam search
PDF
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
PPTX
Introduction For seq2seq(sequence to sequence) and RNN
PDF
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
PDF
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
PPTX
Problems of function based syntax
PDF
Video caption generation via seq-to-seq model (TensorFlow implementation)
PPTX
Thomas Wolf "Transfer learning in NLP"
PPTX
論文輪読資料「Gated Feedback Recurrent Neural Networks」
PDF
DL for sentence classification project Write-up
PDF
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
PPTX
Notes on attention mechanism
PPTX
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
PDF
Learning Communication with Neural Networks
PDF
ON THE DUALITY FEATURE OF P-CLASS PROBLEMS AND NP COMPLETE PROBLEMS
PDF
Emnlp2015 reading festival_lstm_cws
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
NLP using transformers
PDF
ALGEBRAIC DEGREE ESTIMATION OF BLOCK CIPHERS USING RANDOMIZED ALGORITHM; UPPE...
PPTX
Introduction to Transformer Model
Lexically constrained decoding for sequence generation using grid beam search
Transfer Learning (D2L4 Insight@DCU Machine Learning Workshop 2017)
Introduction For seq2seq(sequence to sequence) and RNN
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Recurrent Neural Networks I (D2L2 Deep Learning for Speech and Language UPC 2...
Problems of function based syntax
Video caption generation via seq-to-seq model (TensorFlow implementation)
Thomas Wolf "Transfer learning in NLP"
論文輪読資料「Gated Feedback Recurrent Neural Networks」
DL for sentence classification project Write-up
Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks
Notes on attention mechanism
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Learning Communication with Neural Networks
ON THE DUALITY FEATURE OF P-CLASS PROBLEMS AND NP COMPLETE PROBLEMS
Emnlp2015 reading festival_lstm_cws
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
NLP using transformers
ALGEBRAIC DEGREE ESTIMATION OF BLOCK CIPHERS USING RANDOMIZED ALGORITHM; UPPE...
Introduction to Transformer Model
Ad

Similar to Deep Neural Machine Translation with Linear Associative Unit (20)

PDF
Convolutional and Recurrent Neural Networks
PDF
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
PDF
Scene understanding
PDF
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
PDF
Comparison of Single Channel Blind Dereverberation Methods for Speech Signals
PDF
Multidimensional RNN
PPTX
Introduction to deep learning
PDF
Nips 2017 in a nutshell
PDF
On using monolingual corpora in neural machine translation
PPTX
Semi orthogonal low-rank matrix factorization for deep neural networks
PPTX
Nuts and Bolts of Transfer Learning.pptx
PDF
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
PPTX
Seminar dm
PDF
Week 3 Deep Learning And POS Tagging Hands-On
PPTX
Java and Deep Learning (Introduction)
PDF
slides.pdf
PPTX
A Generalization of Transformer Networks to Graphs.pptx
PDF
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
PDF
Ire presentation
PDF
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
Convolutional and Recurrent Neural Networks
SummaRuNNer: A Recurrent Neural Network based Sequence Model for Extractive S...
Scene understanding
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Comparison of Single Channel Blind Dereverberation Methods for Speech Signals
Multidimensional RNN
Introduction to deep learning
Nips 2017 in a nutshell
On using monolingual corpora in neural machine translation
Semi orthogonal low-rank matrix factorization for deep neural networks
Nuts and Bolts of Transfer Learning.pptx
ECCV2022 paper reading - MultiMAE: Multi-modal Multi-task Masked Autoencoders...
Seminar dm
Week 3 Deep Learning And POS Tagging Hands-On
Java and Deep Learning (Introduction)
slides.pdf
A Generalization of Transformer Networks to Graphs.pptx
Understanding Large Social Networks | IRE Major Project | Team 57 | LINE
Ire presentation
AI optimizing HPC simulations (presentation from 6th EULAG Workshop)
Ad

More from Satoru Katsumata (12)

PDF
Exploiting Monolingual Data at Scale for Neural Machine Translation
PDF
Incorporating Syntactic and Semantic Information in Word Embeddings using Gra...
PDF
How Contextual are Contextualized Word Representations?
PDF
Word-node2vec
PDF
Corpora Generation for Grammatical Error Correction
PDF
Understanding Back-Translation at Scale
PDF
2018年度レトリバインターン参加報告
PDF
On the Limitations of Unsupervised Bilingual Dictionary Induction
PDF
Guiding neural machine translation with retrieved translation pieces
PDF
Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Mo...
PDF
Memory-augmented Neural Machine Translation
PPTX
A convolutional encoder model for neural machine translation
Exploiting Monolingual Data at Scale for Neural Machine Translation
Incorporating Syntactic and Semantic Information in Word Embeddings using Gra...
How Contextual are Contextualized Word Representations?
Word-node2vec
Corpora Generation for Grammatical Error Correction
Understanding Back-Translation at Scale
2018年度レトリバインターン参加報告
On the Limitations of Unsupervised Bilingual Dictionary Induction
Guiding neural machine translation with retrieved translation pieces
Bi-Directional Block Self-Attention for Fast and Memory-Efficient Sequence Mo...
Memory-augmented Neural Machine Translation
A convolutional encoder model for neural machine translation

Recently uploaded (20)

PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PPTX
2Systematics of Living Organisms t-.pptx
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PDF
Placing the Near-Earth Object Impact Probability in Context
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
Sciences of Europe No 170 (2025)
PPTX
Introduction to Cardiovascular system_structure and functions-1
Taita Taveta Laboratory Technician Workshop Presentation.pptx
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
HPLC-PPT.docx high performance liquid chromatography
Assessment of environmental effects of quarrying in Kitengela subcountyof Kaj...
Phytochemical Investigation of Miliusa longipes.pdf
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
2Systematics of Living Organisms t-.pptx
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
TOTAL hIP ARTHROPLASTY Presentation.pptx
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
Placing the Near-Earth Object Impact Probability in Context
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
Classification Systems_TAXONOMY_SCIENCE8.pptx
Sciences of Europe No 170 (2025)
Introduction to Cardiovascular system_structure and functions-1

Deep Neural Machine Translation with Linear Associative Unit

  • 1. Deep Neural Machine Translation with Linear Associative Unit Mingxuan Wang, Zhengdong Lu, Jie Zhou, Qun Liu (ACL 2017) 首都大 B4 勝又智
  • 2. Abstract ● NMT systems with deep architecture RNNs often suffer from severe gradient diffusion. ○ due to the non-linear recurrent activations, which often make the optimization much more difficult. ● we propose novel linear associative units (LAU). ○ reduce the gradient path inside the recurrent units ● experiment ○ NIST task: Chinese-English ○ WMT14: English-German English-French ● analysis ○ LAU vs. GRU ○ Depth vs. Width ○ about Length 2
  • 3. background: gate structure ● LSTM, GRU: capture long-term dependencies ● Residual Network (He et al., 2015) ● highway Network (Srivastava et al., 2015) ● Fast-Forward Network (Zhou et al., 2016) (F-F connections) 3 highway Network (H, T: non-linear function)
  • 4. background: Gated Recurrent Unit taking a linear sum between the existing state and the newly computed state z_t: update gate r_t: reset gate 4
  • 5. Model: LAU (Linear Asocciative Unit) LAU extends GRU by having an additional linear transformation of the input. f_t and r_t express how much of the non-linear abstraction are preduced by the input x_t and previous hidden state h_t. g_t decides how much of the linear transformation of the input is carried to the hidden state. GRU 5
  • 6. What is good using LAU? LAU offers a direct way for input x_t to go to latter hidden state layers. This mechanism is very useful for translation where the input should sometimes be directly carried to the next stage of processing without any substantial composition or nonlinear transformation. ex. imagine we want to translate a rare entity name such as ‘Bahrain’ to Chinese. → LAU is able to retain the embedding of this word in its hidden state. Otherwise, serious distortion occurs due to lack of training instances. 6
  • 7. Model: encoder-decoder (DeepLAU) ● vertical stacking ○ only the output of the previous layer of RNN is fed to the current layer as input. ● bidirectional encoder ○ φ is LAU. ○ the directions are marked by a direction term d. d = -1 or +1 when d = -1, processing in forward diretion. otherwise backward direction. 7
  • 8. Model: Encoder side ● in order to learn more temporal dependencies, they choose unusual bidirectional approach. ● encoding ○ an RNN layer processes the input sequence in forward direction. ○ the output of this layer is taken by an upper RNN layer as input, processed in reverse direction. ○ Formally, following Equation (9), they set d = (-1)^ℓ ○ the final encoder consists of Lenc layers and produces the output 8
  • 9. model: Attention side α_t,j is caluclated by the first layer of decoder at step t - 1 (st-1), the most-top layer of the encoder at step j (hj), and context word yt-1. σ(): tanh() 9
  • 10. model: Decoder side the decoder follows Equation (9) with fixed direction term d = -1. At the first layer, they use the following input: At inference stage, they only utilize the top-most hidden state s(Ldec) to make the final predication with a softmax layer: yt-1 is the target word embedding 10
  • 11. Experiments: corpus ● NIST Chinese-English ○ training: LDC corpora 1.25M sents, 27.9M Chinese words and 34.5M English words ○ dev: NIST 2002 (MT02) dataset ○ test: NIST 2003 (MT03), NIST 2004 (MT04), NIST 2005 (MT05), NIST 2006 (MT06) ● WMT14 English-German ○ training: WMT14 training corpus 4.5M sents, 91M English words and 87M German words ○ dev: news-test 2012, 2013 ○ test: news-test 2014 ● WMT14 English-French ○ training: subset of WMT14 training corpus 12M sents, 304M English words and 348M French words ○ dev: concatenation of news-test 2012 and news-test 2013 ○ test: news-test 2014 11
  • 12. Experiments: setup ● For all expariments ○ dimension: embedding, hidden states and ct are 512 size. ○ optimizer: Adadelta ○ batch size: 128 ○ input length limit: 80 words ○ beam size: 10 ○ dropout rate: 0.5 ○ layer: both encoder and decoder have 4 layers ● settings of each experiment ○ in Chinese-English and English-French, use the soruce and target vocab frequent 30k ○ for English-German, use the source 120k and the target 80k in order of frequent 12
  • 13. Result: Chinese-English, English-French LAUs apply adaptive gate function conditioned on the input which it able to decide how much linear information should be transferred to the next step. 13
  • 15. Analysis: LAU vs. GRU, Depth vs. Width LAU vs. GRU ● row 3 to row 7 → LAU bring imporovement. ● row 3,4 to row 7,8 → GRU decrese BLEU, but LAU bring improvement. Depth vs. Width ● when increasing the model depth but they failed to see further imporovements. ● hidden size (width) row 2 to row 3 → improvements is relative small → depth plays a more important role in incresing the complexity of neural networks than width. in any analysis, use NIST Chinese-English task 15
  • 16. Analysis: About Length DeepLAU models yield higher BLEU score than the DeepGRU model. → very deep RNN model is good at modelling the nested latent structures on relatively complicated sentences. 16
  • 17. Conclusion ● propose a Linear Asociative Unit (LAU) ○ it makes a fusion of both linear and nonlinear transformation. ● LAU enable us to build a deep neural network for MT. ● My feeling ○ After all, is this model a non-recurrent deep or a recurrent deep? maybe, both are likely to be good… ○ I was also interested in the weight of the model. So, I wanted them to mention about it. 17
  • 18. reference ● Srivastava et al. Training Very Deep Networks NIPS 2015 ● Zhou et al. Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation arXiv ● He et al. Deep residual learning for image recognition arXiv ● Wu et al. Google’s neural machine translation system: Bridging the gap between human and machine translation arXiv 18