SlideShare a Scribd company logo
What do Neural Machine
Translation Models Learn
about Morphology?
Yonatan Belinkov, Nadir Durrani, Fahim Dalvi,
Hassan Sajjad and James Glass
@ 8/11 ACL2017 Reading
M1 Hayahide Yamagishi
Introduction
● “Little is known about what and how much NMT models learn
about each language and its features.”
● They try to answer the following questions
1. Which parts of the NMT architecture capture word structure?
2. What is the division of labor between defferent components?
3. How do different word representation help learn better morphology and
modeling of infrequent words?
4. How does the target language affect the learning of word structure?
● Task: Part-of-Speech tagging and morphological tagging
2
Task
● Part-of-Speech (POS) tagging
○ computer → NN
○ computers → NNS
● Morphological tagging
○ he → 3, single, male, subject
○ him → 3, single, male, object
● Task: hidden states → tag
○ They would like to test each hidden state.
○ If the accuracy is high, hidden states learn about the word representation.
3
Methodology
1. Training the NMT models (Bahdanau attention, LSTM)
2. Using the trained models as a feature extractor.
3. Training the feedforward NN using the state-tag pairs
○ 1layer: input layer, hidden layer, output layer
4. Test
● “Our goal is not to beat the state-of-the-art on a given task.”
● “We also experimented with a linear classifier and observed
similar trends to the non-linear case.”
4
Data
● Language Pair:
○ {Arabic, German, French, Czech} - English
○ Arabic - Hebrew (Both languages are morphologically-rich and similar.)
○ Arabic - German (Both languages are morphologically-rich but different.)
● Parallel corpus: TED
● POS annotated data
○ Gold: included in some datasets
○ Predict: from the free taggers
5
Char-base Encoder
● Character-aware Neural Language
Model [Kim+, AAAI2016]
● Character-based Neural Machine
Translation [Costa-jussa and Fonollosa,
ACL2016]
● Character embedding
→ word embedding
● Obtained word embeddings are inputted
into the word-based RNN-LM.
6
Effect of word representation (Encoder)
● Word-based vs. Char-based model
● Char-based models are stronger.
7
Impact of word frequency
● Frequent words don’t need the character information.
● “The char-based model is able to learn character n-gram
patterns that are important for identifying word structure.”
8
Confusion matrices
9
Analyzing specific tags
● Arabic → Determiner “Al-” becomes a prefix.
● Char-based model can distinguish “DT+NNS” from “NNS”.
10
Effect of encoder depth
● LSTM carries the context information → Layer 0 is worse.
● States from layer 1 is more effective than states from layer 2.
11
Effect of encoder depth
● Char-based models have the similar tendencies.
12
Effect of encoder depth
● BLEU: 2-layer NMT > 1-layer NMT
○ word / char : +1.11 / +0.56
● Layer 1 learns the word representation
● Layer 2 learns the word meaning
● Word representation < word representation + word meaning
13
Effect of target language
● Translating into morphologically-rich language is harder.
○ Arabic-English: 24.69
○ English-Arabic: 13.37
● “How does the target language affect the learned source
language representations?”
○ “Does translating into a morphologically-rich language require more
knowledge about source language morphology?”
● Experiment: Arabic - {Arabic, Hebrew, German, English}
○ Arabic-Arabic: Autoencoder
14
Result
15
Effect of target languages
● They expected translating into morph-rich languages would
make the model learn more about morphology. → No
● The accuracy doesn’t correlate with the BLEU score
○ Autoencoder couldn’t learn the morphological representation.
○ If the model only works as a recreator, it doesn’t have to learn it.
○ “A better translation model learns more informative representation.”
● Possible explanation
○ Arabic-English is simply better than -Hebrew and -German.
○ These models may not be able to afford to understand the representations of
word structure.
16
Decoder Analysis
● Similar experiments
○ Decoder’s input is the correct previous word.
○ Char-based decoder’s input is the char-based representation.
○ Char-based decoder’s output is the word-level.
● Arabic-English or English-Arabic
17
Effect of decoder states
● Decoder states doesn’t have a morphological information.
● BLEU doesn’t correlate the accuracy
○ French-English: 37.8 BLEU / 54.26% accuracy
18
Effect of attention
● Encoder states
○ Task: creating a generic, close to language-independent representation of
source sentence .
○ When the attention is attached, these are treated as a memo.
○ When the model translates the noun, the attention sees the noun words.
● Decoder states
○ Task: using encoder’s representation to generate the target sentence in a
specific language.
○ “Without the attention mechanism, the decoder is forced to learn more
informative representations of the target language.”
19
Effect of word representation (Decoder)
● Char-based representations don’t hep the decoder
○ The decoder’s predictions are still done at word level.
○ “In Arabic-English the char-based model reduces the number of generated
unknown words in the MT test set by 25%.”
○ “In English-Arabic the number of unknown words remains roughly the same
between word-based and char-based models.”
20
Conclusion
● Their results lead to the following conclusions
○ Char-based representations are better than word-based ones
○ Lower layers captures morphology, while deeper layers improve translation
performance.
○ Translating into morphologically-poorer languages leads to better source
representations.
○ The attentional decoder learns impoverished representations that do not
carry much information about morphology.
● “Jointly learning translation and morphology can possibly
lead to better representations and improved translation.”
21

More Related Content

PDF
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
PDF
Networks and Natural Language Processing
PPT
Лев Сивашов: "Lean Architecture and DCI"
PDF
Yves Peirsman - Deep Learning for NLP
PPT
Notesparadigms
PDF
Acl reading@2016 10-26
PDF
Deep contextualized word representations
PDF
State-of-the-Art Text Classification using Deep Contextual Word Representations
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
Networks and Natural Language Processing
Лев Сивашов: "Lean Architecture and DCI"
Yves Peirsman - Deep Learning for NLP
Notesparadigms
Acl reading@2016 10-26
Deep contextualized word representations
State-of-the-Art Text Classification using Deep Contextual Word Representations

What's hot (20)

PDF
PL Lecture 01 - preliminaries
PPTX
Natural Language Processing: Parsing
PDF
PL Lecture 02 - Binding and Scope
PPTX
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
PDF
Learning to understand phrases by embedding the dictionary
PPTX
Intent Classifier with Facebook fastText
DOCX
A neural probabilistic language model
PPT
Natural language procssing
PDF
Nlp research presentation
PDF
Automatic text simplification evaluation aspects
PDF
Anthiil Inside workshop on NLP
PDF
Natural Language Processing
PPTX
Language models
PPTX
Lecture 1: Semantic Analysis in Language Technology
PPTX
Family Tree on PROLOG
PPTX
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
PDF
Representation Learning of Vectors of Words and Phrases
PPTX
NLP pipeline in machine translation
PDF
A Low Dimensionality Representation for Language Variety Identification (CICL...
PDF
Frontiers of Natural Language Processing
PL Lecture 01 - preliminaries
Natural Language Processing: Parsing
PL Lecture 02 - Binding and Scope
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
Learning to understand phrases by embedding the dictionary
Intent Classifier with Facebook fastText
A neural probabilistic language model
Natural language procssing
Nlp research presentation
Automatic text simplification evaluation aspects
Anthiil Inside workshop on NLP
Natural Language Processing
Language models
Lecture 1: Semantic Analysis in Language Technology
Family Tree on PROLOG
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
Representation Learning of Vectors of Words and Phrases
NLP pipeline in machine translation
A Low Dimensionality Representation for Language Variety Identification (CICL...
Frontiers of Natural Language Processing
Ad

Similar to [ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology? (20)

PDF
Deep Learning for Natural Language Processing: Word Embeddings
PDF
Visual-Semantic Embeddings: some thoughts on Language
PPTX
Presentacion_Procesamiento_Lenguaje.pptx
PDF
Developing Korean Chatbot 101
PPTX
MixedLanguageProcessingTutorialEMNLP2019.pptx
PDF
NLP using transformers
PPTX
Unlocking the Power of Language: A Beginner’s Guide to Natural Language Proce...
PDF
Trends of ICASSP 2022
PPTX
Understanding Generative AI Models and Their Real-World Applications.pptx
PPTX
PPT Unit 5=software- engineering-21.pptx
PDF
Named Entity Recognition using Hidden Markov Model (HMM)
PDF
Named Entity Recognition using Hidden Markov Model (HMM)
PDF
Named Entity Recognition using Hidden Markov Model (HMM)
PDF
Babak Rasolzadeh: The importance of entities
PPTX
Shedding Light on Software Engineering-specific Metaphors and Idioms
PPTX
Deep Learning for Natural Language Processing
PDF
Merghani-SACNAS Poster
PPTX
ARTIFICIAL INTELLEGENCE AND MACHINE LEARNING.pptx
PDF
BERT: Bidirectional Encoder Representations from Transformers
Deep Learning for Natural Language Processing: Word Embeddings
Visual-Semantic Embeddings: some thoughts on Language
Presentacion_Procesamiento_Lenguaje.pptx
Developing Korean Chatbot 101
MixedLanguageProcessingTutorialEMNLP2019.pptx
NLP using transformers
Unlocking the Power of Language: A Beginner’s Guide to Natural Language Proce...
Trends of ICASSP 2022
Understanding Generative AI Models and Their Real-World Applications.pptx
PPT Unit 5=software- engineering-21.pptx
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
Named Entity Recognition using Hidden Markov Model (HMM)
Babak Rasolzadeh: The importance of entities
Shedding Light on Software Engineering-specific Metaphors and Idioms
Deep Learning for Natural Language Processing
Merghani-SACNAS Poster
ARTIFICIAL INTELLEGENCE AND MACHINE LEARNING.pptx
BERT: Bidirectional Encoder Representations from Transformers
Ad

More from Hayahide Yamagishi (16)

PPTX
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
PDF
[修論発表会資料] 目的言語の文書文脈を用いたニューラル機械翻訳
PDF
[論文読み会資料] Beyond Error Propagation in Neural Machine Translation: Characteris...
PDF
[ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use C...
PDF
[NAACL2018読み会] Deep Communicating Agents for Abstractive Summarization
PDF
[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation
PDF
[ML論文読み会資料] Teaching Machines to Read and Comprehend
PDF
[EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation
PDF
[ML論文読み会資料] Training RNNs as Fast as CNNs
PDF
入力文への情報の付加によるNMTの出力文の変化についてのエラー分析
PDF
Why neural translations are the right length
PDF
A hierarchical neural autoencoder for paragraphs and documents
PDF
ニューラル論文を読む前に
PPTX
ニューラル日英翻訳における出力文の態制御
PPTX
[EMNLP2016読み会] Memory-enhanced Decoder for Neural Machine Translation
PPTX
[ACL2016] Achieving Open Vocabulary Neural Machine Translation with Hybrid Wo...
[PACLING2019] Improving Context-aware Neural Machine Translation with Target-...
[修論発表会資料] 目的言語の文書文脈を用いたニューラル機械翻訳
[論文読み会資料] Beyond Error Propagation in Neural Machine Translation: Characteris...
[ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use C...
[NAACL2018読み会] Deep Communicating Agents for Abstractive Summarization
[論文読み会資料] Asynchronous Bidirectional Decoding for Neural Machine Translation
[ML論文読み会資料] Teaching Machines to Read and Comprehend
[EMNLP2017読み会] Efficient Attention using a Fixed-Size Memory Representation
[ML論文読み会資料] Training RNNs as Fast as CNNs
入力文への情報の付加によるNMTの出力文の変化についてのエラー分析
Why neural translations are the right length
A hierarchical neural autoencoder for paragraphs and documents
ニューラル論文を読む前に
ニューラル日英翻訳における出力文の態制御
[EMNLP2016読み会] Memory-enhanced Decoder for Neural Machine Translation
[ACL2016] Achieving Open Vocabulary Neural Machine Translation with Hybrid Wo...

Recently uploaded (20)

PPTX
web development for engineering and engineering
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Digital Logic Computer Design lecture notes
PPTX
Geodesy 1.pptx...............................................
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
composite construction of structures.pdf
PDF
PPT on Performance Review to get promotions
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
DOCX
573137875-Attendance-Management-System-original
PPTX
Welding lecture in detail for understanding
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPT
Project quality management in manufacturing
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
UNIT 4 Total Quality Management .pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
PPTX
Lecture Notes Electrical Wiring System Components
web development for engineering and engineering
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Operating System & Kernel Study Guide-1 - converted.pdf
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Digital Logic Computer Design lecture notes
Geodesy 1.pptx...............................................
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
composite construction of structures.pdf
PPT on Performance Review to get promotions
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
573137875-Attendance-Management-System-original
Welding lecture in detail for understanding
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Project quality management in manufacturing
R24 SURVEYING LAB MANUAL for civil enggi
UNIT 4 Total Quality Management .pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
Lecture Notes Electrical Wiring System Components

[ACL2017読み会] What do Neural Machine Translation Models Learn about Morphology?

  • 1. What do Neural Machine Translation Models Learn about Morphology? Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad and James Glass @ 8/11 ACL2017 Reading M1 Hayahide Yamagishi
  • 2. Introduction ● “Little is known about what and how much NMT models learn about each language and its features.” ● They try to answer the following questions 1. Which parts of the NMT architecture capture word structure? 2. What is the division of labor between defferent components? 3. How do different word representation help learn better morphology and modeling of infrequent words? 4. How does the target language affect the learning of word structure? ● Task: Part-of-Speech tagging and morphological tagging 2
  • 3. Task ● Part-of-Speech (POS) tagging ○ computer → NN ○ computers → NNS ● Morphological tagging ○ he → 3, single, male, subject ○ him → 3, single, male, object ● Task: hidden states → tag ○ They would like to test each hidden state. ○ If the accuracy is high, hidden states learn about the word representation. 3
  • 4. Methodology 1. Training the NMT models (Bahdanau attention, LSTM) 2. Using the trained models as a feature extractor. 3. Training the feedforward NN using the state-tag pairs ○ 1layer: input layer, hidden layer, output layer 4. Test ● “Our goal is not to beat the state-of-the-art on a given task.” ● “We also experimented with a linear classifier and observed similar trends to the non-linear case.” 4
  • 5. Data ● Language Pair: ○ {Arabic, German, French, Czech} - English ○ Arabic - Hebrew (Both languages are morphologically-rich and similar.) ○ Arabic - German (Both languages are morphologically-rich but different.) ● Parallel corpus: TED ● POS annotated data ○ Gold: included in some datasets ○ Predict: from the free taggers 5
  • 6. Char-base Encoder ● Character-aware Neural Language Model [Kim+, AAAI2016] ● Character-based Neural Machine Translation [Costa-jussa and Fonollosa, ACL2016] ● Character embedding → word embedding ● Obtained word embeddings are inputted into the word-based RNN-LM. 6
  • 7. Effect of word representation (Encoder) ● Word-based vs. Char-based model ● Char-based models are stronger. 7
  • 8. Impact of word frequency ● Frequent words don’t need the character information. ● “The char-based model is able to learn character n-gram patterns that are important for identifying word structure.” 8
  • 10. Analyzing specific tags ● Arabic → Determiner “Al-” becomes a prefix. ● Char-based model can distinguish “DT+NNS” from “NNS”. 10
  • 11. Effect of encoder depth ● LSTM carries the context information → Layer 0 is worse. ● States from layer 1 is more effective than states from layer 2. 11
  • 12. Effect of encoder depth ● Char-based models have the similar tendencies. 12
  • 13. Effect of encoder depth ● BLEU: 2-layer NMT > 1-layer NMT ○ word / char : +1.11 / +0.56 ● Layer 1 learns the word representation ● Layer 2 learns the word meaning ● Word representation < word representation + word meaning 13
  • 14. Effect of target language ● Translating into morphologically-rich language is harder. ○ Arabic-English: 24.69 ○ English-Arabic: 13.37 ● “How does the target language affect the learned source language representations?” ○ “Does translating into a morphologically-rich language require more knowledge about source language morphology?” ● Experiment: Arabic - {Arabic, Hebrew, German, English} ○ Arabic-Arabic: Autoencoder 14
  • 16. Effect of target languages ● They expected translating into morph-rich languages would make the model learn more about morphology. → No ● The accuracy doesn’t correlate with the BLEU score ○ Autoencoder couldn’t learn the morphological representation. ○ If the model only works as a recreator, it doesn’t have to learn it. ○ “A better translation model learns more informative representation.” ● Possible explanation ○ Arabic-English is simply better than -Hebrew and -German. ○ These models may not be able to afford to understand the representations of word structure. 16
  • 17. Decoder Analysis ● Similar experiments ○ Decoder’s input is the correct previous word. ○ Char-based decoder’s input is the char-based representation. ○ Char-based decoder’s output is the word-level. ● Arabic-English or English-Arabic 17
  • 18. Effect of decoder states ● Decoder states doesn’t have a morphological information. ● BLEU doesn’t correlate the accuracy ○ French-English: 37.8 BLEU / 54.26% accuracy 18
  • 19. Effect of attention ● Encoder states ○ Task: creating a generic, close to language-independent representation of source sentence . ○ When the attention is attached, these are treated as a memo. ○ When the model translates the noun, the attention sees the noun words. ● Decoder states ○ Task: using encoder’s representation to generate the target sentence in a specific language. ○ “Without the attention mechanism, the decoder is forced to learn more informative representations of the target language.” 19
  • 20. Effect of word representation (Decoder) ● Char-based representations don’t hep the decoder ○ The decoder’s predictions are still done at word level. ○ “In Arabic-English the char-based model reduces the number of generated unknown words in the MT test set by 25%.” ○ “In English-Arabic the number of unknown words remains roughly the same between word-based and char-based models.” 20
  • 21. Conclusion ● Their results lead to the following conclusions ○ Char-based representations are better than word-based ones ○ Lower layers captures morphology, while deeper layers improve translation performance. ○ Translating into morphologically-poorer languages leads to better source representations. ○ The attentional decoder learns impoverished representations that do not carry much information about morphology. ● “Jointly learning translation and morphology can possibly lead to better representations and improved translation.” 21