SlideShare a Scribd company logo
Introducer: Hiroki Homma, in Komachi’s lab.
November 25th, 2019
ACL 2019
Abstract 1/2
• Noise and domain are important aspects of data
quality for NMT.
• Existing research focus separately on domain-data
selection, clean-data selection, or their static
combination. → ✘ Dynamic interaction
• This paper introduces a “co-curricular learning” method.
• compose dynamic domain-data selection with dynamic clean-data
selection, for transfer learning across both capabilities
Abstract 2/2
• We apply an EM-style optimization procedure to
further refine the “co-curriculum.”
• Experiment results and analysis with two domains
demonstrate the effectiveness of the method and
the properties of data scheduled by the co-
curriculum.
Introduction 1/4
• NMT has found successful use cases in, domain
translation and helping other NLP applications.
• Given a source monolingual corpus, how to use it to
improve an NMT model to translate same-domain
sentences well? Data selection plays an important role.
Introduction 2/4
• Data selection has been a fundamental research topic.
• use language models to select parallel data → ✘ performs
poorly on noisy data, such as large-scale, web-crawled
datasets
• NMT community has realized the harm of data noise to
translation quality.
• denoising methods (Koehn et al., 2018), clean-data selection
→ ✘ they require trusted parallel data
Introduction 3/4
clean-data selection
domain-data selection
Introduction 4/4
• We introduce a method to dynamically combine
clean-data selection and domain-data selection.
• We treat them as independent curricula and compose
them into a “co-curriculum.”
Contributions
1. “Co-curricular learning”, for transfer learning across data quality.
2. A curriculum optimization procedure to refine the co-curriculum.
3. We wish our work contributed towards better understanding of
data, such as noise, domain, or “easy to learn”, and its interaction
with NMT network.
It extends the single curriculum learning work in NMT and makes the
existing domain-data selection method work better with noisy data.
While gaining some improvement with deep models, it surprisingly
improves shallow model by 8-10 BLEU points – We find that
bootstrapping seems to “regularize” the curriculum and make it easier
for a small model to learn on.
Related Work 1/5
• Measuring Domain and Noise in Data
• Moore and Lewis, 2010
• van der Wees et al., 2017
Axelrod et al,. 2011
• Junczys-Dowmunt, 2018
Wang et al,. 2018
cross entropy difference
select domain sentences
measure the noise
level in sentence pair
NMT model
trusted data
𝑥 : source sentence
ሚ𝜗 : general-domain LM
መ𝜗 : in-domain LM
(𝑥, 𝑦) : sentence pair
ሚ𝜗 : baseline NMT
መ𝜗 : trained on noisy data
and cleaner NMT
Related Work 2/5
• Curriculum Learning for NMT
• Bengio et al., 2009
Curriculum learning (CL)
traditional static selection
curriculum
training
criterion
𝑥, 𝑦 : training examples
𝐷 : dataset
𝑊𝑡 𝑥, 𝑦 : weights
𝑃 𝑥, 𝑦 : training distribution
𝑚 𝑓 : final task of interest
tasks
There has already been rich research in CL for NMT
Related Work 3/5 Curriculum learning (CL)
Fine-tuning a base line on in-domain
parallel data is good strategy
domain curriculum
There has already been rich research in CL for NMT
noise level denoising curriculum
linguistically-
motivated features
classify examples into
bins for scheduling
reinforcement learning
observe faster training convergence
adapt generic NMT models to specific domain
CL framework
simplify
speed up training
• Thompson et al., 2018
Sajjad et al., 2017
Freitag and Al-Onaizan, 2016
• van der Wees et al., 2017
• Wang et al., 2018
• Kocmi and Bojar, 2017
• Kumar et al., 2019
• Zhang et al., 2018
• Zhang et al., 2019
• Platanios et al., 2019
Related Work 4/5
therefore
CL is a natural formulation for dynamic online data selection.
Related Work 5/5
• Our work is built on two types of data selection
• Dynamic domain-data selection
Uses the neural LM (NLM) based scoring function
• Dynamic clean-data selection
Uses the NMT-based scoring function
• Ideally, we would have in-domain, trusted parallel data to
design a true curriculum, 𝐶true
domain curriculum
𝐶domain
denoising curriculum
𝐶denoise
reprint
reprint
true curriculum, 𝐶true
More Related Work
• Junczys-Dowmunt (2018) : combine (static) features for data filtering
• Mansour et al. (2011) : combine an n-gram LM and IBM translation Model
1 for domain data filtering
• We compose different types of dynamic online selection rather than
combining static features.
• Back translation (BT) is another important approach to using
monolingual data for NMT
• Sennrich et al., 2016
• We use monolingual data to seed data selection, rather than generating
parallel data directly from it.
Problem Setting 1/2
a background parallel dataset
between languages 𝑋 and 𝑌
large
100M
diverse noisy
an in-domain monolingual
corpus in source language 𝑋
middle
1K~1M
testing
domain
small, trusted, out-of-domain
(OD) parallel dataset
small
~1K
out-of-
domain
trusted
reprint
reprinttrain 𝜙
to sort data by noise level
into a denoising curriculum
train 𝜑
to sort data by domain relevance
into a domain curriculum
Problem Setting 2/2
• The setup, however, assumes that does not exist
• Our goal is to use an easily available monolingual corpus and
recycle existing trusted parallel data to reduce the cost of
curating in-domain parallel data.
• composed curriculum :
• We hope
in-domain, trusted parallel data
testing
domain
trusted
Co-Curricular Learning
The idea with a toy dataset of three examples
travel domain
travel domain noisy
medicine
domain
well-
translated
domain curriculum
𝐶domain
denoising curriculum
𝐶denoise
Co-curriculum
Curriculum Mini-Batching
to return the top 𝜆 𝑡 of examples
in dataset 𝐷
sorted by a scoring function 𝜙
at a training step 𝑡.
pace function
Dynamic data
selection function
This is inspired by the exponential learning rate schedule
We introduce two different co-curricula
• Mixed Co-Curriculum (𝐶co
mix)
• Cascaded Co-Curriculum (𝐶co
cascade)
Mixed Co-Curriculum (𝐶co
mix
)
• simply adds up the domain scoring function and the denoising
function
• for a sentence pair (𝑥, 𝑦),
values of 𝜑 and 𝜙 may not be on the same scale or even from
the same family of distributions
𝐶co
mix may not be able to enforce either curriculum sufficiently
・assign non-zero weights
・use uniform sampling
Cascaded Co-Curriculum (𝐶co
cascade
)
• defines two selection functions and nests them
data-discarding paces for clean-data selection
data-discarding paces for domain-data selection
Curriculum Optimization
curriculum
generation process
iteratively updated
iteratively updated
iteratively updated
trustedleargein-domain
Experiments
Setup 1/4
• We consider two background datasets and two test domains
• four experiment configurations
• a background dataset (adapted sentence-piece model, shared 32k sub-words)
En→Fr
Paracrawl
300 M
pairs
En→Fr
WMT14 train
40 M
pairs
noisier
En→Fr IWSLT15: spoken language domain En→Fr WMT14 test: news domain
220 K
English side
validation:
28 M
English side
En→Fr IWSLT15 En→Fr WMT14 test
En→Fr WMT14 test
WMT 2012-2013
IWSLT15
220 K
pairs
IWSLT15
WMT 2010-2011 test
WMT14
WMT 2010-2011 test
validation:
Setup 2/4
• RNN-based NMT (Wu et al., 2016) to train models
• Model parameterization for 𝜃’s of 𝜑 or 𝜗’s 𝜙 is 512 dimensions by 3 layers
• NLMs are realized using NMT models with dummy source sentences (Sennrich
et al., 2016)
• Deep models are 1024 dimensions by 8 layers
• Unless specified, results are reported for deep models.
• truecased, detokenized BLEU with mteval-v14.pl
Setup 3/4
• Training on Paracrawl
• Adam in warmup
• SGD for a total of 3 million steps
• batch size 128
• learning rate 0.5Training on Paracrawl annealed, at step 2 million, down to
0.05
• no dropout (due to its large data volume)
• Training on WMT 2014
• batch size 96
• dropout probability 0.2 for a total of 2 million steps
• learning rate 0.5 annealed, at step 1.2 million, down to 0.05
Setup 4/4
• Pace hyper-parameters
• 𝐻 = 𝐹 = 400𝑘
• 𝐺 = 900𝑘
• Floor values
• 𝜆, 𝛽, 𝛾 are top 0.1, 0.2, 0.5 selection ratios
• 0.1 = 0.2 × 0.5
empirically
All single curriculum experiments use the same pace setting as 𝐶mix
Baselines and Oracles 1/2
Baselines:
1. 𝐶random : Baseline model trained on background data with random data sampling.
2. 𝐶domain : Dynamically fine-tunes 𝐶random with a domain curriculum
(van dar Wees et al., 2017)
3. 𝐶domain : Dynamically fine-tunes 𝐶random with a denoising curriculum
(Wang et al., 2018)
Oracles:
4. 𝐶true : Dynamically fine-tunes 𝐶random with the true curriculum
5. ID fine-tune 𝐶random : Simply fine-tunes 𝐶random with in-domain (ID) parallel data
All single In most experiments, we fine-tune a warmed-up (baseline) model to
compare curricula, for quicker experiment cycles
Baselines and Oracles 2/2
• 𝐶domain > 𝐶random
• P3 > P2 (P is noisy)
• W3<=W2 (noise compromises
the domain capability)
• W3 > W1
• P3 >> P1
• The true curriculum (P4, W4)
bounds the performance of
Cdomain and Cdenoise. ?
• Simple in-domain fine-tuning
gives good improvements.
Co-Curricular Learning
Per-step cascading works better
than mixing.
so we use 𝐶co
cascade
for the remaining
experiments
Co-curriculum improves either
constituent curriculum (P2 or P3)
and no CL, can be close to the
true curriculum on noisy data.
On the cleaner WMT training data,
co-curriculum (W7) improves
either constituent curricula (W2
and W3) by smaller gains than
Paracrawl.
Effect of Curriculum Optimization 1/5
Shallow models
reprint
512 dimensions × 3 layers
Surprisingly, EM-3 improves
baseline by +10 BLEU on
IWSLT15, +8.2 BLEU on
WMT14
Effect of Curriculum Optimization 2/5
Deep models
reprint
1024 dimensions × 8 layers
EM-style optimization further improves domain curriculum.
But, overall, it has a small impact on deep models.
Effect of Curriculum Optimization 3/5
Why the difference? – Per-word Loss
Curriculum learning and
optimization push
“easier-to-learn” (lower
per-word loss) examples
to late curriculum (right)
and harder examples
(higher per-word loss) to
early curriculum (left).
Effect of Curriculum Optimization 4/5
Why the difference? –
Curriculum learning and
optimization push
“regularized” (lower
variance) examples to
late curriculum and
higher-variance
examples to early
curriculum.
Standard
Deviation
These may be important
for a small-capacity
model to learn efficiently.
‘clean’ may have taken
most of the headroom
for deep models.
Effect of Curriculum Optimization 5/5
Why the difference? –
The denoising curriculum
is made more aware of
news-domain after
iterations.
All curves show negative
news-domain relevance,
indicating lack of news
data in Paracrawl data.
The co-curriculum schedules
data from hard to easier-to-
learn, which seems opposite
to the general CL, it also
schedules data from less in-
domain to cleaner and more
in-domain. ←spirit of CL
News-Domain
Relevance
Retraining
Retraining with a curriculum may work better than fine-
tuning with it, on a large, noisy dataset.
Dynamic vs. Static Data Selection
• How does being dynamic matter?
• What if we retrain on the static data, too?
Curriculum learning
works slightly better
than finetuning a
warmed-up model with
a top static selection.
Curriculum learning
works better than
retraining with a static,
top selection, especially
when the training
dataset is small.
Discussion 1/3
• Evidence of data-quality transfer.
Curriculum learning in
one domain may enable
curriculum learning in
another.
Discussion 2/3
• Regularizing data without a teacher.
• Section “Effect of Curriculum Optimization” shows that
• the denoising scoring function and its bootstrapped versions
tend to regularize the late curriculum and make the
scheduled data easier for small models to learn on.
• One potential further application of this data property may be in
learning a multitask curriculum where regular data may be
helpful for multiple task distributions to work together in the
same model.
• We could regularize data by example
selection, without a teacher.
future work
Discussion 3/3
• Pace function hyper-parameters.
• Data-discarding pace functions seem to work best when they
simultaneously decay down to their respective floors.
• Adaptively adjusting them future work
Conclusion
• We present a co-curricular learning method to make domain-
data selection work better on noisy data, by dynamically
composing it with clean-data selection.
• We show that the method improves over either constituent
selection and their static combination.
• We further refine the co-curriculum with an EM-style
optimization procedure and show its effectiveness, in on small-
capacity models.
• In future, we would like to extend the method to handle more
than two curricula objectives.

More Related Content

PDF
Neural Summarization by Extracting Sentences and Words
PPT
score based ranking of documents
PDF
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
PPTX
Query dependent ranking using k nearest neighbor
 
PPTX
Graph Matching Unsupervised Domain Adaptation
PPTX
Text categorization
PDF
PhD Defense Slides
PPTX
Zero-shot Image Recognition Using Relational Matching, Adaptation and Calibra...
Neural Summarization by Extracting Sentences and Words
score based ranking of documents
AttSum: Joint Learning of Focusing and Summarization with Neural Attention
Query dependent ranking using k nearest neighbor
 
Graph Matching Unsupervised Domain Adaptation
Text categorization
PhD Defense Slides
Zero-shot Image Recognition Using Relational Matching, Adaptation and Calibra...

What's hot (20)

PPTX
Text extraction using document structure features and support vector machines
PDF
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
PPTX
Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...
PPTX
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
PDF
Handling Missing Attributes using Matrix Factorization 
PDF
Improving neural question generation using answer separation
PPT
Color reduction using the combination of the kohonen self organized feature m...
PDF
Preliminary Exam Slides
PDF
Classifying Text using CNN
PDF
Neural Semi-supervised Learning under Domain Shift
PDF
Fractional step discriminant pruning
PPTX
MultiModal Retrieval Image
PDF
EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning
PDF
Toward wave net speech synthesis
PDF
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
PDF
[Introduction] Neural Network-Based Abstract Generation for Opinions and Argu...
PDF
Orchestration Graphs: Enabling Rich Learning Scenarios at Scale
PPTX
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...
PPSX
Learning Probabilistic Relational Models using Non-Negative Matrix Factorization
PDF
Icml2018 naver review
Text extraction using document structure features and support vector machines
대용량 데이터 분석을 위한 병렬 Clustering 알고리즘 최적화
Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...
ICFHR 2014 Competition on Handwritten KeyWord Spotting (H-KWS 2014)
Handling Missing Attributes using Matrix Factorization 
Improving neural question generation using answer separation
Color reduction using the combination of the kohonen self organized feature m...
Preliminary Exam Slides
Classifying Text using CNN
Neural Semi-supervised Learning under Domain Shift
Fractional step discriminant pruning
MultiModal Retrieval Image
EmbNum: Semantic Labeling for Numerical Values with Deep Metric Learning
Toward wave net speech synthesis
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
[Introduction] Neural Network-Based Abstract Generation for Opinions and Argu...
Orchestration Graphs: Enabling Rich Learning Scenarios at Scale
Graph Centric Analysis of Road Network Patterns for CBD’s of Metropolitan Cit...
Learning Probabilistic Relational Models using Non-Negative Matrix Factorization
Icml2018 naver review
Ad

Similar to 2019 dynamically composing_domain-data_selection_with_clean-data_selection_by_co-curricular_learning_for_neural_machine_translation (20)

PPTX
neural based_context_representation_learning_for_dialog_act_classification
PDF
Unsupervised Neural Machine Translation for Low-Resource Domains
PDF
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
PDF
Lepor: augmented automatic MT evaluation metric
PPTX
Transfer Learning in NLP: A Survey
PDF
Naver learning to rank question answer pairs using hrde-ltc
PDF
Nlp research presentation
PDF
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
PPTX
Searching for the Best Machine Translation Combination
PPTX
Seminar dm
PPT
Deep learning is a subset of machine learning and AI
PPT
Overview of Deep Learning and its advantage
PPT
Introduction to Deep Learning presentation
PPT
deepnet-lourentzou.ppt
PPTX
In-depth Exploration of Geotagging Performance
PDF
A Comparative Study of Heterogeneous Ensemble-Learning Techniques for Landsli...
PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
PDF
AutoML lectures (ACDL 2019)
DOC
PDF
Fast and Accurate Preordering for SMT using Neural Networks
 
neural based_context_representation_learning_for_dialog_act_classification
Unsupervised Neural Machine Translation for Low-Resource Domains
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lepor: augmented automatic MT evaluation metric
Transfer Learning in NLP: A Survey
Naver learning to rank question answer pairs using hrde-ltc
Nlp research presentation
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
Searching for the Best Machine Translation Combination
Seminar dm
Deep learning is a subset of machine learning and AI
Overview of Deep Learning and its advantage
Introduction to Deep Learning presentation
deepnet-lourentzou.ppt
In-depth Exploration of Geotagging Performance
A Comparative Study of Heterogeneous Ensemble-Learning Techniques for Landsli...
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
AutoML lectures (ACDL 2019)
Fast and Accurate Preordering for SMT using Neural Networks
 
Ad

More from 広樹 本間 (15)

PDF
論文紹介: Improving grammatical error correction models with purpose built advers...
PDF
Infusing sequential information into conditional masked translation model wit...
PDF
2020 acl learning_to_recover_from_multi-modality_errors_for_non-autoregressiv...
PPTX
論文紹介 JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus
PDF
2020 03 05_mar_revenshtein_transformer_tmu_homma
PDF
EMNLP 2019 parallel iterative edit models for local sequence transduction
PDF
2019 Levenshtein Transformer
PDF
論文紹介 Star-Transformer (NAACL 2019)
PDF
2019年度チュートリアルBPE
PDF
Unsupervised multilingual word embeddings
PDF
Improving neural machine translation by incorporating hierarchical subword fe...
PDF
A deep relevance model for zero shot document filtering
PDF
Reusing weights in subword aware neural language models
PDF
最終発表
PDF
企画書 VirtualDarts v2
論文紹介: Improving grammatical error correction models with purpose built advers...
Infusing sequential information into conditional masked translation model wit...
2020 acl learning_to_recover_from_multi-modality_errors_for_non-autoregressiv...
論文紹介 JParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus
2020 03 05_mar_revenshtein_transformer_tmu_homma
EMNLP 2019 parallel iterative edit models for local sequence transduction
2019 Levenshtein Transformer
論文紹介 Star-Transformer (NAACL 2019)
2019年度チュートリアルBPE
Unsupervised multilingual word embeddings
Improving neural machine translation by incorporating hierarchical subword fe...
A deep relevance model for zero shot document filtering
Reusing weights in subword aware neural language models
最終発表
企画書 VirtualDarts v2

Recently uploaded (20)

PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Cloud computing and distributed systems.
PDF
KodekX | Application Modernization Development
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
cuic standard and advanced reporting.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Big Data Technologies - Introduction.pptx
Review of recent advances in non-invasive hemoglobin estimation
Network Security Unit 5.pdf for BCA BBA.
Programs and apps: productivity, graphics, security and other tools
Diabetes mellitus diagnosis method based random forest with bat algorithm
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Cloud computing and distributed systems.
KodekX | Application Modernization Development
MIND Revenue Release Quarter 2 2025 Press Release
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Per capita expenditure prediction using model stacking based on satellite ima...
Dropbox Q2 2025 Financial Results & Investor Presentation
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
cuic standard and advanced reporting.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Big Data Technologies - Introduction.pptx

2019 dynamically composing_domain-data_selection_with_clean-data_selection_by_co-curricular_learning_for_neural_machine_translation

  • 1. Introducer: Hiroki Homma, in Komachi’s lab. November 25th, 2019 ACL 2019
  • 2. Abstract 1/2 • Noise and domain are important aspects of data quality for NMT. • Existing research focus separately on domain-data selection, clean-data selection, or their static combination. → ✘ Dynamic interaction • This paper introduces a “co-curricular learning” method. • compose dynamic domain-data selection with dynamic clean-data selection, for transfer learning across both capabilities
  • 3. Abstract 2/2 • We apply an EM-style optimization procedure to further refine the “co-curriculum.” • Experiment results and analysis with two domains demonstrate the effectiveness of the method and the properties of data scheduled by the co- curriculum.
  • 4. Introduction 1/4 • NMT has found successful use cases in, domain translation and helping other NLP applications. • Given a source monolingual corpus, how to use it to improve an NMT model to translate same-domain sentences well? Data selection plays an important role.
  • 5. Introduction 2/4 • Data selection has been a fundamental research topic. • use language models to select parallel data → ✘ performs poorly on noisy data, such as large-scale, web-crawled datasets • NMT community has realized the harm of data noise to translation quality. • denoising methods (Koehn et al., 2018), clean-data selection → ✘ they require trusted parallel data
  • 7. Introduction 4/4 • We introduce a method to dynamically combine clean-data selection and domain-data selection. • We treat them as independent curricula and compose them into a “co-curriculum.”
  • 8. Contributions 1. “Co-curricular learning”, for transfer learning across data quality. 2. A curriculum optimization procedure to refine the co-curriculum. 3. We wish our work contributed towards better understanding of data, such as noise, domain, or “easy to learn”, and its interaction with NMT network. It extends the single curriculum learning work in NMT and makes the existing domain-data selection method work better with noisy data. While gaining some improvement with deep models, it surprisingly improves shallow model by 8-10 BLEU points – We find that bootstrapping seems to “regularize” the curriculum and make it easier for a small model to learn on.
  • 9. Related Work 1/5 • Measuring Domain and Noise in Data • Moore and Lewis, 2010 • van der Wees et al., 2017 Axelrod et al,. 2011 • Junczys-Dowmunt, 2018 Wang et al,. 2018 cross entropy difference select domain sentences measure the noise level in sentence pair NMT model trusted data 𝑥 : source sentence ሚ𝜗 : general-domain LM መ𝜗 : in-domain LM (𝑥, 𝑦) : sentence pair ሚ𝜗 : baseline NMT መ𝜗 : trained on noisy data and cleaner NMT
  • 10. Related Work 2/5 • Curriculum Learning for NMT • Bengio et al., 2009 Curriculum learning (CL) traditional static selection curriculum training criterion 𝑥, 𝑦 : training examples 𝐷 : dataset 𝑊𝑡 𝑥, 𝑦 : weights 𝑃 𝑥, 𝑦 : training distribution 𝑚 𝑓 : final task of interest tasks There has already been rich research in CL for NMT
  • 11. Related Work 3/5 Curriculum learning (CL) Fine-tuning a base line on in-domain parallel data is good strategy domain curriculum There has already been rich research in CL for NMT noise level denoising curriculum linguistically- motivated features classify examples into bins for scheduling reinforcement learning observe faster training convergence adapt generic NMT models to specific domain CL framework simplify speed up training • Thompson et al., 2018 Sajjad et al., 2017 Freitag and Al-Onaizan, 2016 • van der Wees et al., 2017 • Wang et al., 2018 • Kocmi and Bojar, 2017 • Kumar et al., 2019 • Zhang et al., 2018 • Zhang et al., 2019 • Platanios et al., 2019
  • 12. Related Work 4/5 therefore CL is a natural formulation for dynamic online data selection.
  • 13. Related Work 5/5 • Our work is built on two types of data selection • Dynamic domain-data selection Uses the neural LM (NLM) based scoring function • Dynamic clean-data selection Uses the NMT-based scoring function • Ideally, we would have in-domain, trusted parallel data to design a true curriculum, 𝐶true domain curriculum 𝐶domain denoising curriculum 𝐶denoise reprint reprint true curriculum, 𝐶true
  • 14. More Related Work • Junczys-Dowmunt (2018) : combine (static) features for data filtering • Mansour et al. (2011) : combine an n-gram LM and IBM translation Model 1 for domain data filtering • We compose different types of dynamic online selection rather than combining static features. • Back translation (BT) is another important approach to using monolingual data for NMT • Sennrich et al., 2016 • We use monolingual data to seed data selection, rather than generating parallel data directly from it.
  • 15. Problem Setting 1/2 a background parallel dataset between languages 𝑋 and 𝑌 large 100M diverse noisy an in-domain monolingual corpus in source language 𝑋 middle 1K~1M testing domain small, trusted, out-of-domain (OD) parallel dataset small ~1K out-of- domain trusted reprint reprinttrain 𝜙 to sort data by noise level into a denoising curriculum train 𝜑 to sort data by domain relevance into a domain curriculum
  • 16. Problem Setting 2/2 • The setup, however, assumes that does not exist • Our goal is to use an easily available monolingual corpus and recycle existing trusted parallel data to reduce the cost of curating in-domain parallel data. • composed curriculum : • We hope in-domain, trusted parallel data testing domain trusted
  • 17. Co-Curricular Learning The idea with a toy dataset of three examples travel domain travel domain noisy medicine domain well- translated domain curriculum 𝐶domain denoising curriculum 𝐶denoise Co-curriculum
  • 18. Curriculum Mini-Batching to return the top 𝜆 𝑡 of examples in dataset 𝐷 sorted by a scoring function 𝜙 at a training step 𝑡. pace function Dynamic data selection function This is inspired by the exponential learning rate schedule We introduce two different co-curricula • Mixed Co-Curriculum (𝐶co mix) • Cascaded Co-Curriculum (𝐶co cascade)
  • 19. Mixed Co-Curriculum (𝐶co mix ) • simply adds up the domain scoring function and the denoising function • for a sentence pair (𝑥, 𝑦), values of 𝜑 and 𝜙 may not be on the same scale or even from the same family of distributions 𝐶co mix may not be able to enforce either curriculum sufficiently ・assign non-zero weights ・use uniform sampling
  • 20. Cascaded Co-Curriculum (𝐶co cascade ) • defines two selection functions and nests them data-discarding paces for clean-data selection data-discarding paces for domain-data selection
  • 21. Curriculum Optimization curriculum generation process iteratively updated iteratively updated iteratively updated trustedleargein-domain
  • 23. Setup 1/4 • We consider two background datasets and two test domains • four experiment configurations • a background dataset (adapted sentence-piece model, shared 32k sub-words) En→Fr Paracrawl 300 M pairs En→Fr WMT14 train 40 M pairs noisier En→Fr IWSLT15: spoken language domain En→Fr WMT14 test: news domain 220 K English side validation: 28 M English side En→Fr IWSLT15 En→Fr WMT14 test En→Fr WMT14 test WMT 2012-2013 IWSLT15 220 K pairs IWSLT15 WMT 2010-2011 test WMT14 WMT 2010-2011 test validation:
  • 24. Setup 2/4 • RNN-based NMT (Wu et al., 2016) to train models • Model parameterization for 𝜃’s of 𝜑 or 𝜗’s 𝜙 is 512 dimensions by 3 layers • NLMs are realized using NMT models with dummy source sentences (Sennrich et al., 2016) • Deep models are 1024 dimensions by 8 layers • Unless specified, results are reported for deep models. • truecased, detokenized BLEU with mteval-v14.pl
  • 25. Setup 3/4 • Training on Paracrawl • Adam in warmup • SGD for a total of 3 million steps • batch size 128 • learning rate 0.5Training on Paracrawl annealed, at step 2 million, down to 0.05 • no dropout (due to its large data volume) • Training on WMT 2014 • batch size 96 • dropout probability 0.2 for a total of 2 million steps • learning rate 0.5 annealed, at step 1.2 million, down to 0.05
  • 26. Setup 4/4 • Pace hyper-parameters • 𝐻 = 𝐹 = 400𝑘 • 𝐺 = 900𝑘 • Floor values • 𝜆, 𝛽, 𝛾 are top 0.1, 0.2, 0.5 selection ratios • 0.1 = 0.2 × 0.5 empirically All single curriculum experiments use the same pace setting as 𝐶mix
  • 27. Baselines and Oracles 1/2 Baselines: 1. 𝐶random : Baseline model trained on background data with random data sampling. 2. 𝐶domain : Dynamically fine-tunes 𝐶random with a domain curriculum (van dar Wees et al., 2017) 3. 𝐶domain : Dynamically fine-tunes 𝐶random with a denoising curriculum (Wang et al., 2018) Oracles: 4. 𝐶true : Dynamically fine-tunes 𝐶random with the true curriculum 5. ID fine-tune 𝐶random : Simply fine-tunes 𝐶random with in-domain (ID) parallel data All single In most experiments, we fine-tune a warmed-up (baseline) model to compare curricula, for quicker experiment cycles
  • 28. Baselines and Oracles 2/2 • 𝐶domain > 𝐶random • P3 > P2 (P is noisy) • W3<=W2 (noise compromises the domain capability) • W3 > W1 • P3 >> P1 • The true curriculum (P4, W4) bounds the performance of Cdomain and Cdenoise. ? • Simple in-domain fine-tuning gives good improvements.
  • 29. Co-Curricular Learning Per-step cascading works better than mixing. so we use 𝐶co cascade for the remaining experiments Co-curriculum improves either constituent curriculum (P2 or P3) and no CL, can be close to the true curriculum on noisy data. On the cleaner WMT training data, co-curriculum (W7) improves either constituent curricula (W2 and W3) by smaller gains than Paracrawl.
  • 30. Effect of Curriculum Optimization 1/5 Shallow models reprint 512 dimensions × 3 layers Surprisingly, EM-3 improves baseline by +10 BLEU on IWSLT15, +8.2 BLEU on WMT14
  • 31. Effect of Curriculum Optimization 2/5 Deep models reprint 1024 dimensions × 8 layers EM-style optimization further improves domain curriculum. But, overall, it has a small impact on deep models.
  • 32. Effect of Curriculum Optimization 3/5 Why the difference? – Per-word Loss Curriculum learning and optimization push “easier-to-learn” (lower per-word loss) examples to late curriculum (right) and harder examples (higher per-word loss) to early curriculum (left).
  • 33. Effect of Curriculum Optimization 4/5 Why the difference? – Curriculum learning and optimization push “regularized” (lower variance) examples to late curriculum and higher-variance examples to early curriculum. Standard Deviation These may be important for a small-capacity model to learn efficiently. ‘clean’ may have taken most of the headroom for deep models.
  • 34. Effect of Curriculum Optimization 5/5 Why the difference? – The denoising curriculum is made more aware of news-domain after iterations. All curves show negative news-domain relevance, indicating lack of news data in Paracrawl data. The co-curriculum schedules data from hard to easier-to- learn, which seems opposite to the general CL, it also schedules data from less in- domain to cleaner and more in-domain. ←spirit of CL News-Domain Relevance
  • 35. Retraining Retraining with a curriculum may work better than fine- tuning with it, on a large, noisy dataset.
  • 36. Dynamic vs. Static Data Selection • How does being dynamic matter? • What if we retrain on the static data, too? Curriculum learning works slightly better than finetuning a warmed-up model with a top static selection. Curriculum learning works better than retraining with a static, top selection, especially when the training dataset is small.
  • 37. Discussion 1/3 • Evidence of data-quality transfer. Curriculum learning in one domain may enable curriculum learning in another.
  • 38. Discussion 2/3 • Regularizing data without a teacher. • Section “Effect of Curriculum Optimization” shows that • the denoising scoring function and its bootstrapped versions tend to regularize the late curriculum and make the scheduled data easier for small models to learn on. • One potential further application of this data property may be in learning a multitask curriculum where regular data may be helpful for multiple task distributions to work together in the same model. • We could regularize data by example selection, without a teacher. future work
  • 39. Discussion 3/3 • Pace function hyper-parameters. • Data-discarding pace functions seem to work best when they simultaneously decay down to their respective floors. • Adaptively adjusting them future work
  • 40. Conclusion • We present a co-curricular learning method to make domain- data selection work better on noisy data, by dynamically composing it with clean-data selection. • We show that the method improves over either constituent selection and their static combination. • We further refine the co-curriculum with an EM-style optimization procedure and show its effectiveness, in on small- capacity models. • In future, we would like to extend the method to handle more than two curricula objectives.