SlideShare a Scribd company logo
Neural Mask Generator:
Learning to Generate Adaptive Word
Maskings for Language Model Adaptation
Minki Kang1*, Moonsu Han1*, and Sung Ju Hwang1,2
KAIST1, Daejeon, South Korea
AITRICS2, Seoul, South Korea
1
Background
The recent success of neural language model is based on the scheme of
pre-train once, and fine-tune everywhere.
[Devlin et al. 19] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
Background
Recent Language Models (LM) are pre-trained on large and heterogeneous
dataset.
General Dataset
(e.g. Wikipedia)
Specific-Domain
Dataset
Further
Pre-training
[Beltagy et al. 19] SciBERT: A Pretrained Language Model for Scientific Text, EMNLP 2019.
[Lee et al. 20] BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 2020.
[Gururangan et al. 20] Don’t stop Pre-training: Adapt Language Models to Domains and Tasks, ACL 2020.
Some works propose further pre-training for LM adaptation.
Background
Masked Language Models (MLMs) objective has shown to be effective
for language model pre-training.
A myocardial infarction,
also known as a [MASK]
attack, occurs when blo
od flow decreases.
A myocardial infarction,
also known as a heart
attack, occurs when bl
ood flow decreases.
[Original] [Model Input] [Model Output]
heart
[Devlin et al. 19] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
Motivation
Will it be effective to further train the pre-trained language model on a
domain-specific corpus using randomly generated masks?
A myocardial infarction,
also known as a heart
attack, occurs when bl
ood flow decreases.
Language
Model
A myocardial infarction,
also known as a heart
attack, occurs when bl
ood flow decreases.
TrivialImportant
Motivation
Although several heuristic masking policies have been proposed, none
of them is clearly superior over others.
A myo ##car ##dial in ##farc ##tion occurs when blood flow ....Original:
A [MASK] [MASK] [MASK] in ##farc ##tion occurs when blood flow ...Whole-word:
Span: A myo ##car ##dial in ##farc [MASK] [MASK] [MASK] blood flow ...
A myo [MASK] ##dial [MASK] ##farc ##tion occurs when [MASK] flow ...Random:
In this work, we propose to generate the masks adaptively for the
given domain, by learning the optimal masking policy.
[Joshi et al. 20] SpanBERT: Improving Pre-training by Representing and Predicting Spans, TACL 2020.
[Sun et al. 19] Enhanced Representation through Knowledge Integration, arXiv 2019.
Motivation
Our objective is to find the task-dependent masking policy via a
learnable mask generator.
Problem Formulation
Masked Language Model
Unannotated
Text corpus
[MASK]
Masked
Text corpus
Language Model
Parameters [MASK]
Original
Context
Masked
Context
Problem Formulation
Masked Language Model
A myo [MASK] ##dial [MASK]
##farc ##tion occurs when
[MASK] flow ...
Masked Context
A myo ##car ##dial in ##farc
##tion occurs when blood flo
w ...
Original Context
𝑤! = A
𝑤" = myo
𝑤# = ##car
𝑤$ = ##dial
𝑤% = in
𝑤& = ##farc
𝑤' = ##tion
…
Words (Tokens)
𝑧( = $
1,
0,
𝑖𝑓 𝑖-𝑡ℎ word is masked
𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Problem Formulation
Bi-level formulation: Masking
{3, 5, 10}
List of word indices
to be masked
Probability of each word
being masked
…
##car in
𝒊 = 𝟏 𝒊 = 𝑵
Arbitrary Function
parameterized by 𝜆
Problem Formulation
Bi-level formulation: Further Pre-Training (Inner Loop)
Further Pre-trained
Language Model
parameterized by 𝜆
Problem Formulation
Bi-level formulation: Fine-tuning on the task (Inner Loop)
Downstream task
Solver model
Loss function of
Supervised Learning
Training Dataset
Problem Formulation
Bi-level formulation: Outer-level objective (Outer Loop)
Problem Formulation
Reinforcement learning formulation
Probability of each word
being masked
…
##car in
𝒊 = 𝟏 𝒊 = 𝑵
A myo ##car ##dial in ##farc
##tion occurs when blood flo
w ...
Input Context
𝑅 = −
= Accuracy on the test set.
Policy
Actions
Reward
Problem Formulation
Reinforcement learning formulation
The probability of
masking T tokens
Transition
Probability
The cat is cute .
The [MASK] is cute .
The [MASK] is [MASK] .
t=1
t=2
t=3
Example (MDP)
The cat is cute .
The [MASK] is [MASK] .
Example (Approximation)
Neural Mask Generator
Neural Mask Generator
Training objective
1. Advantageous Actor-Critic
2. Off-Policy learning with Prioritized Experience Replay
3. Importance Sampling
Neural Mask Generator
Training objective
Sampled Replays Entropy
Regularization
Neural Mask Generator
Some practical problems remain for reinforcement learning.
1. Using the full size of dataset in the inner loop is not feasible.
2. The test dataset is unobservable during training step.
Sample
Neural Mask Generator
The NMG model encounters different sub-task at every new episode.
Episode
1
Episode
2
Same across episodesDifferent across episodes
≠
Comparable?
Accuracy: 0.35
≠
Pre-trained
Language Model
(BERT)
=
Accuracy: 0.6
Neural Mask Generator
We introduce the random policy as an opponent policy.
Accuracy: 0.6
Accuracy: 0.35
Accuracy: 0.54
Accuracy: 0.4
Episode
1
Episode
2
≠≠
Neural Mask Generator
We add another neural policy to induce the Self-Play.
Accuracy: 0.6 Accuracy: 0.54
Neural Policy
(Player)
+ Neural Policy
(Opponent)
Accuracy: 0.62
Random Policy
(Opponent)
a = {𝟏, 5, 𝟕} a-. = {1, 5, 9} a! = {4, 5, 7}
Positive Reward Negative Reward
Neural Mask Generator
In each episode, the language model for each policy is initialized.
Episode
2
Episode
1
Initialized
Initialized
“Omit other policies
for brevity.”Further
Pre-training
Fine-tuning Evaluation
Neural Mask Generator
Continual adaptation - Instead, load the LM from former episode.
Episode
2
Episode
1
Initialized
Load
“Omit other policies
for brevity.”Further
Pre-training
Fine-tuning Evaluation
Experiments
1) Question Answering
• SQuAD v1.1
• emrQA
• NewsQA
2) Text Classification
• IMDb
• ChemProt
Datasets
1) Question Answering
• BERT
• DistilBERT
2) Text Classification
• BERT
Language Models
Experiments
• No Pre-training
• Random Masking (Devlin et al. 19)
• Whole-Random Masking (Devlin et al. 19)
• Span-Random Masking (Joshi et al. 20)
• Entity-Random Masking (Sun et al. 19)
• Punctuation-Random Masking
Baselines
[Joshi et al. 20] SpanBERT: Improving Pre-training by Representing and Predicting Spans, TACL 2020.
[Sun et al. 19] Enhanced Representation through Knowledge Integration, arXiv 2019.
[Devlin et al. 19] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
Results
[Text Classification Results] [Ablation Results]
Results
Analysis
[Example from NewsQA]
[Top6 Part-Of-Speech Tag of Masked Words on NewsQA]
Conclusion
• We proposed Neural Mask Generator (NMG), which learns the adaptive
masking policy to adapt the language model to a new domain.
• We formulate the problem of learning the optimal masking policy as a bi-level
meta-learning framework, with reinforcement learning for optimization.
• Experimental results on multiple NLU tasks show that NMG generates
adaptive word masking for a given domain, which yields better or at least
comparable performance over the best-working heuristic masking policy.
Code is available at https://github.com/Nardien/NMG
Thank you

More Related Content

PPTX
Recent Progress on Object Detection_20170331
PPTX
Customize renderpipeline
PDF
Efficient and effective passage search via contextualized late interaction ov...
PDF
Soon gi Park (LetinAR): PinMR: Novel Optical Solution for AR Glasses
PDF
Machine Learning Model Serving with Backend.AI
PPTX
FrameGraph: Extensible Rendering Architecture in Frostbite
PDF
Introduction to Transformers for NLP - Olga Petrova
PPTX
d.ts 만들기
Recent Progress on Object Detection_20170331
Customize renderpipeline
Efficient and effective passage search via contextualized late interaction ov...
Soon gi Park (LetinAR): PinMR: Novel Optical Solution for AR Glasses
Machine Learning Model Serving with Backend.AI
FrameGraph: Extensible Rendering Architecture in Frostbite
Introduction to Transformers for NLP - Olga Petrova
d.ts 만들기

What's hot (15)

PPTX
Massive Point Light Soft Shadows
PPTX
BERT (v3).pptx
PDF
인프런 - 스타트업 인프랩 시작 사례
PDF
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
PDF
Reference Architecture: Architecting Ceph Storage Solutions
PDF
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
PPT
NVIDIA OpenGL in 2016
PPT
NVIDIA's OpenGL Functionality
PDF
BPL(Banksalad Product Language) 무야호
PPTX
입체충돌처리
PPTX
Khronos Munich 2018 - Halcyon and Vulkan
PDF
엔터프라이즈의 AI/ML 활용을 돕는 Paxata 지능형 데이터 전처리 플랫폼 (최문규 이사, PAXATA) :: AWS Techforum...
PDF
JCL DFSORT
PPTX
Face spoofing detection using texture analysis
PPTX
Natural language processing and transformer models
Massive Point Light Soft Shadows
BERT (v3).pptx
인프런 - 스타트업 인프랩 시작 사례
성장을 좋아하는 사람이, 성장하고 싶은 사람에게
Reference Architecture: Architecting Ceph Storage Solutions
Attention is all you need (UPC Reading Group 2018, by Santi Pascual)
NVIDIA OpenGL in 2016
NVIDIA's OpenGL Functionality
BPL(Banksalad Product Language) 무야호
입체충돌처리
Khronos Munich 2018 - Halcyon and Vulkan
엔터프라이즈의 AI/ML 활용을 돕는 Paxata 지능형 데이터 전처리 플랫폼 (최문규 이사, PAXATA) :: AWS Techforum...
JCL DFSORT
Face spoofing detection using texture analysis
Natural language processing and transformer models
Ad

Similar to Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Language Model Adaptation (20)

PPTX
Research paper presentation for a project .pptx
PDF
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
PDF
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
PDF
cs229_lecture_slides_selfsupervision_final.pdf
PDF
Deep network notes.pdf
PDF
The NLP Muppets revolution!
PDF
Open vocabulary problem
PDF
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
PDF
Deep Learning & NLP: Graphs to the Rescue!
PDF
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
PPTX
[Paper review] BERT
PDF
Devoxx traitement automatique du langage sur du texte en 2019
PDF
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
PDF
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
PDF
CHUNKER BASED SENTIMENT ANALYSIS AND TENSE CLASSIFICATION FOR NEPALI TEXT
PDF
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
PPTX
Trustworthy Generative AI_ ICML'23 Tutorial.pptx
PDF
Plug play language_models
PDF
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
PDF
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Research paper presentation for a project .pptx
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
cs229_lecture_slides_selfsupervision_final.pdf
Deep network notes.pdf
The NLP Muppets revolution!
Open vocabulary problem
“Understand the Multimodal World with Minimal Supervision,” a Keynote Present...
Deep Learning & NLP: Graphs to the Rescue!
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
[Paper review] BERT
Devoxx traitement automatique du langage sur du texte en 2019
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
Chunker Based Sentiment Analysis and Tense Classification for Nepali Text
CHUNKER BASED SENTIMENT ANALYSIS AND TENSE CLASSIFICATION FOR NEPALI TEXT
Learning to Balance: Bayesian Meta-Learning for Imbalanced and Out-of-distrib...
Trustworthy Generative AI_ ICML'23 Tutorial.pptx
Plug play language_models
SENTIMENT ANALYSIS IN MYANMAR LANGUAGE USING CONVOLUTIONAL LSTM NEURAL NETWORK
Sentiment Analysis In Myanmar Language Using Convolutional Lstm Neural Network
Ad

More from MLAI2 (20)

PDF
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
PDF
Online Hyperparameter Meta-Learning with Hypergradient Distillation
PDF
Online Coreset Selection for Rehearsal-based Continual Learning
PDF
Representational Continuity for Unsupervised Continual Learning
PDF
Skill-Based Meta-Reinforcement Learning
PDF
Edge Representation Learning with Hypergraphs
PDF
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
PDF
Mini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
PDF
Task Adaptive Neural Network Search with Meta-Contrastive Learning
PDF
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
PDF
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
PDF
Accurate Learning of Graph Representations with Graph Multiset Pooling
PDF
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
PDF
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
PDF
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
PDF
Adversarial Self-Supervised Contrastive Learning
PDF
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
PDF
Cost-effective Interactive Attention Learning with Neural Attention Process
PDF
Adversarial Neural Pruning with Latent Vulnerability Suppression
PDF
Generating Diverse and Consistent QA pairs from Contexts with Information-Max...
Meta Learning Low Rank Covariance Factors for Energy-Based Deterministic Unce...
Online Hyperparameter Meta-Learning with Hypergradient Distillation
Online Coreset Selection for Rehearsal-based Continual Learning
Representational Continuity for Unsupervised Continual Learning
Skill-Based Meta-Reinforcement Learning
Edge Representation Learning with Hypergraphs
Hit and Lead Discovery with Explorative RL and Fragment-based Molecule Genera...
Mini-Batch Consistent Slot Set Encoder For Scalable Set Encoding
Task Adaptive Neural Network Search with Meta-Contrastive Learning
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Meta-GMVAE: Mixture of Gaussian VAE for Unsupervised Meta-Learning
Accurate Learning of Graph Representations with Graph Multiset Pooling
Contrastive Learning with Adversarial Perturbations for Conditional Text Gene...
Clinical Risk Prediction with Temporal Probabilistic Asymmetric Multi-Task Le...
MetaPerturb: Transferable Regularizer for Heterogeneous Tasks and Architectures
Adversarial Self-Supervised Contrastive Learning
Learning to Extrapolate Knowledge: Transductive Few-shot Out-of-Graph Link Pr...
Cost-effective Interactive Attention Learning with Neural Attention Process
Adversarial Neural Pruning with Latent Vulnerability Suppression
Generating Diverse and Consistent QA pairs from Contexts with Information-Max...

Recently uploaded (20)

PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Machine learning based COVID-19 study performance prediction
PPTX
Cloud computing and distributed systems.
PDF
Approach and Philosophy of On baking technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Big Data Technologies - Introduction.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPT
Teaching material agriculture food technology
PPTX
Spectroscopy.pptx food analysis technology
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
Programs and apps: productivity, graphics, security and other tools
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Machine learning based COVID-19 study performance prediction
Cloud computing and distributed systems.
Approach and Philosophy of On baking technology
MYSQL Presentation for SQL database connectivity
Diabetes mellitus diagnosis method based random forest with bat algorithm
Building Integrated photovoltaic BIPV_UPV.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Network Security Unit 5.pdf for BCA BBA.
Big Data Technologies - Introduction.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Teaching material agriculture food technology
Spectroscopy.pptx food analysis technology
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Reach Out and Touch Someone: Haptics and Empathic Computing

Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Language Model Adaptation

  • 1. Neural Mask Generator: Learning to Generate Adaptive Word Maskings for Language Model Adaptation Minki Kang1*, Moonsu Han1*, and Sung Ju Hwang1,2 KAIST1, Daejeon, South Korea AITRICS2, Seoul, South Korea 1
  • 2. Background The recent success of neural language model is based on the scheme of pre-train once, and fine-tune everywhere. [Devlin et al. 19] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
  • 3. Background Recent Language Models (LM) are pre-trained on large and heterogeneous dataset. General Dataset (e.g. Wikipedia) Specific-Domain Dataset Further Pre-training [Beltagy et al. 19] SciBERT: A Pretrained Language Model for Scientific Text, EMNLP 2019. [Lee et al. 20] BioBERT: a pre-trained biomedical language representation model for biomedical text mining, Bioinformatics 2020. [Gururangan et al. 20] Don’t stop Pre-training: Adapt Language Models to Domains and Tasks, ACL 2020. Some works propose further pre-training for LM adaptation.
  • 4. Background Masked Language Models (MLMs) objective has shown to be effective for language model pre-training. A myocardial infarction, also known as a [MASK] attack, occurs when blo od flow decreases. A myocardial infarction, also known as a heart attack, occurs when bl ood flow decreases. [Original] [Model Input] [Model Output] heart [Devlin et al. 19] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
  • 5. Motivation Will it be effective to further train the pre-trained language model on a domain-specific corpus using randomly generated masks? A myocardial infarction, also known as a heart attack, occurs when bl ood flow decreases. Language Model A myocardial infarction, also known as a heart attack, occurs when bl ood flow decreases. TrivialImportant
  • 6. Motivation Although several heuristic masking policies have been proposed, none of them is clearly superior over others. A myo ##car ##dial in ##farc ##tion occurs when blood flow ....Original: A [MASK] [MASK] [MASK] in ##farc ##tion occurs when blood flow ...Whole-word: Span: A myo ##car ##dial in ##farc [MASK] [MASK] [MASK] blood flow ... A myo [MASK] ##dial [MASK] ##farc ##tion occurs when [MASK] flow ...Random: In this work, we propose to generate the masks adaptively for the given domain, by learning the optimal masking policy. [Joshi et al. 20] SpanBERT: Improving Pre-training by Representing and Predicting Spans, TACL 2020. [Sun et al. 19] Enhanced Representation through Knowledge Integration, arXiv 2019.
  • 7. Motivation Our objective is to find the task-dependent masking policy via a learnable mask generator.
  • 8. Problem Formulation Masked Language Model Unannotated Text corpus [MASK] Masked Text corpus Language Model Parameters [MASK] Original Context Masked Context
  • 9. Problem Formulation Masked Language Model A myo [MASK] ##dial [MASK] ##farc ##tion occurs when [MASK] flow ... Masked Context A myo ##car ##dial in ##farc ##tion occurs when blood flo w ... Original Context 𝑤! = A 𝑤" = myo 𝑤# = ##car 𝑤$ = ##dial 𝑤% = in 𝑤& = ##farc 𝑤' = ##tion … Words (Tokens) 𝑧( = $ 1, 0, 𝑖𝑓 𝑖-𝑡ℎ word is masked 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
  • 10. Problem Formulation Bi-level formulation: Masking {3, 5, 10} List of word indices to be masked Probability of each word being masked … ##car in 𝒊 = 𝟏 𝒊 = 𝑵 Arbitrary Function parameterized by 𝜆
  • 11. Problem Formulation Bi-level formulation: Further Pre-Training (Inner Loop) Further Pre-trained Language Model parameterized by 𝜆
  • 12. Problem Formulation Bi-level formulation: Fine-tuning on the task (Inner Loop) Downstream task Solver model Loss function of Supervised Learning Training Dataset
  • 13. Problem Formulation Bi-level formulation: Outer-level objective (Outer Loop)
  • 14. Problem Formulation Reinforcement learning formulation Probability of each word being masked … ##car in 𝒊 = 𝟏 𝒊 = 𝑵 A myo ##car ##dial in ##farc ##tion occurs when blood flo w ... Input Context 𝑅 = − = Accuracy on the test set. Policy Actions Reward
  • 15. Problem Formulation Reinforcement learning formulation The probability of masking T tokens Transition Probability The cat is cute . The [MASK] is cute . The [MASK] is [MASK] . t=1 t=2 t=3 Example (MDP) The cat is cute . The [MASK] is [MASK] . Example (Approximation)
  • 17. Neural Mask Generator Training objective 1. Advantageous Actor-Critic 2. Off-Policy learning with Prioritized Experience Replay 3. Importance Sampling
  • 18. Neural Mask Generator Training objective Sampled Replays Entropy Regularization
  • 19. Neural Mask Generator Some practical problems remain for reinforcement learning. 1. Using the full size of dataset in the inner loop is not feasible. 2. The test dataset is unobservable during training step. Sample
  • 20. Neural Mask Generator The NMG model encounters different sub-task at every new episode. Episode 1 Episode 2 Same across episodesDifferent across episodes ≠ Comparable? Accuracy: 0.35 ≠ Pre-trained Language Model (BERT) = Accuracy: 0.6
  • 21. Neural Mask Generator We introduce the random policy as an opponent policy. Accuracy: 0.6 Accuracy: 0.35 Accuracy: 0.54 Accuracy: 0.4 Episode 1 Episode 2 ≠≠
  • 22. Neural Mask Generator We add another neural policy to induce the Self-Play. Accuracy: 0.6 Accuracy: 0.54 Neural Policy (Player) + Neural Policy (Opponent) Accuracy: 0.62 Random Policy (Opponent) a = {𝟏, 5, 𝟕} a-. = {1, 5, 9} a! = {4, 5, 7} Positive Reward Negative Reward
  • 23. Neural Mask Generator In each episode, the language model for each policy is initialized. Episode 2 Episode 1 Initialized Initialized “Omit other policies for brevity.”Further Pre-training Fine-tuning Evaluation
  • 24. Neural Mask Generator Continual adaptation - Instead, load the LM from former episode. Episode 2 Episode 1 Initialized Load “Omit other policies for brevity.”Further Pre-training Fine-tuning Evaluation
  • 25. Experiments 1) Question Answering • SQuAD v1.1 • emrQA • NewsQA 2) Text Classification • IMDb • ChemProt Datasets 1) Question Answering • BERT • DistilBERT 2) Text Classification • BERT Language Models
  • 26. Experiments • No Pre-training • Random Masking (Devlin et al. 19) • Whole-Random Masking (Devlin et al. 19) • Span-Random Masking (Joshi et al. 20) • Entity-Random Masking (Sun et al. 19) • Punctuation-Random Masking Baselines [Joshi et al. 20] SpanBERT: Improving Pre-training by Representing and Predicting Spans, TACL 2020. [Sun et al. 19] Enhanced Representation through Knowledge Integration, arXiv 2019. [Devlin et al. 19] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, NAACL 2019
  • 28. [Text Classification Results] [Ablation Results] Results
  • 29. Analysis [Example from NewsQA] [Top6 Part-Of-Speech Tag of Masked Words on NewsQA]
  • 30. Conclusion • We proposed Neural Mask Generator (NMG), which learns the adaptive masking policy to adapt the language model to a new domain. • We formulate the problem of learning the optimal masking policy as a bi-level meta-learning framework, with reinforcement learning for optimization. • Experimental results on multiple NLU tasks show that NMG generates adaptive word masking for a given domain, which yields better or at least comparable performance over the best-working heuristic masking policy. Code is available at https://github.com/Nardien/NMG