SlideShare a Scribd company logo
DEBERTA
Decoding-Enhanced BERT with Disentangled Attention
Arxiv 2020.06.05 || ICLR 2021
🤗
딥러닝 논문 읽기 모임 4기 NLP팀
진명훈 김수빈 신문종 아이린 최상우
TABLE OF CONTENTS
2
Introduction
Background
3 Contributions (2 Q&A)
Experiment
Conclusion
Introduction
3
• He et al., 2020 에서 제안된 모델
• DeBERTa: Decoding-enhanced BERT with Disentangled Attention
• Google의 BERT(2018)과 Facebook(현재 Meta)의 RoBERTa(2019) 기반
• RoBERTa + disentangled attention + enhanced mask decoder
• With half of the data used in RoBERTa (80GB)
• Scale Invariant Fine-Tuning 도입
• #5929 PR로 🤗transformers에 Merge됨
• Outperform RoBERTa an a majority of NLU tasks
• e.g., SQuAD, MNLI and RACE
Background
4
• Positional Information
• Masked Language Model
• Adversarial Training
Background: Positional Information
5
• The standard self-attention mechanism lacks a natural way to encode
word position information
• Add Positional Bias (ref: 딥논읽 Rotary Embedding 발표)
• Absolute Position Embedding
• Relative Position Embedding
Background: Masked Language Model
6
• Large-scale Transformer-based PLMs are typically pre-trained on large
amounts of text to learn contextual word representations using a self-
supervision objective, known as Masked Language Model (MLM)
max
𝜃
log 𝑝𝜃(𝑋| ෨
𝑋) ≈ max
𝜃
෍
𝑖∈𝐶
log 𝑝𝜃 ෥
𝑥𝑖 = 𝑥𝑖
෨
𝑋
Background: Adversarial Training
7
• 정상 데이터를 모델에 학습
• 정상 데이터 + adversarial sample을 같이 학습
• 일반화 성능 향상
3 Contributions
8
• Disentangled attention
• Transformer-xl처럼 additive하게 attention을 분해
• Shaw, Transformer-xl과 다르게 position-to-content term을 살림
• query token의 위치가 달라지는 부분도 반영
• Position-to-position term은 RPE에서 불필요하기 때문에 제거
• Enhanced Mask Decoder
• A new store opened beside the new mall
• Absolute position information 또한 중요하다!
• Scale Invariant Fine-Tuning
• Adversarial Training은 모델의 일반화에 도움을 준다
• NLP에서 embedding vector norm의 분산은 모델바이모델, 단어바이단어
• Word embedding을 normalize해주고 Perturbation을 추가하자!
Disentangled Attention
9
• Disentangled Attention: A two vector approach to content and position
embedding
• 논문에서 아래와 같은 수식을 제안하며 token repr을 content와 position에 대한
두 벡터로 decomposition 수행
• 이렇게 쪼개는 것은 사실 Transformer-XL에서 제안되었어요!
𝐴𝑖,𝑗 = 𝐻𝑖, 𝑃𝑖|𝑗 × 𝐻𝑗, 𝑃𝑗|𝑖
𝑇
= 𝐻𝑖𝐻𝑗
𝑇
+ 𝐻𝑖𝑃𝑗|𝑖
𝑇
+ 𝑃𝑖|𝑗𝐻𝑗
𝑇
+ 𝑃𝑖|𝑗𝑃𝑗|𝑖
𝑇
History: Relative Position Embedding
10
17.06.12 18.03.06 19.10.23
Transformer
Shaw RPE T5
18.09.12
Music
Transformer
19.01.09
Transformer-XL
19.06.19
XLNet
20.06.05
DeBERTa
History: Relative Position Embedding
11
Transformer upgrade! Layer에 직접 위치 정보를 주입하자!
Self-Attention with Relative Position Representations
Music Transformer
History: Relative Position Embedding
12
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
XLNet: Generalized Autoregressive Pretraining for Language Understanding
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗
(a) (b) (c) (d)
Content-to-Content
Content-to-Position
Position-to-Content
Position-to-Position
History: Relative Position Embedding
13
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
XLNet: Generalized Autoregressive Pretraining for Language Understanding
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗
(a) (b) (c) (d)
Content-to-Content
Content-to-Position
Position-to-Content
Position-to-Position
𝐴𝑖,𝑗 = 𝐻𝑖, 𝑃𝑖|𝑗 × 𝐻𝑗, 𝑃𝑗|𝑖
𝑇
= 𝐻𝑖𝐻𝑗
𝑇
+ 𝐻𝑖𝑃𝑗|𝑖
𝑇
+ 𝑃𝑖|𝑗𝐻𝑗
𝑇
+ 𝑃𝑖|𝑗𝑃𝑗|𝑖
𝑇
History: Relative Position Embedding
14
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗
𝐿𝑒𝑡 𝑄 = 𝑊
𝑞𝐸𝑥𝑖
𝑈𝑞 = 𝑊
𝑞𝑈𝑖 𝐾 = 𝑊𝑘𝐸𝑥𝑗
𝑈𝑘 = 𝑊𝑘𝑈𝑗
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝑄𝑇𝐾 + 𝑄𝑇𝑈𝑘 + 𝑈𝑞
𝑇𝐾 + 𝑈𝑞
𝑇𝑈𝑘
이 수식을 분석해봅시다!
이렇게 미리 정해둘게요
그러면 수식이 이렇게 정리됩니다!
이걸 묶어서 정리하면?
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝑄 + 𝑈𝑞
𝑇
(𝐾 + 𝑈𝑘)
즉, 이렇게 정리할 수 있겠군요!
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝑊
𝑞 𝐸𝑥𝑖
+ 𝑈𝑖
𝑇
𝑊𝑘 𝐸𝑥𝑗
+ 𝑈𝑗
𝑇
0.09 0.05 0.86 0.
0.05 0.91 0.04 0.
0.50 0. 10 0.40 0.
0.45 0.40 0.15 0.
𝑄𝑊
𝑖
𝑄
∈ ℝ4×3
𝑞1
𝑞2
𝑞3
𝑞4
</s>
𝑡1
𝑡2
𝑡3
Embedding
+
Projection
𝑘1 𝑘2 𝑘3 𝑘4
𝐾𝑊𝑖
𝐾
∈ ℝ3×4
𝐴𝑡𝑡 ∈ ℝ4×4
𝑎11 𝑎12 𝑎13 𝑎14
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗
𝐴𝑖,𝑗
𝑅𝑃𝐸
= 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘,𝐸𝐸𝑥𝑗
+ 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘,𝑅𝑅𝑖−𝑗 + 𝑢𝑇𝑊𝑘,𝐸𝐸𝑥𝑗
+ 𝑣𝑇𝑊𝑘,𝑅𝑅𝑖−𝑗
Learned position
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝑊
𝑞 𝐸𝑥𝑖
+ 𝑈𝑖
𝑇
𝑊𝑘 𝐸𝑥𝑗
+ 𝑈𝑗
𝑇
𝐴𝑖,𝑗
𝑆ℎ𝑎𝑤_𝑅𝑃𝐸
= 𝐸𝑥𝑖
𝑇
𝑊
𝑞
𝑇
𝑊𝑘𝐸𝑥𝑗
+ 𝐸𝑥𝑖
𝑇
𝑊
𝑞
𝑇
𝑊𝑘𝑅𝑖−𝑗
Dai(transformer-xl의 주저자)의 연구는?
Sinusoid position
Shaw의 연구는?
History: Relative Position Embedding
15
Shaw’s RPE
Transformer-XL
T5
q𝑚
𝑇
𝑘𝑛 = 𝑥𝑚
𝑇
𝑊
𝑞
𝑇
𝑊𝑘𝑥𝑛 + 𝑥𝑚
𝑇
𝑊
𝑞
𝑇 ෩
𝑊𝑘 ෤
𝑝𝑚−𝑛 + 𝑢𝑇
𝑊𝑘𝑥𝑛 + 𝑣𝑇 ෩
𝑊𝑘 ෤
𝑝𝑚−𝑛
q𝑚
𝑇 𝑘𝑛 = 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛 + 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑝𝑚−𝑛
q𝑚
𝑇 𝑘𝑛 = 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛 + 𝑏𝑚,𝑛
History: Relative Position Embedding
Shaw’s RPE
Transformer-XL
T5
q𝑚
𝑇
𝑘𝑛 = 𝑥𝑚
𝑇
𝑊
𝑞
𝑇
𝑊𝑘𝑥𝑛 + 𝑥𝑚
𝑇
𝑊
𝑞
𝑇 ෩
𝑊𝑘 ෤
𝑝𝑚−𝑛 + 𝑢𝑇
𝑊𝑘𝑥𝑛 + 𝑣𝑇 ෩
𝑊𝑘 ෤
𝑝𝑚−𝑛
q𝑚
𝑇 𝑘𝑛 = 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛 + 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑝𝑚−𝑛
q𝑚
𝑇 𝑘𝑛 = 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛 + 𝑏𝑚,𝑛
History: Relative Position Embedding
DeBERTa q𝑚
𝑇 𝑘𝑛 = 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛 + 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘 ෤
𝑝𝑚−𝑛 + ෤
𝑝𝑚−𝑛
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛
Disentangled Attention
18
• Shaw 연구진 등의 기존 RPE 접근 방법은 content-to-content (a) term과
content-to-position (b) term을 사용하여 attention weights를 계산
• Attention weight는 어느 한 쪽 방향으로만 모델링할 수 없다.
• Position-to-content term (c) term 또한 중요하다!
• Relative position embedding에서 (d) term은 이미 고려하고 있음
Disentangled Attention
19
• k: maximum relative distance
• 𝛿 𝑖, 𝑗 ∈ 0,2𝑘
• 𝛿 𝑖, 𝑗 = ቐ
0
2𝑘 − 1
𝑖 − 𝑗 + 1
𝑖 − 𝑗 ≤ −𝑘
𝑖 − 𝑗 ≥ 𝑘
for
for
o. w
Disentangled Attention
20
https://github.com/huggingface/transformers/blob/421
0579522f8b288c3ae6c646e8a7f2e3a941c76/src/trans
formers/models/deberta/modeling_deberta.py#L660
Enhanced Mask Decoder
21
• DeBERTa는 MLM으로 pre-trained
• MLM을 위해 context words의 content와 position information을 활용
• 하지만 absolute positions을 고려하지 않음
• e.g.,
• A new store opened beside the new mall
• BERT는 absolute positions을 input layer에 주입
• DeBERTa는 Transformer layer를 전부 거치고 Masked token prediction을 위
해 softmax layer를 통과시키기 전에 absolute positions을 주입
Enhanced Mask Decoder
22
Enhanced Mask Decoder
23
https://github.com/microsoft/DeBERTa/blob/master/experiments/language_model/mlm.sh
Enhanced Mask Decoder
24
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/apps/run.py
Seed 고정
Tokenizer, task object 반환
Eval, test data load
Load train data and get model
Enhanced Mask Decoder
25
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/apps/tasks/mlm_task.py
Enhanced Mask Decoder
26
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/apps/models/masked_language_model.py
Enhanced Mask Decoder
27
https://youtu.be/gcMyKUXbY8s?t=1198
BertForMaskedLM
28
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
…
Sub-word embedding
Token type embedding
+
Encoder output
Encoder output
Encoder output
Encoder output
Encoder output
Absolute position embedding
Token type embedding …
CLS 딥 ##러 MASK 논 MASK 모임 SEP
lm_head
Token type embedding …
CLS 딥 ##러 ##닝 논 ##문 모임 SEP
lm_logits, lm_loss
BERT Module
DeBERTaForMaskedLM with EDM
29
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
…
Sub-word embedding
Token type embedding
+
DeBERTa Module
Encoder output
Encoder output
Encoder output
Encoder output (H)
Encoder output
Absolute position embedding
• (1)은 position_biased_input 옵션이 True인 경우에만 더해줌
• 분홍색 Transformer Layer는 shared
• lm_head는 word embedding matrix와 shared
• 저자에 의하면 EDM이 누락되도 PLM의 수렴에 영향을 끼치지 않는다고 함
• MLM training의 perplexity에 약간의 영향을 미치는 부분
Token type embedding …
CLS 딥 ##러 MASK 논 MASK 모임 SEP
Query state (I)
+
Transformer Layer
with disentangled attention
Transformer Layer
with disentangled attention
Query state (I)
EDM Module (n=2)
(1)
Encoder output
Token type embedding …
CLS 딥 ##러 ##닝 논 ##문 모임 SEP
lm_head
lm_logits, lm_loss
Scale Invariant Fine-Tuning
30
• Virtual adversarial training은 regularization method
• 모델의 일반화 성능을 강화
• Input에 small perturbation(noise)를 줘서 adversarial attack에도 동일한
output prediction을 만드는 것이 목적
• NLP task에서 perturbation은 word embedding에 주어짐
• 그러나 model by model, word by word로 emb vector의 norm은 상이함
• Bigger model일수록 분산은 커지고 adversarial training의 불안정성을 키움
• Layer norm에서 영감을 받아 normalized word embeddings에 perturbation을
추가하여 Adversarial Fine-Tuning
• 1.5B 모델에만 적용했고 comprehensive study는 향후에 진행할 예정
Scale Invariant Fine-Tuning
31
https://github.com/microsoft/DeBERTa/tree/master/DeBERTa/sift
Scale Invariant Fine-Tuning
32
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/apps/run.py
Scale Invariant Fine-Tuning
33
Embedding Module
DeBERTa Module
Token type embedding …
CLS 딥 ##러 ##닝 논 ##문 모임 SEP
Task-specific Layer
(SuperGLUE)
Token type embedding …
O B-XX I-XX I-XX B-XX I-XX I-XX O
ℒ(𝑙𝑜𝑔𝑖𝑡𝑠, 𝑔𝑜𝑙𝑑𝑒𝑛_𝑡𝑟𝑢𝑡ℎ)
Sub-word embedding Token type embedding
Absolute position embedding
+
If position_biased_input
LayerNorm
Sift hook!
Scale Invariant Fine-Tuning
34
Embedding Module
DeBERTa Module
Token type embedding …
CLS 딥 ##러 ##닝 논 ##문 모임 SEP
Task-specific Layer
(SuperGLUE)
Token type embedding …
O B-XX I-XX I-XX B-XX I-XX I-XX O
ℒ(𝑙𝑜𝑔𝑖𝑡𝑠, 𝑔𝑜𝑙𝑑𝑒𝑛_𝑡𝑟𝑢𝑡ℎ)
Sub-word embedding Token type embedding
Absolute position embedding
+
If position_biased_input
LayerNorm
LayerNorm(inputs)
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/sift/sift.py#L29
Scale Invariant Fine-Tuning
35
Embedding Module
DeBERTa Module
Token type embedding …
CLS 딥 ##러 ##닝 논 ##문 모임 SEP
Task-specific Layer
(SuperGLUE)
Token type embedding …
O B-XX I-XX I-XX B-XX I-XX I-XX O
ℒ(𝑙𝑜𝑔𝑖𝑡𝑠, 𝑔𝑜𝑙𝑑𝑒𝑛_𝑡𝑟𝑢𝑡ℎ)
Sub-word embedding Token type embedding
Absolute position embedding
+
If position_biased_input
LayerNorm
LayerNorm(inputs)
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/sift/sift.py#L29
+ perturbation delta ~𝑁 0, 0.02
If 𝛿 ≥ 0.04 𝑜𝑟 𝛿 ≤ −0.04, clamp
𝑎𝑑𝑣
Experiment: Pre-training
36
Experiment: Pre-training
37
• RoBERTa처럼 dynamic data batching 적용
• SpanBERT처럼 span masking 적용
Experiment: Pre-training
38
Experiment: Pre-training
39
Experiment: Fine-tuning
40
Experiment: Fine-tuning
41
Experiment: Fine-tuning
42
Experiment: Ablation study
43
Experiment: SuperGLUE
44
Experiment: SIFT
45
Experiment: DeBERTa v2
46
https://huggingface.co/docs/transformers/model_doc/deberta_v2
Experiment: DeBERTa v2
47
Conclusion
48
• Disentangled attention과 enhanced mask decoder로 RoBERTa 개선
• Downstream task에서 모델 일반화를 개선하기 위해 SIFT 제안
• Macro score 측면에서 SuperGLUE 벤치마크에서 인간의 성능을 상회
• 아직 인간 수준의 지능까진 도달하지 못함
• 최근에 V3가 나왔습니다! → 다음 차례에 wrap-up하며 발표하도록 하겠습니다.

More Related Content

PDF
RoFormer: Enhanced Transformer with Rotary Position Embedding
PPTX
알기쉬운 Variational autoencoder
PDF
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
PDF
[PR12] understanding deep learning requires rethinking generalization
PDF
[기초개념] Recurrent Neural Network (RNN) 소개
PDF
순환신경망(Recurrent neural networks) 개요
PDF
딥러닝 자연어처리 - RNN에서 BERT까지
PDF
CVPR 2022 Tutorial에 대한 쉽고 상세한 Diffusion Probabilistic Model
RoFormer: Enhanced Transformer with Rotary Position Embedding
알기쉬운 Variational autoencoder
Understanding deep learning requires rethinking generalization (2017) 2 2(2)
[PR12] understanding deep learning requires rethinking generalization
[기초개념] Recurrent Neural Network (RNN) 소개
순환신경망(Recurrent neural networks) 개요
딥러닝 자연어처리 - RNN에서 BERT까지
CVPR 2022 Tutorial에 대한 쉽고 상세한 Diffusion Probabilistic Model

What's hot (20)

PPTX
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
PDF
大規模な組合せ最適化問題に対する発見的解法
PPTX
マルチモーダル深層学習の研究動向
PPTX
BERT introduction
PPTX
【DL輪読会】Contrastive Learning as Goal-Conditioned Reinforcement Learning
PPTX
数理最適化とPython
PPTX
[Paper review] BERT
PDF
Hyperoptとその周辺について
PPTX
深層学習による自然言語処理の研究動向
PDF
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
PDF
Overcoming Catastrophic Forgetting in Neural Networks読んだ
PDF
ゼロから始める転移学習
PPTX
[DL Hacks]tensorflow/privacy 使ってみた
PDF
Efficient and effective passage search via contextualized late interaction ov...
PDF
Crfと素性テンプレート
PDF
CF-FinML 金融時系列予測のための機械学習
PPTX
[DL輪読会]When Does Label Smoothing Help?
PPTX
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
PDF
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
PPTX
[DL輪読会]ODT: Online Decision Transformer
【DL輪読会】言語以外でのTransformerのまとめ (ViT, Perceiver, Frozen Pretrained Transformer etc)
大規模な組合せ最適化問題に対する発見的解法
マルチモーダル深層学習の研究動向
BERT introduction
【DL輪読会】Contrastive Learning as Goal-Conditioned Reinforcement Learning
数理最適化とPython
[Paper review] BERT
Hyperoptとその周辺について
深層学習による自然言語処理の研究動向
[DL輪読会]data2vec: A General Framework for Self-supervised Learning in Speech,...
Overcoming Catastrophic Forgetting in Neural Networks読んだ
ゼロから始める転移学習
[DL Hacks]tensorflow/privacy 使ってみた
Efficient and effective passage search via contextualized late interaction ov...
Crfと素性テンプレート
CF-FinML 金融時系列予測のための機械学習
[DL輪読会]When Does Label Smoothing Help?
[DL輪読会]Vision Transformer with Deformable Attention (Deformable Attention Tra...
【DL輪読会】Perceiver io a general architecture for structured inputs &amp; outputs
[DL輪読会]ODT: Online Decision Transformer
Ad

Similar to DeBERTA : Decoding-Enhanced BERT with Disentangled Attention (20)

PDF
네이버 NLP Challenge 후기
PDF
[부스트캠프 Tech talk] 황우진 딥러닝 가볍게 구현해보기
PDF
PPTX
MRC recent trend_ppt
PDF
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
PPTX
Denoising auto encoders(d a)
PDF
TinyBERT
PDF
MultiModal Embedding integrates various data types, like images, text, and au...
PDF
[데이터 분석 소모임] Transformer 설명 자료입니다 (김고은, 김려린) .pdf
PPTX
Multiple vector encoding (KOR. version)
PPTX
Paper Reading : Learning to compose neural networks for question answering
PDF
[Paper] EDA : easy data augmentation techniques for boosting performance on t...
PDF
REALM
PDF
Deep Learning for Chatbot (1/4)
PDF
PDF
2017 tensor flow dev summit
PPTX
Deep learning overview
PDF
Improving Language Understanding by Generative Pre-Training
PPTX
Lab_Study_0421.pptx
네이버 NLP Challenge 후기
[부스트캠프 Tech talk] 황우진 딥러닝 가볍게 구현해보기
MRC recent trend_ppt
텐서플로우 설치도 했고 튜토리얼도 봤고 기초 예제도 짜봤다면 TensorFlow KR Meetup 2016
Denoising auto encoders(d a)
TinyBERT
MultiModal Embedding integrates various data types, like images, text, and au...
[데이터 분석 소모임] Transformer 설명 자료입니다 (김고은, 김려린) .pdf
Multiple vector encoding (KOR. version)
Paper Reading : Learning to compose neural networks for question answering
[Paper] EDA : easy data augmentation techniques for boosting performance on t...
REALM
Deep Learning for Chatbot (1/4)
2017 tensor flow dev summit
Deep learning overview
Improving Language Understanding by Generative Pre-Training
Lab_Study_0421.pptx
Ad

More from taeseon ryu (20)

PDF
VoxelNet
PDF
OpineSum Entailment-based self-training for abstractive opinion summarization...
PPTX
3D Gaussian Splatting
PDF
JetsonTX2 Python
PPTX
Hyperbolic Image Embedding.pptx
PDF
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
PDF
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
PDF
YOLO V6
PDF
Dataset Distillation by Matching Training Trajectories
PDF
RL_UpsideDown
PDF
Packed Levitated Marker for Entity and Relation Extraction
PPTX
MOReL: Model-Based Offline Reinforcement Learning
PDF
Scaling Instruction-Finetuned Language Models
PDF
Visual prompt tuning
PDF
PDF
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
PDF
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
PDF
The Forward-Forward Algorithm
PPTX
Towards Robust and Reproducible Active Learning using Neural Networks
PDF
BRIO: Bringing Order to Abstractive Summarization
VoxelNet
OpineSum Entailment-based self-training for abstractive opinion summarization...
3D Gaussian Splatting
JetsonTX2 Python
Hyperbolic Image Embedding.pptx
MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정
LLaMA Open and Efficient Foundation Language Models - 230528.pdf
YOLO V6
Dataset Distillation by Matching Training Trajectories
RL_UpsideDown
Packed Levitated Marker for Entity and Relation Extraction
MOReL: Model-Based Offline Reinforcement Learning
Scaling Instruction-Finetuned Language Models
Visual prompt tuning
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf
Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf
The Forward-Forward Algorithm
Towards Robust and Reproducible Active Learning using Neural Networks
BRIO: Bringing Order to Abstractive Summarization

DeBERTA : Decoding-Enhanced BERT with Disentangled Attention

  • 1. DEBERTA Decoding-Enhanced BERT with Disentangled Attention Arxiv 2020.06.05 || ICLR 2021 🤗 딥러닝 논문 읽기 모임 4기 NLP팀 진명훈 김수빈 신문종 아이린 최상우
  • 2. TABLE OF CONTENTS 2 Introduction Background 3 Contributions (2 Q&A) Experiment Conclusion
  • 3. Introduction 3 • He et al., 2020 에서 제안된 모델 • DeBERTa: Decoding-enhanced BERT with Disentangled Attention • Google의 BERT(2018)과 Facebook(현재 Meta)의 RoBERTa(2019) 기반 • RoBERTa + disentangled attention + enhanced mask decoder • With half of the data used in RoBERTa (80GB) • Scale Invariant Fine-Tuning 도입 • #5929 PR로 🤗transformers에 Merge됨 • Outperform RoBERTa an a majority of NLU tasks • e.g., SQuAD, MNLI and RACE
  • 4. Background 4 • Positional Information • Masked Language Model • Adversarial Training
  • 5. Background: Positional Information 5 • The standard self-attention mechanism lacks a natural way to encode word position information • Add Positional Bias (ref: 딥논읽 Rotary Embedding 발표) • Absolute Position Embedding • Relative Position Embedding
  • 6. Background: Masked Language Model 6 • Large-scale Transformer-based PLMs are typically pre-trained on large amounts of text to learn contextual word representations using a self- supervision objective, known as Masked Language Model (MLM) max 𝜃 log 𝑝𝜃(𝑋| ෨ 𝑋) ≈ max 𝜃 ෍ 𝑖∈𝐶 log 𝑝𝜃 ෥ 𝑥𝑖 = 𝑥𝑖 ෨ 𝑋
  • 7. Background: Adversarial Training 7 • 정상 데이터를 모델에 학습 • 정상 데이터 + adversarial sample을 같이 학습 • 일반화 성능 향상
  • 8. 3 Contributions 8 • Disentangled attention • Transformer-xl처럼 additive하게 attention을 분해 • Shaw, Transformer-xl과 다르게 position-to-content term을 살림 • query token의 위치가 달라지는 부분도 반영 • Position-to-position term은 RPE에서 불필요하기 때문에 제거 • Enhanced Mask Decoder • A new store opened beside the new mall • Absolute position information 또한 중요하다! • Scale Invariant Fine-Tuning • Adversarial Training은 모델의 일반화에 도움을 준다 • NLP에서 embedding vector norm의 분산은 모델바이모델, 단어바이단어 • Word embedding을 normalize해주고 Perturbation을 추가하자!
  • 9. Disentangled Attention 9 • Disentangled Attention: A two vector approach to content and position embedding • 논문에서 아래와 같은 수식을 제안하며 token repr을 content와 position에 대한 두 벡터로 decomposition 수행 • 이렇게 쪼개는 것은 사실 Transformer-XL에서 제안되었어요! 𝐴𝑖,𝑗 = 𝐻𝑖, 𝑃𝑖|𝑗 × 𝐻𝑗, 𝑃𝑗|𝑖 𝑇 = 𝐻𝑖𝐻𝑗 𝑇 + 𝐻𝑖𝑃𝑗|𝑖 𝑇 + 𝑃𝑖|𝑗𝐻𝑗 𝑇 + 𝑃𝑖|𝑗𝑃𝑗|𝑖 𝑇
  • 10. History: Relative Position Embedding 10 17.06.12 18.03.06 19.10.23 Transformer Shaw RPE T5 18.09.12 Music Transformer 19.01.09 Transformer-XL 19.06.19 XLNet 20.06.05 DeBERTa
  • 11. History: Relative Position Embedding 11 Transformer upgrade! Layer에 직접 위치 정보를 주입하자! Self-Attention with Relative Position Representations Music Transformer
  • 12. History: Relative Position Embedding 12 Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context XLNet: Generalized Autoregressive Pretraining for Language Understanding 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 (a) (b) (c) (d) Content-to-Content Content-to-Position Position-to-Content Position-to-Position
  • 13. History: Relative Position Embedding 13 Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context XLNet: Generalized Autoregressive Pretraining for Language Understanding 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 (a) (b) (c) (d) Content-to-Content Content-to-Position Position-to-Content Position-to-Position 𝐴𝑖,𝑗 = 𝐻𝑖, 𝑃𝑖|𝑗 × 𝐻𝑗, 𝑃𝑗|𝑖 𝑇 = 𝐻𝑖𝐻𝑗 𝑇 + 𝐻𝑖𝑃𝑗|𝑖 𝑇 + 𝑃𝑖|𝑗𝐻𝑗 𝑇 + 𝑃𝑖|𝑗𝑃𝑗|𝑖 𝑇
  • 14. History: Relative Position Embedding 14 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 𝐿𝑒𝑡 𝑄 = 𝑊 𝑞𝐸𝑥𝑖 𝑈𝑞 = 𝑊 𝑞𝑈𝑖 𝐾 = 𝑊𝑘𝐸𝑥𝑗 𝑈𝑘 = 𝑊𝑘𝑈𝑗 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝑄𝑇𝐾 + 𝑄𝑇𝑈𝑘 + 𝑈𝑞 𝑇𝐾 + 𝑈𝑞 𝑇𝑈𝑘 이 수식을 분석해봅시다! 이렇게 미리 정해둘게요 그러면 수식이 이렇게 정리됩니다! 이걸 묶어서 정리하면? 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝑄 + 𝑈𝑞 𝑇 (𝐾 + 𝑈𝑘) 즉, 이렇게 정리할 수 있겠군요! 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝑊 𝑞 𝐸𝑥𝑖 + 𝑈𝑖 𝑇 𝑊𝑘 𝐸𝑥𝑗 + 𝑈𝑗 𝑇 0.09 0.05 0.86 0. 0.05 0.91 0.04 0. 0.50 0. 10 0.40 0. 0.45 0.40 0.15 0. 𝑄𝑊 𝑖 𝑄 ∈ ℝ4×3 𝑞1 𝑞2 𝑞3 𝑞4 </s> 𝑡1 𝑡2 𝑡3 Embedding + Projection 𝑘1 𝑘2 𝑘3 𝑘4 𝐾𝑊𝑖 𝐾 ∈ ℝ3×4 𝐴𝑡𝑡 ∈ ℝ4×4 𝑎11 𝑎12 𝑎13 𝑎14
  • 15. 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝐸𝑥𝑗 + 𝑈𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑈𝑗 𝐴𝑖,𝑗 𝑅𝑃𝐸 = 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘,𝐸𝐸𝑥𝑗 + 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇𝑊𝑘,𝑅𝑅𝑖−𝑗 + 𝑢𝑇𝑊𝑘,𝐸𝐸𝑥𝑗 + 𝑣𝑇𝑊𝑘,𝑅𝑅𝑖−𝑗 Learned position 𝐴𝑖,𝑗 𝑎𝑏𝑠 = 𝑊 𝑞 𝐸𝑥𝑖 + 𝑈𝑖 𝑇 𝑊𝑘 𝐸𝑥𝑗 + 𝑈𝑗 𝑇 𝐴𝑖,𝑗 𝑆ℎ𝑎𝑤_𝑅𝑃𝐸 = 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇 𝑊𝑘𝐸𝑥𝑗 + 𝐸𝑥𝑖 𝑇 𝑊 𝑞 𝑇 𝑊𝑘𝑅𝑖−𝑗 Dai(transformer-xl의 주저자)의 연구는? Sinusoid position Shaw의 연구는? History: Relative Position Embedding 15
  • 16. Shaw’s RPE Transformer-XL T5 q𝑚 𝑇 𝑘𝑛 = 𝑥𝑚 𝑇 𝑊 𝑞 𝑇 𝑊𝑘𝑥𝑛 + 𝑥𝑚 𝑇 𝑊 𝑞 𝑇 ෩ 𝑊𝑘 ෤ 𝑝𝑚−𝑛 + 𝑢𝑇 𝑊𝑘𝑥𝑛 + 𝑣𝑇 ෩ 𝑊𝑘 ෤ 𝑝𝑚−𝑛 q𝑚 𝑇 𝑘𝑛 = 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑥𝑛 + 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑝𝑚−𝑛 q𝑚 𝑇 𝑘𝑛 = 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑥𝑛 + 𝑏𝑚,𝑛 History: Relative Position Embedding
  • 17. Shaw’s RPE Transformer-XL T5 q𝑚 𝑇 𝑘𝑛 = 𝑥𝑚 𝑇 𝑊 𝑞 𝑇 𝑊𝑘𝑥𝑛 + 𝑥𝑚 𝑇 𝑊 𝑞 𝑇 ෩ 𝑊𝑘 ෤ 𝑝𝑚−𝑛 + 𝑢𝑇 𝑊𝑘𝑥𝑛 + 𝑣𝑇 ෩ 𝑊𝑘 ෤ 𝑝𝑚−𝑛 q𝑚 𝑇 𝑘𝑛 = 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑥𝑛 + 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑝𝑚−𝑛 q𝑚 𝑇 𝑘𝑛 = 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑥𝑛 + 𝑏𝑚,𝑛 History: Relative Position Embedding DeBERTa q𝑚 𝑇 𝑘𝑛 = 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑥𝑛 + 𝑥𝑚 𝑇 𝑊 𝑞 𝑇𝑊𝑘 ෤ 𝑝𝑚−𝑛 + ෤ 𝑝𝑚−𝑛 𝑇 𝑊 𝑞 𝑇𝑊𝑘𝑥𝑛
  • 18. Disentangled Attention 18 • Shaw 연구진 등의 기존 RPE 접근 방법은 content-to-content (a) term과 content-to-position (b) term을 사용하여 attention weights를 계산 • Attention weight는 어느 한 쪽 방향으로만 모델링할 수 없다. • Position-to-content term (c) term 또한 중요하다! • Relative position embedding에서 (d) term은 이미 고려하고 있음
  • 19. Disentangled Attention 19 • k: maximum relative distance • 𝛿 𝑖, 𝑗 ∈ 0,2𝑘 • 𝛿 𝑖, 𝑗 = ቐ 0 2𝑘 − 1 𝑖 − 𝑗 + 1 𝑖 − 𝑗 ≤ −𝑘 𝑖 − 𝑗 ≥ 𝑘 for for o. w
  • 21. Enhanced Mask Decoder 21 • DeBERTa는 MLM으로 pre-trained • MLM을 위해 context words의 content와 position information을 활용 • 하지만 absolute positions을 고려하지 않음 • e.g., • A new store opened beside the new mall • BERT는 absolute positions을 input layer에 주입 • DeBERTa는 Transformer layer를 전부 거치고 Masked token prediction을 위 해 softmax layer를 통과시키기 전에 absolute positions을 주입
  • 24. Enhanced Mask Decoder 24 https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/apps/run.py Seed 고정 Tokenizer, task object 반환 Eval, test data load Load train data and get model
  • 28. BertForMaskedLM 28 Transformer Layer with disentangled attention Transformer Layer with disentangled attention Transformer Layer with disentangled attention Transformer Layer with disentangled attention Transformer Layer with disentangled attention … Sub-word embedding Token type embedding + Encoder output Encoder output Encoder output Encoder output Encoder output Absolute position embedding Token type embedding … CLS 딥 ##러 MASK 논 MASK 모임 SEP lm_head Token type embedding … CLS 딥 ##러 ##닝 논 ##문 모임 SEP lm_logits, lm_loss BERT Module
  • 29. DeBERTaForMaskedLM with EDM 29 Transformer Layer with disentangled attention Transformer Layer with disentangled attention Transformer Layer with disentangled attention Transformer Layer with disentangled attention Transformer Layer with disentangled attention … Sub-word embedding Token type embedding + DeBERTa Module Encoder output Encoder output Encoder output Encoder output (H) Encoder output Absolute position embedding • (1)은 position_biased_input 옵션이 True인 경우에만 더해줌 • 분홍색 Transformer Layer는 shared • lm_head는 word embedding matrix와 shared • 저자에 의하면 EDM이 누락되도 PLM의 수렴에 영향을 끼치지 않는다고 함 • MLM training의 perplexity에 약간의 영향을 미치는 부분 Token type embedding … CLS 딥 ##러 MASK 논 MASK 모임 SEP Query state (I) + Transformer Layer with disentangled attention Transformer Layer with disentangled attention Query state (I) EDM Module (n=2) (1) Encoder output Token type embedding … CLS 딥 ##러 ##닝 논 ##문 모임 SEP lm_head lm_logits, lm_loss
  • 30. Scale Invariant Fine-Tuning 30 • Virtual adversarial training은 regularization method • 모델의 일반화 성능을 강화 • Input에 small perturbation(noise)를 줘서 adversarial attack에도 동일한 output prediction을 만드는 것이 목적 • NLP task에서 perturbation은 word embedding에 주어짐 • 그러나 model by model, word by word로 emb vector의 norm은 상이함 • Bigger model일수록 분산은 커지고 adversarial training의 불안정성을 키움 • Layer norm에서 영감을 받아 normalized word embeddings에 perturbation을 추가하여 Adversarial Fine-Tuning • 1.5B 모델에만 적용했고 comprehensive study는 향후에 진행할 예정
  • 33. Scale Invariant Fine-Tuning 33 Embedding Module DeBERTa Module Token type embedding … CLS 딥 ##러 ##닝 논 ##문 모임 SEP Task-specific Layer (SuperGLUE) Token type embedding … O B-XX I-XX I-XX B-XX I-XX I-XX O ℒ(𝑙𝑜𝑔𝑖𝑡𝑠, 𝑔𝑜𝑙𝑑𝑒𝑛_𝑡𝑟𝑢𝑡ℎ) Sub-word embedding Token type embedding Absolute position embedding + If position_biased_input LayerNorm Sift hook!
  • 34. Scale Invariant Fine-Tuning 34 Embedding Module DeBERTa Module Token type embedding … CLS 딥 ##러 ##닝 논 ##문 모임 SEP Task-specific Layer (SuperGLUE) Token type embedding … O B-XX I-XX I-XX B-XX I-XX I-XX O ℒ(𝑙𝑜𝑔𝑖𝑡𝑠, 𝑔𝑜𝑙𝑑𝑒𝑛_𝑡𝑟𝑢𝑡ℎ) Sub-word embedding Token type embedding Absolute position embedding + If position_biased_input LayerNorm LayerNorm(inputs) https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/sift/sift.py#L29
  • 35. Scale Invariant Fine-Tuning 35 Embedding Module DeBERTa Module Token type embedding … CLS 딥 ##러 ##닝 논 ##문 모임 SEP Task-specific Layer (SuperGLUE) Token type embedding … O B-XX I-XX I-XX B-XX I-XX I-XX O ℒ(𝑙𝑜𝑔𝑖𝑡𝑠, 𝑔𝑜𝑙𝑑𝑒𝑛_𝑡𝑟𝑢𝑡ℎ) Sub-word embedding Token type embedding Absolute position embedding + If position_biased_input LayerNorm LayerNorm(inputs) https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/sift/sift.py#L29 + perturbation delta ~𝑁 0, 0.02 If 𝛿 ≥ 0.04 𝑜𝑟 𝛿 ≤ −0.04, clamp 𝑎𝑑𝑣
  • 37. Experiment: Pre-training 37 • RoBERTa처럼 dynamic data batching 적용 • SpanBERT처럼 span masking 적용
  • 48. Conclusion 48 • Disentangled attention과 enhanced mask decoder로 RoBERTa 개선 • Downstream task에서 모델 일반화를 개선하기 위해 SIFT 제안 • Macro score 측면에서 SuperGLUE 벤치마크에서 인간의 성능을 상회 • 아직 인간 수준의 지능까진 도달하지 못함 • 최근에 V3가 나왔습니다! → 다음 차례에 wrap-up하며 발표하도록 하겠습니다.