DeBERTA : Decoding-Enhanced BERT with Disentangled Attention

DEBERTA
Decoding-Enhanced BERT with Disentangled Attention
Arxiv 2020.06.05 || ICLR 2021
🤗
딥러닝 논문 읽기 모임 4기 NLP팀
진명훈 김수빈 신문종 아이린 최상우

TABLE OF CONTENTS
2
Introduction
Background
3 Contributions (2 Q&A)
Experiment
Conclusion

Introduction
3
• He et al., 2020 에서 제안된 모델
• DeBERTa: Decoding-enhanced BERT with Disentangled Attention
• Google의 BERT(2018)과 Facebook(현재 Meta)의 RoBERTa(2019) 기반
• RoBERTa + disentangled attention + enhanced mask decoder
• With half of the data used in RoBERTa (80GB)
• Scale Invariant Fine-Tuning 도입
• #5929 PR로 🤗transformers에 Merge됨
• Outperform RoBERTa an a majority of NLU tasks
• e.g., SQuAD, MNLI and RACE

Background
4
• Positional Information
• Masked Language Model
• Adversarial Training

Background: Positional Information
5
• The standard self-attention mechanism lacks a natural way to encode
word position information
• Add Positional Bias (ref: 딥논읽 Rotary Embedding 발표)
• Absolute Position Embedding
• Relative Position Embedding

Background: Masked Language Model
6
• Large-scale Transformer-based PLMs are typically pre-trained on large
amounts of text to learn contextual word representations using a self-
supervision objective, known as Masked Language Model (MLM)
max
𝜃
log 𝑝𝜃(𝑋| ෨
𝑋) ≈ max
𝜃
෍
𝑖∈𝐶
log 𝑝𝜃 ෥
𝑥𝑖 = 𝑥𝑖
෨
𝑋

Background: Adversarial Training
7
• 정상 데이터를 모델에 학습
• 정상 데이터 + adversarial sample을 같이 학습
• 일반화 성능 향상

3 Contributions
8
• Disentangled attention
• Transformer-xl처럼 additive하게 attention을 분해
• Shaw, Transformer-xl과 다르게 position-to-content term을 살림
• query token의 위치가 달라지는 부분도 반영
• Position-to-position term은 RPE에서 불필요하기 때문에 제거
• Enhanced Mask Decoder
• A new store opened beside the new mall
• Absolute position information 또한 중요하다!
• Scale Invariant Fine-Tuning
• Adversarial Training은 모델의 일반화에 도움을 준다
• NLP에서 embedding vector norm의 분산은 모델바이모델, 단어바이단어
• Word embedding을 normalize해주고 Perturbation을 추가하자!

Disentangled Attention
9
• Disentangled Attention: A two vector approach to content and position
embedding
• 논문에서 아래와 같은 수식을 제안하며 token repr을 content와 position에 대한
두 벡터로 decomposition 수행
• 이렇게 쪼개는 것은 사실 Transformer-XL에서 제안되었어요!
𝐴𝑖,𝑗 = 𝐻𝑖, 𝑃𝑖|𝑗 × 𝐻𝑗, 𝑃𝑗|𝑖
𝑇
= 𝐻𝑖𝐻𝑗
𝑇
+ 𝐻𝑖𝑃𝑗|𝑖
𝑇
+ 𝑃𝑖|𝑗𝐻𝑗
𝑇
+ 𝑃𝑖|𝑗𝑃𝑗|𝑖
𝑇

History: Relative Position Embedding
10
17.06.12 18.03.06 19.10.23
Transformer
Shaw RPE T5
18.09.12
Music
Transformer
19.01.09
Transformer-XL
19.06.19
XLNet
20.06.05
DeBERTa

11
Transformer upgrade! Layer에 직접 위치 정보를 주입하자!
Self-Attention with Relative Position Representations
Music Transformer

12
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
XLNet: Generalized Autoregressive Pretraining for Language Understanding
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝐸𝑥𝑗
+ 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗 + 𝑈𝑖
𝑇
𝑊
𝑞
+ 𝑈𝑖
𝑇
𝑊
𝑞
𝑇𝑊𝑘𝑈𝑗
(a) (b) (c) (d)
Content-to-Content
Content-to-Position
Position-to-Content
Position-to-Position

13
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
XLNet: Generalized Autoregressive Pretraining for Language Understanding
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝐸𝑥𝑖
𝑇 𝑊
𝑞
+ 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇
𝑊
𝑞
+ 𝑈𝑖
𝑇
𝑊
𝑞
(a) (b) (c) (d)
Content-to-Content
Content-to-Position
Position-to-Content
Position-to-Position
𝐴𝑖,𝑗 = 𝐻𝑖, 𝑃𝑖|𝑗 × 𝐻𝑗, 𝑃𝑗|𝑖
𝑇
= 𝐻𝑖𝐻𝑗
𝑇
+ 𝐻𝑖𝑃𝑗|𝑖
𝑇
+ 𝑃𝑖|𝑗𝐻𝑗
𝑇
+ 𝑃𝑖|𝑗𝑃𝑗|𝑖
𝑇

14
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝐸𝑥𝑖
𝑇 𝑊
𝑞
+ 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇
𝑊
𝑞
+ 𝑈𝑖
𝑇
𝑊
𝑞
𝐿𝑒𝑡 𝑄 = 𝑊
𝑞𝐸𝑥𝑖
𝑈𝑞 = 𝑊
𝑞𝑈𝑖 𝐾 = 𝑊𝑘𝐸𝑥𝑗
𝑈𝑘 = 𝑊𝑘𝑈𝑗
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝑄𝑇𝐾 + 𝑄𝑇𝑈𝑘 + 𝑈𝑞
𝑇𝐾 + 𝑈𝑞
𝑇𝑈𝑘
이 수식을 분석해봅시다!
이렇게 미리 정해둘게요
그러면 수식이 이렇게 정리됩니다!
이걸 묶어서 정리하면?
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝑄 + 𝑈𝑞
𝑇
(𝐾 + 𝑈𝑘)
즉, 이렇게 정리할 수 있겠군요!
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝑊
𝑞 𝐸𝑥𝑖
+ 𝑈𝑖
𝑇
𝑊𝑘 𝐸𝑥𝑗
+ 𝑈𝑗
𝑇
0.09 0.05 0.86 0.
0.05 0.91 0.04 0.
0.50 0. 10 0.40 0.
0.45 0.40 0.15 0.
𝑄𝑊
𝑖
𝑄
∈ ℝ4×3
𝑞1
𝑞2
𝑞3
𝑞4
</s>
𝑡1
𝑡2
𝑡3
Embedding
+
Projection
𝑘1 𝑘2 𝑘3 𝑘4
𝐾𝑊𝑖
𝐾
∈ ℝ3×4
𝐴𝑡𝑡 ∈ ℝ4×4
𝑎11 𝑎12 𝑎13 𝑎14

𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝐸𝑥𝑖
𝑇 𝑊
𝑞
+ 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇
𝑊
𝑞
+ 𝑈𝑖
𝑇
𝑊
𝑞
𝐴𝑖,𝑗
𝑅𝑃𝐸
= 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘,𝐸𝐸𝑥𝑗
+ 𝐸𝑥𝑖
𝑇 𝑊
𝑞
𝑇𝑊𝑘,𝑅𝑅𝑖−𝑗 + 𝑢𝑇𝑊𝑘,𝐸𝐸𝑥𝑗
+ 𝑣𝑇𝑊𝑘,𝑅𝑅𝑖−𝑗
Learned position
𝐴𝑖,𝑗
𝑎𝑏𝑠
= 𝑊
𝑞 𝐸𝑥𝑖
+ 𝑈𝑖
𝑇
𝑊𝑘 𝐸𝑥𝑗
+ 𝑈𝑗
𝑇
𝐴𝑖,𝑗
𝑆ℎ𝑎𝑤_𝑅𝑃𝐸
= 𝐸𝑥𝑖
𝑇
𝑊
𝑞
𝑇
𝑊𝑘𝐸𝑥𝑗
+ 𝐸𝑥𝑖
𝑇
𝑊
𝑞
𝑇
𝑊𝑘𝑅𝑖−𝑗
Dai(transformer-xl의 주저자)의 연구는?
Sinusoid position
Shaw의 연구는?
15

Shaw’s RPE
Transformer-XL
T5
q𝑚
𝑇
𝑘𝑛 = 𝑥𝑚
𝑇
𝑊
𝑞
𝑇
𝑊𝑘𝑥𝑛 + 𝑥𝑚
𝑇
𝑊
𝑞
𝑇 ෩
𝑊𝑘 ෤
𝑝𝑚−𝑛 + 𝑢𝑇
𝑊𝑘𝑥𝑛 + 𝑣𝑇 ෩
𝑊𝑘 ෤
𝑝𝑚−𝑛
q𝑚
𝑇 𝑘𝑛 = 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛 + 𝑥𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑝𝑚−𝑛
q𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛 + 𝑏𝑚,𝑛

Shaw’s RPE
Transformer-XL
T5
q𝑚
𝑇
𝑘𝑛 = 𝑥𝑚
𝑇
𝑊
𝑞
𝑇
𝑊𝑘𝑥𝑛 + 𝑥𝑚
𝑇
𝑊
𝑞
𝑇 ෩
𝑊𝑘 ෤
𝑝𝑚−𝑛 + 𝑢𝑇
𝑊𝑘𝑥𝑛 + 𝑣𝑇 ෩
𝑊𝑘 ෤
𝑝𝑚−𝑛
q𝑚
𝑇 𝑊
𝑞
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑝𝑚−𝑛
q𝑚
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛 + 𝑏𝑚,𝑛
DeBERTa q𝑚
𝑇 𝑊
𝑞
𝑇 𝑊
𝑞
𝑇𝑊𝑘 ෤
𝑝𝑚−𝑛 + ෤
𝑝𝑚−𝑛
𝑇 𝑊
𝑞
𝑇𝑊𝑘𝑥𝑛

18
• Shaw 연구진 등의 기존 RPE 접근 방법은 content-to-content (a) term과
content-to-position (b) term을 사용하여 attention weights를 계산
• Attention weight는 어느 한 쪽 방향으로만 모델링할 수 없다.
• Position-to-content term (c) term 또한 중요하다!
• Relative position embedding에서 (d) term은 이미 고려하고 있음

19
• k: maximum relative distance
• 𝛿 𝑖, 𝑗 ∈ 0,2𝑘
• 𝛿 𝑖, 𝑗 = ቐ
0
2𝑘 − 1
𝑖 − 𝑗 + 1
𝑖 − 𝑗 ≤ −𝑘
𝑖 − 𝑗 ≥ 𝑘
for
for
o. w

20
https://github.com/huggingface/transformers/blob/421
0579522f8b288c3ae6c646e8a7f2e3a941c76/src/trans
formers/models/deberta/modeling_deberta.py#L660

Enhanced Mask Decoder
21
• DeBERTa는 MLM으로 pre-trained
• MLM을 위해 context words의 content와 position information을 활용
• 하지만 absolute positions을 고려하지 않음
• e.g.,
• A new store opened beside the new mall
• BERT는 absolute positions을 input layer에 주입
• DeBERTa는 Transformer layer를 전부 거치고 Masked token prediction을 위
해 softmax layer를 통과시키기 전에 absolute positions을 주입

23
https://github.com/microsoft/DeBERTa/blob/master/experiments/language_model/mlm.sh

24
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/apps/run.py
Seed 고정
Tokenizer, task object 반환
Eval, test data load
Load train data and get model

25
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/apps/tasks/mlm_task.py

26
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/apps/models/masked_language_model.py

27
https://youtu.be/gcMyKUXbY8s?t=1198

BertForMaskedLM
28
Transformer Layer
with disentangled attention
Transformer Layer
Transformer Layer
Transformer Layer
Transformer Layer
…
Sub-word embedding
Token type embedding
+
Encoder output
Encoder output
Encoder output
Encoder output
Encoder output
Absolute position embedding
Token type embedding …
CLS 딥 ##러 MASK 논 MASK 모임 SEP
lm_head
CLS 딥 ##러 ##닝 논 ##문 모임 SEP
lm_logits, lm_loss
BERT Module

DeBERTaForMaskedLM with EDM
29
Transformer Layer
Transformer Layer
Transformer Layer
Transformer Layer
Transformer Layer
…
Sub-word embedding
Token type embedding
+
DeBERTa Module
Encoder output
Encoder output
Encoder output
Encoder output (H)
Encoder output
• (1)은 position_biased_input 옵션이 True인 경우에만 더해줌
• 분홍색 Transformer Layer는 shared
• lm_head는 word embedding matrix와 shared
• 저자에 의하면 EDM이 누락되도 PLM의 수렴에 영향을 끼치지 않는다고 함
• MLM training의 perplexity에 약간의 영향을 미치는 부분
CLS 딥 ##러 MASK 논 MASK 모임 SEP
Query state (I)
+
Transformer Layer
Transformer Layer
Query state (I)
EDM Module (n=2)
(1)
Encoder output
lm_head
lm_logits, lm_loss

Scale Invariant Fine-Tuning
30
• Virtual adversarial training은 regularization method
• 모델의 일반화 성능을 강화
• Input에 small perturbation(noise)를 줘서 adversarial attack에도 동일한
output prediction을 만드는 것이 목적
• NLP task에서 perturbation은 word embedding에 주어짐
• 그러나 model by model, word by word로 emb vector의 norm은 상이함
• Bigger model일수록 분산은 커지고 adversarial training의 불안정성을 키움
• Layer norm에서 영감을 받아 normalized word embeddings에 perturbation을
추가하여 Adversarial Fine-Tuning
• 1.5B 모델에만 적용했고 comprehensive study는 향후에 진행할 예정

31
https://github.com/microsoft/DeBERTa/tree/master/DeBERTa/sift

32
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/apps/run.py

33
Embedding Module
DeBERTa Module
Task-specific Layer
(SuperGLUE)
O B-XX I-XX I-XX B-XX I-XX I-XX O
ℒ(𝑙𝑜𝑔𝑖𝑡𝑠, 𝑔𝑜𝑙𝑑𝑒𝑛_𝑡𝑟𝑢𝑡ℎ)
Sub-word embedding Token type embedding
+
If position_biased_input
LayerNorm
Sift hook!

34
Embedding Module
DeBERTa Module
Task-specific Layer
(SuperGLUE)
+
LayerNorm
LayerNorm(inputs)
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/sift/sift.py#L29

35
Embedding Module
DeBERTa Module
Task-specific Layer
(SuperGLUE)
+
LayerNorm
LayerNorm(inputs)
https://github.com/microsoft/DeBERTa/blob/master/DeBERTa/sift/sift.py#L29
+ perturbation delta ~𝑁 0, 0.02
If 𝛿 ≥ 0.04 𝑜𝑟 𝛿 ≤ −0.04, clamp
𝑎𝑑𝑣

Experiment: Pre-training
37
• RoBERTa처럼 dynamic data batching 적용
• SpanBERT처럼 span masking 적용

Experiment: DeBERTa v2
46
https://huggingface.co/docs/transformers/model_doc/deberta_v2

Conclusion
48
• Disentangled attention과 enhanced mask decoder로 RoBERTa 개선
• Downstream task에서 모델 일반화를 개선하기 위해 SIFT 제안
• Macro score 측면에서 SuperGLUE 벤치마크에서 인간의 성능을 상회
• 아직 인간 수준의 지능까진 도달하지 못함
• 최근에 V3가 나왔습니다! → 다음 차례에 wrap-up하며 발표하도록 하겠습니다.

DeBERTA : Decoding-Enhanced BERT with Disentangled Attention

More Related Content

What's hot (20)

Similar to DeBERTA : Decoding-Enhanced BERT with Disentangled Attention (20)

More from taeseon ryu (20)

DeBERTA : Decoding-Enhanced BERT with Disentangled Attention