BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT: Pre-training of Deep
Bidirectional Transformers for
Language Understanding
Devlin et al., 2018 (Google AI Language)
Presenter
Phạm Quang Nhật Minh
NLP Researcher
Alt Vietnam
al+ AI Seminar No. 7
2018/12/21

Outline
• Research context
• Main ideas
• BERT
• Experiments
• Conclusions
12/21/18 al+ AI Seminar No.7 2

Research context
• Language model pre-training has been used to
improve many NLP tasks
• ELMo (Peters et al., 2018)
• OpenAI GPT (Radford et al., 2018)
• ULMFit (Howard and Rudder, 2018)
• Two existing strategies for applying pre-trained
language representations to downstream tasks
• Feature-based: include pre-trained representations as
additional features (e.g., ELMo)
• Fine-tunning: introduce task-specific parameters and
fine-tune the pre-trained parameters (e.g., OpenAI GPT,
ULMFit)

Limitations of current techniques
• Language models in pre-training are unidirectional,
they restrict the power of the pre-trained
representations
• OpenAI GPT used left-to-right architecture
• ELMo concatenates forward and backward language
models
• Solution BERT: Bidirectional Encoder
Representations from Transformers

BERT: Bidirectional Encoder
• Main ideas
• Propose a new pre-training objective so that a deep
bidirectional Transformer can be trained
• The “masked language model” (MLM): the objective is to
predict the original word of a masked word based only on its
context
• ”Next sentence prediction”
• Merits of BERT
• Just fine-tune BERT model for specific tasks to achieve
state-of-the-art performance
• BERT advances the state-of-the-art for eleven NLP tasks

BERT: Bidirectional Encoder

Model architecture
• BERT’s model architecture is a multi-layer
bidirectional Transformer encoder
• (Vaswani et al., 2017) “Attention is all you need”
• Two models with different sizes were investigated
• BERTBASE: L=12, H=768, A=12, Total Parameters=110M
• (L: number of layers (Transformer blocks), H is the hidden size,
A: the number of self-attention heads)
• BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M

Differences in pre-training model architectures: BERT,
OpenAI GPT, and ELMo

Transformer Encoders
• Transformer is an attention-based architecture for NLP
• Transformer composed of two parts: Encoding
component and Decoding component
• BERT is a multi-layer bidirectional Transformer encoder
Vaswani et al. (2017) Attention is all you need
Encoder Block
Encoder Block
Encoder Block
Input sequence

Inside an Encoder Block
Source: https://medium.com/dissecting-
bert/dissecting-bert-part-1-d3c3d495cdb3
In BERT experiments, the number
of blocks N was chosen to be 12
and 24.
Blocks do not share weights with
each other

Transformer Encoders: Key Concepts
Multi-head
self-attention
Self-
attention
Transformer
Encoders
Position
Encoding
Layer
NormalizationResidual
Connections
Position-wise
Feed Forward
Network

Self-Attention
Image source: https://jalammar.github.io/illustrated-transformer/

Self-Attention in Detail
• Attention maps a query and a set of key-value pairs
to an output
• query, keys, and output are all vectors
Input
Queries
Keys
Values
X1 X2
q1 q2
k1 k2
v1 v2
Use matrices WQ , WK and
WV to project input into
query, key and value vectors
Attention ', ), * = softmax
')1
23
* dk is the dimension of
key vectors

Multi-Head Attention
X1
X2
...
Head #0 Head #1 Head #7
Concat
Linear Projection
Use a weight
matrix Wo

Position Encoding
• Position Encoding is used to make use of the order of the
sequence
• Since the model contains no recurrence and no convolution
• In Vawasni et al., 2017, authors used sine and cosine functions of
different frequencies
!"($%&,()) = sin
/01
10000
()
4!"#$%
!"($%&,()56) = cos
/01
10000
()
4!"#$%
• pos is the position and 9 is the dimension

Input Representation
• Token Embeddings: Use pretrained WordPiece embeddings
• Position Embeddings: Use learned Position Embeddings
• Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N.
Dauphin. Convolutionalsequence to sequence learning. arXiv preprint
arXiv:1705.03122v2, 2017.
• Added sentence embedding to every tokens of each sentence
• Use [CLS] for the classification tasks
• Separate sentences by using a special token [SEP]

Task#1: Masked LM
• 15% of the words are masked at random
• and the task is to predict the masked words based on its
left and right context
• Not all tokens were masked in the same way
(example sentence “My dog is hairy”)
• 80% were replaced by the <MASK> token: “My dog is
<MASK>”
• 10% were replaced by a random token: “My dog is
apple”
• 10% were left intact: “My dog is hairy”

Task#2: Next Sentence Prediction
• Motivation
• Many downstream tasks are based on understanding the
relationship between two text sentences
• Question Answering (QA) and Natural Language Inference (NLI)
• Language modeling does not directly capture that
relationship
• The task is pre-training binarized next sentence
prediction task
Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon
[MASK] milk [SEP]
Label = isNext
Input = [CLS] the man [MASK] to the store [SEP]penguin [MASK] are
flight ##less birds [SEP]
Label = NotNext

Pre-training procedure
• Training data: BooksCorpus (800M words) + English
Wikipedia (2,500M words)
• To generate each training input sequences: sample
two spans of text (A and B) from the corpus
• The combined length is ≤ 500 tokens
• 50% B is the actual next sentence that follows A and 50%
of the time it is a random sentence from the corpus
• The training loss is the sum of the mean masked
LM likelihood and the mean next sentence
prediction likelihood

Fine-tuning procedure
• For sequence-level classification task
• Obtain the representation of the input sequence by using the
final hidden state (hidden state at the position of the special
token [CLS]) ! ∈ #$
• Just add a classification layer and use softmax to calculate
label probabilities. Parameters W ∈ #&×$
( = softmax(!23
)

• For sequence-level classification task
• All of the parameters of BERT and W are fine-tuned
jointly
• Most model hyperparameters are the same as in
pre-training
• except the batch size, learning rate, and number of
training epochs

• Token tagging task (e.g., Named Entity Recognition)
• Feed the final hidden representation !" ∈ $% for each
token & into a classification layer for the tagset (NER
label set)
• To make the task compatible with WordPiece
tokenization
• Predict the tag for the first sub-token of a word
• No prediction is made for X

Single Sentence Tagging Tasks: CoNLL-2003 NER

• Span-level task: SQuAD v1.1
• Input Question:
Where do water droplets collide with ice
crystals to form precipitation?
• Input Paragraph:
.... Precipitation forms as smaller
dropletscoalesce via collision with other rain
dropsor ice crystals within a cloud. ...
• Output Answer:
within a cloud

• Span-level task: SQuAD v1.1
• Represent the input question and paragraph as a single
packed sequence
• The question uses the A embedding and the paragraph uses
the B embedding
• New parameters to be learned in fine-tuning are start
vector ! ∈ ℝ$ and end vector % ∈ ℝ$
• Calculate the probability of word & being the start of the
answer span
'( =
*+,-!
∑/ *+,-"
• The training objective is the log-likelihood the correct
and end positions

Comparison of BERT and OpenAI GPT
OpenAI GPT BERT
Trained on BooksCorpus (800M) Trained on BooksCorpus (800M) +
Wikipedia (2,500M)
Use sentence separater ([SEP])
and classifier token ([CLS]) only at
fine-tuning time
BERT learns [SEP], [CLS] and
sentence A/B embeddings during
pre-training
Trained for 1M steps with a batch-
size of 32,000 words
Trained for 1M steps with a batch-
size of 128,000 words
Use the same learning rate of 5e-5
for all fine-tuning experiments
BERT choose a task-specific
learning rate which performs the
best on the development set

Outline
• Research context
• Main ideas
• BERT
• Experiments
• Conclusions

Experiments
• GLUE (General Language Understanding Evaluation)
benchmark
• Distribute canonical Train, Dev and Test splits
• Labels for Test set are not provided
• Datasets in GLUE:
• MNLI: Multi-Genre Natural Language Inference
• QQP: Quora Question Pairs
• QNLI: Question Natural Language Inference
• SST-2: Stanford Sentiment Treebank
• CoLA: The corpus of Linguistic Acceptability
• STS-B: The Semantic Textual Similarity Benchmark
• MRPC: Microsoft Research Paraphrase Corpus
• RTE: Recognizing Textual Entailment
• WNLI: Winograd NLI

GLUE Results

SQuAD v1.1
Reference: https://rajpurkar.github.io/SQuAD-explorer

Named Entity Recognition

SWAG
• The Situations with Adversarial Generations (SWAG)
• The only task-specific parameters is a vector ! ∈ ℝ$
• The probability distribution is the softmax over the four
choices
%& =
()*+,
∑./0
1
()*+,

SWAG Result

Ablation Studies
• To understand
• Effect of Pre-training Tasks
• Effect of model sizes
• Effect of number of training steps
• Feature-based approach with BERT

Ablation Studies
• Main findings:
• “Next sentence prediction” (NSP) pre-training is important
for sentence-pair classification task
• Bidirectionality (using MLM pre-training task) contributes
significantly to the performance improvement

Ablation Studies
• Main findings
• Bigger model sizes are better even for small-scale tasks

Conclusions
• Unsupervised pre-training (pre-training language
model) is increasingly adopted in many NLP tasks
• Major contribution of the paper is to propose a
deep bidirectional architecture from Transformer
• Advance state-of-the-art for many important NLP tasks

Links
• TensorFlow code and pre-trained models for BERT:
https://github.com/google-research/bert
• PyTorch Pretrained Bert:
https://github.com/huggingface/pytorch-
pretrained-BERT
• BERT-pytorch: https://github.com/codertimo/BERT-
pytorch
• BERT-keras: https://github.com/Separius/BERT-
keras

Remark: Applying BERT for non-
English languages
• Pre-trained BERT models are provided for more than
100 languages (including Vietnamese)
• https://github.com/google-
research/bert/blob/master/multilingual.md
• Be careful with tokenization!!
• For Japanese (and Chinese): “spaces were added
around every character in the CJI Unicode rage before
applying WordPiece” => Not a good way to do
• Use SentencePiece:
https://github.com/google/sentencepiece
• We may need to pre-train BERT model

References
1. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.
(2018). BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. arXiv
preprint arXiv:1810.04805.
2. Vaswani et al. (2017). Attention Is All You Need. arXiv
preprint arXiv:1706.03762.
https://arxiv.org/abs/1706.03762
3. The Annotated Transformer:
http://nlp.seas.harvard.edu/2018/04/03/attention.ht
ml, by harvardnlp.
4. Dissecting BERT: https://medium.com/dissecting-bert

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

More Related Content

What's hot (20)

Similar to BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (20)

More from Minh Pham (14)

Recently uploaded (20)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding