SlideShare a Scribd company logo
BERT: Pre-training of Deep
Bidirectional Transformers for
Language Understanding
Devlin et al., 2018 (Google AI Language)
Presenter
Phạm Quang Nhật Minh
NLP Researcher
Alt Vietnam
al+ AI Seminar No. 7
2018/12/21
Outline
• Research context
• Main ideas
• BERT
• Experiments
• Conclusions
12/21/18 al+ AI Seminar No.7 2
Research context
• Language model pre-training has been used to
improve many NLP tasks
• ELMo (Peters et al., 2018)
• OpenAI GPT (Radford et al., 2018)
• ULMFit (Howard and Rudder, 2018)
• Two existing strategies for applying pre-trained
language representations to downstream tasks
• Feature-based: include pre-trained representations as
additional features (e.g., ELMo)
• Fine-tunning: introduce task-specific parameters and
fine-tune the pre-trained parameters (e.g., OpenAI GPT,
ULMFit)
12/21/18 al+ AI Seminar No.7 3
Limitations of current techniques
• Language models in pre-training are unidirectional,
they restrict the power of the pre-trained
representations
• OpenAI GPT used left-to-right architecture
• ELMo concatenates forward and backward language
models
• Solution BERT: Bidirectional Encoder
Representations from Transformers
12/21/18 al+ AI Seminar No.7 4
BERT: Bidirectional Encoder
Representations from Transformers
• Main ideas
• Propose a new pre-training objective so that a deep
bidirectional Transformer can be trained
• The “masked language model” (MLM): the objective is to
predict the original word of a masked word based only on its
context
• ”Next sentence prediction”
• Merits of BERT
• Just fine-tune BERT model for specific tasks to achieve
state-of-the-art performance
• BERT advances the state-of-the-art for eleven NLP tasks
12/21/18 al+ AI Seminar No.7 5
12/21/18 al+ AI Seminar No.7 6
BERT: Bidirectional Encoder
Representations from Transformers
Model architecture
• BERT’s model architecture is a multi-layer
bidirectional Transformer encoder
• (Vaswani et al., 2017) “Attention is all you need”
• Two models with different sizes were investigated
• BERTBASE: L=12, H=768, A=12, Total Parameters=110M
• (L: number of layers (Transformer blocks), H is the hidden size,
A: the number of self-attention heads)
• BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M
12/21/18 al+ AI Seminar No.7 7
12/21/18 al+ AI Seminar No.7 8
Differences in pre-training model architectures: BERT,
OpenAI GPT, and ELMo
Transformer Encoders
• Transformer is an attention-based architecture for NLP
• Transformer composed of two parts: Encoding
component and Decoding component
• BERT is a multi-layer bidirectional Transformer encoder
12/21/18 al+ AI Seminar No.7 9
Vaswani et al. (2017) Attention is all you need
Encoder Block
Encoder Block
Encoder Block
Input sequence
Inside an Encoder Block
12/21/18 al+ AI Seminar No.7 10
Source: https://medium.com/dissecting-
bert/dissecting-bert-part-1-d3c3d495cdb3
In BERT experiments, the number
of blocks N was chosen to be 12
and 24.
Blocks do not share weights with
each other
Transformer Encoders: Key Concepts
12/21/18 al+ AI Seminar No.7 11
Multi-head
self-attention
Self-
attention
Transformer
Encoders
Position
Encoding
Layer
NormalizationResidual
Connections
Position-wise
Feed Forward
Network
Self-Attention
12/21/18 al+ AI Seminar No.7 12
Image source: https://jalammar.github.io/illustrated-transformer/
Self-Attention in Detail
• Attention maps a query and a set of key-value pairs
to an output
• query, keys, and output are all vectors
12/21/18 al+ AI Seminar No.7 13
Input
Queries
Keys
Values
X1 X2
q1 q2
k1 k2
v1 v2
Use matrices WQ , WK and
WV to project input into
query, key and value vectors
Attention ', ), * = softmax
')1
23
* dk is the dimension of
key vectors
Multi-Head Attention
12/21/18 al+ AI Seminar No.7 14
X1
X2
...
Head #0 Head #1 Head #7
Concat
Linear Projection
Use a weight
matrix Wo
Position Encoding
• Position Encoding is used to make use of the order of the
sequence
• Since the model contains no recurrence and no convolution
• In Vawasni et al., 2017, authors used sine and cosine functions of
different frequencies
!"($%&,()) = sin
/01
10000
()
4!"#$%
!"($%&,()56) = cos
/01
10000
()
4!"#$%
• pos is the position and 9 is the dimension
12/21/18 al+ AI Seminar No.7 15
Input Representation
12/21/18 al+ AI Seminar No.7 16
• Token Embeddings: Use pretrained WordPiece embeddings
• Position Embeddings: Use learned Position Embeddings
• Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N.
Dauphin. Convolutionalsequence to sequence learning. arXiv preprint
arXiv:1705.03122v2, 2017.
• Added sentence embedding to every tokens of each sentence
• Use [CLS] for the classification tasks
• Separate sentences by using a special token [SEP]
Task#1: Masked LM
• 15% of the words are masked at random
• and the task is to predict the masked words based on its
left and right context
• Not all tokens were masked in the same way
(example sentence “My dog is hairy”)
• 80% were replaced by the <MASK> token: “My dog is
<MASK>”
• 10% were replaced by a random token: “My dog is
apple”
• 10% were left intact: “My dog is hairy”
12/21/18 al+ AI Seminar No.7 17
Task#2: Next Sentence Prediction
• Motivation
• Many downstream tasks are based on understanding the
relationship between two text sentences
• Question Answering (QA) and Natural Language Inference (NLI)
• Language modeling does not directly capture that
relationship
• The task is pre-training binarized next sentence
prediction task
12/21/18 al+ AI Seminar No.7 18
Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon
[MASK] milk [SEP]
Label = isNext
Input = [CLS] the man [MASK] to the store [SEP]penguin [MASK] are
flight ##less birds [SEP]
Label = NotNext
Pre-training procedure
• Training data: BooksCorpus (800M words) + English
Wikipedia (2,500M words)
• To generate each training input sequences: sample
two spans of text (A and B) from the corpus
• The combined length is ≤ 500 tokens
• 50% B is the actual next sentence that follows A and 50%
of the time it is a random sentence from the corpus
• The training loss is the sum of the mean masked
LM likelihood and the mean next sentence
prediction likelihood
12/21/18 al+ AI Seminar No.7 19
Fine-tuning procedure
• For sequence-level classification task
• Obtain the representation of the input sequence by using the
final hidden state (hidden state at the position of the special
token [CLS]) ! ∈ #$
• Just add a classification layer and use softmax to calculate
label probabilities. Parameters W ∈ #&×$
( = softmax(!23
)
12/21/18 al+ AI Seminar No.7 20
Fine-tuning procedure
• For sequence-level classification task
• All of the parameters of BERT and W are fine-tuned
jointly
• Most model hyperparameters are the same as in
pre-training
• except the batch size, learning rate, and number of
training epochs
12/21/18 al+ AI Seminar No.7 21
Fine-tuning procedure
• Token tagging task (e.g., Named Entity Recognition)
• Feed the final hidden representation !" ∈ $% for each
token & into a classification layer for the tagset (NER
label set)
• To make the task compatible with WordPiece
tokenization
• Predict the tag for the first sub-token of a word
• No prediction is made for X
12/21/18 al+ AI Seminar No.7 22
Fine-tuning procedure
12/21/18 al+ AI Seminar No.7 23
Single Sentence Tagging Tasks: CoNLL-2003 NER
Fine-tuning procedure
• Span-level task: SQuAD v1.1
• Input Question:
Where do water droplets collide with ice
crystals to form precipitation?
• Input Paragraph:
.... Precipitation forms as smaller
dropletscoalesce via collision with other rain
dropsor ice crystals within a cloud. ...
• Output Answer:
within a cloud
12/21/18 al+ AI Seminar No.7 24
Fine-tuning procedure
• Span-level task: SQuAD v1.1
• Represent the input question and paragraph as a single
packed sequence
• The question uses the A embedding and the paragraph uses
the B embedding
• New parameters to be learned in fine-tuning are start
vector ! ∈ ℝ$ and end vector % ∈ ℝ$
• Calculate the probability of word & being the start of the
answer span
'( =
*+,-!
∑/ *+,-"
• The training objective is the log-likelihood the correct
and end positions
12/21/18 al+ AI Seminar No.7 25
Comparison of BERT and OpenAI GPT
OpenAI GPT BERT
Trained on BooksCorpus (800M) Trained on BooksCorpus (800M) +
Wikipedia (2,500M)
Use sentence separater ([SEP])
and classifier token ([CLS]) only at
fine-tuning time
BERT learns [SEP], [CLS] and
sentence A/B embeddings during
pre-training
Trained for 1M steps with a batch-
size of 32,000 words
Trained for 1M steps with a batch-
size of 128,000 words
Use the same learning rate of 5e-5
for all fine-tuning experiments
BERT choose a task-specific
learning rate which performs the
best on the development set
12/21/18 al+ AI Seminar No.7 26
Outline
• Research context
• Main ideas
• BERT
• Experiments
• Conclusions
12/21/18 al+ AI Seminar No.7 27
Experiments
• GLUE (General Language Understanding Evaluation)
benchmark
• Distribute canonical Train, Dev and Test splits
• Labels for Test set are not provided
• Datasets in GLUE:
• MNLI: Multi-Genre Natural Language Inference
• QQP: Quora Question Pairs
• QNLI: Question Natural Language Inference
• SST-2: Stanford Sentiment Treebank
• CoLA: The corpus of Linguistic Acceptability
• STS-B: The Semantic Textual Similarity Benchmark
• MRPC: Microsoft Research Paraphrase Corpus
• RTE: Recognizing Textual Entailment
• WNLI: Winograd NLI
12/21/18 al+ AI Seminar No.7 28
GLUE Results
12/21/18 al+ AI Seminar No.7 29
SQuAD v1.1
12/21/18 al+ AI Seminar No.7 30
Reference: https://rajpurkar.github.io/SQuAD-explorer
Named Entity Recognition
12/21/18 al+ AI Seminar No.7 31
SWAG
• The Situations with Adversarial Generations (SWAG)
• The only task-specific parameters is a vector ! ∈ ℝ$
• The probability distribution is the softmax over the four
choices
%& =
()*+,
∑./0
1
()*+,
12/21/18 al+ AI Seminar No.7 32
SWAG Result
12/21/18 al+ AI Seminar No.7 33
Ablation Studies
• To understand
• Effect of Pre-training Tasks
• Effect of model sizes
• Effect of number of training steps
• Feature-based approach with BERT
12/21/18 al+ AI Seminar No.7 34
Ablation Studies
• Main findings:
• “Next sentence prediction” (NSP) pre-training is important
for sentence-pair classification task
• Bidirectionality (using MLM pre-training task) contributes
significantly to the performance improvement
12/21/18 al+ AI Seminar No.7 35
Ablation Studies
• Main findings
• Bigger model sizes are better even for small-scale tasks
12/21/18 al+ AI Seminar No.7 36
Conclusions
• Unsupervised pre-training (pre-training language
model) is increasingly adopted in many NLP tasks
• Major contribution of the paper is to propose a
deep bidirectional architecture from Transformer
• Advance state-of-the-art for many important NLP tasks
12/21/18 al+ AI Seminar No.7 37
Links
• TensorFlow code and pre-trained models for BERT:
https://github.com/google-research/bert
• PyTorch Pretrained Bert:
https://github.com/huggingface/pytorch-
pretrained-BERT
• BERT-pytorch: https://github.com/codertimo/BERT-
pytorch
• BERT-keras: https://github.com/Separius/BERT-
keras
12/21/18 al+ AI Seminar No.7 38
Remark: Applying BERT for non-
English languages
• Pre-trained BERT models are provided for more than
100 languages (including Vietnamese)
• https://github.com/google-
research/bert/blob/master/multilingual.md
• Be careful with tokenization!!
• For Japanese (and Chinese): “spaces were added
around every character in the CJI Unicode rage before
applying WordPiece” => Not a good way to do
• Use SentencePiece:
https://github.com/google/sentencepiece
• We may need to pre-train BERT model
12/21/18 al+ AI Seminar No.7 39
References
1. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K.
(2018). BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding. arXiv
preprint arXiv:1810.04805.
2. Vaswani et al. (2017). Attention Is All You Need. arXiv
preprint arXiv:1706.03762.
https://arxiv.org/abs/1706.03762
3. The Annotated Transformer:
http://nlp.seas.harvard.edu/2018/04/03/attention.ht
ml, by harvardnlp.
4. Dissecting BERT: https://medium.com/dissecting-bert
12/21/18 al+ AI Seminar No.7 40

More Related Content

PPTX
PPTX
BERT introduction
PPTX
NLP State of the Art | BERT
PDF
BERT: Bidirectional Encoder Representations from Transformers
PDF
BERT Finetuning Webinar Presentation
PPTX
[Paper review] BERT
PPTX
PDF
An introduction to the Transformers architecture and BERT
BERT introduction
NLP State of the Art | BERT
BERT: Bidirectional Encoder Representations from Transformers
BERT Finetuning Webinar Presentation
[Paper review] BERT
An introduction to the Transformers architecture and BERT

What's hot (20)

PPTX
Natural language processing and transformer models
PDF
Transformer Introduction (Seminar Material)
PDF
A Review of Deep Contextualized Word Representations (Peters+, 2018)
PPTX
Attention Is All You Need
PPTX
Transformers AI PPT.pptx
PPTX
Introduction to Transformer Model
PPTX
1909 BERT: why-and-how (CODE SEMINAR)
PPTX
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
Word2Vec
PDF
Deep learning for NLP and Transformer
PDF
NLP using transformers
PPTX
Word embedding
PPTX
XLnet RoBERTa Reformer
PPTX
PDF
BERT - Part 1 Learning Notes of Senthil Kumar
PDF
Transformers in 2021
PDF
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
PPTX
[AIoTLab]attention mechanism.pptx
PDF
Seq2Seq (encoder decoder) model
Natural language processing and transformer models
Transformer Introduction (Seminar Material)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
Attention Is All You Need
Transformers AI PPT.pptx
Introduction to Transformer Model
1909 BERT: why-and-how (CODE SEMINAR)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Word2Vec
Deep learning for NLP and Transformer
NLP using transformers
Word embedding
XLnet RoBERTa Reformer
BERT - Part 1 Learning Notes of Senthil Kumar
Transformers in 2021
Natural Language Processing (NLP) & Text Mining Tutorial Using NLTK | NLP Tra...
[AIoTLab]attention mechanism.pptx
Seq2Seq (encoder decoder) model
Ad

Similar to BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (20)

PDF
An Introduction to Pre-training General Language Representations
PPTX
Natural Language Processing detailed description
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
Should we be afraid of Transformers?
PDF
The NLP Muppets revolution!
PDF
BERT Explained_ State of the art language model for NLP.pdf
PPTX
PPTX
sliffffffffffffffffffdasddasdffffffffh2.pptx
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
Nlp research presentation
PDF
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
PPTX
Bert.pptx
PDF
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
PDF
Andrea gatto meetup_dli_18_feb_2020
PDF
Interpretation of Pretrained Language Models Chenyan Xiong 11-667
PPTX
Natural language processing
PPTX
Introduction to Neural Information Retrieval and Large Language Models
PPTX
BERT MODULE FOR TEXT CLASSIFICATION.pptx
PDF
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
PDF
NLP@DATEV: Setting up a domain specific language model, Dr. Jonas Rende & Tho...
An Introduction to Pre-training General Language Representations
Natural Language Processing detailed description
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Should we be afraid of Transformers?
The NLP Muppets revolution!
BERT Explained_ State of the art language model for NLP.pdf
sliffffffffffffffffffdasddasdffffffffh2.pptx
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Nlp research presentation
Bert pre_training_of_deep_bidirectional_transformers_for_language_understanding
Bert.pptx
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
Andrea gatto meetup_dli_18_feb_2020
Interpretation of Pretrained Language Models Chenyan Xiong 11-667
Natural language processing
Introduction to Neural Information Retrieval and Large Language Models
BERT MODULE FOR TEXT CLASSIFICATION.pptx
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
NLP@DATEV: Setting up a domain specific language model, Dr. Jonas Rende & Tho...
Ad

More from Minh Pham (14)

PDF
Học tập suốt đời – Chìa khóa để thích ứng với sự bất định
PDF
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
PDF
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
PDF
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
PDF
Research methods for engineering students (v.2020)
PDF
Giới thiệu về AIML
PDF
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
PDF
Deep Contexualized Representation
PDF
Research Methods in Natural Language Processing (2018 version)
PDF
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
PDF
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017
PDF
Research Methods in Natural Language Processing
PDF
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
PDF
Introduction to natural language processing
Học tập suốt đời – Chìa khóa để thích ứng với sự bất định
Prompt Engineering Tutorial: Cách viết prompt hiệu quả với ChatGPT
AimeLaw at ALQAC 2021: Enriching Neural Network Models with Legal-Domain Know...
A Multimodal Ensemble Model for Detecting Unreliable Information on Vietnames...
Research methods for engineering students (v.2020)
Giới thiệu về AIML
Mạng neural nhân tạo và ứng dụng trong xử lý ngôn ngữ tự nhiên
Deep Contexualized Representation
Research Methods in Natural Language Processing (2018 version)
A Feature-Based Model for Nested Named-Entity Recognition at VLSP-2018 NER Ev...
Về kỹ thuật Attention trong mô hình sequence-to-sequence tại hội nghị ACL 2017
Research Methods in Natural Language Processing
Các bài toán xử lý ngôn ngữ tự nhiên trong phát triển hệ thống chatbot
Introduction to natural language processing

Recently uploaded (20)

PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
ai tools demonstartion for schools and inter college
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
medical staffing services at VALiNTRY
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
top salesforce developer skills in 2025.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
L1 - Introduction to python Backend.pptx
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PPTX
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
PPTX
Online Work Permit System for Fast Permit Processing
PDF
AI in Product Development-omnex systems
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
ManageIQ - Sprint 268 Review - Slide Deck
2025 Textile ERP Trends: SAP, Odoo & Oracle
Adobe Illustrator 28.6 Crack My Vision of Vector Design
ai tools demonstartion for schools and inter college
Wondershare Filmora 15 Crack With Activation Key [2025
medical staffing services at VALiNTRY
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Navsoft: AI-Powered Business Solutions & Custom Software Development
top salesforce developer skills in 2025.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
L1 - Introduction to python Backend.pptx
Understanding Forklifts - TECH EHS Solution
Which alternative to Crystal Reports is best for small or large businesses.pdf
Odoo Companies in India – Driving Business Transformation.pdf
CHAPTER 12 - CYBER SECURITY AND FUTURE SKILLS (1) (1).pptx
Online Work Permit System for Fast Permit Processing
AI in Product Development-omnex systems
Odoo POS Development Services by CandidRoot Solutions
How to Migrate SBCGlobal Email to Yahoo Easily
ManageIQ - Sprint 268 Review - Slide Deck

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  • 1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Devlin et al., 2018 (Google AI Language) Presenter Phạm Quang Nhật Minh NLP Researcher Alt Vietnam al+ AI Seminar No. 7 2018/12/21
  • 2. Outline • Research context • Main ideas • BERT • Experiments • Conclusions 12/21/18 al+ AI Seminar No.7 2
  • 3. Research context • Language model pre-training has been used to improve many NLP tasks • ELMo (Peters et al., 2018) • OpenAI GPT (Radford et al., 2018) • ULMFit (Howard and Rudder, 2018) • Two existing strategies for applying pre-trained language representations to downstream tasks • Feature-based: include pre-trained representations as additional features (e.g., ELMo) • Fine-tunning: introduce task-specific parameters and fine-tune the pre-trained parameters (e.g., OpenAI GPT, ULMFit) 12/21/18 al+ AI Seminar No.7 3
  • 4. Limitations of current techniques • Language models in pre-training are unidirectional, they restrict the power of the pre-trained representations • OpenAI GPT used left-to-right architecture • ELMo concatenates forward and backward language models • Solution BERT: Bidirectional Encoder Representations from Transformers 12/21/18 al+ AI Seminar No.7 4
  • 5. BERT: Bidirectional Encoder Representations from Transformers • Main ideas • Propose a new pre-training objective so that a deep bidirectional Transformer can be trained • The “masked language model” (MLM): the objective is to predict the original word of a masked word based only on its context • ”Next sentence prediction” • Merits of BERT • Just fine-tune BERT model for specific tasks to achieve state-of-the-art performance • BERT advances the state-of-the-art for eleven NLP tasks 12/21/18 al+ AI Seminar No.7 5
  • 6. 12/21/18 al+ AI Seminar No.7 6 BERT: Bidirectional Encoder Representations from Transformers
  • 7. Model architecture • BERT’s model architecture is a multi-layer bidirectional Transformer encoder • (Vaswani et al., 2017) “Attention is all you need” • Two models with different sizes were investigated • BERTBASE: L=12, H=768, A=12, Total Parameters=110M • (L: number of layers (Transformer blocks), H is the hidden size, A: the number of self-attention heads) • BERTLARGE: L=24, H=1024, A=16, Total Parameters=340M 12/21/18 al+ AI Seminar No.7 7
  • 8. 12/21/18 al+ AI Seminar No.7 8 Differences in pre-training model architectures: BERT, OpenAI GPT, and ELMo
  • 9. Transformer Encoders • Transformer is an attention-based architecture for NLP • Transformer composed of two parts: Encoding component and Decoding component • BERT is a multi-layer bidirectional Transformer encoder 12/21/18 al+ AI Seminar No.7 9 Vaswani et al. (2017) Attention is all you need Encoder Block Encoder Block Encoder Block Input sequence
  • 10. Inside an Encoder Block 12/21/18 al+ AI Seminar No.7 10 Source: https://medium.com/dissecting- bert/dissecting-bert-part-1-d3c3d495cdb3 In BERT experiments, the number of blocks N was chosen to be 12 and 24. Blocks do not share weights with each other
  • 11. Transformer Encoders: Key Concepts 12/21/18 al+ AI Seminar No.7 11 Multi-head self-attention Self- attention Transformer Encoders Position Encoding Layer NormalizationResidual Connections Position-wise Feed Forward Network
  • 12. Self-Attention 12/21/18 al+ AI Seminar No.7 12 Image source: https://jalammar.github.io/illustrated-transformer/
  • 13. Self-Attention in Detail • Attention maps a query and a set of key-value pairs to an output • query, keys, and output are all vectors 12/21/18 al+ AI Seminar No.7 13 Input Queries Keys Values X1 X2 q1 q2 k1 k2 v1 v2 Use matrices WQ , WK and WV to project input into query, key and value vectors Attention ', ), * = softmax ')1 23 * dk is the dimension of key vectors
  • 14. Multi-Head Attention 12/21/18 al+ AI Seminar No.7 14 X1 X2 ... Head #0 Head #1 Head #7 Concat Linear Projection Use a weight matrix Wo
  • 15. Position Encoding • Position Encoding is used to make use of the order of the sequence • Since the model contains no recurrence and no convolution • In Vawasni et al., 2017, authors used sine and cosine functions of different frequencies !"($%&,()) = sin /01 10000 () 4!"#$% !"($%&,()56) = cos /01 10000 () 4!"#$% • pos is the position and 9 is the dimension 12/21/18 al+ AI Seminar No.7 15
  • 16. Input Representation 12/21/18 al+ AI Seminar No.7 16 • Token Embeddings: Use pretrained WordPiece embeddings • Position Embeddings: Use learned Position Embeddings • Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutionalsequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017. • Added sentence embedding to every tokens of each sentence • Use [CLS] for the classification tasks • Separate sentences by using a special token [SEP]
  • 17. Task#1: Masked LM • 15% of the words are masked at random • and the task is to predict the masked words based on its left and right context • Not all tokens were masked in the same way (example sentence “My dog is hairy”) • 80% were replaced by the <MASK> token: “My dog is <MASK>” • 10% were replaced by a random token: “My dog is apple” • 10% were left intact: “My dog is hairy” 12/21/18 al+ AI Seminar No.7 17
  • 18. Task#2: Next Sentence Prediction • Motivation • Many downstream tasks are based on understanding the relationship between two text sentences • Question Answering (QA) and Natural Language Inference (NLI) • Language modeling does not directly capture that relationship • The task is pre-training binarized next sentence prediction task 12/21/18 al+ AI Seminar No.7 18 Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] Label = isNext Input = [CLS] the man [MASK] to the store [SEP]penguin [MASK] are flight ##less birds [SEP] Label = NotNext
  • 19. Pre-training procedure • Training data: BooksCorpus (800M words) + English Wikipedia (2,500M words) • To generate each training input sequences: sample two spans of text (A and B) from the corpus • The combined length is ≤ 500 tokens • 50% B is the actual next sentence that follows A and 50% of the time it is a random sentence from the corpus • The training loss is the sum of the mean masked LM likelihood and the mean next sentence prediction likelihood 12/21/18 al+ AI Seminar No.7 19
  • 20. Fine-tuning procedure • For sequence-level classification task • Obtain the representation of the input sequence by using the final hidden state (hidden state at the position of the special token [CLS]) ! ∈ #$ • Just add a classification layer and use softmax to calculate label probabilities. Parameters W ∈ #&×$ ( = softmax(!23 ) 12/21/18 al+ AI Seminar No.7 20
  • 21. Fine-tuning procedure • For sequence-level classification task • All of the parameters of BERT and W are fine-tuned jointly • Most model hyperparameters are the same as in pre-training • except the batch size, learning rate, and number of training epochs 12/21/18 al+ AI Seminar No.7 21
  • 22. Fine-tuning procedure • Token tagging task (e.g., Named Entity Recognition) • Feed the final hidden representation !" ∈ $% for each token & into a classification layer for the tagset (NER label set) • To make the task compatible with WordPiece tokenization • Predict the tag for the first sub-token of a word • No prediction is made for X 12/21/18 al+ AI Seminar No.7 22
  • 23. Fine-tuning procedure 12/21/18 al+ AI Seminar No.7 23 Single Sentence Tagging Tasks: CoNLL-2003 NER
  • 24. Fine-tuning procedure • Span-level task: SQuAD v1.1 • Input Question: Where do water droplets collide with ice crystals to form precipitation? • Input Paragraph: .... Precipitation forms as smaller dropletscoalesce via collision with other rain dropsor ice crystals within a cloud. ... • Output Answer: within a cloud 12/21/18 al+ AI Seminar No.7 24
  • 25. Fine-tuning procedure • Span-level task: SQuAD v1.1 • Represent the input question and paragraph as a single packed sequence • The question uses the A embedding and the paragraph uses the B embedding • New parameters to be learned in fine-tuning are start vector ! ∈ ℝ$ and end vector % ∈ ℝ$ • Calculate the probability of word & being the start of the answer span '( = *+,-! ∑/ *+,-" • The training objective is the log-likelihood the correct and end positions 12/21/18 al+ AI Seminar No.7 25
  • 26. Comparison of BERT and OpenAI GPT OpenAI GPT BERT Trained on BooksCorpus (800M) Trained on BooksCorpus (800M) + Wikipedia (2,500M) Use sentence separater ([SEP]) and classifier token ([CLS]) only at fine-tuning time BERT learns [SEP], [CLS] and sentence A/B embeddings during pre-training Trained for 1M steps with a batch- size of 32,000 words Trained for 1M steps with a batch- size of 128,000 words Use the same learning rate of 5e-5 for all fine-tuning experiments BERT choose a task-specific learning rate which performs the best on the development set 12/21/18 al+ AI Seminar No.7 26
  • 27. Outline • Research context • Main ideas • BERT • Experiments • Conclusions 12/21/18 al+ AI Seminar No.7 27
  • 28. Experiments • GLUE (General Language Understanding Evaluation) benchmark • Distribute canonical Train, Dev and Test splits • Labels for Test set are not provided • Datasets in GLUE: • MNLI: Multi-Genre Natural Language Inference • QQP: Quora Question Pairs • QNLI: Question Natural Language Inference • SST-2: Stanford Sentiment Treebank • CoLA: The corpus of Linguistic Acceptability • STS-B: The Semantic Textual Similarity Benchmark • MRPC: Microsoft Research Paraphrase Corpus • RTE: Recognizing Textual Entailment • WNLI: Winograd NLI 12/21/18 al+ AI Seminar No.7 28
  • 29. GLUE Results 12/21/18 al+ AI Seminar No.7 29
  • 30. SQuAD v1.1 12/21/18 al+ AI Seminar No.7 30 Reference: https://rajpurkar.github.io/SQuAD-explorer
  • 31. Named Entity Recognition 12/21/18 al+ AI Seminar No.7 31
  • 32. SWAG • The Situations with Adversarial Generations (SWAG) • The only task-specific parameters is a vector ! ∈ ℝ$ • The probability distribution is the softmax over the four choices %& = ()*+, ∑./0 1 ()*+, 12/21/18 al+ AI Seminar No.7 32
  • 33. SWAG Result 12/21/18 al+ AI Seminar No.7 33
  • 34. Ablation Studies • To understand • Effect of Pre-training Tasks • Effect of model sizes • Effect of number of training steps • Feature-based approach with BERT 12/21/18 al+ AI Seminar No.7 34
  • 35. Ablation Studies • Main findings: • “Next sentence prediction” (NSP) pre-training is important for sentence-pair classification task • Bidirectionality (using MLM pre-training task) contributes significantly to the performance improvement 12/21/18 al+ AI Seminar No.7 35
  • 36. Ablation Studies • Main findings • Bigger model sizes are better even for small-scale tasks 12/21/18 al+ AI Seminar No.7 36
  • 37. Conclusions • Unsupervised pre-training (pre-training language model) is increasingly adopted in many NLP tasks • Major contribution of the paper is to propose a deep bidirectional architecture from Transformer • Advance state-of-the-art for many important NLP tasks 12/21/18 al+ AI Seminar No.7 37
  • 38. Links • TensorFlow code and pre-trained models for BERT: https://github.com/google-research/bert • PyTorch Pretrained Bert: https://github.com/huggingface/pytorch- pretrained-BERT • BERT-pytorch: https://github.com/codertimo/BERT- pytorch • BERT-keras: https://github.com/Separius/BERT- keras 12/21/18 al+ AI Seminar No.7 38
  • 39. Remark: Applying BERT for non- English languages • Pre-trained BERT models are provided for more than 100 languages (including Vietnamese) • https://github.com/google- research/bert/blob/master/multilingual.md • Be careful with tokenization!! • For Japanese (and Chinese): “spaces were added around every character in the CJI Unicode rage before applying WordPiece” => Not a good way to do • Use SentencePiece: https://github.com/google/sentencepiece • We may need to pre-train BERT model 12/21/18 al+ AI Seminar No.7 39
  • 40. References 1. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805. 2. Vaswani et al. (2017). Attention Is All You Need. arXiv preprint arXiv:1706.03762. https://arxiv.org/abs/1706.03762 3. The Annotated Transformer: http://nlp.seas.harvard.edu/2018/04/03/attention.ht ml, by harvardnlp. 4. Dissecting BERT: https://medium.com/dissecting-bert 12/21/18 al+ AI Seminar No.7 40