SlideShare a Scribd company logo
BERT: Pre-training of 

Deep Bidirectional Transformers
for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
!1
Google AI Language
2018.11.25
Presented by Young Seok Kimhttps://arxiv.org/abs/1810.04805
Articles & Useful Links
• Official

• ArXiv : https://arxiv.org/abs/1810.04805

• Blog : https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

• GitHub : https://github.com/google-research/bert

• Unofficial

• Lyrn.ai blog : https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art-
language-model-for-nlp/

• Korean blog : https://rosinality.github.io/2018/10/bert-pre-training-of-deep-
bidirectional-transformers-for-language-understanding
!2
Related Papers
• Vaswani, Ashish et al. “Attention Is All You Need.” NIPS (2017)

• PR-049 : https://youtu.be/6zGgVIlStXs

• Tutorial with code : http://nlp.seas.harvard.edu/2018/04/03/attention.html 

• Radford, Alec. “Improving Language Understanding by Generative Pre-Training.” (2018)

• Website : https://blog.openai.com/language-unsupervised/

• Paper : https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-
unsupervised/language_understanding_paper.pdf

• Wang, Alex et al. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
Understanding.” (2018)

• Website : https://gluebenchmark.com/

• Paper : https://arxiv.org/abs/1804.07461
!3
Preliminaries
!4
Attention is All you need
• Introduced Transformer module

• Reduced computational complexity in respect to
the sequence length
!5
GLUE
• Benchmark introduced in 

Wang, Alex et al. “GLUE: A Multi-Task
Benchmark and Analysis Platform for
Natural Language
Understanding.” (2018)

• Contains 11 Tasks
!6
BERT

Bidirectional Encoder
Representations from Transformers
!7
Motivation
!8
Traditional RNN / LSTM / GRU units
Motivation
!9
Commonly used Bidirectional units
Motivation
Problem
• Unfortunately, standard conditional language models can only be trained left-to-right or
right-to-left, since bidirectional conditioning would allow each word to indirectly “see
itself” in a multi-layered context.
!11
Problem
E 1
T 1
E 2
… EN
Transformer Transformer Transformer…
T 2
TN
…
Single Transformer Layer
!13
E
1
T 1
E
2
… E
N
Transformer Transformer Transformer…
T2
TN
Transformer Transformer Transformer…
…
Multi-layer Transformer Layer
!14
E
1
T 1
E
2
… E
N
Transformer Transformer Transformer…
T2
TN
Transformer Transformer Transformer…
…
Multi-layer Transformer Layer
Training Method
Task #1 - Masked Language Model (MLM) Task #2 - Next Sentence Prediction (NSP)
Task #1 - Masked LM
!16
• Fill in the blank!

• Formally, Cloze Test

(https://en.wikipedia.org/wiki/Cloze_test)

• Similar to CBOW in Word2Vec?
Task #1 - Masked LM
!17
• Fill in the blank!

• Formally, Cloze Test

(https://en.wikipedia.org/wiki/Cloze_test)

• Similar to CBOW in Word2Vec?
My dog is hairy My dog is hairy
Choose 15% of tokens at random
80%
10%
10%
My dog is [Mask]
My dog is hairy
My dog is apple
Masked LM Procedure
Task #2 - Next Sentence Prediction (NSP)
• Classification - [IsNext, NotNext]

• Final pre-trained model achieved 97-98%
accuracy.
!19
Embedding
!20
!21
!22
• The first token of every sequence is
always the special classification
embedding [CLS]. The final hidden state
corresponding to this token is used as the
aggregate sequence representation for
classification tasks. For non-classifcation
tasks, this vector is ignored.
• Sentence pairs are packed together into a
single sequence. The authors separate
them in two ways.

1. Separate with special token [SEP].

2. Add learned sentence embedding to
every token of corresponding
sentence.
Corpus
• BookCorpus (800M words)

• English Wikipedia (2500M words)

• Training dataset

• 50% - Two adjacent sentences

• 50% - Random sentence after a sentence.
!23
Differences between OpenAI GPT
!24
Model Corpus [CLS] / [SEP] tokens Steps Learning rate
BERT
BooksCorpus 

+

Wikipedia
Learns during

pre-training
1M steps with batch
size of 

128,000 words
Task-specific

fine-tuning 

learning rate
OpenAI GPT BooksCorpus
Only introduced at 

fine-tuning time 

1M steps with batch
size of 

32,000 words
Same learning rate of
5e-5
Results
!25
Results 

GLUE benchmark
!26
GLUE Benchmark
• MNLI: Multi-Genre Natural Language Inference 

• Given a pair of sentences, the goal is to predict whether the second sentence is an
entailment, contradiction, or neutral with respect to the first sentence.

• Two versions - MNLI matched, MNLI mismatched

• Two sentence, classification task
!27
GLUE Benchmark
• QQP: Quora Question Pairs

• Quora Question Pairs is a binary classification task where the goal is to determine if
two questions asked on Quora are semantically equivalent 

• Two sentence, binary classification task
!28
GLUE Benchmark
• QNLI: Question Natural Language Inference 

• The positive examples are (question, sentence) pairs which do contain the correct
answer, and the negative examples are (question, sentence) from the same paragraph
which do not contain the answer. 

• Two sentence, binary classification task
!29
GLUE Benchmark
• SST-2: Stanford Sentiment Treebank 

• Binary single-sentence classification task consisting of sentences extracted from
movie reviews with human annotations of their sentiment 

• One sentence, binary classification task
!30
GLUE Benchmark
• CoLA: Corpus of Linguistic Acceptability 

• Binary single-sentence classification task, where the goal is to predict whether an
English sentence is linguistically “acceptable” or not 

• One sentence, binary classification task
!31
GLUE Benchmark
• STS-B: The Semantic Textual Similarity Bench- mark 

• Binary single-sentence classification task, where the goal is to predict whether an
English sentence is linguistically “acceptable” or not 

• One sentence, binary classification task
!32
GLUE Benchmark
• MRPC: Microsoft Research Paraphrase Corpus 

• Consists of sentence pairs automatically extracted from online news sources, with
human annotations for whether the sentences in the pair are semantically equivalent 

• Two sentence, binary classification task
!33
GLUE Benchmark
• RTE: Recognizing Textual Entailment 

• A binary entailment task similar to MNLI, but with much less training data 

• Two sentence, binary classification task
!34
GLUE Benchmark
• WNLI: Winograd Natural Language Inference

• A binary entailment task similar to MNLI, but with much less training data 

• The GLUE webpage notes that there are issues with the construction of this dataset 

• Authors therefore exclude this set
!35
!36
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
SQuAD v1.1
• Standford Question Answering Dataset (SQuAD) is a collection of 100k crowdsourced
question/answer pairs 

• Given a question and a paragraph from Wikipedia containing the answer, the task is to
predict the answer text span in the paragraph
!38
Results on SQuAD v1.1
!39
SWAG
• Situations With Adversarial Generations Dataset

• Given a sentence from a video captioning dataset, the task is to decide among four
choices the most plausible continuation.
!40
SWAG Results
GLUE Results
Ablation Study
!43
Model size
!44
Conclusion
• Unsupervised pre-training is now an integral part of many language understanding
systems.

• Now models can be truly trained with deep bidirectional architectures.

• State-of-the-art on almost every NLP tasks, in some cases surpassing human
performance.
!45
Personal thoughts
• Paper is well written and easy to follow

• SOTA in not just one task/dataset but in almost all tasks

• I think this method is going to be used universally as a baseline for future NLP research

• More objective comparison between BERT and OpenAI GPT was possible because the
baseline parameters are chosen such that it is almost identical to OpenAI GPT

• Model looks very simple but at the same time very flexible to adapt towards various tasks
with simple modifications on the top layer

• Unsupervised pre-training and supervised fine-tuning might prevail in many domain.
!46
Thank you!
!47
References
• Images are either from 

• original papers or

• https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional-
rnn-lstm-gru-73927ec9df15

• https://colah.github.io/posts/2015-08-Understanding-LSTMs/

• https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art-language-model-for-
nlp/
!48

More Related Content

PDF
BERT: Bidirectional Encoder Representations from Transformers
PDF
An introduction to the Transformers architecture and BERT
PDF
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
PPTX
Natural language processing and transformer models
PDF
Introduction to Transformers for NLP - Olga Petrova
PPTX
[Paper review] BERT
PPTX
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
Deep learning for NLP and Transformer
BERT: Bidirectional Encoder Representations from Transformers
An introduction to the Transformers architecture and BERT
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Natural language processing and transformer models
Introduction to Transformers for NLP - Olga Petrova
[Paper review] BERT
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Deep learning for NLP and Transformer

What's hot (20)

PDF
gpt3_presentation.pdf
PDF
BERT - Part 1 Learning Notes of Senthil Kumar
PDF
Gpt models
PPTX
BERT introduction
PPTX
NLP State of the Art | BERT
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
A Review of Deep Contextualized Word Representations (Peters+, 2018)
PPTX
PDF
GPT-2: Language Models are Unsupervised Multitask Learners
PDF
NLP using transformers
PPTX
PPTX
Introduction For seq2seq(sequence to sequence) and RNN
PPTX
1909 BERT: why-and-how (CODE SEMINAR)
PPTX
Word_Embedding.pptx
PDF
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
PPTX
Introduction to Transformer Model
PDF
Natural Language Processing
PDF
Word2Vec
PPTX
A Simple Introduction to Word Embeddings
PPTX
Word embedding
gpt3_presentation.pdf
BERT - Part 1 Learning Notes of Senthil Kumar
Gpt models
BERT introduction
NLP State of the Art | BERT
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A Review of Deep Contextualized Word Representations (Peters+, 2018)
GPT-2: Language Models are Unsupervised Multitask Learners
NLP using transformers
Introduction For seq2seq(sequence to sequence) and RNN
1909 BERT: why-and-how (CODE SEMINAR)
Word_Embedding.pptx
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Introduction to Transformer Model
Natural Language Processing
Word2Vec
A Simple Introduction to Word Embeddings
Word embedding
Ad

Similar to BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (20)

PPTX
sliffffffffffffffffffdasddasdffffffffh2.pptx
PDF
Nlp research presentation
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
DataChat_FinalPaper
PDF
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
PPTX
Introduction to Neural Information Retrieval and Large Language Models
PDF
SelQA: A New Benchmark for Selection-based Question Answering
PDF
Should we be afraid of Transformers?
PDF
The NLP Muppets revolution!
PDF
Deep learning for natural language embeddings
PPTX
Natural Language Processing detailed description
PDF
Deep Learning for Natural Language Processing: Word Embeddings
PDF
Natural Language Understanding of Systems Engineering Artifacts
PDF
Visual-Semantic Embeddings: some thoughts on Language
PPTX
BERT QnA System for Airplane Flight Manual
PDF
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
PDF
Analysis of the evolution of advanced transformer-based language models: Expe...
PDF
Deep-learning based Language Understanding and Emotion extractions
PDF
Evaluation of subjective answers using glsa enhanced with contextual synonymy
PPTX
NLP Bootcamp
sliffffffffffffffffffdasddasdffffffffh2.pptx
Nlp research presentation
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
DataChat_FinalPaper
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Introduction to Neural Information Retrieval and Large Language Models
SelQA: A New Benchmark for Selection-based Question Answering
Should we be afraid of Transformers?
The NLP Muppets revolution!
Deep learning for natural language embeddings
Natural Language Processing detailed description
Deep Learning for Natural Language Processing: Word Embeddings
Natural Language Understanding of Systems Engineering Artifacts
Visual-Semantic Embeddings: some thoughts on Language
BERT QnA System for Airplane Flight Manual
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Analysis of the evolution of advanced transformer-based language models: Expe...
Deep-learning based Language Understanding and Emotion extractions
Evaluation of subjective answers using glsa enhanced with contextual synonymy
NLP Bootcamp
Ad

Recently uploaded (20)

PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation theory and applications.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Electronic commerce courselecture one. Pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Digital-Transformation-Roadmap-for-Companies.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Unlocking AI with Model Context Protocol (MCP)
MIND Revenue Release Quarter 2 2025 Press Release
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Empathic Computing: Creating Shared Understanding
Encapsulation theory and applications.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Electronic commerce courselecture one. Pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
The Rise and Fall of 3GPP – Time for a Sabbatical?
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Understanding_Digital_Forensics_Presentation.pptx

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  • 1. BERT: Pre-training of 
 Deep Bidirectional Transformers for Language Understanding Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova !1 Google AI Language 2018.11.25 Presented by Young Seok Kimhttps://arxiv.org/abs/1810.04805
  • 2. Articles & Useful Links • Official • ArXiv : https://arxiv.org/abs/1810.04805 • Blog : https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html • GitHub : https://github.com/google-research/bert • Unofficial • Lyrn.ai blog : https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art- language-model-for-nlp/ • Korean blog : https://rosinality.github.io/2018/10/bert-pre-training-of-deep- bidirectional-transformers-for-language-understanding !2
  • 3. Related Papers • Vaswani, Ashish et al. “Attention Is All You Need.” NIPS (2017) • PR-049 : https://youtu.be/6zGgVIlStXs • Tutorial with code : http://nlp.seas.harvard.edu/2018/04/03/attention.html • Radford, Alec. “Improving Language Understanding by Generative Pre-Training.” (2018) • Website : https://blog.openai.com/language-unsupervised/ • Paper : https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language- unsupervised/language_understanding_paper.pdf • Wang, Alex et al. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” (2018) • Website : https://gluebenchmark.com/ • Paper : https://arxiv.org/abs/1804.07461 !3
  • 5. Attention is All you need • Introduced Transformer module • Reduced computational complexity in respect to the sequence length !5
  • 6. GLUE • Benchmark introduced in 
 Wang, Alex et al. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” (2018) • Contains 11 Tasks !6
  • 11. Problem • Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself” in a multi-layered context. !11
  • 12. Problem E 1 T 1 E 2 … EN Transformer Transformer Transformer… T 2 TN … Single Transformer Layer
  • 13. !13 E 1 T 1 E 2 … E N Transformer Transformer Transformer… T2 TN Transformer Transformer Transformer… … Multi-layer Transformer Layer
  • 14. !14 E 1 T 1 E 2 … E N Transformer Transformer Transformer… T2 TN Transformer Transformer Transformer… … Multi-layer Transformer Layer
  • 15. Training Method Task #1 - Masked Language Model (MLM) Task #2 - Next Sentence Prediction (NSP)
  • 16. Task #1 - Masked LM !16 • Fill in the blank! • Formally, Cloze Test
 (https://en.wikipedia.org/wiki/Cloze_test) • Similar to CBOW in Word2Vec?
  • 17. Task #1 - Masked LM !17 • Fill in the blank! • Formally, Cloze Test
 (https://en.wikipedia.org/wiki/Cloze_test) • Similar to CBOW in Word2Vec?
  • 18. My dog is hairy My dog is hairy Choose 15% of tokens at random 80% 10% 10% My dog is [Mask] My dog is hairy My dog is apple Masked LM Procedure
  • 19. Task #2 - Next Sentence Prediction (NSP) • Classification - [IsNext, NotNext] • Final pre-trained model achieved 97-98% accuracy. !19
  • 21. !21
  • 22. !22 • The first token of every sequence is always the special classification embedding [CLS]. The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. For non-classifcation tasks, this vector is ignored. • Sentence pairs are packed together into a single sequence. The authors separate them in two ways. 1. Separate with special token [SEP]. 2. Add learned sentence embedding to every token of corresponding sentence.
  • 23. Corpus • BookCorpus (800M words) • English Wikipedia (2500M words) • Training dataset • 50% - Two adjacent sentences • 50% - Random sentence after a sentence. !23
  • 24. Differences between OpenAI GPT !24 Model Corpus [CLS] / [SEP] tokens Steps Learning rate BERT BooksCorpus + Wikipedia Learns during
 pre-training 1M steps with batch size of 
 128,000 words Task-specific
 fine-tuning 
 learning rate OpenAI GPT BooksCorpus Only introduced at 
 fine-tuning time 1M steps with batch size of 
 32,000 words Same learning rate of 5e-5
  • 27. GLUE Benchmark • MNLI: Multi-Genre Natural Language Inference • Given a pair of sentences, the goal is to predict whether the second sentence is an entailment, contradiction, or neutral with respect to the first sentence. • Two versions - MNLI matched, MNLI mismatched • Two sentence, classification task !27
  • 28. GLUE Benchmark • QQP: Quora Question Pairs • Quora Question Pairs is a binary classification task where the goal is to determine if two questions asked on Quora are semantically equivalent • Two sentence, binary classification task !28
  • 29. GLUE Benchmark • QNLI: Question Natural Language Inference • The positive examples are (question, sentence) pairs which do contain the correct answer, and the negative examples are (question, sentence) from the same paragraph which do not contain the answer. • Two sentence, binary classification task !29
  • 30. GLUE Benchmark • SST-2: Stanford Sentiment Treebank • Binary single-sentence classification task consisting of sentences extracted from movie reviews with human annotations of their sentiment • One sentence, binary classification task !30
  • 31. GLUE Benchmark • CoLA: Corpus of Linguistic Acceptability • Binary single-sentence classification task, where the goal is to predict whether an English sentence is linguistically “acceptable” or not • One sentence, binary classification task !31
  • 32. GLUE Benchmark • STS-B: The Semantic Textual Similarity Bench- mark • Binary single-sentence classification task, where the goal is to predict whether an English sentence is linguistically “acceptable” or not • One sentence, binary classification task !32
  • 33. GLUE Benchmark • MRPC: Microsoft Research Paraphrase Corpus • Consists of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent • Two sentence, binary classification task !33
  • 34. GLUE Benchmark • RTE: Recognizing Textual Entailment • A binary entailment task similar to MNLI, but with much less training data • Two sentence, binary classification task !34
  • 35. GLUE Benchmark • WNLI: Winograd Natural Language Inference • A binary entailment task similar to MNLI, but with much less training data • The GLUE webpage notes that there are issues with the construction of this dataset • Authors therefore exclude this set !35
  • 36. !36
  • 38. SQuAD v1.1 • Standford Question Answering Dataset (SQuAD) is a collection of 100k crowdsourced question/answer pairs • Given a question and a paragraph from Wikipedia containing the answer, the task is to predict the answer text span in the paragraph !38
  • 39. Results on SQuAD v1.1 !39
  • 40. SWAG • Situations With Adversarial Generations Dataset • Given a sentence from a video captioning dataset, the task is to decide among four choices the most plausible continuation. !40
  • 45. Conclusion • Unsupervised pre-training is now an integral part of many language understanding systems. • Now models can be truly trained with deep bidirectional architectures. • State-of-the-art on almost every NLP tasks, in some cases surpassing human performance. !45
  • 46. Personal thoughts • Paper is well written and easy to follow • SOTA in not just one task/dataset but in almost all tasks • I think this method is going to be used universally as a baseline for future NLP research • More objective comparison between BERT and OpenAI GPT was possible because the baseline parameters are chosen such that it is almost identical to OpenAI GPT • Model looks very simple but at the same time very flexible to adapt towards various tasks with simple modifications on the top layer • Unsupervised pre-training and supervised fine-tuning might prevail in many domain. !46
  • 48. References • Images are either from • original papers or • https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional- rnn-lstm-gru-73927ec9df15 • https://colah.github.io/posts/2015-08-Understanding-LSTMs/ • https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art-language-model-for- nlp/ !48