SlideShare a Scribd company logo
Natural Language Processing
Rohit Kate
BERT – Part 1
1
Reading
• Chapter 9 (skip code) & Section 10.2 from
Textbook 1
2
Original Paper
• Devlin, Jacob, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. "BERT: Pre-training of
deep bidirectional transformers for language
understanding." arXiv preprint
arXiv:1810.04805 (2018).
https://arxiv.org/abs/1810.04805
• Later appeared in NAACL-HLT 2019
– More than 84,000 citations since 2018
– Has been transformational in NLP
3
BERT
• Bidirectional Encoder Representations from
Transformers
• Uses only the encoder part of the transformer
architecture
• Generates context-based representations of the
input
• The new representation can then be used for
various NLP tasks
• Pre-trained for language modelling tasks, can be
then fine-tuned for other tasks
4
Big Picture
5
New Representation
More layer(s)
Task Output
Pre-trained
Fine-tuned
A New Learning Paradigm in NLP
• Traditionally, for each NLP task:
– Task-specific training data was prepared (typically
small)
– Machine learning model was trained
– Trained model could do that task
6
A New Learning Paradigm in NLP
7
Training data1 Model1
NLP Task1
Training data2 Model2
NLP Task2
NLP Task3
NLP Task4
Training data3
Training data4
Model3
Model4
Traditional learning in NLP
A New Learning Paradigm in NLP
• With BERT:
– Pre-trained to learn “general language” on
massive amounts of text through self-supervision
– Fine-tuned on a specific NLP task using the task-
specific training data (typically small)
8
A New Learning Paradigm in NLP
9
LARGE text corpus
Pretrained
BERT
Pretraining
Training data1 Fine-tuned Model1
NLP Task1
Training data2 Fine-tuned Model2
NLP Task2
NLP Task3
NLP Task4
Training data3
Training data4
Fine-tuned Model3
Fine-tuned Model4
Learning in NLP using BERT
Transfer Learning
• This is also known as transfer learning
– Transfer/adapt the learned knowledge from one
task to another
– For example, tennis  racquetball
• No need to learn from scratch
• A lot of language related knowledge is picked
up during pre-training
– Word related knowledge
– Grammar related knowledge
• This is later useful for a specific task
10
Special Tokens
• BERT uses a few extra tokens in its input representation
• [CLS]: Classification token: First token of every sequence
– Its new representation is used as an aggregate sequence
representation (for e.g., used to classify the sequence)
• [SEP]: Separation token: Used to separate sentences
(“sentence” could mean arbitrary span of contiguous text
that is task-appropriate)
• [PAD]: Padding token: Used to fill up rest of the sequence
when they need to be of a certain lengths
I arrived late. All had left.
[CLS] I arrived late . [SEP] All had left .
[CLS] I arrived late . [SEP] All had left . [PAD] [PAD]
11
Pre-Training BERT Model
• The weights of the encoders need to be learned via
performing some task(s)
– Learn to generate some meaningful representation
• BERT is pre-trained on the following two “fake” tasks
– Masked language modelling
– Next Sentence Prediction
• Neither of these tasks require any annotated data
– Self-supervised learning
• Just requires a large collection of text documents
– Several gigabytes of text
– Easy to obtain
• Helps it learn general aspects of language
12
Masked Language Modelling
• Randomly mask 15% of words in the corpora and train the
network to predict them
– Classification task with all words in the vocabulary as classes
• Unlike the language modeling task that predicts the next word
(that is called causal LM, and is useful for generating text)
• In masked LM, the emphasis is not on generating text but to
learn general aspects of a language
• Words from both directions are used for prediction -
bidirectional
13
I [MASK] late. All had [MASK].
? ?
Masked Language Modelling
• Also known as cloze test in linguistics
• Requires good understanding of the language
– Semantic relationships
• I poured coffee in the ____ .
– Grammatical knowledge
• They ____ happy.
– Word associations
• The balloon ____ in the mid air.
• Also requires some common knowledge
• The sun rises in the ____.
14
Masked Language Modelling
15
Encoders
[CLS] I [MASK] late . [SEP] All had [MASK] . [SEP]
T[CLS] TI T[MASK] Tlate T. T[SEP] T’All T’had T’[MASK] T’. T’[SEP]
... arrived:0.5 was: 0.3 cat: 0.01 …
NN
Softmax
Masked Language Modelling
• However, the NLP tasks on which BERT will be
applied (fine-tuned) will not have any [MASK]
tokens
• To mitigate mismatch between pre-training and
fine-tuning data, not all masked tokens (to be
predicted) are replaced by [MASK] token
– 80% are replaced by [MASK]
– 10% are replaced by random tokens
– 10% are kept as original tokens
• If a word to be masked got split in subwords then
all the subwords are masked
16
Next Sentence Prediction (NSP)
• While masked LM helps in learning some
aspects of language, it does not help learn
beyond a sentence
• Some NLP tasks require ability to see
connections between sentences, e.g.
question-answering, textual entailment
• Hence they included another pre-training
“fake” task at sentence level
17
Next Sentence Prediction (NSP)
• Given two sentences A and B, is B next to A
– Binary classification: IsNext or NotNext
– The dog barked loudly. The cat was woken.
• IsNext
– The cat jumped over the dog. The sun was bright.
• NotNext
• Data can be trivially generated form a corpus
– Self-supervised learning
– Two adjacent sentences form positive example (IsNext; 50%
examples)
– A sentence and a random sentence form negative example
(NotNext; 50 examples)
• Note: Attention will be also between words across the
sentences (cross attention) 18
Next Sentence Prediction (NSP)
• Designed for understanding relationship
between two sentences
– Useful for NLP tasks such as question-answering
– Not captured by language modelling tasks
19
Next Sentence Prediction
20
IsNext :0.9 NotNext 0.1
NN
Softmax
Encoders
[CLS] I [MASK] late . [SEP] All had [MASK] . [SEP]
T[CLS] TI T[MASK] Tlate T. T[SEP] T’All T’had T’[MASK] T’. T’[SEP]
Pre-Training
• Pre-training Data:
– BooksCorpus (800M words; free novel books)
– Wikipedia (2500M words)
• The model is trained on both the pre-training
tasks simultaneously
– Minimizes loss on both the tasks
21
Pre-Training
22
Taken from Figure 1 of the paper.
BERT Input Representation
• The input tokens use three embeddings:
– Token embedding
– Segment embedding: An indicator to distinguish between two
sentences
– Position embedding: To indicate positions of tokens in the sequence
• The three embeddings are summed to get the input
representation
23
Figure 2 from the paper.
BERT Input Representation
• Token embeddings:
– These are learned in the embedding layer
• Segment embeddings:
– To distinguish between segments (e.g. first
sentence vs. second sentence)
– These could be something simple, e.g. all 0s for the
first segment and all 1s for the second segment
– Especially important for the NSP task
• Position embeddings:
– Same as in the transformer architecture
24
Word-Level Tokenization
• Ordinary tokenization: word-by-word has the
disadvantage of out-of-vocabulary (OOV) words
– Words seen during testing which were not seen during
training, usually rare words
• Normally handled using the special “unknown” (UNK)
token
• Disadvantages:
– Applications like machine translation and conversational
engines cannot handle it
• “I don’t know” is not a good response
– Blankets the actual word
– Treats all unknown words as the same
25
Character-Level Tokenization
• A possible alternate is character-by-character
tokenization
• No problem with OOV
– All characters are known!
• Disadvantages
– Very inefficient
• Short sentence becomes a long sequence of characters
– Low level processing, will have to learn that t-h-e
as a sequence means
26
Subword Tokenization
• Can be do something in between the two
extremes of word-level and character-level
tokenization?
• BERT uses a subword based tokenization scheme
– Relatively new
– Subword => smaller than words
• dishwasher  dish + wash + er
• The subwords are statistically driven rather than
linguistically
– Some subwords may not be meaningful
27
Subword Tokenization
• If a word is in the vocabulary, then use it
• Otherwise, split it into two or more frequent
subwords such that all of them are in the
vocabulary
– Vocabulary contains individual characters hence a
split is always possible
– The subsequent subwords are indicated by ##
symbol
– Playing -> play + ##ing
28
Subword Tokenization
• How to build the vocabulary?
• There are multiple methods, most widely used
is byte-pair encoding (BPE)
– BERT used a different but related method:
WordPiece tokenization
• Vocabulary is built from characters up by
merging pairs in the current vocabulary which
occur most frequently in a corpus, till a
desired vocabulary size is reached (30K in
BERT)
29
Subword Tokenization
30
Adapted Figure 10.6 from Textbook1.
Current vocabulary: low, _, e, s,
t, n, w, er_, w i
Byte-pair encoding method:
Subword Tokenization
Advantages
• No OOV problem!
– Unknown words are simply split into known subwords
– Unknown word is not blindly blanketed
– Each unknown word is treated differently
• Efficient
– Unlike character-level tokenization, the sequence does
not become too long and not too low-level
• Complete control over vocabulary level
Also used in transformers
31

More Related Content

PDF
An Introduction to Pre-training General Language Representations
PDF
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
PPTX
NLP State of the Art | BERT
PPTX
Talk from NVidia Developer Connect
PDF
Nlp and transformer (v3s)
PDF
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
PDF
Transformers and BERT with SageMaker
PDF
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling
An Introduction to Pre-training General Language Representations
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
NLP State of the Art | BERT
Talk from NVidia Developer Connect
Nlp and transformer (v3s)
Engineering Intelligent NLP Applications Using Deep Learning – Part 2
Transformers and BERT with SageMaker
Marek Rei - 2017 - Semi-supervised Multitask Learning for Sequence Labeling

Similar to Natural Language Processing detailed description (20)

PDF
An introduction to the Transformers architecture and BERT
PDF
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
PDF
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
PDF
Large Language Models - From RNN to BERT
PDF
MelBERT: Metaphor Detection via Contextualized Late Interaction using Metapho...
PDF
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
PDF
Transformer Models_ BERT vs. GPT.pdf
PDF
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
PDF
Anthiil Inside workshop on NLP
PDF
Representation Learning of Text for NLP
PPTX
Transformers AI PPT.pptx
PPTX
Deep Learning for Natural Language Processing
PPTX
Khmer TTS
PPTX
1909 BERT: why-and-how (CODE SEMINAR)
PPTX
BERT introduction
PPTX
Deep Learning for Machine Translation
PDF
Beyond the Symbols: A 30-minute Overview of NLP
PDF
Turkish language modeling using BERT
PPT
haenelt.ppt
PPTX
[論文紹介] Deep contextualized word representations
An introduction to the Transformers architecture and BERT
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
Large Language Models - From RNN to BERT
MelBERT: Metaphor Detection via Contextualized Late Interaction using Metapho...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Transformer Models_ BERT vs. GPT.pdf
Towards OpenLogos Hybrid Machine Translation - Anabela Barreiro
Anthiil Inside workshop on NLP
Representation Learning of Text for NLP
Transformers AI PPT.pptx
Deep Learning for Natural Language Processing
Khmer TTS
1909 BERT: why-and-how (CODE SEMINAR)
BERT introduction
Deep Learning for Machine Translation
Beyond the Symbols: A 30-minute Overview of NLP
Turkish language modeling using BERT
haenelt.ppt
[論文紹介] Deep contextualized word representations
Ad

Recently uploaded (20)

PPTX
Funds Management Learning Material for Beg
PDF
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PPTX
Power Point - Lesson 3_2.pptx grad school presentation
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPTX
artificial intelligence overview of it and more
PPTX
Digital Literacy And Online Safety on internet
PPTX
Introduction to cybersecurity and digital nettiquette
PPTX
Mathew Digital SEO Checklist Guidlines 2025
PDF
Introduction to the IoT system, how the IoT system works
PPT
Ethics in Information System - Management Information System
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
Module 1 - Cyber Law and Ethics 101.pptx
DOC
Rose毕业证学历认证,利物浦约翰摩尔斯大学毕业证国外本科毕业证
PPT
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt
DOCX
Unit-3 cyber security network security of internet system
PPTX
t_and_OpenAI_Combined_two_pressentations
PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Funds Management Learning Material for Beg
Smart Home Technology for Health Monitoring (www.kiu.ac.ug)
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
Power Point - Lesson 3_2.pptx grad school presentation
presentation_pfe-universite-molay-seltan.pptx
artificial intelligence overview of it and more
Digital Literacy And Online Safety on internet
Introduction to cybersecurity and digital nettiquette
Mathew Digital SEO Checklist Guidlines 2025
Introduction to the IoT system, how the IoT system works
Ethics in Information System - Management Information System
Design_with_Watersergyerge45hrbgre4top (1).ppt
Unit-1 introduction to cyber security discuss about how to secure a system
SASE Traffic Flow - ZTNA Connector-1.pdf
Module 1 - Cyber Law and Ethics 101.pptx
Rose毕业证学历认证,利物浦约翰摩尔斯大学毕业证国外本科毕业证
415456121-Jiwratrwecdtwfdsfwgdwedvwe dbwsdjsadca-EVN.ppt
Unit-3 cyber security network security of internet system
t_and_OpenAI_Combined_two_pressentations
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Ad

Natural Language Processing detailed description

  • 1. Natural Language Processing Rohit Kate BERT – Part 1 1
  • 2. Reading • Chapter 9 (skip code) & Section 10.2 from Textbook 1 2
  • 3. Original Paper • Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. "BERT: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018). https://arxiv.org/abs/1810.04805 • Later appeared in NAACL-HLT 2019 – More than 84,000 citations since 2018 – Has been transformational in NLP 3
  • 4. BERT • Bidirectional Encoder Representations from Transformers • Uses only the encoder part of the transformer architecture • Generates context-based representations of the input • The new representation can then be used for various NLP tasks • Pre-trained for language modelling tasks, can be then fine-tuned for other tasks 4
  • 5. Big Picture 5 New Representation More layer(s) Task Output Pre-trained Fine-tuned
  • 6. A New Learning Paradigm in NLP • Traditionally, for each NLP task: – Task-specific training data was prepared (typically small) – Machine learning model was trained – Trained model could do that task 6
  • 7. A New Learning Paradigm in NLP 7 Training data1 Model1 NLP Task1 Training data2 Model2 NLP Task2 NLP Task3 NLP Task4 Training data3 Training data4 Model3 Model4 Traditional learning in NLP
  • 8. A New Learning Paradigm in NLP • With BERT: – Pre-trained to learn “general language” on massive amounts of text through self-supervision – Fine-tuned on a specific NLP task using the task- specific training data (typically small) 8
  • 9. A New Learning Paradigm in NLP 9 LARGE text corpus Pretrained BERT Pretraining Training data1 Fine-tuned Model1 NLP Task1 Training data2 Fine-tuned Model2 NLP Task2 NLP Task3 NLP Task4 Training data3 Training data4 Fine-tuned Model3 Fine-tuned Model4 Learning in NLP using BERT
  • 10. Transfer Learning • This is also known as transfer learning – Transfer/adapt the learned knowledge from one task to another – For example, tennis  racquetball • No need to learn from scratch • A lot of language related knowledge is picked up during pre-training – Word related knowledge – Grammar related knowledge • This is later useful for a specific task 10
  • 11. Special Tokens • BERT uses a few extra tokens in its input representation • [CLS]: Classification token: First token of every sequence – Its new representation is used as an aggregate sequence representation (for e.g., used to classify the sequence) • [SEP]: Separation token: Used to separate sentences (“sentence” could mean arbitrary span of contiguous text that is task-appropriate) • [PAD]: Padding token: Used to fill up rest of the sequence when they need to be of a certain lengths I arrived late. All had left. [CLS] I arrived late . [SEP] All had left . [CLS] I arrived late . [SEP] All had left . [PAD] [PAD] 11
  • 12. Pre-Training BERT Model • The weights of the encoders need to be learned via performing some task(s) – Learn to generate some meaningful representation • BERT is pre-trained on the following two “fake” tasks – Masked language modelling – Next Sentence Prediction • Neither of these tasks require any annotated data – Self-supervised learning • Just requires a large collection of text documents – Several gigabytes of text – Easy to obtain • Helps it learn general aspects of language 12
  • 13. Masked Language Modelling • Randomly mask 15% of words in the corpora and train the network to predict them – Classification task with all words in the vocabulary as classes • Unlike the language modeling task that predicts the next word (that is called causal LM, and is useful for generating text) • In masked LM, the emphasis is not on generating text but to learn general aspects of a language • Words from both directions are used for prediction - bidirectional 13 I [MASK] late. All had [MASK]. ? ?
  • 14. Masked Language Modelling • Also known as cloze test in linguistics • Requires good understanding of the language – Semantic relationships • I poured coffee in the ____ . – Grammatical knowledge • They ____ happy. – Word associations • The balloon ____ in the mid air. • Also requires some common knowledge • The sun rises in the ____. 14
  • 15. Masked Language Modelling 15 Encoders [CLS] I [MASK] late . [SEP] All had [MASK] . [SEP] T[CLS] TI T[MASK] Tlate T. T[SEP] T’All T’had T’[MASK] T’. T’[SEP] ... arrived:0.5 was: 0.3 cat: 0.01 … NN Softmax
  • 16. Masked Language Modelling • However, the NLP tasks on which BERT will be applied (fine-tuned) will not have any [MASK] tokens • To mitigate mismatch between pre-training and fine-tuning data, not all masked tokens (to be predicted) are replaced by [MASK] token – 80% are replaced by [MASK] – 10% are replaced by random tokens – 10% are kept as original tokens • If a word to be masked got split in subwords then all the subwords are masked 16
  • 17. Next Sentence Prediction (NSP) • While masked LM helps in learning some aspects of language, it does not help learn beyond a sentence • Some NLP tasks require ability to see connections between sentences, e.g. question-answering, textual entailment • Hence they included another pre-training “fake” task at sentence level 17
  • 18. Next Sentence Prediction (NSP) • Given two sentences A and B, is B next to A – Binary classification: IsNext or NotNext – The dog barked loudly. The cat was woken. • IsNext – The cat jumped over the dog. The sun was bright. • NotNext • Data can be trivially generated form a corpus – Self-supervised learning – Two adjacent sentences form positive example (IsNext; 50% examples) – A sentence and a random sentence form negative example (NotNext; 50 examples) • Note: Attention will be also between words across the sentences (cross attention) 18
  • 19. Next Sentence Prediction (NSP) • Designed for understanding relationship between two sentences – Useful for NLP tasks such as question-answering – Not captured by language modelling tasks 19
  • 20. Next Sentence Prediction 20 IsNext :0.9 NotNext 0.1 NN Softmax Encoders [CLS] I [MASK] late . [SEP] All had [MASK] . [SEP] T[CLS] TI T[MASK] Tlate T. T[SEP] T’All T’had T’[MASK] T’. T’[SEP]
  • 21. Pre-Training • Pre-training Data: – BooksCorpus (800M words; free novel books) – Wikipedia (2500M words) • The model is trained on both the pre-training tasks simultaneously – Minimizes loss on both the tasks 21
  • 23. BERT Input Representation • The input tokens use three embeddings: – Token embedding – Segment embedding: An indicator to distinguish between two sentences – Position embedding: To indicate positions of tokens in the sequence • The three embeddings are summed to get the input representation 23 Figure 2 from the paper.
  • 24. BERT Input Representation • Token embeddings: – These are learned in the embedding layer • Segment embeddings: – To distinguish between segments (e.g. first sentence vs. second sentence) – These could be something simple, e.g. all 0s for the first segment and all 1s for the second segment – Especially important for the NSP task • Position embeddings: – Same as in the transformer architecture 24
  • 25. Word-Level Tokenization • Ordinary tokenization: word-by-word has the disadvantage of out-of-vocabulary (OOV) words – Words seen during testing which were not seen during training, usually rare words • Normally handled using the special “unknown” (UNK) token • Disadvantages: – Applications like machine translation and conversational engines cannot handle it • “I don’t know” is not a good response – Blankets the actual word – Treats all unknown words as the same 25
  • 26. Character-Level Tokenization • A possible alternate is character-by-character tokenization • No problem with OOV – All characters are known! • Disadvantages – Very inefficient • Short sentence becomes a long sequence of characters – Low level processing, will have to learn that t-h-e as a sequence means 26
  • 27. Subword Tokenization • Can be do something in between the two extremes of word-level and character-level tokenization? • BERT uses a subword based tokenization scheme – Relatively new – Subword => smaller than words • dishwasher  dish + wash + er • The subwords are statistically driven rather than linguistically – Some subwords may not be meaningful 27
  • 28. Subword Tokenization • If a word is in the vocabulary, then use it • Otherwise, split it into two or more frequent subwords such that all of them are in the vocabulary – Vocabulary contains individual characters hence a split is always possible – The subsequent subwords are indicated by ## symbol – Playing -> play + ##ing 28
  • 29. Subword Tokenization • How to build the vocabulary? • There are multiple methods, most widely used is byte-pair encoding (BPE) – BERT used a different but related method: WordPiece tokenization • Vocabulary is built from characters up by merging pairs in the current vocabulary which occur most frequently in a corpus, till a desired vocabulary size is reached (30K in BERT) 29
  • 30. Subword Tokenization 30 Adapted Figure 10.6 from Textbook1. Current vocabulary: low, _, e, s, t, n, w, er_, w i Byte-pair encoding method:
  • 31. Subword Tokenization Advantages • No OOV problem! – Unknown words are simply split into known subwords – Unknown word is not blindly blanketed – Each unknown word is treated differently • Efficient – Unlike character-level tokenization, the sequence does not become too long and not too low-level • Complete control over vocabulary level Also used in transformers 31