Natural Language Processing detailed description

Natural Language Processing
Rohit Kate
BERT – Part 1
1

Reading
• Chapter 9 (skip code) & Section 10.2 from
Textbook 1
2

Original Paper
• Devlin, Jacob, Ming-Wei Chang, Kenton Lee,
and Kristina Toutanova. "BERT: Pre-training of
deep bidirectional transformers for language
understanding." arXiv preprint
arXiv:1810.04805 (2018).
https://arxiv.org/abs/1810.04805
• Later appeared in NAACL-HLT 2019
– More than 84,000 citations since 2018
– Has been transformational in NLP
3

BERT
• Bidirectional Encoder Representations from
Transformers
• Uses only the encoder part of the transformer
architecture
• Generates context-based representations of the
input
• The new representation can then be used for
various NLP tasks
• Pre-trained for language modelling tasks, can be
then fine-tuned for other tasks
4

Big Picture
5
New Representation
More layer(s)
Task Output
Pre-trained
Fine-tuned

A New Learning Paradigm in NLP
• Traditionally, for each NLP task:
– Task-specific training data was prepared (typically
small)
– Machine learning model was trained
– Trained model could do that task
6

7
Training data1 Model1
NLP Task1
Training data2 Model2
NLP Task2
NLP Task3
NLP Task4
Training data3
Training data4
Model3
Model4
Traditional learning in NLP

• With BERT:
– Pre-trained to learn “general language” on
massive amounts of text through self-supervision
– Fine-tuned on a specific NLP task using the task-
specific training data (typically small)
8

9
LARGE text corpus
Pretrained
BERT
Pretraining
Training data1 Fine-tuned Model1
NLP Task1
Training data2 Fine-tuned Model2
NLP Task2
NLP Task3
NLP Task4
Training data3
Training data4
Fine-tuned Model3
Fine-tuned Model4
Learning in NLP using BERT

Transfer Learning
• This is also known as transfer learning
– Transfer/adapt the learned knowledge from one
task to another
– For example, tennis  racquetball
• No need to learn from scratch
• A lot of language related knowledge is picked
up during pre-training
– Word related knowledge
– Grammar related knowledge
• This is later useful for a specific task
10

Special Tokens
• BERT uses a few extra tokens in its input representation
• [CLS]: Classification token: First token of every sequence
– Its new representation is used as an aggregate sequence
representation (for e.g., used to classify the sequence)
• [SEP]: Separation token: Used to separate sentences
(“sentence” could mean arbitrary span of contiguous text
that is task-appropriate)
• [PAD]: Padding token: Used to fill up rest of the sequence
when they need to be of a certain lengths
I arrived late. All had left.
[CLS] I arrived late . [SEP] All had left .
[CLS] I arrived late . [SEP] All had left . [PAD] [PAD]
11

Pre-Training BERT Model
• The weights of the encoders need to be learned via
performing some task(s)
– Learn to generate some meaningful representation
• BERT is pre-trained on the following two “fake” tasks
– Masked language modelling
– Next Sentence Prediction
• Neither of these tasks require any annotated data
– Self-supervised learning
• Just requires a large collection of text documents
– Several gigabytes of text
– Easy to obtain
• Helps it learn general aspects of language
12

Masked Language Modelling
• Randomly mask 15% of words in the corpora and train the
network to predict them
– Classification task with all words in the vocabulary as classes
• Unlike the language modeling task that predicts the next word
(that is called causal LM, and is useful for generating text)
• In masked LM, the emphasis is not on generating text but to
learn general aspects of a language
• Words from both directions are used for prediction -
bidirectional
13
I [MASK] late. All had [MASK].
? ?

• Also known as cloze test in linguistics
• Requires good understanding of the language
– Semantic relationships
• I poured coffee in the ____ .
– Grammatical knowledge
• They ____ happy.
– Word associations
• The balloon ____ in the mid air.
• Also requires some common knowledge
• The sun rises in the ____.
14

15
Encoders
[CLS] I [MASK] late . [SEP] All had [MASK] . [SEP]
T[CLS] TI T[MASK] Tlate T. T[SEP] T’All T’had T’[MASK] T’. T’[SEP]
... arrived:0.5 was: 0.3 cat: 0.01 …
NN
Softmax

• However, the NLP tasks on which BERT will be
applied (fine-tuned) will not have any [MASK]
tokens
• To mitigate mismatch between pre-training and
fine-tuning data, not all masked tokens (to be
predicted) are replaced by [MASK] token
– 80% are replaced by [MASK]
– 10% are replaced by random tokens
– 10% are kept as original tokens
• If a word to be masked got split in subwords then
all the subwords are masked
16

Next Sentence Prediction (NSP)
• While masked LM helps in learning some
aspects of language, it does not help learn
beyond a sentence
• Some NLP tasks require ability to see
connections between sentences, e.g.
question-answering, textual entailment
• Hence they included another pre-training
“fake” task at sentence level
17

• Given two sentences A and B, is B next to A
– Binary classification: IsNext or NotNext
– The dog barked loudly. The cat was woken.
• IsNext
– The cat jumped over the dog. The sun was bright.
• NotNext
• Data can be trivially generated form a corpus
– Self-supervised learning
– Two adjacent sentences form positive example (IsNext; 50%
examples)
– A sentence and a random sentence form negative example
(NotNext; 50 examples)
• Note: Attention will be also between words across the
sentences (cross attention) 18

• Designed for understanding relationship
between two sentences
– Useful for NLP tasks such as question-answering
– Not captured by language modelling tasks
19

Next Sentence Prediction
20
IsNext :0.9 NotNext 0.1
NN
Softmax
Encoders
[CLS] I [MASK] late . [SEP] All had [MASK] . [SEP]
T[CLS] TI T[MASK] Tlate T. T[SEP] T’All T’had T’[MASK] T’. T’[SEP]

Pre-Training
• Pre-training Data:
– BooksCorpus (800M words; free novel books)
– Wikipedia (2500M words)
• The model is trained on both the pre-training
tasks simultaneously
– Minimizes loss on both the tasks
21

Pre-Training
22
Taken from Figure 1 of the paper.

BERT Input Representation
• The input tokens use three embeddings:
– Token embedding
– Segment embedding: An indicator to distinguish between two
sentences
– Position embedding: To indicate positions of tokens in the sequence
• The three embeddings are summed to get the input
representation
23
Figure 2 from the paper.

BERT Input Representation
• Token embeddings:
– These are learned in the embedding layer
• Segment embeddings:
– To distinguish between segments (e.g. first
sentence vs. second sentence)
– These could be something simple, e.g. all 0s for the
first segment and all 1s for the second segment
– Especially important for the NSP task
• Position embeddings:
– Same as in the transformer architecture
24

Word-Level Tokenization
• Ordinary tokenization: word-by-word has the
disadvantage of out-of-vocabulary (OOV) words
– Words seen during testing which were not seen during
training, usually rare words
• Normally handled using the special “unknown” (UNK)
token
• Disadvantages:
– Applications like machine translation and conversational
engines cannot handle it
• “I don’t know” is not a good response
– Blankets the actual word
– Treats all unknown words as the same
25

Character-Level Tokenization
• A possible alternate is character-by-character
tokenization
• No problem with OOV
– All characters are known!
• Disadvantages
– Very inefficient
• Short sentence becomes a long sequence of characters
– Low level processing, will have to learn that t-h-e
as a sequence means
26

Subword Tokenization
• Can be do something in between the two
extremes of word-level and character-level
tokenization?
• BERT uses a subword based tokenization scheme
– Relatively new
– Subword => smaller than words
• dishwasher  dish + wash + er
• The subwords are statistically driven rather than
linguistically
– Some subwords may not be meaningful
27

• If a word is in the vocabulary, then use it
• Otherwise, split it into two or more frequent
subwords such that all of them are in the
vocabulary
– Vocabulary contains individual characters hence a
split is always possible
– The subsequent subwords are indicated by ##
symbol
– Playing -> play + ##ing
28

• How to build the vocabulary?
• There are multiple methods, most widely used
is byte-pair encoding (BPE)
– BERT used a different but related method:
WordPiece tokenization
• Vocabulary is built from characters up by
merging pairs in the current vocabulary which
occur most frequently in a corpus, till a
desired vocabulary size is reached (30K in
BERT)
29

30
Adapted Figure 10.6 from Textbook1.
Current vocabulary: low, _, e, s,
t, n, w, er_, w i
Byte-pair encoding method:

Advantages
• No OOV problem!
– Unknown words are simply split into known subwords
– Unknown word is not blindly blanketed
– Each unknown word is treated differently
• Efficient
– Unlike character-level tokenization, the sequence does
not become too long and not too low-level
• Complete control over vocabulary level
Also used in transformers
31

Natural Language Processing detailed description

More Related Content

Similar to Natural Language Processing detailed description (20)

Recently uploaded (20)

Natural Language Processing detailed description