Building a Pipeline for State-of-the-Art Natural Language Processing Using Hugging Face Tools

The pipeline for
State-of-the-Art NLP
Hugging Face

Agenda
Lysandre DEBUT
Machine Learning Engineer @ Hugging Face,
maintainer and core contributor of
huggingface/transformers
Anthony MOI
Technical Lead @ Hugging Face, maintainer and
core contributor of huggingface/tokenizers
Some slides were adapted from previous
HuggingFace talk by Thomas Wolf, Victor Sanh and
Morgan Funtowicz

Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.

Hugging Face
Most popular open source NLP
library
▪ 1,000+ Research paper
mentions
▪ Used in production by 1000+
companies

Subjects we’ll dive in today
● NLP: Transfer learning, transformer networks
● Tokenizers: from text to tokens
● Transformers: from tokens to predictions

Transfer Learning - Transformer networks
One big training to rule them all

NLP took a turn in 2018
Self-supervised Training &
Transfer Learning
Large Text Datasets
Compute Power
The arrival of the transformer architecture

Transfer learning
In a few diagrams

Sequential transfer learning
Learn on one task/dataset, transfer to another task/dataset
word2vec
GloVe
skip-thought
InferSent
ELMo
ULMFiT
GPT
BERT
DistilBERT
Text classiﬁcation
Word labeling
Question-Answering
....
Pre-training Adaptation
Computationally
intensive
step General purpose
model

Transformer Networks
Very large models - State of the Art in several tasks

● Very large networks
● Can be trained on very big datasets
● Better than previous architectures at maintaining
long-term dependencies
● Require a lot of compute to be trained Source: BERT: Pre-training of Deep Bidirectional
Transformers for
Language Understanding. Jacob Devlin, Ming-Wei Chang,
Kenton Lee, Kristina Toutanova.In NACCL, 2019.

Pre-training
Base model
Pre-trained language
model
Very large corpus
$$$ in compute
Days of training

Fine-tuning
Pre-trained language
model
Fine-tuned language
model
Training can be done on single GPU
Small dataset
Easily reproducible

Model Sharing
Reduced compute, cost, energy footprint
From 🏎 Smaller, faster, cheaper, lighter: Introducing DistilBERT, a
distilled version of BERT, by Victor Sanh

A deeper look at the inner mechanisms
Pipeline, pre-training, ﬁne-tuning

Adaptation
Head
Pre-trained
model
Tokenizer
Transfer Learning pipeline in NLP
From text to tokens, from tokens to prediction
Jim
Henson
was
a
puppet
##eer
11067
5567
245
120
7756
9908
1.2 2.7 0.6 -0.2
3.7 9.1 -2.1 3.1
1.5 -4.7 2.4 6.7
6.1 2.4 7.3 -0.6
-3.1 2.5 1.9 -0.1
0.7 2.1 4.2 -3.1
True 0.7886
False 0.223
Jim Henson was a puppeteer
Tokenization
Convert to
vocabulary indices
Pre-trained model
Task-speciﬁcmodel

Pre-training
Many currently successful pre-training approaches are based on language
modeling: learning to predict Pϴ
(text) or Pϴ
(text | other text)
Advantages:
- Doesn’t require human annotation - self-supervised
- Many languages have enough text to learn high capacity models
- Versatile - can be used to learn both sentence and word representations with
a variety of objective functions
The rise of language modeling pre-training

Language Modeling
Objectives - MLM
['The', 'pipeline', 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', 'Natural', 'Language', 'Process', '##ing']
The pipeline for State-of-the-Art Natural Language Processing
['The', ‘pipeline’ 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', [MASK], 'Language', 'Process', '##ing']
Tokenization
Masking
['The', ‘pipeline’, 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', [MASK], 'Language', 'Process', '##ing']
‘Natural’
‘Artificial’
‘Machine’
‘Processing’
‘Speech’
Prediction

Language Modeling
Objectives - CLM
['The', 'pipeline', 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', 'Natural', 'Language', 'Process', '##ing']
Tokenization
Prediction
['Process', '##ing', '(', 'NL', '##P', ')', 'software', 'which', 'will', 'allow', 'a', 'user', 'to', 'develop']

Tokenization
It doesn’t have to be slow

Tokenization
- Convert input strings to a set of numbers
Its role in the pipeline
Jim Henson was a puppet ##eer
11067 5567 245 120 7756 9908
Jim Henson was a puppeteer
- Goal: Find the most meaningful and smallest possible representation

Some examples
Let’s dive in the nitty-gritty

Word-based
Word by word tokenization
Let’s do tokenization!
Let ‘s do tokenization !
Split on punctuation:
Split on spaces:
▪ Split on spaces, or following speciﬁc rules to obtain words
▪ What to do with punctuation?
▪ Requires large vocabularies: dog != dogs, run != running
▪ Out-of-vocabulary (aka <UNK>) tokens for unknown words

Character
Character by character tokenization
▪ Split on characters individually
▪ Do we include spaces or not?
▪ Smaller vocabularies
▪ But lack of meaning -> Characters don’t necessarily have a meaning separately
▪ End up with a huge amount of tokens to be processed by the model
L e t ‘ s d o t o k e n i z a t i o n !

Byte Pair Encoding
Welcome subword tokenization
▪ First introduced by Philip Gage in 1994, as a compression algorithm
▪ Applied to NLP by Rico Sennrich et al. in “Neural Machine Translation of Rare Words with
Subwords Units”. ACL 2016.

Byte Pair Encoding
A B C ... a b c ... ? ! ...
Initial alphabet:
▪ Start with a base vocabulary using Unicode characters seen in the data
▪ Most frequent pairs get merged to a new token:
1. T + h => Th
2. Th + e => The

Byte Pair Encoding
▪ Less out-of-vocabulary tokens
▪ Smaller vocabularies
Let’s</w> do</w> token ization</w> !</w>

And a lot more
So many algorithms...
▪ Byte-level BPE as used in GPT-2 (Alec Radford et al. OpenAI)
▪ WordPiece as used in BERT (Jacob Devlin et al. Google)
▪ SentencePiece (Unigram model) (Taku Kudo et al. Google)

Tokenizers
Why did we build it?
▪ Performance
▪ One API for all the different tokenizers
▪ Easy to share and reproduce your work
▪ Easy to use any tokenizer, and re-train it on a new language/dataset

The tokenization pipeline
Inner workings
Normalization Pre-tokenization Tokenization Post-processing

Inner workings
▪ Strip
▪ Lowercase
▪ Removing diacritics
▪ Deduplication
▪ Unicode normalization (NFD, NFC, NFKC, NFKD)

Inner workings
▪ Set of rules to split:
- Whitespace use
- Punctuation use
- Something else?

Inner workings
▪ Actual tokenization algorithm:
- BPE
- Unigram
- Word level

Inner workings
▪ Add special tokens: for example [CLS], [SEP] with BERT
▪ Truncate to match the maximum length of the model
▪ Pad all sequence in a batch to the same length
▪ ...

Tokenizers
Let’s see some code!

Transformers
Using complex models shouldn’t be complicated

Transformers
An explosion of Transformer architectures
▪ Wordpiece tokenization
▪ MLM & NSP
BERT
ALBERT
GPT-2
▪ SentencePiece tokenization
▪ MLM & SOP
▪ Repeating layers
▪ Byte-level BPE tokenization
▪ CLM
Same API

Transformers
As ﬂexible as possible
Runs and trains on:
▪ CPU
▪ GPU
▪ TPU
With optimizations:
▪ XLA
▪ TorchScript
▪ Half-precision
▪ Others
All models
BERT & RoBERTa
More to come!

Transformers
Tokenization to prediction
transformers.PreTrainedTokenizer
transformers.PreTrainedModel
[[464, 11523, 329, 1812, 12, ..., 15417, 28403]]
Tensor(batch_size, sequence_length, hidden_size) Task-specific prediction
Base model With task-speciﬁc head

Transformers
Available pre-trained models
transformers.PreTrainedTokenizer
transformers.PreTrainedModel
▪ We publicly host pre-trained tokenizer vocabularies and
model weights
▪ 1611 model/tokenizer pairs at the time of writing

Transformers
Pipelines
transformers.Pipeline
▪ Pipelines handle both the tokenization and prediction
▪ Reasonable defaults
▪ SOTA models
▪ Customizable

A few use-cases
That’s where it gets interesting

Transformers
Sentiment analysis/Sequence classiﬁcation (pipeline)

Transformers
Question Answering (pipeline)

Transformers
Causal language modeling/Text generation

Transformers
Sequence Classiﬁcation - Under the hood

Transformers
Training models
Example scripts (TensorFlow & PyTorch)
- Named Entity Recognition
- Sequence Classiﬁcation
- Question Answering
- Language modeling (ﬁne-tuning & from scratch)
- Multiple Choice
Trains on TPU, CPU, GPU
Example scripts for PyTorch Lightning

Transformers
Just grazed the surface
The transformers library covers a lot more ground:
- ELECTRA
- Reformer
- Longformer
- Encoder-decoder architectures
- Translation & Summarization

Transformers + Tokenizers
The full pipeline?
Data Tokenization Prediction
🤗 nlp Tokenizers Transformers
Metrics
🤗 nlp

Building a Pipeline for State-of-the-Art Natural Language Processing Using Hugging Face Tools

More Related Content

What's hot (20)

Similar to Building a Pipeline for State-of-the-Art Natural Language Processing Using Hugging Face Tools (20)

More from Databricks (20)

Recently uploaded (20)

Building a Pipeline for State-of-the-Art Natural Language Processing Using Hugging Face Tools