SlideShare a Scribd company logo
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hugging Face Tools
The pipeline for
State-of-the-Art NLP
Hugging Face
Agenda
Lysandre DEBUT
Machine Learning Engineer @ Hugging Face,
maintainer and core contributor of
huggingface/transformers
Anthony MOI
Technical Lead @ Hugging Face, maintainer and
core contributor of huggingface/tokenizers
Some slides were adapted from previous
HuggingFace talk by Thomas Wolf, Victor Sanh and
Morgan Funtowicz
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Hugging Face
Hugging Face
Most popular open source NLP
library
▪ 1,000+ Research paper
mentions
▪ Used in production by 1000+
companies
Hugging Face
Today’s Menu
Subjects we’ll dive in today
● NLP: Transfer learning, transformer networks
● Tokenizers: from text to tokens
● Transformers: from tokens to predictions
Transfer Learning - Transformer networks
One big training to rule them all
NLP took a turn in 2018
Self-supervised Training &
Transfer Learning
Large Text Datasets
Compute Power
The arrival of the transformer architecture
Transfer learning
In a few diagrams
Sequential transfer learning
Learn on one task/dataset, transfer to another task/dataset
word2vec
GloVe
skip-thought
InferSent
ELMo
ULMFiT
GPT
BERT
DistilBERT
Text classification
Word labeling
Question-Answering
....
Pre-training Adaptation
Computationally
intensive
step General purpose
model
Transformer Networks
Very large models - State of the Art in several tasks
Transformer Networks
● Very large networks
● Can be trained on very big datasets
● Better than previous architectures at maintaining
long-term dependencies
● Require a lot of compute to be trained Source: BERT: Pre-training of Deep Bidirectional
Transformers for
Language Understanding. Jacob Devlin, Ming-Wei Chang,
Kenton Lee, Kristina Toutanova.In NACCL, 2019.
Transformer Networks
Pre-training
Base model
Pre-trained language
model
Very large corpus
$$$ in compute
Days of training
Transformer Networks
Fine-tuning
Pre-trained language
model
Fine-tuned language
model
Training can be done on single GPU
Small dataset
Easily reproducible
Model Sharing
Reduced compute, cost, energy footprint
From 🏎 Smaller, faster, cheaper, lighter: Introducing DistilBERT, a
distilled version of BERT, by Victor Sanh
A deeper look at the inner mechanisms
Pipeline, pre-training, fine-tuning
Adaptation
Head
Pre-trained
model
Tokenizer
Transfer Learning pipeline in NLP
From text to tokens, from tokens to prediction
Jim
Henson
was
a
puppet
##eer
11067
5567
245
120
7756
9908
1.2 2.7 0.6 -0.2
3.7 9.1 -2.1 3.1
1.5 -4.7 2.4 6.7
6.1 2.4 7.3 -0.6
-3.1 2.5 1.9 -0.1
0.7 2.1 4.2 -3.1
True 0.7886
False 0.223
Jim Henson was a puppeteer
Tokenization
Convert to
vocabulary indices
Pre-trained model
Task-specificmodel
Pre-training
Many currently successful pre-training approaches are based on language
modeling: learning to predict Pϴ
(text) or Pϴ
(text | other text)
Advantages:
- Doesn’t require human annotation - self-supervised
- Many languages have enough text to learn high capacity models
- Versatile - can be used to learn both sentence and word representations with
a variety of objective functions
The rise of language modeling pre-training
Language Modeling
Objectives - MLM
['The', 'pipeline', 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', 'Natural', 'Language', 'Process', '##ing']
The pipeline for State-of-the-Art Natural Language Processing
['The', ‘pipeline’ 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', [MASK], 'Language', 'Process', '##ing']
Tokenization
Masking
['The', ‘pipeline’, 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', [MASK], 'Language', 'Process', '##ing']
‘Natural’
‘Artificial’
‘Machine’
‘Processing’
‘Speech’
Prediction
Language Modeling
Objectives - CLM
['The', 'pipeline', 'for', 'State', '-', 'of', '-', 'the', '-', 'Art', 'Natural', 'Language', 'Process', '##ing']
The pipeline for State-of-the-Art Natural Language Processing
Tokenization
Prediction
['Process', '##ing', '(', 'NL', '##P', ')', 'software', 'which', 'will', 'allow', 'a', 'user', 'to', 'develop']
Tokenization
It doesn’t have to be slow
Tokenization
- Convert input strings to a set of numbers
Its role in the pipeline
Jim Henson was a puppet ##eer
11067 5567 245 120 7756 9908
Jim Henson was a puppeteer
- Goal: Find the most meaningful and smallest possible representation
Some examples
Let’s dive in the nitty-gritty
Word-based
Word by word tokenization
Let’s do tokenization!
Let ‘s do tokenization !
Split on punctuation:
Split on spaces:
▪ Split on spaces, or following specific rules to obtain words
▪ What to do with punctuation?
▪ Requires large vocabularies: dog != dogs, run != running
▪ Out-of-vocabulary (aka <UNK>) tokens for unknown words
Character
Character by character tokenization
▪ Split on characters individually
▪ Do we include spaces or not?
▪ Smaller vocabularies
▪ But lack of meaning -> Characters don’t necessarily have a meaning separately
▪ End up with a huge amount of tokens to be processed by the model
L e t ‘ s d o t o k e n i z a t i o n !
Byte Pair Encoding
Welcome subword tokenization
▪ First introduced by Philip Gage in 1994, as a compression algorithm
▪ Applied to NLP by Rico Sennrich et al. in “Neural Machine Translation of Rare Words with
Subwords Units”. ACL 2016.
Byte Pair Encoding
Welcome subword tokenization
A B C ... a b c ... ? ! ...
Initial alphabet:
▪ Start with a base vocabulary using Unicode characters seen in the data
▪ Most frequent pairs get merged to a new token:
1. T + h => Th
2. Th + e => The
Byte Pair Encoding
Welcome subword tokenization
▪ Less out-of-vocabulary tokens
▪ Smaller vocabularies
Let’s</w> do</w> token ization</w> !</w>
And a lot more
So many algorithms...
▪ Byte-level BPE as used in GPT-2 (Alec Radford et al. OpenAI)
▪ WordPiece as used in BERT (Jacob Devlin et al. Google)
▪ SentencePiece (Unigram model) (Taku Kudo et al. Google)
Tokenizers
Why did we build it?
▪ Performance
▪ One API for all the different tokenizers
▪ Easy to share and reproduce your work
▪ Easy to use any tokenizer, and re-train it on a new language/dataset
The tokenization pipeline
Inner workings
Normalization Pre-tokenization Tokenization Post-processing
The tokenization pipeline
Inner workings
Normalization Pre-tokenization Tokenization Post-processing
▪ Strip
▪ Lowercase
▪ Removing diacritics
▪ Deduplication
▪ Unicode normalization (NFD, NFC, NFKC, NFKD)
The tokenization pipeline
Inner workings
Normalization Pre-tokenization Tokenization Post-processing
▪ Set of rules to split:
- Whitespace use
- Punctuation use
- Something else?
The tokenization pipeline
Inner workings
Normalization Pre-tokenization Tokenization Post-processing
▪ Actual tokenization algorithm:
- BPE
- Unigram
- Word level
The tokenization pipeline
Inner workings
Normalization Pre-tokenization Tokenization Post-processing
▪ Add special tokens: for example [CLS], [SEP] with BERT
▪ Truncate to match the maximum length of the model
▪ Pad all sequence in a batch to the same length
▪ ...
Tokenizers
Let’s see some code!
Tokenizers
Let’s see some code!
Tokenizers
Let’s see some code!
Tokenizers
Let’s see some code!
Tokenizers
Let’s see some code!
Tokenizers
Let’s see some code!
Tokenizers
How to install it?
Transformers
Using complex models shouldn’t be complicated
Transformers
An explosion of Transformer architectures
▪ Wordpiece tokenization
▪ MLM & NSP
BERT
ALBERT
GPT-2
▪ SentencePiece tokenization
▪ MLM & SOP
▪ Repeating layers
▪ Byte-level BPE tokenization
▪ CLM
Same API
Transformers
As flexible as possible
Runs and trains on:
▪ CPU
▪ GPU
▪ TPU
With optimizations:
▪ XLA
▪ TorchScript
▪ Half-precision
▪ Others
All models
BERT & RoBERTa
More to come!
Transformers
Tokenization to prediction
transformers.PreTrainedTokenizer
transformers.PreTrainedModel
The pipeline for State-of-the-Art Natural Language Processing
[[464, 11523, 329, 1812, 12, ..., 15417, 28403]]
Tensor(batch_size, sequence_length, hidden_size) Task-specific prediction
Base model With task-specific head
Transformers
Available pre-trained models
transformers.PreTrainedTokenizer
transformers.PreTrainedModel
▪ We publicly host pre-trained tokenizer vocabularies and
model weights
▪ 1611 model/tokenizer pairs at the time of writing
Transformers
Pipelines
transformers.Pipeline
▪ Pipelines handle both the tokenization and prediction
▪ Reasonable defaults
▪ SOTA models
▪ Customizable
A few use-cases
That’s where it gets interesting
Transformers
Sentiment analysis/Sequence classification (pipeline)
Transformers
Question Answering (pipeline)
Transformers
Causal language modeling/Text generation
Transformers
Sequence Classification - Under the hood
Transformers
Sequence Classification - Under the hood
Transformers
Sequence Classification - Under the hood
Transformers
Sequence Classification - Under the hood
Transformers
Sequence Classification - Under the hood
Transformers
Sequence Classification - Under the hood
Transformers
Training models
Example scripts (TensorFlow & PyTorch)
- Named Entity Recognition
- Sequence Classification
- Question Answering
- Language modeling (fine-tuning & from scratch)
- Multiple Choice
Trains on TPU, CPU, GPU
Example scripts for PyTorch Lightning
Transformers
Just grazed the surface
The transformers library covers a lot more ground:
- ELECTRA
- Reformer
- Longformer
- Encoder-decoder architectures
- Translation & Summarization
Transformers + Tokenizers
The full pipeline?
Data Tokenization Prediction
🤗 nlp Tokenizers Transformers
Metrics
🤗 nlp
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hugging Face Tools

More Related Content

PPTX
PPTX
A Comprehensive Review of Large Language Models for.pptx
PDF
An introduction to the Transformers architecture and BERT
PDF
Word2Vec
PDF
And then there were ... Large Language Models
PDF
Word Embeddings - Introduction
PPTX
Tokenization using nlp | NLP Course
PDF
Natural language processing (Python)
A Comprehensive Review of Large Language Models for.pptx
An introduction to the Transformers architecture and BERT
Word2Vec
And then there were ... Large Language Models
Word Embeddings - Introduction
Tokenization using nlp | NLP Course
Natural language processing (Python)

What's hot (20)

PDF
Introduction to natural language processing
PPTX
Natural language processing and transformer models
PDF
Natural Language Processing with Python
PDF
A Review of Deep Contextualized Word Representations (Peters+, 2018)
PPTX
Natural language processing
PPT
Natural Language Processing
PDF
Text classification presentation
PDF
Natural language processing (NLP) introduction
PPTX
Word embedding
PPTX
Deep learning presentation
PPTX
PPTX
Deep Learning for Natural Language Processing
PPTX
Natural Language Processing
PDF
Natural Language Processing (NLP)
PDF
Natural language processing
PPTX
Fine tuning large LMs
PDF
Introduction to Transformers for NLP - Olga Petrova
PDF
Gpt models
PPTX
Natural language processing
PPTX
Natural language processing (NLP)
Introduction to natural language processing
Natural language processing and transformer models
Natural Language Processing with Python
A Review of Deep Contextualized Word Representations (Peters+, 2018)
Natural language processing
Natural Language Processing
Text classification presentation
Natural language processing (NLP) introduction
Word embedding
Deep learning presentation
Deep Learning for Natural Language Processing
Natural Language Processing
Natural Language Processing (NLP)
Natural language processing
Fine tuning large LMs
Introduction to Transformers for NLP - Olga Petrova
Gpt models
Natural language processing
Natural language processing (NLP)
Ad

Similar to Building a Pipeline for State-of-the-Art Natural Language Processing Using Hugging Face Tools (20)

PDF
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
PDF
tokens_and_embeddings using Large Language Models
PPTX
Tokenization and how to use it from scratch
PDF
AM4TM_WS22_Practice_01_NLP_Basics.pdf
PPTX
NLP Deep Dive - recurrent neural networks .pptx
PPTX
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
PDF
The NLP Muppets revolution!
PPTX
Natural Language Processing - Research and Application Trends
PDF
BERT Finetuning Webinar Presentation
PPTX
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
PDF
NLP Msc Computer science S2 Kerala University
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PDF
Should we be afraid of Transformers?
PDF
NLP@DATEV: Setting up a domain specific language model, Dr. Jonas Rende & Tho...
PDF
Machine Learning in NLP
PDF
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
PPTX
Bert short story
PPTX
Applications of BERT in NLP and Understanding
PDF
Synthetic dialogue generation with Deep Learning
 
PPTX
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
tokens_and_embeddings using Large Language Models
Tokenization and how to use it from scratch
AM4TM_WS22_Practice_01_NLP_Basics.pdf
NLP Deep Dive - recurrent neural networks .pptx
Deep Learning for Natural Language Processing_FDP on 16 June 2025 MITS.pptx
The NLP Muppets revolution!
Natural Language Processing - Research and Application Trends
BERT Finetuning Webinar Presentation
LONGSEM2024-25_CSE3015_ETH_AP2024256000125_Reference-Material-I.pptx
NLP Msc Computer science S2 Kerala University
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Should we be afraid of Transformers?
NLP@DATEV: Setting up a domain specific language model, Dr. Jonas Rende & Tho...
Machine Learning in NLP
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Bert short story
Applications of BERT in NLP and Understanding
Synthetic dialogue generation with Deep Learning
 
Thomas Wolf "Transfer learning in NLP"
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Lecture1 pattern recognition............
PPTX
Database Infoormation System (DBIS).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
annual-report-2024-2025 original latest.
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Foundation of Data Science unit number two notes
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Business Acumen Training GuidePresentation.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Qualitative Qantitative and Mixed Methods.pptx
.pdf is not working space design for the following data for the following dat...
Clinical guidelines as a resource for EBP(1).pdf
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
climate analysis of Dhaka ,Banglades.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Lecture1 pattern recognition............
Database Infoormation System (DBIS).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
annual-report-2024-2025 original latest.
ISS -ESG Data flows What is ESG and HowHow
Foundation of Data Science unit number two notes
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Business Acumen Training GuidePresentation.pptx

Building a Pipeline for State-of-the-Art Natural Language Processing Using Hugging Face Tools