SlideShare a Scribd company logo
SigOpt. Confidential.
Efficient BERT
Compress BERT with Multimetric Bayesian Optimization
SigOpt. Confidential.2
BERT is great!
Vaswani et al 2017, Devlin et al 2018
SigOpt. Confidential.
But, it is very large
3
Stanford CS244n
SigOpt. Confidential.
Can we understand the trade-offs
when compressing BERT?
SigOpt. Confidential.
Can we find a model architecture
that works for our specific needs?
SigOpt. Confidential.
Distilling BERT for Question
Answering
SigOpt. Confidential.
The Data: SQUAD 2.0
7
SQUAD 2.0
SigOpt. Confidential.
SQUAD 2.0’s Unanswerable Questions
8
SigOpt. Confidential.
How does Distillation work?
9
Teacher Model
Student Model
Data
Data
Soft
target
loss
Hard
target
loss
Trained Student Model
Hinton et al 2015, Intel’s overview
SigOpt. Confidential.
Distilling BERT for Question Answering
10
BERT
Pre-trained for language
modeling
Student Model
SQUAD 2.0
SQUAD 2.0
Soft
target
loss
Hard
target
loss
BERT
Fine-tuned for SQUAD 2.0
Trained Student Model
For more on distillation: Hinton et al 2015, DistilBERT
SigOpt. Confidential.
Defining the student model
11
Student Model
BookCorpus
and English
Wikipedia
DistilBERT
Pre-trained for language
understanding
Architecture
parameters
Pre-trained
model weights
DistilBERT, Toronto Book Corpus,
English Wikipedia, SigOpt
SigOpt. Confidential.
What is the Baseline?
12
BERT
Pre-trained for language
modeling
DistilBERT architecture
SQUAD 2.0
SQUAD 2.0
Soft
target
loss
Hard
target
loss
BERT
Fine-tuned for SQUAD 2.0
Trained DistilBERT
For more on distillation: Hinton et al 2015, DistilBERT
SigOpt. Confidential.
Baseline model performance
13
Baseline Exact
67.07%
Baseline
Parameters
66.3M
SigOpt. Confidential.
SigOpt: Enterprise HPO at Scale
ML, DL or Simulation
Model
Model Evaluation or
Backtest
Testing
Data
Training
Data
SigOpt. Confidential.
Multimetric Bayesian Optimization
Optimizing for two competing metrics
15
SigOpt’s Multimetric Optimization
SigOpt. Confidential.
What are our metrics?
16
Minimize
Model Size
Maximize Model Performance
Baseline Exact
67.07%
Baseline
Parameters
66.3M
SigOpt. Confidential.
Metric Threshold: Dealing with dataset characteristics
17
Minimize
Model Size
Maximize Model Performance
Baseline
Parameters
66.3M
Baseline Exact
67.07%Metric Threshold
SigOpt’s Metric Threshold
SigOpt. Confidential.
What are we tuning?
18
SGD Parameters, Batch Size,
Warm up, Weight Initialization
Number of Layers and
Attention Heads, Pruning,
Dropouts
Temperature and loss
function weights
9 Model training
parameters
6 Model architecture
parameters
3 Distillation parameters
SigOpt. Confidential.
The Optimization Cycle
19
Student Model
Architecture and
training
parameters
BERT
Fine-tuned for SQUAD 2.0
SQUAD 2.0 Trained Student Model
Distillation
Distillation
Parameters
validation accuracy
and model size
SigOpt. Confidential.
Orchestration
20
AWS EC2
User’s Workstation
Execute Program
Cluster
management
Optimization at
Scale
SigOpt. Confidential.
So, what were our results?
SigOpt. Confidential.
SigOpt found dozens of viable models
22
Baseline Exact
Baseline
Size
Metric Threshold
SigOpt. Confidential.
Choose the model architecture that meets your needs
23
Maximize
Performance
Minimize
Size
+3.45% on Performance
+0.09% on Size
-0.25% on Performance
-22.47% on Size
+3.19% on Performance
-1.69% on Size
SigOpt. Confidential.
Some architecture options
24
Maximize
Performance
Minimize
Size
+3.45% on Performance
+0.09% on Size
-0.25% on Performance
-22.47% on Size
+3.19% on Performance
-1.69% on Size
4 layers, 11 attention heads
No dropout, raised
temperature, soft target loss
weighted more
6 layers, 11 attention heads
no dropout, low
temperature, almost all soft
target loss
6 layers, 12 attention heads
no dropout, raised
temperature, soft target loss
weighted more
SigOpt. Confidential.
Let’s take a quick look at the
dashboard
SigOpt. Confidential.
Was the model able to answer
questions?
SigOpt. Confidential.
Model performance
27
SigOpt. Confidential.
What did misclassifications look like?
28
SigOpt. Confidential.
Let’s take a look at Warsaw
29
SQUAD 2.0
SigOpt. Confidential.
Why does it matter?
30
By using Multimetric Bayesian
Optimization, we’re able to easily
understand trade-offs made during
compression
By understanding these trade-offs,
we’re able to choose a model
architecture that best suits our needs
SigOpt. Confidential.
Check out our
YouTube channel:
Learn more about SigOpt
Read our research and product blog.
See more videos here.
Get free beta access to
Experiment Management
Join the beta
Click Here
Upcoming webinars:
● Introducing Experiment
Management - Thursday, July 9 at
10am PT
SigOpt. Confidential.
Thank you!
Here’s the repo to reproduce work

More Related Content

PDF
Optimizing BERT and Natural Language Models with SigOpt Experiment Management
PDF
Tuning 2.0: Advanced Optimization Techniques Webinar
PDF
Experiment Management for the Enterprise
PDF
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
PDF
Modeling at Scale: SigOpt at TWIMLcon 2019
PDF
SigOpt at Uber Science Symposium - Exploring the spectrum of black-box optimi...
PDF
Tuning for Systematic Trading: Talk 2: Deep Learning
PDF
Tuning for Systematic Trading: Talk 1
Optimizing BERT and Natural Language Models with SigOpt Experiment Management
Tuning 2.0: Advanced Optimization Techniques Webinar
Experiment Management for the Enterprise
Tuning for Systematic Trading: Talk 3: Training, Tuning, and Metric Strategy
Modeling at Scale: SigOpt at TWIMLcon 2019
SigOpt at Uber Science Symposium - Exploring the spectrum of black-box optimi...
Tuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 1

Similar to Efficient NLP by Distilling BERT and Multimetric Optimization (20)

PDF
Using Bayesian Optimization to Tune Machine Learning Models
PDF
Using Bayesian Optimization to Tune Machine Learning Models
PDF
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
PDF
MLConf 2016 SigOpt Talk by Scott Clark
PDF
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
PDF
Scott Clark, CEO, SigOpt, at The AI Conference 2017
PPTX
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
PDF
SigOpt for Hedge Funds
PDF
Lessons for an enterprise approach to modeling at scale
PDF
SigOpt at MLconf - Reducing Operational Barriers to Model Training
PDF
Alexandra johnson reducing operational barriers to model training
PDF
Using Optimal Learning to Tune Deep Learning Pipelines
PDF
Using Optimal Learning to Tune Deep Learning Pipelines
PDF
Advanced Optimization for the Enterprise Webinar
PDF
SigOpt for Machine Learning and AI
PDF
Meetup_Consumer_Credit_Default_Vers_2_All
PPTX
Introduction & Hands-on with H2O Driverless AI
PDF
SigOpt at GTC - Reducing operational barriers to optimization
PDF
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
PDF
mlsys_portrait
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
MLConf 2016 SigOpt Talk by Scott Clark
Interactive Tradeoffs Between Competing Offline Metrics with Bayesian Optimiz...
Scott Clark, CEO, SigOpt, at The AI Conference 2017
Advanced Techniques to Accelerate Model Tuning | Software for AI Optimization...
SigOpt for Hedge Funds
Lessons for an enterprise approach to modeling at scale
SigOpt at MLconf - Reducing Operational Barriers to Model Training
Alexandra johnson reducing operational barriers to model training
Using Optimal Learning to Tune Deep Learning Pipelines
Using Optimal Learning to Tune Deep Learning Pipelines
Advanced Optimization for the Enterprise Webinar
SigOpt for Machine Learning and AI
Meetup_Consumer_Credit_Default_Vers_2_All
Introduction & Hands-on with H2O Driverless AI
SigOpt at GTC - Reducing operational barriers to optimization
Using SigOpt to Tune Deep Learning Models with Nervana Cloud
mlsys_portrait
Ad

More from SigOpt (14)

PDF
Detecting COVID-19 Cases with Deep Learning
PDF
Metric Management: a SigOpt Applied Use Case
PDF
Tuning Data Augmentation to Boost Model Performance
PDF
SigOpt at Ai4 Finance—Modeling at Scale
PDF
Machine Learning Infrastructure
PDF
SigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
PDF
SigOpt at GTC - Tuning the Untunable
PDF
Modeling at scale in systematic trading
PDF
Machine Learning Infrastructure
PDF
Tuning the Untunable - Insights on Deep Learning Optimization
PPTX
Machine Learning Fundamentals
PPTX
Tips and techniques for hyperparameter optimization
PDF
MLconf 2017 Seattle Lunch Talk - Using Optimal Learning to tune Deep Learning...
PDF
Common Problems in Hyperparameter Optimization
Detecting COVID-19 Cases with Deep Learning
Metric Management: a SigOpt Applied Use Case
Tuning Data Augmentation to Boost Model Performance
SigOpt at Ai4 Finance—Modeling at Scale
Machine Learning Infrastructure
SigOpt at O'Reilly - Best Practices for Scaling Modeling Platforms
SigOpt at GTC - Tuning the Untunable
Modeling at scale in systematic trading
Machine Learning Infrastructure
Tuning the Untunable - Insights on Deep Learning Optimization
Machine Learning Fundamentals
Tips and techniques for hyperparameter optimization
MLconf 2017 Seattle Lunch Talk - Using Optimal Learning to tune Deep Learning...
Common Problems in Hyperparameter Optimization
Ad

Recently uploaded (20)

PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Electronic commerce courselecture one. Pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Big Data Technologies - Introduction.pptx
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Empathic Computing: Creating Shared Understanding
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Advanced Soft Computing BINUS July 2025.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
cuic standard and advanced reporting.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
KodekX | Application Modernization Development
PDF
Approach and Philosophy of On baking technology
Diabetes mellitus diagnosis method based random forest with bat algorithm
The AUB Centre for AI in Media Proposal.docx
Electronic commerce courselecture one. Pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Big Data Technologies - Introduction.pptx
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Empathic Computing: Creating Shared Understanding
Chapter 3 Spatial Domain Image Processing.pdf
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Advanced Soft Computing BINUS July 2025.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Unlocking AI with Model Context Protocol (MCP)
Review of recent advances in non-invasive hemoglobin estimation
cuic standard and advanced reporting.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
KodekX | Application Modernization Development
Approach and Philosophy of On baking technology

Efficient NLP by Distilling BERT and Multimetric Optimization