Efficient NLP by Distilling BERT and Multimetric Optimization

SigOpt. Conﬁdential.
Eﬃcient BERT
Compress BERT with Multimetric Bayesian Optimization

SigOpt. Conﬁdential.2
BERT is great!
Vaswani et al 2017, Devlin et al 2018

But, it is very large
3
Stanford CS244n

Can we understand the trade-oﬀs
when compressing BERT?

Can we ﬁnd a model architecture
that works for our speciﬁc needs?

Distilling BERT for Question
Answering

The Data: SQUAD 2.0
7
SQUAD 2.0

SQUAD 2.0’s Unanswerable Questions
8

How does Distillation work?
9
Teacher Model
Student Model
Data
Data
Soft
target
loss
Hard
target
loss
Trained Student Model
Hinton et al 2015, Intel’s overview

Distilling BERT for Question Answering
10
BERT
Pre-trained for language
modeling
Student Model
SQUAD 2.0
SQUAD 2.0
Soft
target
loss
Hard
target
loss
BERT
Fine-tuned for SQUAD 2.0
Trained Student Model
For more on distillation: Hinton et al 2015, DistilBERT

Deﬁning the student model
11
Student Model
BookCorpus
and English
Wikipedia
DistilBERT
understanding
Architecture
parameters
Pre-trained
model weights
DistilBERT, Toronto Book Corpus,
English Wikipedia, SigOpt

What is the Baseline?
12
BERT
modeling
DistilBERT architecture
SQUAD 2.0
SQUAD 2.0
Soft
target
loss
Hard
target
loss
BERT
Trained DistilBERT
For more on distillation: Hinton et al 2015, DistilBERT

Baseline model performance
13
Baseline Exact
67.07%
Baseline
Parameters
66.3M

SigOpt: Enterprise HPO at Scale
ML, DL or Simulation
Model
Model Evaluation or
Backtest
Testing
Data
Training
Data

Multimetric Bayesian Optimization
Optimizing for two competing metrics
15
SigOpt’s Multimetric Optimization

What are our metrics?
16
Minimize
Model Size
Maximize Model Performance
Baseline Exact
67.07%
Baseline
Parameters
66.3M

Metric Threshold: Dealing with dataset characteristics
17
Minimize
Model Size
Maximize Model Performance
Baseline
Parameters
66.3M
Baseline Exact
67.07%Metric Threshold
SigOpt’s Metric Threshold

What are we tuning?
18
SGD Parameters, Batch Size,
Warm up, Weight Initialization
Number of Layers and
Attention Heads, Pruning,
Dropouts
Temperature and loss
function weights
9 Model training
parameters
6 Model architecture
parameters
3 Distillation parameters

The Optimization Cycle
19
Student Model
Architecture and
training
parameters
BERT
SQUAD 2.0 Trained Student Model
Distillation
Distillation
Parameters
validation accuracy
and model size

Orchestration
20
AWS EC2
User’s Workstation
Execute Program
Cluster
management
Optimization at
Scale

So, what were our results?

SigOpt found dozens of viable models
22
Baseline Exact
Baseline
Size
Metric Threshold

Choose the model architecture that meets your needs
23
Maximize
Performance
Minimize
Size
+3.45% on Performance
+0.09% on Size
-0.25% on Performance
-22.47% on Size
-1.69% on Size

Some architecture options
24
Maximize
Performance
Minimize
Size
+0.09% on Size
-0.25% on Performance
-22.47% on Size
-1.69% on Size
4 layers, 11 attention heads
No dropout, raised
temperature, soft target loss
weighted more
no dropout, low
temperature, almost all soft
target loss
no dropout, raised
temperature, soft target loss
weighted more

Let’s take a quick look at the
dashboard

Was the model able to answer
questions?

Model performance
27

What did misclassiﬁcations look like?
28

Let’s take a look at Warsaw
29
SQUAD 2.0

Why does it matter?
30
By using Multimetric Bayesian
Optimization, we’re able to easily
understand trade-oﬀs made during
compression
By understanding these trade-oﬀs,
we’re able to choose a model
architecture that best suits our needs

Check out our
YouTube channel:
Learn more about SigOpt
Read our research and product blog.
See more videos here.
Get free beta access to
Experiment Management
Join the beta
Click Here
Upcoming webinars:
● Introducing Experiment
Management - Thursday, July 9 at
10am PT

Thank you!
Here’s the repo to reproduce work

Efficient NLP by Distilling BERT and Multimetric Optimization

More Related Content

Similar to Efficient NLP by Distilling BERT and Multimetric Optimization (20)

More from SigOpt (14)

Recently uploaded (20)

Efficient NLP by Distilling BERT and Multimetric Optimization