4
Most read
17
Most read
19
Most read
How We Scaled BERT
To Serve 1+ Billion Daily
Requests on CPU
Quoc N. Le, Data Scientist, Roblox
Kip Kaehler, Engineering Manager, Roblox
About Roblox
3
Deep Learning for Text Classification
4
● Text Classification is a key capability on
Roblox platform
● BERT is a deep learning model that
has transformed the Natural Language
Processing (NLP) landscape
● Performance (Precision/Recall Area
Under Curve) of our text classifiers
improved by 10 percentage points fine-
tuning BERT versus classical machine
learning
Beyond Accuracy: Latency and Throughput
5
Latency: Speed of Request
Analogy -> How long it takes for a
single person to cross a bridge
We required latency under 20ms
Throughput: Completed Requests
Per Second
Analogy -> How many people can
cross a bridge in a period of time
We required over 50k requests per sec
We want a short, wide bridge to
maximize both Latency and
Throughput in a realtime
environment
Our Deep Learning Tech Stack
GPU vs CPU (for our application)
Higher Throughput
for Real-Time Inference
Higher Throughput
for Model Training
~GPU (TESLA V100) 10x faster
at processing training examples
than CPU, due to efficiency in
doing large batch matrix
operations
OR
~CPU (Intel Xeon Scalable
Processor) 5x more throughput
than GPU due to CPU-specific
optimizations and spreading real-
time inference requests across
cores (with latency < 20ms)
(on cost equivalent hardware in 2020)
Which Comes First?
8
Build an Accurate
Model
Make It Fast
Choose a Known
Fast Model
Make It Accurate
Know Your Quoc Le’s
Quoc N. Le Quoc V. Le
Has over 85k citations
as an AI researcher
according to Google
Scholar
Once got kicked out of
the Boomtown Casino
in Reno for counting
cards in blackjack
Our Scaling Playbook on CPU: Less Is More!
10
❏ Smaller Model (Distillation)
❏ Smaller Inputs (Dynamic Inputs)
❏ Smaller Weights (Quantization)
❏ Smaller Number of Requests
(Caching)
❏ Smaller Number of Threads per
Core (Thread Tuning)
Where We Started
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT
Smaller Model (DistilBERT)
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT
Smaller Model (DistilBERT)
Bert Base has
110m parameters
DistilBert has
66m parameters
Tradeoff: < 1%
negative impact on
“Accuracy” (PR
AUC)
Text Input
..
..
..
..
..
Predict
Layer
Transformer Layer
..
..
..
..
Transformer Layer
Transformer Layer
..
..
..
..
Transformer Layer
Predict
Layer
Text Input
Student
Model
(DistilBERT)
Teacher
Model
(BERT)
DistilBERT: https://arxiv.org/pdf/1910.01108.pdf
..
Knowledge
Distillation
Smaller Inputs (Dynamic Shapes)
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT
Smaller Inputs (Dynamic Shapes)
Fixed Shape Inputs
(Zero pad until all inputs have same shape)
Dynamic Shape Inputs
(Do not zero pad)
Smaller Weights (Quantization)
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT
Smaller Weights (Quantization)
Image Credit:
https://towardsdatascience.com/how-to-
accelerate-and-compress-neural-networks-with-
quantization-edfbbabb6af7
Tradeoff: < 1%
negative impact on
“Accuracy” (PR
AUC)
“Dynamic Quantization” One-liner
Smaller Weights (Quantization)
Smaller Number of Requests to Model (Caching)
Image Credit:
https://peltarion.com/blog/data-
science/illustration-3d-bert
Text
Classification
Service
Cache
DistilBERT Model
1. Retrieve text
classification result
from cache (we’re done
if it’s there)
2. Else call deep
learning model for
result
3. Add result to cache,
then return result to
service
Smaller Number Threads Per Core (Thread Tuning)
Our Scaling Playbook on CPU: Less Is More!
21
✓ Smaller Model (Distillation)
✓ Smaller Inputs (Dynamic Inputs)
✓ Smaller Weights (Quantization)
✓ Smaller Number of Requests
(Caching)
✓ Smaller Number of Threads per
Core (Thread Tuning)
30x Improvement in Latency and Throughput on CPU
Smaller
Model
Smaller
Inputs
Smaller
Weights
Benchmarks
Run on Intel
Xeon
Scalable
Processors
Baseline
BERT
Takeaways
23
● For certain real-time deep learning
applications, it is feasible/natural to super-
scale inferences on CPU
● The key to scaling is making things smaller,
as shown in this presentation
● Many optimizations that enabled scale are
easy to implement (one-liners)
● Check out our blog for more details:
https://robloxtechblog.com/how-we-scaled-
bert-to-serve-1-billion-daily-requests-on-cpus-
d99be090db26
Questions? Suggestions?
24
We are always looking to get more performance
from our models. Please reach out to
kkaehler@roblox.com
PS We are always hiring 🤓
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

PDF
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
PPTX
Spiking neural network: an introduction I
PDF
Object Detection with Transformers
PPTX
Physically Based and Unified Volumetric Rendering in Frostbite
PPTX
support vector regression
PPTX
Style gan
PDF
End-to-End Object Detection with Transformers
PDF
NLP using transformers
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Spiking neural network: an introduction I
Object Detection with Transformers
Physically Based and Unified Volumetric Rendering in Frostbite
support vector regression
Style gan
End-to-End Object Detection with Transformers
NLP using transformers

What's hot (20)

PDF
PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
PDF
Transfer Learning
PDF
物体検出エラーの分析ツール TIDE
PDF
Object Detection Using R-CNN Deep Learning Framework
PDF
Image Restoration for 3D Computer Vision
PDF
文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding
PDF
Anchor free object detection by deep learning
PDF
Panoptic Segmentation @CVPR2019
PDF
Executable Bloat - How it happens and how we can fight it
PDF
Visualizing and understanding neural models in NLP
PDF
Google Dremel. Concept and Implementations.
PPTX
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
PPTX
Decima Engine: Visibility in Horizon Zero Dawn
PPTX
Thesis presentation: Applications of machine learning in predicting supply risks
PDF
Deep Generative Models
PDF
Domain Transfer and Adaptation Survey
PPTX
Subword tokenizers
PDF
Stable Diffusion path
PDF
ML DL AI DS BD - An Introduction
PPTX
Feature store: Solving anti-patterns in ML-systems
PR-302: NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
Transfer Learning
物体検出エラーの分析ツール TIDE
Object Detection Using R-CNN Deep Learning Framework
Image Restoration for 3D Computer Vision
文献紹介:TSM: Temporal Shift Module for Efficient Video Understanding
Anchor free object detection by deep learning
Panoptic Segmentation @CVPR2019
Executable Bloat - How it happens and how we can fight it
Visualizing and understanding neural models in NLP
Google Dremel. Concept and Implementations.
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Decima Engine: Visibility in Horizon Zero Dawn
Thesis presentation: Applications of machine learning in predicting supply risks
Deep Generative Models
Domain Transfer and Adaptation Survey
Subword tokenizers
Stable Diffusion path
ML DL AI DS BD - An Introduction
Feature store: Solving anti-patterns in ML-systems
Ad

Similar to How We Scaled Bert To Serve 1+ Billion Daily Requests on CPU (20)

PPTX
Serving BERT Models in Production with TorchServe
PDF
Conversational AI with Transformer Models
PDF
customization of a deep learning accelerator, based on NVDLA
PPTX
Fine tuning large LMs
PDF
BERT Finetuning Webinar Presentation
PDF
“Powering the Connected Intelligent Edge and the Future of On-Device AI,” a P...
PDF
Presentation - webinar embedded machine learning
PDF
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
PDF
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
PPTX
HPC Advisory Council Stanford Conference 2016
PDF
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
PDF
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...
PDF
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
PDF
building intelligent systems with large scale deep learning
PDF
Deep learning at the edge: 100x Inference improvement on edge devices
PDF
Dato Keynote
PDF
Efficient Deep Learning in Natural Language Processing Production, with Moshe...
PDF
Accelerating AI from the Cloud to the Edge
PDF
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning Research
PPTX
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Serving BERT Models in Production with TorchServe
Conversational AI with Transformer Models
customization of a deep learning accelerator, based on NVDLA
Fine tuning large LMs
BERT Finetuning Webinar Presentation
“Powering the Connected Intelligent Edge and the Future of On-Device AI,” a P...
Presentation - webinar embedded machine learning
"Large-Scale Deep Learning for Building Intelligent Computer Systems," a Keyn...
Volodymyr Lyubinets. One startup's journey of building ML pipelines for text ...
HPC Advisory Council Stanford Conference 2016
Discovering Your AI Super Powers - Tips and Tricks to Jumpstart your AI Projects
AI&BigData Lab. Артем Чернодуб "Распознавание изображений методом Lazy Deep ...
Deep Learning, Microsoft Cognitive Toolkit (CNTK) and Azure Machine Learning ...
building intelligent systems with large scale deep learning
Deep learning at the edge: 100x Inference improvement on edge devices
Dato Keynote
Efficient Deep Learning in Natural Language Processing Production, with Moshe...
Accelerating AI from the Cloud to the Edge
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning Research
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
eGramSWARAJ-PPT Training Module for beginners
PPTX
IMPACT OF LANDSLIDE.....................
PPT
expt-design-lecture-12 hghhgfggjhjd (1).ppt
PDF
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
PPTX
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
PPTX
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
PDF
Best Data Science Professional Certificates in the USA | IABAC
PPTX
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
PDF
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
PPTX
New ISO 27001_2022 standard and the changes
PPTX
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PPT
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
A biomechanical Functional analysis of the masitary muscles in man
PDF
Global Data and Analytics Market Outlook Report
PPTX
MBA JAPAN: 2025 the University of Waseda
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PPTX
CYBER SECURITY the Next Warefare Tactics
eGramSWARAJ-PPT Training Module for beginners
IMPACT OF LANDSLIDE.....................
expt-design-lecture-12 hghhgfggjhjd (1).ppt
CS3352FOUNDATION OF DATA SCIENCE _1_MAterial.pdf
chuitkarjhanbijunsdivndsijvndiucbhsaxnmzsicvjsd
CHAPTER-2-THE-ACCOUNTING-PROCESS-2-4.pptx
Best Data Science Professional Certificates in the USA | IABAC
FMIS 108 and AISlaudon_mis17_ppt_ch11.pptx
ahaaaa shbzjs yaiw jsvssv bdjsjss shsusus s
New ISO 27001_2022 standard and the changes
Crypto_Trading_Beginners.pptxxxxxxxxxxxxxx
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
Lesson-01intheselfoflifeofthekennyrogersoftheunderstandoftheunderstanded
PROJECT CYCLE MANAGEMENT FRAMEWORK (PCM).ppt
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
A biomechanical Functional analysis of the masitary muscles in man
Global Data and Analytics Market Outlook Report
MBA JAPAN: 2025 the University of Waseda
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
CYBER SECURITY the Next Warefare Tactics

How We Scaled Bert To Serve 1+ Billion Daily Requests on CPU

  • 1. How We Scaled BERT To Serve 1+ Billion Daily Requests on CPU Quoc N. Le, Data Scientist, Roblox Kip Kaehler, Engineering Manager, Roblox
  • 3. Deep Learning for Text Classification 4 ● Text Classification is a key capability on Roblox platform ● BERT is a deep learning model that has transformed the Natural Language Processing (NLP) landscape ● Performance (Precision/Recall Area Under Curve) of our text classifiers improved by 10 percentage points fine- tuning BERT versus classical machine learning
  • 4. Beyond Accuracy: Latency and Throughput 5 Latency: Speed of Request Analogy -> How long it takes for a single person to cross a bridge We required latency under 20ms Throughput: Completed Requests Per Second Analogy -> How many people can cross a bridge in a period of time We required over 50k requests per sec We want a short, wide bridge to maximize both Latency and Throughput in a realtime environment
  • 5. Our Deep Learning Tech Stack
  • 6. GPU vs CPU (for our application) Higher Throughput for Real-Time Inference Higher Throughput for Model Training ~GPU (TESLA V100) 10x faster at processing training examples than CPU, due to efficiency in doing large batch matrix operations OR ~CPU (Intel Xeon Scalable Processor) 5x more throughput than GPU due to CPU-specific optimizations and spreading real- time inference requests across cores (with latency < 20ms) (on cost equivalent hardware in 2020)
  • 7. Which Comes First? 8 Build an Accurate Model Make It Fast Choose a Known Fast Model Make It Accurate
  • 8. Know Your Quoc Le’s Quoc N. Le Quoc V. Le Has over 85k citations as an AI researcher according to Google Scholar Once got kicked out of the Boomtown Casino in Reno for counting cards in blackjack
  • 9. Our Scaling Playbook on CPU: Less Is More! 10 ❏ Smaller Model (Distillation) ❏ Smaller Inputs (Dynamic Inputs) ❏ Smaller Weights (Quantization) ❏ Smaller Number of Requests (Caching) ❏ Smaller Number of Threads per Core (Thread Tuning)
  • 10. Where We Started Smaller Model Smaller Inputs Smaller Weights Benchmarks Run on Intel Xeon Scalable Processors Baseline BERT
  • 12. Smaller Model (DistilBERT) Bert Base has 110m parameters DistilBert has 66m parameters Tradeoff: < 1% negative impact on “Accuracy” (PR AUC) Text Input .. .. .. .. .. Predict Layer Transformer Layer .. .. .. .. Transformer Layer Transformer Layer .. .. .. .. Transformer Layer Predict Layer Text Input Student Model (DistilBERT) Teacher Model (BERT) DistilBERT: https://arxiv.org/pdf/1910.01108.pdf .. Knowledge Distillation
  • 13. Smaller Inputs (Dynamic Shapes) Smaller Model Smaller Inputs Smaller Weights Benchmarks Run on Intel Xeon Scalable Processors Baseline BERT
  • 14. Smaller Inputs (Dynamic Shapes) Fixed Shape Inputs (Zero pad until all inputs have same shape) Dynamic Shape Inputs (Do not zero pad)
  • 16. Smaller Weights (Quantization) Image Credit: https://towardsdatascience.com/how-to- accelerate-and-compress-neural-networks-with- quantization-edfbbabb6af7 Tradeoff: < 1% negative impact on “Accuracy” (PR AUC) “Dynamic Quantization” One-liner
  • 18. Smaller Number of Requests to Model (Caching) Image Credit: https://peltarion.com/blog/data- science/illustration-3d-bert Text Classification Service Cache DistilBERT Model 1. Retrieve text classification result from cache (we’re done if it’s there) 2. Else call deep learning model for result 3. Add result to cache, then return result to service
  • 19. Smaller Number Threads Per Core (Thread Tuning)
  • 20. Our Scaling Playbook on CPU: Less Is More! 21 ✓ Smaller Model (Distillation) ✓ Smaller Inputs (Dynamic Inputs) ✓ Smaller Weights (Quantization) ✓ Smaller Number of Requests (Caching) ✓ Smaller Number of Threads per Core (Thread Tuning)
  • 21. 30x Improvement in Latency and Throughput on CPU Smaller Model Smaller Inputs Smaller Weights Benchmarks Run on Intel Xeon Scalable Processors Baseline BERT
  • 22. Takeaways 23 ● For certain real-time deep learning applications, it is feasible/natural to super- scale inferences on CPU ● The key to scaling is making things smaller, as shown in this presentation ● Many optimizations that enabled scale are easy to implement (one-liners) ● Check out our blog for more details: https://robloxtechblog.com/how-we-scaled- bert-to-serve-1-billion-daily-requests-on-cpus- d99be090db26
  • 23. Questions? Suggestions? 24 We are always looking to get more performance from our models. Please reach out to kkaehler@roblox.com PS We are always hiring 🤓
  • 24. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.