SlideShare a Scribd company logo
Lessons Learnt Scaling
Model Building @ Zendesk
The More the Merrier
Wai Chee Yau
Staff Software Engineer
Pepper
Model Building @ Zendesk
How we built machine learning models initially
What challenges we faced
How we evolved the model building infrastructure
What lessons we learnt
Scaling Can be Painful
4
A Few Models
Lots of Models
Slide types and sample layouts
What is Zendesk?
Zendesk ML Products Journey
Customer Satisfaction
Prediction
2016
3000 models
per week
2017 20192018
Customer Satisfaction Prediction
PLACEHOLDER
Building Satisfaction Prediction Models in Hadoop
Hadoop
HDFS
Training data
(Tickets)
Training data
(Tickets)
Training data
(Tickets)
Training data
(Tickets)
Map Reduce
Jobs
ModelsModelsModelsModels
400 models per job run
Hadoop Pains
11
Slide types and sample layouts
Second ML Product
Answer Bot
Second ML Product - Answer Bot
Customer Satisfaction
Prediction
Answer Bot
2016
3000 models
per week
1 deep learning
model a few months
2017 2018 2019
Answerbot - Zendesk’s First Deep Learning Adventure
Embedding: Turning words into numbers
[
7.25020394e-02, -1.19434139e-02, 2.35390533e-02,
9.40115377e-03, 8.13035890e-02, 6.50805384e-02,
-4.03035507e-02, -6.47375807e-02, 2.81035509e-02,
-1.87401652e-01, 1.12001531e-01, -2.67665803e-01,
6.60590157e-02, 2.46239230e-02, -3.72320563e-02,
3.12019400e-02, -7.69012272e-02, -1.70350112e-02,
-4.82226498e-02, -8.59876275e-02, 4.28824723e-02,
-9.28599089e-02, -6.01094738e-02, -8.52334574e-02,
8.72100666e-02, 1.91824064e-01, 1.05177149e-01,
-1.12113327e-01, -1.71761960e-01, -1.66820228e-01,
-1.36309946e-02, -3.36700417e-02, 3.18476819e-02,
-1.26342744e-01, -8.29755142e-03, 8.12109783e-02,
-1.25934565e-02, 1.49573416e-01, 2.69240364e-02,
“We are all prisoners of our
phones, thus they are called
cell phones!”
Embed!
Embedding Space, where no one has gone before
Men with heels
They is not
necessarily
plural
Short haired
women
Tenderloin is
the best part
of SF
Smell of weed
Embedding Space, where no one has gone before
Ticket
Slide types and sample layouts
Third ML Product
Content Cues
Third ML Product - Content Cues
Customer Satisfaction
Prediction
Content Cues
(early access)
Answer Bot
2016
3000 models
per week
1 deep learning
model a few months
Up to 50000
models per day
2017 2018 2019
Content Cues - Summarizing Tickets into Topics
Content Cues - Summarizing Tickets into Topics
Scaling Challenges with Content Cues
Satisfaction Prediction
3k+ models weekly
Content Cues
Up to 50k models daily
Slide types and sample layouts
Generate training
data for Content Cues
Feature Generation with Spark
MySQLMySQLMySQLMySQLMySQL Kafka
Data Lake (S3)
Tickets
(snapshot)
ML Features (S3)
EMR Spark
Filter and
Partition by
Account
Training
Features
(account 1)
Training
Features
(account 1)
Training
Features
(account 1)
Training
Features
(account 2)
Training
Features
(account 2)
Training
Features
(account 2)
Slide types and sample layouts
Group Tickets into
Topics
How to Group Tickets into Topics?
Add frequent flyer
number
Change flight
Cancel Flight
Ticket Summarisation Process
Embed
Text
Cluster
Generate
Titles +
Keywords
Topic 1
Topic 2
Topic n
Input
Ticket
Cluster
snapshots
How to present the UI to the user?
Challenge
Content Cues - Summarizing Tickets into Topics
t-SNE Plot of Ticket Clusters
Balance
Algorithm Complexity
vs
System Performance
Lesson Learnt
How to scale model building?
Challenge
Generic Solution for Offline Batch Model Building
ML Features (S3)
Training
Features
(account 1)
Training
Features
(account 1)
Training
Features
ML Models (S3)
Training
Features
(account 1)
Training
Features
(account 1)
Models
Scalable Compute
To Build Models
Requirements for Model Building
● Scalable and elastic
● Support building batches of models on a recurrent basis
● Support on demand building of models
● Flexibility to use CPU or GPU instances
Scalable Compute Options
AWS Batch for Building Models
AWS Batch
Job
Queue
(low)
Job
Queue
(medium)
Job
Queue
(high)
Compute Environment (spot)
Model Build JobModel Build Job
Compute Environment (on demand)
Model Build JobModel Build Job
Building Models with AWS Batch
AWS Batch
Job
Queues
Job
Queues
Compute Environments
Model Build JobModel Build Job
Model Serving
Service
Submit
Job
ML Features (S3)
Training
Features
(account 1)
Training
Features
(account 1)
Training
Features
ML Models (S3)
SNS + SQS
Models
Models
Airflow
Select Suitable
Instance Types
For Jobs
Lesson Learnt
How to deal with different
data distribution across accounts?
Challenge
Distribution of account and ticket count
NumberofAccounts
Number of Tickets
Static Resource Allocation
Allocate Resources Based on Job Size
Small
Medium
Large
Container resources: vCPU, memory
vCPU: 2 Memory: 2GB
AWS Batch
vCPU: 4 Memory: 5GB
vCPU: 8 Memory: 8GB
Dynamic Resource Allocation
12xFaster to build 50k models
Dynamic vs Static Resource Allocation
Dynamic
Resource Allocation
Optimizes Costs
Lesson Learnt
How to prove scalability?
Challenge:
Elastic Compute
With Job Queues
Works Well
Lesson Learnt
How to fix the 0.03% failed jobs?
Challenge
Timing matters
Build latest
clusters
Upload
Clusters
Input
Ticket
Input
Ticket
Input
Ticket
Input
Ticket
Previous
Cluster
snapshots
ML Models (S3)
Clusters
info
Publish
clusters
Upload
Snapshot
Timing matters
Build latest
clusters
Upload
snapshots
Upload
clusters
Input
Ticket
Input
Ticket
Input
Ticket
Input
Ticket
Previous
Cluster
snapshots
ML Models (S3)
Clusters
info
Publish
clusters
Keep the ML Code
Idempotent
Lesson Learnt
Overcome out of memory errors
Challenge
Try It Till You Make It
AWS Batch SNS + SQS
Model
Building
Service
If job failed due to out of memory,
resubmit with higher memory limit
Trigger job
Job status
events
Be Prepared to
Handle Outlier
Memory Usage
Lesson Learnt
How to validate ML models?
Challenge
Model Performance Concerns
Tickets in the topic
not related to each
other
Incorrect grammar
of title
Topic title not
related to tickets
Cluster Quality Checks
Checks
Overlap between :
● title and keywords
● title and tickets
● tickets and the keywords
Always Build in
Automatic
Model Validation
Lesson Learnt
Balance model complexity vs system performance
Lessons Learnt
Select suitable instance types for jobs
Elastic compute with job queues works well
Dynamic resource allocation optimizes costs
Keep the ML code idempotent
Be prepared to handle outlier memory usage
Always build in automatic model validation
TM and © 2018 Zendesk Inc. All rights reserved.

More Related Content

PDF
SparkML: Easy ML Productization for Real-Time Bidding
PDF
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
PDF
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
PDF
Detecting Financial Fraud at Scale with Machine Learning
PDF
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
PDF
Machine Learning at Scale with MLflow and Apache Spark
PPTX
Production ready big ml workflows from zero to hero daniel marcous @ waze
PDF
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
SparkML: Easy ML Productization for Real-Time Bidding
"Lessons learned using Apache Spark for self-service data prep in SaaS world"
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Detecting Financial Fraud at Scale with Machine Learning
CyberMLToolkit: Anomaly Detection as a Scalable Generic Service Over Apache S...
Machine Learning at Scale with MLflow and Apache Spark
Production ready big ml workflows from zero to hero daniel marcous @ waze
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas

What's hot (20)

PDF
Managing the Machine Learning Lifecycle with MLflow
PDF
Lambda Architecture 2.0 for Reactive AB Testing
PDF
Saving Energy in Homes with a Unified Approach to Data and AI
PDF
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
PDF
Bootstrapping of PySpark Models for Factorial A/B Tests
PDF
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
PDF
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
PDF
Zipline—Airbnb’s Declarative Feature Engineering Framework
PDF
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
PDF
Auto-Train a Time-Series Forecast Model With AML + ADB
PDF
An Architecture for Agile Machine Learning in Real-Time Applications
PDF
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
PDF
Mastering Your Customer Data on Apache Spark by Elliott Cordo
PDF
Automated Production Ready ML at Scale
PPTX
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
PDF
Learn to Use Databricks for Data Science
PDF
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
PDF
AI Modernization at AT&T and the Application to Fraud with Databricks
PDF
Horizon: Deep Reinforcement Learning at Scale
PDF
The Pursuit of Happiness: Building a Scalable Pipeline Using Apache Spark and...
Managing the Machine Learning Lifecycle with MLflow
Lambda Architecture 2.0 for Reactive AB Testing
Saving Energy in Homes with a Unified Approach to Data and AI
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Bootstrapping of PySpark Models for Factorial A/B Tests
MLflow and Azure Machine Learning—The Power Couple for ML Lifecycle Management
Leveraging Spark to Democratize Data for Omni-Commerce with Shafaq Abdullah
Zipline—Airbnb’s Declarative Feature Engineering Framework
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
Auto-Train a Time-Series Forecast Model With AML + ADB
An Architecture for Agile Machine Learning in Real-Time Applications
Moving a Fraud-Fighting Random Forest from scikit-learn to Spark with MLlib, ...
Mastering Your Customer Data on Apache Spark by Elliott Cordo
Automated Production Ready ML at Scale
Machine Learning Interpretability - Mateusz Dymczyk - H2O AI World London 2018
Learn to Use Databricks for Data Science
No REST till Production – Building and Deploying 9 Models to Production in 3 ...
AI Modernization at AT&T and the Application to Fraud with Databricks
Horizon: Deep Reinforcement Learning at Scale
The Pursuit of Happiness: Building a Scalable Pipeline Using Apache Spark and...
Ad

Similar to The More the Merrier: Scaling Model Building Infrastructure at Zendesk (20)

PPTX
MSBI Online Training in Hyderabad
PPTX
MSBI Online Training
PPTX
MSBI Online Training in India
PPTX
MSBI Online Training in India
PPTX
MSBI Online Training in Hyderabad
PPTX
Msbi online training
PPTX
DF1 - ML - Petukhov - Azure Ml Machine Learning as a Service
PPTX
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
PPTX
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
PPT
SQL Server 2008 Data Mining
PPT
BI 2008 Simple
PPTX
AzureML Welcome to the future of Predictive Analytics
PPT
SQL Server 2008 Data Mining
PPT
SQL Server 2008 Data Mining
PDF
How Azure helps to build better business processes and customer experiences w...
PDF
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
PPTX
Designing Artificial Intelligence
PDF
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
PDF
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
PPTX
MLOps for Compositional AI
MSBI Online Training in Hyderabad
MSBI Online Training
MSBI Online Training in India
MSBI Online Training in India
MSBI Online Training in Hyderabad
Msbi online training
DF1 - ML - Petukhov - Azure Ml Machine Learning as a Service
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
SQL Server 2008 Data Mining
BI 2008 Simple
AzureML Welcome to the future of Predictive Analytics
SQL Server 2008 Data Mining
SQL Server 2008 Data Mining
How Azure helps to build better business processes and customer experiences w...
Real-time Machine Learning Analytics Using Structured Streaming and Kinesis F...
Designing Artificial Intelligence
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MongoDB .local London 2019: Fast Machine Learning Development with MongoDB
MLOps for Compositional AI
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
PDF
Machine Learning CI/CD for Email Attack Detection
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake
Machine Learning CI/CD for Email Attack Detection

Recently uploaded (20)

PPTX
1_Introduction to advance data techniques.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Global journeys: estimating international migration
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PDF
.pdf is not working space design for the following data for the following dat...
PDF
Fluorescence-microscope_Botany_detailed content
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Foundation of Data Science unit number two notes
PDF
Mega Projects Data Mega Projects Data
1_Introduction to advance data techniques.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Business Acumen Training GuidePresentation.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Global journeys: estimating international migration
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
IBA_Chapter_11_Slides_Final_Accessible.pptx
Quality review (1)_presentation of this 21
Major-Components-ofNKJNNKNKNKNKronment.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
.pdf is not working space design for the following data for the following dat...
Fluorescence-microscope_Botany_detailed content
Clinical guidelines as a resource for EBP(1).pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Foundation of Data Science unit number two notes
Mega Projects Data Mega Projects Data

The More the Merrier: Scaling Model Building Infrastructure at Zendesk