The More the Merrier: Scaling Model Building Infrastructure at Zendesk

Lessons Learnt Scaling
Model Building @ Zendesk
The More the Merrier

Wai Chee Yau
Staff Software Engineer
Pepper

Model Building @ Zendesk
How we built machine learning models initially
What challenges we faced
How we evolved the model building infrastructure
What lessons we learnt

Slide types and sample layouts
What is Zendesk?

Zendesk ML Products Journey
Customer Satisfaction
Prediction
2016
3000 models
per week
2017 20192018

Customer Satisfaction Prediction
PLACEHOLDER

Building Satisfaction Prediction Models in Hadoop
Hadoop
HDFS
Training data
(Tickets)
Training data
(Tickets)
Training data
(Tickets)
Training data
(Tickets)
Map Reduce
Jobs
ModelsModelsModelsModels
400 models per job run

Second ML Product
Answer Bot

Second ML Product - Answer Bot
Prediction
Answer Bot
2016
3000 models
per week
1 deep learning
model a few months
2017 2018 2019

Answerbot - Zendesk’s First Deep Learning Adventure

Embedding: Turning words into numbers
[
7.25020394e-02, -1.19434139e-02, 2.35390533e-02,
9.40115377e-03, 8.13035890e-02, 6.50805384e-02,
-4.03035507e-02, -6.47375807e-02, 2.81035509e-02,
-1.87401652e-01, 1.12001531e-01, -2.67665803e-01,
6.60590157e-02, 2.46239230e-02, -3.72320563e-02,
3.12019400e-02, -7.69012272e-02, -1.70350112e-02,
-4.82226498e-02, -8.59876275e-02, 4.28824723e-02,
-9.28599089e-02, -6.01094738e-02, -8.52334574e-02,
8.72100666e-02, 1.91824064e-01, 1.05177149e-01,
-1.12113327e-01, -1.71761960e-01, -1.66820228e-01,
-1.36309946e-02, -3.36700417e-02, 3.18476819e-02,
-1.26342744e-01, -8.29755142e-03, 8.12109783e-02,
-1.25934565e-02, 1.49573416e-01, 2.69240364e-02,
“We are all prisoners of our
phones, thus they are called
cell phones!”
Embed!

Embedding Space, where no one has gone before
Men with heels
They is not
necessarily
plural
Short haired
women
Tenderloin is
the best part
of SF
Smell of weed

Embedding Space, where no one has gone before
Ticket

Third ML Product
Content Cues

Third ML Product - Content Cues
Prediction
Content Cues
(early access)
Answer Bot
2016
3000 models
per week
1 deep learning
model a few months
Up to 50000
models per day
2017 2018 2019

Content Cues - Summarizing Tickets into Topics

Scaling Challenges with Content Cues
Satisfaction Prediction
3k+ models weekly
Content Cues
Up to 50k models daily

Generate training
data for Content Cues

Feature Generation with Spark
MySQLMySQLMySQLMySQLMySQL Kafka
Data Lake (S3)
Tickets
(snapshot)
ML Features (S3)
EMR Spark
Filter and
Partition by
Account
Training
Features
(account 1)
Training
Features
(account 1)
Training
Features
(account 1)
Training
Features
(account 2)
Training
Features
(account 2)
Training
Features
(account 2)

Group Tickets into
Topics

How to Group Tickets into Topics?
Add frequent flyer
number
Change flight
Cancel Flight

Ticket Summarisation Process
Embed
Text
Cluster
Generate
Titles +
Keywords
Topic 1
Topic 2
Topic n
Input
Ticket
Cluster
snapshots

How to present the UI to the user?
Challenge

Balance
Algorithm Complexity
vs
System Performance
Lesson Learnt

How to scale model building?
Challenge

Generic Solution for Offline Batch Model Building
ML Features (S3)
Training
Features
(account 1)
Training
Features
(account 1)
Training
Features
ML Models (S3)
Training
Features
(account 1)
Training
Features
(account 1)
Models
Scalable Compute
To Build Models

Requirements for Model Building
● Scalable and elastic
● Support building batches of models on a recurrent basis
● Support on demand building of models
● Flexibility to use CPU or GPU instances

AWS Batch for Building Models
AWS Batch
Job
Queue
(low)
Job
Queue
(medium)
Job
Queue
(high)
Compute Environment (spot)
Model Build JobModel Build Job
Compute Environment (on demand)

Building Models with AWS Batch
AWS Batch
Job
Queues
Job
Queues
Compute Environments
Model Serving
Service
Submit
Job
ML Features (S3)
Training
Features
(account 1)
Training
Features
(account 1)
Training
Features
ML Models (S3)
SNS + SQS
Models
Models
Airflow

Select Suitable
Instance Types
For Jobs
Lesson Learnt

How to deal with different
data distribution across accounts?
Challenge

Distribution of account and ticket count
NumberofAccounts
Number of Tickets

Allocate Resources Based on Job Size
Small
Medium
Large
Container resources: vCPU, memory
vCPU: 2 Memory: 2GB
AWS Batch
vCPU: 4 Memory: 5GB
vCPU: 8 Memory: 8GB

12xFaster to build 50k models
Dynamic vs Static Resource Allocation

Dynamic
Resource Allocation
Optimizes Costs
Lesson Learnt

How to prove scalability?
Challenge:

Elastic Compute
With Job Queues
Works Well
Lesson Learnt

How to fix the 0.03% failed jobs?
Challenge

Timing matters
Build latest
clusters
Upload
Clusters
Input
Ticket
Input
Ticket
Input
Ticket
Input
Ticket
Previous
Cluster
snapshots
ML Models (S3)
Clusters
info
Publish
clusters
Upload
Snapshot

Timing matters
Build latest
clusters
Upload
snapshots
Upload
clusters
Input
Ticket
Input
Ticket
Input
Ticket
Input
Ticket
Previous
Cluster
snapshots
ML Models (S3)
Clusters
info
Publish
clusters

Keep the ML Code
Idempotent
Lesson Learnt

Overcome out of memory errors
Challenge

Try It Till You Make It
AWS Batch SNS + SQS
Model
Building
Service
If job failed due to out of memory,
resubmit with higher memory limit
Trigger job
Job status
events

Be Prepared to
Handle Outlier
Memory Usage
Lesson Learnt

How to validate ML models?
Challenge

Model Performance Concerns
Tickets in the topic
not related to each
other
Incorrect grammar
of title
Topic title not
related to tickets

Cluster Quality Checks
Checks
Overlap between :
● title and keywords
● title and tickets
● tickets and the keywords

Always Build in
Automatic
Model Validation
Lesson Learnt

Balance model complexity vs system performance
Lessons Learnt
Select suitable instance types for jobs
Elastic compute with job queues works well
Dynamic resource allocation optimizes costs
Keep the ML code idempotent
Be prepared to handle outlier memory usage
Always build in automatic model validation

The More the Merrier: Scaling Model Building Infrastructure at Zendesk

More Related Content

What's hot (20)

Similar to The More the Merrier: Scaling Model Building Infrastructure at Zendesk (20)

More from Databricks (20)

Recently uploaded (20)

The More the Merrier: Scaling Model Building Infrastructure at Zendesk