SlideShare a Scribd company logo
Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC
Online, Offline, Mobile, Email, Social
www.sailthru.com
Cost Effectively Scaling Machine Learning Systems in the Cloud
Agenda:
● Background on me, Sailthru & Sightlines (mercifully short)
● Cost effective resources in the AWS cloud
● Efficient(ish) application design
● Easy maintenance and evolution
● Machine learning details
Online, Offline, Mobile, Email, Social
www.sailthru.com
@jeremystan
Capitalism
Idealism
Indirect
Value
Direct
Value
Graduate student
Math
2000
Consultant
Finance
2005 CTO
Ad Tech
2010
Chief Data Scientist
Mar Tech
2015
Online, Offline, Mobile, Email, Social
www.sailthru.com
Sailthru
Online, Offline, Mobile, Email, Social
www.sailthru.com
Sightlines
Analytics
- Segmentation
- Forecasting
Personalization
- Recommendations
- Discounting
Optimization
- Frequency
- Channel
Online, Offline, Mobile, Email, Social
www.sailthru.com
Requirements
1. ~5 million users per client
2. JSON formatted user data, siloed across clients
3. Predict varying outcomes
normal, poisson, binomial, quantile, ...
4. Update models & predictions daily
5. Only really care about predictive performance
6. Scale to 1,000+ clients
Online, Offline, Mobile, Email, Social
www.sailthru.com
Our Cost Effective Scaling Strategy
1. Get really cheap computing power
2. Make it work really, really hard
3. Optimize apps for ease of evolution
4. Setup identical A/B environments
Iterate aggressively based on data:
✓ Features
✓ Efficiency
✓ Scale
10x
3x
0.6x =
0.5x
= 9x
JSON to
Features
GBM in
Memory
1 x0.2x
Half our
processing
Half our
processing
Online, Offline, Mobile, Email, Social
www.sailthru.com
Cost Effective
Resources in
the AWS Cloud
Online, Offline, Mobile, Email, Social
www.sailthru.com
Cost Effective r3.8xlarge
32 vCPU, 244GB RAM
Resource Utilization
30%
(typical cloud)
10%
(data center)
90%
(highly efficient)
Cost
Per
Hour
$2.80
(on demand)
$1.76
(reserved 1yr)
$1.05
(reserved 3yr)
$0.28
(spot instance)
Cloud
$9.80
Data Center
$10.50
Spot + Mesos + Relay
$0.30
30x more cost
efficient!
($10.50 = $1.05 / 10%)
Online, Offline, Mobile, Email, Social
www.sailthru.com
AWS Spot Instances
Your bid
What you pay
All instances died!
Online, Offline, Mobile, Email, Social
www.sailthru.com
Mesos
81 “slaves”
4 availability zones
2 instance types
1,360 CPUs
10TB of RAM
94% utilized
$11.90 per hour
$104,244 per year
Online, Offline, Mobile, Email, Social
www.sailthru.com
Mesos + Marathon
Zone 1 Zone 2 Zone 3 Zone 4
Mesos
Slave
(16 CPU)
Mesos
Slave
(8 CPU)
Online, Offline, Mobile, Email, Social
www.sailthru.com
Mesos + Marathon
Zone 1 Zone 2 Zone 3 Zone 4
Mesos
Slave
(16 CPU)
Mesos
Slave
(8 CPU)
Mesos
Master
App A
App B
App C
Queue Size
Applications must be:
● Distributed to be scheduled wherever Mesos wants
● Fine Grained to maximize utilization in Mesos
● Idempotent to handle duplicate runs in case network
is partitioned
Online, Offline, Mobile, Email, Social
www.sailthru.com
Mesos + Marathon
Zone 1 Zone 2 Zone 3 Zone 4
Mesos
Slave
(16 CPU)
Mesos
Slave
(8 CPU)
Mesos
Master
App A
App B
App C
Queue Size
Time
Available
Mesos
CPU
Jiffies
Doesn’t work for apps
with highly variable load
Idle
User
Online, Offline, Mobile, Email, Social
www.sailthru.com
Mesos + Relay
Available
Mesos
CPU
Jiffies
User
Idle
Available
Mesos
CPU
Jiffies
User
Idle
Relay.Mesos
Auto-scaler for distributed applications
github.com/sailthru/relay.mesos
● Allocates resources based on queue size
● Wraps applications inside Mesos slaves
● Can significantly improve cluster utilization
Before Relay
After
Relay
App A
App B
App C
Queue Size
Mesos
Master
Time
After Relay
Relay.
Mesos
Online, Offline, Mobile, Email, Social
www.sailthru.com
Efficient(ish)
Application
Design
Online, Offline, Mobile, Email, Social
www.sailthru.com
Stolos
Distributed task dependency manager
github.com/sailthru/stolos
● Directed acyclic graph
● Parameterizable templates
● Handles queueing
● Ensures idempotent
Application Pipeline (simplified)
Assembly GBMs
Analyze
Models
JSON
Sailthru
User
API
Predict Upload Mongo
Reports
Actually much more complex
● ~1,000 clients
● ~10 models
● ~10 steps
● ~100 sub-tasks
ETL
Mongo
Online, Offline, Mobile, Email, Social
www.sailthru.com
shard 1
shard 1,000
Sampling Strategy
JSON
Day
1
Mongo
S3
JSON sharded on hash(user)
Online, Offline, Mobile, Email, Social
www.sailthru.com
shard 1
shard 1,000
Sampling Strategy
JSON
Day
N
Mongo
Day
1
S3
Online, Offline, Mobile, Email, Social
www.sailthru.com
Day
N
Day
1
shard 1
shard 1,000
Sampling Strategy
JSON
Consistent 0.1% of data to a
Mesos Slave CPU
Mongo
S3
Online, Offline, Mobile, Email, Social
www.sailthru.com
Day
N
Day
1
shard 1
shard 1,000
Sampling Strategy
JSON
Apps sample more as needed
Mongo
S3
Online, Offline, Mobile, Email, Social
www.sailthru.com
User Profile JSON Data
Online, Offline, Mobile, Email, Social
www.sailthru.com
Each User Radically Different
User
Feature
???
Online, Offline, Mobile, Email, Social
www.sailthru.com
Each User Radically Different
User
Feature
tidyjson
Turn JSON into data frames
github.com/sailthru/tidyjson
● Arbitrary JSON into R data.frames
● Guarantees deterministic structure
● Seamless with dplyr and %>%
Online, Offline, Mobile, Email, Social
www.sailthru.com
Why GBMs?
● Predict varying outcomes
normal, poisson, binomial, quantile, …
● Flexible enough to capture non-linearity & complex interactions
no need to feature engineer for each client
● Minimal number of hyper-parameters
depth, shrinkage, number of trees
● Robust to missing values
no need to impute
Online, Offline, Mobile, Email, Social
www.sailthru.com
+ … + αK
*
Distributing a GBM
α1
*
tree 1 tree 2 tree 3 tree K
+ α2
* + α3
*
Online, Offline, Mobile, Email, Social
www.sailthru.com
+ … + αK
*
Distributing a GBM
α1
*
tree 1 tree 2 tree 3 tree K
1. Across the sum
Gives bagging, not boosting (iterative)
=> less accurate
+ α2
* + α3
*
Zone 1 Zone 2 Zone 3 Zone 4
Mesos
Slaves
Online, Offline, Mobile, Email, Social
www.sailthru.com
+ … + αK
*
Distributing a GBM
α1
*
tree 1 tree 2 tree 3 tree K
1. Across the sum
Gives bagging, not boosting (iterative)
=> less accurate
2. Within each tree (Spark MLLib, H20)
A lot of overhead and coordination
=> not efficient for many small GBMs
+ α2
* + α3
*
Zone 1 Zone 2 Zone 3 Zone 4
Mesos
Slaves
Online, Offline, Mobile, Email, Social
www.sailthru.com
Distributing a GBM
1. Across the sum
Gives bagging, not boosting (iterative)
=> less accurate
2. Within each tree (Spark MLLib, H20)
A lot of overhead and coordination
=> not efficient for many small GBMs
3. Across the GBMs
50,000 GBMs to build
=> each can be built independently
Zone 1 Zone 2 Zone 3 Zone 4
Mesos
Slaves
+ … + αK
*α1
*
tree 1 tree 2 tree 3 tree K
+ α2
* + α3
* + … + αK
*α1
*
tree 1 tree 2 tree 3 tree K
+ α2
* + α3
*
…
GBM 1 GBM 50,000
50,000 = 1,000 clients * 10 models * 5-fold CV
✓
Online, Offline, Mobile, Email, Social
www.sailthru.com
Grid Search
+ … + αK
*α1
*
tree 1 tree 2 tree 3 tree K
+ α2
* + α3
*
For each client & model:
1. Grid search over:
a. Depth: size of trees
b. Shrinkage: λ “learning rate” for {αi
}
2. Cross-validate for optimal # of trees
Online, Offline, Mobile, Email, Social
www.sailthru.com
Easy
Maintenance
& Evolution
Online, Offline, Mobile, Email, Social
www.sailthru.com
Tools Used
R
Modeling
Python
ETL
AWS S3
Batch
Applications
State
Frameworks
Zookeeper
Coordination
Spark
Map Reduce
Marathon
Running Apps
Cluster
Mesos
Sharing
Maintenance
ELK
Log Mgmt
Consul
Discovery
Configuration
Chef
Automation
Librato
Monitoring
Sensu
Alerting
Asgard
Auto Scaling
AWS Spot
Compute
Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
JSON
Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
JSON
Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
JSON
v1.0.0
Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
JSON
v1.0.0
v1.0.1
Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
JSON
v1.0.0
v1.0.1
Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
JSON
v1.0.0
v1.0.1
v1.0.2
Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
✓ Check monitoring
JSON
v1.0.0
v1.0.1
v1.0.2
Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
✓ Check monitoring
✓ Check logging
JSON
v1.0.0
v1.0.1
v1.0.2
Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
✓ Check monitoring
✓ Check logging
✓ Check performance
JSON
v1.0.0
v1.0.1
v1.0.2
Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
✓ Check monitoring
✓ Check logging
✓ Check performance
JSON
v1.0.0
v1.0.1
v1.0.2
Online, Offline, Mobile, Email, Social
www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
✓ Check monitoring
✓ Check logging
✓ Check performance
JSON
v1.0.0
v1.0.1
v1.0.2
Thank You! Our team:
Divyanshu Vats Alex Gaudio Andras Kerekes Jeremy Stanley

More Related Content

PPTX
Acquire, Grow & Retain Customers, Fast
PPTX
11 Shocking Stats That Will Transform Your Marketing Strategy
PPTX
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
PPTX
Big Data Analytics 1: Driving Personalized Experiences Using Customer Profiles
PDF
Cassandra UDF and Materialized Views
PDF
20140908 spark sql & catalyst
PDF
The Best of the Best: Media and Publishing Newsletter Edition
Acquire, Grow & Retain Customers, Fast
11 Shocking Stats That Will Transform Your Marketing Strategy
Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...
Big Data Analytics 1: Driving Personalized Experiences Using Customer Profiles
Cassandra UDF and Materialized Views
20140908 spark sql & catalyst
The Best of the Best: Media and Publishing Newsletter Edition

Viewers also liked (8)

PDF
2017 Digital Retail Innovation: 9 Areas Retail Marketers are Investing and Why
PDF
Balancing Infrastructure with Optimization and Problem Formulation
PPTX
Larry Birnbaum, Narrative Science, 11 June
PDF
13 Stats That Will Redefine Your Email Marketing Priorities
PDF
Playing the Marketing Long Game
PPT
Dave Govan (VP of Sales, Sailthru) - Aligning a Go to Market Strategy with Sa...
PDF
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
PDF
50 Facts That Will Make Businesses Rethink their Customer Service
2017 Digital Retail Innovation: 9 Areas Retail Marketers are Investing and Why
Balancing Infrastructure with Optimization and Problem Formulation
Larry Birnbaum, Narrative Science, 11 June
13 Stats That Will Redefine Your Email Marketing Priorities
Playing the Marketing Long Game
Dave Govan (VP of Sales, Sailthru) - Aligning a Go to Market Strategy with Sa...
Building a Recommendation Engine Using Diverse Features by Divyanshu Vats
50 Facts That Will Make Businesses Rethink their Customer Service
Ad

Similar to Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC (20)

PDF
Tech leaders guide to effective building of machine learning products
PDF
Machine learning at Scale with Apache Spark
PDF
FIWARE Global Summit - Big Data and Machine Learning with FIWARE
PDF
Hydrosphere.io for ODSC: Webinar on Kubeflow
PPTX
Serverless machine learning architectures at Helixa
PDF
AWS Machine Learning & Google Cloud Machine Learning
PDF
Ray Serve: A new scalable machine learning model serving library on Ray
PDF
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
PDF
How to Build a Big Data Application: Serverless Edition
PDF
Notionmind Service Verticals
PDF
Jumpstart your idea with AWS Serverless [Oct 2020]
PDF
New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Tim...
PDF
Case Study: Stream Processing on AWS using Kappa Architecture
PDF
How to Build a Big Data Application: Serverless Edition
PDF
Skillenza Build with Serverless Challenge - Advanced Serverless Concepts
PDF
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
PDF
Productionizing Machine Learning - Bigdata meetup 5-06-2019
PDF
Continuous delivery for machine learning
PPTX
Serverless architectures: APIs, Serverless Functions, Microservices - How to ...
PDF
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Tech leaders guide to effective building of machine learning products
Machine learning at Scale with Apache Spark
FIWARE Global Summit - Big Data and Machine Learning with FIWARE
Hydrosphere.io for ODSC: Webinar on Kubeflow
Serverless machine learning architectures at Helixa
AWS Machine Learning & Google Cloud Machine Learning
Ray Serve: A new scalable machine learning model serving library on Ray
ML Platform Q1 Meetup: Airbnb's End-to-End Machine Learning Infrastructure
How to Build a Big Data Application: Serverless Edition
Notionmind Service Verticals
Jumpstart your idea with AWS Serverless [Oct 2020]
New York Storm Users Group 2014-01-28 - Using Storm with MapR M7 for Real-Tim...
Case Study: Stream Processing on AWS using Kappa Architecture
How to Build a Big Data Application: Serverless Edition
Skillenza Build with Serverless Challenge - Advanced Serverless Concepts
Atmosphere 2014: Switching from monolithic approach to modular cloud computin...
Productionizing Machine Learning - Bigdata meetup 5-06-2019
Continuous delivery for machine learning
Serverless architectures: APIs, Serverless Functions, Microservices - How to ...
Bighead: Airbnb’s End-to-End Machine Learning Platform with Krishna Puttaswa...
Ad

More from MLconf (20)

PDF
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
PDF
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
PPTX
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
PDF
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
PPTX
Josh Wills - Data Labeling as Religious Experience
PDF
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
PDF
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
PDF
Meghana Ravikumar - Optimized Image Classification on the Cheap
PDF
Noam Finkelstein - The Importance of Modeling Data Collection
PDF
June Andrews - The Uncanny Valley of ML
PDF
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
PDF
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
PDF
Vito Ostuni - The Voice: New Challenges in a Zero UI World
PDF
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
PDF
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
PPTX
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
PPTX
Neel Sundaresan - Teaching a machine to code
PDF
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
PPTX
Soumith Chintala - Increasing the Impact of AI Through Better Software
PPTX
Roy Lowrance - Predicting Bond Prices: Regime Changes
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Josh Wills - Data Labeling as Religious Experience
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Meghana Ravikumar - Optimized Image Classification on the Cheap
Noam Finkelstein - The Importance of Modeling Data Collection
June Andrews - The Uncanny Valley of ML
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Neel Sundaresan - Teaching a machine to code
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Soumith Chintala - Increasing the Impact of AI Through Better Software
Roy Lowrance - Predicting Bond Prices: Regime Changes

Recently uploaded (20)

PDF
Getting started with AI Agents and Multi-Agent Systems
PPTX
The various Industrial Revolutions .pptx
PPTX
observCloud-Native Containerability and monitoring.pptx
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
PPTX
Modernising the Digital Integration Hub
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
project resource management chapter-09.pdf
PDF
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
PDF
Hybrid model detection and classification of lung cancer
PDF
August Patch Tuesday
PDF
NewMind AI Weekly Chronicles – August ’25 Week III
PPT
What is a Computer? Input Devices /output devices
PPTX
cloud_computing_Infrastucture_as_cloud_p
PPTX
1. Introduction to Computer Programming.pptx
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
1 - Historical Antecedents, Social Consideration.pdf
PDF
Developing a website for English-speaking practice to English as a foreign la...
Getting started with AI Agents and Multi-Agent Systems
The various Industrial Revolutions .pptx
observCloud-Native Containerability and monitoring.pptx
Module 1.ppt Iot fundamentals and Architecture
DASA ADMISSION 2024_FirstRound_FirstRank_LastRank.pdf
Modernising the Digital Integration Hub
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
project resource management chapter-09.pdf
2021 HotChips TSMC Packaging Technologies for Chiplets and 3D_0819 publish_pu...
Hybrid model detection and classification of lung cancer
August Patch Tuesday
NewMind AI Weekly Chronicles – August ’25 Week III
What is a Computer? Input Devices /output devices
cloud_computing_Infrastucture_as_cloud_p
1. Introduction to Computer Programming.pptx
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
Assigned Numbers - 2025 - Bluetooth® Document
OMC Textile Division Presentation 2021.pptx
1 - Historical Antecedents, Social Consideration.pdf
Developing a website for English-speaking practice to English as a foreign la...

Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC

  • 2. Online, Offline, Mobile, Email, Social www.sailthru.com Cost Effectively Scaling Machine Learning Systems in the Cloud Agenda: ● Background on me, Sailthru & Sightlines (mercifully short) ● Cost effective resources in the AWS cloud ● Efficient(ish) application design ● Easy maintenance and evolution ● Machine learning details
  • 3. Online, Offline, Mobile, Email, Social www.sailthru.com @jeremystan Capitalism Idealism Indirect Value Direct Value Graduate student Math 2000 Consultant Finance 2005 CTO Ad Tech 2010 Chief Data Scientist Mar Tech 2015
  • 4. Online, Offline, Mobile, Email, Social www.sailthru.com Sailthru
  • 5. Online, Offline, Mobile, Email, Social www.sailthru.com Sightlines Analytics - Segmentation - Forecasting Personalization - Recommendations - Discounting Optimization - Frequency - Channel
  • 6. Online, Offline, Mobile, Email, Social www.sailthru.com Requirements 1. ~5 million users per client 2. JSON formatted user data, siloed across clients 3. Predict varying outcomes normal, poisson, binomial, quantile, ... 4. Update models & predictions daily 5. Only really care about predictive performance 6. Scale to 1,000+ clients
  • 7. Online, Offline, Mobile, Email, Social www.sailthru.com Our Cost Effective Scaling Strategy 1. Get really cheap computing power 2. Make it work really, really hard 3. Optimize apps for ease of evolution 4. Setup identical A/B environments Iterate aggressively based on data: ✓ Features ✓ Efficiency ✓ Scale 10x 3x 0.6x = 0.5x = 9x JSON to Features GBM in Memory 1 x0.2x Half our processing Half our processing
  • 8. Online, Offline, Mobile, Email, Social www.sailthru.com Cost Effective Resources in the AWS Cloud
  • 9. Online, Offline, Mobile, Email, Social www.sailthru.com Cost Effective r3.8xlarge 32 vCPU, 244GB RAM Resource Utilization 30% (typical cloud) 10% (data center) 90% (highly efficient) Cost Per Hour $2.80 (on demand) $1.76 (reserved 1yr) $1.05 (reserved 3yr) $0.28 (spot instance) Cloud $9.80 Data Center $10.50 Spot + Mesos + Relay $0.30 30x more cost efficient! ($10.50 = $1.05 / 10%)
  • 10. Online, Offline, Mobile, Email, Social www.sailthru.com AWS Spot Instances Your bid What you pay All instances died!
  • 11. Online, Offline, Mobile, Email, Social www.sailthru.com Mesos 81 “slaves” 4 availability zones 2 instance types 1,360 CPUs 10TB of RAM 94% utilized $11.90 per hour $104,244 per year
  • 12. Online, Offline, Mobile, Email, Social www.sailthru.com Mesos + Marathon Zone 1 Zone 2 Zone 3 Zone 4 Mesos Slave (16 CPU) Mesos Slave (8 CPU)
  • 13. Online, Offline, Mobile, Email, Social www.sailthru.com Mesos + Marathon Zone 1 Zone 2 Zone 3 Zone 4 Mesos Slave (16 CPU) Mesos Slave (8 CPU) Mesos Master App A App B App C Queue Size Applications must be: ● Distributed to be scheduled wherever Mesos wants ● Fine Grained to maximize utilization in Mesos ● Idempotent to handle duplicate runs in case network is partitioned
  • 14. Online, Offline, Mobile, Email, Social www.sailthru.com Mesos + Marathon Zone 1 Zone 2 Zone 3 Zone 4 Mesos Slave (16 CPU) Mesos Slave (8 CPU) Mesos Master App A App B App C Queue Size Time Available Mesos CPU Jiffies Doesn’t work for apps with highly variable load Idle User
  • 15. Online, Offline, Mobile, Email, Social www.sailthru.com Mesos + Relay Available Mesos CPU Jiffies User Idle Available Mesos CPU Jiffies User Idle Relay.Mesos Auto-scaler for distributed applications github.com/sailthru/relay.mesos ● Allocates resources based on queue size ● Wraps applications inside Mesos slaves ● Can significantly improve cluster utilization Before Relay After Relay App A App B App C Queue Size Mesos Master Time After Relay Relay. Mesos
  • 16. Online, Offline, Mobile, Email, Social www.sailthru.com Efficient(ish) Application Design
  • 17. Online, Offline, Mobile, Email, Social www.sailthru.com Stolos Distributed task dependency manager github.com/sailthru/stolos ● Directed acyclic graph ● Parameterizable templates ● Handles queueing ● Ensures idempotent Application Pipeline (simplified) Assembly GBMs Analyze Models JSON Sailthru User API Predict Upload Mongo Reports Actually much more complex ● ~1,000 clients ● ~10 models ● ~10 steps ● ~100 sub-tasks ETL Mongo
  • 18. Online, Offline, Mobile, Email, Social www.sailthru.com shard 1 shard 1,000 Sampling Strategy JSON Day 1 Mongo S3 JSON sharded on hash(user)
  • 19. Online, Offline, Mobile, Email, Social www.sailthru.com shard 1 shard 1,000 Sampling Strategy JSON Day N Mongo Day 1 S3
  • 20. Online, Offline, Mobile, Email, Social www.sailthru.com Day N Day 1 shard 1 shard 1,000 Sampling Strategy JSON Consistent 0.1% of data to a Mesos Slave CPU Mongo S3
  • 21. Online, Offline, Mobile, Email, Social www.sailthru.com Day N Day 1 shard 1 shard 1,000 Sampling Strategy JSON Apps sample more as needed Mongo S3
  • 22. Online, Offline, Mobile, Email, Social www.sailthru.com User Profile JSON Data
  • 23. Online, Offline, Mobile, Email, Social www.sailthru.com Each User Radically Different User Feature ???
  • 24. Online, Offline, Mobile, Email, Social www.sailthru.com Each User Radically Different User Feature tidyjson Turn JSON into data frames github.com/sailthru/tidyjson ● Arbitrary JSON into R data.frames ● Guarantees deterministic structure ● Seamless with dplyr and %>%
  • 25. Online, Offline, Mobile, Email, Social www.sailthru.com Why GBMs? ● Predict varying outcomes normal, poisson, binomial, quantile, … ● Flexible enough to capture non-linearity & complex interactions no need to feature engineer for each client ● Minimal number of hyper-parameters depth, shrinkage, number of trees ● Robust to missing values no need to impute
  • 26. Online, Offline, Mobile, Email, Social www.sailthru.com + … + αK * Distributing a GBM α1 * tree 1 tree 2 tree 3 tree K + α2 * + α3 *
  • 27. Online, Offline, Mobile, Email, Social www.sailthru.com + … + αK * Distributing a GBM α1 * tree 1 tree 2 tree 3 tree K 1. Across the sum Gives bagging, not boosting (iterative) => less accurate + α2 * + α3 * Zone 1 Zone 2 Zone 3 Zone 4 Mesos Slaves
  • 28. Online, Offline, Mobile, Email, Social www.sailthru.com + … + αK * Distributing a GBM α1 * tree 1 tree 2 tree 3 tree K 1. Across the sum Gives bagging, not boosting (iterative) => less accurate 2. Within each tree (Spark MLLib, H20) A lot of overhead and coordination => not efficient for many small GBMs + α2 * + α3 * Zone 1 Zone 2 Zone 3 Zone 4 Mesos Slaves
  • 29. Online, Offline, Mobile, Email, Social www.sailthru.com Distributing a GBM 1. Across the sum Gives bagging, not boosting (iterative) => less accurate 2. Within each tree (Spark MLLib, H20) A lot of overhead and coordination => not efficient for many small GBMs 3. Across the GBMs 50,000 GBMs to build => each can be built independently Zone 1 Zone 2 Zone 3 Zone 4 Mesos Slaves + … + αK *α1 * tree 1 tree 2 tree 3 tree K + α2 * + α3 * + … + αK *α1 * tree 1 tree 2 tree 3 tree K + α2 * + α3 * … GBM 1 GBM 50,000 50,000 = 1,000 clients * 10 models * 5-fold CV ✓
  • 30. Online, Offline, Mobile, Email, Social www.sailthru.com Grid Search + … + αK *α1 * tree 1 tree 2 tree 3 tree K + α2 * + α3 * For each client & model: 1. Grid search over: a. Depth: size of trees b. Shrinkage: λ “learning rate” for {αi } 2. Cross-validate for optimal # of trees
  • 31. Online, Offline, Mobile, Email, Social www.sailthru.com Easy Maintenance & Evolution
  • 32. Online, Offline, Mobile, Email, Social www.sailthru.com Tools Used R Modeling Python ETL AWS S3 Batch Applications State Frameworks Zookeeper Coordination Spark Map Reduce Marathon Running Apps Cluster Mesos Sharing Maintenance ELK Log Mgmt Consul Discovery Configuration Chef Automation Librato Monitoring Sensu Alerting Asgard Auto Scaling AWS Spot Compute
  • 33. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo JSON
  • 34. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo JSON
  • 35. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications JSON v1.0.0
  • 36. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications JSON v1.0.0 v1.0.1
  • 37. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications JSON v1.0.0 v1.0.1
  • 38. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications JSON v1.0.0 v1.0.1 v1.0.2
  • 39. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications ✓ Check monitoring JSON v1.0.0 v1.0.1 v1.0.2
  • 40. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications ✓ Check monitoring ✓ Check logging JSON v1.0.0 v1.0.1 v1.0.2
  • 41. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications ✓ Check monitoring ✓ Check logging ✓ Check performance JSON v1.0.0 v1.0.1 v1.0.2
  • 42. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications ✓ Check monitoring ✓ Check logging ✓ Check performance JSON v1.0.0 v1.0.1 v1.0.2
  • 43. Online, Offline, Mobile, Email, Social www.sailthru.com How we Iterate A B Sailthru User API Mongo ● Tools ● Configuration ● Applications ✓ Check monitoring ✓ Check logging ✓ Check performance JSON v1.0.0 v1.0.1 v1.0.2
  • 44. Thank You! Our team: Divyanshu Vats Alex Gaudio Andras Kerekes Jeremy Stanley