Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC

Online, Offline, Mobile, Email, Social
www.sailthru.com
Cost Effectively Scaling Machine Learning Systems in the Cloud
Agenda:
● Background on me, Sailthru & Sightlines (mercifully short)
● Cost effective resources in the AWS cloud
● Efficient(ish) application design
● Easy maintenance and evolution
● Machine learning details

www.sailthru.com
@jeremystan
Capitalism
Idealism
Indirect
Value
Direct
Value
Graduate student
Math
2000
Consultant
Finance
2005 CTO
Ad Tech
2010
Chief Data Scientist
Mar Tech
2015

www.sailthru.com
Sailthru

www.sailthru.com
Sightlines
Analytics
- Segmentation
- Forecasting
Personalization
- Recommendations
- Discounting
Optimization
- Frequency
- Channel

www.sailthru.com
Requirements
1. ~5 million users per client
2. JSON formatted user data, siloed across clients
3. Predict varying outcomes
normal, poisson, binomial, quantile, ...
4. Update models & predictions daily
5. Only really care about predictive performance
6. Scale to 1,000+ clients

www.sailthru.com
Our Cost Effective Scaling Strategy
1. Get really cheap computing power
2. Make it work really, really hard
3. Optimize apps for ease of evolution
4. Setup identical A/B environments
Iterate aggressively based on data:
✓ Features
✓ Efficiency
✓ Scale
10x
3x
0.6x =
0.5x
= 9x
JSON to
Features
GBM in
Memory
1 x0.2x
Half our
processing
Half our
processing

www.sailthru.com
Cost Effective
Resources in
the AWS Cloud

www.sailthru.com
Cost Effective r3.8xlarge
32 vCPU, 244GB RAM
Resource Utilization
30%
(typical cloud)
10%
(data center)
90%
(highly efficient)
Cost
Per
Hour
$2.80
(on demand)
$1.76
(reserved 1yr)
$1.05
(reserved 3yr)
$0.28
(spot instance)
Cloud
$9.80
Data Center
$10.50
Spot + Mesos + Relay
$0.30
30x more cost
efficient!
($10.50 = $1.05 / 10%)

www.sailthru.com
AWS Spot Instances
Your bid
What you pay
All instances died!

www.sailthru.com
Mesos
81 “slaves”
4 availability zones
2 instance types
1,360 CPUs
10TB of RAM
94% utilized
$11.90 per hour
$104,244 per year

www.sailthru.com
Mesos + Marathon
Zone 1 Zone 2 Zone 3 Zone 4
Mesos
Slave
(16 CPU)
Mesos
Slave
(8 CPU)

www.sailthru.com
Mesos + Marathon
Mesos
Slave
(16 CPU)
Mesos
Slave
(8 CPU)
Mesos
Master
App A
App B
App C
Queue Size
Applications must be:
● Distributed to be scheduled wherever Mesos wants
● Fine Grained to maximize utilization in Mesos
● Idempotent to handle duplicate runs in case network
is partitioned

www.sailthru.com
Mesos + Marathon
Mesos
Slave
(16 CPU)
Mesos
Slave
(8 CPU)
Mesos
Master
App A
App B
App C
Queue Size
Time
Available
Mesos
CPU
Jiffies
Doesn’t work for apps
with highly variable load
Idle
User

www.sailthru.com
Mesos + Relay
Available
Mesos
CPU
Jiffies
User
Idle
Available
Mesos
CPU
Jiffies
User
Idle
Relay.Mesos
Auto-scaler for distributed applications
github.com/sailthru/relay.mesos
● Allocates resources based on queue size
● Wraps applications inside Mesos slaves
● Can significantly improve cluster utilization
Before Relay
After
Relay
App A
App B
App C
Queue Size
Mesos
Master
Time
After Relay
Relay.
Mesos

www.sailthru.com
Efficient(ish)
Application
Design

www.sailthru.com
Stolos
Distributed task dependency manager
github.com/sailthru/stolos
● Directed acyclic graph
● Parameterizable templates
● Handles queueing
● Ensures idempotent
Application Pipeline (simplified)
Assembly GBMs
Analyze
Models
JSON
Sailthru
User
API
Predict Upload Mongo
Reports
Actually much more complex
● ~1,000 clients
● ~10 models
● ~10 steps
● ~100 sub-tasks
ETL
Mongo

www.sailthru.com
shard 1
shard 1,000
Sampling Strategy
JSON
Day
1
Mongo
S3
JSON sharded on hash(user)

www.sailthru.com
shard 1
shard 1,000
Sampling Strategy
JSON
Day
N
Mongo
Day
1
S3

www.sailthru.com
Day
N
Day
1
shard 1
shard 1,000
Sampling Strategy
JSON
Consistent 0.1% of data to a
Mesos Slave CPU
Mongo
S3

www.sailthru.com
Day
N
Day
1
shard 1
shard 1,000
Sampling Strategy
JSON
Apps sample more as needed
Mongo
S3

www.sailthru.com
User Profile JSON Data

www.sailthru.com
Each User Radically Different
User
Feature
???

www.sailthru.com
Each User Radically Different
User
Feature
tidyjson
Turn JSON into data frames
github.com/sailthru/tidyjson
● Arbitrary JSON into R data.frames
● Guarantees deterministic structure
● Seamless with dplyr and %>%

www.sailthru.com
Why GBMs?
● Predict varying outcomes
normal, poisson, binomial, quantile, …
● Flexible enough to capture non-linearity & complex interactions
no need to feature engineer for each client
● Minimal number of hyper-parameters
depth, shrinkage, number of trees
● Robust to missing values
no need to impute

www.sailthru.com
+ … + αK
*
Distributing a GBM
α1
*
tree 1 tree 2 tree 3 tree K
+ α2
* + α3
*

www.sailthru.com
+ … + αK
*
Distributing a GBM
α1
*
1. Across the sum
Gives bagging, not boosting (iterative)
=> less accurate
+ α2
* + α3
*
Mesos
Slaves

www.sailthru.com
+ … + αK
*
Distributing a GBM
α1
*
1. Across the sum
=> less accurate
2. Within each tree (Spark MLLib, H20)
A lot of overhead and coordination
=> not efficient for many small GBMs
+ α2
* + α3
*
Mesos
Slaves

www.sailthru.com
Distributing a GBM
1. Across the sum
=> less accurate
2. Within each tree (Spark MLLib, H20)
A lot of overhead and coordination
=> not efficient for many small GBMs
3. Across the GBMs
50,000 GBMs to build
=> each can be built independently
Mesos
Slaves
+ … + αK
*α1
*
+ α2
* + α3
* + … + αK
*α1
*
+ α2
* + α3
*
…
GBM 1 GBM 50,000
50,000 = 1,000 clients * 10 models * 5-fold CV
✓

www.sailthru.com
Grid Search
+ … + αK
*α1
*
+ α2
* + α3
*
For each client & model:
1. Grid search over:
a. Depth: size of trees
b. Shrinkage: λ “learning rate” for {αi
}
2. Cross-validate for optimal # of trees

www.sailthru.com
Easy
Maintenance
& Evolution

www.sailthru.com
Tools Used
R
Modeling
Python
ETL
AWS S3
Batch
Applications
State
Frameworks
Zookeeper
Coordination
Spark
Map Reduce
Marathon
Running Apps
Cluster
Mesos
Sharing
Maintenance
ELK
Log Mgmt
Consul
Discovery
Configuration
Chef
Automation
Librato
Monitoring
Sensu
Alerting
Asgard
Auto Scaling
AWS Spot
Compute

www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
JSON

www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
JSON
v1.0.0

www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
JSON
v1.0.0
v1.0.1

www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
JSON
v1.0.0
v1.0.1
v1.0.2

www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
✓ Check monitoring
JSON
v1.0.0
v1.0.1
v1.0.2

www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
✓ Check logging
JSON
v1.0.0
v1.0.1
v1.0.2

www.sailthru.com
How we Iterate A
B
Sailthru
User
API
Mongo
● Tools
● Configuration
● Applications
✓ Check logging
✓ Check performance
JSON
v1.0.0
v1.0.1
v1.0.2

Thank You! Our team:
Divyanshu Vats Alex Gaudio Andras Kerekes Jeremy Stanley

Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC

More Related Content

Viewers also liked (8)

Similar to Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC (20)

More from MLconf (20)

Recently uploaded (20)

Jeremy Stanley, EVP/Data Scientist, Sailthru at MLconf NYC