SlideShare a Scribd company logo
Managing Thousands of Spark
Workers in Cloud Environment
#HWCSAIS14
Yuhao Zheng & Boduo Li, DataVisor
1
4
0
1
2
3
4
5 THOUSANDS
How To
2#HWCSAIS14
15
3
0
5
10
15
20 MILLIONS
40
10
0
10
20
30
40
50
Peak Scale
# of Spark executors
Cloud Cost
Annual cost in USD
Ops Cost
Weekly man-hour
4X
5X 4X
3#HWCSAIS14
• Focus on fraud detection
• Founded in Dec 2013
• 100-person team
Coordinated Online Attacks
4#HWCSAIS14
Crime Ring Malicious Accounts Loss: >50B/Year
Transaction
Fraud
Fake
Review
Promotion
Abuse
Launch Attacks
DataVisor: UML Fraud Detection
5#HWCSAIS14
Early Detection
Catch incubated accounts
High Coverage and Accuracy
Detect all bad users in a campaign
Unknown Attack Detection
Catch unknown suspicious activities
Unsupervised Machine Learning
UML is Expensive
6#HWCSAIS14
Register
Profile
Login
Trasaction
…
Data Clean,
Feature Ext.
User Events
Feature Pool: Thousands of Features
Behavior
Pattern
Profile
Pattern
Device
Pattern
application_freq
application_time
…
…
work_year_distrib
marriage_distrib
promoter_info
…
deviceid_distrib
ip_usage
devicetype_var
…
…
…
…
!(#, %) = (
)
*) ∗ ,)(-. , -/)
Clustering
Analysis
01 = (
1
!1(#, %)
Association probability
Clustering probability
…
Huge Data Volume
7#HWCSAIS14
3 Billion+ user accounts
600 Billion+ events and growing
3 Petabytes of data
Schedule Dependency
8#HWCSAIS14
Pipeline Module
Original Data
Pipeline Module
Pipeline Module
Pipeline Module
Pipeline Module Pipeline Module
~20 modules / client
Original Data
Detection
Result
Naïve Solution: Single Cluster
9#HWCSAIS14
EUEUQOFIF
Spark
Applications
Spark Cluster
Estimated Annual Cost
15 Million
M
S
S S
S S
Static cluster
No auto-scale
Problems of Single Static Cluster
10#HWCSAIS14
Application Executor Memory # Executors
1 2 GB 2
2 6 GB 80
3 12 GB 48
Wasted memory12GB2GBRunning:
App Executor12GBQueueing:
Cluster Size
Improvement: Multiple Clusters
11#HWCSAIS14
EUEUQOFIF
Small
Applications
EUEUQOFIF
Large
Applications
12GB Executors
2GB Executors
M
S
S S
S S
M
S
S S
S S
Estimated Annual Cost
12 Million
Further Reduce Cost
12#HWCSAIS14
Operational cost
• Loss of spot
• Job failure
Estimated Annual Cost
8 Million
Cloud cost
• Spot instances
• Smaller cluster
Drawbacks of Static Allocation
13#HWCSAIS14
Fixed Size
Always On
Over capacity
Under capacity
Maintenance Cost
Human Cost
Cloud Cost
Can We Go Dynamic?
14#HWCSAIS14
Why Not?
More Requirements
• Product features
– Affect module dependencies
• Job priority
– SLA assurance
15#HWCSAIS14
High priority Normal priority
Product A Product B
DataVisor SparkGen
16#HWCSAIS14
DataVisor SparkGen
17#HWCSAIS14
Prod Job
Scheduler Spark
Resource
Manager
Prod Jobs
Dev Jobs
Developers
M
S
S S
M
S
S S
S S
M
S S
S S
Estimated Annual Cost
3 Million
Cost Equations
Cost = Machine Cost + Human Cost
Machine Cost = Machine Up Time x Unit Price
Human Cost = Operation Overhead
18#HWCSAIS14
Reduce Machine Up Time
Single Static Cluster
⊕ One-time launch
⊖ Low utilization
⊖ Idle time
19#HWCSAIS14
Multiple Static Clusters
⊕ One-time launch
Moderate utilization
⊖ Idle time
One Job Per Cluster
⊖ Per-job launch
⊕ High utilization
⊕ No idle time
⊕ Dynamic max concurrency
⊕ No inter-job interference
⊕ Low maintenance overhead
⊖ Limited concurrency
⊖ Inter-job interference
⊖ High maintenance overhead
⊖ Limited concurrency
⊖ Inter-job interference
⊖ High maintenance overhead
60% Saving
Job JobLaunch
Time
Utilization
1
0
Idle Idle
Under-utilized Resource
Idle
Reduce Launch Time
• Pre-built AMI
– Systems & libs (dockerized)
– Pre-configured (non-runtime)
• Concurrent master/slave
initialization
• Result: 30 min → 3 min
20#HWCSAIS14
Amazon Machine Image (AMI)
Docker Docker
Spark Ganglia
Docker
Libs
Slave Initialization (2 phases)
1 2
Require Master Ready
• Phase 1
• Launch instance
• Upload runtime configuration
• Start services (local)
• Phase 2
• Start services (connect to master)
Sequential Launch Time
Master Init 1 2 1 2
Concurrent
Launch Time
Master Init
1
1
2
2
Maximize Job Concurrency
21#HWCSAIS14
A
D
C F G
B E
J
K
I
M
L
H
A
B
C
D
E
F G J
H
K M
I L
A B C D E F G H I LJ K M
Sequential
2X lower latency
Eliminate prioritization issue
Time
Time
One Job Per Cluster
Cost Equations
Cost = Machine Cost + Human Cost
Machine Cost = Machine Up Time x Unit Price
Human Cost = Operation Overhead
22#HWCSAIS14
Reduce Unit Price
• Spot Slaves (75% Saving)
• Reserved Masters (40% Saving)
23#HWCSAIS14
0 0.1 0.2 0.3 0.4 0.5 0.6
SPOT
RESERVED
ON DEMAND
R4.2XLARGE HOURLY $
Cost Equations
Cost = Machine Cost + Human Cost
Machine Cost = Machine Up Time x Unit Price
Human Cost = Operation Overhead
24#HWCSAIS14
Reduce Operation Overhead
• One Job Per Cluster
– Dynamic scale out
– No inter-job interference
– Easy patch/re-launch clusters
– Spot Fleet
• Higher availability (diversified)
• Maintain minimum capacity
25#HWCSAIS14
zone a, r4.2xlarge
zone b, r4.8xlarge
zone c, r4.xlarge
zone b, r4.4xlarge
zone b, r4.2xlarge
Why Not Yarn?
• Compared to One Job Per Cluster
– Single-point of failure (Master)
– Slower to scale
– One more system to configure / maintain
26#HWCSAIS14
Job Scheduler
27#HWCSAIS14
Spark
Resource
Manager
Product
Features
Auto Generate
Dependency
Simple Per-client Spec
Results
28#HWCSAIS14
40
10
0
10
20
30
40
50
12
6
0
5
10
15
Peak Scale
# of Spark executors
Cloud Cost
Annual cost in USD
Ops Cost
Weekly man-hour
Pipeline Latency
End-to-end hours
4X
5X 4X
2X1
4
0
1
2
3
4
5 THOUSANDS
15
3
0
5
10
15
20 MILLIONS
Q & A
29#HWCSAIS14
www.datavisor.com

More Related Content

PDF
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
PDF
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
PDF
Continuous Processing in Structured Streaming with Jose Torres
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Performance Troubleshooting Using Apache Spark Metrics
PDF
Top 5 mistakes when writing Streaming applications
PDF
SSR: Structured Streaming for R and Machine Learning
PDF
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal
Accelerating Real Time Analytics with Spark Streaming and FPGAaaS with Prabha...
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Continuous Processing in Structured Streaming with Jose Torres
Next CERN Accelerator Logging Service with Jakub Wozniak
Performance Troubleshooting Using Apache Spark Metrics
Top 5 mistakes when writing Streaming applications
SSR: Structured Streaming for R and Machine Learning
Overview of Apache Spark 2.3: What’s New? with Sameer Agarwal

What's hot (20)

PDF
Apache Spark Performance is too hard. Let's make it easier
PDF
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
PDF
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
PDF
Apache Spark Acceleration Using Hardware Resources in the Cloud, Seamlessl wi...
PDF
Spark Summit EU talk by Michael Nitschinger
PDF
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
PDF
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
PDF
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
PDF
Scaling Apache Spark on Kubernetes at Lyft
PDF
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
PDF
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
PDF
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PDF
Advanced Natural Language Processing with Apache Spark NLP
PDF
Apache Spark on K8S Best Practice and Performance in the Cloud
PDF
Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
PDF
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Apache Spark Performance is too hard. Let's make it easier
Metrics-Driven Tuning of Apache Spark at Scale with Edwina Lu and Ye Zhou
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Real time data viz with Spark Streaming, Kafka and D3.js
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Apache Spark Acceleration Using Hardware Resources in the Cloud, Seamlessl wi...
Spark Summit EU talk by Michael Nitschinger
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...
Deep Dive Into Apache Spark Multi-User Performance Michael Feiman, Mikhail Ge...
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
Scaling Apache Spark on Kubernetes at Lyft
Large Scale Feature Aggregation Using Apache Spark with Pulkit Bhanot and Ami...
Running Apache Spark on a High-Performance Cluster Using RDMA and NVMe Flash ...
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Accelerated Spark on Azure: Seamless and Scalable Hardware Offloads in the C...
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Advanced Natural Language Processing with Apache Spark NLP
Apache Spark on K8S Best Practice and Performance in the Cloud
Using Apache Spark in the Cloud—A Devops Perspective with Telmo Oliveira
Lessons Learned from Managing Thousands of Production Apache Spark Clusters w...
Ad

Similar to Managing Thousands of Spark Workers in Cloud Environment with Yuhao Zheng and Boduo Li (20)

PDF
John adams talk cloudy
PDF
Building a system for machine and event-oriented data - SF HUG Nov 2015
PPTX
DockerCon Europe 2018 Monitoring & Logging Workshop
PDF
Lotuscript for large systems
PPTX
I hunt sys admins 2.0
PDF
Sensu and Sensibility - Puppetconf 2014
PDF
Consul administration at scale
PDF
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
PPTX
Cloud Security Monitoring and Spark Analytics
PDF
Fixing twitter
PDF
Fixing_Twitter
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
PDF
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
PDF
Capacity Planning for fun & profit
PDF
Chirp 2010: Scaling Twitter
PPTX
BSIDES-PR Keynote Hunting for Bad Guys
PDF
Introduction to red team operations
PDF
How to measure your security response readiness?
PDF
Cloud adoption fails - 5 ways deployments go wrong and 5 solutions
John adams talk cloudy
Building a system for machine and event-oriented data - SF HUG Nov 2015
DockerCon Europe 2018 Monitoring & Logging Workshop
Lotuscript for large systems
I hunt sys admins 2.0
Sensu and Sensibility - Puppetconf 2014
Consul administration at scale
Building Efficient and Reliable Crawler System With Sidekiq Enterprise
Cloud Security Monitoring and Spark Analytics
Fixing twitter
Fixing_Twitter
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Flink Forward SF 2017: Feng Wang & Zhijiang Wang - Runtime Improvements in Bl...
Capacity Planning for fun & profit
Chirp 2010: Scaling Twitter
BSIDES-PR Keynote Hunting for Bad Guys
Introduction to red team operations
How to measure your security response readiness?
Cloud adoption fails - 5 ways deployments go wrong and 5 solutions
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
Introduction to machine learning and Linear Models
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Lecture1 pattern recognition............
PDF
Introduction to Data Science and Data Analysis
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Introduction to machine learning and Linear Models
Data_Analytics_and_PowerBI_Presentation.pptx
SAP 2 completion done . PRESENTATION.pptx
annual-report-2024-2025 original latest.
IB Computer Science - Internal Assessment.pptx
climate analysis of Dhaka ,Banglades.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
.pdf is not working space design for the following data for the following dat...
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Qualitative Qantitative and Mixed Methods.pptx
Business Analytics and business intelligence.pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
Lecture1 pattern recognition............
Introduction to Data Science and Data Analysis
Fluorescence-microscope_Botany_detailed content
STUDY DESIGN details- Lt Col Maksud (21).pptx
Reliability_Chapter_ presentation 1221.5784
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...

Managing Thousands of Spark Workers in Cloud Environment with Yuhao Zheng and Boduo Li

  • 1. Managing Thousands of Spark Workers in Cloud Environment #HWCSAIS14 Yuhao Zheng & Boduo Li, DataVisor
  • 2. 1 4 0 1 2 3 4 5 THOUSANDS How To 2#HWCSAIS14 15 3 0 5 10 15 20 MILLIONS 40 10 0 10 20 30 40 50 Peak Scale # of Spark executors Cloud Cost Annual cost in USD Ops Cost Weekly man-hour 4X 5X 4X
  • 3. 3#HWCSAIS14 • Focus on fraud detection • Founded in Dec 2013 • 100-person team
  • 4. Coordinated Online Attacks 4#HWCSAIS14 Crime Ring Malicious Accounts Loss: >50B/Year Transaction Fraud Fake Review Promotion Abuse Launch Attacks
  • 5. DataVisor: UML Fraud Detection 5#HWCSAIS14 Early Detection Catch incubated accounts High Coverage and Accuracy Detect all bad users in a campaign Unknown Attack Detection Catch unknown suspicious activities Unsupervised Machine Learning
  • 6. UML is Expensive 6#HWCSAIS14 Register Profile Login Trasaction … Data Clean, Feature Ext. User Events Feature Pool: Thousands of Features Behavior Pattern Profile Pattern Device Pattern application_freq application_time … … work_year_distrib marriage_distrib promoter_info … deviceid_distrib ip_usage devicetype_var … … … … !(#, %) = ( ) *) ∗ ,)(-. , -/) Clustering Analysis 01 = ( 1 !1(#, %) Association probability Clustering probability …
  • 7. Huge Data Volume 7#HWCSAIS14 3 Billion+ user accounts 600 Billion+ events and growing 3 Petabytes of data
  • 8. Schedule Dependency 8#HWCSAIS14 Pipeline Module Original Data Pipeline Module Pipeline Module Pipeline Module Pipeline Module Pipeline Module ~20 modules / client Original Data Detection Result
  • 9. Naïve Solution: Single Cluster 9#HWCSAIS14 EUEUQOFIF Spark Applications Spark Cluster Estimated Annual Cost 15 Million M S S S S S Static cluster No auto-scale
  • 10. Problems of Single Static Cluster 10#HWCSAIS14 Application Executor Memory # Executors 1 2 GB 2 2 6 GB 80 3 12 GB 48 Wasted memory12GB2GBRunning: App Executor12GBQueueing: Cluster Size
  • 11. Improvement: Multiple Clusters 11#HWCSAIS14 EUEUQOFIF Small Applications EUEUQOFIF Large Applications 12GB Executors 2GB Executors M S S S S S M S S S S S Estimated Annual Cost 12 Million
  • 12. Further Reduce Cost 12#HWCSAIS14 Operational cost • Loss of spot • Job failure Estimated Annual Cost 8 Million Cloud cost • Spot instances • Smaller cluster
  • 13. Drawbacks of Static Allocation 13#HWCSAIS14 Fixed Size Always On Over capacity Under capacity Maintenance Cost Human Cost Cloud Cost
  • 14. Can We Go Dynamic? 14#HWCSAIS14 Why Not?
  • 15. More Requirements • Product features – Affect module dependencies • Job priority – SLA assurance 15#HWCSAIS14 High priority Normal priority Product A Product B
  • 17. DataVisor SparkGen 17#HWCSAIS14 Prod Job Scheduler Spark Resource Manager Prod Jobs Dev Jobs Developers M S S S M S S S S S M S S S S Estimated Annual Cost 3 Million
  • 18. Cost Equations Cost = Machine Cost + Human Cost Machine Cost = Machine Up Time x Unit Price Human Cost = Operation Overhead 18#HWCSAIS14
  • 19. Reduce Machine Up Time Single Static Cluster ⊕ One-time launch ⊖ Low utilization ⊖ Idle time 19#HWCSAIS14 Multiple Static Clusters ⊕ One-time launch Moderate utilization ⊖ Idle time One Job Per Cluster ⊖ Per-job launch ⊕ High utilization ⊕ No idle time ⊕ Dynamic max concurrency ⊕ No inter-job interference ⊕ Low maintenance overhead ⊖ Limited concurrency ⊖ Inter-job interference ⊖ High maintenance overhead ⊖ Limited concurrency ⊖ Inter-job interference ⊖ High maintenance overhead 60% Saving Job JobLaunch Time Utilization 1 0 Idle Idle Under-utilized Resource Idle
  • 20. Reduce Launch Time • Pre-built AMI – Systems & libs (dockerized) – Pre-configured (non-runtime) • Concurrent master/slave initialization • Result: 30 min → 3 min 20#HWCSAIS14 Amazon Machine Image (AMI) Docker Docker Spark Ganglia Docker Libs Slave Initialization (2 phases) 1 2 Require Master Ready • Phase 1 • Launch instance • Upload runtime configuration • Start services (local) • Phase 2 • Start services (connect to master) Sequential Launch Time Master Init 1 2 1 2 Concurrent Launch Time Master Init 1 1 2 2
  • 21. Maximize Job Concurrency 21#HWCSAIS14 A D C F G B E J K I M L H A B C D E F G J H K M I L A B C D E F G H I LJ K M Sequential 2X lower latency Eliminate prioritization issue Time Time One Job Per Cluster
  • 22. Cost Equations Cost = Machine Cost + Human Cost Machine Cost = Machine Up Time x Unit Price Human Cost = Operation Overhead 22#HWCSAIS14
  • 23. Reduce Unit Price • Spot Slaves (75% Saving) • Reserved Masters (40% Saving) 23#HWCSAIS14 0 0.1 0.2 0.3 0.4 0.5 0.6 SPOT RESERVED ON DEMAND R4.2XLARGE HOURLY $
  • 24. Cost Equations Cost = Machine Cost + Human Cost Machine Cost = Machine Up Time x Unit Price Human Cost = Operation Overhead 24#HWCSAIS14
  • 25. Reduce Operation Overhead • One Job Per Cluster – Dynamic scale out – No inter-job interference – Easy patch/re-launch clusters – Spot Fleet • Higher availability (diversified) • Maintain minimum capacity 25#HWCSAIS14 zone a, r4.2xlarge zone b, r4.8xlarge zone c, r4.xlarge zone b, r4.4xlarge zone b, r4.2xlarge
  • 26. Why Not Yarn? • Compared to One Job Per Cluster – Single-point of failure (Master) – Slower to scale – One more system to configure / maintain 26#HWCSAIS14
  • 28. Results 28#HWCSAIS14 40 10 0 10 20 30 40 50 12 6 0 5 10 15 Peak Scale # of Spark executors Cloud Cost Annual cost in USD Ops Cost Weekly man-hour Pipeline Latency End-to-end hours 4X 5X 4X 2X1 4 0 1 2 3 4 5 THOUSANDS 15 3 0 5 10 15 20 MILLIONS