SlideShare a Scribd company logo
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Panagiotis Garefalakis (Imperial College London)
Konstantinos Karanasos (Microsoft)
Peter Pietzuch (Imperial College London)
Cooperative Task Execution
for Apache Spark
#UnifiedAnalytics #SparkAISummit
Evolution of analytics
3#UnifiedAnalytics #SparkAISummit
Batch frameworks
20142010 2018
Frameworks
with hybrid
stream/batch
applicationsStream frameworks
Unified
stream/batch
frameworks
Structured Streaming
4#UnifiedAnalytics #SparkAISummit
InferenceJob
Low-latency
responses
Trained
Model
Historical
data
Real-time
data
TrainingJob
Stream
Batch
Iterate
Application
Job Stages
Unified application example
Stream/Batch = unified applications combining
> latency-sensitive (stream) jobs with
> latency-tolerant (batch) jobs
as part of the same application
5#UnifiedAnalytics #SparkAISummit
Unified applications
Advantages
> Sharing application logic & state
> Result consistency
> Sharing of computation
> Unified processing platform on top of Spark SQL
fast, scalable, fault tolerant
> Large ecosystem of data sources
integrate with many storage systems
> Rich, unified, high-level APIs
deal with complex data and complex workloads
6#UnifiedAnalytics #SparkAISummit
Structured Streaming API #UnifiedAnalytics
val trainData = context.read("malicious−train−data")
val pipeline = new Pipeline().setStages(Array(
new OneHotEncoderEstimator(),
new VectorAssembler(),
new Classifier(/* select estimator */)))
val pipelineModel = pipeline.fit(trainData)
val streamingData = context
.readStream("kafkaTopic")
.groupBy("userId")
.schema() /* input schema */
val streamRates = pipelineModel
.transform(streamingData)
streamRates.start() /* start streaming */
Batch
Stream
Requirements
> Latency: Execute inference job with minimum delay
> Throughput: Batch jobs should not be compromised
> Efficiency: Achieve high cluster resource utilization
7#UnifiedAnalytics #SparkAISummit
Scheduling of Stream/Batch jobs
Challenge: schedule stream/batch jobs to
satisfy their diverse requirements
8#UnifiedAnalytics #SparkAISummit
Stream/Batch application scheduling
2xT
Inference (stream) Job
2xT
3T TTraining (batch) Job
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
Application
Code
Driver
DAG Scheduler
submitSpark Context
run job
Time (T)
9#UnifiedAnalytics #SparkAISummit
Stream/Batch scheduling
3T
3T
3T
T T T T
4
3T
executor1executor2
8
T
T
T
Wasted
resources
Cores
2 6
Resources cannot be shared across jobs
> Static allocation: dedicate resources to each job
2xT 2xT
3T T
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
Time (T)
10#UnifiedAnalytics #SparkAISummit
> FIFO: first job runs to completion
4 82 6
3T
3T
3T
3T
T
T
T
T T
T
Cores
T
Long batch jobs increase stream job latency
2xT 2xT
3T T
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
sharedexecutors
Stream/Batch scheduling
Time (T)
11#UnifiedAnalytics #SparkAISummit
> FAIR: weight share resources across jobs
4 82 6
Cores
3T
3T
3T
3T
T
T
T
T
T
T
T
queuing
Better packing with non-optimal latency
2xT 2xT
3T T
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
sharedexecutors
Stream/Batch scheduling
Time (T)
12#UnifiedAnalytics #SparkAISummit
> KILL: avoid queueing by preempting batch tasks
4 82 6
Cores
3T
3T
3T
3T
T
T
T
T
T
T 3T
T 3T
Better latency at the expense of extra work
2xT 2xT
3T T
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
sharedexecutors
Stream/Batch scheduling
3T
3T
13#UnifiedAnalytics #SparkAISummit
> NEPTUNE: minimize queueing and wasted work
4 82 6
Cores
3T
3T
T
T
T
T
T
2xT 2xT
3T T
Stage1
T
Stage2
T
2x 2x
3T3T3T
Stage1
TT
Stage2
4x 3x
Time (T)
sharedexecutors
2T
2TT
T
Stream/Batch scheduling
> How to minimize queuing for latency-sensitive jobs and
wasted work?
Implement suspendable tasks
> How to natively support stream/batch applications?
Provide a unified execution framework
> How to satisfy different stream/batch application requirements
and high-level objectives?
Introduces custom scheduling policies
14#UnifiedAnalytics #SparkAISummit
Challenges
> How to minimize queuing for latency-sensitive jobs and
wasted work?
Implement suspendable tasks
> How to natively support stream/batch applications?
Provide a unified execution framework
> How to satisfy different stream/batch application requirements
and high-level objectives?
Introduces custom scheduling policies
15#UnifiedAnalytics #SparkAISummit
NEPTUNE
Execution framework for Stream/Batch applications
Support suspendable tasks
Unified execution framework on top of Structure Streaming
Introduce pluggable scheduling policies
16#UnifiedAnalytics #SparkAISummit
Spark tasks
> Tasks: apply a function to a partition of data
> Subroutines that run in executor to
completion
> Preemption problem:
> Loss of progress (kill)
> Unpredictable preemption times
(checkpointing)
Executor
Stack
Task run
Value
Context
Iterator
Function
State
17
#UnifiedAnalytics #SparkAISummit
Suspendable tasks
> Idea: use coroutines
> Separate stacks to store task state
> Yield points handing over control to
executor
> Cooperative preemption:
> Suspend and resume in milliseconds
> Work-preserving
> Transparent to the user
Function
Context
Iterator
Coroutine
Stack
callyield
Executor
Stack
Task run
Value
State
Context
https://github.com/storm-enroute/coroutines
18#UnifiedAnalytics #SparkAISummit
val collect (TaskContext, Iterator[T]) => (Int,
Array[T]) = {
val result = new mutable.ArrayBuffer[T]
while (itr.hasNext) {
result.append(itr.next)
}
result.toArray
}
val collect (TaskContext, Iterator[T]) => (Int, Array[T]) = {
coroutine {(context: TaskContext, itr: Iterator[T]) => {
val result = new mutable.ArrayBuffer[T]
while (itr.hasNext) {
result.append(itr.next)
if (context.isPaused())
yieldval(0)
}
result.toArray
} }
Subroutine Coroutine
Suspendable tasks
19#UnifiedAnalytics #SparkAISummit
Execution framework
ExecutorExecutor
DAG scheduler
Task Scheduler
Scheduling policy
Executor
Tasks
Low-pri job High-pri job
Running Paused
suspend &
run task
App + job priorities
LowHigh
Tasks
Incrementalizer
Optimizer
launch
task
20#UnifiedAnalytics #SparkAISummit
Scheduling Policies
> Idea: policies trigger task suspension and resumption
> Guarantee that stream tasks bypass batch tasks
> Satisfy higher-level objectives i.e. balance cluster load
> Avoid starvation by suspending up to a number of times
> Load-balancing: equalize the number of tasks per node &
reduce preemption
> Cache-aware load-balancing: respect task locality
preferences in addition to load-balancing
> Built as an extension to
2.4.0 (code to be open-sourced)
> Ported all ResultTask, ShuffleMapTask functionality
across programming interfaces to coroutines
> Extended Spark’s DAG Scheduler to allow job
stages with different requirements (priorities)
21#UnifiedAnalytics #SparkAISummit
Implementation
22#UnifiedAnalytics #SparkAISummit
Demo
> Run a simple unified application with
> A high-priority latency-sensitive job
> A low-priority latency-tolerant job
> Schedule them with default Spark and Neptune
> Goal: show benefit of Neptune and ease of
use
> Cluster
– 75 nodes with 4 cores and 32 GB of memory each
> Workloads
– TPC-H decision support benchmark
– Yahoo Streaming Benchmark: ad-analytics on a stream
of ad impressions
– LDA: ML training/inference application uncovering hidden
topics from a group of documents
23#UnifiedAnalytics #SparkAISummit
Azure deployment
24#UnifiedAnalytics #SparkAISummit
DIFF-EXEC FIFO FAIR KILL NEP-CL NEP-LB PRI-ONLY
0
1
2
3
4
5
6
Streaminglatency(s)
LB
Neptune
CLB
Neptune
IsolationKILLFAIRFIFOStatic
allocation
37%
13%
61%
54%
Benefit of NEPTUNE in stream latency
NEPTUNE achieves latencies comparable to
the ideal for the latency-sensitive jobs
25#UnifiedAnalytics #SparkAISummit
Suspension mechanism effectiveness
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
TCPH Scale Factor 10
0.01
0.1
1.0
10.0
100.0
1000.0
10000.0
ms-logscale
Pause latency Resume latency Task runtime
> TPCH: Task runtime distribution for each query ranges from
100s of milliseconds to 10s of seconds
26#UnifiedAnalytics #SparkAISummit
Suspension mechanism effectiveness
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
TCPH Scale Factor 10
0.01
0.1
1.0
10.0
100.0
1000.0
10000.0
ms-logscale
Pause latency Resume latency Task runtime
> TPCH: Continuously transition tasks from Paused to
Resumed states until completion
Effectively pause and resume with sub-
millisecond latency
0% 20% 40% 60% 80% 100%
Cores used for Streaming
0
2
4
6
Streaminglatency(s)
3.85
3.88
3.90
3.92
3.95
Batch(Mevents/s)
27#UnifiedAnalytics #SparkAISummit
Impact of resource demands in performance
1.5%
Past to future
Efficiently share resources with low impact on
throughput
28#UnifiedAnalytics #SparkAISummit
Summary
Neptune supports complex unified
applications with diverse job requirements!
> Suspendable tasks using coroutines
> Pluggable scheduling policies
> Continuous analytics
Thank you!
Questions?
Panagiotis Garefalakis
pgaref@imperial.ac.uk
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

PDF
Elastify Cloud-Native Spark Application with Persistent Memory
PDF
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
PDF
Reliable Performance at Scale with Apache Spark on Kubernetes
PDF
Migrating to Apache Spark at Netflix
PDF
Spark Summit EU talk by Brij Bhushan Ravat
PDF
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
PDF
Self-Service Apache Spark Structured Streaming Applications and Analytics
PDF
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...
Elastify Cloud-Native Spark Application with Persistent Memory
Downscaling: The Achilles heel of Autoscaling Apache Spark Clusters
Reliable Performance at Scale with Apache Spark on Kubernetes
Migrating to Apache Spark at Netflix
Spark Summit EU talk by Brij Bhushan Ravat
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Self-Service Apache Spark Structured Streaming Applications and Analytics
A Journey to Building an Autonomous Streaming Data Platform—Scaling to Trilli...

What's hot (20)

PDF
Scaling Apache Spark on Kubernetes at Lyft
PDF
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
PDF
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
PDF
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
PDF
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
PDF
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
PDF
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
PDF
Apache Pulsar: The Next Generation Messaging and Queuing System
PDF
Spark Summit EU talk by Zoltan Zvara
PDF
Parallelization of Structured Streaming Jobs Using Delta Lake
PDF
Improving Apache Spark's Reliability with DataSourceV2
PDF
Performance Troubleshooting Using Apache Spark Metrics
PPTX
Spark Summit EU talk by Kaarthik Sivashanmugam
PDF
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
PDF
Speed up UDFs with GPUs using the RAPIDS Accelerator
PPTX
Spark Summit EU talk by Sameer Agarwal
PDF
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
PDF
Spark Summit EU talk by Heiko Korndorf
PDF
Getting Started with Apache Spark on Kubernetes
Scaling Apache Spark on Kubernetes at Lyft
Building a Streaming Microservice Architecture: with Apache Spark Structured ...
Hardware Acceleration of Apache Spark on Energy-Efficient FPGAs with Christof...
Near Data Computing Architectures: Opportunities and Challenges for Apache Spark
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Spark Summit EU talk by Kaarthik Sivashanmugam
Monitoring of GPU Usage with Tensorflow Models Using Prometheus
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Apache Pulsar: The Next Generation Messaging and Queuing System
Spark Summit EU talk by Zoltan Zvara
Parallelization of Structured Streaming Jobs Using Delta Lake
Improving Apache Spark's Reliability with DataSourceV2
Performance Troubleshooting Using Apache Spark Metrics
Spark Summit EU talk by Kaarthik Sivashanmugam
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Speed up UDFs with GPUs using the RAPIDS Accelerator
Spark Summit EU talk by Sameer Agarwal
Choose Your Weapon: Comparing Spark on FPGAs vs GPUs
Spark Summit EU talk by Heiko Korndorf
Getting Started with Apache Spark on Kubernetes
Ad

Similar to Cooperative Task Execution for Apache Spark (20)

PDF
Stream Processing: Choosing the Right Tool for the Job
PDF
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
PDF
AI on Spark for Malware Analysis and Anomalous Threat Detection
PDF
Parallelizing with Apache Spark in Unexpected Ways
PDF
Databricks: What We Have Learned by Eating Our Dog Food
PDF
Advanced Hyperparameter Optimization for Deep Learning with MLflow
PDF
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
PDF
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
PDF
Databricks with R: Deep Dive
PPTX
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
PDF
Scaling ML-Based Threat Detection For Production Cyber Attacks
PDF
Connecting the Dots: Integrating Apache Spark into Production Pipelines
PDF
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
PDF
AI-Powered Streaming Analytics for Real-Time Customer Experience
PDF
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
PDF
Life is but a Stream
PDF
Intel realtime analytics_spark
PDF
Tactical Data Science Tips: Python and Spark Together
PDF
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
PDF
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Stream Processing: Choosing the Right Tool for the Job
Tangram: Distributed Scheduling Framework for Apache Spark at Facebook
AI on Spark for Malware Analysis and Anomalous Threat Detection
Parallelizing with Apache Spark in Unexpected Ways
Databricks: What We Have Learned by Eating Our Dog Food
Advanced Hyperparameter Optimization for Deep Learning with MLflow
03 2014 Apache Spark Serving: Unifying Batch, Streaming, and RESTful Serving
Blue Pill/Red Pill: The Matrix of Thousands of Data Streams
Databricks with R: Deep Dive
Neptune: Scheduling Suspendable Tasks for Unified Stream/Batch Applications
Scaling ML-Based Threat Detection For Production Cyber Attacks
Connecting the Dots: Integrating Apache Spark into Production Pipelines
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
AI-Powered Streaming Analytics for Real-Time Customer Experience
Application and Challenges of Streaming Analytics and Machine Learning on Mu...
Life is but a Stream
Intel realtime analytics_spark
Tactical Data Science Tips: Python and Spark Together
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...
Analyzing 2TB of Raw Trace Data from a Manufacturing Process: A First Use Cas...
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PDF
Lecture1 pattern recognition............
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Introduction to machine learning and Linear Models
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
oil_refinery_comprehensive_20250804084928 (1).pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Business Analytics and business intelligence.pdf
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Lecture1 pattern recognition............
Introduction-to-Cloud-ComputingFinal.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Clinical guidelines as a resource for EBP(1).pdf
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
Business Acumen Training GuidePresentation.pptx
Quality review (1)_presentation of this 21
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction to machine learning and Linear Models
The THESIS FINAL-DEFENSE-PRESENTATION.pptx

Cooperative Task Execution for Apache Spark

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Panagiotis Garefalakis (Imperial College London) Konstantinos Karanasos (Microsoft) Peter Pietzuch (Imperial College London) Cooperative Task Execution for Apache Spark #UnifiedAnalytics #SparkAISummit
  • 3. Evolution of analytics 3#UnifiedAnalytics #SparkAISummit Batch frameworks 20142010 2018 Frameworks with hybrid stream/batch applicationsStream frameworks Unified stream/batch frameworks Structured Streaming
  • 5. Stream/Batch = unified applications combining > latency-sensitive (stream) jobs with > latency-tolerant (batch) jobs as part of the same application 5#UnifiedAnalytics #SparkAISummit Unified applications Advantages > Sharing application logic & state > Result consistency > Sharing of computation
  • 6. > Unified processing platform on top of Spark SQL fast, scalable, fault tolerant > Large ecosystem of data sources integrate with many storage systems > Rich, unified, high-level APIs deal with complex data and complex workloads 6#UnifiedAnalytics #SparkAISummit Structured Streaming API #UnifiedAnalytics val trainData = context.read("malicious−train−data") val pipeline = new Pipeline().setStages(Array( new OneHotEncoderEstimator(), new VectorAssembler(), new Classifier(/* select estimator */))) val pipelineModel = pipeline.fit(trainData) val streamingData = context .readStream("kafkaTopic") .groupBy("userId") .schema() /* input schema */ val streamRates = pipelineModel .transform(streamingData) streamRates.start() /* start streaming */ Batch Stream
  • 7. Requirements > Latency: Execute inference job with minimum delay > Throughput: Batch jobs should not be compromised > Efficiency: Achieve high cluster resource utilization 7#UnifiedAnalytics #SparkAISummit Scheduling of Stream/Batch jobs Challenge: schedule stream/batch jobs to satisfy their diverse requirements
  • 8. 8#UnifiedAnalytics #SparkAISummit Stream/Batch application scheduling 2xT Inference (stream) Job 2xT 3T TTraining (batch) Job Stage1 T Stage2 T 2x 2x 3T3T3T Stage1 TT Stage2 4x 3x Application Code Driver DAG Scheduler submitSpark Context run job
  • 9. Time (T) 9#UnifiedAnalytics #SparkAISummit Stream/Batch scheduling 3T 3T 3T T T T T 4 3T executor1executor2 8 T T T Wasted resources Cores 2 6 Resources cannot be shared across jobs > Static allocation: dedicate resources to each job 2xT 2xT 3T T Stage1 T Stage2 T 2x 2x 3T3T3T Stage1 TT Stage2 4x 3x
  • 10. Time (T) 10#UnifiedAnalytics #SparkAISummit > FIFO: first job runs to completion 4 82 6 3T 3T 3T 3T T T T T T T Cores T Long batch jobs increase stream job latency 2xT 2xT 3T T Stage1 T Stage2 T 2x 2x 3T3T3T Stage1 TT Stage2 4x 3x sharedexecutors Stream/Batch scheduling
  • 11. Time (T) 11#UnifiedAnalytics #SparkAISummit > FAIR: weight share resources across jobs 4 82 6 Cores 3T 3T 3T 3T T T T T T T T queuing Better packing with non-optimal latency 2xT 2xT 3T T Stage1 T Stage2 T 2x 2x 3T3T3T Stage1 TT Stage2 4x 3x sharedexecutors Stream/Batch scheduling
  • 12. Time (T) 12#UnifiedAnalytics #SparkAISummit > KILL: avoid queueing by preempting batch tasks 4 82 6 Cores 3T 3T 3T 3T T T T T T T 3T T 3T Better latency at the expense of extra work 2xT 2xT 3T T Stage1 T Stage2 T 2x 2x 3T3T3T Stage1 TT Stage2 4x 3x sharedexecutors Stream/Batch scheduling
  • 13. 3T 3T 13#UnifiedAnalytics #SparkAISummit > NEPTUNE: minimize queueing and wasted work 4 82 6 Cores 3T 3T T T T T T 2xT 2xT 3T T Stage1 T Stage2 T 2x 2x 3T3T3T Stage1 TT Stage2 4x 3x Time (T) sharedexecutors 2T 2TT T Stream/Batch scheduling
  • 14. > How to minimize queuing for latency-sensitive jobs and wasted work? Implement suspendable tasks > How to natively support stream/batch applications? Provide a unified execution framework > How to satisfy different stream/batch application requirements and high-level objectives? Introduces custom scheduling policies 14#UnifiedAnalytics #SparkAISummit Challenges
  • 15. > How to minimize queuing for latency-sensitive jobs and wasted work? Implement suspendable tasks > How to natively support stream/batch applications? Provide a unified execution framework > How to satisfy different stream/batch application requirements and high-level objectives? Introduces custom scheduling policies 15#UnifiedAnalytics #SparkAISummit NEPTUNE Execution framework for Stream/Batch applications Support suspendable tasks Unified execution framework on top of Structure Streaming Introduce pluggable scheduling policies
  • 16. 16#UnifiedAnalytics #SparkAISummit Spark tasks > Tasks: apply a function to a partition of data > Subroutines that run in executor to completion > Preemption problem: > Loss of progress (kill) > Unpredictable preemption times (checkpointing) Executor Stack Task run Value Context Iterator Function State
  • 17. 17 #UnifiedAnalytics #SparkAISummit Suspendable tasks > Idea: use coroutines > Separate stacks to store task state > Yield points handing over control to executor > Cooperative preemption: > Suspend and resume in milliseconds > Work-preserving > Transparent to the user Function Context Iterator Coroutine Stack callyield Executor Stack Task run Value State Context https://github.com/storm-enroute/coroutines
  • 18. 18#UnifiedAnalytics #SparkAISummit val collect (TaskContext, Iterator[T]) => (Int, Array[T]) = { val result = new mutable.ArrayBuffer[T] while (itr.hasNext) { result.append(itr.next) } result.toArray } val collect (TaskContext, Iterator[T]) => (Int, Array[T]) = { coroutine {(context: TaskContext, itr: Iterator[T]) => { val result = new mutable.ArrayBuffer[T] while (itr.hasNext) { result.append(itr.next) if (context.isPaused()) yieldval(0) } result.toArray } } Subroutine Coroutine Suspendable tasks
  • 19. 19#UnifiedAnalytics #SparkAISummit Execution framework ExecutorExecutor DAG scheduler Task Scheduler Scheduling policy Executor Tasks Low-pri job High-pri job Running Paused suspend & run task App + job priorities LowHigh Tasks Incrementalizer Optimizer launch task
  • 20. 20#UnifiedAnalytics #SparkAISummit Scheduling Policies > Idea: policies trigger task suspension and resumption > Guarantee that stream tasks bypass batch tasks > Satisfy higher-level objectives i.e. balance cluster load > Avoid starvation by suspending up to a number of times > Load-balancing: equalize the number of tasks per node & reduce preemption > Cache-aware load-balancing: respect task locality preferences in addition to load-balancing
  • 21. > Built as an extension to 2.4.0 (code to be open-sourced) > Ported all ResultTask, ShuffleMapTask functionality across programming interfaces to coroutines > Extended Spark’s DAG Scheduler to allow job stages with different requirements (priorities) 21#UnifiedAnalytics #SparkAISummit Implementation
  • 22. 22#UnifiedAnalytics #SparkAISummit Demo > Run a simple unified application with > A high-priority latency-sensitive job > A low-priority latency-tolerant job > Schedule them with default Spark and Neptune > Goal: show benefit of Neptune and ease of use
  • 23. > Cluster – 75 nodes with 4 cores and 32 GB of memory each > Workloads – TPC-H decision support benchmark – Yahoo Streaming Benchmark: ad-analytics on a stream of ad impressions – LDA: ML training/inference application uncovering hidden topics from a group of documents 23#UnifiedAnalytics #SparkAISummit Azure deployment
  • 24. 24#UnifiedAnalytics #SparkAISummit DIFF-EXEC FIFO FAIR KILL NEP-CL NEP-LB PRI-ONLY 0 1 2 3 4 5 6 Streaminglatency(s) LB Neptune CLB Neptune IsolationKILLFAIRFIFOStatic allocation 37% 13% 61% 54% Benefit of NEPTUNE in stream latency NEPTUNE achieves latencies comparable to the ideal for the latency-sensitive jobs
  • 25. 25#UnifiedAnalytics #SparkAISummit Suspension mechanism effectiveness Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 TCPH Scale Factor 10 0.01 0.1 1.0 10.0 100.0 1000.0 10000.0 ms-logscale Pause latency Resume latency Task runtime > TPCH: Task runtime distribution for each query ranges from 100s of milliseconds to 10s of seconds
  • 26. 26#UnifiedAnalytics #SparkAISummit Suspension mechanism effectiveness Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 TCPH Scale Factor 10 0.01 0.1 1.0 10.0 100.0 1000.0 10000.0 ms-logscale Pause latency Resume latency Task runtime > TPCH: Continuously transition tasks from Paused to Resumed states until completion Effectively pause and resume with sub- millisecond latency
  • 27. 0% 20% 40% 60% 80% 100% Cores used for Streaming 0 2 4 6 Streaminglatency(s) 3.85 3.88 3.90 3.92 3.95 Batch(Mevents/s) 27#UnifiedAnalytics #SparkAISummit Impact of resource demands in performance 1.5% Past to future Efficiently share resources with low impact on throughput
  • 28. 28#UnifiedAnalytics #SparkAISummit Summary Neptune supports complex unified applications with diverse job requirements! > Suspendable tasks using coroutines > Pluggable scheduling policies > Continuous analytics Thank you! Questions? Panagiotis Garefalakis pgaref@imperial.ac.uk
  • 29. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT