SlideShare a Scribd company logo
SPARK STREAMING:
PUSHING THE THROUGHPUT LIMITS,
THE REACTIVE WAY
François Garillot, Gerard Maas
Who Are We ?
Gerard Maas
Data Processing Team Lead
François Garillot
work done at
Spark Streaming at
@maasg @huitseeker
Spark Streaming (Refresher)
@maasg @huitseeker
Spark Streaming (Refresher)
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
@maasg @huitseeker
Spark Streaming (Refresher)
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
Transformations
@maasg @huitseeker
Spark Streaming (Refresher)
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
t0 t1 t2 t3 ti ti+1
RDD[U] RDD[U] RDD[U] RDD[U] RDD[U]
Actions
Transformations
@maasg @huitseeker
Spark Streaming (Refresher)
Spark API for Streams
Fault-tolerant
High Throughput
Scalable
@maasg @huitseeker
Streaming
Spark
t0 t1 t2
#0
Consumer
Consumer
Consumer
Scheduling
@maasg @huitseeker
Streaming
Spark
t0 t1 t2
#1
Consumer
Consumer
Consumer
#0
Scheduling
Process Time < Batch Interval
@maasg @huitseeker
Streaming
Spark
t0 t1 t2
#2
Consumer
Consumer
Consumer
#0 #1
#3
Scheduling
Scheduling Delay
@maasg @huitseeker
From Streams to μbatches
Consumer
#0 #1
batchInterval
blockInterval
Spark Streaming
Spark
#partitions = receivers x batchInterval /
blockInterval
@maasg @huitseeker
From Streams to μbatches
#0
RDD
Partitions
Spark
Spark
Executors
Spark Streaming
@maasg @huitseeker
From Streams to μbatches
#0
RDD
Spark
Spark
Executors
Spark Streaming
@maasg @huitseeker
From Streams to μbatches
#0
RDD
Spark
Spark
Executors
Spark Streaming
@maasg @huitseeker
From Streams to μbatches
#0
RDD
Spark
Spark
Executors
Spark Streaming
@maasg @huitseeker
@maasg @huitseeker
@maasg @huitseeker
From Streams to μbatches
Consumer
#0 #1
batchInterval
blockInterval
Spark Streaming
Spark
#partitions = receivers x batchInterval /
blockInterval
@maasg @huitseeker
From Streams to μbatches
Consumer
#0 #1
batchInterval
blockInterval
Spark Streaming
Spark
spark.streaming.blockInterval = batchInterval x
receivers / (partitionFactor x sparkCores)
@maasg @huitseeker
The Importance of Caching
dstream.foreachRDD { rdd =>
rdd.cache() // cache the RDD before iterating!
keys.foreach{ key =>
rdd.filter(elem=> key(elem) == key).saveAsFooBar(...)
}
rdd.unpersist()
}
@maasg @huitseeker
Intervals
(Read TD’s Adaptive
Stream Processing using
Dynamic Batch Sizing
before drawing any
conclusions !)
O(n²)
O(n²)
O(n)
O(n)
@maasg @huitseeker
The Receiver model
spark.streaming.receiver.maxRate
Fault tolerance ? WAL
@maasg @huitseeker
Direct Kafka Stream
compute(offsets)
Kafka:The Receiver-less model
Simplified Parallelism
Efficiency
Exactly-once semantics
Less degrees of freedom
val directKafkaStream = KafkaUtils.
createDirectStream[
[key class],
[value class],
[key decoder class],
[value decoder class] ](
streamingContext, [map of Kafka parameters], [set
of topics to consume]
)
spark.streaming.kafka.maxRatePerPartition
@maasg @huitseeker
Demo
@maasg @huitseeker
Reactive Principles
Reactive Streams : composable back-pressure
@maasg @huitseeker
Spark Streaming made
Reactive
@maasg @huitseeker
Spark Streaming made
Reactive
@maasg @huitseeker
Spark Streaming made
Reactive
@maasg @huitseeker
Spark Streaming Made
Reactive
@maasg @huitseeker
Demo
Putting it together
@maasg @huitseeker
Pain point : Data Locality
- Where is your job getting executed ?
spark.locality.wait & spark.streaming.blockInterval
- On Mesos, it’s worse (SPARK-4940)
@maasg @huitseeker
Resources
Backpressure in Spark Streaming:
http://blog.garillot.net/post/121183250481/a-quick-update-on-spark-streaming-work-since-i
The Virdata’s Spark Streaming tuning guide:
http://www.virdata.com/tuning-spark/
TD’s paper on dynamic batch sizing :
http://dl.acm.org/citation.cfm?id=2670995
Diving into Spark Streaming Execution Model:
https://databricks.com/blog/2015/07/30/diving-into-spark-streamings-execution-model.html
Spark Streaming / Storm Trident numbered comparison:
https://www.cs.utoronto.ca/~patricio/docs/Analysis_of_Real_Time_Stream_Processing_Systems_Considering_Latency.pdf
Kafka direct approach:
https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md
Thanks!
Gerard Maas
@maasg
François Garillot
@huitseeker

More Related Content

PDF
Strata NYC 2015: What's new in Spark Streaming
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
PDF
Spark Summit East 2015 Advanced Devops Student Slides
PDF
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
PPTX
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
PDF
Dive into Spark Streaming
ODP
Meet Up - Spark Stream Processing + Kafka
PPTX
ETL with SPARK - First Spark London meetup
Strata NYC 2015: What's new in Spark Streaming
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Spark Summit East 2015 Advanced Devops Student Slides
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics ...
Dive into Spark Streaming
Meet Up - Spark Stream Processing + Kafka
ETL with SPARK - First Spark London meetup

What's hot (20)

PPTX
Alpine academy apache spark series #1 introduction to cluster computing wit...
PDF
Adding Complex Data to Spark Stack by Tug Grall
PDF
Analyzing Time Series Data with Apache Spark and Cassandra
PPTX
Real Time Data Processing Using Spark Streaming
PDF
Spark And Cassandra: 2 Fast, 2 Furious
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
Bellevue Big Data meetup: Dive Deep into Spark Streaming
PDF
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
PDF
Cassandra spark connector
PDF
Unified Big Data Processing with Apache Spark (QCON 2014)
PDF
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
PDF
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
PDF
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
PDF
Distributed Stream Processing - Spark Summit East 2017
PPTX
Spark and Spark Streaming
PDF
SMACK Stack 1.1
PDF
Spark Streaming, Machine Learning and meetup.com streaming API.
PDF
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Alpine academy apache spark series #1 introduction to cluster computing wit...
Adding Complex Data to Spark Stack by Tug Grall
Analyzing Time Series Data with Apache Spark and Cassandra
Real Time Data Processing Using Spark Streaming
Spark And Cassandra: 2 Fast, 2 Furious
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Bellevue Big Data meetup: Dive Deep into Spark Streaming
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Cassandra spark connector
Unified Big Data Processing with Apache Spark (QCON 2014)
ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics
FiloDB - Breakthrough OLAP Performance with Cassandra and Spark
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...
Distributed Stream Processing - Spark Summit East 2017
Spark and Spark Streaming
SMACK Stack 1.1
Spark Streaming, Machine Learning and meetup.com streaming API.
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Ad

Viewers also liked (20)

PPTX
Performance Comparison of Streaming Big Data Platforms
PDF
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
PDF
Top 5 Mistakes When Writing Spark Applications
PDF
Getting real with erlang
PPTX
An Introduction to Spark
PDF
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
PPTX
NextGen Apache Hadoop MapReduce
PDF
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
PDF
Production Readiness Testing At Salesforce Using Spark MLlib
PDF
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
PDF
Spark Summit EU 2015: SparkUI visualization: a lens into your application
PDF
Spark with Cassandra by Christopher Batey
PDF
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
PDF
An Introduction to Sparkling Water by Michal Malohlava
PDF
Spark Tuning for Enterprise System Administrators By Anya Bida
PDF
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
PDF
Continuous Integration for Spark Apps by Sean McIntyre
PDF
Beyond Parallelize and Collect by Holden Karau
PDF
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
PDF
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Performance Comparison of Streaming Big Data Platforms
Towards Benchmaking Modern Distruibuted Systems-(Grace Huang, Intel)
Top 5 Mistakes When Writing Spark Applications
Getting real with erlang
An Introduction to Spark
Automatic Features Generation And Model Training On Spark: A Bayesian Approach
NextGen Apache Hadoop MapReduce
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Production Readiness Testing At Salesforce Using Spark MLlib
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Spark Summit EU 2015: SparkUI visualization: a lens into your application
Spark with Cassandra by Christopher Batey
Some Important Streaming Algorithms You Should Know About-(Ted Dunning, MapR)
An Introduction to Sparkling Water by Michal Malohlava
Spark Tuning for Enterprise System Administrators By Anya Bida
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Continuous Integration for Spark Apps by Sean McIntyre
Beyond Parallelize and Collect by Holden Karau
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Ad

Similar to Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas (20)

PPTX
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
PDF
Introduction to Spark Streaming
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
PDF
Headaches and Breakthroughs in Building Continuous Applications
PDF
Spark Streaming Programming Techniques You Should Know with Gerard Maas
PDF
Spark streaming state of the union
PDF
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
PDF
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
PDF
So you think you can stream.pptx
PPT
strata_spark_streaming.ppt
PPTX
Spark streaming high level overview
PPT
Spark streaming
PDF
Productionizing your Streaming Jobs
PDF
Spark & Spark Streaming Internals - Nov 15 (1)
PPT
strata_spark_streaming.ppt
PPT
strata_spark_streaming.ppt
PPT
strata spark streaming strata spark streamingsrata spark streaming
PDF
[Spark meetup] Spark Streaming Overview
PDF
Deep dive into spark streaming
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guest Lecture on Spark Streaming in Stanford CME 323: Distributed Algorithms ...
Introduction to Spark Streaming
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Headaches and Breakthroughs in Building Continuous Applications
Spark Streaming Programming Techniques You Should Know with Gerard Maas
Spark streaming state of the union
Spark Streaming | Twitter Sentiment Analysis Example | Apache Spark Training ...
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
So you think you can stream.pptx
strata_spark_streaming.ppt
Spark streaming high level overview
Spark streaming
Productionizing your Streaming Jobs
Spark & Spark Streaming Internals - Nov 15 (1)
strata_spark_streaming.ppt
strata_spark_streaming.ppt
strata spark streaming strata spark streamingsrata spark streaming
[Spark meetup] Spark Streaming Overview
Deep dive into spark streaming
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

More from Spark Summit (20)

PDF
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
PDF
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
PDF
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
PDF
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
PDF
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
PDF
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
Apache Spark and Tensorflow as a Service with Jim Dowling
PDF
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
PDF
Next CERN Accelerator Logging Service with Jakub Wozniak
PDF
Powering a Startup with Apache Spark with Kevin Kim
PDF
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
PDF
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
PDF
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
PDF
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
PDF
Goal Based Data Production with Sim Simeonov
PDF
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
PDF
Getting Ready to Use Redis with Apache Spark with Dvir Volk
PDF
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
PDF
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
Next CERN Accelerator Logging Service with Jakub Wozniak
Powering a Startup with Apache Spark with Kevin Kim
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Goal Based Data Production with Sim Simeonov
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...

Recently uploaded (20)

PDF
Introduction to Business Data Analytics.
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PDF
Mega Projects Data Mega Projects Data
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPT
Quality review (1)_presentation of this 21
PDF
Lecture1 pattern recognition............
Introduction to Business Data Analytics.
Acceptance and paychological effects of mandatory extra coach I classes.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
Mega Projects Data Mega Projects Data
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Clinical guidelines as a resource for EBP(1).pdf
Reliability_Chapter_ presentation 1221.5784
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
oil_refinery_comprehensive_20250804084928 (1).pptx
Launch Your Data Science Career in Kochi – 2025
Introduction-to-Cloud-ComputingFinal.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Data_Analytics_and_PowerBI_Presentation.pptx
Quality review (1)_presentation of this 21
Lecture1 pattern recognition............

Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerard Maas