SlideShare a Scribd company logo
WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
Landon Robinson & Jack Chapa
Spark Streaming
Headaches and Breakthroughs in Building
Continuous Applications
#UnifiedAnalytics #SparkAISummit
Who We Are
3#UnifiedAnalytics #SparkAISummit
Landon Robinson
Data Engineer
Jack Chapa
Data Engineer
Big Data Team @ SpotX
Because Spark Streaming...
• is very powerful
• can supercharge your infrastructure
• … and can be very complex!
Lots of headaches and breakthroughs!
But first… why are we here?
4#UnifiedAnalytics #SparkAISummit
● Streaming Basics
● Testing
● Monitoring & Alerts
● Batch Intervals & Resources
Takeaways
Leave with a few actionable items that we wish we knew
when we started with Spark Streaming.
Focus Areas
● Helpful Configurations
● Backpressure
● Data Enrichment &
Transformations
Our Company
6#UnifiedAnalytics #SparkAISummit
The Trusted Platform For
Premium Publishers and
Broadcasters
We Process a Lot of Data
Data:
- 220 MM+ Total Files/Blocks
- 8 PB+ HDFS Space
- 20 TB+ new data daily
- 100MM+ records/minute
- 300+ Data Nodes
Apps:
- Thousands of daily Spark apps
- Hundreds of daily user queries
- Multiple 24/7 Streaming apps
Our uses include:
- Rapid ingestion of data into warehouse for querying
- Machine learning on near-live data streams
- Ability to react to and impact live situations
- Accelerated processing / updating of metadata
- Real-time visualization of data streams and processing
Spark Streaming is Key for Us
Spark Streaming Basics
a brief overview
Spark Streaming Basics
Spark Streaming is an extension of Spark that
enables scalable, high-throughput, fault-tolerant
processing of live data streams.
• Stream == live data stream
– Topic == Kafka’s name for a stream
•
DStream == sequence of RDDs formed
from reading a data stream
• Batch == a self-contained job within your
Streaming app that processes a segment of
the stream.
Testing
Rapid development and
testing of Spark apps
Use Spark in Local Mode
You can start building Spark Streaming apps
in minutes, using Spark locally!
On your local machine
• No cluster needed!
• Great for rough testing
We Recommend:
IntelliJ Community Edition
• with SBT: For dependency
management
Use Spark in Local Mode
In your build.sbt:
• src/test/scala => “provided”
• src/main/scala => “compiled”
The Scala Build Tool is your friend!
Simply:
• Import Spark libraries
• Invoke a Context and/or Session
• Set master to local[*] or local[n]
Example Unit Test using just a SparkContext
Invoke a local session:
• In your unit test classes
• Test logic on small datasets
Add to your deployment pipeline
for a nice pre-release gut check!
Unit Testing
Spark Streaming Apps can easily be unit tested
- Using .queueStream()
- Using a spark testing library
Libraries
- spark-testing-base
- sscheck
- spark-tests
Use Cases
- DStream actions
- Business Logic
- Integration
Example Library: spark-testing-base
- Easy to Use
- Helpful wrappers
- Integrates w/ scalatest
- Minimal code required
- Clock management
- Runs alongside other tests
GitHub: https://github.com/holdenk/spark-testing-base
Monitoring
Tracking and visualizing
performance of your
app
Monitoring is Awesome
It can reveal:
• How your app is performing
• Problems + Bugs!
And provide opportunities to:
• See and address issues
• Observe behavior visually
But monitoring can be tough to implement!
Monitoring (a less than ideal approach)
You could do it all in the app...
Example: Looping over RDDs to:
• Count records
• Track Kafka offsets
• Processing time / delays
But it’s less than ideal...
• Calculating performance significantly
impacts performance… not great.
• All these metrics are calculated by
Spark!
Monitoring and Visualization (using Listeners)
Use Spark Listeners to access
metrics in the background!
Let Spark do the hard work:
• Batch duration, delays
• Record throughput
• Stream position recovery
Come to our talk: Spark Listeners:
A Crash Course in Fast, Easy
Monitoring!
• Room 3016 | Today @ 5:30 PM
Kafka Offset Recovery
Saving your place
elsewhere
Inside the Spark
Listener class, after a
batch completes...
You can access an
object generated by
Spark containing your
offsets processed.
Take those offsets and
back them up to a
DB...
Writing Offsets to MySQL
Your offsets are now
stored in a DB after
each batch completes.
Whenever your app
restarts, it reads those
offsets from the DB...
And starts processing
where it last left off!
Reading Offsets from MySQL
Getting Offsets from the Database
Example: Reading Offsets from MySQL
Example: Reading Offsets from MySQL
- Record timing info for fast
troubleshooting
- Escalate alarms to the
appropriate team
- Quickly resolve while
app continues running
Timing Logging (around actions)
React
How do I react to this monitoring?
● Heartbeats
● Scheduled Monitor Jobs
○ Version Updates
○ Ensure Running
○ Act on failure/problem
● Monitoring Alarms
● Look at them!
Batch Intervals
Optimizing for speed
and resource efficiency
You want batches that
process faster than the
interval, but not so fast that
resources are idling and
therefore wasted!
Setting Appropriate Batch Intervals
An appropriate batch interval is key to
an app that is quick and efficient.
Effectiveness of interval is
affected by:
• Resources alloc (cpu + ram)
• Quantity of work
• Quantity of data
Batch interval
Setting Appropriate Batch Intervals
Consider these questions:
How quickly do I need to process data?
• Can I slow it down to save resources?
What is my resource budget / allocation?
• Can I increase? Can I cut back?
• Bigger interval = more time to process
• … but also more data to process
• Smaller interval = the opposite
Tips for finding an optimal combination:
Start small!
a. Short batch interval (seconds)
b. Modest resources
Whichever you have in more flexible
supply (a or b), increasing accordingly.
Again: processing time < interval = good
Comfortably less, not significantly less.
Setting Appropriate Batch Intervals
Additional Resource Notes
- Scale down when possible
- Free up resources or save on cloud utilization spend
- Avoid preemption
- Use resource pools with prioritization
- With preemption disabled if you can
- Set appropriate # of partitions for Kafka topics
- Higher volume == higher partition count
- Higher partition count == greater parallelization
Helpful Configuration Settings
Configuring your app to
be performant and
efficient
Helpful Configuration Settings
Spark
• spark.memory.useLegacyMode = true
– spark.storage.memoryFraction=0.03
• spark.submit.deployMode = cluster
• spark.serializer = org.apache.spark.serializer.KryoSerializer
• spark.rdd.compress = true
– spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
• spark.shuffle.service.enabled = true
• spark.streaming.blockInterval = 300
Kafka
• enable.auto.commit = ‘false’
Backpressure
Use Case:
You have irregular spikes in message throughput from Kafka topics
• Backpressure dynamically alters rate data is received per batch from Kafka.
• Prevents overwhelming of app at startup and peak load.
Settings:
• spark.streaming.backpressure.enabled = true
• spark.streaming.kafka.maxRatePerPartition = 20000
– max rate (messages/second) at which each Kafka partition will be read
• PID Rate Estimator: can be used to tweak the rate based on batch performance
– spark.streaming.backpressure.pid.*
Source: https://www.linkedin.com/pulse/enable-back-pressure-make-your-spark-streaming-production-lan-jiang/
Transformations
Bringing streaming and
static data together
Transformations (Streaming + Static)
transform()
● Allows RDD-level access to data.
● Use case: joining with another RDD
updateStateByKey() / mapWithState()
● Apply function to each key - useful for keeping
track of state
● Use case: maintaining state between batches
(e.g. rolling join w/ two streams)
reduceByKey()
● Reduce a keyed RDD with appropriate
function.
● Use case: deduping, aggregations
Using the transform() method on DStream:
Apply an RDD-to-RDD function to every RDD of the DStream. Used for arbitrary RDD
operations on the DStream.
• Useful for applying arbitrary RDD operations on a DStream.
• Great for enriching streaming data with supplemental static data
Joining Streaming and Static Data
Source: https://hadoopsters.net/2017/11/26/how-to-join-static-data-with-streaming-data-dstream-in-spark/
transactions = … // streaming dataset (dstream)
transaction_details = … // static dataset (rdd)
val complete_transaction_data = transactions.transform(live_transaction =>
live_transaction.join(transaction_details))
Effective Static Joining
How do we handle static and persistent data?
Driver:
● Broadcast if small enough
● Read on driver every batch, then join
Worker:
● Connect on worker - lazy val connection
object
● Useful for persisting data
Streaming isn’t always easy… but here are some great takeaways!
• Testing: Use Spark Locally w/ Unit Tests
• Monitoring: Use Listeners & React
• Batch Intervals & Resources: Be thoughtful!
• Configuration: Lots of awesome ones!
• Transformations: Do more with your streaming data!
• Offset Recovery: Stop worrying and love the offset management!
Review
Landon Robinson
• lrobinson@spotx.tv
Jack Chapa
• jchapa@spotx.tv
hadoopsters.dev
https://gist.github.com/hadoopsters
Contact Us
Q & A
#UnifiedAnalytics #SparkAISummit
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

More Related Content

PDF
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
PDF
Lambda at Weather Scale by Robbie Strickland
PDF
Spark on Mesos
PPTX
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
PDF
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PDF
Spark Summit EU talk by Berni Schiefer
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Lambda at Weather Scale by Robbie Strickland
Spark on Mesos
Extreme Apache Spark: how in 3 months we created a pipeline that can process ...
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Lambda architecture on Spark, Kafka for real-time large scale ML
Real time Analytics with Apache Kafka and Apache Spark
Spark Summit EU talk by Berni Schiefer

What's hot (20)

PDF
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
PDF
Reactive Streams, Linking Reactive Application To Spark Streaming
ODP
Lambda Architecture with Spark
PDF
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
PDF
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
PDF
Re-Architecting Spark For Performance Understandability
PDF
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
PDF
Scale-Out Using Spark in Serverless Herd Mode!
PDF
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
PPTX
Real time data viz with Spark Streaming, Kafka and D3.js
PDF
Lessons Learned: Using Spark and Microservices
PPTX
Real Time Data Processing Using Spark Streaming
PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
PDF
Real-time personal trainer on the SMACK stack
PDF
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
PDF
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
PDF
Spark Summit EU talk by Ruben Pulido Behar Veliqi
PDF
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
PDF
Big Data visualization with Apache Spark and Zeppelin
PDF
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Towards True Elasticity of Spark-(Michael Le and Min Li, IBM)
Reactive Streams, Linking Reactive Application To Spark Streaming
Lambda Architecture with Spark
Reactive Streams, linking Reactive Application to Spark Streaming by Luc Bour...
Tagging and Processing Data in Real Time-(Hari Shreedharan and Siddhartha Jai...
Re-Architecting Spark For Performance Understandability
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Scale-Out Using Spark in Serverless Herd Mode!
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...
Real time data viz with Spark Streaming, Kafka and D3.js
Lessons Learned: Using Spark and Microservices
Real Time Data Processing Using Spark Streaming
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
Real-time personal trainer on the SMACK stack
Spark-on-Yarn: The Road Ahead-(Marcelo Vanzin, Cloudera)
Spark Tuning For Enterprise System Administrators, Spark Summit East 2016
Spark Summit EU talk by Ruben Pulido Behar Veliqi
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Big Data visualization with Apache Spark and Zeppelin
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Ad

Similar to Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Applications (Spark Streaming) (20)

PPTX
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
PDF
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
PDF
Strata NYC 2015: What's new in Spark Streaming
PDF
Apache Spark - A High Level overview
PDF
Build a Time Series Application with Apache Spark and Apache HBase
PDF
Top 5 mistakes when writing Streaming applications
PDF
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
PDF
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
PDF
What no one tells you about writing a streaming app
PPTX
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
PDF
Structured Streaming in Spark
PDF
Productizing Structured Streaming Jobs
PPTX
Real time streaming analytics
PDF
Building Robust, Adaptive Streaming Apps with Spark Streaming
PDF
Apache Spark streaming and HBase
PDF
Introduction to Spark Streaming
PDF
DIscover Spark and Spark streaming
PDF
Lifting the hood on spark streaming - StampedeCon 2015
PPTX
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...
Apache Spark Listeners: A Crash Course in Fast, Easy Monitoring
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Strata NYC 2015: What's new in Spark Streaming
Apache Spark - A High Level overview
Build a Time Series Application with Apache Spark and Apache HBase
Top 5 mistakes when writing Streaming applications
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What no one tells you about writing a streaming app
Global Big Data Conference Sept 2014 AWS Kinesis Spark Streaming Approximatio...
Structured Streaming in Spark
Productizing Structured Streaming Jobs
Real time streaming analytics
Building Robust, Adaptive Streaming Apps with Spark Streaming
Apache Spark streaming and HBase
Introduction to Spark Streaming
DIscover Spark and Spark streaming
Lifting the hood on spark streaming - StampedeCon 2015
East Bay Java User Group Oct 2014 Spark Streaming Kinesis Machine Learning
Ad

Recently uploaded (20)

PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Logistic Regression ml machine learning.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPT
Quality review (1)_presentation of this 21
PDF
Lecture1 pattern recognition............
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Foundation of Data Science unit number two notes
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Moving the Public Sector (Government) to a Digital Adoption
Clinical guidelines as a resource for EBP(1).pdf
Supervised vs unsupervised machine learning algorithms
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Reliability_Chapter_ presentation 1221.5784
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Logistic Regression ml machine learning.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Bharatiya Antariksh Hackathon 2025 Idea Submission PPT.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Quality review (1)_presentation of this 21
Lecture1 pattern recognition............
Introduction to Knowledge Engineering Part 1
Foundation of Data Science unit number two notes

Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Applications (Spark Streaming)

  • 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  • 2. Landon Robinson & Jack Chapa Spark Streaming Headaches and Breakthroughs in Building Continuous Applications #UnifiedAnalytics #SparkAISummit
  • 3. Who We Are 3#UnifiedAnalytics #SparkAISummit Landon Robinson Data Engineer Jack Chapa Data Engineer Big Data Team @ SpotX
  • 4. Because Spark Streaming... • is very powerful • can supercharge your infrastructure • … and can be very complex! Lots of headaches and breakthroughs! But first… why are we here? 4#UnifiedAnalytics #SparkAISummit
  • 5. ● Streaming Basics ● Testing ● Monitoring & Alerts ● Batch Intervals & Resources Takeaways Leave with a few actionable items that we wish we knew when we started with Spark Streaming. Focus Areas ● Helpful Configurations ● Backpressure ● Data Enrichment & Transformations
  • 6. Our Company 6#UnifiedAnalytics #SparkAISummit The Trusted Platform For Premium Publishers and Broadcasters
  • 7. We Process a Lot of Data Data: - 220 MM+ Total Files/Blocks - 8 PB+ HDFS Space - 20 TB+ new data daily - 100MM+ records/minute - 300+ Data Nodes Apps: - Thousands of daily Spark apps - Hundreds of daily user queries - Multiple 24/7 Streaming apps
  • 8. Our uses include: - Rapid ingestion of data into warehouse for querying - Machine learning on near-live data streams - Ability to react to and impact live situations - Accelerated processing / updating of metadata - Real-time visualization of data streams and processing Spark Streaming is Key for Us
  • 9. Spark Streaming Basics a brief overview
  • 10. Spark Streaming Basics Spark Streaming is an extension of Spark that enables scalable, high-throughput, fault-tolerant processing of live data streams. • Stream == live data stream – Topic == Kafka’s name for a stream • DStream == sequence of RDDs formed from reading a data stream • Batch == a self-contained job within your Streaming app that processes a segment of the stream.
  • 12. Use Spark in Local Mode You can start building Spark Streaming apps in minutes, using Spark locally! On your local machine • No cluster needed! • Great for rough testing We Recommend: IntelliJ Community Edition • with SBT: For dependency management
  • 13. Use Spark in Local Mode In your build.sbt: • src/test/scala => “provided” • src/main/scala => “compiled” The Scala Build Tool is your friend! Simply: • Import Spark libraries • Invoke a Context and/or Session • Set master to local[*] or local[n]
  • 14. Example Unit Test using just a SparkContext Invoke a local session: • In your unit test classes • Test logic on small datasets Add to your deployment pipeline for a nice pre-release gut check!
  • 15. Unit Testing Spark Streaming Apps can easily be unit tested - Using .queueStream() - Using a spark testing library Libraries - spark-testing-base - sscheck - spark-tests Use Cases - DStream actions - Business Logic - Integration
  • 16. Example Library: spark-testing-base - Easy to Use - Helpful wrappers - Integrates w/ scalatest - Minimal code required - Clock management - Runs alongside other tests GitHub: https://github.com/holdenk/spark-testing-base
  • 18. Monitoring is Awesome It can reveal: • How your app is performing • Problems + Bugs! And provide opportunities to: • See and address issues • Observe behavior visually But monitoring can be tough to implement!
  • 19. Monitoring (a less than ideal approach) You could do it all in the app... Example: Looping over RDDs to: • Count records • Track Kafka offsets • Processing time / delays But it’s less than ideal... • Calculating performance significantly impacts performance… not great. • All these metrics are calculated by Spark!
  • 20. Monitoring and Visualization (using Listeners) Use Spark Listeners to access metrics in the background! Let Spark do the hard work: • Batch duration, delays • Record throughput • Stream position recovery Come to our talk: Spark Listeners: A Crash Course in Fast, Easy Monitoring! • Room 3016 | Today @ 5:30 PM
  • 21. Kafka Offset Recovery Saving your place elsewhere
  • 22. Inside the Spark Listener class, after a batch completes... You can access an object generated by Spark containing your offsets processed. Take those offsets and back them up to a DB... Writing Offsets to MySQL
  • 23. Your offsets are now stored in a DB after each batch completes. Whenever your app restarts, it reads those offsets from the DB... And starts processing where it last left off! Reading Offsets from MySQL
  • 24. Getting Offsets from the Database
  • 27. - Record timing info for fast troubleshooting - Escalate alarms to the appropriate team - Quickly resolve while app continues running Timing Logging (around actions)
  • 28. React How do I react to this monitoring? ● Heartbeats ● Scheduled Monitor Jobs ○ Version Updates ○ Ensure Running ○ Act on failure/problem ● Monitoring Alarms ● Look at them!
  • 29. Batch Intervals Optimizing for speed and resource efficiency
  • 30. You want batches that process faster than the interval, but not so fast that resources are idling and therefore wasted! Setting Appropriate Batch Intervals An appropriate batch interval is key to an app that is quick and efficient. Effectiveness of interval is affected by: • Resources alloc (cpu + ram) • Quantity of work • Quantity of data Batch interval
  • 31. Setting Appropriate Batch Intervals Consider these questions: How quickly do I need to process data? • Can I slow it down to save resources? What is my resource budget / allocation? • Can I increase? Can I cut back? • Bigger interval = more time to process • … but also more data to process • Smaller interval = the opposite
  • 32. Tips for finding an optimal combination: Start small! a. Short batch interval (seconds) b. Modest resources Whichever you have in more flexible supply (a or b), increasing accordingly. Again: processing time < interval = good Comfortably less, not significantly less. Setting Appropriate Batch Intervals
  • 33. Additional Resource Notes - Scale down when possible - Free up resources or save on cloud utilization spend - Avoid preemption - Use resource pools with prioritization - With preemption disabled if you can - Set appropriate # of partitions for Kafka topics - Higher volume == higher partition count - Higher partition count == greater parallelization
  • 34. Helpful Configuration Settings Configuring your app to be performant and efficient
  • 35. Helpful Configuration Settings Spark • spark.memory.useLegacyMode = true – spark.storage.memoryFraction=0.03 • spark.submit.deployMode = cluster • spark.serializer = org.apache.spark.serializer.KryoSerializer • spark.rdd.compress = true – spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec • spark.shuffle.service.enabled = true • spark.streaming.blockInterval = 300 Kafka • enable.auto.commit = ‘false’
  • 36. Backpressure Use Case: You have irregular spikes in message throughput from Kafka topics • Backpressure dynamically alters rate data is received per batch from Kafka. • Prevents overwhelming of app at startup and peak load. Settings: • spark.streaming.backpressure.enabled = true • spark.streaming.kafka.maxRatePerPartition = 20000 – max rate (messages/second) at which each Kafka partition will be read • PID Rate Estimator: can be used to tweak the rate based on batch performance – spark.streaming.backpressure.pid.* Source: https://www.linkedin.com/pulse/enable-back-pressure-make-your-spark-streaming-production-lan-jiang/
  • 38. Transformations (Streaming + Static) transform() ● Allows RDD-level access to data. ● Use case: joining with another RDD updateStateByKey() / mapWithState() ● Apply function to each key - useful for keeping track of state ● Use case: maintaining state between batches (e.g. rolling join w/ two streams) reduceByKey() ● Reduce a keyed RDD with appropriate function. ● Use case: deduping, aggregations
  • 39. Using the transform() method on DStream: Apply an RDD-to-RDD function to every RDD of the DStream. Used for arbitrary RDD operations on the DStream. • Useful for applying arbitrary RDD operations on a DStream. • Great for enriching streaming data with supplemental static data Joining Streaming and Static Data Source: https://hadoopsters.net/2017/11/26/how-to-join-static-data-with-streaming-data-dstream-in-spark/ transactions = … // streaming dataset (dstream) transaction_details = … // static dataset (rdd) val complete_transaction_data = transactions.transform(live_transaction => live_transaction.join(transaction_details))
  • 40. Effective Static Joining How do we handle static and persistent data? Driver: ● Broadcast if small enough ● Read on driver every batch, then join Worker: ● Connect on worker - lazy val connection object ● Useful for persisting data
  • 41. Streaming isn’t always easy… but here are some great takeaways! • Testing: Use Spark Locally w/ Unit Tests • Monitoring: Use Listeners & React • Batch Intervals & Resources: Be thoughtful! • Configuration: Lots of awesome ones! • Transformations: Do more with your streaming data! • Offset Recovery: Stop worrying and love the offset management! Review
  • 42. Landon Robinson • lrobinson@spotx.tv Jack Chapa • jchapa@spotx.tv hadoopsters.dev https://gist.github.com/hadoopsters Contact Us
  • 43. Q & A #UnifiedAnalytics #SparkAISummit
  • 44. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT