Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Applications (Spark Streaming)

WIFI SSID:SparkAISummit | Password: UnifiedAnalytics

Landon Robinson & Jack Chapa
Spark Streaming
Headaches and Breakthroughs in Building
Continuous Applications
#UnifiedAnalytics #SparkAISummit

Who We Are
3#UnifiedAnalytics #SparkAISummit
Landon Robinson
Data Engineer
Jack Chapa
Data Engineer
Big Data Team @ SpotX

Because Spark Streaming...
• is very powerful
• can supercharge your infrastructure
• … and can be very complex!
Lots of headaches and breakthroughs!
But first… why are we here?

● Streaming Basics
● Testing
● Monitoring & Alerts
● Batch Intervals & Resources
Takeaways
Leave with a few actionable items that we wish we knew
when we started with Spark Streaming.
Focus Areas
● Helpful Configurations
● Backpressure
● Data Enrichment &
Transformations

Our Company
The Trusted Platform For
Premium Publishers and
Broadcasters

We Process a Lot of Data
Data:
- 220 MM+ Total Files/Blocks
- 8 PB+ HDFS Space
- 20 TB+ new data daily
- 100MM+ records/minute
- 300+ Data Nodes
Apps:
- Thousands of daily Spark apps
- Hundreds of daily user queries
- Multiple 24/7 Streaming apps

Our uses include:
- Rapid ingestion of data into warehouse for querying
- Machine learning on near-live data streams
- Ability to react to and impact live situations
- Accelerated processing / updating of metadata
- Real-time visualization of data streams and processing
Spark Streaming is Key for Us

Spark Streaming Basics
a brief overview

Spark Streaming Basics
Spark Streaming is an extension of Spark that
enables scalable, high-throughput, fault-tolerant
processing of live data streams.
• Stream == live data stream
– Topic == Kafka’s name for a stream
•
DStream == sequence of RDDs formed
from reading a data stream
• Batch == a self-contained job within your
Streaming app that processes a segment of
the stream.

Testing
Rapid development and
testing of Spark apps

Use Spark in Local Mode
You can start building Spark Streaming apps
in minutes, using Spark locally!
On your local machine
• No cluster needed!
• Great for rough testing
We Recommend:
IntelliJ Community Edition
• with SBT: For dependency
management

Use Spark in Local Mode
In your build.sbt:
• src/test/scala => “provided”
• src/main/scala => “compiled”
The Scala Build Tool is your friend!
Simply:
• Import Spark libraries
• Invoke a Context and/or Session
• Set master to local[*] or local[n]

Example Unit Test using just a SparkContext
Invoke a local session:
• In your unit test classes
• Test logic on small datasets
Add to your deployment pipeline
for a nice pre-release gut check!

Unit Testing
Spark Streaming Apps can easily be unit tested
- Using .queueStream()
- Using a spark testing library
Libraries
- spark-testing-base
- sscheck
- spark-tests
Use Cases
- DStream actions
- Business Logic
- Integration

Example Library: spark-testing-base
- Easy to Use
- Helpful wrappers
- Integrates w/ scalatest
- Minimal code required
- Clock management
- Runs alongside other tests
GitHub: https://github.com/holdenk/spark-testing-base

Monitoring
Tracking and visualizing
performance of your
app

Monitoring is Awesome
It can reveal:
• How your app is performing
• Problems + Bugs!
And provide opportunities to:
• See and address issues
• Observe behavior visually
But monitoring can be tough to implement!

Monitoring (a less than ideal approach)
You could do it all in the app...
Example: Looping over RDDs to:
• Count records
• Track Kafka offsets
• Processing time / delays
But it’s less than ideal...
• Calculating performance significantly
impacts performance… not great.
• All these metrics are calculated by
Spark!

Monitoring and Visualization (using Listeners)
Use Spark Listeners to access
metrics in the background!
Let Spark do the hard work:
• Batch duration, delays
• Record throughput
• Stream position recovery
Come to our talk: Spark Listeners:
A Crash Course in Fast, Easy
Monitoring!
• Room 3016 | Today @ 5:30 PM

Kafka Offset Recovery
Saving your place
elsewhere

Inside the Spark
Listener class, after a
batch completes...
You can access an
object generated by
Spark containing your
offsets processed.
Take those offsets and
back them up to a
DB...
Writing Offsets to MySQL

Your offsets are now
stored in a DB after
each batch completes.
Whenever your app
restarts, it reads those
offsets from the DB...
And starts processing
where it last left off!
Reading Offsets from MySQL

Getting Offsets from the Database

Example: Reading Offsets from MySQL

- Record timing info for fast
troubleshooting
- Escalate alarms to the
appropriate team
- Quickly resolve while
app continues running
Timing Logging (around actions)

React
How do I react to this monitoring?
● Heartbeats
● Scheduled Monitor Jobs
○ Version Updates
○ Ensure Running
○ Act on failure/problem
● Monitoring Alarms
● Look at them!

Batch Intervals
Optimizing for speed
and resource efficiency

You want batches that
process faster than the
interval, but not so fast that
resources are idling and
therefore wasted!
Setting Appropriate Batch Intervals
An appropriate batch interval is key to
an app that is quick and efficient.
Effectiveness of interval is
affected by:
• Resources alloc (cpu + ram)
• Quantity of work
• Quantity of data
Batch interval

Consider these questions:
How quickly do I need to process data?
• Can I slow it down to save resources?
What is my resource budget / allocation?
• Can I increase? Can I cut back?
• Bigger interval = more time to process
• … but also more data to process
• Smaller interval = the opposite

Tips for finding an optimal combination:
Start small!
a. Short batch interval (seconds)
b. Modest resources
Whichever you have in more flexible
supply (a or b), increasing accordingly.
Again: processing time < interval = good
Comfortably less, not significantly less.

Additional Resource Notes
- Scale down when possible
- Free up resources or save on cloud utilization spend
- Avoid preemption
- Use resource pools with prioritization
- With preemption disabled if you can
- Set appropriate # of partitions for Kafka topics
- Higher volume == higher partition count
- Higher partition count == greater parallelization

Helpful Configuration Settings
Configuring your app to
be performant and
efficient

Helpful Configuration Settings
Spark
• spark.memory.useLegacyMode = true
– spark.storage.memoryFraction=0.03
• spark.submit.deployMode = cluster
• spark.serializer = org.apache.spark.serializer.KryoSerializer
• spark.rdd.compress = true
– spark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
• spark.shuffle.service.enabled = true
• spark.streaming.blockInterval = 300
Kafka
• enable.auto.commit = ‘false’

Backpressure
Use Case:
You have irregular spikes in message throughput from Kafka topics
• Backpressure dynamically alters rate data is received per batch from Kafka.
• Prevents overwhelming of app at startup and peak load.
Settings:
• spark.streaming.backpressure.enabled = true
• spark.streaming.kafka.maxRatePerPartition = 20000
– max rate (messages/second) at which each Kafka partition will be read
• PID Rate Estimator: can be used to tweak the rate based on batch performance
– spark.streaming.backpressure.pid.*
Source: https://www.linkedin.com/pulse/enable-back-pressure-make-your-spark-streaming-production-lan-jiang/

Transformations
Bringing streaming and
static data together

Transformations (Streaming + Static)
transform()
● Allows RDD-level access to data.
● Use case: joining with another RDD
updateStateByKey() / mapWithState()
● Apply function to each key - useful for keeping
track of state
● Use case: maintaining state between batches
(e.g. rolling join w/ two streams)
reduceByKey()
● Reduce a keyed RDD with appropriate
function.
● Use case: deduping, aggregations

Using the transform() method on DStream:
Apply an RDD-to-RDD function to every RDD of the DStream. Used for arbitrary RDD
operations on the DStream.
• Useful for applying arbitrary RDD operations on a DStream.
• Great for enriching streaming data with supplemental static data
Joining Streaming and Static Data
Source: https://hadoopsters.net/2017/11/26/how-to-join-static-data-with-streaming-data-dstream-in-spark/
transactions = … // streaming dataset (dstream)
transaction_details = … // static dataset (rdd)
val complete_transaction_data = transactions.transform(live_transaction =>
live_transaction.join(transaction_details))

Effective Static Joining
How do we handle static and persistent data?
Driver:
● Broadcast if small enough
● Read on driver every batch, then join
Worker:
● Connect on worker - lazy val connection
object
● Useful for persisting data

Streaming isn’t always easy… but here are some great takeaways!
• Testing: Use Spark Locally w/ Unit Tests
• Monitoring: Use Listeners & React
• Batch Intervals & Resources: Be thoughtful!
• Configuration: Lots of awesome ones!
• Transformations: Do more with your streaming data!
• Offset Recovery: Stop worrying and love the offset management!
Review

Landon Robinson
• lrobinson@spotx.tv
Jack Chapa
• jchapa@spotx.tv
hadoopsters.dev
https://gist.github.com/hadoopsters
Contact Us

Q & A
#UnifiedAnalytics #SparkAISummit

DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Applications (Spark Streaming)

More Related Content

What's hot (20)

Similar to Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Applications (Spark Streaming) (20)

Recently uploaded (20)

Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Applications (Spark Streaming)