Spark Streaming Intro @KTech

Introto Spark Streaming
(pandemic edition)
Oleg Korolenko for RSF Talks @Ktech, March 2020

image credits: @Matt Turck - Big Data Landscape 2017

Agenda
1.Some streaming concepts (quickly)
2.Streaming models: Microbatchning vs One-record-a-
Time models
3.Windowing, watermarks, state management
4.Operations on state and joins
5.Sources and Sinks

Notinthistalk
» Spark as distributed compute engine
» I will not cover speciﬁc integrations (like with
Kafka)
» I will not compare it to some speciﬁc streaming
solutions

API hell
- DStreams (deprecated)
- Continuous mode (experimental from 2.3)
- Structured Streaming (the way to go, in this talk)

Streaming concepts: Data
Data in motion vs data at rest (in the past)
Potentially unbounded vs known size

Spark streaming - Concept
» serves small batches of data collected from stream
» provides them at ﬁxed time intervals (from 0.5
secs)
» performs computation
image credits: Spark ofﬁcial doc

Microbatching
application of Bulk Synchronous Parallelism (BSP)
system
Consists of :
1. A split distribution of asynchronous work (tasks)
2. A synchronous barrier, coming in at ﬁxed
intervals (stages)

Model: Microbatching
Transforms a batch-like query into a series of
incremental execution plans

One-record-at-a-time-processing
Dataﬂow programming
- computation is a graph of data ﬂowing between
operations
- computations are black boxes one to-each other ( vs
Catalyst in Spark)
In : ApacheFlink, Google DataFlow

Model: One-record-at-a-time-processing
processing user functions by pipelining
- deploys functions as pipelines in a cluster
- ﬂows data through pipelines
- pipelines steps are parallilized (differently,
depedning on operators)

Microbatchingvs One-at-a-time
despite higher latency
PROS:
1.sync boundaries gives the ability to adapt (f.i
task recovering from failure if executor is down,
scala executors etc)
2.data is available as a set at every microbatch (we
can inspect, adapt, drop, get stats)
3.easier model that looks like data at rest

Spark streamingAPI
» API on top of Spark SQL Dataframe,Dataset APIs
// Read text from socket
val socketDF = spark
.readStream
.format("socket")
.option(...)
.load()
socketDF.isStreaming // Returns True for DataFrames that have streaming sources

Spark streamingAPI, behindthe lines
[DataFrame/Dataset] =>
[Logical plan] =>
[Optimized plan] =>
[Series of incremental execution plans]

Triggering
Run only once:
val onceStream = data
.writeStream
.format("console")
.queryName("Once")
.trigger(Trigger.Once())

Triggering
Scheduled execution based on processing time:
val processingTimeStream = data
.writeStream
.format("console")
.trigger(Trigger.ProcessingTime("20 seconds"))
processing hasn't yet ﬁnished next batch will start
immediately

Processing
We can use usual Spark transformation and aggregation
APIs
but where's streaming semantics there ?

credits: https://twitter.com/bgeerdink/status/776003500656517120

Processing:WindowingAPI
val avgBySensorTypeOverTime = sensorStream
.select($"timestamp", $"sensorType")
.groupBy(window($"timestamp", "1 minutes", "1 minute"), $"sensorType")
.count()

Tumblingwindow
eventsDF
.groupBy(window("eventTime", "5 minute"))
.count()
image credits: @DataBricks Engineering blog

Slidingwindow
eventsDF
.groupBy(window("eventTime", "10 minutes", "5 minutes"))
.count()

Late events

Watermarks
"all input data with event times less than X have
been observed"
eventsDF
.groupBy(window("eventTime", "10 minutes", "5 minutes"))
.watermark("10 minutes")
.count()

Watermarks

Statefulprocessing
Work with data in the context of what we had already
seen in the stream

State management

State managementand checkpoints
Backed by S3-compatible interface to store state
.
|-- commits/
|-- offsets/
|-- sources/
|-- state/
`-- metadata

Operations - State
mapWithState // we produce a single result
ﬂatMapWithState // we produce 0 or N results in output

Example: Domain
// Input events
val weatherEvents: Dataset[WeatherEvents]
// Weather station event
case class WeatherEvent(
stationId: String,
timestamp: Timestamp,
temp: Double
)
// Weather avg temp output
case class WeatherEventAvg(
stationId: String,
start: Timestamp,
end: Timestamp,
avgTemp: Double
)

Compute using state
val weatherEventsMovingAvg = weatherEvents
// group by station
.groupByKey(_.stationId)
// processing timeout
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)
(mappingFunction)

Mapping function
def mappingFunction(
key: String,
values: Iterator[WeatherEvent],
groupState: GroupState[List[WeatherEvent]]
): WeatherEventAvg = {
// update the state with the new events
val updatedState = ...
// update the group state
groupState.update(updatedState)
// compute new event output using updated state
WeatherEventAvg(key, ts1, ts2, tempAvg)
}

Writetoasinkand startthe stream
// deﬁne the sink for the stream
weatherEventsMovingAvg
.writeStream
.format("kafka") // determines that the kafka sink is used
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("checkpointLocation", "/path/checkpoint")
// stream will start processing events from sources and write to sink
.start()
}

Operations -Joins
» stream join stream
» stream join batch

Sources
» File-based: JSON, CSV, Parquet, ORC, and plain
text
» Kafka, Kinesis, Flume
» TCP sockets

Workingwith sources
image credits: Stream Processing with Apache Spark @OReilly

Offsets in checkpoints
.
|-- commits/
|-- offsets/
|-- sources/
|-- state/
`-- metadata

Sinks
» File-based: JSON, CSV, Parquet, ORC, and plain
text
» Kafka, Kinesis, Flume
Experimentation:
- Memory, Console
Custom:
- forEach (implement ForEachWriter to integrate with
anything)

Failure recovery
» Spark uses checkpoints
Write Ahead Log (WAL)
» for Spark Streaming hwen we receive data from
sources we buffer it
» we need to store additional metadata to register
offsets etc
» we save on offset, data to be able to replay it
from sources

"Exactlyonce" deliveryguarantee
Combination of
replayable sources
idempotent sinks
processing checkpoints

Readsand refs
1.Streaming 102:The World beyond Batch(article) by Tyler Akidau,
2016
2.Stream Processing with Apache Flink by Fabian Hueske and
Vasiliki Kalavri, O'Reilly, April 2019
3.Stream Processing with Apache Spark by Francois Garillot and
Gerard Maas, O'Reilly, 2019
4.Discretized Streams: Fault-Tolerant Streaming Computation at
Scale(whitepaper) by MatheiZaharia, Berkley
5.Event-time Aggregation and Watermarking in Apache Spark’s
Structured Streaming by Tathagata Das, DataBricks enginnering
blog

Thanks !

Spark Streaming Intro @KTech

More Related Content

What's hot (20)

Similar to Spark Streaming Intro @KTech (20)

Recently uploaded (20)

Spark Streaming Intro @KTech