SlideShare a Scribd company logo
Introto Spark Streaming
(pandemic edition)
Oleg Korolenko for RSF Talks @Ktech, March 2020
image credits: @Matt Turck - Big Data Landscape 2017
Agenda
1.Some streaming concepts (quickly)
2.Streaming models: Microbatchning vs One-record-a-
Time models
3.Windowing, watermarks, state management
4.Operations on state and joins
5.Sources and Sinks
Oleg Korolenko for RSF Talks @Ktech, March 2020
Notinthistalk
» Spark as distributed compute engine
» I will not cover specific integrations (like with
Kafka)
» I will not compare it to some specific streaming
solutions
Oleg Korolenko for RSF Talks @Ktech, March 2020
API hell
- DStreams (deprecated)
- Continuous mode (experimental from 2.3)
- Structured Streaming (the way to go, in this talk)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Streaming concepts: Data
Data in motion vs data at rest (in the past)
Potentially unbounded vs known size
Oleg Korolenko for RSF Talks @Ktech, March 2020
Spark streaming - Concept
» serves small batches of data collected from stream
» provides them at fixed time intervals (from 0.5
secs)
» performs computation
image credits: Spark official doc
Microbatching
application of Bulk Synchronous Parallelism (BSP)
system
Consists of :
1. A split distribution of asynchronous work (tasks)
2. A synchronous barrier, coming in at fixed
intervals (stages)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Model: Microbatching
Transforms a batch-like query into a series of
incremental execution plans
Oleg Korolenko for RSF Talks @Ktech, March 2020
One-record-at-a-time-processing
Dataflow programming
- computation is a graph of data flowing between
operations
- computations are black boxes one to-each other ( vs
Catalyst in Spark)
In : ApacheFlink, Google DataFlow
Oleg Korolenko for RSF Talks @Ktech, March 2020
Model: One-record-at-a-time-processing
processing user functions by pipelining
- deploys functions as pipelines in a cluster
- flows data through pipelines
- pipelines steps are parallilized (differently,
depedning on operators)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Microbatchingvs One-at-a-time
despite higher latency
PROS:
1.sync boundaries gives the ability to adapt (f.i
task recovering from failure if executor is down,
scala executors etc)
2.data is available as a set at every microbatch (we
can inspect, adapt, drop, get stats)
3.easier model that looks like data at rest
Oleg Korolenko for RSF Talks @Ktech, March 2020
Spark streamingAPI
» API on top of Spark SQL Dataframe,Dataset APIs
// Read text from socket
val socketDF = spark
.readStream
.format("socket")
.option(...)
.load()
socketDF.isStreaming // Returns True for DataFrames that have streaming sources
Oleg Korolenko for RSF Talks @Ktech, March 2020
Spark streamingAPI, behindthe lines
[DataFrame/Dataset] =>
[Logical plan] =>
[Optimized plan] =>
[Series of incremental execution plans]
Oleg Korolenko for RSF Talks @Ktech, March 2020
Triggering
Run only once:
val onceStream = data
.writeStream
.format("console")
.queryName("Once")
.trigger(Trigger.Once())
Oleg Korolenko for RSF Talks @Ktech, March 2020
Triggering
Scheduled execution based on processing time:
val processingTimeStream = data
.writeStream
.format("console")
.trigger(Trigger.ProcessingTime("20 seconds"))
processing hasn't yet finished next batch will start
immediately
Oleg Korolenko for RSF Talks @Ktech, March 2020
Processing
We can use usual Spark transformation and aggregation
APIs
but where's streaming semantics there ?
Oleg Korolenko for RSF Talks @Ktech, March 2020
credits: https://twitter.com/bgeerdink/status/776003500656517120
Processing:WindowingAPI
val avgBySensorTypeOverTime = sensorStream
.select($"timestamp", $"sensorType")
.groupBy(window($"timestamp", "1 minutes", "1 minute"), $"sensorType")
.count()
Oleg Korolenko for RSF Talks @Ktech, March 2020
Tumblingwindow
eventsDF
.groupBy(window("eventTime", "5 minute"))
.count()
image credits: @DataBricks Engineering blog
Slidingwindow
eventsDF
.groupBy(window("eventTime", "10 minutes", "5 minutes"))
.count()
image credits: @DataBricks Engineering blog
Late events
image credits: @DataBricks Engineering blog
Watermarks
"all input data with event times less than X have
been observed"
eventsDF
.groupBy(window("eventTime", "10 minutes", "5 minutes"))
.watermark("10 minutes")
.count()
Oleg Korolenko for RSF Talks @Ktech, March 2020
Watermarks
image credits: @DataBricks Engineering blog
Statefulprocessing
Work with data in the context of what we had already
seen in the stream
Oleg Korolenko for RSF Talks @Ktech, March 2020
State management
image credits: @DataBricks Engineering blog
State managementand checkpoints
Backed by S3-compatible interface to store state
.
|-- commits/
|-- offsets/
|-- sources/
|-- state/
`-- metadata
Oleg Korolenko for RSF Talks @Ktech, March 2020
Operations - State
mapWithState // we produce a single result
flatMapWithState // we produce 0 or N results in output
Oleg Korolenko for RSF Talks @Ktech, March 2020
Example: Domain
// Input events
val weatherEvents: Dataset[WeatherEvents]
// Weather station event
case class WeatherEvent(
stationId: String,
timestamp: Timestamp,
temp: Double
)
// Weather avg temp output
case class WeatherEventAvg(
stationId: String,
start: Timestamp,
end: Timestamp,
avgTemp: Double
)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Compute using state
val weatherEventsMovingAvg = weatherEvents
// group by station
.groupByKey(_.stationId)
// processing timeout
.mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout)
(mappingFunction)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Mapping function
def mappingFunction(
key: String,
values: Iterator[WeatherEvent],
groupState: GroupState[List[WeatherEvent]]
): WeatherEventAvg = {
// update the state with the new events
val updatedState = ...
// update the group state
groupState.update(updatedState)
// compute new event output using updated state
WeatherEventAvg(key, ts1, ts2, tempAvg)
}
Oleg Korolenko for RSF Talks @Ktech, March 2020
Writetoasinkand startthe stream
// define the sink for the stream
weatherEventsMovingAvg
.writeStream
.format("kafka") // determines that the kafka sink is used
.option("kafka.bootstrap.servers", kafkaBootstrapServer)
.option("checkpointLocation", "/path/checkpoint")
// stream will start processing events from sources and write to sink
.start()
}
Oleg Korolenko for RSF Talks @Ktech, March 2020
Operations -Joins
» stream join stream
» stream join batch
Oleg Korolenko for RSF Talks @Ktech, March 2020
Sources
» File-based: JSON, CSV, Parquet, ORC, and plain
text
» Kafka, Kinesis, Flume
» TCP sockets
Oleg Korolenko for RSF Talks @Ktech, March 2020
Workingwith sources
image credits: Stream Processing with Apache Spark @OReilly
Offsets in checkpoints
.
|-- commits/
|-- offsets/
|-- sources/
|-- state/
`-- metadata
Oleg Korolenko for RSF Talks @Ktech, March 2020
Sinks
» File-based: JSON, CSV, Parquet, ORC, and plain
text
» Kafka, Kinesis, Flume
Experimentation:
- Memory, Console
Custom:
- forEach (implement ForEachWriter to integrate with
anything)
Oleg Korolenko for RSF Talks @Ktech, March 2020
Failure recovery
» Spark uses checkpoints
Write Ahead Log (WAL)
» for Spark Streaming hwen we receive data from
sources we buffer it
» we need to store additional metadata to register
offsets etc
» we save on offset, data to be able to replay it
from sources
Oleg Korolenko for RSF Talks @Ktech, March 2020
"Exactlyonce" deliveryguarantee
Combination of
replayable sources
idempotent sinks
processing checkpoints
Oleg Korolenko for RSF Talks @Ktech, March 2020
Readsand refs
1.Streaming 102:The World beyond Batch(article) by Tyler Akidau,
2016
2.Stream Processing with Apache Flink by Fabian Hueske and
Vasiliki Kalavri, O'Reilly, April 2019
3.Stream Processing with Apache Spark by Francois Garillot and
Gerard Maas, O'Reilly, 2019
4.Discretized Streams: Fault-Tolerant Streaming Computation at
Scale(whitepaper) by MatheiZaharia, Berkley
5.Event-time Aggregation and Watermarking in Apache Spark’s
Structured Streaming by Tathagata Das, DataBricks enginnering
blog
Oleg Korolenko for RSF Talks @Ktech, March 2020
Thanks !
Oleg Korolenko for RSF Talks @Ktech, March 2020

More Related Content

PDF
Realtime Data Analysis Patterns
PPTX
Apache Flink and what it is used for
PPTX
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
PDF
Knowledge Graph for Cybersecurity: An Introduction By Kabul Kurniawan
PDF
Spark streaming State of the Union - Strata San Jose 2015
PDF
Virtual Knowledge Graphs for Federated Log Analysis
PDF
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
PPTX
Spark Summit EU talk by Sameer Agarwal
Realtime Data Analysis Patterns
Apache Flink and what it is used for
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Knowledge Graph for Cybersecurity: An Introduction By Kabul Kurniawan
Spark streaming State of the Union - Strata San Jose 2015
Virtual Knowledge Graphs for Federated Log Analysis
Deep Dive Into Catalyst: Apache Spark 2.0’s Optimizer
Spark Summit EU talk by Sameer Agarwal

What's hot (20)

PDF
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
PDF
FastR+Apache Flink
PDF
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
PPTX
Use r tutorial part1, introduction to sparkr
PDF
Make your PySpark Data Fly with Arrow!
PDF
GraphFrames: Graph Queries In Spark SQL
PDF
Understanding Query Plans and Spark UIs
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
PDF
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
PPTX
Learning spark ch01 - Introduction to Data Analysis with Spark
PDF
Making Nested Columns as First Citizen in Apache Spark SQL
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Monitoring pg with_graphite_grafana
PDF
Web-Scale Graph Analytics with Apache® Spark™
PPTX
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
PDF
Stream Processing: Choosing the Right Tool for the Job
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
PDF
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
PDF
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
FastR+Apache Flink
Debugging Big Data Analytics in Apache Spark with BigDebug with Muhammad Gulz...
Use r tutorial part1, introduction to sparkr
Make your PySpark Data Fly with Arrow!
GraphFrames: Graph Queries In Spark SQL
Understanding Query Plans and Spark UIs
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...
Learning spark ch01 - Introduction to Data Analysis with Spark
Making Nested Columns as First Citizen in Apache Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Monitoring pg with_graphite_grafana
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache Spark with Tim Hunter
Stream Processing: Choosing the Right Tool for the Job
Flink 0.10 @ Bay Area Meetup (October 2015)
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
Ad

Similar to Spark Streaming Intro @KTech (20)

PDF
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
PPTX
Spark Streaming Recipes and "Exactly Once" Semantics Revised
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
PDF
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
PPTX
Apache Spark Components
PDF
Structured Streaming with Kafka
PPT
Spark streaming
PDF
Easy, scalable, fault tolerant stream processing with structured streaming - ...
PDF
Introduction to Structured streaming
PPT
strata_spark_streaming.ppt
PDF
Data Stream Processing - Concepts and Frameworks
PDF
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
PPTX
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
PDF
QCon São Paulo: Real-Time Analytics with Spark Streaming
PDF
Strata NYC 2015: What's new in Spark Streaming
PPT
strata_spark_streaming.ppt
PPT
strata_spark_streaming.ppt
PPT
strata spark streaming strata spark streamingsrata spark streaming
PDF
Apache: Big Data - Starting with Apache Spark, Best Practices
Deep Dive into Stateful Stream Processing in Structured Streaming with Tathag...
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Spark Streaming Recipes and "Exactly Once" Semantics Revised
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Real-Time Data Pipelines Made Easy with Structured Streaming in Apache Spark.pdf
Apache Spark Components
Structured Streaming with Kafka
Spark streaming
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Introduction to Structured streaming
strata_spark_streaming.ppt
Data Stream Processing - Concepts and Frameworks
Avoiding Common Pitfalls: Spark Structured Streaming with Kafka
Leveraging Azure Databricks to minimize time to insight by combining Batch an...
QCon São Paulo: Real-Time Analytics with Spark Streaming
Strata NYC 2015: What's new in Spark Streaming
strata_spark_streaming.ppt
strata_spark_streaming.ppt
strata spark streaming strata spark streamingsrata spark streaming
Apache: Big Data - Starting with Apache Spark, Best Practices
Ad

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Electronic commerce courselecture one. Pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
cuic standard and advanced reporting.pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Encapsulation theory and applications.pdf
PPTX
A Presentation on Artificial Intelligence
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
NewMind AI Weekly Chronicles - August'25-Week II
Electronic commerce courselecture one. Pdf
sap open course for s4hana steps from ECC to s4
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Machine Learning_overview_presentation.pptx
Spectral efficient network and resource selection model in 5G networks
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MIND Revenue Release Quarter 2 2025 Press Release
cuic standard and advanced reporting.pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Assigned Numbers - 2025 - Bluetooth® Document
A comparative analysis of optical character recognition models for extracting...
Encapsulation theory and applications.pdf
A Presentation on Artificial Intelligence
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Mobile App Security Testing_ A Comprehensive Guide.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding

Spark Streaming Intro @KTech

  • 1. Introto Spark Streaming (pandemic edition) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 2. image credits: @Matt Turck - Big Data Landscape 2017
  • 3. Agenda 1.Some streaming concepts (quickly) 2.Streaming models: Microbatchning vs One-record-a- Time models 3.Windowing, watermarks, state management 4.Operations on state and joins 5.Sources and Sinks Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 4. Notinthistalk » Spark as distributed compute engine » I will not cover specific integrations (like with Kafka) » I will not compare it to some specific streaming solutions Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 5. API hell - DStreams (deprecated) - Continuous mode (experimental from 2.3) - Structured Streaming (the way to go, in this talk) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 6. Streaming concepts: Data Data in motion vs data at rest (in the past) Potentially unbounded vs known size Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 7. Spark streaming - Concept » serves small batches of data collected from stream » provides them at fixed time intervals (from 0.5 secs) » performs computation image credits: Spark official doc
  • 8. Microbatching application of Bulk Synchronous Parallelism (BSP) system Consists of : 1. A split distribution of asynchronous work (tasks) 2. A synchronous barrier, coming in at fixed intervals (stages) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 9. Model: Microbatching Transforms a batch-like query into a series of incremental execution plans Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 10. One-record-at-a-time-processing Dataflow programming - computation is a graph of data flowing between operations - computations are black boxes one to-each other ( vs Catalyst in Spark) In : ApacheFlink, Google DataFlow Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 11. Model: One-record-at-a-time-processing processing user functions by pipelining - deploys functions as pipelines in a cluster - flows data through pipelines - pipelines steps are parallilized (differently, depedning on operators) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 12. Microbatchingvs One-at-a-time despite higher latency PROS: 1.sync boundaries gives the ability to adapt (f.i task recovering from failure if executor is down, scala executors etc) 2.data is available as a set at every microbatch (we can inspect, adapt, drop, get stats) 3.easier model that looks like data at rest Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 13. Spark streamingAPI » API on top of Spark SQL Dataframe,Dataset APIs // Read text from socket val socketDF = spark .readStream .format("socket") .option(...) .load() socketDF.isStreaming // Returns True for DataFrames that have streaming sources Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 14. Spark streamingAPI, behindthe lines [DataFrame/Dataset] => [Logical plan] => [Optimized plan] => [Series of incremental execution plans] Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 15. Triggering Run only once: val onceStream = data .writeStream .format("console") .queryName("Once") .trigger(Trigger.Once()) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 16. Triggering Scheduled execution based on processing time: val processingTimeStream = data .writeStream .format("console") .trigger(Trigger.ProcessingTime("20 seconds")) processing hasn't yet finished next batch will start immediately Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 17. Processing We can use usual Spark transformation and aggregation APIs but where's streaming semantics there ? Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 19. Processing:WindowingAPI val avgBySensorTypeOverTime = sensorStream .select($"timestamp", $"sensorType") .groupBy(window($"timestamp", "1 minutes", "1 minute"), $"sensorType") .count() Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 21. Slidingwindow eventsDF .groupBy(window("eventTime", "10 minutes", "5 minutes")) .count() image credits: @DataBricks Engineering blog
  • 22. Late events image credits: @DataBricks Engineering blog
  • 23. Watermarks "all input data with event times less than X have been observed" eventsDF .groupBy(window("eventTime", "10 minutes", "5 minutes")) .watermark("10 minutes") .count() Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 25. Statefulprocessing Work with data in the context of what we had already seen in the stream Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 26. State management image credits: @DataBricks Engineering blog
  • 27. State managementand checkpoints Backed by S3-compatible interface to store state . |-- commits/ |-- offsets/ |-- sources/ |-- state/ `-- metadata Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 28. Operations - State mapWithState // we produce a single result flatMapWithState // we produce 0 or N results in output Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 29. Example: Domain // Input events val weatherEvents: Dataset[WeatherEvents] // Weather station event case class WeatherEvent( stationId: String, timestamp: Timestamp, temp: Double ) // Weather avg temp output case class WeatherEventAvg( stationId: String, start: Timestamp, end: Timestamp, avgTemp: Double ) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 30. Compute using state val weatherEventsMovingAvg = weatherEvents // group by station .groupByKey(_.stationId) // processing timeout .mapGroupsWithState(GroupStateTimeout.ProcessingTimeTimeout) (mappingFunction) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 31. Mapping function def mappingFunction( key: String, values: Iterator[WeatherEvent], groupState: GroupState[List[WeatherEvent]] ): WeatherEventAvg = { // update the state with the new events val updatedState = ... // update the group state groupState.update(updatedState) // compute new event output using updated state WeatherEventAvg(key, ts1, ts2, tempAvg) } Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 32. Writetoasinkand startthe stream // define the sink for the stream weatherEventsMovingAvg .writeStream .format("kafka") // determines that the kafka sink is used .option("kafka.bootstrap.servers", kafkaBootstrapServer) .option("checkpointLocation", "/path/checkpoint") // stream will start processing events from sources and write to sink .start() } Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 33. Operations -Joins » stream join stream » stream join batch Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 34. Sources » File-based: JSON, CSV, Parquet, ORC, and plain text » Kafka, Kinesis, Flume » TCP sockets Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 35. Workingwith sources image credits: Stream Processing with Apache Spark @OReilly
  • 36. Offsets in checkpoints . |-- commits/ |-- offsets/ |-- sources/ |-- state/ `-- metadata Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 37. Sinks » File-based: JSON, CSV, Parquet, ORC, and plain text » Kafka, Kinesis, Flume Experimentation: - Memory, Console Custom: - forEach (implement ForEachWriter to integrate with anything) Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 38. Failure recovery » Spark uses checkpoints Write Ahead Log (WAL) » for Spark Streaming hwen we receive data from sources we buffer it » we need to store additional metadata to register offsets etc » we save on offset, data to be able to replay it from sources Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 39. "Exactlyonce" deliveryguarantee Combination of replayable sources idempotent sinks processing checkpoints Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 40. Readsand refs 1.Streaming 102:The World beyond Batch(article) by Tyler Akidau, 2016 2.Stream Processing with Apache Flink by Fabian Hueske and Vasiliki Kalavri, O'Reilly, April 2019 3.Stream Processing with Apache Spark by Francois Garillot and Gerard Maas, O'Reilly, 2019 4.Discretized Streams: Fault-Tolerant Streaming Computation at Scale(whitepaper) by MatheiZaharia, Berkley 5.Event-time Aggregation and Watermarking in Apache Spark’s Structured Streaming by Tathagata Das, DataBricks enginnering blog Oleg Korolenko for RSF Talks @Ktech, March 2020
  • 41. Thanks ! Oleg Korolenko for RSF Talks @Ktech, March 2020