SlideShare a Scribd company logo
Solving sessionization
problem with Apache
Spark batch and
streaming processing
Bartosz Konieczny
@waitingforcode1
About me
Bartosz Konieczny
#dataEngineer #ApacheSparkEnthusiast #AWSuser
#waitingforcode.com #becomedataengineer.com
#@waitingforcode #github.com/bartosz25
#canalplus #Paris
2
3
Sessions
"user activity followed by a
closing action or a period of
inactivity"
4
5
© https://pixabay.com/users/maxmann-665103/ from https://pixabay.com
Batch architecture
6
data producer
sync consumer input logs
(DFS)
input logs
(streaming broker)
orchestrator
sessions
generator
<triggers>
previous
window raw
sessions
(DFS)
output
sessions
(DFS)
Streaming architecture
7
data producer
sessions
generator
output
sessions
(DFS)
metadata state
<uses>
checkpoint location
input logs
(streaming broker)
Batch
implementation
The code
val previousSessions = loadPreviousWindowSessions(sparkSession,
previousSessionsDir)
val sessionsInWindow = sparkSession.read.schema(Visit.Schema)
.json(inputDir)
val joinedData = previousSessions.join(sessionsInWindow,
sessionsInWindow("user_id") === previousSessions("userId"), "fullouter")
.groupByKey(log => SessionGeneration.resolveGroupByKey(log))
.flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5),
windowUpperBound)).cache()
joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir)
joinedData.filter(state => !state.isActive)
.flatMap(state => state.toSessionOutputState)
.coalesce(50).write.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputDir)
9
Full outer join
val previousSessions = loadPreviousWindowSessions(sparkSession,
previousSessionsDir)
val sessionsInWindow = sparkSession.read.schema(Visit.Schema)
.json(inputDir)
val joinedData = previousSessions.join(sessionsInWindow,
sessionsInWindow("user_id") === previousSessions("userId"), "fullouter")
.groupByKey(log => SessionGeneration.resolveGroupByKey(log))
.flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5),
windowUpperBound))
joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir)
joinedData.filter(state => !state.isActive)
.flatMap(state => state.toSessionOutputState)
.coalesce(50).write.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputDir)
10
processing logic
previous
window
active
sessions
new input
logs
full outer join
Watermark simulation
val previousSessions = loadPreviousWindowSessions(sparkSession,
previousSessionsDir)
val sessionsInWindow = sparkSession.read.schema(Visit.Schema)
.json(inputDir)
val joinedData = previousSessions.join(sessionsInWindow,
sessionsInWindow("user_id") === previousSessions("userId"), "fullouter")
.groupByKey(log => SessionGeneration.resolveGroupByKey(log))
.flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5),
windowUpperBound))
joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir)
joinedData.filter(state => !state.isActive)
.flatMap(state => state.toSessionOutputState)
.coalesce(50).write.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputDir)
case class SessionIntermediaryState(userId:
Long, … expirationTimeMillisUtc: Long,
isActive: Boolean)
11
Save modes
val previousSessions = loadPreviousWindowSessions(sparkSession,
previousSessionsDir)
val sessionsInWindow = sparkSession.read.schema(Visit.Schema)
.json(inputDir)
val joinedData = previousSessions.join(sessionsInWindow,
sessionsInWindow("user_id") === previousSessions("userId"), "fullouter")
.groupByKey(log => SessionGeneration.resolveGroupByKey(log))
.flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5),
windowUpperBound))
joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir)
joinedData.filter(state => !state.isActive)
.flatMap(state => state.toSessionOutputState)
.coalesce(50).write.mode(SaveMode.Overwrite)
.option("compression", "gzip")
.json(outputDir)
SaveMode.Append ⇒
duplicates & invalid results
(e.g. multiplied revenue!)
SaveMode.ErrorIfExists ⇒
failures & maintenance
burden
SaveMode.Ignore ⇒ no
data & old data present in
case of reprocessing
SaveMode.Overwrite ⇒
always fresh data & easy
maintenance
12
Streaming
implementation
The code
val writeQuery = query.writeStream.outputMode(OutputMode.Update())
.option("checkpointLocation", s"s3://my-checkpoint-bucket")
.foreachBatch((dataset: Dataset[SessionIntermediaryState], batchId: Long) => {
BatchWriter.writeDataset(dataset, s"${outputDir}/${batchId}")
})
val dataFrame = sparkSession.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaConfiguration.broker).option(...) .load()
val query = dataFrame.selectExpr("CAST(value AS STRING)")
.select(functions.from_json($"value", Visit.Schema).as("data"))
.select($"data.*").withWatermark("event_time", "3 minutes")
.groupByKey(row => row.getAs[Long]("user_id"))
.mapGroupsWithState(GroupStateTimeout.EventTimeTimeout())
(mapStreamingLogsToSessions(sessionTimeout))
watermark - late events & state
expiration
stateful processing - sessions
generation
checkpoint - fault-tolerance
14
Checkpoint - fault-tolerance
load state
for t0
query
load offsets
to process &
write them
for t1
query
process data
write
processed
offsets
write state
checkpoint location
state store offset log commit log
val writeQuery = query.writeStream.outputMode(OutputMode.Update())
.option("checkpointLocation", s"s3://sessionization-demo/checkpoint")
.foreachBatch((dataset: Dataset[SessionIntermediaryState], batchId: Long) => {
BatchWriter.writeDataset(dataset, s"${outputDir}/${batchId}")
})
.start()
15
Checkpoint - fault-tolerance
load state
for t1
query
load offsets
to process &
write them
for t1
query
process data
confirm
processed
offsets &
next
watermark
commit state
t2
partition-based
checkpoint location
state store offset log commit log
16
Stateful processing
update
remove
get
getput,remove
write update
finalize file
make snapshot
recover state
def mapStreamingLogsToSessions(timeoutDurationMs: Long)(key: Long, logs: Iterator[Row],
currentState: GroupState[SessionIntermediaryState]): SessionIntermediaryState = {
if (currentState.hasTimedOut) {
val expiredState = currentState.get.expire
currentState.remove()
expiredState
} else {
val newState = currentState.getOption.map(state => state.updateWithNewLogs(logs, timeoutDurationMs))
.getOrElse(SessionIntermediaryState.createNew(logs, timeoutDurationMs))
currentState.update(newState)
currentState.setTimeoutTimestamp(currentState.getCurrentWatermarkMs() + timeoutDurationMs)
currentState.get
}
}
17
Stateful processing
update
remove
get
getput,remove
- write update
- finalize file
- make snapshot
recover state
18
.mapGroupsWithState(...)
state store
TreeMap[Long,
ConcurrentHashMap[UnsafeRow,
UnsafeRow]
]
in-memory storage for the most
recent versions
1.delta
2.delta
3.snapshot
checkpoint
location
Watermark
val sessionTimeout = TimeUnit.MINUTES.toMillis(5)
val query = dataFrame.selectExpr("CAST(value AS STRING)")
.select(functions.from_json($"value", Visit.Schema).as("data"))
.select($"data.*")
.withWatermark("event_time", "3 minutes")
.groupByKey(row => row.getAs[Long]("user_id"))
.mapGroupsWithState(GroupStateTimeout.EventTimeTimeout())
(Mapping.mapStreamingLogsToSessions(sessionTimeout))
19
Watermark - late events
on-time
event
late
event
20
.mapGroupsWithState(...)
Watermark - expired state
State representation [simplified]
{value, TTL configuration}
Algorithm:
1. Update all states with new data → eventually extend TTL
2. Retrieve TTL configuration for the query → here: watermark
3. Retrieve all states that expired → no new data in this query & TTL expired
4. Call mapGroupsWithState on it with hasTimedOut param = true & no new data
(Iterator.empty)
// full implementation: org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec.InputProcessor
21
Data reprocessing
Batch
reschedule your job
© https://pics.me.me/just-one-click-and-the-zoo-is-mine-8769663.png
Streaming
Using Apache Spark to Solve Sessionization Problem in Batch and Streaming
State store
1. Restored state is the most recent snapshot
2. Restored state is not the most recent snapshot but a snapshot exists
3. Restored state is not the most recent snapshot and a snapshot doesn't exist
27
1.delta 3.snapshot2.delta
1.delta 3.snapshot2.delta 4.delta
1.delta 3.delta2.delta 4.delta
State store configuration
spark.sql.streaming.stateStore:
→ .minDeltasForSnapshot
→ .maintenanceInterval
28
spark.sql.streaming:
→ .maxBatchesToRetainInMemory
Checkpoint configuration
spark.sql.streaming.minBatchesToRetain
29
Few takeaways
● yet another TDD acronym - Trade-Off Driven Development
○ simplicity for latency
○ simplicity for accuracy
○ scaling for latency
● AWS
○ Kinesis - short retention period = reprocessing boundary, connector
○ S3 - trade reliability for performance
○ EMR - transient cluster
○ Redshift - COPY
● Apache Spark
○ watermarks everywhere - batch simulation
○ state store configuration
○ restore mechanism
○ overwrite idempotent mode
30
Resources
● https://github.com/bartosz25/sessionization-de
mo
● https://www.waitingforcode.com/tags/spark-ai-s
ummit-europe-2019-articles
31
Thank you!Bartosz Konieczny
@waitingforcode / github.com/bartosz25 / waitingforcode.com
Canal+
@canaltechteam

More Related Content

PDF
Building a SIMD Supported Vectorized Native Engine for Spark SQL
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PPTX
Application Timeline Server - Past, Present and Future
PDF
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
PDF
How Adobe Does 2 Million Records Per Second Using Apache Spark!
PDF
Making Apache Spark Better with Delta Lake
PDF
The Apache Spark File Format Ecosystem
PDF
Iceberg: a fast table format for S3
Building a SIMD Supported Vectorized Native Engine for Spark SQL
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Application Timeline Server - Past, Present and Future
Magnet Shuffle Service: Push-based Shuffle at LinkedIn
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Making Apache Spark Better with Delta Lake
The Apache Spark File Format Ecosystem
Iceberg: a fast table format for S3

What's hot (20)

PDF
Physical Plans in Spark SQL
PDF
A Deep Dive into Query Execution Engine of Spark SQL
PDF
Enabling Vectorized Engine in Apache Spark
PDF
Introduction to Spark (Intern Event Presentation)
PDF
Apache Spark Core—Deep Dive—Proper Optimization
PDF
Understanding Query Plans and Spark UIs
PPTX
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
PDF
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
PDF
Apache Spark on K8S and HDFS Security with Ilan Flonenko
PPTX
Hive and Apache Tez: Benchmarked at Yahoo! Scale
PDF
Change Data Feed in Delta
PDF
Fine Tuning and Enhancing Performance of Apache Spark Jobs
PDF
Deep Dive: Memory Management in Apache Spark
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
PPTX
A topology of memory leaks on the JVM
PDF
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
PDF
Introduction to Apache Flink - Fast and reliable big data processing
PDF
Apache Flink internals
Physical Plans in Spark SQL
A Deep Dive into Query Execution Engine of Spark SQL
Enabling Vectorized Engine in Apache Spark
Introduction to Spark (Intern Event Presentation)
Apache Spark Core—Deep Dive—Proper Optimization
Understanding Query Plans and Spark UIs
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Apache Spark on K8S and HDFS Security with Ilan Flonenko
Hive and Apache Tez: Benchmarked at Yahoo! Scale
Change Data Feed in Delta
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Deep Dive: Memory Management in Apache Spark
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
A topology of memory leaks on the JVM
From Query Plan to Query Performance: Supercharging your Apache Spark Queries...
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction to Apache Flink - Fast and reliable big data processing
Apache Flink internals
Ad

Similar to Using Apache Spark to Solve Sessionization Problem in Batch and Streaming (20)

PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
PDF
[JEEConf-2017] RxJava as a key component in mature Big Data product
PPTX
Kick your database_to_the_curb_reston_08_27_19
PDF
Online Meetup: Why should container system / platform builders care about con...
PDF
Deep dive into stateful stream processing in structured streaming by Tathaga...
PPTX
Meetup spark structured streaming
PDF
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
PPT
Spark streaming with kafka
PPT
Spark stream - Kafka
PPTX
A new kind of BPM with Activiti
PPTX
Reactive programming every day
PDF
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
PDF
Continuous Application with Structured Streaming 2.0
PPTX
JavaZone 2014 - goto java;
PDF
こわくないよ❤️ Playframeworkソースコードリーディング入門
PDF
Introducing the WSO2 Complex Event Processor
PDF
Scalable Angular 2 Application Architecture
DOCX
VPN Access Runbook
PPTX
spring aop.pptx aspt oreinted programmin
PDF
Event Sourcing - what could go wrong - Devoxx BE
Flink 0.10 @ Bay Area Meetup (October 2015)
[JEEConf-2017] RxJava as a key component in mature Big Data product
Kick your database_to_the_curb_reston_08_27_19
Online Meetup: Why should container system / platform builders care about con...
Deep dive into stateful stream processing in structured streaming by Tathaga...
Meetup spark structured streaming
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
Spark streaming with kafka
Spark stream - Kafka
A new kind of BPM with Activiti
Reactive programming every day
A Deep Dive into Stateful Stream Processing in Structured Streaming with Tath...
Continuous Application with Structured Streaming 2.0
JavaZone 2014 - goto java;
こわくないよ❤️ Playframeworkソースコードリーディング入門
Introducing the WSO2 Complex Event Processor
Scalable Angular 2 Application Architecture
VPN Access Runbook
spring aop.pptx aspt oreinted programmin
Event Sourcing - what could go wrong - Devoxx BE
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Mega Projects Data Mega Projects Data
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Knowledge Engineering Part 1
PPT
Reliability_Chapter_ presentation 1221.5784
PDF
Foundation of Data Science unit number two notes
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Database Infoormation System (DBIS).pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to machine learning and Linear Models
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
Mega Projects Data Mega Projects Data
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Knowledge Engineering Part 1
Reliability_Chapter_ presentation 1221.5784
Foundation of Data Science unit number two notes
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction-to-Cloud-ComputingFinal.pptx
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Quality review (1)_presentation of this 21
Data_Analytics_and_PowerBI_Presentation.pptx
Database Infoormation System (DBIS).pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
.pdf is not working space design for the following data for the following dat...
Introduction to machine learning and Linear Models
IBA_Chapter_11_Slides_Final_Accessible.pptx
Clinical guidelines as a resource for EBP(1).pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
ISS -ESG Data flows What is ESG and HowHow

Using Apache Spark to Solve Sessionization Problem in Batch and Streaming

  • 1. Solving sessionization problem with Apache Spark batch and streaming processing Bartosz Konieczny @waitingforcode1
  • 2. About me Bartosz Konieczny #dataEngineer #ApacheSparkEnthusiast #AWSuser #waitingforcode.com #becomedataengineer.com #@waitingforcode #github.com/bartosz25 #canalplus #Paris 2
  • 3. 3
  • 4. Sessions "user activity followed by a closing action or a period of inactivity" 4
  • 6. Batch architecture 6 data producer sync consumer input logs (DFS) input logs (streaming broker) orchestrator sessions generator <triggers> previous window raw sessions (DFS) output sessions (DFS)
  • 7. Streaming architecture 7 data producer sessions generator output sessions (DFS) metadata state <uses> checkpoint location input logs (streaming broker)
  • 9. The code val previousSessions = loadPreviousWindowSessions(sparkSession, previousSessionsDir) val sessionsInWindow = sparkSession.read.schema(Visit.Schema) .json(inputDir) val joinedData = previousSessions.join(sessionsInWindow, sessionsInWindow("user_id") === previousSessions("userId"), "fullouter") .groupByKey(log => SessionGeneration.resolveGroupByKey(log)) .flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5), windowUpperBound)).cache() joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir) joinedData.filter(state => !state.isActive) .flatMap(state => state.toSessionOutputState) .coalesce(50).write.mode(SaveMode.Overwrite) .option("compression", "gzip") .json(outputDir) 9
  • 10. Full outer join val previousSessions = loadPreviousWindowSessions(sparkSession, previousSessionsDir) val sessionsInWindow = sparkSession.read.schema(Visit.Schema) .json(inputDir) val joinedData = previousSessions.join(sessionsInWindow, sessionsInWindow("user_id") === previousSessions("userId"), "fullouter") .groupByKey(log => SessionGeneration.resolveGroupByKey(log)) .flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5), windowUpperBound)) joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir) joinedData.filter(state => !state.isActive) .flatMap(state => state.toSessionOutputState) .coalesce(50).write.mode(SaveMode.Overwrite) .option("compression", "gzip") .json(outputDir) 10 processing logic previous window active sessions new input logs full outer join
  • 11. Watermark simulation val previousSessions = loadPreviousWindowSessions(sparkSession, previousSessionsDir) val sessionsInWindow = sparkSession.read.schema(Visit.Schema) .json(inputDir) val joinedData = previousSessions.join(sessionsInWindow, sessionsInWindow("user_id") === previousSessions("userId"), "fullouter") .groupByKey(log => SessionGeneration.resolveGroupByKey(log)) .flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5), windowUpperBound)) joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir) joinedData.filter(state => !state.isActive) .flatMap(state => state.toSessionOutputState) .coalesce(50).write.mode(SaveMode.Overwrite) .option("compression", "gzip") .json(outputDir) case class SessionIntermediaryState(userId: Long, … expirationTimeMillisUtc: Long, isActive: Boolean) 11
  • 12. Save modes val previousSessions = loadPreviousWindowSessions(sparkSession, previousSessionsDir) val sessionsInWindow = sparkSession.read.schema(Visit.Schema) .json(inputDir) val joinedData = previousSessions.join(sessionsInWindow, sessionsInWindow("user_id") === previousSessions("userId"), "fullouter") .groupByKey(log => SessionGeneration.resolveGroupByKey(log)) .flatMapGroups(SessionGeneration.generate(TimeUnit.MINUTES.toMillis(5), windowUpperBound)) joinedData.filter("isActive = true").write.mode(SaveMode.Overwrite).json(outputDir) joinedData.filter(state => !state.isActive) .flatMap(state => state.toSessionOutputState) .coalesce(50).write.mode(SaveMode.Overwrite) .option("compression", "gzip") .json(outputDir) SaveMode.Append ⇒ duplicates & invalid results (e.g. multiplied revenue!) SaveMode.ErrorIfExists ⇒ failures & maintenance burden SaveMode.Ignore ⇒ no data & old data present in case of reprocessing SaveMode.Overwrite ⇒ always fresh data & easy maintenance 12
  • 14. The code val writeQuery = query.writeStream.outputMode(OutputMode.Update()) .option("checkpointLocation", s"s3://my-checkpoint-bucket") .foreachBatch((dataset: Dataset[SessionIntermediaryState], batchId: Long) => { BatchWriter.writeDataset(dataset, s"${outputDir}/${batchId}") }) val dataFrame = sparkSession.readStream .format("kafka") .option("kafka.bootstrap.servers", kafkaConfiguration.broker).option(...) .load() val query = dataFrame.selectExpr("CAST(value AS STRING)") .select(functions.from_json($"value", Visit.Schema).as("data")) .select($"data.*").withWatermark("event_time", "3 minutes") .groupByKey(row => row.getAs[Long]("user_id")) .mapGroupsWithState(GroupStateTimeout.EventTimeTimeout()) (mapStreamingLogsToSessions(sessionTimeout)) watermark - late events & state expiration stateful processing - sessions generation checkpoint - fault-tolerance 14
  • 15. Checkpoint - fault-tolerance load state for t0 query load offsets to process & write them for t1 query process data write processed offsets write state checkpoint location state store offset log commit log val writeQuery = query.writeStream.outputMode(OutputMode.Update()) .option("checkpointLocation", s"s3://sessionization-demo/checkpoint") .foreachBatch((dataset: Dataset[SessionIntermediaryState], batchId: Long) => { BatchWriter.writeDataset(dataset, s"${outputDir}/${batchId}") }) .start() 15
  • 16. Checkpoint - fault-tolerance load state for t1 query load offsets to process & write them for t1 query process data confirm processed offsets & next watermark commit state t2 partition-based checkpoint location state store offset log commit log 16
  • 17. Stateful processing update remove get getput,remove write update finalize file make snapshot recover state def mapStreamingLogsToSessions(timeoutDurationMs: Long)(key: Long, logs: Iterator[Row], currentState: GroupState[SessionIntermediaryState]): SessionIntermediaryState = { if (currentState.hasTimedOut) { val expiredState = currentState.get.expire currentState.remove() expiredState } else { val newState = currentState.getOption.map(state => state.updateWithNewLogs(logs, timeoutDurationMs)) .getOrElse(SessionIntermediaryState.createNew(logs, timeoutDurationMs)) currentState.update(newState) currentState.setTimeoutTimestamp(currentState.getCurrentWatermarkMs() + timeoutDurationMs) currentState.get } } 17
  • 18. Stateful processing update remove get getput,remove - write update - finalize file - make snapshot recover state 18 .mapGroupsWithState(...) state store TreeMap[Long, ConcurrentHashMap[UnsafeRow, UnsafeRow] ] in-memory storage for the most recent versions 1.delta 2.delta 3.snapshot checkpoint location
  • 19. Watermark val sessionTimeout = TimeUnit.MINUTES.toMillis(5) val query = dataFrame.selectExpr("CAST(value AS STRING)") .select(functions.from_json($"value", Visit.Schema).as("data")) .select($"data.*") .withWatermark("event_time", "3 minutes") .groupByKey(row => row.getAs[Long]("user_id")) .mapGroupsWithState(GroupStateTimeout.EventTimeTimeout()) (Mapping.mapStreamingLogsToSessions(sessionTimeout)) 19
  • 20. Watermark - late events on-time event late event 20 .mapGroupsWithState(...)
  • 21. Watermark - expired state State representation [simplified] {value, TTL configuration} Algorithm: 1. Update all states with new data → eventually extend TTL 2. Retrieve TTL configuration for the query → here: watermark 3. Retrieve all states that expired → no new data in this query & TTL expired 4. Call mapGroupsWithState on it with hasTimedOut param = true & no new data (Iterator.empty) // full implementation: org.apache.spark.sql.execution.streaming.FlatMapGroupsWithStateExec.InputProcessor 21
  • 23. Batch
  • 24. reschedule your job © https://pics.me.me/just-one-click-and-the-zoo-is-mine-8769663.png
  • 27. State store 1. Restored state is the most recent snapshot 2. Restored state is not the most recent snapshot but a snapshot exists 3. Restored state is not the most recent snapshot and a snapshot doesn't exist 27 1.delta 3.snapshot2.delta 1.delta 3.snapshot2.delta 4.delta 1.delta 3.delta2.delta 4.delta
  • 28. State store configuration spark.sql.streaming.stateStore: → .minDeltasForSnapshot → .maintenanceInterval 28 spark.sql.streaming: → .maxBatchesToRetainInMemory
  • 30. Few takeaways ● yet another TDD acronym - Trade-Off Driven Development ○ simplicity for latency ○ simplicity for accuracy ○ scaling for latency ● AWS ○ Kinesis - short retention period = reprocessing boundary, connector ○ S3 - trade reliability for performance ○ EMR - transient cluster ○ Redshift - COPY ● Apache Spark ○ watermarks everywhere - batch simulation ○ state store configuration ○ restore mechanism ○ overwrite idempotent mode 30
  • 32. Thank you!Bartosz Konieczny @waitingforcode / github.com/bartosz25 / waitingforcode.com Canal+ @canaltechteam