SlideShare a Scribd company logo
Paris Carbone<parisc@kth.se> - KTH Royal Institute of Technology
Stephan Ewen<stephan@data-artisans.com> - data Artisans
Gyula Fóra<gyula.fora@king.com> - King Digital Entertainment Ltd
Seif Haridi<haridi@kth.se> - KTH Royal Institute of Technology
Stefan Richter<s.richter@data-artisans.com> - data Artisans
Kostas Tzoumas<kostas@data-artisans.com> - data Artisans
1
State Management in
Apache FlinkĀ®
Consistent Stateful Distributed Stream Processing
@vldb17
Overview
• The Apache Flink System Architecture
• Pipelined Consistent Snapshots
• Operations with Snapshots
• Large Scale Deployments and Evaluation
2
The Apache Flink
Framework
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
3
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
Client
4
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
Client
4
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
Client
optimised
logical graph
4
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
• scheduling
• state partitioning
• snapshot coordination
Client
optimised
logical graph
4
Zookeeper
• passive failover
• snapshot metadata
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
• scheduling
• state partitioning
• snapshot coordination
Client
optimised
logical graph
4
Zookeeper
• passive failover
• snapshot metadata
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
• scheduling
• state partitioning
• snapshot coordination
Client
optimised
logical graph
• memory management
• local snapshot execution
• flow control
physical
long-running
tasks
4
Zookeeper
• passive failover
• snapshot metadata
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
• scheduling
• state partitioning
• snapshot coordination
Client
optimised
logical graph
• memory management
• local snapshot execution
• flow control
physical
long-running
tasks
locally managed state
4
Zookeeper
• passive failover
• snapshot metadata
Distributed Architecture
Cluster Backend Metrics
Dataflow Runtime
DataStream DataSet
SQL
Table
CEP
Graphs
ML
Libraries
Core API
Runner
Setup
Job Manager
Task Manager
Task Manager
….
• scheduling
• state partitioning
• snapshot coordination
Client
optimised
logical graph
• memory management
• local snapshot execution
• flow control
physical
long-running
tasks
locally managed state
External
Snapshot Store
(e.g., hdfs)
partial snapshots
4
1. End-to-End
Guarantees
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots
5
1. End-to-End
Guarantees
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots
6
Stateful Processing
tasktasktask
7
Stateful Processing
tasktasktask
invoke per
input record
7
Stateful Processing
tasktasktask
readwrite
managed
state
logical operations
(collections)
invoke per
input record
7
Local
State Backend
physical
operations
In-Memory(Heap)
Embedded Off-heap+Disk
Key-Value Store
(RocksDB)
Stateful Processing
tasktasktask
readwrite
managed
state
logical operations
(collections)
invoke per
input record
7
Local
State Backend
physical
operations
In-Memory(Heap)
Embedded Off-heap+Disk
Key-Value Store
(RocksDB)
Stateful Processing
tasktasktask
readwrite
managed
state
logical operations
(collections)
invoke per
input record
state = f(input)
7
8
local
statesinput
streams
8
local
statesinput
streams
stream
processor
8
local
statesinput
streams
divide computation
into epochs
stream
processor
8
local
statesinput
streams
capture all local
states after
completing an
epoch
divide computation
into epochs
stream
processor
8
local
statesinput
streams
capture all local
states after
completing an
epoch
divide computation
into epochs
stream
processor
can rollback input and state
to captured point in the past
8
Snapshot
Store
copy states
A Synchronous Approach
master
9
drain epoch 1
Snapshot
Store
copy states
A Synchronous Approach
master
9
drain epoch 1
Snapshot
Store
copy states
A Synchronous Approach
master
9
drain epoch 1
Snapshot
Store
copy states
A Synchronous Approach
master
9
drain epoch 2
Snapshot
Store
copy states
A Synchronous Approach
master
9
drain epoch 2
Snapshot
Store
copy states
A Synchronous Approach
master
9
drain epoch 2
Snapshot
Store
copy states
A Synchronous Approach
master
9
• In use: Storm Trident and Spark Streaming
• A conservative approach, equivalent to batching
• Can cause unnecessary latency (master coordination)
• Processing is no longer continuous
• Forces many tasks to be idle
• Instead, in Apache Flink snapshots are pipelined
Synchronous Snapshots
10
Pipelined Snapshots
Snapshot
Store
async state copy
11
Pipelined Snapshots
Snapshot
Store
async state copy
insert markers
11
Pipelined Snapshots
Snapshot
Store
async state copy
insert markers
A
B
C
D
E
11
Pipelined Snapshots
Snapshot
Store
async state copy
A
B
C
D
E
11
Pipelined Snapshots
Snapshot
Store
async state copy
A
B
C
D
E
B
11
Pipelined Snapshots
Snapshot
Store
async state copy
epoch alignment
A
B
C
D
E
B
11
Pipelined Snapshots
Snapshot
Store
async state copy
epoch alignment
A
B
C
D
E
B
A
11
Pipelined Snapshots
Snapshot
Store
async state copy
A
B
C
D
E
B
A
C
11
Pipelined Snapshots
Snapshot
Store
async state copy
A
B
C
D
E
B
A
C
D
E
11
Pipelined Snapshots
Snapshot
Store
async state copy
snapshot
completes
A
B
C
D
E
B
A
C
D
E
11
Pipelined Snapshots (cycles)
12
Pipelined Snapshots (cycles)
Problem: we cannot wait indefinitely for records in cycles
12
Pipelined Snapshots (cycles)
Problem: we cannot wait indefinitely for records in cycles
Solution: log in
snapshot inflight
records within a cycle
Replay upon recovery.
12
• Offers exactly-once processing guarantees
• Issued periodically/externally by the user
• Naturally respects flow control mechanisms
• Channel state logging limited to cycles only
• Multiple epoch snapshots can be pipelined
• Can offer weaker at-least-once processing guarantees
by simply dropping aligning vs no alignment cost
Technique Highlights
13
1. End-to-End
Guarantees
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots Usages
14
Exactly-Once: Input and Processing
Important Assumptions
• Input streams are persisted with offset indexes (e.g., Kafka, Kinesis)
• Data Channels are FIFO and reliable (no loss)
Each epoch either completes or repeats
15
• Idempontency ~ repeated operations can be tolerated after
recovery/rollback (works for mutable stores).
• Transactional Processing ~ Requires a two-phase
coordination. A snapshot completion eventually leads to
external commit (e.g., Flink’s HDFS RollingSink*)
in-progress committedpendingpending
epoch n-1 epoch n-2 epoch n-3epoch n
Exactly-Once Output
16
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots Usages
1. End-to-End
Guarantees
17
Dataflow Reconfiguration
18
Dataflow Reconfiguration
18
Dataflow Reconfiguration
stop
snap-1 snap-2
18
Dataflow Reconfiguration
stop
snap-1 snap-2
snap-3
…
change
parallelism
18
Dataflow Reconfiguration
stop
snap-1 snap-2
snap-3
…
change
parallelism
Problem: How is state repartitioned from a snapshot?
18
Reconfiguration: The Issue
19
Reconfiguration: The Issue
0x100: bob
…
…
…
…
0x449: alice
reconfigure
case I full scan
Scan Remote Storage for Responsible Keys
19
Reconfiguration: The Issue
0x100: bob
…
…
…
…
0x449: alice
reconfigure
case I full scan
Scan Remote Storage for Responsible Keys
too slow
19
Reconfiguration: The Issue
case II
0x100: bob
…
…
…
…
0x449: alice
reconfigure
Include Key Locations in Snapshot Metadata
bob: 0x100
carol: 0x344
…
alice: 0x449
chuck: 0x630
…
0x100: bob
…
…
…
…
0x449: alice
reconfigure
case I full scan
Scan Remote Storage for Responsible Keys
too slow
19
Reconfiguration: The Issue
case II
0x100: bob
…
…
…
…
0x449: alice
reconfigure
Include Key Locations in Snapshot Metadata
bob: 0x100
carol: 0x344
…
alice: 0x449
chuck: 0x630
…
0x100: bob
…
…
…
…
0x449: alice
reconfigure
case I full scan
Scan Remote Storage for Responsible Keys
too slow
too much
19
Reconfiguration: Key Groups
Pre-partition state in
hash(K) space, into key-groups
bob…
…
… …
…
…
alice
20
Reconfiguration: Key Groups
Pre-partition state in
hash(K) space, into key-groups
bob…
…
… …
…
…
• Snapshot Metadata:
Contains a reference per stored
Key-Group (less metadata)
• Reconfiguration:
Contiguous key-group allocation
to available tasks (less IO)
alice
20
Reconfiguration: Key Groups
Pre-partition state in
hash(K) space, into key-groups
bob…
…
… …
…
…
• Snapshot Metadata:
Contains a reference per stored
Key-Group (less metadata)
• Reconfiguration:
Contiguous key-group allocation
to available tasks (less IO)
alice
Note: number of key groups controls trade-off between metadata to
keep and reconfiguration speed
20
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots Usages
1. End-to-End
Guarantees
21
Version Control
22
Version Control
Pipeline v.1
22
Version Control
fork and
update
Pipeline v.1
Pipeline v.2
22
Version Control
fork and
update
Pipeline v.1
Pipeline v.2
22
Version Control
fork and
update
Pipeline v.1
Pipeline v.3
Pipeline v.2
22
Version Control
fork and
update
Pipeline v.1
Pipeline v.3
Pipeline v.2
22
Snapshots
2. Reconfiguration
3. Version Control 4. Isolation
Snapshots Usages
1. End-to-End
Guarantees
23
Isolation Levels
24
Isolation Levels
select from facebook.userID, clients.name …
inner join clients on …
read-committed
(snapshot)
read-uncommitted
(dirty read on latest state)
external
query
24
Large Scale Deployment at King
25
Large Scale Deployment at King100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
TotalSnapshottingTime(sec)
total time / snapshot
(alignment + async copies)
25
Large Scale Deployment at King100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
TotalSnapshottingTime(sec)
total time / snapshot
(alignment + async copies)
~runtime overhead
25
Large Scale Deployment at King
30 50 70
Parallelism
0
200
400
600
800
1000
1200
1400
TotalAlignmentTime(msec)
PROC
WIN
OUT
alignment
cost
100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
TotalSnapshottingTime(sec)
total time / snapshot
(alignment + async copies)
~runtime overhead
25
Large Scale Deployment at King
30 50 70
Parallelism
0
200
400
600
800
1000
1200
1400
TotalAlignmentTime(msec)
PROC
WIN
OUT
alignment
cost
100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
TotalSnapshottingTime(sec)
total time / snapshot
(alignment + async copies)
~runtime overhead
25
Large Scale Deployment at King
30 50 70
Parallelism
0
200
400
600
800
1000
1200
1400
TotalAlignmentTime(msec)
PROC
WIN
OUT
alignment
cost
100
200
300
400
500
Global State Size (GB)
0
50
100
150
200
250
TotalSnapshottingTime(sec)
total time / snapshot
(alignment + async copies)
~runtime overhead
• #shuffles (keyby)
• parallelism
25
Teaser: More paper
highlights
• We can use the same technique to coordinate
externally managed state with snapshots.
• Epoch markers can act as on-the-fly
reconfiguration points.
• Internals of asynchronous and incremental
snapshots.
26
Paris Carbone<parisc@kth.se> - KTH Royal Institute of Technology
Stephan Ewen<stephan@data-artisans.com> - data Artisans
Gyula Fóra<gyula.fora@king.com> - King Digital Entertainment Ltd
Seif Haridi<haridi@kth.se> - KTH Royal Institute of Technology
Stefan Richter<s.richter@data-artisans.com> - data Artisans
Kostas Tzoumas<kostas@data-artisans.com> - data Artisans
27
State Management in
Apache FlinkĀ®
Consistent Stateful Distributed Stream Processing
@vldb17

More Related Content

PDF
Flink Forward Berlin 2017: Tzu-Li (Gordon) Tai - Managing State in Apache Flink
PDF
Introduction To Flink
PDF
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
PPTX
Apache kafka
PDF
Introduction to Apache Kafka
PDF
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
PPTX
Spotify's Music Recommendations Lambda Architecture
PDF
Apache Kafka - Martin Podval
Flink Forward Berlin 2017: Tzu-Li (Gordon) Tai - Managing State in Apache Flink
Introduction To Flink
Flink Forward Berlin 2017: Stefan Richter - A look at Flink's internal data s...
Apache kafka
Introduction to Apache Kafka
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Spotify's Music Recommendations Lambda Architecture
Apache Kafka - Martin Podval

What's hot (20)

PDF
Fundamentals of Apache Kafka
PPTX
Apache kafka
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Real-time Stream Processing with Apache Flink
PPTX
Apache flink
PPTX
Autoscaling Flink with Reactive Mode
PDF
The Patterns of Distributed Logging and Containers
PPTX
Apache kafka
PDF
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
PPTX
Kafka 101
PDF
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
PDF
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
PPTX
Apache Kafka at LinkedIn
PDF
Stateful stream processing with Apache Flink
PPTX
Introduction to Kafka Cruise Control
PDF
Common Patterns of Multi Data-Center Architectures with Apache Kafka
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
PDF
Exactly-once Semantics in Apache Kafka
PDF
ģ•„ķŒŒģ¹˜ 칓프칓 ģž…ė¬øź³¼ ķ™œģš© ź°•ģ˜ģžė£Œ
PDF
Pragmatic Guide to Apache KafkaĀ®'s Exactly Once Semantics
Fundamentals of Apache Kafka
Apache kafka
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Real-time Stream Processing with Apache Flink
Apache flink
Autoscaling Flink with Reactive Mode
The Patterns of Distributed Logging and Containers
Apache kafka
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Kafka 101
KSQL Deep Dive - The Open Source Streaming Engine for Apache Kafka
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
Apache Kafka at LinkedIn
Stateful stream processing with Apache Flink
Introduction to Kafka Cruise Control
Common Patterns of Multi Data-Center Architectures with Apache Kafka
Apache Kafka Fundamentals for Architects, Admins and Developers
Exactly-once Semantics in Apache Kafka
ģ•„ķŒŒģ¹˜ 칓프칓 ģž…ė¬øź³¼ ķ™œģš© ź°•ģ˜ģžė£Œ
Pragmatic Guide to Apache KafkaĀ®'s Exactly Once Semantics
Ad

Similar to State Management in Apache Flink : Consistent Stateful Distributed Stream Processing (20)

PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
PDF
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
PDF
Introduction to Stateful Stream Processing with Apache Flink.
PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
PDF
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
PDF
The Power of Distributed Snapshots in Apache Flink
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
Zurich Flink Meetup
PDF
Tech Talk @ Google on Flink Fault Tolerance and HA
PPTX
Flink history, roadmap and vision
PDF
Building Applications with Streams and Snapshots
PPTX
Flink Streaming @BudapestData
PDF
Marton Balassi – Stateful Stream Processing
PPTX
Stream processing - Apache flink
PDF
Apache flink
PPTX
Flink Streaming Hadoop Summit San Jose
PDF
Apache Flink @ Tel Aviv / Herzliya Meetup
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
PPTX
QCon London - Stream Processing with Apache Flink
PPTX
Apache Flink Overview at SF Spark and Friends
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Tzu-Li (Gordon) Tai - Stateful Stream Processing with Apache Flink
Introduction to Stateful Stream Processing with Apache Flink.
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
Flink Forward San Francisco 2018: Stefan Richter - "How to build a modern str...
The Power of Distributed Snapshots in Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
Zurich Flink Meetup
Tech Talk @ Google on Flink Fault Tolerance and HA
Flink history, roadmap and vision
Building Applications with Streams and Snapshots
Flink Streaming @BudapestData
Marton Balassi – Stateful Stream Processing
Stream processing - Apache flink
Apache flink
Flink Streaming Hadoop Summit San Jose
Apache Flink @ Tel Aviv / Herzliya Meetup
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
QCon London - Stream Processing with Apache Flink
Apache Flink Overview at SF Spark and Friends
Ad

More from Paris Carbone (13)

PDF
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
PDF
Scalable and Reliable Data Stream Processing - Doctorate Seminar
PDF
Stream Loops on Flink - Reinventing the wheel for the streaming era
PDF
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
PDF
A Future Look of Data Stream Processing as an Architecture for AI
PDF
Continuous Deep Analytics
PDF
Reintroducing the Stream Processor: A universal tool for continuous data anal...
PDF
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
PDF
Graph Stream Processing : spinning fast, large scale, complex analytics
PDF
Data Stream Analytics - Why they are important
PDF
Single-Pass Graph Stream Analytics with Apache Flink
PDF
Aggregate Sharing for User-Define Data Stream Windows
PPTX
An Introduction to Distributed Data Streaming
Continuous Intelligence - Intersecting Event-Based Business Logic and ML
Scalable and Reliable Data Stream Processing - Doctorate Seminar
Stream Loops on Flink - Reinventing the wheel for the streaming era
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
A Future Look of Data Stream Processing as an Architecture for AI
Continuous Deep Analytics
Reintroducing the Stream Processor: A universal tool for continuous data anal...
Not Less, Not More: Exactly Once, Large-Scale Stream Processing in Action
Graph Stream Processing : spinning fast, large scale, complex analytics
Data Stream Analytics - Why they are important
Single-Pass Graph Stream Analytics with Apache Flink
Aggregate Sharing for User-Define Data Stream Windows
An Introduction to Distributed Data Streaming

Recently uploaded (20)

PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
1_Introduction to advance data techniques.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
ā€œGetting Started with Data Analytics Using R – Concepts, Tools & Case Studiesā€
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
STUDY DESIGN details- Lt Col Maksud (21).pptx
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
1_Introduction to advance data techniques.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
Galatica Smart Energy Infrastructure Startup Pitch Deck
IB Computer Science - Internal Assessment.pptx
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Quality review (1)_presentation of this 21
Major-Components-ofNKJNNKNKNKNKronment.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
ā€œGetting Started with Data Analytics Using R – Concepts, Tools & Case Studiesā€
Acceptance and paychological effects of mandatory extra coach I classes.pptx

State Management in Apache Flink : Consistent Stateful Distributed Stream Processing