SlideShare a Scribd company logo
Stream processing with
Apache Flink™
Kostas Tzoumas
@kostas_tzoumas
The rise of stream processing
2
Why streaming
3
Data
Warehouse
Batch
Data availability Streaming
- Strict schema
- Load rate
- BI access
- Some schema
- Load rate
- Programmable
- Some schema
- Ingestion rate
- Programmable
2008 20152000
- Which data?
- When?
- Who?
What does streaming enable?
1. Data integration 2. Low latency applications
4
• Fresh recommendations,
fraud detection, etc
• Internet of Things, intelligent
manufacturing
• Results “right here, right now”
cf. Kleppmann: "Turning the DB
inside out with Samza"
3. Batch < Streaming
New stack next to/inside Hadoop
5
Files
Batch
processors
High-latency
apps
Event streams
Stream
processors
Low-latency
apps
Streaming data architectures
6
Stream platform architecture
7
- Gather and backup streams
- Offer streams for consumption
- Provide stream recovery
- Analyze and correlate streams
- Create derived streams and state
- Provide these to upstream systems
Server
logs
Trxn
logs
Sensor
logs
Upstream
systems
Example: Bouygues Telecom
8
Apache Flink primer
9
What is Flink
10
Gelly
Table
ML
SAMOA
DataSet (Java/Scala) DataStream (Java/Scala)
HadoopM/R
Local Cluster Yarn
Tez
Embedded
Dataflow
Dataflow(WiP)
MRQL
Table
Cascading(WiP)
Streaming dataflow
runtime
Storm(WiP)
Zeppelin
Motivation for Flink
11
An engine that can natively support all these workloads.
Flink
Stream
processing
Batch
processing
Machine Learning at scale
Graph Analysis
Stream processing in Flink
12
What is a stream processor?
1. Pipelining
2. Stream replay
3. Operator state
4. Backup and restore
5. High-level APIs
6. Integration with batch
7. High availability
8. Scale-in and scale-out
13
Basics
State
App development
Large deployments
See http://data-artisans.com/stream-processing-with-flink.html
Pipelining
14
Basic building block to “keep the data moving”
Note: pipelined systems do not
usually transfer individual tuples,
but buffers that batch several tuples!
Operator state
 User-defined state
• Flink transformations (map/reduce/etc) are long-running operators, feel
free to keep around objects
• Hooks to include in system's checkpoint
 Windowed streams
• Time, count, data-driven windows
• Managed by the system (currently WiP)
 Managed state (WiP)
• State interface for operators
• Backed up and restored by the system with pluggable state backend
(HDFS, Ignite, Cassandra, …)
15
Streaming fault tolerance
 Ensure that operators see all events
• “At least once”
• Solved by replaying a stream from a checkpoint,
e.g., from a past Kafka offset
 Ensure that operators do not perform
duplicate updates to their state
• “Exactly once”
• Several solutions
16
Exactly once approaches
 Discretized streams (Spark Streaming)
• Treat streaming as a series of small atomic computations
• “Fast track” to fault tolerance, but does not separate business
logic from recovery
 MillWheel (Google Cloud Dataflow)
• State update and derived events committed as atomic
transaction to a high-throughput transactional store
• Needs a very high-throughput transactional store 
 Chandy-Lamport distributed snapshots (Flink)
17
Distributed snapshots in Flink
Super-impose checkpointing mechanism on
execution instead of using execution as the
checkpointing mechanism
18
19
JobManager
Register checkpoint
barrier on master
Replay will start from here
20
JobManagerBarriers “push” prior events
(assumes in-order delivery in
individual channels)
Operator checkpointing
starting
Operator checkpointing
finished
Operator checkpointing in
progress
21
JobManager Operator checkpointing takes
snapshot of state after data
prior to barrier have updated
the state. Checkpoints
currently one-off and
synchronous, WiP for
incremental and
asynchronous
State backup
Pluggable mechanism. Currently
either JobManager (for small state) or
file system (HDFS/Tachyon). WiP for
in-memory grids
22
JobManager
Operators with many inputs
need to wait for all barriers to
pass before they checkpoint
their state
23
JobManager
State snapshots at sinks
signal successful end of this
checkpoint
At failure,
recover last
checkpointed
state and
restart
sources from
last barrier
guarantees at
least once
State backup
Benefits of Flink’s approach
 Data processing does not block
• Can checkpoint at any interval you like to balance overhead/recovery
time
 Separates business logic from recovery
• Checkpointing interval is a config parameter, not a variable in the
program (as in discretization)
 Can support richer windows
• Session windows, event time, etc
 Best of all worlds: true streaming latency, exactly-once semantics,
and low overhead for recovery
24
DataStream API
25
case class Word (word: String, frequency: Int)
val lines: DataStream[String] = env.fromSocketStream(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS))
.groupBy("word").sum("frequency")
.print()
val lines: DataSet[String] = env.readTextFile(...)
lines.flatMap {line => line.split(" ")
.map(word => Word(word,1))}
.groupBy("word").sum("frequency")
.print()
DataSet API (batch):
DataStream API (streaming):
Roadmap
 Short-term (3-6 months)
• Graduate DataStream API from beta
• Fully managed window and user-defined state with pluggable
backends
• Table API for streams (towards StreamSQL)
 Long-term (6+ months)
• Highly available master
• Dynamic scale in/out
• FlinkML and Gelly for streams
• Full batch + stream unification
26
Closing
27
tl;dr: what was this about?
 Streaming is the next logical step in data infrastructure
 Many new "fast data" platforms are being built next to or
inside Hadoop – will need a stream processor
 The case for Flink as a stream processor
• Proper engine foundation
• Attractive APIs and libraries
• Integration with batch
• Large (and growing!) community
28
Apache Flink: community
29
One of the most active big
data projects after one year
in the Apache Software
Foundation
I Flink, do you? 
30
If you find this exciting,
get involved and start a discussion on Flink‘s mailing list,
or stay tuned by
subscribing to news@flink.apache.org,
following flink.apache.org/blog, and
@ApacheFlink on Twitter
31
flink-forward.org
Spark & Friends meetup
June 16
Bay Area Flink meetup
June 17

More Related Content

PPTX
Debunking Common Myths in Stream Processing
PPTX
Debunking Six Common Myths in Stream Processing
PPTX
Flink internals web
PPTX
First Flink Bay Area meetup
PPTX
Continuous Processing with Apache Flink - Strata London 2016
PPTX
Apache Flink@ Strata & Hadoop World London
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
PDF
Stateful Distributed Stream Processing
Debunking Common Myths in Stream Processing
Debunking Six Common Myths in Stream Processing
Flink internals web
First Flink Bay Area meetup
Continuous Processing with Apache Flink - Strata London 2016
Apache Flink@ Strata & Hadoop World London
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Stateful Distributed Stream Processing

What's hot (20)

PDF
Don't Cross The Streams - Data Streaming And Apache Flink
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
PPTX
Data Stream Processing with Apache Flink
PPTX
Flink Streaming @BudapestData
PDF
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
PDF
Streaming Analytics & CEP - Two sides of the same coin?
PPTX
Streaming in the Wild with Apache Flink
PDF
Tech Talk @ Google on Flink Fault Tolerance and HA
PDF
Apache Flink: Streaming Done Right @ FOSDEM 2016
PPTX
Apache Flink: API, runtime, and project roadmap
PPTX
Apache Flink Berlin Meetup May 2016
PPTX
Real-time Stream Processing with Apache Flink
PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
PPTX
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
PDF
Flink Streaming Berlin Meetup
PDF
Unified Stream and Batch Processing with Apache Flink
PDF
Apache Flink internals
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
PDF
Marton Balassi – Stateful Stream Processing
Don't Cross The Streams - Data Streaming And Apache Flink
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Data Stream Processing with Apache Flink
Flink Streaming @BudapestData
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Streaming Analytics & CEP - Two sides of the same coin?
Streaming in the Wild with Apache Flink
Tech Talk @ Google on Flink Fault Tolerance and HA
Apache Flink: Streaming Done Right @ FOSDEM 2016
Apache Flink: API, runtime, and project roadmap
Apache Flink Berlin Meetup May 2016
Real-time Stream Processing with Apache Flink
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Flink Streaming Berlin Meetup
Unified Stream and Batch Processing with Apache Flink
Apache Flink internals
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Marton Balassi – Stateful Stream Processing
Ad

Viewers also liked (6)

PPTX
Performance Comparison of Streaming Big Data Platforms
PPTX
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
PPTX
Apache Flink at Strata San Jose 2016
PPTX
Click-Through Example for Flink’s KafkaConsumer Checkpointing
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
PPTX
Flink vs. Spark
Performance Comparison of Streaming Big Data Platforms
Apache Flume - Streaming data easily to Hadoop from any source for Telco oper...
Apache Flink at Strata San Jose 2016
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Apache Flink: Real-World Use Cases for Streaming Analytics
Flink vs. Spark
Ad

Similar to Flink Streaming Hadoop Summit San Jose (20)

PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PPTX
Apache Flink Overview at SF Spark and Friends
PDF
Stream Processing with Apache Flink
PPTX
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
PPTX
QCon London - Stream Processing with Apache Flink
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
PPTX
Flink history, roadmap and vision
PDF
Introduction to Stateful Stream Processing with Apache Flink.
PPTX
Chicago Flink Meetup: Flink's streaming architecture
PDF
Zurich Flink Meetup
PDF
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
PPTX
Flink 0.10 - Upcoming Features
PPTX
Apache Flink: Past, Present and Future
PPTX
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
PPTX
Debunking Common Myths in Stream Processing
PPTX
Introduction to Apache Flink at Vienna Meet Up
PPTX
Apache Flink Deep Dive
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
Flexible and Real-Time Stream Processing with Apache Flink
Apache Flink Overview at SF Spark and Friends
Stream Processing with Apache Flink
2018-04 Kafka Summit London: Stephan Ewen - "Apache Flink and Apache Kafka fo...
GOTO Night Amsterdam - Stream processing with Apache Flink
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
QCon London - Stream Processing with Apache Flink
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Flink history, roadmap and vision
Introduction to Stateful Stream Processing with Apache Flink.
Chicago Flink Meetup: Flink's streaming architecture
Zurich Flink Meetup
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data Artisans
Flink 0.10 - Upcoming Features
Apache Flink: Past, Present and Future
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Debunking Common Myths in Stream Processing
Introduction to Apache Flink at Vienna Meet Up
Apache Flink Deep Dive
K. Tzoumas & S. Ewen – Flink Forward Keynote

Flink Streaming Hadoop Summit San Jose

  • 1. Stream processing with Apache Flink™ Kostas Tzoumas @kostas_tzoumas
  • 2. The rise of stream processing 2
  • 3. Why streaming 3 Data Warehouse Batch Data availability Streaming - Strict schema - Load rate - BI access - Some schema - Load rate - Programmable - Some schema - Ingestion rate - Programmable 2008 20152000 - Which data? - When? - Who?
  • 4. What does streaming enable? 1. Data integration 2. Low latency applications 4 • Fresh recommendations, fraud detection, etc • Internet of Things, intelligent manufacturing • Results “right here, right now” cf. Kleppmann: "Turning the DB inside out with Samza" 3. Batch < Streaming
  • 5. New stack next to/inside Hadoop 5 Files Batch processors High-latency apps Event streams Stream processors Low-latency apps
  • 7. Stream platform architecture 7 - Gather and backup streams - Offer streams for consumption - Provide stream recovery - Analyze and correlate streams - Create derived streams and state - Provide these to upstream systems Server logs Trxn logs Sensor logs Upstream systems
  • 10. What is Flink 10 Gelly Table ML SAMOA DataSet (Java/Scala) DataStream (Java/Scala) HadoopM/R Local Cluster Yarn Tez Embedded Dataflow Dataflow(WiP) MRQL Table Cascading(WiP) Streaming dataflow runtime Storm(WiP) Zeppelin
  • 11. Motivation for Flink 11 An engine that can natively support all these workloads. Flink Stream processing Batch processing Machine Learning at scale Graph Analysis
  • 13. What is a stream processor? 1. Pipelining 2. Stream replay 3. Operator state 4. Backup and restore 5. High-level APIs 6. Integration with batch 7. High availability 8. Scale-in and scale-out 13 Basics State App development Large deployments See http://data-artisans.com/stream-processing-with-flink.html
  • 14. Pipelining 14 Basic building block to “keep the data moving” Note: pipelined systems do not usually transfer individual tuples, but buffers that batch several tuples!
  • 15. Operator state  User-defined state • Flink transformations (map/reduce/etc) are long-running operators, feel free to keep around objects • Hooks to include in system's checkpoint  Windowed streams • Time, count, data-driven windows • Managed by the system (currently WiP)  Managed state (WiP) • State interface for operators • Backed up and restored by the system with pluggable state backend (HDFS, Ignite, Cassandra, …) 15
  • 16. Streaming fault tolerance  Ensure that operators see all events • “At least once” • Solved by replaying a stream from a checkpoint, e.g., from a past Kafka offset  Ensure that operators do not perform duplicate updates to their state • “Exactly once” • Several solutions 16
  • 17. Exactly once approaches  Discretized streams (Spark Streaming) • Treat streaming as a series of small atomic computations • “Fast track” to fault tolerance, but does not separate business logic from recovery  MillWheel (Google Cloud Dataflow) • State update and derived events committed as atomic transaction to a high-throughput transactional store • Needs a very high-throughput transactional store   Chandy-Lamport distributed snapshots (Flink) 17
  • 18. Distributed snapshots in Flink Super-impose checkpointing mechanism on execution instead of using execution as the checkpointing mechanism 18
  • 19. 19 JobManager Register checkpoint barrier on master Replay will start from here
  • 20. 20 JobManagerBarriers “push” prior events (assumes in-order delivery in individual channels) Operator checkpointing starting Operator checkpointing finished Operator checkpointing in progress
  • 21. 21 JobManager Operator checkpointing takes snapshot of state after data prior to barrier have updated the state. Checkpoints currently one-off and synchronous, WiP for incremental and asynchronous State backup Pluggable mechanism. Currently either JobManager (for small state) or file system (HDFS/Tachyon). WiP for in-memory grids
  • 22. 22 JobManager Operators with many inputs need to wait for all barriers to pass before they checkpoint their state
  • 23. 23 JobManager State snapshots at sinks signal successful end of this checkpoint At failure, recover last checkpointed state and restart sources from last barrier guarantees at least once State backup
  • 24. Benefits of Flink’s approach  Data processing does not block • Can checkpoint at any interval you like to balance overhead/recovery time  Separates business logic from recovery • Checkpointing interval is a config parameter, not a variable in the program (as in discretization)  Can support richer windows • Session windows, event time, etc  Best of all worlds: true streaming latency, exactly-once semantics, and low overhead for recovery 24
  • 25. DataStream API 25 case class Word (word: String, frequency: Int) val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .groupBy("word").sum("frequency") .print() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() DataSet API (batch): DataStream API (streaming):
  • 26. Roadmap  Short-term (3-6 months) • Graduate DataStream API from beta • Fully managed window and user-defined state with pluggable backends • Table API for streams (towards StreamSQL)  Long-term (6+ months) • Highly available master • Dynamic scale in/out • FlinkML and Gelly for streams • Full batch + stream unification 26
  • 28. tl;dr: what was this about?  Streaming is the next logical step in data infrastructure  Many new "fast data" platforms are being built next to or inside Hadoop – will need a stream processor  The case for Flink as a stream processor • Proper engine foundation • Attractive APIs and libraries • Integration with batch • Large (and growing!) community 28
  • 29. Apache Flink: community 29 One of the most active big data projects after one year in the Apache Software Foundation
  • 30. I Flink, do you?  30 If you find this exciting, get involved and start a discussion on Flink‘s mailing list, or stay tuned by subscribing to news@flink.apache.org, following flink.apache.org/blog, and @ApacheFlink on Twitter
  • 31. 31 flink-forward.org Spark & Friends meetup June 16 Bay Area Flink meetup June 17

Editor's Notes

  • #8: What are the technologies that enable streaming? The open source leaders in this space is Apache Kafka (that solves the integration problem), and Apache Flink (that solves the analytics problem, removing the final barrier). Kafka and Flink combined can remove the batch barriers from the infrastructure, creating a truly real-time analytics platform.
  • #30: Other data points Google (cloud dataflow) Hortonworks Cloudera Adatao Concurrent Confluent We have been part of this open source movement with Apache Flink. Flink is a streaming dataflow engine that can run in Hadoop clusters. Flink has grown a lot over the past year both in terms of code and community. We have added domain-specific libraries, a streaming API with streaming backend support, etc, etc. Tremendous growth. Flink has also grown in community. The project is by now a very established Apache project, it has more than 140 contributors (placing it at the top 5 of Apache big data projects), and several companies are starting to experiment with it. At data Artisans we are supporting two production installations (ResearchGate and Bouygues Telecom), and are helping a number of companies that are testing Flink (e.g., Spotify, King.com, Amadeus, and a group at Yahoo). Huawei and Intel have started contributing to Flink, and interest in vendors is picking up (e.g., Adatao, Huawei, Hadoop vendors). All of this is the result of purely organic growth with very little marketing investment from data Artisans.