Real-time Stream Processing with Apache Flink @ Hadoop Summit

Marton Balassi – data Artisans
Gyula Fora - SICS
Flink committers
mbalassi@apache.org / gyfora@apache.org
Real-time Stream Processing
with Apache Flink

Stream Processing
2
§  Data stream: Inﬁnite sequence of data arriving in a continuous fashion.
§  Stream processing: Analyzing and acting on real-time streaming data,
using continuous queries

Streaming landscape
3
Apache Storm
• True streaming, low latency - lower throughput
• Low level API (Bolts, Spouts) + Trident
Spark Streaming
• Stream processing on top of batch system, high throughput - higher latency
• Functional API (DStreams), restricted by batch runtime
Apache Samza
• True streaming built on top of Apache Kafka, state is ﬁrst class citizen
• Slightly different stream notion, low level API
Apache Flink
• True streaming with adjustable latency-throughput trade-off
• Rich functional API exploiting streaming runtime; e.g. rich windowing semantics

Apache Storm
4
§  True streaming, low latency - lower throughput
§  Low level API (Bolts, Spouts) + Trident
§  At-least-once processing guarantees Issues
§  Costly fault tolerance
§  Serialization
§  Low level API

Spark Streaming
5
§  Stream processing emulated on a batch system
§  High throughput - higher latency
§  Functional API (DStreams)
§  Exactly-once processing guarantees Issues
§  Restricted streaming
semantics
§  Windowing
§  High latency

Apache Samza
6
§  True streaming built on top of Apache Kafka
§  Slightly different stream notion, low level API
§  At-least-once processing guarantees with state
Issues
§  High disk IO
§  Low level API

Apache Flink
7
§  True streaming with adjustable latency and throughput
§  Rich functional API exploiting streaming runtime
§  Flexible windowing semantics
§  Exactly-once processing guarantees with (small) state
Issues
§  Limited state size
§  HA issue

What is Flink
9
A "use-case complete" framework to
unify batch and stream processing
Event
logs

Historic
data

ETL

Rela4onal

Graph
analysis

Machine
learning

Streaming
analysis

Flink

Historic data
Ka?a,
RabbitMQ,
...

HDFS,
JDBC,
...

ETL, Graphs,
Machine Learning
Relational, …
Low latency
windowing,
aggregations, ...
Event
logs

Real-time data
streams
What is Flink
An engine that puts equal emphasis
to streaming and batch
10

Flink stack
11
Python
Gelly
Table
FlinkML
SAMOA
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)Hadoop
M/R
Flink Runtime
Local Remote Yarn Tez Embedded
Dataﬂow
Dataﬂow
*current
Flink
master
+
few
PRs

Streaming Optimizer

Overview of the API
§  Data stream sources
•  File system
•  Message queue connectors
•  Arbitrary source functionality
§  Stream transformations
•  Basic transformations: Map, Reduce, Filter, Aggregations…
•  Binary stream transformations: CoMap, CoReduce…
•  Windowing semantics: Policy based ﬂexible windowing (Time, Count, Delta…)
•  Temporal binary stream operators: Joins, Crosses…
•  Native support for iterations
§  Data stream outputs
§  For the details please refer to the programming guide:
•  http://ﬂink.apache.org/docs/latest/streaming_guide.html
13
Reduce
Merge
Filter
Sum
Map
Src
Sink
Src

Use-case: Financial analytics
14
§  Reading from multiple inputs
•  Merge stock data from various sources
§  Window aggregations
•  Compute simple statistics over windows of data
§  Data driven windows
•  Deﬁne arbitrary windowing semantics
§  Combine with sentiment analysis
•  Enrich your analytics with social media feeds (Twitter)
§  Streaming joins
•  Join multiple data streams
§  Detailed explanation and source code on our blog
•  http://ﬂink.apache.org/news/2015/02/09/streaming-example.html

Reading from multiple inputs
case
class
StockPrice(symbol
:
String,
price
:
Double)

val
env
=
StreamExecutionEnvironment.getExecutionEnvironment

val
socketStockStream
=
env.socketTextStream("localhost",
9999)

.map(x
=>
{
val
split
=
x.split(",")

StockPrice(split(0),
split(1).toDouble)
})

val
SPX_Stream
=
env.addSource(generateStock("SPX")(10)
_)

val
FTSE_Stream
=
env.addSource(generateStock("FTSE")(20)
_)

val
stockStream
=
socketStockStream.merge(SPX_Stream,
FTSE_STREAM)
15
(1)

(2)

(4)

(3)

(1)

(2)

(3)

(4)

"HDP,
23.8"

"HDP,
26.6"

StockPrice(SPX,
2113.9)

StockPrice(FTSE,
6931.7)

StockPrice(SPX,
2113.9)

StockPrice(FTSE,
6931.7)

StockPrice(HDP,
23.8)

StockPrice(HDP,
26.6)

Window aggregations
val
windowedStream
=
stockStream

.window(Time.of(10,
SECONDS)).every(Time.of(5,
SECONDS))

val
lowest
=
windowedStream.minBy("price")

val
maxByStock
=
windowedStream.groupBy("symbol").maxBy("price")

val
rollingMean
=
windowedStream.groupBy("symbol").mapWindow(mean
_)

16
(1)

(2)

(4)

(3)

(1)

(2)

(4)

(3)

StockPrice(SPX,
2113.9)

StockPrice(FTSE,
6931.7)

StockPrice(HDP,
23.8)

StockPrice(HDP,
26.6)

StockPrice(HDP,
23.8)

StockPrice(SPX,
2113.9)

StockPrice(FTSE,
6931.7)

StockPrice(HDP,
26.6)

StockPrice(SPX,
2113.9)

StockPrice(FTSE,
6931.7)

StockPrice(HDP,
25.2)

Data-driven windows
case
class
Count(symbol
:
String,
count
:
Int)

val
priceWarnings
=
stockStream.groupBy("symbol")

.window(Delta.of(0.05,
priceChange,
defaultPrice))

.mapWindow(sendWarning
_)

val
warningsPerStock
=
priceWarnings.map(Count(_,
1))
.groupBy("symbol")

.window(Time.of(30,
SECONDS))

.sum("count")
17
(1)

(2)
(4)

(3)

(1)

(2)

(4)

(3)

StockPrice(SPX,
2113.9)

StockPrice(FTSE,
6931.7)

StockPrice(HDP,
23.8)

StockPrice(HDP,
26.6)

Count(HDP,
1)
StockPrice(HDP,
23.8)

StockPrice(HDP,
26.6)

Combining with a Twitter stream
val
tweetStream
=
env.addSource(generateTweets
_)

val
mentionedSymbols
=
tweetStream.flatMap(tweet
=>
tweet.split("
"))

.map(_.toUpperCase())

.filter(symbols.contains(_))

val
tweetsPerStock
=
mentionedSymbols.map(Count(_,
1)).groupBy("symbol")

.window(Time.of(30,
SECONDS))

.sum("count")

18
"hdp
is
on
the
rise!"

"I
wish
I
bought
more

YHOO
and
HDP
stocks"

Count(HDP,
2)

Count(YHOO,
1)
(1)

(2)

(4)

(3)

(1)

(2)

(4)

(3)

Streaming joins
val
tweetsAndWarning
=
warningsPerStock.join(tweetsPerStock)

.onWindow(30,
SECONDS)

.where("symbol")

.equalTo("symbol"){
(c1,
c2)
=>
(c1.count,
c2.count)
}

val
rollingCorrelation
=
tweetsAndWarning

.window(Time.of(30,
SECONDS))

.mapWindow(computeCorrelation
_)

19
Count(HDP,
2)

Count(YHOO,
1)

Count(HDP,
1)

(1,2)

(1)
(2)

(1)

(2)

0.5

Fault tolerance
§  Exactly once semantics
•  Asynchronous barrier snapshotting
•  Checkpoint barriers streamed from the sources
•  Operator state checkpointing + source backup
•  Pluggable backend for state management
20
1

1

2
3

JM

SM

State
manager

Job
manager

Operator

Snapshot
barrier

Event
channel

Data
channel

Checkpoint

JM

SM

Performance
21
§  Performance optimizations
•  Effective serialization due to strongly typed topologies
•  Operator chaining (thread sharing/no serialization)
•  Different automatic query optimizations
§  Competitive performance
•  ~ 1.5m events / sec / core
•  As a comparison Storm promises ~ 1m tuples / sec / node

Roadmap
22
§  Persistent, high-throughput state backend
§  Job manager high availability
§  Application libraries
•  General statistics over streams
•  Pattern matching
•  Machine learning pipelines library
•  Streaming graph processing library
§  Integration with other frameworks
•  Zeppelin (Notebook)
•  SAMOA (Online ML)

Summary
§  Flink is a use-case complete framework to unify batch
and stream processing
§  True streaming runtime with high-level APIs
§  Flexible, data-driven windowing semantics
§  Competitive performance
§  We are just getting started!
23

Flink Community
24
0
20
40
60
80
100
120
Jul-09 Nov-10 Apr-12 Aug-13 Dec-14 May-16
Unique git contributors

ﬂink.apache.org
@ApacheFlink

Real-time Stream Processing with Apache Flink @ Hadoop Summit

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Real-time Stream Processing with Apache Flink @ Hadoop Summit (20)

Recently uploaded (20)

Real-time Stream Processing with Apache Flink @ Hadoop Summit