SlideShare a Scribd company logo
Marton Balassi – data Artisans
Gyula Fora - SICS
Flink committers
mbalassi@apache.org / gyfora@apache.org
Real-time Stream Processing
with Apache Flink
Stream Processing
2
§  Data stream: Infinite sequence of data arriving in a continuous fashion.
§  Stream processing: Analyzing and acting on real-time streaming data,
using continuous queries
Streaming landscape
3
Apache Storm
• True streaming, low latency - lower throughput
• Low level API (Bolts, Spouts) + Trident
Spark Streaming
• Stream processing on top of batch system, high throughput - higher latency
• Functional API (DStreams), restricted by batch runtime
Apache Samza
• True streaming built on top of Apache Kafka, state is first class citizen
• Slightly different stream notion, low level API
Apache Flink
• True streaming with adjustable latency-throughput trade-off
• Rich functional API exploiting streaming runtime; e.g. rich windowing semantics
Apache Storm
4
§  True streaming, low latency - lower throughput
§  Low level API (Bolts, Spouts) + Trident
§  At-least-once processing guarantees Issues
§  Costly fault tolerance
§  Serialization
§  Low level API
Spark Streaming
5
§  Stream processing emulated on a batch system
§  High throughput - higher latency
§  Functional API (DStreams)
§  Exactly-once processing guarantees Issues
§  Restricted streaming
semantics
§  Windowing
§  High latency
Apache Samza
6
§  True streaming built on top of Apache Kafka
§  Slightly different stream notion, low level API
§  At-least-once processing guarantees with state
Issues
§  High disk IO
§  Low level API
Apache Flink
7
§  True streaming with adjustable latency and throughput
§  Rich functional API exploiting streaming runtime
§  Flexible windowing semantics
§  Exactly-once processing guarantees with (small) state
Issues
§  Limited state size
§  HA issue
Apache Flink
8
What is Flink
9
A "use-case complete" framework to
unify batch and stream processing
Event	
  logs	
  
Historic	
  data	
  
ETL	
  	
  
Rela4onal	
  
Graph	
  analysis	
  
Machine	
  learning	
  
Streaming	
  analysis	
  
Flink	
  
Historic data
Ka?a,	
  RabbitMQ,	
  ...	
  
HDFS,	
  JDBC,	
  ...	
  
ETL, Graphs,
Machine Learning
Relational, …
Low latency
windowing,
aggregations, ...
Event	
  logs	
  
Real-time data
streams
What is Flink
An engine that puts equal emphasis
to streaming and batch
10
Flink stack
11
Python
Gelly
Table
FlinkML
SAMOA
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)Hadoop
M/R
Flink Runtime
Local Remote Yarn Tez Embedded
Dataflow
Dataflow
*current	
  Flink	
  master	
  +	
  few	
  PRs	
  	
  
Streaming Optimizer
Flink Streaming
12
Overview of the API
§  Data stream sources
•  File system
•  Message queue connectors
•  Arbitrary source functionality
§  Stream transformations
•  Basic transformations: Map, Reduce, Filter, Aggregations…
•  Binary stream transformations: CoMap, CoReduce…
•  Windowing semantics: Policy based flexible windowing (Time, Count, Delta…)
•  Temporal binary stream operators: Joins, Crosses…
•  Native support for iterations
§  Data stream outputs
§  For the details please refer to the programming guide:
•  http://flink.apache.org/docs/latest/streaming_guide.html
13
Reduce
Merge
Filter
Sum
Map
Src
Sink
Src
Use-case: Financial analytics
14
§  Reading from multiple inputs
•  Merge stock data from various sources
§  Window aggregations
•  Compute simple statistics over windows of data
§  Data driven windows
•  Define arbitrary windowing semantics
§  Combine with sentiment analysis
•  Enrich your analytics with social media feeds (Twitter)
§  Streaming joins
•  Join multiple data streams
§  Detailed explanation and source code on our blog
•  http://flink.apache.org/news/2015/02/09/streaming-example.html
Reading from multiple inputs
case	
  class	
  StockPrice(symbol	
  :	
  String,	
  price	
  :	
  Double)	
  
val	
  env	
  =	
  StreamExecutionEnvironment.getExecutionEnvironment
	
  
val	
  socketStockStream	
  =	
  env.socketTextStream("localhost",	
  9999)	
  
	
  .map(x	
  =>	
  {	
  val	
  split	
  =	
  x.split(",")	
  	
  
	
  	
  	
  	
  	
  	
  	
  	
  StockPrice(split(0),	
  split(1).toDouble)	
  })	
  	
  
	
  
val	
  SPX_Stream	
  =	
  env.addSource(generateStock("SPX")(10)	
  _)	
  
val	
  FTSE_Stream	
  =	
  env.addSource(generateStock("FTSE")(20)	
  _)	
  	
  
val	
  stockStream	
  =	
  socketStockStream.merge(SPX_Stream,	
  FTSE_STREAM)	
   15
(1)	
  
(2)	
  
(4)	
  
(3)	
  
(1)	
  
(2)	
  
(3)	
  
(4)	
  
"HDP,	
  23.8"	
  
"HDP,	
  26.6"	
  
StockPrice(SPX,	
  2113.9)	
  
StockPrice(FTSE,	
  6931.7)	
  
StockPrice(SPX,	
  2113.9)	
  
StockPrice(FTSE,	
  6931.7)	
  
StockPrice(HDP,	
  23.8)	
  
StockPrice(HDP,	
  26.6)	
  
Window aggregations
val	
  windowedStream	
  =	
  stockStream	
  
	
  	
  .window(Time.of(10,	
  SECONDS)).every(Time.of(5,	
  SECONDS))	
  
	
  
val	
  lowest	
  =	
  windowedStream.minBy("price")	
  
val	
  maxByStock	
  =	
  windowedStream.groupBy("symbol").maxBy("price")	
  
val	
  rollingMean	
  =	
  windowedStream.groupBy("symbol").mapWindow(mean	
  _)	
  
16
(1)	
  
(2)	
  
(4)	
  
(3)	
  
(1)	
  
(2)	
  
(4)	
  
(3)	
  
StockPrice(SPX,	
  2113.9)	
  
StockPrice(FTSE,	
  6931.7)	
  
StockPrice(HDP,	
  23.8)	
  
StockPrice(HDP,	
  26.6)	
  
StockPrice(HDP,	
  23.8)	
  
StockPrice(SPX,	
  2113.9)	
  
StockPrice(FTSE,	
  6931.7)	
  
StockPrice(HDP,	
  26.6)	
  
StockPrice(SPX,	
  2113.9)	
  
StockPrice(FTSE,	
  6931.7)	
  
StockPrice(HDP,	
  25.2)	
  
Data-driven windows
case	
  class	
  Count(symbol	
  :	
  String,	
  count	
  :	
  Int)	
  
	
  
val	
  priceWarnings	
  =	
  stockStream.groupBy("symbol")	
  
	
  .window(Delta.of(0.05,	
  priceChange,	
  defaultPrice))	
  	
  
	
  .mapWindow(sendWarning	
  _)	
  	
  
	
  
val	
  warningsPerStock	
  =	
  priceWarnings.map(Count(_,	
  1))	
  .groupBy("symbol")	
  
	
  .window(Time.of(30,	
  SECONDS))	
  
	
  .sum("count")	
   17
(1)	
  
(2)	
   (4)	
  
(3)	
  
(1)	
  
(2)	
  
(4)	
  
(3)	
  
StockPrice(SPX,	
  2113.9)	
  
StockPrice(FTSE,	
  6931.7)	
  
StockPrice(HDP,	
  23.8)	
  
StockPrice(HDP,	
  26.6)	
  
Count(HDP,	
  1)	
  StockPrice(HDP,	
  23.8)	
  
StockPrice(HDP,	
  26.6)	
  
Combining with a Twitter stream
val	
  tweetStream	
  =	
  env.addSource(generateTweets	
  _)	
  	
  
	
  
val	
  mentionedSymbols	
  =	
  tweetStream.flatMap(tweet	
  =>	
  tweet.split("	
  "))	
  
	
  .map(_.toUpperCase())	
  
	
  .filter(symbols.contains(_))	
  	
  
	
  
val	
  tweetsPerStock	
  =	
  mentionedSymbols.map(Count(_,	
  1)).groupBy("symbol")	
  
	
  .window(Time.of(30,	
  SECONDS))	
  
	
  .sum("count")	
  
18
"hdp	
  is	
  on	
  the	
  rise!"	
  
"I	
  wish	
  I	
  bought	
  more	
  
YHOO	
  and	
  HDP	
  stocks"	
  
Count(HDP,	
  2)	
  
Count(YHOO,	
  1)	
  (1)	
  
(2)	
  
(4)	
  
(3)	
  
(1)	
  
(2)	
  
(4)	
  
(3)	
  
Streaming joins
val	
  tweetsAndWarning	
  =	
  warningsPerStock.join(tweetsPerStock)	
  
	
  .onWindow(30,	
  SECONDS)	
  
	
  .where("symbol")	
  
	
  .equalTo("symbol"){	
  (c1,	
  c2)	
  =>	
  (c1.count,	
  c2.count)	
  }	
  	
  
	
  
val	
  rollingCorrelation	
  =	
  tweetsAndWarning	
  
	
  .window(Time.of(30,	
  SECONDS))	
  
	
  .mapWindow(computeCorrelation	
  _)	
  
19
Count(HDP,	
  2)	
  
Count(YHOO,	
  1)	
  
Count(HDP,	
  1)	
  
(1,2)	
  
(1)	
   (2)	
  
(1)	
  
(2)	
  
0.5	
  
Fault tolerance
§  Exactly once semantics
•  Asynchronous barrier snapshotting
•  Checkpoint barriers streamed from the sources
•  Operator state checkpointing + source backup
•  Pluggable backend for state management
20
1	
  
1	
  
2	
   3	
  
JM	
  
SM	
  
State	
  manager	
  
	
  
Job	
  manager	
  
	
  
Operator	
  
	
  
Snapshot	
  barrier	
  
	
  
Event	
  channel	
  
	
  
Data	
  channel	
  
	
  
Checkpoint	
  
JM	
  
SM	
  
Performance
21
§  Performance optimizations
•  Effective serialization due to strongly typed topologies
•  Operator chaining (thread sharing/no serialization)
•  Different automatic query optimizations
§  Competitive performance
•  ~ 1.5m events / sec / core
•  As a comparison Storm promises ~ 1m tuples / sec / node
Roadmap
22
§  Persistent, high-throughput state backend
§  Job manager high availability
§  Application libraries
•  General statistics over streams
•  Pattern matching
•  Machine learning pipelines library
•  Streaming graph processing library
§  Integration with other frameworks
•  Zeppelin (Notebook)
•  SAMOA (Online ML)
Summary
§  Flink is a use-case complete framework to unify batch
and stream processing
§  True streaming runtime with high-level APIs
§  Flexible, data-driven windowing semantics
§  Competitive performance
§  We are just getting started!
23
Flink Community
24
0
20
40
60
80
100
120
Jul-09 Nov-10 Apr-12 Aug-13 Dec-14 May-16
Unique git contributors
flink.apache.org
@ApacheFlink

More Related Content

PDF
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
PPTX
The Stream Processor as a Database Apache Flink
PDF
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
PPTX
Flink 0.10 @ Bay Area Meetup (October 2015)
PDF
Apache Flink internals
PDF
Flink Gelly - Karlsruhe - June 2015
PDF
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Patterns of the Lambda Architecture -- 2015 April -- Hadoop Summit, Europe
The Stream Processor as a Database Apache Flink
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Flink 0.10 @ Bay Area Meetup (October 2015)
Apache Flink internals
Flink Gelly - Karlsruhe - June 2015
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...

What's hot (20)

PPTX
Apache Flink: API, runtime, and project roadmap
PDF
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
PDF
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
PDF
Stream Processing made simple with Kafka
PDF
Stateful Distributed Stream Processing
PDF
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
PDF
Unified Stream and Batch Processing with Apache Flink
PPTX
Flink Streaming @BudapestData
PDF
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
PPTX
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
PPTX
Flink history, roadmap and vision
PPTX
Flink internals web
PDF
Flink Apachecon Presentation
PDF
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
PPTX
Apache flink
PDF
Presto At Treasure Data
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
PDF
K. Tzoumas & S. Ewen – Flink Forward Keynote
Apache Flink: API, runtime, and project roadmap
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
PSUG #52 Dataflow and simplified reactive programming with Akka-streams
Stream Processing made simple with Kafka
Stateful Distributed Stream Processing
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Unified Stream and Batch Processing with Apache Flink
Flink Streaming @BudapestData
Arbitrary Stateful Aggregations using Structured Streaming in Apache Spark
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Flink history, roadmap and vision
Flink internals web
Flink Apachecon Presentation
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Apache flink
Presto At Treasure Data
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
K. Tzoumas & S. Ewen – Flink Forward Keynote
Ad

Viewers also liked (20)

PDF
RBea: Scalable Real-Time Analytics at King
PDF
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
PDF
Building Big Data Streaming Architectures
PPTX
KDD 2016 Streaming Analytics Tutorial
PDF
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
PDF
Real-time analytics as a service at King
PDF
Large-Scale Stream Processing in the Hadoop Ecosystem
PDF
Streaming Analytics
PPTX
Data Streaming (in a Nutshell) ... and Spark's window operations
PPTX
Stream Analytics in the Enterprise
PDF
Reliable Data Intestion in BigData / IoT
PDF
Stream Processing Everywhere - What to use?
PDF
The end of polling : why and how to transform a REST API into a Data Streamin...
PDF
Oracle Stream Analytics - Simplifying Stream Processing
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
Big Data Architectures @ JAX / BigDataCon 2016
PDF
Distributed Real-Time Stream Processing: Why and How 2.0
PDF
Introduction to Streaming Analytics
PDF
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
PDF
Spark Streaming into context
RBea: Scalable Real-Time Analytics at King
Real Time Analytics with Apache Cassandra - Cassandra Day Munich
Building Big Data Streaming Architectures
KDD 2016 Streaming Analytics Tutorial
Real Time Analytics with Apache Cassandra - Cassandra Day Berlin
Real-time analytics as a service at King
Large-Scale Stream Processing in the Hadoop Ecosystem
Streaming Analytics
Data Streaming (in a Nutshell) ... and Spark's window operations
Stream Analytics in the Enterprise
Reliable Data Intestion in BigData / IoT
Stream Processing Everywhere - What to use?
The end of polling : why and how to transform a REST API into a Data Streamin...
Oracle Stream Analytics - Simplifying Stream Processing
Apache Kafka - Scalable Message-Processing and more !
Big Data Architectures @ JAX / BigDataCon 2016
Distributed Real-Time Stream Processing: Why and How 2.0
Introduction to Streaming Analytics
Kafka and Stream Processing, Taking Analytics Real-time, Mike Spicer
Spark Streaming into context
Ad

Similar to Real-time Stream Processing with Apache Flink @ Hadoop Summit (20)

PPTX
Real-time Stream Processing with Apache Flink
PPTX
Flink Streaming Hadoop Summit San Jose
PDF
Flink Streaming Berlin Meetup
PDF
Apache Flink Stream Processing
PPTX
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
PPTX
Apache Flink Overview at SF Spark and Friends
PDF
Apache Flink @ Tel Aviv / Herzliya Meetup
PPTX
Chicago Flink Meetup: Flink's streaming architecture
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PPTX
Meet the squirrel @ #CSHUG
PPTX
Apache Flink Deep Dive
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
PDF
Strtio Spark Streaming + Siddhi CEP Engine
PPTX
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
PPTX
Apache Flink at Strata San Jose 2016
PPTX
First Flink Bay Area meetup
PPTX
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
PPTX
Counting Elements in Streams
PDF
Kafka Streams: the easiest way to start with stream processing
PDF
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Real-time Stream Processing with Apache Flink
Flink Streaming Hadoop Summit San Jose
Flink Streaming Berlin Meetup
Apache Flink Stream Processing
Apache Flink Meetup Munich (November 2015): Flink Overview, Architecture, Int...
Apache Flink Overview at SF Spark and Friends
Apache Flink @ Tel Aviv / Herzliya Meetup
Chicago Flink Meetup: Flink's streaming architecture
Flexible and Real-Time Stream Processing with Apache Flink
Meet the squirrel @ #CSHUG
Apache Flink Deep Dive
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Strtio Spark Streaming + Siddhi CEP Engine
The Stream Processor as the Database - Apache Flink @ Berlin buzzwords
Apache Flink at Strata San Jose 2016
First Flink Bay Area meetup
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Counting Elements in Streams
Kafka Streams: the easiest way to start with stream processing
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

Recently uploaded (20)

PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
1_Introduction to advance data techniques.pptx
PDF
Launch Your Data Science Career in Kochi – 2025
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPT
Quality review (1)_presentation of this 21
PPT
Miokarditis (Inflamasi pada Otot Jantung)
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
1_Introduction to advance data techniques.pptx
Launch Your Data Science Career in Kochi – 2025
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Introduction to Knowledge Engineering Part 1
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
IB Computer Science - Internal Assessment.pptx
Moving the Public Sector (Government) to a Digital Adoption
Fluorescence-microscope_Botany_detailed content
Chapter 3 METAL JOINING.pptnnnnnnnnnnnnn
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
.pdf is not working space design for the following data for the following dat...
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
climate analysis of Dhaka ,Banglades.pptx
Quality review (1)_presentation of this 21
Miokarditis (Inflamasi pada Otot Jantung)

Real-time Stream Processing with Apache Flink @ Hadoop Summit

  • 1. Marton Balassi – data Artisans Gyula Fora - SICS Flink committers mbalassi@apache.org / gyfora@apache.org Real-time Stream Processing with Apache Flink
  • 2. Stream Processing 2 §  Data stream: Infinite sequence of data arriving in a continuous fashion. §  Stream processing: Analyzing and acting on real-time streaming data, using continuous queries
  • 3. Streaming landscape 3 Apache Storm • True streaming, low latency - lower throughput • Low level API (Bolts, Spouts) + Trident Spark Streaming • Stream processing on top of batch system, high throughput - higher latency • Functional API (DStreams), restricted by batch runtime Apache Samza • True streaming built on top of Apache Kafka, state is first class citizen • Slightly different stream notion, low level API Apache Flink • True streaming with adjustable latency-throughput trade-off • Rich functional API exploiting streaming runtime; e.g. rich windowing semantics
  • 4. Apache Storm 4 §  True streaming, low latency - lower throughput §  Low level API (Bolts, Spouts) + Trident §  At-least-once processing guarantees Issues §  Costly fault tolerance §  Serialization §  Low level API
  • 5. Spark Streaming 5 §  Stream processing emulated on a batch system §  High throughput - higher latency §  Functional API (DStreams) §  Exactly-once processing guarantees Issues §  Restricted streaming semantics §  Windowing §  High latency
  • 6. Apache Samza 6 §  True streaming built on top of Apache Kafka §  Slightly different stream notion, low level API §  At-least-once processing guarantees with state Issues §  High disk IO §  Low level API
  • 7. Apache Flink 7 §  True streaming with adjustable latency and throughput §  Rich functional API exploiting streaming runtime §  Flexible windowing semantics §  Exactly-once processing guarantees with (small) state Issues §  Limited state size §  HA issue
  • 9. What is Flink 9 A "use-case complete" framework to unify batch and stream processing Event  logs   Historic  data   ETL     Rela4onal   Graph  analysis   Machine  learning   Streaming  analysis  
  • 10. Flink   Historic data Ka?a,  RabbitMQ,  ...   HDFS,  JDBC,  ...   ETL, Graphs, Machine Learning Relational, … Low latency windowing, aggregations, ... Event  logs   Real-time data streams What is Flink An engine that puts equal emphasis to streaming and batch 10
  • 11. Flink stack 11 Python Gelly Table FlinkML SAMOA Batch Optimizer DataSet (Java/Scala) DataStream (Java/Scala)Hadoop M/R Flink Runtime Local Remote Yarn Tez Embedded Dataflow Dataflow *current  Flink  master  +  few  PRs     Streaming Optimizer
  • 13. Overview of the API §  Data stream sources •  File system •  Message queue connectors •  Arbitrary source functionality §  Stream transformations •  Basic transformations: Map, Reduce, Filter, Aggregations… •  Binary stream transformations: CoMap, CoReduce… •  Windowing semantics: Policy based flexible windowing (Time, Count, Delta…) •  Temporal binary stream operators: Joins, Crosses… •  Native support for iterations §  Data stream outputs §  For the details please refer to the programming guide: •  http://flink.apache.org/docs/latest/streaming_guide.html 13 Reduce Merge Filter Sum Map Src Sink Src
  • 14. Use-case: Financial analytics 14 §  Reading from multiple inputs •  Merge stock data from various sources §  Window aggregations •  Compute simple statistics over windows of data §  Data driven windows •  Define arbitrary windowing semantics §  Combine with sentiment analysis •  Enrich your analytics with social media feeds (Twitter) §  Streaming joins •  Join multiple data streams §  Detailed explanation and source code on our blog •  http://flink.apache.org/news/2015/02/09/streaming-example.html
  • 15. Reading from multiple inputs case  class  StockPrice(symbol  :  String,  price  :  Double)   val  env  =  StreamExecutionEnvironment.getExecutionEnvironment   val  socketStockStream  =  env.socketTextStream("localhost",  9999)    .map(x  =>  {  val  split  =  x.split(",")                    StockPrice(split(0),  split(1).toDouble)  })       val  SPX_Stream  =  env.addSource(generateStock("SPX")(10)  _)   val  FTSE_Stream  =  env.addSource(generateStock("FTSE")(20)  _)     val  stockStream  =  socketStockStream.merge(SPX_Stream,  FTSE_STREAM)   15 (1)   (2)   (4)   (3)   (1)   (2)   (3)   (4)   "HDP,  23.8"   "HDP,  26.6"   StockPrice(SPX,  2113.9)   StockPrice(FTSE,  6931.7)   StockPrice(SPX,  2113.9)   StockPrice(FTSE,  6931.7)   StockPrice(HDP,  23.8)   StockPrice(HDP,  26.6)  
  • 16. Window aggregations val  windowedStream  =  stockStream      .window(Time.of(10,  SECONDS)).every(Time.of(5,  SECONDS))     val  lowest  =  windowedStream.minBy("price")   val  maxByStock  =  windowedStream.groupBy("symbol").maxBy("price")   val  rollingMean  =  windowedStream.groupBy("symbol").mapWindow(mean  _)   16 (1)   (2)   (4)   (3)   (1)   (2)   (4)   (3)   StockPrice(SPX,  2113.9)   StockPrice(FTSE,  6931.7)   StockPrice(HDP,  23.8)   StockPrice(HDP,  26.6)   StockPrice(HDP,  23.8)   StockPrice(SPX,  2113.9)   StockPrice(FTSE,  6931.7)   StockPrice(HDP,  26.6)   StockPrice(SPX,  2113.9)   StockPrice(FTSE,  6931.7)   StockPrice(HDP,  25.2)  
  • 17. Data-driven windows case  class  Count(symbol  :  String,  count  :  Int)     val  priceWarnings  =  stockStream.groupBy("symbol")    .window(Delta.of(0.05,  priceChange,  defaultPrice))      .mapWindow(sendWarning  _)       val  warningsPerStock  =  priceWarnings.map(Count(_,  1))  .groupBy("symbol")    .window(Time.of(30,  SECONDS))    .sum("count")   17 (1)   (2)   (4)   (3)   (1)   (2)   (4)   (3)   StockPrice(SPX,  2113.9)   StockPrice(FTSE,  6931.7)   StockPrice(HDP,  23.8)   StockPrice(HDP,  26.6)   Count(HDP,  1)  StockPrice(HDP,  23.8)   StockPrice(HDP,  26.6)  
  • 18. Combining with a Twitter stream val  tweetStream  =  env.addSource(generateTweets  _)       val  mentionedSymbols  =  tweetStream.flatMap(tweet  =>  tweet.split("  "))    .map(_.toUpperCase())    .filter(symbols.contains(_))       val  tweetsPerStock  =  mentionedSymbols.map(Count(_,  1)).groupBy("symbol")    .window(Time.of(30,  SECONDS))    .sum("count")   18 "hdp  is  on  the  rise!"   "I  wish  I  bought  more   YHOO  and  HDP  stocks"   Count(HDP,  2)   Count(YHOO,  1)  (1)   (2)   (4)   (3)   (1)   (2)   (4)   (3)  
  • 19. Streaming joins val  tweetsAndWarning  =  warningsPerStock.join(tweetsPerStock)    .onWindow(30,  SECONDS)    .where("symbol")    .equalTo("symbol"){  (c1,  c2)  =>  (c1.count,  c2.count)  }       val  rollingCorrelation  =  tweetsAndWarning    .window(Time.of(30,  SECONDS))    .mapWindow(computeCorrelation  _)   19 Count(HDP,  2)   Count(YHOO,  1)   Count(HDP,  1)   (1,2)   (1)   (2)   (1)   (2)   0.5  
  • 20. Fault tolerance §  Exactly once semantics •  Asynchronous barrier snapshotting •  Checkpoint barriers streamed from the sources •  Operator state checkpointing + source backup •  Pluggable backend for state management 20 1   1   2   3   JM   SM   State  manager     Job  manager     Operator     Snapshot  barrier     Event  channel     Data  channel     Checkpoint   JM   SM  
  • 21. Performance 21 §  Performance optimizations •  Effective serialization due to strongly typed topologies •  Operator chaining (thread sharing/no serialization) •  Different automatic query optimizations §  Competitive performance •  ~ 1.5m events / sec / core •  As a comparison Storm promises ~ 1m tuples / sec / node
  • 22. Roadmap 22 §  Persistent, high-throughput state backend §  Job manager high availability §  Application libraries •  General statistics over streams •  Pattern matching •  Machine learning pipelines library •  Streaming graph processing library §  Integration with other frameworks •  Zeppelin (Notebook) •  SAMOA (Online ML)
  • 23. Summary §  Flink is a use-case complete framework to unify batch and stream processing §  True streaming runtime with high-level APIs §  Flexible, data-driven windowing semantics §  Competitive performance §  We are just getting started! 23
  • 24. Flink Community 24 0 20 40 60 80 100 120 Jul-09 Nov-10 Apr-12 Aug-13 Dec-14 May-16 Unique git contributors