SlideShare a Scribd company logo
Kostas Tzoumas
@kostas_tzoumas
Hadoop Summit San Jose
June 6, 2016
Streaming in the Wild with
Apache FlinkTM
2
Streaming technology is enabling the
obvious: continuous processing on data that
is continuously produced
Hint: you are already doing streaming
Why embrace streaming?
 Monitor your business and react in real time
 Implement robust continuous applications
 Adopt a decentralized architecture
 Consolidate analytics infrastructure
3
React in real time
4
Streaming versus real-time
 Streaming != Real-time
 E.g., streaming that is not real time:
continuous applications with large
windows
 E.g., real-time that is not streaming: very
fast data warehousing queries
 However: streaming applications can be
fast
5
Streaming
Real time
How real-time is Flink?
6
Yahoo! benchmark* data Artisans benchmarks**
* https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at
** http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ and http://data-artisans.com/high-throughput-
low-latency-and-exactly-once-stream-processing-with-apache-flink/
When and why does this matter?
 Immediate reaction to life
• E.g., generate alerts on
anomaly/pattern/special event
 Avoid unnecessary tradeoffs
• Even if application is not latency-critical
• With Flink you do not pay a price for latency!
7
Bouygues Telecom – LUX
8
One of the largest telcos in
France. System (among
others) used for real time
diagnostics and alarming.
Read more: http://data-
artisans.com/flink-at-
bouygues-html/
Robust continuous
applications
9
Continuous application
 A production data application that needs to
be live 24/7 feeding other systems (perhaps
customer-facing)
 Need to be efficient, consistent, correct, and
manageable
 Stream processing is a great way to
implement continuous applications robustly
10
Continuous apps with “batch”
11
file 1
file 2
Job 1
Job 2
time
file 3 Job 3
Scheduler
Serve&store
Continuous apps with “lambda”
12
file 1
file 2
Job 1
Job 2
Scheduler
Streaming job
Serve&
store
Problems with batch and λ
 Way too many moving parts (and code dup)
 Implicit treatment of time
 Out of order event handling
 Implicit batch boundaries
13
Continuous apps with streaming
14
Streaming job
Serve&
store
Extending the Yahoo! benchmark
 Work of Jamie Grier, inspired by a real continuous
application at Twitter
15
http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
What is the use case?
 Counting!
• Tweet impressions or ad views
 Most analytics is continuous counting and
aggregations grouped by dimensions
• E.g., anomaly detection
16
Requirements
 Performance: millions of events/sec, millions of
keys
 Correctness: counts correlated with timestamps
 Consistency: counts should be correct under
failures
 Manageability: ability to pause & restart,
reprocess, change code, etc
17
Before Flink
 Performance: 1000s of cores needed to sustain
workload
 Correctness: time handled in application code (or
not)
 Consistency: approximate results during the day,
exact results once a day (lambda)
 Manageability: acceptable
18
After Flink
 Performance: 10s of cores needed to sustain
workload
 Correctness: time handled by framework
 Consistency: correct results on demand
 Manageability: acceptable
19
Results (yet to be beaten!)
 Same program as Yahoo! benchmark
 30x over Storm, plus consistent results
20
Manageability
 Flink savepoints (Flink 1.0): consistent
snapshots of stateful applications
• Planned downtime for code upgrades,
maintenance, migration, debugging, etc
 Monitoring (Flink 1.1)
 Dynamic scaling (Flink 1.2+)
21
Decentralized architecture
22
Streaming and microservices
23
App App
App
local statelocal state
Archive
A decentralized architecture favors
a streaming-based data
infrastructure with local application
state
Zalando
24
Slides at http://www.slideshare.net/ZalandoTech/flink-in-zalandos-world-of-microservices-62376341
Zalando
25
Transitioning from monolithic
architecture to microservices
New BI stack
26
Flink @ Zalando (present & future)
 Business process monitoring
• Check if Zalando platform works
• Order & delivery velocities
• SLAs of related events
 Continuous ETL
• Transformation, combination, pre-aggregation
• Data cleansing and validation
 Complex Event Processing
 Sales monitoring
27
Consolidate analytics
28
Stream Processing as a Service
 How do we make stream processing more
accessible to the data analyst?
 More familiar interfaces
• Flink 1.1 includes the first version of SQL for
static data sets and data streams
 Easier deployment
29
King.com
30
King.com - RBEA
 RBEA – a platform
designed to make
stream processing
available inside
King.com
 Data scientists submit
scripts in Groovy
 Flink backend executes
these scripts
31
https://techblog.king.com/rbea-scalable-real-time-analytics-king/
Netflix
 Netflix plans to offer
Stream Processing as a
Service internally in the
company
 Currently testing Flink
and Apache Beam
32
http://www.slideshare.net/mdaxini/netflix-keystone-streaming-data-pipeline-scale-in-the-clouddbtb2016-62076009
Closing
33
Disclaimer
 A lot of this presentation is based on the work of
very talented engineers building data products
with Flink
 Special thanks to:
• Amine Abdessemed (Bouygues Telecom)
• Mihail Vieru, Javier Lopez (Zalando)
• Gyula Fora, Mattias Andersson (King.com)
• Monal Daxini (Netflix)
34
More Flink tales at Hadoop Summit
35
Xiaowei Jiang
Blink−Improved Runtime for Flink and its
Application in Alibaba Search
Wednesday, June 29, 2016, 2:10PM - 2:50PM
210C
Stephan Ewen
Turning the Stream Processor into a Database:
Building Online Applications on Streams
Thursday, June 30, 2016, 12:20PM - 1:00PM
212
Flink Forward 2016, Berlin
Submission deadline: June 30, 2016 (watch website)
Early bird deadline: July 15, 2016
www.flink-forward.org
We are hiring!
data-artisans.com/careers

More Related Content

PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
PPTX
Data Stream Processing with Apache Flink
PPTX
Aljoscha Krettek - The Future of Apache Flink
PPTX
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
PDF
Baymeetup-FlinkResearch
PPTX
Apache Flink at Strata San Jose 2016
PDF
Streaming Analytics & CEP - Two sides of the same coin?
PDF
Apache Flink: Streaming Done Right @ FOSDEM 2016
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Data Stream Processing with Apache Flink
Aljoscha Krettek - The Future of Apache Flink
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Baymeetup-FlinkResearch
Apache Flink at Strata San Jose 2016
Streaming Analytics & CEP - Two sides of the same coin?
Apache Flink: Streaming Done Right @ FOSDEM 2016

What's hot (20)

PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
PDF
Stateful Distributed Stream Processing
PPTX
Debunking Common Myths in Stream Processing
PPTX
Apache Flink(tm) - A Next-Generation Stream Processor
PPTX
Apache Flink Berlin Meetup May 2016
PDF
Stream Processing with Apache Flink
PDF
Big Data Warsaw
PPTX
Real-time Stream Processing with Apache Flink
PPTX
QCon London - Stream Processing with Apache Flink
PPTX
Apache Flink Overview at SF Spark and Friends
PPTX
Debunking Six Common Myths in Stream Processing
PPTX
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
PPTX
Flink Streaming Hadoop Summit San Jose
PPTX
Extending the Yahoo Streaming Benchmark
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
PDF
A look at Flink 1.2
PPTX
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
PPTX
The Evolution of (Open Source) Data Processing
PDF
Don't Cross The Streams - Data Streaming And Apache Flink
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Stateful Distributed Stream Processing
Debunking Common Myths in Stream Processing
Apache Flink(tm) - A Next-Generation Stream Processor
Apache Flink Berlin Meetup May 2016
Stream Processing with Apache Flink
Big Data Warsaw
Real-time Stream Processing with Apache Flink
QCon London - Stream Processing with Apache Flink
Apache Flink Overview at SF Spark and Friends
Debunking Six Common Myths in Stream Processing
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Taking a look under the hood of Apache Flink's relational APIs.
Flink Streaming Hadoop Summit San Jose
Extending the Yahoo Streaming Benchmark
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
A look at Flink 1.2
A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)
The Evolution of (Open Source) Data Processing
Don't Cross The Streams - Data Streaming And Apache Flink
Ad

Similar to Streaming in the Wild with Apache Flink (20)

PPTX
Streaming in the Wild with Apache Flink
PPTX
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
PPTX
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
PPTX
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
PPTX
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
PDF
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
PPTX
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
PPTX
Kostas Tzoumas - Stream Processing with Apache Flink®
PPTX
GOTO Night Amsterdam - Stream processing with Apache Flink
PDF
Santander Stream Processing with Apache Flink
PPTX
Apache flink 1.7 and Beyond
PPTX
Apache Flink: Past, Present and Future
PDF
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
PPTX
Flink history, roadmap and vision
PDF
Battle of the Stream Processing Titans – Flink versus RisingWave
PDF
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
PPTX
Counting Elements in Streams
PPTX
Apache Flink and what it is used for
PPTX
Debunking Common Myths in Stream Processing
Streaming in the Wild with Apache Flink
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Kostas Tzoumas - Apache Flink®: State of the Union and What's Next
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
Kostas Tzoumas - Stream Processing with Apache Flink®
GOTO Night Amsterdam - Stream processing with Apache Flink
Santander Stream Processing with Apache Flink
Apache flink 1.7 and Beyond
Apache Flink: Past, Present and Future
Thomas Lamirault_Mohamed Amine Abdessemed -A brief history of time with Apac...
Flink history, roadmap and vision
Battle of the Stream Processing Titans – Flink versus RisingWave
Modern Stream Processing With Apache Flink @ GOTO Berlin 2017
Counting Elements in Streams
Apache Flink and what it is used for
Debunking Common Myths in Stream Processing
Ad

Recently uploaded (20)

PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PPTX
Cloud computing and distributed systems.
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
cuic standard and advanced reporting.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Machine learning based COVID-19 study performance prediction
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
Empathic Computing: Creating Shared Understanding
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Cloud computing and distributed systems.
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Mobile App Security Testing_ A Comprehensive Guide.pdf
cuic standard and advanced reporting.pdf
Understanding_Digital_Forensics_Presentation.pptx
Advanced Soft Computing BINUS July 2025.pdf
Review of recent advances in non-invasive hemoglobin estimation
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
Machine learning based COVID-19 study performance prediction
Per capita expenditure prediction using model stacking based on satellite ima...
Reach Out and Touch Someone: Haptics and Empathic Computing
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”

Streaming in the Wild with Apache Flink

  • 1. Kostas Tzoumas @kostas_tzoumas Hadoop Summit San Jose June 6, 2016 Streaming in the Wild with Apache FlinkTM
  • 2. 2 Streaming technology is enabling the obvious: continuous processing on data that is continuously produced Hint: you are already doing streaming
  • 3. Why embrace streaming?  Monitor your business and react in real time  Implement robust continuous applications  Adopt a decentralized architecture  Consolidate analytics infrastructure 3
  • 4. React in real time 4
  • 5. Streaming versus real-time  Streaming != Real-time  E.g., streaming that is not real time: continuous applications with large windows  E.g., real-time that is not streaming: very fast data warehousing queries  However: streaming applications can be fast 5 Streaming Real time
  • 6. How real-time is Flink? 6 Yahoo! benchmark* data Artisans benchmarks** * https://yahooeng.tumblr.com/post/135321837876/benchmarking-streaming-computation-engines-at ** http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ and http://data-artisans.com/high-throughput- low-latency-and-exactly-once-stream-processing-with-apache-flink/
  • 7. When and why does this matter?  Immediate reaction to life • E.g., generate alerts on anomaly/pattern/special event  Avoid unnecessary tradeoffs • Even if application is not latency-critical • With Flink you do not pay a price for latency! 7
  • 8. Bouygues Telecom – LUX 8 One of the largest telcos in France. System (among others) used for real time diagnostics and alarming. Read more: http://data- artisans.com/flink-at- bouygues-html/
  • 10. Continuous application  A production data application that needs to be live 24/7 feeding other systems (perhaps customer-facing)  Need to be efficient, consistent, correct, and manageable  Stream processing is a great way to implement continuous applications robustly 10
  • 11. Continuous apps with “batch” 11 file 1 file 2 Job 1 Job 2 time file 3 Job 3 Scheduler Serve&store
  • 12. Continuous apps with “lambda” 12 file 1 file 2 Job 1 Job 2 Scheduler Streaming job Serve& store
  • 13. Problems with batch and λ  Way too many moving parts (and code dup)  Implicit treatment of time  Out of order event handling  Implicit batch boundaries 13
  • 14. Continuous apps with streaming 14 Streaming job Serve& store
  • 15. Extending the Yahoo! benchmark  Work of Jamie Grier, inspired by a real continuous application at Twitter 15 http://data-artisans.com/extending-the-yahoo-streaming-benchmark/
  • 16. What is the use case?  Counting! • Tweet impressions or ad views  Most analytics is continuous counting and aggregations grouped by dimensions • E.g., anomaly detection 16
  • 17. Requirements  Performance: millions of events/sec, millions of keys  Correctness: counts correlated with timestamps  Consistency: counts should be correct under failures  Manageability: ability to pause & restart, reprocess, change code, etc 17
  • 18. Before Flink  Performance: 1000s of cores needed to sustain workload  Correctness: time handled in application code (or not)  Consistency: approximate results during the day, exact results once a day (lambda)  Manageability: acceptable 18
  • 19. After Flink  Performance: 10s of cores needed to sustain workload  Correctness: time handled by framework  Consistency: correct results on demand  Manageability: acceptable 19
  • 20. Results (yet to be beaten!)  Same program as Yahoo! benchmark  30x over Storm, plus consistent results 20
  • 21. Manageability  Flink savepoints (Flink 1.0): consistent snapshots of stateful applications • Planned downtime for code upgrades, maintenance, migration, debugging, etc  Monitoring (Flink 1.1)  Dynamic scaling (Flink 1.2+) 21
  • 23. Streaming and microservices 23 App App App local statelocal state Archive A decentralized architecture favors a streaming-based data infrastructure with local application state
  • 27. Flink @ Zalando (present & future)  Business process monitoring • Check if Zalando platform works • Order & delivery velocities • SLAs of related events  Continuous ETL • Transformation, combination, pre-aggregation • Data cleansing and validation  Complex Event Processing  Sales monitoring 27
  • 29. Stream Processing as a Service  How do we make stream processing more accessible to the data analyst?  More familiar interfaces • Flink 1.1 includes the first version of SQL for static data sets and data streams  Easier deployment 29
  • 31. King.com - RBEA  RBEA – a platform designed to make stream processing available inside King.com  Data scientists submit scripts in Groovy  Flink backend executes these scripts 31 https://techblog.king.com/rbea-scalable-real-time-analytics-king/
  • 32. Netflix  Netflix plans to offer Stream Processing as a Service internally in the company  Currently testing Flink and Apache Beam 32 http://www.slideshare.net/mdaxini/netflix-keystone-streaming-data-pipeline-scale-in-the-clouddbtb2016-62076009
  • 34. Disclaimer  A lot of this presentation is based on the work of very talented engineers building data products with Flink  Special thanks to: • Amine Abdessemed (Bouygues Telecom) • Mihail Vieru, Javier Lopez (Zalando) • Gyula Fora, Mattias Andersson (King.com) • Monal Daxini (Netflix) 34
  • 35. More Flink tales at Hadoop Summit 35 Xiaowei Jiang Blink−Improved Runtime for Flink and its Application in Alibaba Search Wednesday, June 29, 2016, 2:10PM - 2:50PM 210C Stephan Ewen Turning the Stream Processor into a Database: Building Online Applications on Streams Thursday, June 30, 2016, 12:20PM - 1:00PM 212
  • 36. Flink Forward 2016, Berlin Submission deadline: June 30, 2016 (watch website) Early bird deadline: July 15, 2016 www.flink-forward.org

Editor's Notes

  • #14: 3 systems (batch), or 5 systems (streaming), Need to add a new system for millisecond alerts What If I want to count every 5 minutes, not 1 hour? Just ignores out of order What if I wanna do sessions?