SlideShare a Scribd company logo
DON'T CROSS THE STREAMS!
STREAMING AND APACHE FLINK
Senior Data Consultant
Dublin
JOHN GORMAN
amberhand
WHAT WE WILL COVER
It's all about pain!
Streaming and Related Terminology
Stream Processing Engines
Apache Flink
Don't Cross The Streams  - Data Streaming And Apache Flink
It started with a pain...a so ware pain
Things were big, slow & shaky....and getting worse!
The calm before the storm
Batch Processing (High Latency, inability to reason about
time)
Coupled systems prevented fast delivery of single change
requirements
Processing large distributed data
Messaging incorporated business logic (Service Bus)
Customers demanded immediate insight/action
Event Ordering/Timing, Consistency, Data Lineage
Lack of Fault Tolerant Systems
Someone noticed the need to change some time back...
Don't Cross The Streams  - Data Streaming And Apache Flink
Oh! The other Michael Hammer...
Ref: Michael Hammer - Harvard Business Review 1990
“We cannot achieve breakthroughs in
performance by cutting fat or automating
existing processes. Rather, we must
challenge old assumptions and shed the
old rules that made the business
underperform in the first place.”
Ref: Michael Hammer - Harvard Business Review 1990
“These rules of work design are based on
assumptions about technology, people,
and organisational goals that no longer
hold”
So...So ware Legends set out to fix it...
THE PERFECT STORM
Elements of the "Perfect Storm"
Elements of the "Perfect Storm" contd.
Can something save us?
Streams!
flowing from a to a
Any event that happens internal or external to your
company is fair game for inclusion in a stream!
WHAT ARE STREAMS?
Unbounded Events Producer Consumer
Streaming obliterates old working habbits, not automates
them
When did you last drop a DVD back to your video store ?
Convenience of streaming films won out
Anyone using Dublin Bus still carry a timetable?
Realtime with Context is needed...
SOME OTHER COMMON STREAM EXAMPLES
Log files
User website clicks,
Finance stocks
Social media streams
Ideal Stream Charactristics
Low Latency (Time required to produce some result)
High Throughput (Number of results produced in time)
Persisted for reuse
Fault Tolerant
Scalable Event Production (i.e. Partitioning)
Scaleable Event Consumption (i.e. Consumer Groups)
Consumer manages state (offsets)
Handle Back Pressure
Benefits of streams
Ability to augment and enrich data streams
Duality of Streams and Tables (Only Streams Work)
Replay from define offset
Stream outputs can become stream inputs (unix pipes!)
Data first - Processing Later (Fast feature creation)
Stream your monitoring (Logs, Ops Metrics, Business KPI
etc.)
Benefits of streams contd.
Location in Time Testing (Bugs In Code)
Replication for Scale
Cross/Join prior unrelated sources (i.e. Time, Context -
Analytics)
Point of Record Stream (produce suitable Materialized
Views)
MOST POPULAR STREAMING TOOLS
Apache Kafka
Amazon Kinesis - Based on Kafka Ideas
MapR Streams - Uses Kafka API (adds resilience features)
Can these Streams handle the load ?
Apache Kafka Data Handling at LinkedIn
LinkedIn Engineering Blog March 20, 2015
We have the stream! Now what?
Enter the Stream Processing Engine
What is a Stream Processing Engine ?
8 Requirements of a Real-Time Stream Processing Engine
(Michael Stonebraker)
1. Keep the data moving
2. Query using SQL on Stream
3. Handle Stream Imperfections (Delayed, Missing, Out-Of-
Order Data)
4. Generate Predictable Outcomes
5. Integrate Stored and Streaming Data
6. Guarantee Data Safety and Availabilty
7. Partition and Scale Applications Automatically
8. Process and Respond Instantaneously
OK - Engines on... What can we do with it ?
Stream Processing Engine - Use Cases
Lineage, Auditing, History (Immutable)
Internet of Things (Sensor data)
Realtime Monitoring (Failure Prevention)
Autonomous Cars
Fraud/Anomoly Detection
Health devices (fitbit, cardio pacemakers etc)
For System of record (Infinite persistence)
Digital Marketing
Network monitoring
Realtime pricing / analytics
Stream Processing Engine - Use Cases Contd...
Intelligence and Surveillance
Risk management (Realtime Asset Coverage)
E-commerce (Realtime customer retention)
Fraud detection (Card, Insurance)
Smart order routing
Transaction cost analysis
Pricing and analytics
Market data management
Algorithmic trading
Data warehouse augmentation
Streaming does not mandate BigData
Streaming does not mandate RealTime processing
...but many application types may mandate either or both
Ok great - Let's dig into an engine...
APACHE FLINK
Apache Flink Components
Apache Flink Architecture
Source: DataArtisans (BerlinBuzzwords 2016)
Job Manager UI - (For Job Submission & Monitoring)
Job Manager UI - (Plan and Scheduling)
WAIT! Let's clear a few things up...
Pipelining & Backpressure
Time Semantics (Event, Injestion, Processing etc.)
Windows (count, rolling, session, custom)
Watermarks, Triggers (Inserted into stream)
Checkpoints (Async Recovery - Choice of state store
backend)
"Exactly Once" semantics (no need to question if fail on
send, process, return?)
Apache Flink - Features out of the box!
Support for Event Time and Out-of-Order Events
Exactly-once Semantics for Stateful Computations
Highly flexible Streaming Windows & CEP
Continuous Streaming Model with Backpressure (Buffers)
Fault-tolerance via Lightweight Distributed Snapshots
One Runtime for Streaming and Batch Processing
Memory Management & Custom Serialization
Iterations and Delta Iterations
Program Optimizer
SQL (Batch and Streams) due soon in 1.1
But I'm only here for the Machine Learning and Graph
Processing!!...
Machine Learning in Flink with FlinkML
* Apache Samoa Project - Streaming Machine Learning that works on top of Flink
** Apache Mahout - Batch based Machine Learning that works on top of Flink
Graph Processing in Flink?
"Gelly" is Apache Flink's Graph Analysis API
Iterative Graph processing abstractions on top of Flink
1. Vertex-Centric Iterations (like pregal, giraph)
2. Scatter-Gather Iterations
3. Gather-Sum-Apply (like PowerGraph)
GELLY SUPPORTS
1. Graph Properties (numberOfVerices etc...)
2. Transformations (map, difference, join...)
3. Mutations (Add/Remove vertices/edges...)
4. Batch and Streams - Java, Scala
* External "Gradoop" Project adds further features on top of Flink
Graph Processing with Gelly - Algorithms
PageRank
Single Source Shortest Path
Label Propogation
Weakly Connected Components
Community Detection
Planned Algorithms
Triangle Count
HITS
Affinity Propogation
Graph Summarization
Planned Algorithms - Attribution: Vasia Kalavri
Ecosystem Integration
Data Source/Sinks via Connectors (Kafka, jdbc, S3, etc)
Storm and Cascading & MapReduce support
Machine Learning - Apache Samoa (Streaming ML),
Appache Mahout (Batch)
Graph - Gradoop
Python API, Scala Repl, Apache Zeppelin Support
DataFlow Model - Apache Beam (API Abstraction + Flink
"Runner")
Apache Beam - Data Flow Model Support in Flink
Supported Distributions / Deployment Options
HortonWorks - Ambari Service (Confirmed full support on
the way)
Cloudera - Not Supported to my knowledge (Discussion
forums ref BigTop)
MapR - Not part of their MapR converged data platform
Amazon EMR (Yarn - Single Instance, Session)
Google Compute Engine (Yarn Support & Hosted
Competitor -> Cloud Dataflow)
Via Apache Myriad on Mesos (Native support coming in
1.2)
Some DataStream API Code (Setup)
* Code courtesy of DataArtisans on github
Some DataStream Code (Destination Sink & Running)
Sometimes, crossing the streams is the solution you need...
Crossing the streams with DataStream API
Crossing the streams with CEP Library
Proposed Flink 1.1 SQL API
* Code courtesy of DataArtisans on github
Flink Furthering Yahoo Benchmarks
Apache Flink Adoption
Whats Next For Flink?
Queryable State (Database inversion! Kafka log, RocksDB)
Release of 1.1+
Dynamic Scaling, Resource Elasticity (i.e. for catchup)
Production Hardening (1,000 node cluster Alibaba)
Stream SQL (Apache Calcite)
CEP Enhancements (large sized async state snapshoting)
Mesos Support
More Connectors
API enhancements (joins, slowly changing inputs)
Security (data encryption, Kerberos with Kafka)
Email: john.gorman@amberhand.ie
LinkedIn: johnpgorman
THANK YOU
ACKNOWLEDGEMENTS
Bank Of Ireland - Event and Venue
Hadoop User Group Ireland - Community Building
Data Artisans - Images, Code and Community Support
Anne Ebeling - Dublin Artwork
RESOURCES
APACHE FLINK
APACHE FLINK
IN FLINK
CEP MONITORING
RUNNING FLINK ON
BY TYLER AKIDAU
BY TYLER AKIDAU
MAPR FREE EBOOK ON
TRAINING
TAXI STREAM EXAMPLE
BACK PRESSURE CEP
SAMPLE
YARN
STREAMING 101
STREAMING 102
STREAMING ARCHITECTURE

More Related Content

PPTX
Apache Flink: API, runtime, and project roadmap
PPTX
Flink internals web
PDF
Stateful Distributed Stream Processing
PPTX
Flink Streaming Hadoop Summit San Jose
PPTX
Apache Flink@ Strata & Hadoop World London
PPTX
Debunking Six Common Myths in Stream Processing
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
PDF
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)
Apache Flink: API, runtime, and project roadmap
Flink internals web
Stateful Distributed Stream Processing
Flink Streaming Hadoop Summit San Jose
Apache Flink@ Strata & Hadoop World London
Debunking Six Common Myths in Stream Processing
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Stream Processing with Apache Flink (Flink.tw Meetup 2016/07/19)

What's hot (20)

PPTX
Data Stream Processing with Apache Flink
PDF
FastR+Apache Flink
PPTX
Real-time Stream Processing with Apache Flink
PPTX
Debunking Common Myths in Stream Processing
PPTX
Apache Flink Overview at SF Spark and Friends
PDF
Unified Stream and Batch Processing with Apache Flink
PPTX
Apache Flink Training: System Overview
PPTX
Streaming in the Wild with Apache Flink
PDF
Tech Talk @ Google on Flink Fault Tolerance and HA
PPTX
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
PDF
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
PPTX
SICS: Apache Flink Streaming
PPTX
Flink Streaming @BudapestData
PPTX
Continuous Processing with Apache Flink - Strata London 2016
PDF
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
PDF
Flink Streaming Berlin Meetup
PDF
Apache Flink internals
PPTX
Apache Flink at Strata San Jose 2016
PPTX
First Flink Bay Area meetup
PDF
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Data Stream Processing with Apache Flink
FastR+Apache Flink
Real-time Stream Processing with Apache Flink
Debunking Common Myths in Stream Processing
Apache Flink Overview at SF Spark and Friends
Unified Stream and Batch Processing with Apache Flink
Apache Flink Training: System Overview
Streaming in the Wild with Apache Flink
Tech Talk @ Google on Flink Fault Tolerance and HA
Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
SICS: Apache Flink Streaming
Flink Streaming @BudapestData
Continuous Processing with Apache Flink - Strata London 2016
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15
Flink Streaming Berlin Meetup
Apache Flink internals
Apache Flink at Strata San Jose 2016
First Flink Bay Area meetup
Unified Stream & Batch Processing with Apache Flink (Hadoop Summit Dublin 2016)
Ad

Viewers also liked (19)

PPTX
Dataiku - Paris JUG 2013 - Hadoop is a batch
PDF
The shortest path is not always a straight line
PPTX
OWF 2014 - Take back control of your Web tracking - Dataiku
PDF
Flink in Zalando's World of Microservices
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
PPTX
Apache Flink Training: DataStream API Part 2 Advanced
PPTX
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
PDF
Dongwon Kim – A Comparative Performance Evaluation of Flink
PPTX
Real-Time Event & Stream Processing on MS Azure
PPTX
Flink Case Study: Capital One
PDF
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
PDF
Streaming Analytics - Comparison of Open Source Frameworks and Products
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
PPT
Step-by-Step Introduction to Apache Flink
PPTX
Apache Flink: Real-World Use Cases for Streaming Analytics
PPTX
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
PPTX
A Multi Colored YARN
PPTX
Flink vs. Spark
PDF
Hadoop Overview & Architecture
 
Dataiku - Paris JUG 2013 - Hadoop is a batch
The shortest path is not always a straight line
OWF 2014 - Take back control of your Web tracking - Dataiku
Flink in Zalando's World of Microservices
Taking a look under the hood of Apache Flink's relational APIs.
Apache Flink Training: DataStream API Part 2 Advanced
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Dongwon Kim – A Comparative Performance Evaluation of Flink
Real-Time Event & Stream Processing on MS Azure
Flink Case Study: Capital One
Streaming Analytics Comparison of Open Source Frameworks, Products, Cloud Ser...
Streaming Analytics - Comparison of Open Source Frameworks and Products
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Step-by-Step Introduction to Apache Flink
Apache Flink: Real-World Use Cases for Streaming Analytics
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
A Multi Colored YARN
Flink vs. Spark
Hadoop Overview & Architecture
 
Ad

Similar to Don't Cross The Streams - Data Streaming And Apache Flink (20)

PDF
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
PPTX
Apache Beam: A unified model for batch and stream processing data
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
PDF
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
PDF
Metadata and Provenance for ML Pipelines with Hopsworks
PPTX
Real-time Analytics for Data-Driven Applications
PDF
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
PDF
Cloud Lambda Architecture Patterns
PPT
Moving Towards a Streaming Architecture
PDF
Introduction to apache kafka, confluent and why they matter
PDF
Big Data Streams Architectures. Why? What? How?
PDF
SF Big Analytics meetup : Hoodie From Uber
PPTX
Yahoo compares Storm and Spark
PDF
Stream Processing – Concepts and Frameworks
PDF
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
PDF
Lambda Architecture Using SQL
PPTX
Building data pipelines
PDF
Cloud lunch and learn real-time streaming in azure
PPTX
Log Data Analysis Platform
AI&BigData Lab 2016. Сарапин Виктор: Размер имеет значение: анализ по требова...
Apache Beam: A unified model for batch and stream processing data
Flexible and Real-Time Stream Processing with Apache Flink
Why apache Flink is the 4G of Big Data Analytics Frameworks
Processing Real-Time Data at Scale: A streaming platform as a central nervous...
Metadata and Provenance for ML Pipelines with Hopsworks
Real-time Analytics for Data-Driven Applications
Spark Seattle meetup - Breaking ETL barrier with Spark Streaming
Cloud Lambda Architecture Patterns
Moving Towards a Streaming Architecture
Introduction to apache kafka, confluent and why they matter
Big Data Streams Architectures. Why? What? How?
SF Big Analytics meetup : Hoodie From Uber
Yahoo compares Storm and Spark
Stream Processing – Concepts and Frameworks
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis
Lambda Architecture Using SQL
Building data pipelines
Cloud lunch and learn real-time streaming in azure
Log Data Analysis Platform

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Approach and Philosophy of On baking technology
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Empathic Computing: Creating Shared Understanding
PDF
Electronic commerce courselecture one. Pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced IT Governance
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Spectral efficient network and resource selection model in 5G networks
“AI and Expert System Decision Support & Business Intelligence Systems”
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Big Data Technologies - Introduction.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Machine learning based COVID-19 study performance prediction
Approach and Philosophy of On baking technology
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
GamePlan Trading System Review: Professional Trader's Honest Take
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Empathic Computing: Creating Shared Understanding
Electronic commerce courselecture one. Pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Diabetes mellitus diagnosis method based random forest with bat algorithm
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced IT Governance
Reach Out and Touch Someone: Haptics and Empathic Computing
Spectral efficient network and resource selection model in 5G networks

Don't Cross The Streams - Data Streaming And Apache Flink

  • 1. DON'T CROSS THE STREAMS! STREAMING AND APACHE FLINK
  • 3. WHAT WE WILL COVER It's all about pain! Streaming and Related Terminology Stream Processing Engines Apache Flink
  • 5. It started with a pain...a so ware pain
  • 6. Things were big, slow & shaky....and getting worse!
  • 7. The calm before the storm Batch Processing (High Latency, inability to reason about time) Coupled systems prevented fast delivery of single change requirements Processing large distributed data Messaging incorporated business logic (Service Bus) Customers demanded immediate insight/action Event Ordering/Timing, Consistency, Data Lineage Lack of Fault Tolerant Systems
  • 8. Someone noticed the need to change some time back...
  • 10. Oh! The other Michael Hammer...
  • 11. Ref: Michael Hammer - Harvard Business Review 1990 “We cannot achieve breakthroughs in performance by cutting fat or automating existing processes. Rather, we must challenge old assumptions and shed the old rules that made the business underperform in the first place.”
  • 12. Ref: Michael Hammer - Harvard Business Review 1990 “These rules of work design are based on assumptions about technology, people, and organisational goals that no longer hold”
  • 13. So...So ware Legends set out to fix it...
  • 15. Elements of the "Perfect Storm"
  • 16. Elements of the "Perfect Storm" contd.
  • 19. flowing from a to a Any event that happens internal or external to your company is fair game for inclusion in a stream! WHAT ARE STREAMS? Unbounded Events Producer Consumer
  • 20. Streaming obliterates old working habbits, not automates them
  • 21. When did you last drop a DVD back to your video store ? Convenience of streaming films won out
  • 22. Anyone using Dublin Bus still carry a timetable? Realtime with Context is needed...
  • 23. SOME OTHER COMMON STREAM EXAMPLES Log files User website clicks, Finance stocks Social media streams
  • 24. Ideal Stream Charactristics Low Latency (Time required to produce some result) High Throughput (Number of results produced in time) Persisted for reuse Fault Tolerant Scalable Event Production (i.e. Partitioning) Scaleable Event Consumption (i.e. Consumer Groups) Consumer manages state (offsets) Handle Back Pressure
  • 25. Benefits of streams Ability to augment and enrich data streams Duality of Streams and Tables (Only Streams Work) Replay from define offset Stream outputs can become stream inputs (unix pipes!) Data first - Processing Later (Fast feature creation) Stream your monitoring (Logs, Ops Metrics, Business KPI etc.)
  • 26. Benefits of streams contd. Location in Time Testing (Bugs In Code) Replication for Scale Cross/Join prior unrelated sources (i.e. Time, Context - Analytics) Point of Record Stream (produce suitable Materialized Views)
  • 27. MOST POPULAR STREAMING TOOLS Apache Kafka Amazon Kinesis - Based on Kafka Ideas MapR Streams - Uses Kafka API (adds resilience features)
  • 28. Can these Streams handle the load ?
  • 29. Apache Kafka Data Handling at LinkedIn LinkedIn Engineering Blog March 20, 2015
  • 30. We have the stream! Now what?
  • 31. Enter the Stream Processing Engine
  • 32. What is a Stream Processing Engine ?
  • 33. 8 Requirements of a Real-Time Stream Processing Engine (Michael Stonebraker) 1. Keep the data moving 2. Query using SQL on Stream 3. Handle Stream Imperfections (Delayed, Missing, Out-Of- Order Data) 4. Generate Predictable Outcomes 5. Integrate Stored and Streaming Data 6. Guarantee Data Safety and Availabilty 7. Partition and Scale Applications Automatically 8. Process and Respond Instantaneously
  • 34. OK - Engines on... What can we do with it ?
  • 35. Stream Processing Engine - Use Cases Lineage, Auditing, History (Immutable) Internet of Things (Sensor data) Realtime Monitoring (Failure Prevention) Autonomous Cars Fraud/Anomoly Detection Health devices (fitbit, cardio pacemakers etc) For System of record (Infinite persistence) Digital Marketing Network monitoring Realtime pricing / analytics
  • 36. Stream Processing Engine - Use Cases Contd... Intelligence and Surveillance Risk management (Realtime Asset Coverage) E-commerce (Realtime customer retention) Fraud detection (Card, Insurance) Smart order routing Transaction cost analysis Pricing and analytics Market data management Algorithmic trading Data warehouse augmentation
  • 37. Streaming does not mandate BigData Streaming does not mandate RealTime processing ...but many application types may mandate either or both
  • 38. Ok great - Let's dig into an engine...
  • 41. Apache Flink Architecture Source: DataArtisans (BerlinBuzzwords 2016)
  • 42. Job Manager UI - (For Job Submission & Monitoring)
  • 43. Job Manager UI - (Plan and Scheduling)
  • 44. WAIT! Let's clear a few things up... Pipelining & Backpressure Time Semantics (Event, Injestion, Processing etc.) Windows (count, rolling, session, custom) Watermarks, Triggers (Inserted into stream) Checkpoints (Async Recovery - Choice of state store backend) "Exactly Once" semantics (no need to question if fail on send, process, return?)
  • 45. Apache Flink - Features out of the box! Support for Event Time and Out-of-Order Events Exactly-once Semantics for Stateful Computations Highly flexible Streaming Windows & CEP Continuous Streaming Model with Backpressure (Buffers) Fault-tolerance via Lightweight Distributed Snapshots One Runtime for Streaming and Batch Processing Memory Management & Custom Serialization Iterations and Delta Iterations Program Optimizer SQL (Batch and Streams) due soon in 1.1
  • 46. But I'm only here for the Machine Learning and Graph Processing!!...
  • 47. Machine Learning in Flink with FlinkML * Apache Samoa Project - Streaming Machine Learning that works on top of Flink ** Apache Mahout - Batch based Machine Learning that works on top of Flink
  • 49. "Gelly" is Apache Flink's Graph Analysis API Iterative Graph processing abstractions on top of Flink 1. Vertex-Centric Iterations (like pregal, giraph) 2. Scatter-Gather Iterations 3. Gather-Sum-Apply (like PowerGraph)
  • 50. GELLY SUPPORTS 1. Graph Properties (numberOfVerices etc...) 2. Transformations (map, difference, join...) 3. Mutations (Add/Remove vertices/edges...) 4. Batch and Streams - Java, Scala * External "Gradoop" Project adds further features on top of Flink
  • 51. Graph Processing with Gelly - Algorithms PageRank Single Source Shortest Path Label Propogation Weakly Connected Components Community Detection
  • 52. Planned Algorithms Triangle Count HITS Affinity Propogation Graph Summarization Planned Algorithms - Attribution: Vasia Kalavri
  • 53. Ecosystem Integration Data Source/Sinks via Connectors (Kafka, jdbc, S3, etc) Storm and Cascading & MapReduce support Machine Learning - Apache Samoa (Streaming ML), Appache Mahout (Batch) Graph - Gradoop Python API, Scala Repl, Apache Zeppelin Support DataFlow Model - Apache Beam (API Abstraction + Flink "Runner")
  • 54. Apache Beam - Data Flow Model Support in Flink
  • 55. Supported Distributions / Deployment Options HortonWorks - Ambari Service (Confirmed full support on the way) Cloudera - Not Supported to my knowledge (Discussion forums ref BigTop) MapR - Not part of their MapR converged data platform Amazon EMR (Yarn - Single Instance, Session) Google Compute Engine (Yarn Support & Hosted Competitor -> Cloud Dataflow) Via Apache Myriad on Mesos (Native support coming in 1.2)
  • 56. Some DataStream API Code (Setup) * Code courtesy of DataArtisans on github
  • 57. Some DataStream Code (Destination Sink & Running)
  • 58. Sometimes, crossing the streams is the solution you need...
  • 59. Crossing the streams with DataStream API
  • 60. Crossing the streams with CEP Library
  • 61. Proposed Flink 1.1 SQL API * Code courtesy of DataArtisans on github
  • 64. Whats Next For Flink? Queryable State (Database inversion! Kafka log, RocksDB) Release of 1.1+ Dynamic Scaling, Resource Elasticity (i.e. for catchup) Production Hardening (1,000 node cluster Alibaba) Stream SQL (Apache Calcite) CEP Enhancements (large sized async state snapshoting) Mesos Support More Connectors API enhancements (joins, slowly changing inputs) Security (data encryption, Kerberos with Kafka)
  • 65. Email: john.gorman@amberhand.ie LinkedIn: johnpgorman THANK YOU ACKNOWLEDGEMENTS Bank Of Ireland - Event and Venue Hadoop User Group Ireland - Community Building Data Artisans - Images, Code and Community Support Anne Ebeling - Dublin Artwork
  • 66. RESOURCES APACHE FLINK APACHE FLINK IN FLINK CEP MONITORING RUNNING FLINK ON BY TYLER AKIDAU BY TYLER AKIDAU MAPR FREE EBOOK ON TRAINING TAXI STREAM EXAMPLE BACK PRESSURE CEP SAMPLE YARN STREAMING 101 STREAMING 102 STREAMING ARCHITECTURE