SlideShare a Scribd company logo
© 2019 Ververica
Presenter Name & Title
PRESENTATION TITLE
The Apache Flink® Conference
Stream Processing | Event Driven | Real Time
San Francisco 1-2, 2019
© 2019 Ververica2
A big thanks to our Sponsors
Platinum
Gold
Silver
Community
Flink Fest
Media
© 2019 Ververica3
A big thanks to our Program Committee
Fabian Hueske Dean Wampler
Sonali SharmaEric SammerStefan RichterTyler Akidau
Jamie Grier
© 2019 Ververica4
A big thanks to our Speakers
© 2019 Ververica5
Flink Forward App
Rate speakers and sessions with
a chance to win a Flybrix drone!
Get involved!
Apache Flink User Survey
Win a trip to one of the next Flink Forward
conferences of your choice!
Flink Forward Survey
Help us improve the quality of Flink
Forward. We appreciate your feedback!
Community Contribution
Sign up as a content contribut or for blog
posts or speaking oppor tunities.
Flink Forward Social Feed - View & Engage!
Get involved
© 2019 Ververica
Kostas Tzoumas
INTRODUCING VERVERICA
© 2019 Ververica7
Founded in 2014 by the original creators of Apache Flink to
commercialize the open source project and support the
community
© 2019 Ververica8
+
=
© 2019 Ververica9
Why?
• Alibaba has been the largest user of Flink and second largest
contributor for years
• Deeply committed to open source and creating technological impact
• Joining forces made a lot of sense for the two teams in order to collaborate
even closer and accelerate their contributions to Flink
© 2019 Ververica10
Flink at Alibaba (few examples)
Taobao is the largest e-commerce platform globally with more than 600 million
monthly active users. Every time a user logs into the Taobao app they see a
different landing page personalized for the user and depending on the latest real-
time activity in the platform. Using Flink for real-time machine learning at Taobao
has resulted in over 20% increase in purchase conversion rate. At peak during
Singles Day last year, the system processed over 1.7 billion events/sec.
In the Hangzhou City Brain Project, Flink is used to process in real-time data from
a variety of sensors (traffic cameras, map applications, etc), and manage traffic
signals in 128 intersections. The City Brain project has halved traveling times for
ambulances and commuters. Traffic accidents can be detected immediately, and
help can reach the accident site within 5 minutes.
© 2019 Ververica11
What’s in a name?
verum (“real” in Latin)
Understanding the truth about the world by
getting the real-time view
© 2019 Ververica12
What is Ververica?
1. Double down on the open source community and improve its health
and diversity
2. Contribute a number of innovations to the open source project starting
with Alibaba’s Blink for batch processing
3. Create an ecosystem and foundation for the commercial success of
Flink projects and products across the world
Our #1 goal is to position Apache Flink for the next 10 years of its life
© 2019 Ververica13
Ververica Commercial Products
Full continuation of our commercial products and
services
• Ververica Platform including Apache Flink,
Application Manager, and Streaming Ledger
• Apache Flink Training and Consulting Services
• Enterprise Support
A lot of innovation coming here as well leveraging
existing work in Alibaba Cloud
© 2019 Ververica14
Announcing: Ververica Partner Program
We are looking for partners to help us develop the broader Flink
ecosystem
• Ververica Platform Partner
Preferred partners of our commercial products around the globe
• Ververica Services Partner
Service provider on Apache Flink certified by Ververica
Sign up here! ververica.com/partner-program
© 2019 Ververica
Stephan Ewen
Xiaowei Jiang
Robert Metzger
From Stream Processor to
Unified Data Processing System
© 2019 Ververica16
Use Cases Presented Today
© 2019 Ververica17
Apache Flink and Public Clouds
cloud
on-prem
© 2019 Ververica
Data Processing Applications
© 2019 Ververica
more lag time
Batch
Processing
Continuous
Processing
Event-driven
Applications
Transactional
Processing
more real time
Stream Processing
Data
Pipelines
Streaming
Analytics
© 2019 Ververica
more lag time
Batch
Processing
Continuous
Processing
Event-driven
Applications
Transactional
Processing
more real time
Stream Processing
Data
Pipelines
Streaming
Analytics
Batch Processing & Continuous Streaming
Analytics & Applications
© 2019 Ververica
more lag time
Batch
Processing
Continuous
Processing
Event-driven
Applications
Transactional
Processing
more real time
Stream Processing
Data
Pipelines
Streaming
Analytics
Flink community's focus over the last releases
© 2019 Ververica
more lag time
Batch
Processing
Continuous
Processing
Event-driven
Applications
Transactional
Processing
more real time
Recent Features
Data
Pipelines
Streaming
Analytics
Time-versioned Joins
MATCH_RECOGNIZE
Schema Upgrades
SELECT *
FROM TaxiRides
MATCH_RECOGNIZE (
PARTITION BY driverId
ORDER BY rideTime
MEASURES
S.rideId as sRideId
AFTER MATCH SKIP PAST LAST ROW
PATTERN (S M{2,} E)
DEFINE
S AS S.isStart = true,
M AS M.rideId <> S.rideId,
E AS E.isStart = false
AND E.rideId = S.rideId)
A B
A X Y
© 2019 Ververica
more lag time
Batch
Processing
Continuous
Processing
Event-driven
Applications
Transactional
Processing
more real time
Stream Processing
Data
Pipelines
Streaming
Analytics
"Steam Processing takes on ACID"
by Seth Wiesman
11am, Nikko I
© 2019 Ververica
more lag time
Batch
Processing
Continuous
Processing
Event-driven
Applications
Transactional
Processing
more real time
Stream Processing
Data
Pipelines
Streaming
Analytics
SQL and SQL Ecosystem / tools
Machine Learning
Graphs
Batch Performance
Batch Fault Tolerance
Interactive Queries
Dashboards
© 2019 Ververica
The Relationship between
Batch and Streaming
© 2019 Ververica26
Everything Streams
That is about 60% of the truth…
© 2019 Ververica27
The remaining 40% of the truth
Continuous
Streaming
Batch
Processing
Data is incomplete
Latency SLAs
Completeness and
Latency is a tradeoff
Data is as complete
as it gets within
the job
No Low Latency SLAs
© 2019 Ververica28
The remaining 40% of the truth
Continuous
Streaming
Batch
Processing
Data is incomplete
Latency SLAs
Completeness and
Latency is a tradeoff
Data is as complete
as it gets within
the job
No Low Latency SLAs
© 2019 Ververica29
Streaming versus Batch Join
© 2019 Ververica30
Streaming versus Batch Join
2x RocksDB
LSM-Trees 1x Hybrid Hash Join
DataStream API DataSet API
push-based
operators
low-latency
minimize
in-flight data
pull-based
operators
flexible data
flow control
high latency
no checkpoints
© 2019 Ververica31
Exploiting the Batch Special Case
Planner/Optimizer
See also: "Towards Flink 2.0: Rethinking the stack and APIs to unify Batch & Stream"
by Aljoscha Krettek, 2pm, Nikko II/III
Continuous Operators
Streaming
Scheduler Rules
Additional Bounded
Operators
Additional
Scheduling Strategies
if (bounded && non-incremental)
activates additional
optimizer choices
Core operators,
cover all cases
© 2019 Ververica
Stream Processing,
Analytics, and Applications
© 2019 Ververica33
Process Function (events, state, time)
DataStream API (streams, windows)
Stream SQL / Tables (dynamic tables)
Stream- & Batch
Data Processing
High-level
Analytics API
Stateful Event-
Driven Applications
val stats = stream
.keyBy("sensor")
.timeWindow(Time.seconds(5))
.sum((a, b) -> a.add(b))
def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = {
// work with event and state
(event, state.value) match { … }
out.collect(…) // emit events
state.update(…) // modify state
// schedule a timer callback
ctx.timerService.registerEventTimeTimer(event.timestamp + 500)
}
How we showed the API stack in the past…
© 2019 Ververica
Applications
(physical)
Analytics
(declarative)
DataStream API Table API
Types are Java / Scala classes Logical Schema for Tables
Transformation Functions Declarative Language (SQL, Table DSL)
State implicit in operationsExplicit control over State
Explicit control over Time SLAs define when to trigger
Executes as described Automatic Optimization
© 2019 Ververica35
Rethinking the Flink Stack
Stream Operator & DAG API
Runtime
DataSet
(deprecated)
DataStream Table / SQL
Still possible to mix and
match within a program
© 2019 Ververica
SQL, Notebooks,
and Machine Learning
© 2019 Ververica
JobGraphPhysical PlanTable API & SQL Logical Plan
Pluggable
Catalog
Query
Optimizer
SubQuery
Decorrelation
Filter/Project
Push-Down
Join
Reorder
…
Adding a new Table API / SQL Query Processor (Blink)
© 2019 Ververica
Sub-QueryANSI Syntax Data Type Join
Advanced
Aggregates
Window
Operator
Over Window TPC-H/TPC-DS
Functional Improvements
© 2019 Ververica
Performant
Operators
Operator codegen
HashAgg/Local-global
Agg
Improved HashJoin
Semi/Anti join
Vectorization
Resource
Optimizations
Stats based estimation
Dynamic memory
allocation
Expression
Optimizations
Record Format
Operate binary data
JVM intrinsics
Hot method codegen
Rich Stats
NDV
NULL count
Avg length
Max length
Min
Max
Cost Based
Join order
Join type
Agg strategy
……
Advanced Rules
Subplan reuse
Join condition
expansion
Shuffle removal
Distinct Agg rewrite
….
Query Execution Query Optimizer
Performance Improvements
© 2019 Ververica
Blink
Spark
1T TPC-DS Queries
Blink
Spark
10T TPC-DS Queries
Blink
Spark
30T TPC-DS Queries
2000S
20000S
50000S ?
Batch SQL Benchmark
© 2019 Ververica
1.7B10K 10K
Sub-
Second 100TB
Flink SQL in Production at Alibaba
© 2019 Ververica
Common
Implementation
Modular/
Composable
Applications
Dynamic
Query Logic
Interactive
Programming
Ease of
Use
Table API
© 2019 Ververica
June, 2019
TableAPI Refactor
FLIP-32
July, 2019
Initial Blink Runner Merge
Flink 1.9 Release
Oct, 2019
Full Merge
Table API Layer
Flink
Runner
Blink
Runner
Blink SQL Merge Plan
© 2019 Ververica
Flink Hive
+
For More?
Integrate Flink with Hive Ecosystem
Xuefu Zhang & Bowen Li, Alibaba 12:20pm - 1:00pm Carmel
HMS Data
FlinkHive
Hive Integration
© 2019 Ververica
• Zeppelin Integration
Zeppelin support for Flink
© 2019 Ververica
Unified Interface ML algorithms Common Utilities
For More?
When Table meets AI: Build Flink AI Ecosystem on Table API
Shaoxuan Wang, Alibaba
4:30pm - 5:10pm Nikko II & II
High performance ML library based on Flink
Xu Yang, Alibaba
2:50pm - 3:10pm Carmel
Clustering
K-means
Latent Dirichlet allocation (LDA)
Bisecting k-means
Gaussian Mixture Model (GMM)
Regression
Linear regression
Lasso regression
Ridge regression
Generalized linear regression
Survival regression
Isotonic regression
Classifier
Binomial logistic regression
Multinomial logistic regression
Multilayer perceptron classifier
Linear Support Vector Machine
Naive Bayes
Random Forest
GBDT
Decision Tree
Others
Collaborative filtering
FP-Growth
PrefixSpan
Proposal for Machine Learning
© 2019 Ververica
The Apache Flink Community
© 2019 Ververica48
A growing Apache Flink Community
… not only Flink’s codebase that is growing massively …
230 Emails / week
© 2019 Ververica49
© 2019 Ververica50
Launch of a new Chinese language user support mailing list
© 2019 Ververica51
Growing the Contributors Community
• Cleanup & reorganization of the Jira
components
• Flinkbot: Improve pull request reviews and
labeling
• Discussions about improving the contribution
workflow
• PMC mentoring new committer candidates
• Flink Community Packages website
© 2019 Ververica52
Flink Community Packages
© 2019 Ververica
Closing
© 2019 Ververica
The Apache Flink community is
more active than ever
Apache Flink continues to evolve with the
Stream Processing space.
Seamlessly integrate analytics, machine learning, applications
and very fast batch processing on top of stream processing
© 2019 Ververica
Thank you!

More Related Content

PPTX
Flink Forward San Francisco 2019: Analytics for the masses - Aslam Tajwala
PDF
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
PPTX
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
PDF
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
PPTX
Making Sense of Streaming Sensor Data: How Uber Detects on Trip Car Crashes -...
PPTX
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
PPTX
A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH
PDF
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2019: Analytics for the masses - Aslam Tajwala
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Flink Forward San Francisco 2019: Towards Flink 2.0: Rethinking the stack and...
Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...
Making Sense of Streaming Sensor Data: How Uber Detects on Trip Car Crashes -...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
A stream: Ad-hoc Shared Stream Processing - Jeyhun Karimov, DFKI GmbH
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...

What's hot (20)

PDF
Scaling stream data pipelines with Pravega and Apache Flink
PDF
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
PPTX
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...
PPTX
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
PDF
Matching the Scale at Tinder with Kafka
PDF
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
PDF
Maximilian Michels - Flink and Beam
PDF
HOP! Airlines Jets to Real Time
PPTX
Do Flink on Web with FLOW
PDF
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
PPTX
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
PDF
On Track with Apache Kafka®: Building a Streaming ETL Solution with Rail Data
PDF
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
PDF
Eventing Things - A Netflix Original! (Nitin Sharma, Netflix) Kafka Summit SF...
PPTX
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
PDF
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
PDF
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...
PDF
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
PDF
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
PPTX
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...
Scaling stream data pipelines with Pravega and Apache Flink
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018 keynote: Srikanth Satya - "Stream Processin...
Flink Forward San Francisco 2018: Andrew Gao & Jeff Sharpe - "Finding Bad Ac...
Matching the Scale at Tinder with Kafka
Flink Forward San Francisco 2018: Andrew Torson - "Extending Flink metrics: R...
Maximilian Michels - Flink and Beam
HOP! Airlines Jets to Real Time
Do Flink on Web with FLOW
Flink Forward San Francisco 2018: Xu Yang - "Alibaba’s common algorithm platf...
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...
On Track with Apache Kafka®: Building a Streaming ETL Solution with Rail Data
Kafka as your Data Lake - is it Feasible? (Guido Schmutz, Trivadis) Kafka Sum...
Eventing Things - A Netflix Original! (Nitin Sharma, Netflix) Kafka Summit SF...
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...
Failing to Cross the Streams – Lessons Learned the Hard Way | Philip Schmitt,...
Flink Forward Berlin 2018: Shriya Arora - "Taming large-state to join dataset...
Flink Forward San Francisco 2018: Ken Krugler - "Building a scalable focused ...
apidays LIVE Singapore 2021 - REST the Events - REST APIs for Event-Driven Ar...
Ad

Similar to KEYNOTE Flink Forward San Francisco 2019: From Stream Processor to a Unified Data Processing System - Stephan Ewen & Xiaowei Jiang & Robert Metzger (20)

PDF
Unified Data Processing with Apache Flink and Apache Pulsar_Seth Wiesman
PDF
Stream processing with Apache Flink (Timo Walther - Ververica)
PDF
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
PPTX
Flink SQL in Action
PDF
Don't Cross the Streams! (or do, we got you)
PDF
Stream Processing Solution for the Enterprise
PDF
Better, Faster, Stronger Streaming: Your First Dive into Flink SQL
PDF
What's new for Apache Flink's Table & SQL APIs?
PPTX
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
PPTX
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
PPTX
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
PPTX
Why apache Flink is the 4G of Big Data Analytics Frameworks
PDF
Apache Flink Worst Practices
PPTX
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
PDF
Apache Flink 101 - the rise of stream processing and beyond
PDF
A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
PDF
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
PDF
Apache Flink
PDF
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
PDF
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Unified Data Processing with Apache Flink and Apache Pulsar_Seth Wiesman
Stream processing with Apache Flink (Timo Walther - Ververica)
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Flink SQL in Action
Don't Cross the Streams! (or do, we got you)
Stream Processing Solution for the Enterprise
Better, Faster, Stronger Streaming: Your First Dive into Flink SQL
What's new for Apache Flink's Table & SQL APIs?
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Verv...
Towards Flink 2.0: Unified Batch & Stream Processing - Aljoscha Krettek, Ver...
Why apache Flink is the 4G of Big Data Analytics Frameworks
Apache Flink Worst Practices
ApacheCon: Apache Flink - Fast and Reliable Large-Scale Data Processing
Apache Flink 101 - the rise of stream processing and beyond
A unified analytics platform with Kafka and Flink | Stephan Ewen, Ververica
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Apache Flink
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Spectroscopy.pptx food analysis technology
PDF
cuic standard and advanced reporting.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Machine Learning_overview_presentation.pptx
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Encapsulation_ Review paper, used for researhc scholars
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Getting Started with Data Integration: FME Form 101
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPT
Teaching material agriculture food technology
PDF
Electronic commerce courselecture one. Pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Network Security Unit 5.pdf for BCA BBA.
Per capita expenditure prediction using model stacking based on satellite ima...
Spectroscopy.pptx food analysis technology
cuic standard and advanced reporting.pdf
Spectral efficient network and resource selection model in 5G networks
Machine Learning_overview_presentation.pptx
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Group 1 Presentation -Planning and Decision Making .pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Encapsulation_ Review paper, used for researhc scholars
“AI and Expert System Decision Support & Business Intelligence Systems”
Getting Started with Data Integration: FME Form 101
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Teaching material agriculture food technology
Electronic commerce courselecture one. Pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
20250228 LYD VKU AI Blended-Learning.pptx
Assigned Numbers - 2025 - Bluetooth® Document
Advanced methodologies resolving dimensionality complications for autism neur...
Reach Out and Touch Someone: Haptics and Empathic Computing
Network Security Unit 5.pdf for BCA BBA.

KEYNOTE Flink Forward San Francisco 2019: From Stream Processor to a Unified Data Processing System - Stephan Ewen & Xiaowei Jiang & Robert Metzger

  • 1. © 2019 Ververica Presenter Name & Title PRESENTATION TITLE The Apache Flink® Conference Stream Processing | Event Driven | Real Time San Francisco 1-2, 2019
  • 2. © 2019 Ververica2 A big thanks to our Sponsors Platinum Gold Silver Community Flink Fest Media
  • 3. © 2019 Ververica3 A big thanks to our Program Committee Fabian Hueske Dean Wampler Sonali SharmaEric SammerStefan RichterTyler Akidau Jamie Grier
  • 4. © 2019 Ververica4 A big thanks to our Speakers
  • 5. © 2019 Ververica5 Flink Forward App Rate speakers and sessions with a chance to win a Flybrix drone! Get involved! Apache Flink User Survey Win a trip to one of the next Flink Forward conferences of your choice! Flink Forward Survey Help us improve the quality of Flink Forward. We appreciate your feedback! Community Contribution Sign up as a content contribut or for blog posts or speaking oppor tunities. Flink Forward Social Feed - View & Engage! Get involved
  • 6. © 2019 Ververica Kostas Tzoumas INTRODUCING VERVERICA
  • 7. © 2019 Ververica7 Founded in 2014 by the original creators of Apache Flink to commercialize the open source project and support the community
  • 9. © 2019 Ververica9 Why? • Alibaba has been the largest user of Flink and second largest contributor for years • Deeply committed to open source and creating technological impact • Joining forces made a lot of sense for the two teams in order to collaborate even closer and accelerate their contributions to Flink
  • 10. © 2019 Ververica10 Flink at Alibaba (few examples) Taobao is the largest e-commerce platform globally with more than 600 million monthly active users. Every time a user logs into the Taobao app they see a different landing page personalized for the user and depending on the latest real- time activity in the platform. Using Flink for real-time machine learning at Taobao has resulted in over 20% increase in purchase conversion rate. At peak during Singles Day last year, the system processed over 1.7 billion events/sec. In the Hangzhou City Brain Project, Flink is used to process in real-time data from a variety of sensors (traffic cameras, map applications, etc), and manage traffic signals in 128 intersections. The City Brain project has halved traveling times for ambulances and commuters. Traffic accidents can be detected immediately, and help can reach the accident site within 5 minutes.
  • 11. © 2019 Ververica11 What’s in a name? verum (“real” in Latin) Understanding the truth about the world by getting the real-time view
  • 12. © 2019 Ververica12 What is Ververica? 1. Double down on the open source community and improve its health and diversity 2. Contribute a number of innovations to the open source project starting with Alibaba’s Blink for batch processing 3. Create an ecosystem and foundation for the commercial success of Flink projects and products across the world Our #1 goal is to position Apache Flink for the next 10 years of its life
  • 13. © 2019 Ververica13 Ververica Commercial Products Full continuation of our commercial products and services • Ververica Platform including Apache Flink, Application Manager, and Streaming Ledger • Apache Flink Training and Consulting Services • Enterprise Support A lot of innovation coming here as well leveraging existing work in Alibaba Cloud
  • 14. © 2019 Ververica14 Announcing: Ververica Partner Program We are looking for partners to help us develop the broader Flink ecosystem • Ververica Platform Partner Preferred partners of our commercial products around the globe • Ververica Services Partner Service provider on Apache Flink certified by Ververica Sign up here! ververica.com/partner-program
  • 15. © 2019 Ververica Stephan Ewen Xiaowei Jiang Robert Metzger From Stream Processor to Unified Data Processing System
  • 16. © 2019 Ververica16 Use Cases Presented Today
  • 17. © 2019 Ververica17 Apache Flink and Public Clouds cloud on-prem
  • 18. © 2019 Ververica Data Processing Applications
  • 19. © 2019 Ververica more lag time Batch Processing Continuous Processing Event-driven Applications Transactional Processing more real time Stream Processing Data Pipelines Streaming Analytics
  • 20. © 2019 Ververica more lag time Batch Processing Continuous Processing Event-driven Applications Transactional Processing more real time Stream Processing Data Pipelines Streaming Analytics Batch Processing & Continuous Streaming Analytics & Applications
  • 21. © 2019 Ververica more lag time Batch Processing Continuous Processing Event-driven Applications Transactional Processing more real time Stream Processing Data Pipelines Streaming Analytics Flink community's focus over the last releases
  • 22. © 2019 Ververica more lag time Batch Processing Continuous Processing Event-driven Applications Transactional Processing more real time Recent Features Data Pipelines Streaming Analytics Time-versioned Joins MATCH_RECOGNIZE Schema Upgrades SELECT * FROM TaxiRides MATCH_RECOGNIZE ( PARTITION BY driverId ORDER BY rideTime MEASURES S.rideId as sRideId AFTER MATCH SKIP PAST LAST ROW PATTERN (S M{2,} E) DEFINE S AS S.isStart = true, M AS M.rideId <> S.rideId, E AS E.isStart = false AND E.rideId = S.rideId) A B A X Y
  • 23. © 2019 Ververica more lag time Batch Processing Continuous Processing Event-driven Applications Transactional Processing more real time Stream Processing Data Pipelines Streaming Analytics "Steam Processing takes on ACID" by Seth Wiesman 11am, Nikko I
  • 24. © 2019 Ververica more lag time Batch Processing Continuous Processing Event-driven Applications Transactional Processing more real time Stream Processing Data Pipelines Streaming Analytics SQL and SQL Ecosystem / tools Machine Learning Graphs Batch Performance Batch Fault Tolerance Interactive Queries Dashboards
  • 25. © 2019 Ververica The Relationship between Batch and Streaming
  • 26. © 2019 Ververica26 Everything Streams That is about 60% of the truth…
  • 27. © 2019 Ververica27 The remaining 40% of the truth Continuous Streaming Batch Processing Data is incomplete Latency SLAs Completeness and Latency is a tradeoff Data is as complete as it gets within the job No Low Latency SLAs
  • 28. © 2019 Ververica28 The remaining 40% of the truth Continuous Streaming Batch Processing Data is incomplete Latency SLAs Completeness and Latency is a tradeoff Data is as complete as it gets within the job No Low Latency SLAs
  • 29. © 2019 Ververica29 Streaming versus Batch Join
  • 30. © 2019 Ververica30 Streaming versus Batch Join 2x RocksDB LSM-Trees 1x Hybrid Hash Join DataStream API DataSet API push-based operators low-latency minimize in-flight data pull-based operators flexible data flow control high latency no checkpoints
  • 31. © 2019 Ververica31 Exploiting the Batch Special Case Planner/Optimizer See also: "Towards Flink 2.0: Rethinking the stack and APIs to unify Batch & Stream" by Aljoscha Krettek, 2pm, Nikko II/III Continuous Operators Streaming Scheduler Rules Additional Bounded Operators Additional Scheduling Strategies if (bounded && non-incremental) activates additional optimizer choices Core operators, cover all cases
  • 32. © 2019 Ververica Stream Processing, Analytics, and Applications
  • 33. © 2019 Ververica33 Process Function (events, state, time) DataStream API (streams, windows) Stream SQL / Tables (dynamic tables) Stream- & Batch Data Processing High-level Analytics API Stateful Event- Driven Applications val stats = stream .keyBy("sensor") .timeWindow(Time.seconds(5)) .sum((a, b) -> a.add(b)) def processElement(event: MyEvent, ctx: Context, out: Collector[Result]) = { // work with event and state (event, state.value) match { … } out.collect(…) // emit events state.update(…) // modify state // schedule a timer callback ctx.timerService.registerEventTimeTimer(event.timestamp + 500) } How we showed the API stack in the past…
  • 34. © 2019 Ververica Applications (physical) Analytics (declarative) DataStream API Table API Types are Java / Scala classes Logical Schema for Tables Transformation Functions Declarative Language (SQL, Table DSL) State implicit in operationsExplicit control over State Explicit control over Time SLAs define when to trigger Executes as described Automatic Optimization
  • 35. © 2019 Ververica35 Rethinking the Flink Stack Stream Operator & DAG API Runtime DataSet (deprecated) DataStream Table / SQL Still possible to mix and match within a program
  • 36. © 2019 Ververica SQL, Notebooks, and Machine Learning
  • 37. © 2019 Ververica JobGraphPhysical PlanTable API & SQL Logical Plan Pluggable Catalog Query Optimizer SubQuery Decorrelation Filter/Project Push-Down Join Reorder … Adding a new Table API / SQL Query Processor (Blink)
  • 38. © 2019 Ververica Sub-QueryANSI Syntax Data Type Join Advanced Aggregates Window Operator Over Window TPC-H/TPC-DS Functional Improvements
  • 39. © 2019 Ververica Performant Operators Operator codegen HashAgg/Local-global Agg Improved HashJoin Semi/Anti join Vectorization Resource Optimizations Stats based estimation Dynamic memory allocation Expression Optimizations Record Format Operate binary data JVM intrinsics Hot method codegen Rich Stats NDV NULL count Avg length Max length Min Max Cost Based Join order Join type Agg strategy …… Advanced Rules Subplan reuse Join condition expansion Shuffle removal Distinct Agg rewrite …. Query Execution Query Optimizer Performance Improvements
  • 40. © 2019 Ververica Blink Spark 1T TPC-DS Queries Blink Spark 10T TPC-DS Queries Blink Spark 30T TPC-DS Queries 2000S 20000S 50000S ? Batch SQL Benchmark
  • 41. © 2019 Ververica 1.7B10K 10K Sub- Second 100TB Flink SQL in Production at Alibaba
  • 43. © 2019 Ververica June, 2019 TableAPI Refactor FLIP-32 July, 2019 Initial Blink Runner Merge Flink 1.9 Release Oct, 2019 Full Merge Table API Layer Flink Runner Blink Runner Blink SQL Merge Plan
  • 44. © 2019 Ververica Flink Hive + For More? Integrate Flink with Hive Ecosystem Xuefu Zhang & Bowen Li, Alibaba 12:20pm - 1:00pm Carmel HMS Data FlinkHive Hive Integration
  • 45. © 2019 Ververica • Zeppelin Integration Zeppelin support for Flink
  • 46. © 2019 Ververica Unified Interface ML algorithms Common Utilities For More? When Table meets AI: Build Flink AI Ecosystem on Table API Shaoxuan Wang, Alibaba 4:30pm - 5:10pm Nikko II & II High performance ML library based on Flink Xu Yang, Alibaba 2:50pm - 3:10pm Carmel Clustering K-means Latent Dirichlet allocation (LDA) Bisecting k-means Gaussian Mixture Model (GMM) Regression Linear regression Lasso regression Ridge regression Generalized linear regression Survival regression Isotonic regression Classifier Binomial logistic regression Multinomial logistic regression Multilayer perceptron classifier Linear Support Vector Machine Naive Bayes Random Forest GBDT Decision Tree Others Collaborative filtering FP-Growth PrefixSpan Proposal for Machine Learning
  • 47. © 2019 Ververica The Apache Flink Community
  • 48. © 2019 Ververica48 A growing Apache Flink Community … not only Flink’s codebase that is growing massively … 230 Emails / week
  • 50. © 2019 Ververica50 Launch of a new Chinese language user support mailing list
  • 51. © 2019 Ververica51 Growing the Contributors Community • Cleanup & reorganization of the Jira components • Flinkbot: Improve pull request reviews and labeling • Discussions about improving the contribution workflow • PMC mentoring new committer candidates • Flink Community Packages website
  • 52. © 2019 Ververica52 Flink Community Packages
  • 54. © 2019 Ververica The Apache Flink community is more active than ever Apache Flink continues to evolve with the Stream Processing space. Seamlessly integrate analytics, machine learning, applications and very fast batch processing on top of stream processing

Editor's Notes

  • #39: Ansi SQL Syntax Support rich data types, including varchar, varbinary, and complex types such as structs …
  • #42: Sub-second latency Peak throughput: 1.7B Event/s Largest Batch Job Input size: 100TB Flink cluster size: 10K Flink jobs: 10K
  • #43: Table API is the choice for analytics workloads Shared implementation with Flink SQL Module/Composable code Dynamic queries Interactive Programming Ease of Use
  • #44: 2019.2-2019.6: FLIP-32,refacor table api, decouple API from implementation, side by side supports for multiple runners. (concurrently, one month after)2019.3-2019.6: Blink runner initial merge, supports major functionalities in Blink SQL, and production ready 2019.6-2019.7 flink 1.9.0发布 2019.7-2019.10 continue to merge and improve Blink Runner. Fully merge all Blink code 2019.10-2019.11 new Flink release (1.10.0/2.0.0?)
  • #46: Supported execution mode:Local, Remote和Yarn Supports TableAPI, Batch SQL and Stream SQL Visualize static and dynamic tables Supports savepoints in Flink jobs Supports advanced functionalities in ZeppelinContext
  • #47: Unified the interface of algorithm functions Table serves as input and output of ML algorithms Support ML Pipeline Implemented some major ML algorithms Design and refine the algorithm implementations for higher performance Support large scale data, running and validating in production environment Find and solve performance and stability problems Get the equal or better performance than Spark ML Accumulated utilities for Flink ML developer Basic local library: linear algebra, probabilistic, etc Distributed computing library: statistics, parallel sort, etc Refine data transmission, for example: memory cache training data
  • #49: Like the codebase, Flink’s community growing Now among the top 5 projects within Apache, in terms of community size numbers are all looking very positive 1st number: dev@ activity – developer community growth More contributions project future discussions 2nd number: pageviews – user community growth Constant growth over the past years We expect these to grow in the coming years Also because we’ve started onboarding the huge Chinese Flink community …
  • #50: … 1300 people attended Flink Forward China in December this year There’s been an effort in China to translate the Flink docs .. We are now collaborating in translating Flink’s official material Flink website almost done Flink docs coming now great opportunity for Flink: documentation (and contribution opportunities) for one of the most active Flink communities in the world
  • #51: Flink has thousands of users / WeChat (or DingTalk) group with thousands of members in China foster further growth of Flink in China bring them into the Apache world / offer infrastructure in an apache way make the chinese community more visible to other Flink users + within the ASF English obviously stays the official language for all developer discussions, and we are open for adding more languages
  • #52: Better Jira processes cleanup and reorganization of the components Flinkbot, helping with the review of pull requests, and labeling PRs into components discussion about a stricter Jira workflow, to make sure people work on things which have consensus to be accepted the PMC is actively working on onboarding new committers to help with the amount of incoming pull requests, and the overall growing project.
  • #53: Started a project for a Flink Ecosystem Portal, where people can submit their own Flink extensions for connectors, metrics reporters, file systems, API utilities etc. This is hosted at Apache Infra, run by the Flink project and based on open source reach out to me during the conference if you want to talk community, such as starting your own meetup group, ideas how to improve the community etc.