SlideShare a Scribd company logo
1
Fabian Hueske
@fhueske
Flink Forward Berlin
September, 13th 2017
Stream Analytics with SQL
on Apache Flink®
2
Original creators of Apache
Flink®
Providers of
dA Platform 2, including
open source Apache Flink +
dA Application Manager
The DataStream API
 Flink’s DataStream API is very expressive
• Application logic implemented as user-defined functions
• Windows, triggers, evictors, state, timers, async calls, …
 Many applications follow similar patterns
• Do not require the expressiveness of the DataStream API
• Can be specified more concisely and easily with a DSL
Q: What’s the most popular DSL for data processing?
A: SQL!
3
Apache Flink’s relational APIs
 Standard SQL & LINQ-style Table API
 Unified APIs for batch & streaming data
A query specifies exactly the same result
regardless whether its input is
static batch data or streaming data.
 Common translation layers
• Optimization based on Apache Calcite
• Type system & code-generation
• Table sources & sinks
4
Show me some code!
tableEnvironment
.scan("clicks")
.filter('url.like("https://www.xyz.com%")
.groupBy('user)
.select('user, 'url.count as 'cnt)
SELECT user, COUNT(url) AS cnt
FROM clicks
WHERE url LIKE 'https://www.xyz.com%'
GROUP BY user
5
“clicks” can be a
- file
- database table,
- stream, …
What if “clicks” is a file?
6
user cTime url
Mary 12:00:00 https://…
Bob 12:00:00 https://…
Mary 12:00:02 https://…
Liz 12:00:03 https://…
user cnt
Mary 2
Bob 1
Liz 1
Q: What if we get more click data?
A: We run the query again.
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
What if “clicks” is a stream?
7
 We want the same
results as for batch
input!
 Can we query a stream
with SQL as well?
SQL was not designed for
streams
 Relations are
bounded (multi-)sets.
 DBMS can access
all data.
 SQL queries return a
result and complete.
8
Streams are infinite
sequences.
Streaming data arrives
over time.
Streaming queries
continuously emit results
and never complete.
↔
↔
↔
DBMSs run queries on streams
 Materialized views (MV) are similar to regular views,
but persisted to disk or memory
• Used to speed-up analytical queries
• MVs need to be updated when the base tables change
 MV maintenance is very similar to SQL on streams
• Base table updates are a stream of DML statements
• MV definition query is evaluated on that stream
• MV is query result and continuously updated
9
Continuous Queries in Flink
 Core concept is a “Dynamic Table”
• Dynamic tables are changing over time
 Queries on dynamic tables
• produce new dynamic tables (which are updated based on input)
• do not terminate
 Stream ↔ Dynamic table conversions
10
Stream → Dynamic Table
 Append mode
• Stream records are appended to table
• Table grows as more data arrives
11
user cTime url
Mary 12:00:00 ./home
Bob 12:00:00 ./cart
Mary 12:00:05 ./prod?id=1
Liz 12:01:00 ./home
Bob 12:01:30 ./prod?id=3
Mary 12:01:45 ./prod?id=7
… …
Mary, 12:00:00, ./home
Bob, 12:00:00, ./cart
Mary, 12:00:05, ./prod?id=1
Liz, 12:01:00, ./home
Bob, 12:01:30, ./prod?id=3
Mary, 12:01:45, ./prod?id=7
Stream → Dynamic Table
 Upsert mode
• Stream records have (composite) key attributes
• Records are inserted or update existing records with same key
12
user lastLogin
Mary 2017-07-01
Bob 2017-06-01
Liz 2017-05-01
…
Mary, 2017-03-01
Bob, 2017-03-15
Mary, 2017-04-01
Liz, 2017-05-01
Bob, 2017-06-01
Mary, 2017-07-01
Querying a Dynamic Table
clicks
user cnt
u1 1
result
u2 1
u3 1
u1 2
u3 2
u1 3SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
Rows of result table are updated.
13
Mary 12:01:45 ./prod?id=7
Liz 12:01:30 ./prod?id=3
Liz 12:01:00 ./home
Mary 12:00:05 ./prod?id=1
Bob 12:00:00 ./cart
Mary 12:00:00 ./home
user cTime url
What about windows?
tableEnvironment
.scan("clicks")
.window(Tumble over 1.hour on 'cTime as 'w)
.groupBy('w, 'user)
.select('user, 'w.end AS endT, 'url.count as 'cnt)
SELECT user,
TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT,
COUNT(url) AS cnt
FROM clicks
GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS),
user
14
clicks
Computing window aggregates
user endT cnt
u1 13:00:00 3
u2 13:00:00 1
result
u2 14:00:00 1
u3 14:00:00 2
u1 15:00:00 1
u2 15:00:00 2
u3 15:00:00 1
Mary 12:00:00 ./home
Bob 12:00:00 ./cart
Mary 12:02:00 ./prod?id=2
Mary 12:55:00 ./home
Mary 14:00:00 ./prod?id=1
Liz 14:02:00 ./prod?id=8
Bob 14:30:00 ./prod?id=7
Bob 14:40:00 ./home
Bob 13:01:00 ./prod?id=4
Liz 13:30:00 ./cart
Liz 13:59:00 ./home
SELECT
user,
TUMBLE_END(
cTime,
INTERVAL '1' HOURS)
AS endT,
COUNT(url) AS cnt
FROM clicks
GROUP BY
user,
TUMBLE(
cTime,
INTERVAL '1' HOURS)
Rows are appended to result table. 15
user cTime url
Why are results always appended?
 cTime attribute is event-time attribute
• Guarded by watermarks
• Internally represented as special type
• User-facing as TIMESTAMP
 Special plans for queries that operate on event-time attributes
16
SELECT user,
TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT,
COUNT(url) AS cnt
FROM clicks
GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS),
user
Dynamic Table → Stream
 Converting a dynamic table into a stream
• Dynamic tables might update or delete existing rows
• Updates must be encoded in outgoing stream
 Conversion of tables to streams inspired by DBMS logs
• DBMS use logs to restore databases (and tables)
• REDO logs store new records to redo changes
• UNDO logs store old records to undo changes
17
Dynamic Table → Stream: REDO/UNDO
+ Bob,1+ Mary,2+ Liz,1+ Bob,2 + Mary,1- Mary,1- Bob,1
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
+ INSERT / - DELETE
18
user url
clicks
Mary ./home
Bob ./cart
Mary ./prod?id=1
Liz ./home
Bob ./prod?id=3
Dynamic Table → Stream: REDO
* Bob,1* Mary,2* Liz,1* Liz,2* Mary,3 * Mary,1
* UPSERT by KEY / - DELETE by KEY
SELECT
user,
COUNT(url) as cnt
FROM clicks
GROUP BY user
19
user url
clicks
Mary ./home
Bob ./cart
Mary ./prod?id=1
Liz ./home
Liz ./prod?id=3
Mary ./prod?id=7
Can we run any query on a dynamic table?
 No, there are space and computation constraints 
 State size may not grow infinitely as more data arrives
SELECT sessionId, COUNT(url) FROM clicks GROUP BY sessionId;
 A change of an input table may only trigger a partial
re-computation of the result table
SELECT user, RANK() OVER (ORDER BY lastLogin) FROM users;
20
Bounding the size of query state
 Adapt the semantics of the query
• Aggregate data of last 24 hours. Discard older data.
 Trade the accuracy of the result for size of state
• Remove state for keys that became inactive.
21
SELECT sessionId, COUNT(url) AS cnt
FROM clicks
WHERE last(cTime, INTERVAL '1' DAY)
GROUP BY sessionId
Current state of SQL & Table API
 Flink’s relational APIs are rapidly evolving
• Lots of interest by community and many contributors
• Used in production at large scale by Alibaba and others
 Features released in Flink 1.3
• GroupBy & Over windowed aggregates
• Non-windowed aggregates (with update changes)
• User-defined aggregation functions
 Features coming with Flink 1.4
• Windowed Joins
• Reworked connectors APIs
22
What can be built with this?
 Continuous ETL
• Continuously ingest data
• Process with transformations & window aggregates
• Write to files (Parquet, ORC), Kafka, PostgreSQL, HBase, …
23
What can be built with this?
24
 Dashboards, reporting & event-driven architectures
• Flink updates query results with low latency
• Result can be written to KV store, DBMS, compacted Kafka topic
• Maintain result table as queryable state
Wrap-up!
 Table API & SQL support many streaming use cases
• High-level / declarative specification
• Automatic optimization and translation
• Efficient execution
• Scalar, table, aggregation UDFs for flexibility
 Updating results enable many exciting applications
 Check it out!
25
Thank you!
@fhueske
@ApacheFlink
@dataArtisans
Available on O’Reilly Early Release!
We are hiring!
data-artisans.com/careers
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing with Apache Flink's Relational APIs
Tables are materialized streams
 A table is the materialization of a stream of modifications
• SQL DML statements: INSERT, UPDATE, and DELETE
• DBMSs process statements by modifying tables
29
user name lastLogin
u2 Peter 2017-05-01
u1 Mary 2017-03-01u1 Mary 2017-06-01
INSERT (u1, Mary, "2017-03-01")
INSERT (u2, Peter, "2017-05-01")
DELETE WHERE (user = u2)
UPDATE (lastLogin = "2017-06-01")
WHERE (user = u1)

More Related Content

PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
PDF
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
PPTX
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
PPTX
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
PDF
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
PPTX
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
PDF
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
PDF
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...
Flink Forward Berlin 2017: Patrick Gunia - Migration of a realtime stats prod...
Flink Forward SF 2017: Stephan Ewen - Convergence of real-time analytics and ...
Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API
Flink Forward Berlin 2017: Till Rohrmann - From Apache Flink 1.3 to 1.4
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...

What's hot (20)

PDF
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
PPTX
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
PPTX
Apache Flink Berlin Meetup May 2016
PDF
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
PPTX
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
PPTX
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
PPTX
Flink Streaming @BudapestData
PDF
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
PDF
Stateful Distributed Stream Processing
PDF
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
PPTX
Stephan Ewen - Experiences running Flink at Very Large Scale
PPTX
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
PDF
Marton Balassi – Stateful Stream Processing
PPTX
Continuous Processing with Apache Flink - Strata London 2016
PDF
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
PPTX
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
PPTX
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
PPTX
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...
Stephan Ewen - Stream Processing as a Foundational Paradigm and Apache Flink'...
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Apache Flink Berlin Meetup May 2016
Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
Flink Streaming @BudapestData
Tran Nam-Luc – Stale Synchronous Parallel Iterations on Flink
Stateful Distributed Stream Processing
Flink Forward Berlin 2017: Francesco Versaci - Integrating Flink and Kafka in...
Stephan Ewen - Experiences running Flink at Very Large Scale
Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...
Marton Balassi – Stateful Stream Processing
Continuous Processing with Apache Flink - Strata London 2016
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...
Flink Forward Berlin 2017: Hao Wu - Large Scale User Behavior Analytics by Flink
Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
What's new in 1.9.0 blink planner - Kurt Young, Alibaba
Ad

Similar to Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing with Apache Flink's Relational APIs (20)

PPTX
Stream Analytics with SQL on Apache Flink
PPTX
Capture the Streams of Database Changes
PPTX
Taking a look under the hood of Apache Flink's relational APIs.
PPTX
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
PPTX
Why and how to leverage the power and simplicity of SQL on Apache Flink
PDF
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
PDF
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
PPTX
Streaming SQL to unify batch and stream processing: Theory and practice with ...
PPTX
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
PDF
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
PDF
Streaming SQL Foundations: Why I ❤ Streams+Tables
PPTX
Data Stream Processing for Beginners with Kafka and CDC
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PDF
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
PDF
Change Data Feed in Delta
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PPTX
Gcp dataflow
PPTX
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
PDF
In-memory ColumnStore Index
PPTX
The Evolution of a Relational Database Layer over HBase
Stream Analytics with SQL on Apache Flink
Capture the Streams of Database Changes
Taking a look under the hood of Apache Flink's relational APIs.
Fabian Hueske - Taking a look under the hood of Apache Flink’s relational APIs
Why and how to leverage the power and simplicity of SQL on Apache Flink
Apache Flink's Table & SQL API - unified APIs for batch and stream processing
Timo Walther - Table & SQL API - unified APIs for batch and stream processing
Streaming SQL to unify batch and stream processing: Theory and practice with ...
Flink Forward Berlin 2018: Timo Walther - "Flink SQL in Action"
ApacheCon 2020 - Flink SQL in 2020: Time to show off!
Streaming SQL Foundations: Why I ❤ Streams+Tables
Data Stream Processing for Beginners with Kafka and CDC
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Change Data Feed in Delta
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Gcp dataflow
SQL Analytics for Search Engineers - Timothy Potter, Lucidworksngineers
In-memory ColumnStore Index
The Evolution of a Relational Database Layer over HBase
Ad

More from Flink Forward (20)

PDF
Building a fully managed stream processing platform on Flink at scale for Lin...
PPTX
Evening out the uneven: dealing with skew in Flink
PPTX
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
PDF
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
PDF
Introducing the Apache Flink Kubernetes Operator
PPTX
Autoscaling Flink with Reactive Mode
PDF
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
PPTX
One sink to rule them all: Introducing the new Async Sink
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PDF
Flink powered stream processing platform at Pinterest
PPTX
Apache Flink in the Cloud-Native Era
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PPTX
The Current State of Table API in 2022
PDF
Flink SQL on Pulsar made easy
PPTX
Dynamic Rule-based Real-time Market Data Alerts
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Processing Semantically-Ordered Streams in Financial Services
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Batch Processing at Scale with Flink & Iceberg
Building a fully managed stream processing platform on Flink at scale for Lin...
Evening out the uneven: dealing with skew in Flink
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...
Introducing the Apache Flink Kubernetes Operator
Autoscaling Flink with Reactive Mode
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...
One sink to rule them all: Introducing the new Async Sink
Tuning Apache Kafka Connectors for Flink.pptx
Flink powered stream processing platform at Pinterest
Apache Flink in the Cloud-Native Era
Where is my bottleneck? Performance troubleshooting in Flink
Using the New Apache Flink Kubernetes Operator in a Production Deployment
The Current State of Table API in 2022
Flink SQL on Pulsar made easy
Dynamic Rule-based Real-time Market Data Alerts
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Processing Semantically-Ordered Streams in Financial Services
Tame the small files problem and optimize data layout for streaming ingestion...
Batch Processing at Scale with Flink & Iceberg

Recently uploaded (20)

PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
Global journeys: estimating international migration
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Database Infoormation System (DBIS).pptx
PPTX
Major-Components-ofNKJNNKNKNKNKronment.pptx
PDF
.pdf is not working space design for the following data for the following dat...
PPT
Reliability_Chapter_ presentation 1221.5784
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Introduction to Business Data Analytics.
PPTX
1_Introduction to advance data techniques.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PPT
Quality review (1)_presentation of this 21
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Data_Analytics_and_PowerBI_Presentation.pptx
Global journeys: estimating international migration
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Introduction to Knowledge Engineering Part 1
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Database Infoormation System (DBIS).pptx
Major-Components-ofNKJNNKNKNKNKronment.pptx
.pdf is not working space design for the following data for the following dat...
Reliability_Chapter_ presentation 1221.5784
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Galatica Smart Energy Infrastructure Startup Pitch Deck
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction to Business Data Analytics.
1_Introduction to advance data techniques.pptx
Fluorescence-microscope_Botany_detailed content
Quality review (1)_presentation of this 21

Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing with Apache Flink's Relational APIs

  • 1. 1 Fabian Hueske @fhueske Flink Forward Berlin September, 13th 2017 Stream Analytics with SQL on Apache Flink®
  • 2. 2 Original creators of Apache Flink® Providers of dA Platform 2, including open source Apache Flink + dA Application Manager
  • 3. The DataStream API  Flink’s DataStream API is very expressive • Application logic implemented as user-defined functions • Windows, triggers, evictors, state, timers, async calls, …  Many applications follow similar patterns • Do not require the expressiveness of the DataStream API • Can be specified more concisely and easily with a DSL Q: What’s the most popular DSL for data processing? A: SQL! 3
  • 4. Apache Flink’s relational APIs  Standard SQL & LINQ-style Table API  Unified APIs for batch & streaming data A query specifies exactly the same result regardless whether its input is static batch data or streaming data.  Common translation layers • Optimization based on Apache Calcite • Type system & code-generation • Table sources & sinks 4
  • 5. Show me some code! tableEnvironment .scan("clicks") .filter('url.like("https://www.xyz.com%") .groupBy('user) .select('user, 'url.count as 'cnt) SELECT user, COUNT(url) AS cnt FROM clicks WHERE url LIKE 'https://www.xyz.com%' GROUP BY user 5 “clicks” can be a - file - database table, - stream, …
  • 6. What if “clicks” is a file? 6 user cTime url Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… user cnt Mary 2 Bob 1 Liz 1 Q: What if we get more click data? A: We run the query again. SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user
  • 7. What if “clicks” is a stream? 7  We want the same results as for batch input!  Can we query a stream with SQL as well?
  • 8. SQL was not designed for streams  Relations are bounded (multi-)sets.  DBMS can access all data.  SQL queries return a result and complete. 8 Streams are infinite sequences. Streaming data arrives over time. Streaming queries continuously emit results and never complete. ↔ ↔ ↔
  • 9. DBMSs run queries on streams  Materialized views (MV) are similar to regular views, but persisted to disk or memory • Used to speed-up analytical queries • MVs need to be updated when the base tables change  MV maintenance is very similar to SQL on streams • Base table updates are a stream of DML statements • MV definition query is evaluated on that stream • MV is query result and continuously updated 9
  • 10. Continuous Queries in Flink  Core concept is a “Dynamic Table” • Dynamic tables are changing over time  Queries on dynamic tables • produce new dynamic tables (which are updated based on input) • do not terminate  Stream ↔ Dynamic table conversions 10
  • 11. Stream → Dynamic Table  Append mode • Stream records are appended to table • Table grows as more data arrives 11 user cTime url Mary 12:00:00 ./home Bob 12:00:00 ./cart Mary 12:00:05 ./prod?id=1 Liz 12:01:00 ./home Bob 12:01:30 ./prod?id=3 Mary 12:01:45 ./prod?id=7 … … Mary, 12:00:00, ./home Bob, 12:00:00, ./cart Mary, 12:00:05, ./prod?id=1 Liz, 12:01:00, ./home Bob, 12:01:30, ./prod?id=3 Mary, 12:01:45, ./prod?id=7
  • 12. Stream → Dynamic Table  Upsert mode • Stream records have (composite) key attributes • Records are inserted or update existing records with same key 12 user lastLogin Mary 2017-07-01 Bob 2017-06-01 Liz 2017-05-01 … Mary, 2017-03-01 Bob, 2017-03-15 Mary, 2017-04-01 Liz, 2017-05-01 Bob, 2017-06-01 Mary, 2017-07-01
  • 13. Querying a Dynamic Table clicks user cnt u1 1 result u2 1 u3 1 u1 2 u3 2 u1 3SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user Rows of result table are updated. 13 Mary 12:01:45 ./prod?id=7 Liz 12:01:30 ./prod?id=3 Liz 12:01:00 ./home Mary 12:00:05 ./prod?id=1 Bob 12:00:00 ./cart Mary 12:00:00 ./home user cTime url
  • 14. What about windows? tableEnvironment .scan("clicks") .window(Tumble over 1.hour on 'cTime as 'w) .groupBy('w, 'user) .select('user, 'w.end AS endT, 'url.count as 'cnt) SELECT user, TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS), user 14
  • 15. clicks Computing window aggregates user endT cnt u1 13:00:00 3 u2 13:00:00 1 result u2 14:00:00 1 u3 14:00:00 2 u1 15:00:00 1 u2 15:00:00 2 u3 15:00:00 1 Mary 12:00:00 ./home Bob 12:00:00 ./cart Mary 12:02:00 ./prod?id=2 Mary 12:55:00 ./home Mary 14:00:00 ./prod?id=1 Liz 14:02:00 ./prod?id=8 Bob 14:30:00 ./prod?id=7 Bob 14:40:00 ./home Bob 13:01:00 ./prod?id=4 Liz 13:30:00 ./cart Liz 13:59:00 ./home SELECT user, TUMBLE_END( cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY user, TUMBLE( cTime, INTERVAL '1' HOURS) Rows are appended to result table. 15 user cTime url
  • 16. Why are results always appended?  cTime attribute is event-time attribute • Guarded by watermarks • Internally represented as special type • User-facing as TIMESTAMP  Special plans for queries that operate on event-time attributes 16 SELECT user, TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS), user
  • 17. Dynamic Table → Stream  Converting a dynamic table into a stream • Dynamic tables might update or delete existing rows • Updates must be encoded in outgoing stream  Conversion of tables to streams inspired by DBMS logs • DBMS use logs to restore databases (and tables) • REDO logs store new records to redo changes • UNDO logs store old records to undo changes 17
  • 18. Dynamic Table → Stream: REDO/UNDO + Bob,1+ Mary,2+ Liz,1+ Bob,2 + Mary,1- Mary,1- Bob,1 SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user + INSERT / - DELETE 18 user url clicks Mary ./home Bob ./cart Mary ./prod?id=1 Liz ./home Bob ./prod?id=3
  • 19. Dynamic Table → Stream: REDO * Bob,1* Mary,2* Liz,1* Liz,2* Mary,3 * Mary,1 * UPSERT by KEY / - DELETE by KEY SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user 19 user url clicks Mary ./home Bob ./cart Mary ./prod?id=1 Liz ./home Liz ./prod?id=3 Mary ./prod?id=7
  • 20. Can we run any query on a dynamic table?  No, there are space and computation constraints   State size may not grow infinitely as more data arrives SELECT sessionId, COUNT(url) FROM clicks GROUP BY sessionId;  A change of an input table may only trigger a partial re-computation of the result table SELECT user, RANK() OVER (ORDER BY lastLogin) FROM users; 20
  • 21. Bounding the size of query state  Adapt the semantics of the query • Aggregate data of last 24 hours. Discard older data.  Trade the accuracy of the result for size of state • Remove state for keys that became inactive. 21 SELECT sessionId, COUNT(url) AS cnt FROM clicks WHERE last(cTime, INTERVAL '1' DAY) GROUP BY sessionId
  • 22. Current state of SQL & Table API  Flink’s relational APIs are rapidly evolving • Lots of interest by community and many contributors • Used in production at large scale by Alibaba and others  Features released in Flink 1.3 • GroupBy & Over windowed aggregates • Non-windowed aggregates (with update changes) • User-defined aggregation functions  Features coming with Flink 1.4 • Windowed Joins • Reworked connectors APIs 22
  • 23. What can be built with this?  Continuous ETL • Continuously ingest data • Process with transformations & window aggregates • Write to files (Parquet, ORC), Kafka, PostgreSQL, HBase, … 23
  • 24. What can be built with this? 24  Dashboards, reporting & event-driven architectures • Flink updates query results with low latency • Result can be written to KV store, DBMS, compacted Kafka topic • Maintain result table as queryable state
  • 25. Wrap-up!  Table API & SQL support many streaming use cases • High-level / declarative specification • Automatic optimization and translation • Efficient execution • Scalar, table, aggregation UDFs for flexibility  Updating results enable many exciting applications  Check it out! 25
  • 29. Tables are materialized streams  A table is the materialization of a stream of modifications • SQL DML statements: INSERT, UPDATE, and DELETE • DBMSs process statements by modifying tables 29 user name lastLogin u2 Peter 2017-05-01 u1 Mary 2017-03-01u1 Mary 2017-06-01 INSERT (u1, Mary, "2017-03-01") INSERT (u2, Peter, "2017-05-01") DELETE WHERE (user = u2) UPDATE (lastLogin = "2017-06-01") WHERE (user = u1)

Editor's Notes

  • #3: A little bit about myself, I am a committer for Apache Flink and a software engineer for data Artisans, the original creators of Apache Flink and the providers of the dA Platform.