SlideShare a Scribd company logo
1
Streams and Tables: Two Sides of the Same Coin
Matthias J. Sax12, Guozhang Wang1, Matthias Weidlich2, Johann-Christoph Freytag2
1Confluent Inc., Palo Alto (CA)
matthias@confluent.io
guozhang@confluent.io
2Humboldt-Universität zu Berlin
mjsax@informatik.hu-berlin.de
matthias.weidlich@hu-berlin.de
freytag@informatik.hu-berlin.de
@MatthiasJSax
Twelfth International Workshop on Real-Time Business Intelligence and Analytics
27 August, 2018, Rio de Janeiro
2
Count Clicks per Page
BIRTE
VLDB
VLDB
Distributed Data Source
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)VLDB(4)VLDB(7)
Input Data Stream
3
Count Clicks per Page
BIRTE
VLDB
VLDB
Distributed Data Source
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3)
VLDB(4)
VLDB(7)
BIRTE(3) VLDB(4)VLDB(7)
Input Data Stream
Arrival order non-deterministic
Even-time semantics implies out-of-order data
4
Ordering: Common Approaches
Input Data Stream
BIRTE(3) VLDB(4)VLDB(7)
BIRTE(3)VLDB(4)VLDB(7)
buffer and re-order
SPS
SPS
punctuations/watermarks
BIRTE(3) VLDB(4)VLDB(7)
time=3
5
Cost
Correctness/
Completeness
Latency
Buffering and Reordering
- Ref: CQL1, Trill2
Punctuations/Watermarks
- Ref: Li et al.3, Krishnamurthy et al.4
Design Space
6
Problem Statement
How to design a model
• for the evaluation of expressive operators
• with low latency over potentially unordered data streams
• that can be implemented by mean of distributed online algorithms?
7
High-Level Proposal
• To reduce latency, we need to avoid any processing delays
• Process data in arrival order
• Emit current result immediately
• Law et al.5: cannot handle out-of-order data
• To handle out-of-order data, we need to be able to update/refine previous results
• Data streams must allow for update records
• Update/delete records by Babu and Widom6: no operator semantics defined
• Borealis:7 replays data stream after “updating/reordering”; very high cost
8
Data Model
• Offset: physical order (arrival/processing order)
• Timestamp: logical order (event-time)
• Key: optional for grouping
• Value: payload
9
Stream Processing Operators
• Stateless, order agnostic
• filter, projection, flatMap
• No special handling necessary
• Stateful, order sensitive
• aggregation, joins, windowing
• Need to handle out-of-order data
10
Data Stream Aggregation
• Model output of (windowed) aggregations as table
• State is not internal but first-class citizen
• Update stateful operator continuously
• Emit changelog stream to downstream operators
• Streams, Table, and Changelogs
• Define operator semantics over changelogs and updating tables
• Temporal operator semantics
11
Example: Count Clicks per Page
url count
record stream changelog stream
BIRTE
1
<BIRTE,1>
BIRTE 1
countTable = stream.groupBy(r->url).count()
12
Example: Count Clicks per Page
url count
record stream changelog stream
1
<BIRTE,1>
BIRTE 1
VLDB
1VLDB
<VLDB,1>
countTable = stream.groupBy(r->url).count()
VLDB
13
Example: Count Clicks per Page
url count
record stream changelog stream
1
<BIRTE,1>
BIRTE 1
VLDB
2VLDB
<VLDB,2>
countTable = stream.groupBy(r->url).count()
VLDB
<VLDB,1>
14
countTable2 = countTable.filter(url=‘VLDB’).toTable()
Example: Processing a Changelog Stream
url count
changelog stream
1
<BIRTE,1>
BIRTE
VLDB
<VLDB,1>
2
<VLDB,2>
url count
2VLDB
15
countTable = stream.groupBy(r->url).windowedBy(5sec).count()
windowID = <groupingKey,windowStartTimestamp>
windowStartTimestamp = recordTimestamp / windowSize
Example: Windowed Count
window ID count
record stream changelog stream
BIRTE(3)
1
<<BIRTE,0>,1>
<BIRTE,0>
VLDB(7)
1<VLDB,0>
<<VLDB,0>,1>VLDB(4) <VLDB,5> <<VLDB,5>,1>1
16
countTable = stream.groupBy(r->url).windowedBy(5sec).count()
windowID = <groupingKey,windowStartTimestamp>
windowStartTimestamp = recordTimestamp / windowSize
Example: Out-of-Order Data
window ID count
record stream changelog stream
BIRTE(1)
2<BIRTE,0>
1<VLDB,0>
<VLDB,5> 1 <<BIRTE,0>,2>
17
Duality of Streams and Tables
18
Cost
Correctness/
Completeness
Latency
Buffering and Reordering
- Ref: CQL1, Trill2
Punctuations/Watermarks
- Ref: Li et al.3, Krishnamurthy et al.4
Design Space
Dual Streaming Model
- continuous updates / changelogs
- decouple latency from correctness
- trade-off latency and cost
- trade-off cost and completeness
(retention time)
19
Stream-Table Transformations
See the paper for details…
20
Implementation
• Implemented in Apache Kafka (v0.10)
• Kafka Streams / Streams API
StreamsBuilder builder = new StreamsBuilder();
KStream<String, String> textLines = builder.stream("TextLinesTopic");
KTable<String, Long> wordCounts = textLines
.flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("W+")))
.groupBy((key, word) -> word)
.windowedBy(TimeWindows.of(5_000L))
.count();
wordCounts.toStream().to("WordsWithCountsTopic");
KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();
21
Implementation
• Implemented in Apache Kafka (v0.10)
• Kafka Streams / Streams API
• Leveraged in Confluent’s KSQL
CREATE TABLE click_count_per_url AS
SELECT url, count(*)
FROM click_stream
WINDOW TUMBLING (SIZE 1 MINUTE)
WHERE url LIKE = '%confluent%' OR url LIKE ‘%hu-berlin%’
GROUP BY url;
22
Implementation
• Implemented in Apache Kafka (v0.10)
• Kafka Streams / Streams API
• Leveraged in Confluent’s KSQL
• Widely adopted in industry
23
Summary
• Suggest the Dual-Streaming-Model
• Handles out-of-order data within the processing model
• Optimized for low latency
• Streams and Tables are Dual
• Allows to trade-off processing cost, latency, completeness
• Adopted in industry via Kafka Streams and KSQL
24
Thank You
We are hiring!
25
References
[1] Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2003. CQL: A Language for Continuous Queries
over Streams and Relations. Database Programming Languages, 9th Int. WS. 1–19.
[2] Badrish Chandramouli et al. 2014. Trill: A High-performance Incremental Query Processor for
Diverse Analytics. Proc. VLDB Endow. 8, 4 (2014), 401–412.
[3] Jin Li et al. 2005. Semantics and Evaluation Techniques for Window Aggregates in Data Streams.
Proc. of the ACM SIGMOD Int. Conf. on Management of Data. 311–322.
[4] Sailesh Krishnamurthy et al. 2010. Continuous Analytics over Discontinuous Streams. Proc. of the
2010 ACM SIGMOD Int. Conf. on Management of Data. 1081–1092.
[5] Yan-Nei Law, HaixunWang, and Carlo Zaniolo. 2004. Query Languages and Data Models for
Database Sequences and Data Streams. Proc. of the 13th Int. Conf. on Very Large Data Bases. 492-503.
[6] Shivnath Babu and Jennifer Widom. 2001. Continuous Queries over Data Streams. SIGMOD Records
30, 3 (2001), 109–120.
[7] Daniel Abadi et al. 2005. The Design of the Borealis Stream Processing Engine. CIDR, 2nd Biennial
Conf. on Innovative Data Systems Research. 277–289.
26
Data Streams Types
27
Evolving Table
window ID count
1<BIRTE,0>
1<VLDB,0>
window ID count
1<BIRTE,0>
window ID count
1<BIRTE,0>
1<VLDB,0>
<VLDB,5> 1
record stream
BIRTE(3)VLDB(7) VLDB(4)BIRTE(1)
table v3 table v4 table v7
28
Evolving Table
window ID count
<BIRTE,0>
1<VLDB,0>
window ID count
<BIRTE,0>
window ID count
<BIRTE,0>
1<VLDB,0>
<VLDB,5> 1
record stream
BIRTE(3)VLDB(7) VLDB(4)BIRTE(1)
table v3 table v4 table v7
window ID count
1<BIRTE,0>
table v1
2 2 2
29
Table-Table Join
30
Stream-Stream Join
• Sliding Window Join, i.e., band join
• Window size specifies additional timestamp based join predicate
SELECT * FROM stream1, stream2
WHERE
stream1.key = stream2.key
AND
stream1.ts – windowSize <= stream2.ts
AND stream2.ts <= stream1.ts + windowSize
windowSize
stream1
stream2
31
Stream-Table Join
• Temporal table “lookup” join
• For each stream record, lookup for a matching table record
• Join condition: streamRecord.key == tableRecord.key
• The join is temporal is the sense, that the “correct” table version must be use
• i.e., youngest table version that is before the stream records timestamp
32
Stream-Table Join

More Related Content

PPTX
How to understand and analyze Apache Hive query execution plan for performanc...
PPTX
Presto query optimizer: pursuit of performance
PDF
Don’t optimize my queries, optimize my data!
PDF
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
ODP
Presto
PPT
Cloudera Impala Internals
PPTX
Apache Kudu: Technical Deep Dive


PPTX
MySQL8.0_performance_schema.pptx
How to understand and analyze Apache Hive query execution plan for performanc...
Presto query optimizer: pursuit of performance
Don’t optimize my queries, optimize my data!
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Presto
Cloudera Impala Internals
Apache Kudu: Technical Deep Dive


MySQL8.0_performance_schema.pptx

What's hot (20)

PPTX
Performance Optimizations in Apache Impala
PPTX
The Current State of Table API in 2022
PPTX
Oracle database performance tuning
PDF
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
PDF
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
PDF
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
PDF
Oracle statistics by example
PDF
Oracle Database performance tuning using oratop
PDF
Hive Bucketing in Apache Spark with Tejas Patil
PDF
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
PDF
PostgreSQL Tutorial For Beginners | Edureka
PDF
Query Optimization with MySQL 5.7 and MariaDB 10: Even newer tricks
PDF
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
PDF
PDF
Iceberg: A modern table format for big data (Strata NY 2018)
PPTX
Apache Flink Deep Dive
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
PDF
Sqoop on Spark for Data Ingestion
PDF
MySQL 5.5 Guide to InnoDB Status
PDF
PostgreSQL and Benchmarks
Performance Optimizations in Apache Impala
The Current State of Table API in 2022
Oracle database performance tuning
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
InfluxDB IOx Tech Talks: Query Engine Design and the Rust-Based DataFusion in...
Best Practice of Compression/Decompression Codes in Apache Spark with Sophia...
Oracle statistics by example
Oracle Database performance tuning using oratop
Hive Bucketing in Apache Spark with Tejas Patil
Presto: Fast SQL-on-Anything (including Delta Lake, Snowflake, Elasticsearch ...
PostgreSQL Tutorial For Beginners | Edureka
Query Optimization with MySQL 5.7 and MariaDB 10: Even newer tricks
Top 10 Mistakes When Migrating From Oracle to PostgreSQL
Iceberg: A modern table format for big data (Strata NY 2018)
Apache Flink Deep Dive
Large Scale Lakehouse Implementation Using Structured Streaming
Sqoop on Spark for Data Ingestion
MySQL 5.5 Guide to InnoDB Status
PostgreSQL and Benchmarks
Ad

Similar to Streams and Tables: Two Sides of the Same Coin (BIRTE 2018) (20)

PPTX
Foundations of streaming SQL: stream & table theory
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
PDF
Streaming SQL Foundations: Why I ❤ Streams+Tables
PPTX
Stream Analytics with SQL on Apache Flink
PDF
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
PDF
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
PDF
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
PPTX
Why and how to leverage the power and simplicity of SQL on Apache Flink
PPTX
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
PDF
CDC Stream Processing with Apache Flink
PDF
Streaming SQL
PPTX
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
PDF
Julian Hyde - Streaming SQL
PDF
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
PDF
Streaming SQL
PPTX
Stream Analytics with SQL on Apache Flink
PPTX
Fabian Hueske - Stream Analytics with SQL on Apache Flink
PPTX
Streaming SQL to unify batch and stream processing: Theory and practice with ...
PDF
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
Foundations of streaming SQL: stream & table theory
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Streaming SQL Foundations: Why I ❤ Streams+Tables
Stream Analytics with SQL on Apache Flink
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
Why and how to leverage the power and simplicity of SQL on Apache Flink
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...
CDC Stream Processing with Apache Flink
Streaming SQL
Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing ...
Julian Hyde - Streaming SQL
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL
Stream Analytics with SQL on Apache Flink
Fabian Hueske - Stream Analytics with SQL on Apache Flink
Streaming SQL to unify batch and stream processing: Theory and practice with ...
CDC Stream Processing With Apache Flink With Timo Walther | Current 2022
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

PDF
Enhancing emotion recognition model for a student engagement use case through...
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PPTX
The various Industrial Revolutions .pptx
PDF
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
PDF
Hybrid model detection and classification of lung cancer
PPTX
TLE Review Electricity (Electricity).pptx
PPTX
observCloud-Native Containerability and monitoring.pptx
PDF
Architecture types and enterprise applications.pdf
PDF
A novel scalable deep ensemble learning framework for big data classification...
PDF
project resource management chapter-09.pdf
PPTX
1. Introduction to Computer Programming.pptx
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
STKI Israel Market Study 2025 version august
PPT
Module 1.ppt Iot fundamentals and Architecture
PDF
1 - Historical Antecedents, Social Consideration.pdf
PPTX
Final SEM Unit 1 for mit wpu at pune .pptx
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
OMC Textile Division Presentation 2021.pptx
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Getting Started with Data Integration: FME Form 101
Enhancing emotion recognition model for a student engagement use case through...
Univ-Connecticut-ChatGPT-Presentaion.pdf
The various Industrial Revolutions .pptx
TrustArc Webinar - Click, Consent, Trust: Winning the Privacy Game
Hybrid model detection and classification of lung cancer
TLE Review Electricity (Electricity).pptx
observCloud-Native Containerability and monitoring.pptx
Architecture types and enterprise applications.pdf
A novel scalable deep ensemble learning framework for big data classification...
project resource management chapter-09.pdf
1. Introduction to Computer Programming.pptx
A comparative study of natural language inference in Swahili using monolingua...
STKI Israel Market Study 2025 version august
Module 1.ppt Iot fundamentals and Architecture
1 - Historical Antecedents, Social Consideration.pdf
Final SEM Unit 1 for mit wpu at pune .pptx
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
OMC Textile Division Presentation 2021.pptx
Group 1 Presentation -Planning and Decision Making .pptx
Getting Started with Data Integration: FME Form 101

Streams and Tables: Two Sides of the Same Coin (BIRTE 2018)

  • 1. 1 Streams and Tables: Two Sides of the Same Coin Matthias J. Sax12, Guozhang Wang1, Matthias Weidlich2, Johann-Christoph Freytag2 1Confluent Inc., Palo Alto (CA) matthias@confluent.io guozhang@confluent.io 2Humboldt-Universität zu Berlin mjsax@informatik.hu-berlin.de matthias.weidlich@hu-berlin.de freytag@informatik.hu-berlin.de @MatthiasJSax Twelfth International Workshop on Real-Time Business Intelligence and Analytics 27 August, 2018, Rio de Janeiro
  • 2. 2 Count Clicks per Page BIRTE VLDB VLDB Distributed Data Source BIRTE(3) VLDB(4) VLDB(7) BIRTE(3) VLDB(4) VLDB(7) BIRTE(3)VLDB(4)VLDB(7) Input Data Stream
  • 3. 3 Count Clicks per Page BIRTE VLDB VLDB Distributed Data Source BIRTE(3) VLDB(4) VLDB(7) BIRTE(3) VLDB(4) VLDB(7) BIRTE(3) VLDB(4)VLDB(7) Input Data Stream Arrival order non-deterministic Even-time semantics implies out-of-order data
  • 4. 4 Ordering: Common Approaches Input Data Stream BIRTE(3) VLDB(4)VLDB(7) BIRTE(3)VLDB(4)VLDB(7) buffer and re-order SPS SPS punctuations/watermarks BIRTE(3) VLDB(4)VLDB(7) time=3
  • 5. 5 Cost Correctness/ Completeness Latency Buffering and Reordering - Ref: CQL1, Trill2 Punctuations/Watermarks - Ref: Li et al.3, Krishnamurthy et al.4 Design Space
  • 6. 6 Problem Statement How to design a model • for the evaluation of expressive operators • with low latency over potentially unordered data streams • that can be implemented by mean of distributed online algorithms?
  • 7. 7 High-Level Proposal • To reduce latency, we need to avoid any processing delays • Process data in arrival order • Emit current result immediately • Law et al.5: cannot handle out-of-order data • To handle out-of-order data, we need to be able to update/refine previous results • Data streams must allow for update records • Update/delete records by Babu and Widom6: no operator semantics defined • Borealis:7 replays data stream after “updating/reordering”; very high cost
  • 8. 8 Data Model • Offset: physical order (arrival/processing order) • Timestamp: logical order (event-time) • Key: optional for grouping • Value: payload
  • 9. 9 Stream Processing Operators • Stateless, order agnostic • filter, projection, flatMap • No special handling necessary • Stateful, order sensitive • aggregation, joins, windowing • Need to handle out-of-order data
  • 10. 10 Data Stream Aggregation • Model output of (windowed) aggregations as table • State is not internal but first-class citizen • Update stateful operator continuously • Emit changelog stream to downstream operators • Streams, Table, and Changelogs • Define operator semantics over changelogs and updating tables • Temporal operator semantics
  • 11. 11 Example: Count Clicks per Page url count record stream changelog stream BIRTE 1 <BIRTE,1> BIRTE 1 countTable = stream.groupBy(r->url).count()
  • 12. 12 Example: Count Clicks per Page url count record stream changelog stream 1 <BIRTE,1> BIRTE 1 VLDB 1VLDB <VLDB,1> countTable = stream.groupBy(r->url).count() VLDB
  • 13. 13 Example: Count Clicks per Page url count record stream changelog stream 1 <BIRTE,1> BIRTE 1 VLDB 2VLDB <VLDB,2> countTable = stream.groupBy(r->url).count() VLDB <VLDB,1>
  • 14. 14 countTable2 = countTable.filter(url=‘VLDB’).toTable() Example: Processing a Changelog Stream url count changelog stream 1 <BIRTE,1> BIRTE VLDB <VLDB,1> 2 <VLDB,2> url count 2VLDB
  • 15. 15 countTable = stream.groupBy(r->url).windowedBy(5sec).count() windowID = <groupingKey,windowStartTimestamp> windowStartTimestamp = recordTimestamp / windowSize Example: Windowed Count window ID count record stream changelog stream BIRTE(3) 1 <<BIRTE,0>,1> <BIRTE,0> VLDB(7) 1<VLDB,0> <<VLDB,0>,1>VLDB(4) <VLDB,5> <<VLDB,5>,1>1
  • 16. 16 countTable = stream.groupBy(r->url).windowedBy(5sec).count() windowID = <groupingKey,windowStartTimestamp> windowStartTimestamp = recordTimestamp / windowSize Example: Out-of-Order Data window ID count record stream changelog stream BIRTE(1) 2<BIRTE,0> 1<VLDB,0> <VLDB,5> 1 <<BIRTE,0>,2>
  • 17. 17 Duality of Streams and Tables
  • 18. 18 Cost Correctness/ Completeness Latency Buffering and Reordering - Ref: CQL1, Trill2 Punctuations/Watermarks - Ref: Li et al.3, Krishnamurthy et al.4 Design Space Dual Streaming Model - continuous updates / changelogs - decouple latency from correctness - trade-off latency and cost - trade-off cost and completeness (retention time)
  • 20. 20 Implementation • Implemented in Apache Kafka (v0.10) • Kafka Streams / Streams API StreamsBuilder builder = new StreamsBuilder(); KStream<String, String> textLines = builder.stream("TextLinesTopic"); KTable<String, Long> wordCounts = textLines .flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split("W+"))) .groupBy((key, word) -> word) .windowedBy(TimeWindows.of(5_000L)) .count(); wordCounts.toStream().to("WordsWithCountsTopic"); KafkaStreams streams = new KafkaStreams(builder.build(), props); streams.start();
  • 21. 21 Implementation • Implemented in Apache Kafka (v0.10) • Kafka Streams / Streams API • Leveraged in Confluent’s KSQL CREATE TABLE click_count_per_url AS SELECT url, count(*) FROM click_stream WINDOW TUMBLING (SIZE 1 MINUTE) WHERE url LIKE = '%confluent%' OR url LIKE ‘%hu-berlin%’ GROUP BY url;
  • 22. 22 Implementation • Implemented in Apache Kafka (v0.10) • Kafka Streams / Streams API • Leveraged in Confluent’s KSQL • Widely adopted in industry
  • 23. 23 Summary • Suggest the Dual-Streaming-Model • Handles out-of-order data within the processing model • Optimized for low latency • Streams and Tables are Dual • Allows to trade-off processing cost, latency, completeness • Adopted in industry via Kafka Streams and KSQL
  • 25. 25 References [1] Arvind Arasu, Shivnath Babu, and Jennifer Widom. 2003. CQL: A Language for Continuous Queries over Streams and Relations. Database Programming Languages, 9th Int. WS. 1–19. [2] Badrish Chandramouli et al. 2014. Trill: A High-performance Incremental Query Processor for Diverse Analytics. Proc. VLDB Endow. 8, 4 (2014), 401–412. [3] Jin Li et al. 2005. Semantics and Evaluation Techniques for Window Aggregates in Data Streams. Proc. of the ACM SIGMOD Int. Conf. on Management of Data. 311–322. [4] Sailesh Krishnamurthy et al. 2010. Continuous Analytics over Discontinuous Streams. Proc. of the 2010 ACM SIGMOD Int. Conf. on Management of Data. 1081–1092. [5] Yan-Nei Law, HaixunWang, and Carlo Zaniolo. 2004. Query Languages and Data Models for Database Sequences and Data Streams. Proc. of the 13th Int. Conf. on Very Large Data Bases. 492-503. [6] Shivnath Babu and Jennifer Widom. 2001. Continuous Queries over Data Streams. SIGMOD Records 30, 3 (2001), 109–120. [7] Daniel Abadi et al. 2005. The Design of the Borealis Stream Processing Engine. CIDR, 2nd Biennial Conf. on Innovative Data Systems Research. 277–289.
  • 27. 27 Evolving Table window ID count 1<BIRTE,0> 1<VLDB,0> window ID count 1<BIRTE,0> window ID count 1<BIRTE,0> 1<VLDB,0> <VLDB,5> 1 record stream BIRTE(3)VLDB(7) VLDB(4)BIRTE(1) table v3 table v4 table v7
  • 28. 28 Evolving Table window ID count <BIRTE,0> 1<VLDB,0> window ID count <BIRTE,0> window ID count <BIRTE,0> 1<VLDB,0> <VLDB,5> 1 record stream BIRTE(3)VLDB(7) VLDB(4)BIRTE(1) table v3 table v4 table v7 window ID count 1<BIRTE,0> table v1 2 2 2
  • 30. 30 Stream-Stream Join • Sliding Window Join, i.e., band join • Window size specifies additional timestamp based join predicate SELECT * FROM stream1, stream2 WHERE stream1.key = stream2.key AND stream1.ts – windowSize <= stream2.ts AND stream2.ts <= stream1.ts + windowSize windowSize stream1 stream2
  • 31. 31 Stream-Table Join • Temporal table “lookup” join • For each stream record, lookup for a matching table record • Join condition: streamRecord.key == tableRecord.key • The join is temporal is the sense, that the “correct” table version must be use • i.e., youngest table version that is before the stream records timestamp