SlideShare a Scribd company logo
Crossing the Streams:
Foreign-Key Joins with Kafka Streams
John Roesler
Software Engineer @ Confluent
Agenda
01. The missing join: Foreign-Key Join
02. The current join: Equi- Join
03. The problem with FK Join
04. The solution for FK Join
05. Testing
06. Case Study: Bazaarvoice
albums
AlbumId
Title
ArtistId
tracks
TrackId
Name
AlbumId
Composer
Bytes
UnitPrice
Foreign-Key Join
3
SELECT * from Tracks
JOIN Albums ON Tracks.AlbumID = Albums.AlbumID
albums
AlbumId
Title
ArtistId
tracks
TrackId
Name
AlbumId
Composer
Bytes
UnitPrice
Foreign-Key Join
4
Primary
Foreign
SELECT * from Tracks
JOIN Albums ON Tracks.AlbumID = Albums.AlbumID
albums
AlbumId
Title
ArtistId
tracks
TrackId
Name
AlbumId
Composer
Bytes
UnitPrice
Foreign-Key Join
5
Primary
Foreign
JOIN
SELECT * from Tracks
JOIN Albums ON Tracks.AlbumID = Albums.AlbumID
Foreign-Key Join
6
KTable<TrackId, Track> tracks = …
KTable<AlbumId, Album> albums = …
KTable<TrackId, TrackWithAlbum> =
tracks.join(albums,
Track::getAlbumId,
TrackWithAlbum::joiner);
Agenda
7
01. The missing join: Foreign-Key Join
02. The current join: Equi Join
03. The problem with FK Join
04. The solution for FK Join
05. Testing
06. Case Study: Bazaarvoice
track-meta
TrackId
Name
AlbumId
Composer
Bytes
Equi Join
8
track-pricing
TrackId
UnitPrice
tracks
TrackId
Name
AlbumId
Composer
Bytes
UnitPrice
JOIN
Equi Join
KTable<TrackId, TrackMeta> tracksMetadata = …
KTable<TrackId, TrackStore> tracksPricing = …
KTable<TrackId, Track> =
tracksMetadata.join(tracksPricing,
Track::joiner);
9
A: 9
B: 2
C: 4
A: 6
D: 8
A: 9
C: 4
A: 6
B: 2
D: 8
Partition 0 Partition 1
Big Data Processing == Partitioning
10
A: 9
B: 2
C: 4
A: 6
D: 8
Partition 0 Partition 1
A: α
B: β
C: γ
A: ξ
D: σ
Left Right
A: 9
C: 4
A: 6
A: α
C: γ
A: ξ
Left Right
B: 2
D: 8
B: β
D: σ
Left Right
A: (9,α)
C: (c,γ)
A: (6,ξ)
Join
B: (2,β)
D: (8,σ)
Join
Partitioned Equi Join
11
Agenda
12
01. The missing join: Foreign-Key Join
02. The current join: Equi- Join
03. The problem with FK Join
04. The solution for FK Join
05. Testing
06. Case Study: Bazaarvoice
A: 9
B: 2
C: 4
A: 6
D: 9
Partition 0 Partition 1
Left Right
A: 9
C: 4
A: 6
Left Right
B: 2
D: 9
Left RightJoin Join
9: α
4: β
3: γ
6: ξ
9: σ
? ?? ?
Partitioned Foreign-Key Join?
13
A: 9
B: 2
C: 4
A: 6
D: 8
Partition 0 Partition 1
Left Right
A: 9
C: 4
A: 6
Left
B: 2
D: 8
Left
9: α
4: β
3: γ
6: ξ
9: σ
Partitioned Foreign-Key Join
Partition 0 Partition 1
9: α
9: σ
Right
4: β
3: γ
6: ξ
Right
14
Agenda
15
01. The missing join: Foreign-Key Join
02. The current join: Equi- Join
03. The problem with FK Join
04. The solution for FK Join
05. Testing
06. Case Study: Bazaarvoice
Partitioned Foreign-Key Join
A: 9
B: 9
C: 4
A: 6
D: 8
9: α
4: β
3: γ
6: ξ
9: σ
Left Right
9: A
9: B
4: C
6: A
8: D
Subscriptions
A: α
B: α
C: β
A: ξ
D: null
updates
A: (9,α)
B: (9,α)
C: (4,β)
A: (6,ξ)
D: (8,null)
Join
subscribe
update
16
Partitioned Foreign-Key Join
A: 9 9: α
Left Right
9: A
Subscriptions
updates
A: (9,α)
Join
subscribe
update
17
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
Subscriptions
updates
A: (9,α)
Join
subscribe
update
18
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
Subscriptions
updates
A: (9,α)
Join
subscribe
update
9:B
19
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
Join
subscribe
update
20
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
Join
subscribe
update
B: α
21
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
Join
subscribe
update
B: α
22
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
B: α
updates
A: (9,α)
Join
subscribe
update
23
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
B: α
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
24
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
25
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
26
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
A: β
B: β
27
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
A: β
B: β
28
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
9: β
Left Right
9: A
9: B
Subscriptions
A: β
B: β
updates
A: (9,α)
B: (9,α)
Join
subscribe
update
29
Partitioned Foreign-Key Join
A: 9
B: 9
9: α
9: β
Left Right
9: A
9: B
Subscriptions
A: β
B: β
updates
A: (9,α)
B: (9,α)
A: (9,β)
B: (9,β)
Join
subscribe
update
30
Partitioned Foreign-Key Join
A: 9
B: 9
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,β)
B: (9,β)
Join
subscribe
update
31
Agenda
32
01. The missing join: Foreign-Key Join
02. The current join: Equi- Join
03. The problem with FK Join
04. The solution for FK Join
05. Testing
06. Case Study: Bazaarvoice
Testing
KTable<TrackId, Track> tracks = …
KTable<AlbumId, Album> albums = …
KTable<TrackId, TrackWithAlbum> =
tracks.join(albums,
Track::getAlbumId,
TrackWithAlbum::joiner);
33
Testing
try(driver = new TopologyTestDriver(...)) {
trackInput = driver.createInputTopic(...)
albumInput = driver.createInputTopic(...)
result = driver.createOutputTopic(...)
}
34
Testing
try(driver = new TopologyTestDriver(...)) {
trackInput = driver.createInputTopic(...)
albumInput = driver.createInputTopic(...)
result = driver.createOutputTopic(...)
trackInput.pipeInput(“t1”, new Track(“a1”))
trackInput.pipeInput(“t2”, new Track(“a1”))
albumInput.pipeInput(“a1”, new Album(...))
}
35
Testing
try(driver = new TopologyTestDriver(...)) {
trackInput = driver.createInputTopic(...)
albumInput = driver.createInputTopic(...)
result = driver.createOutputTopic(...)
trackInput.pipeInput(“t1”, new Track(“a1”))
trackInput.pipeInput(“t2”, new Track(“a1”))
albumInput.pipeInput(“a1”, new Album(...))
assertThat(
result.readValuesToMap(),
is(map(
“t1”: pair(track1, album1),
“t2”: pair(track2, album1)
))
);
}
36
Agenda
37
01. The missing join: Foreign-Key Join
02. The current join: Equi- Join
03. The problem with FK Join
04. The solution for FK Join
05. Testing
06. Case Study: Bazaarvoice
Case Study: Bazaarvoice
● Early Relational Streaming adopter
○ In-house streaming platform
○ Periodic bulk DB query jobs
○ Spark, Hadoop, etc.
● Large dataset, healthy update rate
○ 100s of Millions of Products
○ 100s of Billions of Reviews
○ Updates: 10s of Millions a day, at least
○ Views: ludicrous
● Join-heavy workload (high cardinality)
○ Product -> Review fan-out can be 100 of Millions
38
Case Study: Bazaarvoice
● Product
○ Name
○ Description
○ URL
○ Average Rating
● Review
○ ProductId
○ Text
○ Rating
○ Product Name
39
Average Rating (aggregation)
KTable<ReviewId, Review> reviews;
KTable<ProductId, Product> products;
KTable<ProductId, Double> avgRatings =
reviews
.groupBy(Review::getProductId)
.reduce(averageRatings)
KTable<ProductId, ViewProduct> result =
avgRatings.join(products)
40
reviews
productsavgRatings
groupBy(productId)
reduce(avg)
result
Case Study: Bazaarvoice
● Product
○ Name
○ Description
○ URL
○ Average Rating
● Review
○ ProductId
○ Text
○ Rating
○ Product Name
41
Product Name (join)
KTable<ProductId, Set<ReviewId>> productReviews =
reviews
.groupBy(Review::getProductId)
.reduce(collectReviewIdsSet)
42
groupBy(productId)
reduce(collect set)
all reviews for
each product
reviews
Product Name (join)
KTable<ProductId, Set<ReviewId>> productReviews =
reviews
.groupBy(Review::getProductId)
.reduce(collectReviewIdsSet)
KTable<ProductId,
Pair<String, Set<ReviewId>>> toExplode =
products
.mapValues(Product::getName)
.join(productReviews)
43
groupBy(productId)
reduce(collect set)
products
all reviews for
each product
all reviews and
product name for
each product
reviews
Product Name (join)
KTable<ProductId, Set<ReviewId>> productReviews =
reviews
.groupBy(Review::getProductId)
.reduce(collectReviewIdsSet)
KTable<ProductId,
Pair<String, Set<ReviewId>>> toExplode =
products
.mapValues(Product::getName)
.join(productReviews)
KTable<ReviewId, String> reviewsToProductNames =
toExplode.flatMap( name, reviewSet ->
for (reviewId : reviewSet)
forward(reviewId, name);
)
44
groupBy(productId)
reduce(collect set)
products
all reviews for
each product
all reviews and
product name for
each product
product name for
each review
reviews
Product Name (join)
KTable<ProductId, Set<ReviewId>> productReviews =
reviews
.groupBy(Review::getProductId)
.reduce(collectReviewIdsSet)
KTable<ProductId,
Pair<String, Set<ReviewId>>> toExplode =
products
.mapValues(Product::getName)
.join(productReviews)
KTable<ReviewId, String> reviewsToProductNames =
toExplode.flatMap( name, reviewSet ->
for (reviewId : reviewSet)
forward(reviewId, name);
)
KTable<ReviewId, ViewReview> result =
reviews.join(reviewsToProductNames) 45
groupBy(productId)
reduce(collect set)
products
all reviews for
each product
all reviews and
product name for
each product
product name for
each review
result
reviews
Foreign-Key Join
A: 9
B: 9
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,β)
B: (9,β)
Join
subscribe
update
46
Product Name (join)
KTable<ProductId, Set<ReviewId>> productReviews =
reviews
.groupBy(Review::getProductId)
.reduce(collectReviewIdsSet)
KTable<ProductId,
Pair<String, Set<ReviewId>>> toExplode =
products
.mapValues(Product::getName)
.join(productReviews)
KTable<ReviewId, String> reviewsToProductNames =
toExplode.flatMap( name, reviewSet ->
for (reviewId : reviewSet)
forward(reviewId, name);
)
KTable<ReviewId, ViewReview> result =
reviews.join(reviewsToProductNames)
repartition
repartition
47
Foreign-Key Join
A: 9
B: 9
9: β
Left Right
9: A
9: B
Subscriptions
updates
A: (9,β)
B: (9,β)
Join
subscribe
update
48
Product Name (join)
KTable<ProductId, Set<ReviewId>> productReviews =
reviews
.groupBy(Review::getProductId)
.reduce(collectReviewIdsSet)
KTable<ProductId,
Pair<String, Set<ReviewId>>> toExplode =
products
.mapValues(Product::getName)
.join(productReviews)
KTable<ReviewId, String> reviewsToProductNames =
toExplode.flatMap( name, reviewSet ->
for (reviewId : reviewSet)
forward(reviewId, name);
)
KTable<ReviewId, ViewReview> result =
reviews.join(reviewsToProductNames)
repartition
repartition
store and
transmit
entire set
49
Product Name (join)
KTable<ProductId, Set<ReviewId>> productReviews =
reviews
.groupBy(Review::getProductId)
.reduce(collectReviewIdsSet)
KTable<ProductId,
Pair<String, Set<ReviewId>>> toExplode =
products
.mapValues(Product::getName)
.join(productReviews)
KTable<ReviewId, String> reviewsToProductNames =
toExplode.flatMap( name, reviewSet ->
for (reviewId : reviewSet)
forward(reviewId, name);
)
KTable<ReviewId, ViewReview> result =
reviews.join(reviewsToProductNames) 50
Product Name (join)
KTable<ProductId, String> productNames =
products.mapValues(Product::getName)
KTable<ReviewId, ViewReview> result =
reviews.join(productNames,
Review::getProductId)
51
Coming soon to ksqlDB !
SELECT * FROM
Reviews JOIN Products
ON Review.ProductID = Product.ID
52
Thanks to the authors of KIP-213!
● Jan Filipiak (Oct 2017)
● Adam Bellemare (July 2018)
● Accepted Oct 2019
● Released in 2.4.0 Dec 2019
53
Thank you!
john@confluent.io
vvcephei@apache.org
cnfl.io/meetups cnfl.io/slackcnfl.io/blog

More Related Content

PDF
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
PDF
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
PPTX
A visual introduction to Apache Kafka
PDF
Common issues with Apache Kafka® Producer
ODP
Stream processing using Kafka
PPTX
Practical learnings from running thousands of Flink jobs
PDF
Disaster Recovery and High Availability with Kafka, SRM and MM2
PPTX
Apache Kafka at LinkedIn
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
Kafka Streams vs. KSQL for Stream Processing on top of Apache Kafka
A visual introduction to Apache Kafka
Common issues with Apache Kafka® Producer
Stream processing using Kafka
Practical learnings from running thousands of Flink jobs
Disaster Recovery and High Availability with Kafka, SRM and MM2
Apache Kafka at LinkedIn

What's hot (20)

PPTX
Apache kafka
PDF
Dynamic Partition Pruning in Apache Spark
PPTX
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
PPTX
Kafka 101
PDF
Introduction to Apache Kafka
PDF
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
PDF
ksqlDB - Stream Processing simplified!
PDF
Apache Kafka - Martin Podval
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
PDF
Building Microservices with Apache Kafka
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PDF
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
PDF
Disaster Recovery Plans for Apache Kafka
PPTX
APACHE KAFKA / Kafka Connect / Kafka Streams
PPTX
Apache kafka
PDF
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
PPTX
Deep Dive into Apache Kafka
PDF
Dead Letter Queues for Kafka Consumers in Robinhood, Sreeram Ramji and Wenlon...
PDF
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
PPTX
Apache Kafka
Apache kafka
Dynamic Partition Pruning in Apache Spark
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Kafka 101
Introduction to Apache Kafka
Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...
ksqlDB - Stream Processing simplified!
Apache Kafka - Martin Podval
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Building Microservices with Apache Kafka
Presto Summit 2018 - 09 - Netflix Iceberg
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
Disaster Recovery Plans for Apache Kafka
APACHE KAFKA / Kafka Connect / Kafka Streams
Apache kafka
Performance Tuning RocksDB for Kafka Streams' State Stores (Dhruba Borthakur,...
Deep Dive into Apache Kafka
Dead Letter Queues for Kafka Consumers in Robinhood, Sreeram Ramji and Wenlon...
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Apache Kafka
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era
Ad

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Spectroscopy.pptx food analysis technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
cuic standard and advanced reporting.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Cloud computing and distributed systems.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
MYSQL Presentation for SQL database connectivity
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
KodekX | Application Modernization Development
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Spectroscopy.pptx food analysis technology
Advanced methodologies resolving dimensionality complications for autism neur...
Review of recent advances in non-invasive hemoglobin estimation
cuic standard and advanced reporting.pdf
Encapsulation_ Review paper, used for researhc scholars
Spectral efficient network and resource selection model in 5G networks
Digital-Transformation-Roadmap-for-Companies.pptx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Cloud computing and distributed systems.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Unlocking AI with Model Context Protocol (MCP)
sap open course for s4hana steps from ECC to s4
Big Data Technologies - Introduction.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
MYSQL Presentation for SQL database connectivity
The Rise and Fall of 3GPP – Time for a Sabbatical?
NewMind AI Weekly Chronicles - August'25 Week I
KodekX | Application Modernization Development

Crossing the Streams: the New Streaming Foreign-Key Join Feature in Kafka Streams (John Roesler, Confluent) Kafka Summit 2020

  • 1. Crossing the Streams: Foreign-Key Joins with Kafka Streams John Roesler Software Engineer @ Confluent
  • 2. Agenda 01. The missing join: Foreign-Key Join 02. The current join: Equi- Join 03. The problem with FK Join 04. The solution for FK Join 05. Testing 06. Case Study: Bazaarvoice
  • 6. Foreign-Key Join 6 KTable<TrackId, Track> tracks = … KTable<AlbumId, Album> albums = … KTable<TrackId, TrackWithAlbum> = tracks.join(albums, Track::getAlbumId, TrackWithAlbum::joiner);
  • 7. Agenda 7 01. The missing join: Foreign-Key Join 02. The current join: Equi Join 03. The problem with FK Join 04. The solution for FK Join 05. Testing 06. Case Study: Bazaarvoice
  • 9. Equi Join KTable<TrackId, TrackMeta> tracksMetadata = … KTable<TrackId, TrackStore> tracksPricing = … KTable<TrackId, Track> = tracksMetadata.join(tracksPricing, Track::joiner); 9
  • 10. A: 9 B: 2 C: 4 A: 6 D: 8 A: 9 C: 4 A: 6 B: 2 D: 8 Partition 0 Partition 1 Big Data Processing == Partitioning 10
  • 11. A: 9 B: 2 C: 4 A: 6 D: 8 Partition 0 Partition 1 A: α B: β C: γ A: ξ D: σ Left Right A: 9 C: 4 A: 6 A: α C: γ A: ξ Left Right B: 2 D: 8 B: β D: σ Left Right A: (9,α) C: (c,γ) A: (6,ξ) Join B: (2,β) D: (8,σ) Join Partitioned Equi Join 11
  • 12. Agenda 12 01. The missing join: Foreign-Key Join 02. The current join: Equi- Join 03. The problem with FK Join 04. The solution for FK Join 05. Testing 06. Case Study: Bazaarvoice
  • 13. A: 9 B: 2 C: 4 A: 6 D: 9 Partition 0 Partition 1 Left Right A: 9 C: 4 A: 6 Left Right B: 2 D: 9 Left RightJoin Join 9: α 4: β 3: γ 6: ξ 9: σ ? ?? ? Partitioned Foreign-Key Join? 13
  • 14. A: 9 B: 2 C: 4 A: 6 D: 8 Partition 0 Partition 1 Left Right A: 9 C: 4 A: 6 Left B: 2 D: 8 Left 9: α 4: β 3: γ 6: ξ 9: σ Partitioned Foreign-Key Join Partition 0 Partition 1 9: α 9: σ Right 4: β 3: γ 6: ξ Right 14
  • 15. Agenda 15 01. The missing join: Foreign-Key Join 02. The current join: Equi- Join 03. The problem with FK Join 04. The solution for FK Join 05. Testing 06. Case Study: Bazaarvoice
  • 16. Partitioned Foreign-Key Join A: 9 B: 9 C: 4 A: 6 D: 8 9: α 4: β 3: γ 6: ξ 9: σ Left Right 9: A 9: B 4: C 6: A 8: D Subscriptions A: α B: α C: β A: ξ D: null updates A: (9,α) B: (9,α) C: (4,β) A: (6,ξ) D: (8,null) Join subscribe update 16
  • 17. Partitioned Foreign-Key Join A: 9 9: α Left Right 9: A Subscriptions updates A: (9,α) Join subscribe update 17
  • 18. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A Subscriptions updates A: (9,α) Join subscribe update 18
  • 19. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A Subscriptions updates A: (9,α) Join subscribe update 9:B 19
  • 20. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A 9: B Subscriptions updates A: (9,α) Join subscribe update 20
  • 21. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A 9: B Subscriptions updates A: (9,α) Join subscribe update B: α 21
  • 22. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A 9: B Subscriptions updates A: (9,α) Join subscribe update B: α 22
  • 23. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A 9: B Subscriptions B: α updates A: (9,α) Join subscribe update 23
  • 24. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A 9: B Subscriptions B: α updates A: (9,α) B: (9,α) Join subscribe update 24
  • 25. Partitioned Foreign-Key Join A: 9 B: 9 9: α Left Right 9: A 9: B Subscriptions updates A: (9,α) B: (9,α) Join subscribe update 25
  • 26. Partitioned Foreign-Key Join A: 9 B: 9 9: α 9: β Left Right 9: A 9: B Subscriptions updates A: (9,α) B: (9,α) Join subscribe update 26
  • 27. Partitioned Foreign-Key Join A: 9 B: 9 9: α 9: β Left Right 9: A 9: B Subscriptions updates A: (9,α) B: (9,α) Join subscribe update A: β B: β 27
  • 28. Partitioned Foreign-Key Join A: 9 B: 9 9: α 9: β Left Right 9: A 9: B Subscriptions updates A: (9,α) B: (9,α) Join subscribe update A: β B: β 28
  • 29. Partitioned Foreign-Key Join A: 9 B: 9 9: α 9: β Left Right 9: A 9: B Subscriptions A: β B: β updates A: (9,α) B: (9,α) Join subscribe update 29
  • 30. Partitioned Foreign-Key Join A: 9 B: 9 9: α 9: β Left Right 9: A 9: B Subscriptions A: β B: β updates A: (9,α) B: (9,α) A: (9,β) B: (9,β) Join subscribe update 30
  • 31. Partitioned Foreign-Key Join A: 9 B: 9 9: β Left Right 9: A 9: B Subscriptions updates A: (9,β) B: (9,β) Join subscribe update 31
  • 32. Agenda 32 01. The missing join: Foreign-Key Join 02. The current join: Equi- Join 03. The problem with FK Join 04. The solution for FK Join 05. Testing 06. Case Study: Bazaarvoice
  • 33. Testing KTable<TrackId, Track> tracks = … KTable<AlbumId, Album> albums = … KTable<TrackId, TrackWithAlbum> = tracks.join(albums, Track::getAlbumId, TrackWithAlbum::joiner); 33
  • 34. Testing try(driver = new TopologyTestDriver(...)) { trackInput = driver.createInputTopic(...) albumInput = driver.createInputTopic(...) result = driver.createOutputTopic(...) } 34
  • 35. Testing try(driver = new TopologyTestDriver(...)) { trackInput = driver.createInputTopic(...) albumInput = driver.createInputTopic(...) result = driver.createOutputTopic(...) trackInput.pipeInput(“t1”, new Track(“a1”)) trackInput.pipeInput(“t2”, new Track(“a1”)) albumInput.pipeInput(“a1”, new Album(...)) } 35
  • 36. Testing try(driver = new TopologyTestDriver(...)) { trackInput = driver.createInputTopic(...) albumInput = driver.createInputTopic(...) result = driver.createOutputTopic(...) trackInput.pipeInput(“t1”, new Track(“a1”)) trackInput.pipeInput(“t2”, new Track(“a1”)) albumInput.pipeInput(“a1”, new Album(...)) assertThat( result.readValuesToMap(), is(map( “t1”: pair(track1, album1), “t2”: pair(track2, album1) )) ); } 36
  • 37. Agenda 37 01. The missing join: Foreign-Key Join 02. The current join: Equi- Join 03. The problem with FK Join 04. The solution for FK Join 05. Testing 06. Case Study: Bazaarvoice
  • 38. Case Study: Bazaarvoice ● Early Relational Streaming adopter ○ In-house streaming platform ○ Periodic bulk DB query jobs ○ Spark, Hadoop, etc. ● Large dataset, healthy update rate ○ 100s of Millions of Products ○ 100s of Billions of Reviews ○ Updates: 10s of Millions a day, at least ○ Views: ludicrous ● Join-heavy workload (high cardinality) ○ Product -> Review fan-out can be 100 of Millions 38
  • 39. Case Study: Bazaarvoice ● Product ○ Name ○ Description ○ URL ○ Average Rating ● Review ○ ProductId ○ Text ○ Rating ○ Product Name 39
  • 40. Average Rating (aggregation) KTable<ReviewId, Review> reviews; KTable<ProductId, Product> products; KTable<ProductId, Double> avgRatings = reviews .groupBy(Review::getProductId) .reduce(averageRatings) KTable<ProductId, ViewProduct> result = avgRatings.join(products) 40 reviews productsavgRatings groupBy(productId) reduce(avg) result
  • 41. Case Study: Bazaarvoice ● Product ○ Name ○ Description ○ URL ○ Average Rating ● Review ○ ProductId ○ Text ○ Rating ○ Product Name 41
  • 42. Product Name (join) KTable<ProductId, Set<ReviewId>> productReviews = reviews .groupBy(Review::getProductId) .reduce(collectReviewIdsSet) 42 groupBy(productId) reduce(collect set) all reviews for each product reviews
  • 43. Product Name (join) KTable<ProductId, Set<ReviewId>> productReviews = reviews .groupBy(Review::getProductId) .reduce(collectReviewIdsSet) KTable<ProductId, Pair<String, Set<ReviewId>>> toExplode = products .mapValues(Product::getName) .join(productReviews) 43 groupBy(productId) reduce(collect set) products all reviews for each product all reviews and product name for each product reviews
  • 44. Product Name (join) KTable<ProductId, Set<ReviewId>> productReviews = reviews .groupBy(Review::getProductId) .reduce(collectReviewIdsSet) KTable<ProductId, Pair<String, Set<ReviewId>>> toExplode = products .mapValues(Product::getName) .join(productReviews) KTable<ReviewId, String> reviewsToProductNames = toExplode.flatMap( name, reviewSet -> for (reviewId : reviewSet) forward(reviewId, name); ) 44 groupBy(productId) reduce(collect set) products all reviews for each product all reviews and product name for each product product name for each review reviews
  • 45. Product Name (join) KTable<ProductId, Set<ReviewId>> productReviews = reviews .groupBy(Review::getProductId) .reduce(collectReviewIdsSet) KTable<ProductId, Pair<String, Set<ReviewId>>> toExplode = products .mapValues(Product::getName) .join(productReviews) KTable<ReviewId, String> reviewsToProductNames = toExplode.flatMap( name, reviewSet -> for (reviewId : reviewSet) forward(reviewId, name); ) KTable<ReviewId, ViewReview> result = reviews.join(reviewsToProductNames) 45 groupBy(productId) reduce(collect set) products all reviews for each product all reviews and product name for each product product name for each review result reviews
  • 46. Foreign-Key Join A: 9 B: 9 9: β Left Right 9: A 9: B Subscriptions updates A: (9,β) B: (9,β) Join subscribe update 46
  • 47. Product Name (join) KTable<ProductId, Set<ReviewId>> productReviews = reviews .groupBy(Review::getProductId) .reduce(collectReviewIdsSet) KTable<ProductId, Pair<String, Set<ReviewId>>> toExplode = products .mapValues(Product::getName) .join(productReviews) KTable<ReviewId, String> reviewsToProductNames = toExplode.flatMap( name, reviewSet -> for (reviewId : reviewSet) forward(reviewId, name); ) KTable<ReviewId, ViewReview> result = reviews.join(reviewsToProductNames) repartition repartition 47
  • 48. Foreign-Key Join A: 9 B: 9 9: β Left Right 9: A 9: B Subscriptions updates A: (9,β) B: (9,β) Join subscribe update 48
  • 49. Product Name (join) KTable<ProductId, Set<ReviewId>> productReviews = reviews .groupBy(Review::getProductId) .reduce(collectReviewIdsSet) KTable<ProductId, Pair<String, Set<ReviewId>>> toExplode = products .mapValues(Product::getName) .join(productReviews) KTable<ReviewId, String> reviewsToProductNames = toExplode.flatMap( name, reviewSet -> for (reviewId : reviewSet) forward(reviewId, name); ) KTable<ReviewId, ViewReview> result = reviews.join(reviewsToProductNames) repartition repartition store and transmit entire set 49
  • 50. Product Name (join) KTable<ProductId, Set<ReviewId>> productReviews = reviews .groupBy(Review::getProductId) .reduce(collectReviewIdsSet) KTable<ProductId, Pair<String, Set<ReviewId>>> toExplode = products .mapValues(Product::getName) .join(productReviews) KTable<ReviewId, String> reviewsToProductNames = toExplode.flatMap( name, reviewSet -> for (reviewId : reviewSet) forward(reviewId, name); ) KTable<ReviewId, ViewReview> result = reviews.join(reviewsToProductNames) 50
  • 51. Product Name (join) KTable<ProductId, String> productNames = products.mapValues(Product::getName) KTable<ReviewId, ViewReview> result = reviews.join(productNames, Review::getProductId) 51
  • 52. Coming soon to ksqlDB ! SELECT * FROM Reviews JOIN Products ON Review.ProductID = Product.ID 52
  • 53. Thanks to the authors of KIP-213! ● Jan Filipiak (Oct 2017) ● Adam Bellemare (July 2018) ● Accepted Oct 2019 ● Released in 2.4.0 Dec 2019 53