SlideShare a Scribd company logo
A Brief History of
Stream Processing
TimeSeries Meetup, Tallin, Estonia, Europe, Earth,
Milky Way, Universe …. 42
Who I Am
Riccardo Tommasini

Research Fellow @ UT

Future AssistantProfessor

Enthusiast!
Timeline
20152002 2005 2008 201820102006 2007 2014
Precision not Recall
The Origin
Timeline
20152002 2005 2008 201820102006 2007 2014
Precision not Recall
The Vision
Timeline
20152002 2005 2008 201820102006 2007 2014
Precision not Recall
Timeline
20152002 2005 2008 201820102006 2007 2014
Precision not Recall
A Brief History of Stream Processing
Timeline
20152002 2005 2008 201820102006 2007 2014
Precision not Recall
Precision not Recall
Timeline
20152002 2005 2008 201820102006 2007 2014
Stream Processing + AI
(Deductive)
The Title
Timeline
20152002 2005 2008 201820102006 2007 2014
Precision not Recall
Timeline
20152002 2005 2008 201820102006 2007 2014
Stream Processing + AI
(Inductive)
Precision not Recall
Timeline
20152002 2005 2008 201820102006 2007 2014
Precision not Recall
A Brief History of Stream Processing
Timeline
20152002 2005 2008 201820102006 2007 2014
Big Stream Processing
Starts
Precision not Recall
Timeline
20152002 2005 2008 201820102006 2007 2014
S. Murthy et al. Pulsar – Real-Time
Analytics at
Scale. Technical report, eBay,
2015.
Summingbird: A Framework for
Integrating Batch and Online
MapReduce Computations. 

Trill: A High-Performance
Incremental Query Processor for
Diverse Analytics. 

TelegraphCQ:
Continuous Dataflow
Processing. 

NiagaraCQ: A
Scalable
Continuous
Query System
for Internet
Databases. 

Aurora: A New
Model and
Architecture for Data
Stream
Management
Our Focus
https://www.google.com/url?
sa=i&source=images&cd=&ved=2ahUKE
wjaouS-
uK3mAhVEqaQKHfdOA4MQjRx6BAgBE
AQ&url=https%3A%2F%2Fmedium.com
%2Fpersonal-growth-lab%2Fshould-
you-set-realistic-or-highly-ambitious-
goals-7cf400505444&psig=AOvVaw3xPk
Wzwvy8SBXD9NS0o_Oe&ust=15761481
63494865
Is this too
ambitious?
A (Query) Language Perspective
It addresses the problem of manipulating and managing
data-streams. It stresses on the operations that are
necessary to build applications. Some Assumptions
are made on the underlying system(s).
It address the problem of dealing with unbounded
data. Discuss the systems’ primitives that are
necessary to guarantee low-latency and fault-
tolerance in presence of *uncertainty*, e.g., late
arrivals.
A System Perspective
A new hybrid hope perspective
It addresses the problems in between. How do we map
the abstraction of an high-level language to the
underlying system?
Goals
• Exploit lesson learnt
in RDBMS theory.

• simple, clear, yet
powerful language.

• efficiently combine
streams and
relations.
• low-latency is
paramount.

• take data distribution
and ordering into
account.

• avoid implicit operator
semantics.
CQL DSM DFM
• Correctness first with
sessions support.

• Latency requirements
vs resource cost
should dictate system
choice.

• Single processing
model for bounded
and unbounded data.
Time
• DSMS Clock Time, i.e.,
the timestamp
assigned by the
DSMS.

• Source Time, i.e., the
timestamp assigned by
the source.

• Source Heartbeat are
used to declare no
element with lower
timestamp will be sent.
• Logical Order induced
by the timestamp
assigned by the data
source.

• Physical Order
induced by the
timestamp assigned
by the system at
ingestion.
CQL DSM DFM
• Event Time is the time
at which the event itself
actually occurred. 

• Processing Time is the
time at which an event
is observed during
processing.
• Watermark, i.e., global
progress metrics that
tracks the skewness
between ET and PT.
Watermark
Abstractions
• Streams 

• (Time-Varying)
Relations*
• Streams 

• Change-log
Streams

• Records Streams

• Tables
CQL DSM DFM
• PCollections
(bounded and
unbounded)
CQL in 3 Slides
A Stream S is a possibly infinite multi-set of elements <s,t>
where s is a tuple belonging to the schema of S and t is a
timestamp.
Relation R is a set of tuples (d1, d2, ..., dn), where each
element dj is a member of Dj, a data domain1. 
1 a Data Domain refers to all the values which a data element may contain.
CQL in 3 4 Slides
A Stream S is a possibly infinite multi-set of elements <s,t>
where s is a tuple belonging to the schema of S and t is a
timestamp.
Relation R is a set of tuples (d1, d2, ..., dn), where each
element dj is a member of Dj, a data domain1. 
1 a Data Domain refers to all the values which a data element may contain.
X
Ok, CQL in 4 5 Slides
A Stream S is a possibly infinite multi-set of elements <s,t>
where s is a tuple belonging to the schema of S and t is a
timestamp.
Relation R is a mapping from each time instant in T to a
finite but unbounded bag of tuples belonging to the
schema of R.
1 a Data Domain refers to all the values which a data element may contain.
X
CQL in 5 Slides
A Stream S is a possibly infinite multi-set of elements <s,t>
where s is a tuple belonging to the schema of S and t is a
timestamp.
Relation R is a mapping from each time instant in T to a
finite but unbounded bag of tuples belonging to the
schema of R.
1 a Data Domain refers to all the values which a data element may contain.
CQL in 5 Slides
Streams Relations
…
<s,τ>
…
<s1>
<s2>
<s3>
infinite
unbounded
sequence finite
bag
Mapping: T ! R
stream-to-relation
relation-to-stream
relation-to-relation
Stream
Relation R(t)
Relational Algebra (Almost)
*Stream operators
Sliding windows
CQL in 5 6 Slides
Stream-to-Relation Operators:

• Sliding Window: 

FROM S [ RANGE 5 Minutes]

• Parametric Sliding Windows: 

FROM S [ RANGE 5 Minutes Slide 1 Min]
• Partitioned Windows: 

FROM S [PARTITIONED BY A1..An ROW m]
1 a Data Domain refers to all the values which a data element may contain.
X
CQL in 6 Slides
R2R operator
s3
s4 s5
s6
s7
s8
s9 s10
s11
s12S
s1
s2
W(ω,β)
β
ω
t
widthslide
CQL in 6 Slides
Relation-to-Stream Operators:

• Rstream: streams out all data in the last step

• Istream: streams out data in the last step that wasn’t
on the previous step, i.e. streams out what is new

• Dstream: streams out data in the previous step that
isn’t in the last step, i.e. streams out what is old
1 a Data Domain refers to all the values which a data element may contain.
• What results are being computed

• Where in event time are being computed

• When in processing time are being materialised

• How “earlier results” relate to “later refinements”
Reasoning about time
DFM in X<42* Slides
*overestimation
A Stream is represented as a possibly unbounded collection
of key-value pairs, i.e., a PCollection<K,V>.

PTransforms are PCollections-to-PCollections operations.
DFM in X Slides
(WHAT) Processing Model
DFM in X Slides
(WHAT) Processing Model
DFM in X Slides
(WHERE) Windowing Model
DFM in X Slides
(WHERE) Windowing Model
DFM in X Slides
• The DFM windowing model requires extended primitives than CQL’s one.

• Window Assignment, i.e., each element is assigned to a corresponding window.

• Window Merge:

• Drop Timestamp: only the window interval is relevant here on
• GroupBy Key: to enable parallel execution

• Window Merge (the merge logic depends on the window strategy)

• GroupByWindow: to ensure elements are processed in sequence window-
wise.

• ExpandToElements: assigned a valid timestamp to the elements.
(WHERE) Windowing Model
Wall of Text
Disclaimer
DFM in X Slides
• In order to build unaligned (apply across subset of the
data) event-time windows DFM is forced to decouple the
reporting of the window content.

• To do so, DFM introduces an orthogonal model that
allows to signal when a window is ready to be processed.
(WHEN) Triggering Model
DFM in X Slides
• In order to build unaligned (apply across subset of the
data) event-time windows DFM is forced to decouple the
reporting of the window content.

• To do so, DFM introduces an orthogonal model that
allows to signal when a window is ready to be processed.

• Triggers solve this issue allowing DFM users to write an
arbitrary logic to signal the completion of a window.
(WHEN) Triggering Model
DFM in X Slides
In addition to the triggering semantics, DFM introduces different
refinements models to deal with late arrivals

• Discarding, i.e., upon triggering, window contents are discarded
and later results bear no relation to previous results.

• Accumulating, i.e., upon triggering, window contents are left intact
in a persistent state and later results will become a refinement to
previous results

• Accumulating & Retracting, i.e, it extends the accumulating
semantics with retraction of the previous value.
(HOW) Triggering Model (part 2)
Wall of Text
Disclaimer
– Do the math
“X ~ 9 (7 + 2 WoT)”
https://www.google.com/url?
sa=i&source=images&cd=&ved=2ahUKE
wjaouS-
uK3mAhVEqaQKHfdOA4MQjRx6BAgBE
AQ&url=https%3A%2F%2Fmedium.com
%2Fpersonal-growth-lab%2Fshould-
you-set-realistic-or-highly-ambitious-
goals-7cf400505444&psig=AOvVaw3xPk
Wzwvy8SBXD9NS0o_Oe&ust=15761481
63494865
Was this too
ambitious?
ZZZZ
A Brief History of Stream Processing
Dual Stream Model
The design space
Should I Take a Step back?
curtesy of Emanuele Della Valle - http://emanueledellavalle.org
A Conceptual View of Kafka
• Producers send messages on
topics

• Consumers read messages
from topics

• Messages are key-value pairs

• Topics are streams of
messages

• Kafka cluster manages topics

A Logical View of Kafka
• Brokers are the main
storage and messaging
components of the Kafka
cluster
curtesy of Emanuele Della Valle - http://emanueledellavalle.org
Reconciling the two views
of Kafka
• Topics are partitioned across
brokers

• Producers shard messages
over the partitions of a certain
topic

• Typically, the message key
determines which Partition a
message is assigned to
curtesy of Emanuele Della Valle - http://emanueledellavalle.org
Topic partitioning invites
distributed consumption
• Different Consumers can read data
from the same Topic

• By default, each
Consumer will receive all
the messages in the Topic

• Multiple Consumers can be
combined into a Consumer Group

• Consumer Groups provide
scaling capabilities

• Each Consumer is
assigned a subset of
Partitions for consumption
curtesy of Emanuele Della Valle - http://emanueledellavalle.org
Dual Stream Model
The intuition
Dual Stream Model
The intuition
https://www.michael-noll.com/blog/2018/04/05/of-stream-
and-tables-in-kafka-and-stream-processing-part1/
Dual Stream Model
The intuition
https://www.michael-noll.com/blog/2018/04/05/of-stream-
and-tables-in-kafka-and-stream-processing-part1/
Dual Stream Model
The Truth about Streams
Stream
Change-log Stream Record Stream
Unbounded and
ordered sequence
of key-value pairs
A streams whose
records are
updates to a table
A streams whose
records are facts
records are
identified by a
primary key
records are not
identified by a
primary key
Dual Stream Model
A table is a collection of table versions; one version for each
point in time using the timestamp as a version number.
The Truth about Tables
T(1)={T5 ={⟨A,7.2⟩}}

T(2) = {T5 = {⟨A,7.2⟩},T6 = {⟨B,14.7⟩}}

T(3) = {T5 = {⟨A,7.2⟩}, T6 = {⟨A,8.9⟩,⟨B,14.7⟩}}
T(4) = {T3 = {⟨B,12.1⟩},T5 = {⟨A,7.2⟩,⟨B,12.1⟩}, T6 = {⟨A, 8.9⟩, ⟨B, 14.7⟩}}

T(5) = {T3 = {⟨B,12.1⟩},T5 = {⟨A,7.2⟩,⟨B,12.1⟩}, T6 = {⟨A, 8.9⟩, ⟨B,
14.7⟩},T8 = {⟨A, 8.9⟩, ⟨B, 16.7⟩}}
Dual Stream Model
The Truth about Operations
Stateless Operations
Dual Stream Model
The Truth about Operations
Stateful Operations
Dual Stream Model
The Complete Picture
https://www.michael-noll.com/blog/2018/04/05/of-stream-
and-tables-in-kafka-and-stream-processing-part1/
Dual Stream Model
The DSM simplifies the reasoning about the
transformations, but does not solve the unboundedness
problem.

We still need infinite memory for processing an infinite
stream.

DSM introduces the retention time to make the trade-off
explicit.
Result Correctness vs Runtime Cost
Mapping to Kafka
Minimal
https://www.michael-noll.com/blog/2018/04/05/of-stream-
and-tables-in-kafka-and-stream-processing-part1/
Mapping to Kafka
Minimal
Concept Partitioned Unbounded Ordering Mutable Unique key
constraint
Schema
Topic Yes Yes Yes No No No (raw
bytes)
Stream Yes Yes Yes No No Yes
Table Yes Yes No Yes Yes Yes
Concept Kafka
Streams
KSQL Java Scala Python
Topic - - List/Stream List/Stream[(Array[Byte],
Array[Byte])]
[]
Stream KStream STREAM List/Stream List/Stream[(K, V)] []
Table KTable TABLE HashMap mutable.Map[K, V] {}
https://www.michael-noll.com/blog/2018/04/05/of-stream-
and-tables-in-kafka-and-stream-processing-part1/
CQL
Esper
Kafka

Streams
Stream

Duality
DataFlow
KSQL
Flink
Models
& Issues
SECRET 

Model
Complex Event
Processing
Stream Processing
+
AI
Meetup, Tallin
First Event April/March
By Confluent, 

so free beers
Come and talk about
your Kafka experience!
Thanks!
Questions?

rictomm.me

@rictomm

this@rictomm.me

I’m Hiring PhD
Students!!

More Related Content

PPTX
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
PDF
Burp suite
PPTX
Cloud Security
PPTX
SQL Injections - A Powerpoint Presentation
PDF
Ch 3: Network and Computer Attacks
PDF
How to recover from ransomware
PPTX
Cross Site Scripting: Prevention and Detection(XSS)
PDF
Malware detection-using-machine-learning
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Burp suite
Cloud Security
SQL Injections - A Powerpoint Presentation
Ch 3: Network and Computer Attacks
How to recover from ransomware
Cross Site Scripting: Prevention and Detection(XSS)
Malware detection-using-machine-learning

What's hot (20)

PPTX
Learn to pen-test with OWASP ZAP
PPT
Introduction to Thrift
PDF
Natural Language Processing with Graph Databases and Neo4j
PPT
Security Design Principles.ppt
PDF
Apache Flink internals
PPT
PDF
Sql injection with sqlmap
PDF
Snort-IPS-Tutorial
PPTX
Implementing and Running SIEM: Approaches and Lessons
PDF
FUZZING & SOFTWARE SECURITY TESTING
PDF
Cyber Threat Intelligence
PDF
Apache Kafka
PDF
SIEM Architecture
PPTX
WTF is Penetration Testing v.2
PPTX
Google Dorks
PDF
Apache Pulsar Development 101 with Python
PDF
Ch 10: Hacking Web Servers
PDF
Approximate nearest neighbor methods and vector models – NYC ML meetup
PPTX
Apache Flink and what it is used for
PDF
Penetration Testing Tutorial | Penetration Testing Tools | Cyber Security Tra...
Learn to pen-test with OWASP ZAP
Introduction to Thrift
Natural Language Processing with Graph Databases and Neo4j
Security Design Principles.ppt
Apache Flink internals
Sql injection with sqlmap
Snort-IPS-Tutorial
Implementing and Running SIEM: Approaches and Lessons
FUZZING & SOFTWARE SECURITY TESTING
Cyber Threat Intelligence
Apache Kafka
SIEM Architecture
WTF is Penetration Testing v.2
Google Dorks
Apache Pulsar Development 101 with Python
Ch 10: Hacking Web Servers
Approximate nearest neighbor methods and vector models – NYC ML meetup
Apache Flink and what it is used for
Penetration Testing Tutorial | Penetration Testing Tools | Cyber Security Tra...
Ad

Similar to A Brief History of Stream Processing (20)

PPTX
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
PPTX
Predictive Maintenance with Deep Learning and Apache Flink
PDF
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
PPTX
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
PPT
lecture1.ppt
PPT
C++ Notes PPT.ppt
PDF
Dataflow - A Unified Model for Batch and Streaming Data Processing
PDF
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
PDF
Cassandra NYC 2011 Data Modeling
PDF
cb streams - gavin pickin
PPT
CS3114_09212011.ppt
PPTX
Spanner osdi2012
PPTX
Apache Beam: A unified model for batch and stream processing data
PPTX
#NoEstimates project planning using Monte Carlo simulation
PDF
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
PPTX
Crash course on data streaming (with examples using Apache Flink)
PDF
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
PPTX
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
PDF
Time Series With OrientDB - Fosdem 2015
Cloud Dataflow - A Unified Model for Batch and Streaming Data Processing
Predictive Maintenance with Deep Learning and Apache Flink
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...
lecture1.ppt
C++ Notes PPT.ppt
Dataflow - A Unified Model for Batch and Streaming Data Processing
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Cassandra NYC 2011 Data Modeling
cb streams - gavin pickin
CS3114_09212011.ppt
Spanner osdi2012
Apache Beam: A unified model for batch and stream processing data
#NoEstimates project planning using Monte Carlo simulation
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Crash course on data streaming (with examples using Apache Flink)
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
Deep Dive with Spark Streaming - Tathagata Das - Spark Meetup 2013-06-17
Time Series With OrientDB - Fosdem 2015
Ad

Recently uploaded (20)

DOC
LSTM毕业证学历认证,利物浦大学毕业证学历认证怎么认证
PPTX
Tablets And Capsule Preformulation Of Paracetamol
PDF
natwest.pdf company description and business model
PPTX
An Unlikely Response 08 10 2025.pptx
PPTX
Effective_Handling_Information_Presentation.pptx
PPTX
nose tajweed for the arabic alphabets for the responsive
PPTX
Human Mind & its character Characteristics
PPTX
Tour Presentation Educational Activity.pptx
PPTX
Self management and self evaluation presentation
PPTX
Lesson-7-Gas. -Exchange_074636.pptx
PPTX
Anesthesia and it's stage with mnemonic and images
PPTX
Relationship Management Presentation In Banking.pptx
PPTX
Impressionism_PostImpressionism_Presentation.pptx
DOC
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
PDF
Presentation1 [Autosaved].pdf diagnosiss
PPTX
Sustainable Forest Management ..SFM.pptx
PPTX
2025-08-10 Joseph 02 (shared slides).pptx
PDF
Instagram's Product Secrets Unveiled with this PPT
PDF
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
PPTX
chapter8-180915055454bycuufucdghrwtrt.pptx
LSTM毕业证学历认证,利物浦大学毕业证学历认证怎么认证
Tablets And Capsule Preformulation Of Paracetamol
natwest.pdf company description and business model
An Unlikely Response 08 10 2025.pptx
Effective_Handling_Information_Presentation.pptx
nose tajweed for the arabic alphabets for the responsive
Human Mind & its character Characteristics
Tour Presentation Educational Activity.pptx
Self management and self evaluation presentation
Lesson-7-Gas. -Exchange_074636.pptx
Anesthesia and it's stage with mnemonic and images
Relationship Management Presentation In Banking.pptx
Impressionism_PostImpressionism_Presentation.pptx
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
Presentation1 [Autosaved].pdf diagnosiss
Sustainable Forest Management ..SFM.pptx
2025-08-10 Joseph 02 (shared slides).pptx
Instagram's Product Secrets Unveiled with this PPT
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
chapter8-180915055454bycuufucdghrwtrt.pptx

A Brief History of Stream Processing

  • 1. A Brief History of Stream Processing TimeSeries Meetup, Tallin, Estonia, Europe, Earth, Milky Way, Universe …. 42
  • 2. Who I Am Riccardo Tommasini Research Fellow @ UT Future AssistantProfessor Enthusiast!
  • 3. Timeline 20152002 2005 2008 201820102006 2007 2014 Precision not Recall
  • 5. Timeline 20152002 2005 2008 201820102006 2007 2014 Precision not Recall
  • 7. Timeline 20152002 2005 2008 201820102006 2007 2014 Precision not Recall
  • 8. Timeline 20152002 2005 2008 201820102006 2007 2014 Precision not Recall
  • 10. Timeline 20152002 2005 2008 201820102006 2007 2014 Precision not Recall
  • 11. Precision not Recall Timeline 20152002 2005 2008 201820102006 2007 2014 Stream Processing + AI (Deductive)
  • 13. Timeline 20152002 2005 2008 201820102006 2007 2014 Precision not Recall
  • 14. Timeline 20152002 2005 2008 201820102006 2007 2014 Stream Processing + AI (Inductive) Precision not Recall
  • 15. Timeline 20152002 2005 2008 201820102006 2007 2014 Precision not Recall
  • 17. Timeline 20152002 2005 2008 201820102006 2007 2014 Big Stream Processing Starts Precision not Recall
  • 18. Timeline 20152002 2005 2008 201820102006 2007 2014 S. Murthy et al. Pulsar – Real-Time Analytics at Scale. Technical report, eBay, 2015.
Summingbird: A Framework for Integrating Batch and Online MapReduce Computations. 
 Trill: A High-Performance Incremental Query Processor for Diverse Analytics. 
 TelegraphCQ: Continuous Dataflow Processing. 
 NiagaraCQ: A Scalable Continuous Query System for Internet Databases. 
 Aurora: A New Model and Architecture for Data Stream Management
  • 21. A (Query) Language Perspective It addresses the problem of manipulating and managing data-streams. It stresses on the operations that are necessary to build applications. Some Assumptions are made on the underlying system(s). It address the problem of dealing with unbounded data. Discuss the systems’ primitives that are necessary to guarantee low-latency and fault- tolerance in presence of *uncertainty*, e.g., late arrivals. A System Perspective A new hybrid hope perspective It addresses the problems in between. How do we map the abstraction of an high-level language to the underlying system?
  • 22. Goals • Exploit lesson learnt in RDBMS theory. • simple, clear, yet powerful language. • efficiently combine streams and relations. • low-latency is paramount. • take data distribution and ordering into account. • avoid implicit operator semantics. CQL DSM DFM • Correctness first with sessions support. • Latency requirements vs resource cost should dictate system choice. • Single processing model for bounded and unbounded data.
  • 23. Time • DSMS Clock Time, i.e., the timestamp assigned by the DSMS. • Source Time, i.e., the timestamp assigned by the source. • Source Heartbeat are used to declare no element with lower timestamp will be sent. • Logical Order induced by the timestamp assigned by the data source. • Physical Order induced by the timestamp assigned by the system at ingestion. CQL DSM DFM • Event Time is the time at which the event itself actually occurred. • Processing Time is the time at which an event is observed during processing. • Watermark, i.e., global progress metrics that tracks the skewness between ET and PT.
  • 25. Abstractions • Streams • (Time-Varying) Relations* • Streams • Change-log Streams • Records Streams • Tables CQL DSM DFM • PCollections (bounded and unbounded)
  • 26. CQL in 3 Slides A Stream S is a possibly infinite multi-set of elements <s,t> where s is a tuple belonging to the schema of S and t is a timestamp. Relation R is a set of tuples (d1, d2, ..., dn), where each element dj is a member of Dj, a data domain1.  1 a Data Domain refers to all the values which a data element may contain.
  • 27. CQL in 3 4 Slides A Stream S is a possibly infinite multi-set of elements <s,t> where s is a tuple belonging to the schema of S and t is a timestamp. Relation R is a set of tuples (d1, d2, ..., dn), where each element dj is a member of Dj, a data domain1.  1 a Data Domain refers to all the values which a data element may contain. X
  • 28. Ok, CQL in 4 5 Slides A Stream S is a possibly infinite multi-set of elements <s,t> where s is a tuple belonging to the schema of S and t is a timestamp. Relation R is a mapping from each time instant in T to a finite but unbounded bag of tuples belonging to the schema of R. 1 a Data Domain refers to all the values which a data element may contain. X
  • 29. CQL in 5 Slides A Stream S is a possibly infinite multi-set of elements <s,t> where s is a tuple belonging to the schema of S and t is a timestamp. Relation R is a mapping from each time instant in T to a finite but unbounded bag of tuples belonging to the schema of R. 1 a Data Domain refers to all the values which a data element may contain.
  • 30. CQL in 5 Slides Streams Relations … <s,τ> … <s1> <s2> <s3> infinite unbounded sequence finite bag Mapping: T ! R stream-to-relation relation-to-stream relation-to-relation Stream Relation R(t) Relational Algebra (Almost) *Stream operators Sliding windows
  • 31. CQL in 5 6 Slides Stream-to-Relation Operators: • Sliding Window: 
 FROM S [ RANGE 5 Minutes] • Parametric Sliding Windows: 
 FROM S [ RANGE 5 Minutes Slide 1 Min] • Partitioned Windows: 
 FROM S [PARTITIONED BY A1..An ROW m] 1 a Data Domain refers to all the values which a data element may contain. X
  • 32. CQL in 6 Slides R2R operator s3 s4 s5 s6 s7 s8 s9 s10 s11 s12S s1 s2 W(ω,β) β ω t widthslide
  • 33. CQL in 6 Slides Relation-to-Stream Operators: • Rstream: streams out all data in the last step • Istream: streams out data in the last step that wasn’t on the previous step, i.e. streams out what is new • Dstream: streams out data in the previous step that isn’t in the last step, i.e. streams out what is old 1 a Data Domain refers to all the values which a data element may contain.
  • 34. • What results are being computed • Where in event time are being computed • When in processing time are being materialised • How “earlier results” relate to “later refinements” Reasoning about time DFM in X<42* Slides *overestimation
  • 35. A Stream is represented as a possibly unbounded collection of key-value pairs, i.e., a PCollection<K,V>. PTransforms are PCollections-to-PCollections operations. DFM in X Slides (WHAT) Processing Model
  • 36. DFM in X Slides (WHAT) Processing Model
  • 37. DFM in X Slides (WHERE) Windowing Model
  • 38. DFM in X Slides (WHERE) Windowing Model
  • 39. DFM in X Slides • The DFM windowing model requires extended primitives than CQL’s one. • Window Assignment, i.e., each element is assigned to a corresponding window. • Window Merge: • Drop Timestamp: only the window interval is relevant here on • GroupBy Key: to enable parallel execution • Window Merge (the merge logic depends on the window strategy) • GroupByWindow: to ensure elements are processed in sequence window- wise. • ExpandToElements: assigned a valid timestamp to the elements. (WHERE) Windowing Model Wall of Text Disclaimer
  • 40. DFM in X Slides • In order to build unaligned (apply across subset of the data) event-time windows DFM is forced to decouple the reporting of the window content. • To do so, DFM introduces an orthogonal model that allows to signal when a window is ready to be processed. (WHEN) Triggering Model
  • 41. DFM in X Slides • In order to build unaligned (apply across subset of the data) event-time windows DFM is forced to decouple the reporting of the window content. • To do so, DFM introduces an orthogonal model that allows to signal when a window is ready to be processed. • Triggers solve this issue allowing DFM users to write an arbitrary logic to signal the completion of a window. (WHEN) Triggering Model
  • 42. DFM in X Slides In addition to the triggering semantics, DFM introduces different refinements models to deal with late arrivals • Discarding, i.e., upon triggering, window contents are discarded and later results bear no relation to previous results. • Accumulating, i.e., upon triggering, window contents are left intact in a persistent state and later results will become a refinement to previous results • Accumulating & Retracting, i.e, it extends the accumulating semantics with retraction of the previous value. (HOW) Triggering Model (part 2) Wall of Text Disclaimer
  • 43. – Do the math “X ~ 9 (7 + 2 WoT)”
  • 45. ZZZZ
  • 47. Dual Stream Model The design space
  • 48. Should I Take a Step back?
  • 49. curtesy of Emanuele Della Valle - http://emanueledellavalle.org A Conceptual View of Kafka • Producers send messages on topics • Consumers read messages from topics • Messages are key-value pairs • Topics are streams of messages • Kafka cluster manages topics

  • 50. A Logical View of Kafka • Brokers are the main storage and messaging components of the Kafka cluster curtesy of Emanuele Della Valle - http://emanueledellavalle.org
  • 51. Reconciling the two views of Kafka • Topics are partitioned across brokers • Producers shard messages over the partitions of a certain topic • Typically, the message key determines which Partition a message is assigned to curtesy of Emanuele Della Valle - http://emanueledellavalle.org
  • 52. Topic partitioning invites distributed consumption • Different Consumers can read data from the same Topic • By default, each Consumer will receive all the messages in the Topic • Multiple Consumers can be combined into a Consumer Group • Consumer Groups provide scaling capabilities • Each Consumer is assigned a subset of Partitions for consumption curtesy of Emanuele Della Valle - http://emanueledellavalle.org
  • 54. Dual Stream Model The intuition https://www.michael-noll.com/blog/2018/04/05/of-stream- and-tables-in-kafka-and-stream-processing-part1/
  • 55. Dual Stream Model The intuition https://www.michael-noll.com/blog/2018/04/05/of-stream- and-tables-in-kafka-and-stream-processing-part1/
  • 56. Dual Stream Model The Truth about Streams Stream Change-log Stream Record Stream Unbounded and ordered sequence of key-value pairs A streams whose records are updates to a table A streams whose records are facts records are identified by a primary key records are not identified by a primary key
  • 57. Dual Stream Model A table is a collection of table versions; one version for each point in time using the timestamp as a version number. The Truth about Tables T(1)={T5 ={⟨A,7.2⟩}}
 T(2) = {T5 = {⟨A,7.2⟩},T6 = {⟨B,14.7⟩}}
 T(3) = {T5 = {⟨A,7.2⟩}, T6 = {⟨A,8.9⟩,⟨B,14.7⟩}} T(4) = {T3 = {⟨B,12.1⟩},T5 = {⟨A,7.2⟩,⟨B,12.1⟩}, T6 = {⟨A, 8.9⟩, ⟨B, 14.7⟩}}
 T(5) = {T3 = {⟨B,12.1⟩},T5 = {⟨A,7.2⟩,⟨B,12.1⟩}, T6 = {⟨A, 8.9⟩, ⟨B, 14.7⟩},T8 = {⟨A, 8.9⟩, ⟨B, 16.7⟩}}
  • 58. Dual Stream Model The Truth about Operations Stateless Operations
  • 59. Dual Stream Model The Truth about Operations Stateful Operations
  • 60. Dual Stream Model The Complete Picture https://www.michael-noll.com/blog/2018/04/05/of-stream- and-tables-in-kafka-and-stream-processing-part1/
  • 61. Dual Stream Model The DSM simplifies the reasoning about the transformations, but does not solve the unboundedness problem. We still need infinite memory for processing an infinite stream. DSM introduces the retention time to make the trade-off explicit. Result Correctness vs Runtime Cost
  • 63. Mapping to Kafka Minimal Concept Partitioned Unbounded Ordering Mutable Unique key constraint Schema Topic Yes Yes Yes No No No (raw bytes) Stream Yes Yes Yes No No Yes Table Yes Yes No Yes Yes Yes Concept Kafka Streams KSQL Java Scala Python Topic - - List/Stream List/Stream[(Array[Byte], Array[Byte])] [] Stream KStream STREAM List/Stream List/Stream[(K, V)] [] Table KTable TABLE HashMap mutable.Map[K, V] {} https://www.michael-noll.com/blog/2018/04/05/of-stream- and-tables-in-kafka-and-stream-processing-part1/
  • 65. Meetup, Tallin First Event April/March By Confluent, 
 so free beers Come and talk about your Kafka experience!