SlideShare a Scribd company logo
Data Analytics @ Scale:
Implementing stateful stream processing
Michael Kanevsky
VP Innovation
michaelk@codevalue.net
@mkanevsky
http://codevalue.net
About me
Michael Kanevsky
 Distributed Systems
 Big Data & Machine Learning
 Embedded & IoT
2
𝑓 =“cat”
Agenda
 Why now?
 Streaming basics
 Stream processing frameworks 101
 Building a streaming application
 A look at Apache Flink + Kafka Streams
3
Big & Fast Data
 Typical “Big Data” challenges
 Data Volume
 High velocity data streams
 Result relevance
 Offline vs Online
4
1TB
1TB
Scale: Data vs Latency
5
TBs
GBs
MBs
1s 1ms10ms100ms 100us10s
Piece
of
Cake!
Mostly
Feasible
Good
Luck…
 Improving throughput = scaling out
 Communication latencies
 Much harder for some applications
Datasize
Processing
latency
Many natural streaming applications
IOT: Telemetry & sensor data
User activities & clickstreams
Financial data processing
Calculating aggregates on data
6
What makes a good case for streaming?
Continuous data
Time anchoring
Low latency requirement
7
Architecting for analytic workload
8
Processing 1: Store
File 1 File 2 File N
…
Processing 2: Batch
KV Store
6
5
{
“customerId”: 12345,
“transactionSum”: 99.90,
“vendorId”: ”04ec092a-421a-4481”,
“ts”: ”2019-05-30T09:20:50.100+02:00”
}
Architecting for analytic workload #2
9
Processing 1: Store
File 1 File 2 File N Processing 2: Batch
KV Store
6
5
Stream Processing
KV Store
4
4
3
Partitioner
SUM
AVG
Last 30 min
Lambda Architecture
10
System complexity
Duplicate data state
Cascading failures
But get:
Low latency
Accuracy
Complex persistent state
Stream Processing Infrastructures
11
Apache
Gearpump
Our Scenario: Fraud detection
12
Event Sources Alerts
Persistence
Business Logic
Client API
- CustomerId (guid)
- Timestamp
- Transaction Sum
- Vendor Id
Notifications
Spark Streaming
14
 Supports two APIs:
 Spark Streaming (RDD-like Dstream)
 Structured Streaming (DataFrames/Datasets)
 Better in cloud
 Easier code reuse between batch and stream
 Supports Java/Scala/Python/R/SQL
 Requires dedicated cluster (RM/containerized SA)
 JAR-based execution
Zookeeper cluster
Kafka Broker
Resource Manager
Spark Cluster
Shared
Persistence
Master WorkerWorker
Apache Flink
- In Apache incubation since 2014
- Versatile API stack
- Exactly once message processing
- High throughput + low latency
- Flexible state persistence (RocksDB)
- Stream & Batch Semantics
16
JobManager
JobManager
TaskManager
TaskManager
Slot I
Slot II
….
Zookeeper cluster
Storage
Apache Flink processing modes
17
Stateful Stream Processing
DataStream API
SQL
Table API
Application
…
stream
.keyBy(“customerId”)
.timeWindow(Time.seconds(30))
.sum(“transactionSum”)
Table customerTransactions = …
Table transactionAggregates = customerTransactions
.select(“customerId, transactionSum,ts”)
.window(
Tumble.over(“30.seconds”)
.on(“ts”)
.as(“hmWindow”))
.groupBy(“hmWindows, customerId”)
.select(“transactionSum.sum as totalAmount”)
tableEnv.registerDataStream(“transactions",
“customerId, transactionSum, ts");
Table result = tableEnv.sqlQuery(
"SELECT customerId, TUMBLE_END(ts, INTERVAL ‘30’
SECOND), SUM(transactionSum) AS totalAmount
FROM transactions
GROUP BY (customerId, TUMBLE(ts, INTERVAL ‘30'
SECOND)) ";
Kafka Streams
18
 Built on Apache Kafka as a client library
 Can support low latency processing
 Kafka persistence + local RocksDB persistence
 Good DSL with flexible topology support
 Supports “Exactly Once” semantics
 Limited to Kafka -> Kafka use cases
 Available from major 0.10, fully featured as of Kafka 2.X
Zookeeper cluster
Kafka Broker
Kafka Streams App
Local
Persistence
Time -> Data Windows
19
Tumbling Windows Sliding Windows
Stream Processing: Time dimension
Two approaches:
 True Streaming
 Micro batching
Event time
Processingtime
Ideal
In practice
Three timelines:
 Event time
 Ingest time
 Processing time
One watermark!
(to rule them all)
Code: Flink
21
Stream Processing: High Availability
23
Stream Processing
JobManager
Task Executor
Task Executor
Task Executor
Task Executor
Task Executor
Task Executor
State Persistence
APIMSG MSGMSG
MSG
MSG
MSG
Job
Persistence
Stream Processing: Delivery Guarantees
24
System guarantees:
 At Most Once
 At Least Once
 Exactly Once
Message
Processing
Failure
Give upTry again
Streaming Architecture
25
Distributed Queue
Kafka/Kinesis/Event Hub/…
Event
Publishers
μServices Stream Processing
File/BLOB
KV Store
MQ
Data Source
Data Source
Processor
Processor
Data Sink
Data Sink
State
Persistence
Data
Persistence
Code: Kafka Streams
26
Stream Processing Frameworks
27
Streaming API Infra State Batch support
μ-batches
Framework
SA + RM In memory (by default) via Spark
kafka streams
Native
Library Kafka
+
Persisted (RocksDB)
n/a
Apache Flink Native
Framework
SA + RM Persisted (RocksDB) Out of the box
.*
Advanced features: State API
28
• Flink provides access to state via Queryable state client [Beta]
• Kafka Stream supports interactive queries to state, including local
window stores
Advanced features: Checkpoints & Snapshots
29
Checkpointing = Distributed state persistence
- Main use is used for resilience and resumable processing
- Specifies a point in data flow that synchronizes all states
In Spark state is persisted internally (to HashMap backed to persisted storage)
Kafka Streams uses Kafka for persistence + RocksDB for local state
Flink uses RocksDB for state + provides snapshots API
Data Flow inside Flink with checkpoint indicatorsData Flow
- Ideas behind Streaming
- Architecture of stream processing
- Streaming frameworks overview
- Flink example
- Kafka Streams example
30
Recap
Q
A
31
Q&A
Michael Kanevsky
VP Innovation
michaelk@codevalue.net
http://codevalue.net

More Related Content

PDF
Streaming Analytics & CEP - Two sides of the same coin?
PDF
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
PDF
Kafka Summit NYC 2017 - Stream it Together: 3 Realities of Modern Programming
PDF
How Yelp Leapt to Microservices with More than a Message Queue
PDF
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
PPTX
Streaming in the Wild with Apache Flink
PPTX
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
PDF
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Streaming Analytics & CEP - Two sides of the same coin?
Data Transformations on Ops Metrics using Kafka Streams (Srividhya Ramachandr...
Kafka Summit NYC 2017 - Stream it Together: 3 Realities of Modern Programming
How Yelp Leapt to Microservices with More than a Message Queue
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Streaming in the Wild with Apache Flink
Kafka Summit NYC 2017 - Venice: A Distributed Database on top of Kafka
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...

What's hot (20)

PDF
Big Data Warsaw
PDF
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
PDF
Kafka, Killer of Point-to-Point Integrations, Lucian Lita
PPTX
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
PDF
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
PDF
Time series-analysis-using-an-event-streaming-platform -_v3_final
PPTX
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
PDF
A Tour of Apache Kafka
PDF
What every software engineer should know about streams and tables in kafka ...
PDF
Jay Kreps | Kafka Summit NYC 2019 Keynote (Events Everywhere) | CEO, Confluent
PDF
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
PPTX
INTRODUCING: CREATE PIPELINE
PDF
5 lessons learned for successful migration to Confluent cloud | Natan Silinit...
PDF
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
PDF
Deploying Kafka Streams Applications with Docker and Kubernetes
PDF
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
PDF
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
PDF
ksqlDB Workshop
PDF
ksqlDB: Building Consciousness on Real Time Events
PDF
Principles in Data Stream Processing | Matthias J Sax, Confluent
Big Data Warsaw
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Kafka, Killer of Point-to-Point Integrations, Lucian Lita
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Kai Wähner, Technology Evangelist at Confluent: "Development of Scalable Mac...
Time series-analysis-using-an-event-streaming-platform -_v3_final
Fabian Hueske_Till Rohrmann - Declarative stream processing with StreamSQL an...
A Tour of Apache Kafka
What every software engineer should know about streams and tables in kafka ...
Jay Kreps | Kafka Summit NYC 2019 Keynote (Events Everywhere) | CEO, Confluent
You Must Construct Additional Pipelines: Pub-Sub on Kafka at Blizzard
INTRODUCING: CREATE PIPELINE
5 lessons learned for successful migration to Confluent cloud | Natan Silinit...
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
Deploying Kafka Streams Applications with Docker and Kubernetes
Kafka Summit NYC 2017 - Every Message Counts: Kafka as a Foundation for Highl...
dotScale 2017 Keynote: The Rise of Real Time by Neha Narkhede
ksqlDB Workshop
ksqlDB: Building Consciousness on Real Time Events
Principles in Data Stream Processing | Matthias J Sax, Confluent
Ad

Similar to Data analytics at scale implementing stateful stream processing - publish (20)

PDF
Introduction to apache kafka, confluent and why they matter
PPTX
Apache Flink at Strata San Jose 2016
PDF
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
PDF
Introducing Confluent Cloud: Apache Kafka as a Service
PPTX
Flexible and Real-Time Stream Processing with Apache Flink
PDF
Introduction to Apache Kafka and Confluent... and why they matter
PPTX
Data Streaming with Apache Kafka & MongoDB - EMEA
PPTX
Webinar: Data Streaming with Apache Kafka & MongoDB
PPTX
Counting Elements in Streams
PPTX
Data Stream Processing with Apache Flink
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
Introduction to Apache Kafka and Confluent... and why they matter!
PDF
Devoxx university - Kafka de haut en bas
PPTX
Streaming Data Ingest and Processing with Apache Kafka
PDF
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
PDF
Apache Kafka vs. Traditional Middleware (Kai Waehner, Confluent) Frankfurt 20...
PDF
Omid: Scalable and Highly Available Transaction Processing for Phoenix
PDF
Unified Stream and Batch Processing with Apache Flink
PDF
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
PDF
Leveraging Mainframe Data for Modern Analytics
Introduction to apache kafka, confluent and why they matter
Apache Flink at Strata San Jose 2016
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB)
Introducing Confluent Cloud: Apache Kafka as a Service
Flexible and Real-Time Stream Processing with Apache Flink
Introduction to Apache Kafka and Confluent... and why they matter
Data Streaming with Apache Kafka & MongoDB - EMEA
Webinar: Data Streaming with Apache Kafka & MongoDB
Counting Elements in Streams
Data Stream Processing with Apache Flink
Apache Kafka - Scalable Message-Processing and more !
Introduction to Apache Kafka and Confluent... and why they matter!
Devoxx university - Kafka de haut en bas
Streaming Data Ingest and Processing with Apache Kafka
Apache Kafka vs. Integration Middleware (MQ, ETL, ESB) - Friends, Enemies or ...
Apache Kafka vs. Traditional Middleware (Kai Waehner, Confluent) Frankfurt 20...
Omid: Scalable and Highly Available Transaction Processing for Phoenix
Unified Stream and Batch Processing with Apache Flink
Now You See Me, Now You Compute: Building Event-Driven Architectures with Apa...
Leveraging Mainframe Data for Modern Analytics
Ad

More from CodeValue (20)

PPTX
Digital transformation buzzword or reality - Alon Fliess
PPTX
The IDF's journey to the cloud - Merav
PPTX
When your release plan is concluded at the HR office - Hanan Zakai
PPTX
We come in peace hybrid development with web assembly - Maayan Hanin
PPTX
The IoT Transformation and What it Means to You - Nir Dobovizky
PPTX
State in stateless serverless functions - Alex Pshul
PPTX
Will the Real Public API Please Stand Up? Amir Zuker
PPTX
How I built a ml human hybrid workflow using computer vision - Amir Shitrit
PDF
Application evolution strategy - Eran Stiller
PPTX
Designing products in the digital transformation era - Eyal Livne
PPTX
Eerez Pedro: Product thinking 101 - Architecture Next
PDF
Alon Fliess: APM – What Is It, and Why Do I Need It? - Architecture Next 20
PDF
Amir Zuker: Building web apps with web assembly and blazor - Architecture Nex...
PDF
Magnus Mårtensson: The Cloud challenge is more than just technical – people a...
PDF
Nir Doboviski: In Space No One Can Hear Microservices Scream – a Microservice...
PDF
Vered Flis: Because performance matters! Architecture Next 20
PPTX
Vitali zaidman Do You Need Server Side Rendering? What Are The Alternatives?
PDF
Ronen Levinson: Unified policy enforcement with opa - Architecture Next 20
PPTX
Moaid Hathot: Dapr the glue to your microservices - Architecture Next 20
PPTX
Eyal Ellenbogen: Building a UI Foundation for Scalability - Architecture Next 20
Digital transformation buzzword or reality - Alon Fliess
The IDF's journey to the cloud - Merav
When your release plan is concluded at the HR office - Hanan Zakai
We come in peace hybrid development with web assembly - Maayan Hanin
The IoT Transformation and What it Means to You - Nir Dobovizky
State in stateless serverless functions - Alex Pshul
Will the Real Public API Please Stand Up? Amir Zuker
How I built a ml human hybrid workflow using computer vision - Amir Shitrit
Application evolution strategy - Eran Stiller
Designing products in the digital transformation era - Eyal Livne
Eerez Pedro: Product thinking 101 - Architecture Next
Alon Fliess: APM – What Is It, and Why Do I Need It? - Architecture Next 20
Amir Zuker: Building web apps with web assembly and blazor - Architecture Nex...
Magnus Mårtensson: The Cloud challenge is more than just technical – people a...
Nir Doboviski: In Space No One Can Hear Microservices Scream – a Microservice...
Vered Flis: Because performance matters! Architecture Next 20
Vitali zaidman Do You Need Server Side Rendering? What Are The Alternatives?
Ronen Levinson: Unified policy enforcement with opa - Architecture Next 20
Moaid Hathot: Dapr the glue to your microservices - Architecture Next 20
Eyal Ellenbogen: Building a UI Foundation for Scalability - Architecture Next 20

Recently uploaded (20)

PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Transform Your Business with a Software ERP System
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
top salesforce developer skills in 2025.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
System and Network Administration Chapter 2
PDF
Digital Strategies for Manufacturing Companies
PDF
System and Network Administraation Chapter 3
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
Introduction to Artificial Intelligence
PPTX
history of c programming in notes for students .pptx
PDF
Softaken Excel to vCard Converter Software.pdf
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
assetexplorer- product-overview - presentation
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
How to Choose the Right IT Partner for Your Business in Malaysia
Transform Your Business with a Software ERP System
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Operating system designcfffgfgggggggvggggggggg
Understanding Forklifts - TECH EHS Solution
Adobe Illustrator 28.6 Crack My Vision of Vector Design
top salesforce developer skills in 2025.pdf
CHAPTER 2 - PM Management and IT Context
System and Network Administration Chapter 2
Digital Strategies for Manufacturing Companies
System and Network Administraation Chapter 3
Computer Software and OS of computer science of grade 11.pptx
Introduction to Artificial Intelligence
history of c programming in notes for students .pptx
Softaken Excel to vCard Converter Software.pdf
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
assetexplorer- product-overview - presentation
wealthsignaloriginal-com-DS-text-... (1).pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle

Data analytics at scale implementing stateful stream processing - publish

  • 1. Data Analytics @ Scale: Implementing stateful stream processing Michael Kanevsky VP Innovation michaelk@codevalue.net @mkanevsky http://codevalue.net
  • 2. About me Michael Kanevsky  Distributed Systems  Big Data & Machine Learning  Embedded & IoT 2 𝑓 =“cat”
  • 3. Agenda  Why now?  Streaming basics  Stream processing frameworks 101  Building a streaming application  A look at Apache Flink + Kafka Streams 3
  • 4. Big & Fast Data  Typical “Big Data” challenges  Data Volume  High velocity data streams  Result relevance  Offline vs Online 4 1TB 1TB
  • 5. Scale: Data vs Latency 5 TBs GBs MBs 1s 1ms10ms100ms 100us10s Piece of Cake! Mostly Feasible Good Luck…  Improving throughput = scaling out  Communication latencies  Much harder for some applications Datasize Processing latency
  • 6. Many natural streaming applications IOT: Telemetry & sensor data User activities & clickstreams Financial data processing Calculating aggregates on data 6
  • 7. What makes a good case for streaming? Continuous data Time anchoring Low latency requirement 7
  • 8. Architecting for analytic workload 8 Processing 1: Store File 1 File 2 File N … Processing 2: Batch KV Store 6 5 { “customerId”: 12345, “transactionSum”: 99.90, “vendorId”: ”04ec092a-421a-4481”, “ts”: ”2019-05-30T09:20:50.100+02:00” }
  • 9. Architecting for analytic workload #2 9 Processing 1: Store File 1 File 2 File N Processing 2: Batch KV Store 6 5 Stream Processing KV Store 4 4 3 Partitioner SUM AVG Last 30 min
  • 10. Lambda Architecture 10 System complexity Duplicate data state Cascading failures But get: Low latency Accuracy Complex persistent state
  • 12. Our Scenario: Fraud detection 12 Event Sources Alerts Persistence Business Logic Client API - CustomerId (guid) - Timestamp - Transaction Sum - Vendor Id Notifications
  • 13. Spark Streaming 14  Supports two APIs:  Spark Streaming (RDD-like Dstream)  Structured Streaming (DataFrames/Datasets)  Better in cloud  Easier code reuse between batch and stream  Supports Java/Scala/Python/R/SQL  Requires dedicated cluster (RM/containerized SA)  JAR-based execution Zookeeper cluster Kafka Broker Resource Manager Spark Cluster Shared Persistence Master WorkerWorker
  • 14. Apache Flink - In Apache incubation since 2014 - Versatile API stack - Exactly once message processing - High throughput + low latency - Flexible state persistence (RocksDB) - Stream & Batch Semantics 16 JobManager JobManager TaskManager TaskManager Slot I Slot II …. Zookeeper cluster Storage
  • 15. Apache Flink processing modes 17 Stateful Stream Processing DataStream API SQL Table API Application … stream .keyBy(“customerId”) .timeWindow(Time.seconds(30)) .sum(“transactionSum”) Table customerTransactions = … Table transactionAggregates = customerTransactions .select(“customerId, transactionSum,ts”) .window( Tumble.over(“30.seconds”) .on(“ts”) .as(“hmWindow”)) .groupBy(“hmWindows, customerId”) .select(“transactionSum.sum as totalAmount”) tableEnv.registerDataStream(“transactions", “customerId, transactionSum, ts"); Table result = tableEnv.sqlQuery( "SELECT customerId, TUMBLE_END(ts, INTERVAL ‘30’ SECOND), SUM(transactionSum) AS totalAmount FROM transactions GROUP BY (customerId, TUMBLE(ts, INTERVAL ‘30' SECOND)) ";
  • 16. Kafka Streams 18  Built on Apache Kafka as a client library  Can support low latency processing  Kafka persistence + local RocksDB persistence  Good DSL with flexible topology support  Supports “Exactly Once” semantics  Limited to Kafka -> Kafka use cases  Available from major 0.10, fully featured as of Kafka 2.X Zookeeper cluster Kafka Broker Kafka Streams App Local Persistence
  • 17. Time -> Data Windows 19 Tumbling Windows Sliding Windows
  • 18. Stream Processing: Time dimension Two approaches:  True Streaming  Micro batching Event time Processingtime Ideal In practice Three timelines:  Event time  Ingest time  Processing time One watermark! (to rule them all)
  • 20. Stream Processing: High Availability 23 Stream Processing JobManager Task Executor Task Executor Task Executor Task Executor Task Executor Task Executor State Persistence APIMSG MSGMSG MSG MSG MSG Job Persistence
  • 21. Stream Processing: Delivery Guarantees 24 System guarantees:  At Most Once  At Least Once  Exactly Once Message Processing Failure Give upTry again
  • 22. Streaming Architecture 25 Distributed Queue Kafka/Kinesis/Event Hub/… Event Publishers μServices Stream Processing File/BLOB KV Store MQ Data Source Data Source Processor Processor Data Sink Data Sink State Persistence Data Persistence
  • 24. Stream Processing Frameworks 27 Streaming API Infra State Batch support μ-batches Framework SA + RM In memory (by default) via Spark kafka streams Native Library Kafka + Persisted (RocksDB) n/a Apache Flink Native Framework SA + RM Persisted (RocksDB) Out of the box .*
  • 25. Advanced features: State API 28 • Flink provides access to state via Queryable state client [Beta] • Kafka Stream supports interactive queries to state, including local window stores
  • 26. Advanced features: Checkpoints & Snapshots 29 Checkpointing = Distributed state persistence - Main use is used for resilience and resumable processing - Specifies a point in data flow that synchronizes all states In Spark state is persisted internally (to HashMap backed to persisted storage) Kafka Streams uses Kafka for persistence + RocksDB for local state Flink uses RocksDB for state + provides snapshots API Data Flow inside Flink with checkpoint indicatorsData Flow
  • 27. - Ideas behind Streaming - Architecture of stream processing - Streaming frameworks overview - Flink example - Kafka Streams example 30 Recap

Editor's Notes

  • #4: Image: http://turnoff.us/image/en/machine-learning-class.png
  • #15: Somewhat lacking in maturity (2.2.0) Hard to get low latency
  • #25: Image: https://www.sccpre.cat/png/big/73/737561_funny-cat-png.png
  • #31: Image:https://svgsilh.com/image/1296377.html