SlideShare a Scribd company logo
Kafka Streams
Distributed, fault tolerant stream processing
Little bit of history
● Data resided within operational data bases.
● Demand for data analysis on a centralized warehouse which was dedicated to
this procedure.
● ETL processes have imerged.
● ETL - Extract Transform Load
Changes in ETL process
● Data integration - data integration between sources and destinations
● Single server data bases had been replaced by distributed data platforms
● Rise of big data caused ETL tools to handle more than just Data Bases and
Data Warehouses
● Today data comes from a wided range of sources: logs, sensors, metrics
● Demanding change in approach for continous processing
● Processing need to handle high throughput with low latency
Traditional ETL drawbacks
● Originally designed for a ‘’niche’’ problem of connecting between operational
dbs and data warehouses in a ‘’Batch’’ fashion
● Time consuming and resource intensive
● ‘’T’’ in Transform really stood for data cleansing rather than complexed
transformation which could include data enrichment
● Need for a global schema
It gets even massier...
● EAI - Enterprise Applications Integration
● Rising need of integration between different applications in our architecture in
real time.
● Used to be solved by traditional enterprise message queues
● Worked well in small scale but not in large scale
● Resulting in not being able to handle the amount and variety of modern data
such as: logs, sensors, real time transactions, etc...
To summarize...
So what are we looking for?
● Ability to process high volumes and high diversity data
● Real time model from get go which supports continous processing
● Transition to ‘’event-centric’’ paradigm (pubsub)
● Forward compatible data architecture, the ability to add multiple destinations
that process the data differently
● Low latency
Keep looking….
● To enable forward compatability first ‘’T’’ in ETL needs to be redifned.
● Move from data cleansing to data transformations
● Moreover transformations such as data enrichment should not run on the dwh
rather on as continuous transformations on the streaming platform
● To achieve that we need obviously joins aggregations and windowing abilities
● So to summarize we need to extract clean data once transform it in many ways
before loading it to different destinations
Stream Processing
● Stream processing is really all about transformations on a continous stream of
data
● Transformations are in forms of filters, maps, joins and aggregations
● We can divide stream processing into 2 paradigms: Real Time MapReduce and
Event Driven Micro Services
Real Time MapReduce
● MapReduce is with us for quite a long time
● Main issue is to fit mapreduce with modern needs by build a real time
continuous mapreduce layer for example:
Real Time MapReduce
● Processing jobs run on a cenralized dedicated cluster
● Using custom packaging for deployment each platform and it’s respective
deployment
● Most suitable for long run analytics on large multi tanent cluster or
machine/deep learning purposes
● Coupled integration between dev teams and devops teams
● Business logic is divided between 2 layers by expressing some of the logic in a
processing job which needs to be deployed on the rt mr cluster
● In large scale could cause lots of friction
Event Driven Micro Services
● This paradigm correlates with event centric paradigm where your streaming
platform acts as a central nervous system
● Micro services layer also acts as stream processing units
● Just kafka and you app by embedded library
● input and output are always streams
Brave new world - new ETL
Kafka Streams Application Overview
Kafka Streams Application Overview
● Application which uses kafka streams api is just an ordinarry java application
● Making packaging and deployment as easy as it should be
● Built ontop kafka’s fault tolerance capabilities
● Streams are partitioned and replicated
● Stream tasks are also fault tolerant, if a task runs on a machine which failed
than streams platform will automatically restart the task on one of the
remaining instances
Kafka Streams Application Overview
● Abilty to run multiple instances of streams application
● Instances run independently and automatically discover each other
● Abilty to elastically add or remove app instances during live processing
● When instance has a failover other instances will take over it’s work
Stream Processors
● Stream processors are nodes in the processor topolgy
● Representing computational steps in the topology which basically means that
they are responsible for the data transformations
● transformations include: map, filter, aggregations, joins and windowing
● These processors come out of the box with the streams api
● processors get data records from upstream processors apply transformation
and send records to downstream processors
Stream Processors
● 2 special types of processors:
○ Source Processor - This special type of processor produces input stream to the topology by
consuming record from one or multiple kafka topics. this stream is then forwarded downstream
to one or more downstream processors. obviously this processor is located as a root of the
topology so it doesn’t connected to any upstream processors
○ Sink Processor - This special topic doesn’t have any downstream processors, send it’s output
stream to a specified kafka topic
Processor Topology
State Stores
● Store states are used to store and query data
● Are really the backbone which enables ‘’stateful stream processing’’
● Kafka streams dsl automatically creates and uses state stores whenever it is
required for a stateful operations such as joins, aggregations and windowing
● State stores can be stored in RocksDB data base or any in memory hash maps
● Kafka streams offers a robust fault tolerant and recovery for local state stores
● Each state store is replicated by a change log topic
● These changelog topics are also partitioned, enabling each task which access
Fault Tolerance
● Kafka Streams is embedded with fault tolerance capabilities which are
integrated in kafka itself
● Kafka streams are partiotioned and replicated just as kafka topics are
● Stream tasks are monitored internally so if a task runs on a machine the failed
Kafka streams will automatically detect it and will restart the task on another
app instance.
● As mentioned before state stores are also fault tolerant by maintaining
replicated change log for each store which tracks state's updates
● Actually these change logs are also partitioned so any tasks which require
Fault Tolerance
● Log compactions is enabled on the state store’s replicated change logs which
prevents this change log topics from growing indefinitely
Threading Model
● Kafka streams allows a configuration of number of threads that the library can
use for parallelize processing
● Each thread can run one or more stream tasks
Threading Model
Nimrod Ticotzner nimrod.ticozner@mentory.io
https://www.linkedin.com/in/nimrod-ticozner/
THANK YOU!

More Related Content

PDF
Kafka zero to hero
PPTX
Apache Kafka Streams Use Case
PDF
Performance Tuning RocksDB for Kafka Streams’ State Stores
ODP
Stream processing using Kafka
PDF
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
PDF
KSQL Intro
PDF
Introduction to Kafka Streams
PPTX
Exactly-once Stream Processing with Kafka Streams
Kafka zero to hero
Apache Kafka Streams Use Case
Performance Tuning RocksDB for Kafka Streams’ State Stores
Stream processing using Kafka
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
KSQL Intro
Introduction to Kafka Streams
Exactly-once Stream Processing with Kafka Streams

What's hot (20)

PDF
Real-time Data Streaming from Oracle to Apache Kafka
PPTX
Capture the Streams of Database Changes
PDF
Introduction to Spark Streaming
PDF
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
PDF
ksqlDB: A Stream-Relational Database System
PPTX
Stream Application Development with Apache Kafka
PDF
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
PDF
So You Want to Write a Connector?
PDF
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
PDF
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
PDF
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
PDF
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
PDF
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
PDF
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
PDF
Kafka Streams: the easiest way to start with stream processing
PPTX
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
PDF
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
PDF
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
PDF
Kafka Connect by Datio
Real-time Data Streaming from Oracle to Apache Kafka
Capture the Streams of Database Changes
Introduction to Spark Streaming
Kafka Summit NYC 2017 Introduction to Kafka Streams with a Real-life Example
ksqlDB: A Stream-Relational Database System
Stream Application Development with Apache Kafka
Serverless and Streaming: Building ‘eBay’ by ‘Turning the Database Inside Out’
So You Want to Write a Connector?
Utilizing Kafka Connect to Integrate Classic Monoliths into Modern Microservi...
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
Steps to Building a Streaming ETL Pipeline with Apache Kafka® and KSQL
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
Deep Dive Into Kafka Streams (and the Distributed Stream Processing Engine) (...
Kafka Summit SF 2017 - Query the Application, Not a Database: “Interactive Qu...
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Kafka Streams: the easiest way to start with stream processing
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...
What is the State of my Kafka Streams Application? Unleashing Metrics. | Neil...
Kafka Connect by Datio
Ad

Similar to Apache Kafka Streams (20)

PDF
It's Time To Stop Using Lambda Architecture
PDF
A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PPTX
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
PDF
The Future of Fast Databases: Lessons from a Decade of QuestDB
PPTX
Stream Processing with Apache Apex
PPTX
Next Gen Big Data Analytics with Apache Apex
PPTX
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
PPTX
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
PPTX
Stateful streaming and the challenge of state
PPTX
Architectual Comparison of Apache Apex and Spark Streaming
PPTX
Spark Overview and Performance Issues
PPTX
Introduction to Apache Apex and writing a big data streaming application
PDF
Big data Argentina meetup 2020-09: Intro to presto on docker
PPT
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
PPTX
Megastore by Google
PPTX
Intro to Apache Apex @ Women in Big Data
PDF
Event Driven Services Part 3: Putting the Micro into Microservices with State...
PDF
Putting the Micro into Microservices with Stateful Stream Processing
It's Time To Stop Using Lambda Architecture
A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
The Future of Fast Databases: Lessons from a Decade of QuestDB
Stream Processing with Apache Apex
Next Gen Big Data Analytics with Apache Apex
Otimizações de Projetos de Big Data, Dw e AI no Microsoft Azure
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Stateful streaming and the challenge of state
Architectual Comparison of Apache Apex and Spark Streaming
Spark Overview and Performance Issues
Introduction to Apache Apex and writing a big data streaming application
Big data Argentina meetup 2020-09: Intro to presto on docker
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Megastore by Google
Intro to Apache Apex @ Women in Big Data
Event Driven Services Part 3: Putting the Micro into Microservices with State...
Putting the Micro into Microservices with Stateful Stream Processing
Ad

Recently uploaded (20)

PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PDF
System and Network Administration Chapter 2
PPTX
Introduction to Artificial Intelligence
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
history of c programming in notes for students .pptx
PPTX
Transform Your Business with a Software ERP System
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Understanding Forklifts - TECH EHS Solution
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Reimagine Home Health with the Power of Agentic AI​
How to Choose the Right IT Partner for Your Business in Malaysia
Navsoft: AI-Powered Business Solutions & Custom Software Development
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
System and Network Administration Chapter 2
Introduction to Artificial Intelligence
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PTS Company Brochure 2025 (1).pdf.......
Design an Analysis of Algorithms II-SECS-1021-03
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
history of c programming in notes for students .pptx
Transform Your Business with a Software ERP System
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Understanding Forklifts - TECH EHS Solution
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free

Apache Kafka Streams

  • 1. Kafka Streams Distributed, fault tolerant stream processing
  • 2. Little bit of history ● Data resided within operational data bases. ● Demand for data analysis on a centralized warehouse which was dedicated to this procedure. ● ETL processes have imerged. ● ETL - Extract Transform Load
  • 3. Changes in ETL process ● Data integration - data integration between sources and destinations ● Single server data bases had been replaced by distributed data platforms ● Rise of big data caused ETL tools to handle more than just Data Bases and Data Warehouses ● Today data comes from a wided range of sources: logs, sensors, metrics ● Demanding change in approach for continous processing ● Processing need to handle high throughput with low latency
  • 4. Traditional ETL drawbacks ● Originally designed for a ‘’niche’’ problem of connecting between operational dbs and data warehouses in a ‘’Batch’’ fashion ● Time consuming and resource intensive ● ‘’T’’ in Transform really stood for data cleansing rather than complexed transformation which could include data enrichment ● Need for a global schema
  • 5. It gets even massier... ● EAI - Enterprise Applications Integration ● Rising need of integration between different applications in our architecture in real time. ● Used to be solved by traditional enterprise message queues ● Worked well in small scale but not in large scale ● Resulting in not being able to handle the amount and variety of modern data such as: logs, sensors, real time transactions, etc...
  • 7. So what are we looking for? ● Ability to process high volumes and high diversity data ● Real time model from get go which supports continous processing ● Transition to ‘’event-centric’’ paradigm (pubsub) ● Forward compatible data architecture, the ability to add multiple destinations that process the data differently ● Low latency
  • 8. Keep looking…. ● To enable forward compatability first ‘’T’’ in ETL needs to be redifned. ● Move from data cleansing to data transformations ● Moreover transformations such as data enrichment should not run on the dwh rather on as continuous transformations on the streaming platform ● To achieve that we need obviously joins aggregations and windowing abilities ● So to summarize we need to extract clean data once transform it in many ways before loading it to different destinations
  • 9. Stream Processing ● Stream processing is really all about transformations on a continous stream of data ● Transformations are in forms of filters, maps, joins and aggregations ● We can divide stream processing into 2 paradigms: Real Time MapReduce and Event Driven Micro Services
  • 10. Real Time MapReduce ● MapReduce is with us for quite a long time ● Main issue is to fit mapreduce with modern needs by build a real time continuous mapreduce layer for example:
  • 11. Real Time MapReduce ● Processing jobs run on a cenralized dedicated cluster ● Using custom packaging for deployment each platform and it’s respective deployment ● Most suitable for long run analytics on large multi tanent cluster or machine/deep learning purposes ● Coupled integration between dev teams and devops teams ● Business logic is divided between 2 layers by expressing some of the logic in a processing job which needs to be deployed on the rt mr cluster ● In large scale could cause lots of friction
  • 12. Event Driven Micro Services ● This paradigm correlates with event centric paradigm where your streaming platform acts as a central nervous system ● Micro services layer also acts as stream processing units ● Just kafka and you app by embedded library ● input and output are always streams
  • 13. Brave new world - new ETL
  • 15. Kafka Streams Application Overview ● Application which uses kafka streams api is just an ordinarry java application ● Making packaging and deployment as easy as it should be ● Built ontop kafka’s fault tolerance capabilities ● Streams are partitioned and replicated ● Stream tasks are also fault tolerant, if a task runs on a machine which failed than streams platform will automatically restart the task on one of the remaining instances
  • 16. Kafka Streams Application Overview ● Abilty to run multiple instances of streams application ● Instances run independently and automatically discover each other ● Abilty to elastically add or remove app instances during live processing ● When instance has a failover other instances will take over it’s work
  • 17. Stream Processors ● Stream processors are nodes in the processor topolgy ● Representing computational steps in the topology which basically means that they are responsible for the data transformations ● transformations include: map, filter, aggregations, joins and windowing ● These processors come out of the box with the streams api ● processors get data records from upstream processors apply transformation and send records to downstream processors
  • 18. Stream Processors ● 2 special types of processors: ○ Source Processor - This special type of processor produces input stream to the topology by consuming record from one or multiple kafka topics. this stream is then forwarded downstream to one or more downstream processors. obviously this processor is located as a root of the topology so it doesn’t connected to any upstream processors ○ Sink Processor - This special topic doesn’t have any downstream processors, send it’s output stream to a specified kafka topic
  • 20. State Stores ● Store states are used to store and query data ● Are really the backbone which enables ‘’stateful stream processing’’ ● Kafka streams dsl automatically creates and uses state stores whenever it is required for a stateful operations such as joins, aggregations and windowing ● State stores can be stored in RocksDB data base or any in memory hash maps ● Kafka streams offers a robust fault tolerant and recovery for local state stores ● Each state store is replicated by a change log topic ● These changelog topics are also partitioned, enabling each task which access
  • 21. Fault Tolerance ● Kafka Streams is embedded with fault tolerance capabilities which are integrated in kafka itself ● Kafka streams are partiotioned and replicated just as kafka topics are ● Stream tasks are monitored internally so if a task runs on a machine the failed Kafka streams will automatically detect it and will restart the task on another app instance. ● As mentioned before state stores are also fault tolerant by maintaining replicated change log for each store which tracks state's updates ● Actually these change logs are also partitioned so any tasks which require
  • 22. Fault Tolerance ● Log compactions is enabled on the state store’s replicated change logs which prevents this change log topics from growing indefinitely
  • 23. Threading Model ● Kafka streams allows a configuration of number of threads that the library can use for parallelize processing ● Each thread can run one or more stream tasks

Editor's Notes

  • #3: ETL - extract data from databases, transform into destination’s warehouse schema, load into central data warehouse. b2 - analysis on separate data warehouse in order to not affect operational db performance, resulted in analysis after a meaningful time gap instead of “real-time”
  • #5: b1 - need for also EAI enterprise application integration (will be referenced in couple of slides) b3 - data enrichment really only can be implemented by joins and aggregattions.