SlideShare a Scribd company logo
Kafka Streams at Scale
Deepak Goyal
Customer Backbone (CBB)
Walmart Labs
Stateful processing of multi-
million events in real time
Walmart
CBB
Internal
Kafka
customer state updates,
feature extraction,
model-inferencing etc.
event-
filtering
Interested
Consumers
CBB Data Pipeline
Front
Kafka
400 KS instances
>1M eps
Kafka
Streams
•Ease of development
•Library, not a framework
•High processing throughput
•Distributed-persistent storage
•Exactly-once processing semantics
@walmartlabs
Kafka Streams
Architecture
Overview
App Instance
task
consumer
processor
rocks
db
producer
http
server
task
consumer
processor
rocks
db
producer
http
server
task
consumer
processor
rocks
db
producer
http
server
@walmartlabs
Event Flow
kafka cluster
kafka streams cluster
task-0 task-0’
stand-by
task-0’’
stand-by
change-log topic
partition 0’’
partition 0’
partition 0
input topic
partition 0’’
partition 0’
partition 0
rocks
db
rocks
db
active
rocks
db
Event Flow
@walmartlabs
Kafka Streams’
Challenges
1. Fault Recovery
2. Horizontal Scalability
3. Cloud Readiness
4. RocksDB
5. Large Clusters
@walmartlabs
Challenge 1.
Fault
Recovery
Default bootstrap
vs
Cold bootstrap
@walmartlabs
Default
Bootstrap
ChangeLog topic as a source of truth
• Slow stand-by recovery
• Log-Compacted change-log topics
• Inefficient disk usage
@walmartlabs
app cluster
task-0’
stand-by
task-0
kafka cluster
task-0’
new
stand-by
change-log topic
partition 0’’
partition 0’
partition 0
input topic
partition 0’’
partition 0’
partition 0
active
rocks
db
rocks
db
the source of truth
stand-by recovery
Default Bootstrap
empty
rocks
db
@walmartlabs
Cold
Bootstrap
@walmartlabs
Active task as a source of truth
• Lightning stand-by recovery
• Efficient disk usage
• Cross cloud cold-bootstraps
app cluster
task-0’
stand-by
task-0 task-0’
new
stand-by
kafka cluster
change-log topic
partition 0’’
partition 0’
partition 0
input topic
partition 0’’
partition 0’
partition 0
active
rocks
db
rocks
db
the source of truth
Cold Bootstrap
empty
rocks
db
rocks
db
enhanced stand-by recovery
the new source of truth
@walmartlabs
Challenge 2.
Horizontal
Scalability
•Dynamic Repartitioning
•Scaling Lookups
@walmartlabs
partition-0 0,4,8
partition-1 1,5,9
partition-2 2,6,10
partition-3 3,7,11
Repartitioning
Logic
partition-0 0,2,4,6,8,10
partition-1 1,3,5,7,9,11
partition-0 0,2,4,6,8,10
partition-1 1,3,5,7,9,11
partition-2 0,2,4,6,8,10
partition-3 1,3,5,7,9,11
partition-2 0,2,4,6,8,10
partition-3 1,3,5,7,9,11
@walmartlabs
Dynamic
Repartitioning
app cluster
task-0
active
rocks
db
task-0’
stand-by
rocks
db
task-0’’ or 2
future stand-by
rocks
db
task-2
becomes active
rocks
db
task-2’
new stand-by
empty
rocks
db
rocks
db
@walmartlabs
Scaling up from 2 to 4 partitions
Scaling
Lookups
•Queryable Stand-by
•AKKA Server (Non Blocking IO)
•Partition Specific Lookups
@walmartlabs
Challenge 3.
Cloud
Readiness
•Rack/AZ Aware Task Assignment
•Partial Partition Assignment
revocation
@walmartlabs
Rack/AZ
Aware Task
Assignment
AZ2 AZ3AZ1
task-0
active
rocks
db
task-0’’
stand-by
rocks
db
task-1
active
rocks
db
task-1’
stand-by
rocks
db
task-0’
stand-by
rocks
db
task-1’’
stand-by
rocks
db
StickyTaskAssignor using new config RACK_ID_CONFIG = “rack.id”;
@walmartlabs https://github.com/apache/kafka/pull/4785
Partial
Partition
Assignment
Revocation
AZ3AZ1
task-0
active
rocks
db
task-1
active
rocks
db
AZ2
task-0’
stand-by
rocks
db
task-2
active
rocks
db
task-2’
stand-by
rocks
db
task-1
stand-by
rocks
db
@walmartlabs
Enhancements to RocksDB Store
•Column Family support
•Eliminated Synchronized GETs
•Queryable in suspended and
restoration state
@walmartlabs
Challenge 4.
RocksDB
@walmartlabs
Challenge 5.
Large
Clusters
•Rebalance time
•Overriding broker defaults
•Overriding stream defaults
@walmartlabs
Rebalance
Time
Reduction
•Bottleneck: Group Leader Broker
•Partition Assignment Info
• Compression
•Better Encoding
•24x smaller in terms of size
@walmartlabs https://github.com/apache/kafka/pull/6162
Broker
Configs
Overriding Broker Defaults
• message.max.bytes
• replica.fetch.max.bytes
• socket.request.max.bytes
• offsets.load.buffer.size
• min.insync.replicas
@walmartlabs
Streams
Configs
Overriding Streams Defaults
• acks (producer)
• linger.ms (producer)
• auto.offset.reset (consumer)
• state.cleanup.delay.ms (streams)
@walmartlabs
Benchmark
@walmartlabs
Kafka Cluster 17 (8-core) instances
Streams Cluster 100 (2-core) instances
Processing Rate 2.3M events per second
Up Next
• Feature Extraction and Model Inferencing !
• Cold Bootstrap from other stand-by "
• Cold-Bootstrap and Repartitioning for DSL#
• TTL support for State Stores $
• Merge Operator for RocksJava %
• Multi Tenancy & ' ( )
@walmartlabs
keep-streaming. . . . . . . . . . . . . . . . . . .
@deepak-iiit Walmart
We are hiring!

More Related Content

PDF
The Patterns of Distributed Logging and Containers
PDF
Tame the small files problem and optimize data layout for streaming ingestion...
PDF
Producer Performance Tuning for Apache Kafka
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PDF
Event Driven-Architecture from a Scalability perspective
PDF
Introduction to Apache Kafka
PDF
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
PDF
Storing State Forever: Why It Can Be Good For Your Analytics
The Patterns of Distributed Logging and Containers
Tame the small files problem and optimize data layout for streaming ingestion...
Producer Performance Tuning for Apache Kafka
Where is my bottleneck? Performance troubleshooting in Flink
Event Driven-Architecture from a Scalability perspective
Introduction to Apache Kafka
Sharing is Caring: Toward Creating Self-tuning Multi-tenant Kafka (Anna Povzn...
Storing State Forever: Why It Can Be Good For Your Analytics

What's hot (20)

PDF
Restoring Restoration's Reputation in Kafka Streams with Bruno Cadonna & Luca...
PPTX
Apache Kafka at LinkedIn
PPTX
A visual introduction to Apache Kafka
PDF
Flink powered stream processing platform at Pinterest
PPTX
L4-L7 Application Services with Avi Networks
PPTX
Apache Kafka at LinkedIn
PDF
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
PPTX
Tuning Apache Kafka Connectors for Flink.pptx
PPTX
Introduction to Kafka Cruise Control
PPTX
Building an Event Streaming Architecture with Apache Pulsar
PDF
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
PPTX
Building Reliable Lakehouses with Apache Flink and Delta Lake
PDF
Common Patterns of Multi Data-Center Architectures with Apache Kafka
PPTX
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PDF
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
PPTX
PPTX
Real-time Analytics with Trino and Apache Pinot
PDF
PPTX
Autoscaling Flink with Reactive Mode
Restoring Restoration's Reputation in Kafka Streams with Bruno Cadonna & Luca...
Apache Kafka at LinkedIn
A visual introduction to Apache Kafka
Flink powered stream processing platform at Pinterest
L4-L7 Application Services with Avi Networks
Apache Kafka at LinkedIn
Streaming 101 Revisited: A Fresh Hot Take With Tyler Akidau and Dan Sotolongo...
Tuning Apache Kafka Connectors for Flink.pptx
Introduction to Kafka Cruise Control
Building an Event Streaming Architecture with Apache Pulsar
Improving fault tolerance and scaling out in Kafka Streams with Bill Bejeck |...
Building Reliable Lakehouses with Apache Flink and Delta Lake
Common Patterns of Multi Data-Center Architectures with Apache Kafka
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
Presto Summit 2018 - 09 - Netflix Iceberg
Disaster Recovery with MirrorMaker 2.0 (Ryanne Dolan, Cloudera) Kafka Summit ...
Real-time Analytics with Trino and Apache Pinot
Autoscaling Flink with Reactive Mode
Ad

Similar to Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019 (20)

PPTX
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit NYC 2019
PDF
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
PPTX
Service messaging using Kafka
PDF
Event streaming: A paradigm shift in enterprise software architecture
PDF
Deploying Kafka Streams Applications with Docker and Kubernetes
PDF
Capital One Delivers Risk Insights in Real Time with Stream Processing
PDF
Deploying Kafka Streams Applications with Docker and Kubernetes
PPTX
Apache Kafka Best Practices
PPTX
Kafka infrastructure production
PDF
A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...
PDF
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
PDF
Connect K of SMACK:pykafka, kafka-python or?
PPTX
Building Event-Driven Systems with Apache Kafka
PDF
Building the Pivotal RabbitMQ for Kubernetes Beta
PDF
Kafka used at scale to deliver real-time notifications
PDF
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
PDF
Self-hosting Kafka at Scale: Netflix's Journey & Challenges
PPTX
Stateful streaming and the challenge of state
PPTX
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
PPTX
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit NYC 2019
Learnings From Shipping 1000+ Streaming Data Pipelines To Production with Hak...
Service messaging using Kafka
Event streaming: A paradigm shift in enterprise software architecture
Deploying Kafka Streams Applications with Docker and Kubernetes
Capital One Delivers Risk Insights in Real Time with Stream Processing
Deploying Kafka Streams Applications with Docker and Kubernetes
Apache Kafka Best Practices
Kafka infrastructure production
A Functional Approach to Architecture - Kafka & Kafka Streams - Kevin Mas Rui...
Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
Connect K of SMACK:pykafka, kafka-python or?
Building Event-Driven Systems with Apache Kafka
Building the Pivotal RabbitMQ for Kubernetes Beta
Kafka used at scale to deliver real-time notifications
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
Self-hosting Kafka at Scale: Netflix's Journey & Challenges
Stateful streaming and the challenge of state
Kafka Summit NYC 2017 Hanging Out with Your Past Self in VR
Westpac Bank Tech Talk 1: Dive into Apache Kafka
Ad

More from confluent (20)

PDF
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
PPTX
Webinar Think Right - Shift Left - 19-03-2025.pptx
PDF
Migration, backup and restore made easy using Kannika
PDF
Five Things You Need to Know About Data Streaming in 2025
PDF
Data in Motion Tour Seoul 2024 - Keynote
PDF
Data in Motion Tour Seoul 2024 - Roadmap Demo
PDF
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
PDF
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
PDF
Data in Motion Tour 2024 Riyadh, Saudi Arabia
PDF
Build a Real-Time Decision Support Application for Financial Market Traders w...
PDF
Strumenti e Strategie di Stream Governance con Confluent Platform
PDF
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
PDF
Building Real-Time Gen AI Applications with SingleStore and Confluent
PDF
Unlocking value with event-driven architecture by Confluent
PDF
Il Data Streaming per un’AI real-time di nuova generazione
PDF
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
PDF
Break data silos with real-time connectivity using Confluent Cloud Connectors
PDF
Building API data products on top of your real-time data infrastructure
PDF
Speed Wins: From Kafka to APIs in Minutes
PDF
Evolving Data Governance for the Real-time Streaming and AI Era
Stream Processing Handson Workshop - Flink SQL Hands-on Workshop (Korean)
Webinar Think Right - Shift Left - 19-03-2025.pptx
Migration, backup and restore made easy using Kannika
Five Things You Need to Know About Data Streaming in 2025
Data in Motion Tour Seoul 2024 - Keynote
Data in Motion Tour Seoul 2024 - Roadmap Demo
From Stream to Screen: Real-Time Data Streaming to Web Frontends with Conflue...
Confluent per il settore FSI: Accelerare l'Innovazione con il Data Streaming...
Data in Motion Tour 2024 Riyadh, Saudi Arabia
Build a Real-Time Decision Support Application for Financial Market Traders w...
Strumenti e Strategie di Stream Governance con Confluent Platform
Compose Gen-AI Apps With Real-Time Data - In Minutes, Not Weeks
Building Real-Time Gen AI Applications with SingleStore and Confluent
Unlocking value with event-driven architecture by Confluent
Il Data Streaming per un’AI real-time di nuova generazione
Unleashing the Future: Building a Scalable and Up-to-Date GenAI Chatbot with ...
Break data silos with real-time connectivity using Confluent Cloud Connectors
Building API data products on top of your real-time data infrastructure
Speed Wins: From Kafka to APIs in Minutes
Evolving Data Governance for the Real-time Streaming and AI Era

Recently uploaded (20)

PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
KodekX | Application Modernization Development
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
NewMind AI Monthly Chronicles - July 2025
Encapsulation_ Review paper, used for researhc scholars
Agricultural_Statistics_at_a_Glance_2022_0.pdf
20250228 LYD VKU AI Blended-Learning.pptx
KodekX | Application Modernization Development
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
Spectral efficient network and resource selection model in 5G networks
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Unlocking AI with Model Context Protocol (MCP)

Kafka Streams at Scale (Deepak Goyal, Walmart Labs) Kafka Summit London 2019