SlideShare a Scribd company logo
Streams don’t fail me now
Robustness Features in Kafka Streams
1
Lucas Brutschy
Software Engineer @ Confluent
Committer @ Apache Kafka
2
(1) Deserialization
Errors
(2) Business Logic
Failures
(3) Production
Errors
Error Handling
Fail-over
Upgrades & Evolution
Basics
3
Kafka Streams basics
4
Kafka Streams tl;dr
● Java library for stream processing
● Part of Apache Kafka
● Consume from and produce to Kafka
● Highly scalable, fault-tolerant
https://github.com/responsivedev/awesome-kafka-streams
5
Kafka Streams tl;dr
6
Scaling out
7
What if a node suddenly disappears?
Kafka Streams fail-over
8
Losing a node
Restart node like
any other service
(K8s)
Rebalance
protocol will move
work to healthy
nodes
Problem: Bringing
back the state
9
Restoration
Changelog as
back-up of the local
state
Restoration blocks
processing, can be
slow
K8s: Use
StatefulSets to
make restoration
less common
10
Standby Tasks
Standby tasks keep an
up-to-date copy of the
state by reading
changelog topic
Only copying bytes, no
processing
Quick failover but
increased cost
11
num.standby.replicas = 1
kafka-streams-1.properties
Across racks / data centers
12
⇒ KIP-708: Rack aware StandbyTask assignment for Kafka Streams
Configuring Rack-awareness
client.tag.zone: mordor-west-1a
rack.aware.assignment.tags: zone
kafka-streams-1.properties
client.tag.zone: mordor-west-1b
rack.aware.assignment.tags: zone
kafka-streams-4.properties
13
client.tag.zone: mordor-west-1a
rack.aware.assignment.tags: zone
kafka-streams-2.properties
client.tag.zone: mordor-west-1b
rack.aware.assignment.tags: zone
kafka-streams-3.properties
Minimizing cross-AZ traffic
⇒ KIP-392: Allow consumers to fetch from closest replica
⇒ KIP-881: Rack-aware Partition Assignment for Kafka Consumers
⇒ KIP-925: Rack aware task assignment in Kafka Streams
Cross-AZ traffic is slow
and expensive
Writes go to the leader,
but reads should be
co-located
client.rack: mordor-west-1a
kafka-streams-1.properties
14
Okay, we can replace nodes now and restore state.
What else can go wrong?
Record processing failures
15
Poison pills
Record processing failures
(1) Deserialization Errors
(2) Business Logic Failures
(3) Production Errors,
Serialization Errors
Poison pill: Record triggers failure. Retries, restarts won’t help
16
Dead Letter Queue (DLQ)
● Still needs monitoring
● Recovery strategy
depends on the problem
and is typically manual
● Unblocks processing,
but recovery can be
difficult
● Sometimes, stopping
processing is better
17
Dead Letter Queue (DLQ)
Map<String, KStream<String, Result<String, Integer>>> branches = stream
.mapValues(string -> {
try {
return new Result<String, Integer>(Integer.parseInt(string));
} catch (Exception exception) {
return new Result<String, Integer>(string, exception);
}
})
.split()
.branch((k, v) -> v.isSuccess, Branched.as("success"))
.defaultBranch(Branched.as("failure"));
branches.get("success").mapValues(x -> x.result).to("output");
branches.get("failure").to("dlq");
StreamsApp.java
⇒ Built-in DLQ available in Spring Cloud Stream, Michelin’s Kstreamplify
⇒ KIP for built-in DLQ in Streams coming
18
Deserialization Exception Handlers
default.deserialization.exception.handler: myapp.DeserializationExceptionHandler
kafka-streams-1.properties
com.google.gson.JsonSyntaxException: java.lang.IllegalStateException:
Expected BEGIN_OBJECT but was STRING at line 1 column 1 path $
Typical implementations:
● LogAndContinueExceptionHandler (default)
○ Pitfall: Schema Registry authorization problem ⇒ Skipped records
● LogAndFailExceptionHandler
● Append to DLQ (Spring Cloud Stream, KStreamplify)
Custom exception handler to decide FAIL / CONTINUE on exception
19
⇒ KIP-161: streams deserialization exception handlers
Production Exception Handlers
default.production.exception.handler: myapp.ProductionExceptionHandler
kafka-streams-1.properties
org.apache.kafka.common.errors.RecordTooLargeException: The message is
5292482 bytes when serialized, which is larger than 1048576 ...
Custom exception handler to decide FAIL / CONTINUE on exception
Typical implementations:
● Always FAIL (default)
● Always CONTINUE: Prioritize Availability
● Update metrics for monitoring
● Append to DLQ (Spring Cloud Stream, KStreamplify)
20
⇒ KIP-210 - Provide for custom error handling when Kafka Streams fails to produce
Stream Thread Exception Handler
● Custom exception handler for all uncaught exceptions
● Possible decisions: REPLACE_THREAD/SHUTDOWN_CLIENT/SHUTDOWN_APPLICATION
KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), props);
kafkaStreams.setUncaughtExceptionHandler((exception) ->
StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.REPLACE_THREAD);
StreamsApp.java
Typical implementations
● Always SHUTDOWN_CLIENT (default)
● Always REPLACE_THREAD: Prioritize Availability
● Limit number of REPLACE_THREAD responses in a certain time-window
● Only return REPLACE_THREAD for a subset of transient exceptions.
21
⇒ KIP-671: Introduce Kafka Streams Specific Uncaught Exception Handler
Retrying timeouts
● No retries inside Kafka clients
○ Retry configurations for clients are ignored
○ Would cause other tasks to be blocked during retries
○ Exception: admin client always retries for max.poll.interval.ms / 2
● Every Kafka client operation is retried until per-task timeout expires
(at least once)
⇒ KIP-572: Improve timeouts and retries in Kafka Streams
task.timeout.ms = 300000
kafka-streams-1.properties
22
Fail-over solved, broken records are being dealt with.
Are we done yet?
Upgrading Kafka Streams &
Evolving topologies
23
● Offline upgrade (with reset)
○ Stop all instances
○ Use kafka-streams-application-reset to reset internal topics, offsets
○ Clean state directories
○ Start all instances in new version
● Rolling bounce: replace application instances one-by-one
Ways to evolve & upgrade
Offline upgrade
with reset
Rolling bounce
Upgrade ✓ ✓*
Evolve ✓ If topology
compatible
* Check the upgrade guide https://kafka.apache.org/37/documentation/streams/upgrade-guide 24
(1) Make sure to persist the state store (standby tasks alone won’t help here)
e.g. k8s PersistentVolumes
(2) Kafka Streams does a lot of useful things during shutdown
● Flush caches, close RocksDB
● Wait until all produce requests are sent
● Commit offsets & transaction
● Write a checkpoint file
● Explicitly leave consumer group (important for static membership)
⇒ Give it enough time after sending terminate signal
(e.g. increase termination grace period in K8s)
A correct rolling bounce
25
What’s compatible in a “compatible topology?
Compatible state
key/value format,
naming
Compatible key/
value schemas,
partitioning,
naming
Compatible key/
value schemas,
partitioning
naming
Same set of input
topics
26
Matching set of
subtopologies
KStream<String,String> stream = builder.stream("input");
stream.groupByKey()
.count()
.toStream()
.to("output");
Solving naming problems
Topologies:
Sub-topology: 0
Source: KSTREAM-SOURCE-0000000000 (topics: [input])
--> KSTREAM-AGGREGATE-0000000002
Processor: KSTREAM-AGGREGATE-0000000002 (stores: [KSTREAM-AGGREGATE-STATE-STORE-0000000001])
27
KStream<String,String> stream = builder.stream("input");
stream.filter((k,v)-> v !=null && v.length() >= 6 )
.groupByKey()
.count()
.toStream()
.to("output");
Solving naming problems
Topologies:
Sub-topology: 0
Source: KSTREAM-SOURCE-0000000000 (topics: [input])
--> KSTREAM-FILTER-0000000001
Processor: KSTREAM-FILTER-0000000001 (stores: [])
--> KSTREAM-AGGREGATE-0000000003
<-- KSTREAM-SOURCE-0000000000
Processor: KSTREAM-AGGREGATE-0000000003 (stores: [KSTREAM-AGGREGATE-STATE-STORE-0000000002])
...
28
KStream<String,String> stream = builder.stream("input");
stream.filter((k, v) -> v != null && v.length() >= 6)
.groupByKey()
.count(Materialized.as("Purchase_count_store"))
.toStream()
.to("output");
Solving naming problems
Topologies:
Sub-topology: 0
Source: KSTREAM-SOURCE-0000000000 (topics: [input])
--> KSTREAM-FILTER-0000000001
Processor: KSTREAM-FILTER-0000000001 (stores: [])
--> KSTREAM-AGGREGATE-0000000002
<-- KSTREAM-SOURCE-0000000000
Processor: KSTREAM-AGGREGATE-0000000002 (stores: [Purchase_count_store])
...
29
○ Change a filter condition
○ Change mapValues or map transformation without changing key or value
type
○ Evolving schemas (protobuf etc.) in a backward-compatible way
○ Adding an independent branch to the topology for the existing input topics,
without introducing new repartitioning steps
Examples: What’s compatible
New logic will only apply to new records
Test in pre-prod first
30
● Changing the number of partitions of input, repartition or changelog topics
⇒ Will break existing partitioning of existing data
⇒ Topics need to be manually repartitioned (or reset) offline
● Change the type of key or value before repartitioning
⇒ Incompatible records in the repartition topic
⇒ “Draining” repartition topics can be attempted to change repartition format
● Add or remove input topics
⇒ Partitioner will fail to handle rolling upgrade
⇒ Offline upgrade without reset possible
Examples: What’s not compatible
31
Manual judgement / mitigations required
Automatic streaming logic upgrades largely unsolved
(1) Deserialization
Errors
(2) Business Logic
Failures
(3) Production
Errors
Error Handling
Fail-over
Upgrades & Evolution
Basics
32
Kafka Streams tl;dr
builder.stream("stocks_trades", Consumed.with(Serdes.String(), tradeSerde))
.groupByKey()
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMillis(5000)))
.aggregate(
AverageAgg::new,
(k, v, avg) -> {
avg.sumPrice += v.price;
avg.countTrades++;
return avg;
},
Materialized.with(Serdes.String(), averageSerde)
)
.toStream()
.mapValues(v -> v.sumPrice/v.countTrades)
.to("average_trades", Produced.with(windowedSerde, Serdes.Double()));
input topics deserialization
output topics serialization
Stateful
transformations
Stateless
transformations
33

More Related Content

PDF
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
PPTX
Exactly-once Stream Processing with Kafka Streams
PDF
Exactly-once Data Processing with Kafka Streams - July 27, 2017
PDF
Building a High-Performance Database with Scala, Akka, and Spark
PDF
Introduction to Kafka Streams
PDF
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
PPTX
Lightweight Transactions in Scylla versus Apache Cassandra
PDF
The Nuts and Bolts of Kafka Streams---An Architectural Deep Dive
Kafka Summit SF 2017 - Exactly-once Stream Processing with Kafka Streams
Exactly-once Stream Processing with Kafka Streams
Exactly-once Data Processing with Kafka Streams - July 27, 2017
Building a High-Performance Database with Scala, Akka, and Spark
Introduction to Kafka Streams
Tarantool как платформа для микросервисов / Антон Резников, Владимир Перепели...
Lightweight Transactions in Scylla versus Apache Cassandra
The Nuts and Bolts of Kafka Streams---An Architectural Deep Dive

Similar to Streams Don't Fail Me Now - Robustness Features in Kafka Streams (20)

PDF
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
PPTX
OLTP+OLAP=HTAP
 
PDF
APAC ksqlDB Workshop
PDF
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
PDF
Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...
PDF
Performance Analysis and Optimizations for Kafka Streams Applications
PDF
Deploying Kafka Streams Applications with Docker and Kubernetes
PDF
Scala to assembly
PDF
KSQL - Stream Processing simplified!
PDF
Stream Processing made simple with Kafka
PDF
KSQL in Practice (Almog Gavra, Confluent) Kafka Summit London 2019
PPTX
Kick your database_to_the_curb_reston_08_27_19
PDF
KSQL: Streaming SQL for Kafka
PDF
Containerizing Distributed Pipes
PDF
ksqlDB Workshop
PDF
Kafka Streams: the easiest way to start with stream processing
PPTX
Meeting the challenges of OLTP Big Data with Scylla
PDF
Unleashing your Kafka Streams Application Metrics!
PDF
Openstack taskflow 簡介
PDF
CQL: SQL In Cassandra
Recipes for Running Spark Streaming Applications in Production-(Tathagata Das...
OLTP+OLAP=HTAP
 
APAC ksqlDB Workshop
Production Ready Kafka on Kubernetes (Devandra Tagare, Lyft) Kafka Summit SF ...
Performance Analysis and Optimizations for Kafka Streams Applications (Guozha...
Performance Analysis and Optimizations for Kafka Streams Applications
Deploying Kafka Streams Applications with Docker and Kubernetes
Scala to assembly
KSQL - Stream Processing simplified!
Stream Processing made simple with Kafka
KSQL in Practice (Almog Gavra, Confluent) Kafka Summit London 2019
Kick your database_to_the_curb_reston_08_27_19
KSQL: Streaming SQL for Kafka
Containerizing Distributed Pipes
ksqlDB Workshop
Kafka Streams: the easiest way to start with stream processing
Meeting the challenges of OLTP Big Data with Scylla
Unleashing your Kafka Streams Application Metrics!
Openstack taskflow 簡介
CQL: SQL In Cassandra
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
PDF
Renaming a Kafka Topic | Kafka Summit London
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
PDF
Exactly-once Stream Processing with Arroyo and Kafka
PDF
Fish Plays Pokemon | Kafka Summit London
PDF
Tiered Storage 101 | Kafla Summit London
PDF
Building a Self-Service Stream Processing Portal: How And Why
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
PDF
TL;DR Kafka Metrics | Kafka Summit London
PDF
A Window Into Your Kafka Streams Tasks | KSL
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
PDF
Data Contracts Management: Schema Registry and Beyond
PDF
Code-First Approach: Crafting Efficient Flink Apps
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Renaming a Kafka Topic | Kafka Summit London
Evolution of NRT Data Ingestion Pipeline at Trendyol
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Exactly-once Stream Processing with Arroyo and Kafka
Fish Plays Pokemon | Kafka Summit London
Tiered Storage 101 | Kafla Summit London
Building a Self-Service Stream Processing Portal: How And Why
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Navigating Private Network Connectivity Options for Kafka Clusters
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Explaining How Real-Time GenAI Works in a Noisy Pub
TL;DR Kafka Metrics | Kafka Summit London
A Window Into Your Kafka Streams Tasks | KSL
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Data Contracts Management: Schema Registry and Beyond
Code-First Approach: Crafting Efficient Flink Apps
Debezium vs. the World: An Overview of the CDC Ecosystem
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Ad

Recently uploaded (20)

PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Machine learning based COVID-19 study performance prediction
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation theory and applications.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Empathic Computing: Creating Shared Understanding
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
“AI and Expert System Decision Support & Business Intelligence Systems”
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Spectral efficient network and resource selection model in 5G networks
Machine learning based COVID-19 study performance prediction
Chapter 3 Spatial Domain Image Processing.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation theory and applications.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Digital-Transformation-Roadmap-for-Companies.pptx
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Approach and Philosophy of On baking technology
Empathic Computing: Creating Shared Understanding
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
MIND Revenue Release Quarter 2 2025 Press Release

Streams Don't Fail Me Now - Robustness Features in Kafka Streams

  • 1. Streams don’t fail me now Robustness Features in Kafka Streams 1
  • 2. Lucas Brutschy Software Engineer @ Confluent Committer @ Apache Kafka 2
  • 3. (1) Deserialization Errors (2) Business Logic Failures (3) Production Errors Error Handling Fail-over Upgrades & Evolution Basics 3
  • 5. Kafka Streams tl;dr ● Java library for stream processing ● Part of Apache Kafka ● Consume from and produce to Kafka ● Highly scalable, fault-tolerant https://github.com/responsivedev/awesome-kafka-streams 5
  • 8. What if a node suddenly disappears? Kafka Streams fail-over 8
  • 9. Losing a node Restart node like any other service (K8s) Rebalance protocol will move work to healthy nodes Problem: Bringing back the state 9
  • 10. Restoration Changelog as back-up of the local state Restoration blocks processing, can be slow K8s: Use StatefulSets to make restoration less common 10
  • 11. Standby Tasks Standby tasks keep an up-to-date copy of the state by reading changelog topic Only copying bytes, no processing Quick failover but increased cost 11 num.standby.replicas = 1 kafka-streams-1.properties
  • 12. Across racks / data centers 12
  • 13. ⇒ KIP-708: Rack aware StandbyTask assignment for Kafka Streams Configuring Rack-awareness client.tag.zone: mordor-west-1a rack.aware.assignment.tags: zone kafka-streams-1.properties client.tag.zone: mordor-west-1b rack.aware.assignment.tags: zone kafka-streams-4.properties 13 client.tag.zone: mordor-west-1a rack.aware.assignment.tags: zone kafka-streams-2.properties client.tag.zone: mordor-west-1b rack.aware.assignment.tags: zone kafka-streams-3.properties
  • 14. Minimizing cross-AZ traffic ⇒ KIP-392: Allow consumers to fetch from closest replica ⇒ KIP-881: Rack-aware Partition Assignment for Kafka Consumers ⇒ KIP-925: Rack aware task assignment in Kafka Streams Cross-AZ traffic is slow and expensive Writes go to the leader, but reads should be co-located client.rack: mordor-west-1a kafka-streams-1.properties 14
  • 15. Okay, we can replace nodes now and restore state. What else can go wrong? Record processing failures 15
  • 16. Poison pills Record processing failures (1) Deserialization Errors (2) Business Logic Failures (3) Production Errors, Serialization Errors Poison pill: Record triggers failure. Retries, restarts won’t help 16
  • 17. Dead Letter Queue (DLQ) ● Still needs monitoring ● Recovery strategy depends on the problem and is typically manual ● Unblocks processing, but recovery can be difficult ● Sometimes, stopping processing is better 17
  • 18. Dead Letter Queue (DLQ) Map<String, KStream<String, Result<String, Integer>>> branches = stream .mapValues(string -> { try { return new Result<String, Integer>(Integer.parseInt(string)); } catch (Exception exception) { return new Result<String, Integer>(string, exception); } }) .split() .branch((k, v) -> v.isSuccess, Branched.as("success")) .defaultBranch(Branched.as("failure")); branches.get("success").mapValues(x -> x.result).to("output"); branches.get("failure").to("dlq"); StreamsApp.java ⇒ Built-in DLQ available in Spring Cloud Stream, Michelin’s Kstreamplify ⇒ KIP for built-in DLQ in Streams coming 18
  • 19. Deserialization Exception Handlers default.deserialization.exception.handler: myapp.DeserializationExceptionHandler kafka-streams-1.properties com.google.gson.JsonSyntaxException: java.lang.IllegalStateException: Expected BEGIN_OBJECT but was STRING at line 1 column 1 path $ Typical implementations: ● LogAndContinueExceptionHandler (default) ○ Pitfall: Schema Registry authorization problem ⇒ Skipped records ● LogAndFailExceptionHandler ● Append to DLQ (Spring Cloud Stream, KStreamplify) Custom exception handler to decide FAIL / CONTINUE on exception 19 ⇒ KIP-161: streams deserialization exception handlers
  • 20. Production Exception Handlers default.production.exception.handler: myapp.ProductionExceptionHandler kafka-streams-1.properties org.apache.kafka.common.errors.RecordTooLargeException: The message is 5292482 bytes when serialized, which is larger than 1048576 ... Custom exception handler to decide FAIL / CONTINUE on exception Typical implementations: ● Always FAIL (default) ● Always CONTINUE: Prioritize Availability ● Update metrics for monitoring ● Append to DLQ (Spring Cloud Stream, KStreamplify) 20 ⇒ KIP-210 - Provide for custom error handling when Kafka Streams fails to produce
  • 21. Stream Thread Exception Handler ● Custom exception handler for all uncaught exceptions ● Possible decisions: REPLACE_THREAD/SHUTDOWN_CLIENT/SHUTDOWN_APPLICATION KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), props); kafkaStreams.setUncaughtExceptionHandler((exception) -> StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.REPLACE_THREAD); StreamsApp.java Typical implementations ● Always SHUTDOWN_CLIENT (default) ● Always REPLACE_THREAD: Prioritize Availability ● Limit number of REPLACE_THREAD responses in a certain time-window ● Only return REPLACE_THREAD for a subset of transient exceptions. 21 ⇒ KIP-671: Introduce Kafka Streams Specific Uncaught Exception Handler
  • 22. Retrying timeouts ● No retries inside Kafka clients ○ Retry configurations for clients are ignored ○ Would cause other tasks to be blocked during retries ○ Exception: admin client always retries for max.poll.interval.ms / 2 ● Every Kafka client operation is retried until per-task timeout expires (at least once) ⇒ KIP-572: Improve timeouts and retries in Kafka Streams task.timeout.ms = 300000 kafka-streams-1.properties 22
  • 23. Fail-over solved, broken records are being dealt with. Are we done yet? Upgrading Kafka Streams & Evolving topologies 23
  • 24. ● Offline upgrade (with reset) ○ Stop all instances ○ Use kafka-streams-application-reset to reset internal topics, offsets ○ Clean state directories ○ Start all instances in new version ● Rolling bounce: replace application instances one-by-one Ways to evolve & upgrade Offline upgrade with reset Rolling bounce Upgrade ✓ ✓* Evolve ✓ If topology compatible * Check the upgrade guide https://kafka.apache.org/37/documentation/streams/upgrade-guide 24
  • 25. (1) Make sure to persist the state store (standby tasks alone won’t help here) e.g. k8s PersistentVolumes (2) Kafka Streams does a lot of useful things during shutdown ● Flush caches, close RocksDB ● Wait until all produce requests are sent ● Commit offsets & transaction ● Write a checkpoint file ● Explicitly leave consumer group (important for static membership) ⇒ Give it enough time after sending terminate signal (e.g. increase termination grace period in K8s) A correct rolling bounce 25
  • 26. What’s compatible in a “compatible topology? Compatible state key/value format, naming Compatible key/ value schemas, partitioning, naming Compatible key/ value schemas, partitioning naming Same set of input topics 26 Matching set of subtopologies
  • 27. KStream<String,String> stream = builder.stream("input"); stream.groupByKey() .count() .toStream() .to("output"); Solving naming problems Topologies: Sub-topology: 0 Source: KSTREAM-SOURCE-0000000000 (topics: [input]) --> KSTREAM-AGGREGATE-0000000002 Processor: KSTREAM-AGGREGATE-0000000002 (stores: [KSTREAM-AGGREGATE-STATE-STORE-0000000001]) 27
  • 28. KStream<String,String> stream = builder.stream("input"); stream.filter((k,v)-> v !=null && v.length() >= 6 ) .groupByKey() .count() .toStream() .to("output"); Solving naming problems Topologies: Sub-topology: 0 Source: KSTREAM-SOURCE-0000000000 (topics: [input]) --> KSTREAM-FILTER-0000000001 Processor: KSTREAM-FILTER-0000000001 (stores: []) --> KSTREAM-AGGREGATE-0000000003 <-- KSTREAM-SOURCE-0000000000 Processor: KSTREAM-AGGREGATE-0000000003 (stores: [KSTREAM-AGGREGATE-STATE-STORE-0000000002]) ... 28
  • 29. KStream<String,String> stream = builder.stream("input"); stream.filter((k, v) -> v != null && v.length() >= 6) .groupByKey() .count(Materialized.as("Purchase_count_store")) .toStream() .to("output"); Solving naming problems Topologies: Sub-topology: 0 Source: KSTREAM-SOURCE-0000000000 (topics: [input]) --> KSTREAM-FILTER-0000000001 Processor: KSTREAM-FILTER-0000000001 (stores: []) --> KSTREAM-AGGREGATE-0000000002 <-- KSTREAM-SOURCE-0000000000 Processor: KSTREAM-AGGREGATE-0000000002 (stores: [Purchase_count_store]) ... 29
  • 30. ○ Change a filter condition ○ Change mapValues or map transformation without changing key or value type ○ Evolving schemas (protobuf etc.) in a backward-compatible way ○ Adding an independent branch to the topology for the existing input topics, without introducing new repartitioning steps Examples: What’s compatible New logic will only apply to new records Test in pre-prod first 30
  • 31. ● Changing the number of partitions of input, repartition or changelog topics ⇒ Will break existing partitioning of existing data ⇒ Topics need to be manually repartitioned (or reset) offline ● Change the type of key or value before repartitioning ⇒ Incompatible records in the repartition topic ⇒ “Draining” repartition topics can be attempted to change repartition format ● Add or remove input topics ⇒ Partitioner will fail to handle rolling upgrade ⇒ Offline upgrade without reset possible Examples: What’s not compatible 31 Manual judgement / mitigations required Automatic streaming logic upgrades largely unsolved
  • 32. (1) Deserialization Errors (2) Business Logic Failures (3) Production Errors Error Handling Fail-over Upgrades & Evolution Basics 32
  • 33. Kafka Streams tl;dr builder.stream("stocks_trades", Consumed.with(Serdes.String(), tradeSerde)) .groupByKey() .windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMillis(5000))) .aggregate( AverageAgg::new, (k, v, avg) -> { avg.sumPrice += v.price; avg.countTrades++; return avg; }, Materialized.with(Serdes.String(), averageSerde) ) .toStream() .mapValues(v -> v.sumPrice/v.countTrades) .to("average_trades", Produced.with(windowedSerde, Serdes.Double())); input topics deserialization output topics serialization Stateful transformations Stateless transformations 33