Streams Don't Fail Me Now - Robustness Features in Kafka Streams

Streams don’t fail me now
Robustness Features in Kafka Streams
1

Lucas Brutschy
Software Engineer @ Conﬂuent
Committer @ Apache Kafka
2

(1) Deserialization
Errors
(2) Business Logic
Failures
(3) Production
Errors
Error Handling
Fail-over
Upgrades & Evolution
Basics
3

Kafka Streams tl;dr
● Java library for stream processing
● Part of Apache Kafka
● Consume from and produce to Kafka
● Highly scalable, fault-tolerant
https://github.com/responsivedev/awesome-kafka-streams
5

What if a node suddenly disappears?
Kafka Streams fail-over
8

Losing a node
Restart node like
any other service
(K8s)
Rebalance
protocol will move
work to healthy
nodes
Problem: Bringing
back the state
9

Restoration
Changelog as
back-up of the local
state
Restoration blocks
processing, can be
slow
K8s: Use
StatefulSets to
make restoration
less common
10

Standby Tasks
Standby tasks keep an
up-to-date copy of the
state by reading
changelog topic
Only copying bytes, no
processing
Quick failover but
increased cost
11
num.standby.replicas = 1
kafka-streams-1.properties

Across racks / data centers
12

⇒ KIP-708: Rack aware StandbyTask assignment for Kafka Streams
Conﬁguring Rack-awareness
client.tag.zone: mordor-west-1a
rack.aware.assignment.tags: zone
client.tag.zone: mordor-west-1b
13
client.tag.zone: mordor-west-1a
client.tag.zone: mordor-west-1b

Minimizing cross-AZ trafﬁc
⇒ KIP-392: Allow consumers to fetch from closest replica
⇒ KIP-881: Rack-aware Partition Assignment for Kafka Consumers
⇒ KIP-925: Rack aware task assignment in Kafka Streams
Cross-AZ trafﬁc is slow
and expensive
Writes go to the leader,
but reads should be
co-located
client.rack: mordor-west-1a
14

Okay, we can replace nodes now and restore state.
What else can go wrong?
Record processing failures
15

Poison pills
Record processing failures
(1) Deserialization Errors
(2) Business Logic Failures
(3) Production Errors,
Serialization Errors
Poison pill: Record triggers failure. Retries, restarts won’t help
16

Dead Letter Queue (DLQ)
● Still needs monitoring
● Recovery strategy
depends on the problem
and is typically manual
● Unblocks processing,
but recovery can be
difﬁcult
● Sometimes, stopping
processing is better
17

Dead Letter Queue (DLQ)
Map<String, KStream<String, Result<String, Integer>>> branches = stream
.mapValues(string -> {
try {
return new Result<String, Integer>(Integer.parseInt(string));
} catch (Exception exception) {
return new Result<String, Integer>(string, exception);
}
})
.split()
.branch((k, v) -> v.isSuccess, Branched.as("success"))
.defaultBranch(Branched.as("failure"));
branches.get("success").mapValues(x -> x.result).to("output");
branches.get("failure").to("dlq");
StreamsApp.java
⇒ Built-in DLQ available in Spring Cloud Stream, Michelin’s Kstreamplify
⇒ KIP for built-in DLQ in Streams coming
18

Deserialization Exception Handlers
default.deserialization.exception.handler: myapp.DeserializationExceptionHandler
com.google.gson.JsonSyntaxException: java.lang.IllegalStateException:
Expected BEGIN_OBJECT but was STRING at line 1 column 1 path $
Typical implementations:
● LogAndContinueExceptionHandler (default)
○ Pitfall: Schema Registry authorization problem ⇒ Skipped records
● LogAndFailExceptionHandler
● Append to DLQ (Spring Cloud Stream, KStreamplify)
Custom exception handler to decide FAIL / CONTINUE on exception
19
⇒ KIP-161: streams deserialization exception handlers

Production Exception Handlers
default.production.exception.handler: myapp.ProductionExceptionHandler
org.apache.kafka.common.errors.RecordTooLargeException: The message is
5292482 bytes when serialized, which is larger than 1048576 ...
Custom exception handler to decide FAIL / CONTINUE on exception
Typical implementations:
● Always FAIL (default)
● Always CONTINUE: Prioritize Availability
● Update metrics for monitoring
● Append to DLQ (Spring Cloud Stream, KStreamplify)
20
⇒ KIP-210 - Provide for custom error handling when Kafka Streams fails to produce

Stream Thread Exception Handler
● Custom exception handler for all uncaught exceptions
● Possible decisions: REPLACE_THREAD/SHUTDOWN_CLIENT/SHUTDOWN_APPLICATION
KafkaStreams kafkaStreams = new KafkaStreams(builder.build(), props);
kafkaStreams.setUncaughtExceptionHandler((exception) ->
StreamsUncaughtExceptionHandler.StreamThreadExceptionResponse.REPLACE_THREAD);
StreamsApp.java
Typical implementations
● Always SHUTDOWN_CLIENT (default)
● Always REPLACE_THREAD: Prioritize Availability
● Limit number of REPLACE_THREAD responses in a certain time-window
● Only return REPLACE_THREAD for a subset of transient exceptions.
21
⇒ KIP-671: Introduce Kafka Streams Speciﬁc Uncaught Exception Handler

Retrying timeouts
● No retries inside Kafka clients
○ Retry conﬁgurations for clients are ignored
○ Would cause other tasks to be blocked during retries
○ Exception: admin client always retries for max.poll.interval.ms / 2
● Every Kafka client operation is retried until per-task timeout expires
(at least once)
⇒ KIP-572: Improve timeouts and retries in Kafka Streams
task.timeout.ms = 300000
22

Fail-over solved, broken records are being dealt with.
Are we done yet?
Upgrading Kafka Streams &
Evolving topologies
23

● Ofﬂine upgrade (with reset)
○ Stop all instances
○ Use kafka-streams-application-reset to reset internal topics, offsets
○ Clean state directories
○ Start all instances in new version
● Rolling bounce: replace application instances one-by-one
Ways to evolve & upgrade
Offline upgrade
with reset
Rolling bounce
Upgrade ✓ ✓*
Evolve ✓ If topology
compatible
* Check the upgrade guide https://kafka.apache.org/37/documentation/streams/upgrade-guide 24

(1) Make sure to persist the state store (standby tasks alone won’t help here)
e.g. k8s PersistentVolumes
(2) Kafka Streams does a lot of useful things during shutdown
● Flush caches, close RocksDB
● Wait until all produce requests are sent
● Commit offsets & transaction
● Write a checkpoint ﬁle
● Explicitly leave consumer group (important for static membership)
⇒ Give it enough time after sending terminate signal
(e.g. increase termination grace period in K8s)
A correct rolling bounce
25

What’s compatible in a “compatible topology?
Compatible state
key/value format,
naming
Compatible key/
value schemas,
partitioning,
naming
Compatible key/
value schemas,
partitioning
naming
Same set of input
topics
26
Matching set of
subtopologies

KStream<String,String> stream = builder.stream("input");
stream.groupByKey()
.count()
.toStream()
.to("output");
Solving naming problems
Topologies:
Sub-topology: 0
Source: KSTREAM-SOURCE-0000000000 (topics: [input])
--> KSTREAM-AGGREGATE-0000000002
Processor: KSTREAM-AGGREGATE-0000000002 (stores: [KSTREAM-AGGREGATE-STATE-STORE-0000000001])
27

stream.filter((k,v)-> v !=null && v.length() >= 6 )
.groupByKey()
.count()
.toStream()
.to("output");
Topologies:
Sub-topology: 0
--> KSTREAM-FILTER-0000000001
Processor: KSTREAM-FILTER-0000000001 (stores: [])
<-- KSTREAM-SOURCE-0000000000
Processor: KSTREAM-AGGREGATE-0000000003 (stores: [KSTREAM-AGGREGATE-STATE-STORE-0000000002])
...
28

stream.filter((k, v) -> v != null && v.length() >= 6)
.groupByKey()
.count(Materialized.as("Purchase_count_store"))
.toStream()
.to("output");
Topologies:
Sub-topology: 0
--> KSTREAM-FILTER-0000000001
Processor: KSTREAM-FILTER-0000000001 (stores: [])
<-- KSTREAM-SOURCE-0000000000
Processor: KSTREAM-AGGREGATE-0000000002 (stores: [Purchase_count_store])
...
29

○ Change a filter condition
○ Change mapValues or map transformation without changing key or value
type
○ Evolving schemas (protobuf etc.) in a backward-compatible way
○ Adding an independent branch to the topology for the existing input topics,
without introducing new repartitioning steps
Examples: What’s compatible
New logic will only apply to new records
Test in pre-prod ﬁrst
30

● Changing the number of partitions of input, repartition or changelog topics
⇒ Will break existing partitioning of existing data
⇒ Topics need to be manually repartitioned (or reset) ofﬂine
● Change the type of key or value before repartitioning
⇒ Incompatible records in the repartition topic
⇒ “Draining” repartition topics can be attempted to change repartition format
● Add or remove input topics
⇒ Partitioner will fail to handle rolling upgrade
⇒ Ofﬂine upgrade without reset possible
Examples: What’s not compatible
31
Manual judgement / mitigations required
Automatic streaming logic upgrades largely unsolved

(1) Deserialization
Errors
(2) Business Logic
Failures
(3) Production
Errors
Error Handling
Fail-over
Upgrades & Evolution
Basics
32

Kafka Streams tl;dr
builder.stream("stocks_trades", Consumed.with(Serdes.String(), tradeSerde))
.groupByKey()
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofMillis(5000)))
.aggregate(
AverageAgg::new,
(k, v, avg) -> {
avg.sumPrice += v.price;
avg.countTrades++;
return avg;
},
Materialized.with(Serdes.String(), averageSerde)
)
.toStream()
.mapValues(v -> v.sumPrice/v.countTrades)
.to("average_trades", Produced.with(windowedSerde, Serdes.Double()));
input topics deserialization
output topics serialization
Stateful
transformations
Stateless
transformations
33

Streams Don't Fail Me Now - Robustness Features in Kafka Streams

More Related Content

Similar to Streams Don't Fail Me Now - Robustness Features in Kafka Streams (20)

More from HostedbyConfluent (20)

Recently uploaded (20)

Streams Don't Fail Me Now - Robustness Features in Kafka Streams