Beyond the DSL - Unlocking the power of Kafka Streams with the Processor API

1
Beyond the DSL - #process
If you’re PAPI and you know it, merge your streams!
Antony Stubbs
Solution Architect EMEA
Confluent
MÜNCHEN - 09. OKTOBER 2018

3
Kafka Streams DSL - the Easy Path

5
Quick Scientific™ survey
Who in the audience uses Kafka in prod?
- Stream processing frameworks?
- Kafka Streams?
- PAPI?

6
Antony Stubbs - New Zealand Made
● @psynikal
● github.com/astubbs
● Confluent for ~2 years
● Consultant in EMEA
● Kafka Streams in my favourite

8
Topologies, Trolls and Troglodytes

9
KStream<Integer, Integer> input =
builder.stream("numbers-topic");
// Stateless computation
KStream<Integer, Integer> doubled =
input.mapValues(v -> v * 2);
// Stateful computation
KTable<Integer, Integer> sumOfOdds = input
.filter((k,v) -> v % 2 != 0)
.selectKey((k, v) -> 1)
.groupByKey()
.reduce((v1, v2) -> v1 + v2, "sum-of-odds");
What is the DSL?

11
What is #process?
Flexibility

13
What is #process?
Power
But with great power...
but not that much… :)

14
What is #process?
KStream#process({magic})

15
What is #process?
interface Processor<K, V> {
void process(K key, V value)
}

16
What is #transform?
interface Transformer<K, V, R> {
R transform(K key, V value)
}

17
What is #process?
interface Processor<K, V> {
public void init(ProcessorContext context)
void process(K key, V value)
<... snip ...>
}

19
PAPI vs DSL
When should you use which?
● “It depends”
● DSL
○ Easy
○ Can do a lot with the blocks
■ Clever with data structures
○ If it fits nicely, use it
● PAPI can be more advanced, but also super flexible
○ Build reusable processors for your company
○ Doesn’t have the “nice“ utility functions like count - but that’s the point
○ Can ”shoot your own foot”
○ Be responsible
● Don’t bend over backwards to fit your model to the DSL

20
A Combination - the Best of Both Worlds
ppvStream.groupByKey().reduce({ newState, ktableEntry ->
<.... reduce function …>
}).toStream().transform({
new Transformer<>((){
<.... do something complicated…>
}
}).mapValues({ v ->
<.... map function …>
}).to("output-deltas-v2")

21
What is a State Store? IQ?
A local database - RocksDB
● K/V Store
● High speed
● Spills to disk
An optimisation?
- Moves the state to the processing
What are Interactive Queries (IQ)?
By nature of Kafka
● Distributed
● Redundant
● Backed onto Kafka (optionally)
● Highly available (Kafka + Standby tasks)

22
Simple:
- Deduplication (vs EOS)
- Secondary indices
- Need to do something periodically
- TTL Caches
- Synchronous state checking
Working With Processors and State Stores
Advanced:
- State recalculation
- Expiring data efficiently
- Global KTable triggered joins (DSL work around)
- Probabilistic counting with custom store implementations...

24
State Subselects - Compound Keys
State stores have the #put, #get and #range method call - this brings some new magic...
Order Items
- Select all orders items from my state store, for this order key
- Avoids building larger and larger compounded values
- Which will take longer and longer to update - using individual entries instead
Time
- Timestamp compounded with key
- Great for retrieving entries within computable time windows (hint hint)
- Great for scanning over a range of entries

25
State Subselects - Secondary Indices
● Think ~“tables”...
● Serving an “Interactive Query” for a few possible fields...
○ Can’t construct compound keys for multiple combinations
○ One State Store per combination
○ Upon inserting the primary key/value
■ Also insert into the extra stores the field<->key mappings
■ Upon query, query against the appropriate store that holds the mappings for the requested field
■ Collect the possibly many values (keys) and retrieve the entire object(s) from the primary store
KEY → VALUE
TYPE+KEY → KEY
EMAIL → KEY

29
Case Study - leveraging state stores
WIPERS = ON WIPERS = OFF WIPERS = ?

32
Dynamic,
dependent,
aggregate,
on demand recalculation,
from -
out of order data.
So, need to go back and recalculate...

33
DSL despair!
Can’t update aggregates outside of the aggregate that has been triggered...
Potential messy DSL solution (bend over backwards) - synthetic events!
- Publish an event back to the topic with the correct time stamp and information needed to retrigger the
other aggregates
- You need to calculate / all / the correct time stamps
- Pollutes the stream with fake data
- Is unnatural / smells
- Breaks out of the KS transaction (EOS)

34
Enter #process()
Keep track of our aggregates ourselves
- Need to calculate our own time buckets
- Time query or store for possible future buckets
- All kept with in the KS context (no producer break out for synthetic events)

35
Case Study - leveraging state stores

36
Punc’tuation.?
● What are Punctuations?
● What is Wall Time?
● What is Event Time?

37
● DSL has window retention periods
○ We need state - but for how long?
○ KTable TTL? (Bounded vs unbounded keyset)
#process window retention period you may ask?
● #process TTL?
○ Using punctuation - scan periodically through the state
store and delete all buckets that are beyond our retention
period
○ Do TTL on “KTable” type data
● How? Compound keys...

38
Future Event Triggers
Expiring credit cards in real time or some other future timed action
- don’t want to poll all entries and check action times
- need to be able to expire tasks
Time as secondary index (either compound key or secondary state store)
- range select on all keys with time value before now
- take action
- emit action taken event (context.forward to specific node or emit)
- delete entry
- poll state store with range select every ~second,
- or schedule next punctuator to run at timestamp of next event
- need to update

39
Database Changelog Derivation
Problem:
● DB CDC doesn’t emit deltas, only full state
● Can’t see what’s changed in the document
Solution:
● Derive deltas from full state, stored in a stateful stream processor
● Can use KTable tuples
Issue:
● No TTL - enter PAPI

40
Distributed one to MANY MANY MANY late (maybe) joins
What’s the problem?
- Effectively re-joining missed joins once the right hand side arrives

52
Sometimes it’s useful to avoid some DSL overheads
● Combine operators
● Avoid repartitioning in some cases
● etc...
Optimising Topologies
Beware inconsistent hashing...

53
Speaking of topology optimisation...
Two phase topology building
- First optimisation is reusing intermediate rekey topics
- Avoids branched on demand rekey further down the DAG by detecting the rekey and moving it up
immediately
- manually achievable by forcing rekey straight away with #through
Global topology optimisation coming in 2.1

54
Now I did ask about KSQL after all...
Check out it KSQL if you haven’t already…
● Abstraction over Kafka Streams
● Languages outside of the JVM
● Non programmers
● Among others...
KSQL User Defined Functions in CP 5.0!
● Parallels with Processors combined with the DSL, you can now insert more complex functionality
into ksql
○ Eg trained machine learning model and use as UDF in KSQL

55
Where to next? & We’re hiring!
● github.com/confluentinc/
kafka-streams-examples
● “Ben Stopford” on youtube.com
● Kafka Streams playlist on confluentinc
youtube
● Consulting services? Contact
sales@confluent.io
Further reading
● confluent.io/resources/
● docs.confluent.io/current/streams/
● confluentinc on Youtube
● github.com/astubbs
● @psynikal
Come find me for Q&A later...
Don’t be afraid of #process and do drop down from the DSL for some
operations!
Join us! https://www.confluent.io/careers/

56
THANK YOU!
Learn more:
confluent.io/download
confluent.io/product/ksql/
confluent.io/confluent-cloud/

Beyond the DSL - Unlocking the power of Kafka Streams with the Processor API

More Related Content

What's hot (20)

Similar to Beyond the DSL - Unlocking the power of Kafka Streams with the Processor API (20)

More from confluent (20)

Recently uploaded (20)

Beyond the DSL - Unlocking the power of Kafka Streams with the Processor API