SlideShare a Scribd company logo
Data Policies for the Kafka-API with
WebAssembly
https://github.com/vectorizedio/redpanda
background
● i’ve worked on streaming sys. for 12+ years
● developer, founder & CEO of Vectorized, hacking on
Redpanda, a modern streaming platform for mission
critical workloads.
● previously, principal engineer at Akamai; co-founder &
CTO of concord.io, a high performance stream
processing engine built in C++ and acquired by Akamai
in 2016
alex gallego
@emaxerrno
keystone problem in streaming
The Data
What you
care about
keystone problem in streaming: consume what you want
The Data
What you
care about
The Good Parts Of Streaming
○ Streaming data is immutable
■ Good for understandability
■ Built-in auditing
○ Ability to replay data
○ The essence of a streaming platform is to
decouple producers from consumers
■ Focuses on Data vs Who, What and
When produced it
○ Proven at scale
keystone problem in streaming: consume what you want
The Data
What you
care about
The *Not so* Good Parts Of Streaming
○ Consume *at least* as much as you
produce
■ In the same order
■ common to 10x more than produce
○ Up front architecture cost of splitting your
streams, types, and data contracts for
things like privacy, etc…
○ Can easily saturate your network (simply
shifting the bottleneck)
■ Forces developers to create
specialized clusters
● Mission Critical Prod
● Analytics/Dashboard Prod
● ML sample clusters
keystone problem in streaming: consume what you want
The Data
What you
care about 🥳🤯🎉 - hooray!
performance
improvement
2007 2011 2020
SSD: $2500/TB
typical instance 4 cores
SSD $200/TB - 1000x faster, 10x cheaper
225 core VMs - 30x more cores
100Gbps NICs - 100x more throughput
first open source
solutions
take advantage of cheap
disk
disaggregate compute
and storage
modern hardware +
cloud native
30x taller computers + 1000x faster disks
thread per core
architecture
● explicit scheduling everywhere
○ IO groups
○ x-core groups (smp)
○ memory throttling
● ONLY supports async interfaces
○ requires library re-writes for
threading model to work
well
future<>
● viral primitive (like actors, Orleans, Akka, Pony, etc) - mix, map-reduce, filter,
chain, fail, complete, generate, fulfill, sleep, expire futures, etc
● fundamentally about program structure. w/ concurrent structure, parallelism
is a free variable
● one pinned thread per core - must express parallelism and concurrency
explicitly
● no locks on the hotpath - network of SPSC queues
async-only
cooperative scheduling framework
new way to build software:
no virtual memory
buddy allocator ● preallocate 100% of mem; split across
N-cores for thread-local allocation/access
● create pools by dividing the memory one
layer above/2 and creating a new pool
● large allocations (above 64KB are not
pooled)
● buddy allocator pools for all object sizes
below 64KB
● full free-lists are recycled
● difficult to use this technique in practice,
and requires developer
retraining/accounting for every single byte
present in the system at all times
○ forces developer to pay additional attention
to all hash-maps, allocations, pooling, etc
Pools
of 8KB
Pools of 16KB
Pools of 64KB
Pool 0 - large object pool;
above 64KB+1
memory/2
memory/2
...
memory core local (usually around 2GB+)
memory global/N cores…
iobuf - TPC buffer management
src: https://vectorized.io/blog/tpc-buffers/
request pipelining per partition
● parallelism model == number of
cores/pthreads in the system
● read full request metadata and assign
subrequest to physical core
● for all non-overlapping cores, execute in
parallel
● for all overlapping cores per *partition*
pipeline (enqueue writes in order)
core-local metadata piggybacking
(...pandabacking?)
● maintain core-local metadata cache of
○ bytes written per partition (for future
readers)
○ latencies from the remote core (could be
highly contended and we need TCP
backpressure)
○ per TCP-connection read-ahead pointers
on disk for O(1) access/assignment
copy-on-read cache
x-shard metadata for low
latency access
core-local v8::isolate per topic/partition/policy
● maintain a v8::isolate *per core*
● maintain v8::context per topic/partition
○ thread_local v8::isolate
■ Low latency access
■ Preemption
● Timebound
● CPU Cycles
● Cooperative scheduling
○ No cross-core communications
○ Small memory footprint
applying a .wasm or .js to a Kafka Topic
> bin/kafka-topics.sh 
--alter --topic my_topic_name 
--config x-data-policy={...}
(redpanda has a tool called `rpk`
that is similar to kafka-topics.sh)
● Must be a pure function
○ limit of global state per core is 1MB
● On TCP connection
○ Look up associated data policy for the
topic
○ Instantiate a v8::context
○ Perform reads from disk, and subsystems
as normal
○ *before sending tcp bytes
■ Call v8::isolate
■ Swap v8::context
■ Transform payload
■ Re-checksum payload
■ Return new RecordBatch
flow
import { InlineTransform } from "@vectorizedio/InlineTransform";
const transform = new InlineTransform();
transform.topics([{"input": "lowercase", "output":"uppercase"}]);
...
const uppercase = async (record) => {
const newRecord = {
...record,
value: record.value.map((char) => {
if (char > 97 && char < 122) {
return char - 32;
} else {
return char;
}
}),
};
return newRecord;
}
l
o
w
e
r
c
a
s
e
U
P
P
E
R
C
A
S
E
F
i
l
t
e
r
*using the Kafka-API in your favorite
programming language (JS, Py, Java, C++, etc)
full compatibility with all your tools. No code
changes
check out the code for yourself!
● https://github.com/vectorizedio/redpanda
● ask questions from the maintainers at https://vectorized.io/slack
● say hi on twitter https://twitter.com/vectorizedio
● wasm+kafka-api https://vectorized.io/blog/wasm-architecture/

More Related Content

PDF
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
PDF
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
PDF
Modern ETL Pipelines with Change Data Capture
PDF
Apache kafka-a distributed streaming platform
PDF
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
PDF
From Batch to Streaming ET(L) with Apache Apex
PDF
Superset druid realtime
PDF
Streamsets and spark
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
It's Time To Stop Using Lambda Architecture | Yaroslav Tkachenko, Shopify
Modern ETL Pipelines with Change Data Capture
Apache kafka-a distributed streaming platform
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
From Batch to Streaming ET(L) with Apache Apex
Superset druid realtime
Streamsets and spark

What's hot (20)

PDF
Change Data Capture with Data Collector @OVH
PPTX
Bullet: A Real Time Data Query Engine
PDF
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
PDF
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
PDF
Introduction to Data Engineer and Data Pipeline at Credit OK
PDF
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
PDF
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
PDF
Hoodie: How (And Why) We built an analytical datastore on Spark
PPTX
Assaf Araki – Real Time Analytics at Scale
PPTX
Jack Gudenkauf sparkug_20151207_7
PDF
Column and hadoop
PDF
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
PPTX
Introduction to Streaming Distributed Processing with Storm
PPTX
Lambda architecture: from zero to One
PPTX
Open Source Big Data Ingestion - Without the Heartburn!
PDF
Real-time Data Streaming from Oracle to Apache Kafka
PPTX
Cloud native data platform
ODP
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
PDF
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Change Data Capture with Data Collector @OVH
Bullet: A Real Time Data Query Engine
An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...
Going from three nines to four nines using Kafka | Tejas Chopra, Netflix
Introduction to Data Engineer and Data Pipeline at Credit OK
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...
Hoodie: How (And Why) We built an analytical datastore on Spark
Assaf Araki – Real Time Analytics at Scale
Jack Gudenkauf sparkug_20151207_7
Column and hadoop
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Introduction to Streaming Distributed Processing with Storm
Lambda architecture: from zero to One
Open Source Big Data Ingestion - Without the Heartburn!
Real-time Data Streaming from Oracle to Apache Kafka
Cloud native data platform
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Big Data Day LA 2015 - Always-on Ingestion for Data at Scale by Arvind Prabha...
Ad

Similar to Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized (20)

PDF
Towards Data Operations
PPTX
Streaming analytics with Python and Kafka
PDF
Redpanda and ClickHouse
PPTX
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
PDF
Lessons Learned: Using Spark and Microservices
PDF
Sc12 workshop-writeup
PDF
Voldemort Nosql
PDF
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
PDF
Ray The alternative to distributed frameworks.pdf
PDF
Building Big Data Streaming Architectures
PDF
Why Distributed Databases?
PDF
Fast Open Source Software - Without The Fury
PDF
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
PDF
Kafka used at scale to deliver real-time notifications
PPTX
Software architecture for data applications
PPTX
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
PPTX
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
PDF
Near realtime analytics - technology choice (@pavlobaron)
PPTX
Взгляд на облака с точки зрения HPC
PDF
The best of Apache Kafka Architecture
Towards Data Operations
Streaming analytics with Python and Kafka
Redpanda and ClickHouse
Kafka 0.8.0 Presentation to Atlanta Java User's Group March 2013
Lessons Learned: Using Spark and Microservices
Sc12 workshop-writeup
Voldemort Nosql
Large volume data analysis on the Typesafe Reactive Platform - Big Data Scala...
Ray The alternative to distributed frameworks.pdf
Building Big Data Streaming Architectures
Why Distributed Databases?
Fast Open Source Software - Without The Fury
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Kafka used at scale to deliver real-time notifications
Software architecture for data applications
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Accumulo Summit 2015: Real-Time Distributed and Reactive Systems with Apache ...
Near realtime analytics - technology choice (@pavlobaron)
Взгляд на облака с точки зрения HPC
The best of Apache Kafka Architecture
Ad

More from HostedbyConfluent (20)

PDF
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
PDF
Renaming a Kafka Topic | Kafka Summit London
PDF
Evolution of NRT Data Ingestion Pipeline at Trendyol
PDF
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
PDF
Exactly-once Stream Processing with Arroyo and Kafka
PDF
Fish Plays Pokemon | Kafka Summit London
PDF
Tiered Storage 101 | Kafla Summit London
PDF
Building a Self-Service Stream Processing Portal: How And Why
PDF
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
PDF
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
PDF
Navigating Private Network Connectivity Options for Kafka Clusters
PDF
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
PDF
Explaining How Real-Time GenAI Works in a Noisy Pub
PDF
TL;DR Kafka Metrics | Kafka Summit London
PDF
A Window Into Your Kafka Streams Tasks | KSL
PDF
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
PDF
Data Contracts Management: Schema Registry and Beyond
PDF
Code-First Approach: Crafting Efficient Flink Apps
PDF
Debezium vs. the World: An Overview of the CDC Ecosystem
PDF
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Renaming a Kafka Topic | Kafka Summit London
Evolution of NRT Data Ingestion Pipeline at Trendyol
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Exactly-once Stream Processing with Arroyo and Kafka
Fish Plays Pokemon | Kafka Summit London
Tiered Storage 101 | Kafla Summit London
Building a Self-Service Stream Processing Portal: How And Why
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Navigating Private Network Connectivity Options for Kafka Clusters
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Explaining How Real-Time GenAI Works in a Noisy Pub
TL;DR Kafka Metrics | Kafka Summit London
A Window Into Your Kafka Streams Tasks | KSL
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Data Contracts Management: Schema Registry and Beyond
Code-First Approach: Crafting Efficient Flink Apps
Debezium vs. the World: An Overview of the CDC Ecosystem
Beyond Tiered Storage: Serverless Kafka with No Local Disks

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Review of recent advances in non-invasive hemoglobin estimation
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
KodekX | Application Modernization Development
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
A Presentation on Artificial Intelligence
PDF
NewMind AI Monthly Chronicles - July 2025
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Review of recent advances in non-invasive hemoglobin estimation
The AUB Centre for AI in Media Proposal.docx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Encapsulation_ Review paper, used for researhc scholars
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
MYSQL Presentation for SQL database connectivity
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Empathic Computing: Creating Shared Understanding
Digital-Transformation-Roadmap-for-Companies.pptx
KodekX | Application Modernization Development
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Modernizing your data center with Dell and AMD
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
A Presentation on Artificial Intelligence
NewMind AI Monthly Chronicles - July 2025

Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized

  • 1. Data Policies for the Kafka-API with WebAssembly https://github.com/vectorizedio/redpanda
  • 2. background ● i’ve worked on streaming sys. for 12+ years ● developer, founder & CEO of Vectorized, hacking on Redpanda, a modern streaming platform for mission critical workloads. ● previously, principal engineer at Akamai; co-founder & CTO of concord.io, a high performance stream processing engine built in C++ and acquired by Akamai in 2016 alex gallego @emaxerrno
  • 3. keystone problem in streaming The Data What you care about
  • 4. keystone problem in streaming: consume what you want The Data What you care about The Good Parts Of Streaming ○ Streaming data is immutable ■ Good for understandability ■ Built-in auditing ○ Ability to replay data ○ The essence of a streaming platform is to decouple producers from consumers ■ Focuses on Data vs Who, What and When produced it ○ Proven at scale
  • 5. keystone problem in streaming: consume what you want The Data What you care about The *Not so* Good Parts Of Streaming ○ Consume *at least* as much as you produce ■ In the same order ■ common to 10x more than produce ○ Up front architecture cost of splitting your streams, types, and data contracts for things like privacy, etc… ○ Can easily saturate your network (simply shifting the bottleneck) ■ Forces developers to create specialized clusters ● Mission Critical Prod ● Analytics/Dashboard Prod ● ML sample clusters
  • 6. keystone problem in streaming: consume what you want The Data What you care about 🥳🤯🎉 - hooray!
  • 7. performance improvement 2007 2011 2020 SSD: $2500/TB typical instance 4 cores SSD $200/TB - 1000x faster, 10x cheaper 225 core VMs - 30x more cores 100Gbps NICs - 100x more throughput first open source solutions take advantage of cheap disk disaggregate compute and storage modern hardware + cloud native 30x taller computers + 1000x faster disks
  • 8. thread per core architecture ● explicit scheduling everywhere ○ IO groups ○ x-core groups (smp) ○ memory throttling ● ONLY supports async interfaces ○ requires library re-writes for threading model to work well
  • 9. future<> ● viral primitive (like actors, Orleans, Akka, Pony, etc) - mix, map-reduce, filter, chain, fail, complete, generate, fulfill, sleep, expire futures, etc ● fundamentally about program structure. w/ concurrent structure, parallelism is a free variable ● one pinned thread per core - must express parallelism and concurrency explicitly ● no locks on the hotpath - network of SPSC queues async-only cooperative scheduling framework new way to build software:
  • 10. no virtual memory buddy allocator ● preallocate 100% of mem; split across N-cores for thread-local allocation/access ● create pools by dividing the memory one layer above/2 and creating a new pool ● large allocations (above 64KB are not pooled) ● buddy allocator pools for all object sizes below 64KB ● full free-lists are recycled ● difficult to use this technique in practice, and requires developer retraining/accounting for every single byte present in the system at all times ○ forces developer to pay additional attention to all hash-maps, allocations, pooling, etc Pools of 8KB Pools of 16KB Pools of 64KB Pool 0 - large object pool; above 64KB+1 memory/2 memory/2 ... memory core local (usually around 2GB+) memory global/N cores…
  • 11. iobuf - TPC buffer management src: https://vectorized.io/blog/tpc-buffers/
  • 12. request pipelining per partition ● parallelism model == number of cores/pthreads in the system ● read full request metadata and assign subrequest to physical core ● for all non-overlapping cores, execute in parallel ● for all overlapping cores per *partition* pipeline (enqueue writes in order)
  • 13. core-local metadata piggybacking (...pandabacking?) ● maintain core-local metadata cache of ○ bytes written per partition (for future readers) ○ latencies from the remote core (could be highly contended and we need TCP backpressure) ○ per TCP-connection read-ahead pointers on disk for O(1) access/assignment copy-on-read cache x-shard metadata for low latency access
  • 14. core-local v8::isolate per topic/partition/policy ● maintain a v8::isolate *per core* ● maintain v8::context per topic/partition ○ thread_local v8::isolate ■ Low latency access ■ Preemption ● Timebound ● CPU Cycles ● Cooperative scheduling ○ No cross-core communications ○ Small memory footprint
  • 15. applying a .wasm or .js to a Kafka Topic > bin/kafka-topics.sh --alter --topic my_topic_name --config x-data-policy={...} (redpanda has a tool called `rpk` that is similar to kafka-topics.sh) ● Must be a pure function ○ limit of global state per core is 1MB ● On TCP connection ○ Look up associated data policy for the topic ○ Instantiate a v8::context ○ Perform reads from disk, and subsystems as normal ○ *before sending tcp bytes ■ Call v8::isolate ■ Swap v8::context ■ Transform payload ■ Re-checksum payload ■ Return new RecordBatch
  • 16. flow import { InlineTransform } from "@vectorizedio/InlineTransform"; const transform = new InlineTransform(); transform.topics([{"input": "lowercase", "output":"uppercase"}]); ... const uppercase = async (record) => { const newRecord = { ...record, value: record.value.map((char) => { if (char > 97 && char < 122) { return char - 32; } else { return char; } }), }; return newRecord; } l o w e r c a s e U P P E R C A S E F i l t e r *using the Kafka-API in your favorite programming language (JS, Py, Java, C++, etc) full compatibility with all your tools. No code changes
  • 17. check out the code for yourself! ● https://github.com/vectorizedio/redpanda ● ask questions from the maintainers at https://vectorized.io/slack ● say hi on twitter https://twitter.com/vectorizedio ● wasm+kafka-api https://vectorized.io/blog/wasm-architecture/