SlideShare a Scribd company logo
© Instaclustr Pty Limited, 2024
30 Of My Favourite
Open Source
Technologies
Paul Brebner
Open Source Technology Evangelist
© Instaclustr Pty Limited, 2024
30 Of My Favourite
Open Source
Technologies
In 30 Minutes
Paul Brebner
Open Source Technology Evangelist
Paul Brebner (Netherlands 30
minutes bike parking zone)
© Instaclustr Pty Limited, 2024
© Instaclustr Pty Limited, 2024
What do they have in Common?
• Instaclustr provides some as
managed services
• They are complementary and
can be used together
• And I’ve used them to build
realistic demo applications
over the last 7 years
© Instaclustr Pty Limited, 2024
A Strange Toy I Found At The Shop
• What’s that?!
• An escaped “Pokemon”!
• When my kids were growing up Pokemon lived inside a
“Game Boy”
© Instaclustr Pty Limited, 2024
Format
• Name, Overview, Superpower(s), Watch out for …
• E.g. “Pokemon”
• Name: Charmander
• What: A fire Lizard
• Superpower: Evolves to Charizard, a flying fire breathing lizard
• Watch out for: Water
+ Use Cases and What’s New?
© Instaclustr Pty Limited, 2024
Countdown!
Flicker CCL + Wikimedia CCL
© Instaclustr Pty Limited, 2024
1. Apache Cassandra
Office Typing
Pool, 1918
Wikipedia Public Domain
© Instaclustr Pty Limited, 2024
Apache Cassandra
• What?
• NoSQL Horizontally Scalable Key-Value Database
• Superpowers
• Fast Writes (lots of typewriters)
• Wide Column Store
• Clustering Columns, good for hierarchical data modelling (E.g. Geospatial)
• In-built multi-DC replication
• My Use Cases
© Instaclustr Pty Limited, 2024
Anomaly Detection: 19 Million
checks/day
Apache Cassandra
Apache Kafka
Kubernetes
And more
© Instaclustr Pty Limited, 2024
Global low-latency Fintech
© Instaclustr Pty Limited, 2024
Apache Cassandra
• Watch Our For
• CQL != SQL
• Different data model
§ Design for reads
§ De-normalization is normal
• Consistency < traditional SQL databases
• Reads are slower
• What’s New?
• Vector Search in 5.0
© Instaclustr Pty Limited, 2024
2. Apache Spark
Car Factory
Assembly Line
Wikimedia Public Domain
© Instaclustr Pty Limited, 2024
Apache Spark
• What?
• Cluster batch/stream processing, analytics and ML
• Superpowers
• In-memory à fast
• Good support for ML
§ + Cassandra (wide columns) as a feature store
• Good for heavy transformation operations at scale
• My Use Cases
© Instaclustr Pty Limited, 2024
ML of Cassandra Monitoring Data
Apache Spark
Apache Cassandra
MLlib
DataFrames
Spark Streaming
© Instaclustr Pty Limited, 2024
Apache Spark
• Watch Our For
• Lots of RAM, else OOM (Out-of-Memory Errors)
• Spark Streaming is near real-time (micro-batch)
• What’s New?
• 3.4 has Spark Connect for decoupled client-servers
• Ocean for Apache Spark
(Spot by NetApp)
© Instaclustr Pty Limited, 2024
3. Apache Zeppelin
Graf Zeppelin
exploring the
Arctic, 1931
Wikimedia Public Domain
© Instaclustr Pty Limited, 2024
Apache Zeppelin
• What?
• Web-based notebook for data exploration
• Superpowers
• Interactive “notebook” style tool
• Supports Apache Spark
© Instaclustr Pty Limited, 2024
Apache Zeppelin
• Watch Our For
• Sufficient Zeppelin resources
• We don’t support it anymore
• What’s New?
• Jupyter Notebook!
§ Good Kafka and Cassandra integration
The Galilean moons of Jupiter (Wikimedia CCL)
© Instaclustr Pty Limited, 2024
4. Apache Lucene
A Librarian using
a card catalogue
(1940)
Library of Congress Public
Domain
© Instaclustr Pty Limited, 2024
Apache Lucene
• What?
• Fast Full-featured Search Engine
• Superpowers
• Lucene plugin + Cassandra for enhanced Cassandra search
§ Works as a Cassandra secondary index
§ Support Vector Search too
• Watch Our For
• Performance
• We currently support it: https://github.com/instaclustr/cassandra-lucene-index
• My Use Cases
© Instaclustr Pty Limited, 2024
Geospatial Anomaly Detection
Apache Cassandra
Apache Lucene Plugin
Geospatial searches
Wikimedia Public Domain
© Instaclustr Pty Limited, 2024
5. Apache Kafka
Postal Delivery
Service
Railway Post
Office:
Mail bags
snatched by
speeding train
Wikimedia CCL
© Instaclustr Pty Limited, 2024
Apache Kafka
• What?
• Distributed publish-subscribe messaging system
• Superpowers
• Fast
• Highly distributed and horizontally scalable, available and durable
• Buffering and message replay
• My Use Cases
© Instaclustr Pty Limited, 2024
Xmas Tree Lights Simulation
© Instaclustr Pty Limited, 2024
“Kongo” IoT Logistics Simulation
Apache Kafka
Guava Event Bus
Real-time logistics
Tracking and checking
© Instaclustr Pty Limited, 2024
Anomaly Detection: 19 Million
checks/day
Apache Cassandra
Apache Kafka
Kubernetes
And more
© Instaclustr Pty Limited, 2024
Apache Kafka
• Watch Our For
• Too many topics/partitions impacts throughput
• What’s New?
• KRaft (replacing ZooKeeper) for faster meta-data operations
§ And maybe even faster data workloads
• Tiered Storage (3.6)
• End-to-end client monitoring (3.7)
© Instaclustr Pty Limited, 2024
6. Apache Kafka Streams
Niagra Falls
Darevevil
Shutterstock
© Instaclustr Pty Limited, 2024
Apache Kafka Streams
• What?
• Stream processing API and client for Kafka
• From/to Kafka cluster
• Superpowers
• Complex stateful stream processing operations (e.g. joins)
• Over time windows and multiple topics and state stores
• My Use Cases
© Instaclustr Pty Limited, 2024
Kafka Streams IoT Application
Truck Overload
© Instaclustr Pty Limited, 2024
Apache Kafka Streams
• Watch Our For
• Complex stream topologies
• Debugging is tricky
• Performance
• What’s New?
• Alternatives (E.g. Apache Flink, RisingWave, etc)
© Instaclustr Pty Limited, 2024
7. Apache Kafka Connect
Telephone
Switchboard
Operators
Connecting Calls
Wikimedia Public Domain
© Instaclustr Pty Limited, 2024
Apache Kafka Connect
• What?
• Kafka API for streaming from source to sink systems
• Via Kafka cluster
• Superpowers
• Heterogeneous integration
• Code-free – just connector configuration
• Independently scalable
§ connectors run on independent Kafka Connect cluster
• My Use Cases
© Instaclustr Pty Limited, 2024
Zero-code Data Pipelines
REST Tidal Data to PostgreSQL + Superset
REST Tidal Data to OpenSearch
OpenSearch sink
connector
© Instaclustr Pty Limited, 2024
Apache Kafka Connect
• Watch Our For
• Open-source connector evaluation and selection
• Error handling
• Source/sink system scalability
• What’s New?
• Debezium
© Instaclustr Pty Limited, 2024
8. Kafka MirrorMaker 2 (MM2)
Head of Kafka
Replicated tiers
move
Shutterstock
© Instaclustr Pty Limited, 2024
Kafka MirrorMaker 2
• What?
• Replicates Kafka topics between clusters
• Superpowers
• Uses Kafka Connect (but reads/writes from/to Kafka clusters)
• Topic renaming, prevents loops
• Complex bi-directional topologies
• Many use cases for multiple Kafka clusters:
§ Cluster migration
§ Geographical distribution
§ Low latency, redundancy
§ Fan-out architectures
§ Edge computing, etc
© Instaclustr Pty Limited, 2024
Kafka MirrorMaker 2
• Watch Our For
• Bi-directional flow requires TWO Kafka Connect Clusters
• Duplicate events (from overlapping topic subscriptions)
• Use topic renaming and the default source cluster alias to
§ Prevent cycles and infinite topic creation
• What’s New?
• For me, automated consumer offset sync across clusters
§ In 2.7.0 (2020)!
© Instaclustr Pty Limited, 2024
9. Apache Camel
Camel Train In
Broome, WA
(Adobe Stock by scottimage)
© Instaclustr Pty Limited, 2024
Apache Camel - Kafka Connectors
• What?
• Apache Camel – Integration framework
• Apache Camel Kafka Connectors – open source Kafka connectors
• Superpowers
• Large number of open source Kafka Connectors – 172 (officially), 179 sources and sinks
• Auto-generated from Camel components
© Instaclustr Pty Limited, 2024
• Watch Our For
• Configuration!
§ Need to read (1) Camel component, (2) Basic connector configuration, and (3)
connector specific documentation
• Some connectors are both sources and sinks (source or sink depends on
configuration)
• What’s New?
• Kamelets!
§ Can appear in the configuration
Apache Camel - Kafka Connectors
© Instaclustr Pty Limited, 2024
10. Kafka Parallel Consumer
Jacquard Loom,
Berlin
(Paul Brebner)
© Instaclustr Pty Limited, 2024
Kafka Parallel Consumer
• What?
• Multi-threaded Kafka Consumer
• Superpowers
• Multi-threaded c.f. default consumers single-threaded
• Higher concurrency with less consumers and partitions
• Use Cases
• Low latency, High Throughput
• Slow consumers
• Replacement for my multiple pool consumer hack
© Instaclustr Pty Limited, 2024
Kafka Parallel Consumer
• Watch Our For
• Configure for
§ Ordering mode
• Partition à Key à Unordered (Increasing concurrency)
§ Max threads
• What’s New?
• Choice of commit modes
§ Consumer Asynchronous, Synchronous and Producer Transactions
© Instaclustr Pty Limited, 2024
11. Apache ZooKeeper
12. Apache Curator
Being a
ZooKeeper in
Australia can be
risky!
(Shutterstock)
© Instaclustr Pty Limited, 2024
Apache ZooKeeper
• What?
• Distributed systems and coordination and
meta-data management
• Superpowers
• High consistency, availability and performance (reads)
• Use Cases
• Until recently, used in Kafka, Pulsar, etc
© Instaclustr Pty Limited, 2024
Apache ZooKeeper (and Curator)
Meet the Dining Philosophers
Wikipedia CCL
© Instaclustr Pty Limited, 2024
Apache ZooKeeper
• Watch Our For
• Low-level
• Apache Curator (high level ZK client) is better with
§ Leader Latch
§ Shared Lock
§ Shared Counter
• Scalability limitations
• Slow for writes, max cluster size is 7 servers
• What’s New?
• KRaft – Kafka based RAFT implementation
§ For meta-data management and leader election
§ Faster meta-data operations, more partitions etc. Potentially faster data workloads
© Instaclustr Pty Limited, 2024
13. Kubernetes
Greek Triremes
ruled the seas
Captained by
Helmsmen
(Kubernetes)
(Wikipedia CCL)
© Instaclustr Pty Limited, 2024
Kubernetes
• What?
• Automation of containerized applications
• Superpowers
• Available on public clouds, E.g. AWS EKS
• Ephemeral Pods are the unit of concurrency
• Easy to scale applications (more or less Pods)
• My Use Cases
© Instaclustr Pty Limited, 2024
Anomaly Detection: 19 Million
checks/day
Apache Cassandra
Apache Kafka
Kubernetes
And more
© Instaclustr Pty Limited, 2024
Kubernetes
• Watch Our For
• Pod and resource scaling
§ Easy to create many Pods
• With insufficient or lots of resources
• Tuning the application can be tricky
§ Optimize the number of Pods vs Kafka consumers/partitions, Cassandra database
connections, etc
• What’s New?
• Operators
§ E.g. Strimzi for Kafka
© Instaclustr Pty Limited, 2024
14. Prometheus
15. Grafana
Counting on an
Abacus
(Wikimedia Public Domain)
© Instaclustr Pty Limited, 2024
Prometheus + Grafana
• What?
• Prometheus: Monitoring and Alerting
• Grafana: Graphing
• Superpowers
• Instrumentation or Agents (Exporters) to expose application metrics
• Time series data with counter, gauge, histogram and summary metrics
• My Use Cases
• Monitoring and scaling/optimization/debugging
§ Anomaly Detector (Cassandra, Kafka, Kubernetes) application
§ Kafka Connect data pipelines
• Instaclustr’s Monitoring API has a Prometheus version
© Instaclustr Pty Limited, 2024
Prometheus + Grafana
• Watch Our For
• Need to run a Prometheus server
• Configuring Prometheus with Kubernetes is tricky
§ use Prometheus Operator
• What’s New?
• Since using it Grafana is now AGPL licensed
§ modified code has to be open sourced
© Instaclustr Pty Limited, 2024
16. OpenTracing
17. OpenTelemetry
18. Jaeger (and others)
X-Ray Vision!
Public Domain
© Instaclustr Pty Limited, 2024
OpenTracing/OpenTelemetry
• What?
• OpenTracing: End-to-end distributed tracing
• Superpowers
• End-to-end distributed application visibility
§ Traces have Spans
• Visualisation of system topology and times
© Instaclustr Pty Limited, 2024
OpenTracing
OpenTelemetry
• Watch Our For
• Originally used OpenTracing and Jaeger
• Manual instrumentation
• What’s New?
• OpenTelemetry is the new standard
§ Tracing, metrics and logs
§ Automatic instrumentation
§ Lots of open-source visualization tools
• Jaeger, SigNoz, Uptrace, OpenSearch
§ Used in new client monitoring KIP-714, Kafka 3.7.0
© Instaclustr Pty Limited, 2024
SigNoz Service Map for
Toy+Boxes application
© Instaclustr Pty Limited, 2024
19. PostgreSQL
Elephant vs. Tree
Elephants are
Powerful
Adobe
© Instaclustr Pty Limited, 2024
PostgreSQL
• What?
• Powerful SQL Database
• Superpowers
• SQL + Object Database
• Extensible
• JSONB+GIN indexes (efficient storage and search of JSON)
© Instaclustr Pty Limited, 2024
PostgreSQL
• Watch Our For
• Scalability
§ Vertical; limited horizontal
• Benefits from connection pooling
• What’s New?
• PGVector (vector similarity search)
• Significant performance improvement
§ on NetApp Azure Files
• FerretDB (MongoDB front-end)
© Instaclustr Pty Limited, 2024
20. Apache Superset
All superheroes
(B) are a superset
of those who use
weapons (A)
(Shutterstock)
© Instaclustr Pty Limited, 2024
Apache Superset
• What?
• Powerful data visualization tool
• Superpowers
• Reads from SQL sources
• Lots of visualization and graph types including geospatial
• My Use Case
• Visualization of tidal data from Kafka
connect pipeline
§ Easy integration with PostgreSQL + JSONB
© Instaclustr Pty Limited, 2024
21. OpenSearch
22. Dashboard
Library of Congress
Card Division 1919
(City block long)
(Library of Congress Public
Domain)
© Instaclustr Pty Limited, 2024
OpenSearch + Dashboard
• What?
• Open-source version of ElasticSearch
• Based on Lucene à powerful + scalable text searching
• Superpowers
• Ingestion, indexing and searching of JSON documents
• Integrated dashboard for visualization
• Computational linguistics support:
§ Stemming, Lemmatization, Levenshtein Fuzzy Queries,
N-grams, Slop, Partial matching!
• My Use Cases
• Sink and visualization for Kafka connect
tidal data processing pipeline
© Instaclustr Pty Limited, 2024
OpenSearch + Dashboard
• Watch Our For
• Default mappings and ingestion may not work
§ E.g. geospatial data needs custom mappings and ingest pipelines
• Reindexing
• Kafka Connect Sink à OpenSearch throughput
§ Needed the BULK API
• What’s New?
• Vector Search
© Instaclustr Pty Limited, 2024
23. Redis
Look! Up in the sky!
It’s an in-memory
key-value store!
It’s a database!
It’s Redis!
(Shutterstock)
© Instaclustr Pty Limited, 2024
Redis
• What?
• Fast (in-memory) Data Structures server
• Superpowers
• Lots of data types
§ Keys, Strings, Lists, Hashes, Sets, Sorted sets, bitmaps, geospatial, streams, time series,
HyperLogLogs (approximate counting)
• Pub/Sub
§ Connected and disconnected delivery
• Client-side caching for ultra-low latency – e.g. Redisson client
© Instaclustr Pty Limited, 2024
Redis
• Watch Our For
• Pipeline tuning impacts throughput
• Often used as a cache to reduce load on backend database
§ I.e. Efficiency not improved latency
• As other factors may dominate
• What’s New?
• Redis Functions
§ Code executed on the server (Redis 7)
• License change (7.4 source-available)
© Instaclustr Pty Limited, 2024
24. Uber’s Cadence
Railway Signal “man”
(Signalwoman!)
(Wikimedia Public Domain)
© Instaclustr Pty Limited, 2024
Uber’s Cadence
• What?
• Scalable code-as-workflows engine
• Superpowers
• Sequenced, stateful, long-running, scheduled steps
• Scalable and reliable using event-sourcing
§ Workflows are failproof, history is replayed until the point of failure and resumed
• My Use Cases
© Instaclustr Pty Limited, 2024
Drone Delivery Application
Kafka Microservices
Integration of fast/slow systems
© Instaclustr Pty Limited, 2024
Uber’s Cadence
• Watch Our For
• Uses Apache Cassandra and OpenSearch backends
• Code must be deterministic (replayed on failure)
§ Use special functions for non-deterministic functions
• What’s New?
• Potential use cases
§ Scalable push notifications (Uber)
§ ML workflows
© Instaclustr Pty Limited, 2024
25. Debezium
Animal speed transformation (Shutterstock)
© Instaclustr Pty Limited, 2024
Debezium
• What?
• Change Data Capture (CDC)
• Superpowers
• Captures slow database state changes
• Turns them into fast Kafka events
• Uses Kafka: Kafka Connect, and/or DB-specific “Connectors”
• Can be used to replicate databases (same type), or send events to different sink
systems
• My Use Cases
• Debezium Cassandra Connector (doesn’t use Kafka Connect, writes to Kafka directly)
• Debezium PostgreSQL Connector (Kafka source connector)
© Instaclustr Pty Limited, 2024
Debezium
• Watch Our For
• The DB specific connectors need to be configured/run in the DB
• Debezium change data format is complex
§ Actual content depends on the source DB
• Schemas may be inline or just an ID
• May include schema changes
• Tricky to find Kafka Connect sink connectors that work correctly
• Duplicates and ordering issues, latency and scalability challenges
• Schema IDs require a Kafka Schema Registry
• What’s New?
• GA on Instaclustr’s managed Cassandra (Dec 2023)
© Instaclustr Pty Limited, 2024
26. Karapace
Karapace in the
driver's seat!
(Shutterstock)
© Instaclustr Pty Limited, 2024
Karapace
• What?
• Open-source Kafka Schema Registry
• Superpowers
• Adds Schemas to Schemeless Kafka
• Supports multiple schema formats
§ Avro, Protobuf and JSON Schemas
• Kafka cluster is not directly involved
§ Karapace enforces schema checks for clients only
• Use Cases
• Debezium
© Instaclustr Pty Limited, 2024
Karapace
• Watch Our For
• Auto vs. manual schema registration – manual is safer in production
• Schema compatibility, compatibility modes, and evolution: complex!
© Instaclustr Pty Limited, 2024
27. FerretDB
Fish/Shark?
(Adobe)
© Instaclustr Pty Limited, 2024
FerretDB
• What?
• Open-source MongoDB proxy for PostgreSQL
• Superpowers
• Compatible with MongoDB drivers on the front-end
• Pluggable backends including PostgreSQL (using JSONB/GIN indexes)
• Query Pushdown for efficiency/performance
© Instaclustr Pty Limited, 2024
28. RisingWave
Wave processing
(Adobe)
© Instaclustr Pty Limited, 2024
RisingWave
• What?
• Stream processing database – also as a Service
• Superpowers
• Stateful stream processing
§ Using Cloud Native Storage
§ Potential replacement for Kafka Streams
• PostgreSQL compatible
§ Works with Apache Superset
• My Use Cases
© Instaclustr Pty Limited, 2024
Santa’s Elves Toy + Box Packing
Streaming joins to match toys and boxes (Adobe) Service Map using
OpenTelemetry + SigNoz
© Instaclustr Pty Limited, 2024
RisingWave
• Watch Our For
• SQL != Kafka Streams DSL
• Kafka keys not propagated
• Windowing has different semantics
© Instaclustr Pty Limited, 2024
29. TensorFlow
What does the
future hold?
(Adobe)
© Instaclustr Pty Limited, 2024
TensorFlow
• What?
• Neural network ML library
• Superpowers
• Supports incremental ML
• From streaming Kafka data
• My Use Cases
© Instaclustr Pty Limited, 2024
ML Over Streaming Kafka
Data – With Concept Drift
Kafka Streams
© Instaclustr Pty Limited, 2024
TensorFlow
• Watch Our For
• ML over streaming spatiotemporal data with concept drifts is tricky
§ Time/space bias
• Wild model accuracy oscillation
§ Concept shift can result in very low-accuracy models initially
• Train/use Multiple Models
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100 120
Concept Drift - incremental training (time
vs accuracy)
same model reset model guessing
© Instaclustr Pty Limited, 2024
30. Yours Here
Invent your own
(DeepAI)
© Instaclustr Pty Limited, 2024
Integration Example 1
Our Customer Facing Monitoring
Before:
Spark and API
requests
à High load on
Cassandra
© Instaclustr Pty Limited, 2024
Integration Example 1
Our Customer Facing Monitoring
After:
Kafka + Kafka
Streams + Redis
Reduced
Cassandra Load
Recent metrics
served from Redis,
or Cassandra on
cache miss
Postgre
SQL
2 – get data from Redis
3 - or from Cassandra
1 – get meta-data
20k Nodes
Thanks to my colleague
Kuangda He
for this information
© Instaclustr Pty Limited, 2024
Integration Example 2
Drone Delivery Demo
© Instaclustr Pty
Limited, 2023
Kafka
Streams
Customers
Order
Shops
Busy warnings
Uses Cassandra+OpenSearch
ML over streaming data
Demo/POC
© Instaclustr Pty Limited, 2024
Integration Example 2
Drone Delivery Prod?
© Instaclustr Pty
Limited, 2023
Kafka
Streams
Customers
Order
Postgre
SQL
Drone Operations
Order Tracking
Shops
Busy warnings
Uses Cassandra+OpenSearch
ML over streaming data
Drone/order locations cached in Redis
Read-through or write-behind
Kafka sink
connectors
www.instaclustr.com
info@instaclustr.com
@instaclustr
THANK
YOU!
© Instaclustr Pty Limited, 2024

More Related Content

PDF
Architecting Applications With Multiple Open Source Big Data Technologies
PDF
Superpower Your Apache Kafka Applications Development with Complementary Open...
PDF
Insta clustr seattle kafka meetup presentation bb
PDF
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
PPTX
Architecting an Open Source AI Platform 2018 edition
PDF
Hadoop Technologies
PDF
Lessons Learned: Using Spark and Microservices
PPTX
Kafka vs Spark vs Impala in bigdata .pptx
Architecting Applications With Multiple Open Source Big Data Technologies
Superpower Your Apache Kafka Applications Development with Complementary Open...
Insta clustr seattle kafka meetup presentation bb
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
Architecting an Open Source AI Platform 2018 edition
Hadoop Technologies
Lessons Learned: Using Spark and Microservices
Kafka vs Spark vs Impala in bigdata .pptx

Similar to 30 Of My Favourite Open Source Technologies In 30 Minutes (20)

PPTX
Apache Spark in Industry
PDF
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
PPTX
Introduction to AWS Big Data
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PDF
NetflixOSS Open House Lightning talks
PDF
Emerging trends in data analytics
PDF
Started with-apache-spark
PDF
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
PDF
20081022cca
PPTX
Big Data Retrospective - STL Big Data IDEA Jan 2019
PPTX
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
IMCSummit 2015 - Day 2 Developer Track - A Reference Architecture for the Int...
PDF
Webinar - Big Data: Let's SMACK - Jorg Schad
PDF
Kafka spark cassandra webinar feb 16 2016
PDF
Kafka spark cassandra webinar feb 16 2016
PDF
Reference architecture for Internet Of Things
DOCX
Big Data A La Carte Menu
PDF
Getting insights from IoT data with Apache Spark and Apache Bahir
PDF
Reference architecture for Internet of Things
Apache Spark in Industry
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...
Introduction to AWS Big Data
Real time Analytics with Apache Kafka and Apache Spark
NetflixOSS Open House Lightning talks
Emerging trends in data analytics
Started with-apache-spark
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...
20081022cca
Big Data Retrospective - STL Big Data IDEA Jan 2019
AWS Big Data Demystified #1: Big data architecture lessons learned
IMCSummit 2015 - Day 2 Developer Track - A Reference Architecture for the Int...
Webinar - Big Data: Let's SMACK - Jorg Schad
Kafka spark cassandra webinar feb 16 2016
Kafka spark cassandra webinar feb 16 2016
Reference architecture for Internet Of Things
Big Data A La Carte Menu
Getting insights from IoT data with Apache Spark and Apache Bahir
Reference architecture for Internet of Things
Ad

More from Paul Brebner (20)

PPTX
Streaming More For Less With Apache Kafka Tiered Storage
PDF
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
PDF
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
PDF
Spinning your Drones with Cadence Workflows and Apache Kafka
PDF
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
PDF
Scaling Open Source Big Data Cloud Applications is Easy/Hard
PDF
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
PDF
A Visual Introduction to Apache Kafka
PDF
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
PDF
Grid Middleware – Principles, Practice and Potential
PDF
Grid middleware is easy to install, configure, secure, debug and manage acros...
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
PPTX
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
PPTX
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
PPTX
0b101000 years of computing: a personal timeline - decade "0", the 1980's
PDF
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
PPTX
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
PDF
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
PPTX
How to Improve the Observability of Apache Cassandra and Kafka applications...
Streaming More For Less With Apache Kafka Tiered Storage
Why Apache Kafka Clusters Are Like Galaxies (And Other Cosmic Kafka Quandarie...
Apache ZooKeeper and Apache Curator: Meet the Dining Philosophers
Spinning your Drones with Cadence Workflows and Apache Kafka
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...
Scaling Open Source Big Data Cloud Applications is Easy/Hard
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/Hard
A Visual Introduction to Apache Kafka
Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka a...
Grid Middleware – Principles, Practice and Potential
Grid middleware is easy to install, configure, secure, debug and manage acros...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...
0b101000 years of computing: a personal timeline - decade "0", the 1980's
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...
ApacheCon2019 Talk: Kafka, Cassandra and Kubernetes at Scale – Real-time Ano...
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...
How to Improve the Observability of Apache Cassandra and Kafka applications...
Ad

Recently uploaded (20)

PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Designing Intelligence for the Shop Floor.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
System and Network Administration Chapter 2
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
Transform Your Business with a Software ERP System
PPTX
Computer Software and OS of computer science of grade 11.pptx
PPTX
L1 - Introduction to python Backend.pptx
PDF
System and Network Administraation Chapter 3
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PPTX
assetexplorer- product-overview - presentation
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Digital Strategies for Manufacturing Companies
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
Designing Intelligence for the Shop Floor.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
Reimagine Home Health with the Power of Agentic AI​
How to Migrate SBCGlobal Email to Yahoo Easily
CHAPTER 2 - PM Management and IT Context
System and Network Administration Chapter 2
PTS Company Brochure 2025 (1).pdf.......
Transform Your Business with a Software ERP System
Computer Software and OS of computer science of grade 11.pptx
L1 - Introduction to python Backend.pptx
System and Network Administraation Chapter 3
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Design an Analysis of Algorithms I-SECS-1021-03
assetexplorer- product-overview - presentation
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Digital Strategies for Manufacturing Companies
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf

30 Of My Favourite Open Source Technologies In 30 Minutes

  • 1. © Instaclustr Pty Limited, 2024 30 Of My Favourite Open Source Technologies Paul Brebner Open Source Technology Evangelist
  • 2. © Instaclustr Pty Limited, 2024 30 Of My Favourite Open Source Technologies In 30 Minutes Paul Brebner Open Source Technology Evangelist Paul Brebner (Netherlands 30 minutes bike parking zone)
  • 3. © Instaclustr Pty Limited, 2024
  • 4. © Instaclustr Pty Limited, 2024 What do they have in Common? • Instaclustr provides some as managed services • They are complementary and can be used together • And I’ve used them to build realistic demo applications over the last 7 years
  • 5. © Instaclustr Pty Limited, 2024 A Strange Toy I Found At The Shop • What’s that?! • An escaped “Pokemon”! • When my kids were growing up Pokemon lived inside a “Game Boy”
  • 6. © Instaclustr Pty Limited, 2024 Format • Name, Overview, Superpower(s), Watch out for … • E.g. “Pokemon” • Name: Charmander • What: A fire Lizard • Superpower: Evolves to Charizard, a flying fire breathing lizard • Watch out for: Water + Use Cases and What’s New?
  • 7. © Instaclustr Pty Limited, 2024 Countdown! Flicker CCL + Wikimedia CCL
  • 8. © Instaclustr Pty Limited, 2024 1. Apache Cassandra Office Typing Pool, 1918 Wikipedia Public Domain
  • 9. © Instaclustr Pty Limited, 2024 Apache Cassandra • What? • NoSQL Horizontally Scalable Key-Value Database • Superpowers • Fast Writes (lots of typewriters) • Wide Column Store • Clustering Columns, good for hierarchical data modelling (E.g. Geospatial) • In-built multi-DC replication • My Use Cases
  • 10. © Instaclustr Pty Limited, 2024 Anomaly Detection: 19 Million checks/day Apache Cassandra Apache Kafka Kubernetes And more
  • 11. © Instaclustr Pty Limited, 2024 Global low-latency Fintech
  • 12. © Instaclustr Pty Limited, 2024 Apache Cassandra • Watch Our For • CQL != SQL • Different data model § Design for reads § De-normalization is normal • Consistency < traditional SQL databases • Reads are slower • What’s New? • Vector Search in 5.0
  • 13. © Instaclustr Pty Limited, 2024 2. Apache Spark Car Factory Assembly Line Wikimedia Public Domain
  • 14. © Instaclustr Pty Limited, 2024 Apache Spark • What? • Cluster batch/stream processing, analytics and ML • Superpowers • In-memory à fast • Good support for ML § + Cassandra (wide columns) as a feature store • Good for heavy transformation operations at scale • My Use Cases
  • 15. © Instaclustr Pty Limited, 2024 ML of Cassandra Monitoring Data Apache Spark Apache Cassandra MLlib DataFrames Spark Streaming
  • 16. © Instaclustr Pty Limited, 2024 Apache Spark • Watch Our For • Lots of RAM, else OOM (Out-of-Memory Errors) • Spark Streaming is near real-time (micro-batch) • What’s New? • 3.4 has Spark Connect for decoupled client-servers • Ocean for Apache Spark (Spot by NetApp)
  • 17. © Instaclustr Pty Limited, 2024 3. Apache Zeppelin Graf Zeppelin exploring the Arctic, 1931 Wikimedia Public Domain
  • 18. © Instaclustr Pty Limited, 2024 Apache Zeppelin • What? • Web-based notebook for data exploration • Superpowers • Interactive “notebook” style tool • Supports Apache Spark
  • 19. © Instaclustr Pty Limited, 2024 Apache Zeppelin • Watch Our For • Sufficient Zeppelin resources • We don’t support it anymore • What’s New? • Jupyter Notebook! § Good Kafka and Cassandra integration The Galilean moons of Jupiter (Wikimedia CCL)
  • 20. © Instaclustr Pty Limited, 2024 4. Apache Lucene A Librarian using a card catalogue (1940) Library of Congress Public Domain
  • 21. © Instaclustr Pty Limited, 2024 Apache Lucene • What? • Fast Full-featured Search Engine • Superpowers • Lucene plugin + Cassandra for enhanced Cassandra search § Works as a Cassandra secondary index § Support Vector Search too • Watch Our For • Performance • We currently support it: https://github.com/instaclustr/cassandra-lucene-index • My Use Cases
  • 22. © Instaclustr Pty Limited, 2024 Geospatial Anomaly Detection Apache Cassandra Apache Lucene Plugin Geospatial searches Wikimedia Public Domain
  • 23. © Instaclustr Pty Limited, 2024 5. Apache Kafka Postal Delivery Service Railway Post Office: Mail bags snatched by speeding train Wikimedia CCL
  • 24. © Instaclustr Pty Limited, 2024 Apache Kafka • What? • Distributed publish-subscribe messaging system • Superpowers • Fast • Highly distributed and horizontally scalable, available and durable • Buffering and message replay • My Use Cases
  • 25. © Instaclustr Pty Limited, 2024 Xmas Tree Lights Simulation
  • 26. © Instaclustr Pty Limited, 2024 “Kongo” IoT Logistics Simulation Apache Kafka Guava Event Bus Real-time logistics Tracking and checking
  • 27. © Instaclustr Pty Limited, 2024 Anomaly Detection: 19 Million checks/day Apache Cassandra Apache Kafka Kubernetes And more
  • 28. © Instaclustr Pty Limited, 2024 Apache Kafka • Watch Our For • Too many topics/partitions impacts throughput • What’s New? • KRaft (replacing ZooKeeper) for faster meta-data operations § And maybe even faster data workloads • Tiered Storage (3.6) • End-to-end client monitoring (3.7)
  • 29. © Instaclustr Pty Limited, 2024 6. Apache Kafka Streams Niagra Falls Darevevil Shutterstock
  • 30. © Instaclustr Pty Limited, 2024 Apache Kafka Streams • What? • Stream processing API and client for Kafka • From/to Kafka cluster • Superpowers • Complex stateful stream processing operations (e.g. joins) • Over time windows and multiple topics and state stores • My Use Cases
  • 31. © Instaclustr Pty Limited, 2024 Kafka Streams IoT Application Truck Overload
  • 32. © Instaclustr Pty Limited, 2024 Apache Kafka Streams • Watch Our For • Complex stream topologies • Debugging is tricky • Performance • What’s New? • Alternatives (E.g. Apache Flink, RisingWave, etc)
  • 33. © Instaclustr Pty Limited, 2024 7. Apache Kafka Connect Telephone Switchboard Operators Connecting Calls Wikimedia Public Domain
  • 34. © Instaclustr Pty Limited, 2024 Apache Kafka Connect • What? • Kafka API for streaming from source to sink systems • Via Kafka cluster • Superpowers • Heterogeneous integration • Code-free – just connector configuration • Independently scalable § connectors run on independent Kafka Connect cluster • My Use Cases
  • 35. © Instaclustr Pty Limited, 2024 Zero-code Data Pipelines REST Tidal Data to PostgreSQL + Superset REST Tidal Data to OpenSearch OpenSearch sink connector
  • 36. © Instaclustr Pty Limited, 2024 Apache Kafka Connect • Watch Our For • Open-source connector evaluation and selection • Error handling • Source/sink system scalability • What’s New? • Debezium
  • 37. © Instaclustr Pty Limited, 2024 8. Kafka MirrorMaker 2 (MM2) Head of Kafka Replicated tiers move Shutterstock
  • 38. © Instaclustr Pty Limited, 2024 Kafka MirrorMaker 2 • What? • Replicates Kafka topics between clusters • Superpowers • Uses Kafka Connect (but reads/writes from/to Kafka clusters) • Topic renaming, prevents loops • Complex bi-directional topologies • Many use cases for multiple Kafka clusters: § Cluster migration § Geographical distribution § Low latency, redundancy § Fan-out architectures § Edge computing, etc
  • 39. © Instaclustr Pty Limited, 2024 Kafka MirrorMaker 2 • Watch Our For • Bi-directional flow requires TWO Kafka Connect Clusters • Duplicate events (from overlapping topic subscriptions) • Use topic renaming and the default source cluster alias to § Prevent cycles and infinite topic creation • What’s New? • For me, automated consumer offset sync across clusters § In 2.7.0 (2020)!
  • 40. © Instaclustr Pty Limited, 2024 9. Apache Camel Camel Train In Broome, WA (Adobe Stock by scottimage)
  • 41. © Instaclustr Pty Limited, 2024 Apache Camel - Kafka Connectors • What? • Apache Camel – Integration framework • Apache Camel Kafka Connectors – open source Kafka connectors • Superpowers • Large number of open source Kafka Connectors – 172 (officially), 179 sources and sinks • Auto-generated from Camel components
  • 42. © Instaclustr Pty Limited, 2024 • Watch Our For • Configuration! § Need to read (1) Camel component, (2) Basic connector configuration, and (3) connector specific documentation • Some connectors are both sources and sinks (source or sink depends on configuration) • What’s New? • Kamelets! § Can appear in the configuration Apache Camel - Kafka Connectors
  • 43. © Instaclustr Pty Limited, 2024 10. Kafka Parallel Consumer Jacquard Loom, Berlin (Paul Brebner)
  • 44. © Instaclustr Pty Limited, 2024 Kafka Parallel Consumer • What? • Multi-threaded Kafka Consumer • Superpowers • Multi-threaded c.f. default consumers single-threaded • Higher concurrency with less consumers and partitions • Use Cases • Low latency, High Throughput • Slow consumers • Replacement for my multiple pool consumer hack
  • 45. © Instaclustr Pty Limited, 2024 Kafka Parallel Consumer • Watch Our For • Configure for § Ordering mode • Partition à Key à Unordered (Increasing concurrency) § Max threads • What’s New? • Choice of commit modes § Consumer Asynchronous, Synchronous and Producer Transactions
  • 46. © Instaclustr Pty Limited, 2024 11. Apache ZooKeeper 12. Apache Curator Being a ZooKeeper in Australia can be risky! (Shutterstock)
  • 47. © Instaclustr Pty Limited, 2024 Apache ZooKeeper • What? • Distributed systems and coordination and meta-data management • Superpowers • High consistency, availability and performance (reads) • Use Cases • Until recently, used in Kafka, Pulsar, etc
  • 48. © Instaclustr Pty Limited, 2024 Apache ZooKeeper (and Curator) Meet the Dining Philosophers Wikipedia CCL
  • 49. © Instaclustr Pty Limited, 2024 Apache ZooKeeper • Watch Our For • Low-level • Apache Curator (high level ZK client) is better with § Leader Latch § Shared Lock § Shared Counter • Scalability limitations • Slow for writes, max cluster size is 7 servers • What’s New? • KRaft – Kafka based RAFT implementation § For meta-data management and leader election § Faster meta-data operations, more partitions etc. Potentially faster data workloads
  • 50. © Instaclustr Pty Limited, 2024 13. Kubernetes Greek Triremes ruled the seas Captained by Helmsmen (Kubernetes) (Wikipedia CCL)
  • 51. © Instaclustr Pty Limited, 2024 Kubernetes • What? • Automation of containerized applications • Superpowers • Available on public clouds, E.g. AWS EKS • Ephemeral Pods are the unit of concurrency • Easy to scale applications (more or less Pods) • My Use Cases
  • 52. © Instaclustr Pty Limited, 2024 Anomaly Detection: 19 Million checks/day Apache Cassandra Apache Kafka Kubernetes And more
  • 53. © Instaclustr Pty Limited, 2024 Kubernetes • Watch Our For • Pod and resource scaling § Easy to create many Pods • With insufficient or lots of resources • Tuning the application can be tricky § Optimize the number of Pods vs Kafka consumers/partitions, Cassandra database connections, etc • What’s New? • Operators § E.g. Strimzi for Kafka
  • 54. © Instaclustr Pty Limited, 2024 14. Prometheus 15. Grafana Counting on an Abacus (Wikimedia Public Domain)
  • 55. © Instaclustr Pty Limited, 2024 Prometheus + Grafana • What? • Prometheus: Monitoring and Alerting • Grafana: Graphing • Superpowers • Instrumentation or Agents (Exporters) to expose application metrics • Time series data with counter, gauge, histogram and summary metrics • My Use Cases • Monitoring and scaling/optimization/debugging § Anomaly Detector (Cassandra, Kafka, Kubernetes) application § Kafka Connect data pipelines • Instaclustr’s Monitoring API has a Prometheus version
  • 56. © Instaclustr Pty Limited, 2024 Prometheus + Grafana • Watch Our For • Need to run a Prometheus server • Configuring Prometheus with Kubernetes is tricky § use Prometheus Operator • What’s New? • Since using it Grafana is now AGPL licensed § modified code has to be open sourced
  • 57. © Instaclustr Pty Limited, 2024 16. OpenTracing 17. OpenTelemetry 18. Jaeger (and others) X-Ray Vision! Public Domain
  • 58. © Instaclustr Pty Limited, 2024 OpenTracing/OpenTelemetry • What? • OpenTracing: End-to-end distributed tracing • Superpowers • End-to-end distributed application visibility § Traces have Spans • Visualisation of system topology and times
  • 59. © Instaclustr Pty Limited, 2024 OpenTracing OpenTelemetry • Watch Our For • Originally used OpenTracing and Jaeger • Manual instrumentation • What’s New? • OpenTelemetry is the new standard § Tracing, metrics and logs § Automatic instrumentation § Lots of open-source visualization tools • Jaeger, SigNoz, Uptrace, OpenSearch § Used in new client monitoring KIP-714, Kafka 3.7.0
  • 60. © Instaclustr Pty Limited, 2024 SigNoz Service Map for Toy+Boxes application
  • 61. © Instaclustr Pty Limited, 2024 19. PostgreSQL Elephant vs. Tree Elephants are Powerful Adobe
  • 62. © Instaclustr Pty Limited, 2024 PostgreSQL • What? • Powerful SQL Database • Superpowers • SQL + Object Database • Extensible • JSONB+GIN indexes (efficient storage and search of JSON)
  • 63. © Instaclustr Pty Limited, 2024 PostgreSQL • Watch Our For • Scalability § Vertical; limited horizontal • Benefits from connection pooling • What’s New? • PGVector (vector similarity search) • Significant performance improvement § on NetApp Azure Files • FerretDB (MongoDB front-end)
  • 64. © Instaclustr Pty Limited, 2024 20. Apache Superset All superheroes (B) are a superset of those who use weapons (A) (Shutterstock)
  • 65. © Instaclustr Pty Limited, 2024 Apache Superset • What? • Powerful data visualization tool • Superpowers • Reads from SQL sources • Lots of visualization and graph types including geospatial • My Use Case • Visualization of tidal data from Kafka connect pipeline § Easy integration with PostgreSQL + JSONB
  • 66. © Instaclustr Pty Limited, 2024 21. OpenSearch 22. Dashboard Library of Congress Card Division 1919 (City block long) (Library of Congress Public Domain)
  • 67. © Instaclustr Pty Limited, 2024 OpenSearch + Dashboard • What? • Open-source version of ElasticSearch • Based on Lucene à powerful + scalable text searching • Superpowers • Ingestion, indexing and searching of JSON documents • Integrated dashboard for visualization • Computational linguistics support: § Stemming, Lemmatization, Levenshtein Fuzzy Queries, N-grams, Slop, Partial matching! • My Use Cases • Sink and visualization for Kafka connect tidal data processing pipeline
  • 68. © Instaclustr Pty Limited, 2024 OpenSearch + Dashboard • Watch Our For • Default mappings and ingestion may not work § E.g. geospatial data needs custom mappings and ingest pipelines • Reindexing • Kafka Connect Sink à OpenSearch throughput § Needed the BULK API • What’s New? • Vector Search
  • 69. © Instaclustr Pty Limited, 2024 23. Redis Look! Up in the sky! It’s an in-memory key-value store! It’s a database! It’s Redis! (Shutterstock)
  • 70. © Instaclustr Pty Limited, 2024 Redis • What? • Fast (in-memory) Data Structures server • Superpowers • Lots of data types § Keys, Strings, Lists, Hashes, Sets, Sorted sets, bitmaps, geospatial, streams, time series, HyperLogLogs (approximate counting) • Pub/Sub § Connected and disconnected delivery • Client-side caching for ultra-low latency – e.g. Redisson client
  • 71. © Instaclustr Pty Limited, 2024 Redis • Watch Our For • Pipeline tuning impacts throughput • Often used as a cache to reduce load on backend database § I.e. Efficiency not improved latency • As other factors may dominate • What’s New? • Redis Functions § Code executed on the server (Redis 7) • License change (7.4 source-available)
  • 72. © Instaclustr Pty Limited, 2024 24. Uber’s Cadence Railway Signal “man” (Signalwoman!) (Wikimedia Public Domain)
  • 73. © Instaclustr Pty Limited, 2024 Uber’s Cadence • What? • Scalable code-as-workflows engine • Superpowers • Sequenced, stateful, long-running, scheduled steps • Scalable and reliable using event-sourcing § Workflows are failproof, history is replayed until the point of failure and resumed • My Use Cases
  • 74. © Instaclustr Pty Limited, 2024 Drone Delivery Application Kafka Microservices Integration of fast/slow systems
  • 75. © Instaclustr Pty Limited, 2024 Uber’s Cadence • Watch Our For • Uses Apache Cassandra and OpenSearch backends • Code must be deterministic (replayed on failure) § Use special functions for non-deterministic functions • What’s New? • Potential use cases § Scalable push notifications (Uber) § ML workflows
  • 76. © Instaclustr Pty Limited, 2024 25. Debezium Animal speed transformation (Shutterstock)
  • 77. © Instaclustr Pty Limited, 2024 Debezium • What? • Change Data Capture (CDC) • Superpowers • Captures slow database state changes • Turns them into fast Kafka events • Uses Kafka: Kafka Connect, and/or DB-specific “Connectors” • Can be used to replicate databases (same type), or send events to different sink systems • My Use Cases • Debezium Cassandra Connector (doesn’t use Kafka Connect, writes to Kafka directly) • Debezium PostgreSQL Connector (Kafka source connector)
  • 78. © Instaclustr Pty Limited, 2024 Debezium • Watch Our For • The DB specific connectors need to be configured/run in the DB • Debezium change data format is complex § Actual content depends on the source DB • Schemas may be inline or just an ID • May include schema changes • Tricky to find Kafka Connect sink connectors that work correctly • Duplicates and ordering issues, latency and scalability challenges • Schema IDs require a Kafka Schema Registry • What’s New? • GA on Instaclustr’s managed Cassandra (Dec 2023)
  • 79. © Instaclustr Pty Limited, 2024 26. Karapace Karapace in the driver's seat! (Shutterstock)
  • 80. © Instaclustr Pty Limited, 2024 Karapace • What? • Open-source Kafka Schema Registry • Superpowers • Adds Schemas to Schemeless Kafka • Supports multiple schema formats § Avro, Protobuf and JSON Schemas • Kafka cluster is not directly involved § Karapace enforces schema checks for clients only • Use Cases • Debezium
  • 81. © Instaclustr Pty Limited, 2024 Karapace • Watch Our For • Auto vs. manual schema registration – manual is safer in production • Schema compatibility, compatibility modes, and evolution: complex!
  • 82. © Instaclustr Pty Limited, 2024 27. FerretDB Fish/Shark? (Adobe)
  • 83. © Instaclustr Pty Limited, 2024 FerretDB • What? • Open-source MongoDB proxy for PostgreSQL • Superpowers • Compatible with MongoDB drivers on the front-end • Pluggable backends including PostgreSQL (using JSONB/GIN indexes) • Query Pushdown for efficiency/performance
  • 84. © Instaclustr Pty Limited, 2024 28. RisingWave Wave processing (Adobe)
  • 85. © Instaclustr Pty Limited, 2024 RisingWave • What? • Stream processing database – also as a Service • Superpowers • Stateful stream processing § Using Cloud Native Storage § Potential replacement for Kafka Streams • PostgreSQL compatible § Works with Apache Superset • My Use Cases
  • 86. © Instaclustr Pty Limited, 2024 Santa’s Elves Toy + Box Packing Streaming joins to match toys and boxes (Adobe) Service Map using OpenTelemetry + SigNoz
  • 87. © Instaclustr Pty Limited, 2024 RisingWave • Watch Our For • SQL != Kafka Streams DSL • Kafka keys not propagated • Windowing has different semantics
  • 88. © Instaclustr Pty Limited, 2024 29. TensorFlow What does the future hold? (Adobe)
  • 89. © Instaclustr Pty Limited, 2024 TensorFlow • What? • Neural network ML library • Superpowers • Supports incremental ML • From streaming Kafka data • My Use Cases
  • 90. © Instaclustr Pty Limited, 2024 ML Over Streaming Kafka Data – With Concept Drift Kafka Streams
  • 91. © Instaclustr Pty Limited, 2024 TensorFlow • Watch Our For • ML over streaming spatiotemporal data with concept drifts is tricky § Time/space bias • Wild model accuracy oscillation § Concept shift can result in very low-accuracy models initially • Train/use Multiple Models 0 0.2 0.4 0.6 0.8 1 0 20 40 60 80 100 120 Concept Drift - incremental training (time vs accuracy) same model reset model guessing
  • 92. © Instaclustr Pty Limited, 2024 30. Yours Here Invent your own (DeepAI)
  • 93. © Instaclustr Pty Limited, 2024 Integration Example 1 Our Customer Facing Monitoring Before: Spark and API requests à High load on Cassandra
  • 94. © Instaclustr Pty Limited, 2024 Integration Example 1 Our Customer Facing Monitoring After: Kafka + Kafka Streams + Redis Reduced Cassandra Load Recent metrics served from Redis, or Cassandra on cache miss Postgre SQL 2 – get data from Redis 3 - or from Cassandra 1 – get meta-data 20k Nodes Thanks to my colleague Kuangda He for this information
  • 95. © Instaclustr Pty Limited, 2024 Integration Example 2 Drone Delivery Demo © Instaclustr Pty Limited, 2023 Kafka Streams Customers Order Shops Busy warnings Uses Cassandra+OpenSearch ML over streaming data Demo/POC
  • 96. © Instaclustr Pty Limited, 2024 Integration Example 2 Drone Delivery Prod? © Instaclustr Pty Limited, 2023 Kafka Streams Customers Order Postgre SQL Drone Operations Order Tracking Shops Busy warnings Uses Cassandra+OpenSearch ML over streaming data Drone/order locations cached in Redis Read-through or write-behind Kafka sink connectors