SlideShare a Scribd company logo
How it’s similar to the databases you know and love, and how
it’s not.
What is Apache Kafka?
Kenny Gorman
Founder and CEO
www.eventador.io
www.kennygorman.com
@kennygorman
I have done database foo for my whole career, going on 25
years.
Sybase, Oracle DBA, PostgreSQL DBA, MySQL aficionado,
MongoDB early adopter, founded two companies based on
data technologies
Broke lots of stuff, lost data before, recovered said data,
stayed up many nights, on-call shift horror stories
Apache Kafka is really cool, as fellow database nerds you
will appreciate it.
I am a database nerd
‘02 had hair ^
Now… lol
Kafka
Comparison with the databases you are familiar with
Apache Kafka is an open-source stream processing platform pub/sub message
platform developed by the Apache Software Foundation written in Scala and Java.
The project aims blah blah blah pub/sub message queue architected as a
distributed transaction log,"[3]
Blah blah blah to process streaming data. Blah blah
blah.
The design is heavily influenced by transaction logs.[4]
Kafka
High Performance Streaming Data
Persistent
Distributed
Fault Tolerant
K.I.S.S.
Many Modern Use Cases
Why Kafka?
- It’s a stream of data. A boundless stream of data.
Pub/Sub Messaging Attributes
Image: https://kafka.apache.org
{“temperature”: 29}
{“temperature”: 29}
{“temperature”: 30}
{“temperature”: 29}
{“temperature”: 29}
{“temperature”: 30}
{“temperature”: 29}
{“temperature”: 29}
Logical Data Organization
PostgreSQL MongoDB Kafka
Database Database Topic Files
Fixed Schema Non Fixed Schema Key/Value Message
Table Collection Topic
Row Document Message
Column Name/Value Pairs
Shard Partition
Storage Architecture
PostgreSQL MongoDB Kafka
Stores data in files on disk Stores data in files on disk Stores data in files on disk
Has journal for recovery (WAL) Has journal for recovery (Oplog) Is a commit log
FS + Buffer Cache FS for caching * FS for caching
Random Access, Indexing Random Access, Indexing Sequential access
- Core to design of Kafka
- Partitioning
- Consumers and Consumer Groups
- Offsets ~= High Water Mark
Topics
Image: https://kafka.apache.org
- Kafka topics are glorified distributed write ahead logs
- Append only
- k/v pairs where the key decides the partition it lives in
- Sendfile system call optimization
- Client controlled routing
Performance
- Topics are replicated among any number of servers (brokers)
- Topics can be configured individually
- Topic partitions are the unit of replication
The unit of replication is the topic partition. Under non-failure conditions, each partition in Kafka has a single
leader and zero or more followers.
Availability and Fault Tolerance
MongoDB Majority Consensus (Raft-like in 3.2)
Kafka ISR set vote, stored in ZK
Application Programming Interfaces
PostgreSQL MongoDB Kafka
Insert sql = “insert into mytable ..”
db.execute(sql)
db.commit()
db.mytable.save({“baz”:1}) producer.send(“mytopic”, “{‘baz’:1}”)
Query sql = “select * from …”
cursor = db.execute(sql)
for record in cursor:
print record
db.mytable.find({“baz”:1}) consumer = get_from_topic(“mytopic”)
for message in consumer:
print message
Update sql = “update mytable set ..”
db.execute(sql)
db.commit()
db.mytable.update({“baz”:1,
“baz”:2})
Delete sql = “delete from mytable ..”
db.execute(sql)
db.commit()
db.mytable.remove({“baz”:1})
conn = database_connect()
cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
cur.execute(
"""
SELECT a.lastname, a.firstname, a.email,
a.userid, a.password, a.username, b.orgname
FROM users a, orgs b
WHERE a.orgid = b.orgid
AND a.orgid = %(orgid)s
""", {"orgid": orgid}
)
results = cur.fetchall()
for result in results:
print result
Typical RDBMS
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers='localhost:1234')
for _ in range(100):
producer.send('foobar', b'some_message_bytes')
Publishing
- Flush frequency/batch
- Partition keys
Subscribing (Consume)
from kafka import KafkaConsumer
consumer = KafkaConsumer(bootstrap_servers='localhost:9092')
consumer.subscribe('my-topic')
for msg in consumer:
print (msg)
try:
msg_count = 0
while running:
msg = consumer.poll(timeout=1.0)
if msg is None: continue
msg_process(msg) # application-specific processing
msg_count += 1
if msg_count % MIN_COMMIT_COUNT == 0:
consumer.commit(async=False)
finally:
# Shut down consumer
consumer.close()
Subscribing (Consume)
- Continuous ‘cursor’
- Offset management
- Partition assignment
- No simple command console like psql or mongo shell
- BOFJCiS
- Kafkacat, jq
- Shell scripts, mirrormaker, etc.
- PrestoDB
Tooling
PostgreSQL:
- Shared Buffers
- WAL/recovery
MongoDB (mmapv2)
- directoryPerDB
- FStuning
Settings and Tunables
Kafka:
- Xmx ~ 90% memory
- log.retention.hours
https://kafka.apache.org/documentation
We are hiring!
www.eventador.io
@kennygorman
Contact

More Related Content

PPT
January 2011 HUG: Kafka Presentation
PPTX
Kafka
PDF
From Newbie to Highly Available, a Successful Kafka Adoption Tale (Jonathan S...
PDF
Facebook Presto presentation
PPTX
Introduction to Kafka
PPTX
Kafka blr-meetup-presentation - Kafka internals
PPTX
Confluent building a real-time streaming platform using kafka streams and k...
PDF
Apache Kafka
January 2011 HUG: Kafka Presentation
Kafka
From Newbie to Highly Available, a Successful Kafka Adoption Tale (Jonathan S...
Facebook Presto presentation
Introduction to Kafka
Kafka blr-meetup-presentation - Kafka internals
Confluent building a real-time streaming platform using kafka streams and k...
Apache Kafka

What's hot (20)

PPTX
Kafka Tutorial - basics of the Kafka streaming platform
PPTX
Kafka Connect - debezium
PDF
Apache Drill (ver. 0.1, check ver. 0.2)
PDF
Couchdb + Membase = Couchbase
PPTX
Apache kafka
PPTX
Apache kafka
PDF
Node.js and couchbase Full Stack JSON - Munich NoSQL
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
PDF
Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field
PDF
From Zero to Hero with Kafka Connect
PPTX
Real time dashboards with Kafka and Druid
PPTX
Building a derived data store using Kafka
PPTX
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
PPTX
Introduction to Kafka and Zookeeper
PDF
Introduction to apache kafka
PDF
Cassandra Introduction & Features
PPTX
Apache kafka
KEY
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
PDF
Presto+MySQLで分散SQL
PDF
Kafka meetup - kafka connect
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Connect - debezium
Apache Drill (ver. 0.1, check ver. 0.2)
Couchdb + Membase = Couchbase
Apache kafka
Apache kafka
Node.js and couchbase Full Stack JSON - Munich NoSQL
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Kafka Summit SF 2017 - Kafka Connect Best Practices – Advice from the Field
From Zero to Hero with Kafka Connect
Real time dashboards with Kafka and Druid
Building a derived data store using Kafka
HBaseConAsia2018 Track3-7: The application of HBase in New Energy Vehicle Mon...
Introduction to Kafka and Zookeeper
Introduction to apache kafka
Cassandra Introduction & Features
Apache kafka
Data Models and Consumer Idioms Using Apache Kafka for Continuous Data Stream...
Presto+MySQLで分散SQL
Kafka meetup - kafka connect
Ad

Similar to What is apache Kafka? (20)

PDF
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
PDF
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
PDF
What is Apache Kafka and What is an Event Streaming Platform?
PPTX
Kafka Streams for Java enthusiasts
PPTX
Building Event-Driven Systems with Apache Kafka
PDF
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
PDF
Streaming Microservices With Akka Streams And Kafka Streams
PDF
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
PDF
Python Kafka Integration: Developers Guide
PDF
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
PDF
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
PPTX
Real time Analytics with Apache Kafka and Apache Spark
PDF
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
PDF
SSR: Structured Streaming for R and Machine Learning
PDF
Apache Kafka - Scalable Message Processing and more!
PDF
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
PDF
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Kafka DC Meetup: Replicating DB Binary Logs to Kafka
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
What is Apache Kafka and What is an Event Streaming Platform?
Kafka Streams for Java enthusiasts
Building Event-Driven Systems with Apache Kafka
Akka Streams And Kafka Streams: Where Microservices Meet Fast Data
Streaming Microservices With Akka Streams And Kafka Streams
Near Real Time Indexing Kafka Messages into Apache Blur: Presented by Dibyend...
Python Kafka Integration: Developers Guide
14th Athens Big Data Meetup - Landoop Workshop - Apache Kafka Entering The St...
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Real time Analytics with Apache Kafka and Apache Spark
SSR: Structured Streaming on R for Machine Learning with Felix Cheung
SSR: Structured Streaming for R and Machine Learning
Apache Kafka - Scalable Message Processing and more!
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Ad

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Spectroscopy.pptx food analysis technology
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Approach and Philosophy of On baking technology
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Machine learning based COVID-19 study performance prediction
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
Advanced methodologies resolving dimensionality complications for autism neur...
Understanding_Digital_Forensics_Presentation.pptx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Spectroscopy.pptx food analysis technology
Review of recent advances in non-invasive hemoglobin estimation
Per capita expenditure prediction using model stacking based on satellite ima...
Approach and Philosophy of On baking technology
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Big Data Technologies - Introduction.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MIND Revenue Release Quarter 2 2025 Press Release
Machine learning based COVID-19 study performance prediction
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
sap open course for s4hana steps from ECC to s4
Programs and apps: productivity, graphics, security and other tools
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Chapter 3 Spatial Domain Image Processing.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
20250228 LYD VKU AI Blended-Learning.pptx

What is apache Kafka?

  • 1. How it’s similar to the databases you know and love, and how it’s not. What is Apache Kafka? Kenny Gorman Founder and CEO www.eventador.io www.kennygorman.com @kennygorman
  • 2. I have done database foo for my whole career, going on 25 years. Sybase, Oracle DBA, PostgreSQL DBA, MySQL aficionado, MongoDB early adopter, founded two companies based on data technologies Broke lots of stuff, lost data before, recovered said data, stayed up many nights, on-call shift horror stories Apache Kafka is really cool, as fellow database nerds you will appreciate it. I am a database nerd ‘02 had hair ^ Now… lol
  • 3. Kafka Comparison with the databases you are familiar with
  • 4. Apache Kafka is an open-source stream processing platform pub/sub message platform developed by the Apache Software Foundation written in Scala and Java. The project aims blah blah blah pub/sub message queue architected as a distributed transaction log,"[3] Blah blah blah to process streaming data. Blah blah blah. The design is heavily influenced by transaction logs.[4] Kafka
  • 5. High Performance Streaming Data Persistent Distributed Fault Tolerant K.I.S.S. Many Modern Use Cases Why Kafka?
  • 6. - It’s a stream of data. A boundless stream of data. Pub/Sub Messaging Attributes Image: https://kafka.apache.org {“temperature”: 29} {“temperature”: 29} {“temperature”: 30} {“temperature”: 29} {“temperature”: 29} {“temperature”: 30} {“temperature”: 29} {“temperature”: 29}
  • 7. Logical Data Organization PostgreSQL MongoDB Kafka Database Database Topic Files Fixed Schema Non Fixed Schema Key/Value Message Table Collection Topic Row Document Message Column Name/Value Pairs Shard Partition
  • 8. Storage Architecture PostgreSQL MongoDB Kafka Stores data in files on disk Stores data in files on disk Stores data in files on disk Has journal for recovery (WAL) Has journal for recovery (Oplog) Is a commit log FS + Buffer Cache FS for caching * FS for caching Random Access, Indexing Random Access, Indexing Sequential access
  • 9. - Core to design of Kafka - Partitioning - Consumers and Consumer Groups - Offsets ~= High Water Mark Topics Image: https://kafka.apache.org
  • 10. - Kafka topics are glorified distributed write ahead logs - Append only - k/v pairs where the key decides the partition it lives in - Sendfile system call optimization - Client controlled routing Performance
  • 11. - Topics are replicated among any number of servers (brokers) - Topics can be configured individually - Topic partitions are the unit of replication The unit of replication is the topic partition. Under non-failure conditions, each partition in Kafka has a single leader and zero or more followers. Availability and Fault Tolerance MongoDB Majority Consensus (Raft-like in 3.2) Kafka ISR set vote, stored in ZK
  • 12. Application Programming Interfaces PostgreSQL MongoDB Kafka Insert sql = “insert into mytable ..” db.execute(sql) db.commit() db.mytable.save({“baz”:1}) producer.send(“mytopic”, “{‘baz’:1}”) Query sql = “select * from …” cursor = db.execute(sql) for record in cursor: print record db.mytable.find({“baz”:1}) consumer = get_from_topic(“mytopic”) for message in consumer: print message Update sql = “update mytable set ..” db.execute(sql) db.commit() db.mytable.update({“baz”:1, “baz”:2}) Delete sql = “delete from mytable ..” db.execute(sql) db.commit() db.mytable.remove({“baz”:1})
  • 13. conn = database_connect() cur = conn.cursor(cursor_factory=psycopg2.extras.RealDictCursor) cur.execute( """ SELECT a.lastname, a.firstname, a.email, a.userid, a.password, a.username, b.orgname FROM users a, orgs b WHERE a.orgid = b.orgid AND a.orgid = %(orgid)s """, {"orgid": orgid} ) results = cur.fetchall() for result in results: print result Typical RDBMS
  • 14. from kafka import KafkaProducer producer = KafkaProducer(bootstrap_servers='localhost:1234') for _ in range(100): producer.send('foobar', b'some_message_bytes') Publishing - Flush frequency/batch - Partition keys
  • 15. Subscribing (Consume) from kafka import KafkaConsumer consumer = KafkaConsumer(bootstrap_servers='localhost:9092') consumer.subscribe('my-topic') for msg in consumer: print (msg)
  • 16. try: msg_count = 0 while running: msg = consumer.poll(timeout=1.0) if msg is None: continue msg_process(msg) # application-specific processing msg_count += 1 if msg_count % MIN_COMMIT_COUNT == 0: consumer.commit(async=False) finally: # Shut down consumer consumer.close() Subscribing (Consume) - Continuous ‘cursor’ - Offset management - Partition assignment
  • 17. - No simple command console like psql or mongo shell - BOFJCiS - Kafkacat, jq - Shell scripts, mirrormaker, etc. - PrestoDB Tooling
  • 18. PostgreSQL: - Shared Buffers - WAL/recovery MongoDB (mmapv2) - directoryPerDB - FStuning Settings and Tunables Kafka: - Xmx ~ 90% memory - log.retention.hours