SlideShare a Scribd company logo
Apache Kafka
A distributed publish-subscribe messaging system
Neha Narkhede, LinkedIn
nnarkhede@linkedin.com, 11/11/11
Apache kafka
 Introduction to pub-sub
 Kafka at LinkedIn
 Hadoop and Kafka
 Design
 Performance
Outline
What is pub sub ?
Producer Consumer
Producer
Consumer
Topic
1
Topic
2
Topic
3
subscribe
publish(topic, msg)
Publish subscribe
system
msg
msg
 Introduction to pub-sub
 Kafka at LinkedIn
 Hadoop and Kafka
 Design
 Performance
Outline
Motivation
 Activity tracking
 Operational metrics
Kafka
 Distributed
 Persistent
 High throughput
Kafka at LinkedIn
Frontend
Real time
monitoring
BrokerBrokerBroker
Hadoop DWH
Security
systems
News feed
Kafka
Frontend ServiceService
 Introduction to pub-sub
 Kafka at LinkedIn
 Hadoop and Kafka
 Design
 Performance
Outline
Hadoop Data Load for Kafka
Live data center Offline data center
HadoopHadoopDev
Hadoop
FrontendFrontendReal time
consumers
KafkaKafkaKafka
KafkaKafkaKafka
HadoopHadoopPROD
Hadoop
Multi DC data deployments
Live data centers Offline data centers
Kafka
Real time
consumers
Real time
consumers
Real time
consumers
Real time
consumers
Real time
consumers
Real time
consumers
Kafka
Hadoop
HadoopHadoopHadoop
Hadoop
HadoopHadoopDWH
Volume
 20B events/day
 3 terabytes/day
 150K events/sec
Message queues
•ActiveMQ
•TIBCO
Log aggregators
•Flume
•Scribe
•Low throughput
•Secondary indexes
•Tuned for low
latency
•Focus on HDFS
•Push model
•No rewindable
consumption
KAFKA
What Kafka offers
 Very high performance
 Elastically scalable
 Low operational overhead
 Durable, highly available (coming soon)
Architecture
Producer
Consumer
Producer
Broker Broker Broker Broker
Consumer
ZK
 Introduction to pub-sub
 Kafka at LinkedIn
 Hadoop and Kafka
 Design
 Performance
Outline
Efficiency #1: simple storage
 Each topic has an ever-growing log
 A log == a list of files
 A message is addressed by a log offset
Efficiency #2: careful transfer
 Batch send and receive
 No message caching in JVM
 Rely on file system buffering
 Zero-copy transfer: file -> socket
Multi subscribers
 1 file system operation per request
 Consumption is cheap
 SLA based message retention
 Rewindable consumption
Guarantees
 Data integrity checks
 At least once delivery
 In order delivery, per partition
Automatic load balancing
Consumer
Producer
Broker Broker
Consumer
Producer
Auditing
# events published = # events consumed
 Introduction to pub-sub
 Kafka at LinkedIn
 Hadoop and Kafka
 Design
 Performance
Outline
Performance
 2 Linux boxes
• 16 2.0 GHz cores
• 6 7200 rpm SATA drive RAID 10
• 24GB memory
• 1Gb network link
 200 byte messages
 Producer batch size 200 messages
Basic performance metrics
• Producer batch size = 40K
• Consumer batch size = 1MB
• 100 topics, broker flush interval = 100K
– Producer throughput = 90 MB/sec
– Consumer throughput = 60 MB/sec
– Consumer latency = 220 ms
Latency vs throughput
0
50
100
150
200
250
0 20 40 60 80 100
Producer throughput in MB/sec
Consumerlatencyinms
(100 topics, 1 producer, 1 broker)
Scalability
101
190
293
381
0
50
100
150
200
250
300
350
400
1 broker 2 brokers 3 brokers 4 brokers
ThroughputinMB/s
(10 topics, broker flush interval 100K)
Throughput vs Unconsumed data
0
40000
80000
120000
160000
200000
10
105
199
294
388
473
567
662
756
851
945
1039
(1 topic, broker flush interval 10K)
Throughputinmsg/s
Unconsumed data in GB
State of the system
 4 clusters per colo, 4 servers each
 850 socket connections per server
 20 TB
 430 topics
 Batched frontend to offline datacenter latency
=> 6-10 secs
 Frontend to Hadoop latency => 5 min
State of the system
 Successfully deployed in production at LinkedIn
and other startups
 Apache incubator inclusion
 0.7 Release
• Compression
• Cluster mirroring
Replication
Some project ideas
 Security
 Long poll
 More compression codecs
 Locality of consumption
Team
• Jay Kreps
• Jun Rao
• Neha Narkhede
• Joel Koshy
• Chris Burroughs
THANK YOU
http://incubator.apache.org/kafka/index.html
kafka-users@incubator.apache.org
http://www.linkedin.com/in/nehanarkhede
@nehanarkhede
#kafka
Apache kafka
Apache kafka
API

More Related Content

PPTX
Streaming in Practice - Putting Apache Kafka in Production
PPTX
Singer, Pinterest's Logging Infrastructure
PPTX
Apache Kafka at LinkedIn
PDF
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
PPTX
Copy of Kafka-Camus
PDF
Connect at Twitter-scale | Jordan Bull and Ryanne Dolan, Twitter
PPTX
PPTX
I Heart Log: Real-time Data and Apache Kafka
Streaming in Practice - Putting Apache Kafka in Production
Singer, Pinterest's Logging Infrastructure
Apache Kafka at LinkedIn
Capacity Planning Your Kafka Cluster | Jason Bell, Digitalis
Copy of Kafka-Camus
Connect at Twitter-scale | Jordan Bull and Ryanne Dolan, Twitter
I Heart Log: Real-time Data and Apache Kafka

What's hot (15)

PDF
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
PDF
Thoughts on kafka capacity planning
PPTX
Kafka presentation
PDF
Exactly-once Data Processing with Kafka Streams - July 27, 2017
PPTX
What's new in MongoDB 2.6
PDF
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
PDF
Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using ...
PPTX
MongoDB 3.4 webinar
PPTX
Netflix Data Pipeline With Kafka
PPTX
Architecture of a Kafka camus infrastructure
PDF
NoSQL benchmarking
PDF
Scalable and Reliable Logging at Pinterest
PDF
Solving Problems At Scale With Redis
PDF
Putting Kafka Together with the Best of Google Cloud Platform
PDF
Netflix Keystone—Cloud scale event processing pipeline
Stream Me Up, Scotty: Transitioning to the Cloud Using a Streaming Data Platform
Thoughts on kafka capacity planning
Kafka presentation
Exactly-once Data Processing with Kafka Streams - July 27, 2017
What's new in MongoDB 2.6
Running Kafka On Kubernetes With Strimzi For Real-Time Streaming Applications
Cost Effectively and Reliably Aggregating Billions of Messages Per Day Using ...
MongoDB 3.4 webinar
Netflix Data Pipeline With Kafka
Architecture of a Kafka camus infrastructure
NoSQL benchmarking
Scalable and Reliable Logging at Pinterest
Solving Problems At Scale With Redis
Putting Kafka Together with the Best of Google Cloud Platform
Netflix Keystone—Cloud scale event processing pipeline
Ad

Similar to Apache kafka (20)

PPTX
F_1330_Narkhede_Kafka .pptx
PDF
PPTX
Unleashing Real-time Power with Kafka.pptx
PPTX
Apache Kafka
PPTX
Apache Kafka: Next Generation Distributed Messaging System
PDF
Apache Kafka Introduction
PDF
Introduction to Apache Kafka
PPTX
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
PPTX
Understanding kafka
PPTX
Apache kafka
PDF
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
PPTX
Kafka for Scale
PPTX
Apache Kafka at LinkedIn
PPTX
Current and Future of Apache Kafka
PPTX
Kafka Presentation.pptx
PPTX
Kafka Presentation.pptx
PPTX
How kafka is transforming hadoop, spark & storm
PDF
Apache Kafka - Scalable Message Processing and more!
PDF
Streaming Analytics unit 2 notes for engineers
PPTX
kafka_session_updated.pptx
F_1330_Narkhede_Kafka .pptx
Unleashing Real-time Power with Kafka.pptx
Apache Kafka
Apache Kafka: Next Generation Distributed Messaging System
Apache Kafka Introduction
Introduction to Apache Kafka
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
Understanding kafka
Apache kafka
Trivadis TechEvent 2016 Apache Kafka - Scalable Massage Processing and more! ...
Kafka for Scale
Apache Kafka at LinkedIn
Current and Future of Apache Kafka
Kafka Presentation.pptx
Kafka Presentation.pptx
How kafka is transforming hadoop, spark & storm
Apache Kafka - Scalable Message Processing and more!
Streaming Analytics unit 2 notes for engineers
kafka_session_updated.pptx
Ad

More from MvkZ (10)

PPTX
Big datatraining.in devops-part1
PPTX
Login
PPTX
Big datatraining.in devops-part2 (1)
PPTX
Big datatraining.in devops-part1
PPTX
Big datatraining.in devops-part1
PPTX
Apache kafka
PPTX
Login
PPTX
Big datatraining.in devops-part2 (1)
PPTX
Big datatraining.in devops-part1
PPTX
Apache kafka
Big datatraining.in devops-part1
Login
Big datatraining.in devops-part2 (1)
Big datatraining.in devops-part1
Big datatraining.in devops-part1
Apache kafka
Login
Big datatraining.in devops-part2 (1)
Big datatraining.in devops-part1
Apache kafka

Recently uploaded (20)

PPTX
Gayatri Cultural Educational Society.pptx
PDF
Volvo EC20C Excavator Service maintenance schedules.pdf
PPT
Your score increases as you pick a category, fill out a long description and ...
PDF
Caterpillar Cat 315C Excavator (Prefix ANF) Service Repair Manual Instant Dow...
PDF
Presentation.pdf ...............gjtn....tdubsr..........
DOCX
lp of food hygiene.docxvvvvvvvvvvvvvvvvvvvvvvv
PDF
Volvo EC290C NL EC290CNL engine Manual.pdf
PPTX
Paediatric History & Clinical Examination.pptx
PPTX
Type of Sentence & SaaaaaaaaaadddVA.pptx
PPT
Kaizen for Beginners and how to implement Kaizen
PDF
Volvo EC300D L EC300DL excavator weight Manuals.pdf
PDF
How Much does a Volvo EC290C NL EC290CNL Weight.pdf
PPTX
laws of thermodynamics with complete explanation
PPTX
IMMUNITY TYPES PPT.pptx very good , sufficient
PPT
Mettal aloys and it's application and theri composition
PDF
Physics class 12thstep down transformer project.pdf
PDF
How much horsepower does a Volvo EC210Cl have.pdf
PDF
Journal Meraj.pdfuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
PPTX
capstoneoooooooooooooooooooooooooooooooooo
PDF
computer system to create, modify, analyse or optimize an engineering design.
Gayatri Cultural Educational Society.pptx
Volvo EC20C Excavator Service maintenance schedules.pdf
Your score increases as you pick a category, fill out a long description and ...
Caterpillar Cat 315C Excavator (Prefix ANF) Service Repair Manual Instant Dow...
Presentation.pdf ...............gjtn....tdubsr..........
lp of food hygiene.docxvvvvvvvvvvvvvvvvvvvvvvv
Volvo EC290C NL EC290CNL engine Manual.pdf
Paediatric History & Clinical Examination.pptx
Type of Sentence & SaaaaaaaaaadddVA.pptx
Kaizen for Beginners and how to implement Kaizen
Volvo EC300D L EC300DL excavator weight Manuals.pdf
How Much does a Volvo EC290C NL EC290CNL Weight.pdf
laws of thermodynamics with complete explanation
IMMUNITY TYPES PPT.pptx very good , sufficient
Mettal aloys and it's application and theri composition
Physics class 12thstep down transformer project.pdf
How much horsepower does a Volvo EC210Cl have.pdf
Journal Meraj.pdfuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
capstoneoooooooooooooooooooooooooooooooooo
computer system to create, modify, analyse or optimize an engineering design.

Apache kafka

Editor's Notes

  • #5: Multiple subscribers, decoupling Some examples : email notification on adding connection etc
  • #9: Every service in LinkedIn uses Kafka. 10-15 of real time consumers using Kafka
  • #11: Give some background on – basically generating business reports Tracking data is crucial to measure user engagement – unique member visits per day, ad metrics CTR to ad publishers Also data analytics – new models for PYMK Collapse multple boxes into one. Show multiple Hadoop clusters If you track some new data, it will automatically reach Hadoop in a few minutes.
  • #14: Messaging systems tuned for very low latency, but not high volume
  • #18: One topic is enough
  • #26: s
  • #29: Add compression line here instead
  • #30: Batching on frontends, batching on both kafka clusters, tuned for throughput
  • #35: Kafka webpage here
  • #36: single producer to send 10 million messages, 200 bytes each. single consumer thread Flush interval 10K ActiveMQ – syncOnWrite=false, KahaDB RabbitMQ – mandatory=true, immediate=false, file persistence the producer throughput in messages/secondKafka produces at the rate of 50K / second with batch size of 1 400K / second with batch size of 50. These numbers are orders of magnitude higher than that of ActiveMQ and at least twice better than RabbitMQ. No producer ACKS more compact storage format 9 bytes/message overhead in Kafka 144 bytes/message overhead in ActiveMQ Cost of maintaining heavy index structures One thread in ActiveMQ spent most of its time accessing a B-Tree to maintain message metadata and state
  • #37: Kafka consumes at 22K messages / sec, more than 4 times that of ActiveMQ and RabbitMQ More efficient storage format - fewer bytes transferred from server to consumer The broker in ActiveMQ/RabbitMQ had to maintain delivery status for each message. One of the threads in ActiveMQ kept writing KahaDB pages to disk during the test. Kafka uses sendFile API to reduce the transmission overhead