SlideShare a Scribd company logo
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Apache Kafka at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
About Me
2
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
3
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Roadmap
• Q & A
Why We Build Kafka?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
We Have a lot of Data
5
• User activity tracking
• Page views, ad impressions, etc
• Server logs and metrics
• Syslogs, request-rates, etc
• Messaging
• Emails, news feeds, etc
• Computation derived
• Results of Hadoop / data warehousing, etc
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
.. and We Build Products on Data
6
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Newsfeed
7
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Recommendation
8HADOOP SUMMIT 2013
People you may know
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Recommendation
9
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Search
10
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Metrics and Monitoring
11
HADOOP SUMMIT 2013
System and application metrics/logging
LinkedIn Corporation ©2013 All Rights Reserved 5
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
.. and a LOT of Monitoring
12
The Problem:
How to integrate this variety of data
and make it available to all products?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14
Life back in 2010:
Point-to-Point Pipeplines
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 15
Example: User Activity Data Flow
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 16
What We Want
• A centralized data pipeline
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 17
Apache Kafka
We tried some systems off-
the-shelf, but…
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 18
What We REALLY Want
• A centralized data pipeline that is
• Elastically scalable
• Durable
• High-throughput
• Easy to use
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• A distributed pub-sub messaging system
• Scale-out from groundup
• Persistent to disks
• High-Throughput (10s MB/sec per server)
19
Apache Kafka
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 20
Life Since Kafka in Production
Apache Kafka
• Developed and maintained by 5 Devs + 2 SRE
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
21
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Roadmap
• Q & A
Key Idea #1:
Data-parallelism leads to scale-out
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Produce/consume requests are randomly balanced
among brokers
23
Distribute Clients across Partitions
Key Idea #2:
Disks are fast when used sequentially
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Appends are effectively O(1)
• Reads from known offset are fast still, when cached
25
Store Messages as a Log
3 4 5 5 7 8 9 10 11 12...
Producer Write
Consumer1
Reads (offset 7)
Consumer2
Reads (offset 7)
Partition i of Topic A
Key Idea #3:
Batching makes best use of network/IO
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
• Batched send and receive
• Batched compression
• No message caching in JVM
• Zero-copy from file to socket (Java NIO)
27
Batch Transfer
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 28
The API (0.8)
Producer:
send(topic, message)
Consumer:
Iterable stream = createMessageStreams(…).get(topic)
for (message: stream) {
// process the message
}
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
29
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 30
Kafka Usage at LinkedIn
• Mainly used for tracking user-activity and metrics data
• 16 - 32 brokers in each cluster (615+ total brokers)
• 527 billion messages/day
• 7500+ topics, 270k+ partitions
• Byte rates:
• Writes: 97 TB/day
• Reads: 430 TB/day
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 31
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 32
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 33
Kafka Usage at LinkedIn
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
34
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
Problems
• Hundreds of message types
• Thousands of fields
• What do they all mean?
• What happens when they change?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 36
Standardized Schema on Avro
• Schema
• Message structure contract
• Performance gain
• Workflow
• Check in schema
• Auto compatibility check
• Code review
• “Ship it!”
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
37
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 38
Kafka to Hadoop
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 39
Hadoop ETL (Camus)
• Map/Reduce job does data load
• One job loads all events
• ~10 minute ETA on average from producer to HDFS
• Hive registration done automatically
• Schema evolution handled transparently
• Open sourced:
– https://github.com/linkedin/camus
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Agenda
40
• Overview of Kafka
• Kafka Design
• Kafka Usage at LinkedIn
• Pipeline deployment
• Schema for data cleanliness
• O(1) ETL
• Auditing for correctness
• Roadmap
• Q & A
Does it really work?
“All published messages must be delivered to all consumers (quickly)”
Audit Trail
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 43
More Features in Kafka 0.8
• Intra-cluster replication (0.8.0)
• Highly availability,
• Reduced latency
• Log compaction (0.8.1)
• State storage
• Operational tools (0.8.2)
• Topic management
• Automated leader rebalance
• etc ..
Checkout our page for more: http://kafka.apache.org/
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 44
Kafka 0.9
• Clients Rewrite
• Remove ZK dependency
• Even better throughput
• Security
• More operability, multi-tenancy ready
• Transactional Messaing
• From at-least-one to exactly-once
Checkout our page for more: http://kafka.apache.org/
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure
Kafka Users: Next Maybe You?
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 46
Acknowledgements
Questions? Guozhang Wang
guwang@linkedin.com
www.linkedin.com/in/guozhangwang
Backup Slides
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 49
Real-time Analysis with Kafka
• Analytics from Hadoop can be slow
• Production -> Kafka: tens of milliseconds
• Kafka - > Hadoop: < 1 minute
• ETL in Hadoop: ~ 45 minutes
• MapReduce in Hadoop: maybe hours
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 50
Real-time Analysis with Kafka
• Solution No.1: directly consuming from Kafka
• Solution No. 2: other storage than HDFS
• Spark, Shark
• Pinot, Druid, FastBit
• Solution No. 3: stream processing
• Apache Samza
• Storm
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 51
How Fast can Kafka Go?
• Bottleneck #1: network bandwidth
• Producer: 100 Mb/s for 1 Gig-Ethernet
• Consumer can be slower due to multi-sub
• Bottleneck #2: disk space
• Data may be deleted before consumed at peak time•
• Configurable time/size-based retention policy
• Bottleneck #3: Zookeeper
• Mainly due to offset commit, will be lifted in 0.9
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 52
Intra-cluster Replication
• Pick CA within Datacenter (failover < 10ms)
• Network partition is rare
• Latency less than an issue
• Separate data replication and consensus
• Consensus => Zookeeper
• Replication => primary-backup (f to tolerate f-1 failure)
• Configurable ACK (durability v.s. latency)
• More details:
• http://www.slideshare.net/junrao/kafka-replication-apachecon2013
©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 53
Replication Architecture
Producer
Consumer
Producer
Broker Broker Broker Broker
Consumer
ZK

More Related Content

PDF
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
PDF
Cassandra Introduction & Features
PPTX
Where is my bottleneck? Performance troubleshooting in Flink
PDF
Cilium - Fast IPv6 Container Networking with BPF and XDP
PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PPTX
Building an Event Streaming Architecture with Apache Pulsar
Amazon S3 Best Practice and Tuning for Hadoop/Spark in the Cloud
Cassandra Introduction & Features
Where is my bottleneck? Performance troubleshooting in Flink
Cilium - Fast IPv6 Container Networking with BPF and XDP
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Apache Iceberg - A Table Format for Hige Analytic Datasets
Building an Event Streaming Architecture with Apache Pulsar

What's hot (20)

PPTX
PDF
Deep Dive into GPU Support in Apache Spark 3.x
PDF
Airflow Best Practises & Roadmap to Airflow 2.0
PDF
Apache Kafka Architecture & Fundamentals Explained
PDF
Iceberg + Alluxio for Fast Data Analytics
PDF
Iceberg: a fast table format for S3
PDF
Producer Performance Tuning for Apache Kafka
PDF
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
PDF
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
PPTX
Kafka presentation
PPTX
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
PPTX
Introduction to Kafka Cruise Control
PDF
Storage and Alfresco
PDF
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
PPTX
Kafka Tutorial: Kafka Security
PDF
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
PDF
Airflow presentation
PPTX
Apache Flink and what it is used for
PPTX
Introduction to Apache ZooKeeper
PPTX
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Deep Dive into GPU Support in Apache Spark 3.x
Airflow Best Practises & Roadmap to Airflow 2.0
Apache Kafka Architecture & Fundamentals Explained
Iceberg + Alluxio for Fast Data Analytics
Iceberg: a fast table format for S3
Producer Performance Tuning for Apache Kafka
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Performant Streaming in Production: Preventing Common Pitfalls when Productio...
Kafka presentation
Deep Dive : Spark Data Frames, SQL and Catalyst Optimizer
Introduction to Kafka Cruise Control
Storage and Alfresco
Apache Spark Streaming in K8s with ArgoCD & Spark Operator
Kafka Tutorial: Kafka Security
[OpenInfra Days Korea 2018] Day 2 - CEPH 운영자를 위한 Object Storage Performance T...
Airflow presentation
Apache Flink and what it is used for
Introduction to Apache ZooKeeper
Using the New Apache Flink Kubernetes Operator in a Production Deployment
Ad

Similar to Apache Kafka at LinkedIn (20)

PDF
GSJUG: Mastering Data Streaming Pipelines 09May2023
PPTX
Large-Scale Data Science on Hadoop (Intel Big Data Day)
PPTX
CouchbasetoHadoop_Matt_Michael_Justin v4
PDF
CA Technologies Customer Presentation
PDF
Building a Hadoop Data Warehouse with Impala
PPTX
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
PPTX
Introduction to Kafka
PPTX
PyData: The Next Generation | Data Day Texas 2015
PDF
Building a Hadoop Data Warehouse with Impala
PPTX
Real Time Data Processing Using Spark Streaming
PDF
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
PPTX
Bay Area Impala User Group Meetup (Sept 16 2014)
PDF
Building real time data-driven products
PDF
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
PPT
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
PDF
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
PDF
Oracle Cloud : Big Data Use Cases and Architecture
PPTX
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
PDF
An Incomplete Data Tools Landscape for Hackers in 2015
PPTX
Lambda architecture on Spark, Kafka for real-time large scale ML
GSJUG: Mastering Data Streaming Pipelines 09May2023
Large-Scale Data Science on Hadoop (Intel Big Data Day)
CouchbasetoHadoop_Matt_Michael_Justin v4
CA Technologies Customer Presentation
Building a Hadoop Data Warehouse with Impala
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Introduction to Kafka
PyData: The Next Generation | Data Day Texas 2015
Building a Hadoop Data Warehouse with Impala
Real Time Data Processing Using Spark Streaming
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Bay Area Impala User Group Meetup (Sept 16 2014)
Building real time data-driven products
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Oracle Cloud : Big Data Use Cases and Architecture
Being Ready for Apache Kafka - Apache: Big Data Europe 2015
An Incomplete Data Tools Landscape for Hackers in 2015
Lambda architecture on Spark, Kafka for real-time large scale ML
Ad

More from Guozhang Wang (14)

PDF
Consensus in Apache Kafka: From Theory to Production.pdf
PDF
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
PDF
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
PDF
Introduction to the Incremental Cooperative Protocol of Kafka
PDF
Performance Analysis and Optimizations for Kafka Streams Applications
PDF
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
PPTX
Exactly-once Stream Processing with Kafka Streams
PDF
Apache Kafka, and the Rise of Stream Processing
PDF
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
PDF
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
PDF
Introduction to Kafka Streams
PPTX
Building a Replicated Logging System with Apache Kafka
PPTX
Behavioral Simulations in MapReduce
PPTX
Automatic Scaling Iterative Computations
Consensus in Apache Kafka: From Theory to Production.pdf
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Introduction to the Incremental Cooperative Protocol of Kafka
Performance Analysis and Optimizations for Kafka Streams Applications
Apache Kafka from 0.7 to 1.0, History and Lesson Learned
Exactly-once Stream Processing with Kafka Streams
Apache Kafka, and the Rise of Stream Processing
Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Introduction to Kafka Streams
Building a Replicated Logging System with Apache Kafka
Behavioral Simulations in MapReduce
Automatic Scaling Iterative Computations

Recently uploaded (20)

DOCX
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
PDF
Well-logging-methods_new................
PPTX
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PDF
Arduino robotics embedded978-1-4302-3184-4.pdf
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Digital Logic Computer Design lecture notes
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Geodesy 1.pptx...............................................
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PPTX
Sustainable Sites - Green Building Construction
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Structs to JSON How Go Powers REST APIs.pdf
PPT
Mechanical Engineering MATERIALS Selection
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPTX
OOP with Java - Java Introduction (Basics)
ASol_English-Language-Literature-Set-1-27-02-2023-converted.docx
Well-logging-methods_new................
MCN 401 KTU-2019-PPE KITS-MODULE 2.pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Arduino robotics embedded978-1-4302-3184-4.pdf
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Digital Logic Computer Design lecture notes
Model Code of Practice - Construction Work - 21102022 .pdf
Geodesy 1.pptx...............................................
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
CYBER-CRIMES AND SECURITY A guide to understanding
Sustainable Sites - Green Building Construction
Embodied AI: Ushering in the Next Era of Intelligent Systems
Structs to JSON How Go Powers REST APIs.pdf
Mechanical Engineering MATERIALS Selection
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
OOP with Java - Java Introduction (Basics)

Apache Kafka at LinkedIn

  • 1. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Apache Kafka at LinkedIn
  • 2. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure About Me 2
  • 3. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 3 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Roadmap • Q & A
  • 4. Why We Build Kafka?
  • 5. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure We Have a lot of Data 5 • User activity tracking • Page views, ad impressions, etc • Server logs and metrics • Syslogs, request-rates, etc • Messaging • Emails, news feeds, etc • Computation derived • Results of Hadoop / data warehousing, etc
  • 6. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure .. and We Build Products on Data 6
  • 7. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Newsfeed 7
  • 8. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Recommendation 8HADOOP SUMMIT 2013 People you may know
  • 9. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Recommendation 9
  • 10. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Search 10
  • 11. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Metrics and Monitoring 11 HADOOP SUMMIT 2013 System and application metrics/logging LinkedIn Corporation ©2013 All Rights Reserved 5
  • 12. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure .. and a LOT of Monitoring 12
  • 13. The Problem: How to integrate this variety of data and make it available to all products?
  • 14. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 14 Life back in 2010: Point-to-Point Pipeplines
  • 15. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 15 Example: User Activity Data Flow
  • 16. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 16 What We Want • A centralized data pipeline
  • 17. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 17 Apache Kafka We tried some systems off- the-shelf, but…
  • 18. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 18 What We REALLY Want • A centralized data pipeline that is • Elastically scalable • Durable • High-throughput • Easy to use
  • 19. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • A distributed pub-sub messaging system • Scale-out from groundup • Persistent to disks • High-Throughput (10s MB/sec per server) 19 Apache Kafka
  • 20. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 20 Life Since Kafka in Production Apache Kafka • Developed and maintained by 5 Devs + 2 SRE
  • 21. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 21 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Roadmap • Q & A
  • 22. Key Idea #1: Data-parallelism leads to scale-out
  • 23. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • Produce/consume requests are randomly balanced among brokers 23 Distribute Clients across Partitions
  • 24. Key Idea #2: Disks are fast when used sequentially
  • 25. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • Appends are effectively O(1) • Reads from known offset are fast still, when cached 25 Store Messages as a Log 3 4 5 5 7 8 9 10 11 12... Producer Write Consumer1 Reads (offset 7) Consumer2 Reads (offset 7) Partition i of Topic A
  • 26. Key Idea #3: Batching makes best use of network/IO
  • 27. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure • Batched send and receive • Batched compression • No message caching in JVM • Zero-copy from file to socket (Java NIO) 27 Batch Transfer
  • 28. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 28 The API (0.8) Producer: send(topic, message) Consumer: Iterable stream = createMessageStreams(…).get(topic) for (message: stream) { // process the message }
  • 29. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 29 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 30. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 30 Kafka Usage at LinkedIn • Mainly used for tracking user-activity and metrics data • 16 - 32 brokers in each cluster (615+ total brokers) • 527 billion messages/day • 7500+ topics, 270k+ partitions • Byte rates: • Writes: 97 TB/day • Reads: 430 TB/day
  • 31. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 31 Kafka Usage at LinkedIn
  • 32. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 32 Kafka Usage at LinkedIn
  • 33. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 33 Kafka Usage at LinkedIn
  • 34. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 34 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 35. Problems • Hundreds of message types • Thousands of fields • What do they all mean? • What happens when they change?
  • 36. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 36 Standardized Schema on Avro • Schema • Message structure contract • Performance gain • Workflow • Check in schema • Auto compatibility check • Code review • “Ship it!”
  • 37. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 37 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 38. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 38 Kafka to Hadoop
  • 39. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 39 Hadoop ETL (Camus) • Map/Reduce job does data load • One job loads all events • ~10 minute ETA on average from producer to HDFS • Hive registration done automatically • Schema evolution handled transparently • Open sourced: – https://github.com/linkedin/camus
  • 40. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Agenda 40 • Overview of Kafka • Kafka Design • Kafka Usage at LinkedIn • Pipeline deployment • Schema for data cleanliness • O(1) ETL • Auditing for correctness • Roadmap • Q & A
  • 41. Does it really work? “All published messages must be delivered to all consumers (quickly)”
  • 43. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 43 More Features in Kafka 0.8 • Intra-cluster replication (0.8.0) • Highly availability, • Reduced latency • Log compaction (0.8.1) • State storage • Operational tools (0.8.2) • Topic management • Automated leader rebalance • etc .. Checkout our page for more: http://kafka.apache.org/
  • 44. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 44 Kafka 0.9 • Clients Rewrite • Remove ZK dependency • Even better throughput • Security • More operability, multi-tenancy ready • Transactional Messaing • From at-least-one to exactly-once Checkout our page for more: http://kafka.apache.org/
  • 45. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure Kafka Users: Next Maybe You?
  • 46. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 46 Acknowledgements
  • 49. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 49 Real-time Analysis with Kafka • Analytics from Hadoop can be slow • Production -> Kafka: tens of milliseconds • Kafka - > Hadoop: < 1 minute • ETL in Hadoop: ~ 45 minutes • MapReduce in Hadoop: maybe hours
  • 50. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 50 Real-time Analysis with Kafka • Solution No.1: directly consuming from Kafka • Solution No. 2: other storage than HDFS • Spark, Shark • Pinot, Druid, FastBit • Solution No. 3: stream processing • Apache Samza • Storm
  • 51. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 51 How Fast can Kafka Go? • Bottleneck #1: network bandwidth • Producer: 100 Mb/s for 1 Gig-Ethernet • Consumer can be slower due to multi-sub • Bottleneck #2: disk space • Data may be deleted before consumed at peak time• • Configurable time/size-based retention policy • Bottleneck #3: Zookeeper • Mainly due to offset commit, will be lifted in 0.9
  • 52. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 52 Intra-cluster Replication • Pick CA within Datacenter (failover < 10ms) • Network partition is rare • Latency less than an issue • Separate data replication and consensus • Consensus => Zookeeper • Replication => primary-backup (f to tolerate f-1 failure) • Configurable ACK (durability v.s. latency) • More details: • http://www.slideshare.net/junrao/kafka-replication-apachecon2013
  • 53. ©2013 LinkedIn Corporation. All Rights Reserved. KAFKA Team, Data Infrastructure 53 Replication Architecture Producer Consumer Producer Broker Broker Broker Broker Consumer ZK

Editor's Notes

  • #6: Data-serving websites, LinkedIn has a lot of data
  • #9: Based on relevence
  • #12: We have this variety of data and and we need to build all these products around such data.
  • #13: We have this variety of data and and we need to build all these products around such data.
  • #15: Messaging: ActiveMQ User Activity: In house log aggregation Logging: Splunk Metrics: JMX => Zenoss Database data: Databus, custom ETL
  • #18: ActiveMQ: they do not fly
  • #21: Now you maybe wondering why it works so well? For example, why it can be both highly durable by persisting data to disks while still maintaining high throughput?
  • #24: Topic = message stream Topic has partitions, partitions are distributed to brokers
  • #25: Do not be afraid of disks
  • #26: File system caching
  • #28: And finally after all these tricks, the client interface we exposed to the users, are very simple.
  • #30: Now I will switch my gear and talk a little bit about Kafka usage at Linkedin
  • #31: 21st, October.
  • #33: Multi-colo
  • #43: 99.99%
  • #44: 0.8.2: Delete topic Automated leader rebalancing Controlled shutdown Offset management Parallel recovery min.isr and clean leader election
  • #46: Non-Java / Scala C / C++ / .NET Go Clojure Ruby Node.js PHP Python Erlang HTTP REST Command line etc .. https://cwiki.apache.org/confluence/display/KAFKA/Clients Python - Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. C - High performance C library with full protocol support C++ - Native C++ library with protocol support for Metadata, Produce, Fetch, and Offset. Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby 1.9.3 and up (CI runs MRI 2. Clojure - Clojure DSL for the Kafka API JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation stdin & stdout https://cwiki.apache.org/confluence/display/KAFKA/Clients
  • #47: Non-Java / Scala C / C++ / .NET Go Clojure Ruby Node.js PHP Python Erlang HTTP REST Command line etc ..