SlideShare a Scribd company logo
Introduction to Kafka
Akash Vacher
2015/12/5
▪ Akash Vacher
SRE,
Data Infrastructure Streaming (Bengaluru)
Linkedin
SRE?
▪ Site Reliability Engineers
– Administrators
– Architects
– Developers
▪ Keep the site running, always
Agenda
▪ Kafka Overview
▪ Some facts and figures
▪ Basic Kafka concepts
▪ Some use cases
▪ Q and A
Kafka Overview
▪ High-throughput distributed messaging system
▪ Kafka guarantees:
– At least once delivery
– Strong ordering
▪ Developed at Linkedin and open sourced in early 2011
▪ Implemented in Scala and Java
Kafka users
Source: https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
Attributes of a Kafka Cluster
• Disk Based
• Durable
• Scalable
• Low Latency
• Finite Retention
Motivation
▪ Unified platform to handle all real time data feeds
▪ High throughput
▪ Stream Processing
▪ Horizontally scalable
Before
After
How is Kafka used at Linkedin?
▪ Monitoring (inGraphs)
▪ User tracking
▪ Email and SMS notifications
▪ Stream processing (Samza)
▪ Database Replication
Facts and figures
▪ Over 1,300,000,000,000 messages are produced to Kafka everyday at
LinkedIn
▪ 300 Terabytes of inbound and 900 Terabytes of outbound traffic
▪ 4.5 Million messages per second, on single cluster
▪ Kafka runs on ~1300 servers at LinkedIn
Building blocks
The humble log
Anatomy of a topic
Consumer groups
Bird’s eye view
Kafka in action
Broker
A
P0
A
P1
A
P0
Consumer
Producer
Zookeeper
Performance recipe
▪ OS page cache
▪ Linear IO, never fear the file system!
▪ sendfile(), system call
▪ Message batching
Operating Kafka
▪ Broker Hardware
– Cisco C240, Intel xeon quad core, 64GB RAM , 14 disk Raid-10
▪ Zookeeper Hardware
– 5 + 1 ensemble, 64GB RAM, 500GB SSD
Operating Kafka
▪ Monitoring
– Under Replicated Partitions
– Unclean leader election
– Lag monitoring
– Burrow
▪ Cluster rebalance
– Sizewise rebalance
– Partitionwise rebalance
Kafka at Linkedin
▪ Multiple data centers
▪ Mirror data
▪ Cluster Types
– Tracking
– Metrics
– Queuing
▪ Data transport from applications to Hadoop, and back
Metrics collection
▪ Building Blocks
– Sensors
– RRD
– Front end
▪ Facts & Figures
– 320,000,000 metrics
collected per minute
– 530 TB of disk space
– Over 210,000 metrics
collected per service
InGraphs
Kafka for database replication - Master slave
Kafka for database replication - Multi master
How Can You Get Involved?
▪ http://kafka.apache.org
▪ Join the mailing lists
–users@kafka.apache.org
▪ irc.freenode.net - #apache-kafka
▪ Contribute
Questions?

More Related Content

PPTX
Kafka 101
PPTX
Introduction to Apache Kafka
PDF
Fundamentals of Apache Kafka
PDF
Apache Kafka - Martin Podval
PDF
Apache Kafka Introduction
ODP
Stream processing using Kafka
PDF
Hello, kafka! (an introduction to apache kafka)
PPTX
Kafka 101
Introduction to Apache Kafka
Fundamentals of Apache Kafka
Apache Kafka - Martin Podval
Apache Kafka Introduction
Stream processing using Kafka
Hello, kafka! (an introduction to apache kafka)

What's hot (20)

PPTX
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
PDF
An Introduction to Apache Kafka
PDF
Introduction to apache kafka
PPTX
A visual introduction to Apache Kafka
PDF
PPTX
Kafka presentation
PDF
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
PDF
Apache Kafka Architecture & Fundamentals Explained
PDF
Apache Kafka Fundamentals for Architects, Admins and Developers
PPTX
Introduction to Apache Kafka
PDF
Disaster Recovery and High Availability with Kafka, SRM and MM2
PPTX
Apache kafka
PDF
Producer Performance Tuning for Apache Kafka
PPTX
Kafka Connect - debezium
PDF
Kafka Overview
PPTX
Bootstrapping state in Apache Flink
PPTX
Kafka replication apachecon_2013
PPTX
Running Airflow Workflows as ETL Processes on Hadoop
PPTX
Apache Kafka
PPTX
Introduction to Kafka Cruise Control
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
An Introduction to Apache Kafka
Introduction to apache kafka
A visual introduction to Apache Kafka
Kafka presentation
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Apache Kafka Architecture & Fundamentals Explained
Apache Kafka Fundamentals for Architects, Admins and Developers
Introduction to Apache Kafka
Disaster Recovery and High Availability with Kafka, SRM and MM2
Apache kafka
Producer Performance Tuning for Apache Kafka
Kafka Connect - debezium
Kafka Overview
Bootstrapping state in Apache Flink
Kafka replication apachecon_2013
Running Airflow Workflows as ETL Processes on Hadoop
Apache Kafka
Introduction to Kafka Cruise Control
Ad

Viewers also liked (20)

PPTX
Change Data Capture using Kafka
PPTX
Apache kafka
PPTX
Introduction to Kafka and Zookeeper
PDF
Introduction to Databus
PPTX
IoT Data as Service with Hadoop
PDF
Event-Stream Processing with Kafka
PDF
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
PPTX
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
PPTX
Databus - LinkedIn's Change Data Capture Pipeline
PDF
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
PPTX
Introduction Apache Kafka
PPTX
Apache Kafka at LinkedIn
PPTX
Apache Kafka Security
PPTX
Streaming Data Ingest and Processing with Apache Kafka
PDF
Securing Kafka
PPTX
Apache Flink at Strata San Jose 2016
PPTX
Continuous Processing with Apache Flink - Strata London 2016
PPTX
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
PDF
Introduction to Kafka Streams
PDF
Handle Large Messages In Apache Kafka
Change Data Capture using Kafka
Apache kafka
Introduction to Kafka and Zookeeper
Introduction to Databus
IoT Data as Service with Hadoop
Event-Stream Processing with Kafka
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Databus: LinkedIn's Change Data Capture Pipeline SOCC 2012
Databus - LinkedIn's Change Data Capture Pipeline
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Introduction Apache Kafka
Apache Kafka at LinkedIn
Apache Kafka Security
Streaming Data Ingest and Processing with Apache Kafka
Securing Kafka
Apache Flink at Strata San Jose 2016
Continuous Processing with Apache Flink - Strata London 2016
Building Large-Scale Stream Infrastructures Across Multiple Data Centers with...
Introduction to Kafka Streams
Handle Large Messages In Apache Kafka
Ad

Similar to Introduction to Kafka (20)

PPTX
Kafka - Linkedin's messaging backbone
PPTX
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
PPTX
Understanding kafka
PDF
Fault Tolerance with Kafka
PPTX
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
PDF
Introduction to apache kafka
PDF
Apache Kafka - Scalable Message-Processing and more !
PDF
Apache kafka
PPTX
Apache kafka
PPTX
Kafkha real time analytics platform.pptx
PPTX
Apache Kafka: Next Generation Distributed Messaging System
PPTX
Apache kafka
PPTX
CouchbasetoHadoop_Matt_Michael_Justin v4
PDF
Introduction to Apache Kafka
PPTX
Apache Kafka 0.8 basic training - Verisign
PDF
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
PPTX
Kafka overview and use cases
PPTX
Apache Kafka
PDF
Apache Kafka - Free Friday
PDF
Kafka Up And Running For Network Devops Set Your Network Data In Motion Eric ...
Kafka - Linkedin's messaging backbone
An introduction to Apache Kafka and Kafka ecosystem at LinkedIn
Understanding kafka
Fault Tolerance with Kafka
Apache Kafka with Spark Streaming: Real-time Analytics Redefined
Introduction to apache kafka
Apache Kafka - Scalable Message-Processing and more !
Apache kafka
Apache kafka
Kafkha real time analytics platform.pptx
Apache Kafka: Next Generation Distributed Messaging System
Apache kafka
CouchbasetoHadoop_Matt_Michael_Justin v4
Introduction to Apache Kafka
Apache Kafka 0.8 basic training - Verisign
Building a company-wide data pipeline on Apache Kafka - engineering for 150 b...
Kafka overview and use cases
Apache Kafka
Apache Kafka - Free Friday
Kafka Up And Running For Network Devops Set Your Network Data In Motion Eric ...

Recently uploaded (20)

PPTX
Database Infoormation System (DBIS).pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Mega Projects Data Mega Projects Data
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Database Infoormation System (DBIS).pptx
Miokarditis (Inflamasi pada Otot Jantung)
ISS -ESG Data flows What is ESG and HowHow
Clinical guidelines as a resource for EBP(1).pdf
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Mega Projects Data Mega Projects Data
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Galatica Smart Energy Infrastructure Startup Pitch Deck
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
Supervised vs unsupervised machine learning algorithms
STUDY DESIGN details- Lt Col Maksud (21).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Business Acumen Training GuidePresentation.pptx
Qualitative Qantitative and Mixed Methods.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Acceptance and paychological effects of mandatory extra coach I classes.pptx
climate analysis of Dhaka ,Banglades.pptx
Business Analytics and business intelligence.pdf
The THESIS FINAL-DEFENSE-PRESENTATION.pptx

Introduction to Kafka

Editor's Notes

  • #2: Kafka – a high throughput messaging system
  • #4: SRE stands for Site Reliability Engineering. SRE combines several roles that fit together into one Operations position Foremost, we are administrators. We manage all of the systems in our area We are also architects. We do capacity planning for our deployments, plan out our infrastructure in new datacenters, and make sure all the pieces fit together And we are also developers. We identify tools we need, both to make our jobs easier and to keep our users happy, and we write and maintain them. At the end of the day, our job is to keep the site running, always.
  • #6: Kafka is distributed partitioned replicated commit log Kafka guarantees at least once delivery or messages and strong ordering on per partition basis.
  • #7: Some of the companies powered by Kafka. Source: https://cwiki.apache.org/confluence/display/KAFKA/Powered+By
  • #8: Allows retention of data, which is a huge plus as it makes bootstrapping a new service from a past point of time easy. There is durability due to redundancy on partition level Horizontally scalable Most of the reads that hit the kafka brokers are served off the memory which results in low latency reads for a consumer which is relatively caught up Custom data expiry rule
  • #9: Apache Kafka was built at LinkedIn with a specific purpose in mind: to serve as a central repository of data streams. There were two major motivations: 1)The first problem was how to transport data between systems. We had lots of data systems and each of these needed reliable feeds of data in a geographically distributed environment 2)The second part of this problem was the need to do richer analytical data processing—the kind of thing that would normally happen in a data warehouse or Hadoop cluster—but with very low latency It was evident that a system that catered to both the above needs would need to have high throughput and be horizontally scalable as well.
  • #10: Initially, our approach was very ad hoc: we built custom piping between systems and applications on an as needed basis and shoe-horned any asynchronous processing into request-response web services. Over time this set-up got more and more complex as we ended up building pipelines between all kinds of different systems.
  • #11: After we introduced Kafka, the producers and the consumers got completely decoupled and this allowed services to just connect to a central system for all their data production/consumption needs without worrying about the other services which may be consuming/producing this data.
  • #12: We have many use cases of Kafka at Linkedin, here are summaries of a few of them Every application emits metrics into Kafka and we have systems that read and store this data to generate Graphs and thresholds User tracking of all website activities, clicks, page views, experiments which we turn on for subsets of users. Each time you visit LinkedIn many different services are called to generate the page you are looking at, each service sends a message to kafka with details of that request. We then later analyze all of that data with a Samza job that allows us to build a full call tree for the particular request. We can then use this data to troubleshoot issues on the site. Samza, by the way, is another open source product developed at LinkedIn that our team supports. All of the emails that get sent out from LinkedIn go through Kafka at least one time, and often a few times. They are often generated in Hadoop, sent to a production system using Kafka which then decorates the emails with additional information and then sends it back in to Kafka for another application to read and turn into an actual email. We stream changes to our search indexes in real time through Kafka to allow us to update search results in real time. We also use Kafka combined with Apache Samza to standardize things like Job titles, phone numbers and addresses. We are also currently exploring the use case of using Kafka to replicate databases. The rough idea is that a stream of transactions received by a database can be copied over through kafka to another db and replayed in the same order to achieve same state as the first database.
  • #13: All of the previous use cases I described, and many more add up to a ton of data. 1.3T messages per day. As it is evident, the total read traffic is almost thrice the write traffic. This is where data retention really shines as Kafka does not have to push the data to consumers every time it is read. The data resides on disk and any consumer can access and start reading the data for a Kafka cluster. We replicate most of the data between datacenters to keep applications in sync.
  • #15: Simple data structure Writes happen on tail Messages are in chronological order from head to tail Easy movement in stream by offset Allows read scalability
  • #16: A “message” is a discrete unit of data within Kafka Clients who send data into Kafka are called Producers Clients who read data from Kafka are called Consumers Every message that gets sent to Kafka belongs to a Topic, this allows for different types of data to be sent into a single cluster. The topic is then divided into multiple partitions for parallelism. These partitions exist across kafka servers (brokers) that make up the Kafka cluster. This diagram depicts how data is written into partitions.
  • #17: Messaging traditionally has two models: queuing and publish-subscribe. In a queue, a pool of consumers may read from a server and each message goes to one of them; in publish-subscribe the message is broadcast to all consumers. Kafka offers a single consumer abstraction that generalizes both of these—the consumer group. Consumers label themselves with a consumer group name, and each message published to a topic is delivered to one consumer instance within each subscribing consumer group. Consumer instances can be in separate processes within a single host, or on separate machines. If all the consumer instances have the same consumer group, then this works just like a traditional queue balancing load over the consumers. If all the consumer instances have different consumer groups, then this works like publish-subscribe and all messages are broadcast to all consumers.
  • #18: This shows how the data flows through a cluster.
  • #19: Kafka is a publish-subscribe messaging system, in which there are four components: - Broker (what we call the Kafka server) - Zookeeper (which serves as a data store for information about the cluster and consumers) - Producer (sends data into the system) - Consumer (reads data out of the system) Data is organized into topics (here we show a topic named “A”) and topics are split into partitions (we have partitions 0 and 1 here). A “message” is a discrete unit of data within Kafka. Producers create messages and send them into the system. The broker stores them, and any number of consumers can then read those messages. In order to provide scalability, we have multiple brokers. By spreading out the partitions, we can handle more messages in any topic. This also provides redundancy. We can now replicate partitions on separate brokers. When we do this, one broker is the designated “leader” for each partition. This is the only broker that producers and consumers connect to for that partition. The brokers that hold the replicas are designated “followers” and all they do with the partition is keep it in sync with the leader. When a broker fails, one of the brokers holding an in-sync replica takes over as the leader for the partition. The producer and consumer clients have logic built-in to automatically rebalance and find the new leader when the cluster changes like this. When the original broker comes back online, it gets its replicas back in sync, and then it functions as the follower.
  • #20: Kafka is incredibly fast for a few reasons: Most reads never actually hit the disk – usually consumers are caught up. Head seek time reduction due to linear IO On a read Kafka utilizes the sendfile() system call which allows the data to be directly written to a socket without first being loaded into the application. This reduces context switching. Batching allows higher throughput and better compression
  • #21: We run Kafka on hardware with lots of disk spindles in a RAID 10 configuration. We put our Zookeeper clusters on SSDs which brought our average request latency down to zero milliseconds
  • #22: We monitor Kafka in several different ways with tooling developed by the SRE team. Lag monitoring, lag is defined as the number of messages between the latest message available in Kafka and the newest message available in Kafka. Under Replicated Partitions, this is the count of Follower replicas which have fallen behind the leader. This metric is reported per broker. In the healthy state these should always be zero. Unclean leader elections. When this happens data has been lost. This occurs when there is a leader failure and there was not a follower who was insync at that time. Burrow is a tool developed and open sourced by one of the Kafka SREs at LinkedIn. It is our new way of monitoring Lag within Kafka which uses velocity calculations to determine if a consumer is falling behind. We have also developed tooling to ensure all brokers within a cluster are doing the same amount of work. in the Size based balance we ensure that each broker has the same amount of data on disk. If they are not within our defined threshold we move the optimal number of partitions around to make it balanced. In the Partition based balance we ensure that each broker has the same number of partitions. If they are not within our defined threshold we move the optimal number of partitions around to make it balanced.
  • #23: Cluster types: User activities on linkedin sites are tracked. These data flow into the tracking clusters. Linkedin has multiple colos and users are served from different colo based on their unique ID. The tracking data goes to the local tracking clusters. We have aggregator cluster, which gets the data aggregated from the multiple colos using mirror makers. The downstream application which process the tracking data consumes from the aggregate clusters OS and application generate metrics, and these metrics are used for understanding state of the system. These values are pumped into a separate metrics cluster. More about metrics in the next slide Queuing cluster is used for the traditional queuing scenarios when you have multiple applications and you want to coordinate their activities.
  • #24: We at Linkedin use Kafka for pumping metrics into our graphing engine – InGraphs The basic idea is that we have have services which expose a certain set of metrics using Mbeans which are picked up using sensors, processed, and pumped into Kafka. These enriched metrics are all consumed by a service which filters metrics by tags and push this data into RRD. These RRDs are used to generate graphs which are served to the end user.
  • #25: This is just a sample screenshot of final graphs in InGraphs. Different colors correspond to different hosts
  • #26: One new use case for Kafka at LinkedIn is for Database replication. In this diagram we show how this is done. The database on the left streams its transaction log into Kafka. The data replicator consumes the transaction log stream from Kafka and replays them into the database on the right. This is a great method for doing cross-datacenter replication of databases. One of the obvious advantage over the traditional master slave database replication is the decoupling of both databases. To initially start the secondary database you first must create a backup snapshot of the data in DB1, and load it into DB2. After that DB2 can listen to the transaction log stream via Data Replicator and stay in sync.
  • #27: This also works for a master master relationship where you stream the transactions originating in the second colo back to the database in first colo. Additional filtering logic is added to Data Replicator to ensure that a loop is not created, in other words, the transaction originating in colo A needs to be mirrored to colo B but should not be replicated back to colo A.
  • #28: So how can you get more involved in the Kafka community? The most obvious answer is to go apache.kafka.org. From there you can: 1) Join the mailing lists, either on the development or the user side 2) You can also dive into the source repository, and work on and contribute your own tools back.