SlideShare a Scribd company logo
7
Most read
9
Most read
11
Most read
ELK @ LinkedIn
Scaling ELK with Kafka
Introduction
Tin Le (tinle@linkedin.com)
Senior Site Reliability Engineer
Formerly part of Mobile SRE team, responsible for servers
handling mobile apps (IOS, Android, Windows, RIM, etc.)
traffic.
Now responsible for guiding ELK @ LinkedIn as a whole
Problems
● Multiple data centers, ten of thousands of servers,
hundreds of billions of log records
● Logging, indexing, searching, storing, visualizing and
analysing all of those logs all day every day
● Security (access control, storage, transport)
● Scaling to more DCs, more servers, and even more
logs…
● ARRGGGGHHH!!!!!
Solutions
● Commercial
o Splunk, Sumo Logic, HP ArcSight Logger, Tibco,
XpoLog, Loggly, etc.
● Open Source
o Syslog + Grep
o Graylog
o Elasticsearch
o etc.
Criterias
● Scalable - horizontally, by adding more nodes
● Fast - as close to real time as possible
● Inexpensive
● Flexible
● Large user community (support)
● Open source
ELK!
The winner is...
Splunk ???
ELK at LinkedIn
● 100+ ELK clusters across 20+ teams and 6
data centers
● Some of our larger clusters have:
o Greater than 32+ billion docs (30+TB)
o Daily indices average 3.0 billion docs (~3TB)
ELK + Kafka
Summary: ELK is a popular open sourced application stack for
visualizing and analyzing logs. ELK is currently being used across
many teams within LinkedIn. The architecture we use is made up of
four components: Elasticsearch, Logstash, Kibana and Kafka.
● Elasticsearch: Distributed real-time search and analytics engine
● Logstash: Collect and parse all data sources into an easy-to-read
JSON format
● Kibana: Elasticsearch data visualization engine
● Kafka: Data transport, queue, buffer and short term storage
What is Kafka?
● Apache Kafka is a high-throughput distributed
messaging system
o Invented at LinkedIn and Open Sourced in 2011
o Fast, Scalable, Durable, and Distributed by Design
o Links for more:
 http://kafka.apache.org
 http://data.linkedin.com/opensource/kafka
Kafka at LinkedIn
● Common data transport
● Available and supported by dedicated team
o 875 Billion messages per day
o 200 TB/day In
o 700 TB/day Out
o Peak Load
 10.5 Million messages/s
 18.5 Gigabits/s Inbound
 70.5 Gigabits/s Outbound
Logging using Kafka at LinkedIn
● Dedicated cluster for logs in each data center
● Individual topics per application
● Defaults to 4 days of transport level retention
● Not currently replicating between data centers
● Common logging transport for all services, languages
and frameworks
ELK Architectural Concerns
● Network Concerns
o Bandwidth
o Network partitioning
o Latency
● Security Concerns
o Firewalls and ACLs
o Encrypting data in transit
● Resource Concerns
o A misbehaving application can swamp production resources
Multi-colo ELK Architecture
ELK Dashboard
13
Services
ELK Search
Clusters
Log
Transport
Kafka
ELK Search
Clusters
LinkedIn
Services
DC1
Services
Kafka
ELK Search
Clusters
DC2
Services
Kafka
ELK Search
Clusters
DC3
Tribes
Corp Data Centers
ELK Search Architecture
Kibana
Elasticsearch
(tribe)
Kafka
Elasticsearch
(master)
Logstash
Elasticsearch
(data node)
Logstash
Elasticsearch
(data node)
Users
Operational Challenges
● Data, lots of it.
o Transporting, queueing, storing, securing,
reliability…
o Ingesting & Indexing fast enough
o Scaling infrastructure
o Which data? (right data needed?)
o Formats, mapping, transformation
 Data from many sources: Java, Scala, Python, Node.js, Go
Operational Challenges...
● Centralized vs Siloed Cluster Management
● Aggregated views of data across the entire
infrastructure
● Consistent view (trace up/down app stack)
● Scaling - horizontally or vertically?
● Monitoring, alerting, auto-remediating
The future of ELK at LinkedIn
● More ELK clusters being used by even more teams
● Clusters with 300+ billion docs (300+TB)
● Daily indices average 10+ billion docs, 10TB - move to
hourly indices
● ~5,000 shards per cluster
Extra slides
Next two slides contain example logstash
configs to show how we use input pipe plugin
with Kafka Console Consumer, and how to
monitor logstash using metrics filter.
KCC pipe input config
pipe {
type => "mobile"
command => "/opt/bin/kafka-console-consumer/kafka-console-consumer.sh 
--formatter com.linkedin.avro.KafkaMessageJsonWithHexFormatter 
--property schema.registry.url=http://schema-
server.example.com:12250/schemaRegistry/schemas 
--autocommit.interval.ms=60000 
--zookeeper zk.example.com:12913/kafka-metrics 
--topic log_stash_event 
--group logstash1"
codec => “json”
}
Monitoring Logstash metrics
filter {
metrics {
meter => "events"
add_tag => "metric"
}
}
output {
if “metric” in [tags] [
stdout {
codec => line {
format => “Rate: %{events.rate_1m}”
}
}
}

More Related Content

PDF
Elk devops
PDF
ELK stack introduction
PPTX
Elastic Search
PDF
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
PPTX
Log analysis using Logstash,ElasticSearch and Kibana
PPTX
Apache Kudu: Technical Deep Dive


PDF
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
PPTX
An Intro to Elasticsearch and Kibana
Elk devops
ELK stack introduction
Elastic Search
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
Log analysis using Logstash,ElasticSearch and Kibana
Apache Kudu: Technical Deep Dive


Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
An Intro to Elasticsearch and Kibana

What's hot (20)

PDF
GoldenGateテクニカルセミナー3「Oracle GoldenGate Technical Deep Dive」(2016/5/11)
PDF
elk_stack_alexander_szalonnas
PDF
Distributed tracing - get a grasp on your production
PPTX
Apache Kafka
PPTX
Introduction to Apache Kudu
PDF
Log Structured Merge Tree
PDF
Introducing ELK
PPTX
Manage Add-On Services with Apache Ambari
PPTX
ORC improvement in Apache Spark 2.3
PPTX
Managing 2000 Node Cluster with Ambari
PDF
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
ODP
Elastic Stack ELK, Beats, and Cloud
PPTX
Elastic Stack Introduction
PDF
It's Time To Stop Using Lambda Architecture
PPTX
Tổng quan sa tạng chậu
PDF
Otrs&OTOBO_document 20210402
PDF
Speed Up Uber's Presto with Alluxio
PDF
Google F1
PDF
Data ingestion and distribution with apache NiFi
PDF
SGK Hội chứng chèn ép khoang.pdf hay các bạn ơi
GoldenGateテクニカルセミナー3「Oracle GoldenGate Technical Deep Dive」(2016/5/11)
elk_stack_alexander_szalonnas
Distributed tracing - get a grasp on your production
Apache Kafka
Introduction to Apache Kudu
Log Structured Merge Tree
Introducing ELK
Manage Add-On Services with Apache Ambari
ORC improvement in Apache Spark 2.3
Managing 2000 Node Cluster with Ambari
Monitoring Hadoop with Prometheus (Hadoop User Group Ireland, December 2015)
Elastic Stack ELK, Beats, and Cloud
Elastic Stack Introduction
It's Time To Stop Using Lambda Architecture
Tổng quan sa tạng chậu
Otrs&OTOBO_document 20210402
Speed Up Uber's Presto with Alluxio
Google F1
Data ingestion and distribution with apache NiFi
SGK Hội chứng chèn ép khoang.pdf hay các bạn ơi
Ad

Similar to ELK at LinkedIn - Kafka, scaling, lessons learned (20)

PDF
2015 03-16-elk at-bsides
PPTX
ELK - Stack - Munich .net UG
PPTX
ELK Elasticsearch Logstash and Kibana Stack for Log Management
PPTX
ELK Stack Online Training - ELK Stack Training.pptx
PDF
Experiences in ELK with D3.js for Large Log Analysis and Visualization
PDF
ELK Wrestling (Leeds DevOps)
PDF
Logs aggregation and analysis
PPTX
ELK Ruminating on Logs (Zendcon 2016)
PPTX
Elastic Search Capability Presentation.pptx
PPTX
The Elastic ELK Stack
PDF
Rootconf
PDF
Présentation ELK/SIEM et démo Wazuh
PDF
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
PPTX
Elastic stack Presentation
PDF
Elastic.co's ELK Stack - Platform Agnostic Immutable Infrastructure & Analys...
PDF
Scaling ELK Stack - DevOpsDays Singapore
PPTX
Centralized Logging System Using ELK Stack
PDF
Log analysis with the elk stack
PDF
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
PPTX
2015 03-16-elk at-bsides
ELK - Stack - Munich .net UG
ELK Elasticsearch Logstash and Kibana Stack for Log Management
ELK Stack Online Training - ELK Stack Training.pptx
Experiences in ELK with D3.js for Large Log Analysis and Visualization
ELK Wrestling (Leeds DevOps)
Logs aggregation and analysis
ELK Ruminating on Logs (Zendcon 2016)
Elastic Search Capability Presentation.pptx
The Elastic ELK Stack
Rootconf
Présentation ELK/SIEM et démo Wazuh
Bandwidth: Use Cases for Elastic Cloud on Kubernetes
Elastic stack Presentation
Elastic.co's ELK Stack - Platform Agnostic Immutable Infrastructure & Analys...
Scaling ELK Stack - DevOpsDays Singapore
Centralized Logging System Using ELK Stack
Log analysis with the elk stack
Centralized Logging Feature in CloudStack using ELK and Grafana - Kiran Chava...
Ad

Recently uploaded (20)

PDF
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
PPTX
Introduction to Information and Communication Technology
PPTX
Digital Literacy And Online Safety on internet
PDF
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
PPTX
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
DOC
Rose毕业证学历认证,利物浦约翰摩尔斯大学毕业证国外本科毕业证
PPT
Ethics in Information System - Management Information System
PPTX
Funds Management Learning Material for Beg
PPTX
artificial intelligence overview of it and more
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PPTX
Mathew Digital SEO Checklist Guidlines 2025
PDF
Paper PDF World Game (s) Great Redesign.pdf
PDF
Sims 4 Historia para lo sims 4 para jugar
PPTX
newyork.pptxirantrafgshenepalchinachinane
PDF
Unit-1 introduction to cyber security discuss about how to secure a system
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PDF
SASE Traffic Flow - ZTNA Connector-1.pdf
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PDF
Tenda Login Guide: Access Your Router in 5 Easy Steps
Vigrab.top – Online Tool for Downloading and Converting Social Media Videos a...
Introduction to Information and Communication Technology
Digital Literacy And Online Safety on internet
FINAL CALL-6th International Conference on Networks & IOT (NeTIOT 2025)
June-4-Sermon-Powerpoint.pptx USE THIS FOR YOUR MOTIVATION
Rose毕业证学历认证,利物浦约翰摩尔斯大学毕业证国外本科毕业证
Ethics in Information System - Management Information System
Funds Management Learning Material for Beg
artificial intelligence overview of it and more
Design_with_Watersergyerge45hrbgre4top (1).ppt
Mathew Digital SEO Checklist Guidlines 2025
Paper PDF World Game (s) Great Redesign.pdf
Sims 4 Historia para lo sims 4 para jugar
newyork.pptxirantrafgshenepalchinachinane
Unit-1 introduction to cyber security discuss about how to secure a system
presentation_pfe-universite-molay-seltan.pptx
Job_Card_System_Styled_lorem_ipsum_.pptx
SASE Traffic Flow - ZTNA Connector-1.pdf
Slides PPTX World Game (s) Eco Economic Epochs.pptx
Tenda Login Guide: Access Your Router in 5 Easy Steps

ELK at LinkedIn - Kafka, scaling, lessons learned

  • 1. ELK @ LinkedIn Scaling ELK with Kafka
  • 2. Introduction Tin Le (tinle@linkedin.com) Senior Site Reliability Engineer Formerly part of Mobile SRE team, responsible for servers handling mobile apps (IOS, Android, Windows, RIM, etc.) traffic. Now responsible for guiding ELK @ LinkedIn as a whole
  • 3. Problems ● Multiple data centers, ten of thousands of servers, hundreds of billions of log records ● Logging, indexing, searching, storing, visualizing and analysing all of those logs all day every day ● Security (access control, storage, transport) ● Scaling to more DCs, more servers, and even more logs… ● ARRGGGGHHH!!!!!
  • 4. Solutions ● Commercial o Splunk, Sumo Logic, HP ArcSight Logger, Tibco, XpoLog, Loggly, etc. ● Open Source o Syslog + Grep o Graylog o Elasticsearch o etc.
  • 5. Criterias ● Scalable - horizontally, by adding more nodes ● Fast - as close to real time as possible ● Inexpensive ● Flexible ● Large user community (support) ● Open source
  • 7. ELK at LinkedIn ● 100+ ELK clusters across 20+ teams and 6 data centers ● Some of our larger clusters have: o Greater than 32+ billion docs (30+TB) o Daily indices average 3.0 billion docs (~3TB)
  • 8. ELK + Kafka Summary: ELK is a popular open sourced application stack for visualizing and analyzing logs. ELK is currently being used across many teams within LinkedIn. The architecture we use is made up of four components: Elasticsearch, Logstash, Kibana and Kafka. ● Elasticsearch: Distributed real-time search and analytics engine ● Logstash: Collect and parse all data sources into an easy-to-read JSON format ● Kibana: Elasticsearch data visualization engine ● Kafka: Data transport, queue, buffer and short term storage
  • 9. What is Kafka? ● Apache Kafka is a high-throughput distributed messaging system o Invented at LinkedIn and Open Sourced in 2011 o Fast, Scalable, Durable, and Distributed by Design o Links for more:  http://kafka.apache.org  http://data.linkedin.com/opensource/kafka
  • 10. Kafka at LinkedIn ● Common data transport ● Available and supported by dedicated team o 875 Billion messages per day o 200 TB/day In o 700 TB/day Out o Peak Load  10.5 Million messages/s  18.5 Gigabits/s Inbound  70.5 Gigabits/s Outbound
  • 11. Logging using Kafka at LinkedIn ● Dedicated cluster for logs in each data center ● Individual topics per application ● Defaults to 4 days of transport level retention ● Not currently replicating between data centers ● Common logging transport for all services, languages and frameworks
  • 12. ELK Architectural Concerns ● Network Concerns o Bandwidth o Network partitioning o Latency ● Security Concerns o Firewalls and ACLs o Encrypting data in transit ● Resource Concerns o A misbehaving application can swamp production resources
  • 13. Multi-colo ELK Architecture ELK Dashboard 13 Services ELK Search Clusters Log Transport Kafka ELK Search Clusters LinkedIn Services DC1 Services Kafka ELK Search Clusters DC2 Services Kafka ELK Search Clusters DC3 Tribes Corp Data Centers
  • 15. Operational Challenges ● Data, lots of it. o Transporting, queueing, storing, securing, reliability… o Ingesting & Indexing fast enough o Scaling infrastructure o Which data? (right data needed?) o Formats, mapping, transformation  Data from many sources: Java, Scala, Python, Node.js, Go
  • 16. Operational Challenges... ● Centralized vs Siloed Cluster Management ● Aggregated views of data across the entire infrastructure ● Consistent view (trace up/down app stack) ● Scaling - horizontally or vertically? ● Monitoring, alerting, auto-remediating
  • 17. The future of ELK at LinkedIn ● More ELK clusters being used by even more teams ● Clusters with 300+ billion docs (300+TB) ● Daily indices average 10+ billion docs, 10TB - move to hourly indices ● ~5,000 shards per cluster
  • 18. Extra slides Next two slides contain example logstash configs to show how we use input pipe plugin with Kafka Console Consumer, and how to monitor logstash using metrics filter.
  • 19. KCC pipe input config pipe { type => "mobile" command => "/opt/bin/kafka-console-consumer/kafka-console-consumer.sh --formatter com.linkedin.avro.KafkaMessageJsonWithHexFormatter --property schema.registry.url=http://schema- server.example.com:12250/schemaRegistry/schemas --autocommit.interval.ms=60000 --zookeeper zk.example.com:12913/kafka-metrics --topic log_stash_event --group logstash1" codec => “json” }
  • 20. Monitoring Logstash metrics filter { metrics { meter => "events" add_tag => "metric" } } output { if “metric” in [tags] [ stdout { codec => line { format => “Rate: %{events.rate_1m}” } } }

Editor's Notes

  • #3: 50+ % of site traffic come in via mobile
  • #4: Many applications. Mobile frontend logs: average 2.4TB size (3+ billion docs).
  • #5: Evaluated, used by some teams. Some Acquisitions use commercial solutions. Commercial solutions cost prohibitive.
  • #8: Expect storage size to increase as we migrate to using doc_values
  • #9: Leveraging open source stack. Large community. Leveraging common data transport. Rock solid, proven, dedicated support team.
  • #10: Fast : a single Kafka broker can handle hundreds of megabytes of reads and writes per sec from thousands of clients. Scaleable: Kafka is designed to allow a single cluster to serve as the central data backbone for a large organization. It can elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of coordinated consumers. Durable: Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages w/o performance impact. Distributed by Design: Kafka has a modern cluster-centric design that offers strong durability and fault-tolerance guarantees.
  • #11: These numbers are from LinkedIn Kafka presentation at Apache Con 2015. Over 1100 brokers over 50+ clusters Over 32000 topics Over 350 thousand partitions, not including replication.
  • #12: Log retention long enough for 3 days weekend. So we can re-index from Kafka if encounter issues. Dedicated log clusters to isolate traffic from other clusters. Two years ago, used our own kafka input plugin. Then switched to using KCC via pipe input for performance reason. Monitoring: . It’s important to monitor logstash nodes. LS has a bug where an error in any of its input, filter or output will stop the entire LS process. See metrics filter config file at end of these slides.
  • #13: We’ve chosen to keep all of our clients local to the clusters and use a tiered architecture due to several major concerns. The primary concern is around the networking itself. Kafka enables multiple consumers to read the same topic, which means if we are reading remotely, we are copying messages over expensive inter-datacenter connections multiple times. We also have to handle problems like network partitioning in every client. Granted, you can have a partition even within a single datacenter, but it happens much more frequently when you are dealing with large distances. There’s also the concern of latency in connections – distance increases latency. Latency can cause interesting problems in client applications, and I like life to be boring. There are also security concerns around talking across datacenters. If we keep all of our clients local, we do not have to worry about ACL problems between the clients and the brokers (and Zookeeper as well). We can also deal with the problem of encrypting data in transit much more easily. This is one problem we have not worried about as much, but it is becoming a big concern now. The last concern is over resource usage. Everything at LinkedIn talks to Kafka, and a problem that takes out a production cluster is a major event. It could mean we have to shift traffic out of the datacenter until we resolve it, or it could result in inconsistent behavior in applications. Any application could overwhelm a cluster, but there are some, such as applications that run in Hadoop, that are more prone to this. By keeping those clients talking to a cluster that is separate from the front end, we mitigate resource contention.
  • #14: For security reasons, data/logs generated in each DC stays there. Indexed by local ELK cluster. Aggregated views via Tribe nodes. All logstash use common filters to catch most common data leakage. How services log to Kafka - imposed common logging library. All services use common library, which automatically log to Kafka, WARN or above.
  • #15: General architecture for each ELK cluster. Dedicated masters. Tribe client node (HTTP services).
  • #16: Data - reliable transport, storing, queueing, consuming, indexing. Some data (java service logs for example) not in right format. Solutions to Data Kafka as transport, storage queue, backbone. More logstash instances, more Kafka partitions. Using KCC we can consume faster than ES can index. To increase indexing speed More ES nodes (horizontal). More shards (distribute work) Customized templates
  • #17: Using local Kafka log clusters instead of aggregated metrics. Tribe to aggregate clusters. Use internal tool call Nurse to monitor and auto-remediate (restarts) hung/dead instances of LS and ES.
  • #18: These numbers are estimate based on growth rate and plans. Beside logs, we have other application use cases internally.
  • #20: This is how we use Logstash pipe input plugin to call out to Kafka Console Consumer. This currently give us the highest ingestion throughput.
  • #21: It’s important to monitor logstash nodes. LS has a bug where an error in any of its input, filter or output will stop the entire LS process. You can use Logstash metrics filter to make sure that LS is still processing data. Sometime LS runs but no data goes through. This will let you know when that happens.