SlideShare a Scribd company logo
ELK
Log processing at Scale
#DevOpsDays 2015, Singapore
@DevOpsDaysSG
Angad Singh
About me
DevOps at Viki, Inc - A global
video streaming site with
subtitles.
Previously a Twitter SRE,
National University of Singapore
Twitter @angadsg,
Github @angad
Elasticsearch - Log Indexing and Searching
Logstash - Log Ingestion plumbing
Kibana - Frontend
{
Metrics vs Logging
Metrics
● Numeric timeseries data
● Actionable
● Counts, Statistical (p90, p99 etc.)
● Scalable cost-effective solutions
already available
Logging
● Useful for debugging
● Catch-all
● Full text searching
● Computationally intensive, harder
to scale
Metrics vs Logging
Metrics
● Numeric timeseries data
● Actionable
● Counts, Statistical (p90, p99 etc.)
● Scalable cost-effective solutions
already available
Alerting and Monitoring at Viki
Deeper level
debugging with
application logs
Success Rate
Alert for
service X
Logs
● Application logs - Stack Traces, Handled Exceptions
● Access Logs - Status codes, URI, HTTP Method at all levels of the stack
● Client Logs - Direct HTTP requests containing log events from client-side
Javascript or Mobile application (android/ios)
● Standardized log format to JSON - easy to add / remove fields.
● Request tracing through various services using Unique-ID at Load Balancer
● Log aggregator
● Log preprocessing
(Filtering etc.)
● 3 stage pipeline
● Input > Filter > Output
Logstash
● Log aggregator
● Log preprocessing
(Filtering etc.)
● 3 stage pipeline
● Input > Filter > Output
Logstash Elasticsearch
● Full text searching and
indexing
● on top of Apache
Lucene
● RESTful web interface
● Horizontally scalable
● Log aggregator
● Log preprocessing
(Filtering etc.)
● 3 stage pipeline
● Input > Filter > Output
Logstash Elasticsearch
● Full text searching and
indexing
● on top of Apache
Lucene
● RESTful web interface
● Horizontally scalable
Kibana
● Frontend
● Visualizations,
Dashboards
● Supports Geo
visualizations
● Uses ES REST API
Scaling ELK Stack - DevOpsDays Singapore
Input
Any Stream
● local file
● queue
● tcp, udp
● twitter
● etc..
Logstash
Filter
Mutation
● add/remove field
● parse as json
● ruby code
● parse geoip
● etc..
Output
● elasticsearch
● redis
● queue
● file
● pagerduty
● etc..
● Golang program that sits next to log files, lumberjack protocol
● Forwards logs from a file to a logstash server
● Removes the need for a buffer (such as redis, or a queue) for
logs pending ingestion to logstash.
● Docker container with volume mounted /var/log.
Configuration stored in Consul.
● Application containers with volume mounted /var/log to
/var/log/docker/<container>/application.log
Logstash Forwarder
Logstash pool with HAProxy
4 x logstash machines, 8 cores, 16 GB
RAM
7 x logstash processes per machine, 5 for
application logs, 2 for HTTP client logs.
Fronted by HAProxy for both lumberjack
protocol as well as HTTP protocol.
Easily scalable by adding more machines
and spinning up more logstash processes.
Application
Service
Container 1
Application
Service
Container 2
Logstash-Forwarder
Container
Mounted /var/log
to
/var/log/docker/
on host
Elasticsearch Hardware
12 core, 64GB RAM with RAID 0 - 2 x 3TB 7200rpm disks.
20 nodes, 20 shards, 3 replicas (with 1 primary).
Each day ~300GB x 4 copies (3 + 1) ~ 3 months of data on 120TB.
Average 6k-8k logs per second, peak 25k logs per second.
https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html
Elasticsearch Hardware
● < 30.5 GB Heap - JAVA compressed pointers below 30.5GB heap
● Sweet spot - 64GB of RAM with half available for Lucene file buffers.
● SSD or RAID 0 (or multiple path directories similar to RAID 0).
● If SSD then set I/O scheduler to deadline instead of cfq.
● RAID0 - no need to worry about disks failing as machines can easily be
replaced due to multiple copies of data.
● Disable swap.
Hardware Tuning
● 20 days of indexes open based on available memory, rest closed - open on
demand
● Field data - cache used while sorting and aggregating data.
● Circuit breaker - cancels requests which require large memory, prevent OOM,
http://elasticsearch:9200/_cache/clear if field data is very close to memory
limit.
● Shards >= Number of nodes
● Lucene forceMerge - minor performance improvements for older indexes
(https://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize.
html)
Elasticsearch Configuration
Prevent split brain situation to avoid losing data - set minimum number of master
eligible nodes to (n/2 + 1)
Set higher ulimit for elasticsearch process
Daily cronjob which deletes data older than 90 days, closes indices older than 20
days, optimizes (forceMerge) indices older than 2 days
And also...
Scaling ELK Stack - DevOpsDays Singapore
Marvel - Official plugin from Elasticsearch
KOPF - Index management plugin
CAT APIs - REST APIs to view cluster information
Curator - Data management
Monitoring
Thanks
email: angad@viki.com
twitter: @angadsg

More Related Content

PPTX
NATE-Central-Log
PPTX
ELK at LinkedIn - Kafka, scaling, lessons learned
PPTX
ELK - Stack - Munich .net UG
PPTX
Toronto High Scalability meetup - Scaling ELK
PPTX
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
PDF
Presto Strata Hadoop SJ 2016 short talk
PDF
Presto at Hadoop Summit 2016
PDF
Superset druid realtime
NATE-Central-Log
ELK at LinkedIn - Kafka, scaling, lessons learned
ELK - Stack - Munich .net UG
Toronto High Scalability meetup - Scaling ELK
Lessons Learned in Deploying the ELK Stack (Elasticsearch, Logstash, and Kibana)
Presto Strata Hadoop SJ 2016 short talk
Presto at Hadoop Summit 2016
Superset druid realtime

What's hot (20)

PPTX
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
PPTX
Bleeding Edge Databases
PDF
Presto @ Zalando - Big Data Tech Warsaw 2020
PDF
Presto @ Treasure Data - Presto Meetup Boston 2015
PPTX
Centralised logging with ELK stack
PDF
An Open Source NoSQL solution for Internet Access Logs Analysis
PDF
Presto Summit 2018 - 01 - Facebook Presto
PPTX
Neo4j tms
PDF
ELK in Security Analytics
PDF
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
PPT
Rolling With Riak
PPTX
Meetup#2: Building responsive Symbology & Suggest WebService
PPTX
The Elastic Stack as a SIEM
PPTX
Lightning talk: elasticsearch at Cogenta
PDF
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
PDF
Presto at Twitter
PDF
ストリーミングデータのアドホック分析エンジンの比較
PDF
Scaling with Riak at Showyou
PDF
Security Analytics using ELK stack
PDF
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Hello, Enterprise! Meet Presto. (Presto Boston Meetup 10062015)
Bleeding Edge Databases
Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Treasure Data - Presto Meetup Boston 2015
Centralised logging with ELK stack
An Open Source NoSQL solution for Internet Access Logs Analysis
Presto Summit 2018 - 01 - Facebook Presto
Neo4j tms
ELK in Security Analytics
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Rolling With Riak
Meetup#2: Building responsive Symbology & Suggest WebService
The Elastic Stack as a SIEM
Lightning talk: elasticsearch at Cogenta
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Presto at Twitter
ストリーミングデータのアドホック分析エンジンの比較
Scaling with Riak at Showyou
Security Analytics using ELK stack
Building Realtime Data Pipelines with Kafka Connect and Spark Streaming
Ad

Similar to Scaling ELK Stack - DevOpsDays Singapore (20)

PDF
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
PDF
Interactive Data Analysis in Spark Streaming
PPTX
Logs @ OVHcloud
PPTX
AWS Big Data Demystified #1: Big data architecture lessons learned
PDF
Netflix Open Source Meetup Season 4 Episode 2
PPTX
Silverstripe at scale - design & architecture for silverstripe applications
PDF
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
PDF
2021.02 new in Ceph Pacific Dashboard
PDF
Red Hat Gluster Storage Performance
PDF
Logs aggregation and analysis
PDF
The Future of Fast Databases: Lessons from a Decade of QuestDB
PDF
Tips & Tricks for Apache Kafka®
PPTX
Node.js Web Apps @ ebay scale
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PDF
Using Ceph in OStack.de - Ceph Day Frankfurt
PDF
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
PDF
Serverless for High Performance Computing
PDF
Enabling Presto Caching at Uber with Alluxio
PDF
Red Hat Storage Roadmap
PDF
Red Hat Storage Roadmap
Como creamos QuestDB Cloud, un SaaS basado en Kubernetes alrededor de QuestDB...
Interactive Data Analysis in Spark Streaming
Logs @ OVHcloud
AWS Big Data Demystified #1: Big data architecture lessons learned
Netflix Open Source Meetup Season 4 Episode 2
Silverstripe at scale - design & architecture for silverstripe applications
Cómo se diseña una base de datos que pueda ingerir más de cuatro millones de ...
2021.02 new in Ceph Pacific Dashboard
Red Hat Gluster Storage Performance
Logs aggregation and analysis
The Future of Fast Databases: Lessons from a Decade of QuestDB
Tips & Tricks for Apache Kafka®
Node.js Web Apps @ ebay scale
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
Using Ceph in OStack.de - Ceph Day Frankfurt
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Serverless for High Performance Computing
Enabling Presto Caching at Uber with Alluxio
Red Hat Storage Roadmap
Red Hat Storage Roadmap
Ad

Recently uploaded (20)

PPTX
E -tech empowerment technologies PowerPoint
PDF
Testing WebRTC applications at scale.pdf
PPTX
Funds Management Learning Material for Beg
PDF
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
PPTX
Introduction to Information and Communication Technology
PPTX
522797556-Unit-2-Temperature-measurement-1-1.pptx
PPTX
INTERNET------BASICS-------UPDATED PPT PRESENTATION
PDF
WebRTC in SignalWire - troubleshooting media negotiation
PDF
The New Creative Director: How AI Tools for Social Media Content Creation Are...
PDF
Sims 4 Historia para lo sims 4 para jugar
PDF
Cloud-Scale Log Monitoring _ Datadog.pdf
PDF
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
PPT
Design_with_Watersergyerge45hrbgre4top (1).ppt
PDF
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
PPTX
Job_Card_System_Styled_lorem_ipsum_.pptx
PDF
RPKI Status Update, presented by Makito Lay at IDNOG 10
PPTX
presentation_pfe-universite-molay-seltan.pptx
PPTX
Slides PPTX World Game (s) Eco Economic Epochs.pptx
PDF
Decoding a Decade: 10 Years of Applied CTI Discipline
PDF
The Internet -By the Numbers, Sri Lanka Edition
E -tech empowerment technologies PowerPoint
Testing WebRTC applications at scale.pdf
Funds Management Learning Material for Beg
APNIC Update, presented at PHNOG 2025 by Shane Hermoso
Introduction to Information and Communication Technology
522797556-Unit-2-Temperature-measurement-1-1.pptx
INTERNET------BASICS-------UPDATED PPT PRESENTATION
WebRTC in SignalWire - troubleshooting media negotiation
The New Creative Director: How AI Tools for Social Media Content Creation Are...
Sims 4 Historia para lo sims 4 para jugar
Cloud-Scale Log Monitoring _ Datadog.pdf
💰 𝐔𝐊𝐓𝐈 𝐊𝐄𝐌𝐄𝐍𝐀𝐍𝐆𝐀𝐍 𝐊𝐈𝐏𝐄𝐑𝟒𝐃 𝐇𝐀𝐑𝐈 𝐈𝐍𝐈 𝟐𝟎𝟐𝟓 💰
Design_with_Watersergyerge45hrbgre4top (1).ppt
How to Ensure Data Integrity During Shopify Migration_ Best Practices for Sec...
Job_Card_System_Styled_lorem_ipsum_.pptx
RPKI Status Update, presented by Makito Lay at IDNOG 10
presentation_pfe-universite-molay-seltan.pptx
Slides PPTX World Game (s) Eco Economic Epochs.pptx
Decoding a Decade: 10 Years of Applied CTI Discipline
The Internet -By the Numbers, Sri Lanka Edition

Scaling ELK Stack - DevOpsDays Singapore

  • 1. ELK Log processing at Scale #DevOpsDays 2015, Singapore @DevOpsDaysSG Angad Singh
  • 2. About me DevOps at Viki, Inc - A global video streaming site with subtitles. Previously a Twitter SRE, National University of Singapore Twitter @angadsg, Github @angad
  • 3. Elasticsearch - Log Indexing and Searching Logstash - Log Ingestion plumbing Kibana - Frontend {
  • 4. Metrics vs Logging Metrics ● Numeric timeseries data ● Actionable ● Counts, Statistical (p90, p99 etc.) ● Scalable cost-effective solutions already available
  • 5. Logging ● Useful for debugging ● Catch-all ● Full text searching ● Computationally intensive, harder to scale Metrics vs Logging Metrics ● Numeric timeseries data ● Actionable ● Counts, Statistical (p90, p99 etc.) ● Scalable cost-effective solutions already available
  • 6. Alerting and Monitoring at Viki Deeper level debugging with application logs Success Rate Alert for service X
  • 7. Logs ● Application logs - Stack Traces, Handled Exceptions ● Access Logs - Status codes, URI, HTTP Method at all levels of the stack ● Client Logs - Direct HTTP requests containing log events from client-side Javascript or Mobile application (android/ios) ● Standardized log format to JSON - easy to add / remove fields. ● Request tracing through various services using Unique-ID at Load Balancer
  • 8. ● Log aggregator ● Log preprocessing (Filtering etc.) ● 3 stage pipeline ● Input > Filter > Output Logstash
  • 9. ● Log aggregator ● Log preprocessing (Filtering etc.) ● 3 stage pipeline ● Input > Filter > Output Logstash Elasticsearch ● Full text searching and indexing ● on top of Apache Lucene ● RESTful web interface ● Horizontally scalable
  • 10. ● Log aggregator ● Log preprocessing (Filtering etc.) ● 3 stage pipeline ● Input > Filter > Output Logstash Elasticsearch ● Full text searching and indexing ● on top of Apache Lucene ● RESTful web interface ● Horizontally scalable Kibana ● Frontend ● Visualizations, Dashboards ● Supports Geo visualizations ● Uses ES REST API
  • 12. Input Any Stream ● local file ● queue ● tcp, udp ● twitter ● etc.. Logstash Filter Mutation ● add/remove field ● parse as json ● ruby code ● parse geoip ● etc.. Output ● elasticsearch ● redis ● queue ● file ● pagerduty ● etc..
  • 13. ● Golang program that sits next to log files, lumberjack protocol ● Forwards logs from a file to a logstash server ● Removes the need for a buffer (such as redis, or a queue) for logs pending ingestion to logstash. ● Docker container with volume mounted /var/log. Configuration stored in Consul. ● Application containers with volume mounted /var/log to /var/log/docker/<container>/application.log Logstash Forwarder
  • 14. Logstash pool with HAProxy 4 x logstash machines, 8 cores, 16 GB RAM 7 x logstash processes per machine, 5 for application logs, 2 for HTTP client logs. Fronted by HAProxy for both lumberjack protocol as well as HTTP protocol. Easily scalable by adding more machines and spinning up more logstash processes.
  • 16. Elasticsearch Hardware 12 core, 64GB RAM with RAID 0 - 2 x 3TB 7200rpm disks. 20 nodes, 20 shards, 3 replicas (with 1 primary). Each day ~300GB x 4 copies (3 + 1) ~ 3 months of data on 120TB. Average 6k-8k logs per second, peak 25k logs per second. https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html
  • 18. ● < 30.5 GB Heap - JAVA compressed pointers below 30.5GB heap ● Sweet spot - 64GB of RAM with half available for Lucene file buffers. ● SSD or RAID 0 (or multiple path directories similar to RAID 0). ● If SSD then set I/O scheduler to deadline instead of cfq. ● RAID0 - no need to worry about disks failing as machines can easily be replaced due to multiple copies of data. ● Disable swap. Hardware Tuning
  • 19. ● 20 days of indexes open based on available memory, rest closed - open on demand ● Field data - cache used while sorting and aggregating data. ● Circuit breaker - cancels requests which require large memory, prevent OOM, http://elasticsearch:9200/_cache/clear if field data is very close to memory limit. ● Shards >= Number of nodes ● Lucene forceMerge - minor performance improvements for older indexes (https://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize. html) Elasticsearch Configuration
  • 20. Prevent split brain situation to avoid losing data - set minimum number of master eligible nodes to (n/2 + 1) Set higher ulimit for elasticsearch process Daily cronjob which deletes data older than 90 days, closes indices older than 20 days, optimizes (forceMerge) indices older than 2 days And also...
  • 22. Marvel - Official plugin from Elasticsearch KOPF - Index management plugin CAT APIs - REST APIs to view cluster information Curator - Data management Monitoring