SlideShare a Scribd company logo
Troubleshooting your Elasticsearch
cluster like a Support Engineer
Imma Valls
Support Engineer, Elastic
@eyeveebee
http://eyeveebee.net
Cluster
down!
Know
Your
Tools
How can we approach troubleshooting?
The Emergency Room model
Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer
Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer
Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer
9
Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer
11
Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer
13
Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer
15
Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer
17
Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer
Troubleshooting by
Example
Red Cluster
19
TRIAGE
Urgent Severity] Red cluster
Vital signs
➔ Cluster in red health
➔ No ingest into any
indices
Symptoms
➔ Beats fail to ingest
➔ Cluster is responsive, search and REST API still work
TRIAGE
Urgent Severity] Red cluster
What happened?
➔ Out of the blue, no changes
Any attempts to fix it?
➔ No
Next steps
➔ Share a support diagnostics that will provide REST API calls
https://www.elastic.co/blog/why-does-elastic-support-keep-asking-for-diagnostic-files
https://github.com/elastic/support-diagnostics/blob/main/src/main/resources/elastic-rest.yml
https://github.com/elastic/support-diagnostics
> ./diagnostics.sh --host https://localhost -u elastic -p --port 9200 --ssl --type api --noVerify
DIAGNOSTIC
Urgent Severity] Red cluster
Why is the cluster red?
➔ REST API calls - CAT Indices API
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/rest-apis.html
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cat-indices.html
DIAGNOSTIC
Urgent Severity] Red cluster
Why is an index red?
➔ Check shards that are not started:
INITIALIZING or UNASSIGNED
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cat-shards.html
DIAGNOSTIC
Urgent Severity] Red cluster
Why is a shard UNASSIGNED?
➔ Cluster allocation explain API
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cluster-allocation-explain.html
DIAGNOSTIC
Urgent Severity] Red cluster
Why is a shard UNASSIGNED?
DIAGNOSTIC
Urgent Severity] Red cluster
Have we used all the cluster storage?
➔ Use CAT Allocation API
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cat-allocation.html
DIAGNOSTIC
Urgent Severity] Red cluster
Interpret data
➔ Cluster reached its flood stage disk watermark
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/modules-cluster.html#disk-based-shard-allocation
DIAGNOSTIC
Urgent Severity] Red cluster
Interpret data
➔ Existing indices are blocked for write
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cluster-get-settings.html
TREATMENT
Urgent Severity] Red cluster
Fixing the root cause
➔ Delete indices to increase available storage
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/indices-delete-index.html
Do we have snapshots? We can restore later.
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/snapshot-restore.html
➔ Add nodes or increase storage capacity (easier on cloud)
TREATMENT
Urgent Severity] Red cluster
Temporary Hotfix
➔ Alter the cluster settings to temporarily allow a higher disk usage
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cluster-update-settings.html
TREATMENT
Urgent Severity] Red cluster
Remove write block on the indices
➔ Once we have enough disk, remove the index block if needed
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/indices-update-settings.html
TREATMENT
Urgent Severity] Red cluster
Bonus track
➔ If corrupted shards, and no snapshots, we can force allocation
accepting potential data loss
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cluster-reroute.html#cluster-reroute-api-request-body
DISCHARGE
Urgent Severity] Red cluster
Takeaways
➔ Proactively monitor disk usage on each node / Alerts
Aim to 75% used storage to be on the safe side (< 85%)
➔ Plan for data retention / deletion with ILM or Data Tiers
Index Lifecycle Management (ILM) can help automate
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/index-lifecycle-management.html
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/data-tiers.html
➔ Snapshot / Snapshot Lifecycle Management (SLM) for backups
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/snapshot-lifecycle-management.html
Treatment ➔ Proactively
monitor disk
usage (alerts)
➔ Snapshots
➔ Index Lifecycle
Management
deletes old data
and manages
replicas
➔ Data Tiers with
Cold Tier or
Frozen Tiers
Discharge
Diagnostic
➔ Delete indices
➔ Add data node/s
➔ Update index
settings / allow
write
Reached flood stage
disk watermark
➔ CAT APIs
➔ Allocation Explain
➔ Cluster and index
settings
Support diagnostics
Triage
➔ Cluster health is
red
➔ Stopped
ingesting
➔ Search works
SUMMARY
Urgent Severity] Red cluster
Troubleshooting by
Example
Unbalanced CPU usage
35
TRIAGE
High Severity] Unbalanced CPU Usage
Vital signs
➔ Green cluster
➔ Monitoring alerts
high CPU usage
Symptoms
➔ Unbalance CPU usage between different nodes
➔ CPU pressure switches to different nodes over time
➔ Some ingest delays
TRIAGE
High Severity] Unbalanced CPU usage
What happened?
➔ Was there any benchmarking done
before going into production?
➔ Any changes in data ingest
volumes?
Any attempts to fix it?
➔ Tried adding nodes,
did not help
Next steps
➔ Export monitoring data and share a support diagnostics
https://github.com/elastic/support-diagnostics#extracting-time-series-diagnostics-from-monitoring
https://github.com/elastic/support-diagnostics
DIAGNOSTIC
High Severity] Unbalanced CPU Usage
Why high CPU usage?
➔ Monitoring
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/monitoring-production.html
DIAGNOSTIC
High Severity] Unbalanced CPU Usage
Why high CPU usage?
➔ REST API calls - Cat Shards and Get Index settings API
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cat-shards.html
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/indices-get-settings.html
DIAGNOSTIC
High Severity] Unbalanced CPU Usage
Why high CPU usage?
➔ Hot threads API to confirm CPU usage is on write
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cluster-nodes-hot-threads.html
TREATMENT
High Severity] Unbalanced CPU Usage
Fixing the root cause
➔ Increase primary shards
In the example: 3 nodes → 3 primary shards for hot index logs-201998
➔ How? Change index template and rollover index if using ILM
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/index-templates.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-
management.html
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/indices-rollover-index.html
TREATMENT
High Severity] Unbalanced CPU Usage
Fixing the root cause
TREATMENT
High Severity] Unbalanced CPU usage
Bonus track
➔ Use with caution index.routing.allocation.total_shards_per_node
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/allocation-total-shards.html
DISCHARGE
High Severity] Unbalanced CPU Usage
Takeaways
➔ Proactively monitor CPU usage on each node / Alerts
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/monitoring-production.html
➔ Benchmark with tools like ES Rally
https://esrally.readthedocs.io/en/stable/
https://benchmarks.elastic.co/index.html
https://esrally.readthedocs.io/en/stable/adding_tracks.html
Treatment ➔ Proactively
monitor cpu
usage (alerts)
➔ Benchmark (ES
rally) before
going into
production or if
there is any
changes in the
data volumes
Discharge
Diagnostic
➔ Create index
settings
➔ Index templates
to add additional
primary shards
Ingest is hot on an
index with 1 primary
shard
➔ Monitoring
➔ CAT shards API
➔ Index settings
Triage
➔ Cluster health is
green
➔ Unbalanced high
CPU usage
switching nodes
over time
SUMMARY
High Severity] Unbalanced CPU Usage
➔ How critical is it?
➔ Do we need urgent care or is there a
workaround to stabilize?
Have your tools ready
➔ REST APIs / Support diagnostics
➔ Monitoring & Alerts
➔ Log Analysis: Use Kibana!
➔ Search Elastic discuss, Stackoverflow,
Elastic GitHub repos, etc.
Lessons learned
➔ Follow best practices
➔ Prevent future incidents -
proactively investigate unexpected
logs, etc.
Wrapping
up
Triage incidents
47
Thank You

More Related Content

PDF
Deploying Elasticsearch and Kibana on Kubernetes with the Elastic Operator / ECK
PDF
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
PDF
Can Apache Kafka Replace a Database?
PDF
Cassandra Introduction & Features
PDF
Introducción al Stack Elastic y Machine Learning con Elasticsearch
PDF
Introduction to Kafka Streams
PDF
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
PPTX
Cassandra concepts, patterns and anti-patterns
Deploying Elasticsearch and Kibana on Kubernetes with the Elastic Operator / ECK
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
Can Apache Kafka Replace a Database?
Cassandra Introduction & Features
Introducción al Stack Elastic y Machine Learning con Elasticsearch
Introduction to Kafka Streams
Zeus: Locality-aware Distributed Transactions [Eurosys '21 presentation]
Cassandra concepts, patterns and anti-patterns

What's hot (20)

PPTX
PostgreSQL Database Slides
PDF
Cassandra NoSQL Tutorial
PPTX
Appache Cassandra
PDF
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
PDF
Architecture for building scalable and highly available Postgres Cluster
PDF
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
PDF
jQuery Tutorial For Beginners | Developing User Interface (UI) Using jQuery |...
PDF
Percona xtrabackup - MySQL Meetup @ Mumbai
ODP
Monitoring With Prometheus
PDF
Large Scale Lakehouse Implementation Using Structured Streaming
PPTX
Hadoop & Greenplum: Why Do Such a Thing?
PDF
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
PPTX
Introduction to ansible
PPTX
Prometheus and Grafana
PDF
Logs/Metrics Gathering With OpenShift EFK Stack
PDF
Getting Microservices and Legacy to Play Nicely Together with Event-Driven Ar...
PDF
Cassandra 101
PPTX
Advanced Flink Training - Design patterns for streaming applications
PDF
Kubernetes in action
PDF
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
PostgreSQL Database Slides
Cassandra NoSQL Tutorial
Appache Cassandra
Galera Cluster for MySQL vs MySQL (NDB) Cluster: A High Level Comparison
Architecture for building scalable and highly available Postgres Cluster
Apache kafka 모니터링을 위한 Metrics 이해 및 최적화 방안
jQuery Tutorial For Beginners | Developing User Interface (UI) Using jQuery |...
Percona xtrabackup - MySQL Meetup @ Mumbai
Monitoring With Prometheus
Large Scale Lakehouse Implementation Using Structured Streaming
Hadoop & Greenplum: Why Do Such a Thing?
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Introduction to ansible
Prometheus and Grafana
Logs/Metrics Gathering With OpenShift EFK Stack
Getting Microservices and Legacy to Play Nicely Together with Event-Driven Ar...
Cassandra 101
Advanced Flink Training - Design patterns for streaming applications
Kubernetes in action
Confluent Workshop Series: ksqlDB로 스트리밍 앱 빌드
Ad

Similar to Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer (20)

PDF
Troubleshooting your elasticsearch cluster like a support engineer
PDF
Troubleshooting your Elasticsearch cluster like a support engineer
PDF
Client-Side Performance Testing
PDF
Client-side Performance Testing
PPT
NCache 3.8 SP3
DOCX
High performance coding practices code project
ODP
Caching and tuning fun for high scalability @ FOSDEM 2012
PPT
Four Ways to Improve ASP .NET Performance and Scalability
PDF
Ten Battle-Tested Tips for Atlassian Connect Add-ons
PPT
11g R2
PDF
Addressing Issues of Risk & Governance in OpenStack without sacrificing Agili...
PPT
16aug06.ppt
ODP
Caching and tuning fun for high scalability @ PHPTour
PPTX
Ssis Best Practices Israel Bi U Ser Group Itay Braun
KEY
Enterprise Hosting
PPTX
Dot Net Application Monitoring
PDF
Scaling machine learning to millions of users with Apache Beam
PPTX
Analysis Services Best Practices From Large Deployments
PPTX
Logs: Can’t Hate Them, Won’t Love Them: Brief Log Management Class by Anton C...
PPT
Jboss World 2011 Infinispan
Troubleshooting your elasticsearch cluster like a support engineer
Troubleshooting your Elasticsearch cluster like a support engineer
Client-Side Performance Testing
Client-side Performance Testing
NCache 3.8 SP3
High performance coding practices code project
Caching and tuning fun for high scalability @ FOSDEM 2012
Four Ways to Improve ASP .NET Performance and Scalability
Ten Battle-Tested Tips for Atlassian Connect Add-ons
11g R2
Addressing Issues of Risk & Governance in OpenStack without sacrificing Agili...
16aug06.ppt
Caching and tuning fun for high scalability @ PHPTour
Ssis Best Practices Israel Bi U Ser Group Itay Braun
Enterprise Hosting
Dot Net Application Monitoring
Scaling machine learning to millions of users with Apache Beam
Analysis Services Best Practices From Large Deployments
Logs: Can’t Hate Them, Won’t Love Them: Brief Log Management Class by Anton C...
Jboss World 2011 Infinispan
Ad

More from Imma Valls Bernaus (20)

PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
PDF
capitulando la keynote de GrafanaCON 2025 - Madrid
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
PDF
OpenTelemetry 101 Cloud Native Barcelona
PDF
Observa tus flotas de Kubernetes como un/a especialista con Grafana
PDF
Recapitulando la keynote de GrafanaCON 2025 - Barcelona
PDF
Recapitulando la keynote de GrafanaCON 2025 - Barcelona
PDF
Temas principales de GrafanaCON 2025 Grafana 12 y más
PDF
Choose Your Own Adventure to Get Started with Grafana Loki
PDF
Logs, Metrics, traces and Mayhem - An Interactive Observability Adventure Wor...
PDF
🌱 Green Grafana 🌱 Essentials_ Data, Visualizations and Plugins.pdf
PDF
Métricas, Logs, Trazas y Caos_ Una Aventura Interactiva de Observabilidad co...
PDF
The Missing Voices: Unearthing the Impact of Survivorship Bias on Women in Te...
PDF
The Missing Voices: Unearthing the Impact of Survivorship Bias on Women in Cl...
PDF
Métricas, Logs, Trazas y Caos - Una Aventura Interactiva de Observabilidad c...
PDF
Unearthing the impact of survivorship bias on women in FOSS to build more inc...
PDF
Rebuilding Your Cloud Native Community Lessons learned from Stardew Valley
PDF
Metrics Cost Management with Adaptive Metrics.pdf
PDF
Te damos la bienvenida a una nueva forma de realizar búsquedas
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Understanding the Need for Systemic Change in Open Source Through Intersectio...
capitulando la keynote de GrafanaCON 2025 - Madrid
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
OpenTelemetry 101 Cloud Native Barcelona
Observa tus flotas de Kubernetes como un/a especialista con Grafana
Recapitulando la keynote de GrafanaCON 2025 - Barcelona
Recapitulando la keynote de GrafanaCON 2025 - Barcelona
Temas principales de GrafanaCON 2025 Grafana 12 y más
Choose Your Own Adventure to Get Started with Grafana Loki
Logs, Metrics, traces and Mayhem - An Interactive Observability Adventure Wor...
🌱 Green Grafana 🌱 Essentials_ Data, Visualizations and Plugins.pdf
Métricas, Logs, Trazas y Caos_ Una Aventura Interactiva de Observabilidad co...
The Missing Voices: Unearthing the Impact of Survivorship Bias on Women in Te...
The Missing Voices: Unearthing the Impact of Survivorship Bias on Women in Cl...
Métricas, Logs, Trazas y Caos - Una Aventura Interactiva de Observabilidad c...
Unearthing the impact of survivorship bias on women in FOSS to build more inc...
Rebuilding Your Cloud Native Community Lessons learned from Stardew Valley
Metrics Cost Management with Adaptive Metrics.pdf
Te damos la bienvenida a una nueva forma de realizar búsquedas

Recently uploaded (20)

PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPT
Teaching material agriculture food technology
PDF
Empathic Computing: Creating Shared Understanding
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
cuic standard and advanced reporting.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MYSQL Presentation for SQL database connectivity
Teaching material agriculture food technology
Empathic Computing: Creating Shared Understanding
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation_ Review paper, used for researhc scholars
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Programs and apps: productivity, graphics, security and other tools
cuic standard and advanced reporting.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Spectral efficient network and resource selection model in 5G networks
Network Security Unit 5.pdf for BCA BBA.
Spectroscopy.pptx food analysis technology
Digital-Transformation-Roadmap-for-Companies.pptx
Understanding_Digital_Forensics_Presentation.pptx
Per capita expenditure prediction using model stacking based on satellite ima...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
NewMind AI Weekly Chronicles - August'25 Week I

Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer

  • 1. Troubleshooting your Elasticsearch cluster like a Support Engineer Imma Valls Support Engineer, Elastic @eyeveebee http://eyeveebee.net
  • 4. How can we approach troubleshooting?
  • 9. 9
  • 11. 11
  • 13. 13
  • 15. 15
  • 17. 17
  • 20. TRIAGE Urgent Severity] Red cluster Vital signs ➔ Cluster in red health ➔ No ingest into any indices Symptoms ➔ Beats fail to ingest ➔ Cluster is responsive, search and REST API still work
  • 21. TRIAGE Urgent Severity] Red cluster What happened? ➔ Out of the blue, no changes Any attempts to fix it? ➔ No Next steps ➔ Share a support diagnostics that will provide REST API calls https://www.elastic.co/blog/why-does-elastic-support-keep-asking-for-diagnostic-files https://github.com/elastic/support-diagnostics/blob/main/src/main/resources/elastic-rest.yml https://github.com/elastic/support-diagnostics > ./diagnostics.sh --host https://localhost -u elastic -p --port 9200 --ssl --type api --noVerify
  • 22. DIAGNOSTIC Urgent Severity] Red cluster Why is the cluster red? ➔ REST API calls - CAT Indices API https://www.elastic.co/guide/en/elasticsearch/reference/7.14/rest-apis.html https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cat-indices.html
  • 23. DIAGNOSTIC Urgent Severity] Red cluster Why is an index red? ➔ Check shards that are not started: INITIALIZING or UNASSIGNED https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cat-shards.html
  • 24. DIAGNOSTIC Urgent Severity] Red cluster Why is a shard UNASSIGNED? ➔ Cluster allocation explain API https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cluster-allocation-explain.html
  • 25. DIAGNOSTIC Urgent Severity] Red cluster Why is a shard UNASSIGNED?
  • 26. DIAGNOSTIC Urgent Severity] Red cluster Have we used all the cluster storage? ➔ Use CAT Allocation API https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cat-allocation.html
  • 27. DIAGNOSTIC Urgent Severity] Red cluster Interpret data ➔ Cluster reached its flood stage disk watermark https://www.elastic.co/guide/en/elasticsearch/reference/7.14/modules-cluster.html#disk-based-shard-allocation
  • 28. DIAGNOSTIC Urgent Severity] Red cluster Interpret data ➔ Existing indices are blocked for write https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cluster-get-settings.html
  • 29. TREATMENT Urgent Severity] Red cluster Fixing the root cause ➔ Delete indices to increase available storage https://www.elastic.co/guide/en/elasticsearch/reference/7.14/indices-delete-index.html Do we have snapshots? We can restore later. https://www.elastic.co/guide/en/elasticsearch/reference/7.14/snapshot-restore.html ➔ Add nodes or increase storage capacity (easier on cloud)
  • 30. TREATMENT Urgent Severity] Red cluster Temporary Hotfix ➔ Alter the cluster settings to temporarily allow a higher disk usage https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cluster-update-settings.html
  • 31. TREATMENT Urgent Severity] Red cluster Remove write block on the indices ➔ Once we have enough disk, remove the index block if needed https://www.elastic.co/guide/en/elasticsearch/reference/7.14/indices-update-settings.html
  • 32. TREATMENT Urgent Severity] Red cluster Bonus track ➔ If corrupted shards, and no snapshots, we can force allocation accepting potential data loss https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cluster-reroute.html#cluster-reroute-api-request-body
  • 33. DISCHARGE Urgent Severity] Red cluster Takeaways ➔ Proactively monitor disk usage on each node / Alerts Aim to 75% used storage to be on the safe side (< 85%) ➔ Plan for data retention / deletion with ILM or Data Tiers Index Lifecycle Management (ILM) can help automate https://www.elastic.co/guide/en/elasticsearch/reference/7.14/index-lifecycle-management.html https://www.elastic.co/guide/en/elasticsearch/reference/7.14/data-tiers.html ➔ Snapshot / Snapshot Lifecycle Management (SLM) for backups https://www.elastic.co/guide/en/elasticsearch/reference/7.14/snapshot-lifecycle-management.html
  • 34. Treatment ➔ Proactively monitor disk usage (alerts) ➔ Snapshots ➔ Index Lifecycle Management deletes old data and manages replicas ➔ Data Tiers with Cold Tier or Frozen Tiers Discharge Diagnostic ➔ Delete indices ➔ Add data node/s ➔ Update index settings / allow write Reached flood stage disk watermark ➔ CAT APIs ➔ Allocation Explain ➔ Cluster and index settings Support diagnostics Triage ➔ Cluster health is red ➔ Stopped ingesting ➔ Search works SUMMARY Urgent Severity] Red cluster
  • 36. TRIAGE High Severity] Unbalanced CPU Usage Vital signs ➔ Green cluster ➔ Monitoring alerts high CPU usage Symptoms ➔ Unbalance CPU usage between different nodes ➔ CPU pressure switches to different nodes over time ➔ Some ingest delays
  • 37. TRIAGE High Severity] Unbalanced CPU usage What happened? ➔ Was there any benchmarking done before going into production? ➔ Any changes in data ingest volumes? Any attempts to fix it? ➔ Tried adding nodes, did not help Next steps ➔ Export monitoring data and share a support diagnostics https://github.com/elastic/support-diagnostics#extracting-time-series-diagnostics-from-monitoring https://github.com/elastic/support-diagnostics
  • 38. DIAGNOSTIC High Severity] Unbalanced CPU Usage Why high CPU usage? ➔ Monitoring https://www.elastic.co/guide/en/elasticsearch/reference/7.14/monitoring-production.html
  • 39. DIAGNOSTIC High Severity] Unbalanced CPU Usage Why high CPU usage? ➔ REST API calls - Cat Shards and Get Index settings API https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cat-shards.html https://www.elastic.co/guide/en/elasticsearch/reference/7.14/indices-get-settings.html
  • 40. DIAGNOSTIC High Severity] Unbalanced CPU Usage Why high CPU usage? ➔ Hot threads API to confirm CPU usage is on write https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cluster-nodes-hot-threads.html
  • 41. TREATMENT High Severity] Unbalanced CPU Usage Fixing the root cause ➔ Increase primary shards In the example: 3 nodes → 3 primary shards for hot index logs-201998 ➔ How? Change index template and rollover index if using ILM https://www.elastic.co/guide/en/elasticsearch/reference/7.14/index-templates.html https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle- management.html https://www.elastic.co/guide/en/elasticsearch/reference/7.14/indices-rollover-index.html
  • 42. TREATMENT High Severity] Unbalanced CPU Usage Fixing the root cause
  • 43. TREATMENT High Severity] Unbalanced CPU usage Bonus track ➔ Use with caution index.routing.allocation.total_shards_per_node https://www.elastic.co/guide/en/elasticsearch/reference/7.14/allocation-total-shards.html
  • 44. DISCHARGE High Severity] Unbalanced CPU Usage Takeaways ➔ Proactively monitor CPU usage on each node / Alerts https://www.elastic.co/guide/en/elasticsearch/reference/7.14/monitoring-production.html ➔ Benchmark with tools like ES Rally https://esrally.readthedocs.io/en/stable/ https://benchmarks.elastic.co/index.html https://esrally.readthedocs.io/en/stable/adding_tracks.html
  • 45. Treatment ➔ Proactively monitor cpu usage (alerts) ➔ Benchmark (ES rally) before going into production or if there is any changes in the data volumes Discharge Diagnostic ➔ Create index settings ➔ Index templates to add additional primary shards Ingest is hot on an index with 1 primary shard ➔ Monitoring ➔ CAT shards API ➔ Index settings Triage ➔ Cluster health is green ➔ Unbalanced high CPU usage switching nodes over time SUMMARY High Severity] Unbalanced CPU Usage
  • 46. ➔ How critical is it? ➔ Do we need urgent care or is there a workaround to stabilize? Have your tools ready ➔ REST APIs / Support diagnostics ➔ Monitoring & Alerts ➔ Log Analysis: Use Kibana! ➔ Search Elastic discuss, Stackoverflow, Elastic GitHub repos, etc. Lessons learned ➔ Follow best practices ➔ Prevent future incidents - proactively investigate unexpected logs, etc. Wrapping up Triage incidents