SlideShare a Scribd company logo
Community Conference 2021
Troubleshooting your Elasticsearch
cluster like a Support Engineer
Janko Strassburg, Imma Valls
Sr. Support Engineers, Elastic
@jankopueh, @eyeveebee
Cluster down!
https://safetyposter.com/products/
simpsons-safety-poster-medical-emergency-know-what-to-do
How can we
approach
troubleshooting?
The hospital
Emergency Room
model
Troubleshooting your elasticsearch cluster like a support engineer
Troubleshooting your elasticsearch cluster like a support engineer
Troubleshooting your elasticsearch cluster like a support engineer
Troubleshooting your elasticsearch cluster like a support engineer
Troubleshooting your elasticsearch cluster like a support engineer
Troubleshooting your elasticsearch cluster like a support engineer
Troubleshooting your elasticsearch cluster like a support engineer
Troubleshooting your elasticsearch cluster like a support engineer
Most
Common
Issues?
Troubleshooting by Example
TRIAGE
Urgent Severity] Red cluster
Vital signs
➔ Cluster in red health
➔ No ingest into any
indices
Symptoms
➔ Beats fail to ingest
➔ Cluster is responsive, search and REST API still work
TRIAGE
Urgent Severity] Red cluster
What happened?
➔ Out of the blue, no changes
Any attempts to fix it?
➔ No
Next steps
➔ Share a support diagnostics that will provide REST API calls
https://www.elastic.co/blog/why-does-elastic-support-keep-asking-for-diagnostic-files
https://github.com/elastic/support-diagnostics/blob/main/src/main/resources/elastic-rest.yml
https://github.com/elastic/support-diagnostics
> ./diagnostics.sh --host https://localhost -u elastic -p --port 9200 --ssl --type api --noVerify
DIAGNOSTIC
Urgent Severity] Red cluster
Why is the cluster red?
➔ REST API calls - CAT Indices API
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/rest-apis.html
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cat-indices.html
DIAGNOSTIC
Urgent Severity] Red cluster
Why is an index red?
➔ Check shards that are not started:
INITIALIZING or UNASSIGNED
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cat-shards.html
DIAGNOSTIC
Urgent Severity] Red cluster
Why is a shard UNASSIGNED?
➔ Cluster allocation explain API
https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html
DIAGNOSTIC
Urgent Severity] Red cluster
Why is a shard UNASSIGNED?
DIAGNOSTIC
Urgent Severity] Red cluster
Have we used all the cluster storage?
➔ Use CAT Allocation API
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cat-allocation.html
DIAGNOSTIC
Urgent Severity] Red cluster
Interpret data
➔ Cluster reached its disk high watermark
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/modules-cluster.html#disk-based-shard-allocation
DIAGNOSTIC
Urgent Severity] Red cluster
Interpret data
➔ Existing indices are blocked for write
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cluster-get-settings.html
TREATMENT
Urgent Severity] Red cluster
Fixing the root cause
➔ Delete indices to increase available storage
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/indices-delete-index.html
Do we have snapshots? We can restore later.
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/snapshot-restore.html
➔ Add nodes or increase storage capacity (easier on cloud)
TREATMENT
Urgent Severity] Red cluster
Temporary Hotfix
➔ Alter the cluster settings to temporarily allow a higher disk
usage
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cluster-update-settings.html
TREATMENT
Urgent Severity] Red cluster
Remove write block on the indices
➔ Once we have enough disk, remove the index block if needed
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/indices-update-settings.html
TREATMENT
Urgent Severity] Red cluster
Bonus track
➔ If corrupted shards, and no snapshots, we can force allocation
accepting potential data loss
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cluster-reroute.html#cluster-reroute-api-request-body
DISCHARGE
Urgent Severity] Red cluster
Takeaways
➔ Proactively monitor disk usage on each node / Alerts
Aim to 75% used storage to be on the safe side (< 85%
➔ Plan for data retention / deletion with ILM or Data Tiers
Index Lifecycle Management (ILM can help automate
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/index-lifecycle-management.html
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/data-tiers.html
➔ Snapshot / Snapshot Lifecycle Management (SLM) for backups
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/snapshot-lifecycle-management.html
Treatment ➔ Proactively
monitor disk
usage (alerts)
➔ Snapshots
➔ Index Lifecycle
Management
deletes old data
and manages
replicas
➔ Data Tiers with
Cold Tier
Discharge
Diagnostic
➔ Delete indices
➔ Add data node/s
➔ Update index
settings / allow
write
Reached high disk
watermark
➔ CAT APIs
➔ Allocation Explain
➔ Cluster and index
settings
Support diagnostics
Triage
➔ Cluster health is
red
➔ Stopped
ingesting
➔ Search works
SUMMARY
Urgent Severity] Red cluster
More Tools & Resources
Monitoring
➔ Monitoring in production - dedicated cluster
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/monitoring-production.html
Monitoring
➔ Nodes’ memory usage
Monitoring
➔ Ingest and Search queues and rejections
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cat-thread-pool.html
Monitoring
➔ Example  High CPU usage
Monitoring
➔ Example  High CPU usage
And CAT
APIs again!
➔ Example  High CPU usage
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cat-shards.html
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cluster-nodes-hot-threads.html
Log
Analysis
➔ Elasticsearch Logging
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/logging.html
https://www.elastic.co/guide/en/elasticsearch/reference/7.11/configuring-filebeat.html
Sizing - how many shards per node and what size
➔ https://www.elastic.co/guide/en/elasticsearch/reference/7.11/size-your-shards.html
➔ https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
➔ https://www.elastic.co/guide/en/cloud/current/ec-reference-hardware.html
➔ https://benchmarks.elastic.co/
➔ https://esrally.readthedocs.io/
Storage
➔ https://www.elastic.co/blog/how-to-design-your-elasticsearch-data-storage-architecture-for-scale
➔ https://www.elastic.co/guide/en/elasticsearch/reference/7.11/tune-for-disk-usage.html
JVM Heap - do not go over ~30Gb heap
➔ https://www.elastic.co/blog/a-heap-of-trouble
Hot/Warm/Cold architectures for time series data
➔ https://www.elastic.co/blog/optimizing-costs-elastic-cloud-hot-warm-index-lifecycle-management
Common
Resources
Shared
Common
Resources
Shared
Tuning for search - slow searches
➔ https://www.elastic.co/blog/advanced-tuning-finding-and-fixing-slow-elasticsearch-queries
➔ https://www.elastic.co/guide/en/elasticsearch/reference/7.11/tune-for-search-speed.html
Tuning for ingest - use bulk!
➔ https://www.elastic.co/guide/en/elasticsearch/reference/7.11/tune-for-indexing-speed.html
➔ https://www.elastic.co/guide/en/elasticsearch/reference/7.11/docs-bulk.html
Upgrading the Stack - be prepared, test and snapshots!
➔ https://www.elastic.co/webinars/expert-tips-for-upgrading-the-elk-stack
➔ https://www.elastic.co/guide/en/elastic-stack/7.11/upgrading-elastic-stack.html
Secure the Stack
➔ https://www.elastic.co/blog/configuring-ssl-tls-and-https-to-secure-elasticsearch-kibana-beats-and-l
ogstash
Optimize Mappings
➔ https://www.elastic.co/blog/strings-are-dead-long-live-strings
Wrapping Up
Triage incidents
➔ How critical is it?
➔ Do we need urgent care or is there a
workaround to stabilize?
Have tools ready
➔ REST APIs / Support diagnostics
➔ Monitoring & Alerts
➔ Log Analysis / Kibana Discover
➔ Search Elastic discuss, Stackoverflow,
Elastic GitHub repos, etc..
Lessons learned
➔ Follow best practices
➔ Prevent future incidents -
proactively investigate unexpected
logs, etc.
Q & A
Thank You

More Related Content

PPTX
Enabling Precision Health in Edison AI
PDF
Resistance is futile, resilience is crucial
PDF
Logging, Metrics, and APM: The Operations Trifecta (P)
PDF
Nine Publishing: Building a modern infrastructure with the Elastic Stack
PDF
_Search? Made Simple: Elastic + App Search
PDF
One Azure Monitor to Rule Them All? - Marius Zaharia
PPTX
Reactive Extensions .NET
PDF
Agile Lab_BigData_Meetup_AKKA
Enabling Precision Health in Edison AI
Resistance is futile, resilience is crucial
Logging, Metrics, and APM: The Operations Trifecta (P)
Nine Publishing: Building a modern infrastructure with the Elastic Stack
_Search? Made Simple: Elastic + App Search
One Azure Monitor to Rule Them All? - Marius Zaharia
Reactive Extensions .NET
Agile Lab_BigData_Meetup_AKKA

What's hot (7)

PPTX
Rounds tips & tricks
PDF
Rounds analytics pipeline
PPTX
Deployment Checkup: How to Regularly Tune Your Cloud Environment - RightScale...
PDF
Mastering Azure Monitor
DOCX
OpenSCGCASE STUDY
PDF
USUGM 2014 - Evolution of the ChemAxon product portfolio - Douglas Drake (Che...
PDF
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Rounds tips & tricks
Rounds analytics pipeline
Deployment Checkup: How to Regularly Tune Your Cloud Environment - RightScale...
Mastering Azure Monitor
OpenSCGCASE STUDY
USUGM 2014 - Evolution of the ChemAxon product portfolio - Douglas Drake (Che...
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
Ad

Similar to Troubleshooting your elasticsearch cluster like a support engineer (20)

PDF
Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer
PDF
Is your Elastic Cluster Stable and Production Ready?
PPTX
Running & Scaling Large Elasticsearch Clusters
PDF
Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...
PDF
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
PDF
Es part 2 pdf no build
PDF
Elasticsearch from the trenches
PPTX
Managing Security At 1M Events a Second using Elasticsearch
PDF
Elasticsearch for Logs & Metrics - a deep dive
PDF
Explore Elasticsearch and Why It’s Worth Using
PDF
Scaling Elasticsearch at Synthesio
PDF
Elasticsearch Introduction at BigData meetup
PDF
Vancouver part 1 intro to elasticsearch and kibana-beginner's crash course ...
PDF
Elasticsearch speed is key
PDF
SignalFx Elasticsearch Metrics Monitoring and Alerting
PPTX
Dev nexus 2017
PDF
Growing with elastic search
PPTX
Devnexus 2018
PDF
Scale search powered apps with Elastisearch, k8s and go - Maxime Boisvert
PDF
Elasticsearch in production
Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer
Is your Elastic Cluster Stable and Production Ready?
Running & Scaling Large Elasticsearch Clusters
Making your elastic cluster perform - Jettro Coenradie - Codemotion Amsterdam...
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
Es part 2 pdf no build
Elasticsearch from the trenches
Managing Security At 1M Events a Second using Elasticsearch
Elasticsearch for Logs & Metrics - a deep dive
Explore Elasticsearch and Why It’s Worth Using
Scaling Elasticsearch at Synthesio
Elasticsearch Introduction at BigData meetup
Vancouver part 1 intro to elasticsearch and kibana-beginner's crash course ...
Elasticsearch speed is key
SignalFx Elasticsearch Metrics Monitoring and Alerting
Dev nexus 2017
Growing with elastic search
Devnexus 2018
Scale search powered apps with Elastisearch, k8s and go - Maxime Boisvert
Elasticsearch in production
Ad

More from Imma Valls Bernaus (20)

PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
PDF
Understanding the Need for Systemic Change in Open Source Through Intersectio...
PDF
capitulando la keynote de GrafanaCON 2025 - Madrid
PDF
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
PDF
OpenTelemetry 101 Cloud Native Barcelona
PDF
Observa tus flotas de Kubernetes como un/a especialista con Grafana
PDF
Recapitulando la keynote de GrafanaCON 2025 - Barcelona
PDF
Recapitulando la keynote de GrafanaCON 2025 - Barcelona
PDF
Temas principales de GrafanaCON 2025 Grafana 12 y más
PDF
Choose Your Own Adventure to Get Started with Grafana Loki
PDF
Logs, Metrics, traces and Mayhem - An Interactive Observability Adventure Wor...
PDF
🌱 Green Grafana 🌱 Essentials_ Data, Visualizations and Plugins.pdf
PDF
Métricas, Logs, Trazas y Caos_ Una Aventura Interactiva de Observabilidad co...
PDF
The Missing Voices: Unearthing the Impact of Survivorship Bias on Women in Te...
PDF
The Missing Voices: Unearthing the Impact of Survivorship Bias on Women in Cl...
PDF
Métricas, Logs, Trazas y Caos - Una Aventura Interactiva de Observabilidad c...
PDF
Unearthing the impact of survivorship bias on women in FOSS to build more inc...
PDF
Rebuilding Your Cloud Native Community Lessons learned from Stardew Valley
PDF
Metrics Cost Management with Adaptive Metrics.pdf
PDF
Te damos la bienvenida a una nueva forma de realizar búsquedas
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
Understanding the Need for Systemic Change in Open Source Through Intersectio...
capitulando la keynote de GrafanaCON 2025 - Madrid
Beyond Binaries: Understanding Diversity and Allyship in a Global Workplace -...
OpenTelemetry 101 Cloud Native Barcelona
Observa tus flotas de Kubernetes como un/a especialista con Grafana
Recapitulando la keynote de GrafanaCON 2025 - Barcelona
Recapitulando la keynote de GrafanaCON 2025 - Barcelona
Temas principales de GrafanaCON 2025 Grafana 12 y más
Choose Your Own Adventure to Get Started with Grafana Loki
Logs, Metrics, traces and Mayhem - An Interactive Observability Adventure Wor...
🌱 Green Grafana 🌱 Essentials_ Data, Visualizations and Plugins.pdf
Métricas, Logs, Trazas y Caos_ Una Aventura Interactiva de Observabilidad co...
The Missing Voices: Unearthing the Impact of Survivorship Bias on Women in Te...
The Missing Voices: Unearthing the Impact of Survivorship Bias on Women in Cl...
Métricas, Logs, Trazas y Caos - Una Aventura Interactiva de Observabilidad c...
Unearthing the impact of survivorship bias on women in FOSS to build more inc...
Rebuilding Your Cloud Native Community Lessons learned from Stardew Valley
Metrics Cost Management with Adaptive Metrics.pdf
Te damos la bienvenida a una nueva forma de realizar búsquedas

Recently uploaded (20)

PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
System and Network Administration Chapter 2
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
medical staffing services at VALiNTRY
PDF
Nekopoi APK 2025 free lastest update
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Upgrade and Innovation Strategies for SAP ERP Customers
PPTX
ai tools demonstartion for schools and inter college
PDF
Understanding Forklifts - TECH EHS Solution
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
history of c programming in notes for students .pptx
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
Which alternative to Crystal Reports is best for small or large businesses.pdf
System and Network Administration Chapter 2
VVF-Customer-Presentation2025-Ver1.9.pptx
medical staffing services at VALiNTRY
Nekopoi APK 2025 free lastest update
wealthsignaloriginal-com-DS-text-... (1).pdf
Upgrade and Innovation Strategies for SAP ERP Customers
ai tools demonstartion for schools and inter college
Understanding Forklifts - TECH EHS Solution
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Odoo POS Development Services by CandidRoot Solutions
Operating system designcfffgfgggggggvggggggggg
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
CHAPTER 2 - PM Management and IT Context
history of c programming in notes for students .pptx
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...

Troubleshooting your elasticsearch cluster like a support engineer

  • 1. Community Conference 2021 Troubleshooting your Elasticsearch cluster like a Support Engineer Janko Strassburg, Imma Valls Sr. Support Engineers, Elastic @jankopueh, @eyeveebee
  • 15. TRIAGE Urgent Severity] Red cluster Vital signs ➔ Cluster in red health ➔ No ingest into any indices Symptoms ➔ Beats fail to ingest ➔ Cluster is responsive, search and REST API still work
  • 16. TRIAGE Urgent Severity] Red cluster What happened? ➔ Out of the blue, no changes Any attempts to fix it? ➔ No Next steps ➔ Share a support diagnostics that will provide REST API calls https://www.elastic.co/blog/why-does-elastic-support-keep-asking-for-diagnostic-files https://github.com/elastic/support-diagnostics/blob/main/src/main/resources/elastic-rest.yml https://github.com/elastic/support-diagnostics > ./diagnostics.sh --host https://localhost -u elastic -p --port 9200 --ssl --type api --noVerify
  • 17. DIAGNOSTIC Urgent Severity] Red cluster Why is the cluster red? ➔ REST API calls - CAT Indices API https://www.elastic.co/guide/en/elasticsearch/reference/7.11/rest-apis.html https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cat-indices.html
  • 18. DIAGNOSTIC Urgent Severity] Red cluster Why is an index red? ➔ Check shards that are not started: INITIALIZING or UNASSIGNED https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cat-shards.html
  • 19. DIAGNOSTIC Urgent Severity] Red cluster Why is a shard UNASSIGNED? ➔ Cluster allocation explain API https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-allocation-explain.html
  • 20. DIAGNOSTIC Urgent Severity] Red cluster Why is a shard UNASSIGNED?
  • 21. DIAGNOSTIC Urgent Severity] Red cluster Have we used all the cluster storage? ➔ Use CAT Allocation API https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cat-allocation.html
  • 22. DIAGNOSTIC Urgent Severity] Red cluster Interpret data ➔ Cluster reached its disk high watermark https://www.elastic.co/guide/en/elasticsearch/reference/7.11/modules-cluster.html#disk-based-shard-allocation
  • 23. DIAGNOSTIC Urgent Severity] Red cluster Interpret data ➔ Existing indices are blocked for write https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cluster-get-settings.html
  • 24. TREATMENT Urgent Severity] Red cluster Fixing the root cause ➔ Delete indices to increase available storage https://www.elastic.co/guide/en/elasticsearch/reference/7.11/indices-delete-index.html Do we have snapshots? We can restore later. https://www.elastic.co/guide/en/elasticsearch/reference/7.11/snapshot-restore.html ➔ Add nodes or increase storage capacity (easier on cloud)
  • 25. TREATMENT Urgent Severity] Red cluster Temporary Hotfix ➔ Alter the cluster settings to temporarily allow a higher disk usage https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cluster-update-settings.html
  • 26. TREATMENT Urgent Severity] Red cluster Remove write block on the indices ➔ Once we have enough disk, remove the index block if needed https://www.elastic.co/guide/en/elasticsearch/reference/7.11/indices-update-settings.html
  • 27. TREATMENT Urgent Severity] Red cluster Bonus track ➔ If corrupted shards, and no snapshots, we can force allocation accepting potential data loss https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cluster-reroute.html#cluster-reroute-api-request-body
  • 28. DISCHARGE Urgent Severity] Red cluster Takeaways ➔ Proactively monitor disk usage on each node / Alerts Aim to 75% used storage to be on the safe side (< 85% ➔ Plan for data retention / deletion with ILM or Data Tiers Index Lifecycle Management (ILM can help automate https://www.elastic.co/guide/en/elasticsearch/reference/7.11/index-lifecycle-management.html https://www.elastic.co/guide/en/elasticsearch/reference/7.11/data-tiers.html ➔ Snapshot / Snapshot Lifecycle Management (SLM) for backups https://www.elastic.co/guide/en/elasticsearch/reference/7.11/snapshot-lifecycle-management.html
  • 29. Treatment ➔ Proactively monitor disk usage (alerts) ➔ Snapshots ➔ Index Lifecycle Management deletes old data and manages replicas ➔ Data Tiers with Cold Tier Discharge Diagnostic ➔ Delete indices ➔ Add data node/s ➔ Update index settings / allow write Reached high disk watermark ➔ CAT APIs ➔ Allocation Explain ➔ Cluster and index settings Support diagnostics Triage ➔ Cluster health is red ➔ Stopped ingesting ➔ Search works SUMMARY Urgent Severity] Red cluster
  • 30. More Tools & Resources
  • 31. Monitoring ➔ Monitoring in production - dedicated cluster https://www.elastic.co/guide/en/elasticsearch/reference/7.11/monitoring-production.html
  • 33. Monitoring ➔ Ingest and Search queues and rejections https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cat-thread-pool.html
  • 34. Monitoring ➔ Example  High CPU usage
  • 35. Monitoring ➔ Example  High CPU usage
  • 36. And CAT APIs again! ➔ Example  High CPU usage https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cat-shards.html https://www.elastic.co/guide/en/elasticsearch/reference/7.11/cluster-nodes-hot-threads.html
  • 38. Sizing - how many shards per node and what size ➔ https://www.elastic.co/guide/en/elasticsearch/reference/7.11/size-your-shards.html ➔ https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster ➔ https://www.elastic.co/guide/en/cloud/current/ec-reference-hardware.html ➔ https://benchmarks.elastic.co/ ➔ https://esrally.readthedocs.io/ Storage ➔ https://www.elastic.co/blog/how-to-design-your-elasticsearch-data-storage-architecture-for-scale ➔ https://www.elastic.co/guide/en/elasticsearch/reference/7.11/tune-for-disk-usage.html JVM Heap - do not go over ~30Gb heap ➔ https://www.elastic.co/blog/a-heap-of-trouble Hot/Warm/Cold architectures for time series data ➔ https://www.elastic.co/blog/optimizing-costs-elastic-cloud-hot-warm-index-lifecycle-management Common Resources Shared
  • 39. Common Resources Shared Tuning for search - slow searches ➔ https://www.elastic.co/blog/advanced-tuning-finding-and-fixing-slow-elasticsearch-queries ➔ https://www.elastic.co/guide/en/elasticsearch/reference/7.11/tune-for-search-speed.html Tuning for ingest - use bulk! ➔ https://www.elastic.co/guide/en/elasticsearch/reference/7.11/tune-for-indexing-speed.html ➔ https://www.elastic.co/guide/en/elasticsearch/reference/7.11/docs-bulk.html Upgrading the Stack - be prepared, test and snapshots! ➔ https://www.elastic.co/webinars/expert-tips-for-upgrading-the-elk-stack ➔ https://www.elastic.co/guide/en/elastic-stack/7.11/upgrading-elastic-stack.html Secure the Stack ➔ https://www.elastic.co/blog/configuring-ssl-tls-and-https-to-secure-elasticsearch-kibana-beats-and-l ogstash Optimize Mappings ➔ https://www.elastic.co/blog/strings-are-dead-long-live-strings
  • 40. Wrapping Up Triage incidents ➔ How critical is it? ➔ Do we need urgent care or is there a workaround to stabilize? Have tools ready ➔ REST APIs / Support diagnostics ➔ Monitoring & Alerts ➔ Log Analysis / Kibana Discover ➔ Search Elastic discuss, Stackoverflow, Elastic GitHub repos, etc.. Lessons learned ➔ Follow best practices ➔ Prevent future incidents - proactively investigate unexpected logs, etc.
  • 41. Q & A Thank You