Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer

Troubleshooting your Elasticsearch
cluster like a Support Engineer
Imma Valls
Support Engineer, Elastic
@eyeveebee
http://eyeveebee.net

How can we approach troubleshooting?

Troubleshooting by
Example
Red Cluster
19

TRIAGE
Urgent Severity] Red cluster
Vital signs
➔ Cluster in red health
➔ No ingest into any
indices
Symptoms
➔ Beats fail to ingest
➔ Cluster is responsive, search and REST API still work

TRIAGE
What happened?
➔ Out of the blue, no changes
Any attempts to fix it?
➔ No
Next steps
➔ Share a support diagnostics that will provide REST API calls
https://www.elastic.co/blog/why-does-elastic-support-keep-asking-for-diagnostic-files
https://github.com/elastic/support-diagnostics/blob/main/src/main/resources/elastic-rest.yml
https://github.com/elastic/support-diagnostics
> ./diagnostics.sh --host https://localhost -u elastic -p --port 9200 --ssl --type api --noVerify

DIAGNOSTIC
Why is the cluster red?
➔ REST API calls - CAT Indices API
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/rest-apis.html
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cat-indices.html

DIAGNOSTIC
Why is an index red?
➔ Check shards that are not started:
INITIALIZING or UNASSIGNED
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cat-shards.html

DIAGNOSTIC
Why is a shard UNASSIGNED?
➔ Cluster allocation explain API
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cluster-allocation-explain.html

DIAGNOSTIC
Why is a shard UNASSIGNED?

DIAGNOSTIC
Have we used all the cluster storage?
➔ Use CAT Allocation API
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cat-allocation.html

DIAGNOSTIC
Interpret data
➔ Cluster reached its flood stage disk watermark
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/modules-cluster.html#disk-based-shard-allocation

DIAGNOSTIC
Interpret data
➔ Existing indices are blocked for write
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cluster-get-settings.html

TREATMENT
Fixing the root cause
➔ Delete indices to increase available storage
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/indices-delete-index.html
Do we have snapshots? We can restore later.
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/snapshot-restore.html
➔ Add nodes or increase storage capacity (easier on cloud)

TREATMENT
Temporary Hotfix
➔ Alter the cluster settings to temporarily allow a higher disk usage
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cluster-update-settings.html

TREATMENT
Remove write block on the indices
➔ Once we have enough disk, remove the index block if needed
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/indices-update-settings.html

TREATMENT
Bonus track
➔ If corrupted shards, and no snapshots, we can force allocation
accepting potential data loss
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cluster-reroute.html#cluster-reroute-api-request-body

DISCHARGE
Takeaways
➔ Proactively monitor disk usage on each node / Alerts
Aim to 75% used storage to be on the safe side (< 85%)
➔ Plan for data retention / deletion with ILM or Data Tiers
Index Lifecycle Management (ILM) can help automate
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/index-lifecycle-management.html
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/data-tiers.html
➔ Snapshot / Snapshot Lifecycle Management (SLM) for backups
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/snapshot-lifecycle-management.html

Treatment ➔ Proactively
monitor disk
usage (alerts)
➔ Snapshots
➔ Index Lifecycle
Management
deletes old data
and manages
replicas
➔ Data Tiers with
Cold Tier or
Frozen Tiers
Discharge
Diagnostic
➔ Delete indices
➔ Add data node/s
➔ Update index
settings / allow
write
Reached flood stage
disk watermark
➔ CAT APIs
➔ Allocation Explain
➔ Cluster and index
settings
Support diagnostics
Triage
➔ Cluster health is
red
➔ Stopped
ingesting
➔ Search works
SUMMARY

Troubleshooting by
Example
Unbalanced CPU usage
35

TRIAGE
High Severity] Unbalanced CPU Usage
Vital signs
➔ Green cluster
➔ Monitoring alerts
high CPU usage
Symptoms
➔ Unbalance CPU usage between different nodes
➔ CPU pressure switches to different nodes over time
➔ Some ingest delays

TRIAGE
High Severity] Unbalanced CPU usage
What happened?
➔ Was there any benchmarking done
before going into production?
➔ Any changes in data ingest
volumes?
Any attempts to fix it?
➔ Tried adding nodes,
did not help
Next steps
➔ Export monitoring data and share a support diagnostics
https://github.com/elastic/support-diagnostics#extracting-time-series-diagnostics-from-monitoring
https://github.com/elastic/support-diagnostics

DIAGNOSTIC
Why high CPU usage?
➔ Monitoring
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/monitoring-production.html

DIAGNOSTIC
Why high CPU usage?
➔ REST API calls - Cat Shards and Get Index settings API
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cat-shards.html
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/indices-get-settings.html

DIAGNOSTIC
Why high CPU usage?
➔ Hot threads API to confirm CPU usage is on write
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/cluster-nodes-hot-threads.html

TREATMENT
➔ Increase primary shards
In the example: 3 nodes → 3 primary shards for hot index logs-201998
➔ How? Change index template and rollover index if using ILM
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/index-templates.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index-lifecycle-
management.html
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/indices-rollover-index.html

TREATMENT

TREATMENT
High Severity] Unbalanced CPU usage
Bonus track
➔ Use with caution index.routing.allocation.total_shards_per_node
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/allocation-total-shards.html

DISCHARGE
Takeaways
➔ Proactively monitor CPU usage on each node / Alerts
https://www.elastic.co/guide/en/elasticsearch/reference/7.14/monitoring-production.html
➔ Benchmark with tools like ES Rally
https://esrally.readthedocs.io/en/stable/
https://benchmarks.elastic.co/index.html
https://esrally.readthedocs.io/en/stable/adding_tracks.html

Treatment ➔ Proactively
monitor cpu
usage (alerts)
➔ Benchmark (ES
rally) before
going into
production or if
there is any
changes in the
data volumes
Discharge
Diagnostic
➔ Create index
settings
➔ Index templates
to add additional
primary shards
Ingest is hot on an
index with 1 primary
shard
➔ Monitoring
➔ CAT shards API
➔ Index settings
Triage
➔ Cluster health is
green
➔ Unbalanced high
CPU usage
switching nodes
over time
SUMMARY

➔ How critical is it?
➔ Do we need urgent care or is there a
workaround to stabilize?
Have your tools ready
➔ REST APIs / Support diagnostics
➔ Monitoring & Alerts
➔ Log Analysis: Use Kibana!
➔ Search Elastic discuss, Stackoverflow,
Elastic GitHub repos, etc.
Lessons learned
➔ Follow best practices
➔ Prevent future incidents -
proactively investigate unexpected
logs, etc.
Wrapping
up
Triage incidents

Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer

More Related Content

What's hot (20)

Similar to Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer (20)

More from Imma Valls Bernaus (20)

Recently uploaded (20)

Troubleshooting your Elasticsearch cluster like an Elastic Support Engineer