SlideShare a Scribd company logo
Elasticsearch
Elasticsearch
Timed Data Analyses
By Alaa Elhadba
@aelhadba
Table of Contents
- Hot-Cold Architecture
- Data High Availability
- Data design at large scale
- Search Execution
- Time framed indices
- Aggregations
Hot-Cold Architecture
Hot-Cold Architecture
Hot Data Nodes
Perform indexing
Hold most recent data
Use SSD storage, Writing is an Intensive IO operation
Cold Data Nodes
Handle read only operations
Can use large spinning disks
Hot-Cold Configuration
node.box_type: hot
elasticsearch.yaml
Shard 2
Node
Shard 1
Node
node.box_type: cold
elasticsearch.yaml
Data Availability
Availability Zone 1
Availability Zone 2
Data Availability
Availability Zone 1
Availability Zone 2
Data Availability
Availability Zone 1
Availability Zone 2Availability Zone / Rack failure ? Shard Allocation
Awareness
Shard Allocation Awareness
Availability Zone 1
Availability Zone 2
Shard Allocation Awareness
Availability Zone 1
Availability Zone 2
1
2
1
2
1
2
3
1
2
3
Shard Allocation Awareness
cluster.routing.allocation.awareness.attributes: rack_1
● Data replication is spanned across AZs
● No two copies of same shard on the same rack
● Elasticsearch is fully aware of shard distribution
● Awareness can be set based cluster or index
● Elasticsearch will prefer using local shards
● Always balance your nodes across AZs
● Routing Allocation Awareness can be updated
on a live cluster
cluster.routing.allocation.awareness.attributes: rack_2
Availability Zone 1 Availability Zone 2
Shard Allocation Awareness
cluster.routing.allocation.awareness.attributes: rack_1
● Data replication is spanned across AZs
● No two copies of same shard on the same rack
● Elasticsearch is fully aware of shard distribution
● Awareness can be set based cluster or index
● Elasticsearch will prefer using local shards
● Always balance your nodes across AZs
● Routing Allocation Awareness can be updated
on a live cluster
cluster.routing.allocation.awareness.attributes: rack_2
Availability Zone 1 Availability Zone 2
Shard Allocation Awareness
cluster.routing.allocation.awareness.attributes: rack_1
● Data replication is spanned across AZs
● No two copies of same shard on the same rack
● Elasticsearch is fully aware of shard distribution
● Awareness can be set based cluster or index
● Elasticsearch will prefer using local shards
● Always balance your nodes across AZs
● Routing Allocation Awareness can be updated
on a live cluster
cluster.routing.allocation.awareness.attributes: rack_2
Availability Zone 1 Availability Zone 2
Shard Allocation Awareness
cluster.routing.allocation.awareness.attributes: rack_1
● Data replication is spanned across AZs
● No two copies of same shard on the same rack
● Elasticsearch is fully aware of shard distribution
● Awareness can be set based cluster or index
● Elasticsearch will prefer using local shards
● Always balance your nodes across AZs
● Routing Allocation Awareness can be updated
on a live cluster
● Use Forced Awareness to avoid the extra load
of reallocation of missing shards
cluster.routing.allocation.awareness.attributes: rack_2
Availability Zone 1 Availability Zone 2
Shard Allocation Awareness
cluster.routing.allocation.awareness.attributes: rack_1
● Data replication is spanned across AZs
● No two copies of same shard on the same rack
● Elasticsearch is fully aware of shard distribution
● Awareness can be set based cluster or index
● Elasticsearch will prefer using local shards
● Always balance your nodes across AZs
● Routing Allocation Awareness can be updated
on a live cluster
● Use Forced Awareness to avoid the extra load
of reallocation of missing shards
cluster.routing.allocation.awareness.attributes: rack_2
Availability Zone 1 Availability Zone 2
Make sure you can handle the load with less nodes!
Forced Awareness
● Forced awareness solves this problem by NEVER allowing
copies of the same shard to be allocated to the same zone.
● Avoid extra of reallocating unassigned shards after rack
failure.
● Allow no single point of failure for your system.
● Make sure you can handle the load with less nodes.
cluster.routing.allocation.awareness.force.zone.values: zone1,zone2
cluster.routing.allocation.awareness.attributes: rack1,zone1
Data design at large scale
Searching
Shard 4
Shard 2
Query
Result
Node
Node
Shard 3
Node
Shard 1
Node
Searching
Shard 4
Shard 2
Query
Result
Node
Node
Shard 3
Node
Shard 1
Node
How to avoid asking all shards ?
Searching
Shard 4
Shard 2
Query
Result
Node
Node
Shard 3
Node
Shard 1
Node
How to avoid asking all shards ? Routing
I know my
shards!
Routing
PUT my_index/my_type/my_id?routing=shard1
GET my_index/_search?routing=shard1,shard2
● Avoid calling all shards
● Dedicated shards per purpose
● Talk to one dedicated shard
● Eliminate Network Traffic
● Better Performance
● Handle sharding on your own
Routing
PUT my_index/my_type/my_id?routing=shard1
GET my_index/_search?routing=shard1,shard2
● Avoid calling all shards
● Dedicated shards per purpose
● Talk to one dedicated shard
● Eliminate Network Traffic
● Better Performance
● Handle sharding on your own
But, Once in, Never out
● Routing must be always specified
Routing
1 2 3 1 2 3 1 2
21.06.2016 20.06.2016 19.06.2016
Routing
1 2 3 1 2 3 1 2
21.06.2016 20.06.2016 19.06.2016
I MUST KNOW
EVERYTHING!
Talking to data
Aliasing
1 2 3 1 2 3 1 2
21.06.2016 20.06.2016 19.06.2016
today yesterday 3_days_ago
Aliasing
1 2 3 1 2 3 1 2
21.06.2016 20.06.2016 19.06.2016
today yesterday 3_days_ago
1 2 3
22.06.2016
Aliasing
1 2 3 1 2 3
21.06.2016 20.06.2016
today yesterday 3_days_ago
1 2 3
22.06.2016
Aliasing
1 2 3 1 2 3
21.06.2016 20.06.2016
today yesterday 3_days_ago
1 2 3
22.06.2016
I MUST KNOW!
it’s Better Performance
Aliasing
1 2 3 1 2 3
21.06.2016 20.06.2016
1 2 3
22.06.2016
It’s a Data
Problem!
today yesterday 3_days_ago
Aliasing + Routing
1 2 3 1 2 3
21.06.2016 20.06.2016
1 2 3
22.06.2016
It’s a Data
Problem!
today yesterday 3_days_agotoday_returns recent_returns
Aliasing + Routing + Search
IndexIndex Shard
Alias
Shard slice
Search Execution Preference
Elasticsearch targets shards and replicas in round-robin manner. Each shard is queried similarly
_primary Query only primary shards (latest info from index or optimize for writing path)
_primary_first Query primary first in available
_replica Query replica shard only
_replica_first Query replica first in available
_local Query shards available on the current node
_only_node:node_id Query a specific node
_only_nodes:* Query only a set of nodes
_prefer_node:node_id Query a prefered noe
_shards:1,3 e,g _shards:1,3;_local Query specific shards with a preference
PUT _search?preference=_replica
Time Framed Indices
Data Flow
HOT Cold Closed
Backed_up
Trashed
Time
Closing/Opening Index
➔ Closing an index
◆ Removes all shard allocations from the cluster
◆ But keeps the index data around
◆ Helps reduce the resources used on the cluster
◆ Consumes only disk space
➔ Opening an index
◆ Allows to open a closed index
◆ Note, those are not “milliseconds” time operation, opening an index can take a few seconds
to a couple of minutes
◆ Flushing before closing will reduce the opening time
Index Templates
- Order allows you to override other templates
- Settings allows you to scale anytime
- Aliases can be defined on index creation
Index Templates
Time framed indices lifecycle
1. Use Index templates to generate mappings for new indices
2. Use aliases to decouple your application from data logic
3. Use hot nodes for fresh data
4. Move old data to cold nodes
5. Close old indices before deletion
6. Change your time frame at any point to scale (Monthly, Weekly….)
7. Use Routing if you have too many shards in a big cluster
Data Flow
HOT Cold Closed
Backed_up
Trashed
Time
Aggregations
Aggregations Types
Buckets Metrics Pipeline
Nested Bucket Aggregations
Aggregation Query
Aggregation Query
Better caching
Fetch relevant documents
First segmentation
Nested segmentation
Doc Values
- Why do we need this?
- Sorting, Aggregations, Some Scripting
- Doc Values
- Build columnar style data structure on disk
- Created at indexing time, stored as part of the segment
- Read like other pieces of the Lucene index
- Don't take up heap space
- Uses file system cache
- Default for not_analyzed string and numeric fields in 2.0+
Raw Fields
- Use customer_name.raw for aggregations
- Use customer_name for search
Aggregations Types
Buckets Metrics Pipeline
Metrics Aggregations
- Avg Aggregation
- Cardinality Aggregation
- Extended Stats Aggregation
- Max Aggregation
- Min Aggregation
- Percentiles Aggregation
- Percentile Ranks Aggregation
- Scripted Metric Aggregation
- Stats Aggregation
- Sum Aggregation
- Top hits Aggregation
- Value Count Aggregation
Extended Stats Aggregation
Aggregation Search
Shard 4
Shard 2
Query
Result
Node
Node
Shard 3
Node
Shard 1
Node
Scripted Metric Aggregation
- Init_script Executed first. Allows initialization of variables.
- map_script Executed once after each document is collected.
- combine_script Executed once on each shard after document collection is complete.
- reduce_script Executed once on the coordinating node after all shards have returned their results.
Buckets Aggregations
- Children Aggregation
- Date Histogram Aggregation
- Date Range Aggregation
- Filter Aggregation
- Filters Aggregation
- Global Aggregation
- Histogram Aggregation
- Missing Aggregation
- Range Aggregation
- Reverse nested Aggregation
- Sampler Aggregation
- Significant Terms Aggregation
- Terms Aggregation
Date Histogram Aggregation
Date Range Aggregation
Don’t forget!
Round your dates
Missing Aggregations
Range agg
Histogram Aggregation
Pipeline Aggregations
Pipeline
Pipeline Aggregations
Parent
- Able to compute new buckets or new
aggregations to a parent aggregation.
Sibling
- Able to compute new buckets or new aggregation
on the same level.
Siblings Aggregation
- min_bucket
- max_bucket
- sum_bucket
- avg_bucket
- stats_bucket
- extended_stats_bucket
- percentiles_bucket
Average Aggregation
Parent Pipeline Aggregation
- moving_avg
- derivative
- cumulative_sum
- bucket_script
- bucket_selector
- serial_diff
Cumulative Sum Aggregation
Derivative Aggregation
Moving Average Aggregation
Moving Average Aggregation
Moving Average Aggregation
Prediction
Bucket Selector Aggregation
Bucket Script Aggregation
The End

More Related Content

PDF
Vegas ES
PDF
Elasticsearch Introduction to Data model, Search & Aggregations
PPTX
Scalable Data Models with Elasticsearch
PDF
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
PPTX
ElasticSearch in Production: lessons learned
PPTX
Sql performance tuning
PPT
Efficient Query Processing in Geographic Web Search Engines
PPTX
Eventually Elasticsearch: Eventual Consistency in the Real World
Vegas ES
Elasticsearch Introduction to Data model, Search & Aggregations
Scalable Data Models with Elasticsearch
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch in Production: lessons learned
Sql performance tuning
Efficient Query Processing in Geographic Web Search Engines
Eventually Elasticsearch: Eventual Consistency in the Real World

What's hot (18)

PPTX
Sql query performance analysis
PPTX
Sql query performance analysis
PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PDF
Analyzing Web Archives
PDF
SQL Now! How Optiq brings the best of SQL to NoSQL data.
PDF
Elasticsearch for Data Engineers
PDF
Apache Accumulo and the Data Lake
PPTX
Elasticsearch tuning
PDF
Big data ecosystem
PDF
Introduction to elasticsearch
PPTX
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
PPTX
Battle of the Giants round 2
PDF
Parallel SQL and Streaming Expressions in Apache Solr 6
PPTX
Elasticsearch Arcihtecture & What's New in Version 5
PDF
An Introduction to Spark with Scala
PPTX
Hive and HiveQL - Module6
PDF
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
PDF
Etl with apache impala by athemaster
Sql query performance analysis
Sql query performance analysis
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Analyzing Web Archives
SQL Now! How Optiq brings the best of SQL to NoSQL data.
Elasticsearch for Data Engineers
Apache Accumulo and the Data Lake
Elasticsearch tuning
Big data ecosystem
Introduction to elasticsearch
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Battle of the Giants round 2
Parallel SQL and Streaming Expressions in Apache Solr 6
Elasticsearch Arcihtecture & What's New in Version 5
An Introduction to Spark with Scala
Hive and HiveQL - Module6
Thoth - Real-time Solr Monitor and Search Analysis Engine: Presented by Damia...
Etl with apache impala by athemaster
Ad

Viewers also liked (20)

PDF
Elasticsearch in Zalando
PDF
Data modeling for Elasticsearch
PDF
Scaling massive elastic search clusters - Rafał Kuć - Sematext
PPTX
Academy PRO: Elasticsearch Misc
PDF
Tomorrows language technology
PPTX
Oncrawl elasticsearch meetup france #12
PDF
Hot and cold data storage
PDF
Elasticsearch Aggregations
PDF
What's new in Elasticsearch v5
PDF
Elasticsearch for Data Analytics
PDF
Intro to Elasticsearch
PDF
TDC2016POA | Trilha BigData - Respostas em tempo real para perguntas complexa...
PDF
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
PDF
쉽게 쓰여진 Django
PPTX
ElasticSearch - Introduction to Aggregations
PPTX
Administering and Monitoring SolrCloud Clusters
PDF
From zero to hero - Easy log centralization with Logstash and Elasticsearch
PDF
Solr Anti - patterns
PPTX
Battle of the giants: Apache Solr vs ElasticSearch
PPTX
An Introduction to Elastic Search.
Elasticsearch in Zalando
Data modeling for Elasticsearch
Scaling massive elastic search clusters - Rafał Kuć - Sematext
Academy PRO: Elasticsearch Misc
Tomorrows language technology
Oncrawl elasticsearch meetup france #12
Hot and cold data storage
Elasticsearch Aggregations
What's new in Elasticsearch v5
Elasticsearch for Data Analytics
Intro to Elasticsearch
TDC2016POA | Trilha BigData - Respostas em tempo real para perguntas complexa...
From Zero to Hero - Centralized Logging with Logstash & Elasticsearch
쉽게 쓰여진 Django
ElasticSearch - Introduction to Aggregations
Administering and Monitoring SolrCloud Clusters
From zero to hero - Easy log centralization with Logstash and Elasticsearch
Solr Anti - patterns
Battle of the giants: Apache Solr vs ElasticSearch
An Introduction to Elastic Search.
Ad

Similar to Elasticsearch Data Analyses (20)

PDF
Optimizing Elastic for Search at McQueen Solutions
PDF
Is your Elastic Cluster Stable and Production Ready?
PDF
Elasticsearch from the trenches
PDF
Elasticsearch for Logs & Metrics - a deep dive
PPTX
Running & Scaling Large Elasticsearch Clusters
ODP
Elasticsearch selected topics
PPTX
MongoDB Deployment Tips
PPT
Everything You Need to Know About Sharding
PDF
Black friday logs - Scaling Elasticsearch
PPTX
Managing Security At 1M Events a Second using Elasticsearch
PDF
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
PDF
Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019
PDF
Architecture at Scale
PDF
Log Analytics with AWS
PPT
MongoDB Sharding Webinar 2014
PDF
MongoDB World 2019: Sharding: Stories From the Field
PDF
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
PPTX
Index Provisioning for ALM Search - My Presentation
PPTX
Elasticsearch meetup final_2014_04
PPTX
Elasticsearch - Scalability and Multitenancy
Optimizing Elastic for Search at McQueen Solutions
Is your Elastic Cluster Stable and Production Ready?
Elasticsearch from the trenches
Elasticsearch for Logs & Metrics - a deep dive
Running & Scaling Large Elasticsearch Clusters
Elasticsearch selected topics
MongoDB Deployment Tips
Everything You Need to Know About Sharding
Black friday logs - Scaling Elasticsearch
Managing Security At 1M Events a Second using Elasticsearch
Managing your Black Friday Logs - Antonio Bonuccelli - Codemotion Rome 2018
Pablo Musa - Managing your Black Friday Logs - Codemotion Amsterdam 2019
Architecture at Scale
Log Analytics with AWS
MongoDB Sharding Webinar 2014
MongoDB World 2019: Sharding: Stories From the Field
SUE 2018 - Migrating a 130TB Cluster from Elasticsearch 2 to 5 in 20 Hours Wi...
Index Provisioning for ALM Search - My Presentation
Elasticsearch meetup final_2014_04
Elasticsearch - Scalability and Multitenancy

Elasticsearch Data Analyses