SlideShare a Scribd company logo
Optimize Is (Not) Bad For You
Deep Dive Into The Segment Merge Abyss
Rafał Kuć
Sematext Group, Inc.
Agenda
• Segments – where, what & how
• Writing segments
• Modifying segments
• Segment merging – what, where, how, why
• Force merging
• Force merging & SolrCloud
• Performance considerations
• Specialized merge policies
https://github.com/sematext/lr/tree/master/2017/optimize
3
01
Sematext & I
cloud
metrics
logs
&
4
01
Solr Collection Architecture
Zookeeper
5
01
Solr Collection Architecture
Zookeeper
SOLR
SOLR
SOLR
SOLR
6
01
Solr Collection Architecture
Zookeeper
SOLR
shard shard
SOLR
shard shard
SOLR
shard shard
SOLR
shard shard
7
01
Solr Shard Architecture
TLOG
8
01
Solr Shard Architecture
TLOG
Segment Segment Segment
Segment
9
01
Lucene Segment
Segment Info
Field Names
Stored Field Values
Point Values
Term Dictionary
Term Frequency
Term Proximity
Normalization
Per Document Vals
Live Documents
1
01
Inside the Segment – Term Dictionary
TERM DOCID
lucene <1>, <2>
revolution <1>, <2>
washington <1>
boston <2>
_1.tim
Doc1 Title: Lucene Revolution Washington, City: Washington D.C
Doc2 Title: Lucene Revolution Boston, City: Boston
_1.tip
1
01
Inside the Segment – Doc Values
Doc1 Title: Lucene Revolution Washington, City: Washington D.C
Doc2 Title: Lucene Revolution Boston, City: Boston
DOCID FIELD VALUE
1 Title Lucene Revolution Washington
1 City Washington D.C.
2 Title Lucene Revolution Boston
2 City Boston
_1.dvd
_1.dvm
1
01
Inside the Segment – Stored Fields
Doc1 Title: Lucene Revolution Washington, City: Washington D.C
Doc2 Title: Lucene Revolution Boston, City: Boston
DOCID VALUE
1 Title: Lucene Revolution Washington
City: Washington D.C
2 Title: Lucene Revolution Boston
City: Boston
_1.fdx
_1.fdt
1
01
Inside the Segment – Compound File System
_1.fdt
_1.fdx
_1.fnm
_1.nvd
_1.nvm
_1.si
_1.Lucene50_0.doc
_1.Lucene50_0.pos
_1.Lucene50_0.tim
_1.Lucene50_0.tip
_1.Lucene50_0.dvd
_1.Lucene50_0.dvm
1
01
Inside the Segment – Compound File System
_1.fdt
_1.fdx
_1.fnm
_1.nvd
_1.nvm
_1.si
_1.Lucene50_0.doc
_1.Lucene50_0.pos
_1.Lucene50_0.tim
_1.Lucene50_0.tip
_1.Lucene50_0.dvd
_1.Lucene50_0.dvm
1
01
Inside the Segment – Compound File System
_1.fdt
_1.fdx
_1.fnm
_1.nvd
_1.nvm
_1.si
_1.Lucene50_0.doc
_1.Lucene50_0.pos
_1.Lucene50_0.tim
_1.Lucene50_0.tip
_1.Lucene50_0.dvd
_1.Lucene50_0.dvm
_2.cfs
_2.cfe
1
01
Indexing
1
01
Indexing
1
01
Indexing
1
01
Indexing
level/tier
2
01
Indexing
2
01
Indexing
2
01
Indexing
2
01
Indexing
2
01
Indexing
2
01
Indexing
2
01
Indexing
2
01
Deletes
2
01
Deletes – After Merge
2
01
Atomic Updates
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
retrieve document
{
"id" : 3,
"tags" : [ "lucene" ],
"awesome" : true
}
3
01
Atomic Updates
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true
}
apply changes
3
01
Atomic Updates
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true
}
delete old document
3
01
Atomic Updates
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"tags" : {
"add" : [ "solr" ]
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true
}
3
01
Atomic Updates – In Place
Works on top of numeric, doc values based fields
Fields need to be not indexed and not stored
Doesn’t require delete/index
Support only inc and set modifers
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'
3
01
Atomic Updates – In Place
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'
retrieve document
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true
}
3
01
Atomic Updates – In Place
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true,
"views" : 100
}
apply changes
3
01
Atomic Updates – In Place
$ curl -XPOST -H 'Content-Type: application/json'
'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[
{
"id" : "3",
"views" : {
"inc" : 100
}
}
]'
{
"id" : 3,
"tags" : [ "lucene", "solr" ],
"awesome" : true,
"views" : 100
}
update doc values
3
01
Search – Importance of Segments
Immutable – write once read many
3
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
3
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
Fewer segments – faster searches
4
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
Fewer segments – faster searches
Fewer segments – smaller shard size
4
01
Search – Importance of Segments
Immutable – write once read many
More segments – slower search speed
Fewer segments – faster searches
Fewer segments – smaller shard size
Rapid segment changes – worse I/O cache usage
4
01
Taking Control
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
4
01
Taking Control
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" />
4
01
Taking Control
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" />
Segment Warmer
<mergedSegmentWarmer
class="org.apache.lucene.index.SimpleMergedSegmentWarmer" />
4
01
Taking Control – Default Indexing Throughput
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
4
01
Taking Control – Default Indexing Throughput
throughput < 5k/sec @ ~14GB
4
01
Taking Control – Max Merged Segment Size
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Lower higher indexing throughput – smaller segments
Higher better search latency (depends) – more merges
4
01
Taking Control – Lowering Max Merged Size
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">512</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
4
01
Taking Control – Lowering Max Segment Size
throughput < 5k/sec @ ~15.5GB
11% throughput increase
5
01
Taking Control – Merge At Once
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Lower better search latency (depends)
Higher higher indexing throughput
5
01
Taking Control – Lowering Merge At Once
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">2</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
5
01
Taking Control – Lowering Merge At Once
throughput < 5k/sec @ ~13GB
8% throughput decrease
5
01
Taking Control – Merge At Once Explicit
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Controls number of segments merged at once during force merge
5
01
Taking Control – Segments Per Tier
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Lower value means more merging, but less segments
Along with maxMergeAtOnce can smoothen I/O spikes
For better indexing throughput set maxMergeAtOnce <
segmentsPerTier
5
01
Taking Control – Combined Together
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">30</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">30</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">512</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
5
01
Taking Control – Combined Together
throughput < 5k/sec @ ~15GB
but look at read difference
5
01
Taking Control – Default vs Combined Read/Write
default settings
5
01
Taking Control – Default vs Combined Read/Write
default settings combined changes settings
5
01
Taking Control – Reclaim Deletes Weight
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Controls importance of merging segments with deleted documents
Increase to put priority on merging segments with deleted documents
6
01
Taking Control – No CFS Ratio
Merge Policy Factory
<mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory">
<int name="maxMergeAtOnce">10</int>
<int name="maxMergeAtOnceExplicit">30</int>
<int name="segmentsPerTier">10</int>
<int name="floorSegmentMB">2048</int>
<int name="maxMergedSegmentMB">5120</int>
<double name="noCFSRatio">0.1</double>
<int name="maxCFSSegmentSizeMB">2048</int>
<double name="reclaimDeletesWeight">2.0</double>
<double name="forceMergeDeletesPctAllowed">10.0</double>
</mergePolicyFactory>
Controls compound file system segments ratio
To completely disable CFS set to 0.0
6
01
Taking Control – Merge Scheduler
Controls maximum number of concurrent merges
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
<int name="maxMergeCount">4</int>
<int name="maxThreadCount">4</int>
</mergeScheduler>
6
01
Taking Control – Merge Scheduler
Controls number of threads dedicated to merging
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
<int name="maxMergeCount">4</int>
<int name="maxThreadCount">4</int>
</mergeScheduler>
6
01
Taking Control – Merge Scheduler
Controls number of threads dedicated to merging
For spinning drives set maxThreadCount to 1
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
<int name="maxMergeCount">4</int>
<int name="maxThreadCount">4</int>
</mergeScheduler>
6
01
Taking Control – Merge Scheduler
Controls number of threads dedicated to merging
For spinning drives set maxThreadCount to 1
For SSD set maxThreadCount to min(4, #CPUs / 2)
Merge Scheduler
<mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler">
<int name="maxMergeCount">4</int>
<int name="maxThreadCount">4</int>
</mergeScheduler>
6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
Done on all shards at the same time (by default)
6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
Done on all shards at the same time (by default)
Can be very bad or very good – depending on the use case
6
01
Optimize aka Force Merge
Forces segment merge – usually very expensive
Desired number of segments can be specified
Done on all shards at the same time (by default)
Can be very bad or very good – depending on the use case
$ curl
'http://solr:8983/solr/lr/update?optimize=true&numSegments=1&waitFlush=false'
7
01
Force Merge – The Good
Improves search speed (fewer segments)
7
01
Force Merge – The Good
Improves search speed (fewer segments)
Removes deleted documents
7
01
Force Merge – The Good
Improves search speed (fewer segments)
Removes deleted documents
Shrinks the index by pruning duplicated data
7
01
Force Merge – The Good
Improves search speed (fewer segments)
Removes deleted documents
Shrinks the index by pruning duplicated data
Reduces number of used files
7
01
Force Merge – The Bad
Invalidates operating system I/O cache
7
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
7
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
Not efficient on changing data
7
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
Not efficient on changing data
May cause performance issues
7
01
Force Merge – The Bad
Invalidates operating system I/O cache
Very expensive to perform – rewrites all segments
Not efficient on changing data
May cause performance issues
Will cause temporary increase of disk usage (up to 3x)
7
01
Force Merge – SolrCloud Performance Example
8
01
Force Merge – SolrCloud Performance Example
8
01
Force Merge – Legacy
Index on the master server
Solr Master
Solr Slave
Solr Slave
Solr Slave
index
Documents
8
01
Force Merge – Legacy
Index on the master server
Force merge on the master server
Solr Master
Solr Slave
Solr Slave
Solr Slave
force merge
8
01
Force Merge – Legacy
Index on the master server
Force merge on the master server
Replicate after optimize is done
Solr Master
Solr Slave
Solr Slave
Solr Slave
pull after optimize
8
01
Force Merge – SolrCloud (Solr 7 – pull replicas)
Create collection
Force merge
Solr will do the rest
Solr Solr
Solr Solr
Primary 1
Primary 2 Pull Replica 2
Pull Replica 1
8
01
Force Merge – SolrCloud (NRT, pre 7.0)
Ask yourself if you really need force merge
Solr Solr
Solr Solr
8
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Solr Solr
Solr Solr
Primary 1
Primary 2
8
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Index
Solr Solr
Solr Solr
Primary 1
Primary 2
DocumentsDocuments
Documents
Documents
8
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Index
Force merge
Solr Solr
Solr Solr
Primary 1
Primary 2optimize
8
01
Force Merge – SolrCloud (NRT replicas, pre 7.0)
Ask yourself if you really need force merge
Create collection on part of the nodes
Index
Force merge
Create replicas
Solr Solr
Solr Solr
Primary 1
Primary 2 Replica 2
Replica 1
9
01
Specialized Merge Policy Example – Sorting
Sorting Merge Policy Factory Example
<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
<str name="sort">timestamp desc</str>
<str name="wrapper.prefix">inner</str>
<str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
<int name="inner.maxMergeAtOnce">10</int>
<int name="inner.segmentsPerTier">10</int>
<double name="inner.noCFSRatio">0.1</double>
</mergePolicyFactory>
9
01
Specialized Merge Policy Example – Sorting
Sorting Merge Policy Factory Example
<mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory">
<str name="sort">timestamp desc</str>
<str name="wrapper.prefix">inner</str>
<str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
<int name="inner.maxMergeAtOnce">10</int>
<int name="inner.segmentsPerTier">10</int>
<double name="inner.noCFSRatio">0.1</double>
</mergePolicyFactory>
Pre-sorts data during merge for:
- faster range queries
- faster data retrieval
- possibility of early query termination
- convenient for time based data
9
01
http://sematext.com/jobs
You love like we do?
You want to work with ?
Want to work with open source?
You want to do fun stuff?
9
01
Get in touch
Rafał
rafal.kuc@sematext.com
@kucrafal
http://sematext.com
@sematext http://sematext.com/jobs
Come talk to us
at the booth
Thank You

More Related Content

PDF
はじめての検索エンジン&Solr 第13回Solr勉強会
PPTX
はじめてのElasticsearchクラスタ
PDF
Vectorized Query Execution in Apache Spark at Facebook
PPTX
Neural Search Comes to Apache Solr
PDF
Elasticsearch の検索精度のチューニング 〜テストを作って高速かつ安全に〜
PDF
Azure Database for PostgreSQL 入門 (PostgreSQL Conference Japan 2021)
PDF
SolrとElasticsearchを比べてみよう
PPTX
技術勉強会(Solr入門編)
はじめての検索エンジン&Solr 第13回Solr勉強会
はじめてのElasticsearchクラスタ
Vectorized Query Execution in Apache Spark at Facebook
Neural Search Comes to Apache Solr
Elasticsearch の検索精度のチューニング 〜テストを作って高速かつ安全に〜
Azure Database for PostgreSQL 入門 (PostgreSQL Conference Japan 2021)
SolrとElasticsearchを比べてみよう
技術勉強会(Solr入門編)

What's hot (20)

PDF
AWS で Presto を徹底的に使いこなすワザ
PDF
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
PPTX
LIFULL HOME'SでのSolrの構成と運用の変遷
PDF
Solr Query Parsing
PDF
At least onceってぶっちゃけ問題の先送りだったよね #kafkajp
PDF
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
PDF
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
PDF
20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)
PPTX
PostgreSQLのロール管理とその注意点(Open Source Conference 2022 Online/Osaka 発表資料)
PDF
20190410 AWS Black Belt Online Seminar Amazon Elastic Container Service for K...
PDF
Amazon Redshift 概要 (20分版)
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
並行処理初心者のためのAkka入門
PPTX
Introduction to Elasticsearch
PPTX
검색엔진이 데이터를 다루는 법 김종민
PDF
20190319 AWS Black Belt Online Seminar Amazon FSx for Lustre
PDF
全文検索サーバ Fess 〜 全文検索システム構築時の悩みどころ
PDF
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
PDF
IoT時代におけるストリームデータ処理と急成長の Apache Flink
PDF
事例から見る規模別クラウド・データベースの選び方 (Oracle Database) (Oracle Cloudウェビナーシリーズ: 2021年6月30日)
AWS で Presto を徹底的に使いこなすワザ
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
LIFULL HOME'SでのSolrの構成と運用の変遷
Solr Query Parsing
At least onceってぶっちゃけ問題の先送りだったよね #kafkajp
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
20190521 AWS Black Belt Online Seminar Amazon Simple Email Service (Amazon SES)
PostgreSQLのロール管理とその注意点(Open Source Conference 2022 Online/Osaka 発表資料)
20190410 AWS Black Belt Online Seminar Amazon Elastic Container Service for K...
Amazon Redshift 概要 (20分版)
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
並行処理初心者のためのAkka入門
Introduction to Elasticsearch
검색엔진이 데이터를 다루는 법 김종민
20190319 AWS Black Belt Online Seminar Amazon FSx for Lustre
全文検索サーバ Fess 〜 全文検索システム構築時の悩みどころ
Amazon Redshift 아키텍처 및 모범사례::김민성::AWS Summit Seoul 2018
IoT時代におけるストリームデータ処理と急成長の Apache Flink
事例から見る規模別クラウド・データベースの選び方 (Oracle Database) (Oracle Cloudウェビナーシリーズ: 2021年6月30日)
Ad

Viewers also liked (6)

PDF
Effective Hive Queries
PDF
Cross Datacenter Replication in Apache Solr 6
PDF
SolrCloud on Hadoop
PDF
Solr on Docker - the Good, the Bad and the Ugly
PDF
Best practices for highly available and large scale SolrCloud
PDF
How to Run Solr on Docker and Why
Effective Hive Queries
Cross Datacenter Replication in Apache Solr 6
SolrCloud on Hadoop
Solr on Docker - the Good, the Bad and the Ugly
Best practices for highly available and large scale SolrCloud
How to Run Solr on Docker and Why
Ad

Similar to Solr Search Engine: Optimize Is (Not) Bad for You (20)

PDF
Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.
PPTX
04 data accesstechnologies
PDF
Beyond full-text searches with Lucene and Solr
PPTX
IT talk SPb "Full text search for lazy guys"
PDF
Interactive Questions and Answers - London Information Retrieval Meetup
PDF
20150210 solr introdution
ODP
Dev8d Apache Solr Tutorial
PDF
Presto at Tivo, Boston Hadoop Meetup
ODP
Letting In the Light: Using Solr as an External Search Component
PDF
Rails and the Apache SOLR Search Engine
PDF
Rapid prototyping search applications with solr
PDF
Web analytics at scale with Druid at naver.com
PDF
Rapid Prototyping with Solr
PDF
Rapid Prototyping with Solr
PPTX
Developing on SQL Azure
PDF
Lessons Learned While Scaling Elasticsearch at Vinted
PDF
[2D1]Elasticsearch 성능 최적화
PDF
[2 d1] elasticsearch 성능 최적화
PPTX
Top 5 things to know about sql azure for developers
PDF
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
Optimize Is (Not) Bad For You - Rafał Kuć, Sematext Group, Inc.
04 data accesstechnologies
Beyond full-text searches with Lucene and Solr
IT talk SPb "Full text search for lazy guys"
Interactive Questions and Answers - London Information Retrieval Meetup
20150210 solr introdution
Dev8d Apache Solr Tutorial
Presto at Tivo, Boston Hadoop Meetup
Letting In the Light: Using Solr as an External Search Component
Rails and the Apache SOLR Search Engine
Rapid prototyping search applications with solr
Web analytics at scale with Druid at naver.com
Rapid Prototyping with Solr
Rapid Prototyping with Solr
Developing on SQL Azure
Lessons Learned While Scaling Elasticsearch at Vinted
[2D1]Elasticsearch 성능 최적화
[2 d1] elasticsearch 성능 최적화
Top 5 things to know about sql azure for developers
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자

More from Sematext Group, Inc. (20)

PDF
Tweaking the Base Score: Lucene/Solr Similarities Explained
PDF
OOPs, OOMs, oh my! Containerizing JVM apps
PPTX
Is observability good for your brain?
PDF
Introducing log analysis to your organization
PDF
Monitoring and Log Management for
PDF
Introduction to solr
PDF
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
PDF
Elasticsearch for Logs & Metrics - a deep dive
PDF
Tuning Solr & Pipeline for Logs
PPTX
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
PDF
Top Node.js Metrics to Watch
PPT
Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker
PDF
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
PDF
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
PDF
Docker Logging Webinar
PDF
Docker Monitoring Webinar
PDF
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
PDF
Side by Side with Elasticsearch & Solr, Part 2
PPTX
Tuning Elasticsearch Indexing Pipeline for Logs
PDF
Solr Anti Patterns
Tweaking the Base Score: Lucene/Solr Similarities Explained
OOPs, OOMs, oh my! Containerizing JVM apps
Is observability good for your brain?
Introducing log analysis to your organization
Monitoring and Log Management for
Introduction to solr
Building Resilient Log Aggregation Pipeline with Elasticsearch & Kafka
Elasticsearch for Logs & Metrics - a deep dive
Tuning Solr & Pipeline for Logs
Running High Performance & Fault-tolerant Elasticsearch Clusters on Docker
Top Node.js Metrics to Watch
Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
From Zero to Production Hero: Log Analysis with Elasticsearch (from Velocity ...
Docker Logging Webinar
Docker Monitoring Webinar
Metrics, Logs, Transaction Traces, Anomaly Detection at Scale
Side by Side with Elasticsearch & Solr, Part 2
Tuning Elasticsearch Indexing Pipeline for Logs
Solr Anti Patterns

Recently uploaded (20)

PDF
Electronic commerce courselecture one. Pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Modernizing your data center with Dell and AMD
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
cuic standard and advanced reporting.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
Electronic commerce courselecture one. Pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Modernizing your data center with Dell and AMD
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Machine learning based COVID-19 study performance prediction
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
NewMind AI Monthly Chronicles - July 2025
Per capita expenditure prediction using model stacking based on satellite ima...
Big Data Technologies - Introduction.pptx
Advanced Soft Computing BINUS July 2025.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectral efficient network and resource selection model in 5G networks
GamePlan Trading System Review: Professional Trader's Honest Take
cuic standard and advanced reporting.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
The Rise and Fall of 3GPP – Time for a Sabbatical?

Solr Search Engine: Optimize Is (Not) Bad for You

  • 1. Optimize Is (Not) Bad For You Deep Dive Into The Segment Merge Abyss Rafał Kuć Sematext Group, Inc.
  • 2. Agenda • Segments – where, what & how • Writing segments • Modifying segments • Segment merging – what, where, how, why • Force merging • Force merging & SolrCloud • Performance considerations • Specialized merge policies https://github.com/sematext/lr/tree/master/2017/optimize
  • 6. 6 01 Solr Collection Architecture Zookeeper SOLR shard shard SOLR shard shard SOLR shard shard SOLR shard shard
  • 9. 9 01 Lucene Segment Segment Info Field Names Stored Field Values Point Values Term Dictionary Term Frequency Term Proximity Normalization Per Document Vals Live Documents
  • 10. 1 01 Inside the Segment – Term Dictionary TERM DOCID lucene <1>, <2> revolution <1>, <2> washington <1> boston <2> _1.tim Doc1 Title: Lucene Revolution Washington, City: Washington D.C Doc2 Title: Lucene Revolution Boston, City: Boston _1.tip
  • 11. 1 01 Inside the Segment – Doc Values Doc1 Title: Lucene Revolution Washington, City: Washington D.C Doc2 Title: Lucene Revolution Boston, City: Boston DOCID FIELD VALUE 1 Title Lucene Revolution Washington 1 City Washington D.C. 2 Title Lucene Revolution Boston 2 City Boston _1.dvd _1.dvm
  • 12. 1 01 Inside the Segment – Stored Fields Doc1 Title: Lucene Revolution Washington, City: Washington D.C Doc2 Title: Lucene Revolution Boston, City: Boston DOCID VALUE 1 Title: Lucene Revolution Washington City: Washington D.C 2 Title: Lucene Revolution Boston City: Boston _1.fdx _1.fdt
  • 13. 1 01 Inside the Segment – Compound File System _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1.Lucene50_0.doc _1.Lucene50_0.pos _1.Lucene50_0.tim _1.Lucene50_0.tip _1.Lucene50_0.dvd _1.Lucene50_0.dvm
  • 14. 1 01 Inside the Segment – Compound File System _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1.Lucene50_0.doc _1.Lucene50_0.pos _1.Lucene50_0.tim _1.Lucene50_0.tip _1.Lucene50_0.dvd _1.Lucene50_0.dvm
  • 15. 1 01 Inside the Segment – Compound File System _1.fdt _1.fdx _1.fnm _1.nvd _1.nvm _1.si _1.Lucene50_0.doc _1.Lucene50_0.pos _1.Lucene50_0.tim _1.Lucene50_0.tip _1.Lucene50_0.dvd _1.Lucene50_0.dvm _2.cfs _2.cfe
  • 29. 2 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' retrieve document { "id" : 3, "tags" : [ "lucene" ], "awesome" : true }
  • 30. 3 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true } apply changes
  • 31. 3 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true } delete old document
  • 32. 3 01 Atomic Updates $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "tags" : { "add" : [ "solr" ] } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true }
  • 33. 3 01 Atomic Updates – In Place Works on top of numeric, doc values based fields Fields need to be not indexed and not stored Doesn’t require delete/index Support only inc and set modifers $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]'
  • 34. 3 01 Atomic Updates – In Place $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]' retrieve document { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true }
  • 35. 3 01 Atomic Updates – In Place $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true, "views" : 100 } apply changes
  • 36. 3 01 Atomic Updates – In Place $ curl -XPOST -H 'Content-Type: application/json' 'http://localhost:8983/solr/lr/update?commit=true' --data-binary '[ { "id" : "3", "views" : { "inc" : 100 } } ]' { "id" : 3, "tags" : [ "lucene", "solr" ], "awesome" : true, "views" : 100 } update doc values
  • 37. 3 01 Search – Importance of Segments Immutable – write once read many
  • 38. 3 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed
  • 39. 3 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed Fewer segments – faster searches
  • 40. 4 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed Fewer segments – faster searches Fewer segments – smaller shard size
  • 41. 4 01 Search – Importance of Segments Immutable – write once read many More segments – slower search speed Fewer segments – faster searches Fewer segments – smaller shard size Rapid segment changes – worse I/O cache usage
  • 42. 4 01 Taking Control Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>
  • 43. 4 01 Taking Control Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" />
  • 44. 4 01 Taking Control Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler" /> Segment Warmer <mergedSegmentWarmer class="org.apache.lucene.index.SimpleMergedSegmentWarmer" />
  • 45. 4 01 Taking Control – Default Indexing Throughput Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>
  • 46. 4 01 Taking Control – Default Indexing Throughput throughput < 5k/sec @ ~14GB
  • 47. 4 01 Taking Control – Max Merged Segment Size Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Lower higher indexing throughput – smaller segments Higher better search latency (depends) – more merges
  • 48. 4 01 Taking Control – Lowering Max Merged Size Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">512</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>
  • 49. 4 01 Taking Control – Lowering Max Segment Size throughput < 5k/sec @ ~15.5GB 11% throughput increase
  • 50. 5 01 Taking Control – Merge At Once Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Lower better search latency (depends) Higher higher indexing throughput
  • 51. 5 01 Taking Control – Lowering Merge At Once Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">2</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>
  • 52. 5 01 Taking Control – Lowering Merge At Once throughput < 5k/sec @ ~13GB 8% throughput decrease
  • 53. 5 01 Taking Control – Merge At Once Explicit Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Controls number of segments merged at once during force merge
  • 54. 5 01 Taking Control – Segments Per Tier Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Lower value means more merging, but less segments Along with maxMergeAtOnce can smoothen I/O spikes For better indexing throughput set maxMergeAtOnce < segmentsPerTier
  • 55. 5 01 Taking Control – Combined Together Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">30</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">30</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">512</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory>
  • 56. 5 01 Taking Control – Combined Together throughput < 5k/sec @ ~15GB but look at read difference
  • 57. 5 01 Taking Control – Default vs Combined Read/Write default settings
  • 58. 5 01 Taking Control – Default vs Combined Read/Write default settings combined changes settings
  • 59. 5 01 Taking Control – Reclaim Deletes Weight Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Controls importance of merging segments with deleted documents Increase to put priority on merging segments with deleted documents
  • 60. 6 01 Taking Control – No CFS Ratio Merge Policy Factory <mergePolicyFactory class="org.apache.solr.index.TieredMergePolicyFactory"> <int name="maxMergeAtOnce">10</int> <int name="maxMergeAtOnceExplicit">30</int> <int name="segmentsPerTier">10</int> <int name="floorSegmentMB">2048</int> <int name="maxMergedSegmentMB">5120</int> <double name="noCFSRatio">0.1</double> <int name="maxCFSSegmentSizeMB">2048</int> <double name="reclaimDeletesWeight">2.0</double> <double name="forceMergeDeletesPctAllowed">10.0</double> </mergePolicyFactory> Controls compound file system segments ratio To completely disable CFS set to 0.0
  • 61. 6 01 Taking Control – Merge Scheduler Controls maximum number of concurrent merges Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>
  • 62. 6 01 Taking Control – Merge Scheduler Controls number of threads dedicated to merging Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>
  • 63. 6 01 Taking Control – Merge Scheduler Controls number of threads dedicated to merging For spinning drives set maxThreadCount to 1 Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>
  • 64. 6 01 Taking Control – Merge Scheduler Controls number of threads dedicated to merging For spinning drives set maxThreadCount to 1 For SSD set maxThreadCount to min(4, #CPUs / 2) Merge Scheduler <mergeScheduler class="org.apache.lucene.index.ConcurrentMergeScheduler"> <int name="maxMergeCount">4</int> <int name="maxThreadCount">4</int> </mergeScheduler>
  • 65. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive
  • 66. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified
  • 67. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified Done on all shards at the same time (by default)
  • 68. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified Done on all shards at the same time (by default) Can be very bad or very good – depending on the use case
  • 69. 6 01 Optimize aka Force Merge Forces segment merge – usually very expensive Desired number of segments can be specified Done on all shards at the same time (by default) Can be very bad or very good – depending on the use case $ curl 'http://solr:8983/solr/lr/update?optimize=true&numSegments=1&waitFlush=false'
  • 70. 7 01 Force Merge – The Good Improves search speed (fewer segments)
  • 71. 7 01 Force Merge – The Good Improves search speed (fewer segments) Removes deleted documents
  • 72. 7 01 Force Merge – The Good Improves search speed (fewer segments) Removes deleted documents Shrinks the index by pruning duplicated data
  • 73. 7 01 Force Merge – The Good Improves search speed (fewer segments) Removes deleted documents Shrinks the index by pruning duplicated data Reduces number of used files
  • 74. 7 01 Force Merge – The Bad Invalidates operating system I/O cache
  • 75. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments
  • 76. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments Not efficient on changing data
  • 77. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments Not efficient on changing data May cause performance issues
  • 78. 7 01 Force Merge – The Bad Invalidates operating system I/O cache Very expensive to perform – rewrites all segments Not efficient on changing data May cause performance issues Will cause temporary increase of disk usage (up to 3x)
  • 79. 7 01 Force Merge – SolrCloud Performance Example
  • 80. 8 01 Force Merge – SolrCloud Performance Example
  • 81. 8 01 Force Merge – Legacy Index on the master server Solr Master Solr Slave Solr Slave Solr Slave index Documents
  • 82. 8 01 Force Merge – Legacy Index on the master server Force merge on the master server Solr Master Solr Slave Solr Slave Solr Slave force merge
  • 83. 8 01 Force Merge – Legacy Index on the master server Force merge on the master server Replicate after optimize is done Solr Master Solr Slave Solr Slave Solr Slave pull after optimize
  • 84. 8 01 Force Merge – SolrCloud (Solr 7 – pull replicas) Create collection Force merge Solr will do the rest Solr Solr Solr Solr Primary 1 Primary 2 Pull Replica 2 Pull Replica 1
  • 85. 8 01 Force Merge – SolrCloud (NRT, pre 7.0) Ask yourself if you really need force merge Solr Solr Solr Solr
  • 86. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Solr Solr Solr Solr Primary 1 Primary 2
  • 87. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Index Solr Solr Solr Solr Primary 1 Primary 2 DocumentsDocuments Documents Documents
  • 88. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Index Force merge Solr Solr Solr Solr Primary 1 Primary 2optimize
  • 89. 8 01 Force Merge – SolrCloud (NRT replicas, pre 7.0) Ask yourself if you really need force merge Create collection on part of the nodes Index Force merge Create replicas Solr Solr Solr Solr Primary 1 Primary 2 Replica 2 Replica 1
  • 90. 9 01 Specialized Merge Policy Example – Sorting Sorting Merge Policy Factory Example <mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory"> <str name="sort">timestamp desc</str> <str name="wrapper.prefix">inner</str> <str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str> <int name="inner.maxMergeAtOnce">10</int> <int name="inner.segmentsPerTier">10</int> <double name="inner.noCFSRatio">0.1</double> </mergePolicyFactory>
  • 91. 9 01 Specialized Merge Policy Example – Sorting Sorting Merge Policy Factory Example <mergePolicyFactory class="org.apache.solr.index.SortingMergePolicyFactory"> <str name="sort">timestamp desc</str> <str name="wrapper.prefix">inner</str> <str name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str> <int name="inner.maxMergeAtOnce">10</int> <int name="inner.segmentsPerTier">10</int> <double name="inner.noCFSRatio">0.1</double> </mergePolicyFactory> Pre-sorts data during merge for: - faster range queries - faster data retrieval - possibility of early query termination - convenient for time based data
  • 92. 9 01 http://sematext.com/jobs You love like we do? You want to work with ? Want to work with open source? You want to do fun stuff?
  • 93. 9 01 Get in touch Rafał rafal.kuc@sematext.com @kucrafal http://sematext.com @sematext http://sematext.com/jobs Come talk to us at the booth