SlideShare a Scribd company logo
Solr vs. Elasticsearch 
Case by Case 
Alexandre Rafalovitch @arafalov 
@SolrStart 
www.solr-start.com
Meet the FRENEMIES 
Friends (common) 
• Based on Lucene 
• Full-text search 
• Structured search 
• Queries, filters, caches 
• Facets/stats/enumerations 
• Cloud-ready 
Elas%csearch* 
* 
Elas%csearch 
is 
a 
trademark 
of 
Elas%csearch 
BV, 
registered 
in 
the 
U.S. 
and 
in 
other 
countries. 
Enemies (differences) 
• Download size 
• AdminUI vs. Marvel 
• Configuration vs. Magic 
• Nested documents 
• Chains vs. Plugins 
• Types and Rivers 
• OpenSource vs. Commercial 
• Etc.
This used to be Solr (now in Lucene/ES) 
• Field 
types 
• Dismax/eDismax 
• Many 
of 
analysis 
filters 
(WordDelimiterFilter, 
Soundex, 
Regex, 
HTML, 
kstem, 
Trim…) 
• Mul%-­‐valued 
field 
cache 
• …. 
(source: 
hOp://heliosearch.org/lucene-­‐solr-­‐history/ 
) 
• Disclaimer: 
Nowadays, 
Elas%csearch 
hires 
awesome 
Lucene 
hackers
Basically - sisters 
Source: 
hOps://www.flickr.com/photos/franzfume/11530902934/ 
First 
run 
Expanded 
Download 
300 
250 
200 
150 
100 
50 
0 
Solr 
Elas%csearch
Solr: Chubby or Rubenesque? 
0.00 
50.00 
100.00 
150.00 
200.00 
250.00 
300.00 
Elas%csearch+plugins 
Solr 
Code 
Examples 
Documenta%on 
ES-­‐Admin 
ES-­‐ICU 
Extract/Tika 
UIMA 
Map-­‐Reduce 
Test 
Framework
Elasticsearch setup 
Source: 
hOps://www.flickr.com/photos/deborah-­‐is-­‐lola/6815624125/ 
• Admin UI: 
bin/plugin -i elasticsearch/marvel/latest 
• Tika/Extraction: 
bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/ 
2.4.1 
• ICU (Unicode components): 
bin/plugin -install elasticsearch/elasticsearch-analysis-icu/ 
2.4.1 
• JDBC River (like DataImportHandler 
subset): 
bin/plugin --install jdbc --url http://xbib.org/repository/ 
org/xbib/elasticsearch/plugin/elasticsearch-river-jdbc/ 
1.3.4.4/elasticsearch-river-jdbc-1.3.4.4-plugin.zip 
• JavaScript scripting support: 
bin/plugin -install elasticsearch/elasticsearch-lang-javascript/ 
2.4.1 
• On each node…. 
• Without dependency management (jars = 
rabbits)
Index a document - Elasticsearch 
1. Setup an index/collection 
2. Define fields and types 
3. Index content (using Marvel sense): 
POST /test1/hello 
{ 
"msg": "Happy birthday", 
"names": ["Alex", "Mark"], 
"when": "2014-11-01T10:09:08" 
} 
Alternative: 
PUT /test1/hello/id1 
{ 
"msg": "Happy birthday", 
"names": ["Alex", "Mark"], 
"when": "2014-11-01T10:09:08" 
} 
An index, type and definitions are created automatically 
So, 
where 
is 
our 
document: 
GET 
/test1/hello/_search 
{ 
"took": 1, 
"timed_out": false, 
"_shards": { 
"total": 5, 
"successful": 5, 
"failed": 0 
}, 
"hits": { 
"total": 1, 
"max_score": 1, 
"hits": [ 
{ 
"_index": "test1", 
"_type": "hello", 
"_id": "AUmIk4LDF4XvfpxnVJ2g", 
"_score": 1, 
"_source": { 
"msg": "Happy birthday", 
"names": [ 
"Alex", 
"Mark" 
], 
"when": "2014-11-01T10:09:08" 
}} 
] 
}}
Behind the scenes 
GET /test1/hello/_search 
….. 
{ 
"_index": "test1", 
"_type": "hello", 
"_id": "AUmIk4LDF4XvfpxnVJ2g", 
"_score": 1, 
"_source": { 
"msg": "Happy birthday", 
"names": [ 
"Alex", 
"Mark" 
], 
"when": "2014-11-01T10:09:08" 
} 
…. 
GET 
/test1/hello/_mapping 
{ 
"test1": { 
"mappings": { 
"hello": { 
"properties": { 
"msg": { 
"type": "string" 
}, 
"names": { 
"type": "string" 
}, 
"when": { 
"type": "date", 
"format": "dateOptionalTime" 
}}}}}}
Basic search in Elasticsearch 
GET /test1/hello/_search 
….. 
{ 
"_index": "test1", 
"_type": "hello", 
"_id": "AUmIk4LDF4XvfpxnVJ2g", 
"_score": 1, 
"_source": { 
"msg": "Happy birthday", 
"names": [ 
"Alex", 
"Mark" 
], 
"when": "2014-11-01T10:09:08" 
} 
…. 
• GET 
/test1/hello/_search?q=foobar 
– 
no 
results 
• GET 
/test1/hello/_search?q=Alex 
– 
YES 
on 
names? 
• GET 
/test1/hello/_search?q=alex 
– 
YES 
lower 
case 
• GET 
/test1/hello/_search?q=happy 
– 
YES 
on 
msg? 
• GET 
/test1/hello/_search?q=2014 
– 
YES??? 
• GET 
/test1/hello/_search?q="birthday 
alex" 
– 
YES 
• GET 
/test1/hello/_search?q="birthday 
mark" 
– 
NO 
Issues: 
1. Where 
are 
we 
actually 
searching? 
2. Why 
are 
lower-­‐case 
searches 
work? 
3. What's 
so 
special 
about 
Alex?
All about _all and why strings are tricky 
• By default, we search in the field _all 
• What's an _all field in Solr terms? 
<field name="_all" type="es_string" multiValued="true" indexed="true" stored="false"/> 
<copyField source="*" dest="_all"/> 
• And the default mapping for Elasticsearch "string" type is like: 
<fieldType name="es_string" class="solr.TextField" multiValued="true" positionIncrementGap="0" > 
<analyzer> 
<tokenizer class="solr.StandardTokenizerFactory"/> 
<filter class="solr.LowerCaseFilterFactory"/> 
</analyzer> 
</fieldType> 
• Elasticsearch equivalent to Solr's solr.StrField is: 
{"type" : "string", "index" : "not_analyzed"}
Can Solr do the same kind of magic? 
• curl 'http://localhost:8983/solr/collection1/update/json/docs' -H 'Content-type: 
application/json' -d @msg.json 
curl 
'hOp://localhost:8983/solr/collec%on1/select' 
{ 
"responseHeader":{ 
"status":0, 
"QTime":18, 
"params":{}}, 
"response":{"numFound":1,"start":0,"docs":[ 
{ 
"msg":["Happy birthday"], 
"names":["Alex", "Mark"], 
"when":["2014-11-01T10:09:08Z"], 
"_id":"e9af682d-e775-42f2-90a5-c932b5fbb691", 
"_version_":1484096406012559360}] 
}} 
curl 
'hOp://localhost:8983/solr/collec%on1/schema/fields' 
{ 
"responseHeader":{ 
"status":0, 
"QTime":1}, 
"fields":[ 
{"name":"_all", "type":"es_string", 
"multiValued":true, 
"indexed":true, "stored":false}, 
{"name":"_id", "type":"string", 
"multiValued":false, 
"indexed":true, "required":true, 
"stored":true, "uniqueKey":true}, 
{"name":"_version_", "type":"long", 
"indexed":true, "stored":true}, 
{"name":"msg", "type":"es_string"}, 
{"name":"names", "type":"es_string"}, 
{"name":"when", "type":"tdates"}]} 
• Output 
slightly 
re-­‐formated
Nearly the same magic 
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema"> 
<!-- UUIDUpdateProcessorFactory will generate an id if none is 
present in the incoming document --> 
<processor class="solr.UUIDUpdateProcessorFactory" /> 
<processor class="solr.LogUpdateProcessorFactory"/> 
<processor class="solr.DistributedUpdateProcessorFactory"/> 
<processor class="solr.RemoveBlankFieldUpdateProcessorFactory"/> 
<processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/> 
<processor class="solr.ParseLongFieldUpdateProcessorFactory"/> 
<processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/> 
<processor class="solr.ParseDateFieldUpdateProcessorFactory"> 
<arr name="format"> 
<str>yyyy-MM-dd'T'HH:mm:ss</str> 
<str>yyyyMMdd'T'HH:mm:ss</str> 
</arr> 
</processor> 
<processor class="solr.AddSchemaFieldsUpdateProcessorFactory"> 
<str name="defaultFieldType">es_string</str> 
<lst name="typeMapping"> 
<str name="valueClass">java.lang.Boolean</str> 
<str name="fieldType">booleans</str> 
</lst> 
<lst name="typeMapping"> 
<str name="valueClass">java.util.Date</str> 
<str name="fieldType">tdates</str> 
</lst> 
<processor class="solr.RunUpdateProcessorFactory"/> 
</updateRequestProcessorChain> 
Not 
quite 
the 
same 
magic: 
• URP 
chain 
happens 
before 
copyField 
• Date/Ints 
are 
converted 
first 
• copyText 
converts 
content 
back 
to 
string 
• _all 
field 
also 
gets 
copy 
of 
_id 
and 
_version 
• All 
auto-­‐mapped 
fields 
HAVE 
to 
be 
mul%valued 
• No 
(ES-­‐Style) 
types, 
just 
collec%ons 
• Unable 
to 
reproduce 
cross-­‐field 
search 
• S%ll 
rough 
around 
the 
edges 
• Requires 
dynamic 
schema, 
so 
adding 
new 
types 
becomes 
a 
challenge 
• Auto-­‐mapping 
is 
NOT 
recommended 
for 
produc%on 
• Dynamic 
fields 
solu%on 
is 
s%ll 
more 
mature
Explicit mapping - Solr 
• In schema.xml (or dynamic equivalent) 
• Uses Java Factories 
• Related content (e.g. stopwords) are usually in separate files (recently added REST-managed) 
• French example: 
<fieldType name="text_fr" class="solr.TextField" 
positionIncrementGap="100"> 
<analyzer> 
<tokenizer class="solr.StandardTokenizerFactory"/> 
<filter class="solr.ElisionFilterFactory" ignoreCase="true" 
articles="lang/contractions_fr.txt"/> 
<filter class="solr.LowerCaseFilterFactory"/> 
<filter class="solr.StopFilterFactory" ignoreCase="true" 
words="lang/stopwords_fr.txt" format="snowball" /> 
<filter class="solr.FrenchLightStemFilterFactory"/> 
</analyzer> 
</fieldType>
Explicit mapping - Elasticsearch 
• Created through PUT command 
• Also can be stored in config/default-mapping.json or config/mappings/[index_name] 
• Mappings for all types in one index should be compatible to avoid problems 
• Usually uses predefined mapping names. Has many names, including for 
languages 
• Explicit mapping is through named cross-references, rather than duplicated in-place 
stack (like Solr) 
• Related content is usually also in the definition. Sometimes in file (e.g. 
stopwords_path – needs to be on all nodes) 
• French example (next slide):
Explicit mapping – Elasticsearch - French 
{ 
"settings": { 
"analysis": { 
"filter": { 
"french_elision": { 
"type": "elision", 
"articles": [ "l", "m", "t", "qu", 
"n", "s", "j", "d", "c", "jusqu", "quoiqu", 
"lorsqu", "puisqu" 
] 
}, 
"french_stop": { 
"type": "stop", 
"stopwords": "_french_" 
}, 
"french_keywords": { 
"type": "keyword_marker", 
"keywords": [] 
}, 
"french_stemmer": { 
"type": "stemmer", 
"language": "light_french" 
} 
}, 
…. 
"analyzer": { 
"french": { 
"tokenizer": "standard", 
"filter": [ 
"french_elision", 
"lowercase", 
"french_stop", 
"french_keywords", 
"french_stemmer" 
] 
} 
} 
} 
} 
}
Default analyzer - Elasticsearch 
Indexing 
1. the 
analyzer 
defined 
in 
the 
field 
mapping, 
else 
2. the 
analyzer 
defined 
in 
the 
_analyzer 
field 
of 
the 
document, 
else 
3. the 
default 
analyzer 
for 
the 
type, 
which 
defaults 
to 
4. the 
analyzer 
named 
default 
in 
the 
index 
seongs, 
which 
defaults 
to 
5. the 
analyzer 
named 
default 
at 
node 
level, 
which 
defaults 
to 
6. the 
standard 
analyzer 
Query 
1. the 
analyzer 
defined 
in 
the 
query 
itself, 
else 
2. the 
analyzer 
defined 
in 
the 
field 
mapping, 
else 
3. the 
default 
analyzer 
for 
the 
type, 
which 
defaults 
to 
4. the 
analyzer 
named 
default 
in 
the 
index 
seongs, 
which 
defaults 
to 
5. the 
analyzer 
named 
default 
at 
node 
level, 
which 
defaults 
to 
6. the 
standard 
analyzer
Index many documents – Elasticsearch 
POST 
/test3/entries/_bulk 
{ 
"index": 
{"_id": 
"1" 
} 
} 
{"msg": 
"Hello", 
"names": 
["Jack", 
"Jill"]} 
{ 
"index": 
{"_id": 
"2" 
} 
} 
{"msg": 
"Goodbye", 
"names": 
"Jason"} 
{ 
"delete" 
: 
{"_id" 
: 
"3" 
} 
} 
NOTE: 
Rivers 
(similar 
to 
DIH) 
MAY 
be 
deprecated. 
Use 
Logstash 
instead 
(180Mb 
on 
disk, 
including 
2 
jRuby 
run%mes 
!!!)
Index many documents - Solr 
JSON 
-­‐ 
simple 
[ 
{ 
"_id": "1", 
"msg": "Hello", 
"names": ["Jack", "Jill"] 
}, 
{ 
"_id": "2", 
"msg": "Goodbye", 
"names": "Jason" 
} 
] 
JSON 
– 
with 
commands 
{ 
"add": { "doc": { 
"_id": "1", 
"msg": "Hello", 
"names": ["Jack", "Jill"] 
} }, 
"add": { "doc": { 
"_id": "2", 
"msg": "Goodbye", 
"names": "Jason" 
} }, 
"delete": { "_id":3 } 
} 
Also: 
• CSV 
• XML 
• XML+XSLT 
• JSON+transform 
(4.10) 
• DataImportHandler 
• Map-­‐Reduce 
External 
tools 
• Logstash 
(owned 
by 
ES)
Comparing search - Search 
• Same but different 
• Same: vast majority of the features 
come from Lucene 
• Different: representation of search 
parameters 
• Solr: URL query with many – cryptic – 
parameters 
• Elasticsearch: 
• Search lite: URL query with a 
limited set of parameters (basic 
Lucene query) 
• Query DSL: JSON with multi-leveled 
structure 
Lucene 
Impl 
ES 
only 
Solr 
only
Search compared – Simple searches 
{ 
"msg": "Happy birthday", 
"names": ["Alex", "Mark"], 
"when": "2014-11-01T10:09:08" 
} 
{ 
"msg": "Happy New Year", 
"names": ["Jack", "Jill"], 
"when": "2015-01-01T00:00:01" 
} 
{ 
"msg": "Goodbye", 
"names": ["Jack", "Jason"], 
"when": "2015-06-01T00:00:00" 
} 
Elas%csearch 
(Marvel 
Sense 
GET): 
• /test1/hello/_search 
– 
all 
• /test1/hello/_search?q=happy 
birthday 
Alex– 
2 
• /test1/hello/_search?q=names:Alex 
– 
1 
Solr 
(GET 
hOp://localhost:8983/solr/…): 
• /collec%on1/select 
– 
all 
• /collec%on1/select?q=happy 
birthday 
Alex 
– 
2 
• /test1/hello/_search?q=names:Alex 
– 
1
Search Compared – Query DSL 
Elas%csearch 
GET /test1/hello/_search 
{ 
"query": { 
"query_string": { 
"fields": ["msg^5", "names"], 
"query": "happy birthday Alex", 
"minimum_should_match": "100%" 
} 
} 
} 
Solr 
…/collection1/select 
?q=happy birthday Alex 
&defType=dismax 
&qf=msg^5 names 
&mm=100%
Search Compared – Query DSL - combo 
Elas%csearch 
GET /test1/hello/_search 
{ 
"size" : 1, 
"query": { 
"filtered": { 
"query": { 
"query_string": { 
"query": "jack" 
}}, 
"filter": { 
"range": { 
"when": { 
"gte": "now" 
}}}}}} 
Solr 
…/collection1/select 
?q=jack 
&fq=when:[NOW TO *] 
&rows=1 
Search 
future 
entries 
about 
Jack. 
Return 
only 
the 
best 
one.
Parent/Child structures 
Inner 
objects 
• Mapping: 
Object 
• Dynamic 
mapping 
(default) 
• NOT 
separate 
Lucene 
docs 
• Map 
to 
flaGened 
mul%valued 
fields 
• Search 
matches 
against 
value 
from 
ANY 
of 
inner 
objects 
{ 
"followers.age": [19, 26], 
"followers.name": 
[alex, lisa] 
} 
Elas%csearch 
Nested 
objects 
• Mapping: 
nested 
• Explicit 
mapping 
• Lucene 
block 
storage 
• Inner 
documents 
are 
hidden 
• Cannot 
return 
inner 
docs 
only 
• Can 
do 
nested 
& 
inner 
Parent 
and 
Child 
• Mapping: 
_parent 
• Explicit 
references 
• Separate 
documents 
• In-­‐memory 
join 
• SLOW 
Solr 
Nested 
objects 
• Lucene 
block 
storage 
• All 
documents 
are 
visible 
• Child 
JSON 
is 
less 
natural
Cloud deployment – quick take 
1. General concepts are similar: 
• Node discovery 
• Sharding 
• Replication 
• Routing 
1. Implementations are very, very different (layer above Lucene) 
2. Solr uses Apache Zookeeper 
3. Elasticsearch has its own algorithms 
4. No time to discuss 
5. Let's focus on the critical path: Node discovery/cloud-state management 
6. Use a 3rd party analysis: Kyle Kingsbury's Jepsen tests
Jepsen test of Zookeper 
Use 
Zookeeper. 
It’s 
mature, 
well-­‐designed, 
and 
baOle-­‐tested.
Jepsen test of Elasticsearch 
If 
you 
are 
an 
Elas%csearch 
user 
(as 
I 
am): 
good 
luck.
Innovator’s dilemma 
• Solr's usual attitude 
• An amazingly useful product for many different uses 
• And wants everybody to know it 
• …Right in the collection1 example 
• “You will need all this eventually, might as well learn it first” 
• Elasticsearch is small and shiny (“trust us, the magic exists”) 
• Elasticsearch + Logstash + Kibana => power-punch triple combo 
• Especially when comparing to Solr (and not another commercial solution) 
• Feature release process 
• Elasticsearch: kimchy: “LGTM” (Looks good to me) 
• Solr: full Apache process around it 
• Solr – needs to buckle down and focus on onboarding experience 
• Solr is getting better (e.g. listen to SolrCluster podcast of October 24, 2014)
Solr vs. Elasticsearch 
Case by Case 
Alexandre Rafalovitch 
www.solr-start.com 
@arafalov 
@SolrStart

More Related Content

PPTX
Solr vs. Elasticsearch - Case by Case
PPT
Solr and Elasticsearch, a performance study
PPTX
Solr 6 Feature Preview
PPTX
ElasticSearch for .NET Developers
PDF
Solr Troubleshooting - TreeMap approach
PPTX
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
PDF
Beyond full-text searches with Lucene and Solr
PDF
elasticsearch - advanced features in practice
Solr vs. Elasticsearch - Case by Case
Solr and Elasticsearch, a performance study
Solr 6 Feature Preview
ElasticSearch for .NET Developers
Solr Troubleshooting - TreeMap approach
Battle of the Giants - Apache Solr vs. Elasticsearch (ApacheCon)
Beyond full-text searches with Lucene and Solr
elasticsearch - advanced features in practice

What's hot (20)

PDF
Introduction to solr
PDF
Side by Side with Elasticsearch and Solr
PPTX
Getting started with Elasticsearch and .NET
PPTX
Battle of the Giants round 2
PPTX
ElasticSearch AJUG 2013
PPTX
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
PDF
Solr Masterclass Bangkok, June 2014
PDF
Solr Black Belt Pre-conference
ODP
Cool bonsai cool - an introduction to ElasticSearch
PDF
Elasticsearch Basics
PDF
ElasticSearch in action
PDF
Simple search with elastic search
PDF
Apache Solr Workshop
PDF
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
PDF
Elasticsearch in 15 minutes
PDF
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
PDF
Schemaless Solr and the Solr Schema REST API
PDF
it's just search
PPTX
Tutorial on developing a Solr search component plugin
PDF
Solr Recipes Workshop
Introduction to solr
Side by Side with Elasticsearch and Solr
Getting started with Elasticsearch and .NET
Battle of the Giants round 2
ElasticSearch AJUG 2013
Native Code, Off-Heap Data & JSON Facet API for Solr (Heliosearch)
Solr Masterclass Bangkok, June 2014
Solr Black Belt Pre-conference
Cool bonsai cool - an introduction to ElasticSearch
Elasticsearch Basics
ElasticSearch in action
Simple search with elastic search
Apache Solr Workshop
Streaming Aggregation in Solr - New Horizons for Search: Presented by Erick E...
Elasticsearch in 15 minutes
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)
Schemaless Solr and the Solr Schema REST API
it's just search
Tutorial on developing a Solr search component plugin
Solr Recipes Workshop
Ad

Viewers also liked (6)

PDF
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
PDF
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
PDF
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
PPTX
BlueData Hunk Integration: Splunk Analytics for Hadoop
PPT
Code Optimization
PPTX
Building a data driven search application with LucidWorks SiLK
Visualize Solr Data with Banana: Presented by Andrew Thanalertvisuti, Lucidworks
Never Stop Exploring - Pushing the Limits of Solr: Presented by Anirudha Jadh...
Time Series Processing with Solr and Spark: Presented by Josef Adersberger, Q...
BlueData Hunk Integration: Splunk Analytics for Hadoop
Code Optimization
Building a data driven search application with LucidWorks SiLK
Ad

Similar to Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN (20)

PDF
Rapid Prototyping with Solr
PDF
Rapid Prototyping with Solr
PDF
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
PPT
How ElasticSearch lives in my DevOps life
PDF
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
PDF
Full-Text Search Explained - Philipp Krenn - Codemotion Rome 2017
PDF
Apache Solr crash course
PDF
Fuzzing - A Tale of Two Cultures
PDF
Introduction to Elasticsearch
PDF
Intro to Elasticsearch
PDF
Information Retrieval - Data Science Bootcamp
PDF
Solr Application Development Tutorial
PDF
A noobs lesson on solr (configuration)
PDF
PDF
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
PDF
Oracle forensics 101
PPTX
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
PDF
Introduction to Solr
KEY
Elasticsearch & "PeopleSearch"
KEY
Javascript done right - Open Web Camp III
Rapid Prototyping with Solr
Rapid Prototyping with Solr
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
How ElasticSearch lives in my DevOps life
[제1회 루씬 한글분석기 기술세미나] solr로 나만의 검색엔진을 만들어보자
Full-Text Search Explained - Philipp Krenn - Codemotion Rome 2017
Apache Solr crash course
Fuzzing - A Tale of Two Cultures
Introduction to Elasticsearch
Intro to Elasticsearch
Information Retrieval - Data Science Bootcamp
Solr Application Development Tutorial
A noobs lesson on solr (configuration)
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Oracle forensics 101
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Introduction to Solr
Elasticsearch & "PeopleSearch"
Javascript done right - Open Web Camp III

More from Lucidworks (20)

PDF
Search is the Tip of the Spear for Your B2B eCommerce Strategy
PDF
Drive Agent Effectiveness in Salesforce
PPTX
How Crate & Barrel Connects Shoppers with Relevant Products
PPTX
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
PPTX
Connected Experiences Are Personalized Experiences
PDF
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
PPTX
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
PPTX
Preparing for Peak in Ecommerce | eTail Asia 2020
PPTX
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
PPTX
AI-Powered Linguistics and Search with Fusion and Rosette
PDF
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
PPTX
Webinar: Smart answers for employee and customer support after covid 19 - Europe
PDF
Smart Answers for Employee and Customer Support After COVID-19
PPTX
Applying AI & Search in Europe - featuring 451 Research
PPTX
Webinar: Accelerate Data Science with Fusion 5.1
PDF
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
PPTX
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
PPTX
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
PPTX
Webinar: Building a Business Case for Enterprise Search
PPTX
Why Insight Engines Matter in 2020 and Beyond
Search is the Tip of the Spear for Your B2B eCommerce Strategy
Drive Agent Effectiveness in Salesforce
How Crate & Barrel Connects Shoppers with Relevant Products
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Connected Experiences Are Personalized Experiences
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
Preparing for Peak in Ecommerce | eTail Asia 2020
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
AI-Powered Linguistics and Search with Fusion and Rosette
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Smart Answers for Employee and Customer Support After COVID-19
Applying AI & Search in Europe - featuring 451 Research
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Webinar: Building a Business Case for Enterprise Search
Why Insight Engines Matter in 2020 and Beyond

Recently uploaded (20)

PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Introduction to Artificial Intelligence
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PDF
AI in Product Development-omnex systems
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
top salesforce developer skills in 2025.pdf
PDF
Digital Strategies for Manufacturing Companies
PPTX
ISO 45001 Occupational Health and Safety Management System
PPTX
Materi_Pemrograman_Komputer-Looping.pptx
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Understanding Forklifts - TECH EHS Solution
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Design an Analysis of Algorithms I-SECS-1021-03
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Wondershare Filmora 15 Crack With Activation Key [2025
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Introduction to Artificial Intelligence
How to Migrate SBCGlobal Email to Yahoo Easily
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
How to Choose the Right IT Partner for Your Business in Malaysia
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
AI in Product Development-omnex systems
PTS Company Brochure 2025 (1).pdf.......
top salesforce developer skills in 2025.pdf
Digital Strategies for Manufacturing Companies
ISO 45001 Occupational Health and Safety Management System
Materi_Pemrograman_Komputer-Looping.pptx
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Understanding Forklifts - TECH EHS Solution
2025 Textile ERP Trends: SAP, Odoo & Oracle
Softaken Excel to vCard Converter Software.pdf
Design an Analysis of Algorithms I-SECS-1021-03
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...

Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN

  • 1. Solr vs. Elasticsearch Case by Case Alexandre Rafalovitch @arafalov @SolrStart www.solr-start.com
  • 2. Meet the FRENEMIES Friends (common) • Based on Lucene • Full-text search • Structured search • Queries, filters, caches • Facets/stats/enumerations • Cloud-ready Elas%csearch* * Elas%csearch is a trademark of Elas%csearch BV, registered in the U.S. and in other countries. Enemies (differences) • Download size • AdminUI vs. Marvel • Configuration vs. Magic • Nested documents • Chains vs. Plugins • Types and Rivers • OpenSource vs. Commercial • Etc.
  • 3. This used to be Solr (now in Lucene/ES) • Field types • Dismax/eDismax • Many of analysis filters (WordDelimiterFilter, Soundex, Regex, HTML, kstem, Trim…) • Mul%-­‐valued field cache • …. (source: hOp://heliosearch.org/lucene-­‐solr-­‐history/ ) • Disclaimer: Nowadays, Elas%csearch hires awesome Lucene hackers
  • 4. Basically - sisters Source: hOps://www.flickr.com/photos/franzfume/11530902934/ First run Expanded Download 300 250 200 150 100 50 0 Solr Elas%csearch
  • 5. Solr: Chubby or Rubenesque? 0.00 50.00 100.00 150.00 200.00 250.00 300.00 Elas%csearch+plugins Solr Code Examples Documenta%on ES-­‐Admin ES-­‐ICU Extract/Tika UIMA Map-­‐Reduce Test Framework
  • 6. Elasticsearch setup Source: hOps://www.flickr.com/photos/deborah-­‐is-­‐lola/6815624125/ • Admin UI: bin/plugin -i elasticsearch/marvel/latest • Tika/Extraction: bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/ 2.4.1 • ICU (Unicode components): bin/plugin -install elasticsearch/elasticsearch-analysis-icu/ 2.4.1 • JDBC River (like DataImportHandler subset): bin/plugin --install jdbc --url http://xbib.org/repository/ org/xbib/elasticsearch/plugin/elasticsearch-river-jdbc/ 1.3.4.4/elasticsearch-river-jdbc-1.3.4.4-plugin.zip • JavaScript scripting support: bin/plugin -install elasticsearch/elasticsearch-lang-javascript/ 2.4.1 • On each node…. • Without dependency management (jars = rabbits)
  • 7. Index a document - Elasticsearch 1. Setup an index/collection 2. Define fields and types 3. Index content (using Marvel sense): POST /test1/hello { "msg": "Happy birthday", "names": ["Alex", "Mark"], "when": "2014-11-01T10:09:08" } Alternative: PUT /test1/hello/id1 { "msg": "Happy birthday", "names": ["Alex", "Mark"], "when": "2014-11-01T10:09:08" } An index, type and definitions are created automatically So, where is our document: GET /test1/hello/_search { "took": 1, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 1, "max_score": 1, "hits": [ { "_index": "test1", "_type": "hello", "_id": "AUmIk4LDF4XvfpxnVJ2g", "_score": 1, "_source": { "msg": "Happy birthday", "names": [ "Alex", "Mark" ], "when": "2014-11-01T10:09:08" }} ] }}
  • 8. Behind the scenes GET /test1/hello/_search ….. { "_index": "test1", "_type": "hello", "_id": "AUmIk4LDF4XvfpxnVJ2g", "_score": 1, "_source": { "msg": "Happy birthday", "names": [ "Alex", "Mark" ], "when": "2014-11-01T10:09:08" } …. GET /test1/hello/_mapping { "test1": { "mappings": { "hello": { "properties": { "msg": { "type": "string" }, "names": { "type": "string" }, "when": { "type": "date", "format": "dateOptionalTime" }}}}}}
  • 9. Basic search in Elasticsearch GET /test1/hello/_search ….. { "_index": "test1", "_type": "hello", "_id": "AUmIk4LDF4XvfpxnVJ2g", "_score": 1, "_source": { "msg": "Happy birthday", "names": [ "Alex", "Mark" ], "when": "2014-11-01T10:09:08" } …. • GET /test1/hello/_search?q=foobar – no results • GET /test1/hello/_search?q=Alex – YES on names? • GET /test1/hello/_search?q=alex – YES lower case • GET /test1/hello/_search?q=happy – YES on msg? • GET /test1/hello/_search?q=2014 – YES??? • GET /test1/hello/_search?q="birthday alex" – YES • GET /test1/hello/_search?q="birthday mark" – NO Issues: 1. Where are we actually searching? 2. Why are lower-­‐case searches work? 3. What's so special about Alex?
  • 10. All about _all and why strings are tricky • By default, we search in the field _all • What's an _all field in Solr terms? <field name="_all" type="es_string" multiValued="true" indexed="true" stored="false"/> <copyField source="*" dest="_all"/> • And the default mapping for Elasticsearch "string" type is like: <fieldType name="es_string" class="solr.TextField" multiValued="true" positionIncrementGap="0" > <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> • Elasticsearch equivalent to Solr's solr.StrField is: {"type" : "string", "index" : "not_analyzed"}
  • 11. Can Solr do the same kind of magic? • curl 'http://localhost:8983/solr/collection1/update/json/docs' -H 'Content-type: application/json' -d @msg.json curl 'hOp://localhost:8983/solr/collec%on1/select' { "responseHeader":{ "status":0, "QTime":18, "params":{}}, "response":{"numFound":1,"start":0,"docs":[ { "msg":["Happy birthday"], "names":["Alex", "Mark"], "when":["2014-11-01T10:09:08Z"], "_id":"e9af682d-e775-42f2-90a5-c932b5fbb691", "_version_":1484096406012559360}] }} curl 'hOp://localhost:8983/solr/collec%on1/schema/fields' { "responseHeader":{ "status":0, "QTime":1}, "fields":[ {"name":"_all", "type":"es_string", "multiValued":true, "indexed":true, "stored":false}, {"name":"_id", "type":"string", "multiValued":false, "indexed":true, "required":true, "stored":true, "uniqueKey":true}, {"name":"_version_", "type":"long", "indexed":true, "stored":true}, {"name":"msg", "type":"es_string"}, {"name":"names", "type":"es_string"}, {"name":"when", "type":"tdates"}]} • Output slightly re-­‐formated
  • 12. Nearly the same magic <updateRequestProcessorChain name="add-unknown-fields-to-the-schema"> <!-- UUIDUpdateProcessorFactory will generate an id if none is present in the incoming document --> <processor class="solr.UUIDUpdateProcessorFactory" /> <processor class="solr.LogUpdateProcessorFactory"/> <processor class="solr.DistributedUpdateProcessorFactory"/> <processor class="solr.RemoveBlankFieldUpdateProcessorFactory"/> <processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/> <processor class="solr.ParseLongFieldUpdateProcessorFactory"/> <processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/> <processor class="solr.ParseDateFieldUpdateProcessorFactory"> <arr name="format"> <str>yyyy-MM-dd'T'HH:mm:ss</str> <str>yyyyMMdd'T'HH:mm:ss</str> </arr> </processor> <processor class="solr.AddSchemaFieldsUpdateProcessorFactory"> <str name="defaultFieldType">es_string</str> <lst name="typeMapping"> <str name="valueClass">java.lang.Boolean</str> <str name="fieldType">booleans</str> </lst> <lst name="typeMapping"> <str name="valueClass">java.util.Date</str> <str name="fieldType">tdates</str> </lst> <processor class="solr.RunUpdateProcessorFactory"/> </updateRequestProcessorChain> Not quite the same magic: • URP chain happens before copyField • Date/Ints are converted first • copyText converts content back to string • _all field also gets copy of _id and _version • All auto-­‐mapped fields HAVE to be mul%valued • No (ES-­‐Style) types, just collec%ons • Unable to reproduce cross-­‐field search • S%ll rough around the edges • Requires dynamic schema, so adding new types becomes a challenge • Auto-­‐mapping is NOT recommended for produc%on • Dynamic fields solu%on is s%ll more mature
  • 13. Explicit mapping - Solr • In schema.xml (or dynamic equivalent) • Uses Java Factories • Related content (e.g. stopwords) are usually in separate files (recently added REST-managed) • French example: <fieldType name="text_fr" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.ElisionFilterFactory" ignoreCase="true" articles="lang/contractions_fr.txt"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_fr.txt" format="snowball" /> <filter class="solr.FrenchLightStemFilterFactory"/> </analyzer> </fieldType>
  • 14. Explicit mapping - Elasticsearch • Created through PUT command • Also can be stored in config/default-mapping.json or config/mappings/[index_name] • Mappings for all types in one index should be compatible to avoid problems • Usually uses predefined mapping names. Has many names, including for languages • Explicit mapping is through named cross-references, rather than duplicated in-place stack (like Solr) • Related content is usually also in the definition. Sometimes in file (e.g. stopwords_path – needs to be on all nodes) • French example (next slide):
  • 15. Explicit mapping – Elasticsearch - French { "settings": { "analysis": { "filter": { "french_elision": { "type": "elision", "articles": [ "l", "m", "t", "qu", "n", "s", "j", "d", "c", "jusqu", "quoiqu", "lorsqu", "puisqu" ] }, "french_stop": { "type": "stop", "stopwords": "_french_" }, "french_keywords": { "type": "keyword_marker", "keywords": [] }, "french_stemmer": { "type": "stemmer", "language": "light_french" } }, …. "analyzer": { "french": { "tokenizer": "standard", "filter": [ "french_elision", "lowercase", "french_stop", "french_keywords", "french_stemmer" ] } } } } }
  • 16. Default analyzer - Elasticsearch Indexing 1. the analyzer defined in the field mapping, else 2. the analyzer defined in the _analyzer field of the document, else 3. the default analyzer for the type, which defaults to 4. the analyzer named default in the index seongs, which defaults to 5. the analyzer named default at node level, which defaults to 6. the standard analyzer Query 1. the analyzer defined in the query itself, else 2. the analyzer defined in the field mapping, else 3. the default analyzer for the type, which defaults to 4. the analyzer named default in the index seongs, which defaults to 5. the analyzer named default at node level, which defaults to 6. the standard analyzer
  • 17. Index many documents – Elasticsearch POST /test3/entries/_bulk { "index": {"_id": "1" } } {"msg": "Hello", "names": ["Jack", "Jill"]} { "index": {"_id": "2" } } {"msg": "Goodbye", "names": "Jason"} { "delete" : {"_id" : "3" } } NOTE: Rivers (similar to DIH) MAY be deprecated. Use Logstash instead (180Mb on disk, including 2 jRuby run%mes !!!)
  • 18. Index many documents - Solr JSON -­‐ simple [ { "_id": "1", "msg": "Hello", "names": ["Jack", "Jill"] }, { "_id": "2", "msg": "Goodbye", "names": "Jason" } ] JSON – with commands { "add": { "doc": { "_id": "1", "msg": "Hello", "names": ["Jack", "Jill"] } }, "add": { "doc": { "_id": "2", "msg": "Goodbye", "names": "Jason" } }, "delete": { "_id":3 } } Also: • CSV • XML • XML+XSLT • JSON+transform (4.10) • DataImportHandler • Map-­‐Reduce External tools • Logstash (owned by ES)
  • 19. Comparing search - Search • Same but different • Same: vast majority of the features come from Lucene • Different: representation of search parameters • Solr: URL query with many – cryptic – parameters • Elasticsearch: • Search lite: URL query with a limited set of parameters (basic Lucene query) • Query DSL: JSON with multi-leveled structure Lucene Impl ES only Solr only
  • 20. Search compared – Simple searches { "msg": "Happy birthday", "names": ["Alex", "Mark"], "when": "2014-11-01T10:09:08" } { "msg": "Happy New Year", "names": ["Jack", "Jill"], "when": "2015-01-01T00:00:01" } { "msg": "Goodbye", "names": ["Jack", "Jason"], "when": "2015-06-01T00:00:00" } Elas%csearch (Marvel Sense GET): • /test1/hello/_search – all • /test1/hello/_search?q=happy birthday Alex– 2 • /test1/hello/_search?q=names:Alex – 1 Solr (GET hOp://localhost:8983/solr/…): • /collec%on1/select – all • /collec%on1/select?q=happy birthday Alex – 2 • /test1/hello/_search?q=names:Alex – 1
  • 21. Search Compared – Query DSL Elas%csearch GET /test1/hello/_search { "query": { "query_string": { "fields": ["msg^5", "names"], "query": "happy birthday Alex", "minimum_should_match": "100%" } } } Solr …/collection1/select ?q=happy birthday Alex &defType=dismax &qf=msg^5 names &mm=100%
  • 22. Search Compared – Query DSL - combo Elas%csearch GET /test1/hello/_search { "size" : 1, "query": { "filtered": { "query": { "query_string": { "query": "jack" }}, "filter": { "range": { "when": { "gte": "now" }}}}}} Solr …/collection1/select ?q=jack &fq=when:[NOW TO *] &rows=1 Search future entries about Jack. Return only the best one.
  • 23. Parent/Child structures Inner objects • Mapping: Object • Dynamic mapping (default) • NOT separate Lucene docs • Map to flaGened mul%valued fields • Search matches against value from ANY of inner objects { "followers.age": [19, 26], "followers.name": [alex, lisa] } Elas%csearch Nested objects • Mapping: nested • Explicit mapping • Lucene block storage • Inner documents are hidden • Cannot return inner docs only • Can do nested & inner Parent and Child • Mapping: _parent • Explicit references • Separate documents • In-­‐memory join • SLOW Solr Nested objects • Lucene block storage • All documents are visible • Child JSON is less natural
  • 24. Cloud deployment – quick take 1. General concepts are similar: • Node discovery • Sharding • Replication • Routing 1. Implementations are very, very different (layer above Lucene) 2. Solr uses Apache Zookeeper 3. Elasticsearch has its own algorithms 4. No time to discuss 5. Let's focus on the critical path: Node discovery/cloud-state management 6. Use a 3rd party analysis: Kyle Kingsbury's Jepsen tests
  • 25. Jepsen test of Zookeper Use Zookeeper. It’s mature, well-­‐designed, and baOle-­‐tested.
  • 26. Jepsen test of Elasticsearch If you are an Elas%csearch user (as I am): good luck.
  • 27. Innovator’s dilemma • Solr's usual attitude • An amazingly useful product for many different uses • And wants everybody to know it • …Right in the collection1 example • “You will need all this eventually, might as well learn it first” • Elasticsearch is small and shiny (“trust us, the magic exists”) • Elasticsearch + Logstash + Kibana => power-punch triple combo • Especially when comparing to Solr (and not another commercial solution) • Feature release process • Elasticsearch: kimchy: “LGTM” (Looks good to me) • Solr: full Apache process around it • Solr – needs to buckle down and focus on onboarding experience • Solr is getting better (e.g. listen to SolrCluster podcast of October 24, 2014)
  • 28. Solr vs. Elasticsearch Case by Case Alexandre Rafalovitch www.solr-start.com @arafalov @SolrStart