Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN

Solr vs. Elasticsearch
Case by Case
Alexandre Rafalovitch @arafalov
@SolrStart
www.solr-start.com

Meet the FRENEMIES
Friends (common)
• Based on Lucene
• Full-text search
• Structured search
• Queries, filters, caches
• Facets/stats/enumerations
• Cloud-ready
Elas%csearch*
*
Elas%csearch
is
a
trademark
of
Elas%csearch
BV,
registered
in
the
U.S.
and
in
other
countries.
Enemies (differences)
• Download size
• AdminUI vs. Marvel
• Configuration vs. Magic
• Nested documents
• Chains vs. Plugins
• Types and Rivers
• OpenSource vs. Commercial
• Etc.

This used to be Solr (now in Lucene/ES)
• Field
types
• Dismax/eDismax
• Many
of
analysis
filters
(WordDelimiterFilter,
Soundex,
Regex,
HTML,
kstem,
Trim…)
• Mul%-‐valued
field
cache
• ….
(source:
hOp://heliosearch.org/lucene-‐solr-‐history/
)
• Disclaimer:
Nowadays,
Elas%csearch
hires
awesome
Lucene
hackers

Basically - sisters
Source:
hOps://www.flickr.com/photos/franzfume/11530902934/
First
run
Expanded
Download
300
250
200
150
100
50
0
Solr
Elas%csearch

Solr: Chubby or Rubenesque?
0.00
50.00
100.00
150.00
200.00
250.00
300.00
Elas%csearch+plugins
Solr
Code
Examples
Documenta%on
ES-‐Admin
ES-‐ICU
Extract/Tika
UIMA
Map-‐Reduce
Test
Framework

Elasticsearch setup
Source:
hOps://www.flickr.com/photos/deborah-‐is-‐lola/6815624125/
• Admin UI:
bin/plugin -i elasticsearch/marvel/latest
• Tika/Extraction:
bin/plugin -install elasticsearch/elasticsearch-mapper-attachments/
2.4.1
• ICU (Unicode components):
bin/plugin -install elasticsearch/elasticsearch-analysis-icu/
2.4.1
• JDBC River (like DataImportHandler
subset):
bin/plugin --install jdbc --url http://xbib.org/repository/
org/xbib/elasticsearch/plugin/elasticsearch-river-jdbc/
1.3.4.4/elasticsearch-river-jdbc-1.3.4.4-plugin.zip
• JavaScript scripting support:
bin/plugin -install elasticsearch/elasticsearch-lang-javascript/
2.4.1
• On each node….
• Without dependency management (jars =
rabbits)

Index a document - Elasticsearch
1. Setup an index/collection
2. Define fields and types
3. Index content (using Marvel sense):
POST /test1/hello
{
"msg": "Happy birthday",
"names": ["Alex", "Mark"],
"when": "2014-11-01T10:09:08"
}
Alternative:
PUT /test1/hello/id1
{
"when": "2014-11-01T10:09:08"
}
An index, type and definitions are created automatically
So,
where
is
our
document:
GET
/test1/hello/_search
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "test1",
"_type": "hello",
"_id": "AUmIk4LDF4XvfpxnVJ2g",
"_score": 1,
"_source": {
"names": [
"Alex",
"Mark"
],
"when": "2014-11-01T10:09:08"
}}
]
}}

Behind the scenes
GET /test1/hello/_search
…..
{
"_index": "test1",
"_type": "hello",
"_score": 1,
"_source": {
"names": [
"Alex",
"Mark"
],
"when": "2014-11-01T10:09:08"
}
….
GET
/test1/hello/_mapping
{
"test1": {
"mappings": {
"hello": {
"properties": {
"msg": {
"type": "string"
},
"names": {
"type": "string"
},
"when": {
"type": "date",
"format": "dateOptionalTime"
}}}}}}

Basic search in Elasticsearch
…..
{
"_index": "test1",
"_type": "hello",
"_score": 1,
"_source": {
"names": [
"Alex",
"Mark"
],
"when": "2014-11-01T10:09:08"
}
….
• GET
/test1/hello/_search?q=foobar
–
no
results
• GET
/test1/hello/_search?q=Alex
–
YES
on
names?
• GET
/test1/hello/_search?q=alex
–
YES
lower
case
• GET
/test1/hello/_search?q=happy
–
YES
on
msg?
• GET
/test1/hello/_search?q=2014
–
YES???
• GET
/test1/hello/_search?q="birthday
alex"
–
YES
• GET
/test1/hello/_search?q="birthday
mark"
–
NO
Issues:
1. Where
are
we
actually
searching?
2. Why
are
lower-‐case
searches
work?
3. What's
so
special
about
Alex?

All about _all and why strings are tricky
• By default, we search in the field _all
• What's an _all field in Solr terms?
<field name="_all" type="es_string" multiValued="true" indexed="true" stored="false"/>
<copyField source="*" dest="_all"/>
• And the default mapping for Elasticsearch "string" type is like:
<fieldType name="es_string" class="solr.TextField" multiValued="true" positionIncrementGap="0" >
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
• Elasticsearch equivalent to Solr's solr.StrField is:
{"type" : "string", "index" : "not_analyzed"}

Can Solr do the same kind of magic?
• curl 'http://localhost:8983/solr/collection1/update/json/docs' -H 'Content-type:
application/json' -d @msg.json
curl
'hOp://localhost:8983/solr/collec%on1/select'
{
"responseHeader":{
"status":0,
"QTime":18,
"params":{}},
"response":{"numFound":1,"start":0,"docs":[
{
"msg":["Happy birthday"],
"names":["Alex", "Mark"],
"when":["2014-11-01T10:09:08Z"],
"_id":"e9af682d-e775-42f2-90a5-c932b5fbb691",
"_version_":1484096406012559360}]
}}
curl
'hOp://localhost:8983/solr/collec%on1/schema/fields'
{
"responseHeader":{
"status":0,
"QTime":1},
"fields":[
{"name":"_all", "type":"es_string",
"multiValued":true,
"indexed":true, "stored":false},
{"name":"_id", "type":"string",
"multiValued":false,
"indexed":true, "required":true,
"stored":true, "uniqueKey":true},
{"name":"_version_", "type":"long",
"indexed":true, "stored":true},
{"name":"msg", "type":"es_string"},
{"name":"names", "type":"es_string"},
{"name":"when", "type":"tdates"}]}
• Output
slightly
re-‐formated

Nearly the same magic
<updateRequestProcessorChain name="add-unknown-fields-to-the-schema">

<processor class="solr.UUIDUpdateProcessorFactory" />
<processor class="solr.LogUpdateProcessorFactory"/>
<processor class="solr.DistributedUpdateProcessorFactory"/>
<processor class="solr.RemoveBlankFieldUpdateProcessorFactory"/>
<processor class="solr.ParseBooleanFieldUpdateProcessorFactory"/>
<processor class="solr.ParseLongFieldUpdateProcessorFactory"/>
<processor class="solr.ParseDoubleFieldUpdateProcessorFactory"/>
<processor class="solr.ParseDateFieldUpdateProcessorFactory">
<arr name="format">
<str>yyyy-MM-dd'T'HH:mm:ss</str>
<str>yyyyMMdd'T'HH:mm:ss</str>
</arr>
</processor>
<processor class="solr.AddSchemaFieldsUpdateProcessorFactory">
<str name="defaultFieldType">es_string</str>
<lst name="typeMapping">
<str name="valueClass">java.lang.Boolean</str>
<str name="fieldType">booleans</str>
</lst>
<lst name="typeMapping">
<str name="valueClass">java.util.Date</str>
<str name="fieldType">tdates</str>
</lst>
<processor class="solr.RunUpdateProcessorFactory"/>
</updateRequestProcessorChain>
Not
quite
the
same
magic:
• URP
chain
happens
before
copyField
• Date/Ints
are
converted
first
• copyText
converts
content
back
to
string
• _all
field
also
gets
copy
of
_id
and
_version
• All
auto-‐mapped
fields
HAVE
to
be
mul%valued
• No
(ES-‐Style)
types,
just
collec%ons
• Unable
to
reproduce
cross-‐field
search
• S%ll
rough
around
the
edges
• Requires
dynamic
schema,
so
adding
new
types
becomes
a
challenge
• Auto-‐mapping
is
NOT
recommended
for
produc%on
• Dynamic
fields
solu%on
is
s%ll
more
mature

Explicit mapping - Solr
• In schema.xml (or dynamic equivalent)
• Uses Java Factories
• Related content (e.g. stopwords) are usually in separate files (recently added REST-managed)
• French example:
<fieldType name="text_fr" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ElisionFilterFactory" ignoreCase="true"
articles="lang/contractions_fr.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="lang/stopwords_fr.txt" format="snowball" />
<filter class="solr.FrenchLightStemFilterFactory"/>
</analyzer>
</fieldType>

Explicit mapping - Elasticsearch
• Created through PUT command
• Also can be stored in config/default-mapping.json or config/mappings/[index_name]
• Mappings for all types in one index should be compatible to avoid problems
• Usually uses predefined mapping names. Has many names, including for
languages
• Explicit mapping is through named cross-references, rather than duplicated in-place
stack (like Solr)
• Related content is usually also in the definition. Sometimes in file (e.g.
stopwords_path – needs to be on all nodes)
• French example (next slide):

Explicit mapping – Elasticsearch - French
{
"settings": {
"analysis": {
"filter": {
"french_elision": {
"type": "elision",
"articles": [ "l", "m", "t", "qu",
"n", "s", "j", "d", "c", "jusqu", "quoiqu",
"lorsqu", "puisqu"
]
},
"french_stop": {
"type": "stop",
"stopwords": "_french_"
},
"french_keywords": {
"type": "keyword_marker",
"keywords": []
},
"french_stemmer": {
"type": "stemmer",
"language": "light_french"
}
},
….
"analyzer": {
"french": {
"tokenizer": "standard",
"filter": [
"french_elision",
"lowercase",
"french_stop",
"french_keywords",
"french_stemmer"
]
}
}
}
}
}

Default analyzer - Elasticsearch
Indexing
1. the
analyzer
defined
in
the
field
mapping,
else
2. the
analyzer
defined
in
the
_analyzer
field
of
the
document,
else
3. the
default
analyzer
for
the
type,
which
defaults
to
4. the
analyzer
named
default
in
the
index
seongs,
which
defaults
to
5. the
analyzer
named
default
at
node
level,
which
defaults
to
6. the
standard
analyzer
Query
1. the
analyzer
defined
in
the
query
itself,
else
2. the
analyzer
defined
in
the
field
mapping,
else
3. the
default
analyzer
for
the
type,
which
defaults
to
4. the
analyzer
named
default
in
the
index
seongs,
which
defaults
to
5. the
analyzer
named
default
at
node
level,
which
defaults
to
6. the
standard
analyzer

Index many documents – Elasticsearch
POST
/test3/entries/_bulk
{
"index":
{"_id":
"1"
}
}
{"msg":
"Hello",
"names":
["Jack",
"Jill"]}
{
"index":
{"_id":
"2"
}
}
{"msg":
"Goodbye",
"names":
"Jason"}
{
"delete"
:
{"_id"
:
"3"
}
}
NOTE:
Rivers
(similar
to
DIH)
MAY
be
deprecated.
Use
Logstash
instead
(180Mb
on
disk,
including
2
jRuby
run%mes
!!!)

Index many documents - Solr
JSON
-‐
simple
[
{
"_id": "1",
"msg": "Hello",
"names": ["Jack", "Jill"]
},
{
"_id": "2",
"msg": "Goodbye",
"names": "Jason"
}
]
JSON
–
with
commands
{
"add": { "doc": {
"_id": "1",
"msg": "Hello",
"names": ["Jack", "Jill"]
} },
"add": { "doc": {
"_id": "2",
"msg": "Goodbye",
"names": "Jason"
} },
"delete": { "_id":3 }
}
Also:
• CSV
• XML
• XML+XSLT
• JSON+transform
(4.10)
• DataImportHandler
• Map-‐Reduce
External
tools
• Logstash
(owned
by
ES)

Comparing search - Search
• Same but different
• Same: vast majority of the features
come from Lucene
• Different: representation of search
parameters
• Solr: URL query with many – cryptic –
parameters
• Elasticsearch:
• Search lite: URL query with a
limited set of parameters (basic
Lucene query)
• Query DSL: JSON with multi-leveled
structure
Lucene
Impl
ES
only
Solr
only

Search compared – Simple searches
{
"when": "2014-11-01T10:09:08"
}
{
"msg": "Happy New Year",
"names": ["Jack", "Jill"],
"when": "2015-01-01T00:00:01"
}
{
"msg": "Goodbye",
"names": ["Jack", "Jason"],
"when": "2015-06-01T00:00:00"
}
Elas%csearch
(Marvel
Sense
GET):
• /test1/hello/_search
–
all
• /test1/hello/_search?q=happy
birthday
Alex–
2
• /test1/hello/_search?q=names:Alex
–
1
Solr
(GET
hOp://localhost:8983/solr/…):
• /collec%on1/select
–
all
• /collec%on1/select?q=happy
birthday
Alex
–
2
• /test1/hello/_search?q=names:Alex
–
1

Search Compared – Query DSL
Elas%csearch
{
"query": {
"query_string": {
"fields": ["msg^5", "names"],
"query": "happy birthday Alex",
"minimum_should_match": "100%"
}
}
}
Solr
…/collection1/select
?q=happy birthday Alex
&defType=dismax
&qf=msg^5 names
&mm=100%

Search Compared – Query DSL - combo
Elas%csearch
{
"size" : 1,
"query": {
"filtered": {
"query": {
"query_string": {
"query": "jack"
}},
"filter": {
"range": {
"when": {
"gte": "now"
}}}}}}
Solr
…/collection1/select
?q=jack
&fq=when:[NOW TO *]
&rows=1
Search
future
entries
about
Jack.
Return
only
the
best
one.

Parent/Child structures
Inner
objects
• Mapping:
Object
• Dynamic
mapping
(default)
• NOT
separate
Lucene
docs
• Map
to
flaGened
mul%valued
fields
• Search
matches
against
value
from
ANY
of
inner
objects
{
"followers.age": [19, 26],
"followers.name":
[alex, lisa]
}
Elas%csearch
Nested
objects
• Mapping:
nested
• Explicit
mapping
• Lucene
block
storage
• Inner
documents
are
hidden
• Cannot
return
inner
docs
only
• Can
do
nested
&
inner
Parent
and
Child
• Mapping:
_parent
• Explicit
references
• Separate
documents
• In-‐memory
join
• SLOW
Solr
Nested
objects
• Lucene
block
storage
• All
documents
are
visible
• Child
JSON
is
less
natural

Cloud deployment – quick take
1. General concepts are similar:
• Node discovery
• Sharding
• Replication
• Routing
1. Implementations are very, very different (layer above Lucene)
2. Solr uses Apache Zookeeper
3. Elasticsearch has its own algorithms
4. No time to discuss
5. Let's focus on the critical path: Node discovery/cloud-state management
6. Use a 3rd party analysis: Kyle Kingsbury's Jepsen tests

Jepsen test of Zookeper
Use
Zookeeper.
It’s
mature,
well-‐designed,
and
baOle-‐tested.

Jepsen test of Elasticsearch
If
you
are
an
Elas%csearch
user
(as
I
am):
good
luck.

Innovator’s dilemma
• Solr's usual attitude
• An amazingly useful product for many different uses
• And wants everybody to know it
• …Right in the collection1 example
• “You will need all this eventually, might as well learn it first”
• Elasticsearch is small and shiny (“trust us, the magic exists”)
• Elasticsearch + Logstash + Kibana => power-punch triple combo
• Especially when comparing to Solr (and not another commercial solution)
• Feature release process
• Elasticsearch: kimchy: “LGTM” (Looks good to me)
• Solr: full Apache process around it
• Solr – needs to buckle down and focus on onboarding experience
• Solr is getting better (e.g. listen to SolrCluster podcast of October 24, 2014)

Solr vs. Elasticsearch
Case by Case
Alexandre Rafalovitch
www.solr-start.com
@arafalov
@SolrStart

Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN (20)

More from Lucidworks (20)

Recently uploaded (20)

Solr vs. Elasticsearch, Case by Case: Presented by Alexandre Rafalovitch, UN