Solr 101

SOLR 101
JavaZone 2012, Oslo, Sébastien Muller, Findwise

Agenda
 Introductions
 Enterprise search
 What is Solr, why choose it?
 Solr Terminology
 Main Solr Features
 How it works
 Anatomy of a Query
 Scalbility
 Case studies
 Sparebank1
 Komplett Group

Enterprise Search
 Search has become mission critical for most enterprises
 Intranet
 Web presence
 E-commerce

 Exponential growth of data

 Cost of not ﬁnding information
 Knowledge (sharing)
 Time
 Money

 Information blackhole

What is Solr?

Official deﬁnition:

“Solr is an open source enterprise search platform
based on the
Lucene Java search library, with an HTTP
interface using XML,
JSON or other formats. It provides hit
highlighting, faceted
search, caching, replication, a web
administration interface and
many more features. It runs in a
Java servlet container such as
Apache Tomcat.”

http://lucene.apache.org/solr

What is Solr?
 Open-source, license-free search engine
 Uses Apache Lucene library and adds enterprise search server
features and capabilities
 Web based application that processes requests and returns
responses via HTTP
 Easy scalability and great performance
 Industry-tested worldwide
 Modern solution architecture based on XML and Java – easy to work
with
 Well integrated with the ecosystem around Big Data, such as
Hadoop (also Nutch, Tika).

Why choose Solr?
 “Buy” > Build

 Open source vs. Commercial solution
 Open source software is free
 Licensed software can be very expensive

 High quality and easily modiﬁable relevancy

 Very fast query and indexing performance

 Highly ﬂexible data processing/transformation

Why choose Solr?
 Some challenges unique to open source…
 No guaranteed support or bug fixing from community
 No formal quality control or support for upgrades
 Limited support for less experienced developers

 Some benefits unique to open source…
 Widely used and tested
 Access to source code
 Access to development versions and unreleased patches

 Ultimately search is a specialised field and requires specialists

Solr Terminology
 Index(ing)
 Inverted index
 Document
 Field
 Stored and/or indexed ﬁelds
 Analysis
 Tokenization
 Filters
 Terms
 Query
 Filter
 Function
 Facet

Main Solr Features
 Full text search
 Field search
 Number and date searching
 Facets
 Spelling assistance – “Did you mean…?”
 Replication
 Master/Slave architecture

 Related hits
 Query completion
 Admin GUI

How it works
 Easy conﬁguration through XML
 schema.xml
 solrconﬁg.xml

 Documents are POSTed via HTTP to Solr
 Add/update
 Delete
 Commit

 Queries and response are also sent via HTTP
 Choice of formats

Anatomy of a Query
 Common parameters
 Start, rows, ﬂ, fq, sort
 http://wiki.apache.org/solr/CommonQueryParameters

?
q=*:*&start=0&rows=10&ﬂ=title&fq=collection:popular&s
ort=title asc

 Slightly more advanced
 &facets
 &qf

What is Facetting?
 Navigation/discovery technique

 Tally of docs for each distinct ﬁeld value

 Parameters
 &facet=true
 &facet.ﬁeld=category

And so much more…

Scalability
 Architecture goals:
 More queries per second (qps)
 Faster query execution
 Bigger indexes
 Faster indexing

 Scaling options
 Multicore
 Replication
 Sharding

Scalability - Multicore
 Having more than one Solr in one Solr webapp

<solr persistent = “true” sharedLib = “lib” >

<cores adminPath=“/admin/cores”>

<core name=“core0” instanceDir=“core0” />

<core name=“core1” instanceDir=“core1” />

</cores
</solr>

 http://localhost:8080/solr/admin/cores?action=...
 STATUS
 CREATE
 SWAP

Scalability - Replication
 Basic architecture – indexing/querying handled by one instance

 1:1 Master/slave
 Indexing
 Querying

 1:N Master/slaves
 Different user groups

Scalability - Sharding
 Distributed index
 N masters with index split between them
 Simple hashing to choose index

 Sharding + replication
 N masters with M slaves each
 More shards = faster execution time
 More slaves = higher average QPS

&shards=solr1:8983/solr,solr2:8983/
solr&indent=true&q=ipod+solr

SpareBank1 - Background
 SpareBank1 Gruppen
 19 individual localised bank portals and one parent front page

 Boost 25 umbrella project
 Semantic URLs: https://www2.sparebank1.no/9898/3_privat?
_nfpb=true&_nﬂs=false&_pageLabel=page_privat_innhold&pId=1233
149354625&_
 New search interface
 Banking app

 CMS with no easy way of tracking individual banks’
publications
 Mass duplicates
 Access to irrelevant data

SpareBank1 - Requirements
 Customer requirements : “bedre portal søk”

SpareBank1 - Requirements
 Basic search features include

 High quality relevance and precision

 Relevant faceting


 Spell check and suggestions

 Search analytics

SpareBank1 – Live Demo

https://www2.sparebank1.no/

Komplett - Background
 Komplett NO, SE, DK… inWarhouse.se, MPX

 Existing Solr solution
 Mile long query with boosting per ﬁeld

 Poor relevance
 Peripherals/accessories ranked higher than products

 Limited faceting

 No query completion or spellcheck

 Sloooooow indexing

Komplett - Requirements
 Superior and customisable relevance model

 Much more comprehensive indexing of products and speciﬁcations

 Spellcheck


 So much more faceting

Sébastien Muller
sebastien.muller@ﬁndwise.com

Solr 101

More Related Content

Similar to Solr 101 (20)

More from Findwise (20)

Recently uploaded (20)

Solr 101

Editor's Notes