Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)

Search Engine-Building
with Lucene and Solr
Part 1
Kai Chan
SoCal Code Camp, November 2013

Overview
● why Lucene/Solr?
● what are Lucene and Solr?
● how to use Lucene and Solr
○ setup
○ indexing
○ searching

● resources
● demo
● questions/answers

How to Make Your Data Searchable
● pay someone to do it
● use some solution someone else has written
● write some solution yourself

How to Search - One Approach
for each document d {
if (query is a substring of d's content) {
add d to the list of results
}
}
sort the result (or not)

How to Search - Problems
● slow
○ reads the whole dataset for each search

● not scalable
○ if you dataset grows by 10x,
your search slows down by 10x

● how to show the most relevant documents
first?
○ list of results can be quite long
○ users have limited time and patience

Inverted Index - Introduction
● like the "index" at the end of books
● a map of one of the following types
○ term → document list
○ term → <document, position> list

documents:
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"

inverted index (without positions):
"a":

{2}

"banana": {2}
"is":

{0, 1, 2}

"it":

{0, 1, 2}

"what":

{0, 1}

inverted index (with positions):
"a":

{(2, 2)}

"banana": {(2, 3)}
"is":

{(0, 1), (0, 4), (1, 1), (2, 1)}

"it":

{(0, 0), (0, 3), (1, 2), (2, 0)}

"what":

{(0, 2), (1, 0)}
Credit: Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)

Inverted Index - Speed
● term list
○ typically very small
○ grows slowly

● term lookup
○ O(1) to O(log(number of terms))

● for a particular term
○ document lists: very small
○ document + position lists: still small

● few terms per query

Inverted Index - Relevance
● information in the index enables:
○ determination (scoring) of relevance of each
document to the query
○ comparison of relevance among documents
○ sorting by (decreasing) relevance
■ i.e. the most relevant document first

Lucene v.s. Solr - Lucene
●
●
●
●

full-text search library
creates, updates and read from the index
takes queries and produces search results
your application creates objects and calls
methods in the Lucene API
● provides building blocks for custom features

Lucene v.s. Solr - Solr
●
●
●
●
●

full-text search platform
uses Lucene for indexing and search
REST-like API over HTTP
different output formats (e.g. XML, JSON)
provides some features not built into Lucene

Lucene:

machine running Java VM
your application
Lucene
Lucene code
libraries
index

Solr:

machine running Java VM
servlet container (e.g. Tomcat, Jetty)
Solr
HTTP

Solr code
Lucene code
index

libraries

client

Workflow
Setup

Indexing

Search

Workflow - Setup
● servlet configuration
○ e.g. port number, max POST size
○ you can usually use the default settings

● Solr configuration
○ e.g. data directory, deduplication, language
identification, highlighting
○ you can usually use the default settings

● schema definition
○ defines fields in your documents
○ you can use the default settings if you name your
fields in a certain way

How Data Are Organized
collection
document

document

document

field

field

field

field

field

field

field

field

field

field
name (e.g. "title" or "price")
content (e.g. "please read" or 30)

type
options

collection
document

document

subject

subject

date

date

date

from

from

from

reply-to

reply-to

text

text

text

document

collection
document

document

document

subject

title

first name

date

SKU

last name

from

price

phone

text

description

address

Solr Field Definition
● field
○ name (e.g. "subject")
○ type (e.g. "text_general")
○ options (e.g. indexed="true" stored="true")

● field type
○ text: "string", "text_general"
○ numeric: "int", "long", "float", "double"

● options
○ indexed: content can be searched
○ stored: content can be returned at search-time
○ multivalued: multiple values per field & document

Solr Dynamic Field
● define field by naming convention
● "amount_i": int, index, stored
● "tag_ss": string, indexed, stored, multivalued
name

type

indexed

stored

multiValued

*_i

int

true

true

false

*_l

long

true

true

false

*_f

float

true

true

false

*_d

double

true

true

false

*_s

string

true

true

false

*_ss

string

true

true

true

*_t

text_general

true

true

false

*_txt

text_general

true

true

true

Solr Copy Field
● copy one or more fields into another field
● can be used to define a catch-all field
○ source: "title", "author", "description"
○ destination: "text"
○ searching the "text" field has the effect of searching
all the other three fields

Indexing - UpdateRequestHandler
● upload (POST) content or file to http://host:
port/solr/update
● formats: XML, JSON, CSV

XML:
<add>
<doc>
<field
<field
<field
</doc>
<doc>
<field
<field
<field
</doc>
</add>

name="id">apple</field>
name="compName">Apple</field>
name="address">1 Infinite Way, Cupertino CA</field>

name="id">asus</field>
name="compName">ASUS Computer</field>
name="address">800 Corporate Way Fremont, CA 94539</field>

JSON:
[
{"id":"apple","compName_s":"Apple","address_s":"1 Infinite Way,
Cupertino CA"}
{"id":"asus","compName_s":"Asus Computer","address_s":"800 Corporate
Way Fremont, CA 94539"}
]

CSV:
id,compName_s,address_s
apple,Apple,"1 Infinite Way, Cupertino CA"
asus,Asus Computer,"800 Corporate Way Fremont, CA 94539"

Indexing - DataImportHandler
● has its own config file (data-config.xml)
● import data from various sources
○ RDBMS (JDBC)
○ e-mail (IMAP)
○ XML data locally (file) or remotely (HTTP)

● transformers
○ extract data (RegEx, XPath)
○ manipulate data (strip HTML tags)

Indexing - ExtractingRequestHandler
● allows indexing of different formats
○ e.g. PDF, MS Word, XML

● uses Apache Tika to extract text and
metadata
○ Tika: a framework for different file format parsers (e.
g. PDFBox for PDF, Apache POI for MS Word)

● maps extracted text to the “content” field
● maps metadata (e.g. MIME type) to different
fields

Searching - Basics
● send request to http://host:port/solr/search
● parameters
○
○
○
○
○
○
○

q - main query
fq - filter query
defType - query parser (e.g. lucene, edismax)
fl - fields to return
sort - sort criteria
wt - response writer (e.g. xml, json)
indent - set to true for pretty-printing

search handler's URL

main query

http://localhost:8983/solr/select?q=title:tablet&
fl=title,price,inStock&sort=price&wt=json
fields to return

sort criteria

response writer

Searching - Query Syntax - Field
● search a specific field
○ field_name:value

● if field omitted, Solr uses default field:
○ df parameter in URL
○ defaultSearchField setting in schema.xml
○ "text"

Searching - Query Syntax - Term
● a term by itself: matches documents that
contain that term
○ e.g. tablet

Searching - Query Syntax - Boolean
● “conventional” boolean operators supported

●
●
●

○ AND &&
○ OR ||
○ NOT !
e.g. a AND b
○ all of a, b must occur
e.g. a OR b
○ at least one of a, b must occur
e.g. a AND NOT b
○ a must occur and b must not occur

● Lucene/Solr's boolean operators are not true
boolean operators
● e.g. a OR b OR c does not behave like
(a OR b) OR c
○ instead, a OR b OR c means at least one of a, b, c
must occur

● parentheses are supported

● "+" prefix means "must"
● "-" prefix means "must not"
● no prefix means "at least one must"
(by default)
○ e.g. a b c
■ at least one of a, b, c must occur

● operators can mix
○ e.g. +a b c d -e
■ a must occur
■ at least one of b, c, d must occur
■ e must not occur

Searching - Query Syntax - Phrase
● phrases are enclosed by double-quotes
● e.g. +"the phrase"
○ the phrase must occur

● e.g. -"the phrase"
○ the phrase must not occur

Searching - Query Syntax - Boost
● manually assign different weights to clauses
● gives more weight to a field
○ e.g. title:a^10 body:a

● gives more weight to a word
○ e.g. title:a title:b^10

● gives phrases more weight than words
○ e.g. title:(+a +b) title:"a b"^10

Searching - Query Syntax - Range
● matches field values within a range
○ inclusive range - denoted by square brackets
○ exclusive range - denoted by curly brackets

● e.g. age:[10 TO 20]
○ matches the field "age" with the value in 10..20

● string or numeric comparison, depending on
the field's type
● open-ended range supported
● e.g. age: [10 TO *]
○ matches the field "age" with the value 10 or larger

Searching - Query Syntax - EDisMax
● suitable for user-generated queries
○ does not complain about the syntax
○ searches for individual words across several fields
("disjunction")
○ uses max score of a word in all fields for scoring
("max")

● configurable (in solrconfig.xml)
○ what fields to search the words in
○ boosting of these fields

Sorting
● default: sorting by decreasing score
● custom sorting rules: use the sort parameter
○ syntax: fieldName (asc|desc)
○ e.g. sort by ascending price (i.e. lowest price first):
price asc
○ e.g. sort by descending date (i.e. newest date first):
date asc

Sorting
● special field names
○ use score for score and _docid_ for document D
○ e.g. sort by ascending score:
score asc
○ e.g. sort by descending document ID
_docid_ desc

Sorting
● multiple fields and orders: separate by
commas
○ e.g. sort by descending starRating and ascending
price:
○ starRating desc, price asc

Sorting
● cannot use multivalued fields
● overrides the default sorting behavior

Faceted Search
● facet values: (distinct) values (generally nonoverlapping) ranges of a field
● displaying facets
○ show possible values
○ let users narrow down their searches easily

facet
facet values (5 of them)

Faceted Search
● set facet parameter to true - enables
faceting
● other parameters
○ facet.field - use the field's values as facets
■ return <value, count> pairs
○ facet.query - use the given queries as facets
■ return <query, count> pairs
○ facet.sort - set the ordering of the facets;
■ can be "count" or "index"
○ facet.offset and face.limit - used for
pagination of facets

Resources - Books
● Lucene in Action
○ written by 3 committer and PMC members
○ somewhat outdated (2010; covers Lucene 3.0)
○ http://www.manning.com/hatcher3/

● Solr in Action
○ early access; coming out later this year
○ http://www.manning.com/grainger/

● Apache Solr 4 Cookbook
○ common problems and useful tips
○ http://www.packtpub.com/apache-solr-4cookbook/book

Resources - Books
● Introduction to Information Retrieval
○ not specific to Lucene/Solr, but about IR concepts
○ free e-book
○ http://nlp.stanford.edu/IR-book/

● Managing Gigabytes
○ indexing, compression and other topics
○ accompanied by MG4J - a full-text search software
○ http://mg4j.di.unimi.it/

Resources - Web
● official websites
○ Lucene Core - http://lucene.apache.org/core/
○ Solr - http://lucene.apache.org/solr/

● mailing lists
● Wiki sites
○ Lucene Core - http://wiki.apache.org/lucene-java/
○ Solr - http://wiki.apache.org/solr/

● reference guides
○ API Documentation for Lucene and Solr
○ Apache Solr Reference Guide

Getting Started
● download Solr
○ requires Java 6 or newer to run

● Solr comes bundled/configured with Jetty
○ <Solr directory>/example/start.jar

● "exampledocs" directory contains sample
documents
○ <Solr directory>/example/exampledocs/post.jar
○ java -Durl=http://localhost:
8983/solr/update -jar post.jar *.xml

● use the Solr admin interface
○ http://localhost:8983/solr/

Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)

More Related Content

What's hot (20)

Similar to Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013) (20)

Recently uploaded (20)

Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)