SlideShare a Scribd company logo
Search Engine-Building
with Lucene and Solr
Part 1
Kai Chan
SoCal Code Camp, November 2013
Overview
● why Lucene/Solr?
● what are Lucene and Solr?
● how to use Lucene and Solr
○ setup
○ indexing
○ searching

● resources
● demo
● questions/answers
How to Make Your Data Searchable
● pay someone to do it
● use some solution someone else has written
● write some solution yourself
How to Search - One Approach
for each document d {
if (query is a substring of d's content) {
add d to the list of results
}
}
sort the result (or not)
How to Search - Problems
● slow
○ reads the whole dataset for each search

● not scalable
○ if you dataset grows by 10x,
your search slows down by 10x

● how to show the most relevant documents
first?
○ list of results can be quite long
○ users have limited time and patience
Inverted Index - Introduction
● like the "index" at the end of books
● a map of one of the following types
○ term → document list
○ term → <document, position> list
documents:
T[0] = "it is what it is"
T[1] = "what is it"
T[2] = "it is a banana"

inverted index (without positions):
"a":

{2}

"banana": {2}
"is":

{0, 1, 2}

"it":

{0, 1, 2}

"what":

{0, 1}

inverted index (with positions):
"a":

{(2, 2)}

"banana": {(2, 3)}
"is":

{(0, 1), (0, 4), (1, 1), (2, 1)}

"it":

{(0, 0), (0, 3), (1, 2), (2, 0)}

"what":

{(0, 2), (1, 0)}
Credit: Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)
Inverted Index - Speed
● term list
○ typically very small
○ grows slowly

● term lookup
○ O(1) to O(log(number of terms))

● for a particular term
○ document lists: very small
○ document + position lists: still small

● few terms per query
Inverted Index - Relevance
● information in the index enables:
○ determination (scoring) of relevance of each
document to the query
○ comparison of relevance among documents
○ sorting by (decreasing) relevance
■ i.e. the most relevant document first
Lucene v.s. Solr - Lucene
●
●
●
●

full-text search library
creates, updates and read from the index
takes queries and produces search results
your application creates objects and calls
methods in the Lucene API
● provides building blocks for custom features
Lucene v.s. Solr - Solr
●
●
●
●
●

full-text search platform
uses Lucene for indexing and search
REST-like API over HTTP
different output formats (e.g. XML, JSON)
provides some features not built into Lucene
Lucene:

machine running Java VM
your application
Lucene
Lucene code
libraries
index

Solr:

machine running Java VM
servlet container (e.g. Tomcat, Jetty)
Solr
HTTP

Solr code
Lucene code
index

libraries

client
Workflow
Setup

Indexing

Search
Workflow - Setup
● servlet configuration
○ e.g. port number, max POST size
○ you can usually use the default settings

● Solr configuration
○ e.g. data directory, deduplication, language
identification, highlighting
○ you can usually use the default settings

● schema definition
○ defines fields in your documents
○ you can use the default settings if you name your
fields in a certain way
How Data Are Organized
collection
document

document

document

field

field

field

field

field

field

field

field

field
field
name (e.g. "title" or "price")
content (e.g. "please read" or 30)

type
options
collection
document

document

subject

subject

date

date

date

from

from

from

reply-to

reply-to

text

text

text

document
collection
document

document

document

subject

title

first name

date

SKU

last name

from

price

phone

text

description

address
Solr Field Definition
● field
○ name (e.g. "subject")
○ type (e.g. "text_general")
○ options (e.g. indexed="true" stored="true")

● field type
○ text: "string", "text_general"
○ numeric: "int", "long", "float", "double"

● options
○ indexed: content can be searched
○ stored: content can be returned at search-time
○ multivalued: multiple values per field & document
Solr Dynamic Field
● define field by naming convention
● "amount_i": int, index, stored
● "tag_ss": string, indexed, stored, multivalued
name

type

indexed

stored

multiValued

*_i

int

true

true

false

*_l

long

true

true

false

*_f

float

true

true

false

*_d

double

true

true

false

*_s

string

true

true

false

*_ss

string

true

true

true

*_t

text_general

true

true

false

*_txt

text_general

true

true

true
Solr Copy Field
● copy one or more fields into another field
● can be used to define a catch-all field
○ source: "title", "author", "description"
○ destination: "text"
○ searching the "text" field has the effect of searching
all the other three fields
Indexing - UpdateRequestHandler
● upload (POST) content or file to http://host:
port/solr/update
● formats: XML, JSON, CSV
XML:
<add>
<doc>
<field
<field
<field
</doc>
<doc>
<field
<field
<field
</doc>
</add>

name="id">apple</field>
name="compName">Apple</field>
name="address">1 Infinite Way, Cupertino CA</field>

name="id">asus</field>
name="compName">ASUS Computer</field>
name="address">800 Corporate Way Fremont, CA 94539</field>

JSON:
[
{"id":"apple","compName_s":"Apple","address_s":"1 Infinite Way,
Cupertino CA"}
{"id":"asus","compName_s":"Asus Computer","address_s":"800 Corporate
Way Fremont, CA 94539"}
]

CSV:
id,compName_s,address_s
apple,Apple,"1 Infinite Way, Cupertino CA"
asus,Asus Computer,"800 Corporate Way Fremont, CA 94539"
Indexing - DataImportHandler
● has its own config file (data-config.xml)
● import data from various sources
○ RDBMS (JDBC)
○ e-mail (IMAP)
○ XML data locally (file) or remotely (HTTP)

● transformers
○ extract data (RegEx, XPath)
○ manipulate data (strip HTML tags)
Indexing - ExtractingRequestHandler
● allows indexing of different formats
○ e.g. PDF, MS Word, XML

● uses Apache Tika to extract text and
metadata
○ Tika: a framework for different file format parsers (e.
g. PDFBox for PDF, Apache POI for MS Word)

● maps extracted text to the “content” field
● maps metadata (e.g. MIME type) to different
fields
Searching - Basics
● send request to http://host:port/solr/search
● parameters
○
○
○
○
○
○
○

q - main query
fq - filter query
defType - query parser (e.g. lucene, edismax)
fl - fields to return
sort - sort criteria
wt - response writer (e.g. xml, json)
indent - set to true for pretty-printing
search handler's URL

main query

http://localhost:8983/solr/select?q=title:tablet&
fl=title,price,inStock&sort=price&wt=json
fields to return

sort criteria

response writer
Searching - Query Syntax - Field
● search a specific field
○ field_name:value

● if field omitted, Solr uses default field:
○ df parameter in URL
○ defaultSearchField setting in schema.xml
○ "text"
Searching - Query Syntax - Term
● a term by itself: matches documents that
contain that term
○ e.g. tablet
Searching - Query Syntax - Boolean
● “conventional” boolean operators supported

●
●
●

○ AND &&
○ OR ||
○ NOT !
e.g. a AND b
○ all of a, b must occur
e.g. a OR b
○ at least one of a, b must occur
e.g. a AND NOT b
○ a must occur and b must not occur
Searching - Query Syntax - Boolean
● Lucene/Solr's boolean operators are not true
boolean operators
● e.g. a OR b OR c does not behave like
(a OR b) OR c
○ instead, a OR b OR c means at least one of a, b, c
must occur

● parentheses are supported
Searching - Query Syntax - Boolean
● "+" prefix means "must"
● "-" prefix means "must not"
● no prefix means "at least one must"
(by default)
○ e.g. a b c
■ at least one of a, b, c must occur

● operators can mix
○ e.g. +a b c d -e
■ a must occur
■ at least one of b, c, d must occur
■ e must not occur
Searching - Query Syntax - Phrase
● phrases are enclosed by double-quotes
● e.g. +"the phrase"
○ the phrase must occur

● e.g. -"the phrase"
○ the phrase must not occur
Searching - Query Syntax - Boost
● manually assign different weights to clauses
● gives more weight to a field
○ e.g. title:a^10 body:a

● gives more weight to a word
○ e.g. title:a title:b^10

● gives phrases more weight than words
○ e.g. title:(+a +b) title:"a b"^10
Searching - Query Syntax - Range
● matches field values within a range
○ inclusive range - denoted by square brackets
○ exclusive range - denoted by curly brackets

● e.g. age:[10 TO 20]
○ matches the field "age" with the value in 10..20

● string or numeric comparison, depending on
the field's type
● open-ended range supported
● e.g. age: [10 TO *]
○ matches the field "age" with the value 10 or larger
Searching - Query Syntax - EDisMax
● suitable for user-generated queries
○ does not complain about the syntax
○ searches for individual words across several fields
("disjunction")
○ uses max score of a word in all fields for scoring
("max")

● configurable (in solrconfig.xml)
○ what fields to search the words in
○ boosting of these fields
Sorting
● default: sorting by decreasing score
● custom sorting rules: use the sort parameter
○ syntax: fieldName (asc|desc)
○ e.g. sort by ascending price (i.e. lowest price first):
price asc
○ e.g. sort by descending date (i.e. newest date first):
date asc
Sorting
● special field names
○ use score for score and _docid_ for document D
○ e.g. sort by ascending score:
score asc
○ e.g. sort by descending document ID
_docid_ desc
Sorting
● multiple fields and orders: separate by
commas
○ e.g. sort by descending starRating and ascending
price:
○ starRating desc, price asc
Sorting
● cannot use multivalued fields
● overrides the default sorting behavior
Faceted Search
● facet values: (distinct) values (generally nonoverlapping) ranges of a field
● displaying facets
○ show possible values
○ let users narrow down their searches easily
facet
facet values (5 of them)
Faceted Search
● set facet parameter to true - enables
faceting
● other parameters
○ facet.field - use the field's values as facets
■ return <value, count> pairs
○ facet.query - use the given queries as facets
■ return <query, count> pairs
○ facet.sort - set the ordering of the facets;
■ can be "count" or "index"
○ facet.offset and face.limit - used for
pagination of facets
Resources - Books
● Lucene in Action
○ written by 3 committer and PMC members
○ somewhat outdated (2010; covers Lucene 3.0)
○ http://www.manning.com/hatcher3/

● Solr in Action
○ early access; coming out later this year
○ http://www.manning.com/grainger/

● Apache Solr 4 Cookbook
○ common problems and useful tips
○ http://www.packtpub.com/apache-solr-4cookbook/book
Resources - Books
● Introduction to Information Retrieval
○ not specific to Lucene/Solr, but about IR concepts
○ free e-book
○ http://nlp.stanford.edu/IR-book/

● Managing Gigabytes
○ indexing, compression and other topics
○ accompanied by MG4J - a full-text search software
○ http://mg4j.di.unimi.it/
Resources - Web
● official websites
○ Lucene Core - http://lucene.apache.org/core/
○ Solr - http://lucene.apache.org/solr/

● mailing lists
● Wiki sites
○ Lucene Core - http://wiki.apache.org/lucene-java/
○ Solr - http://wiki.apache.org/solr/

● reference guides
○ API Documentation for Lucene and Solr
○ Apache Solr Reference Guide
Getting Started
● download Solr
○ requires Java 6 or newer to run

● Solr comes bundled/configured with Jetty
○ <Solr directory>/example/start.jar

● "exampledocs" directory contains sample
documents
○ <Solr directory>/example/exampledocs/post.jar
○ java -Durl=http://localhost:
8983/solr/update -jar post.jar *.xml

● use the Solr admin interface
○ http://localhost:8983/solr/

More Related Content

PDF
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
PPTX
Java Performance Tips (So Code Camp San Diego 2014)
PDF
Search Engine-Building with Lucene and Solr
PDF
RedisConf17 - Redis as a JSON document store
PPTX
Avro introduction
PDF
Beautiful soup
PDF
Filesinc 130512002619-phpapp01
PDF
Apache avro and overview hadoop tools
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Java Performance Tips (So Code Camp San Diego 2014)
Search Engine-Building with Lucene and Solr
RedisConf17 - Redis as a JSON document store
Avro introduction
Beautiful soup
Filesinc 130512002619-phpapp01
Apache avro and overview hadoop tools

What's hot (20)

PDF
MongoDB Advanced Topics
PDF
Avro, la puissance du binaire, la souplesse du JSON
PPT
C++ files and streams
PPTX
odoo 11.0 development (CRUD)
PDF
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
PPTX
Make Your Data Searchable With Solr in 25 Minutes
PDF
Parquet Twitter Seattle open house
PPT
Json - ideal for data interchange
PPTX
MongoDB (Advanced)
PPTX
04 standard class library c#
PPTX
Java Data Migration with Data Pipeline
PPT
file handling, dynamic memory allocation
PDF
Text tagging with finite state transducers
PDF
useR! 2012 Talk
PPTX
Getting Started with the Alma API
PPTX
Apache solr
PDF
5java Io
ODP
Django with MongoDB using MongoEngine
PPT
Hive - SerDe and LazySerde
PPT
Comp102 lec 11
MongoDB Advanced Topics
Avro, la puissance du binaire, la souplesse du JSON
C++ files and streams
odoo 11.0 development (CRUD)
Spark + Clojure for Topic Discovery - Zalando Tech Clojure/Conj Talk
Make Your Data Searchable With Solr in 25 Minutes
Parquet Twitter Seattle open house
Json - ideal for data interchange
MongoDB (Advanced)
04 standard class library c#
Java Data Migration with Data Pipeline
file handling, dynamic memory allocation
Text tagging with finite state transducers
useR! 2012 Talk
Getting Started with the Alma API
Apache solr
5java Io
Django with MongoDB using MongoEngine
Hive - SerDe and LazySerde
Comp102 lec 11
Ad

Similar to Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013) (20)

PDF
Apache Solr crash course
KEY
Apache Solr - Enterprise search platform
PPTX
Introduction to Lucene and Solr - 1
PPTX
Introduction to Apache Lucene/Solr
PDF
A Practical Introduction to Apache Solr
PDF
Solr search engine with multiple table relation
PPTX
Introduction to Lucene & Solr and Usecases
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
PDF
Apace Solr Web Development.pdf
PPTX
20130310 solr tuorial
PDF
Basics of Solr and Solr Integration with AEM6
PPTX
Solr introduction
PDF
PPT
Introduction to Search Engines
PDF
Information Retrieval - Data Science Bootcamp
PPTX
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
PDF
Introduction to Solr
PPTX
Apache Solr Workshop
PPT
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
PDF
Get the most out of Solr search with PHP
Apache Solr crash course
Apache Solr - Enterprise search platform
Introduction to Lucene and Solr - 1
Introduction to Apache Lucene/Solr
A Practical Introduction to Apache Solr
Solr search engine with multiple table relation
Introduction to Lucene & Solr and Usecases
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Apace Solr Web Development.pdf
20130310 solr tuorial
Basics of Solr and Solr Integration with AEM6
Solr introduction
Introduction to Search Engines
Information Retrieval - Data Science Bootcamp
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Introduction to Solr
Apache Solr Workshop
Enterprise Search Solution: Apache SOLR. What's available and why it's so cool
Get the most out of Solr search with PHP
Ad

Recently uploaded (20)

PDF
Spectral efficient network and resource selection model in 5G networks
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
GamePlan Trading System Review: Professional Trader's Honest Take
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Spectral efficient network and resource selection model in 5G networks
The AUB Centre for AI in Media Proposal.docx
Review of recent advances in non-invasive hemoglobin estimation
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
20250228 LYD VKU AI Blended-Learning.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
NewMind AI Weekly Chronicles - August'25 Week I
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Diabetes mellitus diagnosis method based random forest with bat algorithm
GamePlan Trading System Review: Professional Trader's Honest Take
Per capita expenditure prediction using model stacking based on satellite ima...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
cuic standard and advanced reporting.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Understanding_Digital_Forensics_Presentation.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx

Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)

  • 1. Search Engine-Building with Lucene and Solr Part 1 Kai Chan SoCal Code Camp, November 2013
  • 2. Overview ● why Lucene/Solr? ● what are Lucene and Solr? ● how to use Lucene and Solr ○ setup ○ indexing ○ searching ● resources ● demo ● questions/answers
  • 3. How to Make Your Data Searchable ● pay someone to do it ● use some solution someone else has written ● write some solution yourself
  • 4. How to Search - One Approach for each document d { if (query is a substring of d's content) { add d to the list of results } } sort the result (or not)
  • 5. How to Search - Problems ● slow ○ reads the whole dataset for each search ● not scalable ○ if you dataset grows by 10x, your search slows down by 10x ● how to show the most relevant documents first? ○ list of results can be quite long ○ users have limited time and patience
  • 6. Inverted Index - Introduction ● like the "index" at the end of books ● a map of one of the following types ○ term → document list ○ term → <document, position> list
  • 7. documents: T[0] = "it is what it is" T[1] = "what is it" T[2] = "it is a banana" inverted index (without positions): "a": {2} "banana": {2} "is": {0, 1, 2} "it": {0, 1, 2} "what": {0, 1} inverted index (with positions): "a": {(2, 2)} "banana": {(2, 3)} "is": {(0, 1), (0, 4), (1, 1), (2, 1)} "it": {(0, 0), (0, 3), (1, 2), (2, 0)} "what": {(0, 2), (1, 0)} Credit: Wikipedia (http://en.wikipedia.org/wiki/Inverted_index)
  • 8. Inverted Index - Speed ● term list ○ typically very small ○ grows slowly ● term lookup ○ O(1) to O(log(number of terms)) ● for a particular term ○ document lists: very small ○ document + position lists: still small ● few terms per query
  • 9. Inverted Index - Relevance ● information in the index enables: ○ determination (scoring) of relevance of each document to the query ○ comparison of relevance among documents ○ sorting by (decreasing) relevance ■ i.e. the most relevant document first
  • 10. Lucene v.s. Solr - Lucene ● ● ● ● full-text search library creates, updates and read from the index takes queries and produces search results your application creates objects and calls methods in the Lucene API ● provides building blocks for custom features
  • 11. Lucene v.s. Solr - Solr ● ● ● ● ● full-text search platform uses Lucene for indexing and search REST-like API over HTTP different output formats (e.g. XML, JSON) provides some features not built into Lucene
  • 12. Lucene: machine running Java VM your application Lucene Lucene code libraries index Solr: machine running Java VM servlet container (e.g. Tomcat, Jetty) Solr HTTP Solr code Lucene code index libraries client
  • 14. Workflow - Setup ● servlet configuration ○ e.g. port number, max POST size ○ you can usually use the default settings ● Solr configuration ○ e.g. data directory, deduplication, language identification, highlighting ○ you can usually use the default settings ● schema definition ○ defines fields in your documents ○ you can use the default settings if you name your fields in a certain way
  • 15. How Data Are Organized collection document document document field field field field field field field field field
  • 16. field name (e.g. "title" or "price") content (e.g. "please read" or 30) type options
  • 19. Solr Field Definition ● field ○ name (e.g. "subject") ○ type (e.g. "text_general") ○ options (e.g. indexed="true" stored="true") ● field type ○ text: "string", "text_general" ○ numeric: "int", "long", "float", "double" ● options ○ indexed: content can be searched ○ stored: content can be returned at search-time ○ multivalued: multiple values per field & document
  • 20. Solr Dynamic Field ● define field by naming convention ● "amount_i": int, index, stored ● "tag_ss": string, indexed, stored, multivalued name type indexed stored multiValued *_i int true true false *_l long true true false *_f float true true false *_d double true true false *_s string true true false *_ss string true true true *_t text_general true true false *_txt text_general true true true
  • 21. Solr Copy Field ● copy one or more fields into another field ● can be used to define a catch-all field ○ source: "title", "author", "description" ○ destination: "text" ○ searching the "text" field has the effect of searching all the other three fields
  • 22. Indexing - UpdateRequestHandler ● upload (POST) content or file to http://host: port/solr/update ● formats: XML, JSON, CSV
  • 23. XML: <add> <doc> <field <field <field </doc> <doc> <field <field <field </doc> </add> name="id">apple</field> name="compName">Apple</field> name="address">1 Infinite Way, Cupertino CA</field> name="id">asus</field> name="compName">ASUS Computer</field> name="address">800 Corporate Way Fremont, CA 94539</field> JSON: [ {"id":"apple","compName_s":"Apple","address_s":"1 Infinite Way, Cupertino CA"} {"id":"asus","compName_s":"Asus Computer","address_s":"800 Corporate Way Fremont, CA 94539"} ] CSV: id,compName_s,address_s apple,Apple,"1 Infinite Way, Cupertino CA" asus,Asus Computer,"800 Corporate Way Fremont, CA 94539"
  • 24. Indexing - DataImportHandler ● has its own config file (data-config.xml) ● import data from various sources ○ RDBMS (JDBC) ○ e-mail (IMAP) ○ XML data locally (file) or remotely (HTTP) ● transformers ○ extract data (RegEx, XPath) ○ manipulate data (strip HTML tags)
  • 25. Indexing - ExtractingRequestHandler ● allows indexing of different formats ○ e.g. PDF, MS Word, XML ● uses Apache Tika to extract text and metadata ○ Tika: a framework for different file format parsers (e. g. PDFBox for PDF, Apache POI for MS Word) ● maps extracted text to the “content” field ● maps metadata (e.g. MIME type) to different fields
  • 26. Searching - Basics ● send request to http://host:port/solr/search ● parameters ○ ○ ○ ○ ○ ○ ○ q - main query fq - filter query defType - query parser (e.g. lucene, edismax) fl - fields to return sort - sort criteria wt - response writer (e.g. xml, json) indent - set to true for pretty-printing
  • 27. search handler's URL main query http://localhost:8983/solr/select?q=title:tablet& fl=title,price,inStock&sort=price&wt=json fields to return sort criteria response writer
  • 28. Searching - Query Syntax - Field ● search a specific field ○ field_name:value ● if field omitted, Solr uses default field: ○ df parameter in URL ○ defaultSearchField setting in schema.xml ○ "text"
  • 29. Searching - Query Syntax - Term ● a term by itself: matches documents that contain that term ○ e.g. tablet
  • 30. Searching - Query Syntax - Boolean ● “conventional” boolean operators supported ● ● ● ○ AND && ○ OR || ○ NOT ! e.g. a AND b ○ all of a, b must occur e.g. a OR b ○ at least one of a, b must occur e.g. a AND NOT b ○ a must occur and b must not occur
  • 31. Searching - Query Syntax - Boolean ● Lucene/Solr's boolean operators are not true boolean operators ● e.g. a OR b OR c does not behave like (a OR b) OR c ○ instead, a OR b OR c means at least one of a, b, c must occur ● parentheses are supported
  • 32. Searching - Query Syntax - Boolean ● "+" prefix means "must" ● "-" prefix means "must not" ● no prefix means "at least one must" (by default) ○ e.g. a b c ■ at least one of a, b, c must occur ● operators can mix ○ e.g. +a b c d -e ■ a must occur ■ at least one of b, c, d must occur ■ e must not occur
  • 33. Searching - Query Syntax - Phrase ● phrases are enclosed by double-quotes ● e.g. +"the phrase" ○ the phrase must occur ● e.g. -"the phrase" ○ the phrase must not occur
  • 34. Searching - Query Syntax - Boost ● manually assign different weights to clauses ● gives more weight to a field ○ e.g. title:a^10 body:a ● gives more weight to a word ○ e.g. title:a title:b^10 ● gives phrases more weight than words ○ e.g. title:(+a +b) title:"a b"^10
  • 35. Searching - Query Syntax - Range ● matches field values within a range ○ inclusive range - denoted by square brackets ○ exclusive range - denoted by curly brackets ● e.g. age:[10 TO 20] ○ matches the field "age" with the value in 10..20 ● string or numeric comparison, depending on the field's type ● open-ended range supported ● e.g. age: [10 TO *] ○ matches the field "age" with the value 10 or larger
  • 36. Searching - Query Syntax - EDisMax ● suitable for user-generated queries ○ does not complain about the syntax ○ searches for individual words across several fields ("disjunction") ○ uses max score of a word in all fields for scoring ("max") ● configurable (in solrconfig.xml) ○ what fields to search the words in ○ boosting of these fields
  • 37. Sorting ● default: sorting by decreasing score ● custom sorting rules: use the sort parameter ○ syntax: fieldName (asc|desc) ○ e.g. sort by ascending price (i.e. lowest price first): price asc ○ e.g. sort by descending date (i.e. newest date first): date asc
  • 38. Sorting ● special field names ○ use score for score and _docid_ for document D ○ e.g. sort by ascending score: score asc ○ e.g. sort by descending document ID _docid_ desc
  • 39. Sorting ● multiple fields and orders: separate by commas ○ e.g. sort by descending starRating and ascending price: ○ starRating desc, price asc
  • 40. Sorting ● cannot use multivalued fields ● overrides the default sorting behavior
  • 41. Faceted Search ● facet values: (distinct) values (generally nonoverlapping) ranges of a field ● displaying facets ○ show possible values ○ let users narrow down their searches easily
  • 43. Faceted Search ● set facet parameter to true - enables faceting ● other parameters ○ facet.field - use the field's values as facets ■ return <value, count> pairs ○ facet.query - use the given queries as facets ■ return <query, count> pairs ○ facet.sort - set the ordering of the facets; ■ can be "count" or "index" ○ facet.offset and face.limit - used for pagination of facets
  • 44. Resources - Books ● Lucene in Action ○ written by 3 committer and PMC members ○ somewhat outdated (2010; covers Lucene 3.0) ○ http://www.manning.com/hatcher3/ ● Solr in Action ○ early access; coming out later this year ○ http://www.manning.com/grainger/ ● Apache Solr 4 Cookbook ○ common problems and useful tips ○ http://www.packtpub.com/apache-solr-4cookbook/book
  • 45. Resources - Books ● Introduction to Information Retrieval ○ not specific to Lucene/Solr, but about IR concepts ○ free e-book ○ http://nlp.stanford.edu/IR-book/ ● Managing Gigabytes ○ indexing, compression and other topics ○ accompanied by MG4J - a full-text search software ○ http://mg4j.di.unimi.it/
  • 46. Resources - Web ● official websites ○ Lucene Core - http://lucene.apache.org/core/ ○ Solr - http://lucene.apache.org/solr/ ● mailing lists ● Wiki sites ○ Lucene Core - http://wiki.apache.org/lucene-java/ ○ Solr - http://wiki.apache.org/solr/ ● reference guides ○ API Documentation for Lucene and Solr ○ Apache Solr Reference Guide
  • 47. Getting Started ● download Solr ○ requires Java 6 or newer to run ● Solr comes bundled/configured with Jetty ○ <Solr directory>/example/start.jar ● "exampledocs" directory contains sample documents ○ <Solr directory>/example/exampledocs/post.jar ○ java -Durl=http://localhost: 8983/solr/update -jar post.jar *.xml ● use the Solr admin interface ○ http://localhost:8983/solr/