Full Text Search with Lucene

Full Text Search
David LeBer
Align Software Inc.

How?

• Wild card database queries

• Database implementations

• Third party search engines

• Text indexing libraries

Wild Card Queries

SELECT FROM 'SOME_TABLE' WHERE 'SOME_COLUMN' LIKE '%Some String%'

Wild Card Queries

• Easy

Wild Card Queries

• Slow

• Hard to optimize

• Difﬁcult to rank

Database Implementations

• MySQL FULLTEXT index and MATCH queries

• PostgreSQL tsvector & tsquery


• Fairly Easy


• Database speciﬁc SQL

• May include additional limitations
(i.e: MySQL - MyISAM tables only)

• Functionality deﬁne by the DB engine

Third Party Search Engines

• Google indexing / searching of your content


• Easy

• Matches user expectations


• Content must be available for indexing

• Loss of control

• Enhances the Google hegemony

Text Indexing Library

• Lucene


• Complete control

• Database independent

• Flexible search behaviour

• Ranked results


• Adds complexity

• Additional query language

• Parallel index

Lucene Overview

• Open Source - part of the Apache Project

• Very ﬂexible

• Wickedly fast

• Index based

Lucene : Installing

• Add the Lucene jars to your classpath

• Use ERIndexing

Lucene : Tasks

• Indexing

• Searching

Indexing : Steps

• Conversion (to plain text)

• Analysis (clean and convert the text to tokens)

• Index (save the tokens to the index)

Indexing : Parts

• Index - either ﬁle or memory based

• Document - represents a unique object added to the index

• Field - identiﬁes a chunk of data in the document

Indexing : Classes

• IndexWriter

• Directory

• Analyzer

• Document

• Field

Creating an Index

URL indexDirectoryURL = ... // assume exists
File indexFile = new File(indexDirectoryURL.getPath());
FSDirectory indexDirectory = FSDirectory.open(indexFile);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
IndexWriter indexWriter = new IndexWriter(index, analyzer, true,
IndexWriter.MaxFieldLength.UNLIMITED);

Indexing : Field Parameters

• Stored or not

• Analyzed or not, with and without norms

• Include position, offset, and term frequency

Indexing : Analyzers

• SimpleAnalyzer

• StopAnalyzer

• StandardAnalyzer

• ...

Adding a Document

String value = ... // assume exists
Document doc = new Document();
Field docField = new Field("title", value,
Field.Store.YES, Field.Index.ANALYZED);
doc.add(docField);
...
indexWriter.addDocument(doc);

Indexing : Fun with indexes

• Multiple Access

Searching : Steps

• Clean the user input

• Create a Query

• Query the Index

• Return the results

Searching : Search Classes
• IndexReader

• IndexSearcher

• Query

• QueryParser

• TopDocs/ScoreDocs

• Document

Searching : QueryTypes
• TermQuery

• RangeQuery

• PreﬁxQuery

• BooleanQuery

• PhraseQuery

• WildCardQuery

• FuzzyQuery

Searching : QueryParser
• 'webobjects' - contains an exact match - TermQuery

• 'webobjects apple', 'webobjects OR apple' - an OR Query

• +webobjects +apple / webobjects AND apple - an AND Query

• title:webobjects - Contains the term in title ﬁeld

• title:webobjects -subject:iTunes / title:webobjects AND NOT
subject:iTunes

• (webobjects OR apple) AND iTunes

Searching : QueryParser

• title:"apple webobjects" - Phrase Query

• title:"apple webobjects"~5 - slop of 5

• webobj* - Preﬁx Query

• webobjicts~ - Fuzzy Query

• lastmodiﬁed:[1/1/10 TO 1/1/11] - Range Query

Performing a Search

Query q = ... // assume exists
IndexSearcher searcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

Using a QueryParser

QueryParser queryParser = new QueryParser(Version.LUCENE_2.9,
"content", analyzer);
Query query = queryParser.parse(queryString);

“The more times a query term appears in a
document relative to the number of times the term
appears in all the documents in the collection, the
more relevant that document is to the query”

Boost

• While Indexing

• Document

• Field

• While Searching

• Query

ERIndexing : Strengths

• Hides some of the complexity of integrating Lucene with WO

• Offers lots of utility and helper methods

• Speaks WebObjects collection classes

• Simpliﬁes index creation

ERIndexing : Weaknesses

• Hides some of the complexity of integrating Lucene with WO

• Not fully baked

• Auto indexing may be dangerous

Beyond Lucene

• Solr

• Compass

• ElasticSearch

Q&A
Lucene: http://lucene.apache.org
Luke: http://code.google.com/p/luke/
Solr: http://lucene.apache.org/solr/
Compass: http://www.compass-project.org/overview.html
ElasticSearch: http://www.elasticsearch.com/

Full Text Search with Lucene

More Related Content

What's hot (18)

Similar to Full Text Search with Lucene (20)

More from WO Community (20)

Recently uploaded (20)

Full Text Search with Lucene