SlideShare a Scribd company logo
Full Text Search
David LeBer
Align Software Inc.
What is full text search?
Full Text Search with Lucene
How?

•   Wild card database queries

•   Database implementations

•   Third party search engines

•   Text indexing libraries
Wild Card Queries

SELECT FROM 'SOME_TABLE' WHERE 'SOME_COLUMN' LIKE '%Some String%'
Wild Card Queries



•   Easy
Wild Card Queries


•   Slow

•   Hard to optimize

•   Difficult to rank
Database Implementations


•   MySQL FULLTEXT index and MATCH queries

•   PostgreSQL tsvector & tsquery
Database Implementations



•   Fairly Easy
Database Implementations

•   Database specific SQL

•   May include additional limitations
    (i.e: MySQL - MyISAM tables only)

•   Functionality define by the DB engine
Third Party Search Engines



•   Google indexing / searching of your content
Third Party Search Engines


•   Easy

•   Matches user expectations
Third Party Search Engines


•   Content must be available for indexing

•   Loss of control

•   Enhances the Google hegemony
Text Indexing Library



•   Lucene
Text Indexing Library

•   Complete control

•   Database independent

•   Flexible search behaviour

•   Ranked results
Text Indexing Library


•   Adds complexity

•   Additional query language

•   Parallel index
Lucene Overview

•   Open Source - part of the Apache Project

•   Very flexible

•   Wickedly fast

•   Index based
Lucene : Installing


•   Add the Lucene jars to your classpath

•   Use ERIndexing
Lucene : Tasks


•   Indexing

•   Searching
Indexing
What is Indexing?
Indexing : Steps


•   Conversion (to plain text)

•   Analysis (clean and convert the text to tokens)

•   Index (save the tokens to the index)
Indexing : Parts


•   Index - either file or memory based

•   Document - represents a unique object added to the index

•   Field - identifies a chunk of data in the document
Indexing : Classes

•   IndexWriter

•   Directory

•   Analyzer

•   Document

•   Field
Creating an Index

URL indexDirectoryURL = ... // assume exists
File indexFile = new File(indexDirectoryURL.getPath());
FSDirectory indexDirectory = FSDirectory.open(indexFile);
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT);
IndexWriter indexWriter = new IndexWriter(index, analyzer, true,
                                IndexWriter.MaxFieldLength.UNLIMITED);
Indexing : Field Parameters


•   Stored or not

•   Analyzed or not, with and without norms

•   Include position, offset, and term frequency
Indexing : Analyzers

•   SimpleAnalyzer

•   StopAnalyzer

•   StandardAnalyzer

•   ...
Adding a Document

String value = ... // assume exists
Document doc = new Document();
Field docField = new Field("title", value,
                            Field.Store.YES, Field.Index.ANALYZED);
doc.add(docField);
...
indexWriter.addDocument(doc);
Indexing : Fun with indexes



•   Multiple Access
Searching
What is Searching
Searching : Steps

•   Clean the user input

•   Create a Query

•   Query the Index

•   Return the results
Searching : Search Classes
•   IndexReader

•   IndexSearcher

•   Query

•   QueryParser

•   TopDocs/ScoreDocs

•   Document
Searching : QueryTypes
•   TermQuery

•   RangeQuery

•   PrefixQuery

•   BooleanQuery

•   PhraseQuery

•   WildCardQuery

•   FuzzyQuery
Searching : QueryParser
•   'webobjects' - contains an exact match - TermQuery

•   'webobjects apple', 'webobjects OR apple' - an OR Query

•   +webobjects +apple / webobjects AND apple - an AND Query

•   title:webobjects - Contains the term in title field

•   title:webobjects -subject:iTunes / title:webobjects AND NOT
    subject:iTunes

•   (webobjects OR apple) AND iTunes
Searching : QueryParser

•   title:"apple webobjects" - Phrase Query

•   title:"apple webobjects"~5 - slop of 5

•   webobj* - Prefix Query

•   webobjicts~ - Fuzzy Query

•   lastmodified:[1/1/10 TO 1/1/11] - Range Query
Performing a Search

Query q = ... // assume exists
IndexSearcher searcher = new IndexSearcher(index, true);
TopScoreDocCollector collector = TopScoreDocCollector.create(10, true);
searcher.search(query, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
Using a QueryParser

QueryParser queryParser = new QueryParser(Version.LUCENE_2.9,
                                          "content", analyzer);
Query query = queryParser.parse(queryString);
Demo
Scoring
“The more times a query term appears in a
document relative to the number of times the term
 appears in all the documents in the collection, the
   more relevant that document is to the query”
Boost

•   While Indexing

    •   Document

    •   Field

•   While Searching

    •   Query
Luke
Demo
ERIndexing
ERIndexing : Strengths

•   Hides some of the complexity of integrating Lucene with WO

•   Offers lots of utility and helper methods

•   Speaks WebObjects collection classes

•   Simplifies index creation
ERIndexing : Weaknesses


•   Hides some of the complexity of integrating Lucene with WO

•   Not fully baked

•   Auto indexing may be dangerous
Demo
Beyond Lucene


•   Solr

•   Compass

•   ElasticSearch
Q&A
Lucene: http://lucene.apache.org
Luke: http://code.google.com/p/luke/
Solr: http://lucene.apache.org/solr/
Compass: http://www.compass-project.org/overview.html
ElasticSearch: http://www.elasticsearch.com/

More Related Content

PPTX
PPTX
Apache lucene
PPTX
Introduction to apache lucene
PDF
Introduction To Apache Lucene
PPT
Lucene BootCamp
PDF
What is in a Lucene index?
PPT
Lucene basics
PDF
Apache Lucene intro - Breizhcamp 2015
Apache lucene
Introduction to apache lucene
Introduction To Apache Lucene
Lucene BootCamp
What is in a Lucene index?
Lucene basics
Apache Lucene intro - Breizhcamp 2015

What's hot (18)

ODP
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
PPTX
Apache Lucene Basics
PPT
Intelligent crawling and indexing using lucene
PPTX
Azure search
PPT
Lucece Indexing
PDF
Berlin Buzzwords 2013 - How does lucene store your data?
PDF
Munching & crunching - Lucene index post-processing
PPTX
Intro to Apache Lucene and Solr
PPTX
Hacking Lucene for Custom Search Results
PDF
Wanna search? Piece of cake!
PDF
Multi faceted responsive search, autocomplete, feeds engine & logging
PPTX
Search Me: Using Lucene.Net
PPTX
Introduction to Apache Solr
PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
PPT
Content analysis for ECM with Apache Tika
PDF
Full text search
KEY
Content extraction with apache tika
PDF
What's new with Apache Tika?
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene Basics
Intelligent crawling and indexing using lucene
Azure search
Lucece Indexing
Berlin Buzzwords 2013 - How does lucene store your data?
Munching & crunching - Lucene index post-processing
Intro to Apache Lucene and Solr
Hacking Lucene for Custom Search Results
Wanna search? Piece of cake!
Multi faceted responsive search, autocomplete, feeds engine & logging
Search Me: Using Lucene.Net
Introduction to Apache Solr
High Performance JSON Search and Relational Faceted Browsing with Lucene
Content analysis for ECM with Apache Tika
Full text search
Content extraction with apache tika
What's new with Apache Tika?
Ad

Similar to Full Text Search with Lucene (20)

PPT
Advanced full text searching techniques using Lucene
PPT
Lucene and MySQL
PPTX
Search enabled applications with lucene.net
PPT
Lucene Bootcamp -1
PDF
Lucene for Solr Developers
PPT
Lucene Bootcamp - 2
PDF
Lucene for Solr Developers
PDF
Lucene for Solr Developers
PPTX
Introduction to search engine-building with Lucene
PPTX
Illuminating Lucene.Net
PPT
Introduction to Search Engines
PPTX
Introduction to search engine-building with Lucene
PDF
IR with lucene
PDF
Tutorial 5 (lucene)
PPTX
JavaEdge09 : Java Indexing and Searching
PDF
Solr中国6月21日企业搜索
PPT
Apache Lucene Searching The Web
PPTX
Introduction to Information Retrieval using Lucene
PDF
Full Text Search In PostgreSQL
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Advanced full text searching techniques using Lucene
Lucene and MySQL
Search enabled applications with lucene.net
Lucene Bootcamp -1
Lucene for Solr Developers
Lucene Bootcamp - 2
Lucene for Solr Developers
Lucene for Solr Developers
Introduction to search engine-building with Lucene
Illuminating Lucene.Net
Introduction to Search Engines
Introduction to search engine-building with Lucene
IR with lucene
Tutorial 5 (lucene)
JavaEdge09 : Java Indexing and Searching
Solr中国6月21日企业搜索
Apache Lucene Searching The Web
Introduction to Information Retrieval using Lucene
Full Text Search In PostgreSQL
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Ad

More from WO Community (20)

PDF
KAAccessControl
PDF
In memory OLAP engine
PDF
Using Nagios to monitor your WO systems
PDF
Build and deployment
PDF
High availability
PDF
Reenabling SOAP using ERJaxWS
PDF
Chaining the Beast - Testing Wonder Applications in the Real World
PDF
D2W Stateful Controllers
PDF
Deploying WO on Windows
PDF
Unit Testing with WOUnit
PDF
Life outside WO
PDF
Apache Cayenne for WO Devs
PDF
Advanced Apache Cayenne
PDF
Migrating existing Projects to Wonder
PDF
iOS for ERREST - alternative version
PDF
iOS for ERREST
PDF
"Framework Principal" pattern
PDF
Filtering data with D2W
PDF
PDF
Localizing your apps for multibyte languages
KAAccessControl
In memory OLAP engine
Using Nagios to monitor your WO systems
Build and deployment
High availability
Reenabling SOAP using ERJaxWS
Chaining the Beast - Testing Wonder Applications in the Real World
D2W Stateful Controllers
Deploying WO on Windows
Unit Testing with WOUnit
Life outside WO
Apache Cayenne for WO Devs
Advanced Apache Cayenne
Migrating existing Projects to Wonder
iOS for ERREST - alternative version
iOS for ERREST
"Framework Principal" pattern
Filtering data with D2W
Localizing your apps for multibyte languages

Recently uploaded (20)

PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Empathic Computing: Creating Shared Understanding
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Machine learning based COVID-19 study performance prediction
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Cloud computing and distributed systems.
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
KodekX | Application Modernization Development
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Modernizing your data center with Dell and AMD
DOCX
The AUB Centre for AI in Media Proposal.docx
NewMind AI Weekly Chronicles - August'25 Week I
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Empathic Computing: Creating Shared Understanding
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Machine learning based COVID-19 study performance prediction
Spectral efficient network and resource selection model in 5G networks
NewMind AI Monthly Chronicles - July 2025
GDG Cloud Iasi [PUBLIC] Florian Blaga - Unveiling the Evolution of Cybersecur...
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Cloud computing and distributed systems.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
KodekX | Application Modernization Development
Review of recent advances in non-invasive hemoglobin estimation
MYSQL Presentation for SQL database connectivity
Mobile App Security Testing_ A Comprehensive Guide.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Modernizing your data center with Dell and AMD
The AUB Centre for AI in Media Proposal.docx

Full Text Search with Lucene

  • 1. Full Text Search David LeBer Align Software Inc.
  • 2. What is full text search?
  • 4. How? • Wild card database queries • Database implementations • Third party search engines • Text indexing libraries
  • 5. Wild Card Queries SELECT FROM 'SOME_TABLE' WHERE 'SOME_COLUMN' LIKE '%Some String%'
  • 7. Wild Card Queries • Slow • Hard to optimize • Difficult to rank
  • 8. Database Implementations • MySQL FULLTEXT index and MATCH queries • PostgreSQL tsvector & tsquery
  • 10. Database Implementations • Database specific SQL • May include additional limitations (i.e: MySQL - MyISAM tables only) • Functionality define by the DB engine
  • 11. Third Party Search Engines • Google indexing / searching of your content
  • 12. Third Party Search Engines • Easy • Matches user expectations
  • 13. Third Party Search Engines • Content must be available for indexing • Loss of control • Enhances the Google hegemony
  • 15. Text Indexing Library • Complete control • Database independent • Flexible search behaviour • Ranked results
  • 16. Text Indexing Library • Adds complexity • Additional query language • Parallel index
  • 17. Lucene Overview • Open Source - part of the Apache Project • Very flexible • Wickedly fast • Index based
  • 18. Lucene : Installing • Add the Lucene jars to your classpath • Use ERIndexing
  • 19. Lucene : Tasks • Indexing • Searching
  • 22. Indexing : Steps • Conversion (to plain text) • Analysis (clean and convert the text to tokens) • Index (save the tokens to the index)
  • 23. Indexing : Parts • Index - either file or memory based • Document - represents a unique object added to the index • Field - identifies a chunk of data in the document
  • 24. Indexing : Classes • IndexWriter • Directory • Analyzer • Document • Field
  • 25. Creating an Index URL indexDirectoryURL = ... // assume exists File indexFile = new File(indexDirectoryURL.getPath()); FSDirectory indexDirectory = FSDirectory.open(indexFile); StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_CURRENT); IndexWriter indexWriter = new IndexWriter(index, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED);
  • 26. Indexing : Field Parameters • Stored or not • Analyzed or not, with and without norms • Include position, offset, and term frequency
  • 27. Indexing : Analyzers • SimpleAnalyzer • StopAnalyzer • StandardAnalyzer • ...
  • 28. Adding a Document String value = ... // assume exists Document doc = new Document(); Field docField = new Field("title", value, Field.Store.YES, Field.Index.ANALYZED); doc.add(docField); ... indexWriter.addDocument(doc);
  • 29. Indexing : Fun with indexes • Multiple Access
  • 32. Searching : Steps • Clean the user input • Create a Query • Query the Index • Return the results
  • 33. Searching : Search Classes • IndexReader • IndexSearcher • Query • QueryParser • TopDocs/ScoreDocs • Document
  • 34. Searching : QueryTypes • TermQuery • RangeQuery • PrefixQuery • BooleanQuery • PhraseQuery • WildCardQuery • FuzzyQuery
  • 35. Searching : QueryParser • 'webobjects' - contains an exact match - TermQuery • 'webobjects apple', 'webobjects OR apple' - an OR Query • +webobjects +apple / webobjects AND apple - an AND Query • title:webobjects - Contains the term in title field • title:webobjects -subject:iTunes / title:webobjects AND NOT subject:iTunes • (webobjects OR apple) AND iTunes
  • 36. Searching : QueryParser • title:"apple webobjects" - Phrase Query • title:"apple webobjects"~5 - slop of 5 • webobj* - Prefix Query • webobjicts~ - Fuzzy Query • lastmodified:[1/1/10 TO 1/1/11] - Range Query
  • 37. Performing a Search Query q = ... // assume exists IndexSearcher searcher = new IndexSearcher(index, true); TopScoreDocCollector collector = TopScoreDocCollector.create(10, true); searcher.search(query, collector); ScoreDoc[] hits = collector.topDocs().scoreDocs;
  • 38. Using a QueryParser QueryParser queryParser = new QueryParser(Version.LUCENE_2.9, "content", analyzer); Query query = queryParser.parse(queryString);
  • 39. Demo
  • 41. “The more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query”
  • 42. Boost • While Indexing • Document • Field • While Searching • Query
  • 43. Luke
  • 44. Demo
  • 46. ERIndexing : Strengths • Hides some of the complexity of integrating Lucene with WO • Offers lots of utility and helper methods • Speaks WebObjects collection classes • Simplifies index creation
  • 47. ERIndexing : Weaknesses • Hides some of the complexity of integrating Lucene with WO • Not fully baked • Auto indexing may be dangerous
  • 48. Demo
  • 49. Beyond Lucene • Solr • Compass • ElasticSearch
  • 50. Q&A Lucene: http://lucene.apache.org Luke: http://code.google.com/p/luke/ Solr: http://lucene.apache.org/solr/ Compass: http://www.compass-project.org/overview.html ElasticSearch: http://www.elasticsearch.com/