SlideShare a Scribd company logo
Custom Analyzer in Lucene
Lucene/Solr Meetup
Ganesh.M
http://www.linkedin.com/in/gmurugappan
‱ Introduction to Analyzer
‱ Why we require Custom Analyzer
‱ Use case / Scenario
‱ Writing custom analyzer
‱ Know your analyzer
‱ Analyzer : Analyzes the given text and returns
tokens using Tokenizer and TokenFilter
‱ Tokenizer : Understands the language and breaks
the text in to tokens.
– WhitespaceTokenizer divides text at whitespace
– LetterTokenizer divides text at non-letter
– CJKTokenizer – Chinese, Japanese, Korean language
tokenizer
‱ TokenFiler: adds / stem / deletes token
– StopFilter – removes stop words
– PorterStemFilter – Transforms the token
‱ Lets have the text
“The quick brown fox jumps over lazy dog”
Using Standard Analyzer, it will generate
following tokens
Quick Brown
Fox Jumps
Over Lazy
dog
Know Your analyzer
‱ It is important to choose best analyzer for
your fields.
‱ If you choose it wrong then it may not give
expected search result.
‱ If you ever think you are not expecting the
correct result then check your Analyzer and
Query parser.
Lucene 3.x: Below code will print the tokens
generated from given analyzer
Analyzer analyzer = new SimpleAnalyzer();
TokenStream ts = analyzer.tokenStream(“Field", new
StringReader(“Hello world-2013 "));
ts.reset();
while (ts.incrementToken()) {
System.out.println("token: " +
ts.getAttribute(TermAttribute.class).term());
}
ts.close();
The purpose of Custom Analyzer
‱ Existing analyzers not always solves our
purpose, some times we need to analyze in a
different way
‱ Custom Analyzer could use existing inbuilt
filters.
‱ It could also be used for parsing queries
Use case
‱ Synonym Injection / Abbreviation Expansion
– Add synonyms at the time of indexing.
– In case of parsing resume, add related content for
a keyword. If you find text “lucene/solr” then you
could add information retrieval, search engine.
– If you are searching medical documents, chat
messages etc you need to expand the
abbreviation / codes at the time of indexing
‱ Stripping XML / HTML tags and index only the
content
<Address>
<Street>123, MG Road<Street>
<City>Bangalore<Bangalore>
<State>Karnataka<State>
</Address>
‱ Break Email ID / URL in to multiple tokens
– Sachin Tendulkar
<sachin.tendulkar123@gmail.com>
– Should be analyzed as
‱ sachin
‱ tendulkar
‱ sachin
‱ tendulkar123
‱ gmail
‱ com
HTMLAnalyzer in Lucene 4.5
public class HTMLAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String
arg0, Reader reader) {
HTMLStripCharFilter htmlFilter = new HTMLStripCharFilter(reader);
WhitespaceTokenizer tokenizer = new
WhitespaceTokenizer(Version.LUCENE_45, htmlFilter);
TokenStream result = new LowerCaseFilter(Version.LUCENE_45,
tokenizer);
return new TokenStreamComponents (tokenizer, result);
}
}
HTMLAnalyzer in Solr
<fieldType name="text_html" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"
escapedTags="a, title" /> <tokenizer
class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
SynonymAnalyzer
‱ SynonymAnalyzer will inject the synonym as
part of the indexed content using Lucene 3.3
‱ Check out the code..
https://github.com/geekganesh/SynonymAnal
yzer
PerFieldAnalyzerWrapper
‱ IndexWriter / IndexWriterConfig will take only
one Analyzer and it will use that for all its
fields.
‱ We may have multiple fields and each field
should be indexed using specific analyzer then
we need to use PerFieldAnalyzerWrapper
‱ PerFieldAnalyzerWrapper is used to have
different analyzer per field. It will be passed to
IndexWriter

More Related Content

PDF
Linguistic component Sentiment Analyzer for the Russian language
PPTX
Solr
PDF
Analyze this! tips and tricks on getting the lucene solr analyzer to index an...
PPT
Lucene Bootcamp -1
 
PDF
Elasticsearch Analyzers Field-Level Optimization.pdf
PDF
Solr workshop
ODP
Lucene And Solr Intro
PDF
Tutorial 5 (lucene)
 
Linguistic component Sentiment Analyzer for the Russian language
Solr
Analyze this! tips and tricks on getting the lucene solr analyzer to index an...
Lucene Bootcamp -1
 
Elasticsearch Analyzers Field-Level Optimization.pdf
Solr workshop
Lucene And Solr Intro
Tutorial 5 (lucene)
 

Similar to Custom analyzer using lucene (20)

PDF
IR with lucene
PDF
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
PPT
Lucene BootCamp
 
PDF
Lucene for Solr Developers
PDF
Introduction To Apache Lucene
PDF
Lucene for Solr Developers
PPT
Advanced full text searching techniques using Lucene
PPTX
Elastic search custom chinese analyzer
PDF
Get the most out of Solr search with PHP
PDF
Find it, possibly also near you!
PPT
Lucene Introduction
 
PPTX
Apache Solr Workshop
 
PDF
Apache Solr Workshop
PPTX
Autocomplete in elasticsearch
PDF
Lucene for Solr Developers
PDF
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
PPTX
Lucene
PPTX
Building Search & Recommendation Engines
PDF
Lucene for Solr Developers
ODP
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
IR with lucene
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Lucene BootCamp
 
Lucene for Solr Developers
Introduction To Apache Lucene
Lucene for Solr Developers
Advanced full text searching techniques using Lucene
Elastic search custom chinese analyzer
Get the most out of Solr search with PHP
Find it, possibly also near you!
Lucene Introduction
 
Apache Solr Workshop
 
Apache Solr Workshop
Autocomplete in elasticsearch
Lucene for Solr Developers
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Lucene
Building Search & Recommendation Engines
Lucene for Solr Developers
Searching for AI - Leveraging Solr for classic Artificial Intelligence tasks
Ad

Recently uploaded (20)

PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
PPTX
Introduction to Artificial Intelligence
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
ai tools demonstartion for schools and inter college
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
AI in Product Development-omnex systems
PPTX
Essential Infomation Tech presentation.pptx
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PDF
Softaken Excel to vCard Converter Software.pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
CHAPTER 2 - PM Management and IT Context
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
 
Introduction to Artificial Intelligence
2025 Textile ERP Trends: SAP, Odoo & Oracle
ai tools demonstartion for schools and inter college
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Which alternative to Crystal Reports is best for small or large businesses.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
AI in Product Development-omnex systems
Essential Infomation Tech presentation.pptx
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
Softaken Excel to vCard Converter Software.pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Ad

Custom analyzer using lucene

  • 1. Custom Analyzer in Lucene Lucene/Solr Meetup Ganesh.M http://www.linkedin.com/in/gmurugappan
  • 2. ‱ Introduction to Analyzer ‱ Why we require Custom Analyzer ‱ Use case / Scenario ‱ Writing custom analyzer ‱ Know your analyzer
  • 3. ‱ Analyzer : Analyzes the given text and returns tokens using Tokenizer and TokenFilter ‱ Tokenizer : Understands the language and breaks the text in to tokens. – WhitespaceTokenizer divides text at whitespace – LetterTokenizer divides text at non-letter – CJKTokenizer – Chinese, Japanese, Korean language tokenizer ‱ TokenFiler: adds / stem / deletes token – StopFilter – removes stop words – PorterStemFilter – Transforms the token
  • 4. ‱ Lets have the text “The quick brown fox jumps over lazy dog” Using Standard Analyzer, it will generate following tokens Quick Brown Fox Jumps Over Lazy dog
  • 5. Know Your analyzer ‱ It is important to choose best analyzer for your fields. ‱ If you choose it wrong then it may not give expected search result. ‱ If you ever think you are not expecting the correct result then check your Analyzer and Query parser.
  • 6. Lucene 3.x: Below code will print the tokens generated from given analyzer Analyzer analyzer = new SimpleAnalyzer(); TokenStream ts = analyzer.tokenStream(“Field", new StringReader(“Hello world-2013 ")); ts.reset(); while (ts.incrementToken()) { System.out.println("token: " + ts.getAttribute(TermAttribute.class).term()); } ts.close();
  • 7. The purpose of Custom Analyzer ‱ Existing analyzers not always solves our purpose, some times we need to analyze in a different way ‱ Custom Analyzer could use existing inbuilt filters. ‱ It could also be used for parsing queries
  • 8. Use case ‱ Synonym Injection / Abbreviation Expansion – Add synonyms at the time of indexing. – In case of parsing resume, add related content for a keyword. If you find text “lucene/solr” then you could add information retrieval, search engine. – If you are searching medical documents, chat messages etc you need to expand the abbreviation / codes at the time of indexing
  • 9. ‱ Stripping XML / HTML tags and index only the content <Address> <Street>123, MG Road<Street> <City>Bangalore<Bangalore> <State>Karnataka<State> </Address>
  • 10. ‱ Break Email ID / URL in to multiple tokens – Sachin Tendulkar <sachin.tendulkar123@gmail.com> – Should be analyzed as ‱ sachin ‱ tendulkar ‱ sachin ‱ tendulkar123 ‱ gmail ‱ com
  • 11. HTMLAnalyzer in Lucene 4.5 public class HTMLAnalyzer extends Analyzer { @Override protected TokenStreamComponents createComponents(String arg0, Reader reader) { HTMLStripCharFilter htmlFilter = new HTMLStripCharFilter(reader); WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(Version.LUCENE_45, htmlFilter); TokenStream result = new LowerCaseFilter(Version.LUCENE_45, tokenizer); return new TokenStreamComponents (tokenizer, result); } }
  • 12. HTMLAnalyzer in Solr <fieldType name="text_html" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory" escapedTags="a, title" /> <tokenizer class="solr.WhitespaceTokenizerFactory"/> </analyzer> </fieldType>
  • 13. SynonymAnalyzer ‱ SynonymAnalyzer will inject the synonym as part of the indexed content using Lucene 3.3 ‱ Check out the code.. https://github.com/geekganesh/SynonymAnal yzer
  • 14. PerFieldAnalyzerWrapper ‱ IndexWriter / IndexWriterConfig will take only one Analyzer and it will use that for all its fields. ‱ We may have multiple fields and each field should be indexed using specific analyzer then we need to use PerFieldAnalyzerWrapper ‱ PerFieldAnalyzerWrapper is used to have different analyzer per field. It will be passed to IndexWriter