SlideShare a Scribd company logo
Solr Indexing and Analysis Tricks
@ErikHatcher
Senior Solutions Architect, LucidWorks
Erik Hatcher's Relevant Professional Bio
• 
• 
• 
• 
• 

Lucene/Solr committer
Apache Software Foundation member
Co-founder, Senior Solutions Architect, and Janitor at LucidWorks
Creator of Blacklight
Co-author of "Ant in Action" and "Lucene in Action"
Abstract

This session will introduce and demonstrate several
techniques for enhancing the search experience by
augmenting documents during indexing. First we'll survey
the analysis components available in Solr, and then we'll
delve into using Solr's update processing pipeline to
modify documents on the way in. The session will build
on Erik's "Poor Man's Entity Extraction" blog at http://
www.searchhub.org/2013/06/27/poor-mans-entityextraction-with-solr/
Poor Man’s Entity Extraction

•  acronyms: a searchable/filterable/facetable (but not stored)
field containing all three+ letter CAPS acronyms
•  key_phrases: a searchable/filterable/facetable (but also not
stored) field containing any key phrases matching a
provided list
•  links: a stored field containing http(s) links
•  extracted_locations: lat/long points that are mentioned in
the document content, indexed as geographically savvy
points, stored on the document for easy use in maps or
elsewhere
example_data.txt
The DUB airport is at 53.421389,-6.27
See also:
http://en.wikipedia.org/wiki/Dublin_Airport
End results
Challenges and needs
• 

• 

• 

Analyzers and Query Parsers
–  Analysis != query parsing
•  Query parsers generally analyze “chunks” of the query expression and
combine the results in various ways
–  Synergy, working in conjunction
Query parsing
–  q=Lucene Revolution in Dubhlinn
–  q="Lucene Revolution"
–  q=lucene [AND/OR] revolution
–  On which field(s)? Which query parser?
Analysis
–  +((content:lucen) (content:revolut) (content:dublin)) [from edismax]
Extracting with copyField
• 

copyField content => acronyms
–  Note that destination of a copy field generally should not be stored
(stored="false)

• 

"caps" field type
–  PatternCaptureGroupFilterFactory with pattern="((?:[A-Z].?){3,})"

• 

"The Dublin airport, DUB, is at…"
=> DUB

• 

Results could be suitable for faceting, searching, and boosting but the results are
not "stored" values (only indexed terms)
Extracting with ScriptUpdateProcessor
• 
• 

An update processor can manipulate (add, modify, delete) document fields
–  Field values can be stored
update.chain=script
–  With post.jar:
•  java –Dauto -Dparams=update.chain=script -jar post.jar
–  Or make the update chain the default
•  <updateRequestProcessorChain default="true"…
// basic lat/long pattern matching eg "38.1384683,-78.4527887"
var location_regexp = /(-?d{1,2}.d{2,7},-?d{1,3}.d{2,7})/g;
var extracted_locations = getMatches(location_regexp, content);
doc.setField("extracted_locations", extracted_locations);
Analysis in ScriptUpdateProcessor
var analyzer =
req.getCore().getLatestSchema()
.getFieldTypeByName("<field type>")
.getAnalyzer();
doc.setField("token_ss",
getAnalyzerResult(analyzer, null, content));
getAnalyzerResult
function getAnalyzerResult(analyzer, fieldName, fieldValue) {
var result = [];
var token_stream =
analyzer.tokenStream(fieldName, new java.io.StringReader(fieldValue));
var term_att = token_stream.getAttribute(
Packages.org.apache.lucene.analysis.tokenattributes.CharTermAttribute);
token_stream.reset();
while (token_stream.incrementToken()) {
result.push(term_att.toString());
}
token_stream.end();
token_stream.close();
return result;
}
Using analysis externally

•  http://localhost:8983/solr/collection1/analysis/field
–  ?analysis.fieldvalue=Dubhlinn
–  &analysis.fieldtype=just_synonyms
•  => dublin
Solr Indexing and Analysis Tricks

More Related Content

PDF
Lucene for Solr Developers
PDF
Solr Flair
PDF
Solr Black Belt Pre-conference
PDF
it's just search
PDF
Rapid Prototyping with Solr
PDF
Lucene for Solr Developers
PDF
Solr Recipes Workshop
PDF
Solr Powered Lucene
Lucene for Solr Developers
Solr Flair
Solr Black Belt Pre-conference
it's just search
Rapid Prototyping with Solr
Lucene for Solr Developers
Solr Recipes Workshop
Solr Powered Lucene

What's hot (20)

PPTX
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
PDF
Lucene for Solr Developers
PDF
Solr Troubleshooting - TreeMap approach
PDF
Solr Recipes
PDF
Lucene's Latest (for Libraries)
PDF
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
PDF
Apache Solr! Enterprise Search Solutions at your Fingertips!
PDF
Rapid Prototyping with Solr
PDF
Apache Solr Workshop
PPTX
JSON in Solr: from top to bottom
PPTX
Solr 6 Feature Preview
PDF
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
PDF
Lucene for Solr Developers
PPTX
Ingesting and Manipulating Data with JavaScript
PDF
Solr Masterclass Bangkok, June 2014
PDF
Solr 4
PDF
Webinar: What's New in Solr 7
PDF
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
PDF
Integrating the Solr search engine
PDF
Make your gui shine with ajax solr
Rebuilding Solr 6 examples - layer by layer (LuceneSolrRevolution 2016)
Lucene for Solr Developers
Solr Troubleshooting - TreeMap approach
Solr Recipes
Lucene's Latest (for Libraries)
Rebuilding Solr 6 Examples - Layer by Layer: Presented by Alexandre Rafalovit...
Apache Solr! Enterprise Search Solutions at your Fingertips!
Rapid Prototyping with Solr
Apache Solr Workshop
JSON in Solr: from top to bottom
Solr 6 Feature Preview
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
Lucene for Solr Developers
Ingesting and Manipulating Data with JavaScript
Solr Masterclass Bangkok, June 2014
Solr 4
Webinar: What's New in Solr 7
Solr Flair: Search User Interfaces Powered by Apache Solr (ApacheCon US 2009,...
Integrating the Solr search engine
Make your gui shine with ajax solr
Ad

Similar to Solr Indexing and Analysis Tricks (20)

PDF
Apache Solr crash course
PDF
PPTX
Introduction to Apache Solr
PDF
IIPC-Training-Event-Jan-2014-Solr-Introduction.pdf
PPTX
Search Me: Using Lucene.Net
PDF
Find it, possibly also near you!
PDF
Get the most out of Solr search with PHP
PPTX
The Apache Solr Smart Data Ecosystem
PDF
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
PDF
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
PPT
Lucene Bootcamp -1
PDF
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
PDF
KEYNOTE: Lucene / Solr road map
PPTX
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
PPTX
Self-learned Relevancy with Apache Solr
PDF
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
PPTX
Introduction to Lucene & Solr and Usecases
PPTX
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
KEY
Solr 101
PPTX
Introduction to Apache Lucene/Solr
Apache Solr crash course
Introduction to Apache Solr
IIPC-Training-Event-Jan-2014-Solr-Introduction.pdf
Search Me: Using Lucene.Net
Find it, possibly also near you!
Get the most out of Solr search with PHP
The Apache Solr Smart Data Ecosystem
Apache Solr 1.4 – Faster, Easier, and More Versatile than Ever
Search Engine-Building with Lucene and Solr, Part 2 (SoCal Code Camp LA 2013)
Lucene Bootcamp -1
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
KEYNOTE: Lucene / Solr road map
Search Engine Building with Lucene and Solr (So Code Camp San Diego 2014)
Self-learned Relevancy with Apache Solr
Search Engine-Building with Lucene and Solr, Part 1 (SoCal Code Camp LA 2013)
Introduction to Lucene & Solr and Usecases
Assamese search engine using SOLR by Moinuddin Ahmed ( moin )
Solr 101
Introduction to Apache Lucene/Solr
Ad

More from Erik Hatcher (14)

PDF
Ted Talk
PDF
Solr Payloads
PDF
Solr Powered Libraries
PDF
Solr Query Parsing
PDF
"Solr Update" at code4lib '13 - Chicago
PDF
Query Parsing - Tips and Tricks
PDF
Introduction to Solr
PDF
Introduction to Solr
PDF
Introduction to Solr
PDF
What's New in Solr 3.x / 4.0
PDF
Solr Application Development Tutorial
PDF
Rapid Prototyping with Solr
PDF
Rapid Prototyping with Solr
PDF
Solr Flair: Search User Interfaces Powered by Apache Solr
Ted Talk
Solr Payloads
Solr Powered Libraries
Solr Query Parsing
"Solr Update" at code4lib '13 - Chicago
Query Parsing - Tips and Tricks
Introduction to Solr
Introduction to Solr
Introduction to Solr
What's New in Solr 3.x / 4.0
Solr Application Development Tutorial
Rapid Prototyping with Solr
Rapid Prototyping with Solr
Solr Flair: Search User Interfaces Powered by Apache Solr

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
PDF
KodekX | Application Modernization Development
PPT
Teaching material agriculture food technology
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Unlocking AI with Model Context Protocol (MCP)
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Electronic commerce courselecture one. Pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Cloud computing and distributed systems.
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Digital-Transformation-Roadmap-for-Companies.pptx
Understanding_Digital_Forensics_Presentation.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Building Integrated photovoltaic BIPV_UPV.pdf
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction
KodekX | Application Modernization Development
Teaching material agriculture food technology
The AUB Centre for AI in Media Proposal.docx
Network Security Unit 5.pdf for BCA BBA.
Chapter 3 Spatial Domain Image Processing.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Unlocking AI with Model Context Protocol (MCP)
MYSQL Presentation for SQL database connectivity
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Electronic commerce courselecture one. Pdf
NewMind AI Monthly Chronicles - July 2025
Cloud computing and distributed systems.

Solr Indexing and Analysis Tricks

  • 1. Solr Indexing and Analysis Tricks @ErikHatcher Senior Solutions Architect, LucidWorks
  • 2. Erik Hatcher's Relevant Professional Bio •  •  •  •  •  Lucene/Solr committer Apache Software Foundation member Co-founder, Senior Solutions Architect, and Janitor at LucidWorks Creator of Blacklight Co-author of "Ant in Action" and "Lucene in Action"
  • 3. Abstract This session will introduce and demonstrate several techniques for enhancing the search experience by augmenting documents during indexing. First we'll survey the analysis components available in Solr, and then we'll delve into using Solr's update processing pipeline to modify documents on the way in. The session will build on Erik's "Poor Man's Entity Extraction" blog at http:// www.searchhub.org/2013/06/27/poor-mans-entityextraction-with-solr/
  • 4. Poor Man’s Entity Extraction •  acronyms: a searchable/filterable/facetable (but not stored) field containing all three+ letter CAPS acronyms •  key_phrases: a searchable/filterable/facetable (but also not stored) field containing any key phrases matching a provided list •  links: a stored field containing http(s) links •  extracted_locations: lat/long points that are mentioned in the document content, indexed as geographically savvy points, stored on the document for easy use in maps or elsewhere
  • 5. example_data.txt The DUB airport is at 53.421389,-6.27 See also: http://en.wikipedia.org/wiki/Dublin_Airport
  • 7. Challenges and needs •  •  •  Analyzers and Query Parsers –  Analysis != query parsing •  Query parsers generally analyze “chunks” of the query expression and combine the results in various ways –  Synergy, working in conjunction Query parsing –  q=Lucene Revolution in Dubhlinn –  q="Lucene Revolution" –  q=lucene [AND/OR] revolution –  On which field(s)? Which query parser? Analysis –  +((content:lucen) (content:revolut) (content:dublin)) [from edismax]
  • 8. Extracting with copyField •  copyField content => acronyms –  Note that destination of a copy field generally should not be stored (stored="false) •  "caps" field type –  PatternCaptureGroupFilterFactory with pattern="((?:[A-Z].?){3,})" •  "The Dublin airport, DUB, is at…" => DUB •  Results could be suitable for faceting, searching, and boosting but the results are not "stored" values (only indexed terms)
  • 9. Extracting with ScriptUpdateProcessor •  •  An update processor can manipulate (add, modify, delete) document fields –  Field values can be stored update.chain=script –  With post.jar: •  java –Dauto -Dparams=update.chain=script -jar post.jar –  Or make the update chain the default •  <updateRequestProcessorChain default="true"… // basic lat/long pattern matching eg "38.1384683,-78.4527887" var location_regexp = /(-?d{1,2}.d{2,7},-?d{1,3}.d{2,7})/g; var extracted_locations = getMatches(location_regexp, content); doc.setField("extracted_locations", extracted_locations);
  • 10. Analysis in ScriptUpdateProcessor var analyzer = req.getCore().getLatestSchema() .getFieldTypeByName("<field type>") .getAnalyzer(); doc.setField("token_ss", getAnalyzerResult(analyzer, null, content));
  • 11. getAnalyzerResult function getAnalyzerResult(analyzer, fieldName, fieldValue) { var result = []; var token_stream = analyzer.tokenStream(fieldName, new java.io.StringReader(fieldValue)); var term_att = token_stream.getAttribute( Packages.org.apache.lucene.analysis.tokenattributes.CharTermAttribute); token_stream.reset(); while (token_stream.incrementToken()) { result.push(term_att.toString()); } token_stream.end(); token_stream.close(); return result; }
  • 12. Using analysis externally •  http://localhost:8983/solr/collection1/analysis/field –  ?analysis.fieldvalue=Dubhlinn –  &analysis.fieldtype=just_synonyms •  => dublin