SlideShare a Scribd company logo
Analyze This!


                              Tom Hill
                              Lucid Imagination
                              Webinar 1/28/2010



    Lucid Imagination, Inc.
Analyze This!




                 Analysis
         Basics, Tips and Tools



                          Lucid Imagination, Inc.




Page 2
                                                    © 2010 Lucid Imagination, Inc.
Overview
         We’ll be covering:
           What is analysis, and why do you care?
           Some common problems with analysis
           Tools for troubleshooting
             Analyzer Tool
             Schema Browser
             Luke
           Existing Analyzers, Filters and Tokenizers
                                       Lucid Imagination, Inc.




Page 3
                                                                 © 2010 Lucid Imagination, Inc.
What is Analysis?

         • Converting your text into terms
              Solr does NOT search your text
              Solr searches the set of terms created by analysis
              Problems happen when the terms are not what you think they
              are




                                        Lucid Imagination, Inc.




Page 4
                                                                  © 2010 Lucid Imagination, Inc.
Examples

                       Don’t => dont

                       iPhone => i phone
                                   iphon
                       τα πρώτα δείγματα =>πρωτα δειγματα
                       The quick brown fox jumps => The quick brown fox jumps



                                               Lucid Imagination, Inc.




Page 5   © 2008-2009                                                     © 2010 Lucid Imagination, Inc.   5
Different Effects of Analysis
                There are many ways to analyze a run of text.
                       Break on whitespace, punctuation, caseChanges, numb3rs
                       Stemming (shoes -> shoe)
                       Removing/replacing unwanted words/symbols
                       Combining words
                       Adding new words (synonyms)
                       And many more


                                                  Lucid Imagination, Inc.




Page 6   © 2008-2009                                                        © 2010 Lucid Imagination, Inc.   6
Copy Fields                                                                                  1


              It’s common to want to index data more than one way
              You might store an analyzed version of a field for searching
                And store an unanalyzed version for faceting or sorting
              You might store a stemmed and non-stemmed version of a field
                To boost precise matches




                                           Lucid Imagination, Inc.




Page 7
                                                                     © 2010 Lucid Imagination, Inc.
Copy Fields                                                                                 2


              It’s also common to copy to a common destination field
                For example: “alltext”
              Note this copies from the SOURCE of the copied field
                Not the analyzed version of the copied field
              <copyField source="cat" dest="text"/>
               <copyField source="name" dest="text"/>
               <copyField source="manu" dest="text"/>

                                          Lucid Imagination, Inc.




Page 8
                                                                    © 2010 Lucid Imagination, Inc.
What could go wrong?

         • Lots of things
              You can’t find things
              You find too much
              Poor query or indexing performance




                                      Lucid Imagination, Inc.




Page 9
                                                                © 2010 Lucid Imagination, Inc.
Common Scenario #1

              Someone sets up Solr for the first time
              Adds some data
              Then posts to the mailing list, and says “why can’t I find my
              data?”
              The problem’s basic, but it’s useful to know how to identify it.




                                        Lucid Imagination, Inc.




Page 10
                                                                  © 2010 Lucid Imagination, Inc.
“When I Search For ‘fox’…”




                                       Lucid Imagination, Inc.




Page 11
                                                                 © 2010 Lucid Imagination, Inc.
“…I Find Nothing”




                              Lucid Imagination, Inc.




Page 12
                                                        © 2010 Lucid Imagination, Inc.
“But, If I look at the index”




                                          Lucid Imagination, Inc.




Page 13
                                                                    © 2010 Lucid Imagination, Inc.
“It’s right there”




                               Lucid Imagination, Inc.




Page 14
                                                         © 2010 Lucid Imagination, Inc.
Analysis Tool

               Your first stop for figuring out analysis problems




                                          Lucid Imagination, Inc.




Page 15
                                                                    © 2010 Lucid Imagination, Inc.
Analysis Tool




                          Lucid Imagination, Inc.




Page 16
                                                    © 2010 Lucid Imagination, Inc.
Analysis Tool Demo




                               Lucid Imagination, Inc.




Page 17
                                                         © 2010 Lucid Imagination, Inc.
Stored vs. Indexed

               Solr can store both analyzed and un-analyzed content
               But you knew that …
                 “stored” vs. “indexed” in the field definition
               How can you see what is actually indexed?
                 …that is, the terms you can search for




                                            Lucid Imagination, Inc.




Page 18
                                                                      © 2010 Lucid Imagination, Inc.
Schema Browser
              Schema Browser lets you examine the fields and how they are
              configured.
              It also allows you to examine the terms in the index




                                        Lucid Imagination, Inc.




Page 19
                                                                  © 2010 Lucid Imagination, Inc.
Schema Browser




                           Lucid Imagination, Inc.




Page 20
                                                     © 2010 Lucid Imagination, Inc.
Schema Browser




                           Lucid Imagination, Inc.




Page 21
                                                     © 2010 Lucid Imagination, Inc.
Schema Browser Demo




                                Lucid Imagination, Inc.




Page 22
                                                          © 2010 Lucid Imagination, Inc.
How Many of You Just Copied the Example Schema?

          • Just because it works for one person’s data, doesn’t mean it
            works for yours.
          • Take the time to look at the output




                                      Lucid Imagination, Inc.




Page 23
                                                                © 2010 Lucid Imagination, Inc.
Luke

                 Lucene Index Exploration Tool
                 Allows you to look at (and modify) the contents of an index




                                          Lucid Imagination, Inc.




Page 24
                                                                    © 2010 Lucid Imagination, Inc.
Luke Main Screen




                             Lucid Imagination, Inc.




Page 25
                                                       © 2010 Lucid Imagination, Inc.
Luke Document “Reconstruction”




                                   Lucid Imagination, Inc.




Page 26
                                                             © 2010 Lucid Imagination, Inc.
Luke Document “Reconstruction”




                                   Lucid Imagination, Inc.




Page 27
                                                             © 2010 Lucid Imagination, Inc.
Close-up from last slide

               solr null_1 enterpris search server
               null_100 apach softwar foundat null_100 softwar null_100 search
                 null_100 advanc
               full fulltext|text search capabl use
               lucen null_100 optim null_1 high …




                                             Lucid Imagination, Inc.




Page 28
                                                                       © 2010 Lucid Imagination, Inc.
Position Increment Gap

               The null_xxx entries are how luke represents the position
               increment between instances of multi-valued fields.
               The example had
               <field name=“text">Solr, the Enterprise Search Server</field>
               <field name=“text">Apache Software Foundation</field>
               Using a position increment prevents phrase queries from
               matching across different values of a field
               Without the gap “Server Apache” would be a valid phrase.

                                           Lucid Imagination, Inc.




Page 29
                                                                     © 2010 Lucid Imagination, Inc.
Analysis Can Affect Performance

               Analysis doesn’t just product success/failure on a search
               It can affect the query processing speed, too.




                                         Lucid Imagination, Inc.




Page 30
                                                                   © 2010 Lucid Imagination, Inc.
Slow Searches

              They index 500,000 books
              Multiple languages in one field
                So they can’t do stemming or stop words
              Their worst case query was:
              “The lives and literature of the beat generation”
              It took 2 minutes to run.
              The query requires checking every doc containing “the” & “and”
                And the position info for each occurrence
                                          Lucid Imagination, Inc.




Page 31
                                                                    © 2010 Lucid Imagination, Inc.
Bi-grams

              Bi-grams combine adjacent terms
              ““The lives and literature “ becomes
              “The lives” “lives and” “and literature”
              Only have to check documents that contain the pair adjacent to
              each other.
              Only have to look at position information for the pair
              But can triple the size of the index
                Word indexed by itself
                                         Lucid Imagination, Inc.
                Indexed both with preceding term, and following term



Page 32
                                                                   © 2010 Lucid Imagination, Inc.
Common Grams

              Form bi-grams only for common terms
              “The” occurs 2 billion times. “The lives” occurs 360k.
              Used the only 32 most common terms
              Average response went from 460 ms to 68ms.




                                        Lucid Imagination, Inc.




Page 33
                                                                  © 2010 Lucid Imagination, Inc.
Implied Phrase Queries

               Another example involved a query with “L’art”
               This turns into a phrase query, “L art” with the default config.
                 PhraseQuery(text:"l art")
               “Turning it into the single token ‘L art’ is much more efficient.
                 Occurs in far fewer documents that “L”
                 Is a term query, not a phrase query.



                                             Lucid Imagination, Inc.




Page 34
                                                                       © 2010 Lucid Imagination, Inc.
Multiple Languages

              Generally, we suggest keeping different languages in their own
              fields
              This lets you have an analyzer for each language
                Stemming, stop words, etc.
              If you don’t know the total number of languages, you can use
              dynamic fields.
                That allows you to accept them, but not to dynamically stem, etc.


                                          Lucid Imagination, Inc.




Page 35
                                                                    © 2010 Lucid Imagination, Inc.
Analysis And Query Parsing

               What happens when parsing a query in Solr?
                You may have many fields, with different analyzers
                Which Analyzer gets used?




                                         Lucid Imagination, Inc.




Page 36
                                                                   © 2010 Lucid Imagination, Inc.
Analysis And Query Parsing

               QueryParser splits the query
                 Understands quotes, parens and whitespace
               Gives the resulting pieces to the correct analyzer
                 Explicit or Default




                                         Lucid Imagination, Inc.




Page 37
                                                                   © 2010 Lucid Imagination, Inc.
Analysis And Query Parsing

               To see what happens to your query
                 Use the “Full Interface” section of the admin interface
                   Check ‘debug: enable’
                 Or just add “&debugQuery=on” to the end of your query string
               We’re using the Lucene Query Parser
               Dismax does different things.


                                           Lucid Imagination, Inc.




Page 38
                                                                     © 2010 Lucid Imagination, Inc.
Seeing the results of query parsing




                                       Lucid Imagination, Inc.




Page 39
                                                                 © 2010 Lucid Imagination, Inc.
Seeing the results of query parsing




                                       Lucid Imagination, Inc.




Page 40
                                                                 © 2010 Lucid Imagination, Inc.
Query Examples

              title:foo bar
                Becomes: +title:foo +text:bar
                “foo” goes title field analyzer, bar to default field analyzer
              manu:”foo_bar baz”
                Becomes: manu:"foo bar baz“
                Note _ got removed. The whole string goes to manu analyzer
                Phrase query
              title: (foo bar)
                                            Lucid Imagination, Inc.

                Becomes: title:foo title:bar
                foo and bar passed separately to title’s analyzer

Page 41
                                                                      © 2010 Lucid Imagination, Inc.
Components of an Analyzer




                                      Lucid Imagination, Inc.




Page 42
                                                                © 2010 Lucid Imagination, Inc.
Components of an Analyzer

              CharFilters
              Tokenizers
              TokenFilters




                                      Lucid Imagination, Inc.




Page 43
                                                                © 2010 Lucid Imagination, Inc.
CharFilters

               Used to clean up/regularize characters before passing to
               TokenFilter
               Remove accents, etc. MappingCharFilter
               They can also do complex things, we’ll look at
               HTMLStripCharFilter later.




                                         Lucid Imagination, Inc.




Page 44
                                                                   © 2010 Lucid Imagination, Inc.
Tokenizers

               Convert text to tokens (terms)
               Only one per analyzer
               Many Options
                 WhitespaceTokenizer
                 StandardTokenizer
                 PatternTokenizer
                 More…

                                        Lucid Imagination, Inc.




Page 45
                                                                  © 2010 Lucid Imagination, Inc.
TokenFilters

               Process the tokens produced by the Tokenizer
               Can be many of them per field




                                       Lucid Imagination, Inc.




Page 46
                                                                 © 2010 Lucid Imagination, Inc.
Some example TokenFilters that come with Solr/Lucene

               There are way too many to list them all
               We’re just going to go through a few of them




                                        Lucid Imagination, Inc.




Page 47
                                                                  © 2010 Lucid Imagination, Inc.
Reversing Filter

               Why?
                 Leading wildcards require traversing the whole index
               Reverse the order, and leading wildcards become trailing
                 *cats => stac*
               Only have to check terms that start with stac, instead of the
               whole index.



                                          Lucid Imagination, Inc.




Page 48
                                                                    © 2010 Lucid Imagination, Inc.
Phonetic Analysis

               Creates a phonetic representation of the text, for “sounds like”
               matching
               PhoneticFilterFactory. Uses one of
                 Metaphone
                 Double Metaphone
                 Soundex
                 Refined Soundex

                                         Lucid Imagination, Inc.




Page 49
                                                                   © 2010 Lucid Imagination, Inc.
Synonyms

              Synonym filter allows you to include alternate words that the
              user can use when searching
              For example, theater, theatre
                Useful for movie titles, where words are deliberately mis-spelled
              Don’t over-use synonyms
                It helps recall, but lowers precision
              Produces tokens at the same token position
                “local theater company”
                       theatre         Lucid Imagination, Inc.




Page 50
                                                                 © 2010 Lucid Imagination, Inc.
HTML text extraction

               Removes html tags, attributes comments
               XML processing directives
               Removes <script> and <style> contents
               Replaces entities
               HtmlStripCharFilterFactory




                                           Lucid Imagination, Inc.




Page 51
                                                                     © 2010 Lucid Imagination, Inc.
Spell Checking

               Spell checker starts by analyzing the source terms into n-grams
               From the Lucene Wiki:




                                         Lucid Imagination, Inc.




Page 52
                                                                   © 2010 Lucid Imagination, Inc.
Spell Checking

               You don’t actually have to know that to use the spell checker
               But I think it’s kind of cool
               Use luke to explore the index generated by the spell checker.




                                               Lucid Imagination, Inc.




Page 53
                                                                         © 2010 Lucid Imagination, Inc.
And many more

              Regular expression Tokenizer
              Stemmers for many languages
               Persian, Hindi, Chinese, Japanese, etc.
               Third party/commercial stemmers available, too.
              SnowballPorterFilter




                                          Lucid Imagination, Inc.




Page 54
                                                                    © 2010 Lucid Imagination, Inc.
Recap

              If you can’t find it, and you are sure it’s there:
                  It’s likely an analysis problem
              Three main tools for troubleshooting analysis
                  Analysis tool
                  Schema browser
                  Luke
              Look at your index, documents and the output of your analyzers
              periodically.
                                             Lucid Imagination, Inc.




Page 55
                                                                       © 2010 Lucid Imagination, Inc.
Additional Resources

               Lucid Imagination Solr Reference Guide
                 LucidImagination.com/downloads
               Lucene in Action Second Edition
                 This isn’t published yet, but you can get the early access version
                 from manning.com/hatcher3
               http://www.hathitrust.org/blog
               Solr wiki on Analysis
                 Wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
                                            Lucid Imagination, Inc.
               Luke - http://code.google.com/p/luke/



Page 56
                                                                      © 2010 Lucid Imagination, Inc.
Questions

              If we have time, we’ll take some questions




                                       Lucid Imagination, Inc.




Page 57
                                                                 © 2010 Lucid Imagination, Inc.
Thanks!
                Tom Hill
          LucidImagination.com



               Lucid Imagination, Inc.




Page 58
                                         © 2010 Lucid Imagination, Inc.

More Related Content

PDF
Semantic & Multilingual Strategies in Lucene/Solr
PPTX
Custom analyzer using lucene
PPTX
PDF
Understanding and visualizing solr explain information - Rafal Kuc
PDF
Schemaless Solr and the Solr Schema REST API
PPTX
Rocketfuel_Data Capitalization for efficient Campaign management_Success story
PDF
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
PDF
Solr & Lucene at Etsy
Semantic & Multilingual Strategies in Lucene/Solr
Custom analyzer using lucene
Understanding and visualizing solr explain information - Rafal Kuc
Schemaless Solr and the Solr Schema REST API
Rocketfuel_Data Capitalization for efficient Campaign management_Success story
Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...
Solr & Lucene at Etsy

Viewers also liked (16)

PPT
Presentation
PPTX
Is this love
DOCX
Metacognicion
PDF
2010 10-building-global-listening-platform-with-solr
PPT
Descritores de linguagem
PPTX
Updated: You Have An Idea ... Do You Have A Business?
PDF
What’s New in Apache Lucene 3.0
PDF
Column Stride Fields aka. DocValues
PPTX
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
PPT
How To Get The Justin Bieber Smile
PPTX
across the universe
PDF
What’s New in Apache Lucene 3.0
PDF
Building SaaS Solutions for Online Media Using Apache Solr
PDF
Gaiety Hotel - full version
PPTX
Customized Navigation Using SOLR
PPTX
Presentacion Ingles
Presentation
Is this love
Metacognicion
2010 10-building-global-listening-platform-with-solr
Descritores de linguagem
Updated: You Have An Idea ... Do You Have A Business?
What’s New in Apache Lucene 3.0
Column Stride Fields aka. DocValues
Presentation to the Old Dominion University (ODU) MBA Association, 3/20/13
How To Get The Justin Bieber Smile
across the universe
What’s New in Apache Lucene 3.0
Building SaaS Solutions for Online Media Using Apache Solr
Gaiety Hotel - full version
Customized Navigation Using SOLR
Presentacion Ingles
Ad

Similar to Analyze this! tips and tricks on getting the lucene solr analyzer to index and search your content right (20)

PDF
Solr: Search at the Speed of Light
PDF
Getting started faster with LucidWorks for Solr
PDF
Practical Search with Solr: Beyond just Looking it Up
PDF
Understanding Lucene Search Performance
PDF
Understanding Lucene Search Performance
PDF
Understanding Lucene Search Performance
PDF
The Seven Deadly Sins of Solr - By Jay Hill
PDF
The Seven Deadly Sins of Solr - By Jay Hill
PDF
The Seven Deadly Sins of Solr
PDF
Discover the new techniques about search application
PDF
The Motley Fool Migrates From Search Engines Search Product To Apache Solr_Lu...
PDF
More Powerful Solr Search with Semaphore - Jeremy Bentley
PDF
More Powerful Solr Search with Semaphore - Jeremy Bentley
PDF
Lightning talk :IBM Content Analytics with Enterprise Search - Wolfgang Jung
PDF
An Introduction to Basics of Search and Relevancy with Apache Solr
PPTX
Large Scale Search, Discovery and Analytics in Action
PDF
Moving to Solr/Lucene Open Source Search
PDF
Liwp consider opensource2010
PDF
Ubiquitous IA
PDF
Information Architecture 3.0 (Second Life)
Solr: Search at the Speed of Light
Getting started faster with LucidWorks for Solr
Practical Search with Solr: Beyond just Looking it Up
Understanding Lucene Search Performance
Understanding Lucene Search Performance
Understanding Lucene Search Performance
The Seven Deadly Sins of Solr - By Jay Hill
The Seven Deadly Sins of Solr - By Jay Hill
The Seven Deadly Sins of Solr
Discover the new techniques about search application
The Motley Fool Migrates From Search Engines Search Product To Apache Solr_Lu...
More Powerful Solr Search with Semaphore - Jeremy Bentley
More Powerful Solr Search with Semaphore - Jeremy Bentley
Lightning talk :IBM Content Analytics with Enterprise Search - Wolfgang Jung
An Introduction to Basics of Search and Relevancy with Apache Solr
Large Scale Search, Discovery and Analytics in Action
Moving to Solr/Lucene Open Source Search
Liwp consider opensource2010
Ubiquitous IA
Information Architecture 3.0 (Second Life)
Ad

More from Lucidworks (Archived) (20)

PDF
Integrating Hadoop & Solr
PDF
The Data-Driven Paradigm
PDF
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
PDF
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
PPTX
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
PPTX
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
PPTX
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
PPTX
What's new in solr june 2014
PPTX
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
PPTX
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
PPTX
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
PDF
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
PDF
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
PPTX
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
PPTX
Solr At AOL, Presented by Sean Timm at SolrExchage DC
PPTX
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
PPTX
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
PPTX
Building a data driven search application with LucidWorks SiLK
PPTX
Introducing LucidWorks App for Splunk Enterprise webinar
Integrating Hadoop & Solr
The Data-Driven Paradigm
Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...
SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr
SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine
Chicago Solr Meetup - June 10th: Exploring Hadoop with Search
What's new in solr june 2014
Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr
Minneapolis Solr Meetup - May 28, 2014: Target.com Search
Exploration of multidimensional biomedical data in pub chem, Presented by Lia...
Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented...
Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC
What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC
Solr At AOL, Presented by Sean Timm at SolrExchage DC
Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC
Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC
Building a data driven search application with LucidWorks SiLK
Introducing LucidWorks App for Splunk Enterprise webinar

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Spectroscopy.pptx food analysis technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Approach and Philosophy of On baking technology
PPTX
Big Data Technologies - Introduction.pptx
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Network Security Unit 5.pdf for BCA BBA.
Diabetes mellitus diagnosis method based random forest with bat algorithm
Reach Out and Touch Someone: Haptics and Empathic Computing
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Building Integrated photovoltaic BIPV_UPV.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Spectroscopy.pptx food analysis technology
Advanced methodologies resolving dimensionality complications for autism neur...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Review of recent advances in non-invasive hemoglobin estimation
Spectral efficient network and resource selection model in 5G networks
MYSQL Presentation for SQL database connectivity
Encapsulation_ Review paper, used for researhc scholars
Approach and Philosophy of On baking technology
Big Data Technologies - Introduction.pptx
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Build a system with the filesystem maintained by OSTree @ COSCUP 2025

Analyze this! tips and tricks on getting the lucene solr analyzer to index and search your content right

  • 1. Analyze This! Tom Hill Lucid Imagination Webinar 1/28/2010 Lucid Imagination, Inc.
  • 2. Analyze This! Analysis Basics, Tips and Tools Lucid Imagination, Inc. Page 2 © 2010 Lucid Imagination, Inc.
  • 3. Overview We’ll be covering: What is analysis, and why do you care? Some common problems with analysis Tools for troubleshooting Analyzer Tool Schema Browser Luke Existing Analyzers, Filters and Tokenizers Lucid Imagination, Inc. Page 3 © 2010 Lucid Imagination, Inc.
  • 4. What is Analysis? • Converting your text into terms Solr does NOT search your text Solr searches the set of terms created by analysis Problems happen when the terms are not what you think they are Lucid Imagination, Inc. Page 4 © 2010 Lucid Imagination, Inc.
  • 5. Examples Don’t => dont iPhone => i phone iphon τα πρώτα δείγματα =>πρωτα δειγματα The quick brown fox jumps => The quick brown fox jumps Lucid Imagination, Inc. Page 5 © 2008-2009 © 2010 Lucid Imagination, Inc. 5
  • 6. Different Effects of Analysis There are many ways to analyze a run of text. Break on whitespace, punctuation, caseChanges, numb3rs Stemming (shoes -> shoe) Removing/replacing unwanted words/symbols Combining words Adding new words (synonyms) And many more Lucid Imagination, Inc. Page 6 © 2008-2009 © 2010 Lucid Imagination, Inc. 6
  • 7. Copy Fields 1 It’s common to want to index data more than one way You might store an analyzed version of a field for searching And store an unanalyzed version for faceting or sorting You might store a stemmed and non-stemmed version of a field To boost precise matches Lucid Imagination, Inc. Page 7 © 2010 Lucid Imagination, Inc.
  • 8. Copy Fields 2 It’s also common to copy to a common destination field For example: “alltext” Note this copies from the SOURCE of the copied field Not the analyzed version of the copied field <copyField source="cat" dest="text"/> <copyField source="name" dest="text"/> <copyField source="manu" dest="text"/> Lucid Imagination, Inc. Page 8 © 2010 Lucid Imagination, Inc.
  • 9. What could go wrong? • Lots of things You can’t find things You find too much Poor query or indexing performance Lucid Imagination, Inc. Page 9 © 2010 Lucid Imagination, Inc.
  • 10. Common Scenario #1 Someone sets up Solr for the first time Adds some data Then posts to the mailing list, and says “why can’t I find my data?” The problem’s basic, but it’s useful to know how to identify it. Lucid Imagination, Inc. Page 10 © 2010 Lucid Imagination, Inc.
  • 11. “When I Search For ‘fox’…” Lucid Imagination, Inc. Page 11 © 2010 Lucid Imagination, Inc.
  • 12. “…I Find Nothing” Lucid Imagination, Inc. Page 12 © 2010 Lucid Imagination, Inc.
  • 13. “But, If I look at the index” Lucid Imagination, Inc. Page 13 © 2010 Lucid Imagination, Inc.
  • 14. “It’s right there” Lucid Imagination, Inc. Page 14 © 2010 Lucid Imagination, Inc.
  • 15. Analysis Tool Your first stop for figuring out analysis problems Lucid Imagination, Inc. Page 15 © 2010 Lucid Imagination, Inc.
  • 16. Analysis Tool Lucid Imagination, Inc. Page 16 © 2010 Lucid Imagination, Inc.
  • 17. Analysis Tool Demo Lucid Imagination, Inc. Page 17 © 2010 Lucid Imagination, Inc.
  • 18. Stored vs. Indexed Solr can store both analyzed and un-analyzed content But you knew that … “stored” vs. “indexed” in the field definition How can you see what is actually indexed? …that is, the terms you can search for Lucid Imagination, Inc. Page 18 © 2010 Lucid Imagination, Inc.
  • 19. Schema Browser Schema Browser lets you examine the fields and how they are configured. It also allows you to examine the terms in the index Lucid Imagination, Inc. Page 19 © 2010 Lucid Imagination, Inc.
  • 20. Schema Browser Lucid Imagination, Inc. Page 20 © 2010 Lucid Imagination, Inc.
  • 21. Schema Browser Lucid Imagination, Inc. Page 21 © 2010 Lucid Imagination, Inc.
  • 22. Schema Browser Demo Lucid Imagination, Inc. Page 22 © 2010 Lucid Imagination, Inc.
  • 23. How Many of You Just Copied the Example Schema? • Just because it works for one person’s data, doesn’t mean it works for yours. • Take the time to look at the output Lucid Imagination, Inc. Page 23 © 2010 Lucid Imagination, Inc.
  • 24. Luke Lucene Index Exploration Tool Allows you to look at (and modify) the contents of an index Lucid Imagination, Inc. Page 24 © 2010 Lucid Imagination, Inc.
  • 25. Luke Main Screen Lucid Imagination, Inc. Page 25 © 2010 Lucid Imagination, Inc.
  • 26. Luke Document “Reconstruction” Lucid Imagination, Inc. Page 26 © 2010 Lucid Imagination, Inc.
  • 27. Luke Document “Reconstruction” Lucid Imagination, Inc. Page 27 © 2010 Lucid Imagination, Inc.
  • 28. Close-up from last slide solr null_1 enterpris search server null_100 apach softwar foundat null_100 softwar null_100 search null_100 advanc full fulltext|text search capabl use lucen null_100 optim null_1 high … Lucid Imagination, Inc. Page 28 © 2010 Lucid Imagination, Inc.
  • 29. Position Increment Gap The null_xxx entries are how luke represents the position increment between instances of multi-valued fields. The example had <field name=“text">Solr, the Enterprise Search Server</field> <field name=“text">Apache Software Foundation</field> Using a position increment prevents phrase queries from matching across different values of a field Without the gap “Server Apache” would be a valid phrase. Lucid Imagination, Inc. Page 29 © 2010 Lucid Imagination, Inc.
  • 30. Analysis Can Affect Performance Analysis doesn’t just product success/failure on a search It can affect the query processing speed, too. Lucid Imagination, Inc. Page 30 © 2010 Lucid Imagination, Inc.
  • 31. Slow Searches They index 500,000 books Multiple languages in one field So they can’t do stemming or stop words Their worst case query was: “The lives and literature of the beat generation” It took 2 minutes to run. The query requires checking every doc containing “the” & “and” And the position info for each occurrence Lucid Imagination, Inc. Page 31 © 2010 Lucid Imagination, Inc.
  • 32. Bi-grams Bi-grams combine adjacent terms ““The lives and literature “ becomes “The lives” “lives and” “and literature” Only have to check documents that contain the pair adjacent to each other. Only have to look at position information for the pair But can triple the size of the index Word indexed by itself Lucid Imagination, Inc. Indexed both with preceding term, and following term Page 32 © 2010 Lucid Imagination, Inc.
  • 33. Common Grams Form bi-grams only for common terms “The” occurs 2 billion times. “The lives” occurs 360k. Used the only 32 most common terms Average response went from 460 ms to 68ms. Lucid Imagination, Inc. Page 33 © 2010 Lucid Imagination, Inc.
  • 34. Implied Phrase Queries Another example involved a query with “L’art” This turns into a phrase query, “L art” with the default config. PhraseQuery(text:"l art") “Turning it into the single token ‘L art’ is much more efficient. Occurs in far fewer documents that “L” Is a term query, not a phrase query. Lucid Imagination, Inc. Page 34 © 2010 Lucid Imagination, Inc.
  • 35. Multiple Languages Generally, we suggest keeping different languages in their own fields This lets you have an analyzer for each language Stemming, stop words, etc. If you don’t know the total number of languages, you can use dynamic fields. That allows you to accept them, but not to dynamically stem, etc. Lucid Imagination, Inc. Page 35 © 2010 Lucid Imagination, Inc.
  • 36. Analysis And Query Parsing What happens when parsing a query in Solr? You may have many fields, with different analyzers Which Analyzer gets used? Lucid Imagination, Inc. Page 36 © 2010 Lucid Imagination, Inc.
  • 37. Analysis And Query Parsing QueryParser splits the query Understands quotes, parens and whitespace Gives the resulting pieces to the correct analyzer Explicit or Default Lucid Imagination, Inc. Page 37 © 2010 Lucid Imagination, Inc.
  • 38. Analysis And Query Parsing To see what happens to your query Use the “Full Interface” section of the admin interface Check ‘debug: enable’ Or just add “&debugQuery=on” to the end of your query string We’re using the Lucene Query Parser Dismax does different things. Lucid Imagination, Inc. Page 38 © 2010 Lucid Imagination, Inc.
  • 39. Seeing the results of query parsing Lucid Imagination, Inc. Page 39 © 2010 Lucid Imagination, Inc.
  • 40. Seeing the results of query parsing Lucid Imagination, Inc. Page 40 © 2010 Lucid Imagination, Inc.
  • 41. Query Examples title:foo bar Becomes: +title:foo +text:bar “foo” goes title field analyzer, bar to default field analyzer manu:”foo_bar baz” Becomes: manu:"foo bar baz“ Note _ got removed. The whole string goes to manu analyzer Phrase query title: (foo bar) Lucid Imagination, Inc. Becomes: title:foo title:bar foo and bar passed separately to title’s analyzer Page 41 © 2010 Lucid Imagination, Inc.
  • 42. Components of an Analyzer Lucid Imagination, Inc. Page 42 © 2010 Lucid Imagination, Inc.
  • 43. Components of an Analyzer CharFilters Tokenizers TokenFilters Lucid Imagination, Inc. Page 43 © 2010 Lucid Imagination, Inc.
  • 44. CharFilters Used to clean up/regularize characters before passing to TokenFilter Remove accents, etc. MappingCharFilter They can also do complex things, we’ll look at HTMLStripCharFilter later. Lucid Imagination, Inc. Page 44 © 2010 Lucid Imagination, Inc.
  • 45. Tokenizers Convert text to tokens (terms) Only one per analyzer Many Options WhitespaceTokenizer StandardTokenizer PatternTokenizer More… Lucid Imagination, Inc. Page 45 © 2010 Lucid Imagination, Inc.
  • 46. TokenFilters Process the tokens produced by the Tokenizer Can be many of them per field Lucid Imagination, Inc. Page 46 © 2010 Lucid Imagination, Inc.
  • 47. Some example TokenFilters that come with Solr/Lucene There are way too many to list them all We’re just going to go through a few of them Lucid Imagination, Inc. Page 47 © 2010 Lucid Imagination, Inc.
  • 48. Reversing Filter Why? Leading wildcards require traversing the whole index Reverse the order, and leading wildcards become trailing *cats => stac* Only have to check terms that start with stac, instead of the whole index. Lucid Imagination, Inc. Page 48 © 2010 Lucid Imagination, Inc.
  • 49. Phonetic Analysis Creates a phonetic representation of the text, for “sounds like” matching PhoneticFilterFactory. Uses one of Metaphone Double Metaphone Soundex Refined Soundex Lucid Imagination, Inc. Page 49 © 2010 Lucid Imagination, Inc.
  • 50. Synonyms Synonym filter allows you to include alternate words that the user can use when searching For example, theater, theatre Useful for movie titles, where words are deliberately mis-spelled Don’t over-use synonyms It helps recall, but lowers precision Produces tokens at the same token position “local theater company” theatre Lucid Imagination, Inc. Page 50 © 2010 Lucid Imagination, Inc.
  • 51. HTML text extraction Removes html tags, attributes comments XML processing directives Removes <script> and <style> contents Replaces entities HtmlStripCharFilterFactory Lucid Imagination, Inc. Page 51 © 2010 Lucid Imagination, Inc.
  • 52. Spell Checking Spell checker starts by analyzing the source terms into n-grams From the Lucene Wiki: Lucid Imagination, Inc. Page 52 © 2010 Lucid Imagination, Inc.
  • 53. Spell Checking You don’t actually have to know that to use the spell checker But I think it’s kind of cool Use luke to explore the index generated by the spell checker. Lucid Imagination, Inc. Page 53 © 2010 Lucid Imagination, Inc.
  • 54. And many more Regular expression Tokenizer Stemmers for many languages Persian, Hindi, Chinese, Japanese, etc. Third party/commercial stemmers available, too. SnowballPorterFilter Lucid Imagination, Inc. Page 54 © 2010 Lucid Imagination, Inc.
  • 55. Recap If you can’t find it, and you are sure it’s there: It’s likely an analysis problem Three main tools for troubleshooting analysis Analysis tool Schema browser Luke Look at your index, documents and the output of your analyzers periodically. Lucid Imagination, Inc. Page 55 © 2010 Lucid Imagination, Inc.
  • 56. Additional Resources Lucid Imagination Solr Reference Guide LucidImagination.com/downloads Lucene in Action Second Edition This isn’t published yet, but you can get the early access version from manning.com/hatcher3 http://www.hathitrust.org/blog Solr wiki on Analysis Wiki.apache.org/solr/AnalyzersTokenizersTokenFilters Lucid Imagination, Inc. Luke - http://code.google.com/p/luke/ Page 56 © 2010 Lucid Imagination, Inc.
  • 57. Questions If we have time, we’ll take some questions Lucid Imagination, Inc. Page 57 © 2010 Lucid Imagination, Inc.
  • 58. Thanks! Tom Hill LucidImagination.com Lucid Imagination, Inc. Page 58 © 2010 Lucid Imagination, Inc.