SlideShare a Scribd company logo
Finite-State Queries in Lucene
             Robert Muir
          rmuir@apache.org
Agenda
• Introduction to Lucene
• Improving inexact matching:
  – Background
  – Regular Expression, Wildcard, Fuzzy Queries
• Additional use cases:
  – Language support: expansion versus
    stemming
  – Improved spellchecking
• Other ongoing developments in Lucene
Introduction to Lucene
• Open Source Search Engine Library
  – Not just Java, ported to other languages too.
  – Commercial support via several companies
• Just the library
  – Embed for your own uses, e.g. Eclipse
  – For a search server, see Solr
  – For web search + crawler, see Nutch
• Website: http://lucene.apache.org
Inverted Index
• Like a TreeMap<Terms, Documents>
• Before 2.9, queries only operate on one
  “subtree”.
  – For example, Regex and Fuzzy exhaustively
    evaluate all terms, unless you give them a
    “constant prefix”.
• TermQuery is just a special case, looks at
  one leaf node.
Lucene 2.9: Fast Numeric Ranges
• Indexes at different levels of precision.
• Enumerates multiple subtrees.
  – But typically this is a small number: e.g. 15
• Query APIs improved to support this.

                   4                       5                           6



            42           44            52                  63                    64



      421    423   445   446   448   521       522   632   633   634       641   642   644
Automaton Queries
• Only explore subtrees that can lead to an
  accept state of some finite state machine.
• AutomatonQuery traverses the term
  dictionary and the state machine in parallel
Another way to think of it
• Index as a state machine that recognizes Terms
  and transduces matching Documents.
• AutomatonQuery represents a user’s search
  need as a FSM.
• The intersection of the two emits search results.
Query API improvements
• Automata might need to do many seeks
  around the term dictionary.
  – Depends on what is in term dictionary
  – Depends on state machine structure
• MultiTermQuery API further improved
  – Easier and more efficient to skip around.
  – Explicitly supports seeking.
Regex, Wildcard, Fuzzy
• Without constant prefix, exhaustive
  – Regex: (http|ftp)://foo.com
  – Wildcard: ?oo?ar
  – Fuzzy: foobar~
• Re-implemented as automata queries
  – Just parsers that produce a DFA
  – Improved performance and scalability
  – (http|ftp)://foo.com examines 2 terms.
Additional/Advanced Use Cases
Stemming
• Stemmers work at index and query time
  – walked, walking -> walk
  – Can increase retrieval effectiveness
• Some problems
  – Mistakes: international -> intern
  – Must determine language of documents
  – Multilingual cases can get messy
  – Tuning is difficult: must re-index
  – Unfriendly: wildcards on stemmed terms…
Expansion instead
• Don’t remove data at index time
  – Expand the query instead.
  – Single field now works well for all queries:
    exact match, wildcard, expanded, etc.
• Simplifies search configuration
  – Tuning relevance is easier, no re-indexing.
  – No need to worry about language ID for docs.
  – Multilingual case is much simpler.
Automata expansion
• Natural fit for morphology
• Use set intersection operators
  – Minus to subtract exact match case
  – Union to search multiple languages
• Efficient operation
  – Doesn’t explode for languages with complex
    morphology
Experimental results
• 125k docs English test collection
   • Results are for TD queries
• Inverted the “S-Stemmer”
   • 6 declarative rewrite rules to regex
• Competitive with traditional stemming.
                  No      Porter    S-Stem    Automaton
               Stemming                        S-Stem
      MAP       0.4575    0.5069    0.5029     0.4979
      MRR       0.8070    0.7862    0.7587     0.8220
     # Terms   336,675    280,061   305,710    336,675
TODO:
• Support expansion models, too in Lucene.
• Language-specific resources
  – lucene-hunspell could provide these
• Language-independent tokenization
  – Unicode rules go a long way.
• Scoring that doesn’t need stopwords
  – For now, use CommonGrams!
Spellchecking




• Lucene spellchecker builds a separate
  index to find correction candidates
• Perhaps our fuzzy enumeration is now fast
  enough for small edit distances (e.g. 1,2)
  to just use the index directly.
• Could simplify configurations, especially
  distributed ones.
Ongoing developments in Lucene
Community
• Merging Lucene and Solr development
  – Still two separate released “products”!!!
  – Share mailing list and code repository
  – Solr dev code in sync with Lucene dev code
• Benefits to both Lucene and Solr users
  – Lucene features exposed to Solr faster
  – Solr features available to Lucene users
Indexing
• Flexible Indexing
  – Customize the format of the index
  – Decreased RAM usage
  – Faster IndexReader open and faster seeking
• Future
  – Serialize custom attributes to the index
  – More RAM savings
  – Improved index compression
  – Faster Near-Real-Time performance
Text Analysis
• Improved Unicode Support
  – Unicode 4/Supplementary support
  – ICU integration (to support Unicode 5.2)
• Improved Language Support
  – More languages
  – Faster indexing performance
  – Easier packaging and ease-of-use
• Improvements to Analysis API
Relevance and Scoring
• Improved relevance benchmarking
  – Open Relevance Project
  – Quality Benchmarking Package
• Future: More flexible scoring
  – How to support BM25, DFR, more vector
    space?
  – Some packages/patches available, but
     • Additional index statistics needed to be simple
Other
• Improvements to Spatial support
  – See Chris Male’s talk here:
    http://vimeo.com/10204365
  – Progress in both Lucene and Solr
• Steps towards a more modular
  architecture
  – Smaller Lucene core
  – Separate modules with more functionality
Questions?
Backup slides
Finite State Queries In Lucene

More Related Content

PDF
AWS Elastic Beanstalk 활용하여 수 분만에 코드 배포하기 (최원근, AWS 솔루션즈 아키텍트) :: AWS DevDay2018
PDF
20150115 AWS BlackBelt - Amazon VPC (Korea)
PDF
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
PDF
Untangling Cluster Management with Helix
PDF
How to tune Kafka® for production
PDF
AWS Lambda를 기반으로한 실시간 빅테이터 처리하기
PDF
AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기
PPTX
Indexing and Query Optimization
AWS Elastic Beanstalk 활용하여 수 분만에 코드 배포하기 (최원근, AWS 솔루션즈 아키텍트) :: AWS DevDay2018
20150115 AWS BlackBelt - Amazon VPC (Korea)
Building a Data Warehouse for Business Analytics using Spark SQL-(Blagoy Kalo...
Untangling Cluster Management with Helix
How to tune Kafka® for production
AWS Lambda를 기반으로한 실시간 빅테이터 처리하기
AWS Summit Seoul 2023 | 실시간 CDC 데이터 처리! Modern Transactional Data Lake 구축하기
Indexing and Query Optimization

What's hot (20)

PDF
AWS 기반 지속 가능한 데이터 분석 플랫폼 구축하기 - 박윤곤, 아이스크림에듀 :: AWS Summit Seoul 2019
PDF
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
PDF
Amazon Aurora 신규 서비스 알아보기::최유정::AWS Summit Seoul 2018
PPTX
Elasticsearch Introduction
PDF
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
PDF
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
PPTX
Compression Options in Hadoop - A Tale of Tradeoffs
PDF
Amazon EKS를 통한 빠르고 편리한 컨테이너 플랫폼 활용 – 이일구 AWS 솔루션즈 아키텍트:: AWS Cloud Week - Ind...
PPTX
Stream Processing Frameworks
PPTX
개발자도 알아야 하는 DBMS튜닝
KEY
MongoDB: How it Works
PDF
20090713 Hbase Schema Design Case Studies
PPTX
Ozone: An Object Store in HDFS
PDF
EDB Postgres DBA Best Practices
 
PDF
Hadoop and Kerberos
PDF
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
PDF
Apache Calcite (a tutorial given at BOSS '21)
PDF
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
PPTX
Unique ID generation in distributed systems
PDF
My first 90 days with ClickHouse.pdf
AWS 기반 지속 가능한 데이터 분석 플랫폼 구축하기 - 박윤곤, 아이스크림에듀 :: AWS Summit Seoul 2019
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Amazon Aurora 신규 서비스 알아보기::최유정::AWS Summit Seoul 2018
Elasticsearch Introduction
Fast Insight from Fast Data: Integrating ClickHouse and Apache Kafka
A Practical Introduction to Handling Log Data in ClickHouse, by Robert Hodges...
Compression Options in Hadoop - A Tale of Tradeoffs
Amazon EKS를 통한 빠르고 편리한 컨테이너 플랫폼 활용 – 이일구 AWS 솔루션즈 아키텍트:: AWS Cloud Week - Ind...
Stream Processing Frameworks
개발자도 알아야 하는 DBMS튜닝
MongoDB: How it Works
20090713 Hbase Schema Design Case Studies
Ozone: An Object Store in HDFS
EDB Postgres DBA Best Practices
 
Hadoop and Kerberos
HBaseCon 2012 | HBase Schema Design - Ian Varley, Salesforce
Apache Calcite (a tutorial given at BOSS '21)
Using Delta Lake to Transform a Legacy Apache Spark to Support Complex Update...
Unique ID generation in distributed systems
My first 90 days with ClickHouse.pdf
Ad

Viewers also liked (20)

PDF
Portable Lucene Index Format & Applications - Andrzej Bialecki
PDF
Lucene
PPT
Lucandra
PDF
Dawid Weiss- Finite state automata in lucene
PDF
Search at Tumblr (nyc search meetup)
ODP
Lucene And Solr Intro
PPTX
Introduction to Lucene and Solr - 1
PPTX
Apache lucene
PDF
Analytics in olap with lucene & hadoop
PDF
Beyond full-text searches with Lucene and Solr
PPT
Lucene and MySQL
PDF
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
PDF
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
PDF
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
PDF
Lucene for Solr Developers
PDF
Berlin Buzzwords 2013 - How does lucene store your data?
PDF
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
PDF
Architecture and Implementation of Apache Lucene: Marter's Thesis
PPT
Lucene Introduction
PDF
Text categorization with Lucene and Solr
Portable Lucene Index Format & Applications - Andrzej Bialecki
Lucene
Lucandra
Dawid Weiss- Finite state automata in lucene
Search at Tumblr (nyc search meetup)
Lucene And Solr Intro
Introduction to Lucene and Solr - 1
Apache lucene
Analytics in olap with lucene & hadoop
Beyond full-text searches with Lucene and Solr
Lucene and MySQL
DocValues aka. Column Stride Fields in Lucene 4.0 - By Willnauer Simon
The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by ...
Building and Running Solr-as-a-Service: Presented by Shai Erera, IBM
Lucene for Solr Developers
Berlin Buzzwords 2013 - How does lucene store your data?
Near Real Time Indexing: Presented by Umesh Prasad & Thejus V M, Flipkart
Architecture and Implementation of Apache Lucene: Marter's Thesis
Lucene Introduction
Text categorization with Lucene and Solr
Ad

Similar to Finite State Queries In Lucene (20)

PPT
Lucene BootCamp
PPT
Lucene Bootcamp - 2
PDF
What is in a Lucene index?
PPTX
Illuminating Lucene.Net
PDF
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
PDF
Performance and Abstractions
KEY
Apache Solr - Enterprise search platform
PDF
Lucene 101
PDF
KEYNOTE: Lucene / Solr road map
PDF
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
PDF
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
PDF
Best practices for highly available and large scale SolrCloud
PDF
Open Source SQL Databases
PPTX
Real world RESTful service development problems and solutions
PDF
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
PDF
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
PDF
Solr 4
PDF
Query Parsing - Tips and Tricks
PDF
Lucene for Solr Developers
PDF
Fun with flexible indexing
Lucene BootCamp
Lucene Bootcamp - 2
What is in a Lucene index?
Illuminating Lucene.Net
Improved Search With Lucene 4.0 - NOVA Lucene/Solr Meetup
Performance and Abstractions
Apache Solr - Enterprise search platform
Lucene 101
KEYNOTE: Lucene / Solr road map
Keynote Yonik Seeley & Steve Rowe lucene solr roadmap
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Best practices for highly available and large scale SolrCloud
Open Source SQL Databases
Real world RESTful service development problems and solutions
TAUS Moses Industry Roundtable 2014, Changes in Moses, Hieu Hoang, University...
Search Architecture at Evernote: Presented by Christian Kohlschütter, Evernote
Solr 4
Query Parsing - Tips and Tricks
Lucene for Solr Developers
Fun with flexible indexing

Finite State Queries In Lucene

  • 1. Finite-State Queries in Lucene Robert Muir rmuir@apache.org
  • 2. Agenda • Introduction to Lucene • Improving inexact matching: – Background – Regular Expression, Wildcard, Fuzzy Queries • Additional use cases: – Language support: expansion versus stemming – Improved spellchecking • Other ongoing developments in Lucene
  • 3. Introduction to Lucene • Open Source Search Engine Library – Not just Java, ported to other languages too. – Commercial support via several companies • Just the library – Embed for your own uses, e.g. Eclipse – For a search server, see Solr – For web search + crawler, see Nutch • Website: http://lucene.apache.org
  • 4. Inverted Index • Like a TreeMap<Terms, Documents> • Before 2.9, queries only operate on one “subtree”. – For example, Regex and Fuzzy exhaustively evaluate all terms, unless you give them a “constant prefix”. • TermQuery is just a special case, looks at one leaf node.
  • 5. Lucene 2.9: Fast Numeric Ranges • Indexes at different levels of precision. • Enumerates multiple subtrees. – But typically this is a small number: e.g. 15 • Query APIs improved to support this. 4 5 6 42 44 52 63 64 421 423 445 446 448 521 522 632 633 634 641 642 644
  • 6. Automaton Queries • Only explore subtrees that can lead to an accept state of some finite state machine. • AutomatonQuery traverses the term dictionary and the state machine in parallel
  • 7. Another way to think of it • Index as a state machine that recognizes Terms and transduces matching Documents. • AutomatonQuery represents a user’s search need as a FSM. • The intersection of the two emits search results.
  • 8. Query API improvements • Automata might need to do many seeks around the term dictionary. – Depends on what is in term dictionary – Depends on state machine structure • MultiTermQuery API further improved – Easier and more efficient to skip around. – Explicitly supports seeking.
  • 9. Regex, Wildcard, Fuzzy • Without constant prefix, exhaustive – Regex: (http|ftp)://foo.com – Wildcard: ?oo?ar – Fuzzy: foobar~ • Re-implemented as automata queries – Just parsers that produce a DFA – Improved performance and scalability – (http|ftp)://foo.com examines 2 terms.
  • 11. Stemming • Stemmers work at index and query time – walked, walking -> walk – Can increase retrieval effectiveness • Some problems – Mistakes: international -> intern – Must determine language of documents – Multilingual cases can get messy – Tuning is difficult: must re-index – Unfriendly: wildcards on stemmed terms…
  • 12. Expansion instead • Don’t remove data at index time – Expand the query instead. – Single field now works well for all queries: exact match, wildcard, expanded, etc. • Simplifies search configuration – Tuning relevance is easier, no re-indexing. – No need to worry about language ID for docs. – Multilingual case is much simpler.
  • 13. Automata expansion • Natural fit for morphology • Use set intersection operators – Minus to subtract exact match case – Union to search multiple languages • Efficient operation – Doesn’t explode for languages with complex morphology
  • 14. Experimental results • 125k docs English test collection • Results are for TD queries • Inverted the “S-Stemmer” • 6 declarative rewrite rules to regex • Competitive with traditional stemming. No Porter S-Stem Automaton Stemming S-Stem MAP 0.4575 0.5069 0.5029 0.4979 MRR 0.8070 0.7862 0.7587 0.8220 # Terms 336,675 280,061 305,710 336,675
  • 15. TODO: • Support expansion models, too in Lucene. • Language-specific resources – lucene-hunspell could provide these • Language-independent tokenization – Unicode rules go a long way. • Scoring that doesn’t need stopwords – For now, use CommonGrams!
  • 16. Spellchecking • Lucene spellchecker builds a separate index to find correction candidates • Perhaps our fuzzy enumeration is now fast enough for small edit distances (e.g. 1,2) to just use the index directly. • Could simplify configurations, especially distributed ones.
  • 18. Community • Merging Lucene and Solr development – Still two separate released “products”!!! – Share mailing list and code repository – Solr dev code in sync with Lucene dev code • Benefits to both Lucene and Solr users – Lucene features exposed to Solr faster – Solr features available to Lucene users
  • 19. Indexing • Flexible Indexing – Customize the format of the index – Decreased RAM usage – Faster IndexReader open and faster seeking • Future – Serialize custom attributes to the index – More RAM savings – Improved index compression – Faster Near-Real-Time performance
  • 20. Text Analysis • Improved Unicode Support – Unicode 4/Supplementary support – ICU integration (to support Unicode 5.2) • Improved Language Support – More languages – Faster indexing performance – Easier packaging and ease-of-use • Improvements to Analysis API
  • 21. Relevance and Scoring • Improved relevance benchmarking – Open Relevance Project – Quality Benchmarking Package • Future: More flexible scoring – How to support BM25, DFR, more vector space? – Some packages/patches available, but • Additional index statistics needed to be simple
  • 22. Other • Improvements to Spatial support – See Chris Male’s talk here: http://vimeo.com/10204365 – Progress in both Lucene and Solr • Steps towards a more modular architecture – Smaller Lucene core – Separate modules with more functionality