SlideShare a Scribd company logo
8
Most read
9
Most read
12
Most read
Nested Documents in LuceneHigh-performance support for parent/child document relationsmark@searcharea.co.uk
Problem:The Lucene data model is based on Documents, Fields and Terms. However many real-world data structures cannot be properly represented when collapsed into a single Lucene document.SingleLucenedocument
Problem: “Cross-matching”When two or more data structures of the same type are jumbled up into a single Lucene field, matching logic becomes confused e.g. >1 qualification in a resumeJohnNameJohnA1 in MathsA1, E1GradeE1 in ScienceSubjectMaths, Science!False match for query:Grade:A1 AND Subject:Science
Unacceptable solution #1One modeling approach is to store related items in the same field and use proximity operators in queriesNameJohnA1 Maths….E1 ScienceGradeAndSubjectJohnExample query: “GradeAndSubject:”A1 Science”~2A1 in MathsE1 in Science!Slow!Not scalable with number of fields Loss of fieldnames as context in query
 Proximity distances must grow.
 Only one choice of Analyzer for given field Unacceptable solution #2Use numbered fieldnames to group related structuresNameJohnExample query:( Grade1:A1 AND   Subject1:Science) OR (Grade2:A1 AND Subject2:Science )…A1Grade1MathsSubject1E1Grade2JohnSubject2ScienceA1 in MathsE1 in Science!Slow!Not scalable with number of nested structuresMore numbered fieldnames = more query complexity and more unique tokens in indexSolution: Nested documentsThe existing Lucene codebase can be used to simply store multiple “nested” documents to represent arbitrarily complex structures. Related documents are just added in sequenceJohnNameJohnA1 in MathsA1E1GradeGradeE1 in ScienceSubjectMathsSubjectScience?But how to query?....
Solution: Nested Document QueriesNested documents need to be queried using new NestedDocumentQuery class which understands document relationshipsJohnNameA1E1GradeGradedocTyperesumeSubjectMathsSubjectScienceNew NestedDocumentQuery Executes child search using any arbitrary Lucene Query object e.g. Boolean, fuzzy, numeric etc
 Reports any matches as a match on the parent document not the child
 Super-fast evaluation of joins between child and parent
 Requires an indexed field to identify parent documents?
Solution: Example QueryFind resume of person called “John” with A1 grade in MathsJohnNameE1A1resumeGradedocTypeGradeSubjectScienceSubjectMathsThe NestedDocumentQuery wrapper simply translates the stream of reported matches from the child-level query criteria into matches on the parent for evaluation of all the parent-level logic
Solution: Join speedUnlike a database, the cost of a join (child to parent) is blisteringly fast3) Find first prior set bit e.g. position #356,6701000001000000001000000010000000100000010000100000000010000001000001000012) Index directly into cached BitSet at position #356,6751) Match reported on document #356,675ParentQuery4) Attribute match to doc #356,670NestedDocumentQueryChildQueryThe BitSet for defining parents is obtained from a Filter and can be cached aggressively with minimal memory cost (one bit per document in the index)
Other advantagesParent-child document relationships can also be used to limit child results from any one parent (e.g. efficiently control the max number of pages returned from any one website)Nesting levels can be arbitrarily deep Very powerful multi-child queries possible e.g. find people likely to know person X using resume’s employment histories (multiple employer names/urls and related date-ranges)

More Related Content

PDF
Let’s get to know Snowflake
PDF
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
PDF
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
PDF
Introduction to elasticsearch
PPTX
Free Training: How to Build a Lakehouse
PPTX
Graphs in Automotive and Manufacturing - Unlock New Value from Your Data
PPTX
An Introduction to Elastic Search.
PDF
Databricks + Snowflake: Catalyzing Data and AI Initiatives
Let’s get to know Snowflake
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Introduction to elasticsearch
Free Training: How to Build a Lakehouse
Graphs in Automotive and Manufacturing - Unlock New Value from Your Data
An Introduction to Elastic Search.
Databricks + Snowflake: Catalyzing Data and AI Initiatives

What's hot (20)

PPTX
Graph Databases at Netflix
PDF
Achieving Lakehouse Models with Spark 3.0
PPTX
Snowflake essentials
PDF
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Data Mesh for Dinner
PPTX
Elastic Stack Introduction
PDF
Data Mesh 101
PDF
Owning Your Own (Data) Lake House
PPTX
Zero to Snowflake Presentation
PPTX
Azure DataBricks for Data Engineering by Eugene Polonichko
PPTX
Security and Data Governance using Apache Ranger and Apache Atlas
PPTX
Real Time UI with Apache Kafka Streaming Analytics of Fast Data and Server Push
PDF
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
PPTX
Inside open metadata—the deep dive
PDF
Graph based data models
PPTX
DW Migration Webinar-March 2022.pptx
PDF
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
PPTX
Data Streaming in Big Data Analysis
Graph Databases at Netflix
Achieving Lakehouse Models with Spark 3.0
Snowflake essentials
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Apache Iceberg - A Table Format for Hige Analytic Datasets
Data Mesh for Dinner
Elastic Stack Introduction
Data Mesh 101
Owning Your Own (Data) Lake House
Zero to Snowflake Presentation
Azure DataBricks for Data Engineering by Eugene Polonichko
Security and Data Governance using Apache Ranger and Apache Atlas
Real Time UI with Apache Kafka Streaming Analytics of Fast Data and Server Push
Near Real-Time Netflix Recommendations Using Apache Spark Streaming with Nit...
Inside open metadata—the deep dive
Graph based data models
DW Migration Webinar-March 2022.pptx
Virtual Flink Forward 2020: Netflix Data Mesh: Composable Data Processing - J...
Data Streaming in Big Data Analysis
Ad

Viewers also liked (13)

PDF
Grouping and Joining in Lucene/Solr
PDF
Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
PPTX
Lucene KV-Store
PDF
Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015
PPTX
MaFI Meeting 2016 (slides)
PDF
Solr search engine with multiple table relation
PDF
Patterns for large scale search
PPTX
Lucene with Bloom filtered segments
PDF
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics
PDF
Is Your Index Reader Really Atomic or Maybe Slow?
PDF
Understanding and visualizing solr explain information - Rafal Kuc
PDF
Working with deeply nested documents in Apache Solr
PDF
An Introduction to Basics of Search and Relevancy with Apache Solr
Grouping and Joining in Lucene/Solr
Approaching Join Index: Presented by Mikhail Khludnev, Grid Dynamics
Lucene KV-Store
Mark Harwood - Building Entity Centric Indexes - NoSQL matters Dublin 2015
MaFI Meeting 2016 (slides)
Solr search engine with multiple table relation
Patterns for large scale search
Lucene with Bloom filtered segments
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics
Is Your Index Reader Really Atomic or Maybe Slow?
Understanding and visualizing solr explain information - Rafal Kuc
Working with deeply nested documents in Apache Solr
An Introduction to Basics of Search and Relevancy with Apache Solr
Ad

Similar to Proposal for nested document support in Lucene (20)

PDF
High Performance JSON Search and Relational Faceted Browsing with Lucene
PDF
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
ODP
Searching Relational Data with Elasticsearch
PDF
The Power of Elasticsearch
PPTX
Search enabled applications with lucene.net
PPTX
KD-2013-Optimizing-Document-Search-using-Lucene
PDF
Introduction to NOSQL quadrants
KEY
Elasticsearch & "PeopleSearch"
PPTX
Cloudant
PPTX
PPT
Advanced full text searching techniques using Lucene
PPTX
Search Me: Using Lucene.Net
PDF
The OU Linked Open Data, Production and Consumption
PPTX
Building a real time, solr-powered recommendation engine
PPT
2005 fall cs523_lecture_4
KEY
Preliminary committee presentation
PPT
Lucene basics
PPTX
NoSQL document oriented data access for .net systems with postgresql and marten
PPTX
Searching for Meaning
PDF
Full Text Search with Lucene
High Performance JSON Search and Relational Faceted Browsing with Lucene
Deep dive to ElasticSearch - معرفی ابزار جستجوی الاستیکی
Searching Relational Data with Elasticsearch
The Power of Elasticsearch
Search enabled applications with lucene.net
KD-2013-Optimizing-Document-Search-using-Lucene
Introduction to NOSQL quadrants
Elasticsearch & "PeopleSearch"
Cloudant
Advanced full text searching techniques using Lucene
Search Me: Using Lucene.Net
The OU Linked Open Data, Production and Consumption
Building a real time, solr-powered recommendation engine
2005 fall cs523_lecture_4
Preliminary committee presentation
Lucene basics
NoSQL document oriented data access for .net systems with postgresql and marten
Searching for Meaning
Full Text Search with Lucene

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
cuic standard and advanced reporting.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
NewMind AI Monthly Chronicles - July 2025
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
cuic standard and advanced reporting.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
20250228 LYD VKU AI Blended-Learning.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MYSQL Presentation for SQL database connectivity
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Advanced methodologies resolving dimensionality complications for autism neur...
Big Data Technologies - Introduction.pptx
Network Security Unit 5.pdf for BCA BBA.
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
The AUB Centre for AI in Media Proposal.docx
Diabetes mellitus diagnosis method based random forest with bat algorithm
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Mobile App Security Testing_ A Comprehensive Guide.pdf
CIFDAQ's Market Insight: SEC Turns Pro Crypto
NewMind AI Monthly Chronicles - July 2025
Understanding_Digital_Forensics_Presentation.pptx

Proposal for nested document support in Lucene

  • 1. Nested Documents in LuceneHigh-performance support for parent/child document relationsmark@searcharea.co.uk
  • 2. Problem:The Lucene data model is based on Documents, Fields and Terms. However many real-world data structures cannot be properly represented when collapsed into a single Lucene document.SingleLucenedocument
  • 3. Problem: “Cross-matching”When two or more data structures of the same type are jumbled up into a single Lucene field, matching logic becomes confused e.g. >1 qualification in a resumeJohnNameJohnA1 in MathsA1, E1GradeE1 in ScienceSubjectMaths, Science!False match for query:Grade:A1 AND Subject:Science
  • 4. Unacceptable solution #1One modeling approach is to store related items in the same field and use proximity operators in queriesNameJohnA1 Maths….E1 ScienceGradeAndSubjectJohnExample query: “GradeAndSubject:”A1 Science”~2A1 in MathsE1 in Science!Slow!Not scalable with number of fields Loss of fieldnames as context in query
  • 6. Only one choice of Analyzer for given field Unacceptable solution #2Use numbered fieldnames to group related structuresNameJohnExample query:( Grade1:A1 AND Subject1:Science) OR (Grade2:A1 AND Subject2:Science )…A1Grade1MathsSubject1E1Grade2JohnSubject2ScienceA1 in MathsE1 in Science!Slow!Not scalable with number of nested structuresMore numbered fieldnames = more query complexity and more unique tokens in indexSolution: Nested documentsThe existing Lucene codebase can be used to simply store multiple “nested” documents to represent arbitrarily complex structures. Related documents are just added in sequenceJohnNameJohnA1 in MathsA1E1GradeGradeE1 in ScienceSubjectMathsSubjectScience?But how to query?....
  • 7. Solution: Nested Document QueriesNested documents need to be queried using new NestedDocumentQuery class which understands document relationshipsJohnNameA1E1GradeGradedocTyperesumeSubjectMathsSubjectScienceNew NestedDocumentQuery Executes child search using any arbitrary Lucene Query object e.g. Boolean, fuzzy, numeric etc
  • 8. Reports any matches as a match on the parent document not the child
  • 9. Super-fast evaluation of joins between child and parent
  • 10. Requires an indexed field to identify parent documents?
  • 11. Solution: Example QueryFind resume of person called “John” with A1 grade in MathsJohnNameE1A1resumeGradedocTypeGradeSubjectScienceSubjectMathsThe NestedDocumentQuery wrapper simply translates the stream of reported matches from the child-level query criteria into matches on the parent for evaluation of all the parent-level logic
  • 12. Solution: Join speedUnlike a database, the cost of a join (child to parent) is blisteringly fast3) Find first prior set bit e.g. position #356,6701000001000000001000000010000000100000010000100000000010000001000001000012) Index directly into cached BitSet at position #356,6751) Match reported on document #356,675ParentQuery4) Attribute match to doc #356,670NestedDocumentQueryChildQueryThe BitSet for defining parents is obtained from a Filter and can be cached aggressively with minimal memory cost (one bit per document in the index)
  • 13. Other advantagesParent-child document relationships can also be used to limit child results from any one parent (e.g. efficiently control the max number of pages returned from any one website)Nesting levels can be arbitrarily deep Very powerful multi-child queries possible e.g. find people likely to know person X using resume’s employment histories (multiple employer names/urls and related date-ranges)
  • 14. “Lucene is not a database”, but…..Structure mattersMany data sources are a mix of structured and unstructured content (e.g. microformats). This is unlikely to change. Lucene has historically been about unstructured text but has steadily been adding structured capability (Trie, spatial, facets) and become a great solution for hybrid data. However support for modeling and querying non-trivial data structures is missing currently.Relationships matterThis proposal is not to recreate the full capabilities of a SQL database with arbitrary relationships. However we can benefit greatly from providing simple parent-child relationshipsWe have some unique capabilitiesParent-child joins are very fastUnlike SQL we can return partial, relevance-ranked matchesProbably more akin to XML databases than SQL databases
  • 15. Next stepsExisting code/unit tests can be released to Lucene project if there is sufficient interest. This software has been deployed in production on large datasets.The matching approach is reliant on parents and children being held in the same Lucene index segment. Additional control is needed to enforce this more rigorously - either by Adding more user-control over IndexWritersegment creation where applications understand/control parent-child dependencies ORMaking Lucene aware of parent-child relationships e.g. new method Document.add(Document) Query parser supportXML Query Parser support is availableEnd-user Query parser could add new syntax e.g. +candidateLocale:UK +child(grade:A1 AND subject:music)
  • 16. Thoughts?Feedback encouraged on dev@lucene.apache.org