Proposal for nested document support in Lucene

Nested Documents in LuceneHigh-performance support for parent/child document relationsmark@searcharea.co.uk

Problem:The Lucene data model is based on Documents, Fields and Terms. However many real-world data structures cannot be properly represented when collapsed into a single Lucene document.SingleLucenedocument

Problem: “Cross-matching”When two or more data structures of the same type are jumbled up into a single Lucene field, matching logic becomes confused e.g. >1 qualification in a resumeJohnNameJohnA1 in MathsA1, E1GradeE1 in ScienceSubjectMaths, Science!False match for query:Grade:A1 AND Subject:Science

Unacceptable solution #1One modeling approach is to store related items in the same field and use proximity operators in queriesNameJohnA1 Maths….E1 ScienceGradeAndSubjectJohnExample query: “GradeAndSubject:”A1 Science”~2A1 in MathsE1 in Science!Slow!Not scalable with number of fields Loss of fieldnames as context in query

Proximity distances must grow.

Only one choice of Analyzer for given field Unacceptable solution #2Use numbered fieldnames to group related structuresNameJohnExample query:( Grade1:A1 AND Subject1:Science) OR (Grade2:A1 AND Subject2:Science )…A1Grade1MathsSubject1E1Grade2JohnSubject2ScienceA1 in MathsE1 in Science!Slow!Not scalable with number of nested structuresMore numbered fieldnames = more query complexity and more unique tokens in indexSolution: Nested documentsThe existing Lucene codebase can be used to simply store multiple “nested” documents to represent arbitrarily complex structures. Related documents are just added in sequenceJohnNameJohnA1 in MathsA1E1GradeGradeE1 in ScienceSubjectMathsSubjectScience?But how to query?....

Solution: Nested Document QueriesNested documents need to be queried using new NestedDocumentQuery class which understands document relationshipsJohnNameA1E1GradeGradedocTyperesumeSubjectMathsSubjectScienceNew NestedDocumentQuery Executes child search using any arbitrary Lucene Query object e.g. Boolean, fuzzy, numeric etc

Reports any matches as a match on the parent document not the child

Super-fast evaluation of joins between child and parent

Requires an indexed field to identify parent documents?

Solution: Example QueryFind resume of person called “John” with A1 grade in MathsJohnNameE1A1resumeGradedocTypeGradeSubjectScienceSubjectMathsThe NestedDocumentQuery wrapper simply translates the stream of reported matches from the child-level query criteria into matches on the parent for evaluation of all the parent-level logic

Solution: Join speedUnlike a database, the cost of a join (child to parent) is blisteringly fast3) Find first prior set bit e.g. position #356,6701000001000000001000000010000000100000010000100000000010000001000001000012) Index directly into cached BitSet at position #356,6751) Match reported on document #356,675ParentQuery4) Attribute match to doc #356,670NestedDocumentQueryChildQueryThe BitSet for defining parents is obtained from a Filter and can be cached aggressively with minimal memory cost (one bit per document in the index)

Other advantagesParent-child document relationships can also be used to limit child results from any one parent (e.g. efficiently control the max number of pages returned from any one website)Nesting levels can be arbitrarily deep Very powerful multi-child queries possible e.g. find people likely to know person X using resume’s employment histories (multiple employer names/urls and related date-ranges)

Proposal for nested document support in Lucene

More Related Content

What's hot (20)

Viewers also liked (13)

Similar to Proposal for nested document support in Lucene (20)

Recently uploaded (20)

Proposal for nested document support in Lucene