SlideShare a Scribd company logo
Document relations
                              Martijn van Groningen
                                 @mvgroningen




Tuesday, November 6, 12
Overview
     • Background
     • Document relations with joining.
     • Various solutions in Lucene and Elasticsearch



Tuesday, November 6, 12
Background - Lucene model
     • Lucene is document based.
     • Lucene doesn’t store information about relations
       between documents.


     • Data often holds relations.
     • Good free text search over relational data.

Tuesday, November 6, 12
Background - Common solutions
     • Compound documents.
           •     May result in documents with many fields.

     • Subsequent searches.
           •     May cause a lot of network overhead.




     • Non Lucene based approach:
           •     Use Lucene in combination with a relational database.




Tuesday, November 6, 12
Background - Example
     • Product
           •     Name
           •     Description

     • Product-item
           •     Color
           •     Size
           •     Price




Tuesday, November 6, 12
Background - Example
     • Compound Product & Product-items document.
     • Each product-item has its own field prefix.




Tuesday, November 6, 12
Background - Other solutions
     • Lucene offers solutions to have a 'relational' like
       search.
           •     Joining
           •     Result grouping


     • Elasticsearch builds on top of the joining
       capabilities.


     • These solutions aren't naturally supported.

Tuesday, November 6, 12
Joining



Tuesday, November 6, 12
Joining
     • Join support available since Lucene 3.4
           •     Not a SQL join!


     • Two distinct joining types:
           •     Index time join
           •     Query time join


     • Joining provides a solution to handle document
       relations.


Tuesday, November 6, 12
Joining - What is out there?
     • Index time:
           •     Lucene’s block join implementation.
           •     Elasticsearch’s nested filter, query and facets.
                •         Built on top of the Lucene’s block join support.


     • Query time:
           •     Lucene’s query time join utility.
           •     Solr’s join query parser.
           •     Elasticsearch’s various parent-child queries and filters.



Tuesday, November 6, 12
Index time join
                            And nested documents.




Tuesday, November 6, 12
Joining - Block join query
     • Lucene block join queries:
           •     ToParentBlockJoinQuery
           •     ToChildBlockJoinQuery


     • Lucene collector:
           •     ToParentBlockJoinCollector


     • Index time join requires block indexing.

Tuesday, November 6, 12
Joining - Block indexing
     • Atomically adding documents.
           •     A block of documents.


     • Each document gets sequentially assigned Lucene
       document id.


     • IndexWriter#addDocuments(docs);


Tuesday, November 6, 12
Joining - Block indexing
     • Index doesn't record blocks.
     • App is responsible for identifying block documents.
     • Segment merging doesn’t re-order documents in a
       segment.


     • Adding athe whole block.block requires you to
       reindex
                document to a



Tuesday, November 6, 12
Joining - block join query

      • Parent is the last document in a block.




Tuesday, November 6, 12
Block join - ToChildBlockJoinQuery

                          Marking parent documents




Tuesday, November 6, 12
Block join - ToChildBlockJoinQuery



                                  Add block




                                   Add block




Tuesday, November 6, 12
Block join - ToChildBlockJoinQuery
     • Parent filter marks the parent documents.

     • Child query is executed in the parent space.




Tuesday, November 6, 12
Block join & Elasticsearch
     • In Elasticsearch exposed as nested objects.
     • Documents are constructed as JSON.
           •     JSON’s nested structure works nicely with block
                 indexing.


     • Elasticsearchoftakes nested documents. and also
       keeps track the
                            care of block indexing




Tuesday, November 6, 12
Elasticsearch’s nested support
     • Support for a nested type in mappings.
     • Nested query.
     • Nested filter.
     • Nested facets.


Tuesday, November 6, 12
Nested type
     • The nested types enables Lucene’s block indexing.
                                           index



                          curl -XPUT 'localhost:9200/products' -d '{
     type                    "mappings" : {
                               "product" : {
                                  "properties" : {
Nested offers                       "offers" : { "type" : "nested" }
                                  }
                               }
                             }
                          }'




Tuesday, November 6, 12
Indexing nested objects
                                                 index         type

                          curl -XPOST 'localhost:9200/products/product' -d '{
                             "name" : "Polo shirt",
                             "description" : "Made of 100% cotton",
 nested objects              "offers" : [
                                  {
                                     "color" : "red",
                                     "size" : "s",
                                     "price" : 999
                                  },
                                  {
                                     "color" : "red",
                                     "size" : "m",
                                     "price" : 1099
                                  },
                                  {
                                     "color" : "blue",
                                     "size" : "s",
                                     "price" : 999
                                  }
                             ]
                          }'




Tuesday, November 6, 12
Nested query
                                                                   The nested field
     curl -XPOST 'localhost:9200/products/product/_search' -d '{
        "query" : {
                                                                   path in mapping.
          "nested" : {
              "path" : "offers",
              "score_mode" : "total",                               Sum the individual
              "query" : {                                            nested matches.
                  "bool" : {
                       "must" : [
                           {
                               "term" : {
                                   "color" : "blue"
                               }
                           },
                                                                    Color red would match
                           {                                        the previous document.
                               "term" : {
                                   "size" : "m"
                               }
                           }
                       ]
                  }
              }
          }
        }
     }'




Tuesday, November 6, 12
Nested facets
     curl -XPOST 'localhost:9200/products/product/_search' -d '{
        "facets" : {
          "color" : {
             "terms_stats" : {
                "key_field" : "size",
                "value_field" : "price"
             },
             "nested" : "offers"
          }                                              "facets":{
        }                                                    "color":{
     }'                                                          "_type":"terms_stats",
                                                                 "missing":0,
                                                                 "terms":[
                                                                     {
        A facet for nested field                                          "term":"s",
                                                                         "count":2,
                 offers.                                                 "total_count":2,
                                                                         "min":999.0,
                                                                         "max":999.0,
                                                                         "total":1998.0,
                                                                         "mean":999.0
                                                                     },
                                                                     ...
                        Counts 2 nested documents                ]
                                 for term: s                 }
                                                         }




Tuesday, November 6, 12
Query time join
                           and parent & child relations.




Tuesday, November 6, 12
Query time joining
     • Documents are joined during query time.
           •     More expensive, but more flexible.


     • Two types of query time joins:
           •     Parent child joining.
           •     Field based joining.




Tuesday, November 6, 12
Lucene’s query time join
     • Query time joining is executed in two phases.
     • Field based joining:
           •     ‘from’ field
           •     ‘to’ field




     • Doesn’t require block indexing.
Tuesday, November 6, 12
Query time join - JoinUtil
     • Firstthe documents thatthe terms in the fromField
       for
              phase collects all
                                 match with the original
            query.


     • The second the collected termsdocumentsprevious
       match with
                   phase returns the
                                      from the
                                               that
            phase in the toField.


     • One public method:
           •     JoinUtil#createJoinQuery(...)



Tuesday, November 6, 12
Joining - JoinUtil




                          Referrer the product id.



Tuesday, November 6, 12
Joining - JoinUtil




Tuesday, November 6, 12
Joining - JoinUtil



                                           Join utility




 • Result will contain one products.
 • Possible to do ‘join’ across indices.

Tuesday, November 6, 12
Elasticsearch’s query time join
     • A parent child solution.
     • Not related to Lucene’s query time join.
     • Support consists out of:
           •     The _parent field.
           •     The top_children query.
           •     The has_parent & has_child filter & query.
           •     Scoped facets.


Tuesday, November 6, 12
The _parent field
     • Points to the parent type.
     • Mapping attribute to be define on the child type.
                           curl -XPUT 'localhost:9200/products' -d '{
                              "mappings" : {
                                "offer" : {
                                   "_parent" : {
                                     "type" : "product"
                                   }
                                }
                              }
                           }'



     • Elasticsearch uses the _parent field to build an id
       cache.
           •       Makes parent/child queries & filters fast.



Tuesday, November 6, 12
Indexing parent & child documents
     • Parent document:
              curl -XPOST 'localhost:9200/products/product/1' -d '{
                 "name" : "Polo shirt",
                 "description" : "Made of 100% cotton"
              }'                                                            The id of the parent
                                                                          document. Also used for
     • Child documents:                                                           routing.

             curl -XPOST 'localhost:9200/products/offer?parent=1' -d '{
                "color" : "red",
                "size" : "s",
                "price" : 999
             }'



             curl -XPOST 'localhost:9200/products/offer?parent=1' -d '{
                "color" : "blue",
                          "red",
                "size" : "s",
                         "m",
                "price" : 999
                          1099
             }'




Tuesday, November 6, 12
The ‘top_children’ query
      curl -XPOST 'localhost:9200/products/_search' -d '{
         "query" : {
                                                            Child type
           "top_children" : {
              "type" : "offer",
              "query" : {
                 "term" : {                                 Child query
                   "size" : "m"
                 }
              },
              "score" : "sum"                               Score mode
           }
         }
      }'




     • Internally the in order to getpotentially executed
       several times
                      child query is
                                      enough parent hits.



Tuesday, November 6, 12
The ‘has_child’ query
      curl -XPOST 'localhost:9200/products/_search' -d '{   Child type
         "query" : {
           "has_child" : {
              "type" : "offer",
              "query" : {
                "term" : {
                   "size" : "m"
                                                            Child query
                }
              }
           }
         }
      }'




     • Doesn’tdoc. Works as a filter. into the matching
       parent
               map the child scores



     • The has_parent query matches child document
       instead.

Tuesday, November 6, 12
Scoped facets
     curl -XPOST 'localhost:9200/products/_search' -d '{
        "query" : {
           "has_child" : {
             "type" : "offer",
             "query" : {
                "term" : {
                  "size" : "m"
                }
             },
             "_scope" : "my_scope"
           }                                               Execute facets inside
        },
        "facets" : {
                                                             a specific scope.
           "color" : {
             "terms_stats" : {
                "key_field" : "size",
                "value_field" : "price"
             },
             "scope" : "my_scope"
           }
        }
     }'




Tuesday, November 6, 12
Conclusion
     • Block join & nested object are fast and efficient,
       but lack flexibility.


     • Query time and parent child join are flexible at the
       cost of performance and memory.
           •     Field based query time joining is the most flexible.
           •     Parent child based joining is the fastest.


     • Faceting in combination with document relations
       gives a nice analytical view.


Tuesday, November 6, 12
Any questions?




Tuesday, November 6, 12

More Related Content

PPTX
SPARQL Cheat Sheet
PDF
Paper: Oracle RAC Internals - The Cache Fusion Edition
PPTX
Mongo Nosql CRUD Operations
DOCX
Linux admin interview questions
PPTX
Troubleshooting Kerberos in Hadoop: Taming the Beast
PPTX
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptx
PDF
Redo internals ppt
PDF
Using Optimizer Hints to Improve MySQL Query Performance
SPARQL Cheat Sheet
Paper: Oracle RAC Internals - The Cache Fusion Edition
Mongo Nosql CRUD Operations
Linux admin interview questions
Troubleshooting Kerberos in Hadoop: Taming the Beast
Five_Things_You_Might_Not_Know_About_Oracle_Database_v2.pptx
Redo internals ppt
Using Optimizer Hints to Improve MySQL Query Performance

What's hot (20)

PDF
Getting to Know Microsoft Teams
PPTX
PPTX
ZStack for Datacenter as a Service - Product Deck
PPTX
Microsoft Teams
PDF
Office 365 introduction and technical overview
PDF
OneDrive to Rule Them All
PPT
Linux: Basics OF Linux
PDF
Microsoft Teams is Here!
PPTX
An overview of data warehousing and OLAP technology
PPT
01 oracle architecture
PPTX
Microsoft Teams & Yammer Enterprise Social: Better Together
PPT
Protege tutorial
PPTX
Intro to the Office 365 Admin Center
PPT
Business model canvas
PPTX
Oracle REST Data Services: Options for your Web Services
PPTX
2014 Target Case Competition
PDF
Performance Schema for MySQL troubleshooting
PDF
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
PPTX
Memory management in oracle
PPTX
Intro to Office 365 Admin
Getting to Know Microsoft Teams
ZStack for Datacenter as a Service - Product Deck
Microsoft Teams
Office 365 introduction and technical overview
OneDrive to Rule Them All
Linux: Basics OF Linux
Microsoft Teams is Here!
An overview of data warehousing and OLAP technology
01 oracle architecture
Microsoft Teams & Yammer Enterprise Social: Better Together
Protege tutorial
Intro to the Office 365 Admin Center
Business model canvas
Oracle REST Data Services: Options for your Web Services
2014 Target Case Competition
Performance Schema for MySQL troubleshooting
Oracle_Multitenant_19c_-_All_About_Pluggable_D.pdf
Memory management in oracle
Intro to Office 365 Admin
Ad

Viewers also liked (6)

ODP
Searching Relational Data with Elasticsearch
PPTX
Elasticsearch as a search alternative to a relational database
PDF
Intro to Elasticsearch
PDF
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics
PPTX
Nested and Parent/Child Docs in ElasticSearch
PDF
Data modeling for Elasticsearch
Searching Relational Data with Elasticsearch
Elasticsearch as a search alternative to a relational database
Intro to Elasticsearch
Faceting with Lucene Block Join Query: Presented by Oleg Savrasov, Grid Dynamics
Nested and Parent/Child Docs in ElasticSearch
Data modeling for Elasticsearch
Ad

Similar to Document relations (20)

KEY
Elasticsearch & "PeopleSearch"
PPTX
ElasticSearch AJUG 2013
PDF
The Power of Elasticsearch
PPTX
Elasticsearch
PDF
Simplifying RESTful Search- Impetus Webinar
PDF
Solr Query Parsing
PDF
Document relations - Berlin Buzzwords 2013
PDF
"Solr Update" at code4lib '13 - Chicago
PPTX
Getting Started With Elasticsearch In .NET
PPTX
Getting started with Elasticsearch in .net
PDF
Api's and ember js
PDF
NoSQL - An introduction to CouchDB
PDF
Lucene for Solr Developers
PDF
Simple search with elastic search
PDF
Introduction to Elasticsearch
PDF
Elastic Search
PDF
Web Queries: From a Web of Data to a Semantic Web
KEY
An Evening with MongoDB - Orlando: Welcome and Keynote
PPTX
Search enabled applications with lucene.net
PDF
Elasticsearch in 15 Minutes
Elasticsearch & "PeopleSearch"
ElasticSearch AJUG 2013
The Power of Elasticsearch
Elasticsearch
Simplifying RESTful Search- Impetus Webinar
Solr Query Parsing
Document relations - Berlin Buzzwords 2013
"Solr Update" at code4lib '13 - Chicago
Getting Started With Elasticsearch In .NET
Getting started with Elasticsearch in .net
Api's and ember js
NoSQL - An introduction to CouchDB
Lucene for Solr Developers
Simple search with elastic search
Introduction to Elasticsearch
Elastic Search
Web Queries: From a Web of Data to a Semantic Web
An Evening with MongoDB - Orlando: Welcome and Keynote
Search enabled applications with lucene.net
Elasticsearch in 15 Minutes

Document relations

  • 1. Document relations Martijn van Groningen @mvgroningen Tuesday, November 6, 12
  • 2. Overview • Background • Document relations with joining. • Various solutions in Lucene and Elasticsearch Tuesday, November 6, 12
  • 3. Background - Lucene model • Lucene is document based. • Lucene doesn’t store information about relations between documents. • Data often holds relations. • Good free text search over relational data. Tuesday, November 6, 12
  • 4. Background - Common solutions • Compound documents. • May result in documents with many fields. • Subsequent searches. • May cause a lot of network overhead. • Non Lucene based approach: • Use Lucene in combination with a relational database. Tuesday, November 6, 12
  • 5. Background - Example • Product • Name • Description • Product-item • Color • Size • Price Tuesday, November 6, 12
  • 6. Background - Example • Compound Product & Product-items document. • Each product-item has its own field prefix. Tuesday, November 6, 12
  • 7. Background - Other solutions • Lucene offers solutions to have a 'relational' like search. • Joining • Result grouping • Elasticsearch builds on top of the joining capabilities. • These solutions aren't naturally supported. Tuesday, November 6, 12
  • 9. Joining • Join support available since Lucene 3.4 • Not a SQL join! • Two distinct joining types: • Index time join • Query time join • Joining provides a solution to handle document relations. Tuesday, November 6, 12
  • 10. Joining - What is out there? • Index time: • Lucene’s block join implementation. • Elasticsearch’s nested filter, query and facets. • Built on top of the Lucene’s block join support. • Query time: • Lucene’s query time join utility. • Solr’s join query parser. • Elasticsearch’s various parent-child queries and filters. Tuesday, November 6, 12
  • 11. Index time join And nested documents. Tuesday, November 6, 12
  • 12. Joining - Block join query • Lucene block join queries: • ToParentBlockJoinQuery • ToChildBlockJoinQuery • Lucene collector: • ToParentBlockJoinCollector • Index time join requires block indexing. Tuesday, November 6, 12
  • 13. Joining - Block indexing • Atomically adding documents. • A block of documents. • Each document gets sequentially assigned Lucene document id. • IndexWriter#addDocuments(docs); Tuesday, November 6, 12
  • 14. Joining - Block indexing • Index doesn't record blocks. • App is responsible for identifying block documents. • Segment merging doesn’t re-order documents in a segment. • Adding athe whole block.block requires you to reindex document to a Tuesday, November 6, 12
  • 15. Joining - block join query • Parent is the last document in a block. Tuesday, November 6, 12
  • 16. Block join - ToChildBlockJoinQuery Marking parent documents Tuesday, November 6, 12
  • 17. Block join - ToChildBlockJoinQuery Add block Add block Tuesday, November 6, 12
  • 18. Block join - ToChildBlockJoinQuery • Parent filter marks the parent documents. • Child query is executed in the parent space. Tuesday, November 6, 12
  • 19. Block join & Elasticsearch • In Elasticsearch exposed as nested objects. • Documents are constructed as JSON. • JSON’s nested structure works nicely with block indexing. • Elasticsearchoftakes nested documents. and also keeps track the care of block indexing Tuesday, November 6, 12
  • 20. Elasticsearch’s nested support • Support for a nested type in mappings. • Nested query. • Nested filter. • Nested facets. Tuesday, November 6, 12
  • 21. Nested type • The nested types enables Lucene’s block indexing. index curl -XPUT 'localhost:9200/products' -d '{ type "mappings" : { "product" : { "properties" : { Nested offers "offers" : { "type" : "nested" } } } } }' Tuesday, November 6, 12
  • 22. Indexing nested objects index type curl -XPOST 'localhost:9200/products/product' -d '{ "name" : "Polo shirt", "description" : "Made of 100% cotton", nested objects "offers" : [ { "color" : "red", "size" : "s", "price" : 999 }, { "color" : "red", "size" : "m", "price" : 1099 }, { "color" : "blue", "size" : "s", "price" : 999 } ] }' Tuesday, November 6, 12
  • 23. Nested query The nested field curl -XPOST 'localhost:9200/products/product/_search' -d '{ "query" : { path in mapping. "nested" : { "path" : "offers", "score_mode" : "total", Sum the individual "query" : { nested matches. "bool" : { "must" : [ { "term" : { "color" : "blue" } }, Color red would match { the previous document. "term" : { "size" : "m" } } ] } } } } }' Tuesday, November 6, 12
  • 24. Nested facets curl -XPOST 'localhost:9200/products/product/_search' -d '{ "facets" : { "color" : { "terms_stats" : { "key_field" : "size", "value_field" : "price" }, "nested" : "offers" } "facets":{ } "color":{ }' "_type":"terms_stats", "missing":0, "terms":[ { A facet for nested field "term":"s", "count":2, offers. "total_count":2, "min":999.0, "max":999.0, "total":1998.0, "mean":999.0 }, ... Counts 2 nested documents ] for term: s } } Tuesday, November 6, 12
  • 25. Query time join and parent & child relations. Tuesday, November 6, 12
  • 26. Query time joining • Documents are joined during query time. • More expensive, but more flexible. • Two types of query time joins: • Parent child joining. • Field based joining. Tuesday, November 6, 12
  • 27. Lucene’s query time join • Query time joining is executed in two phases. • Field based joining: • ‘from’ field • ‘to’ field • Doesn’t require block indexing. Tuesday, November 6, 12
  • 28. Query time join - JoinUtil • Firstthe documents thatthe terms in the fromField for phase collects all match with the original query. • The second the collected termsdocumentsprevious match with phase returns the from the that phase in the toField. • One public method: • JoinUtil#createJoinQuery(...) Tuesday, November 6, 12
  • 29. Joining - JoinUtil Referrer the product id. Tuesday, November 6, 12
  • 30. Joining - JoinUtil Tuesday, November 6, 12
  • 31. Joining - JoinUtil Join utility • Result will contain one products. • Possible to do ‘join’ across indices. Tuesday, November 6, 12
  • 32. Elasticsearch’s query time join • A parent child solution. • Not related to Lucene’s query time join. • Support consists out of: • The _parent field. • The top_children query. • The has_parent & has_child filter & query. • Scoped facets. Tuesday, November 6, 12
  • 33. The _parent field • Points to the parent type. • Mapping attribute to be define on the child type. curl -XPUT 'localhost:9200/products' -d '{ "mappings" : { "offer" : { "_parent" : { "type" : "product" } } } }' • Elasticsearch uses the _parent field to build an id cache. • Makes parent/child queries & filters fast. Tuesday, November 6, 12
  • 34. Indexing parent & child documents • Parent document: curl -XPOST 'localhost:9200/products/product/1' -d '{ "name" : "Polo shirt", "description" : "Made of 100% cotton" }' The id of the parent document. Also used for • Child documents: routing. curl -XPOST 'localhost:9200/products/offer?parent=1' -d '{ "color" : "red", "size" : "s", "price" : 999 }' curl -XPOST 'localhost:9200/products/offer?parent=1' -d '{ "color" : "blue", "red", "size" : "s", "m", "price" : 999 1099 }' Tuesday, November 6, 12
  • 35. The ‘top_children’ query curl -XPOST 'localhost:9200/products/_search' -d '{ "query" : { Child type "top_children" : { "type" : "offer", "query" : { "term" : { Child query "size" : "m" } }, "score" : "sum" Score mode } } }' • Internally the in order to getpotentially executed several times child query is enough parent hits. Tuesday, November 6, 12
  • 36. The ‘has_child’ query curl -XPOST 'localhost:9200/products/_search' -d '{ Child type "query" : { "has_child" : { "type" : "offer", "query" : { "term" : { "size" : "m" Child query } } } } }' • Doesn’tdoc. Works as a filter. into the matching parent map the child scores • The has_parent query matches child document instead. Tuesday, November 6, 12
  • 37. Scoped facets curl -XPOST 'localhost:9200/products/_search' -d '{ "query" : { "has_child" : { "type" : "offer", "query" : { "term" : { "size" : "m" } }, "_scope" : "my_scope" } Execute facets inside }, "facets" : { a specific scope. "color" : { "terms_stats" : { "key_field" : "size", "value_field" : "price" }, "scope" : "my_scope" } } }' Tuesday, November 6, 12
  • 38. Conclusion • Block join & nested object are fast and efficient, but lack flexibility. • Query time and parent child join are flexible at the cost of performance and memory. • Field based query time joining is the most flexible. • Parent child based joining is the fastest. • Faceting in combination with document relations gives a nice analytical view. Tuesday, November 6, 12