SlideShare a Scribd company logo
Working with deeply nested documents in
Apache Solr
Anshum Gupta
Apache Lucene/Solr PMC member & committer
IBM Watson
2
Anshum Gupta
• Apache Lucene/Solr committer and PMC member
• Search guy @ IBM Watson.
• Interested in search and related stuff.
• Apache Lucene since 2006 and Solr since 2010.
3
Agenda
• Hierarchical Data/Nested Documents
• Indexing Nested Documents
• Querying Nested Documents
• Faceting on Nested Documents
Thanks to my fellow IBMer Alisa Zhila for working on this with me!
Hierarchical Documents
5
• Social media comments, Email threads,
Annotated data - AI
• Relationship between documents
• Possibility to flatten
Need for nested data
EXAMPLE: Blog Post with Comments
Peter Navarro outlines the Trump economic plan
Tyler Cowen, September 27, 2016 at 3:07am
Trump proposes eliminating America’s $500 billion
trade deficit through a combination of increased
exports and reduced imports.
1 Ray Lopez September 27, 2016 at 3:21 am
I’ll be the first to say this, but the analysis is flawed.
{negative}
2 Brian Donohue September 27, 2016 at 9:20 am
The math checks out. Solid.
{positive}
examples from http://marginalrevolution.com
6
• Can not flatten, need to retain context
• Relationship between documents
• Get all 'positive comments' to 'posts about
Trump' -- IMPOSSIBLE!!!
Nested Documents
EXAMPLE: Data Flattening
Title: Peter Navarro outlines the Trump economic plan
Author: Tyler Cowen
Date: September 27, 2016 at 3:07am
Body: Trump proposes eliminating America’s $500 billion
trade deficit through a combination of increased exports and
reduced imports.
Comment_authors: [Ray Lopez, Brian Donohue]
Comment_dates: [September 27, 2016 at 3:21 am,
September 27, 2016 at 9:20 am]
Comment_texts: ["I’ll be the first to say this, but the analysis is
flawed.", "The math checks out. Solid."]
Comment_sentiments: [negative, positive]
7
• Can not flatten, need to retain context
• Relationship between documents
• Get all 'positive comments' to 'posts about
Trump' -- POSSIBLE!!! (stay tuned)
Nested Documents
EXAMPLE: Hierarchical Documents
Type: Post
Title: Peter Navarro outlines the Trump economic plan
Author: Tyler Cowen
Date: September 27, 2016 at 3:07am
Body: Trump proposes eliminating America’s $500 billion
trade deficit through a combination of increased exports and
reduced imports.
Type: Comment
Author: Ray Lopez
Date: September 27, 2016 at 3:21 am
Text: I’ll be the first to say this, but the analysis is flawed.
Sentiment: negative
Type: Comment
Author: Brian Donohue
Date: September 27, 2016 at 9:20 am
Text: The math checks out. Solid.
Sentiment: positive
8
• Blog Post Data with Comments and Replies
from http://marginalrevolution.com (cured)
• 2 posts, 2-3 comments per post, 0-3 replies
per comment
• Extracted keywords & sentiment data
• 4 levels of "nesting"
• Too big to show on slides
• Data + Scripts + Demo Queries:
• https://github.com/alisa-ipn/solr-
revolution-2016-nested-demo
Running Example
Indexing Nested Documents
10
• Nested XML
• JSON Documents
• Add _childDocument_ tags for all children
• Pre-process field names to FQNs
• Lose information, or add that as meta-data during pre-processing
• JSON Document endpoint (6x only) - /update/json/docs
• Field name mappings
• Child Document splitting - Enhanced support coming soon.
Sending Documents to Solr
11
solr-6.2.1$ bin/post -c demo-xml ./data/example-data.xml
Sending Documents to Solr: Nested XML
<add>
<doc>
<field name="type">post</field>
<field name="author"> "Alex Tabarrok"</field>
<field name="title">"The Irony of Hillary Clinton’s Data Analytics"</
field>
<field name="body">"Barack Obama’s campaign adopted data but
Hillary Clinton’s campaign has been molded by data from birth."</field>
<field name="id">"12015-24204"</field>
<doc>
<field name="type">comment</field>
<field name="author">"Todd"</field>
<field name="text">"Clinton got out data-ed and out organized in
2008 by Obama. She seems at least to learn over time, and apply the
lessons learned to the real world."</field>
<field name="sentiment">"positive"</field>
<field name="id">"29798-24171"</field>
<doc>
<field name="type">reply</field>
<field name="author">"The Other Jim"</field>
<field name="text">"No, she lost because (1) she is thoroughly
detested person and (2) the DNC decided Obama should therefore
win."</field>
<field name="sentiment">"negative"</field>
<field name="id">"29798-21232"</field>
</doc>
</doc>
</doc>
</add>
12
• Add _childDocument_ tags for all children
• Pre-process field names to FQNs
• Lose information, or add that as meta-data during pre-processing
solr-6.2.1$ bin/post -c demo-solr-json ./data/small-example-data-solr.json -format solr
Sending Documents to Solr: JSON Documents
[{ "path": "1.posts",
"id": "28711",
"author": "Alex Tabarrok",
"title": "The Irony of Hillary Clinton’s Data Analytics",
"body": "Barack Obama’s campaign adopted data but Hillary Clinton’s campaign
has been molded by data from birth.",
"_childDocuments_": [
{
"path": "2.posts.comments",
"id": "28711-19237",
"author": "Todd",
"text": "Clinton got out data-ed and out organized in 2008 by Obama. She
seems at least to learn over time, and apply the lessons learned to the real world.",
"sentiment": "positive",
"_childDocuments_": [
{
"path": "3.posts.comments.replies",
"author": "The Other Jim",
"id": "28711-12444",
"sentiment": "negative",
"text": "No, she lost because (1) she is thoroughly detested person and
(2) the DNC decided Obama should therefore win."
}]}]}]
13
• JSON Document endpoint (6x only) - /update/json/docs
• Field name mappings
• Child Document splitting - Enhanced support coming soon.
solr-6.2.1$ curl 'http://localhost:8983/solr/gettingstarted/update/json/docs?
split=/|/posts|/posts/comments|/posts/comments/replies&commit=true' --data-
binary @small-example-data.json -H ‘Content-type:application/json'
NOTE: All documents must contain a unique ID.
Sending Documents to Solr: JSON Endpoint
14
• Update Request Processors don’t work with nested documents
• Example:
• UUID update processor does not auto-add an id for a child document.
• Workaround:
• Take responsibility at the client layer to handle the computation for nested
documents.
• Change the update processor in Solr to handle nested documents.
Update Processors and Nested Documents
15
• The entire block needs reindexing
• Forgot to add a meta-data field that might be useful? Complete reindex
• Store everything in Solr IF
• it’s too expensive to reconstruct the doc from original data source
• No access to data anymore e.g. streaming data
Re-Indexing Your Documents
16
• Various ways to index nested documents
• Need to re-index entire block
Nested Document Indexing Summary
Let’s ask some interesting questions
18
{
"path":["4.posts.comments.replies.keywords"],
"text":["Trump"]},
{
"path":["3.posts.comments.keywords"],
"text":["Trump"]},
{
"path":["2.posts.keywords"],
"text":["Trump"]},
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"]},
{
"text":["Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports."],
"path":["1.posts"]},
{
"text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."],
"path":["2.posts.comments"]}
Easy question first
Find all documents that mention Trump
q=text:Trump
19
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"]},
{
"text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."],
"path":["2.posts.comments"]},
{
"text":["No one goes to Clinton rallies while tens of thousands line up to see Trump, data-mining leads to a fantasy view of the World."],
"path":["2.posts.comments"]}
Returning certain types of documents
Find all comments and replies that mention Trump
q=(path:2.posts.comments OR path:3.posts.comments.replies) AND text:Trump
Recipe:
At the data pre-processing stage, add a field that indicates document type
and also its path in the hierarchy (-- stay tuned):
20
{
"path":["3.posts.comments.keywords"],
"sentiment":["positive"],
"text":["Hillary"]},
{
"path":["4.posts.comments.replies.keywords"],
"sentiment":["negative"],
"text":["Hillary"]},
{
"path":["2.posts.keywords"],
"text":["Hillary"]}
Returning similar type from different level
Find all keywords that are Hillary
q=path:*.keywords AND text:Hillary
Recipe:
Use wild-cards in the field that stores the hierarchy path
Cross-Level Querying
22
{
"path":["3.posts.comments.keywords"],
"sentiment":["positive"],
"text":["Hillary"]},
{
"path":["4.posts.comments.replies.keywords"],
"sentiment":["negative"],
"text":["Hillary"]},
{
"path":["2.posts.keywords"],
"text":["Hillary"]}
Recap so far...
Find all keywords that are Hillary
q=path:*.keywords AND text:Hillary
We're querying precisely for documents
which we provide a search condition for
Query
Level 3
Result
Level 3
Query
Level 4
Result
Level 4
Query
Level 2
Result
Level 2
23
Returning parents by querying children:
Block Join Parent Query
Find all comments whose keywords detected positive sentiment towards Hillary
q={!parent which="path:2.posts.comments"}path:3.posts.comments.keywords AND text:Hillary AND sentiment:positive
Query
Level 3
Result
Level 2
{
"author":["Brian Donohue"],
"text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering,
but he was actually able to find his feet and score some points."],
"path":["2.posts.comments"]},
{
"author":["Todd"],
"text":["Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to
learn over time, and apply the lessons learned to the real world."],
"path":["2.posts.comments"]}
24
{
"sentiment":["negative"],
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"]},
{
"sentiment":["neutral"],
"text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S.
asset values?"],
"path":["3.posts.comments.replies"]},
{
"sentiment":["positive"],
"text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see
a fantasy in person?"],
"path":["3.posts.comments.replies"]}
Returning children by querying parents:
Block Join Child Query
Find replies to negative comments
q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative&fq=path:3.posts.comments.replies
Query
Level 2
Result
Level 3
Block Join Child Query + Filter Query
A bit counterintuitive and non-symmetrical to the BJPQ
25
Returning all document's descendants
Block Join Child Query
Find all descendants of negative comments
q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative
Query
Level 2
Results
Level 3
Results
Level 4
{
"path":["4.posts.comments.replies.keywords"],
"id":"17413-13550",
"text":["Trump"]},
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"],
"id":"17413-66188"},
{
"path":["3.posts.comments.keywords"],
"id":"12413-12487",
"text":["Hillary"]},
{
"text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see
a fantasy in person?"],
"path":["3.posts.comments.replies"],
"id":"12413-10998"}
Issue: no grouping by parent
What if we want to bring the whole sub-structure?
26
Find all negative comments and return them with all their descendants
q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.*]
Query
Level 2
Result
Level 2
sub-
hierarchy
Returning document with all descendants:
ChildDocTransformer
{
"sentiment":["negative"],
"text":["I’ll be the first to say this, but the analysis is flawed."],
"path":["2.posts.comments"],
"_childDocuments_":[
{
"path":["4.posts.comments.replies.keywords"],
"text":["Trump"]},
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."],
"path":["3.posts.comments.replies"]},
{
"path":["4.posts.comments.replies.keywords"],
"text":["U.S."]},
{
"text":["So then I guess he will also eliminate the current account surplus? What
will happen to U.S. asset values?"],
"path":["3.posts.comments.replies"]}
]
},
...
Issue: the "sub-hierarchy" is flat
• Returns all descendant documents along with the queried document
• flattens the sub-hierarchy
• Workarounds:
• Reconstruct the document using path ("path":["3.posts.comments.replies"])
information in case you want the entire subtree (result post-processing)
• use childFilter in case you want a specific level
27
“This transformer returns all descendant documents of each parent document matching your query in
a flat list nested inside the matching parent document." (ChildDocTransformer cwiki)
Returning document with all descendants:
ChildDocTransformer
28
Find all negative comments and return them with all replies to them
q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.*
childFilter=path:3.posts.comments.replies]
{
"sentiment":["negative"],
"text":["I’ll be the first to say this, but the analysis is flawed."],
"path":["2.posts.comments"],
"_childDocuments_":[
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is
funnier."],
"path":["3.posts.comments.replies"]},
{
"text":["So then I guess he will also eliminate the current account surplus? What
will happen to U.S. asset values?"],
"path":["3.posts.comments.replies"]}
]
},
...
Returning document with specific descendants:
ChildDocTransformer + childFilter
Query
Level 2:comments
Result
Level 2:comments
+ Level 3:replies
29
Find all negative comments and return them with all their descendants that mention Trump
q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.* childFilter=text:Trump]
{
"sentiment":["negative"],
"text":["I’ll be the first to say this, but the analysis is flawed."],
"path":["2.posts.comments"],
"_childDocuments_":[
{
"path":["4.posts.comments.replies.keywords"],
"text":["Trump"]},
{
"text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is
funnier."],
"path":["3.posts.comments.replies"]}
]
},
...
Returning document with queried descendants:
ChildDocTransformer + childFilter
Query
Level 2:comments
Result
Level 2:comments
+ sub-levels
Issue: cannot use boolean expressions in childFilter query
30
Cross-Level Querying Mechanisms:
• Block Join Parent Query
• Block Join Children Query
• ChildDocTransformer
Good points:
• overlapping & complementary features
• good capabilities of querying direct ancestors/descendants
• possible to query on siblings of different type
Drawbacks:
• need for data-preprocessing for better querying flexibility
• limited support of querying over non-directly related branches (overcome with graphs?)
• flattening nested data (additional post-processing is needed for reconstruction)
Nested Document Querying Summary
Faceting on Nested Documents
32
• Solr allows faceting on nested documents!
• Two mechanisms for faceting:
• Faceting with JSON Facet API (since Solr 5.3)
• Block Join Faceting (since Solr 5.5)
Faceting on Nested Documents
33
q=path:2.posts.comments AND sentiment:positive&
json.facet={
most_liked_authors : {
type: terms,
field: author,
domain: { blockParent : "path:1.posts"}
}}
Faceting on parents by descendants
JSON Facet API: Parent Domain
Count authors of the posts that received positive comments
"most_liked_authors":{
"buckets":[
{
"val":"Alex Tabarrok",
"count":1},
{
"val":"Tyler Cowen",
"count":1}
]
}
Query
Level 2
Facet
Level 1
34
Faceting on descendants by ancestors
JSON Facet API: Child Domain
Distribution of keywords that appear in comments and replies by the top-level posts
Query
Level 1
Facet
Descendant
Levels
"top_keywords":{
"buckets":[{
"val":"hillary",
"count":4,
"counts_by_posts":2},
{
"val":"trump",
"count":3,
"counts_by_posts":2},
{
"val":"dnc",
"count":1,
"counts_by_posts":1},
{
"val":"obama",
"count":2,
"counts_by_posts":1},
{
"val":"u.s",
"count":1,
"counts_by_posts":1}
]}
35
q=path:1.posts&rows=0&
json.facet={
filter_by_child_type :{
type:query,
q:"path:*comments*keywords",
domain: { blockChildren : "path:1.posts" },
facet:{
top_keywords : {
type: terms,
field: text,
sort: "counts_by_posts desc",
facet: {
counts_by_posts: "unique(_root_)"
}}}}}
Faceting on descendants by ancestors
JSON Facet API: Child Domain
Distribution of keywords that appear in comments and replies by the top-level posts
Query
Level 1
Facet
Descendant
Levels
36
Faceting on descendants by top-level ancestor
JSON Facet API: Child Domain
Distribution of keywords that appear in comments and replies by the top-level posts
Query
Level 1
Facet
Descendant
Levels
Issue: only the top-ancestor gets the unique "_root_" field by default
q=path:1.posts&rows=0&
json.facet={
filter_by_child_type :{
type:query,
q:"path:*comments*keywords",
domain: { blockChildren : "path:1.posts" },
facet:{
top_keywords : {
type: terms,
field: text,
sort: "counts_by_posts desc",
facet: {
counts_by_posts: "unique(_root_)"
}}}}}
37
q=path:2.posts.comments&rows=0&
json.facet={
filter_by_child_type :{
type:query,
q:"path:*comments*keywords",
domain: { blockChildren : "path:2.posts.comments" },
facet:{
top_keywords : {
type: terms,
field: text,
sort: "counts_by_comments desc",
facet: {
counts_by_comments: "unique(2.posts.comments-id)"
}}}}}
Faceting on descendants by intermediate ancestors
JSON Facet API: Child Domain + unique fields
Distribution of keywords that appear in comments and replies by the comments
Query
Level 2
Facet
Descendant
Levels
At pre-processing, introduce unique fields for each level
38
Faceting on descendants by intermediate ancestors
JSON Facet API: Child Domain + unique fields
Distribution of keywords that appear in comments and replies by the comments
Query
Level 2
Facet
Descendant
Levels
"top_keywords":{
"buckets":[{
"val":"Hillary",
"count":4,
"counts_by_comments":3},
{
"val":"Trump",
"count":3,
"counts_by_comments":3},
{
"val":"DNC",
"count":1,
"counts_by_comments":1},
{
"val":"Obama",
"count":2,
"counts_by_comments":1},
{
"val":"U.S.",
"count":1,
"counts_by_comments":1}
]}
Now let's try the same using Block Join Faceting
40
• Experimental Feature
• Needs to be turned on explicitly in solrconfig.xml
More info: https://cwiki.apache.org/confluence/display/solr/BlockJoin+Faceting
Block Join Faceting
41
bjqfacet?q={!parent which=path:2.posts.comments}
path:*.comments*keywords&rows=0&facet=true&child.facet.field=text
Faceting on descendants by ancestors #2:
Block Join Faceting on Children Domain
Distribution of keywords that appear in comments and replies by the comments
"facet_fields":{
"text":[
"dnc",1,
"hillary",3,
"obama",1,
"trump",3,
"u.s",1
]
}
Query
Level 2
Facet
Descendant
Levels
bjqfacet request handler instead of query
42
Output Comparison
Block Join Facet JSON Facet API
"facet_fields":{
"text":[
"dnc",1,
"hillary",3,
"obama",1,
"trump",3,
"u.s",1
]
}
"top_keywords":{
"buckets":[{
"val":"Hillary",
"count":4,
"counts_by_comments":3},
{
"val":"Trump",
"count":3,
"counts_by_comments":3},
{
"val":"DNC",
"count":1,
"counts_by_comments":1},
{
"val":"Obama",
"count":2,
"counts_by_comments":1},
{
"val":"U.S.",
"count":1,
"counts_by_comments":1}
]}
Distribution of keywords that appear in comments and replies by the comments
43
Output Comparison
Block Join Facet JSON Facet API
"facet_fields":{
"text":[
"dnc",1,
"hillary",3,
"obama",1,
"trump",3,
"u.s",1
]
}
"top_keywords":{
"buckets":[{
"val":"Hillary",
"count":4,
"counts_by_comments":3},
{
"val":"Trump",
"count":3,
"counts_by_comments":3},
{
"val":"DNC",
"count":1,
"counts_by_comments":1},
...
Distribution of keywords that appear in comments and replies by the comments
Output is sorted in alphabetical
order. It cannot be changed
facet:{
top_keywords : {
...
sort: "counts_by_comments desc"
}}}
44
JSON Facet API:
• Experimental - but more mature
• More developed and established feature
• bulky JSON syntax
• faceting on children by non-top level ancestors requires introducing unique branch
identifiers similar to "_root_" on each level
Block Join Facet:
• Experimental feature
• Lacks controls: sorting, limit...
• traditional query-style syntax
• proper handling of faceting on children by non-top level ancestors
Hierarchical Faceting Summary
45
• Returning hierarchical structure
• JSON facet rollups is in the works - SOLR-8998
• Graph query might replace a lot of functionalities of cross-level querying - No
distributed support right now.
• There’s more but the community would love to have more people involved!
Community Roadmap
Thank you!
Anshum Gupta anshum@apache.org | @anshumgupta
https://github.com/alisa-ipn/solr-revolution-2016-nested-demo

More Related Content

PDF
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
PDF
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
PPTX
JSON in Solr: from top to bottom
PDF
Solr Query Parsing
PPTX
Building a real time, solr-powered recommendation engine
PDF
Making Structured Streaming Ready for Production
PPTX
Understanding SQL Trace, TKPROF and Execution Plan for beginners
PDF
Apache Calcite: One planner fits all
Working with Deeply Nested Documents in Apache Solr: Presented by Anshum Gupt...
Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and Mo...
JSON in Solr: from top to bottom
Solr Query Parsing
Building a real time, solr-powered recommendation engine
Making Structured Streaming Ready for Production
Understanding SQL Trace, TKPROF and Execution Plan for beginners
Apache Calcite: One planner fits all

What's hot (20)

PDF
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
PDF
Dense Retrieval with Apache Solr Neural Search.pdf
PDF
Introduction To Apache Lucene
PDF
ksqlDB - Stream Processing simplified!
PDF
Hive Bucketing in Apache Spark with Tejas Patil
PDF
Producer Performance Tuning for Apache Kafka
PPTX
Introduction to laravel framework
PDF
The Google Bigtable
ODP
ORM, JPA, & Hibernate Overview
PDF
Recursive Query Throwdown
PDF
Flink SQL: The Challenges to Build a Streaming SQL Engine
PPTX
Kafka Connect - debezium
PDF
From my sql to postgresql using kafka+debezium
PDF
Pinot: Near Realtime Analytics @ Uber
PDF
MongoDB Fundamentals
PDF
From Lucene to Elasticsearch, a short explanation of horizontal scalability
PDF
Introduction to elasticsearch
PDF
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
PDF
Cosco: An Efficient Facebook-Scale Shuffle Service
PDF
Apache Calcite Tutorial - BOSS 21
Zeus: Uber’s Highly Scalable and Distributed Shuffle as a Service
Dense Retrieval with Apache Solr Neural Search.pdf
Introduction To Apache Lucene
ksqlDB - Stream Processing simplified!
Hive Bucketing in Apache Spark with Tejas Patil
Producer Performance Tuning for Apache Kafka
Introduction to laravel framework
The Google Bigtable
ORM, JPA, & Hibernate Overview
Recursive Query Throwdown
Flink SQL: The Challenges to Build a Streaming SQL Engine
Kafka Connect - debezium
From my sql to postgresql using kafka+debezium
Pinot: Near Realtime Analytics @ Uber
MongoDB Fundamentals
From Lucene to Elasticsearch, a short explanation of horizontal scalability
Introduction to elasticsearch
Bucketing 2.0: Improve Spark SQL Performance by Removing Shuffle
Cosco: An Efficient Facebook-Scale Shuffle Service
Apache Calcite Tutorial - BOSS 21
Ad

Viewers also liked (20)

PDF
Working with deeply nested documents in Apache Solr
PPTX
Solr vs. Elasticsearch - Case by Case
PDF
What's New in Apache Solr 4.10
PDF
The Seven Deadly Sins of Solr - By Jay Hill
PPTX
Battle of the Giants Round 2 - Apache Solr vs. Elasticsearch
PDF
Facettensuche mit Lucene und Solr
PDF
Warum 'ne Datenbank, wenn wir Elasticsearch haben?
PDF
Apache Solr 5.0 and beyond
PDF
Understanding the Solr security framework - Lucene Solr Revolution 2015
PDF
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
PDF
Webinar: Fusion for Business Intelligence
PDF
Webinar: Search and Recommenders
PDF
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
PDF
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
PDF
What's new in Solr 5.0
PDF
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
PDF
Grouping and Joining in Lucene/Solr
PPTX
Scaling SolrCloud to a large number of Collections
PDF
it's just search
PDF
Ease of use in Apache Solr
Working with deeply nested documents in Apache Solr
Solr vs. Elasticsearch - Case by Case
What's New in Apache Solr 4.10
The Seven Deadly Sins of Solr - By Jay Hill
Battle of the Giants Round 2 - Apache Solr vs. Elasticsearch
Facettensuche mit Lucene und Solr
Warum 'ne Datenbank, wenn wir Elasticsearch haben?
Apache Solr 5.0 and beyond
Understanding the Solr security framework - Lucene Solr Revolution 2015
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Webinar: Fusion for Business Intelligence
Webinar: Search and Recommenders
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
What's new in Solr 5.0
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
Grouping and Joining in Lucene/Solr
Scaling SolrCloud to a large number of Collections
it's just search
Ease of use in Apache Solr
Ad

Similar to Working with deeply nested documents in Apache Solr (20)

PPTX
Adding data sources to the reporter
PPTX
I want to know more about compuerized text analysis
PPTX
Duke at SplunkLive! Charlotte
PPTX
An Introduction to Working With the Activity Stream
PPTX
Back to Basics Webinar 3 - Thinking in Documents
PPTX
Back to Basics Webinar 3: Schema Design Thinking in Documents
PDF
Mikkel Heisterberg - An introduction to developing for the Activity Stream
PDF
Scaling Recommendations, Semantic Search, & Data Analytics with solr
PPT
Data Journalism (City Online Journalism wk8)
PPTX
60 reporting tips in 60 minutes - SQLBits 2018
PPTX
Week 2 - Interactive News Editing and Producing
PPT
Webofdata
PPTX
Conceptos básicos. seminario web 3 : Diseño de esquema pensado para documentos
PDF
Vizwik part 3 data
PDF
UBC STAT545 2014 Cm001 intro to-course
PDF
CPRS Ottawa-Gatineau - Measuring Social Media Workshop - Sean Howard - thornl...
ODP
Searching Relational Data with Elasticsearch
PDF
What I tell myself before visualizing
PDF
Test Trend Analysis : Towards robust, reliable and timely tests
PPTX
Adding data sources to the reporter
I want to know more about compuerized text analysis
Duke at SplunkLive! Charlotte
An Introduction to Working With the Activity Stream
Back to Basics Webinar 3 - Thinking in Documents
Back to Basics Webinar 3: Schema Design Thinking in Documents
Mikkel Heisterberg - An introduction to developing for the Activity Stream
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Data Journalism (City Online Journalism wk8)
60 reporting tips in 60 minutes - SQLBits 2018
Week 2 - Interactive News Editing and Producing
Webofdata
Conceptos básicos. seminario web 3 : Diseño de esquema pensado para documentos
Vizwik part 3 data
UBC STAT545 2014 Cm001 intro to-course
CPRS Ottawa-Gatineau - Measuring Social Media Workshop - Sean Howard - thornl...
Searching Relational Data with Elasticsearch
What I tell myself before visualizing
Test Trend Analysis : Towards robust, reliable and timely tests

Recently uploaded (20)

PPTX
L1 - Introduction to python Backend.pptx
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Nekopoi APK 2025 free lastest update
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PPTX
Introduction to Artificial Intelligence
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
Odoo POS Development Services by CandidRoot Solutions
PPTX
Essential Infomation Tech presentation.pptx
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
PTS Company Brochure 2025 (1).pdf.......
PDF
System and Network Administration Chapter 2
PDF
How Creative Agencies Leverage Project Management Software.pdf
PPTX
ai tools demonstartion for schools and inter college
PPTX
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PPTX
Reimagine Home Health with the Power of Agentic AI​
PDF
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf
L1 - Introduction to python Backend.pptx
2025 Textile ERP Trends: SAP, Odoo & Oracle
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Nekopoi APK 2025 free lastest update
Adobe Illustrator 28.6 Crack My Vision of Vector Design
Introduction to Artificial Intelligence
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Odoo POS Development Services by CandidRoot Solutions
Essential Infomation Tech presentation.pptx
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PTS Company Brochure 2025 (1).pdf.......
System and Network Administration Chapter 2
How Creative Agencies Leverage Project Management Software.pdf
ai tools demonstartion for schools and inter college
Agentic AI : A Practical Guide. Undersating, Implementing and Scaling Autono...
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
Reimagine Home Health with the Power of Agentic AI​
EN-Survey-Report-SAP-LeanIX-EA-Insights-2025.pdf

Working with deeply nested documents in Apache Solr

  • 1. Working with deeply nested documents in Apache Solr Anshum Gupta Apache Lucene/Solr PMC member & committer IBM Watson
  • 2. 2 Anshum Gupta • Apache Lucene/Solr committer and PMC member • Search guy @ IBM Watson. • Interested in search and related stuff. • Apache Lucene since 2006 and Solr since 2010.
  • 3. 3 Agenda • Hierarchical Data/Nested Documents • Indexing Nested Documents • Querying Nested Documents • Faceting on Nested Documents Thanks to my fellow IBMer Alisa Zhila for working on this with me!
  • 5. 5 • Social media comments, Email threads, Annotated data - AI • Relationship between documents • Possibility to flatten Need for nested data EXAMPLE: Blog Post with Comments Peter Navarro outlines the Trump economic plan Tyler Cowen, September 27, 2016 at 3:07am Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports. 1 Ray Lopez September 27, 2016 at 3:21 am I’ll be the first to say this, but the analysis is flawed. {negative} 2 Brian Donohue September 27, 2016 at 9:20 am The math checks out. Solid. {positive} examples from http://marginalrevolution.com
  • 6. 6 • Can not flatten, need to retain context • Relationship between documents • Get all 'positive comments' to 'posts about Trump' -- IMPOSSIBLE!!! Nested Documents EXAMPLE: Data Flattening Title: Peter Navarro outlines the Trump economic plan Author: Tyler Cowen Date: September 27, 2016 at 3:07am Body: Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports. Comment_authors: [Ray Lopez, Brian Donohue] Comment_dates: [September 27, 2016 at 3:21 am, September 27, 2016 at 9:20 am] Comment_texts: ["I’ll be the first to say this, but the analysis is flawed.", "The math checks out. Solid."] Comment_sentiments: [negative, positive]
  • 7. 7 • Can not flatten, need to retain context • Relationship between documents • Get all 'positive comments' to 'posts about Trump' -- POSSIBLE!!! (stay tuned) Nested Documents EXAMPLE: Hierarchical Documents Type: Post Title: Peter Navarro outlines the Trump economic plan Author: Tyler Cowen Date: September 27, 2016 at 3:07am Body: Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports. Type: Comment Author: Ray Lopez Date: September 27, 2016 at 3:21 am Text: I’ll be the first to say this, but the analysis is flawed. Sentiment: negative Type: Comment Author: Brian Donohue Date: September 27, 2016 at 9:20 am Text: The math checks out. Solid. Sentiment: positive
  • 8. 8 • Blog Post Data with Comments and Replies from http://marginalrevolution.com (cured) • 2 posts, 2-3 comments per post, 0-3 replies per comment • Extracted keywords & sentiment data • 4 levels of "nesting" • Too big to show on slides • Data + Scripts + Demo Queries: • https://github.com/alisa-ipn/solr- revolution-2016-nested-demo Running Example
  • 10. 10 • Nested XML • JSON Documents • Add _childDocument_ tags for all children • Pre-process field names to FQNs • Lose information, or add that as meta-data during pre-processing • JSON Document endpoint (6x only) - /update/json/docs • Field name mappings • Child Document splitting - Enhanced support coming soon. Sending Documents to Solr
  • 11. 11 solr-6.2.1$ bin/post -c demo-xml ./data/example-data.xml Sending Documents to Solr: Nested XML <add> <doc> <field name="type">post</field> <field name="author"> "Alex Tabarrok"</field> <field name="title">"The Irony of Hillary Clinton’s Data Analytics"</ field> <field name="body">"Barack Obama’s campaign adopted data but Hillary Clinton’s campaign has been molded by data from birth."</field> <field name="id">"12015-24204"</field> <doc> <field name="type">comment</field> <field name="author">"Todd"</field> <field name="text">"Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to learn over time, and apply the lessons learned to the real world."</field> <field name="sentiment">"positive"</field> <field name="id">"29798-24171"</field> <doc> <field name="type">reply</field> <field name="author">"The Other Jim"</field> <field name="text">"No, she lost because (1) she is thoroughly detested person and (2) the DNC decided Obama should therefore win."</field> <field name="sentiment">"negative"</field> <field name="id">"29798-21232"</field> </doc> </doc> </doc> </add>
  • 12. 12 • Add _childDocument_ tags for all children • Pre-process field names to FQNs • Lose information, or add that as meta-data during pre-processing solr-6.2.1$ bin/post -c demo-solr-json ./data/small-example-data-solr.json -format solr Sending Documents to Solr: JSON Documents [{ "path": "1.posts", "id": "28711", "author": "Alex Tabarrok", "title": "The Irony of Hillary Clinton’s Data Analytics", "body": "Barack Obama’s campaign adopted data but Hillary Clinton’s campaign has been molded by data from birth.", "_childDocuments_": [ { "path": "2.posts.comments", "id": "28711-19237", "author": "Todd", "text": "Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to learn over time, and apply the lessons learned to the real world.", "sentiment": "positive", "_childDocuments_": [ { "path": "3.posts.comments.replies", "author": "The Other Jim", "id": "28711-12444", "sentiment": "negative", "text": "No, she lost because (1) she is thoroughly detested person and (2) the DNC decided Obama should therefore win." }]}]}]
  • 13. 13 • JSON Document endpoint (6x only) - /update/json/docs • Field name mappings • Child Document splitting - Enhanced support coming soon. solr-6.2.1$ curl 'http://localhost:8983/solr/gettingstarted/update/json/docs? split=/|/posts|/posts/comments|/posts/comments/replies&commit=true' --data- binary @small-example-data.json -H ‘Content-type:application/json' NOTE: All documents must contain a unique ID. Sending Documents to Solr: JSON Endpoint
  • 14. 14 • Update Request Processors don’t work with nested documents • Example: • UUID update processor does not auto-add an id for a child document. • Workaround: • Take responsibility at the client layer to handle the computation for nested documents. • Change the update processor in Solr to handle nested documents. Update Processors and Nested Documents
  • 15. 15 • The entire block needs reindexing • Forgot to add a meta-data field that might be useful? Complete reindex • Store everything in Solr IF • it’s too expensive to reconstruct the doc from original data source • No access to data anymore e.g. streaming data Re-Indexing Your Documents
  • 16. 16 • Various ways to index nested documents • Need to re-index entire block Nested Document Indexing Summary
  • 17. Let’s ask some interesting questions
  • 18. 18 { "path":["4.posts.comments.replies.keywords"], "text":["Trump"]}, { "path":["3.posts.comments.keywords"], "text":["Trump"]}, { "path":["2.posts.keywords"], "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "text":["Trump proposes eliminating America’s $500 billion trade deficit through a combination of increased exports and reduced imports."], "path":["1.posts"]}, { "text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."], "path":["2.posts.comments"]} Easy question first Find all documents that mention Trump q=text:Trump
  • 19. 19 { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."], "path":["2.posts.comments"]}, { "text":["No one goes to Clinton rallies while tens of thousands line up to see Trump, data-mining leads to a fantasy view of the World."], "path":["2.posts.comments"]} Returning certain types of documents Find all comments and replies that mention Trump q=(path:2.posts.comments OR path:3.posts.comments.replies) AND text:Trump Recipe: At the data pre-processing stage, add a field that indicates document type and also its path in the hierarchy (-- stay tuned):
  • 22. 22 { "path":["3.posts.comments.keywords"], "sentiment":["positive"], "text":["Hillary"]}, { "path":["4.posts.comments.replies.keywords"], "sentiment":["negative"], "text":["Hillary"]}, { "path":["2.posts.keywords"], "text":["Hillary"]} Recap so far... Find all keywords that are Hillary q=path:*.keywords AND text:Hillary We're querying precisely for documents which we provide a search condition for Query Level 3 Result Level 3 Query Level 4 Result Level 4 Query Level 2 Result Level 2
  • 23. 23 Returning parents by querying children: Block Join Parent Query Find all comments whose keywords detected positive sentiment towards Hillary q={!parent which="path:2.posts.comments"}path:3.posts.comments.keywords AND text:Hillary AND sentiment:positive Query Level 3 Result Level 2 { "author":["Brian Donohue"], "text":["Hillary was impressive, for sure, and Trump spent time spluttering and floundering, but he was actually able to find his feet and score some points."], "path":["2.posts.comments"]}, { "author":["Todd"], "text":["Clinton got out data-ed and out organized in 2008 by Obama. She seems at least to learn over time, and apply the lessons learned to the real world."], "path":["2.posts.comments"]}
  • 24. 24 { "sentiment":["negative"], "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "sentiment":["neutral"], "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]}, { "sentiment":["positive"], "text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see a fantasy in person?"], "path":["3.posts.comments.replies"]} Returning children by querying parents: Block Join Child Query Find replies to negative comments q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative&fq=path:3.posts.comments.replies Query Level 2 Result Level 3 Block Join Child Query + Filter Query A bit counterintuitive and non-symmetrical to the BJPQ
  • 25. 25 Returning all document's descendants Block Join Child Query Find all descendants of negative comments q={!child of="path:2.posts.comments"}path:2.posts.comments AND sentiment:negative Query Level 2 Results Level 3 Results Level 4 { "path":["4.posts.comments.replies.keywords"], "id":"17413-13550", "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"], "id":"17413-66188"}, { "path":["3.posts.comments.keywords"], "id":"12413-12487", "text":["Hillary"]}, { "text":["Agreed why spend time data-mining for a fantasy view of the world , when instead you can see a fantasy in person?"], "path":["3.posts.comments.replies"], "id":"12413-10998"} Issue: no grouping by parent What if we want to bring the whole sub-structure?
  • 26. 26 Find all negative comments and return them with all their descendants q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.*] Query Level 2 Result Level 2 sub- hierarchy Returning document with all descendants: ChildDocTransformer { "sentiment":["negative"], "text":["I’ll be the first to say this, but the analysis is flawed."], "path":["2.posts.comments"], "_childDocuments_":[ { "path":["4.posts.comments.replies.keywords"], "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "path":["4.posts.comments.replies.keywords"], "text":["U.S."]}, { "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]} ] }, ... Issue: the "sub-hierarchy" is flat
  • 27. • Returns all descendant documents along with the queried document • flattens the sub-hierarchy • Workarounds: • Reconstruct the document using path ("path":["3.posts.comments.replies"]) information in case you want the entire subtree (result post-processing) • use childFilter in case you want a specific level 27 “This transformer returns all descendant documents of each parent document matching your query in a flat list nested inside the matching parent document." (ChildDocTransformer cwiki) Returning document with all descendants: ChildDocTransformer
  • 28. 28 Find all negative comments and return them with all replies to them q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.* childFilter=path:3.posts.comments.replies] { "sentiment":["negative"], "text":["I’ll be the first to say this, but the analysis is flawed."], "path":["2.posts.comments"], "_childDocuments_":[ { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]}, { "text":["So then I guess he will also eliminate the current account surplus? What will happen to U.S. asset values?"], "path":["3.posts.comments.replies"]} ] }, ... Returning document with specific descendants: ChildDocTransformer + childFilter Query Level 2:comments Result Level 2:comments + Level 3:replies
  • 29. 29 Find all negative comments and return them with all their descendants that mention Trump q=path:2.posts.comments AND sentiment:negative&fl=*,[child parentFilter=path:2.* childFilter=text:Trump] { "sentiment":["negative"], "text":["I’ll be the first to say this, but the analysis is flawed."], "path":["2.posts.comments"], "_childDocuments_":[ { "path":["4.posts.comments.replies.keywords"], "text":["Trump"]}, { "text":["LOL. I enjoyed Trump during last night’s stand-up bit, but this is funnier."], "path":["3.posts.comments.replies"]} ] }, ... Returning document with queried descendants: ChildDocTransformer + childFilter Query Level 2:comments Result Level 2:comments + sub-levels Issue: cannot use boolean expressions in childFilter query
  • 30. 30 Cross-Level Querying Mechanisms: • Block Join Parent Query • Block Join Children Query • ChildDocTransformer Good points: • overlapping & complementary features • good capabilities of querying direct ancestors/descendants • possible to query on siblings of different type Drawbacks: • need for data-preprocessing for better querying flexibility • limited support of querying over non-directly related branches (overcome with graphs?) • flattening nested data (additional post-processing is needed for reconstruction) Nested Document Querying Summary
  • 31. Faceting on Nested Documents
  • 32. 32 • Solr allows faceting on nested documents! • Two mechanisms for faceting: • Faceting with JSON Facet API (since Solr 5.3) • Block Join Faceting (since Solr 5.5) Faceting on Nested Documents
  • 33. 33 q=path:2.posts.comments AND sentiment:positive& json.facet={ most_liked_authors : { type: terms, field: author, domain: { blockParent : "path:1.posts"} }} Faceting on parents by descendants JSON Facet API: Parent Domain Count authors of the posts that received positive comments "most_liked_authors":{ "buckets":[ { "val":"Alex Tabarrok", "count":1}, { "val":"Tyler Cowen", "count":1} ] } Query Level 2 Facet Level 1
  • 34. 34 Faceting on descendants by ancestors JSON Facet API: Child Domain Distribution of keywords that appear in comments and replies by the top-level posts Query Level 1 Facet Descendant Levels "top_keywords":{ "buckets":[{ "val":"hillary", "count":4, "counts_by_posts":2}, { "val":"trump", "count":3, "counts_by_posts":2}, { "val":"dnc", "count":1, "counts_by_posts":1}, { "val":"obama", "count":2, "counts_by_posts":1}, { "val":"u.s", "count":1, "counts_by_posts":1} ]}
  • 35. 35 q=path:1.posts&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:1.posts" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_posts desc", facet: { counts_by_posts: "unique(_root_)" }}}}} Faceting on descendants by ancestors JSON Facet API: Child Domain Distribution of keywords that appear in comments and replies by the top-level posts Query Level 1 Facet Descendant Levels
  • 36. 36 Faceting on descendants by top-level ancestor JSON Facet API: Child Domain Distribution of keywords that appear in comments and replies by the top-level posts Query Level 1 Facet Descendant Levels Issue: only the top-ancestor gets the unique "_root_" field by default q=path:1.posts&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:1.posts" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_posts desc", facet: { counts_by_posts: "unique(_root_)" }}}}}
  • 37. 37 q=path:2.posts.comments&rows=0& json.facet={ filter_by_child_type :{ type:query, q:"path:*comments*keywords", domain: { blockChildren : "path:2.posts.comments" }, facet:{ top_keywords : { type: terms, field: text, sort: "counts_by_comments desc", facet: { counts_by_comments: "unique(2.posts.comments-id)" }}}}} Faceting on descendants by intermediate ancestors JSON Facet API: Child Domain + unique fields Distribution of keywords that appear in comments and replies by the comments Query Level 2 Facet Descendant Levels At pre-processing, introduce unique fields for each level
  • 38. 38 Faceting on descendants by intermediate ancestors JSON Facet API: Child Domain + unique fields Distribution of keywords that appear in comments and replies by the comments Query Level 2 Facet Descendant Levels "top_keywords":{ "buckets":[{ "val":"Hillary", "count":4, "counts_by_comments":3}, { "val":"Trump", "count":3, "counts_by_comments":3}, { "val":"DNC", "count":1, "counts_by_comments":1}, { "val":"Obama", "count":2, "counts_by_comments":1}, { "val":"U.S.", "count":1, "counts_by_comments":1} ]}
  • 39. Now let's try the same using Block Join Faceting
  • 40. 40 • Experimental Feature • Needs to be turned on explicitly in solrconfig.xml More info: https://cwiki.apache.org/confluence/display/solr/BlockJoin+Faceting Block Join Faceting
  • 41. 41 bjqfacet?q={!parent which=path:2.posts.comments} path:*.comments*keywords&rows=0&facet=true&child.facet.field=text Faceting on descendants by ancestors #2: Block Join Faceting on Children Domain Distribution of keywords that appear in comments and replies by the comments "facet_fields":{ "text":[ "dnc",1, "hillary",3, "obama",1, "trump",3, "u.s",1 ] } Query Level 2 Facet Descendant Levels bjqfacet request handler instead of query
  • 42. 42 Output Comparison Block Join Facet JSON Facet API "facet_fields":{ "text":[ "dnc",1, "hillary",3, "obama",1, "trump",3, "u.s",1 ] } "top_keywords":{ "buckets":[{ "val":"Hillary", "count":4, "counts_by_comments":3}, { "val":"Trump", "count":3, "counts_by_comments":3}, { "val":"DNC", "count":1, "counts_by_comments":1}, { "val":"Obama", "count":2, "counts_by_comments":1}, { "val":"U.S.", "count":1, "counts_by_comments":1} ]} Distribution of keywords that appear in comments and replies by the comments
  • 43. 43 Output Comparison Block Join Facet JSON Facet API "facet_fields":{ "text":[ "dnc",1, "hillary",3, "obama",1, "trump",3, "u.s",1 ] } "top_keywords":{ "buckets":[{ "val":"Hillary", "count":4, "counts_by_comments":3}, { "val":"Trump", "count":3, "counts_by_comments":3}, { "val":"DNC", "count":1, "counts_by_comments":1}, ... Distribution of keywords that appear in comments and replies by the comments Output is sorted in alphabetical order. It cannot be changed facet:{ top_keywords : { ... sort: "counts_by_comments desc" }}}
  • 44. 44 JSON Facet API: • Experimental - but more mature • More developed and established feature • bulky JSON syntax • faceting on children by non-top level ancestors requires introducing unique branch identifiers similar to "_root_" on each level Block Join Facet: • Experimental feature • Lacks controls: sorting, limit... • traditional query-style syntax • proper handling of faceting on children by non-top level ancestors Hierarchical Faceting Summary
  • 45. 45 • Returning hierarchical structure • JSON facet rollups is in the works - SOLR-8998 • Graph query might replace a lot of functionalities of cross-level querying - No distributed support right now. • There’s more but the community would love to have more people involved! Community Roadmap
  • 46. Thank you! Anshum Gupta anshum@apache.org | @anshumgupta https://github.com/alisa-ipn/solr-revolution-2016-nested-demo