SlideShare a Scribd company logo
Combining Inverted Indices and
Structured Search for
Ad-hoc Object Retrieval
Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux
eXascale Infolab - University of Fribourg - Switzerland
{firstname.lastname}@unifr.ch
SIGIR2012 - Monday, August 13th 2012
2
Motivation
• Lot of search engines queries are
about entities.
• Increasingly large amount of entity
data online.
• Often represented as huge graphs
• e.g. the LOD cloud, Google
Knowledge Graph, Facebook social
graph.
• Globally unique Entity identifiers
(e.g., URIs) .
• Hard to discover and/or
memorize.
3
Ad-hoc Object Retrieval
(informal definition)
• “Given the description of an entity, give me back its identifier”
• Description can be keywords (e.g., “Harry Potter”).
• More than one identifier per entity (e.g., dbpedia +
freebase).
• How to evaluate returned results?
Ad-hoc Object Retrieval
(formal definition by Pound et al.)
• Input: unstructured query q
and data graph G.
• Output: ranked list of
resource identifiers (URIs)
from G.
• Evaluation: results (URIs)
scored by a judge with
access to all the information
contained in or linked to the
resource.
• Standard collections exist.
+
1. http://ex.plode.us/tag/harry+potter
1. http://www.vox.com/explore/interests/harry%20potter
1. http://www.flickr.com/groups/harrypotterandthedeathlyhallo
ws/
1. http://harrypotter.wizards.pro/
1. http://ex.plode.us/tag/harry+potter
1. http://www.vox.com/explore/interests/harry%20potter
1. http://www.flickr.com/groups/harrypotterandthedeathlyhallo
ws/
1. http://harrypotter.wizards.pro/
http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows
http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.u
k
http://harrypotter.wizards.pro/
http://ebiquity.umbc.edu/person/html/Harry/Chen/
http://dbpedia.org/resource/Ceramist
http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows
http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.u
k
http://harrypotter.wizards.pro/
http://ebiquity.umbc.edu/person/html/Harry/Chen/
http://dbpedia.org/resource/Ceramist
5
Overview of Our Solution
Inverted indices on
the LOD Cloud...
...and RDF store
containing the data.
Simple NLP techniques,
Autocompletion,
Pseudo-relevance feedback
BM25,
BM25F
6
Pseudo-Relevance Feedback
NLP techniques
Query auto-
completion
A Simple Example
SIGIRSIGIR
Graph traversals
Final ranking function
2. http://freebase.com/…/sigir
3. http://dbpedia.org/…/IRAQ
…
1. http://dbpedia.org/…/SIGIR
Which properties
should we follow?
How to rank new
results?
II + ranking function(s)
2. http://dbpedia.org/…/IRAQ
3. …
…
1. http://dbpedia.org/…/SIGIR
How to build
the II?
7
Outline
1. Inverted Indices
2. Graph Based Entity Search
1. Object Properties vs Datatype Properties
2. Properties to Follow
3. Experimental Results
1. Experimental Setting
2. IR Techniques: Experimental Results
3. Evaluation of the Hybrid Approaches
4. Overhead of the Graph Traversal
8
1. Inverted Indices (IIs)
• Simple inverted index:
• index all literals attached to each
node in the input graph.
• “movie” http://…types/film→
• Structured inverted index with three
fields:
• URI - tokenized URIs identifying
entities.
• Label - manually selected datatype
properties to textual descriptions of
the entity (e.g., label, title, name, full-
name, …).
• Attributes - all other literals.
BM25(F), query auto-completion, query extension, relevance
8
9
New URIs
...
2. Graph-Based Entity Search
IR results
...
...
N
p1
p2
p_m
p1
p2
p_m
sim(e, q) > τ?
...
Assign Scores
0.284
1.428
0.556
Merged Re-
Ranked Results
...
Take top-N
docs.
Follow
links/properties
and get new
URIs.
Filter new
results by text
similarity wrt
the user query.
Scoring functions:
count sim > τ,
avg sim > τ,
Sum sim,
Avg sim,
Sum BM25 - ε
10
2. 1. Object Properties vs
Datatype Properties
• Object Properties:
• connect different entities
• explore all the graph
• Datatype properties:
• give additional info about
entities
• explore just the
neighborhood of a node
11
2.2. properties to follow
• RDF graph queried with SPARQL queries.
• Scope 1 queries vs Scope 2 queries.
• Set of predicates to follow selected using:
• Common sense (e.g., sameAs)
• Statistics from the data
12
properties to follow:
Two Examples
Entry point
given by the II
13
3. Experimental results
14
3.1 Experimental Setting
• SemSearch 2010 and 2011 testsets:
• Billion Triple Challenge 2009 (BTC2009)
• 1.3 billions RDF triples crawled from the LOD cloud.
• 92 and 50 queries, respectively.
• Evaluation of systems with depth-10 pooling by means of
crowdsourcing.
• Measures taken into consideration: Mean Average Precision (MAP),
Normalized Discounted Cumulative Gain (NDCG), early Precision
(P10)
15
Completing Relevance by
Crowdsourcing Judgements
• We obtained relevance judgments for unjudged entities in
the top-10 results of our runs by using Amazon MTurk.
• To be fair we used the same design and settings that were
used for the AOR task of SemSearch.
16
3.2. IR Techniques: Experimental
ResultsOur
Baseline.
18
3.3. Evaluation of Hybrid
Approaches N = 3, = 0,τ
score = sumBM25 - ε
19
3.4. Overhead of the Graph
traversal
• Time in milliseconds
needed for each part of the
hybrid approaches.
• Measures taken on a single
machine with cold cache.
Surprisingly small
overhead (17% for best
results).
20
Conclusions
• AOR = “Given the description of an entity, give me back its identifier”
• Disappointing results using simple IR techniques for AOR task.
• Hybrid system for AOR:
• combining classic IR techniques + structured database storing graph
data.
• Our evaluation shows that the new approach leads to significantly better
results (up to +25% MAP over BM25 baseline).
• For the best working configuration found, the overhead caused from the
graph traversal part is limited (17% more than running the chosen
baseline).
21
Thank you for your attention
• You can find the new relevance judgments at
http://diuf.unifr.ch/xi/HybridAOR.
• More info at www.exascale.info.
• In the following days you’ll find our paper, this presentation,
and the new crowdsourced relevance judgements at
www.exascale.info/AOR.

More Related Content

PPTX
TRank ISWC2013
PDF
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
PPT
Web Data Extraction Como2010
KEY
Building a Mongo DSL in Scala at Hot Potato
PPTX
Theano tutorial
PPTX
Walking Linked Data: a graph traversal approach to explain clusters
PPT
MappingBetweenRealWorldandComputerScience
PPTX
HyperLogLog and friends
TRank ISWC2013
Fixing the Domain and Range of Properties in Linked Data by Context Disambigu...
Web Data Extraction Como2010
Building a Mongo DSL in Scala at Hot Potato
Theano tutorial
Walking Linked Data: a graph traversal approach to explain clusters
MappingBetweenRealWorldandComputerScience
HyperLogLog and friends

What's hot (20)

PPT
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
PPTX
Data wrangling with dplyr
PPTX
Data structure and its types
PDF
Redis Day TLV 2018 - Graph Distribution
PPTX
Data Structure
PDF
Tutorial 9 (bloom filters)
PDF
Python networkx library quick start guide
PPTX
K-Means Algorithm Implementation In python
PPTX
Roberto Trasarti PhD Thesis
PPTX
Data structure
PPT
PPTX
Empirical Semantics
PPTX
Basic data analysis using R.
PDF
Incremental View Maintenance for openCypher Queries
PDF
Data Wrangling and Visualization Using Python
PPTX
Java Extension Methods
PPTX
Java Arrays and DateTime Functions
DOCX
What is data structure
PPT
IR-ranking
PPT
Introduction of data structure
Efficient Parallel Set-Similarity Joins Using Hadoop__HadoopSummit2010
Data wrangling with dplyr
Data structure and its types
Redis Day TLV 2018 - Graph Distribution
Data Structure
Tutorial 9 (bloom filters)
Python networkx library quick start guide
K-Means Algorithm Implementation In python
Roberto Trasarti PhD Thesis
Data structure
Empirical Semantics
Basic data analysis using R.
Incremental View Maintenance for openCypher Queries
Data Wrangling and Visualization Using Python
Java Extension Methods
Java Arrays and DateTime Functions
What is data structure
IR-ranking
Introduction of data structure
Ad

Similar to Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval (20)

PPTX
Effective and Efficient Entity Search in RDF data
PDF
PPTX
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
PDF
IRJET- Review on Information Retrieval for Desktop Search Engine
PDF
Web Information Retrieval - Homework 1
PPT
Improving VIVO search through semantic ranking.
PDF
Performance Evaluation of Query Processing Techniques in Information Retrieval
PDF
Improving Entity Retrieval on Structured Data
PPT
Web search engines
PDF
Entity-Centric Data Management
PDF
Information retrieval systems irt ppt do
PPT
Information Retrieval Models
PDF
Using BM25F for Semantic Search
PPTX
Large-Scale Semantic Search
PDF
inteSearch: An Intelligent Linked Data Information Access Framework
PDF
Cluster Based Web Search Using Support Vector Machine
PPTX
Knowledge Graph Introduction
PDF
Entity Retrieval (WSDM 2014 tutorial)
PPTX
Exploiting web search engines to search structured
PPT
Cs583 info-retrieval
Effective and Efficient Entity Search in RDF data
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
IRJET- Review on Information Retrieval for Desktop Search Engine
Web Information Retrieval - Homework 1
Improving VIVO search through semantic ranking.
Performance Evaluation of Query Processing Techniques in Information Retrieval
Improving Entity Retrieval on Structured Data
Web search engines
Entity-Centric Data Management
Information retrieval systems irt ppt do
Information Retrieval Models
Using BM25F for Semantic Search
Large-Scale Semantic Search
inteSearch: An Intelligent Linked Data Information Access Framework
Cluster Based Web Search Using Support Vector Machine
Knowledge Graph Introduction
Entity Retrieval (WSDM 2014 tutorial)
Exploiting web search engines to search structured
Cs583 info-retrieval
Ad

More from eXascale Infolab (20)

PDF
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
PPTX
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
PDF
Representation Learning on Complex Graphs
PPTX
A force directed approach for offline gps trajectory map
PPTX
Cikm 2018
PPTX
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
PDF
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
PDF
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
PDF
Crowd scheduling www2016
PPTX
SANAPHOR: Ontology-based Coreference Resolution
PDF
Efficient, Scalable, and Provenance-Aware Management of Linked Data
PDF
SSSW 2015 Sense Making
PDF
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
PDF
Executing Provenance-Enabled Queries over Web Data
PDF
The Dynamics of Micro-Task Crowdsourcing
PPTX
CIKM14: Fixing grammatical errors by preposition ranking
PDF
OLTP-Bench
PPTX
An Introduction to Big Data
PPTX
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
PPTX
Hasler2014
Beyond Triplets: Hyper-Relational Knowledge Graph Embedding for Link Prediction
It Takes Two: Instrumenting the Interaction between In-Memory Databases and S...
Representation Learning on Complex Graphs
A force directed approach for offline gps trajectory map
Cikm 2018
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms wit...
SwissLink: High-Precision, Context-Free Entity Linking Exploiting Unambiguous...
Dependency-Driven Analytics: A Compass for Uncharted Data Oceans
Crowd scheduling www2016
SANAPHOR: Ontology-based Coreference Resolution
Efficient, Scalable, and Provenance-Aware Management of Linked Data
SSSW 2015 Sense Making
LDOW2015 - Uduvudu: a Graph-Aware and Adaptive UI Engine for Linked Data
Executing Provenance-Enabled Queries over Web Data
The Dynamics of Micro-Task Crowdsourcing
CIKM14: Fixing grammatical errors by preposition ranking
OLTP-Bench
An Introduction to Big Data
Internet Infrastructures for Big Data (Verisign's Distinguished Speaker Series)
Hasler2014

Recently uploaded (20)

PPTX
MYSQL Presentation for SQL database connectivity
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
cuic standard and advanced reporting.pdf
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
KodekX | Application Modernization Development
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Cloud computing and distributed systems.
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Encapsulation theory and applications.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Approach and Philosophy of On baking technology
MYSQL Presentation for SQL database connectivity
Mobile App Security Testing_ A Comprehensive Guide.pdf
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
cuic standard and advanced reporting.pdf
Per capita expenditure prediction using model stacking based on satellite ima...
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Spectral efficient network and resource selection model in 5G networks
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
KodekX | Application Modernization Development
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Cloud computing and distributed systems.
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Empathic Computing: Creating Shared Understanding
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
The AUB Centre for AI in Media Proposal.docx
Encapsulation theory and applications.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Approach and Philosophy of On baking technology

Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval

  • 1. Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux eXascale Infolab - University of Fribourg - Switzerland {firstname.lastname}@unifr.ch SIGIR2012 - Monday, August 13th 2012
  • 2. 2 Motivation • Lot of search engines queries are about entities. • Increasingly large amount of entity data online. • Often represented as huge graphs • e.g. the LOD cloud, Google Knowledge Graph, Facebook social graph. • Globally unique Entity identifiers (e.g., URIs) . • Hard to discover and/or memorize.
  • 3. 3 Ad-hoc Object Retrieval (informal definition) • “Given the description of an entity, give me back its identifier” • Description can be keywords (e.g., “Harry Potter”). • More than one identifier per entity (e.g., dbpedia + freebase). • How to evaluate returned results?
  • 4. Ad-hoc Object Retrieval (formal definition by Pound et al.) • Input: unstructured query q and data graph G. • Output: ranked list of resource identifiers (URIs) from G. • Evaluation: results (URIs) scored by a judge with access to all the information contained in or linked to the resource. • Standard collections exist. + 1. http://ex.plode.us/tag/harry+potter 1. http://www.vox.com/explore/interests/harry%20potter 1. http://www.flickr.com/groups/harrypotterandthedeathlyhallo ws/ 1. http://harrypotter.wizards.pro/ 1. http://ex.plode.us/tag/harry+potter 1. http://www.vox.com/explore/interests/harry%20potter 1. http://www.flickr.com/groups/harrypotterandthedeathlyhallo ws/ 1. http://harrypotter.wizards.pro/ http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.u k http://harrypotter.wizards.pro/ http://ebiquity.umbc.edu/person/html/Harry/Chen/ http://dbpedia.org/resource/Ceramist http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.u k http://harrypotter.wizards.pro/ http://ebiquity.umbc.edu/person/html/Harry/Chen/ http://dbpedia.org/resource/Ceramist
  • 5. 5 Overview of Our Solution Inverted indices on the LOD Cloud... ...and RDF store containing the data. Simple NLP techniques, Autocompletion, Pseudo-relevance feedback BM25, BM25F
  • 6. 6 Pseudo-Relevance Feedback NLP techniques Query auto- completion A Simple Example SIGIRSIGIR Graph traversals Final ranking function 2. http://freebase.com/…/sigir 3. http://dbpedia.org/…/IRAQ … 1. http://dbpedia.org/…/SIGIR Which properties should we follow? How to rank new results? II + ranking function(s) 2. http://dbpedia.org/…/IRAQ 3. … … 1. http://dbpedia.org/…/SIGIR How to build the II?
  • 7. 7 Outline 1. Inverted Indices 2. Graph Based Entity Search 1. Object Properties vs Datatype Properties 2. Properties to Follow 3. Experimental Results 1. Experimental Setting 2. IR Techniques: Experimental Results 3. Evaluation of the Hybrid Approaches 4. Overhead of the Graph Traversal
  • 8. 8 1. Inverted Indices (IIs) • Simple inverted index: • index all literals attached to each node in the input graph. • “movie” http://…types/film→ • Structured inverted index with three fields: • URI - tokenized URIs identifying entities. • Label - manually selected datatype properties to textual descriptions of the entity (e.g., label, title, name, full- name, …). • Attributes - all other literals. BM25(F), query auto-completion, query extension, relevance 8
  • 9. 9 New URIs ... 2. Graph-Based Entity Search IR results ... ... N p1 p2 p_m p1 p2 p_m sim(e, q) > τ? ... Assign Scores 0.284 1.428 0.556 Merged Re- Ranked Results ... Take top-N docs. Follow links/properties and get new URIs. Filter new results by text similarity wrt the user query. Scoring functions: count sim > τ, avg sim > τ, Sum sim, Avg sim, Sum BM25 - ε
  • 10. 10 2. 1. Object Properties vs Datatype Properties • Object Properties: • connect different entities • explore all the graph • Datatype properties: • give additional info about entities • explore just the neighborhood of a node
  • 11. 11 2.2. properties to follow • RDF graph queried with SPARQL queries. • Scope 1 queries vs Scope 2 queries. • Set of predicates to follow selected using: • Common sense (e.g., sameAs) • Statistics from the data
  • 12. 12 properties to follow: Two Examples Entry point given by the II
  • 14. 14 3.1 Experimental Setting • SemSearch 2010 and 2011 testsets: • Billion Triple Challenge 2009 (BTC2009) • 1.3 billions RDF triples crawled from the LOD cloud. • 92 and 50 queries, respectively. • Evaluation of systems with depth-10 pooling by means of crowdsourcing. • Measures taken into consideration: Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), early Precision (P10)
  • 15. 15 Completing Relevance by Crowdsourcing Judgements • We obtained relevance judgments for unjudged entities in the top-10 results of our runs by using Amazon MTurk. • To be fair we used the same design and settings that were used for the AOR task of SemSearch.
  • 16. 16 3.2. IR Techniques: Experimental ResultsOur Baseline.
  • 17. 18 3.3. Evaluation of Hybrid Approaches N = 3, = 0,τ score = sumBM25 - ε
  • 18. 19 3.4. Overhead of the Graph traversal • Time in milliseconds needed for each part of the hybrid approaches. • Measures taken on a single machine with cold cache. Surprisingly small overhead (17% for best results).
  • 19. 20 Conclusions • AOR = “Given the description of an entity, give me back its identifier” • Disappointing results using simple IR techniques for AOR task. • Hybrid system for AOR: • combining classic IR techniques + structured database storing graph data. • Our evaluation shows that the new approach leads to significantly better results (up to +25% MAP over BM25 baseline). • For the best working configuration found, the overhead caused from the graph traversal part is limited (17% more than running the chosen baseline).
  • 20. 21 Thank you for your attention • You can find the new relevance judgments at http://diuf.unifr.ch/xi/HybridAOR. • More info at www.exascale.info. • In the following days you’ll find our paper, this presentation, and the new crowdsourced relevance judgements at www.exascale.info/AOR.

Editor's Notes

  • #3: lot of search engines queries are about entities (more than a half) there is the task...
  • #9: tell that literals are strings attached to some node
  • #10: just the only scoring function
  • #12: tell what same as is
  • #13: I dati sono un grafo , l ’ indice invertito ci dà un entry point e poi camminiam
  • #15: TREC like collection/testset depth 10 pooling tutti lo conoscono qui!
  • #17: Say that simple index is “ or ” , UL, LA, ULA is “ and ” Say disappointment with first result with BM25: we tried to do just II but didn ’ t work, and then we decided to go for graph… NO GOOGLE
  • #19: Compare JUST s_1 with s_2 (lower recall but higher precision)
  • #20: s2_3 doesn ’ t follow wikilinks. Indicies and database were resident in the machine. We didn ’ t focus on efficiency