Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval

Combining Inverted Indices and
Structured Search for
Ad-hoc Object Retrieval
Alberto Tonon, Gianluca Demartini, Philippe Cudré-Mauroux
eXascale Infolab - University of Fribourg - Switzerland
{firstname.lastname}@unifr.ch
SIGIR2012 - Monday, August 13th 2012

2
Motivation
• Lot of search engines queries are
about entities.
• Increasingly large amount of entity
data online.
• Often represented as huge graphs
• e.g. the LOD cloud, Google
Knowledge Graph, Facebook social
graph.
• Globally unique Entity identifiers
(e.g., URIs) .
• Hard to discover and/or
memorize.

3
(informal definition)
• “Given the description of an entity, give me back its identifier”
• Description can be keywords (e.g., “Harry Potter”).
• More than one identifier per entity (e.g., dbpedia +
freebase).
• How to evaluate returned results?

(formal definition by Pound et al.)
• Input: unstructured query q
and data graph G.
• Output: ranked list of
resource identifiers (URIs)
from G.
• Evaluation: results (URIs)
scored by a judge with
access to all the information
contained in or linked to the
resource.
• Standard collections exist.
+
1. http://ex.plode.us/tag/harry+potter
1. http://www.vox.com/explore/interests/harry%20potter
1. http://www.flickr.com/groups/harrypotterandthedeathlyhallo
ws/
1. http://harrypotter.wizards.pro/
1. http://ex.plode.us/tag/harry+potter
1. http://www.vox.com/explore/interests/harry%20potter
1. http://www.flickr.com/groups/harrypotterandthedeathlyhallo
ws/
1. http://harrypotter.wizards.pro/
http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows
http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.u
k
http://harrypotter.wizards.pro/
http://ebiquity.umbc.edu/person/html/Harry/Chen/
http://dbpedia.org/resource/Ceramist
http://dbpedia.org/resource/Harry_Potter_and_the_Deathly_Hallows
http://www.aktors.org/scripts/EdinburghPeople.dome#stephenp.aiai.ed.ac.u
k
http://harrypotter.wizards.pro/
http://ebiquity.umbc.edu/person/html/Harry/Chen/
http://dbpedia.org/resource/Ceramist

5
Overview of Our Solution
Inverted indices on
the LOD Cloud...
...and RDF store
containing the data.
Simple NLP techniques,
Autocompletion,
Pseudo-relevance feedback
BM25,
BM25F

6
Pseudo-Relevance Feedback
NLP techniques
Query auto-
completion
A Simple Example
SIGIRSIGIR
Graph traversals
Final ranking function
2. http://freebase.com/…/sigir
3. http://dbpedia.org/…/IRAQ
…
1. http://dbpedia.org/…/SIGIR
Which properties
should we follow?
How to rank new
results?
II + ranking function(s)
2. http://dbpedia.org/…/IRAQ
3. …
…
1. http://dbpedia.org/…/SIGIR
How to build
the II?

7
Outline
1. Inverted Indices
2. Graph Based Entity Search
1. Object Properties vs Datatype Properties
2. Properties to Follow
3. Experimental Results
1. Experimental Setting
2. IR Techniques: Experimental Results
3. Evaluation of the Hybrid Approaches
4. Overhead of the Graph Traversal

8
1. Inverted Indices (IIs)
• Simple inverted index:
• index all literals attached to each
node in the input graph.
• “movie” http://…types/film→
• Structured inverted index with three
fields:
• URI - tokenized URIs identifying
entities.
• Label - manually selected datatype
properties to textual descriptions of
the entity (e.g., label, title, name, full-
name, …).
• Attributes - all other literals.
BM25(F), query auto-completion, query extension, relevance
8

9
New URIs
...
2. Graph-Based Entity Search
IR results
...
...
N
p1
p2
p_m
p1
p2
p_m
sim(e, q) > τ?
...
Assign Scores
0.284
1.428
0.556
Merged Re-
Ranked Results
...
Take top-N
docs.
Follow
links/properties
and get new
URIs.
Filter new
results by text
similarity wrt
the user query.
Scoring functions:
count sim > τ,
avg sim > τ,
Sum sim,
Avg sim,
Sum BM25 - ε

10
2. 1. Object Properties vs
Datatype Properties
• Object Properties:
• connect different entities
• explore all the graph
• Datatype properties:
• give additional info about
entities
• explore just the
neighborhood of a node

11
2.2. properties to follow
• RDF graph queried with SPARQL queries.
• Scope 1 queries vs Scope 2 queries.
• Set of predicates to follow selected using:
• Common sense (e.g., sameAs)
• Statistics from the data

12
properties to follow:
Two Examples
Entry point
given by the II

14
3.1 Experimental Setting
• SemSearch 2010 and 2011 testsets:
• Billion Triple Challenge 2009 (BTC2009)
• 1.3 billions RDF triples crawled from the LOD cloud.
• 92 and 50 queries, respectively.
• Evaluation of systems with depth-10 pooling by means of
crowdsourcing.
• Measures taken into consideration: Mean Average Precision (MAP),
Normalized Discounted Cumulative Gain (NDCG), early Precision
(P10)

15
Completing Relevance by
Crowdsourcing Judgements
• We obtained relevance judgments for unjudged entities in
the top-10 results of our runs by using Amazon MTurk.
• To be fair we used the same design and settings that were
used for the AOR task of SemSearch.

16
3.2. IR Techniques: Experimental
ResultsOur
Baseline.

18
3.3. Evaluation of Hybrid
Approaches N = 3, = 0,τ
score = sumBM25 - ε

19
3.4. Overhead of the Graph
traversal
• Time in milliseconds
needed for each part of the
hybrid approaches.
• Measures taken on a single
machine with cold cache.
Surprisingly small
overhead (17% for best
results).

20
Conclusions
• AOR = “Given the description of an entity, give me back its identifier”
• Disappointing results using simple IR techniques for AOR task.
• Hybrid system for AOR:
• combining classic IR techniques + structured database storing graph
data.
• Our evaluation shows that the new approach leads to significantly better
results (up to +25% MAP over BM25 baseline).
• For the best working configuration found, the overhead caused from the
graph traversal part is limited (17% more than running the chosen
baseline).

21
Thank you for your attention
• You can find the new relevance judgments at
http://diuf.unifr.ch/xi/HybridAOR.
• More info at www.exascale.info.
• In the following days you’ll find our paper, this presentation,
and the new crowdsourced relevance judgements at
www.exascale.info/AOR.

Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval

More Related Content

What's hot (20)

Similar to Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval (20)

More from eXascale Infolab (20)

Recently uploaded (20)

Combining Inverted Indices and Structured Search for Ad-hoc Object Retrieval

Editor's Notes