Searching Linked Data

Searching Linked Data
From Finding Relevant Sources to Computing Answers
Invited Presentation @ International Workshop on Scalable Semantic Computing,
Hangzhou, China, November 2010.

Thanh Tran, Günter Ladwig, Veli Bicer, Lei Zhang, Daniel Herzig, Yongtao
Ma, Andreas Wagner, Rudi Studer from AIFB Institute, KIT

Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and
1 National Laboratory of the Helmholtz Association

Agenda

 Searching Linked Data
 Opportunities & challenges

 Keyword Query Routing
 Problem Definition

 Summary Models

 Experiments

 Linked Data Query Processing
 Combining Top-down & Bottom-up

 Stream-based Query Processing

 Corrective Source Ranking

 Conclusions


Linked Data

- 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links
- As of 09-2010 + other linked data not covered by LOD cloud

Opportunities
“Articles from awarded researchers at Stanford ”

 Freebase contains data about people  More complex information needs
 DBPedia contains information about awards  More precise results
 DBLP contains bibliographic data  More integrated results

Problems
“Articles from awarded researchers at Stanford ”

 Large number of unknown,
unexplored & irrelevant sources!
 What is in there?
 What is out there?
 What is relevant?

Formulating queries is a hard task! Processing queries is expensive!
• Which data sources?
USABILITY • Process against all data sources?
SCALABILITY
• Which schema elements? • Explore all links to other sources?

( z). x, y.prizes(x, Turing Award) worksAt(x,y) name(y,Stanford) publication(x, z)


Searching Linked Data

 Given the needs (expressed as sets of keywords),
 are there answers in linked data?
 what combination of data sources produce them?
 how to incorporate related unexplored linked sources?

 Keyword Query Routing to of
Identify valid combination  Let user choose combination
sources Linked Data Sources
Relevant of sources
 Identify schema elements  Focused,on this combination of
Focus Adaptive and Stream-
sources and explore related
based Linked Data Query
linked sources(c.f. LARKC)
Processing


Agenda



 Summary Models

 Experiments

 Combining top-down & bottom-up

 Stream-based query processing

 Corrective source ranking

 Conclusions


LOD Data Graph
 Web data modeled as a set of interlinked data graphs
 Each data graph represent a source
 Data graph vs. schema graph vs. source graph

Freebase DBLP DBPedia
… John Music
John. Smith Award
title name label

uni1 pub2 pub1 pub3 per4 prize2
author prizes
employ author author

per2 per1 per3 prize1
sameAs sameAs prizes
name name name name label

Stanford John John John Turing
University McCarthy Mccarthy McCarthy Award


LOD Schema Graph


Written
University Article
Work

Person Author Person Prize


LOD Source Graph


author

sames sameAs


Keyword Query Answers
User information need „stanford article award“

… John Music
Article
John. Smith Award
type title name label

author prizes




Problem Definition

 Keyword query result (also called Steiner graph) is a
subgraph of data graph that for every keyword, contains a
matching data element (called keyword elements), and
these elements are pairwise connected over a path.

 d-max Steiner graph is a Steiner graph where paths
between keyword elements is d-max or less.

 Keyword query routing: compute valid set of data sources
called keyword routing plan. A plan is valid if its union set of
sources produces non-empty keyword query results.


A Valid Keyword Routing Plan
User information need „stanford article award“

… John Music
Article
John. Smith Award
type title name label

author prizes