Distributed Keyword Search over RDF via MapReduce

Distributed Keyword Search over
RDF via MapReduce
Roberto De Virgilio and Antonio Maccioni

Semantic Web is distributed
The (Semantic) Web is
distributed
We have distributed and
cloud-based Infrastructures
for data processing
The Linked (Open) Data are
getting popular also to non-expert
users

MapReduce: a simple programming model
Map+Reduce
Map:
Accepts input key/value
pair
Emits intermediate key/
value pair
Reduce:
Accepts intermediate key/value
pair
Emits output key/value pair
Very
big
data
Result
M
A
P
R
E
D
U
C
E

The problem
Bernstein
type type
name name
author year year author
type author type
SIGMOD
Koltsidis
Buneman
Publication
pub1 2008 pub2
conf1 Conference
aut1
aut2
aut3
Researcher
name
name
editedBy
acceptedBy
type
type

The problem
Query:
Bernstein
2008
SIGMOD
Bernstein
type type
name name
type author type
SIGMOD
Koltsidis
Buneman
Publication
pub1 2008 pub2
conf1 Conference
aut1
aut2
aut3
Researcher
name
name
editedBy
acceptedBy
type
type

The problem
Bernstein
type type
name name
type author type
SIGMOD
Koltsides
Buneman
Publication
pub1 2008 pub2
conf1 Conference
aut1
aut2
aut3
Researcher
name
name
editedBy
acceptedBy
type
type
Koltsidis
Query:
Bernstein
2008
SIGMOD

Existing Approach
Bernstein
type type
name name
type author type
SIGMOD
Koltsides
Buneman
Publication
pub1 2008 pub2
conf1 Conference
aut1
aut2
aut3
Researcher
name
name
editedBy
acceptedBy
type
type
Koltsidis
Query:
Bernstein
2008
SIGMOD

Existing Approach
S1
S2
S3
S4
- - -
Sn
relevance

drawbacks:
low specificity
high overlapping
high computational cost
centralized computation
Existing Approach
S2 S1 S4 S3 - - - Sn
Top-k
relevance

Desired direction
S1
S2
- - -
Sk
relevance

Desired direction
strong points:
linear computational cost
monotonic ranked result
low overlapping
distributed computation
S1 S2 - - - Sk
Top-k
relevance

From Graph parallel to Data parallel

Graph Data Indexing
Breadth
First
Search
Distributed
Storage
Path!
Store 1!
Path!
Store 2!
Path!
Store j!
Bernstein
type type
name name
type author type
SIGMOD
Koltsidis
Buneman
Publication
pub1 2008 pub2
conf1 Conference
aut1
aut2
aut3
Researcher
name
name
editedBy
acceptedBy
type
type

Paths and Templates
year
Query = {Bernstein, 2008, SIGMOD} pub1 2008
Path!
Store 1!
Path!
Store 2!
Path!
Store j!
author name
pub1 aut1 Bernstein
acceptedBy name
pub1 conf1 SIGMOD
year
pub2 2008
editedBy name
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
[pub1-acceptedBy-conf1-name-SIGMOD]
[# -acceptedBy- # -name- #]
Path
Template

Paths Clustering
year
author name
acceptedBy name
year
pub1 2008
year
author name
pub1 aut1 Bernstein
year
acceptedBy name
editedBy name
year
pub1 conf1 SIGMOD
author name
year
author name
year
pub2 2008
acceptedBy name
author name
acceptedBy name
editedBy name
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
pub1 2008
pub1 aut1 Bernstein
pub1 conf1 SIGMOD
acceptedBy name
year
year
pub2 2008
year
editedBy name
editedBy name
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
pub1 2008
pub1 aut1 Bernstein
pub1 conf1 SIGMOD
pub2 2008
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
pub1 2008
pub1 aut1 Bernstein
pub1 conf1 SIGMOD
pub2 2008
editedBy name
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
pub1 2008
pub1 aut1 Bernstein
pub1 conf1 SIGMOD
pub2 2008
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
cl1
[1]
[2]
cl2
[1]
cl3
[1]
cl4
[1]

to a query Q on two key factors: its topology and the
relevance of the information carried from its vertices and
edges. The topology is Scoring
evaluated in terms of the length of
its paths. The relevance of the information carried from its
vertices Unlike is all evaluated current approaches, through an we implementation are independent from of the
TF/IDF.
Given scoring a query function: Q = we {q1, do not . . . impose , qn}, a a graph monotonic, G and aggregative a sub-graph
nor
sg in an G, “ad-the hoc score for the of case” sg with scoring respect function.
to Q is:
score(sg,Q) =
↵(sg)
!str(sg) ·
X
q2Q
(⇢(q) · !ct(sg, q))
where:
• ↵(sg) is the relative relevance of sg within G;
• ⇢(q) is the weight associated to each keyword q with
respect to the query Q;
• !ct(sg, q) is the content weight of q considering sg;
• !str(sg) is the structural weight of sg.
The relevance of sg is measured as follows:
Roberto De Virgilio, Antonio Maccioni, Paolo Cappellari: A Linear and
Monotonic Strategy to Keyword Search over RDF Data. ICWE 2013:338-353

TF/IDF.
nor
to Q is:
score(sg,Q) =
↵(sg)
!str(sg) ·
X
q2Q
(⇢(q) · !ct(sg, q))
where:
topology:
how keywords are strictly
connected
length of the paths

TF/IDF.
nor
to Q is:
score(sg,Q) =
↵(sg)
!str(sg) ·
X
q2Q
(⇢(q) · !ct(sg, q))
where:
topology:
how keywords are strictly
connected
length of the paths
relevance:
information carried from
nodes
implementation of TF/IDF

Building solutions
different strategies:
linear time complexity
in one round of MR
monotonically in a
quadratic time and 2k
rounds of MR

Linear Strategy (Map)
Map: Iterates over the clusters;
key: position of the path in the cluster
value: path itself
author name
acceptedBy name
year
pub1 2008
year
author name
pub1 aut1 Bernstein
year
acceptedBy name
editedBy name
pub1 conf1 SIGMOD
author name
year
pub2 2008
acceptedBy name
cl1
cl2
year
year
author name
author name
acceptedBy name
acceptedBy name
year
year
editedBy name
editedBy name
cl3
cl4
<1, p1>, <2, p4>, <1, p2> <1, p3>, <1, p5>
editedBy name
year
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
pub1 2008
pub1 aut1 Bernstein
pub1 conf1 SIGMOD
pub2 2008
p1
p2
p3
p4
pub1 2008
year
pub1 aut1 Bernstein
pub1 conf1 SIGMOD
pub2 2008
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
[1]
[2]
[1]
pub1 2008
pub1 aut1 Bernstein
pub1 conf1 SIGMOD
pub2 2008
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
pub1 2008
pub1 aut1 Bernstein
pub1 conf1 SIGMOD
pub2 2008
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
[1]
[1]

Linear Strategy (Reduce)
Reduce: Each machine receives a list of paths out of which it
computes connected components;
value: path itself
<1, p1>, <2, p4>, <1, p2>
<1, p3>, <1, p5>
<1, {p1, p2, p3, p5}>
<2, {p4}>
Each connected component is a final solution to output
S1: {p1, p2, p3, p5}
S2: {p4}

Linear Strategy: solutions
S1: {p1, p2, p3, p5} S2: {p4}
Bernstein
pub1 2008 pub2
conf1
SIGMOD
aut1
name
author year
name
editedBy
acceptedBy
year
2008 pub2

Monotonic Strategy (Map round 1)
Map: Iterates over the clusters (taking only the best ones);
value: path itself
author name
acceptedBy name
year
pub1 2008
year
author name
pub1 aut1 Bernstein
year
acceptedBy name
editedBy name
pub1 conf1 SIGMOD
author name
year
pub2 2008
acceptedBy name
year
year
author name
author name
acceptedBy name
acceptedBy name
year
year
editedBy name
editedBy name
cl3
cl4
<1, p1>, <1, p2> <1, p3>, <1, p5>
editedBy name
year
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
pub1 2008
pub1 aut1 Bernstein
pub1 conf1 SIGMOD
pub2 2008
p1
p2
p3
p4
pub1 2008
year
pub1 aut1 Bernstein
pub1 conf1 SIGMOD
pub2 2008
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
cl1
[1]
[2]
cl2
[1]
pub1 2008
pub1 aut1 Bernstein
pub1 conf1 SIGMOD
pub2 2008
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
pub1 2008
pub1 aut1 Bernstein
pub1 conf1 SIGMOD
pub2 2008
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
[1]
[1]

Monotonic Strategy (Reduce round 1)
Reduce: Each machine receives a list of paths out of which it
computes global connected components;
value: path itself
<1, p1>, <1, p2>
<1, p3>, <1, p5>
<1, {p1, p2, p3, p5}>
Then it invokes a new round of MapReduce

Monotonic Strategy (Map round 2)
Map: Dispatches the connected components to different
machines;
key: occurrence of connected component
value: connected component itself
year
year
pub1 2008
year
author name
year
author name
pub1 aut1 Bernstein
author name
acceptedBy name
author name
acceptedBy name
pub1 conf1 SIGMOD
acceptedBy name
acceptedBy name
year
year
pub2 2008
year
year
editedBy name
editedBy name
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
pub1 2008
pub1 aut1 Bernstein
pub1 conf1 SIGMOD
pub2 2008
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
pub1 2008
pub1 aut1 Bernstein
pub1 conf1 SIGMOD
pub2 2008
editedBy name
editedBy name
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
pub1 2008
pub1 aut1 Bernstein
pub1 conf1 SIGMOD
pub2 2008
pub2 conf1 SIGMOD
p1
p2
p3
p4
p5
<1, {p1, p2, p3, p5}>

Monotonic Strategy (Reduce round 2)
Reduce: Each machine receives a connected component and
iterates on the paths;
tau-test: a variant of the TA algorithm
it determines if some path is exceeding in the connected component
<1, {p1, p2, p3, p5}>
S = {p1, p2, p3}
{p5}
Discarded paths and discarded connected components are
reinserted in the clusters

In other words, given the set P containing the candidate paths to be included
by the most common IR based functions. It is possible to prove that the pivoted
normalization weighting method (SIM) [11], which inspired most of the IR scoring
functions, satisfy Properties 1 and 2. For the sake of simplicity, we discuss the
properties by referring to the data structures used in this paper.
Property 1 Given a query Q and a path p, score(p,Q) = score({p},Q).
This property states that the score of a path p is equal to the score of the
solution S containing only that same path (i.e. {p}). It means that every path
must be evaluated as the solution containing exactly that path. Consequently we
have that, if score(p1,Q) > score(p2,Q) then score({p1},Q) > score({p2},Q).
Analogously, extending Property 1 we provide the following.
Property 2 Given a query Q, a set of paths P in which p is the more relevant
path (i.e. 8pj 2 P we have that score(p,Q) score(pj,Q)) and P⇤ is its power
set, we have score(S = Pi,Q)  score(S = {p},Q) 8Pi ✓ P⇤.
Property 1 Given a query Q and a path p, score(p,Q) = score({p},Q).
This property states that the score of a path p is equal to the score of the
solution S containing only that same path (i.e. {p}). It means that every path
must be evaluated as the solution containing exactly that path. Consequently we
have that, if score(p1,Q) score(p2,Q) then score({p1},Q) score({p2},Q).
Analogously, extending Property 1 we provide the following.
Property 2 Given a query Q, a set of paths P in which p is the more relevant
path (i.e. 8pj 2 P we have that score(p,Q) score(pj,Q)) and P⇤ is its power
set, we have score(S = Pi,Q)  score(S = {p},Q) 8Pi ✓ P⇤.
in the solution, the scores of all possible solutions generated from P (i.e. P⇤) are
bounded by the score of the most relevant and generalizes the Threshold Algorithm Tau-test
path p of P. This property is coherent
(TA) [6]. Contrarily to TA, we do not
use an aggregative function, nor we assume the aggregation to be monotone. TA
introduces a mechanism to optimize the number of steps n to compute the best
k objects (where it could be n k), while our framework produces k optima
solutions in k steps. To verify the monotonicity we apply a so-called ⌧ -test to
determine which paths of a connected component cc should be inserted into an
optimum solution optS ⇢ cc. The ⌧ -test is supported by Theorem 1. Firstly,
we have to take into consideration the paths that can be used to form more
solutions in the next iterations of the process. In our framework they are still
within the set of clusters CL. Then, let us consider the path ps with the highest
score in CL and the path py with the highest score in ccroptS. Then we define
the threshold ⌧ as ⌧ = max{score(ps,Q), score(py,Q)}. The threshold ⌧ can be
considered as the upper bound score for the potential solutions to generate in
the next iterations of the algorithm. Now, we provide the following:
Theorem 1. Given a query Q, a scoring function satisfying Property 1 and
Property 2, a connected component cc, a subset optS ⇢ cc representing an
optimum solution and a candidate path px 2 cc r optS, S = optS [ {px} is still
optimum i↵ score(S,Q) ' ⌧.
Necessary condition. Let us assume that S = optS [ {px} is an optimum
bounded by the score of the most relevant path p of P. This property is coherent
and generalizes the Threshold Algorithm (TA) [6]. Contrarily to TA, we do not
k objects (where it could be n k), while our framework produces k optima
solutions in k steps. To verify the monotonicity we apply a so-called ⌧ -test to
determine which paths of a connected component cc should be inserted into an
solutions in the next iterations of the process. In our framework they are still
within the set of clusters CL. Then, let us consider the path ps with the highest
score in CL and the path py with the highest score in ccroptS. Then we define
the threshold ⌧ as ⌧ = max{score(ps,Q), score(py,Q)}. The threshold ⌧ can be
considered as the upper bound score for the potential solutions to generate in
bounded by the score of the most relevant path p of P. This property is coherent
and generalizes the Threshold Algorithm (TA) [6]. Contrarily to TA, we do not
k objects solution. (where We must it could verify be if the n score k), while of this our solution framework is still produces greater than k optima
⌧ .
solutions Reminding in k to steps. the definition To verify of the ⌧ , monotonicity we can have two we cases:
apply a so-called ⌧ -test to
determine – ⌧ = which score(ps,paths Q) of score(a connected py,Q).
component cc should be inserted into an
In this case score(ps,Q) represents the upper bound for the scoring
of the possible solutions to generate in the next steps. Recalling the

Monotonic Strategy: solutions
S1: {p1, p2, p3} S2: {p4, p5}
Bernstein
pub1 2008
conf1
SIGMOD
aut1
name
author year
name
acceptedBy
SIGMOD
2008 pub2
conf1
year
name
editedBy

Experiments
We deployed YaaniiMR on EC2 clusters.
cc1.4xlarge: 10, 50 and 100 nodes.
YaaniiMR is provided with the Hadoop file system (HDFS)
version 1.1.1 and the HBase data store version 0.94.3
The performance of our systems has been mea-sured with
respect to data loading, memory footprint, and query
execution.

Data Loading
400
300
200
100
0
45,49% 50,49% 56,43%
10 nodes 50 nodes
100 nodes
314,26% 348,84%
389,88%
UploadTime(sec)
DBPedia Billion
300M triples
2008 edition: 1200M triples

Memory Consumption
10000
1000
100
10
1
1,1( 1,5( 1,8(
Mondial(
DbPedia(
Billion(
1,4(
349,3(
1781,1(
sizefornode(MB)
overhead spaceconsump8on
300M triples
2008 edition: 1200M triples
17k triples
10-nodes

End-to-end job runtimes
[Coffman et al. Benchmark]
L = Linear strategy M = Monotonic strategy
6
5
4
3
2
1
0
query$execu(on$ overhead$
L M L M L M L M L M L M L M L M L M
Mondial Dbpedia Billion Mondial Dbpedia Billion Mondial Dbpedia Billion
10-nodes 50-nodes 100-nodes
Job$Run(me$(sec)$
300M triples 1200M triples 17k triples

Distributed Keyword Search over RDF via MapReduce

More Related Content

Viewers also liked (7)

Similar to Distributed Keyword Search over RDF via MapReduce (20)

Recently uploaded (20)

Distributed Keyword Search over RDF via MapReduce