SlideShare a Scribd company logo
Distributed Keyword Search over 
RDF via MapReduce 
Roberto De Virgilio and Antonio Maccioni
Semantic Web is distributed 
The (Semantic) Web is 
distributed 
We have distributed and 
cloud-based Infrastructures 
for data processing 
The Linked (Open) Data are 
getting popular also to non-expert 
users
Distributed Keyword Search
MapReduce: a simple programming model 
Map+Reduce 
Map: 
Accepts input key/value 
pair 
Emits intermediate key/ 
value pair 
Reduce: 
Accepts intermediate key/value 
pair 
Emits output key/value pair 
Very 
big 
data 
Result 
M 
A 
P 
R 
E 
D 
U 
C 
E
The problem 
Bernstein 
type type 
name name 
author year year author 
type author type 
SIGMOD 
Koltsidis 
Buneman 
Publication 
pub1 2008 pub2 
conf1 Conference 
aut1 
aut2 
aut3 
Researcher 
name 
name 
editedBy 
acceptedBy 
type 
type
The problem 
Query: 
Bernstein 
2008 
SIGMOD 
Bernstein 
type type 
name name 
author year year author 
type author type 
SIGMOD 
Koltsidis 
Buneman 
Publication 
pub1 2008 pub2 
conf1 Conference 
aut1 
aut2 
aut3 
Researcher 
name 
name 
editedBy 
acceptedBy 
type 
type
The problem 
Bernstein 
type type 
name name 
author year year author 
type author type 
SIGMOD 
Koltsides 
Buneman 
Publication 
pub1 2008 pub2 
conf1 Conference 
aut1 
aut2 
aut3 
Researcher 
name 
name 
editedBy 
acceptedBy 
type 
type 
Koltsidis 
Query: 
Bernstein 
2008 
SIGMOD
Existing Approach 
Bernstein 
type type 
name name 
author year year author 
type author type 
SIGMOD 
Koltsides 
Buneman 
Publication 
pub1 2008 pub2 
conf1 Conference 
aut1 
aut2 
aut3 
Researcher 
name 
name 
editedBy 
acceptedBy 
type 
type 
Koltsidis 
Query: 
Bernstein 
2008 
SIGMOD
Existing Approach 
relevance
Existing Approach 
S1 
S2 
S3 
S4 
- - - 
Sn 
relevance
drawbacks: 
low specificity 
high overlapping 
high computational cost 
centralized computation 
Existing Approach 
S2 S1 S4 S3 - - - Sn 
Top-k 
relevance
Desired direction 
relevance
Desired direction 
S1 
S2 
- - - 
Sk 
relevance
Desired direction 
strong points: 
linear computational cost 
monotonic ranked result 
low overlapping 
distributed computation 
S1 S2 - - - Sk 
Top-k 
relevance
From Graph parallel to Data parallel
Graph Data Indexing 
Breadth 
First 
Search 
Distributed 
Storage 
Path! 
Store 1! 
Path! 
Store 2! 
Path! 
Store j! 
Bernstein 
type type 
name name 
author year year author 
type author type 
SIGMOD 
Koltsidis 
Buneman 
Publication 
pub1 2008 pub2 
conf1 Conference 
aut1 
aut2 
aut3 
Researcher 
name 
name 
editedBy 
acceptedBy 
type 
type
Paths and Templates 
year 
Query = {Bernstein, 2008, SIGMOD} pub1 2008 
Path! 
Store 1! 
Path! 
Store 2! 
Path! 
Store j! 
author name 
pub1 aut1 Bernstein 
acceptedBy name 
pub1 conf1 SIGMOD 
year 
pub2 2008 
editedBy name 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
[pub1-acceptedBy-conf1-name-SIGMOD] 
[# -acceptedBy- # -name- #] 
Path 
Template
Paths Clustering 
year 
author name 
acceptedBy name 
year 
pub1 2008 
year 
author name 
pub1 aut1 Bernstein 
year 
acceptedBy name 
editedBy name 
year 
pub1 conf1 SIGMOD 
author name 
year 
author name 
year 
pub2 2008 
acceptedBy name 
author name 
acceptedBy name 
editedBy name 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
pub1 2008 
pub1 aut1 Bernstein 
pub1 conf1 SIGMOD 
acceptedBy name 
year 
year 
pub2 2008 
year 
editedBy name 
editedBy name 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
pub1 2008 
pub1 aut1 Bernstein 
pub1 conf1 SIGMOD 
pub2 2008 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
pub1 2008 
pub1 aut1 Bernstein 
pub1 conf1 SIGMOD 
pub2 2008 
editedBy name 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
pub1 2008 
pub1 aut1 Bernstein 
pub1 conf1 SIGMOD 
pub2 2008 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
cl1 
[1] 
[2] 
cl2 
[1] 
cl3 
[1] 
cl4 
[1]
to a query Q on two key factors: its topology and the 
relevance of the information carried from its vertices and 
edges. The topology is Scoring 
evaluated in terms of the length of 
its paths. The relevance of the information carried from its 
vertices Unlike is all evaluated current approaches, through an we implementation are independent from of the 
TF/IDF. 
Given scoring a query function: Q = we {q1, do not . . . impose , qn}, a a graph monotonic, G and aggregative a sub-graph 
nor 
sg in an G, “ad-the hoc score for the of case” sg with scoring respect function. 
to Q is: 
score(sg,Q) = 
↵(sg) 
!str(sg) · 
X 
q2Q 
(⇢(q) · !ct(sg, q)) 
where: 
• ↵(sg) is the relative relevance of sg within G; 
• ⇢(q) is the weight associated to each keyword q with 
respect to the query Q; 
• !ct(sg, q) is the content weight of q considering sg; 
• !str(sg) is the structural weight of sg. 
The relevance of sg is measured as follows: 
Roberto De Virgilio, Antonio Maccioni, Paolo Cappellari: A Linear and 
Monotonic Strategy to Keyword Search over RDF Data. ICWE 2013:338-353
to a query Q on two key factors: its topology and the 
relevance of the information carried from its vertices and 
edges. The topology is Scoring 
evaluated in terms of the length of 
its paths. The relevance of the information carried from its 
vertices Unlike is all evaluated current approaches, through an we implementation are independent from of the 
TF/IDF. 
Given scoring a query function: Q = we {q1, do not . . . impose , qn}, a a graph monotonic, G and aggregative a sub-graph 
nor 
sg in an G, “ad-the hoc score for the of case” sg with scoring respect function. 
to Q is: 
score(sg,Q) = 
↵(sg) 
!str(sg) · 
X 
q2Q 
(⇢(q) · !ct(sg, q)) 
where: 
• ↵(sg) is the relative relevance of sg within G; 
• ⇢(q) is the weight associated to each keyword q with 
topology: 
how keywords are strictly 
connected 
length of the paths 
respect to the query Q; 
• !ct(sg, q) is the content weight of q considering sg; 
• !str(sg) is the structural weight of sg. 
The relevance of sg is measured as follows: 
Roberto De Virgilio, Antonio Maccioni, Paolo Cappellari: A Linear and 
Monotonic Strategy to Keyword Search over RDF Data. ICWE 2013:338-353
to a query Q on two key factors: its topology and the 
relevance of the information carried from its vertices and 
edges. The topology is Scoring 
evaluated in terms of the length of 
its paths. The relevance of the information carried from its 
vertices Unlike is all evaluated current approaches, through an we implementation are independent from of the 
TF/IDF. 
Given scoring a query function: Q = we {q1, do not . . . impose , qn}, a a graph monotonic, G and aggregative a sub-graph 
nor 
sg in an G, “ad-the hoc score for the of case” sg with scoring respect function. 
to Q is: 
score(sg,Q) = 
↵(sg) 
!str(sg) · 
X 
q2Q 
(⇢(q) · !ct(sg, q)) 
where: 
• ↵(sg) is the relative relevance of sg within G; 
• ⇢(q) is the weight associated to each keyword q with 
topology: 
how keywords are strictly 
connected 
length of the paths 
relevance: 
information carried from 
nodes 
implementation of TF/IDF 
respect to the query Q; 
• !ct(sg, q) is the content weight of q considering sg; 
• !str(sg) is the structural weight of sg. 
The relevance of sg is measured as follows: 
Roberto De Virgilio, Antonio Maccioni, Paolo Cappellari: A Linear and 
Monotonic Strategy to Keyword Search over RDF Data. ICWE 2013:338-353
Building solutions 
different strategies: 
linear time complexity 
in one round of MR 
monotonically in a 
quadratic time and 2k 
rounds of MR
Linear Strategy (Map) 
Map: Iterates over the clusters; 
key: position of the path in the cluster 
value: path itself 
author name 
acceptedBy name 
year 
pub1 2008 
year 
author name 
pub1 aut1 Bernstein 
year 
acceptedBy name 
editedBy name 
pub1 conf1 SIGMOD 
author name 
year 
pub2 2008 
acceptedBy name 
cl1 
cl2 
year 
year 
author name 
author name 
acceptedBy name 
acceptedBy name 
year 
year 
editedBy name 
editedBy name 
cl3 
cl4 
<1, p1>, <2, p4>, <1, p2> <1, p3>, <1, p5> 
editedBy name 
year 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
pub1 2008 
pub1 aut1 Bernstein 
pub1 conf1 SIGMOD 
pub2 2008 
p1 
p2 
p3 
p4 
pub1 2008 
year 
pub1 aut1 Bernstein 
pub1 conf1 SIGMOD 
pub2 2008 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
[1] 
[2] 
[1] 
pub1 2008 
pub1 aut1 Bernstein 
pub1 conf1 SIGMOD 
pub2 2008 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
pub1 2008 
pub1 aut1 Bernstein 
pub1 conf1 SIGMOD 
pub2 2008 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
[1] 
[1]
Linear Strategy (Reduce) 
Reduce: Each machine receives a list of paths out of which it 
computes connected components; 
key: position of the path in the cluster 
value: path itself 
<1, p1>, <2, p4>, <1, p2> 
<1, p3>, <1, p5> 
<1, {p1, p2, p3, p5}> 
<2, {p4}> 
Each connected component is a final solution to output 
S1: {p1, p2, p3, p5} 
S2: {p4}
Linear Strategy: solutions 
Each connected component is a final solution to output 
S1: {p1, p2, p3, p5} S2: {p4} 
Bernstein 
pub1 2008 pub2 
conf1 
SIGMOD 
aut1 
name 
author year 
name 
editedBy 
acceptedBy 
year 
2008 pub2
Monotonic Strategy (Map round 1) 
Map: Iterates over the clusters (taking only the best ones); 
key: position of the path in the cluster 
value: path itself 
author name 
acceptedBy name 
year 
pub1 2008 
year 
author name 
pub1 aut1 Bernstein 
year 
acceptedBy name 
editedBy name 
pub1 conf1 SIGMOD 
author name 
year 
pub2 2008 
acceptedBy name 
year 
year 
author name 
author name 
acceptedBy name 
acceptedBy name 
year 
year 
editedBy name 
editedBy name 
cl3 
cl4 
<1, p1>, <1, p2> <1, p3>, <1, p5> 
editedBy name 
year 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
pub1 2008 
pub1 aut1 Bernstein 
pub1 conf1 SIGMOD 
pub2 2008 
p1 
p2 
p3 
p4 
pub1 2008 
year 
pub1 aut1 Bernstein 
pub1 conf1 SIGMOD 
pub2 2008 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
cl1 
[1] 
[2] 
cl2 
[1] 
pub1 2008 
pub1 aut1 Bernstein 
pub1 conf1 SIGMOD 
pub2 2008 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
pub1 2008 
pub1 aut1 Bernstein 
pub1 conf1 SIGMOD 
pub2 2008 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
[1] 
[1]
Monotonic Strategy (Reduce round 1) 
Reduce: Each machine receives a list of paths out of which it 
computes global connected components; 
key: position of the path in the cluster 
value: path itself 
<1, p1>, <1, p2> 
<1, p3>, <1, p5> 
<1, {p1, p2, p3, p5}> 
Then it invokes a new round of MapReduce
Monotonic Strategy (Map round 2) 
Map: Dispatches the connected components to different 
machines; 
key: occurrence of connected component 
value: connected component itself 
year 
year 
pub1 2008 
year 
author name 
year 
author name 
pub1 aut1 Bernstein 
author name 
acceptedBy name 
author name 
acceptedBy name 
pub1 conf1 SIGMOD 
acceptedBy name 
acceptedBy name 
year 
year 
pub2 2008 
year 
year 
editedBy name 
editedBy name 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
pub1 2008 
pub1 aut1 Bernstein 
pub1 conf1 SIGMOD 
pub2 2008 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
pub1 2008 
pub1 aut1 Bernstein 
pub1 conf1 SIGMOD 
pub2 2008 
editedBy name 
editedBy name 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
pub1 2008 
pub1 aut1 Bernstein 
pub1 conf1 SIGMOD 
pub2 2008 
pub2 conf1 SIGMOD 
p1 
p2 
p3 
p4 
p5 
<1, {p1, p2, p3, p5}>
Monotonic Strategy (Reduce round 2) 
Reduce: Each machine receives a connected component and 
iterates on the paths; 
tau-test: a variant of the TA algorithm 
it determines if some path is exceeding in the connected component 
<1, {p1, p2, p3, p5}> 
S = {p1, p2, p3} 
{p5} 
Discarded paths and discarded connected components are 
reinserted in the clusters
In other words, given the set P containing the candidate paths to be included 
by the most common IR based functions. It is possible to prove that the pivoted 
normalization weighting method (SIM) [11], which inspired most of the IR scoring 
functions, satisfy Properties 1 and 2. For the sake of simplicity, we discuss the 
properties by referring to the data structures used in this paper. 
Property 1 Given a query Q and a path p, score(p,Q) = score({p},Q). 
This property states that the score of a path p is equal to the score of the 
solution S containing only that same path (i.e. {p}). It means that every path 
must be evaluated as the solution containing exactly that path. Consequently we 
have that, if score(p1,Q) > score(p2,Q) then score({p1},Q) > score({p2},Q). 
Analogously, extending Property 1 we provide the following. 
Property 2 Given a query Q, a set of paths P in which p is the more relevant 
path (i.e. 8pj 2 P we have that score(p,Q)  score(pj,Q)) and P⇤ is its power 
set, we have score(S = Pi,Q)  score(S = {p},Q) 8Pi ✓ P⇤. 
Property 1 Given a query Q and a path p, score(p,Q) = score({p},Q). 
This property states that the score of a path p is equal to the score of the 
solution S containing only that same path (i.e. {p}). It means that every path 
must be evaluated as the solution containing exactly that path. Consequently we 
have that, if score(p1,Q)  score(p2,Q) then score({p1},Q)  score({p2},Q). 
Analogously, extending Property 1 we provide the following. 
Property 2 Given a query Q, a set of paths P in which p is the more relevant 
path (i.e. 8pj 2 P we have that score(p,Q)  score(pj,Q)) and P⇤ is its power 
set, we have score(S = Pi,Q)  score(S = {p},Q) 8Pi ✓ P⇤. 
in the solution, the scores of all possible solutions generated from P (i.e. P⇤) are 
bounded by the score of the most relevant and generalizes the Threshold Algorithm Tau-test 
path p of P. This property is coherent 
(TA) [6]. Contrarily to TA, we do not 
use an aggregative function, nor we assume the aggregation to be monotone. TA 
introduces a mechanism to optimize the number of steps n to compute the best 
k objects (where it could be n  k), while our framework produces k optima 
solutions in k steps. To verify the monotonicity we apply a so-called ⌧ -test to 
determine which paths of a connected component cc should be inserted into an 
optimum solution optS ⇢ cc. The ⌧ -test is supported by Theorem 1. Firstly, 
we have to take into consideration the paths that can be used to form more 
solutions in the next iterations of the process. In our framework they are still 
within the set of clusters CL. Then, let us consider the path ps with the highest 
score in CL and the path py with the highest score in ccroptS. Then we define 
the threshold ⌧ as ⌧ = max{score(ps,Q), score(py,Q)}. The threshold ⌧ can be 
considered as the upper bound score for the potential solutions to generate in 
the next iterations of the algorithm. Now, we provide the following: 
Theorem 1. Given a query Q, a scoring function satisfying Property 1 and 
Property 2, a connected component cc, a subset optS ⇢ cc representing an 
optimum solution and a candidate path px 2 cc r optS, S = optS [ {px} is still 
optimum i↵ score(S,Q) ' ⌧. 
Necessary condition. Let us assume that S = optS [ {px} is an optimum 
In other words, given the set P containing the candidate paths to be included 
in the solution, the scores of all possible solutions generated from P (i.e. P⇤) are 
bounded by the score of the most relevant path p of P. This property is coherent 
and generalizes the Threshold Algorithm (TA) [6]. Contrarily to TA, we do not 
use an aggregative function, nor we assume the aggregation to be monotone. TA 
introduces a mechanism to optimize the number of steps n to compute the best 
k objects (where it could be n  k), while our framework produces k optima 
solutions in k steps. To verify the monotonicity we apply a so-called ⌧ -test to 
determine which paths of a connected component cc should be inserted into an 
optimum solution optS ⇢ cc. The ⌧ -test is supported by Theorem 1. Firstly, 
we have to take into consideration the paths that can be used to form more 
solutions in the next iterations of the process. In our framework they are still 
within the set of clusters CL. Then, let us consider the path ps with the highest 
score in CL and the path py with the highest score in ccroptS. Then we define 
the threshold ⌧ as ⌧ = max{score(ps,Q), score(py,Q)}. The threshold ⌧ can be 
considered as the upper bound score for the potential solutions to generate in 
In other words, given the set P containing the candidate paths to be included 
in the solution, the scores of all possible solutions generated from P (i.e. P⇤) are 
bounded by the score of the most relevant path p of P. This property is coherent 
and generalizes the Threshold Algorithm (TA) [6]. Contrarily to TA, we do not 
use an aggregative function, nor we assume the aggregation to be monotone. TA 
introduces a mechanism to optimize the number of steps n to compute the best 
k objects solution. (where We must it could verify be if the n  score k), while of this our solution framework is still produces greater than k optima 
⌧ . 
solutions Reminding in k to steps. the definition To verify of the ⌧ , monotonicity we can have two we cases: 
apply a so-called ⌧ -test to 
determine – ⌧ = which score(ps,paths Q)  of score(a connected py,Q). 
component cc should be inserted into an 
optimum solution optS ⇢ cc. The ⌧ -test is supported by Theorem 1. Firstly, 
we have to take into consideration the paths that can be used to form more 
Roberto De Virgilio, Antonio Maccioni, Paolo Cappellari: A Linear and 
Monotonic Strategy to Keyword Search over RDF Data. ICWE 2013:338-353 
In this case score(ps,Q) represents the upper bound for the scoring 
of the possible solutions to generate in the next steps. Recalling the
Monotonic Strategy: solutions 
Each connected component is a final solution to output 
S1: {p1, p2, p3} S2: {p4, p5} 
Bernstein 
pub1 2008 
conf1 
SIGMOD 
aut1 
name 
author year 
name 
acceptedBy 
SIGMOD 
2008 pub2 
conf1 
year 
name 
editedBy
Experiments 
We deployed YaaniiMR on EC2 clusters. 
cc1.4xlarge: 10, 50 and 100 nodes. 
YaaniiMR is provided with the Hadoop file system (HDFS) 
version 1.1.1 and the HBase data store version 0.94.3 
The performance of our systems has been mea-sured with 
respect to data loading, memory footprint, and query 
execution.
Data Loading 
400 
300 
200 
100 
0 
45,49% 50,49% 56,43% 
10 nodes 50 nodes 
100 nodes 
314,26% 348,84% 
389,88% 
UploadTime(sec) 
DBPedia Billion 
300M triples 
2008 edition: 1200M triples
Memory Consumption 
10000 
1000 
100 
10 
1 
1,1( 1,5( 1,8( 
Mondial( 
DbPedia( 
Billion( 
1,4( 
349,3( 
1781,1( 
sizefornode(MB) 
overhead spaceconsump8on 
300M triples 
2008 edition: 1200M triples 
17k triples 
10-nodes
End-to-end job runtimes 
[Coffman et al. Benchmark] 
L = Linear strategy M = Monotonic strategy 
6 
5 
4 
3 
2 
1 
0 
query$execu(on$ overhead$ 
L M L M L M L M L M L M L M L M L M 
Mondial Dbpedia Billion Mondial Dbpedia Billion Mondial Dbpedia Billion 
10-nodes 50-nodes 100-nodes 
Job$Run(me$(sec)$ 
300M triples 1200M triples 17k triples
Questions?

More Related Content

PDF
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
PDF
Graph-to-Text Generation and its Applications to Dialogue
PDF
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
PDF
The DE-9IM Matrix in Details using ST_Relate: In Picture and SQL
PDF
Spatial Indexing
PDF
Decentralized Evolution and Consolidation of RDF Graphs
PPT
Planning Evacuation Routes with the P-graph Framework
PDF
An Approach for Project Scheduling Using PERT/CPM and Petri Nets (PNs) Tools
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Graph-to-Text Generation and its Applications to Dialogue
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
The DE-9IM Matrix in Details using ST_Relate: In Picture and SQL
Spatial Indexing
Decentralized Evolution and Consolidation of RDF Graphs
Planning Evacuation Routes with the P-graph Framework
An Approach for Project Scheduling Using PERT/CPM and Petri Nets (PNs) Tools

Viewers also liked (7)

PDF
Modelgraphdb
PDF
SPCData: la nuvola dei dati della Pubblica Amministrazione Italiana
PDF
Finding All Maximal Cliques in Very Large Social Networks
PDF
Converting Relational to Graph Databases
PPTX
File Format Benchmarks - Avro, JSON, ORC, & Parquet
PPTX
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
PDF
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Modelgraphdb
SPCData: la nuvola dei dati della Pubblica Amministrazione Italiana
Finding All Maximal Cliques in Very Large Social Networks
Converting Relational to Graph Databases
File Format Benchmarks - Avro, JSON, ORC, & Parquet
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Everyday I'm Shuffling - Tips for Writing Better Spark Programs, Strata San J...
Ad

Similar to Distributed Keyword Search over RDF via MapReduce (20)

PPTX
Lgm saarbrucken
PDF
Estimation of Signal Models for Aerospace and Industrial Applications slides
PDF
Extended Property Graphs and Cypher on Gradoop
PDF
PageRank on an evolving graph - Yanzhao Yang : NOTES
PDF
pi-Lisco: Parallel and Incremental Stream-Based Point-Cloud Clustering
PPTX
Locally densest subgraph discovery
PDF
A Subgraph Pattern Search over Graph Databases
PDF
Compiling openCypher graph queries with Spark Catalyst
PDF
Propel your Performance: AgensGraph, the multi-model database
PDF
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
PDF
Incremental Graph Queries for Cypher
PDF
ingraph: Live Queries on Graphs
PPTX
[NS][Lab_Seminar_250421]SignGraph: A Sign Sequence is Worth Graphs of Nodes.pptx
PDF
Scalable and Adaptive Graph Querying with MapReduce
PPTX
SPgen: A Benchmark Generator for Spatial Link Discovery Tools
PDF
Map-Side Merge Joins for Scalable SPARQL BGP Processing
PDF
The 2nd graph database in sv meetup
PDF
Challenge@RuleML2015 Modeling Object-Relational Geolocation Knowledge in PSOA...
PPT
4900514.ppt
PPTX
Compact Representation of Large RDF Data Sets for Publishing and Exchange
Lgm saarbrucken
Estimation of Signal Models for Aerospace and Industrial Applications slides
Extended Property Graphs and Cypher on Gradoop
PageRank on an evolving graph - Yanzhao Yang : NOTES
pi-Lisco: Parallel and Incremental Stream-Based Point-Cloud Clustering
Locally densest subgraph discovery
A Subgraph Pattern Search over Graph Databases
Compiling openCypher graph queries with Spark Catalyst
Propel your Performance: AgensGraph, the multi-model database
쉽게 설명하는 GAN (What is this? Gum? It's GAN.)
Incremental Graph Queries for Cypher
ingraph: Live Queries on Graphs
[NS][Lab_Seminar_250421]SignGraph: A Sign Sequence is Worth Graphs of Nodes.pptx
Scalable and Adaptive Graph Querying with MapReduce
SPgen: A Benchmark Generator for Spatial Link Discovery Tools
Map-Side Merge Joins for Scalable SPARQL BGP Processing
The 2nd graph database in sv meetup
Challenge@RuleML2015 Modeling Object-Relational Geolocation Knowledge in PSOA...
4900514.ppt
Compact Representation of Large RDF Data Sets for Publishing and Exchange
Ad

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Machine Learning_overview_presentation.pptx
PDF
Approach and Philosophy of On baking technology
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PPTX
Programs and apps: productivity, graphics, security and other tools
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Network Security Unit 5.pdf for BCA BBA.
A Presentation on Artificial Intelligence
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
Unlocking AI with Model Context Protocol (MCP)
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Per capita expenditure prediction using model stacking based on satellite ima...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Machine Learning_overview_presentation.pptx
Approach and Philosophy of On baking technology
Advanced methodologies resolving dimensionality complications for autism neur...
SOPHOS-XG Firewall Administrator PPT.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Getting Started with Data Integration: FME Form 101
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Programs and apps: productivity, graphics, security and other tools
20250228 LYD VKU AI Blended-Learning.pptx
A comparative analysis of optical character recognition models for extracting...
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
Machine learning based COVID-19 study performance prediction
Network Security Unit 5.pdf for BCA BBA.

Distributed Keyword Search over RDF via MapReduce

  • 1. Distributed Keyword Search over RDF via MapReduce Roberto De Virgilio and Antonio Maccioni
  • 2. Semantic Web is distributed The (Semantic) Web is distributed We have distributed and cloud-based Infrastructures for data processing The Linked (Open) Data are getting popular also to non-expert users
  • 4. MapReduce: a simple programming model Map+Reduce Map: Accepts input key/value pair Emits intermediate key/ value pair Reduce: Accepts intermediate key/value pair Emits output key/value pair Very big data Result M A P R E D U C E
  • 5. The problem Bernstein type type name name author year year author type author type SIGMOD Koltsidis Buneman Publication pub1 2008 pub2 conf1 Conference aut1 aut2 aut3 Researcher name name editedBy acceptedBy type type
  • 6. The problem Query: Bernstein 2008 SIGMOD Bernstein type type name name author year year author type author type SIGMOD Koltsidis Buneman Publication pub1 2008 pub2 conf1 Conference aut1 aut2 aut3 Researcher name name editedBy acceptedBy type type
  • 7. The problem Bernstein type type name name author year year author type author type SIGMOD Koltsides Buneman Publication pub1 2008 pub2 conf1 Conference aut1 aut2 aut3 Researcher name name editedBy acceptedBy type type Koltsidis Query: Bernstein 2008 SIGMOD
  • 8. Existing Approach Bernstein type type name name author year year author type author type SIGMOD Koltsides Buneman Publication pub1 2008 pub2 conf1 Conference aut1 aut2 aut3 Researcher name name editedBy acceptedBy type type Koltsidis Query: Bernstein 2008 SIGMOD
  • 10. Existing Approach S1 S2 S3 S4 - - - Sn relevance
  • 11. drawbacks: low specificity high overlapping high computational cost centralized computation Existing Approach S2 S1 S4 S3 - - - Sn Top-k relevance
  • 13. Desired direction S1 S2 - - - Sk relevance
  • 14. Desired direction strong points: linear computational cost monotonic ranked result low overlapping distributed computation S1 S2 - - - Sk Top-k relevance
  • 15. From Graph parallel to Data parallel
  • 16. Graph Data Indexing Breadth First Search Distributed Storage Path! Store 1! Path! Store 2! Path! Store j! Bernstein type type name name author year year author type author type SIGMOD Koltsidis Buneman Publication pub1 2008 pub2 conf1 Conference aut1 aut2 aut3 Researcher name name editedBy acceptedBy type type
  • 17. Paths and Templates year Query = {Bernstein, 2008, SIGMOD} pub1 2008 Path! Store 1! Path! Store 2! Path! Store j! author name pub1 aut1 Bernstein acceptedBy name pub1 conf1 SIGMOD year pub2 2008 editedBy name pub2 conf1 SIGMOD p1 p2 p3 p4 p5 [pub1-acceptedBy-conf1-name-SIGMOD] [# -acceptedBy- # -name- #] Path Template
  • 18. Paths Clustering year author name acceptedBy name year pub1 2008 year author name pub1 aut1 Bernstein year acceptedBy name editedBy name year pub1 conf1 SIGMOD author name year author name year pub2 2008 acceptedBy name author name acceptedBy name editedBy name pub2 conf1 SIGMOD p1 p2 p3 p4 p5 pub1 2008 pub1 aut1 Bernstein pub1 conf1 SIGMOD acceptedBy name year year pub2 2008 year editedBy name editedBy name pub2 conf1 SIGMOD p1 p2 p3 p4 p5 pub1 2008 pub1 aut1 Bernstein pub1 conf1 SIGMOD pub2 2008 pub2 conf1 SIGMOD p1 p2 p3 p4 p5 pub1 2008 pub1 aut1 Bernstein pub1 conf1 SIGMOD pub2 2008 editedBy name pub2 conf1 SIGMOD p1 p2 p3 p4 p5 pub1 2008 pub1 aut1 Bernstein pub1 conf1 SIGMOD pub2 2008 pub2 conf1 SIGMOD p1 p2 p3 p4 p5 cl1 [1] [2] cl2 [1] cl3 [1] cl4 [1]
  • 19. to a query Q on two key factors: its topology and the relevance of the information carried from its vertices and edges. The topology is Scoring evaluated in terms of the length of its paths. The relevance of the information carried from its vertices Unlike is all evaluated current approaches, through an we implementation are independent from of the TF/IDF. Given scoring a query function: Q = we {q1, do not . . . impose , qn}, a a graph monotonic, G and aggregative a sub-graph nor sg in an G, “ad-the hoc score for the of case” sg with scoring respect function. to Q is: score(sg,Q) = ↵(sg) !str(sg) · X q2Q (⇢(q) · !ct(sg, q)) where: • ↵(sg) is the relative relevance of sg within G; • ⇢(q) is the weight associated to each keyword q with respect to the query Q; • !ct(sg, q) is the content weight of q considering sg; • !str(sg) is the structural weight of sg. The relevance of sg is measured as follows: Roberto De Virgilio, Antonio Maccioni, Paolo Cappellari: A Linear and Monotonic Strategy to Keyword Search over RDF Data. ICWE 2013:338-353
  • 20. to a query Q on two key factors: its topology and the relevance of the information carried from its vertices and edges. The topology is Scoring evaluated in terms of the length of its paths. The relevance of the information carried from its vertices Unlike is all evaluated current approaches, through an we implementation are independent from of the TF/IDF. Given scoring a query function: Q = we {q1, do not . . . impose , qn}, a a graph monotonic, G and aggregative a sub-graph nor sg in an G, “ad-the hoc score for the of case” sg with scoring respect function. to Q is: score(sg,Q) = ↵(sg) !str(sg) · X q2Q (⇢(q) · !ct(sg, q)) where: • ↵(sg) is the relative relevance of sg within G; • ⇢(q) is the weight associated to each keyword q with topology: how keywords are strictly connected length of the paths respect to the query Q; • !ct(sg, q) is the content weight of q considering sg; • !str(sg) is the structural weight of sg. The relevance of sg is measured as follows: Roberto De Virgilio, Antonio Maccioni, Paolo Cappellari: A Linear and Monotonic Strategy to Keyword Search over RDF Data. ICWE 2013:338-353
  • 21. to a query Q on two key factors: its topology and the relevance of the information carried from its vertices and edges. The topology is Scoring evaluated in terms of the length of its paths. The relevance of the information carried from its vertices Unlike is all evaluated current approaches, through an we implementation are independent from of the TF/IDF. Given scoring a query function: Q = we {q1, do not . . . impose , qn}, a a graph monotonic, G and aggregative a sub-graph nor sg in an G, “ad-the hoc score for the of case” sg with scoring respect function. to Q is: score(sg,Q) = ↵(sg) !str(sg) · X q2Q (⇢(q) · !ct(sg, q)) where: • ↵(sg) is the relative relevance of sg within G; • ⇢(q) is the weight associated to each keyword q with topology: how keywords are strictly connected length of the paths relevance: information carried from nodes implementation of TF/IDF respect to the query Q; • !ct(sg, q) is the content weight of q considering sg; • !str(sg) is the structural weight of sg. The relevance of sg is measured as follows: Roberto De Virgilio, Antonio Maccioni, Paolo Cappellari: A Linear and Monotonic Strategy to Keyword Search over RDF Data. ICWE 2013:338-353
  • 22. Building solutions different strategies: linear time complexity in one round of MR monotonically in a quadratic time and 2k rounds of MR
  • 23. Linear Strategy (Map) Map: Iterates over the clusters; key: position of the path in the cluster value: path itself author name acceptedBy name year pub1 2008 year author name pub1 aut1 Bernstein year acceptedBy name editedBy name pub1 conf1 SIGMOD author name year pub2 2008 acceptedBy name cl1 cl2 year year author name author name acceptedBy name acceptedBy name year year editedBy name editedBy name cl3 cl4 <1, p1>, <2, p4>, <1, p2> <1, p3>, <1, p5> editedBy name year pub2 conf1 SIGMOD p1 p2 p3 p4 p5 pub1 2008 pub1 aut1 Bernstein pub1 conf1 SIGMOD pub2 2008 p1 p2 p3 p4 pub1 2008 year pub1 aut1 Bernstein pub1 conf1 SIGMOD pub2 2008 pub2 conf1 SIGMOD p1 p2 p3 p4 p5 [1] [2] [1] pub1 2008 pub1 aut1 Bernstein pub1 conf1 SIGMOD pub2 2008 pub2 conf1 SIGMOD p1 p2 p3 p4 p5 pub1 2008 pub1 aut1 Bernstein pub1 conf1 SIGMOD pub2 2008 pub2 conf1 SIGMOD p1 p2 p3 p4 p5 [1] [1]
  • 24. Linear Strategy (Reduce) Reduce: Each machine receives a list of paths out of which it computes connected components; key: position of the path in the cluster value: path itself <1, p1>, <2, p4>, <1, p2> <1, p3>, <1, p5> <1, {p1, p2, p3, p5}> <2, {p4}> Each connected component is a final solution to output S1: {p1, p2, p3, p5} S2: {p4}
  • 25. Linear Strategy: solutions Each connected component is a final solution to output S1: {p1, p2, p3, p5} S2: {p4} Bernstein pub1 2008 pub2 conf1 SIGMOD aut1 name author year name editedBy acceptedBy year 2008 pub2
  • 26. Monotonic Strategy (Map round 1) Map: Iterates over the clusters (taking only the best ones); key: position of the path in the cluster value: path itself author name acceptedBy name year pub1 2008 year author name pub1 aut1 Bernstein year acceptedBy name editedBy name pub1 conf1 SIGMOD author name year pub2 2008 acceptedBy name year year author name author name acceptedBy name acceptedBy name year year editedBy name editedBy name cl3 cl4 <1, p1>, <1, p2> <1, p3>, <1, p5> editedBy name year pub2 conf1 SIGMOD p1 p2 p3 p4 p5 pub1 2008 pub1 aut1 Bernstein pub1 conf1 SIGMOD pub2 2008 p1 p2 p3 p4 pub1 2008 year pub1 aut1 Bernstein pub1 conf1 SIGMOD pub2 2008 pub2 conf1 SIGMOD p1 p2 p3 p4 p5 cl1 [1] [2] cl2 [1] pub1 2008 pub1 aut1 Bernstein pub1 conf1 SIGMOD pub2 2008 pub2 conf1 SIGMOD p1 p2 p3 p4 p5 pub1 2008 pub1 aut1 Bernstein pub1 conf1 SIGMOD pub2 2008 pub2 conf1 SIGMOD p1 p2 p3 p4 p5 [1] [1]
  • 27. Monotonic Strategy (Reduce round 1) Reduce: Each machine receives a list of paths out of which it computes global connected components; key: position of the path in the cluster value: path itself <1, p1>, <1, p2> <1, p3>, <1, p5> <1, {p1, p2, p3, p5}> Then it invokes a new round of MapReduce
  • 28. Monotonic Strategy (Map round 2) Map: Dispatches the connected components to different machines; key: occurrence of connected component value: connected component itself year year pub1 2008 year author name year author name pub1 aut1 Bernstein author name acceptedBy name author name acceptedBy name pub1 conf1 SIGMOD acceptedBy name acceptedBy name year year pub2 2008 year year editedBy name editedBy name pub2 conf1 SIGMOD p1 p2 p3 p4 p5 pub1 2008 pub1 aut1 Bernstein pub1 conf1 SIGMOD pub2 2008 pub2 conf1 SIGMOD p1 p2 p3 p4 p5 pub1 2008 pub1 aut1 Bernstein pub1 conf1 SIGMOD pub2 2008 editedBy name editedBy name pub2 conf1 SIGMOD p1 p2 p3 p4 p5 pub1 2008 pub1 aut1 Bernstein pub1 conf1 SIGMOD pub2 2008 pub2 conf1 SIGMOD p1 p2 p3 p4 p5 <1, {p1, p2, p3, p5}>
  • 29. Monotonic Strategy (Reduce round 2) Reduce: Each machine receives a connected component and iterates on the paths; tau-test: a variant of the TA algorithm it determines if some path is exceeding in the connected component <1, {p1, p2, p3, p5}> S = {p1, p2, p3} {p5} Discarded paths and discarded connected components are reinserted in the clusters
  • 30. In other words, given the set P containing the candidate paths to be included by the most common IR based functions. It is possible to prove that the pivoted normalization weighting method (SIM) [11], which inspired most of the IR scoring functions, satisfy Properties 1 and 2. For the sake of simplicity, we discuss the properties by referring to the data structures used in this paper. Property 1 Given a query Q and a path p, score(p,Q) = score({p},Q). This property states that the score of a path p is equal to the score of the solution S containing only that same path (i.e. {p}). It means that every path must be evaluated as the solution containing exactly that path. Consequently we have that, if score(p1,Q) > score(p2,Q) then score({p1},Q) > score({p2},Q). Analogously, extending Property 1 we provide the following. Property 2 Given a query Q, a set of paths P in which p is the more relevant path (i.e. 8pj 2 P we have that score(p,Q) score(pj,Q)) and P⇤ is its power set, we have score(S = Pi,Q)  score(S = {p},Q) 8Pi ✓ P⇤. Property 1 Given a query Q and a path p, score(p,Q) = score({p},Q). This property states that the score of a path p is equal to the score of the solution S containing only that same path (i.e. {p}). It means that every path must be evaluated as the solution containing exactly that path. Consequently we have that, if score(p1,Q) score(p2,Q) then score({p1},Q) score({p2},Q). Analogously, extending Property 1 we provide the following. Property 2 Given a query Q, a set of paths P in which p is the more relevant path (i.e. 8pj 2 P we have that score(p,Q) score(pj,Q)) and P⇤ is its power set, we have score(S = Pi,Q)  score(S = {p},Q) 8Pi ✓ P⇤. in the solution, the scores of all possible solutions generated from P (i.e. P⇤) are bounded by the score of the most relevant and generalizes the Threshold Algorithm Tau-test path p of P. This property is coherent (TA) [6]. Contrarily to TA, we do not use an aggregative function, nor we assume the aggregation to be monotone. TA introduces a mechanism to optimize the number of steps n to compute the best k objects (where it could be n k), while our framework produces k optima solutions in k steps. To verify the monotonicity we apply a so-called ⌧ -test to determine which paths of a connected component cc should be inserted into an optimum solution optS ⇢ cc. The ⌧ -test is supported by Theorem 1. Firstly, we have to take into consideration the paths that can be used to form more solutions in the next iterations of the process. In our framework they are still within the set of clusters CL. Then, let us consider the path ps with the highest score in CL and the path py with the highest score in ccroptS. Then we define the threshold ⌧ as ⌧ = max{score(ps,Q), score(py,Q)}. The threshold ⌧ can be considered as the upper bound score for the potential solutions to generate in the next iterations of the algorithm. Now, we provide the following: Theorem 1. Given a query Q, a scoring function satisfying Property 1 and Property 2, a connected component cc, a subset optS ⇢ cc representing an optimum solution and a candidate path px 2 cc r optS, S = optS [ {px} is still optimum i↵ score(S,Q) ' ⌧. Necessary condition. Let us assume that S = optS [ {px} is an optimum In other words, given the set P containing the candidate paths to be included in the solution, the scores of all possible solutions generated from P (i.e. P⇤) are bounded by the score of the most relevant path p of P. This property is coherent and generalizes the Threshold Algorithm (TA) [6]. Contrarily to TA, we do not use an aggregative function, nor we assume the aggregation to be monotone. TA introduces a mechanism to optimize the number of steps n to compute the best k objects (where it could be n k), while our framework produces k optima solutions in k steps. To verify the monotonicity we apply a so-called ⌧ -test to determine which paths of a connected component cc should be inserted into an optimum solution optS ⇢ cc. The ⌧ -test is supported by Theorem 1. Firstly, we have to take into consideration the paths that can be used to form more solutions in the next iterations of the process. In our framework they are still within the set of clusters CL. Then, let us consider the path ps with the highest score in CL and the path py with the highest score in ccroptS. Then we define the threshold ⌧ as ⌧ = max{score(ps,Q), score(py,Q)}. The threshold ⌧ can be considered as the upper bound score for the potential solutions to generate in In other words, given the set P containing the candidate paths to be included in the solution, the scores of all possible solutions generated from P (i.e. P⇤) are bounded by the score of the most relevant path p of P. This property is coherent and generalizes the Threshold Algorithm (TA) [6]. Contrarily to TA, we do not use an aggregative function, nor we assume the aggregation to be monotone. TA introduces a mechanism to optimize the number of steps n to compute the best k objects solution. (where We must it could verify be if the n score k), while of this our solution framework is still produces greater than k optima ⌧ . solutions Reminding in k to steps. the definition To verify of the ⌧ , monotonicity we can have two we cases: apply a so-called ⌧ -test to determine – ⌧ = which score(ps,paths Q) of score(a connected py,Q). component cc should be inserted into an optimum solution optS ⇢ cc. The ⌧ -test is supported by Theorem 1. Firstly, we have to take into consideration the paths that can be used to form more Roberto De Virgilio, Antonio Maccioni, Paolo Cappellari: A Linear and Monotonic Strategy to Keyword Search over RDF Data. ICWE 2013:338-353 In this case score(ps,Q) represents the upper bound for the scoring of the possible solutions to generate in the next steps. Recalling the
  • 31. Monotonic Strategy: solutions Each connected component is a final solution to output S1: {p1, p2, p3} S2: {p4, p5} Bernstein pub1 2008 conf1 SIGMOD aut1 name author year name acceptedBy SIGMOD 2008 pub2 conf1 year name editedBy
  • 32. Experiments We deployed YaaniiMR on EC2 clusters. cc1.4xlarge: 10, 50 and 100 nodes. YaaniiMR is provided with the Hadoop file system (HDFS) version 1.1.1 and the HBase data store version 0.94.3 The performance of our systems has been mea-sured with respect to data loading, memory footprint, and query execution.
  • 33. Data Loading 400 300 200 100 0 45,49% 50,49% 56,43% 10 nodes 50 nodes 100 nodes 314,26% 348,84% 389,88% UploadTime(sec) DBPedia Billion 300M triples 2008 edition: 1200M triples
  • 34. Memory Consumption 10000 1000 100 10 1 1,1( 1,5( 1,8( Mondial( DbPedia( Billion( 1,4( 349,3( 1781,1( sizefornode(MB) overhead spaceconsump8on 300M triples 2008 edition: 1200M triples 17k triples 10-nodes
  • 35. End-to-end job runtimes [Coffman et al. Benchmark] L = Linear strategy M = Monotonic strategy 6 5 4 3 2 1 0 query$execu(on$ overhead$ L M L M L M L M L M L M L M L M L M Mondial Dbpedia Billion Mondial Dbpedia Billion Mondial Dbpedia Billion 10-nodes 50-nodes 100-nodes Job$Run(me$(sec)$ 300M triples 1200M triples 17k triples