Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

Rya: Optimizations to Support Real
Time Graph Queries on Accumulo
Dr. Caleb Meier, Puja Valiyil, Aaron Mihalik, Dr. Adina Crainiceanu
DISTRIBUTION STATEMENT A. Approved for
public release; distribution is unlimited.
ONR Case Number 43-279-15 JB.01.2015

22
Acknowledgements
 This work is the collective effort of:
 Parsons’ Rya Team, sponsored by the Department of
the Navy, Office of Naval Research
 Rya Founders: Roshan Punnoose, Adina Crainiceanu,
and David Rapp

33
Overview
 Rya Overview
 Query Execution within Rya
 Query Optimizations
 Results
 Summary

44
Background: Rya and RDF
 Rya: Resource Description Framework (RDF)
Triplestore built on top of Accumulo
 RDF: W3C standard for representing
linked/graph data
 Represents data as statements (assertions) about
resources
– Serialized as triples in {subject, predicate, object}
form
– Example:
• {Caleb, worksAt, Parsons}
• {Caleb, livesIn, Virginia}
Caleb
Parsons
Virginia
worksAt
livesIn

55
Background: SPARQL
 RDF Queries are described using SPARQL
 SPARQL Protocol and RDF Query Language
 SQL-like syntax for finding triples matching
specific patterns
 Look for subgraphs that match triple statement patterns
 Joins are performed when there are variables common
to two or more statement patterns
SELECT ?people WHERE {
?people <worksAt> <Parsons>.
?people <livesIn> <Virginia>.
}

66
Rya Architecture
 Open RDF Interface for interacting with RDF data
stored on Accumulo
 Open RDF (Sesame): Open
Source Java framework for
storing and querying RDF
data
 Open RDF Provides several
interfaces/abstractions
central for interacting with
a RDF datastore
– SAIL interface for interacting with underlying persisted
RDF model
– SAIL: Storage And Inference Layer
Data storage layer
Query processing in SAIL layer
SPARQL
Rya Open RDF
Rya QueryPlanner
Accumulo

77
Storage: Triple Table Index
 3 Tables
 SPO : subject, predicate, object
 POS : predicate, object, subject
 OSP : object, subject, predicate
 Store triples in the RowID of the table
 Store graph name in the Column Family
 Advantages:
 Native lexicographical sorting of row keys  fast range queries
 All patterns can be translated into a scan of one of these tables

88
Overview
 Rya Overview
 Results
 Summary

99
…
worksAt, Netflix, Dan
worksAt, OfficeMax, Zack
worksAt, Parsons, Bob
worksAt, Parsons, Greta
worksAt, Parsons, John
…
Rya Query Execution
 Implemented OpenRDF Sesame SAIL API
 Parse queries, generate initial query plan, execute plan
 Triple patterns map to range queries in Accumulo
SELECT ?x WHERE { ?x <worksAt> <Parsons>.
?x <livesIn> <Virginia>. }
Step 1: POS Table – scan range
…
Bob, livesIn, Georgia
…
Greta, livesIn, Virginia
…
John, livesIn, Virginia
…
Step 2: for each ?x, SPO – index lookup

1010
More Complicated Example of Rya Query
Execution
Step 2: For each ?x,
SPO Table lookup
…
Greta, commuteMethod,
bike
…
John, commuteMethod,
Bus
…
Step 3: For each
remaining ?x, SPO
Table lookup
Step 1: POS Table – scan
range for worksAt, Parsons
?x livesIn Virginia?x worksAt Parsons
?x commuteMethod bike
…
worksAt, Netflix, Dan
worksAt, Parsons, Bob
worksAt, Parsons, Greta
worksAt, Parsons, John
worksAt, PlayStation,
Alice
…
…
Bob, livesIn, Georgia
…
Greta, livesIn, Virginia
…
John, livesIn, Virginia
…
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <livesIn> Virginia.
?x <commuteMethod> bike.
}

1111
Challenges in Query Execution
 Scalability and Responsiveness
 Massive amounts of data
 Potentially large amounts of comparisons
 Consider the Previous Example:
 Default query execution: comparing each “?x” returned from first
statement pattern query to all subsequent triple patterns
 There are 8.3 million Virginia residents, about 15,000 Parsons
employees, and 750,000 people who commute via bike.
 Only 100 people who work at Parsons commute via bike while 1000
people who work at Parsons live in Virginia.
Poor query execution plans can result in simple queries
taking minutes as opposed to milliseconds
SELECT ?x WHERE {
}
SELECT ?x WHERE {
}
SELECT ?x WHERE {
}
vs. vs.

1212
Overview
 Rya Overview
 Results
 Summary

1313
Rya Query Optimizations
 Goal: Optimize query execution (joins) to better
support real time responsiveness
 Three Approaches:
 Reduce the number of joins: Pattern Based Indices
– Pre-calculate common joins
 Limit data in joins: Use more stats to improve query
planning
– Cardinality estimation on individual statement patterns
– Join selectivity estimation on pairs of statement patterns
 Make joins more efficient: Distribute the Join Processing
– Distribute processing using SPARK SQL or MapReduce
– Use Hash Joins and Intersecting Iterators
– Just beginning to start looking at this

1414
Rya Query Optimizations Using Cardinalities
 Goal: Optimize ordering of query execution to
reduce the number of comparison operations
 Order execution based on the number of triples that
match each triple pattern
SELECT ?x WHERE {
}
8.3M matches
15k matches
750k matches

1515
Rya Cardinality Usage
 Maintain cardinalities on the following triple patterns
element combinations:
 Single elements: Subject, Predicate, Object
 Composite elements: Subject-Predicate, Subject-Object,
Predicate-Object
 Computed periodically using MapReduce
 Row ID:
– <CardinalityType><TripleElements>
• OBJECT, Parsons
• PREDICATEOBJECT, worksAt, Parsons
 Cardinality stored in the value
 Sparse table: Only store cardinalities above a threshold
 Only need to recompute cardinalities if the
distribution of the data changes significantly

1616
Limitations of Cardinality Approach
 Consider a more complicated query
 Cardinality approach does not take into account
number of results returned by joins
 Solution lies in estimating the “join selectivity” for a
each pair of triples
SELECT ?x WHERE {
?vehicle <vehicleType> SUV.
?x <owns> ?vehicle.
}
2.1M matches
15k matches
750k matches
8.3M matches
254M matches

1717
Rya Query Optimizations Using Join Selectivity
Query optimized using
only Cardinality Info:
Query optimized using Cardinality
and Join Selectivity Info:
SELECT ?x WHERE {
?x <owns> ?vehicle.
}
SELECT ?x WHERE {
?x <owns> ?vehicle.
}
 Join Selectivity measures number of results returned by joining two
triple patterns
 Approach taken from: RDF-3X: a RISC-style Engine for RDF by Thomas
Neumann and Gerhard Weikum in JDMR (formerly Proc. VLDB) 2008
 Due to computational complexity, estimate of join selectivity for triple
patterns is pre-computed and stored in Accumulo
 Join selectivity estimated by computing the number of results obtained
when each triple pattern is joined with the full table

1818
Join Selectivity: General Algorithm
 For statement patterns <?x, p1, o1> and <?x, p2, o2> with ?x a
variable and p1, o1 , p2, o2 constant, estimate the number of results
 Sel(<?x, p1, o1> <?x, ?y, ?z>) and Sel(<?x, p2, o2> <?x, ?y, ?z>)
give number of results returned by joining a statement pattern with
the full table along the subject component
 Full table join statistics precomputed and stored in index
 Join statistics for each triple pattern computed using following equation:
 Use analogous definition if variables appear in predicate or object position
 Join selectivity statistics used with cardinalities to generate more
efficient query plans

1919
Join Selectivity: Integration into Rya
 Join Selectivity estimates used to optimize Rya queries
through a greedy algorithm approach
 Query constructed starting with first triple pattern to be
evaluated (the pattern with the smallest cardinality) and then
patterns are added based on minimization of a cost function
 Cost function
 C = leftCard + rightCard + leftCard*rightCard*selectivity
 C measures number of entries Accumulo must scan and the
number of comparisons required to perform the join
 Selectivity set to one if two triple patterns share no common
variables, otherwise precomputed estimates used
 Ensures that patterns with common variables are grouped
together

2020
Construction of Selectivity Tables
 For the pattern <?x, p1, o1>, associate each RDF triple of
the form <c, p1, o1> with the cardinality |<c,?y,?z>| and
then sum the results
 Given a triple <c, p1, o1> in the SPO table, Map Job 1 emits
the key-value pair (c, (p1, o1))
 Map Job 2 processes the cardinality table and emits the key
value pair (c, |<c,?y,?x>|), which consists of the constant c
and its single component, subject cardinality for the table
 Map Job 3 merges the results from jobs 1 and 2 by emitting
the key-value pair ((p1, o1), |<c,?y,?x>|)
 Map Job 4 sums the cardinalities from those key-value pairs
containing (p1, o1) as a key, and the result is written to the
selectivity table

2121
Query Optimizations Using Pre-Computed Joins
 Reduce joins by pre-computing common joins
 Approach taken from: Heese, Ralf, et al. "Index Support for
SPARQL." European Semantic Web Conference, Innsbruck,
Austria. 2007.
SELECT ?x WHERE {
?x <owns> ?vehicle.
}
Pre-compute using
batch processing
and look up during
query execution

2222
Query Optimizations Using Pre-Computed Joins
Index Result Table
.…
Aaron, ToyotaRav4
Caleb, JeepCherokee
Puja, HondaCRV
.…
SELECT ?x WHERE {
?x <owns> ?vehicle.
}
SELECT ?person ?car
WHERE {
?person <livesIn> Virginia.
?person <owns> ?car.
?car <vehicleType> SUV.
}
1. Pre-compute a portion of the query
using MapReduce
2. Store SPARQL describing the query
along with pre-computed values in
Accumulo
3. Normalize query variables to match
stored SPARQL variables during
query execution
Stored SPARQL

2323
Overview
 Rya Overview
 Results
 Summary

2424
Query Optimization Results
 Ran 14 queries against the Lehigh University Benchmark (LUBM)
dataset (33.34 million triples)
 LUBM queries 2, 5, 9, and 13 were discarded after 3 runs due to query complexity
– Remaining queries were executed 12 times
 Cluster Specs:
– 8 worker nodes, each has 2 x 6-Core Xeon E5-2440 (2.4GHz) Processors and
48 GB RAM
 Results indicate that cardinality and join selectivity optimizations provide
improved or comparable performance

2525
Summary
 Cardinality estimation and join selectivity can
improve query response times for ad hoc queries
 Effects of join selectivity are more apparent for
complex queries over large datasets
 Pre-computed joins are extremely useful for
optimizing common queries
 Potentially avoid large number of join operations
 Maintaining pre-computed join indices is difficult

2828
Useful Links
 SPARQL
 http://www.w3.org/TR/rdf-sparql-query/
 http://jena.apache.org/tutorials/sparql.html
 RDF
 http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/
 Rya
 https://github.com/LAS-NCSU/rya
– Source on github: Provides documentation and sample client code
– Email Aaron Mihalik (aaron.mihalik@parsons.com) for access (US Citizens only)
 Rya Working Group
– Monthly telecon / update on progress, issues, upcoming features
– Email Puja Valiyil puja.valiyil@parsons.com to join (US Citizens only)
 Open RDF Tutorial: http://openrdf.callimachus.net/sesame/tutorials/getting-
started.docbook?view
 Open RDF Javadoc: http://openrdf.callimachus.net/sesame/2.7/apidocs/index.html
 Punnoose R., Crainiceanu A., Rapp D. 2012. Rya: a scalable RDF triple store for the
clouds. Proceedings of the 1st International Workshop on Cloud Intelligence.
http://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf
 Roshan Punnoose, Adina Crainiceanu, David Rapp. SPARQL in the Clouds Using Rya.
Information Systems Journal (2013).
http://www.usna.edu/Users/cs/adina/research/Rya_ISjournal2013.pdf

2929
Next Steps
 Maintaining pre-computed join indices
 Dynamically determining potential pre-computed
joins
 Distributing query planning and execution
 SPARK SQL
 Rya backed by other datastores
 Fully open sourcing Rya

3030
Sample LUBM Queries (1 of 3)
Query 1
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:GraduateStudent .
?X ub:takesCourse <http://www.Department0.University0.edu/GraduateCourse0>}
}
Query 3
SELECT ?X WHERE
{?X rdf:type ub:Publication .
?X ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>}
}

3131
Query 7
SELECT ?X ?Y WHERE
{?X rdf:type ub:Student .
?Y rdf:type ub:Course .
?X ub:takesCourse ?Y .
<http://www.Department0.University0.edu/AssociateProfessor0> ub:teacherOf ?Y}
}
Query 8
SELECT ?X ?Y ?Z WHERE
?Y rdf:type ub:Department .
?X ub:memberOf ?Y .
?Y ub:subOrganizationOf <http://www.University0.edu> .
?X ub:emailAddress ?Z}
}

3232
Query 9
SELECT ?X ?Y ?Z WHERE
?Y rdf:type ub:Faculty .
?Z rdf:type ub:Course .
?X ub:advisor ?Y .
?Y ub:teacherOf ?Z .
?X ub:takesCourse ?Z}
}
Query 11
SELECT ?X WHERE
{?X rdf:type ub:ResearchGroup .
?X ub:subOrganizationOf <http://www.University0.edu>}
}

Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

More Related Content

What's hot (20)

Similar to Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks] (20)

Recently uploaded (20)

Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

Editor's Notes