SlideShare a Scribd company logo
Rya: Optimizations to Support Real
Time Graph Queries on Accumulo
Dr. Caleb Meier, Puja Valiyil, Aaron Mihalik, Dr. Adina Crainiceanu
DISTRIBUTION STATEMENT A. Approved for
public release; distribution is unlimited.
ONR Case Number 43-279-15 JB.01.2015
22
Acknowledgements
 This work is the collective effort of:
 Parsons’ Rya Team, sponsored by the Department of
the Navy, Office of Naval Research
 Rya Founders: Roshan Punnoose, Adina Crainiceanu,
and David Rapp
33
Overview
 Rya Overview
 Query Execution within Rya
 Query Optimizations
 Results
 Summary
44
Background: Rya and RDF
 Rya: Resource Description Framework (RDF)
Triplestore built on top of Accumulo
 RDF: W3C standard for representing
linked/graph data
 Represents data as statements (assertions) about
resources
– Serialized as triples in {subject, predicate, object}
form
– Example:
• {Caleb, worksAt, Parsons}
• {Caleb, livesIn, Virginia}
Caleb
Parsons
Virginia
worksAt
livesIn
55
Background: SPARQL
 RDF Queries are described using SPARQL
 SPARQL Protocol and RDF Query Language
 SQL-like syntax for finding triples matching
specific patterns
 Look for subgraphs that match triple statement patterns
 Joins are performed when there are variables common
to two or more statement patterns
SELECT ?people WHERE {
?people <worksAt> <Parsons>.
?people <livesIn> <Virginia>.
}
66
Rya Architecture
 Open RDF Interface for interacting with RDF data
stored on Accumulo
 Open RDF (Sesame): Open
Source Java framework for
storing and querying RDF
data
 Open RDF Provides several
interfaces/abstractions
central for interacting with
a RDF datastore
– SAIL interface for interacting with underlying persisted
RDF model
– SAIL: Storage And Inference Layer
Data storage layer
Query processing in SAIL layer
SPARQL
Rya Open RDF
Rya QueryPlanner
Accumulo
77
Storage: Triple Table Index
 3 Tables
 SPO : subject, predicate, object
 POS : predicate, object, subject
 OSP : object, subject, predicate
 Store triples in the RowID of the table
 Store graph name in the Column Family
 Advantages:
 Native lexicographical sorting of row keys  fast range queries
 All patterns can be translated into a scan of one of these tables
88
Overview
 Rya Overview
 Query Execution within Rya
 Query Optimizations
 Results
 Summary
99
…
worksAt, Netflix, Dan
worksAt, OfficeMax, Zack
worksAt, Parsons, Bob
worksAt, Parsons, Greta
worksAt, Parsons, John
…
Rya Query Execution
 Implemented OpenRDF Sesame SAIL API
 Parse queries, generate initial query plan, execute plan
 Triple patterns map to range queries in Accumulo
SELECT ?x WHERE { ?x <worksAt> <Parsons>.
?x <livesIn> <Virginia>. }
Step 1: POS Table – scan range
…
Bob, livesIn, Georgia
…
Greta, livesIn, Virginia
…
John, livesIn, Virginia
…
Step 2: for each ?x, SPO – index lookup
1010
More Complicated Example of Rya Query
Execution
Step 2: For each ?x,
SPO Table lookup
…
Greta, commuteMethod,
bike
…
John, commuteMethod,
Bus
…
Step 3: For each
remaining ?x, SPO
Table lookup
Step 1: POS Table – scan
range for worksAt, Parsons
?x livesIn Virginia?x worksAt Parsons
?x commuteMethod bike
…
worksAt, Netflix, Dan
worksAt, Parsons, Bob
worksAt, Parsons, Greta
worksAt, Parsons, John
worksAt, PlayStation,
Alice
…
…
Bob, livesIn, Georgia
…
Greta, livesIn, Virginia
…
John, livesIn, Virginia
…
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <livesIn> Virginia.
?x <commuteMethod> bike.
}
1111
Challenges in Query Execution
 Scalability and Responsiveness
 Massive amounts of data
 Potentially large amounts of comparisons
 Consider the Previous Example:
 Default query execution: comparing each “?x” returned from first
statement pattern query to all subsequent triple patterns
 There are 8.3 million Virginia residents, about 15,000 Parsons
employees, and 750,000 people who commute via bike.
 Only 100 people who work at Parsons commute via bike while 1000
people who work at Parsons live in Virginia.
Poor query execution plans can result in simple queries
taking minutes as opposed to milliseconds
SELECT ?x WHERE {
?x <livesIn> Virginia.
?x <worksAt> Parsons.
?x <commuteMethod> bike.
}
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <livesIn> Virginia.
?x <commuteMethod> bike.
}
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
}
vs. vs.
1212
Overview
 Rya Overview
 Query Execution within Rya
 Query Optimizations
 Results
 Summary
1313
Rya Query Optimizations
 Goal: Optimize query execution (joins) to better
support real time responsiveness
 Three Approaches:
 Reduce the number of joins: Pattern Based Indices
– Pre-calculate common joins
 Limit data in joins: Use more stats to improve query
planning
– Cardinality estimation on individual statement patterns
– Join selectivity estimation on pairs of statement patterns
 Make joins more efficient: Distribute the Join Processing
– Distribute processing using SPARK SQL or MapReduce
– Use Hash Joins and Intersecting Iterators
– Just beginning to start looking at this
1414
Rya Query Optimizations Using Cardinalities
 Goal: Optimize ordering of query execution to
reduce the number of comparison operations
 Order execution based on the number of triples that
match each triple pattern
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
}
8.3M matches
15k matches
750k matches
1515
Rya Cardinality Usage
 Maintain cardinalities on the following triple patterns
element combinations:
 Single elements: Subject, Predicate, Object
 Composite elements: Subject-Predicate, Subject-Object,
Predicate-Object
 Computed periodically using MapReduce
 Row ID:
– <CardinalityType><TripleElements>
• OBJECT, Parsons
• PREDICATEOBJECT, worksAt, Parsons
 Cardinality stored in the value
 Sparse table: Only store cardinalities above a threshold
 Only need to recompute cardinalities if the
distribution of the data changes significantly
1616
Limitations of Cardinality Approach
 Consider a more complicated query
 Cardinality approach does not take into account
number of results returned by joins
 Solution lies in estimating the “join selectivity” for a
each pair of triples
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?vehicle <vehicleType> SUV.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
}
2.1M matches
15k matches
750k matches
8.3M matches
254M matches
1717
Rya Query Optimizations Using Join Selectivity
Query optimized using
only Cardinality Info:
Query optimized using Cardinality
and Join Selectivity Info:
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?vehicle <vehicleType> SUV.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
}
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
?vehicle <vehicleType> SUV.
}
 Join Selectivity measures number of results returned by joining two
triple patterns
 Approach taken from: RDF-3X: a RISC-style Engine for RDF by Thomas
Neumann and Gerhard Weikum in JDMR (formerly Proc. VLDB) 2008
 Due to computational complexity, estimate of join selectivity for triple
patterns is pre-computed and stored in Accumulo
 Join selectivity estimated by computing the number of results obtained
when each triple pattern is joined with the full table
1818
Join Selectivity: General Algorithm
 For statement patterns <?x, p1, o1> and <?x, p2, o2> with ?x a
variable and p1, o1 , p2, o2 constant, estimate the number of results
 Sel(<?x, p1, o1> <?x, ?y, ?z>) and Sel(<?x, p2, o2> <?x, ?y, ?z>)
give number of results returned by joining a statement pattern with
the full table along the subject component
 Full table join statistics precomputed and stored in index
 Join statistics for each triple pattern computed using following equation:
 Use analogous definition if variables appear in predicate or object position
 Join selectivity statistics used with cardinalities to generate more
efficient query plans
1919
Join Selectivity: Integration into Rya
 Join Selectivity estimates used to optimize Rya queries
through a greedy algorithm approach
 Query constructed starting with first triple pattern to be
evaluated (the pattern with the smallest cardinality) and then
patterns are added based on minimization of a cost function
 Cost function
 C = leftCard + rightCard + leftCard*rightCard*selectivity
 C measures number of entries Accumulo must scan and the
number of comparisons required to perform the join
 Selectivity set to one if two triple patterns share no common
variables, otherwise precomputed estimates used
 Ensures that patterns with common variables are grouped
together
2020
Construction of Selectivity Tables
 For the pattern <?x, p1, o1>, associate each RDF triple of
the form <c, p1, o1> with the cardinality |<c,?y,?z>| and
then sum the results
 Given a triple <c, p1, o1> in the SPO table, Map Job 1 emits
the key-value pair (c, (p1, o1))
 Map Job 2 processes the cardinality table and emits the key
value pair (c, |<c,?y,?x>|), which consists of the constant c
and its single component, subject cardinality for the table
 Map Job 3 merges the results from jobs 1 and 2 by emitting
the key-value pair ((p1, o1), |<c,?y,?x>|)
 Map Job 4 sums the cardinalities from those key-value pairs
containing (p1, o1) as a key, and the result is written to the
selectivity table
2121
Query Optimizations Using Pre-Computed Joins
 Reduce joins by pre-computing common joins
 Approach taken from: Heese, Ralf, et al. "Index Support for
SPARQL." European Semantic Web Conference, Innsbruck,
Austria. 2007.
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
?vehicle <vehicleType> SUV.
}
Pre-compute using
batch processing
and look up during
query execution
2222
Query Optimizations Using Pre-Computed Joins
Index Result Table
.…
Aaron, ToyotaRav4
Caleb, JeepCherokee
Puja, HondaCRV
.…
SELECT ?x WHERE {
?x <worksAt> Parsons.
?x <commuteMethod> bike.
?x <livesIn> Virginia.
?x <owns> ?vehicle.
?vehicle <vehicleType> SUV.
}
SELECT ?person ?car
WHERE {
?person <livesIn> Virginia.
?person <owns> ?car.
?car <vehicleType> SUV.
}
1. Pre-compute a portion of the query
using MapReduce
2. Store SPARQL describing the query
along with pre-computed values in
Accumulo
3. Normalize query variables to match
stored SPARQL variables during
query execution
Stored SPARQL
2323
Overview
 Rya Overview
 Query Execution within Rya
 Query Optimizations
 Results
 Summary
2424
Query Optimization Results
 Ran 14 queries against the Lehigh University Benchmark (LUBM)
dataset (33.34 million triples)
 LUBM queries 2, 5, 9, and 13 were discarded after 3 runs due to query complexity
– Remaining queries were executed 12 times
 Cluster Specs:
– 8 worker nodes, each has 2 x 6-Core Xeon E5-2440 (2.4GHz) Processors and
48 GB RAM
 Results indicate that cardinality and join selectivity optimizations provide
improved or comparable performance
2525
Summary
 Cardinality estimation and join selectivity can
improve query response times for ad hoc queries
 Effects of join selectivity are more apparent for
complex queries over large datasets
 Pre-computed joins are extremely useful for
optimizing common queries
 Potentially avoid large number of join operations
 Maintaining pre-computed join indices is difficult
2626
Questions?
2727
BACK-UP
2828
Useful Links
 SPARQL
 http://www.w3.org/TR/rdf-sparql-query/
 http://jena.apache.org/tutorials/sparql.html
 RDF
 http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/
 Rya
 https://github.com/LAS-NCSU/rya
– Source on github: Provides documentation and sample client code
– Email Aaron Mihalik (aaron.mihalik@parsons.com) for access (US Citizens only)
 Rya Working Group
– Monthly telecon / update on progress, issues, upcoming features
– Email Puja Valiyil puja.valiyil@parsons.com to join (US Citizens only)
 Open RDF Tutorial: http://openrdf.callimachus.net/sesame/tutorials/getting-
started.docbook?view
 Open RDF Javadoc: http://openrdf.callimachus.net/sesame/2.7/apidocs/index.html
 Punnoose R., Crainiceanu A., Rapp D. 2012. Rya: a scalable RDF triple store for the
clouds. Proceedings of the 1st International Workshop on Cloud Intelligence.
http://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf
 Roshan Punnoose, Adina Crainiceanu, David Rapp. SPARQL in the Clouds Using Rya.
Information Systems Journal (2013).
http://www.usna.edu/Users/cs/adina/research/Rya_ISjournal2013.pdf
2929
Next Steps
 Maintaining pre-computed join indices
 Dynamically determining potential pre-computed
joins
 Distributing query planning and execution
 SPARK SQL
 Rya backed by other datastores
 Fully open sourcing Rya
3030
Sample LUBM Queries (1 of 3)
Query 1
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:GraduateStudent .
?X ub:takesCourse <http://www.Department0.University0.edu/GraduateCourse0>}
}
Query 3
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Publication .
?X ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>}
}
3131
Sample LUBM Queries (2 of 3)
Query 7
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X ?Y WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Student .
?Y rdf:type ub:Course .
?X ub:takesCourse ?Y .
<http://www.Department0.University0.edu/AssociateProfessor0> ub:teacherOf ?Y}
}
Query 8
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X ?Y ?Z WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Student .
?Y rdf:type ub:Department .
?X ub:memberOf ?Y .
?Y ub:subOrganizationOf <http://www.University0.edu> .
?X ub:emailAddress ?Z}
}
3232
Sample LUBM Queries (3 of 3)
Query 9
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X ?Y ?Z WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:Student .
?Y rdf:type ub:Faculty .
?Z rdf:type ub:Course .
?X ub:advisor ?Y .
?Y ub:teacherOf ?Z .
?X ub:takesCourse ?Z}
}
Query 11
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
SELECT ?X WHERE
{ GRAPH <http://LUBM>
{?X rdf:type ub:ResearchGroup .
?X ub:subOrganizationOf <http://www.University0.edu>}
}

More Related Content

PPTX
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
PPTX
Accumulo Summit 2016: Accumulo Indexing Strategies for Searching Semantic Net...
PPTX
Spark Summit EU talk by Sameer Agarwal
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
PDF
Inside Apache SystemML by Frederick Reiss
PDF
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
PDF
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
PPTX
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Accumulo Summit 2015: Using D4M for rapid prototyping of analytics for Apache...
Accumulo Summit 2016: Accumulo Indexing Strategies for Searching Semantic Net...
Spark Summit EU talk by Sameer Agarwal
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Inside Apache SystemML by Frederick Reiss
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Processing Terabyte-Scale Genomics Datasets with ADAM: Spark Summit East talk...
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks

What's hot (20)

PDF
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
PPTX
Large Scale Machine Learning with Apache Spark
PDF
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
PDF
Designing Distributed Machine Learning on Apache Spark
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PDF
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
PDF
final_copy_camera_ready_paper (7)
PDF
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
PDF
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
PDF
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
PPTX
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
PDF
On-Prem Solution for the Selection of Wind Energy Models
PDF
Web-Scale Graph Analytics with Apache® Spark™
PDF
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
PDF
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
PDF
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
PDF
Generalized Linear Models in Spark MLlib and SparkR
PDF
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
PDF
Apache Spark Core—Deep Dive—Proper Optimization
Automobile Route Matching with Dynamic Time Warping Using PySpark with Cather...
Large Scale Machine Learning with Apache Spark
ADMM-Based Scalable Machine Learning on Apache Spark with Sauptik Dhar and Mo...
Designing Distributed Machine Learning on Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Spark ML with High Dimensional Labels Michael Zargham and Stefan Panayotov
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
final_copy_camera_ready_paper (7)
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
On-Prem Solution for the Selection of Wind Energy Models
Web-Scale Graph Analytics with Apache® Spark™
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Easy, Scalable, Fault-tolerant stream processing with Structured Streaming in...
Unified Framework for Real Time, Near Real Time and Offline Analysis of Video...
Generalized Linear Models in Spark MLlib and SparkR
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink Forward 2015
Apache Spark Core—Deep Dive—Proper Optimization
Ad

Similar to Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks] (20)

PDF
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
PPTX
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets
PPTX
Lecture 5.pptx
PPTX
Semantic web meetup – sparql tutorial
PDF
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
PDF
Cost-Based Optimizer in Apache Spark 2.2
PPTX
Optimizing SPARQL Query Processing On Dynamic and Static Data Based on Query ...
PPTX
Concepts of Query Processing in ADBMS.pptx
PDF
Max Neunhöffer – Joins and aggregations in a distributed NoSQL DB - NoSQL mat...
PPTX
Query processing
PPT
Query processing-and-optimization
PDF
Complex queries in a distributed multi-model database
PDF
Query Optimization - Brandon Latronica
PPTX
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
PDF
Performance Evaluation of Query Processing Techniques in Information Retrieval
PDF
Enhancing Spark SQL Optimizer with Reliable Statistics
PPTX
SQCFramework: SPARQL Query containment Benchmark Generation Framework
PDF
itm661-lecture0VBBBBBBBBBBBBBBM3-part2-2015.pdf
PDF
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
PDF
Don’t optimize my queries, optimize my data!
Cost-Based Optimizer Framework for Spark SQL: Spark Summit East talk by Ron H...
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data Sets
Lecture 5.pptx
Semantic web meetup – sparql tutorial
Cost-Based Optimizer in Apache Spark 2.2 Ron Hu, Sameer Agarwal, Wenchen Fan ...
Cost-Based Optimizer in Apache Spark 2.2
Optimizing SPARQL Query Processing On Dynamic and Static Data Based on Query ...
Concepts of Query Processing in ADBMS.pptx
Max Neunhöffer – Joins and aggregations in a distributed NoSQL DB - NoSQL mat...
Query processing
Query processing-and-optimization
Complex queries in a distributed multi-model database
Query Optimization - Brandon Latronica
Top-k Exploration of Query Candidates for Efficient Keyword Search on Graph-S...
Performance Evaluation of Query Processing Techniques in Information Retrieval
Enhancing Spark SQL Optimizer with Reliable Statistics
SQCFramework: SPARQL Query containment Benchmark Generation Framework
itm661-lecture0VBBBBBBBBBBBBBBM3-part2-2015.pdf
Fast & relevant search: solutions and trade-offs (January 2020 - Search Techn...
Don’t optimize my queries, optimize my data!
Ad

Recently uploaded (20)

PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Spectroscopy.pptx food analysis technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Encapsulation theory and applications.pdf
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PPT
Teaching material agriculture food technology
Advanced methodologies resolving dimensionality complications for autism neur...
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
sap open course for s4hana steps from ECC to s4
Building Integrated photovoltaic BIPV_UPV.pdf
Spectroscopy.pptx food analysis technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Encapsulation theory and applications.pdf
Digital-Transformation-Roadmap-for-Companies.pptx
20250228 LYD VKU AI Blended-Learning.pptx
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
NewMind AI Weekly Chronicles - August'25-Week II
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Network Security Unit 5.pdf for BCA BBA.
Reach Out and Touch Someone: Haptics and Empathic Computing
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Programs and apps: productivity, graphics, security and other tools
Teaching material agriculture food technology

Accumulo Summit 2015: Rya: Optimizations to Support Real Time Graph Queries on Accumulo [Frameworks]

  • 1. Rya: Optimizations to Support Real Time Graph Queries on Accumulo Dr. Caleb Meier, Puja Valiyil, Aaron Mihalik, Dr. Adina Crainiceanu DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. ONR Case Number 43-279-15 JB.01.2015
  • 2. 22 Acknowledgements  This work is the collective effort of:  Parsons’ Rya Team, sponsored by the Department of the Navy, Office of Naval Research  Rya Founders: Roshan Punnoose, Adina Crainiceanu, and David Rapp
  • 3. 33 Overview  Rya Overview  Query Execution within Rya  Query Optimizations  Results  Summary
  • 4. 44 Background: Rya and RDF  Rya: Resource Description Framework (RDF) Triplestore built on top of Accumulo  RDF: W3C standard for representing linked/graph data  Represents data as statements (assertions) about resources – Serialized as triples in {subject, predicate, object} form – Example: • {Caleb, worksAt, Parsons} • {Caleb, livesIn, Virginia} Caleb Parsons Virginia worksAt livesIn
  • 5. 55 Background: SPARQL  RDF Queries are described using SPARQL  SPARQL Protocol and RDF Query Language  SQL-like syntax for finding triples matching specific patterns  Look for subgraphs that match triple statement patterns  Joins are performed when there are variables common to two or more statement patterns SELECT ?people WHERE { ?people <worksAt> <Parsons>. ?people <livesIn> <Virginia>. }
  • 6. 66 Rya Architecture  Open RDF Interface for interacting with RDF data stored on Accumulo  Open RDF (Sesame): Open Source Java framework for storing and querying RDF data  Open RDF Provides several interfaces/abstractions central for interacting with a RDF datastore – SAIL interface for interacting with underlying persisted RDF model – SAIL: Storage And Inference Layer Data storage layer Query processing in SAIL layer SPARQL Rya Open RDF Rya QueryPlanner Accumulo
  • 7. 77 Storage: Triple Table Index  3 Tables  SPO : subject, predicate, object  POS : predicate, object, subject  OSP : object, subject, predicate  Store triples in the RowID of the table  Store graph name in the Column Family  Advantages:  Native lexicographical sorting of row keys  fast range queries  All patterns can be translated into a scan of one of these tables
  • 8. 88 Overview  Rya Overview  Query Execution within Rya  Query Optimizations  Results  Summary
  • 9. 99 … worksAt, Netflix, Dan worksAt, OfficeMax, Zack worksAt, Parsons, Bob worksAt, Parsons, Greta worksAt, Parsons, John … Rya Query Execution  Implemented OpenRDF Sesame SAIL API  Parse queries, generate initial query plan, execute plan  Triple patterns map to range queries in Accumulo SELECT ?x WHERE { ?x <worksAt> <Parsons>. ?x <livesIn> <Virginia>. } Step 1: POS Table – scan range … Bob, livesIn, Georgia … Greta, livesIn, Virginia … John, livesIn, Virginia … Step 2: for each ?x, SPO – index lookup
  • 10. 1010 More Complicated Example of Rya Query Execution Step 2: For each ?x, SPO Table lookup … Greta, commuteMethod, bike … John, commuteMethod, Bus … Step 3: For each remaining ?x, SPO Table lookup Step 1: POS Table – scan range for worksAt, Parsons ?x livesIn Virginia?x worksAt Parsons ?x commuteMethod bike … worksAt, Netflix, Dan worksAt, Parsons, Bob worksAt, Parsons, Greta worksAt, Parsons, John worksAt, PlayStation, Alice … … Bob, livesIn, Georgia … Greta, livesIn, Virginia … John, livesIn, Virginia … SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <livesIn> Virginia. ?x <commuteMethod> bike. }
  • 11. 1111 Challenges in Query Execution  Scalability and Responsiveness  Massive amounts of data  Potentially large amounts of comparisons  Consider the Previous Example:  Default query execution: comparing each “?x” returned from first statement pattern query to all subsequent triple patterns  There are 8.3 million Virginia residents, about 15,000 Parsons employees, and 750,000 people who commute via bike.  Only 100 people who work at Parsons commute via bike while 1000 people who work at Parsons live in Virginia. Poor query execution plans can result in simple queries taking minutes as opposed to milliseconds SELECT ?x WHERE { ?x <livesIn> Virginia. ?x <worksAt> Parsons. ?x <commuteMethod> bike. } SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <livesIn> Virginia. ?x <commuteMethod> bike. } SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?x <livesIn> Virginia. } vs. vs.
  • 12. 1212 Overview  Rya Overview  Query Execution within Rya  Query Optimizations  Results  Summary
  • 13. 1313 Rya Query Optimizations  Goal: Optimize query execution (joins) to better support real time responsiveness  Three Approaches:  Reduce the number of joins: Pattern Based Indices – Pre-calculate common joins  Limit data in joins: Use more stats to improve query planning – Cardinality estimation on individual statement patterns – Join selectivity estimation on pairs of statement patterns  Make joins more efficient: Distribute the Join Processing – Distribute processing using SPARK SQL or MapReduce – Use Hash Joins and Intersecting Iterators – Just beginning to start looking at this
  • 14. 1414 Rya Query Optimizations Using Cardinalities  Goal: Optimize ordering of query execution to reduce the number of comparison operations  Order execution based on the number of triples that match each triple pattern SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?x <livesIn> Virginia. } 8.3M matches 15k matches 750k matches
  • 15. 1515 Rya Cardinality Usage  Maintain cardinalities on the following triple patterns element combinations:  Single elements: Subject, Predicate, Object  Composite elements: Subject-Predicate, Subject-Object, Predicate-Object  Computed periodically using MapReduce  Row ID: – <CardinalityType><TripleElements> • OBJECT, Parsons • PREDICATEOBJECT, worksAt, Parsons  Cardinality stored in the value  Sparse table: Only store cardinalities above a threshold  Only need to recompute cardinalities if the distribution of the data changes significantly
  • 16. 1616 Limitations of Cardinality Approach  Consider a more complicated query  Cardinality approach does not take into account number of results returned by joins  Solution lies in estimating the “join selectivity” for a each pair of triples SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?vehicle <vehicleType> SUV. ?x <livesIn> Virginia. ?x <owns> ?vehicle. } 2.1M matches 15k matches 750k matches 8.3M matches 254M matches
  • 17. 1717 Rya Query Optimizations Using Join Selectivity Query optimized using only Cardinality Info: Query optimized using Cardinality and Join Selectivity Info: SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?vehicle <vehicleType> SUV. ?x <livesIn> Virginia. ?x <owns> ?vehicle. } SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?x <livesIn> Virginia. ?x <owns> ?vehicle. ?vehicle <vehicleType> SUV. }  Join Selectivity measures number of results returned by joining two triple patterns  Approach taken from: RDF-3X: a RISC-style Engine for RDF by Thomas Neumann and Gerhard Weikum in JDMR (formerly Proc. VLDB) 2008  Due to computational complexity, estimate of join selectivity for triple patterns is pre-computed and stored in Accumulo  Join selectivity estimated by computing the number of results obtained when each triple pattern is joined with the full table
  • 18. 1818 Join Selectivity: General Algorithm  For statement patterns <?x, p1, o1> and <?x, p2, o2> with ?x a variable and p1, o1 , p2, o2 constant, estimate the number of results  Sel(<?x, p1, o1> <?x, ?y, ?z>) and Sel(<?x, p2, o2> <?x, ?y, ?z>) give number of results returned by joining a statement pattern with the full table along the subject component  Full table join statistics precomputed and stored in index  Join statistics for each triple pattern computed using following equation:  Use analogous definition if variables appear in predicate or object position  Join selectivity statistics used with cardinalities to generate more efficient query plans
  • 19. 1919 Join Selectivity: Integration into Rya  Join Selectivity estimates used to optimize Rya queries through a greedy algorithm approach  Query constructed starting with first triple pattern to be evaluated (the pattern with the smallest cardinality) and then patterns are added based on minimization of a cost function  Cost function  C = leftCard + rightCard + leftCard*rightCard*selectivity  C measures number of entries Accumulo must scan and the number of comparisons required to perform the join  Selectivity set to one if two triple patterns share no common variables, otherwise precomputed estimates used  Ensures that patterns with common variables are grouped together
  • 20. 2020 Construction of Selectivity Tables  For the pattern <?x, p1, o1>, associate each RDF triple of the form <c, p1, o1> with the cardinality |<c,?y,?z>| and then sum the results  Given a triple <c, p1, o1> in the SPO table, Map Job 1 emits the key-value pair (c, (p1, o1))  Map Job 2 processes the cardinality table and emits the key value pair (c, |<c,?y,?x>|), which consists of the constant c and its single component, subject cardinality for the table  Map Job 3 merges the results from jobs 1 and 2 by emitting the key-value pair ((p1, o1), |<c,?y,?x>|)  Map Job 4 sums the cardinalities from those key-value pairs containing (p1, o1) as a key, and the result is written to the selectivity table
  • 21. 2121 Query Optimizations Using Pre-Computed Joins  Reduce joins by pre-computing common joins  Approach taken from: Heese, Ralf, et al. "Index Support for SPARQL." European Semantic Web Conference, Innsbruck, Austria. 2007. SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?x <livesIn> Virginia. ?x <owns> ?vehicle. ?vehicle <vehicleType> SUV. } Pre-compute using batch processing and look up during query execution
  • 22. 2222 Query Optimizations Using Pre-Computed Joins Index Result Table .… Aaron, ToyotaRav4 Caleb, JeepCherokee Puja, HondaCRV .… SELECT ?x WHERE { ?x <worksAt> Parsons. ?x <commuteMethod> bike. ?x <livesIn> Virginia. ?x <owns> ?vehicle. ?vehicle <vehicleType> SUV. } SELECT ?person ?car WHERE { ?person <livesIn> Virginia. ?person <owns> ?car. ?car <vehicleType> SUV. } 1. Pre-compute a portion of the query using MapReduce 2. Store SPARQL describing the query along with pre-computed values in Accumulo 3. Normalize query variables to match stored SPARQL variables during query execution Stored SPARQL
  • 23. 2323 Overview  Rya Overview  Query Execution within Rya  Query Optimizations  Results  Summary
  • 24. 2424 Query Optimization Results  Ran 14 queries against the Lehigh University Benchmark (LUBM) dataset (33.34 million triples)  LUBM queries 2, 5, 9, and 13 were discarded after 3 runs due to query complexity – Remaining queries were executed 12 times  Cluster Specs: – 8 worker nodes, each has 2 x 6-Core Xeon E5-2440 (2.4GHz) Processors and 48 GB RAM  Results indicate that cardinality and join selectivity optimizations provide improved or comparable performance
  • 25. 2525 Summary  Cardinality estimation and join selectivity can improve query response times for ad hoc queries  Effects of join selectivity are more apparent for complex queries over large datasets  Pre-computed joins are extremely useful for optimizing common queries  Potentially avoid large number of join operations  Maintaining pre-computed join indices is difficult
  • 28. 2828 Useful Links  SPARQL  http://www.w3.org/TR/rdf-sparql-query/  http://jena.apache.org/tutorials/sparql.html  RDF  http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140225/  Rya  https://github.com/LAS-NCSU/rya – Source on github: Provides documentation and sample client code – Email Aaron Mihalik (aaron.mihalik@parsons.com) for access (US Citizens only)  Rya Working Group – Monthly telecon / update on progress, issues, upcoming features – Email Puja Valiyil puja.valiyil@parsons.com to join (US Citizens only)  Open RDF Tutorial: http://openrdf.callimachus.net/sesame/tutorials/getting- started.docbook?view  Open RDF Javadoc: http://openrdf.callimachus.net/sesame/2.7/apidocs/index.html  Punnoose R., Crainiceanu A., Rapp D. 2012. Rya: a scalable RDF triple store for the clouds. Proceedings of the 1st International Workshop on Cloud Intelligence. http://www.usna.edu/Users/cs/adina/research/Rya_CloudI2012.pdf  Roshan Punnoose, Adina Crainiceanu, David Rapp. SPARQL in the Clouds Using Rya. Information Systems Journal (2013). http://www.usna.edu/Users/cs/adina/research/Rya_ISjournal2013.pdf
  • 29. 2929 Next Steps  Maintaining pre-computed join indices  Dynamically determining potential pre-computed joins  Distributing query planning and execution  SPARK SQL  Rya backed by other datastores  Fully open sourcing Rya
  • 30. 3030 Sample LUBM Queries (1 of 3) Query 1 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X WHERE { GRAPH <http://LUBM> {?X rdf:type ub:GraduateStudent . ?X ub:takesCourse <http://www.Department0.University0.edu/GraduateCourse0>} } Query 3 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X WHERE { GRAPH <http://LUBM> {?X rdf:type ub:Publication . ?X ub:publicationAuthor <http://www.Department0.University0.edu/AssistantProfessor0>} }
  • 31. 3131 Sample LUBM Queries (2 of 3) Query 7 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y WHERE { GRAPH <http://LUBM> {?X rdf:type ub:Student . ?Y rdf:type ub:Course . ?X ub:takesCourse ?Y . <http://www.Department0.University0.edu/AssociateProfessor0> ub:teacherOf ?Y} } Query 8 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y ?Z WHERE { GRAPH <http://LUBM> {?X rdf:type ub:Student . ?Y rdf:type ub:Department . ?X ub:memberOf ?Y . ?Y ub:subOrganizationOf <http://www.University0.edu> . ?X ub:emailAddress ?Z} }
  • 32. 3232 Sample LUBM Queries (3 of 3) Query 9 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X ?Y ?Z WHERE { GRAPH <http://LUBM> {?X rdf:type ub:Student . ?Y rdf:type ub:Faculty . ?Z rdf:type ub:Course . ?X ub:advisor ?Y . ?Y ub:teacherOf ?Z . ?X ub:takesCourse ?Z} } Query 11 PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#> SELECT ?X WHERE { GRAPH <http://LUBM> {?X rdf:type ub:ResearchGroup . ?X ub:subOrganizationOf <http://www.University0.edu>} }

Editor's Notes

  • #2: Abstract The Resource Description Framework (RDF) is a standard model for expressing graph data for the World Wide Web. Developed by the W3C, RDF and related technologies such as OWL and SKOS provide a rich vocabulary for exchanging graph data in a machine understandable manner. As the size of available data continues to grow, there has been an increased desire for methods of storing very large RDF graphs within big data architectures. Rya is a government open source scalable RDF triple store built on top of Apache Accumulo. Originally developed by the Laboratory for Telecommunication Sciences and US Naval Academy, Rya is currently being used by a number of government agencies for storing, inferencing, and querying large amounts of RDF data. As Rya’s user base has grown, there has been a stronger requirement for near real time query responsiveness over massive RDF graphs. In this talk, we detail several query optimization strategies the Rya team has pursued to better satisfy this requirement. We describe recent work allowing for the use of additional indices to eliminate large common joins within complex SPARQL queries. Additionally, we explain a number of statistics based optimizations to improve query planning. Specifically, we detail extensions to existing methods of estimating the selectivity of individual statement patterns (cardinality) and the selectivity of joining two statement patterns (join selectivity) to better fit a “big data” paradigm and utilize Accumulo. Finally, we share preliminary performance evaluation results for the optimizations that have been pursued. Speaker Dr. Caleb Meier, Engineer/Algorithm Developer, Parsons Corporation Dr. Meier received a PhD from the University of California San Diego (UCSD) in Mathematics in 2012. For the past two years, he was a postdoctoral fellow at UCSD's Math department specializing in non-linear elliptic systems of partial differential equations. He received his undergraduate degree in Mathematics from Yale University in 2006. Dr. Meier is currently working as an engineer at Parsons Corporation, specializing in query optimization algorithms for large scale RDF graphs. He is an expert in semantic technologies, Accumulo, the Hadoop Ecosystem, and is actually more fun to be around than his bio suggests. Schedule: 2:45-3:20 on April 29, 2015
  • #10: Find all US citizens that travel to Iran
  • #17: Triple patterns containing no common variables can be joined together creating an external product Among triple patterns with similar cardinalities and common variables, how should they be joined to obtain best execution plan
  • #22: Term “Pattern Based Index” taken from : Heese, Ralf, et al. "Index support for sparql." European Semantic Web Conference, Innsbruck, Austria. 2007. Issues Query planning is difficult Potentially exponentially increase index size Maintaining an external index