MPTStore: A Fast, Scalable, and Stable Resource Index

MPTStore: A Fast, Scalable, and Stable Resource Index Aaron Birkland and Chris Wilper Open Repositories 2007 San Antonio, TX

Background: RDF in Fedora A natural fit: Object-object relationships Object properties Exposure to services (as a graph) Resource Index introduced: Fedora 2.0 (January ‘05)

Background: RDF in Fedora Challenges Scalability Few triplestores designed for 100M+ Performance Jena vs. Kowari (Jena: OOM) Kowari vs. Sesame Native (Sesame: slow complex queries) Stability Frequent “rebuilds”

Motivation: The NSDL Use Case The NSDL has a moderately large repository 4.7 million objects 250 million triples

Motivation: The NSDL Use Case The NSDL has a moderately large repository 4.7 million objects 250 million triples ..and has a large volume of writes Driven by periodic OAI harvests Primarily mixed ingests and datastream mods Highly concurrent reads and writes

Motivation: The NSDL Use Case Additionally, NSDL has data model constraints that must be enforced Existential/referential constraints on objects (e.g. “foreign key” constraints) Uniqueness constraints on some object properties

Motivation: The NSDL Use Case These constraints primarily center around RELS-EXT content: Relationships to other NSDL objects (forming a graph) Literal value properties for a particular object itself

<foxml:datastream ID=”RELS-EXT” ...> ... <example:id>PLUGH-XYZZY</example:id> <example:memberOf rdf:resource=”info:fedora/demo:73” /> </foxml: datastream > ... Must be globally unique <example:objectType>Resource</example:objectType> This object... 1) Must exist 2) Must be 'Active' 3) Must be objectType 'Aggregation'

Motivation: The NSDL Use Case No suitable constraint enforcement mechanisms exist in Fedora itself Our approach: Enforce content model in middleware Serialize access where we have to Query RI before ingest or modify

The Challenge Querying the RI to determine correct repository state proved to be the most difficult aspect. To achieve acceptable performance with Kowari, triple writes are buffered and executed in large, infrequent chunks Triples waiting in these buffers are invisible to outside queries

The Challenge Possible solution: Flush the buffer after every write operation New problem: Flushed updates with Kowari are very expensive -- Multiple seconds per operation. This was incompatible with NSDL processing volume This was a real showstopper...

The Challenge Other difficulties the NSDL had with Kowari: RI corruption under concurrent use RI corruption with abnormal shutdowns Scalability. Performance became noticeably worse with increasing repository size Steep memory requirements

The Challenge Searching for a solution.. Other triple stores (e.g. Jena, Sesame) were considered for Fedora in the past, rejected for various reasons RDBMS seemed attractive – efficient transactions, very stable, generally speedy “ One big table” paradigm did not seem to give us desired scalability in initial tests

Our Solution Mapped predicate tables One table per predicate, containing indexed 'subject' and 'object' values Mapping table containing metadata correlating predicate URI to a particular db table

<info:fedora/demo:1> <info:fedora/demo:2> <info:fedora/demo:3> <info:fedora/demo:4> s o t1 <info:fedora/fedora-def:model#disseminates> <http://ns.example.org/rels#memberOf> 1 2 p pkey tmap Triples Predicate Mapping

Our Solution Benefits: Low cost adds and deletes Queries with known predicates are very fast Complex queries benefit due to RDBMS planner having finer-grained statistics and query plans Flexible data partitioning

Our Solution Disadvantages: Need to manage predicate to table mapping Complex queries require more effort to formulate With a naïve approach, simple unbound queries scale linearly with the number of predicates

Our Solution Observations: Total number of distinct predicates is much lower than predicates or objects. NSDL has ~ 50 Unbound predicate queries are less common NSDL is heavily biased towards a high volume of writes and simple queries

Our Solution Enter MPTStore Java library that handles all mapping and accounting behind the scenes API for performing triple writes and queries Translates queries from a particular language (e.g. SPO, SPARQL) into SQL statements

Our Solution Designed to expose transaction/connection semantics Calling code has to provide jdbc connection for adding, querying triples Thus, clear path to use advanced transactional capabilities offered by jdbc driver (such as XA)

Results MPTStore performance well suited to NSDL use case Adds or modifies were significantly faster than Kowari case, and were unaffected by database size SPO queries were on-par with Kowari in unbound(common) case

Results Bonus NSDL team was very familiar with operation of RDBMS administration: performance tuning, backups, etc Stored data is transparent and “hackable”: Ad-hoc SQL queries and analysis are relatively simple

Results Fedora Bonus Ability to easily analyze the database: helped us track down our own middleware bugs (improved Kowari Performance).

Fast, Immediate Updates Graph shows average ms. per datastream modification MPTStore achieves virtually same performance whether buffering or not Complete test detail in Fedora 2.2 docs

RI: Future Directions External Resource Index Event-based (JMS) updates to external triplestore Analogous to GSearch index updates May be asynchronous May index other datastreams Make full use of triplestore capabilities without compromising the core repository Inference (e.g. krule, RACER) Native APIs

RI: Future Directions Internal (Synchronous) Resource Index Assumption: XA Transactions. Option A: MPTStore Only Pro: Simple, synchronous, JDBC (no need for middleware) Con: Basic queries (no iTQL, maybe SPARQL-Lite) Option B: Mulgara or MPTStore Pro: Richer queries when using Mulgara (iTQL) Con: Complexity (need for XA-aware middleware?)

Thank You More Information http://mptstore.sourceforge.net/ http://www.fedora.info/download/2.2/ http://tripletest.sourceforge.net/

MPTStore: A Fast, Scalable, and Stable Resource Index

More Related Content

What's hot (20)

Viewers also liked (8)

Similar to MPTStore: A Fast, Scalable, and Stable Resource Index (20)

Recently uploaded (20)

MPTStore: A Fast, Scalable, and Stable Resource Index