SlideShare a Scribd company logo
MPTStore:  A Fast, Scalable, and Stable Resource Index Aaron Birkland and Chris Wilper Open Repositories 2007 San Antonio, TX
Background: RDF in Fedora A natural fit: Object-object relationships Object properties Exposure to services (as a graph) Resource Index introduced: Fedora 2.0 (January ‘05)
Background: RDF in Fedora Challenges Scalability Few triplestores designed for 100M+ Performance Jena vs. Kowari (Jena: OOM) Kowari vs. Sesame Native (Sesame: slow complex queries)  Stability Frequent “rebuilds”
Motivation: The NSDL Use Case The NSDL has a moderately large repository 4.7 million objects 250 million triples
Motivation: The NSDL Use Case The NSDL has a moderately large repository 4.7 million objects 250 million triples ..and has a large volume of writes Driven by periodic OAI harvests Primarily mixed ingests and datastream mods Highly concurrent reads and writes
Motivation: The NSDL Use Case Additionally, NSDL has data model constraints that must be enforced Existential/referential  constraints on objects (e.g. “foreign key” constraints) Uniqueness constraints on some object properties
Motivation: The NSDL Use Case These constraints primarily center around RELS-EXT content: Relationships to other NSDL objects (forming a graph)  Literal value properties for a particular object itself
<foxml:datastream ID=”RELS-EXT” ...> ... <example:id>PLUGH-XYZZY</example:id> <example:memberOf rdf:resource=”info:fedora/demo:73” /> </foxml: datastream > ... Must be globally unique <example:objectType>Resource</example:objectType> This object... 1) Must exist 2) Must be 'Active' 3) Must be objectType 'Aggregation'
Motivation: The NSDL Use Case No suitable constraint enforcement mechanisms  exist in Fedora itself Our approach: Enforce content model in middleware Serialize access where we have to Query RI before ingest or modify
The Challenge Querying the RI to determine correct repository state proved to be the most difficult aspect. To achieve acceptable performance with Kowari, triple writes are buffered and executed in large, infrequent chunks Triples waiting in these buffers are invisible to outside queries
The Challenge Possible solution: Flush the buffer after every write operation New problem: Flushed updates with Kowari are very expensive --  Multiple seconds per operation.  This was incompatible with NSDL processing volume This was a real showstopper...
The Challenge Other difficulties the NSDL had with Kowari: RI corruption under concurrent use RI corruption with abnormal shutdowns Scalability.  Performance became noticeably worse with increasing repository size Steep memory requirements
The Challenge Searching for a solution.. Other triple stores (e.g. Jena, Sesame) were considered for Fedora in the past, rejected for various reasons RDBMS seemed attractive – efficient transactions, very stable, generally speedy “ One big table” paradigm did not seem to give us desired scalability in initial tests
Our Solution Mapped predicate tables One table per predicate, containing indexed 'subject' and 'object' values Mapping table containing metadata correlating predicate URI to a particular db table
<info:fedora/demo:1> <info:fedora/demo:2> <info:fedora/demo:3> <info:fedora/demo:4> s o t1 <info:fedora/fedora-def:model#disseminates> <http://ns.example.org/rels#memberOf> 1 2 p pkey tmap Triples   Predicate Mapping
Our Solution Benefits: Low cost adds and deletes Queries with known predicates are very fast Complex queries benefit due to RDBMS planner having finer-grained statistics and query plans Flexible data partitioning
Our Solution Disadvantages: Need to manage predicate to table mapping Complex queries require more effort to formulate  With a naïve approach, simple unbound queries scale linearly with the number of predicates
Our Solution Observations: Total number of distinct predicates is much lower than predicates or objects.  NSDL has ~ 50 Unbound predicate queries are less common NSDL is heavily biased towards a high volume of writes and simple queries
Our Solution Enter MPTStore Java library that handles all mapping and accounting behind the scenes API for performing triple writes and queries Translates queries from a particular language (e.g. SPO, SPARQL) into SQL statements
Our Solution Designed to expose transaction/connection semantics Calling code has to provide jdbc connection for adding, querying triples Thus, clear path to use advanced transactional capabilities offered by jdbc driver (such as XA)
Results MPTStore performance well suited to NSDL use case Adds or modifies were significantly faster than Kowari case, and were unaffected by database size SPO queries were on-par with Kowari in unbound(common) case
Results Bonus NSDL team was very familiar with operation of RDBMS administration:  performance tuning, backups, etc Stored data is transparent and “hackable”:  Ad-hoc SQL queries and analysis are relatively simple
Results Fedora Bonus Ability to easily analyze the database: helped us track down our own middleware bugs (improved Kowari Performance).
Fast, Immediate Updates Graph shows average ms. per datastream modification MPTStore achieves virtually same performance whether buffering or not Complete test detail in Fedora 2.2 docs
RI: Future Directions External Resource Index Event-based (JMS) updates to external triplestore Analogous to GSearch index updates May be asynchronous May index other datastreams Make full use of triplestore capabilities without compromising the core repository Inference (e.g. krule, RACER) Native APIs
RI: Future Directions Internal (Synchronous) Resource Index Assumption: XA Transactions. Option A: MPTStore Only Pro: Simple, synchronous, JDBC (no need for middleware) Con: Basic queries (no iTQL, maybe SPARQL-Lite) Option B: Mulgara or MPTStore Pro: Richer queries when using Mulgara (iTQL) Con: Complexity (need for XA-aware middleware?)
Thank You More Information http://mptstore.sourceforge.net/ http://www.fedora.info/download/2.2/ http://tripletest.sourceforge.net/

More Related Content

PDF
BIG DATA Session 6
PPTX
HUG Nov 2010: HDFS Raid - Facebook
PPTX
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
PDF
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
PPTX
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
PDF
PDF
Lightning fast genomics with Spark, Adam and Scala
PDF
PostgreSQL Extension APIs are Changing the Face of Relational Databases | PGC...
BIG DATA Session 6
HUG Nov 2010: HDFS Raid - Facebook
Денис Резник "Моя база данных не справляется с нагрузкой. Что делать?"
Евгений Бобров "Powered by OSS. Масштабируемая потоковая обработка и анализ б...
DataEngConf: Uri Laserson (Data Scientist, Cloudera) Scaling up Genomics with...
Lightning fast genomics with Spark, Adam and Scala
PostgreSQL Extension APIs are Changing the Face of Relational Databases | PGC...

What's hot (20)

PDF
Optimizing Presto Connector on Cloud Storage
PDF
Designing your SaaS Database for Scale with Postgres
PDF
Portable Lucene Index Format & Applications - Andrzej Bialecki
PDF
Improved Search with Lucene 4.0 - Robert Muir
PDF
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
PDF
RedisSearch / CRDT: Kyle Davis, Meir Shpilraien
PPTX
Building a Large Scale SEO/SEM Application with Apache Solr
PDF
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
PPT
Finite State Queries In Lucene
ODP
Search Lucene
PPT
Handling Data in Mega Scale Web Systems
PPT
2011 06-30-hadoop-summit v5
PDF
MongoDB Capacity Planning
PPTX
Case study of Rujhaan.com (A social news app )
PPT
STAT Requirement Analysis
PPTX
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
PPTX
Big Data Anti-Patterns: Lessons From the Front LIne
PDF
Gruter TECHDAY 2014 Realtime Processing in Telco
PPTX
Massive parallel processing database systems mpp
PPT
NoSQL databases pros and cons
Optimizing Presto Connector on Cloud Storage
Designing your SaaS Database for Scale with Postgres
Portable Lucene Index Format & Applications - Andrzej Bialecki
Improved Search with Lucene 4.0 - Robert Muir
Hadoop Adventures At Spotify (Strata Conference + Hadoop World 2013)
RedisSearch / CRDT: Kyle Davis, Meir Shpilraien
Building a Large Scale SEO/SEM Application with Apache Solr
PyConDE / PyData Karlsruhe 2017 – Connecting PyData to other Big Data Landsca...
Finite State Queries In Lucene
Search Lucene
Handling Data in Mega Scale Web Systems
2011 06-30-hadoop-summit v5
MongoDB Capacity Planning
Case study of Rujhaan.com (A social news app )
STAT Requirement Analysis
Realtime Analytics and Anomalities Detection using Elasticsearch, Hadoop and ...
Big Data Anti-Patterns: Lessons From the Front LIne
Gruter TECHDAY 2014 Realtime Processing in Telco
Massive parallel processing database systems mpp
NoSQL databases pros and cons
Ad

Viewers also liked (8)

PDF
Why Are We Afraid of Death?
PPS
sport pp
PPT
Open Repositories 2011 - DuraSpace Plenary - Fedora Roadmap
PPS
remember when?
PPTX
La Logosynthèse en bref
PPSX
Vous avez la grippe
PPTX
Logosintesi in un guscio di Noce
PPS
humour
Why Are We Afraid of Death?
sport pp
Open Repositories 2011 - DuraSpace Plenary - Fedora Roadmap
remember when?
La Logosynthèse en bref
Vous avez la grippe
Logosintesi in un guscio di Noce
humour
Ad

Similar to MPTStore: A Fast, Scalable, and Stable Resource Index (20)

PDF
Architecture by Accident
PPTX
CS 542 Parallel DBs, NoSQL, MapReduce
PDF
Architectural anti-patterns for data handling
ODP
Front Range PHP NoSQL Databases
DOCX
disertation
PPT
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
PPT
Hadoop and Voldemort @ LinkedIn
PPTX
NoSQL Introduction, Theory, Implementations
PPTX
Hardware Provisioning
PPTX
Overview of MongoDB and Other Non-Relational Databases
PDF
Architectural anti patterns_for_data_handling
PPTX
MinneBar 2013 - Scaling with Cassandra
ODP
MySQL And Search At Craigslist
PPT
Hadoop training in bangalore
PDF
Elastic search from the trenches
PPTX
Nosql databases
PPTX
MongoDB Notes for BSC Students for all n
PPT
NoSql Databases
PDF
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
PPTX
The future of Big Data tooling
Architecture by Accident
CS 542 Parallel DBs, NoSQL, MapReduce
Architectural anti-patterns for data handling
Front Range PHP NoSQL Databases
disertation
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Hadoop and Voldemort @ LinkedIn
NoSQL Introduction, Theory, Implementations
Hardware Provisioning
Overview of MongoDB and Other Non-Relational Databases
Architectural anti patterns_for_data_handling
MinneBar 2013 - Scaling with Cassandra
MySQL And Search At Craigslist
Hadoop training in bangalore
Elastic search from the trenches
Nosql databases
MongoDB Notes for BSC Students for all n
NoSql Databases
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
The future of Big Data tooling

Recently uploaded (20)

PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPT
Teaching material agriculture food technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Dropbox Q2 2025 Financial Results & Investor Presentation
The Rise and Fall of 3GPP – Time for a Sabbatical?
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Understanding_Digital_Forensics_Presentation.pptx
Encapsulation_ Review paper, used for researhc scholars
Diabetes mellitus diagnosis method based random forest with bat algorithm
Digital-Transformation-Roadmap-for-Companies.pptx
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Per capita expenditure prediction using model stacking based on satellite ima...
Advanced methodologies resolving dimensionality complications for autism neur...
The AUB Centre for AI in Media Proposal.docx
Mobile App Security Testing_ A Comprehensive Guide.pdf
Teaching material agriculture food technology
Chapter 3 Spatial Domain Image Processing.pdf
Review of recent advances in non-invasive hemoglobin estimation
Spectral efficient network and resource selection model in 5G networks
Building Integrated photovoltaic BIPV_UPV.pdf
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx

MPTStore: A Fast, Scalable, and Stable Resource Index

  • 1. MPTStore: A Fast, Scalable, and Stable Resource Index Aaron Birkland and Chris Wilper Open Repositories 2007 San Antonio, TX
  • 2. Background: RDF in Fedora A natural fit: Object-object relationships Object properties Exposure to services (as a graph) Resource Index introduced: Fedora 2.0 (January ‘05)
  • 3. Background: RDF in Fedora Challenges Scalability Few triplestores designed for 100M+ Performance Jena vs. Kowari (Jena: OOM) Kowari vs. Sesame Native (Sesame: slow complex queries) Stability Frequent “rebuilds”
  • 4. Motivation: The NSDL Use Case The NSDL has a moderately large repository 4.7 million objects 250 million triples
  • 5. Motivation: The NSDL Use Case The NSDL has a moderately large repository 4.7 million objects 250 million triples ..and has a large volume of writes Driven by periodic OAI harvests Primarily mixed ingests and datastream mods Highly concurrent reads and writes
  • 6. Motivation: The NSDL Use Case Additionally, NSDL has data model constraints that must be enforced Existential/referential constraints on objects (e.g. “foreign key” constraints) Uniqueness constraints on some object properties
  • 7. Motivation: The NSDL Use Case These constraints primarily center around RELS-EXT content: Relationships to other NSDL objects (forming a graph) Literal value properties for a particular object itself
  • 8. <foxml:datastream ID=”RELS-EXT” ...> ... <example:id>PLUGH-XYZZY</example:id> <example:memberOf rdf:resource=”info:fedora/demo:73” /> </foxml: datastream > ... Must be globally unique <example:objectType>Resource</example:objectType> This object... 1) Must exist 2) Must be 'Active' 3) Must be objectType 'Aggregation'
  • 9. Motivation: The NSDL Use Case No suitable constraint enforcement mechanisms exist in Fedora itself Our approach: Enforce content model in middleware Serialize access where we have to Query RI before ingest or modify
  • 10. The Challenge Querying the RI to determine correct repository state proved to be the most difficult aspect. To achieve acceptable performance with Kowari, triple writes are buffered and executed in large, infrequent chunks Triples waiting in these buffers are invisible to outside queries
  • 11. The Challenge Possible solution: Flush the buffer after every write operation New problem: Flushed updates with Kowari are very expensive -- Multiple seconds per operation. This was incompatible with NSDL processing volume This was a real showstopper...
  • 12. The Challenge Other difficulties the NSDL had with Kowari: RI corruption under concurrent use RI corruption with abnormal shutdowns Scalability. Performance became noticeably worse with increasing repository size Steep memory requirements
  • 13. The Challenge Searching for a solution.. Other triple stores (e.g. Jena, Sesame) were considered for Fedora in the past, rejected for various reasons RDBMS seemed attractive – efficient transactions, very stable, generally speedy “ One big table” paradigm did not seem to give us desired scalability in initial tests
  • 14. Our Solution Mapped predicate tables One table per predicate, containing indexed 'subject' and 'object' values Mapping table containing metadata correlating predicate URI to a particular db table
  • 15. <info:fedora/demo:1> <info:fedora/demo:2> <info:fedora/demo:3> <info:fedora/demo:4> s o t1 <info:fedora/fedora-def:model#disseminates> <http://ns.example.org/rels#memberOf> 1 2 p pkey tmap Triples Predicate Mapping
  • 16. Our Solution Benefits: Low cost adds and deletes Queries with known predicates are very fast Complex queries benefit due to RDBMS planner having finer-grained statistics and query plans Flexible data partitioning
  • 17. Our Solution Disadvantages: Need to manage predicate to table mapping Complex queries require more effort to formulate With a naïve approach, simple unbound queries scale linearly with the number of predicates
  • 18. Our Solution Observations: Total number of distinct predicates is much lower than predicates or objects. NSDL has ~ 50 Unbound predicate queries are less common NSDL is heavily biased towards a high volume of writes and simple queries
  • 19. Our Solution Enter MPTStore Java library that handles all mapping and accounting behind the scenes API for performing triple writes and queries Translates queries from a particular language (e.g. SPO, SPARQL) into SQL statements
  • 20. Our Solution Designed to expose transaction/connection semantics Calling code has to provide jdbc connection for adding, querying triples Thus, clear path to use advanced transactional capabilities offered by jdbc driver (such as XA)
  • 21. Results MPTStore performance well suited to NSDL use case Adds or modifies were significantly faster than Kowari case, and were unaffected by database size SPO queries were on-par with Kowari in unbound(common) case
  • 22. Results Bonus NSDL team was very familiar with operation of RDBMS administration: performance tuning, backups, etc Stored data is transparent and “hackable”: Ad-hoc SQL queries and analysis are relatively simple
  • 23. Results Fedora Bonus Ability to easily analyze the database: helped us track down our own middleware bugs (improved Kowari Performance).
  • 24. Fast, Immediate Updates Graph shows average ms. per datastream modification MPTStore achieves virtually same performance whether buffering or not Complete test detail in Fedora 2.2 docs
  • 25. RI: Future Directions External Resource Index Event-based (JMS) updates to external triplestore Analogous to GSearch index updates May be asynchronous May index other datastreams Make full use of triplestore capabilities without compromising the core repository Inference (e.g. krule, RACER) Native APIs
  • 26. RI: Future Directions Internal (Synchronous) Resource Index Assumption: XA Transactions. Option A: MPTStore Only Pro: Simple, synchronous, JDBC (no need for middleware) Con: Basic queries (no iTQL, maybe SPARQL-Lite) Option B: Mulgara or MPTStore Pro: Richer queries when using Mulgara (iTQL) Con: Complexity (need for XA-aware middleware?)
  • 27. Thank You More Information http://mptstore.sourceforge.net/ http://www.fedora.info/download/2.2/ http://tripletest.sourceforge.net/